Published on Wed May 02 2018

AI safety via debate

Geoffrey Irving, Paul Christiano, Dario Amodei

We propose training AIs via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit. A human judges which of the agents gave the most true, useful information. We report results on an initial

1
0
0
Abstract

To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier's accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to 85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.

Mon Nov 11 2019
Artificial Intelligence
(When) Is Truth-telling Favored in AI Debate?
Irving et al. (2018) propose a debate between two AI systems to amplify the problem-solving capabilities of a human judge. They introduce a mathematical framework that can model debates of this type. The quality of debate designs should be measured by the accuracy of the most persuasive answer.
0
0
0
Wed Jan 29 2020
Artificial Intelligence
Bayesian Reasoning with Trained Neural Networks
We showed how to use trained neural networks to perform Bayesian reasoning in order to solve tasks outside their initial scope. The approach built on top of already trained networks, and the addressable questions grew super-exponentially with the number of available networks.
0
0
0
Sun Nov 04 2018
Artificial Intelligence
Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning
The Bayesian action decoder (BAD) is a new multi-agent learning method. It uses an approximate Bayesian update to obtain a public belief that conditions on the actions taken by all agents in the environment. BAD introduces a new Markov decision process, the public belief MDP.
0
0
0
Thu Jun 10 2021
Artificial Intelligence
Brittle AI, Causal Confusion, and Bad Mental Models: Challenges and Successes in the XAI Program
Deep neural network driven models have surpassed human level performance in benchmark autonomy tasks. The underlying policies for these agents are not easily interpretable. This paper discusses the origins of these takeaways, provides Amplifying information, and suggestions for future work.
0
0
0
Fri Feb 01 2019
Machine Learning
The Hanabi Challenge: A New Frontier for AI Research
Games have been important testbeds for how well machines can do sophisticated decision making. We argue that Hanabi elevates reasoning about the beliefs andintentions of other agents to the foreground. To facilitate future research, we introduce the open-source Hanabi Learning Environment.
1
0
0
Thu Dec 10 2020
Artificial Intelligence
Deep Argumentative Explanations
Deep Argumentative eXplanations (DAXs) are a form of symbolic AI offering useful reasoning abstractions for explanation. DAXs exhibit deep fidelity and low computational cost. They are also more competitive with existing approaches to XAI.
0
0
0
Tue Jun 10 2014
Machine Learning
Generative Adversarial Networks
We propose a new framework for estimating generative models via an adversarial process. We simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D. The training procedure for G is to maximize the probability of D making a mistake.
9
3,421
23,915
Fri Oct 27 2017
Neural Networks
Progressive Growing of GANs for Improved Quality, Stability, and Variation
We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively. We add new layers that model increasingly fine details as training progresses.
6
1,390
3,769
Tue Dec 05 2017
Artificial Intelligence
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
The game of chess is the most widely-studied domain in the history of artificial intelligence. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go.
1
0
1
Tue Jun 21 2016
Artificial Intelligence
Concrete Problems in AI Safety
Rapid progress in machine learning and artificial intelligence has brought increasing attention to the potential impacts of AI technologies on society. We present a list of five practical research problems related to accident risk. We review previous work as well as suggesting research directions with a focus on relevance to cutting-edge AI systems.
1
0
1
Thu Nov 02 2017
Artificial Intelligence
A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning
Multiagentreinforcement learning (MARL) is the challenge of interacting with other agents in a shared environment. We describe an algorithm for general MARL based on approximate best responses to mixtures of policies. The algorithm generalizes previous ones such as InRL, iterated best response,
0
0
0
Fri Oct 19 2018
Artificial Intelligence
Supervising strong learners by amplifying weak experts
Using an easier-to-specify proxy can lead to poor performance or misaligned behavior. We propose an alternative training strategy which progressively builds up a training signal for difficult problems. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors.
0
0
0
Wed Jul 10 2019
Artificial Intelligence
The Role of Cooperation in Responsible AI Development
competitive pressures could incentivize AI companies to underinvest in ensuring their systems are safe, secure, and have a positive social impact. Ensuring that AI systems are developed responsibly may require preventing and solving collective action problems between companies.
1
0
5
Tue Oct 06 2020
Artificial Intelligence
Chess as a Testing Grounds for the Oracle Approach to AI Safety
Super-intelligent AIs that can only send and receive messages could be used to provide chess advice. The player would be uncertain which type of oracle it was interacting with. The oracles would be vastly more intelligent than the player.
1
0
3
Thu Jun 20 2019
Artificial Intelligence
Modeling AGI Safety Frameworks with Causal Influence Diagrams
Proposals for safe AGI systems are typically made at the level of frameworks. We model and compare the most promising AGI safety frameworks using causal influence diagrams. The diagrams show the optimization objective and causal assumptions of the framework.
1
0
0
Thu Sep 12 2019
Artificial Intelligence
Finding Generalizable Evidence by Learning to Convince Q&A Models
We train evidence agents to select the passage sentences that most convince apretrained QA model of a given answer. Rather than finding evidence that convinces one model alone, we find that agents select evidence that generalizes. This approach improves QA in arobust manner.
0
0
0
Wed Aug 07 2019
Machine Learning
Advocacy Learning: Learning through Competition and Class-Conditional Representations
Advocacy learning relies on aframework consisting of two connected networks. Each Advocate produces a class-conditional representation with the goal of convincing the Judge that the input example belongs to their class. We show that advocacy learning can lead to small improvements in classification accuracy over an identical supervised baseline.
0
0
0
Wed May 29 2019
Artificial Intelligence
Asymptotically Unambitious Artificial General Intelligence
Narrow intelligence, the ability to solve a given difficult problem, has seen impressive development. Artificial General Intelligence (AGI) presents dangers that narrow intelligence does not. We present the first algorithm we are aware of for asymptotically unambitious AGI.
0
0
0