Seasons / Papers

Episodes

Toy Models of Superposition

13 min

In this episode of The Turing Talks, we explore the concept of superposition in neural networks, where more features are represented than there are dimensions. The research uses toy models—small ReLU networks with sparse inputs—to investigate how superposition enables networks to simulate larger ones, producing polysemantic neurons that respond to multiple unrelated features. The study examines uniform superposition, tied to geometric shapes like triangles and pentagons, and non-uniform superposition, where features vary in importance or sparsity. Connections between superposition, learning dynamics, adversarial vulnerability, and AI safety are discussed, with proposed solutions including developing models without superposition or finding overcomplete bases to describe features.

Many-shot Jailbreaking

9 min

This episode of The Turing Talks explores Many-Shot Jailbreaking (MSJ), a technique exploiting expanded context windows in large language models (LLMs) to prompt harmful behaviors. The study highlights the increasing effectiveness of MSJ with more examples, showing vulnerabilities across LLMs like GPT-4 and Llama 2, while standard defenses proved insufficient. Researchers stress the need for innovative mitigations, noting some promise in prompt-based defenses like Cautionary Warning Defense (CWD) to reduce attack success rates.

Machine Theory of Mind

11 min

In this episode of The Turing Talks, we introduce the innovative Theory of Mind neural network (ToMnet), which utilizes meta-learning to model agents by analyzing their behavior. The ToMnet is designed with three key modules: a character net that processes past actions, a mental state net for current behavioral analysis, and a prediction net for forecasting future actions. We discuss various experiments demonstrating how the ToMnet approximates optimal inference, infers goals, and recognizes agents' false beliefs. This framework not only advances multi-agent AI systems but also holds potential for improving machine-human interactions and fostering interpretable AI.

On the Measure of Intelligence

15 min

In this episode of The Turing Talks, we explore a groundbreaking approach to understanding intelligence in artificial intelligence (AI) systems. We discuss how intelligence differs from mere skill, emphasizing the need for an anthropocentric perspective that evaluates AI against human-like general intelligence. The conversation includes a new framework for assessing intelligence based on Algorithmic Information Theory (AIT), focusing on the efficiency of skill acquisition and the challenge of generalization. We also introduce the Abstraction and Reasoning Corpus (ARC) as a benchmark for measuring AI’s capacity to handle novel tasks, while addressing the limitations of current evaluation methods and the need for further research.