Count Vote Candidate
0 2025-04 Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions from Anthropic (Saffron Huang et al.), 44 pages
3 2025-03 Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation from OpenAI (Bowen Baker et al.), 39 pages
1 2025-03 Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations from Apollo, 10 pages
2 2025-02 Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs from Truthful AI (Jan Betley et al.), 32 pages
0 2025-02 Demonstrating specification gaming in reasoning models from Palisade (Alexander Bondarenko et al.), 15 pages
1 2025-02 Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs from the Center for AI Safety (Mantas Mazeika et al.), 38 pages
3 2025-01 Tell me about yourself: LLMs are aware of their learned behaviors from Truthful AI (Jan Betley et al.), 64 pages
0 2024-12 Deliberative Alignment: Reasoning Enables Safer Language Models from OpenAI (Melody Y. Guan et al.), 24 pages
3 2024-12 Towards Safe and Honest AI Agents with Neural Self-Other Overlap and LW post from AE Studio (Marc Carauleanu et al.), 15 pages
2 2024-12 Alignment faking in large language models from Anthropic (Ryan Greenblatt et al.), 137 pages
0 2024-12 Monet: Mixture of Monosemantic Experts for Transformers by Park et al., 32 pages
0 2024-12 Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration by Chun Hei Yip et al., 33 pages
2 2024-12 Frontier AI systems have surpassed the self-replicating red line from Fudan University (Xudong Pan et al.), 47 pages
0 2024-12 Frontier Models are Capable of In-context Scheming from Apollo (Meinke, Schoen, & Scheurer et al.), 70 pages
0 2024-10 Sabotage Evaluations for Frontier Models from Anthropic (Joe Benton et al.), 69 pages
0 2024-10 Towards a unified and verified understanding of group-operation networks by Wilson Wu et al., 32 pages
0 2024-10 Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient from Timaeus (George Wang et al.), 41 pages
4 2024-08 Showing SAE Latents Are Not Atomic Using Meta-SAEs from MATS (Bart Bussmann et al.), 22 pages
0 2024-08 Verification methods for international AI agreements by Akash R. Wasil et al., 17 pages
1 2024-08 Tamper-Resistant Safeguards for Open-Weight LLMs by Rishub Tamirisa et al., 26 pages
5 2024-07 Rule Based Rewards for Language Model Safety from OpenAI (Tong Mu et al.), 23 pages
1 2024-07 Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs by Rudolf Laine et al., 109 pages
0 2024-06 Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation by Danny Halawi et al., 22 pages
1 2024-06 Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space from Harvard (Core Francisco Park et al.), 32 pages
1 2024-06 WARP: On the Benefits of Weight Averaged Rewarded Policies from DeepMind (Alexandre Ramé et al.), 34 pages
1 2024-06 Compact Proofs of Model Performance via Mechanistic Interpretability by Jason Gross et al., 61 pages
2 2024-06 Refusal in Language Models Is Mediated by a Single Direction by Andy Arditi et al., 39 pages
5 2024-06 Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models from Anthropic (Carson Denison et al.), 29 pages
0 2024-06 Safety Alignment Should Be Made More Than Just a Few Tokens Deep from Princeton (Xiangyu Qi et al.), 25 pages
5 2024-06 Improving Alignment and Robustness with Circuit Breakers from Black Swan (Andy Zou et al.), 22 pages
2 2024-06 To Believe or Not to Believe Your LLM from DeepMind (Yasin Abbasi Yadkori, et al.), 33 pages
1 2024-06 Scaling and evaluating sparse autoencoders from OpenAI (Leo Gao et al.), 34 pages
2 2024-05 Debating with More Persuasive LLMs Leads to More Truthful Answers by Akbir Khan et al.), 72 pages
4 2024-05 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet from Athropic (Adly Templeton et al.), 65 pages
4 2024-04 Let's Think Dot by Dot: Hidden Computation in Transformer Language Models by Jacob Pfau, William Merrill & Samuel R. Bowman, 17 pages
5 2024-03 What are human values, and how do we align AI to them? from Meaning Alignment Institute (Oliver Klingefjord et al), 38 pages
1 2024-03 Safety Cases: How to Justify the Safety of Advanced AI Systems by Joshua Clymer et al., 73 pages
2 2024-03 AtP*: An efficient and scalable method for localizing LLM behaviour to components from DeepMind (János Kramár et al), 46 pages
2 2024-02 The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits from Microsoft (Shuming Ma & Hongyu Wang et al), 8 pages
1 2024-02 Coercing LLMs to do and reveal (almost) anything by Jonas Geiping et al., 32 pages
1 2024-02 Attention SAEs Scale to GPT-2 Small from MATS (Connor Kissane et al), 14 pages
2 2024-01 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training from Anthropic (Evan Hubinger et al), 71 pages
1 2024-01 Secure, Governable Chips from CNAS (Onni Aarne, Tim Fist, and Caleb Withers), 45 pages
0 2024-01 LLaMA Pro: Progressive LLaMA with Block Expansion by Chengyue Wu et al, 21 pages
2 2023-12 Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision from OpenAI, 49 pages
2 2023-12 Practices for Governing Agentic AI Systems from OpenAI, 23 pages
2 2023-12 Challenges with unsupervised LLM knowledge discovery from DeepMind (Sebastian Farquhar et al.), 38 pages
0 2023-12 AI Control: Improving Safety Despite Intentional Subversion from Redwood (Ryan Greenblatt et al.), 46 pages
0 2023-12 Mamba: Linear-Time Sequence Modeling with Selective State Spaces by Albert Gu and Tri Dao, 36 pages
1 2023-11 Generative Models: What do they know? Do they know things? Let's find out! from TTIC/Adobe (Xiaodan Du et al.), 28 pages
0 2023-11 Scheming AIs: Will AIs fake alignment during training in order to get power? by Joe Carlsmith, 127 pages
0 2023-10 LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B (Simon Lermen et al.) and BadLlama (Pranav Gade et al.) from Palisade, 10 pages each
3 2023-10 Representation Engineering: A Top-Down Approach to AI Transparency from the Center for AI Safety (Andy Zou et al.), 55 pages
0 2023-09 Explaining grokking through circuit efficiency from DeepMind (Vikrant Varma et al.), 28 pages
2 2023-08 Against Almost Every Theory of Impact of Interpretability by Charbel-Raphael Segerie, 40 pages
3 2023-08 Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research by Evan Hubinger et al., 15 pages
1 2023-07 Universal and Transferable Adversarial Attacks on Aligned Language Models by Andy Zou et al., 31 pages
0 2023-06 Textbooks Are All You Need from Microsoft (Suriya Gunasekar et al.), 26 pages (and followup, 16 pages)
1 2023-06 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness by Tri Dao et al., 34 pages
1 2023-05 Let's Verify Step by Step from OpenAI (Hunter Lightman et al.), 29 pages
1 2023-05 Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafael Rafailov et al.), 27 pages
1 2023-05 Steering GPT-2-XL by adding an activation vector by Alex Turner, Monte MacDiarmid, David Udell, lisathiergart, Ulisse Mini, 60 pages
1 2023-05 Model evaluation for extreme risks from DeepMind (Toby Shevlane et al.), 20 pages
0 2023-05 Sparks of Artificial General Intelligence: Early experiments with GPT-4 from Microsoft (Sébastien Bubeck et al.), 155 pages
0 2023-05 Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting from NYU (Miles Turpin et al.), 32 pages
0 2023-03 Omnigrok: Grokking Beyond Algorithmic Data by Ziming Liu, 16 pages
3 2023-03 What does it take to catch a Chinchilla? by Yonadav Shavit, 22 pages
0 2023-02 SolidGoldMagikarp (plus, prompt generation) by Jessica Rumbelow, mwatkins, 17 pages
0 2023-01 Neural networks generalize because of this one weird trick by Jesse Hoogland, 18 pages
2 2022-12 Constitutional AI: Harmlessness from AI Feedback from Anthropic (Yuntao Bai, et al), 34 pages
0 2022-10 Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task by Kenneth Li et al., 17 pages
0 2022-10 Scaling Laws for Reward Model Overoptimization from OpenAI, 28 pages
3 2022-09 Toy Models of Superposition from Anthropic (Nelson Elhage, et al.), 70 pages
4 2022-08 A Mechanistic Interpretability Analysis of Grokking by Neel Nanda and Tom Lieberum, 48 pages
2 2022-08 The alignment problem from a deep learning perspective by Richard Ngo, Lawrence Chan, Sören Mindermann, 21 pages
0 2022-07 A Hazard Analysis Framework for Code Synthesis Large Language Models from OpenAI, 20 pages
2 2022-06 A Path Towards Autonomous Machine Intelligence by Yann LeCun, 62 pages
0 2022-06 Softmax Linear Units from Anthropic (Nelson Elhage, et al.), 33 pages
0 2022-05 Scaling Laws and Interpretability of Learning from Repeated Data from Anthropic (Danny Hernandez et al.), 23 pages
0 2022-04 Training language models to follow instructions with human feedback from OpenAI, 68 pages
5 2022-03 In-context Learning and Induction Heads from Anthropic (Catherine Olsson, et al.), 61 pages
0 2022-03 Training Compute-Optimal Large Language Models from DeepMind (Jordan Hoffmann, et al.), 36 pages (and 1a3orn's commentary, 8 pages)
1 2022-02 Alignment research exercises by Richard Ngo, 9 pages
1 2022-01 Intro to Brain-Like-AGI Safety Series by Steven Byrnes, 350 pages across 15 posts
3 2021-12 A Mathematical Framework for Transformer Circuits from Anthropic (Nelson Elhage, et al.), 49 pages
8 2021-12 Eliciting Latent Knowledge from Alignment Research Center (ARC) (Paul Christiano, Ajeya Cotra, and Mark Xu), 106 pages
0 2021-12 A General Language Assistant as a Laboratory for Alignment from Anthropic (Amanda Askell, et al.), 48 pages
0 2021-05 Symmetry, Equilibria, and Robustness in Common-Payoff Games by Scott Emmons et al., 17 pages
0 2021-03 The General Theory of General Intelligence: A Pragmatic Patternist Perspective by Ben Goertzel, 73 pages
0 2021-01 Counterfactual Planning in AGI Systems by Koen Holtman, 39 pages
0 2020-12 Extracting and Using Preference Information from the State of the World, Rohin Shah's PhD thesis, 120 pages
1 2020-12 An overview of 11 proposals for building safe advanced AI by Evan Hubinger, 40 pages
0 2020-02 Reward-rational (implicit) choice: A unifying formalism for reward learning by Hong Jun Jeon, Smitha Milli, Anca D. Dragan, 21 pages
0 2019-12 Deep Double Descent: Where Bigger Models and More Data Hurt by Preetum Nakkiran et al, 24 pages
0 2019-04 Data Shapley: Equitable Valuation of Data for Machine Learning by Amirata Ghorbani and James Zou, 23 pages