Capabilities ∩ alignment research

Alignment and safety work is often situated as orthogonal to capabilities work, or in direct conflict with advancing capabilities (e.g. “the alignment tax”). On the contrary, there are multiple examples of research – past and present – that simultaneously make progress on both fronts.

This post summarizes incentives for individual researchers and organizations to work at the intersection, and outlines project areas that exemplify this kind of work. It’s somewhat geared towards capabilities folks, and meant to be somewhat practical (e.g. “what projects should we work on”), as opposed to a debate about what constitutes AGI, existential risk, or governance structures.

Background

The rough working definitions of capabilities and alignment work for this doc are:

Capabilities work: enhancing and expanding the abilities of AI systems. This includes: improving performance on existing tasks, achieving superhuman performance, and solving novel problems.
Alignment work: ensuring AI systems are safe, reliable, and aligned with human values. This includes: mitigating potential risks of AI, preventing unintended consequences, developing mechanisms to control and govern AI behavior, and ensuring beneficial outcomes.

Incentives make capabilities research inevitable (or at least difficult to slow). All AI work has traditionally been “capabilities” research, driven by the fact that:

Researchers are more used to capabilities work – most papers are driven by improvements in some benchmark measuring a model’s ability to accomplish X.
Capabilities research is often more “fun” or tangible – it can be satisfying to see a new model generate increasingly realistic outputs, loss curves go down, and new capabilities emerge.
AI can be useful and provide economic value – which provides personal motivation for researchers, and incentives for organizations to provide funding for capabilities projects.
Competition and capital investment – especially post-ChatGPT, encourages acceleration of capabilities by major organizations.
Tech culture – popular subcultures, such as effective accelerationism (e/acc), promote capabilities work, with a focus on real-world usage.

Benefits of working at the intersection

For capabilities-oriented folks

If you're capabilities oriented, consider working on alignment, following the literature, or talking to alignment people. In addition to progress on alignment itself, the benefits include:

Alignment work can also improve capabilities – the most prominent example is RLHF. See the next section for more areas.
Alignment is still relatively underexplored – there remains lots of opportunity for novel ideas, papers, etc.
Some alignment work is centered around fundamental machine learning problems – examples include sample complexity (e.g. learning from sparse human preferences), unlearning (e.g. the ability to remove knowledge/abilities from a trained model), adaptation, distribution shift, and out-of-distribution generalization (e.g. post-training alignment, alignment to multiple, heterogeneous populations, adversarial robustness).
Some alignment work intersects with multiple fields – machine learning, language, cognitive science, social sciences, etc.
It can enable the launch of products.

For alignment-oriented folks

If you're alignment oriented, consider at least staying in tune with capabilities work and people for the following reasons:

Understanding the capabilities and implementation of rapidly evolving SOTA models – for example, new modalities require new evals (e.g. multimodal factuality); modeling shifts may require changes in methods (e.g. interpretability for MoE architectures); improved capabilities may introduce new attack vectors (e.g. longer contexts + document uploads).
Having the experience/skills to work with rapidly evolving SOTA models – knowing the infra and details can enable and accelerate your work.
Increasing the impact of your alignment work – for example, incorporating work into frontier LLMs.

Topics at the intersection of capabilities and alignment

This kind of work is often situated as purely post-training and evals, but can include pre-training and other cross-cutting work as well.

Note: references are not meant to be a comprehensive literature review, and are biased towards more recent publications.

[Pre-training] Incorporating human feedback and other post-training-derived data/artifacts into pre-training data

Incorporating instruction-finetuning data, chain of thought (e.g. for math problems), and human feedback data can improve downstream performance. What are the limits of this kind of synthetic data? How effective is it at scale?
What is the best way to incorporate human feedback in pre-training (e.g. control tokens, filtering, weighted losses, conditional pre-training with RM labels)?
Does incorporating human feedback into pre-training data improve downstream alignment and increase robustness to adversarial attacks? How can we measure this at a pre-training level (e.g. probes)?

References: 1. Pretraining LMs with Human Preferences (2022) 2. Physics of LMs: Part 3.1 3. Physics of LMs: Part 3.2 4. Orca: Progressive Learning (2023)

[Pre-training] Better world models of actions and their consequences, general world knowledge and understanding, behavior of people and human values

To what extent can models learn from data of descriptions of behavior rather than demonstrations?

References: 1. System Two Safety (talk 2023)

[Cross-cutting] Improving and understanding reasoning through data

There is evidence that various paraphrases of the same information need to be present in pre-training for knowledge extraction and manipulation to emerge.
What kinds of data allow reasoning to emerge? What forms of reasoning are present in current LLMs?
How is "reasoning" occurring? If reasoning happens without CoT – reasoning is internal, and hidden in weights and activations. Does this occur at test-time during the forward pass, or during training by combining knowledge?

References: 1. System Two Safety (2023) 2. The Reversal Curse (2023) 3. Out of Context Reasoning (2023) 4. Physics of LMs: Part 3.1 (2023) 5. Physics of LMs: Part 3.2 (2023) 6. Data Distributional Properties (2022) 7. Rephrasing the Web (2024)

[Cross-cutting] LLM security and privacy: memorization, poisoning, adversarial robustness

How much do models memorize training data? How often is pre-training data leaked at inference time? How do these relate to data repetition in pre-training?
How vulnerable are models to data poisoning?
How can security and privacy risks be mitigated through data changes (e.g. PII stripping) without reducing the capabilities of the model?
How robust are existing safety mitigations to adversarial attacks?

References: 1. Measuring Forgetting (2022) 2. [Data Leakage] Scalable Extraction (2023) 3. [Data repetition] Long-Tail Knowledge (2022) 4. Scaling Data-Constrained LMs (2023) 5. Learning from Repeated Data (2022) 6. [Data poisoning] Instruction Tuning (2023) 7. Poisoning Datasets is Practical (2023) 8. [Robustness] Universal Attacks (2023)

[Cross-cutting] Improving factuality

Pre-training – does adding document metadata, reordering documents (e.g. pack related documents into same sequence), or adding control/quality tokens improve factuality?
How can tool use improve factuality?
Do we have the right evals for different aspects of factuality (knowledge, grounding in input, real-time information, etc.)?
How susceptible are humans to hallucinated outputs? What UX affordances can best engender user trust (showing cited sources, use of the recitation checker, "Google It" button in Bard)?

[Cross-cutting] Scalable oversight: supervising superhuman models (with either humans or models)

How can humans supervise superhuman AI models? How can models be used to assist humans in rating hard outputs (e.g. by providing critiques)?
Can we scale this by replacing human supervision with model supervision?
What are the right tasks, setups, and methods to study this (e.g. debate)?
Can weak model teachers (e.g. GPT2) actually be used to improve a strong student (e.g. GPT4)?
What techniques from classic semi-supervised and weakly-supervised learning, robustness, and out-of-distribution generalization, knowledge distillation, etc. can be applied to these settings?

References: 1. Measuring Progress (2022) 2. Self-critiquing models (2022) 3. Debate Helps Supervise Experts (2023) 4. AI safety via debate (2018) 5. Weak-to-strong generalization 6. Fine-Tuning Distortions (2022)

[Cross-cutting] LLMs as human simulators

What is the fidelity of LLMs as human simulators?
How can this be formulated and/or traced back directly to data contained in pretraining? Can we make it easier to pick out the right "persona"?
Whose opinions and beliefs are represented? Can/should we improve the representation of underrepresented communities?

References: 1. Simulate Human Samples (2022) 2. Predict Public Opinion (2021) 3. Interactive Simulacra (2023) 4. Simulated Economic Agents (2023) 5. Whose Opinions Do LMs Reflect? (2023)

[Cross-cutting] LLM interpretability

Bottom-up interpretability (i.e. mechanistic interpretability): reverse-engineering and understanding basic building blocks of neural networks (neurons, activation functions, attention heads). How can we marry findings such as superposition with methods to provide human-interpretable explanations of behavior? How can this be scaled to SOTA models?
Top-down interpretability: such as influence functions, TCAV (starting with human-defined concepts), post-training alteration and understanding of model representations.
How can we automate and scale interpretability research to our largest models?
Most interpretability and explanations methods are not very faithful. How can we improve this?
Can model architectures be made more faithful and interpretable without sacrificing quality (e.g. backpack language models, concept bottleneck models)?
What interpretability methods + tools are most effective (for different use cases, in different settings) when surfaced to users (e.g. improving human-AI collaborative performance, model debugging, model certification)?

References: 1. Transformer Circuits thread 2. Toy models of Superposition (2022) 3. Automated Circuit Discovery (2023) 4. Explain neurons in LMs (2023) 5. Generalization with Influence Functions (2023) 6. Discovering Latent Knowledge (2022) 7. Testing with Concepts (TCAV) (2017) 8. Inference-Time Intervention (2023) 9. Impossibility Theorems (2022) 10. Unfaithful Explanations (2023) 11. Measuring Faithfulness (2023) 12. Question Decomposition (2023)

[Cross-cutting] Evaluation

How can we better align inner-loop + academic evals with outer-loop evals + real-world usage?
How can we combine LLM + model evaluations? To what degree are they correlated with each other?
How can we better evaluate tasks without a definitive / quantitative correct answer (e.g. many real-world use cases, measuring instruction following ability, alignment evals)?
Can we design alignment evals that operate on pre-trained models (e.g. linear probes on representations)?
Are there new eval test beds and settings that should be designed to probe for specific/new capabilities (e.g. reward hacking, sandboxed agents)?
Can we automate new capabilities discovery and eval creation?
Can new theories of LLM capabilities can inform new evals?

References: 1. Model Organisms of Misalignment (2023) 2. Sleeper Agents (2024) 3. Behaviors with Model-Written Evals (2022) 4. Emergence of Complex Skills (2023) 5. Skill-Mix Evaluations (2023)

[Post-training] Instruction following

To what degree can complex instruction following emerge / generalize from simpler instruction following datasets?
How can we best evaluate instruction following ability?
How can we reduce sycophancy?

[Post-training] Learning from human feedback: RLHF and beyond

How critical are the different pieces of RLHF? (RL, human data, preference vs. single-point ratings, etc.) What are the best ways to simplify the RLHF pipeline or modify the RL algorithm (e.g. DPO, IPO, KTO)?
How can we incorporate richer forms of feedback (e.g. natural language feedback)?
How can we optimize for multiple objectives? How can we incorporate heterogeneous human feedback and diverse perspectives?

References: 1. Learning from human preferences (2017) 2. Instructions with human feedback (2021) 3. Targeted human judgements (2022) 4. Direct Preference Optimization (2023) 5. Learning from Human Preferences Paradigm 6. Natural Language Feedback (2023) 7. Learning New Skills after Deployment (2022) 8. Diverse Preferences Agreement (2022) 9. Democratic inputs to AI program (2024)

[Post-training] Process supervision and supervising models with more granular feedback

Past work has predominantly be done on math (GSM8K and MATH). How does process supervision compare with outcome supervision (e.g. RLHF) in different settings?
Process supervision is likely more effective for hard and complex tasks. What are the right tasks and environments (e.g. long code outputs, agents)?
How can we design/improve the process supervision setup and algorithms to mitigate reward hacking behavior?
How can we scale the data annotation with synthetic (model-generated) labels and tools?

References: 1. Outcome-based feedback (2022) 2. Let's Verify Step by Step (2023)

[Post-training] Self-correction and self-improvement

ConstitutionalAI has been helpful for improving safety (self-critique, self-revision + RLAIF). What advances need to be made to extend it successfully to improve quality?
Can we create informative scaling laws as a function of the ratio of synthetic vs. human data?

References: 1. Constitutional AI (2022) 2. Teaching LLMs to Self-Debug (2023) 3. LLMs Cannot Self-Correct Yet (2023) 4. LLMs can correct errors (2023)

[Post-training] Inference-time algorithms for better reasoning and System 2 thinking

Scaling inference
Prompting strategies
Combining search with LLMs
Adaptive computation
Controlled decoding