Alignment Research Center

Berkeley, CA

9 peopleFounded 2021

A nonprofit research organization focused on theoretical AI alignment research, developing formal mechanistic explanations of neural network behavior to ensure future ML systems are aligned with human interests.

Endorsed by+1

Donate:Every.org·Direct

Endorsed by+1

Donate:Every.org·Direct

People– no linked people

Funding Details

Annual Budget: $9,050,000
Current Runway: -
Funding Goal: -
Funding Raised to Date: -

Org Details

The Alignment Research Center (ARC) is a nonprofit research organization whose mission is to align future machine learning systems with human interests. Founded in April 2021 by Paul Christiano, a former OpenAI researcher and one of the principal architects of Reinforcement Learning from Human Feedback (RLHF), ARC is headquartered in Berkeley, California, where it operates out of the Constellation co-working space alongside other AI safety organizations.

ARC focuses on theoretical research addressing the problem of scalable alignment: how to ensure AI systems remain intent-aligned even as they become more capable. The organization employs a builder-breaker methodology, making worst-case assumptions rather than extrapolating from current systems. Their approach allows rapid theoretical evaluation of alignment strategies without requiring full implementation.

The organization's core research directions include heuristic arguments (machine-checkable reasoning about neural network behavior that does not require the certainty of formal proofs), Eliciting Latent Knowledge (ELK, training systems to report their genuine internal beliefs rather than predicted human thoughts), mechanistic anomaly detection, and low probability estimation for rare catastrophic failures. Their current work can be understood as an effort to combine mechanistic interpretability with formal verification, producing mathematical frameworks for explaining neural network behaviors automatically.

In 2022, Beth Barnes joined ARC from OpenAI to start ARC Evals, a team focused on evaluating the capabilities and alignment of advanced AI models. ARC Evals notably evaluated GPT-4 for OpenAI, assessing the model's ability to exhibit power-seeking behavior. In late 2023, ARC Evals spun out as METR, an independent 501(c)(3) nonprofit.

Paul Christiano led ARC from its founding until 2024, when he was appointed Head of AI Safety at the U.S. AI Safety Institute within NIST. Leadership transitioned to Jacob Hilton, who serves as President. The board of directors includes Jacob Hilton, Buck Shlegeris, and Ben Hoskin. The team consists of approximately nine full-time staff (six researchers and three operations personnel), supplemented by external research collaborators and visiting researchers.

In 2025, ARC reported making conceptual and theoretical progress at the fastest pace seen since 2022, publishing work on topics including heuristic explanations, surprise accounting, the presumption of independence, and their first empirical paper on estimating probabilities of rare language model outputs. ARC holds a 4-star rating from Charity Navigator with a 90% overall score.

Theory of Change

ARC believes that as ML systems become more capable, current alignment approaches may fail to scale, potentially leading to systems that pursue goals misaligned with human interests. Their theory of change centers on developing rigorous theoretical foundations for alignment before superintelligent systems arrive. By creating formal mechanistic explanations of neural network behavior, combining ideas from mechanistic interpretability and formal verification into heuristic arguments, ARC aims to enable reliable detection of when AI systems might behave in dangerous or deceptive ways. This theoretical groundwork is intended to inform practical alignment techniques that can be applied by AI labs building frontier models, ensuring that powerful AI systems remain genuinely helpful and honest rather than merely appearing aligned.

Grants Received

SFF-2024 - Alignment Research Center

from Survival and Flourishing Fundsurvivalandflourishing.fund

$197,000

General Support

from Coefficient Givingwww.alignment.org

$1,250,000

LTFF 2022 Q4 - Alignment Research Center

from Long-Term Future Fundfunds.effectivealtruism.org

$72,000

Projects– no linked projects

Discussion

No comments yet. Be the first to share your thoughts.