Alignment Pretraining

active

Experimental project studying how discourse about AI systems in pretraining data affects downstream model alignment, introducing "alignment pretraining" as a complement to post-training methods and releasing associated models, datasets, and evaluations.

Endorsements support Geodesic Research.

People– no linked people

Project Details

Alignment Pretraining pretrains 6.9B-parameter language models on different mixes of synthetic documents about aligned versus misaligned AI behaviour, showing that upsampling misalignment discourse increases misaligned behaviours while upsampling alignment discourse can reduce misalignment scores from roughly 45% to about 9%. The project further finds that these alignment effects are dampened but persist through post-training, and provides open-source models, datasets, and evaluations for further study.

Theory of Change

By deliberately shaping pretraining corpora to include more discourse depicting aligned AI behaviour and less discourse emphasising misaligned personas, the Alignment Pretraining project aims to build beneficial alignment priors directly into models before any post-training. If pretraining data can reliably shift alignment behaviour, frontier labs can treat alignment-focused data curation as an additional lever alongside post-training methods to reduce the risk of misaligned models.

Grants Received– no grants recorded

Discussion

No comments yet. Be the first to share your thoughts.