Safety Pretraining: Toward the Next Generation of Safe AI

Pratyush Maini^* Sachin Goyal^* Dylan Sam^*
Alex Robey Yash Savani Yiding Jiang Andy Zou
Matt Fredrikson Zachary C. Lipton J. Zico Kolter

Carnegie Mellon University DatologyAI Center for AI Safety Gray Swan AI

* Equal contribution

Title figure illustrating Safety Pretraining overview

TL;DR: We embed safety directly into the pretraining pipeline with data‑centric interventions, delivering a 1.7B parameter model family that is natively safe. Everything (code, data & weights) is open‑source.

📄 Read the Paper 🔗 HuggingFace Hub

Models & Checkpoints

SafeLM‑1.7B

Natively-Safe Base Model

Download

SafeLM‑1.7B-Instruct

Instruction Tuned Model

Download

Safety Classifier

Lightweight embedding-based model to score the safety of web text.

Download

Datasets

SafeWeb

Context‑rich rewrites of harmful web text.

Download

RefuseWeb

A diverse dataset of web text repurposed into refusals to unsafe requests.

Download

Moral Education

Moral and educational lessons about safety.

Download

Evaluation & Code

Base Model Safety Benchmarks

Completion-style Safety Evaluations

Download

Training Code

Reproducible LitGPT recipes & configs.

Stay Tuned!