SafeLM shield logo

Safety Pretraining: Toward the Next Generation of Safe AI

Pratyush Maini* Sachin Goyal* Dylan Sam*
Alex Robey Yash Savani Yiding Jiang Andy Zou
Matt Fredrikson Zachary C. Lipton J. Zico Kolter

Carnegie Mellon University       DatologyAI       Center for AI Safety       Gray Swan AI

* Equal contribution


Title figure illustrating Safety Pretraining overview
TL;DR: We embed safety directly into the pretraining pipeline with data‑centric interventions, delivering a 1.7B parameter model family that is natively safe. Everything (code, data & weights) is open‑source.
📄 Read the Paper 🔗 HuggingFace Hub

Models & Checkpoints

SafeLM‑1.7B

Natively-Safe Base Model

Download

SafeLM‑1.7B-Instruct

Instruction Tuned Model

Download

Safety Classifier

Lightweight embedding-based model to score the safety of web text.

Download

Datasets

RefuseWeb

A diverse dataset of web text repurposed into refusals to unsafe requests.

Download

Moral Education

Context‑rich rewrites of harmful web text.

Download

Evaluation & Code

Base Model Safety Benchmarks

Completion-style Safety Evaluations

Download

Training Code

Reproducible LitGPT recipes & configs.

Stay Tuned!