Home Models Datasets Evaluation Paper
Logo

Safety Pretraining: Toward the Next Generation of Safe AI

Authors: Pratyush Maini1,2*, Sachin Goyal1*, Dylan Sam1*, Alex Robey1,4, Yash Savani1, Yiding Jiang1, Andy Zou1,3,4, Zachary C. Lipton1, J. Zico Kolter1

1Carnegie Mellon University   2DatologyAI   3Center for AI Safety   4Gray Swan AI

* Equal contribution

TL;DR – We embed safety directly into the pretraining pipeline with data‑centric interventions, delivering SafeLM, a 1.7B model family that is natively safe before any RLHF.
Everything (code, data & weights) is open‑source.

📄 Read the Paper 🔗 HuggingFace Hub

Models & Checkpoints

SafeLM‑1.7B

Natively-safe base model (1.7B parameters).

Stay Tuned!

SafeLM‑1.7B-Instruct

Instruction Tuned Model (1.7B parameters).

Stay Tuned!

Safety Classifier

Lightweight embedding-based model to score the safety of web text.

Download

Datasets

RefuseWeb

A diverse dataset of web text repurposed into refusals to unsafe requests.

Download

Moral Education

Context‑rich rewrites of harmful web text.

Download

Data Safety Report Cards

Standardized report on the safety of any dataset release.

Stay Tuned!

Evaluation & Code

Safety Benchmarks

Base Model Safety Evaluations

Download

Training Code

Reproducible LitGPT recipes & configs.

Stay Tuned!