How Large Language Models Are Trained: Pretraining, SFT, Reward Models, and RLHF Explained - Devuly | Smart Analytics for Developers & Projects

This article breaks down the complete training pipeline for large language models, from pretraining and supervised fine-tuning (SFT) to reward modeling and reinforcement learning. It focuses on what problem each stage solves, what resources it consumes, and what capabilities it produces. It is intended for developers who want to understand how ChatGPT-like systems are built. Keywords: pretraining, SFT, RLHF.

Table of Contents

Technical specifications are summarized below

Parameter	Details
Domain	Large language model training pipeline
Core languages	Python, CUDA, distributed training frameworks
Key protocols/paradigms	Transformer, supervised learning, preference learning, reinforcement learning
Reference models	GPT-3, BLOOM, LLaMA, ChatGLM
Data scale	Hundreds of billions of tokens for pretraining, tens of thousands of samples for SFT, millions of preference pairs for RM
Compute profile	Pretraining is the most expensive stage; post-training is relatively cheaper
GitHub stars	Not provided in the original article
Core dependencies	Transformer, tokenizer, LoRA, PPO/policy optimization

LLM training is not a single breakthrough but a staged process of capability shaping

A large language model does not become a chat assistant in one training run. It typically starts by learning statistical patterns of language from massive text corpora, then learns to answer according to instructions through instruction data, and finally aligns with human preferences through preference feedback and reinforcement learning.

You can summarize this process in four stages: pretraining builds the foundation, SFT makes the model usable, RM evaluates quality, and RL continuously optimizes behavior. Understanding this division of labor is a key entry point into modern LLM engineering.

A minimal pipeline illustration

stages = ["Pretraining", "SFT", "RewardModel", "RLHF"]
for stage in stages:
    print(stage)  # Execute the four core training stages in sequence

This code only illustrates that LLM training usually evolves serially rather than finishing in a single stage.

The pretraining stage determines the model’s knowledge ceiling and language floor

At its core, pretraining teaches the model to perform next-token prediction over massive corpora. The goal is not to answer questions directly, but to continue text as accurately as possible. As a result, the model first learns language structure, factual patterns, and cross-domain knowledge distributions.

Common data sources include web pages, encyclopedias, books, papers, code, and Q&A communities. Raw corpora cannot be used directly. You must clean, deduplicate, remove sensitive or private data, and tokenize them first. Otherwise, low-quality data will noticeably reduce the model’s upper bound.

A typical data processing pipeline

raw_corpus = load_data()  # Load raw corpora such as web pages, books, and code
clean_corpus = filter_noise(raw_corpus)  # Filter garbled text, spam pages, and low-quality content
dedup_corpus = deduplicate(clean_corpus)  # Deduplicate to avoid too many repeated samples
tokens = tokenize(dedup_corpus)  # Split into token sequences the model can process

This workflow shows the most important data engineering steps before pretraining.

In terms of scale, GPT-3-class models are trained on roughly hundreds of billions of tokens. Training usually requires hundreds to thousands of GPUs running for weeks. Pretraining often accounts for more than 90% of total cost, which makes it the central battleground where capital, engineering, and algorithms all matter.

Instruction fine-tuning turns a base model from “able to continue text” into “able to perform tasks”

Although a pretrained base model contains rich knowledge, it does not inherently know whether the user is asking a question, requesting a translation, asking for a summary, or requesting code generation. The role of SFT is to convert raw language capability into task response capability.

SFT data usually consists of instruction-response pairs. For example, an instruction like “Explain what a token is” maps to a structured answer. The dataset is much smaller than pretraining data, typically tens of thousands of high-quality examples, but it places a high bar on formatting consistency, answer quality, and task coverage.

An example SFT sample structure

{
  "instruction": "How many campuses does Fudan University have?",
  "output": "Fudan University currently has four campuses: Handan, Jiangwan, Fenglin, and Zhangjiang."
}

These samples directly teach the model to understand instructions and produce the target answer format.

It is worth noting that both research and open-source practice suggest that SFT acts more like a capability activator than a capability source. For example, conclusions from work such as LIMA emphasize that a small amount of high-quality data can achieve results close to those of much larger instruction datasets. This suggests that the true capability foundation still comes from pretraining.

Reward models convert human preferences into an optimizable signal

SFT alone is not enough, because being able to answer does not necessarily mean answering in a way that better matches human preferences. The job of the reward model (RM) is to learn which answers are clearer, safer, and more helpful.

RM training samples usually do not contain a single canonical answer. Instead, they contain multiple responses to the same instruction along with ranking relationships between them. For example, A is better than B, and B is better than C. What the model learns is not knowledge itself, but a preference function.

The core input format for preference learning

prompt = "How can I improve my writing skills?"
answer_a = "Read more, write more, and practice deliberately."
answer_b = "Just write casually."
label = 1  # 1 means A is preferred over B

This kind of comparison data helps the RM learn which answer is more worth rewarding.

The value of RM is that it compresses expensive and hard-to-scale human evaluation into a scorer that you can call in bulk. If the RM learns the wrong preferences, downstream reinforcement learning will optimize in the wrong direction. That is why reward model quality directly affects final alignment quality.

Reinforcement learning pushes the model from “able to answer” to “able to answer in a preference-aligned way”

The reinforcement learning stage typically starts from an SFT-initialized model. The model generates responses, the RM scores them, and a policy optimization method then increases the probability of high-scoring outputs. In industry, this workflow is commonly summarized as RLHF.

Its goal is not to add new knowledge, but to adjust output style, stability, and alignment with preferences. After RL, a model usually behaves more like a production-grade assistant: it follows instructions more willingly, refuses unsafe requests more consistently, and responds in a more stable tone, although it may also become more conservative.

AI Visual Insight: This diagram shows the closed-loop structure of RLHF. A user instruction first enters the SFT model to generate candidate responses. The reward model then scores those responses for quality, and the reward signal is fed back into the policy optimization module to update parameters. This highlights that reinforcement learning does not directly teach knowledge. Instead, it redistributes output probabilities on top of a fixed knowledge foundation so that the model becomes more likely to produce high-scoring, low-risk, preference-aligned responses.

A simplified RLHF feedback loop

response = policy_model.generate(prompt)  # Generate a response with the current policy model
reward = reward_model.score(prompt, response)  # The reward model assigns a preference score
policy_model.update(response, reward)  # Update policy parameters based on the reward

This code captures the core feedback loop of the reinforcement learning stage.

From an engineering perspective, the four-stage division of labor is very clear

Pretraining determines how much the model knows. SFT determines whether the model can behave as instructed. RM determines how the system measures a good answer. RL determines whether the final product feels stable, smooth, and trustworthy. All four matter, but the investment profile across them is highly uneven.

Stage	Goal	Primary data	Cost profile	Main output
Pretraining	Learn language and world knowledge	Hundreds of billions of tokens	Highest	Base Model
SFT	Learn to follow instructions	Tens of thousands of Q&A pairs	Medium to low	Instruction Model
RM	Learn preference ranking	Millions of comparison samples	Medium to low	Reward Model
RL	Optimize response quality	Hundreds of thousands of prompts	Medium to low	Chat/Assistant Model

This is also why most teams do not train LLMs from scratch. Instead, they build on open-weight foundation models and apply SFT or parameter-efficient fine-tuning.

For developers, building the full pipeline from scratch is usually not the most practical strategy

If your goal is production delivery, it is usually far more practical to start with a mature base model and then apply domain-specific SFT, LoRA fine-tuning, or preference optimization. Pretraining requires substantial capital, strong data governance, and advanced cluster scheduling capabilities, which makes it unrealistic for most teams.

LoRA is popular because it trains only a very small number of incremental parameters while still allowing a 7B model—or even larger models—to adapt quickly to vertical tasks. This enables smaller teams to complete customized iterations on a single machine or with only a few GPUs.

A low-cost LoRA adaptation example

base_model = load_model("7B-base")  # Load the base model
lora_model = attach_lora(base_model)  # Attach low-rank adaptation layers
train(lora_model, domain_dataset)  # Train only a small number of incremental parameters

This code shows the core idea behind LoRA: freeze most original parameters and update only a small adaptation layer.

FAQ

FAQ 1: Why are many open-source models already quite useful with only SFT?

Because core capability mainly comes from pretraining. As long as the foundation model is strong enough, SFT can organize existing capabilities into conversational and task-execution formats, which is already sufficient for many general-purpose scenarios.

FAQ 2: Are RM and RL required for every project?

No. If your use case mainly involves knowledge Q&A, summarization, or classification, SFT or LoRA is often enough. RM and RL are more suitable for systems that require high consistency, high safety, and product-grade interaction quality.

FAQ 3: Which layer is the most worthwhile investment for smaller teams?

Usually the data construction and fine-tuning layer. High-quality domain data, standardized instruction templates, and solid evaluation sets often improve business outcomes more directly than blindly scaling model size.

References and further reading

InstructGPT, OpenAI, 2022
Self-Instruct, Wang et al., 2022
LIMA, Zhou et al., 2023
Andrej Karpathy: State of GPT

Core summary: This article systematically reconstructs the full LLM pipeline from pretraining to SFT, reward modeling, and reinforcement learning. It explains the goal of each stage, data scale, compute cost, and engineering value, and clarifies why pretraining determines the capability ceiling, SFT determines usability, and RM + RL determine alignment quality.