← Back to blog
April 13, 2026 5 min read

Your Data Is the Product (and Your Moat)

Most teams obsess over the algorithm (RLHF vs DPO vs PPO). The data you feed those algorithms is what decides whether your model improves.

Most teams treat RLHF as the interesting part of post-training. The dataset underneath it is an afterthought.

That gets the priority backwards. A perfectly implemented DPO pipeline trained on noisy, inconsistent preference labels will produce a model that gives wrong answers with confidence. A simple reward model trained on clean, well-calibrated human judgments usually beats it.

We published a full guide on this: Curating Training Data for LLMs. It covers every stage from pre-training corpora to safety tuning. The part most relevant to production teams today is post-training: the preference and RL data that shapes how a model actually behaves.

The Post-Training Pipeline

Post-training spans two phases with very different data:

Post-Training Pipeline
SFT Data
Prompt → ideal response
Preference Data
Prompt → A vs B → winner
Reward Model
Learns quality signal
RL Policy
Optimizes against reward

Supervised fine-tuning (SFT) teaches the model to behave like an assistant. The data is straightforward: a user prompt paired with a gold-standard response. That's how you teach format, tone, and task completion.

Preference and RL data is harder. Instead of handing the model the right answer, you give it two answers and let it work out which one is better.

Preference Data Formats

Preference data comes in several forms, each with tradeoffs:

  • Pairwise (A vs B): cheapest to annotate, but loses magnitude. You know A beats B, but not by how much.
  • Scalar (rate each 1-5): captures degree of quality, but raters drift. One annotator's "4" is another's "3".
  • N-way ranking: richest signal from ordering N responses, but slow and costly to collect.

DPO uses pairwise data directly. RLHF first trains a reward model on preference data that outputs scalar scores, then optimizes a policy against it. The choice depends on annotation budget, task subjectivity, and whether you need a standalone reward model.

Reward hacking

If the reward is wrong, the model learns to exploit the reward instead of solving the task. A model trained with a length-based reward proxy may generate verbose, repetitive responses that score high but help no one. A coding model may pass test cases by hardcoding outputs rather than solving the general problem.

A Concrete Example: Preference Data for Code Review

Take an assistant that reviews pull requests. You want it catching real bugs rather than flagging style issues. Here's how the data pipeline breaks down.

Step 1: Generate Candidate Responses

For each PR diff, generate multiple review responses with varying quality:

{
  "prompt": "Review this Python function:\n  def get_user(id):\n    user = db.query(f'SELECT * FROM users WHERE id = {id}')\n    return user",
  "response_a": "Looks fine. The function retrieves a user by ID.",
  "response_b": "SQL injection vulnerability: the id parameter is interpolated directly into the query string. Use parameterized queries: db.query('SELECT * FROM users WHERE id = ?', [id]). An attacker could pass '1; DROP TABLE users' as the ID."
}

Step 2: Collect Preference Labels

Human annotators (or a strong judge model) label which response is better and why:

{
  "winner": "response_b",
  "annotation": {
    "dimension": "bug_detection",
    "margin": "large",
    "reasoning": "Response A misses a critical security vulnerability. Response B identifies the SQL injection, explains the risk, and provides a fix."
  }
}

Step 3: Quality-Control the Labels

Most teams cut corners on this step, and bad labels poison the reward model:

  • Inter-annotator agreement: if two raters disagree on which response is better, the example is ambiguous. Re-annotate with a tiebreaker, or drop it.
  • Calibration: annotators need to agree on what "better" means. A clear rubric isn't optional.
  • Diversity: if 90% of your preference pairs are about style and 10% about correctness, the model learns that style matters more.
Weak preference data
{
  "prompt": "Review this code",
  "winner": "response_b",
  "reason": "better"
}
Strong preference data
{
  "prompt": "Review this code",
  "winner": "response_b",
  "dimension": "security",
  "margin": "large",
  "rubric_scores": {
    "bug_detection": [1, 5],
    "explanation": [2, 5],
    "actionability": [1, 4]
  }
}

The second format gives you rubric-based scoring across multiple dimensions. When something goes wrong downstream, you can trace it back to which dimension the reward model is misjudging.

Process vs. Outcome: Where to Spend Annotation Budget

One of the biggest choices in post-training data curation:

  • Outcome supervision labels only the final answer as pass or fail. Cheap; a verifier or test suite is enough. It can also reward models that happen to get the right answer through bad reasoning.
  • Process supervision labels each reasoning step as correct or incorrect. Expensive, but produces more reliable reasoners.

For code review, outcome supervision means checking whether the review caught the bug. Process supervision grades the intermediate steps: did the model identify the vulnerability type correctly, and is the proposed fix actually right?

Outcome supervision gets used more widely because it scales. Process supervision tends to be reserved for domains where each step matters as much as the final answer, like math proofs or formal reasoning.

Key takeaway

The algorithm debate (RLHF vs DPO vs PPO) is a distraction when your preference data is noisy. Clean labels and a consistent rubric will usually beat a more sophisticated training loop.

Beyond Preference Pairs

Post-training data curation doesn't stop there. Safety tuning teaches the model when to refuse versus redirect. Reliable function calling depends on tool-use data. Self-correction loops are trained with critique-revision examples.

The LLM Data Curation Guide covers all of it: pre-training corpora, SFT pairs, preference formats, safety datasets, and the metadata schemas for each. If you're building or fine-tuning models, it's worth a read.