X3 Compass answers FMCSA compliance questions and cites the exact 49 CFR section behind every claim. The whole wedge collapses if those citations are wrong, so step zero was building an honest measurement.

Why 60 questions, not 6 or 600

Small enough to hand-curate every question with a known-good citation. Large enough to surface category-specific failure modes. We split the 60 across 15 categories— DQF, HOS, D&A, Medical, Inspection, Hazmat, CSA, MVR, Financial Responsibility, General Applicability, ELD, Cargo Securement, Vehicle Standards, Driver Conduct, and Cross-Border. Each question is paired with the canonical CFR citation and a list of common-hallucination failure traps (e.g. "don't confuse the 26,001 lb CDL threshold with the 10,001 lb CMV threshold").

The setup

Vanilla claude-sonnet-4-6. Minimal system prompt:"You are a DOT compliance reference. Answer each question concisely. When you cite a regulation, give the full 49 CFR citation. Do not invent citations. If unsure, say so."

Each model answer is scored 0 or 1 by an automatic grader. Pass if the answer contains the expected CFR section (base-section match, ignoring subsection chars). Failif it omits the citation, or contains a numeric pattern from the question's common_hallucinations list.

The number

85.0%

51 of 60 questions correctly cited · claude-sonnet-4-6 · vanilla, May 2026

By category (the interesting part)

Where the model is rock-solid:

100% on CSA/SMS, Driver Qualification Files, Financial Responsibility, Hazmat, Medical, Driver Conduct, Vehicle Standards, Cross-Border

Where it slips:

80% on HOS, D&A, Inspection, MVR, ELD — close but mistakes exist
50% on Cargo Securement (small sample, but real)
40% on General Applicability — the dangerous one. Includes the CMV-vs-CDL weight-threshold question (10,001 vs 26,001 lbs) carriers get wrong all the time.

The specific failures

Four real wrong answers, two grader artifacts. Here are the worst real-fails:

GEN-003: Asked "What is the difference between intrastate and interstate motor-carrier authority?" — model cited § 350.341 instead of the canonical § 390.5 + § 392.1. The substantive content was correct; the citation was adjacent-but-wrong. In a compliance context, "adjacent-but-wrong" is wrong.
DA-005: Asked "Who is the Designated Employer Representative (DER)?" — model cited § 40.3 (which does mention DERs) instead of § 382.107 (which defines them for Part 382). Same shape: 80% right, 100% wrong by audit standards.
MVR-005: CDL holder's obligation to notify employer of license suspension — model cited § 383.31 (convictions, related) instead of § 383.33 (suspensions, specific).

The architecture decision the baseline locked in

With 85% as the baseline, the production architecture is forced:

Retrieval grounding is non-negotiable. Every cited CFR section gets round-tripped against ecfr.federalregister.govin the same request that generated it. If a section doesn't exist or doesn't contain the claimed text, the response gets an unverified_citation flag and the UI shows an amber ⚠ chip instead of green ✓.
Per-category eval gate.Any new skill in a failing category (GEN, HOS, D&A, MVR, INSP, ELD, CARGO) must score 100% on that category's eval questions before its PR can merge.
Human merge on all agent/skill-builder/* branches. Skill-builder agents draft; humans approve. AGENT_SAFETY.md §3 forbids self-merge regardless of score.
The eval grows weekly. Every new skill adds 1-3 questions to the harness. Target: 200 questions before we unlock parallel skill-builder agents.

The honest caveat about that 85%

Our grader is intentionally strict. A few of the 9 failures are the model citing an adjacent-but-related section. We left them as fails on purpose. Better to over-fail and force the production skill-builder to be airtight than to grade on a curve and ship sloppy citations to a carrier in front of a DOT inspector.

What the live product does today

Every /api/ask response (and every public /api/ask-demo response) extracts the cited sections, round-trips them against eCFR, returns a citation_quality_score 0.0–1.0, and the homepage demo shows per-section ✓ or ⚠ chips. You can see this live by typing into the Ask Compass widget on the home page.

The number we publish every week

A GitHub Action runs the same 60-question eval against the live system prompt every Monday and posts the result to the public /changelog. If we regress, you see it the same week we do. If a new model release lifts us, you see that too.

Try it

Easiest way to verify any of this: ask Compass a question yourself.

Try the live demo →

Joshua Kovarik · Founder, X3 Fleet Safety LLC · May 17, 2026. Got a question or a counter-claim? [email protected].

How we got to 85% citation accuracy on a 60-question FMCSA eval