Why Tests Fail

Smile sheets correlate with learning at about 10%. Here's what four meta-analyses say we should be measuring instead.

Jun 04, 2026

☕ 7-minute read

I shipped courses with end-of-module quizzes for years. Both in the classroom and as e-learning. Pass at 80%, get the completion, move on. I believed, the way we all believed, that a passing score meant something about what would happen when that person got back to their desk.

Then I went looking for the evidence. And the evidence has been sitting there for decades, quietly saying the opposite.

We have talked about how organizations measure skills without building them. This week is the uncomfortable prequel: the measurement instrument most of us still default to, the knowledge test, was never measuring what we thought it was.

🧭 TL;DR

The ratings and quiz scores we collect at the end of training barely correlate with whether anyone learned. The number across four meta-analyses is about 10%, which is statistically almost nothing.
Knowing and doing are different floors of the same building. Tests live on the bottom floor. The job happens on the top one.
Even the certification industry admits its exams measure “minimal competency.” What’s replacing them is performance: work samples, simulations, demonstrations. With one big new caveat about AI.
Share

📉 The 10% number

Will Thalheimer did something most of us never do: he went and read the meta-analyses on training evaluations. Four of them cover more than 200 studies. The finding that should have ended the smile sheet forever: the correlation between how learners rate a course and how much they learned averages about 10%. Statisticians call anything under 30% weak. Ten percent is noise wearing a lab coat.

Sit with what that means. A course with glowing ratings is almost equally likely to be effective or useless. The thing we report to leadership as evidence of quality has roughly the predictive power of a coin flip.

And the quiz at the end isn’t much better, for a reason memory researchers mapped out more than a century ago. Ebbinghaus found that we forget half of new information within about 30 minutes and most of it within a day. An end-of-course test taken five minutes after the content measures short-term recall, not durable capability. We’re testing people at the one moment guaranteed to flatter the training.

There’s a practitioner number that haunts this whole conversation. Robert Brinkerhoff estimated that about 80-85% of training is never consistently applied on the job. Some apply it and revert. Some never try. I’d treat the exact figure loosely, it’s a practitioner estimate and not a peer-reviewed statistic, but anyone who’s run a training function knows the shape of it is right.

🏗️ Knowing and doing are different floors

The cleanest way I’ve found to think about this comes from medical education, of all places. In 1990, George Miller drew a pyramid with four levels: knows, knows how, shows how, and does. The bottom two are knowledge. The top two are behavior. Multiple-choice tests live entirely on the bottom floors.

The line from that research tradition that I keep handing to people: knowledge and behavior correlate poorly. A nurse can ace the infection-control quiz and still skip steps under pressure on a short-staffed Tuesday. We’ve all watched some version of this.

The test asks “do you know the right answer when we hand you four options?” The job asks “do you produce the right action, unprompted, under conditions nobody put on the quiz?”

Those are different skills. Picking a correct answer from a list is a task that almost never appears in real work. The format itself is the contamination.

🔬 What the hiring science just admitted

Here’s the part I find weirdly encouraging. The most rigorous field that studies this, personnel selection research, has just updated its own answer.

For decades, the field ran on a 1998 meta-analysis that crowned general cognitive ability as the best predictor of job performance. In 2022, Sackett and colleagues re-ran the math and found the old numbers had been systematically inflated. The corrected list reordered itself: structured interviews came out on top, followed closely by job knowledge tests, with work samples strong and cognitive ability falling from first place to the middle of the pack.

Two things in there matter for us. First, the predictors that won are the job-specific ones, the ones closest to the actual work. Second, and this is the part I’d tattoo on every assessment conversation: no single measure is strong alone. Combining methods beats every individual one. The lesson was never “find the one perfect test.” It was “stop trusting any single instrument.”

Even the certification world is quietly conceding. CompTIA states in its exam policies that its tests assess “minimal competency” and are not designed to validate whether someone has gained skills from training. SAP went further and rebuilt its certification model around hands-on, scenario-based assessment inside actual SAP environments. When the people who sell exams start replacing exams, we should probably stop defending them harder than they do.

🎭 What’s replacing the test

The direction of travel is consistent everywhere I look: away from recall, toward demonstration.

Work samples and simulations ask people to do a slice of the real job and get scored against a rubric. The authentic-assessment crowd has argued for this since the early 90s, and the AI era has just made their case, because a recall question is now trivially answerable by any model, while a performance under observation is not.

The new tooling is real. AI role-play platforms put a salesperson or a manager in a live scenario with an AI counterpart and score the conversation against a methodology. VR assessment watches where your hands and eyes go during a procedure. We ran simpler versions of this thinking on my last team, digitizing observation checklists so that a manager’s watch-them-do-it judgment became data rather than a vibe. That data told us things no quiz ever had.

One honest caveat before we get too excited, because there’s a brand-new landmine here. Anthropic ran a randomized controlled trial published in January, 52 engineers learning a new library. The group that used AI assistance finished slightly faster and scored 17% lower on a comprehension quiz afterward, with the biggest gap on debugging. If we start inferring skill from work output, and AI helped produce that output, we can end up measuring the tool instead of the person. Performance assessment survives the AI era. Unsupervised performance assessment might not.

💡 What This All Means

I’m not arguing for zero tests. A well-built knowledge check has real uses, spaced out over time, as retrieval practice rather than as a verdict. The research on testing as a learning tool is solid. The problem was never the quiz. The problem was the weight we put on it.

What I’d change is the question we ask at the design table. We’ve been asking, “How will we test this?” The better question is “where will we watch them do it?” If the honest answer is nowhere, we haven’t designed an assessment yet. We’ve designed a ritual.

And when someone asks how we know the training worked, the defensible answer stacks evidence the way the selection researchers found: multiple signals, none trusted alone. A demonstration. A manager’s structured observation. A work product. A delayed check, weeks later, when the forgetting curve has done its honest work.

The quiz score was always the easiest thing to collect. That was never the same as being the truth.

🔧 From the workbench

The assessment-design layer of the L&D AI Operating System I’ve been building starts exactly here, with the “where will we watch them do it” question baked into the intake template. More on that in a coming issue.

If someone forwarded this to you, the full Learning Upgraded newsletter is at learningupgraded.com. We’re working through the whole skills-measurement problem this month, one uncomfortable layer at a time.

Where in your current programs could a demonstration replace a quiz this quarter?

—Eian

Learning, Upgraded

Discussion about this post

Ready for more?