You Can't Review What You Can't Evaluate

Human-in-the-loop review only works if the human can evaluate the output. On the limits of HITL for LLM and AI-generated work.

Adding a human checkpoint to an AI workflow doesn't make the output trustworthy. It makes you feel like it does. Those are different things, and the gap between them is where bad output ships. A reviewer who can't evaluate what the model produced isn't a safeguard. They're a sign-off who thinks they're a safeguard.

I build and maintain LLM-powered tools used by thousands of real users at Enrollment Resources and VirtuOHS.

When a human in the loop catches nothing

The industry consensus on AI quality is basically one sentence: keep a human in the loop. Don't let the model ship unsupervised. Put a person at the end to catch what it gets wrong.

I believed this. I still do, conditionally. But the condition is the entire thing, and almost nobody states it: the human in the loop has to be able to evaluate the output. If they can't, the loop is theater. You've added a step, a sign-off, a feeling of safety, and zero actual quality control.

I know this because I built a tool I was not qualified to evaluate. VirtuOHS is an AI hazard-assessment tool for workplace safety. Under the hood it runs the William Fine method, a decades-old OHS model that scores a risk as Consequence × Exposure × Probability, then recommends a control. The math is trivial. The model isn't the hard part.

The hard part is the interview. The tool is a chatbot that sits across from a non-expert (a facilities manager, a supervisor, someone who has never heard of William Fine) and extracts the right inputs from them. It asks the questions an OHS professional would ask. It pushes back when an answer doesn't add up. It knows that "people are around it sometimes" isn't an exposure rating and digs until it can assign one. Then it maps plain English onto the factors and runs the numbers.

I can build the chatbot. I cannot tell you whether it's asking the right questions, because, unlike my cofounder, Maegan Mackenzie M.Sc. OHS, I'm not an occupational health and safety professional. She has over ten years in the field. I wrote the system. She is the reason the questions are right.

When AI output looks good but isn't

Here's where it gets uncomfortable, and where I have to be honest with you. I can't hand you the dramatic war story, the assessment that looked perfect until the expert caught the fatal flaw. Not because it never happened, but because I wouldn't have been the one to catch it, which is the entire point.

Picture the tool interviewing a manager about a workspace. It produces a clean risk score and a recommended control. It looks authoritative. The numbers are internally consistent. It reads exactly like the output of a method that's been used in industry for fifty years.

Now: is the exposure rating right? Did the chatbot ask enough to know whether this person is near the hazard continuously or occasionally, categories that swing the score by multiples? Did it probe for the hazard the manager didn't think to mention, the way a real assessor would? I have no idea. I can read every number on that page and verify the arithmetic, and I still cannot tell you if the assessment is correct, because correctness lives in the questions, not the multiplication. Maegan can look at the same transcript and see that the exposure rating came from a question that was too vague, or that an entire line of inquiry never happened.

That's the failure mode that should scare you, and it's invisible by construction. There's no flag, no low-confidence marker, no missing field, just a clean, professional assessment that's confident whether or not the interview behind it was any good. You cannot review what you don't know to look for. Evaluating it required the expertise I was trying to deliver through the tool, which means I was structurally incapable of being its quality check.

can you spot the flawed assessment?

Two hazard assessments for the same workplace scenario. Same method, same arithmetic. One is sound. One isn't. Pick the one you'd sign off on.

Scenario

Icy walkway outside a loading dock at a distribution facility. Employees cross it daily during winter. Ice accumulates overnight and isn't always cleared before the morning shift.

Proposed Control

Install a heated walkway system. Estimated cost: $75,000.

Assessment Aclick to choose →

Consequence5

The potential consequences of a slip-and-fall on the icy walkway are bruising and general soreness. These outcomes are consistent with the 'Important — minor injury requiring first aid' category on the rating scale, corresponding to a consequence rating of 5.

Exposure10

Employees cross the walkway multiple times per shift, daily throughout the winter months. This frequency of contact with the hazard aligns directly with the 'Continuous' description for a rating of 10 on the provided exposure scale.

Probability7

Ice forms regularly on the walkway and previous slip incidents have been documented. Given the frequency of exposure and observed near-misses, an incident is quite possible, corresponding to a probability rating of 7.

Cost Factor6

The total implementation cost for the heated walkway system is $75,000. This amount falls within the $50,000–$100,000 range on the provided rating scale, which corresponds to a cost factor of 6.

Correction1

The proposed heated walkway system eliminates ice formation entirely, removing the slip-and-fall hazard at its source. This constitutes a complete elimination of the hazard for affected employees, corresponding to a correction factor of 1.

Risk Score350

Justification58

Not Justified

Assessment Bclick to choose →

Consequence25

A slip-and-fall on icy concrete can result in broken bones, traumatic brain injury, spinal injury, or long-term disability. While bruising and soreness are common outcomes, the most serious credible outcome of an uncontrolled fall onto a hard surface in freezing conditions is significantly more severe. These potential outcomes align with the 'Very Serious — permanent disability' category on the rating scale, corresponding to a consequence rating of 25.

Exposure10

Probability7

Cost Factor6

The total implementation cost for the heated walkway system is $75,000. This amount falls within the $50,000–$100,000 range on the provided rating scale, which corresponds to a cost factor of 6.

Correction1

Risk Score1,750

Justification292

Justified

The same gap, with me on the other side of it

That cuts both ways, and the reverse is just as sharp. Hand me LLM-generated code and the gaps light up, because code is mine the way hazard assessment is hers.

Ask a model for a quick counter, increment a number every time something happens, and you'll get something like this. It might pass review. It might pass tests. A junior dev might ship it without a second thought.

const record = await db.counter.findById(id);
record.value += 1;
await db.counter.update(id, { value: record.value });

Fetch, increment, save. It reads correct, which is exactly why it survives review. But it's a read-modify-write race: two requests arrive at the same moment, both read the same value, both write value + 1, and one increment silently disappears. On a test environment with one request at a time it works perfectly. Under real concurrent load it quietly undercounts, with no error and nothing in the logs to chase. The fix is to make the database do the increment atomically:

await db.counter.updateOne({ _id: id }, { $inc: { value: 1 } });

Obvious to me. Invisible to someone who's never been burned by concurrency. Same dynamic as the risk assessment, pointed the opposite way. The output looks right to anyone who can't evaluate it, and wrong on sight to anyone who can.

race condition · read-modify-write vs atomic increment

read-modify-write

atomic increment

requests sent: 0 · expected count: 0

Why review depends entirely on the reviewer's expertise

Here's the rule both stories obey, stated plainly: a reviewer only improves output on the dimensions where they could have produced or verified the right answer themselves. Everywhere else, they're not reviewing. They're reading.

Maegan could conduct the assessment herself, so she can tell whether the tool's interview was sound. I could write the atomic increment, so I'd catch the race. Flip the assignments and both of us miss, not through carelessness, but because catching the error required already knowing the answer to check against. "Add a human in the loop" treats the reviewer as a generic quality function: drop any person in, quality goes up. But the sign-off doesn't just fail to add quality. It manufactures confidence the review never earned.

The model makes this harder by producing everything at the same confident polish. A well-conducted assessment and a sloppily-conducted one print the same clean score. The racing code and the correct code read equally finished. There's no tell in the output. You learn which side of the line you're standing on only when someone who has the answer you're missing points at the thing you couldn't see.

You can't automate expertise you don't have

So how is VirtuOHS any good, if I can't evaluate it? Because I didn't review my way to quality. I built the expertise into the system.

It started with the questions. You might think I could've skipped Maegan and asked an LLM for the interview script myself. I could have. An LLM will happily generate a full OHS interview: confident, plausible, formatted like an expert wrote it, and I'd have no way to tell if a word of it held up. Same trap, one step earlier in the pipeline.

She did use an LLM to draft questions. The difference is she could read what it produced and know which questions were sharp, which were too vague to pin a rating, and which a real assessor would never ask. She wasn't writing from scratch; she was tuning, and tuning is the part that takes the expertise. The LLM wrote faster than she would have. It didn't know whether the questions were any good. She did.

Then round after round, she'd run the tool, read the transcripts, and point at where the interview still went wrong: the follow-up a real assessor would have asked and the bot didn't, the situation it should have pushed back on and accepted instead. I'd encode each catch into the system: the questions it must ask, the answers it can't accept at face value, the things it has to probe for. The tool got better in a specific, bounded sense: it stopped making the mistakes she'd already caught.

That sounds like I automated her expertise away. I didn't, and the distinction is the whole point. The tool conducts a sound interview on the dimensions she has already reviewed, and an unknown one everywhere beyond them. I never had the OHS judgment myself. I had a pipe to someone who did, and I poured what came through it into the system, one corrected question at a time. The quality is real, but it's borrowed, and it only covers ground she already walked.

Which sets a ceiling I can feel but can't see past. Every question the tool should ask but doesn't is still missing, indistinguishable from the ones it gets right, because the output looks equally polished either way, the same way that clean risk score looked fine to me regardless of whether the interview behind it was any good. I can't even tell you how much is uncovered, because measuring it would take the expertise I don't have. That's not false modesty; it's the trap one level up. I can't evaluate the tool's blind spots for the same reason I couldn't evaluate a single one of its assessments. Maegan isn't a crutch I'll engineer away. She's the source the tool is made of.

VirtuOHS is marketed, accurately, as not needing a safety specialist on staff. That's true, and it's true because a safety specialist built it. The expertise didn't stop mattering. It got moved upstream, out of the user's chair and into the system, where you can't see it and might forget it was ever required. The expertise I don't have is precisely the expertise I can't automate. I can only borrow it, and a tool is just borrowed expertise with the lender's name filed off.

AI didn't close the skill gap. It moved it somewhere you can't see it from inside the loop. The output looks the same whether you're qualified to judge it or not, which means the only honest way to know if your human in the loop is doing anything is to ask what, specifically, they would be able to catch. If the answer is nothing, you don't have a checkpoint. You have a witness. And even when the answer is everything, the expert still has to actually work the problem — skill without effort is just a more convincing rubber stamp.

SuedePritch