There's a Dirty Secret in the AI Industry

There's a dirty secret in the AI industry: large language models will confidently fabricate information when they don't know the answer. This isn't a bug in any particular vendor's product or something the next model update will fix. It's a fundamental property of how the technology works.

These models generate the most statistically likely next word. They have no mechanism for knowing whether what they're saying is true. Every provider, every model, every version shares this limitation.

In casual conversation, that's amusing. In construction procurement, where a misread score can distort an entire technical evaluation, or a fabricated clause reference can undermine months of due diligence, it's unacceptable.

What does a plausible hallucination look like in practice? A well-written assessment of a criterion the contractor never actually addressed. No obvious error flags. No broken logic. Just a confident, coherent paragraph about something that isn't in the submission. That's the failure mode that matters most in high-stakes procurement, not the error that looks wrong, but the one that looks right.

At TruBuild, we process tender packages that run into thousands of pages: RFP documents, evaluation criteria, technical submissions from multiple contractors, post-tender clarifications across successive rounds. The question we obsessed over from day one wasn't "can AI read these documents?" — it was "how do we make absolutely certain it doesn't lie about what's in them?"

The answer turned out to be deceptively simple in principle and painstaking in practice: use AI for what it's good at, and don't use it for what it isn't.

Most of Our System Isn't AI at All

This surprises people. They assume an "AI-powered platform" means a language model is doing everything. In our case, the majority of the pipeline is deterministic, traditional programming, structured data extraction, rule-based validation, arithmetic verification.

The AI handles what genuinely requires comprehension: reading a contractor's methodology statement to assess whether it substantively addresses an evaluation criterion, recognising when a response is thorough versus superficial, identifying where a submission has gaps against the RFP requirements. That's what language models are actually good at — reading with judgment.

But the moment you need a score? That's derived through a fixed, auditable mapping. The model provides a qualitative assessment against defined criteria, the equivalent of an evaluator's professional judgment call, and the numeric score follows deterministically from that assessment. The AI never picks a number out of thin air.

Evaluation That Traces Back to the Page

When our system assesses a contractor against a technical criterion, say, their approach to risk management or the qualifications of key personnel, every finding points back to where in the submission that conclusion came from. The document, the section, and the page. If the AI can't ground a claim in the source material, the claim doesn't make it into the output.

This becomes especially important across evaluation rounds. When contractors submit updated responses after post-tender clarifications, the system needs to understand what changed, what improved, and what stayed the same, all while keeping the assessment anchored to actual documents, not to its memory of what the previous submission said. Each round is evaluated against what's on the page, not against what the model thinks it remembers.

The Architecture of Distrust

We designed our system assuming the AI will hallucinate, and built layers to catch it when it does.

Every output goes through structural validation before it reaches a user. Does the assessment conform to the expected shape? Do the cited sources point to documents that actually exist? Does the evaluation logic hold together across criteria? When it doesn't, and with large language models, sometimes it doesn't, the system retries with targeted corrections. If it still fails, it flags the issue explicitly rather than presenting a confident wrong answer. We use statistical agreement tests and agent-on-agent validation to essentially have a self-correcting evaluation system.

Can we guarantee zero hallucination? No one honestly can. Not us, not any provider, not any research lab (yet). But we can make hallucination structurally difficult, detectable when it occurs, and inconsequential when it slips through, because a human with full source visibility is always the final checkpoint. That's the difference between hoping the AI gets it right and engineering a system where it doesn't matter if it occasionally doesn't.

Documents as the Single Source of Truth

Our analysis never draws from the model's general training knowledge. A contractor's track record is assessed based on what they wrote in their submission, not on what the model might know about them from the internet. When the system says a submission doesn't address a requirement, it's because it looked through the provided documents and didn't find it, not because the model couldn't remember.

When tender packages are too large to process at once, and with multi-contractor, multi-round technical evaluations, they often are — we're talking about 100M+ words per package — we've developed custom methods to manage that complexity without losing factual content. The emphasis is always on preserving what's actually in the documents over producing something that reads fluently.

Why the Human Stays in the Loop

The most important architectural decision we made wasn't technical. We don't believe AI should make procurement decisions. We believe it should make procurement professionals faster and more thorough, surfacing what they'd find themselves if they had unlimited time to read every page of every submission against every criterion. That's a distinction with real consequences for how the system is built.

So, the system surfaces its assessment, and the evaluator reviews it with full visibility into the evidence. They see the reasoning, they see where it came from, and they adjust scores where their professional judgment differs. The AI did the reading. The human makes the call.

That isn't a limitation of our technology. It's the point of it.

The Real Benchmark

The construction industry has spent decades developing rigorous evaluation frameworks for good reason. The stakes are too high, and the complexity too deep for shortcuts. AI doesn't change that. What changes is the throughput: the ability to apply that rigour consistently across every criterion, every contractor, every page, every round of clarifications.

But only if you can trust what it tells you.

The dirty secret we started with isn't going away. But a system engineered around it — one that treats hallucination as a design constraint rather than a footnote — changes what's possible. That's what we built for.

See how TruBuild ensures accuracy and traceability

There's a Dirty Secret in the AI Industry

Architecture of Distrust

Document Ingestion

AI Assessment

Structural Validation

Source Verification

Scoring Validation

Logic Consistency Check

Human Review

Most of Our System Isn't AI at All

Evaluation That Traces Back to the Page

The Architecture of Distrust

Documents as the Single Source of Truth

Why the Human Stays in the Loop

The Real Benchmark

Dr. Sari Sabban

Continue Reading

Why the Cheapest Contractor Is Often the Most Expensive Decision

Construction Firms Are Losing Millions on Manual Tender Evaluation - AI Is Changing That

Top 7 Global Developments in the Construction Industry in 2025

Ready to transform your tender evaluation?