AI Consistency vs Human Evaluator Variance: What Three Vision 2030 Pilots Revealed

Back to Blog

I've spent most of my career running procurement processes — at Gleeds, CBRE, Turner & Townsend. I've built scoring frameworks, chaired evaluation panels, and signed off on contractor recommendations worth hundreds of millions. I believed, as most practitioners do, that a well-structured panel of experienced assessors was the closest thing to an objective evaluation.

Earlier this year, running live parallel evaluations alongside three major real estate developers on Vision 2030 programmes in Saudi Arabia, I found data that made me question that assumption directly.

We ran the same tenders through TruBuild simultaneously with each client's own manual process — same criteria, same weightings, blind to each other's scores. Then we compared.

The finding that stopped me wasn't the efficiency gap, though that was significant. Across two of the three evaluations, the average scoring variance between human assessors on the same panel was 7–8.7%. In those same evaluations, TruBuild's deviation from the panel's collective consensus was 2.8% and 6% respectively. In both cases, TruBuild was more consistent than the human assessors were with each other.

Scoring Consistency: AI vs Human Panel Variance

Average scoring variance between human assessors on the same panel compared to TruBuild's deviation from panel consensus

Human Assessor Variance
TruBuild AI Deviation

Key Finding: In both evaluations, TruBuild's deviation from the panel consensus was lower than the average variance between human assessors themselves, demonstrating greater consistency in scoring application.

That consistency finding matters most when you look at what's underneath it — the specific scoring decisions driving the variance.

In one evaluation, a single criterion produced this result across three assessors on the same panel: 0, 0, and 4 out of 5. No written comments explained the discrepancy. The scores were averaged, the ranking was produced, and the process moved forward. Nobody flagged it.

Unexplained Scoring Discrepancy on Single Criterion

Three experienced assessors scored the same criterion with no written justification for the variance

Critical Issue: A single criterion produced scores of 0, 0, and 4 out of 5 from three assessors on the same panel. No written comments explained the 400% discrepancy. The scores were averaged and the process continued without flagging.

This wasn't isolated. A pattern that emerged across the evaluations: high scores rarely came with written justification. Low scores occasionally did — briefly. The result is an evaluation record where the most consequential decisions are the least documented.

There's a second finding I didn't expect. In one evaluation, a contractor produced a disorganised submission — relevant information buried across misnamed files. Manual assessors, working under time pressure and relying on PDF searches, missed content that was demonstrably there. TruBuild found it regardless of where it sat in the document structure. The contractor's actual capability was assessed. The presentation penalty disappeared.

This cuts both ways: a well-presented weak bid can outscore a disorganised strong one, and nobody on the panel necessarily knows that's happening.

The time and cost numbers varied across the three pilots but were consistent in direction. Two evaluations ran against a four-week manual process costing $31,800 — TruBuild completed both in two hours for $3,500. The third ran against a three-week process costing $21,800 — same two-hour result from TruBuild, at $2,200. Across all three, total evaluation time including report creation dropped from three to four weeks down to four days.

But efficiency isn't the point I'm making. The organisations we work with aren't primarily trying to save money on evaluation. They're running programmes where a challenged procurement decision doesn't just delay a contract — it delays a project, triggers a legal process, and puts careers on the line.

The question I keep returning to is not whether AI should support evaluation. It's whether unexplained scoring variance between assessors on the same panel — with no audit trail — is actually acceptable in an environment where every decision must be defensible. Having sat inside these processes for most of my professional life, I think the honest answer is no.

TruBuild is running pilot evaluations with project owners and consultancies across the GCC and UK. If you're running a complex tender and want to see how your evaluation compares, try the demo.

Was this article helpful?

Written by

Yiannis Gkrimpizis

Yiannis Gkrimpizis is the Chief Commercial Officer at TruBuild, bringing over 10 years of experience in the construction industry with governments and Fortune 500 companies. He has worked with leading firms including Gleeds, Turner & Townsend, and CBRE, where he led product and cross-functional teams building digital solutions for construction.

More about our team →

AI Consistency vs Human Evaluator Variance: What Three Vision 2030 Pilots Revealed

Scoring Consistency: AI vs Human Panel Variance

Unexplained Scoring Discrepancy on Single Criterion

Yiannis Gkrimpizis

Continue Reading

Why the Cheapest Contractor Is Often the Most Expensive Decision

Construction Firms Are Losing Millions on Manual Tender Evaluation - AI Is Changing That

Top 7 Global Developments in the Construction Industry in 2025

Ready to transform your tender evaluation?