Inside the Review Board

Lessons from Applied AI Experimentation.

9 min

Why I Built It

Applying to jobs is a black box. You send off applications and get a couple of automated emails back. Even if you get an interview, it is very difficult to get meaningful feedback (believe me, I’ve asked), and without good data, there’s no chance to adjust before it’s too late.

I’ve sat on 42 hiring review boards myself, leading two dozen of them, and I know firsthand how decisions actually get made: not by a single reader, but through the push and pull of multiple perspectives across disciplines and levels. That dynamic is invisible to candidates — and yet critical.

So I built Resumagic’s Review Board, a simulated panel of six AI “reviewers” — HR, Technical, Design, Finance, CEO, and Peer — each with their own rubric, role, and temperament. The goal wasn’t to replace real committees. It was to give candidates credible, early feedback loops: a way to catch blind spots, adapt coverage to the job at hand, and strengthen materials before hitting submit.

The Approach: Build It Like Research, Not a Toy

I treated this less like a hackathon script and more like applied research:

Personas crafted deliberately. Covering hierarchy (executive vs IC), domain (finance vs design), and perspective (manager vs peer). This ensured the mix could surface tradeoffs: you might look like a great boss but a weak direct report; strong on users but shaky on metrics.
Rubrics enforced discipline. Each persona scored against a structured rubric, producing outputs that were comparable across candidates and across experiments.
Baselines established first. I ran baseline reviews using frontier models and then smoke-tested them myself. Only once those felt credible did I start iterating with lighter, faster local models.
Calibration was continuous. I tested configurations until scores landed consistently within ±5% of baseline. That’s when I knew a setup was “stable enough” for real use.

Key Insights From the Build

Temperature tuning is everything.

One of the biggest surprises: the “right” temperature varied by model.

Too low: personas collapsed into consensus, even over-favoring underqualified candidates.
Too high: chaos — unreliable and noisy outputs.
Tuned correctly: each persona gave differentiated, constructive feedback. The result was a richer, more useful spread.

Model choice isn’t obvious. In practice, dolphin-3 outperformed deepseek-r1:8b for depth. My hypothesis: dolphin’s uncensored nature made it more willing to offer candid, even critical feedback — a better fit for this problem space.

Anti–people-pleasing prompt design matters. Left unchecked, models tried to flatter the candidate. By continually calibrating against three fit profiles (weak, average, strong), I forced the reviewers to be critical. That discipline made the feedback useful.

Rubrics did double duty. They not only structured the candidate feedback, they became instrumentation for my own experiments. By measuring variance, alignment to baseline, and spread of scores, I could evaluate model configs scientifically.

Why It Matters for Candidates

Later in your career, you’ve done a lot. Which also means it’s easy to forget something, or to underplay skills you take for granted. I once got dinged on “UX experience” despite the fact that I’d spent years as a UX designer, taught it professionally, and ran campaigns for major clients.

The Review Board is designed to catch those blind spots — and adapt that coverage to each specific job posting. One persona might flag “metrics are thin,” another “design depth is undersold.” Outliers become signals, not noise. The result: applications that are more balanced, credible, and resilient across multiple readers.

And all of this happens privately. Unlike third-party cloud tools, nothing leaves my machine. Sensitive materials stay local.

Generalizable Lessons for Applied AI

The Review Board is just one feature in Resumagic, but the learnings go further:

Always create baselines with both humans and strong models before you iterate.
Treat temperature as a variable, not a set-and-forget constant.
Design personas to reflect real-world diversity of perspective, not just generic roles.
Use rubrics as experimental instrumentation, not just output format.
Favor local, explainable loops over flashy but opaque systems — especially in high-stakes contexts.

These aren’t breakthroughs in academic AI research. They’re lessons from building in the wild. But I think they highlight something important: applied experimentation, when done rigorously, can surface insights that are credible, transferable, and useful.

What’s Next

Prompt compression to reduce latency.
Expanded reuse for other opaque, multi-stakeholder processes: vendor evaluations, performance reviews, job description scoring.
Reporting layer improvements for easier digestion of persona outputs.

Closing Thought

LLMs are trained on the data that exists. In tech, where people take up outsized space online, the results have been credible and constructive. But what about simulating underrepresented populations or emerging fields? My hypothesis: the results would be less convincing.

That’s why our work at Wikimedia — building a more diverse corpus of open data — was so essential. Because if AI is going to play a role in shaping decisions, it has to work for everyone, not just the loudest or most represented groups.