AI Hiring: A Coin Flip for Your Career

Remember that promise? The one where shiny, new AI would finally slay the dragon of hiring bias? Where algorithms, pure and logical, would give every candidate a fair shot based on nothing but their skills?

Yeah, about that.

We’ve known for decades that hiring is a mess. The classic 2004 study by Bertrand and Mullainathan proved it: résumés with “White-sounding” names like Emily and Greg got 50% more callbacks than identical ones with “African-American-sounding” names like Lakisha and Jamal. In response, corporate America rolled out years of mandatory sensitivity training and anti-prejudice workshops. Then came AI, which was supposed to be the ultimate fix, the objective referee that would finally solve our flawed human judgment.

But what happens when you train a supposedly “objective” system on a world brimming with bias?

I went back to my roots in experimental psychology and started testing. I put the eight AI platforms that account for over 99% of all LLM activity in the U.S. (platforms like ChatGPT, Gemini, and Claude) through their paces. My findings are not just disappointing; they’re alarming. The idea that AI is an unbiased hiring tool is a dangerous myth. These systems aren't just echoing our old prejudices; they're automating them at scale.

The real-world test: No code, no tricks

First, I created the résumés. I deliberately looked past the entry-level roles that fill most studies to focus on candidates who are doing everything right: the ones with degrees and a clear career path. I created résumés for candidates applying to senior white-collar roles: a Director of Finance, an HR Manager, and a Senior Vice President of Fraud, picked to cover a few of the most common corporate functions. The résumés for each role were identical, except for the candidate's name, representing Black, Hispanic, and White females and males.

To get a clear picture, I skipped the fancy code and APIs. I wanted to see what happens when an actual overworked hiring manager uses AI to lighten their load. I opened a web browser, went to these AI sites, pasted in each high-quality résumé, and gave a simple command, essentially “Act as an experienced recruiter and rate this candidate for this job on a scale of 1-100.”

The results were immediate and infuriating.

The bias is a tangled, intersectional mess

Let’s be clear: the biases were real and significant. But this wasn't a simple case of White privilege; the reality was far stranger and more chaotic. The bias was complex and intersectional, changing based on the specific AI model and the job role in question.

In some cases, a model might favor a Black female candidate for one role, only to penalize her for another. In others, models penalized candidates just for having a name that signaled any demographic.

This chaos is arguably more damaging than predictable prejudice. Imagine a company trying to audit its AI for bias; one month it finds favoritism towards Hispanic males, and the next, after a model update, it finds the opposite. It makes the problem impossible to reliably track or fix. This isn't just automating the old, familiar prejudices. It's creating new, arbitrary, and untraceable forms of bias. It's an opaque system that doesn't just discriminate; it discriminates unpredictably.

A coin flip for your career

But unpredictable bias wasn't the only problem. It wasn't just about prejudice. I found a staggering lack of fundamental consistency. Some of these AI models backed by billions in funding are just plain bad at their jobs.

This erratic behavior isn’t limited to hiring; it seems to be a fundamental issue with how these models operate. We’ve seen this publicly, like when Google recommended putting glue on pizza or when Grok referred to itself as “MechaHitler.”

In one instance, a model’s ratings swung so wildly that a candidate could be scored a 53 one minute and a 97 the next. That’s not a rounding error; that’s a coin flip deciding someone’s career. This level of randomness means the tool isn't just biased; it's fundamentally unreliable. It’s a high-tech game of roulette with people’s careers on the line.

Holding up a flawed mirror

An LLM doesn't have to be malicious to produce bigoted results. It just needs to learn from our messy, biased, human world. These models are trained on a vast ocean of text from the internet, books, and articles, data that is saturated with the stereotypes and inequities of our society. The AI is simply holding up a mirror.

The stakes of this flawed reflection are terrifyingly high. As companies rush to adopt AI for the sake of “efficiency,” they risk building a new, invisible wall that blocks out entire communities of talented people, a digital gatekeeper that penalizes you for your name. For any business using these tools, the question is unavoidable: how can you be sure your AI isn’t quietly turning away your best candidates for reasons you can’t see or control?

The companies building these models aren't ignorant of this issue. They have dedicated teams where ethicists work alongside engineers to fine-tune the models and mitigate bias. But their very attempts at a solution may be part of the problem. It is highly likely that these efforts to steer the AI away from one form of bias are inadvertently creating the chaotic, tangled mess of intersectional biases my research uncovered. In trying to apply simple fixes to a complex system, they may be making the problem less predictable, but no less harmful.

So, should we unplug it all? Definitely not.

But we have to stop treating these tools like infallible oracles. We have to start demanding transparent, third-party bias audits. An effective audit must go beyond simple demographic parity checks; it should test for consistency by running the same résumé multiple times and measure core competence by using a mix of qualified and unqualified candidates to see if the AI can even tell the difference. We need to remember that an AI’s output isn't objective truth; it's a reflection of its training data, warts and all.

The dream of a perfectly fair hiring tool is still just a dream. For now, keeping a human in the loop isn't a luxury or a feature; it's a question of accountability. Until AI can be held accountable, the hard work of fairness is still on us, the people using the tools. Our judgment is the last and most critical line of defense against automating society’s prejudices.

If you’d like to read the full research paper, visit https://arxiv.org/abs/2507.11548