Ryan Henderson

Failed chemist

Making a fair technical hiring challenge

How to differentiate applicants while respecting their time

I always hated coding challenges as part of a job interview. The more interesting the company, it seemed, the more vague and broad the task. And after pouring a weekend or two into building something, I’d fire off the email as the deadline approached and — nothing. For weeks. I never got any feedback on any task without follow-up, and in one case I didn’t even get the courtesy of a rejection despite an extensive coding challenge and an onsite interview!

Anyone interviewing frequently for software engineering positions can relate. Exhausting, irrelevant interviews are a staple of our industry. For my part, I swore off coding challenges and actively lobbied against implementing them at subsequent jobs.
I recently reconsidered, and want to share with you our motivations, how we tried to make the process fair for everyone involved, and what we learned: including some data and statistics and suggestions on how the process could be improved. Finally, we speculate on how we could make the pipeline more balanced — for instance, with respect to gender.

We were hiring for a Python backend position, and the challenge is available here.

Motivations

What changed my mind about coding challenges? I had one cynical motive, and two nobler ones.
The cynical motive
I might rightfully lose a few of you here. I became the CTO of a new startup (corrux, and yes we are hiring), and was simply buried in the amount of time it took to hire for any position. Candidates with solid resumes that did well on a phone screen simply couldn’t answer more substantial questions in an onsite interview. The few that could turned down the position. I desperately needed a better system.
The candidates asked
Candidates actually asked me for a coding task in lieu of — or in addition to — a technical interview. This was a big surprise for me, but their reasoning was straightforward: corrux is still a small and relatively unknown company. They want to see what kind of work they’d be doing day-to-day, and whether we can come up with a task that’s engaging, respects their time, and is realistic.
In the end, no one opted for the tech interview over the task.

Our interviews were inconsistent
I asked one of our junior developers to sit in on an interview for an internship candidate — a job she herself had been interviewing for only a few months back. Afterwards, she pointed out that the layout of this interview was much different than what she experienced during her interview. She was correct to insist we introduce some consistency to the process, and pointed me to the canonical Matasano hiring post; if you haven’t read it, you definitely should!

What's fair?

What makes a challenge fair is of course up for debate. We’re dealing with at least two definitions of fair here: fairness in what we ask of candidates and fairness in evaluation. We want our process to be fair to candidates and fair among candidates.
Here are the broad points I wanted to satisfy, informed by the Matasano hiring post and my own negative experiences with hiring challenges.
Objective scoring
A candidate should know exactly how he is being reviewed. This is probably the most important thing you can do to make your programming challenge fair. The benefits to you and the candidate are manifold:
  • Objective scoring forces you to think through how you will evaluate the submissions in a standardized way. This will not only save you and other reviewers a lot of time, it also tells the candidate a lot about your culture and values. For instance, do you put lots of weight on clean, well-architected code, or do you only score on getting the correct answer? Does the complexity of the implementation matter, even if it’s right? Does the candidate get extra points for including tests and documentation?
  • It helps a time-strapped candidate prioritize which parts of the solution work on.
  • Allows you to quickly build a distribution of scores which shows you the quality of candidates applying to the position. This can take some time depending on the variance of the responses but it will inform you when someone gives an exceptional response or if your challenge is too easy or too hard.
We spent a lot of time trying to come up with an objective scorecard that still allowed good differentiation between candidates. For reference, here is how we scored submissions.
Respect the candidates’ time
A broad, ill-defined task shows a lack of respect of the candidate’s time and reflects poorly on you as a hiring manager. If this is how you write your hiring tasks, how do you communicate tasks within the company?
A clearly defined, specific task is respectful to candidates of all experience levels. For the qualified candidate, she can immediately estimate how much time the task should take. For the unqualified candidate, she should be able to immediately see that she’s out of her depth. In both cases, the candidate can decide for him or herself whether the time investment is worth the risk.
Was our task well defined? Have a look and see for yourself!
Use blind review when possible
Inexact criteria like “culture fit” have long been suspected of being a more palatable modern incarnation of discrimination. Even if that weren’t true, you’d want to use blind review anyway since it’s a basic tenet of good experimental design. For the subjective parts of our scoring — code style and conventions — we used blind review.
Be realistic
Don’t have your candidates invert binary trees unless that’s the kind of stuff you do in your company, or you have so many candidates you can get away with it. Neither is the case for us. This is, after all, as much a chance for the candidates to review us as for us to review them — they wanted something as close as possible to what they might actually be doing on the job. For our part, we made a toy task that mocked the entire backend at the time of the posting.
Provide feedback
No matter how much you respect the candidates’ time, this is still a risky time investment for them. If you have more submissions than positions, someone is not going to get a job out of it. They deserve to know not only how they did in an objective sense, but also how they compared to the average candidate. In addition, specific feedback from the reviewers should be included if available.
To this end, as soon as we had more than two submissions, we included the mean and standard deviation of the scores across all candidates when responding with the individual candidate’s score.
Alert immediately
Give the candidate the feedback as soon as it’s available if you’re evaluating on a rolling basis, or at a fixed date after the deadline. The candidates have invested a lot of time and deserve to know where they stand.

The Data

We sourced our pipeline from the monthly Hacker News “Who’s Hiring” threads, as well as talent.io, a European recruiting platform. In the end, I phone-screened 9 candidates, the team reviewed four code submissions, and we made one hire.
To give something like a baseline, I distributed the CVs to the team and asked some of my colleagues the following: “give me a rating of these candidates’ resumes on a scale of 0–5, with 5 being a perfect fit for the job, and 0 not a fit at all."
Candidates are numbered 1–9. Each point represents a CV review score from an engineer at corrux.
The reviews are all over the place: surprisingly subjective, in my view! We didn’t get enough code submissions to make a strong correlation between suitability of CV and score on the code submission (see below), but the spread is surprising. In addition to the usual sources of bias, we wondered: how much does seeing a big name school or company influence us?
Next up are the scores for code review. Remember, this was single-blind, so the reviewers didn’t know which code corresponded to which CV. Although we only have a few results to go on, there does appear more consensus among the reviewers:
If I had a lot more incoming submissions , I could probably write another post on screening by CV. Why? Because, in our admittedly small sample, the correlation between resume review score and score on the challenge is, well, tenuous to say the least:
You can see that there’s virtually no correlation at all. In fact, one of the responses tied for best code style received the lowest score in resume review (of the candidates that submitted code)! Of course, there must be some correlation. Someone with no programming background should pretty consistently score zero, as well as have a totally unsuitable resume. To find out how weak or strong that trend is, you’ll need to wait until we’re a much bigger company.
Note that I didn’t include the scores for the correctness section of the challenge. That’s because there was no variance: everyone who could complete the challenge could complete it to the same degree(more on that below). We found grading on organization and style was enough to differentiate for our case however. After all, wouldn’t you want to work with someone who writes informative comments, has a clean directory structure, and adds tests? Even though we specified this would be a huge part of the evaluation, some of the candidates skirted these points.
def hours_since_last_maintaince(now: datetime.datetime):
    """
    Calculates hours since last maintaince as specified in the excavator stats 'most_recent_maintenance' field
    The implementation is straightforward, look for the excavator object with the latest timestamp and get the data
    from there subtracted from now
    :return: hours in decimal format (2 signifact digits) since last maintaince
    """
    records = _get_records(EXCAVATOR_STATS, find_dict={}, sort_list=[("timestamp", -1)], limit=1)

    if len(records) == 1:
        latest_record = records[0]
        seconds_since_maintaince = (
                now - latest_record.get('most_recent_maintenance').replace(tzinfo=pytz.utc)).total_seconds()
        hours = seconds_since_maintaince / 3600
        difference = float('{0:.2f}'.format(hours))
    else:
        difference = {'error': "no excavator data found"}

    return difference
An example of clean code from one of the highest-rated submissions. Used with permission!

Lessons

Grading to a curve

We wanted to respect all the points in the “What’s Fair” section while still making a challenge difficult enough to provide real differentiation between candidates. In our case, scoring on the code quality and test coverage really separated the candidates. Everyone who was able to submit a functioning solution, however, got about the same score on correctness. Does this mean the task was too easy? Perhaps. Having a more granular task to evaluate the correctness would be another way to help.

Have a deadline

One thing I regret is not having the same start and end dates for all candidates. We are a small company and wanted to fill the position quickly, so we informed all candidates that submissions were accepted on a rolling basis. Later on, I was feeling a bit guilty telling candidates “we have a coding challenge available, but we’re already in final stages with a couple people.” Fortunately none of them attempted the task (so far as I know), but it’s still unfair. Taking submissions on a rolling basis favors people who happened to have time to work on it right at the moment you told them, and happened to be phone screened a week before. Obviously, that’s not a characteristic that correlates in anyway to performance on the job. It also works against people from groups underrepresented in your field.

Build a fair pipeline

The pipeline problem is a hotly debated issue in tech hiring. Whatever your thoughts about the causes and consequences of gender imbalance in tech, one thing is for sure: women are underrepresented and it’s particularly bad in Germany. In our case, out of the nine applicants, two were women. That’s 22%, which is pretty close to the 17% given in the linked article.
If I can make a statement like “I need to have 3 female candidates in the pipeline for this process to be fair,” then I can model how many candidates I’ll need to interview with a negative binomial distribution.
The negative binomial distribution helps answer questions like “how many experiments would I need to run to get n positive outcomes if the probability of a positive outcome is p?” (Unlike gender, Bernoulli trials are binary — please excuse this simplification). More exactly: “How many male candidates should I interview if I want to be 50% sure the pipeline will include 3 women, and the probability of a candidate being a woman is 22%?”
This is the probability mass function for the negative binomial distribution. Here, as in the example above, n=3 and p=0.22. The shaded region from 0 to 9 male applicants indicates 51% probability of 3 female applicants (i.e. the CDF at 9 is 51%).
This model is telling me I should expect to interview 9 men before I’ve interviewed 3 women. This is encouraging, since I actually did interview 7 male candidates and 2 female. But what if i wanted to be 90% sure? We’d have to interview about 20 candidates (imagine shading the plot to cover 90% of the area).
That might be a bit burdensome, especially if the candidates trickle in over time. And if I lived somewhere where the pool of candidates were only 5% women (p=0.05 instead of 0.22)? This model suggests I’d need to interview 51 candidates before I could be half sure my pipeline included 3 women!
If we were in a hurry, we could have just noted that 3 is about 22% of 14 and been done with it. But where’s the fun in that?
While this approach will not work for all positions, we hope to use it more in the future for technical positions which are well-understood by the team. We’ve already had some success applying it to business positions as well. It also made us think about how to balance our pipeline and continue to minimize irrelevant bias in our hiring practices.
Thanks for reading! If you have any questions about hiring or corrux, reach out to us! If you’re interested in joining, apply here.
The team at Bauma 2019

Tags

Comments

Topics of interest