Recruitment testing from a statistical perspective

It is increasingly common with psychometric testing in recruitment. Especially for roles with a high number of applicants, or for high stakes positions, it is almost a part of standard procedure to ask applicants to complete a psychometric assessment at some point in the hiring process.

These tests have some clear advantages for the companies using them:

  1.  Tests ensure some degree of fairness as everybody gets the same or an equivalent set of questions. Testing also encourages some sort of a standardized process, where all applicants are treated and assessed in the same way, no matter who the interviewer happens to be or which time of day the interview is.
  2. Tests are designed to measure traits that are important for job performance, like intelligence and personality. These traits are hidden (they can not be observed directly, only through behaviours) and notoriously hard to assess from a short interview or text in a resumé, so test results can provide a useful reference. 
  3. There is conclusive evidence that psychometric tests of certain kinds show "predictive validity", which means that they can be used to predict job performance from the results on the test.

This post is about the third point. I want to elaborate on the argument that using test scores will increase the probability of hiring good candidates.

Predictive validity is almost always expressed as a correlation coefficient, which describes the strength of the relationship between test scores and job performance. This means that under some assumptions, we can estimate probabilities of different outcomes. Let's play around with this!

We can estimate the probability of a good hiring decision by integrating over the bivariate normal probability density function. Something like this was done by Taylor and Russell in 1939, to produce tables with what is also called precision - that is the number of good hires (high performers) divided by all hires (high and low performers). 

Let's simulate a random sample from the bivariate normal distribution, with a correlation of r = 0.5, to visualize the assumptions made by Taylor and Russell. Looking at the scatterplot below, calculating precision would be equivalent to dividing the number of green points by the total number of green and red points: 72 / (72 + 28) ≈ 0.72

(Integrating over the bivariate normal distribution gives 0.726)

Scatter plot of normalized assessment scores and job performance scores. Each point represents a fictional applicant. The position of any applicant shows two things - their test score (higher scores to the right) and their job performance (higher scores on top)

Scatter plot of normalized assessment scores and job performance scores. Each point represents a fictional applicant. The position of any applicant shows two things - their test score (higher scores to the right) and their job performance (higher scores on top)

So, given a predictive validity of r = 0.5, and by defining high performers as being above average (yellow and green points) - selecting 100 of 300 in the sample only based on the assessment score (green and red points) gives a 72 % probability of selecting good candidates. 

Let's put this into context. Imagine a company looking to hire law graduates, getting 300 applications in a year. By using an IQ test to find the top 100, given all the assumptions above, they would have a 72 % chance of hiring an above average performer - just by choosing randomly from these top 100. 

This is of course efficient for the company. They no longer have to spend time reading resumés or conducting interviews with 300 applicants, instead they can focus on the top 100 applicants that are more likely to be high performers. Or perhaps even the top 10, if they really trust their IQ test.

But let's move focus away from the applicants that passed the test. Instead, let's have a look at a group of applicants that are "misclassified" by using the IQ test - the yellow ones. They are the lost opportunity in this example. If the company hired them, they would have performed above average. However, they did not pass the first hurdle.

Given the situation described above, we can estimate the probability of misclassifying the true high performers. Dividing the number of yellow points in the graph by the total number of green and yellow points, we get 78 / (72 + 78) ≈ 0.52. So in the same situation as above, there is a 52 % chance that the company missed out on true high performers before even looking at their resumés. (This is in fact a different metric, 1 - sensitivity/recall)

Are you surprised? The process that sounded very good when we only considered the greens and the reds, sounds a bit less impressive now when we also consider the yellows.

Now, let's take a step back. Imagine that it's not just one company using IQ tests for pre-selection, or "screening" of applicants. Instead, imagine that all companies hiring law graduates are using very similar processes. What would happen to our yellow group? They would get rejected repeatedly, based on the same reasoning over and over again. They would not be on the shortlist of any company and they would have no alternative way of proving their worth.


I worry about the widespread use of psychometric tests, for the reasons described above. It doesn't matter how high the predictive validity of IQ tests are, or how much companies save by using them for screening applicants. If all companies are using similar tests to reject applicants, the same yellow group of applicants will be consistently mistreated. And that's bad both for them and for the companies losing out on their abilities.

This is a hard problem to tackle. Companies will know the greens and reds well, since they were selected and got to proceed in the hiring process. They won't know anything about the yellows and blues. Also, a lot of people that I have met in this field don't seem to think in statistical terms, instead making the incorrect assumption that people with higher IQ must necessarily also be better performers on the job. This is clearly wrong, even in this simplified example, and it leads to an even stronger lack of consideration for the yellow group.

Maybe it would be good for companies to make some exceptions from their screening process every now and then, and let some yellows (and potentially blues) through. They may be surprised to see some good performance from highly motivated employees - maybe even better than what they're used to from chasing greens.

What do you think? Please let us know in the comment section below. 



The simulated data in the graph and the calculations are based on the same assumptions as the test companies and researchers in psychometrics are basing their studies on. These are simplifications to make the hiring problem, which is multi dimensional and extremely complex, a much easier one. First, job performance is reduced to one single continuous scale, encompassing everything from task productivity to team work. Second, the trait being assessed by the test is also assumed to be one dimensional and continuous. Additionally, in the case of significance testing, it is common to also assume that observations are random samples from a population with a bivariate normal distribution.