Case Study 29-1: Pymetrics and the Gamified Hiring Screen

Overview

Pymetrics, acquired by Harver in 2022, offers one of the most technically sophisticated and philosophically interesting examples of AI-powered hiring assessment. Unlike HireVue's video analysis, Pymetrics uses gamified cognitive and emotional tests — short interactive exercises — to generate personality and cognitive profiles that it claims predict job performance with greater accuracy and less bias than traditional screening. Examining Pymetrics in detail provides a case study in how the most technically sophisticated version of algorithmic hiring still raises serious surveillance and fairness concerns.


What Pymetrics Does

Pymetrics presents applicants with a series of games — typically 12 short exercises lasting about 25 minutes total — that measure cognitive and emotional traits through behavioral signals rather than self-report. The games include:

Attention and memory tasks: Games measuring working memory capacity, sustained attention, and information processing speed.

Risk and reward tasks: Games measuring risk tolerance, learning under uncertainty, and response to reward and punishment signals. These draw on behavioral economics and neuroscience research.

Emotion recognition: In earlier versions, games requiring identification of emotions from faces or vocal cues.

Impulse control tasks: Games measuring the ability to inhibit reflexive responses.

The behavioral signals from these games are analyzed by Pymetrics' machine learning system to generate a trait profile, which is then compared against profiles of successful employees at the specific company that uses Pymetrics. Applicants whose trait profiles match the successful employee model are advanced; those who don't are rejected.


The Matching Problem

The fundamental mechanism — comparing applicants against a model of successful past employees — is both Pymetrics' commercial proposition and its most significant source of bias risk.

If the company's successful employees are demographically homogeneous — which most large U.S. employers are — then the model of "successful employee" captures demographic characteristics as well as cognitive and emotional characteristics. An applicant who is demographically similar to successful past employees will match the profile better, regardless of whether the demographic similarity has any actual relationship to job performance.

Pymetrics has published research claiming that their system demonstrates lower racial and gender bias than traditional resume screening and GPA-based screening. The research methodology has been reviewed and critiqued by independent researchers:

What Pymetrics' research shows: When Pymetrics' algorithm is compared against the specific metrics it is trained to predict (hiring decisions made by the employer), it shows less disparate impact on some demographic groups than the employer's previous screening methods.

What the critique highlights: The comparison baseline — the employer's previous screening methods — may have been highly discriminatory. Showing less bias than a discriminatory baseline is not the same as showing non-discriminatory outcomes. Additionally, Pymetrics' validation research is conducted using the same company's past hiring data, creating a circularity problem: the system is validated against outcomes that may themselves reflect historical bias.


Cognitive Science Concerns

Pymetrics claims its games measure stable cognitive and emotional traits with neuroscientific validity. Several cognitive scientists have raised concerns:

Task validity: The games measure specific task performance — how quickly someone responds to a balloon risk game, how well they recognize emotional expressions in a controlled context. Whether these tasks validly measure stable traits that predict job performance in the specific role is an empirical question whose answer depends on the role, the industry, and the specific validation study.

Cultural and neurological variation: Cognitive task performance varies across cultures, neurological profiles, and disability status. Working memory tests, for example, may disadvantage applicants with ADHD. Response inhibition tasks may disadvantage applicants with certain anxiety disorders. Risk tolerance tasks may measure cultural attitudes toward risk as much as cognitive style.

Practice and coaching effects: Unlike facial expressions or vocal patterns, gamified cognitive tasks can be practiced. Applicants who are aware of what the games measure and have practiced similar tasks may perform differently from first-time applicants — introducing a systematic advantage for applicants who have access to coaching resources.


Pymetrics' Bias Reduction Efforts

Unlike many algorithmic hiring vendors, Pymetrics has published methodology for its bias reduction efforts:

The company uses an "auditing" process that tests whether its output differs significantly by demographic group (race, gender) and adjusts model weights to reduce disparate impact when detected. The company also offers the option of using "fair model" outputs that specifically optimize for demographic parity.

These efforts represent genuine engagement with the bias problem. They also illustrate its complexity: optimizing for demographic parity in outputs can trade off against predictive accuracy; what counts as "disparate impact" depends on how you define the comparison groups; and auditing for bias on observed demographic characteristics does not address bias related to unobserved characteristics (socioeconomic background, disability status, intersectional identities).


Applicants who complete Pymetrics games are typically informed that the games measure cognitive and emotional traits and that their results will be used in hiring decisions. They are typically not informed:

  • What specific traits are being measured in each game
  • How their trait profile compares to the "successful employee" model
  • What the demographic distribution of the training data looks like
  • How their specific score was generated and what would need to be different for them to advance

From a GDPR perspective (for EU applicants), these disclosure gaps may trigger Article 22 rights: if the Pymetrics output constitutes a solely automated decision producing significant effects, applicants have the right to a meaningful explanation.


Implications for Jordan Ellis

If Jordan had applied through a company using Pymetrics rather than HireVue, the surveillance architecture would have been different in form but similar in structure:

  • Jordan's behavioral responses during 25 minutes of games would have been analyzed
  • The analysis would have compared Jordan's profile against successful employees at the company — employees who may not demographically resemble Jordan
  • Jordan would have received a hiring recommendation score without understanding how it was generated
  • Jordan would have no access to their trait profile, the comparison model, or the explanation for their score

The gamification of the assessment makes it feel less intrusive than a facial expression video analysis — it feels like playing games rather than being scrutinized. This is, from a consent perspective, actually more concerning: Jordan is being assessed without the assessment feeling like assessment. The surveillance is disguised as play.


Discussion Questions

  1. Pymetrics claims to measure cognitive and emotional traits with neuroscientific validity. What level of evidence should be required before an employer can use a cognitive assessment in hiring decisions? Who should conduct that validation?

  2. The "gamification" of hiring assessment makes the evaluation feel less threatening than a video interview. Does this make it more or less ethical? Does the fact that applicants are more likely to perform authentically in a game format justify using the game format as a screening mechanism?

  3. Pymetrics' bias reduction efforts — auditing for demographic parity in outputs — represent genuine technical engagement with the discrimination problem. What are the limits of purely technical approaches to bias reduction in hiring algorithms?

  4. An employer might argue that Pymetrics' gamified assessment is fairer than GPA-based screening because it tests actual cognitive capabilities rather than credentialing advantages. Is this argument valid? What does it assume about what GPA measures and what cognitive games measure?

  5. Jordan could potentially do better or worse on Pymetrics games depending on factors unrelated to job qualifications: fatigue, anxiety, internet connection quality, access to quiet space to complete the games. How should algorithmic hiring systems account for the conditions under which assessments are completed?