The short answer is that yes, it's possible for someone to get 100/100. However, only 0.2% will actually score this. Candidate scores generally fit along a bell curve, which is why you're unlikely to see someone getting 100/100.
The long answer is that a score is mainly based on how well aligned the answer is to the “good” answers seen by the model in it’s training time. Here’s some more details on exactly how that works.
Think of PHAI as an “algorithmic marker” that holds the rubric for scoring a candidate response. As an analogy, let’s think of an essay marking system.
Firstly the AI makes some measurements on a set of predefined properties called “features” from the text response (see the “Feature Extractor” in the figure below).
Each feature is an attribute that has demonstrated validity in identifying suitable candidates. These include personality traits like extraversion, openness, drive, accoutability, and many more like language fluency, proficiency, sentiment etc. We currently have over 50 such features.
In an essay marking system these features can be spelling, grammar, punctuation, paragraph structure, subject matter etc. The marker will score each of these features independently.
These measurements are universal regardless of what role family you apply to. For example, if two candidates applying to two different roles (e.g. Sales and Retail) write the same answer to the question “How would you handle a difficult customer?”, both of them will get the same measurement values for each feature (e.g. extraversion, openness, drive, language fluency etc). This is the objective and consistent nature of using AI for marking. It measures features consistently and without bias across all candidates.
The Second step is aggregating the different measurements to come up with a final score for the response. This is where AI will apply different “rules” based on the requirements of the different role families. For example, in Sales the trait “drive” and “language fluency” might have a higher weight than in Retail. An important point here is, how the model comes up with these rules. We use two ways to generate these rules.
- Using past hired candidates as a gold standard. Machine learning methods can learn weights automatically if you provide it with a set of “good” (HIRED) and “bad” (DECLINED) samples. We call models built with this approach “Machine Learning Models”. A downside of this is that it needs a large HIRED and DECLINED samples to come up with these rules, and it can also add past hiring biases to the model (Not gender and race biases, as we specifically test for them in our training process)
- Setting rules manually based on an ideal profile desired for the role family or the customer organisation. Typically this is done in collaboration with the customer, where the earlier mentioned measure like personality traits, language fluency etc are set at high, medium and low weights. We call models built with this approach “Rule Based Recommendation Models”.
Think now of the essay marking system. An essay in an English class might apply higher weights to spelling and grammar whereas a science class might have higher weights for subject matter than grammar. This means a student who makes grammar mistakes are penalised more in the English class than in the science class.
The third step is score normalising using a “norm group” of relevant candidate answers. This is to make sure that the final score represents a relative importance than a universal one. For example, a candidate that scores 60% on a Retail model may score 40% on the Sales model, with the same answers, even if the rules are the same in both models. This is because the norm groups of retail candidates and sales candidates used in normalising have different levels of trait scores. Our norm groups are in the 1000’s and in some cases over 100,000, making them representative of the relevant candidate populations.
What all this means is, in order to get a 100/100 or close to that, a candidate needs to write answers close to answers written by candidates seen by the model as “preferred candidates” when it was built.
It is important to stress that the model is not doing a simple word to word match. This is where advances in Natural Language Processing (NLP) comes in. It is able to infer similarities in language at different levels than word matching. Plus we do not directly use “words” in scoring, instead use features derived from the words relevant to selection such as personality and language fluency (see above). This is akin to scoring an essay on grammar and subject matter rather than looking for specific keywords.
Examples: Let’s look at some examples using answers given to the question “Share an example of when you took a different approach to solving a problem” (typically a candidates answers 5-6 questions like these, and all are considered in the final outcome)
A Graduate role family answer with a score above 90%
“In my medical administration role, we are currently balancing attending to patients needs for service with social distancing policy. On one occasion, all of the phones and internet went down, making the communication lines which are usually integral to striking this balance impossible to use and involving anxiety for clinicians and patients. In order to deliver information, alleviate patient (essentially, client) anxieties and ensure healthcare delivery, we utilised our remaining resources, converting our personal devices for privately-administered telephone operation and emailing to inform external specialists, companies and patients of the new means by which to contact us. At the end of the day, we found that while some issues weren't possible to resolve, our ability to utilise different forms of communication and community outreach minimised the damage to the processes involved for effective workplace function.”
A Retail role family answer with above 90%
”At my previous place of work, the kitchen flooded just before service on New Years Day. Rather than close the store for the day and wait for repairers, the team and I worked together to thoroughly clean the area and were able to prepare the restaurant for service later that day. During this service, the ice machine broke down and rather than telling customers that we had none, I took the initiative and purchased ice from a nereby supermarket. As a result of these resolutions we were able to provide our customers with the best service.”
You can see the relative complexity of the language in the two answers differs despite receiving a similar final score, given the relative scoring based on role families.
Our approach for evaluating candidate responses is in line with the method suggested by the well known Behavioural Psychologist & Nobel Laureate Daniel Kahneman*:
“When making decisions, think of options as if they were candidates. Break them up into dimensions and evaluate each dimension separately. Then – Delay forming an intuition too quickly. Instead, focus on the separate points, and when you have the full profile, then you can develop an intuition.”