In the 1960s, Woodrow W. Bledsoe created a secret program that manually identified points on a person’s face and compared the distances between these coordinates with other images.
Facial recognition technology has come a long way since then. The field has evolved quickly and software can now automatically process staggering amounts of facial data in real time, dramatically improving the results (and reliability) of matching across a variety of use cases.
Despite all of the advancements we’ve seen, many organizations still rely on the same algorithm used by Bledsoe’s database – known as “k-nearest neighbors” or k-NN. Since each face has multiple coordinates, a comparison of these distances over millions of facial images requires significant data processing. The k-NN algorithm simplifies this process and makes matching these points easier by considerably reducing the data set. But that’s only part of the equation. Facial recognition also involves finding the location of a feature on a face before evaluating it. This requires a different algorithm such as HOG (we’ll get to it later).
The algorithms used for facial recognition today rely heavily on machine learning (ML) models, which require significant training. Unfortunately, the training process can result in biases in these technologies. If the training doesn’t contain a representative sample of the population, ML will fail to correctly identify the missed population.
While this may not be a significant problem when matching faces for social media platforms, it can be far more damaging when the facial recognition software from Amazon, Google, Clearview AI and others is used by government agencies and law enforcement.
Previous studies on this topic found that facial recognition software suffers from racial biases, but overall, the research on bias has been thin. The consequences of such biases can be dire for both people and companies. Further complicating matters is the fact that even small changes to one’s face, hair or makeup can impact a model’s ability to accurately match faces. If not accounted for, this can create distinct challenges when trying to leverage facial recognition technology to identify women, who generally tend to use beauty and self-care products more than men.
Understanding sexism in facial recognition software
Just how bad are gender-based misidentifications? Our team at WatchGuard conducted some additional facial recognition research, looking solely at gender biases to find out. The results were eye-opening. The solutions we evaluated were misidentifying women 18% more often than men.
You can imagine the terrible consequences this type of bias could generate. For example, a smartphone relying on face recognition could block access, a police officer using facial recognition software could mistakenly identify an innocent bystander as a criminal, or a government agency might call in the wrong person for questioning based on a false match. The list goes on. The reality is that the culprit behind these issues is bias within model training that creates biases in the results.
Let’s explore how we uncovered these results.
Our team performed two separate tests – first using Amazon Rekognition and the second using Dlib. Unfortunately, with Amazon Rekognition we were unable to unpack just how their ML modeling and algorithm works due to transparency issues (although we assume it’s similar to Dlib). Dlib is a different story, and uses local resources to identify faces provided to it. It comes pretrained to identify the location of a face, and with face location finder HOG, a slower CPU-based algorithm, and CNN, a faster algorithm making use of specialized processors found in a graphics cards.
Both services provide match results with additional information. Besides the match found, a similarity score is given that shows how close a face must match to the known face. If the face on file doesn’t exist, a similarity score set to low may incorrectly match a face. However, a face can have a low similarity score and still match when the image doesn’t show the face clearly.
For the data set, we used a database of faces called Labeled Faces in the Wild, and we only investigated faces that matched another face in the database. This allowed us to test matching faces and similarity scores at the same time.
Amazon Rekognition correctly identified all pictures we provided. However, when we looked more closely at the data provided, our team saw a wider distribution of the similarities in female faces than in males. We saw more female faces with higher similarities then men and more female faces with less similarities than men (this actually matches a recent study performed around the same time).
What does this mean? Essentially it means a female face not found in the database is more likely to provide a false match. Also, because of the lower similarity in female faces, our team was confident that we’d see more errors in identifying female faces over male if given enough images with faces.
Amazon Rekognition gave accurate results but lacked in consistency and precision between male and female faces. Male faces on average were 99.06% similar, but female faces on average were 98.43% similar. This might not seem like a big variance, but the gap widened when we looked at the outliers – a standard deviation of 1.64 for males versus 2.83 for females. More female faces fall farther from the average then male faces, meaning female false match is far more likely than the 0.6% difference based on our data.
Dlib didn’t perform as well. On average, Dlib misidentified female faces more than male, leading to an average rate of 5% more misidentified females. When comparing faces using the slower HOG, the differences grew to 18%. Of interest, our team found that on average, female faces have higher similarity scores then male when using Dlib, but like Amazon Rekognition, also have a larger spectrum of similarity scores leading to the low results we found in accuracy.
Tackling facial recognition bias
Unfortunately, facial recognition software providers struggle to be transparent when it comes to the efficacy of their solutions. For example, our team didn’t find any place in Amazon’s documentation in which users could review the processing results before the software made a positive or negative match.
Unfortunately, this assumption of accuracy (and lack of context from providers) will likely lead to more and more instances of unwarranted arrests, like this one. It’s highly unlikely that facial recognition models will reach 100% accuracy anytime soon, but industry participants must focus on improving their effectiveness nonetheless. Knowing that these programs contain biases today, law enforcement and other organizations should use them as one of many tools – not as a definitive resource.
But there is hope. If the industry can honestly acknowledge and address the biases in facial recognition software, we can work together to improve model training and outcomes, which can help reduce misidentifications not only based on gender, but race and other variables, too.