This contribution was written by Matthias Hug, Martina Hunziker, Katharina Scheuner, master students in Multimedia Communication and Publishing at Bern University of Applied Sciences.
It started with pictures of Barack Obama and Mitch McConnell and escalated quickly. In the fall 2020, a Twitter user posted portraits of these two men, paired on top of each other. Twitter cropped the preview to the face of Mitch McConnell, leaving Obama hidden. Users repeated the test and realised that the picture cropping algorithm always showed McConnell and never Obama. Many users came to the conclusion that it was racially biased.
Twitter has used some form of automated cropping at least since 2015. The company said that the cropping “improves consistency” in the newsfeed by focusing on the “most salient” areas of a picture. After the accusations of racism in 2020, Twitter promised to improve their algorithm.
Vinay Prabhu, at the time a PhD student at Carnegie Mellon University, conducted an experiment in September 2020, a few hours after the row exploded. He posted 92 pictures displaying as many pairs of faces, each with a different combination of skin tones. His results did not show any systematic bias towards a specific skin tone.
Because algorithms in self-learning systems might change, they need to be constantly monitored. That’s why we – three students at Bern University of Applied Sciences – decided to redo the experiment in January 2021. We generated pictures of non-existing people of different skin tones on thispersondoesnotexist.com, a service that offers automatically-generated faces.
We classified the generated pictures according to the scale of skin tones, or phototypes, developed in 1975 by Thomas Fitzpatrick, a dermatologist. It was originally created to estimate skin cancer risk from sun exposure. The Fitzpatrick scale is mostly known today as the basis for the skin tones of emojis.
For testing purposes, we merged the categories I and II of the Fitzpatrick scale (the lightest ones) after we faced difficulties distinguishing between the two. We did the same with categories V and VI (the darkest ones). In the end, we categorized our portraits into categories 1 (I and II of the Fitzpatrick scale), 2 (III of the Fitzpatrick scale), 3 (IV of the Fitzpatrick scale) and 4 (V and VI of the Fitzpatrick scale).
We tested pictures of adults along gender lines (female/male), pairing categories 1 with 4, 1 with 3 and 2 with 4. We compared skin tones that are at least two categories apart to obtain clearer results. For each comparison, we tested twenty pairs, ten for each gender. We always tested two versions: one with the portrait of the person with lighter skin tone on top and the other with the darker skin tone on top.
We created a new Twitter account each and published five pairs of portraits per 24 hour-period over a five days, from 8 January to 13 January. The accounts and the tests can be seen at @Kathari16815868, @tinah12671334 and @matthia50611656.
On a personal note: As white people trying to educate ourselves on racism and how our upbringing makes us structurally involved, we felt uneasy assigning faces to categories based on skin tones. We also noted that Thomas Fitzpatrick was white and seems to have intuitively picked category I for the lightest skin tone. Similarly, we called the pairings where the lighter-skinned person was on top “version A”, while “version B” was the one with the darker-skinned person on top. This felt “natural” to us – but of course it is not. It is a social construct we are ill-equipped to handle.
Out of 30 portrait pairs, Twitter showed the lighter skin tone in 14 cases and darker skin types in 14 cases as well. In 2 cases the cropped preview was inconsistent (one version of the image pairing was cropped to show the darker skin tone, the other was cropped to show the lighter skin tone). We tested each of the 30 pairs three times to rule out inconsistencies in the algorithm.
We looked at the data in more detail and found no more bias along gender lines. The cropping algorithm showed a strong preference for our category 3 when cropping against faces whose skin tone belonged in category 1 (7 to 2). But the finding did not hold up when we looked at the rest of the data and can be safely attributed to random chance. Our experiment remains small, and more data would be needed to come to more solid conclusions.
If there was a bias towards lighter skins before, it seems that Twitter has now fixed it. However, we could not find conclusive evidence of racial bias before the viral tweets of September 2020. Twitter said that they conducted an internal audit of their cropping algorithm before releasing it in 2018 and did not find evidence of racial bias.
We therefore make three hypotheses:
- Twitter fixed its racist cropping algorithm,
- The row over the algorithm’s racism was apparently due to a selection bias, as tweets who could not replicate the findings of the original Obama/McConnell comparison did not go viral, or were not posted in the first place (which raises questions regarding the way Twitter’s newsfeed algorithm prioritizes content), or
- Our experimental set-up cannot capture the racism of Twitter’s image-cropping algorithm, which might use proxies not present in our pictures in its decisions.
Among this uncertainty, one thing is for sure. Algorithms should be audited, and continuously audited, to ensure that they do not discriminate illegally by race, gender, disability or any other protected category.