Researchers studied how Google, OpenAI, Anthropic, and DeepSeek identify hate speech. Here’s how they vary
Google, OpenAI, DeepSeek, and Anthropic vary widely in how they identify hate speech, according to new research.
The study, from researchers at the University of Pennsylvania’s Annenberg School for Communication and published in Findings of the Association for Computational Linguistics, is the first large-scale comparative analysis of AI content moderation systems—used by tech companies and social media platforms—that looks at how consistent they are in evaluating hate speech.
Research shows that online hate speech both increases political polarization and damages mental health.
The University of Pennsylvania study found that different systems produce different outcomes for the same content, undermining consistency and predictability, and leading to moderation decisions that appear arbitrary or unfair.
“Private technology companies have become the de facto arbiters of what speech is permissible in the digital public square, yet they do so without any consistent standard,” said Yphtach Lelkes, associate professor at the Annenberg School for Communication and the study’s coauthor.
Lelkes and doctoral student Neil Fasching analyzed seven leading models, some designed specifically for content classification, while others were more general. They included two from OpenAI and two from Mistral, along with Claude 3.5 Sonnet, DeepSeek-V3, and Google Perspective API.
Their analysis included 1.3 million synthetic sentences that made statements about 125 distinct groups—including both neutral terms and slurs, on characteristics ranging from religion to disabilities and age. Each sentence included “all” or “some,” a group, and a hate speech phrase.
Results revealed systematic differences in how models establish decision boundaries around harmful content, highlighting significant implications for automated content moderation.
Key study takeaways
Among the models, one demonstrated high predictability for how it would classify similar content, another produced different results for similar content, while others did not over-flag nor under-detect content as hate speech.
“These differences highlight the challenge of balancing detection accuracy with avoiding over-moderation,” researchers said.
The models were more similar when they evaluated group statements regarding sexual orientation, race, and gender, and more inconsistent when it came to education level, personal interest, and economic class. Researchers concluded that “systems generally recognize hate speech targeting traditional protected classes more readily than content targeting other groups.”
Finally, the study found that Claude 3.5 Sonnet and Mistral’s specialized content classification system treated slurs as harmful across the board, while other models prioritized context and intent—with little middle ground between the two.
Meanwhile, a recent survey from Vanderbilt University’s nonpartisan think tank, The Future of Free Speech, concluded there was “low public support for allowing AI tools to generate content that might offend or insult.”