EQ Benchmark

Empathy Quotient Test - Evaluating empathy and social understanding

What is the Empathy Quotient?

The Empathy Quotient (EQ) is a psychological self-assessment questionnaire developed by Simon Baron-Cohen and Sally Wheelwright in 2004. It measures empathy in adults, specifically "the ability to tune into how someone else is feeling, or what they might be thinking." The test was originally designed for autistic adults aged 16+ with IQ ≥80, though it is now widely used in research and self-assessment.

Scoring Method

The EQ consists of 60 total statements, with 40 empathy items and 20 serving as filler items. Questions are scored based on their directionality to prevent response bias and ensure accuracy:

✓ Positive Items (Pro-Empathy)

Statements where agreement indicates empathy:

Definitely Agree2 points

Slightly Agree1 point

Slightly/Definitely Disagree0 points

Example: "I can easily tell if someone wants to enter a conversation"

↻ Negative Items (Reverse-Scored)

Statements where disagreement indicates empathy:

Definitely Disagree2 points

Slightly Disagree1 point

Slightly/Definitely Agree0 points

Example: "I find it hard to know what to do in a social situation"

This bidirectional scoring ensures that participants can't simply agree or disagree with everything to artificially inflate their score. The mix of positive and negative items requires thoughtful, honest responses.

Scoring & Performance Range

Score Range

0-80 points total

Average Male

42 out of 80

Average Female

47 out of 80

0-30

Lower empathy (may indicate challenges with emotional recognition or social communication)

31-52

Average range for general population

53-63

Above average empathy

64-80

Very high empathy (significant strength in understanding others' emotions)

What This Benchmark Measures in AI

The EQ benchmark evaluates how well language models can reason about social and emotional scenarios described in text. Unlike vision-based tests, this measures a model's ability to process verbal descriptions of social situations and predict appropriate empathetic responses.

Key capabilities tested:

Social reasoning: Understanding complex interpersonal dynamics and emotional contexts from text
Perspective-taking: Simulating how others might feel or think in various situations
Bidirectional comprehension: Correctly interpreting both positively and negatively framed statements about empathy
Consistency: Maintaining coherent empathetic reasoning across 60 diverse scenarios

While AI models don't experience genuine emotions, strong performance on the EQ indicates sophisticated natural language understanding and social reasoning capabilities - essential for conversational AI, mental health chatbots, customer support systems, and any application requiring nuanced interpretation of human social dynamics.

Reference: Baron-Cohen, S., & Wheelwright, S. (2004). The empathy quotient: An investigation of adults with Asperger syndrome or high functioning autism, and normal sex differences.
Learn more at Embrace Autism - Empathy Quotient →

Model Rankings

Performance rankings for all tested models on 60 EQ questions (40 empathy items scored). Click on any row to see performance breakdown by scoring type.

Showing 33 of 33 models

Rank					Performance Range
#1	claude-3.7-sonnet	Anthropic	45/80 56.3%	29/4073%	Average
#2	grok-4-fast	xAI	44/80 55.0%	26/4065%	Average
#3	gemini-2.5-pro	Google	42/80 52.5%	24/4060%	Average
#3	gpt-4.1-mini	OpenAI	42/80 52.5%	30/4075%	Average
#3	qwen3-vl-8b-instruct	Qwen	42/80 52.5%	26/4065%	Average
#3	grok-4	xAI	42/80 52.5%	24/4060%	Average
#7	claude-sonnet-4	Anthropic	38/80 47.5%	28/4070%	Average
#7	mistral-medium-3.1	Mistral	38/80 47.5%	19/4048%	Average
#7	gpt-4.1-nano	OpenAI	38/80 47.5%	19/4048%	Average
#10	mistral-small-3.2-24b-instruct	Mistral	36/80 45.0%	20/4050%	Average
#10	qwen3-vl-235b-a22b-instruct	Qwen	36/80 45.0%	25/4063%	Average
#12	claude-sonnet-4.5	Anthropic	35/80 43.8%	22/4055%	Average
#12	qwen3-vl-30b-a3b-thinking	Qwen	35/80 43.8%	18/4045%	Average
#14	gemini-2.0-flash-001	Google	34/80 42.5%	21/4053%	Average
#14	gpt-4o-mini	OpenAI	34/80 42.5%	17/4043%	Average
#16	nova-lite-v1	Amazon	33/80 41.3%	21/4053%	Average
#16	nova-pro-v1	Amazon	33/80 41.3%	22/4055%	Average
#16	claude-haiku-4.5	Anthropic	33/80 41.3%	23/4057%	Average
#19	claude-opus-4	Anthropic	32/80 40.0%	21/4053%	Lower than Average
#19	gpt-4.1	OpenAI	32/80 40.0%	19/4048%	Lower than Average
#21	claude-opus-4.1	Anthropic	31/80 38.8%	18/4045%	Lower than Average
#21	gpt-5-pro	OpenAI	31/80 38.8%	20/4050%	Lower than Average
#21	gpt-5	OpenAI	31/80 38.8%	20/4050%	Lower than Average
#24	gpt-5-mini	OpenAI	30/80 37.5%	21/4053%	Lower than Average
#24	qwen3-vl-8b-thinking	Qwen	30/80 37.5%	18/4045%	Lower than Average
#26	mistral-small-3.1-24b-instruct	Mistral	29/80 36.3%	16/4040%	Lower than Average
#27	claude-3.5-haiku	Anthropic	28/80 35.0%	20/4050%	Lower than Average
#27	gemini-2.5-flash	Google	28/80 35.0%	14/4035%	Lower than Average
#29	gpt-5-nano	OpenAI	27/80 33.8%	19/4048%	Lower than Average
#30	gemini-2.5-flash-lite	Google	25/80 31.3%	15/3741%	Lower than Average
#30	qwen3-vl-30b-a3b-instruct	Qwen	25/80 31.3%	24/4060%	Lower than Average
#32	llama-4-maverick	Meta	23/80 28.7%	20/4050%	Lower than Average
#33	llama-4-scout	Meta	16/80 20.0%	11/4028%	Lower than Average