IRI Benchmark

Interpersonal Reactivity Index - Measuring multidimensional empathy across 4 distinct subscales

What is the Interpersonal Reactivity Index?

The Interpersonal Reactivity Index (IRI) is a psychological self-assessment questionnaire developed by Mark H. Davis in 1980. It measures dispositional empathy - stable individual differences in empathetic tendencies - through a multidimensional approach that recognizes empathy as a complex construct with both cognitive and affective components.

Unlike unidimensional empathy measures, the IRI acknowledges that empathy encompasses multiple distinct processes, from understanding others' perspectives to experiencing emotional responses to their situations.

The Four Dimensions of Empathy

The IRI measures empathy across four subscales, each capturing a different aspect of empathetic experience:

Cognitive Empathy

PT - Perspective Taking

The tendency to spontaneously adopt the psychological point of view of others in everyday life. Involves imagining another person's thoughts, feelings, and motivations from their perspective.

Example: "I try to look at everybody's side of a disagreement before I make a decision."

FS - Fantasy

The tendency to transpose oneself imaginatively into the feelings and actions of fictitious characters in books, movies, and plays. Measures imaginative engagement with narratives.

Example: "I really get involved with the feelings of the characters in a novel."

Affective Empathy

EC - Empathic Concern

"Other-oriented" feelings of sympathy, compassion, and concern for people experiencing negative situations. Represents the warm, compassionate side of empathy focused on others' wellbeing.

Example: "I often have tender, concerned feelings for people less fortunate than me."

PD - Personal Distress

"Self-oriented" feelings of personal anxiety, unease, and discomfort in tense interpersonal settings. Captures the distressing aspect of witnessing others' negative experiences.

Example: "In emergency situations, I feel apprehensive and ill-at-ease."

Note: Higher PD scores indicate greater personal distress, not necessarily "better" empathy.

Scoring Method

The IRI consists of 28 statements (7 per subscale) that participants rate on a 5-point Likert scale from "Does not describe me well" (A) to "Describes me very well" (E). The test includes both regular and reverse-scored items to control for response bias and ensure accuracy:

✓ Regular Items (Pro-Empathy)

Statements where agreement indicates empathy:

E (Describes me very well)4 points

D3 points

C2 points

B1 point

A (Does not describe me well)0 points

Example: "I try to look at everybody's side of a disagreement before I make a decision."

↻ Reverse-Scored Items

Statements where disagreement indicates empathy:

A (Does not describe me well)4 points

B3 points

C2 points

D1 point

E (Describes me very well)0 points

Example: "I sometimes find it difficult to see things from the 'other guy's' point of view."

This bidirectional scoring ensures that participants can't simply agree or disagree with everything to artificially inflate their score. The mix of regular and reverse-scored items requires thoughtful, honest responses.

Scoring Summary

Point System

• Each subscale: 0-28 points (7 items × 4 max)
• Total IRI score: 0-112 points (sum of 4 subscales)

Interpretation Guidelines

• No established cut-off scores
• Continuous measure of empathy
• Subscales analyzed separately
• Not intended as global empathy score

Research Findings & Population Norms

While the IRI has no official cut-offs, research across diverse populations has identified some general patterns:

Gender Differences

Significant differences exist across all subscales, with women typically scoring higher than men:

• Largest differences: Fantasy (FS) subscale
• Consistent across cultures and age groups

Example Population Means

Sample from pharmacy student research:

• PT: ~18.5 / 28
• FS: ~17.0 / 28
• EC: ~20.2 / 28
• PD: ~15.2 / 28

Note: These are population-specific examples and should not be used as universal benchmarks.

What This Benchmark Measures in AI

The IRI benchmark evaluates how well language models can simulate multidimensional empathetic reasoning through self-assessment of social and emotional capabilities. Unlike vision-based tests or single-dimension measures, this tests a model's ability to differentiate between distinct aspects of empathetic experience.

Key capabilities tested:

Self-awareness modeling: Ability to assess own capabilities across different empathy dimensions
Subscale differentiation: Distinguishing between cognitive, imaginative, affective, and distress-based empathy
Response consistency: Maintaining coherent responses across regular and reverse-scored items
Nuanced understanding: Recognizing that empathy involves both positive (PT, EC) and challenging (PD) aspects
Multidimensional reasoning: Understanding that high scores on all subscales aren't necessarily ideal (e.g., PD)

Strong IRI performance indicates sophisticated understanding of empathy as a complex, multifaceted construct rather than a simple binary trait. This is crucial for AI systems designed for therapeutic applications, social support, emotional intelligence training, and nuanced human interaction where understanding the type of empathy needed matters.

Reference: Davis, M. H. (1980). A multidimensional approach to individual differences in empathy.
Learn more about the IRI on ResearchGate →

Model Rankings

Performance rankings for all tested models on 28 IRI items across 4 subscales. Click on any row to see detailed subscale breakdown.

Showing 33 of 33 models

Rank					Subscales
#1	gemini-2.0-flash-001	Google	63/112 56.3%	14/2850%	PT 24/28 FS 14/28 EC 17/28 PD 8/28
#2	gpt-4.1-mini	OpenAI	62/112 55.4%	16/2857%	PT 26/28 FS 14/28 EC 21/28 PD 1/28
#3	claude-sonnet-4	Anthropic	60/112 53.6%	14/2850%	PT 25/28 FS 11/28 EC 22/28 PD 2/28
#3	gpt-4.1-nano	OpenAI	60/112 53.6%	15/2854%	PT 20/28 FS 16/28 EC 16/28 PD 8/28
#5	claude-3.7-sonnet	Anthropic	59/112 52.7%	14/2850%	PT 25/28 FS 9/28 EC 17/28 PD 8/28
#6	gemini-2.5-flash-lite	Google	58/112 51.8%	12/2843%	PT 25/28 FS 16/28 EC 11/28 PD 6/28
#7	qwen3-vl-235b-a22b-instruct	Qwen	57/112 50.9%	12/2843%	PT 21/28 FS 12/28 EC 20/28 PD 4/28
#8	claude-3.5-haiku	Anthropic	56/112 50.0%	9/2832%	PT 21/28 FS 13/28 EC 16/28 PD 6/28
#9	claude-opus-4.1	Anthropic	52/112 46.4%	11/2839%	PT 24/28 FS 10/28 EC 15/28 PD 3/28
#9	qwen3-vl-8b-instruct	Qwen	52/112 46.4%	13/2846%	PT 20/28 FS 8/28 EC 16/28 PD 8/28
#11	qwen3-vl-30b-a3b-instruct	Qwen	49/112 43.8%	13/2846%	PT 20/28 FS 12/28 EC 12/28 PD 5/28
#12	claude-haiku-4.5	Anthropic	47/112 42.0%	10/2836%	PT 21/28 FS 8/28 EC 14/28 PD 4/28
#12	llama-4-maverick	Meta	47/112 42.0%	8/2829%	PT 19/28 FS 11/28 EC 14/28 PD 3/28
#14	claude-opus-4	Anthropic	46/112 41.1%	8/2829%	PT 23/28 FS 8/28 EC 13/28 PD 2/28
#15	gemini-2.5-pro	Google	44/112 39.3%	11/2839%	PT 28/28 FS 3/28 EC 12/28 PD 1/28
#16	mistral-small-3.2-24b-instruct	Mistral	42/112 37.5%	10/2836%	PT 15/28 FS 8/28 EC 12/28 PD 7/28
#17	nova-pro-v1	Amazon	41/112 36.6%	10/2836%	PT 12/28 FS 9/28 EC 13/28 PD 7/28
#18	llama-4-scout	Meta	40/112 35.7%	7/2825%	PT 12/28 FS 10/28 EC 9/28 PD 9/28
#18	mistral-small-3.1-24b-instruct	Mistral	40/112 35.7%	9/2832%	PT 12/28 FS 8/28 EC 12/28 PD 8/28
#20	nova-lite-v1	Amazon	39/112 34.8%	10/2836%	PT 11/28 FS 8/28 EC 12/28 PD 8/28
#20	claude-sonnet-4.5	Anthropic	39/112 34.8%	8/2829%	PT 21/28 FS 4/28 EC 11/28 PD 3/28
#22	gpt-5-nano	OpenAI	37/112 33.0%	10/2836%	PT 21/28 FS 8/28 EC 4/28 PD 4/28
#23	mistral-medium-3.1	Mistral	36/112 32.1%	9/2832%	PT 16/28 FS 4/28 EC 12/28 PD 4/28
#24	gpt-4.1	OpenAI	35/112 31.3%	8/2829%	PT 18/28 FS 5/28 EC 8/28 PD 4/28
#24	grok-4	xAI	35/112 31.3%	9/2832%	PT 21/28 FS 7/28 EC 6/28 PD 1/28
#26	qwen3-vl-30b-a3b-thinking	Qwen	32/112 28.6%	8/2829%	PT 12/28 FS 4/28 EC 8/28 PD 8/28
#27	gpt-4o-mini	OpenAI	30/112 26.8%	7/2825%	PT 12/28 FS 6/28 EC 8/28 PD 4/28
#27	grok-4-fast	xAI	30/112 26.8%	7/2825%	PT 23/28 FS 1/28 EC 4/28 PD 2/28
#29	qwen3-vl-8b-thinking	Qwen	29/112 25.9%	7/2825%	PT 13/28 FS 8/28 EC 0/28 PD 8/28
#30	gpt-5-mini	OpenAI	28/112 25.0%	6/2821%	PT 20/28 FS 4/28 EC 1/28 PD 3/28
#31	gpt-5	OpenAI	27/112 24.1%	7/2825%	PT 17/28 FS 7/28 EC 0/28 PD 3/28
#32	gpt-5-pro	OpenAI	25/112 22.3%	7/2825%	PT 18/28 FS 4/28 EC 0/28 PD 3/28
#33	gemini-2.5-flash	Google	20/112 17.9%	5/2818%	PT 8/28 FS 0/28 EC 8/28 PD 4/28