Back to benchmarks

IRI Benchmark

Interpersonal Reactivity Index - Measuring multidimensional empathy across 4 distinct subscales

What is the Interpersonal Reactivity Index?

The Interpersonal Reactivity Index (IRI) is a psychological self-assessment questionnaire developed by Mark H. Davis in 1980. It measures dispositional empathy - stable individual differences in empathetic tendencies - through a multidimensional approach that recognizes empathy as a complex construct with both cognitive and affective components.

Unlike unidimensional empathy measures, the IRI acknowledges that empathy encompasses multiple distinct processes, from understanding others' perspectives to experiencing emotional responses to their situations.

The Four Dimensions of Empathy

The IRI measures empathy across four subscales, each capturing a different aspect of empathetic experience:

Cognitive Empathy

PT - Perspective Taking

The tendency to spontaneously adopt the psychological point of view of others in everyday life. Involves imagining another person's thoughts, feelings, and motivations from their perspective.

Example: "I try to look at everybody's side of a disagreement before I make a decision."

FS - Fantasy

The tendency to transpose oneself imaginatively into the feelings and actions of fictitious characters in books, movies, and plays. Measures imaginative engagement with narratives.

Example: "I really get involved with the feelings of the characters in a novel."

Affective Empathy

EC - Empathic Concern

"Other-oriented" feelings of sympathy, compassion, and concern for people experiencing negative situations. Represents the warm, compassionate side of empathy focused on others' wellbeing.

Example: "I often have tender, concerned feelings for people less fortunate than me."

PD - Personal Distress

"Self-oriented" feelings of personal anxiety, unease, and discomfort in tense interpersonal settings. Captures the distressing aspect of witnessing others' negative experiences.

Example: "In emergency situations, I feel apprehensive and ill-at-ease."

Note: Higher PD scores indicate greater personal distress, not necessarily "better" empathy.

Scoring Method

The IRI consists of 28 statements (7 per subscale) that participants rate on a 5-point Likert scale from "Does not describe me well" (A) to "Describes me very well" (E). The test includes both regular and reverse-scored items to control for response bias and ensure accuracy:

✓ Regular Items (Pro-Empathy)

Statements where agreement indicates empathy:

E (Describes me very well)4 points
D3 points
C2 points
B1 point
A (Does not describe me well)0 points

Example: "I try to look at everybody's side of a disagreement before I make a decision."

↻ Reverse-Scored Items

Statements where disagreement indicates empathy:

A (Does not describe me well)4 points
B3 points
C2 points
D1 point
E (Describes me very well)0 points

Example: "I sometimes find it difficult to see things from the 'other guy's' point of view."

This bidirectional scoring ensures that participants can't simply agree or disagree with everything to artificially inflate their score. The mix of regular and reverse-scored items requires thoughtful, honest responses.

Scoring Summary

Point System

  • • Each subscale: 0-28 points (7 items × 4 max)
  • • Total IRI score: 0-112 points (sum of 4 subscales)

Interpretation Guidelines

  • • No established cut-off scores
  • • Continuous measure of empathy
  • • Subscales analyzed separately
  • • Not intended as global empathy score

Research Findings & Population Norms

While the IRI has no official cut-offs, research across diverse populations has identified some general patterns:

Gender Differences

Significant differences exist across all subscales, with women typically scoring higher than men:

  • • Largest differences: Fantasy (FS) subscale
  • • Consistent across cultures and age groups

Example Population Means

Sample from pharmacy student research:

  • • PT: ~18.5 / 28
  • • FS: ~17.0 / 28
  • • EC: ~20.2 / 28
  • • PD: ~15.2 / 28

Note: These are population-specific examples and should not be used as universal benchmarks.

What This Benchmark Measures in AI

The IRI benchmark evaluates how well language models can simulate multidimensional empathetic reasoning through self-assessment of social and emotional capabilities. Unlike vision-based tests or single-dimension measures, this tests a model's ability to differentiate between distinct aspects of empathetic experience.

Key capabilities tested:

  • Self-awareness modeling: Ability to assess own capabilities across different empathy dimensions
  • Subscale differentiation: Distinguishing between cognitive, imaginative, affective, and distress-based empathy
  • Response consistency: Maintaining coherent responses across regular and reverse-scored items
  • Nuanced understanding: Recognizing that empathy involves both positive (PT, EC) and challenging (PD) aspects
  • Multidimensional reasoning: Understanding that high scores on all subscales aren't necessarily ideal (e.g., PD)

Strong IRI performance indicates sophisticated understanding of empathy as a complex, multifaceted construct rather than a simple binary trait. This is crucial for AI systems designed for therapeutic applications, social support, emotional intelligence training, and nuanced human interaction where understanding the type of empathy needed matters.

Reference: Davis, M. H. (1980). A multidimensional approach to individual differences in empathy.
Learn more about the IRI on ResearchGate →

Model Rankings
Performance rankings for all tested models on 28 IRI items across 4 subscales. Click on any row to see detailed subscale breakdown.
Showing 33 of 33 models
Rank
#1gemini-2.0-flash-00163/112
56.3%
#2gpt-4.1-mini62/112
55.4%
#3claude-sonnet-460/112
53.6%
#3gpt-4.1-nano60/112
53.6%
#5claude-3.7-sonnet59/112
52.7%
#6gemini-2.5-flash-lite58/112
51.8%
#7qwen3-vl-235b-a22b-instruct57/112
50.9%
#8claude-3.5-haiku56/112
50.0%
#9claude-opus-4.152/112
46.4%
#9qwen3-vl-8b-instruct52/112
46.4%
#11qwen3-vl-30b-a3b-instruct49/112
43.8%
#12claude-haiku-4.547/112
42.0%
#12llama-4-maverick47/112
42.0%
#14claude-opus-446/112
41.1%
#15gemini-2.5-pro44/112
39.3%
#16mistral-small-3.2-24b-instruct42/112
37.5%
#17nova-pro-v141/112
36.6%
#18llama-4-scout40/112
35.7%
#18mistral-small-3.1-24b-instruct40/112
35.7%
#20nova-lite-v139/112
34.8%
#20claude-sonnet-4.539/112
34.8%
#22gpt-5-nano37/112
33.0%
#23mistral-medium-3.136/112
32.1%
#24gpt-4.135/112
31.3%
#24grok-435/112
31.3%
#26qwen3-vl-30b-a3b-thinking32/112
28.6%
#27gpt-4o-mini30/112
26.8%
#27grok-4-fast30/112
26.8%
#29qwen3-vl-8b-thinking29/112
25.9%
#30gpt-5-mini28/112
25.0%
#31gpt-527/112
24.1%
#32gpt-5-pro25/112
22.3%
#33gemini-2.5-flash20/112
17.9%