Interpersonal Reactivity Index - Measuring multidimensional empathy across 4 distinct subscales
The Interpersonal Reactivity Index (IRI) is a psychological self-assessment questionnaire developed by Mark H. Davis in 1980. It measures dispositional empathy - stable individual differences in empathetic tendencies - through a multidimensional approach that recognizes empathy as a complex construct with both cognitive and affective components.
Unlike unidimensional empathy measures, the IRI acknowledges that empathy encompasses multiple distinct processes, from understanding others' perspectives to experiencing emotional responses to their situations.
The IRI measures empathy across four subscales, each capturing a different aspect of empathetic experience:
PT - Perspective Taking
The tendency to spontaneously adopt the psychological point of view of others in everyday life. Involves imagining another person's thoughts, feelings, and motivations from their perspective.
Example: "I try to look at everybody's side of a disagreement before I make a decision."
FS - Fantasy
The tendency to transpose oneself imaginatively into the feelings and actions of fictitious characters in books, movies, and plays. Measures imaginative engagement with narratives.
Example: "I really get involved with the feelings of the characters in a novel."
EC - Empathic Concern
"Other-oriented" feelings of sympathy, compassion, and concern for people experiencing negative situations. Represents the warm, compassionate side of empathy focused on others' wellbeing.
Example: "I often have tender, concerned feelings for people less fortunate than me."
PD - Personal Distress
"Self-oriented" feelings of personal anxiety, unease, and discomfort in tense interpersonal settings. Captures the distressing aspect of witnessing others' negative experiences.
Example: "In emergency situations, I feel apprehensive and ill-at-ease."
Note: Higher PD scores indicate greater personal distress, not necessarily "better" empathy.
The IRI consists of 28 statements (7 per subscale) that participants rate on a 5-point Likert scale from "Does not describe me well" (A) to "Describes me very well" (E). The test includes both regular and reverse-scored items to control for response bias and ensure accuracy:
✓ Regular Items (Pro-Empathy)
Statements where agreement indicates empathy:
Example: "I try to look at everybody's side of a disagreement before I make a decision."
↻ Reverse-Scored Items
Statements where disagreement indicates empathy:
Example: "I sometimes find it difficult to see things from the 'other guy's' point of view."
This bidirectional scoring ensures that participants can't simply agree or disagree with everything to artificially inflate their score. The mix of regular and reverse-scored items requires thoughtful, honest responses.
Point System
Interpretation Guidelines
While the IRI has no official cut-offs, research across diverse populations has identified some general patterns:
Gender Differences
Significant differences exist across all subscales, with women typically scoring higher than men:
Example Population Means
Sample from pharmacy student research:
Note: These are population-specific examples and should not be used as universal benchmarks.
The IRI benchmark evaluates how well language models can simulate multidimensional empathetic reasoning through self-assessment of social and emotional capabilities. Unlike vision-based tests or single-dimension measures, this tests a model's ability to differentiate between distinct aspects of empathetic experience.
Key capabilities tested:
Strong IRI performance indicates sophisticated understanding of empathy as a complex, multifaceted construct rather than a simple binary trait. This is crucial for AI systems designed for therapeutic applications, social support, emotional intelligence training, and nuanced human interaction where understanding the type of empathy needed matters.
Reference: Davis, M. H. (1980). A multidimensional approach to individual differences in empathy.
Learn more about the IRI on ResearchGate →
| Rank | |||
|---|---|---|---|
| #1 | gemini-2.0-flash-001 | 63/112 56.3% | |
| #2 | gpt-4.1-mini | 62/112 55.4% | |
| #3 | claude-sonnet-4 | 60/112 53.6% | |
| #3 | gpt-4.1-nano | 60/112 53.6% | |
| #5 | claude-3.7-sonnet | 59/112 52.7% | |
| #6 | gemini-2.5-flash-lite | 58/112 51.8% | |
| #7 | qwen3-vl-235b-a22b-instruct | 57/112 50.9% | |
| #8 | claude-3.5-haiku | 56/112 50.0% | |
| #9 | claude-opus-4.1 | 52/112 46.4% | |
| #9 | qwen3-vl-8b-instruct | 52/112 46.4% | |
| #11 | qwen3-vl-30b-a3b-instruct | 49/112 43.8% | |
| #12 | claude-haiku-4.5 | 47/112 42.0% | |
| #12 | llama-4-maverick | 47/112 42.0% | |
| #14 | claude-opus-4 | 46/112 41.1% | |
| #15 | gemini-2.5-pro | 44/112 39.3% | |
| #16 | mistral-small-3.2-24b-instruct | 42/112 37.5% | |
| #17 | nova-pro-v1 | 41/112 36.6% | |
| #18 | llama-4-scout | 40/112 35.7% | |
| #18 | mistral-small-3.1-24b-instruct | 40/112 35.7% | |
| #20 | nova-lite-v1 | 39/112 34.8% | |
| #20 | claude-sonnet-4.5 | 39/112 34.8% | |
| #22 | gpt-5-nano | 37/112 33.0% | |
| #23 | mistral-medium-3.1 | 36/112 32.1% | |
| #24 | gpt-4.1 | 35/112 31.3% | |
| #24 | grok-4 | 35/112 31.3% | |
| #26 | qwen3-vl-30b-a3b-thinking | 32/112 28.6% | |
| #27 | gpt-4o-mini | 30/112 26.8% | |
| #27 | grok-4-fast | 30/112 26.8% | |
| #29 | qwen3-vl-8b-thinking | 29/112 25.9% | |
| #30 | gpt-5-mini | 28/112 25.0% | |
| #31 | gpt-5 | 27/112 24.1% | |
| #32 | gpt-5-pro | 25/112 22.3% | |
| #33 | gemini-2.5-flash | 20/112 17.9% |