RMET Benchmark

Reading the Mind in the Eyes Test - Evaluating emotional intelligence across

What is the RMET?

The Reading the Mind in the Eyes Test (RMET) is a psychological assessment developed by Simon Baron-Cohen in 1997 and updated in 2001. It measures theory of mind - the ability to recognize and understand another person's mental state - and assesses social intelligence through emotional recognition.

How It Works

The test presents 36 photographs showing only the eye region of different faces. For each image, participants choose which of 4 words best describes the emotion or mental state being expressed. The test focuses specifically on interpreting emotional expressions through subtle details in the eyes.

Emotional Valence Categories

The RMET items can be categorized by emotional valence: positive (e.g., friendly), negative (e.g., upset), and neutral (e.g., reflective). Research by Harkness et al. established a classification of 8 positive, 12 negative, and 16 neutral items.

Scoring & Research Findings

Neurotypical Adults

Average score: 24.9 ± 0.7 (range 23-30 out of 36)

Autistic Adults

Average score: 27.3 ± 0.5 (range 18-29 out of 36)

Originally designed for adults (age 16+) with IQ ≥ 80, the test completion time varies from 2-20 minutes, with neurotypical adults typically completing it in 2-3 minutes.

What This Benchmark Measures in AI

The RMET benchmark evaluates how well vision-language models can interpret subtle emotional cues from minimal visual information. Unlike text-based social reasoning tests, this requires models to process fine-grained facial features and map them to complex emotional states.

Key capabilities tested:

Visual emotion recognition: Detecting subtle expressions in the eye region alone (no mouth, face, or body cues)
Semantic differentiation: Distinguishing between 4 similar mental state terms, many of which are nuanced (e.g., "contemplative" vs. "pensive")
Cross-modal reasoning: Bridging visual features to complex linguistic emotion concepts
Valence sensitivity: Performing consistently across positive, negative, and neutral emotional states

Strong RMET performance indicates sophisticated multimodal understanding - crucial for applications like emotion-aware interfaces, accessibility tools for emotion recognition training, human-robot interaction, and AI systems that need to respond appropriately to human emotional states. Recent research views RMET primarily as an emotion perception measure rather than a pure theory of mind test, making it an ideal benchmark for evaluating AI visual emotion recognition capabilities.

Learn more: Embrace Autism - Reading the Mind in the Eyes Test

Valence classification: Harkness et al. - Enhanced accuracy of mental state decoding

Model Rankings

Performance rankings for all tested models on 36 RMET questions. Click on any row to see performance breakdown by emotional valence.

Showing 33 of 33 models

Rank					Performance RangeRange
#1	qwen3-vl-235b-a22b-instruct	Qwen	91.7%	33/36	Above Neurotypical
#2	gpt-4.1	OpenAI	80.6%	29/36	Neurotypical
#2	gpt-5	OpenAI	80.6%	29/36	Neurotypical
#4	claude-3.7-sonnet	Anthropic	77.8%	28/36	Neurotypical
#5	gemini-2.5-flash	Google	75.0%	27/36	Neurotypical
#5	gpt-4o-mini	OpenAI	75.0%	27/36	Neurotypical
#5	gpt-5-mini	OpenAI	75.0%	27/36	Neurotypical
#8	gpt-4.1-mini	OpenAI	72.2%	26/36	Neurotypical
#9	gemini-2.0-flash-001	Google	69.4%	25/36	Neurotypical
#9	gemini-2.5-pro	Google	69.4%	25/36	Neurotypical
#9	gpt-5-pro	OpenAI	69.4%	25/36	Neurotypical
#9	grok-4	xAI	69.4%	25/36	Neurotypical
#13	claude-opus-4.1	Anthropic	66.7%	24/36	Neurotypical
#13	claude-sonnet-4.5	Anthropic	66.7%	24/36	Neurotypical
#13	claude-sonnet-4	Anthropic	66.7%	24/36	Neurotypical
#16	nova-lite-v1	Amazon	63.9%	23/36	Autistic
#16	mistral-medium-3.1	Mistral	63.9%	23/36	Autistic
#18	llama-4-maverick	Meta	61.1%	22/36	Autistic
#18	grok-4-fast	xAI	61.1%	22/36	Autistic
#20	mistral-small-3.1-24b-instruct	Mistral	58.3%	21/36	Autistic
#20	mistral-small-3.2-24b-instruct	Mistral	58.3%	21/36	Autistic
#20	gpt-5-nano	OpenAI	58.3%	21/36	Autistic
#20	qwen3-vl-30b-a3b-thinking	Qwen	58.3%	21/36	Autistic
#24	claude-3.5-haiku	Anthropic	55.6%	20/36	Autistic
#24	qwen3-vl-8b-instruct	Qwen	55.6%	20/36	Autistic
#26	nova-pro-v1	Amazon	52.8%	19/36	Autistic
#26	gemini-2.5-flash-lite	Google	52.8%	19/36	Autistic
#26	qwen3-vl-8b-thinking	Qwen	52.8%	19/36	Autistic
#29	claude-opus-4	Anthropic	47.2%	17/36	Below Clinical
#30	gpt-4.1-nano	OpenAI	44.4%	16/36	Below Clinical
#31	claude-haiku-4.5	Anthropic	41.7%	15/36	Below Clinical
#32	llama-4-scout	Meta	38.9%	14/36	Below Clinical
#33	qwen3-vl-30b-a3b-instruct	Qwen	19.4%	7/36	Below Clinical