Back to benchmarks

RMET Benchmark

Reading the Mind in the Eyes Test - Evaluating emotional intelligence across

What is the RMET?

The Reading the Mind in the Eyes Test (RMET) is a psychological assessment developed by Simon Baron-Cohen in 1997 and updated in 2001. It measures theory of mind - the ability to recognize and understand another person's mental state - and assesses social intelligence through emotional recognition.

How It Works

The test presents 36 photographs showing only the eye region of different faces. For each image, participants choose which of 4 words best describes the emotion or mental state being expressed. The test focuses specifically on interpreting emotional expressions through subtle details in the eyes.

Emotional Valence Categories

The RMET items can be categorized by emotional valence: positive (e.g., friendly), negative (e.g., upset), and neutral (e.g., reflective). Research by Harkness et al. established a classification of 8 positive, 12 negative, and 16 neutral items.

Scoring & Research Findings

Neurotypical Adults

Average score: 24.9 ± 0.7 (range 23-30 out of 36)

Autistic Adults

Average score: 27.3 ± 0.5 (range 18-29 out of 36)

Originally designed for adults (age 16+) with IQ ≥ 80, the test completion time varies from 2-20 minutes, with neurotypical adults typically completing it in 2-3 minutes.

What This Benchmark Measures in AI

The RMET benchmark evaluates how well vision-language models can interpret subtle emotional cues from minimal visual information. Unlike text-based social reasoning tests, this requires models to process fine-grained facial features and map them to complex emotional states.

Key capabilities tested:

  • Visual emotion recognition: Detecting subtle expressions in the eye region alone (no mouth, face, or body cues)
  • Semantic differentiation: Distinguishing between 4 similar mental state terms, many of which are nuanced (e.g., "contemplative" vs. "pensive")
  • Cross-modal reasoning: Bridging visual features to complex linguistic emotion concepts
  • Valence sensitivity: Performing consistently across positive, negative, and neutral emotional states

Strong RMET performance indicates sophisticated multimodal understanding - crucial for applications like emotion-aware interfaces, accessibility tools for emotion recognition training, human-robot interaction, and AI systems that need to respond appropriately to human emotional states. Recent research views RMET primarily as an emotion perception measure rather than a pure theory of mind test, making it an ideal benchmark for evaluating AI visual emotion recognition capabilities.

Model Rankings
Performance rankings for all tested models on 36 RMET questions. Click on any row to see performance breakdown by emotional valence.
Showing 33 of 33 models
Rank
#1qwen3-vl-235b-a22b-instruct33/36
#2gpt-4.129/36
#2gpt-529/36
#4claude-3.7-sonnet28/36
#5gemini-2.5-flash27/36
#5gpt-4o-mini27/36
#5gpt-5-mini27/36
#8gpt-4.1-mini26/36
#9gemini-2.0-flash-00125/36
#9gemini-2.5-pro25/36
#9gpt-5-pro25/36
#9grok-425/36
#13claude-opus-4.124/36
#13claude-sonnet-4.524/36
#13claude-sonnet-424/36
#16nova-lite-v123/36
#16mistral-medium-3.123/36
#18llama-4-maverick22/36
#18grok-4-fast22/36
#20mistral-small-3.1-24b-instruct21/36
#20mistral-small-3.2-24b-instruct21/36
#20gpt-5-nano21/36
#20qwen3-vl-30b-a3b-thinking21/36
#24claude-3.5-haiku20/36
#24qwen3-vl-8b-instruct20/36
#26nova-pro-v119/36
#26gemini-2.5-flash-lite19/36
#26qwen3-vl-8b-thinking19/36
#29claude-opus-417/36
#30gpt-4.1-nano16/36
#31claude-haiku-4.515/36
#32llama-4-scout14/36
#33qwen3-vl-30b-a3b-instruct7/36