
For decades, the IQ test has served as a familiar, albeit contested, metric for human intelligence. Now, a startup project named AI IQ is applying this same concept to artificial intelligence, assigning estimated intelligence quotients to over 50 of the world’s leading language models and visualizing their performance on a standard bell curve.
The resulting interactive visualizations at aiiq.org have rapidly gained traction on social media over the past week. Enterprise technologists have lauded the charts for making a complex market more understandable, while researchers and commentators have voiced sharp criticism, arguing the entire framework is fundamentally misleading.
“This is incredibly useful,” commented Thibaut Mélen, a technology commentator, on X. “It’s much easier to grasp model progress when visualized this way, rather than just another lengthy leaderboard.”
Brian Vellmure, a business strategist, echoed this sentiment: “This is helpful. It aligns anecdotally with personal experience.”
However, the backlash emerged swiftly. “It’s nonsensical. AI is far too uneven; the map is not the territory,” posted AI Deeply, an artificial intelligence commentary account, articulating a widespread concern among researchers: that condensing a language model’s diverse and often erratic capabilities into a single number creates a dangerous illusion of precision.

Twelve benchmarks, four dimensions, and one controversial number: how AI IQ actually works
AI IQ was conceived by Ryan Shea, an engineer, entrepreneur, and angel investor, notably a co-founder of the blockchain platform Stacks. Shea also co-founded Voterbase and has been an early-stage investor in several successful companies, including OpenSea, Lattice, Anchorage, and Mercury. He holds a Bachelor of Science in Mechanical Engineering from Princeton University.
The site’s methodology is built on a seemingly straightforward formula. AI IQ aggregates performance across 12 benchmarks, categorized into four reasoning dimensions: abstract, mathematical, programmatic, and academic. The overall composite IQ score is a simple average of these four dimensional scores: IQ = ¼ (IQ_Abstract + IQ_Math + IQ_Prog + IQ_Acad).
The abstract reasoning dimension incorporates results from ARC-AGI-1 and ARC-AGI-2, the challenging pattern-recognition benchmarks designed to evaluate general fluid intelligence. Mathematical reasoning includes benchmarks such as FrontierMath (Tiers 1–3 and Tier 4), AIME, and ProofBench. Programmatic reasoning utilizes Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Academic reasoning draws from Humanity’s Last Exam, CritPt, and GPQA Diamond.
Each raw benchmark score is translated into an implied IQ score through what the site terms “hand-calibrated difficulty curves.” Significantly, the methodology imposes caps on benchmarks deemed easier or more susceptible to data contamination, preventing them from artificially inflating scores beyond 100. More difficult, less gameable benchmarks maintain higher score ceilings. The system also conservatively handles missing data: models must achieve scores on at least two of the four dimensions to receive a derived IQ. In instances where benchmarks are absent, the pipeline is designed to suppress scores rather than inflate them. The platform states, “every derived IQ averages all four dimensions, so missing coverage cannot make a model look better by omission.”
OpenAI leads the bell curve, but the gap between the top AI models has never been smaller
As of mid-May 2026, the AI IQ charts depict a trend of rapid convergence among leading-edge models, while models in lower tiers exhibit greater diversity. OpenAI’s GPT-5.5 currently holds the top position on the bell curve, with an estimated IQ close to 136, the highest among all tracked models. It is closely followed by GPT-5.4 (approximately 131), Anthropic’s Opus 4.7 (approximately 132), and Opus 4.6 (approximately 129). Google’s Gemini 3.1 Pro ranks around 131, indicating an exceptionally tight cluster at the top tier.
This compression is not unique to the AI IQ framework. Visual Capitalist recently noted a similar trend, citing a separate Mensa-based ranking by TrackingAI, observing that “the biggest takeaway is how compressed the top of the leaderboard has become.” On that specific scale, Grok-4.20 Expert Mode and GPT 5.4 Pro tied at 145, with Gemini 3.1 Pro at 141.
Below the leading cluster, the AI IQ charts reveal a densely populated midfield. Models from Chinese companies, including Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, and MiniMax-M2.7, are grouped between approximately 112 and 118. This suggests increasing competitiveness in the cost-performance segment, a crucial factor for enterprise buyers who may not require the absolute top-tier model for every application. One user on X, ovsky, commented that the data “confirms experience with sonnet 4.6 being an absolute workhorse as opposed to opus 4.5,” highlighting how the charts can validate practical insights that headline rankings often overlook.

Why emotional intelligence scores are becoming the new battleground in AI model rankings
What distinguishes AI IQ from most other benchmarking efforts is its inclusion of an “EQ” — emotional intelligence — score. The site maps each model’s EQ-Bench 3 Elo score and Arena Elo score to an estimated EQ using calibrated piecewise-linear scales, then calculates a 50/50 weighted composite of the two.
The EQ scores yield a meaningfully different ranking compared to IQ alone. In the IQ vs. EQ scatter plot, Anthropic’s Opus 4.7 leads in EQ with a score near 132, positioning it in the upper-right quadrant—the most desirable area, indicating both high cognitive and high emotional intelligence. OpenAI’s GPT-5.5 and GPT-5.4 cluster in the high-IQ zone but exhibit slightly lower EQ scores. Google’s Gemini 3.1 Pro occupies a strong mid-range position on both axes.
A notable methodological choice has drawn attention: EQ-Bench 3 is evaluated by Claude, an Anthropic model. The site acknowledges this “creates potential scoring bias in favor of Anthropic models.” To mitigate this, AI IQ applies a 200-point Elo penalty to the EQ-Bench component for all Anthropic models before mapping to implied EQ. The Arena component, which uses human judges, remains unaffected. This self-correction is unusual in the benchmarking landscape and suggests Shea’s awareness of the methodological complexities. Nevertheless, the EQ dimension captures an aspect that IQ alone cannot: the increasing significance of conversational quality, collaboration, and trust in models deployed for user-facing applications.

The AI cost-performance chart that enterprise buyers actually need to see
Arguably the most pragmatically useful chart on the site is the IQ vs. Effective Cost scatter plot. This visualization maps each model’s estimated IQ against an “effective cost” metric, calculated as the token cost for a task involving 2 million input tokens and 1 million output tokens, multiplied by a usage efficiency factor.
The chart reveals a common pattern in enterprise technology: the most capable models are not always the most cost-effective. GPT-5.5 and Opus 4.7 are positioned in the upper-left quadrant, indicating high IQ but also high cost, with effective per-task expenses exceeding $30 and $50, respectively. In contrast, models such as GPT-5.4-mini, DeepSeek-V3.2, and MiniMax-M2.7 occupy a favorable position, offering respectable IQ scores between 112 and 120 at effective costs ranging from approximately $1 to $5 per task. At the lower end of the cost spectrum, GPT-oss-20b, an open-source OpenAI model, is priced around $0.20 effective cost with an IQ of approximately 107, presenting a potentially highly economical option for high-volume classification or extraction workloads.
The platform also provides a 3D visualization that simultaneously displays IQ, EQ, and effective cost. A dashed line across the cube indicates the ideal trajectory: higher IQ, higher EQ, and lower cost. Models positioned near the “green end” of this axis represent stronger overall value; those closer to the “red end” compromise on capability, cost efficiency, or both. For Chief Information Officers managing API expenditures, the implication is clear: the performance gap between a $50 model and a $3 model has narrowed to the point where intelligent routing—utilizing expensive models for complex tasks and more affordable ones for simpler operations—is no longer a strategic option but a necessity. This approach forms the dominant architecture for sophisticated AI deployments.
Critics argue AI’s “jagged” capabilities make a single IQ score dangerously misleading
The most significant criticism leveled against AI IQ is philosophical and targets the core methodology. Critics contend that consolidating a model’s disparate capabilities into a single score obscures more than it clarifies.
“IQ as a proxy is fading; we’re observing reasoning density spikes that don’t correlate with the general intelligence factor (g-factor),” posted Zaya, a technology commentator, on X. “GPT-5.5 has already reached saturation on MMLU-Pro, yet it fails ClockBench 50% of the time.”
This observation touches upon what AI researchers refer to as the “jaggedness” problem: large language models often exhibit highly inconsistent performance, excelling at advanced physics problems while struggling with tasks considered elementary. A composite score can mask these significant performance disparities.
Pressureangle, another user on X, offered a more granular critique, highlighting a “complete lack of transparency” and arguing that the site fails to fully disclose the creation and validation process of its calibration curves. While AI IQ does list its 12 benchmarks and illustrates the shape of each calibration curve within its methodology section, the raw data and precise mathematical transformations are not published as open datasets. This lack of full reproducibility is a concern for researchers accustomed to transparent methodologies.
Others questioned the fundamental premise. “As useless as human IQ testing,” commented haashim on X. Shubham Sharma, an AI and technology writer, proposed an alternative: “Why not have the models take an official (MENSA-Grade) test? Wouldn’t this be the most accurate and ‘human-comparable’ way to benchmark intelligence?” This approach is already employed by TrackingAI, which administers the Mensa Norway IQ test to language models. However, Mensa-style tests primarily assess abstract pattern recognition, whereas AI IQ aims for a broader composite that includes coding, mathematics, and academic reasoning. As noted by Visual Capitalist, “an IQ-style benchmark captures only one slice of capability.” Each methodology has its trade-offs, and neither has definitively settled the debate.
The real race isn’t for the highest score — it’s for the smartest model stack
Despite the methodological debates, the most impactful insight from the AI IQ data may lie not in individual model scores but in the market dynamics it reveals. The landscape now features over 50 frontier-class models accessible via APIs, provided by at least 14 major vendors across the United States, China, and Europe. Each vendor publishes its own benchmarks, often selectively highlighting strengths, leading to a fragmentation where comparisons are difficult due to differing measurement criteria.
Academic research has pointed out that “most benchmarks introduce bias by focusing on a particular type of domain.” The Frontier IQ Over Time chart on AI IQ illustrates the rapid evolution of this field: in October 2023, GPT-4-turbo was positioned around an estimated IQ of 75. By early 2026, the leading models were approaching 135—an improvement of roughly 60 points in just 30 months.
This pace raises a fundamental question about the sustainability of any single scoring system. While the site compresses ceilings for saturated benchmarks, as models continue to achieve top scores on even the most challenging tests—ARC-AGI-2, FrontierMath Tier 4, Humanity’s Last Exam—the framework will eventually face the same ceiling effects that have challenged previous AI evaluation systems. Connor Forsyth highlighted this trend on X: “ARC AGI 3 disagrees,” he stated, referencing a next-generation benchmark that potentially challenges current scoring metrics.
AI IQ is not without its imperfections. Its methodology is only partially transparent, the IQ metaphor can be misleading, and its creator acknowledges known biases while likely overlooking others. However, the alternative—navigating dozens of vendor-specific benchmark tables with inconsistent testing and scoring conventions—is arguably more challenging. The site offers enterprise buyers something genuinely valuable: a unified framework for comparing models across vendors, dimensions, and price points. With regular updates and sufficient nuance, it provides clarity that the optimal model choice “depends on the task,” dispelling the notion of a universally “best” AI.
As Debdoot Ghosh observed on X after reviewing the charts: “Now a human’s role is just to orchestrate?”
Perhaps. But if the AI IQ data clearly indicates anything, it’s that orchestration—knowing which model to deploy, when, and at what cost—has evolved into its own distinct form of intelligence, for which a definitive benchmark is still needed.
Business Style Takeaway: The AI IQ project offers a valuable, albeit debated, framework for demystifying the rapidly evolving LLM market, providing a much-needed comparative lens for businesses evaluating AI solutions. Understanding the trade-offs between model performance (IQ/EQ) and cost-effectiveness is crucial for strategic AI adoption and maximizing ROI.
Based on materials from : venturebeat.com
