How does GPT-4.5 perform in emotional intelligence tests?

GPT-4.5 struggles with emotional intelligence and nuanced scenarios, often siding with the user even in problematic situations. In contrast, Claude 3.7 demonstrates a higher EQ, recognizing harmful behavior and setting appropriate boundaries.

Is GPT-4.5 good at creative writing and humor?

While GPT-4.5 can produce creative text, it tends to 'tell' rather than 'show,' lacking descriptive depth. Claude 3.7 is noted for its superior ability to convey emotion and atmosphere through better storytelling techniques.

What is the performance of GPT-4.5 on benchmarks like Simple Bench?

GPT-4.5 scored around 35% on Simple Bench in initial runs, which is an improvement over GPT-4 Turbo but significantly behind Claude 3.7 Sonic at 45%. This suggests that while GPT-4.5 is a step forward, reasoning models are currently leading.

Why is GPT-4.5 so expensive to use?

GPT-4.5 is considerably more expensive than GPT-4o, reportedly 15-30 times more in API costs. This high cost is a significant factor for OpenAI, leading them to evaluate its long-term viability.

What is the future of AI development beyond scaling base models?

The trend is shifting from simply scaling base models to focusing on reasoning and 'chains of thought' techniques. Models like Claude 3.7 exemplify this shift, showcasing better performance in complex tasks, while OpenAI is expected to leverage GPT-4.5 as a foundation for future reasoning agents.

Did GPT-4.5 address the issue of hallucinations?

Despite OpenAI's earlier claims that scaling would solve hallucinations, the system card for GPT-4.5 explicitly states that more work is needed. The model still hallucinates frequently, indicating that simply increasing model size isn't a complete solution.

Key Moments

GPT 4.5 - not so much wow

AI Explained

Science & Technology4 min read26 min video

Feb 28, 2025|110,741 views|4,300|717

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

GPT-4.5: Incremental improvement, not groundbreaking. Lacks EQ, falls short of competitors.

Key Insights

GPT-4.5 represents an incremental upgrade, not a revolutionary leap, focusing on scaling base models rather than innovative reasoning.

Despite OpenAI's claims, GPT-4.5 shows weak performance in emotional intelligence and humor tests compared to competitors like Claude 3.7.

The model underperforms in various benchmarks, including science, math, coding, and spatial reasoning, suggesting a flawed strategy of solely scaling base models.

GPT-4.5's high cost (15-30x GPT-4o in API) raises questions about its long-term viability and value proposition.

While OpenAI initially bet on scaling base models, recent innovations in 'extended thinking' and reasoning (like Anthropic's approach) appear more promising.

GPT-4.5 serves as a foundational model, but its true potential might be realized when coupled with advanced reasoning layers, similar to the 'O' series models.

THEORETICAL PROMISE OF SCALING BASE MODELS

The early premise in AI development centered on the idea that simply scaling up base models with more parameters, data, and GPUs would unlock significant advancements. GPT-4.5 is presented as a product of this philosophy, representing an alternative timeline where this approach dominated AI evolution. The cost for OpenAI to achieve this scaling was immense, yet the results suggest that solely focusing on larger base models, without other innovations, may not have delivered the revolutionary impact initially envisioned by AI leaders.

PERFORMANCE AND BENCHMARK SHORTCOMINGS

Contrary to expectations and initial hype, GPT-4.5 demonstrates underperformance across various crucial benchmarks. It falls short in science, mathematics, and coding assessments, and significantly lags behind DeepSeek in almost every metric. While OpenAI acknowledges it wouldn't necessarily crush benchmarks, the underperformance compared to even smaller competitor models like Claude 3.7's smaller versions raises concerns about its efficacy as a standalone product.

EMOTIONAL INTELLIGENCE AND HUMOR TESTS

A key selling point emphasized by OpenAI for GPT-4.5 was its improved emotional intelligence (EQ). However, testing revealed a notable deficiency in this area. In scenarios demanding nuanced understanding of social cues, humor, or potential abuse masked as playfulness, GPT-4.5 consistently sided with the user, even in ethically questionable contexts. Competitors like Claude 3.7 exhibited a more responsible and discerning approach, highlighting GPT-4.5's limitations in sophisticated emotional understanding.

CREATIVE WRITING AND USER INTERACTION DIFFERENCES

When tested on creative tasks, such as writing a story within a specific universe, GPT-4.5's output leaned more towards 'telling' than 'showing,' lacking the descriptive depth found in competitor models. In humor elicitation, GPT-4.5's responses were perceived as less impactful and more functional compared to the more genuinely amusing or insightful reactions from other models. User interaction also revealed a tendency for GPT-4.5 to require excessive clarification, unlike models that can infer intent more effectively.

COST, VIABILITY, AND STRATEGIC IMPLICATIONS

GPT-4.5 comes with a significant cost, particularly for API users, being 15-30 times more expensive than GPT-4o. This prohibitive pricing has led OpenAI to re-evaluate its long-term offering of GPT-4.5 via the API. The high cost-to-performance ratio suggests that the strategy of simply scaling base models is economically unsustainable without commensurate gains in capability, especially when compared to more cost-effective and capable competitors.

THE ASCENDANCE OF REASONING AND EXTENDED THINKING

The video strongly posits that recent innovations in 'extended thinking' and reasoning capabilities, as exemplified by models like Claude 3.7 and OpenAI's own 'O' series, represent the future of AI development. The comparative underperformance of GPT-4.5, designed purely through base model scaling, underscores the idea that simply increasing model size is no longer the primary path to intelligence. The focus is shifting from pre-training to sophisticated reasoning layers, a strategy where competitors like Anthropic seem to currently hold an edge.

HUMAN RED TEAMING AND PERSUASION TESTS

OpenAI's system card for GPT-4.5 reveals a reduced reliance on human red teaming due to insufficient performance, opting for automated evaluations instead. Interestingly, in a persuasion test where GPT-4.5 attempted to solicit money from GPT-4o, it frequently succeeded by begging for small amounts, yet ultimately raised less overall than another model. This behavior is interpreted as indicative of a model that is more submissive or less strategically sharp when not guided by explicit reasoning prompts.

INCREMENTAL IMPROVEMENTS OVER GPT-4O

While GPT-4.5 may not be a revolutionary leap, it does offer some improvements over its predecessor, GPT-4o. OpenAI's internal testing indicates modest gains in areas like engineer interview questions and autonomous agentic tasks. However, these increments (e.g., 6% and 6-7% respectively) are not substantial enough to justify the hype or the significantly higher costs, especially considering the rapid advancements seen in reasoning-focused models from competitors and OpenAI's own 'O' series.

THE 'O' SERIES AS A NEW PARADIGM

The 'O' series of models, such as 01 and 03, which incorporate reasoning and extended thinking, consistently outperform GPT-4.5, even in domains like language understanding. This suggests a fundamental shift in how advanced AI capabilities are being achieved. OpenAI's future development, including GPT-5, is expected to build upon these reasoning-enhanced architectures, moving away from the sole reliance on massive base model scaling that characterized the development of GPT-4.5.

COMPARISON WITH ANTHROPIC AND FUTURE OUTLOOK

Anthropic, with its 'O' series models, appears to be leading in raw intelligence and offering more usable models for specific tasks like coding and those requiring higher EQ. This positions them favorably for future reasoning expansion. While OpenAI is still developing GPT-5, which is anticipated to be of incredible significance, the current landscape suggests that the 'low-hanging fruit' for compute investment in 2025 lies in reasoning, not just pre-training.

Mentioned in This Episode

●Software & Apps

●Companies

●People Referenced

Common Questions

GPT-4.5 is OpenAI's latest large language model, representing a 'scaling up' approach without advanced reasoning or extended thinking time. Initial tests show it underperforms in benchmarks compared to models like Claude 3.7, especially in areas requiring nuanced understanding and reasoning.

Topics

GPT-4.5 AI Benchmarks Creative Writing Claude 3.7 AI Cost

Mentioned in this video

People

Bob McGrew

Quoted on the concept of pre-training versus reasoning, suggesting pre-training is not dead but waiting for reasoning to catch up to log-linear returns.

Software & Apps

Family Guy

Mentioned as context for the character Herbert used in an emotional intelligence test scenario.

Gemini 2

An AI model with which GPT-4.5's performance is compared, showing GPT-4.5 performs better in some benchmarks but is significantly behind Claude 3.7 Sonic.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free