Key Moments
GPT 4.5 - not so much wow
Key Moments
GPT-4.5: Incremental improvement, not groundbreaking. Lacks EQ, falls short of competitors.
Key Insights
GPT-4.5 represents an incremental upgrade, not a revolutionary leap, focusing on scaling base models rather than innovative reasoning.
Despite OpenAI's claims, GPT-4.5 shows weak performance in emotional intelligence and humor tests compared to competitors like Claude 3.7.
The model underperforms in various benchmarks, including science, math, coding, and spatial reasoning, suggesting a flawed strategy of solely scaling base models.
GPT-4.5's high cost (15-30x GPT-4o in API) raises questions about its long-term viability and value proposition.
While OpenAI initially bet on scaling base models, recent innovations in 'extended thinking' and reasoning (like Anthropic's approach) appear more promising.
GPT-4.5 serves as a foundational model, but its true potential might be realized when coupled with advanced reasoning layers, similar to the 'O' series models.
THEORETICAL PROMISE OF SCALING BASE MODELS
The early premise in AI development centered on the idea that simply scaling up base models with more parameters, data, and GPUs would unlock significant advancements. GPT-4.5 is presented as a product of this philosophy, representing an alternative timeline where this approach dominated AI evolution. The cost for OpenAI to achieve this scaling was immense, yet the results suggest that solely focusing on larger base models, without other innovations, may not have delivered the revolutionary impact initially envisioned by AI leaders.
PERFORMANCE AND BENCHMARK SHORTCOMINGS
Contrary to expectations and initial hype, GPT-4.5 demonstrates underperformance across various crucial benchmarks. It falls short in science, mathematics, and coding assessments, and significantly lags behind DeepSeek in almost every metric. While OpenAI acknowledges it wouldn't necessarily crush benchmarks, the underperformance compared to even smaller competitor models like Claude 3.7's smaller versions raises concerns about its efficacy as a standalone product.
EMOTIONAL INTELLIGENCE AND HUMOR TESTS
A key selling point emphasized by OpenAI for GPT-4.5 was its improved emotional intelligence (EQ). However, testing revealed a notable deficiency in this area. In scenarios demanding nuanced understanding of social cues, humor, or potential abuse masked as playfulness, GPT-4.5 consistently sided with the user, even in ethically questionable contexts. Competitors like Claude 3.7 exhibited a more responsible and discerning approach, highlighting GPT-4.5's limitations in sophisticated emotional understanding.
CREATIVE WRITING AND USER INTERACTION DIFFERENCES
When tested on creative tasks, such as writing a story within a specific universe, GPT-4.5's output leaned more towards 'telling' than 'showing,' lacking the descriptive depth found in competitor models. In humor elicitation, GPT-4.5's responses were perceived as less impactful and more functional compared to the more genuinely amusing or insightful reactions from other models. User interaction also revealed a tendency for GPT-4.5 to require excessive clarification, unlike models that can infer intent more effectively.
COST, VIABILITY, AND STRATEGIC IMPLICATIONS
GPT-4.5 comes with a significant cost, particularly for API users, being 15-30 times more expensive than GPT-4o. This prohibitive pricing has led OpenAI to re-evaluate its long-term offering of GPT-4.5 via the API. The high cost-to-performance ratio suggests that the strategy of simply scaling base models is economically unsustainable without commensurate gains in capability, especially when compared to more cost-effective and capable competitors.
THE ASCENDANCE OF REASONING AND EXTENDED THINKING
The video strongly posits that recent innovations in 'extended thinking' and reasoning capabilities, as exemplified by models like Claude 3.7 and OpenAI's own 'O' series, represent the future of AI development. The comparative underperformance of GPT-4.5, designed purely through base model scaling, underscores the idea that simply increasing model size is no longer the primary path to intelligence. The focus is shifting from pre-training to sophisticated reasoning layers, a strategy where competitors like Anthropic seem to currently hold an edge.
HUMAN RED TEAMING AND PERSUASION TESTS
OpenAI's system card for GPT-4.5 reveals a reduced reliance on human red teaming due to insufficient performance, opting for automated evaluations instead. Interestingly, in a persuasion test where GPT-4.5 attempted to solicit money from GPT-4o, it frequently succeeded by begging for small amounts, yet ultimately raised less overall than another model. This behavior is interpreted as indicative of a model that is more submissive or less strategically sharp when not guided by explicit reasoning prompts.
INCREMENTAL IMPROVEMENTS OVER GPT-4O
While GPT-4.5 may not be a revolutionary leap, it does offer some improvements over its predecessor, GPT-4o. OpenAI's internal testing indicates modest gains in areas like engineer interview questions and autonomous agentic tasks. However, these increments (e.g., 6% and 6-7% respectively) are not substantial enough to justify the hype or the significantly higher costs, especially considering the rapid advancements seen in reasoning-focused models from competitors and OpenAI's own 'O' series.
THE 'O' SERIES AS A NEW PARADIGM
The 'O' series of models, such as 01 and 03, which incorporate reasoning and extended thinking, consistently outperform GPT-4.5, even in domains like language understanding. This suggests a fundamental shift in how advanced AI capabilities are being achieved. OpenAI's future development, including GPT-5, is expected to build upon these reasoning-enhanced architectures, moving away from the sole reliance on massive base model scaling that characterized the development of GPT-4.5.
COMPARISON WITH ANTHROPIC AND FUTURE OUTLOOK
Anthropic, with its 'O' series models, appears to be leading in raw intelligence and offering more usable models for specific tasks like coding and those requiring higher EQ. This positions them favorably for future reasoning expansion. While OpenAI is still developing GPT-5, which is anticipated to be of incredible significance, the current landscape suggests that the 'low-hanging fruit' for compute investment in 2025 lies in reasoning, not just pre-training.
Mentioned in This Episode
●Software & Apps
●Companies
●People Referenced
Common Questions
GPT-4.5 is OpenAI's latest large language model, representing a 'scaling up' approach without advanced reasoning or extended thinking time. Initial tests show it underperforms in benchmarks compared to models like Claude 3.7, especially in areas requiring nuanced understanding and reasoning.
Topics
Mentioned in this video
Mentioned as context for the character Herbert used in an emotional intelligence test scenario.
An AI model with which GPT-4.5's performance is compared, showing GPT-4.5 performs better in some benchmarks but is significantly behind Claude 3.7 Sonic.
Quoted on the concept of pre-training versus reasoning, suggesting pre-training is not dead but waiting for reasoning to catch up to log-linear returns.
More from AI Explained
View all 41 summaries
22 minWhat the New ChatGPT 5.4 Means for the World
14 minDeadline Day for Autonomous AI Weapons & Mass Surveillance
19 minGemini 3.1 Pro and the Downfall of Benchmarks: Welcome to the Vibe Era of AI
20 minThe Two Best AI Models/Enemies Just Got Released Simultaneously
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free