Key Moments

New OpenAI Model 'Imminent' and AI Stakes Get Raised (plus Med Gemini, GPT 2 Chatbot and Scale AI)

AI ExplainedAI Explained
Science & Technology6 min read21 min video
May 2, 2024|99,435 views|4,592|555
Save to Pod
TL;DR

OpenAI's new model is imminent, potentially a GPT-4.5, but the company has bypassed UK government safety testing, raising concerns about uncontrolled AI deployment.

Key Insights

1

OpenAI and Meta have not allowed the UK government to safety test their latest AI models, with only Google DeepMind complying.

2

Sam Altman confirmed he is personally using an unreleased version of OpenAI's new model, suggesting an imminent release, likely GPT-4.5.

3

The mystery GPT-2 chatbot, potentially a preview of GPT-4.5, performed similarly to GPT-4 Turbo on various tests, indicating data quality is key over model size.

4

A Scale AI benchmark found that newer AI models may have been exposed to test questions during training (data contamination), affecting performance evaluations.

5

Google's Med-Gemini models show performance competitive with or exceeding that of doctors in medical diagnosis tasks, with some benchmarks achieving 93% accuracy after error correction.

6

Despite advancements, benchmarks like GSM 8K contain errors, and even advanced models like Claude 3 Opus struggle with basic arithmetic in some tests.

OpenAI navigates safety tests amidst imminent model release rumors

Recent AI developments have been marked by a blend of surprising revelations and a sense of escalating stakes, particularly with an "imminent release" of new models from OpenAI hinted at by company insiders and government officials. This follows a peculiar incident involving a mysterious GPT-2 chatbot that appeared and was quickly withdrawn. While many speculated wildly, firsthand testing of the GPT-2 chatbot revealed its performance was largely on par with GPT-4 Turbo, suggesting its intended role was not a significant leap in AI capabilities, but perhaps a sneak peek at an iterative update. This aligns with rumors that OpenAI might be preparing to launch a GPT-4.5, optimized for reasoning and planning, rather than rushing a full GPT-5. The urgency behind these releases is amplified by a Politico article, which highlighted that, contrary to promises made at the Bletchley AI safety summit, only Google DeepMind has provided early access for safety testing to the UK government. OpenAI, along with Meta, has not. This lack of governmental oversight on cutting-edge models is a significant concern, especially given the potential for surprise or unintended consequences when deploying advanced AI systems, as stated by Sam Altman himself: "AI and surprise don't go well together." This underscores a growing tension between rapid AI development and the imperative for safety and transparency.

Evidence mounts for a GPT-4.5 release before GPT-5

Further evidence points towards an upcoming GPT-4.5 release. Sam Altman, in a recent interview with MIT Technology Review, confidently stated he knew when the next GPT version would be released. This assertion is significant; such certainty would be impossible if the model required extensive, uncertain safety testing, unlike Google's Gemini Ultra, which faced delays. Additionally, a firsthand account from an AI insider at a Stanford event revealed Altman is personally using an unreleased version of OpenAI's new model. These indicators suggest OpenAI is prioritizing an iterative deployment strategy, aiming for a GPT-4.5 release before the predicted GPT-5, which is expected between November and January. Altman's philosophy emphasizes responsible deployment: "it's much better than the alternative and this case in particular where I think we really owe it to society to deploy tively." This approach contrasts with some internal perceptions of transparency at OpenAI, with one former employee joining Google citing better model visibility there.

The role of data in achieving state-the-art AI performance

The performance of the mystery GPT-2 chatbot, despite its potentially smaller size, highlights the pivotal role of data in AI development. As James Becker of OpenAI noted, "behavior is determined by your data set." This suggests that rather than solely focusing on tweaking model architectures or hyperparameters, the quality and scale of training data are paramount. Flaws observed in models like GPT-4 and Dari-3 are attributed to a "lack of data in a specific domain." The mantra appears to be that "anything can be state-of-the-art with enough scale compute and eval hacking." This perspective challenges the notion of exclusive proprietary 'secret sauce' among AI giants. Meta's LLaMA 3 models, performing strongly across different parameter sizes, suggest that the competitive edge might increasingly come down to sheer investment in compute and data, rather than unique architectural secrets. The scarcity of GPUs some years ago is now being supplanted by a surge in investment, with companies building out massive AI infrastructure.

Benchmark contamination and model generalization concerns

A recent Scale AI paper introduces a refined benchmark for mathematical reasoning, uncovering crucial issues with existing tests. A primary concern is data contamination, where models may have encountered benchmark questions during their training, inflating their perceived performance. This was evident as Mistral and Fi models lagged on new, unseen questions compared to the original test, while GPT-4 and Claude 3 performed comparably on both. The paper suggests that larger models, even if exposed to test data, can generalize better due to more robust learning. Furthermore, the research identified errors in the original GSM 8K benchmark, designed for high schoolers, with some answers being non-positive integers when they should have been. This raises questions about the reliability of widely used benchmarks. The authors posit that model builders may inadvertently create training datasets too similar to benchmark questions, leading to inflated scores that don't reflect real-world capabilities. Despite these issues, smaller models like Mistral's 53 Mini, with just 3.8 billion parameters, showed impressive performance, nearing GPT-4 Turbo on unseen questions, underscoring the impact of high-quality data.

Google's Med-Gemini achieves doctor-level diagnostic capabilities

A significant breakthrough is Google's Med-Gemini, a multimodal AI system demonstrating performance competitive with medical professionals in diagnosis. The system can analyze vast amounts of medical data, including 700,000-word electronic health records facilitated by Gemini 1.5's long context window. On the Med QA benchmark, where human doctor pass rates are around 60%, Med-Gemini achieved state-of-the-art performance. When errors in the benchmark questions were accounted for, performance rose to approximately 93%. The model's ability to provide answers with confidence scores and use search to verify uncertain information, coupled with a fine-tuning loop using correct answers, contributes to its accuracy. This advancement is particularly relevant given the millions of deaths worldwide attributed to medical errors. The potential for Med-Gemini to assist in complex procedures, even during live surgery by assessing video feeds for critical safety criteria, is immense, although ethical and safety considerations have thus far restricted deployment in such scenarios. The non-open-sourced nature of the model raises questions about accessibility, but the ethical imperative to deploy AI that surpasses human diagnostic accuracy is a critical consideration for the future of healthcare.

Rivalry and ongoing challenges in AI benchmark evaluation

The development of Med-Gemini also highlights the competitive landscape and ongoing challenges in AI benchmarking. Google's Med-Prompt approach is contrasted with Microsoft's, with each company claiming superiority. Google asserts its method is principled and extensible, countering Microsoft's claim that their Med-Prompt approach allows GPT-4 to outperform specially tuned Google models. Google subsequently showcased state-of-the-art performance on 10 out of 14 benchmarks. This competitive pressure is seen as positive, driving innovation. However, the recurring theme of benchmark flaws persists. In Med-Gemini's evaluation, 7.4% of questions had quality issues, such as missing information or incorrect answers. This, along with the observation that even advanced models like Claude 3 Opus can err on basic arithmetic, suggests that while AI is making strides, limitations in generalization and potential over-optimization for benchmarks remain critical areas for improvement and careful evaluation.

Benchmark Performance Comparison: New vs. Old Questions

Data extracted from this episode

Model FamilyPerformance on Original BenchmarkPerformance on New Benchmark
Mistral and PhiHighNotably Lagged
GPT-4 and ClaudeHighSame or Better

Med QA Benchmark Performance

Data extracted from this episode

Model/ClinicianPerformance (with search)Performance (with search, errors removed)
Doctor Pass Rate~60%N/A
Med GeminiOutperformed Expert Clinicians~93%

Common Questions

Insider reports suggest an imminent release of new OpenAI models, possibly a GPT-4.5 version optimized for reasoning and planning, with GPT-5 predicted for late 2024 or early 2025.

Topics

Mentioned in this video

More from AI Explained

View all 41 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free