What was the mystery GPT-2 chatbot, and how did it perform?

The GPT-2 chatbot was a mystery OpenAI model showcased briefly and then withdrawn. Initial tests found its performance largely identical to GPT-4 Turbo, though later tests showed it underperforming in language translation compared to Claude Opus.

What are the key findings from the Scale AI benchmark paper?

The paper found issues with benchmark contamination, showed that larger models generalize better, identified errors in existing benchmarks like GSM 8K, and suggested that model builders might be inadvertently over-optimizing for benchmarks.

How does Google's Med Gemini compare to human doctors?

Med Gemini is highly competitive with doctors in providing medical answers and diagnosis. In one test, after removing errors, Med Gemini with search achieved around 93% performance, significantly outperforming expert clinicians.

What are the main innovations in Med Gemini?

Med Gemini features inspectable confidence levels, a fine-tuning loop using search results, and leverages the long context abilities of Gemini 1.5 to process extensive medical records. It can also process images and assist during surgery.

Why is data quality so important for AI models?

According to OpenAI's James Beck, model behavior is primarily determined by the data set. High-quality data, even with smaller models, can lead to competitive performance, suggesting that sheer compute scale is not the only key to advancement.

What are the ethical implications of AI like Med Gemini becoming superior to doctors?

The video poses the question of when it becomes unethical not to deploy AI like Med Gemini to assist clinicians if it unambiguously surpasses human capabilities in diagnosis, highlighting a crucial future consideration for healthcare.

Key Moments

New OpenAI Model 'Imminent' and AI Stakes Get Raised (plus Med Gemini, GPT 2 Chatbot and Scale AI)

AI Explained

Science & Technology6 min read21 min video

May 2, 2024|99,445 views|4,588|552

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

OpenAI's new model is imminent, potentially a GPT-4.5, but the company has bypassed UK government safety testing, raising concerns about uncontrolled AI deployment.

Key Insights

OpenAI and Meta have not allowed the UK government to safety test their latest AI models, with only Google DeepMind complying.

Sam Altman confirmed he is personally using an unreleased version of OpenAI's new model, suggesting an imminent release, likely GPT-4.5.

The mystery GPT-2 chatbot, potentially a preview of GPT-4.5, performed similarly to GPT-4 Turbo on various tests, indicating data quality is key over model size.

A Scale AI benchmark found that newer AI models may have been exposed to test questions during training (data contamination), affecting performance evaluations.

Google's Med-Gemini models show performance competitive with or exceeding that of doctors in medical diagnosis tasks, with some benchmarks achieving 93% accuracy after error correction.

Despite advancements, benchmarks like GSM 8K contain errors, and even advanced models like Claude 3 Opus struggle with basic arithmetic in some tests.

OpenAI navigates safety tests amidst imminent model release rumors

Recent AI developments have been marked by a blend of surprising revelations and a sense of escalating stakes, particularly with an "imminent release" of new models from OpenAI hinted at by company insiders and government officials. This follows a peculiar incident involving a mysterious GPT-2 chatbot that appeared and was quickly withdrawn. While many speculated wildly, firsthand testing of the GPT-2 chatbot revealed its performance was largely on par with GPT-4 Turbo, suggesting its intended role was not a significant leap in AI capabilities, but perhaps a sneak peek at an iterative update. This aligns with rumors that OpenAI might be preparing to launch a GPT-4.5, optimized for reasoning and planning, rather than rushing a full GPT-5. The urgency behind these releases is amplified by a Politico article, which highlighted that, contrary to promises made at the Bletchley AI safety summit, only Google DeepMind has provided early access for safety testing to the UK government. OpenAI, along with Meta, has not. This lack of governmental oversight on cutting-edge models is a significant concern, especially given the potential for surprise or unintended consequences when deploying advanced AI systems, as stated by Sam Altman himself: "AI and surprise don't go well together." This underscores a growing tension between rapid AI development and the imperative for safety and transparency.

Evidence mounts for a GPT-4.5 release before GPT-5

Further evidence points towards an upcoming GPT-4.5 release. Sam Altman, in a recent interview with MIT Technology Review, confidently stated he knew when the next GPT version would be released. This assertion is significant; such certainty would be impossible if the model required extensive, uncertain safety testing, unlike Google's Gemini Ultra, which faced delays. Additionally, a firsthand account from an AI insider at a Stanford event revealed Altman is personally using an unreleased version of OpenAI's new model. These indicators suggest OpenAI is prioritizing an iterative deployment strategy, aiming for a GPT-4.5 release before the predicted GPT-5, which is expected between November and January. Altman's philosophy emphasizes responsible deployment: "it's much better than the alternative and this case in particular where I think we really owe it to society to deploy tively." This approach contrasts with some internal perceptions of transparency at OpenAI, with one former employee joining Google citing better model visibility there.

The role of data in achieving state-the-art AI performance

The performance of the mystery GPT-2 chatbot, despite its potentially smaller size, highlights the pivotal role of data in AI development. As James Becker of OpenAI noted, "behavior is determined by your data set." This suggests that rather than solely focusing on tweaking model architectures or hyperparameters, the quality and scale of training data are paramount. Flaws observed in models like GPT-4 and Dari-3 are attributed to a "lack of data in a specific domain." The mantra appears to be that "anything can be state-of-the-art with enough scale compute and eval hacking." This perspective challenges the notion of exclusive proprietary 'secret sauce' among AI giants. Meta's LLaMA 3 models, performing strongly across different parameter sizes, suggest that the competitive edge might increasingly come down to sheer investment in compute and data, rather than unique architectural secrets. The scarcity of GPUs some years ago is now being supplanted by a surge in investment, with companies building out massive AI infrastructure.

Benchmark contamination and model generalization concerns

A recent Scale AI paper introduces a refined benchmark for mathematical reasoning, uncovering crucial issues with existing tests. A primary concern is data contamination, where models may have encountered benchmark questions during their training, inflating their perceived performance. This was evident as Mistral and Fi models lagged on new, unseen questions compared to the original test, while GPT-4 and Claude 3 performed comparably on both. The paper suggests that larger models, even if exposed to test data, can generalize better due to more robust learning. Furthermore, the research identified errors in the original GSM 8K benchmark, designed for high schoolers, with some answers being non-positive integers when they should have been. This raises questions about the reliability of widely used benchmarks. The authors posit that model builders may inadvertently create training datasets too similar to benchmark questions, leading to inflated scores that don't reflect real-world capabilities. Despite these issues, smaller models like Mistral's 53 Mini, with just 3.8 billion parameters, showed impressive performance, nearing GPT-4 Turbo on unseen questions, underscoring the impact of high-quality data.

Google's Med-Gemini achieves doctor-level diagnostic capabilities

A significant breakthrough is Google's Med-Gemini, a multimodal AI system demonstrating performance competitive with medical professionals in diagnosis. The system can analyze vast amounts of medical data, including 700,000-word electronic health records facilitated by Gemini 1.5's long context window. On the Med QA benchmark, where human doctor pass rates are around 60%, Med-Gemini achieved state-of-the-art performance. When errors in the benchmark questions were accounted for, performance rose to approximately 93%. The model's ability to provide answers with confidence scores and use search to verify uncertain information, coupled with a fine-tuning loop using correct answers, contributes to its accuracy. This advancement is particularly relevant given the millions of deaths worldwide attributed to medical errors. The potential for Med-Gemini to assist in complex procedures, even during live surgery by assessing video feeds for critical safety criteria, is immense, although ethical and safety considerations have thus far restricted deployment in such scenarios. The non-open-sourced nature of the model raises questions about accessibility, but the ethical imperative to deploy AI that surpasses human diagnostic accuracy is a critical consideration for the future of healthcare.

Rivalry and ongoing challenges in AI benchmark evaluation

The development of Med-Gemini also highlights the competitive landscape and ongoing challenges in AI benchmarking. Google's Med-Prompt approach is contrasted with Microsoft's, with each company claiming superiority. Google asserts its method is principled and extensible, countering Microsoft's claim that their Med-Prompt approach allows GPT-4 to outperform specially tuned Google models. Google subsequently showcased state-of-the-art performance on 10 out of 14 benchmarks. This competitive pressure is seen as positive, driving innovation. However, the recurring theme of benchmark flaws persists. In Med-Gemini's evaluation, 7.4% of questions had quality issues, such as missing information or incorrect answers. This, along with the observation that even advanced models like Claude 3 Opus can err on basic arithmetic, suggests that while AI is making strides, limitations in generalization and potential over-optimization for benchmarks remain critical areas for improvement and careful evaluation.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Studies Cited

●Concepts

●People Referenced

Benchmark Performance Comparison: New vs. Old Questions

Data extracted from this episode

Model Family	Performance on Original Benchmark	Performance on New Benchmark
Mistral and Phi	High	Notably Lagged
GPT-4 and Claude	High	Same or Better

Med QA Benchmark Performance

Data extracted from this episode

Model/Clinician	Performance (with search)	Performance (with search, errors removed)
Doctor Pass Rate	~60%	N/A
Med Gemini	Outperformed Expert Clinicians	~93%

Common Questions

Insider reports suggest an imminent release of new OpenAI models, possibly a GPT-4.5 version optimized for reasoning and planning, with GPT-5 predicted for late 2024 or early 2025.

Topics

Med Gemini Model Performance

Mentioned in this video

Software & Apps

Gemini Ultra

A previous model from Google, mentioned in the context of a past controversy where it was claimed to be GPT-4.

MedPrompt

Microsoft's approach to medical AI, contrasted with Google's Med Gemini approach.

GPT-2 Chatbot

A mystery OpenAI model showcased and then withdrawn, tested by the author, whose performance was found to be similar to GPT-4 Turbo.

Google Gemini Ultra

Mentioned as an example of a model whose release was significantly delayed, contrasting with the expected imminent release of new OpenAI models.

Mistral Large

A model from Mistral that performed the same on a benchmark as Mistral Instruct, despite potentially having seen the benchmark questions.

Phi-3 Mini

A small model (3.8 billion parameters) that performed comparably to GPT-4 Turbo on a new benchmark, highlighting the importance of high-quality data.

Mistral Instruct

A model from Mistral that underperformed on a new benchmark, suggesting it may not generalize as well as larger models.

Med Gemini

Google's new AI model series that is highly competitive with doctors in providing medical answers and can assist in complex areas like surgery.

Mistral

Locations

Bletchley

The location of the AI Safety Summit in Southern England.

Concepts

AI Safety Summit

A summit held in Bletchley where AI companies like Meta and OpenAI promised the UK government they would allow safety testing of their models.

People

Sebastian Binkley

Lead author of the Phi series of models, who stated that even smaller models can perform well with high-quality data.

James Beck

Of OpenAI, stated that behavior is determined by the data set and that scale compute can lead to state-of-the-art performance.

Studies & Research

Med QA

A benchmark that assesses an AI's ability to diagnose diseases, where Med Gemini achieved state-of-the-art performance.

GSM 8K

A benchmark for AI mathematical reasoning designed for high schoolers, found to have errors in its original design.

Organizations

Politico

A news publication that reported on insider information regarding upcoming OpenAI models and government safety testing.

Stanford University

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free