How many jobs were tested and what was the average payment per job in the study?

The study tested 240 real freelance jobs, with an average payment around $630 per job. This setup aimed to reflect real-world incentives and outcomes rather than controlled lab conditions.

Which AI model performed best, and what was its accuracy in the study?

Claude Opus 4.5 was the best among the tested models with about a 3.75% success rate on acceptable-quality work. That means roughly 96% of tasks failed to meet human-level output.

What were the four main failure points AI systems exhibited?

The study identified: 1) producing corrupt or unusable files; 2) submitting incomplete work; 3) overall low quality even when complete; 4) inconsistencies in output across views or assets that don’t match provided sketches.

In which areas did AI show some capability according to the study?

AI showed relative strength in creative ideas, audio and image tasks, writing, data retrieval, web scraping, and basic report generation. It could handle some simple coding for visualizations and produced video content in a few cases.

What does the study suggest about AI's impact on jobs requiring language, audio, or data retrieval?

The study suggests AI may affect jobs with language, audio, or data-retrieval components, but human oversight remains essential. In many domains, AI is a tool that supports humans rather than a full replacement.

Why does the author warn about hype and spending in the AI space?

The author notes significant funding and promo activity around AI (e.g., large payments to creators) and argues that such hype may outpace genuine, scalable capability. He emphasizes cautious investment and focus on foundational research rather than broad, premature deployment.

What is the main takeaway about near-term AI replacement of human labor?

The video argues that while AI will disrupt some roles, especially in coding, math, and writing, it is unlikely to replace most jobs soon. There are substantial limits to current architectures, and human oversight plus strategic implementation are still required.

Key Moments

AI Fails at 96% of Jobs (New Study)

ColdFusion

Science & Technology3 min read13 min video

Feb 13, 2026|958,043 views|42,864|5,564

Coldfusion TV Dagogo Altraide Technology Apple Google Samsung Facebook Tesla

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

AI fails 96% of real freelance tasks; it's a tool, not a replacement.

Key Insights

The Upwork-based remote labor index (RLI) study finds AI struggles to meet human-level output in about 96% of paid freelance tasks (best around 3.75–5% success).

Common AI failures include corrupt or unusable files, incomplete work, low-quality results, and inconsistent outputs across variants.

AI shows strengths in narrow, creative domains (ideas generation, image/audio work, writing, data retrieval, simple coding, basic video generation).

Current benchmarks can overstate capabilities; RLI reveals AI performance in real-world tasks is near the floor for many job types.

Adoption requires purposeful planning, oversight, and realistic ROI expectations; hype and large investments may outpace durable value.

STUDY DESIGN AND REMOTE LABOR INDEX

Researchers measured AI against real, paid freelance work from Upwork. They sampled 240 tasks across domains like video, design, game development, and architecture, each paying about $630 on average. Both AI and humans received the same brief and attached files, and humans evaluated the AI’s output. The method, called the Remote Labor Index (RLI), treats success as equal or better performance than a human; failure means AI output misses that mark. Older models were used in the study, with newer scores shown on the project site.

REAL-WORLD RESULTS: WHERE AI FAILED AND WHERE IT SUCCEEDED

Across the 240 jobs, the best AI (Claude Opus 4.5) achieved only 3.75% acceptable-quality outputs, with other models performing substantially worse. Claude Opus 4.6 and Gemini fared similarly, yielding roughly 5% and 1.25% success, respectively, translating to about 91–96% failure. However, AI did show strength in certain areas: generating ideas for audio and visuals, writing, data retrieval, simple coding and report generation, and basic video creation—aligning with common AI use cases.

INTERPRETING THE NUMBERS: LIMITS OF RLI AND WHAT COUNTS AS FAILURE

The RLI metric captures whether AI meets human-level quality in a paid, professional setting. It exposes why current benchmarks often overstate AI capabilities: real-world tasks involve nuance, formats, and stakeholder expectations that are hard to simulate. The paper argues that AI performance is near the floor on RLI, despite impressive standard benchmarks. Limitations include using older models and evolving toolchains; results on the website reflect up-to-date models, not the original paper. Taken together, the findings caution against assuming broad replacement of human workers today.

BUSINESS IMPLICATIONS AND THE HYPE CYCLE

Company leaders face a mixed reality: while AI can cut time on routine tasks, the financial returns are unclear. PwC reports many CEOs see no immediate ROI; Gartner predicts some layoffs may reverse as companies re-hire workers; and high-profile claims (e.g., Microsoft code being AI-generated) have coincided with software issues. The takeaway is to plan deliberate AI deployments with oversight, recognizing that productivity gains may accrue in narrow domains rather than universal replacement, and that hype can outpace measurable value.

SAFETY, REGULATION, AND THE CASE OF AI MALFUNCTIONS

AI’s use in critical fields carries safety risks. The FDA has logged around 100 AI-related malfunctions, including botched surgeries and misidentified instruments, with some cases linked to adverse outcomes. These incidents underscore the need for human oversight and domain-specific validation. The takeaway is that broad, mission-critical adoption across medicine or other high-stakes areas is premature; at best, AI can assist in well-delimited tasks while requiring professional judgment and corrective feedback loops.

FUTURE OF AI: FOUNDATIONAL RESEARCH OVER SCALE

Industry voices warn that simply increasing data and compute may not produce human-like intelligence. The debate centers on redefining intelligence to focus on understanding the world (reinforcement learning) rather than merely mimicking language (LLMs). Yan LeCun’s critique is highlighted: current architectures may be near their peak, and scaling alone won’t solve core limitations. The host stresses the need for foundational research, while noting ongoing hype and substantial funding that may misalign with durable, widespread value and practical deployment.

Mentioned in This Episode

●Software & Apps

●Studies Cited

●People Referenced

AI model performance on real-world tasks (RLI)

Data extracted from this episode

Model	Estimated success rate	Notes
Claude Opus 4.5	3.75%	Best performing model among those tested (still very low)
Gemini	1.25%	Lowest success among models tested
Claude Opus 4.6	≈5%	Better than 4.5 in theory, but still high failure rate

Common Questions

The RLI is a method where paid, real-world tasks from Upwork were given to both humans and AI, and experts evaluated the AI output against human performance. This contrasts with benchmarks that use simulated or non-real tasks. The goal is to gauge AI effectiveness in actual freelancing contexts.

Topics

Remote Labor Index RLI Upwork Open Claw AI Failure Points AI Capabilities Industrial Caution Transformative AI

Mentioned in this video

People

Yan LeCun

Pioneer in CNNs criticizing current AI architecture and scalability

Newan Simon

AI researcher referenced in the historical critique

Marvin Minsky

AI pioneer cited in a historical critique of progress in AI

Frank Rosenblatt

Inventor of the perceptron, cited in the historical critique

Studies & Research

Remote Labor Index (RLI)

The study's method for comparing AI-produced work to human work in freelancing tasks

Companies

Upwork

Freelancer platform used to source real-world paid tasks for the study

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free