What is Suno's approach to training data for music generation?

Interestingly, Suno doesn't solely train on music. They incorporate other types of audio, like human vocals that aren't singing, similar to how code generation models are trained on English text to improve pattern recognition.

Are Suno's music generation models large and computationally expensive?

Suno's models are described as relatively small by tech standards, scaling similarly to text Transformers. A key challenge is generating audio quickly for streaming without requiring excessively large models, potentially limiting parameter counts compared to some LLMs.

Why did Suno choose music generation over speech, given its background?

While starting with speech models like Bark, the co-founders fell in love with the intersection of AI and audio. Music, in particular, felt uniquely human and exciting, despite advice to focus on the larger speech market.

What are the main use cases for Suno's music generation tool?

Users engage in two primary ways: quick 'meme posting' for humorous, spontaneous songs, and a more involved 'expert mode' for users who want to deeply craft a specific song in their head. Many users utilize the expert mode.

Does Suno's platform allow users to create AI covers of existing songs?

No, Suno currently does not allow users to input existing lyrics or create covers of copyrighted songs due to publishing rights. Their focus is strictly on enabling the creation of new and original music.

How does Suno see its role in the future of music and its comparison to image generation tools?

Suno aims to be the 'Midjourney of music,' making creation accessible and active. They believe music is fundamentally more social and synchronous than images, offering unique opportunities for collaborative and joyful experiences.

What are the biggest challenges in AI music generation according to Suno?

Key challenges include improving sound fidelity, song quality, and controllability. Suno emphasizes that traditional quantitative benchmarks are insufficient, and 'aesthetics matter'—ultimately, your ears are the best judge of good music.

Key Moments

Making Transformers Sing - with Mikey Shulman of Suno

Latent Space Podcast

Science & Technology3 min read59 min video

Mar 14, 2024|1,592 views|48|8

mikey shulman suno ai ai music latent space

Save to Pod

Key Moments

TL;DR

Suno AI uses Transformer models for music generation, focusing on end-to-end learning and user creativity, expanding audio's role beyond passive consumption.

Key Insights

Suno AI employs Transformer-based models, analogous to language models, for audio generation by predicting sequential audio tokens.

The company prioritizes a general, end-to-end approach to model training, minimizing implicit musical knowledge to foster organic learning.

Suno trains on diverse audio data, not just music, to improve specific areas like realistic vocal generation.

Suno's platform offers both simple text-to-song generation and an 'expert mode' catering to both casual creativity and detailed customization.

The core vision for Suno is to make music creation more active and social, akin to a 'collaborative concert' or 'multiplayer mode' in gaming.

While acknowledging the utility of benchmarks, Suno emphasizes 'aesthetics matter', prioritizing the emotional impact and listenability of generated music.

THE STATE OF MUSIC GENERATION

Audio generation, including music, is described as being one to two years behind text and image generation in terms of maturity. While both Transformer-based and diffusion-based models exist for audio, Suno AI favors Transformers, drawing parallels to their success in language modeling. Their core method involves training models to predict the next small segment of audio, treating audio as a sequence of tokens.

GENERALITY OVER SPECIALIZATION

Suno AI's philosophy centers on building generalist models rather than highly specialized ones. They avoid embedding domain-specific knowledge, such as musical scales or phonemes, into the models. Instead, they aim for end-to-end learning, allowing the models to discover musical patterns and structures intrinsically from vast amounts of data, similar to how LLMs learn grammar and syntax without explicit programming.

DATA STRATEGIES AND COPYRIGHT

Suno does not exclusively train on music; they incorporate diverse audio data, including non-musical human vocals, to enhance specific functionalities like realistic vocal generation. This approach acknowledges potential data limitations and legal complexities surrounding copyright, drawing parallels to how code generation models benefit from training on natural language. The focus remains on enabling users to create new, original music.

USER EXPERIENCE AND CREATIVITY

Suno offers multiple interaction modes, including a simple text-to-song generator and a more advanced 'expert mode' for detailed customization. This caters to diverse user needs, from quick, humorous song creation ('nice shitposting') to finely tuning a desired musical piece. The platform aims to democratize music creation, making it an active and accessible experience for a broader audience.

EXPANDING THE MUSIC PIE

The ultimate vision for Suno is to significantly expand the music industry by making participation more active and social, likening it to the boom in the gaming industry. They aim to facilitate 'collaborative concerts' and 'multiplayer music-making experiences,' moving beyond passive consumption. This involves exploring new interaction paradigms that are more intuitive and engaging for the average person than traditional music production workflows.

BEYOND BENCHMARKS: THE AESTHETICS MATTER

Suno emphasizes that 'aesthetics matter,' recognizing that quantitative benchmarks alone cannot capture the essence of music. They prioritize the emotional impact and listenability of generated music, trusting human ears as the ultimate judge. This perspective, influenced by principles found in economics and social sciences, guides their development towards creating music that genuinely resonates with people.

THE EVOLUTION OF AUDIO GENERATION

The audio generation landscape is broadly categorized into music, speech, and sound effects. Suno competes in the 'net new songs' category, differentiating from license-free stock music and AI cover generation, which face legal hurdles. They also see AI as a tool to augment music production for professionals, enabling more sonic exploration and innovation, thus evolving how music is created and experienced.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Studies Cited

●Concepts

●People Referenced

Common Questions

Suno uses Transformer-based language models, similar to those for text, but with tokens representing music or audio instead of words. The core idea is to predict the next 'token' of audio sequentially to build a song.

Topics

AI & Machine Learning Technology & Innovation Creativity & Media Generative AI Prompt Engineering Deep Learning Music Generation AI Creativity Audio Synthesis

Mentioned in this video

Companies

Kenow Technologies

A company where Mikey Shulman and other Suno co-founders previously worked, known for housing many AI startups.

NVIDIA

Collaborated with Suno on the Parakeet text-to-speech model.

UMG

Universal Music Group, mentioned in the context of their dispute with TikTok regarding music licensing.

TikTok

Mentioned in the context of their dispute with UMG regarding music licensing.

Seamless

An audio model released by Meta.

Squad

A question-answering dataset used to illustrate how models can overfit benchmarks, highlighting the need for clever problem formulation (e.g., questions with no answers).

People

Mikey Shulman

Guest on the podcast, Co-founder and Head of Machine Learning at Suno, with a background in physics and extensive experience in AI startups.

Software & Apps

Midjourney

Used as a comparison for Suno's approach to user interaction and creation, particularly in terms of enabling creative expression through prompts.

Code LLaMA

A code generation model trained on both code and English, used as an analogy for why Suno trains its music models on data beyond just music.

Bark

An open-source text-to-speech model developed by Suno, which received significant community attention.

Suno

A startup focused on music generation, recently gaining recognition as a top player in the field. They leverage Transformer models for audio creation.

GPT

Mentioned as an example of a language model where implicit knowledge rather than explicit programming is used, similar to Suno's approach to music models.

Media

Return to Monkey

A song generated by a community member and shared on Twitter, which Mikey Shulman highlights as a beautiful example of personalized music creation.

Products

Theremin

An early electronic musical instrument, mentioned as a historical precursor to AI's ability to create novel sounds.

Apple Vision Pro

Mentioned in a demo song about a sad AI, highlighting the model's ability to incorporate real-world entities and current events into its creations.

Concepts

Transformer models

A type of neural network architecture that forms the basis of Suno's approach to music generation, similar to how they are used in text models.

diffusion models

Another type of model used for audio generation, though Suno prefers Transformers for their music models.

Goodhart's Law

A principle discussed by Mikey Shulman, stating that when a measure becomes a target, it ceases to be a good measure. Applied to LLM benchmarking and the importance of aesthetics over purely quantitative metrics.

Organizations

MIT

Mikey Shulman lectured at MIT.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free