Key Moments

Making Transformers Sing - with Mikey Shulman of Suno

Latent Space PodcastLatent Space Podcast
Science & Technology3 min read59 min video
Mar 14, 2024|1,592 views|48|8
Save to Pod
TL;DR

Suno AI uses Transformer models for music generation, focusing on end-to-end learning and user creativity, expanding audio's role beyond passive consumption.

Key Insights

1

Suno AI employs Transformer-based models, analogous to language models, for audio generation by predicting sequential audio tokens.

2

The company prioritizes a general, end-to-end approach to model training, minimizing implicit musical knowledge to foster organic learning.

3

Suno trains on diverse audio data, not just music, to improve specific areas like realistic vocal generation.

4

Suno's platform offers both simple text-to-song generation and an 'expert mode' catering to both casual creativity and detailed customization.

5

The core vision for Suno is to make music creation more active and social, akin to a 'collaborative concert' or 'multiplayer mode' in gaming.

6

While acknowledging the utility of benchmarks, Suno emphasizes 'aesthetics matter', prioritizing the emotional impact and listenability of generated music.

THE STATE OF MUSIC GENERATION

Audio generation, including music, is described as being one to two years behind text and image generation in terms of maturity. While both Transformer-based and diffusion-based models exist for audio, Suno AI favors Transformers, drawing parallels to their success in language modeling. Their core method involves training models to predict the next small segment of audio, treating audio as a sequence of tokens.

GENERALITY OVER SPECIALIZATION

Suno AI's philosophy centers on building generalist models rather than highly specialized ones. They avoid embedding domain-specific knowledge, such as musical scales or phonemes, into the models. Instead, they aim for end-to-end learning, allowing the models to discover musical patterns and structures intrinsically from vast amounts of data, similar to how LLMs learn grammar and syntax without explicit programming.

DATA STRATEGIES AND COPYRIGHT

Suno does not exclusively train on music; they incorporate diverse audio data, including non-musical human vocals, to enhance specific functionalities like realistic vocal generation. This approach acknowledges potential data limitations and legal complexities surrounding copyright, drawing parallels to how code generation models benefit from training on natural language. The focus remains on enabling users to create new, original music.

USER EXPERIENCE AND CREATIVITY

Suno offers multiple interaction modes, including a simple text-to-song generator and a more advanced 'expert mode' for detailed customization. This caters to diverse user needs, from quick, humorous song creation ('nice shitposting') to finely tuning a desired musical piece. The platform aims to democratize music creation, making it an active and accessible experience for a broader audience.

EXPANDING THE MUSIC PIE

The ultimate vision for Suno is to significantly expand the music industry by making participation more active and social, likening it to the boom in the gaming industry. They aim to facilitate 'collaborative concerts' and 'multiplayer music-making experiences,' moving beyond passive consumption. This involves exploring new interaction paradigms that are more intuitive and engaging for the average person than traditional music production workflows.

BEYOND BENCHMARKS: THE AESTHETICS MATTER

Suno emphasizes that 'aesthetics matter,' recognizing that quantitative benchmarks alone cannot capture the essence of music. They prioritize the emotional impact and listenability of generated music, trusting human ears as the ultimate judge. This perspective, influenced by principles found in economics and social sciences, guides their development towards creating music that genuinely resonates with people.

THE EVOLUTION OF AUDIO GENERATION

The audio generation landscape is broadly categorized into music, speech, and sound effects. Suno competes in the 'net new songs' category, differentiating from license-free stock music and AI cover generation, which face legal hurdles. They also see AI as a tool to augment music production for professionals, enabling more sonic exploration and innovation, thus evolving how music is created and experienced.

Common Questions

Suno uses Transformer-based language models, similar to those for text, but with tokens representing music or audio instead of words. The core idea is to predict the next 'token' of audio sequentially to build a song.

Topics

Mentioned in this video

More from Latent Space

View all 185 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free