Key Moments
Making Transformers Sing - with Mikey Shulman of Suno
Key Moments
Suno AI uses Transformer models for music generation, focusing on end-to-end learning and user creativity, expanding audio's role beyond passive consumption.
Key Insights
Suno AI employs Transformer-based models, analogous to language models, for audio generation by predicting sequential audio tokens.
The company prioritizes a general, end-to-end approach to model training, minimizing implicit musical knowledge to foster organic learning.
Suno trains on diverse audio data, not just music, to improve specific areas like realistic vocal generation.
Suno's platform offers both simple text-to-song generation and an 'expert mode' catering to both casual creativity and detailed customization.
The core vision for Suno is to make music creation more active and social, akin to a 'collaborative concert' or 'multiplayer mode' in gaming.
While acknowledging the utility of benchmarks, Suno emphasizes 'aesthetics matter', prioritizing the emotional impact and listenability of generated music.
THE STATE OF MUSIC GENERATION
Audio generation, including music, is described as being one to two years behind text and image generation in terms of maturity. While both Transformer-based and diffusion-based models exist for audio, Suno AI favors Transformers, drawing parallels to their success in language modeling. Their core method involves training models to predict the next small segment of audio, treating audio as a sequence of tokens.
GENERALITY OVER SPECIALIZATION
Suno AI's philosophy centers on building generalist models rather than highly specialized ones. They avoid embedding domain-specific knowledge, such as musical scales or phonemes, into the models. Instead, they aim for end-to-end learning, allowing the models to discover musical patterns and structures intrinsically from vast amounts of data, similar to how LLMs learn grammar and syntax without explicit programming.
DATA STRATEGIES AND COPYRIGHT
Suno does not exclusively train on music; they incorporate diverse audio data, including non-musical human vocals, to enhance specific functionalities like realistic vocal generation. This approach acknowledges potential data limitations and legal complexities surrounding copyright, drawing parallels to how code generation models benefit from training on natural language. The focus remains on enabling users to create new, original music.
USER EXPERIENCE AND CREATIVITY
Suno offers multiple interaction modes, including a simple text-to-song generator and a more advanced 'expert mode' for detailed customization. This caters to diverse user needs, from quick, humorous song creation ('nice shitposting') to finely tuning a desired musical piece. The platform aims to democratize music creation, making it an active and accessible experience for a broader audience.
EXPANDING THE MUSIC PIE
The ultimate vision for Suno is to significantly expand the music industry by making participation more active and social, likening it to the boom in the gaming industry. They aim to facilitate 'collaborative concerts' and 'multiplayer music-making experiences,' moving beyond passive consumption. This involves exploring new interaction paradigms that are more intuitive and engaging for the average person than traditional music production workflows.
BEYOND BENCHMARKS: THE AESTHETICS MATTER
Suno emphasizes that 'aesthetics matter,' recognizing that quantitative benchmarks alone cannot capture the essence of music. They prioritize the emotional impact and listenability of generated music, trusting human ears as the ultimate judge. This perspective, influenced by principles found in economics and social sciences, guides their development towards creating music that genuinely resonates with people.
THE EVOLUTION OF AUDIO GENERATION
The audio generation landscape is broadly categorized into music, speech, and sound effects. Suno competes in the 'net new songs' category, differentiating from license-free stock music and AI cover generation, which face legal hurdles. They also see AI as a tool to augment music production for professionals, enabling more sonic exploration and innovation, thus evolving how music is created and experienced.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Studies Cited
●Concepts
●People Referenced
Common Questions
Suno uses Transformer-based language models, similar to those for text, but with tokens representing music or audio instead of words. The core idea is to predict the next 'token' of audio sequentially to build a song.
Topics
Mentioned in this video
A company where Mikey Shulman and other Suno co-founders previously worked, known for housing many AI startups.
Collaborated with Suno on the Parakeet text-to-speech model.
Universal Music Group, mentioned in the context of their dispute with TikTok regarding music licensing.
Mentioned in the context of their dispute with UMG regarding music licensing.
An audio model released by Meta.
A question-answering dataset used to illustrate how models can overfit benchmarks, highlighting the need for clever problem formulation (e.g., questions with no answers).
Used as a comparison for Suno's approach to user interaction and creation, particularly in terms of enabling creative expression through prompts.
A code generation model trained on both code and English, used as an analogy for why Suno trains its music models on data beyond just music.
An open-source text-to-speech model developed by Suno, which received significant community attention.
A startup focused on music generation, recently gaining recognition as a top player in the field. They leverage Transformer models for audio creation.
Mentioned as an example of a language model where implicit knowledge rather than explicit programming is used, similar to Suno's approach to music models.
A type of neural network architecture that forms the basis of Suno's approach to music generation, similar to how they are used in text models.
Another type of model used for audio generation, though Suno prefers Transformers for their music models.
A principle discussed by Mikey Shulman, stating that when a measure becomes a target, it ceases to be a good measure. Applied to LLM benchmarking and the importance of aesthetics over purely quantitative metrics.
More from Latent Space
View all 185 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free