Technology
Google’s DeepMind Develops V2A Tech to Add Soundtracks to AI-Generated Videos
- DeepMind is developing AI technology called V2A to generate soundtracks for videos, enhancing video generation models by adding synchronized audio elements like music, sound effects, and dialogue.
- Although promising, V2A technology faces challenges with audio quality and misuse concerns, leading DeepMind to withhold public release while gathering feedback and conducting safety assessments.
DeepMind, Google’s Artificial Intelligence research lab, announced on Monday it is developing AI technology that generates soundtracks for videos.
DeepMind recently discussed in its official blog that they believe V2A technology (short for “video-to-audio”) to be an essential piece in developing media AI models that generate videos, as the models cannot create sound effects to go along with these videos generated. While many organizations, including DeepMind itself have built video generating AI models capable of producing videos but without creating sound effects to match.
“Video generation models are advancing at an incredible pace, but many current systems can only generate silent output,” DeepMind writes. “V2A technology [could] become a promising approach for bringing generated movies to life.”
DeepMind’s V2A tech uses the description of an audio track (e.g. “jellyfish pulsating under water, marine life and ocean”) together with video to generate music, sound effects and dialogue that match both characters and tone of video, watermarked using deepfakes-combating SynthID technology. V2A relies on an AI model trained on sounds transcripts as well as video footage according to DeepMind – while this process was further tested over multiple training rounds using AI model training on video clips as iterative development by AI model training of V2A’s diffusion model trained via training the AI model trained on video clips as described earlier by DeepMind itself.
“By training on video, audio and the additional annotations, our technology learns to associate specific audio events with various visual scenes, while responding to the information provided in the annotations or transcripts,” DeepMind states.
DeepMind has not provided information as to whether any training data they used for deep learning was infringed upon, nor whether its creators were informed. We reached out for clarification from them and will update this post if we hear back.
AI-powered sound generating tools and models for video sound effects creation are nothing new: startup Stability AI released one last week while ElevenLabs unveiled one earlier in May. Furthermore, Microsoft offers projects to generate talking/singing video effects directly from images while platforms like Pika or GenreX utilize trained models that analyze videos to guess the appropriate music or effects that match scenes presented to their platforms.
DeepMind claims its V2A tech is unlike anything else available; it can interpret raw pixels from video footage and automatically sync generated sounds to its image automatically without manual intervention from users or descriptions of audio visual elements in videos.
V2A isn’t perfect, and DeepMind recognizes this. Because its model wasn’t trained on numerous videos with artifacts or distortions, its audio output doesn’t create high-quality results for these. And overall, I find the generated sound rather unconvincing; my colleague Natasha Lomas described it as “an eclectic collection of stereotypical sounds”- I couldn’t disagree more!
Due to these reasons and to prevent misuse, DeepMind states it will not release their tech to the general public in any near future, if ever.
“To make sure our V2A technology can have a positive impact on the creative community, we’re gathering diverse perspectives and insights from leading creators and filmmakers, and using this valuable feedback to inform our ongoing research and development,” DeepMind writes. “Before we consider opening access to it to the wider public, our V2A technology will undergo rigorous safety assessments and testing.”
DeepMind markets its V2A technology as being particularly beneficial to archivists and people working with historical footage, yet generative AI threatens the film and television industries with such technology threatening jobs or professions altogether. Therefore, proper labor protections need to be in place in order to prevent such tools from upending them as well.