cmu logo hcii logo runway logo neurips 2021 logo

Soundify: Matching Sound Effects to Video

David Chuan-En Lin1, Anastasis Germanidis2, Cristóbal Valenzuela2, Yining Shi2, Nikolas Martelaro1

1Carnegie Mellon University, 2Runway

📄 Paper📊 Full Talk📝 Citation (BibTeX)

teaser image

Soundify matches sound effects (bold) and ambients (italics) by localizing "sound emitters".


In the art of video editing, sound is really half the story. A skilled video editor overlays sounds, such as effects and ambients, over footage to add character to an object or immerse the viewer within a space. However, through formative interviews with professional video editors, we found that this process can be extremely tedious and time-consuming. We introduce Soundify, a system that matches sound effects to video. By leveraging labeled, studio-quality sound effects libraries and extending CLIP, a neural network with impressive zero-shot image classification capabilities, into a "zero-shot detector", we are able to produce high-quality results without resource-intensive correspondence learning or audio generation. We encourage you to have a look at, or better yet, have a listen to the results at

Example Visualization

Example Results