Google Releases Google Soundstorm Paper

Google Soundstorm

Google called it Insane Audio Generation, that’s Google Soundstorm.

SoundStorm is a machine learning model that generates audio files. It is non-autoregressive.

Google Soundstorm
Google Soundstorm

“Non-autoregressive approaches aim to improve the inference speed of translation models by only requiring a single forward pass to generate the output sequence instead of iteratively producing each predicted token.” (Apple Machine Learning)

Requiring only a single forward pass as opposed to multiple iterations makes it really fast.

Blazingly fast!

In fact, Google Research highlights that “When synthesizing dialogue segments of 30 seconds, we measured a runtime of 2 seconds on a single TPU-v4”. (source)

Example Prompt

For example, Google researchers gave it the following dialogue prompt:

Where did you go last summer? | I went to Greece, it was amazing. | Oh, that's great. I've always wanted to go to Greece. What was your favorite part? | Uh it's hard to choose just one favorite part, but yeah I really loved the food. The seafood was especially delicious. | yeah | And the beaches were incredible. | uhhuh | We spent a lot of time swimming, uh sunbathing, and and exploring the islands. | Oh that sounds like a perfect vacation! I'm so jealous. | It was definitely a trip I'll never forget | I really hope I'll get to visit someday!

The impressive output generated by the model (source):

Now think about this for a moment. You could create a simple pipeline like this:

  1. Step 1: Generate dialogues with ChatGPT or OpenAI API
  2. Step 2: Feed the dialogues into the SoundStorm model
  3. Step 3: Upload to a podcasting platform
  4. Repeat!

And 99% of people wouldn’t even note a difference!

But there are many more applications, such as replacing human readers of audiobooks (yet another job description that will be disrupted soon!), creating truly accessible web apps with human readers, and rapid prototyping for movies and (YouTube) videos.