Pictures that Sound: Creating Beautiful Audiovisual Artwork with AI | by Max Hilsdorf | Aug, 2024

To reply this query, we have to perceive two phrases:

  1. Waveform
  2. Spectrogram

In the actual world, sound is produced by vibrating objects creating acoustic waves (modifications in air strain over time). When sound is captured by means of a microphone or generated by a digital synthesizer, we will characterize this sound wave as a waveform:

Waveform of an acoustic tune. Music and picture by writer.

The waveform is helpful for recording and taking part in audio, however it’s sometimes averted for music evaluation or machine studying with audio information. As an alternative, a way more informative illustration of the sign, the spectrogram, is used.

Mel Spectrogram of an acoustic tune. Music and picture by writer.

The spectrogram tells us which frequencies are roughly pronounced within the sound throughout time. Nevertheless, for this text, the important thing factor to notice is {that a} spectrogram is a picture. And with that, we come full circle.

When producing the corgi sound and picture above, the AI creates a sound that, when reworked right into a spectrogram, appears to be like like a corgi.

Which means the output of this AI is each sound and picture on the identical time.

Regardless that you now perceive what is supposed by a picture that sounds, you may nonetheless marvel how that is even potential. How does the AI know which sound would produce the specified picture? In spite of everything, the waveform of the corgi sound appears to be like nothing like a corgi.

Waveform of the Corgi sound generated by “Pictures that Sound”. Picture by writer.

First, we have to perceive one foundational idea: Diffusion fashions. Diffusion fashions are the expertise behind picture fashions like DALL-E 3 or Midjourney. In essence, a diffusion mannequin encodes a person immediate right into a mathematical illustration (an embedding) which is then used to generate the specified output picture step-by-step from random noise.

Right here’s the workflow of making photos with a diffusion mannequin

  1. Encode the immediate into an embedding (a bunch of numbers) utilizing a synthetic neural community
  2. Initialize a picture with white noise (Gaussian noise)
  3. Progressively denoise the picture. Based mostly on the immediate embedding, the diffusion mannequin determines an optimum, small denoising step that brings the picture nearer to the immediate description. Let’s name this the denoising instruction.
  4. Repeat denoising step till a noiseless, high-quality picture is generated
Excessive-level interior workings of a picture diffusion mannequin. Picture by writer.

To generate “photos that sound”, the researchers used a intelligent method by combining two diffusion fashions into one. One of many diffusion fashions is a text-to-image mannequin (Secure Diffusion), and the opposite is a text-to-spectrogram mannequin (Auffusion). Every of those fashions receives its personal immediate, which is encoded into an embedding and determines its personal denoising instruction.

Nevertheless, a number of completely different denoising directions are problematic, as a result of the mannequin must determine easy methods to denoise the picture. Within the paper, the authors clear up this drawback by averaging the denoising directions from each prompts, successfully guiding the mannequin to optimize for each prompts equally.

Excessive-level interior workings of “Pictures that Sound”. Picture by writer.

On a excessive degree, you’ll be able to consider this as making certain the ensuing picture displays each the picture and audio immediate equally nicely. One draw back of that is that the output will all the time be a mixture of the 2 and never each sound or picture popping out of the mannequin will look/sound nice. This inherent tradeoff considerably limits the mannequin’s output high quality.

Is AI simply Mimicking Human Intelligence?

AI is usually outlined as laptop methods mimicking human intelligence (e.g. IMB, TechTarget, Coursera). This definition works nicely for gross sales forecasting, picture classification, and textual content technology AI fashions. Nevertheless, it comes with the inherent restriction that a pc system can solely be an AI if it performs a process that people have traditionally solved.

In the actual world, there exist a excessive (doubtless infinite) variety of issues solvable by means of intelligence. Whereas human intelligence has cracked a few of these issues, most stay unsolved. Amongst these unsolved issues, some are recognized (e.g. curing most cancers, quantum computing, the character of consciousness) and others are unknown. In case your purpose is to sort out these unsolved issues, mimicking human intelligence doesn’t seem like an optimum technique.

Picture by the Writer.

Following the definition above, a pc system that discovers a treatment for most cancers with out mimicking human intelligence wouldn’t be thought-about AI. That is clearly counterintuitive and counterproductive. I don’t intend to start out a debate on “the one and solely definition”. As an alternative, I wish to emphasize that AI is rather more than an automation device for human intelligence. It has the potential to resolve issues that we didn’t even know existed.

Can Spectrogram Artwork be Generated with Human Intelligence?

In an article on Mixmag, Becky Buckle explores the “historical past of artists concealing visuals inside the waveforms of their music”. One spectacular instance of human spectrogram artwork is the tune “∆Mᵢ⁻¹=−α ∑ Dᵢ[η][ ∑ Fjᵢ[η−1]+Fextᵢ [η⁻¹]]” by the British musician Aphex Twin.

Screenshot of the alien Face in Aphex Twin’s “∆Mᵢ⁻¹=−α ∑ Dᵢ[η][ ∑ Fjᵢ[η−1]+Fextᵢ [η⁻¹]]”. Hyperlink to the video.

One other instance is the monitor “Look” from the album “Songs about my Cats” by the Canadian musician Venetian Snares.

Screenshot of the cat picture encoded in Venetian Snares’ “Look”. Hyperlink to the video.

Whereas each examples present that people can encode photos into waveforms, there’s a clear distinction to what “Pictures that Sound” is able to.

How is “Pictures that Sound” Totally different from Human Spectrogram Artwork?

Should you take heed to the above examples of human spectrogram artwork, you’ll discover that they sound like noise. For an alien face, this is perhaps an appropriate musical underscore. Nevertheless, listening to the cat instance, there appears to be no intentional relationship between the sounds and the spectrogram picture. Human composers have been capable of generate waveforms that appear like a sure factor when reworked to a spectrogram. Nevertheless, to my information, no human has been capable of produce examples the place the sound and pictures match, based on predefined standards.

“Pictures that Sound” can produce audio that feels like a cat and appears like a cat. It could additionally produce audio that feels like a spaceship and appears like a dolphin. It’s able to producing intentional associations between the sound and picture illustration of the audio sign. On this regard, the AI displays non-human intelligence.

“Pictures that Sound” has no Use Case. That’s what Makes it Stunning

In recent times, AI has largely been portrayed as a productiveness device that may improve financial outputs by means of automation. Whereas most would agree that that is extremely fascinating to some extent, others really feel threatened by this attitude on the longer term. In spite of everything, if AI retains taking away work from people, it would find yourself changing the work we love doing. Therefore, our lives might turn into extra productive, however much less significant.

“Pictures that Sound” contrasts this attitude and is a chief instance of gorgeous AI artwork. This work just isn’t pushed by an financial drawback however by curiosity and creativity. It’s unlikely that there’ll ever by an financial use case for this expertise, though we should always by no means say by no means…

From all of the folks I’ve talked to about AI, artists are usually essentially the most damaging about AI. That is backed up by a latest research from the German GEMA, displaying that over 60% of musicians “consider that the dangers of AI use outweigh its potential alternatives” and that solely 11% “consider that the alternatives outweigh the dangers”.

Extra works just like this paper might assist artists perceive that AI has the potential to carry extra lovely artwork into the world and that this doesn’t need to occur at the price of human creators.

Pictures that Sound has not been the primary use case of AI that has the potential to create lovely artwork. On this part, I wish to showcase a number of different approaches that can hopefully encourage you and make you assume otherwise about AI.

Restoring Artwork

A mosaic of the Battle of Amazons, reconstructed with AI. Taken from this paper.

AI helps restore artwork by repairing broken items exactly, making certain historic works last more. This mixture of expertise and creativity retains our creative heritage alive for future generations. Learn extra.

Bringing Work to Stay

A YouTube video of Mona Lisa rapping Paparazzi (AI-generated).

AI can animate photographs to create sensible movies with pure actions and lip-syncing. This may make historic figures or artworks just like the Mona Lisa transfer and communicate (or rap). Whereas this expertise is definitely harmful within the context of deep fakes, utilized to historic portraits, it may possibly create humorous and/or significant artwork. Learn extra.

Turning Mono-Recordings to Stereo

AI has the potential to reinforce outdated recordings by reworking their mono combine right into a stereo combine. There are classical algorithmic approaches for this, however AI guarantees to make synthetic stereo mixes sound increasingly sensible. Learn extra right here and right here.