Contrastive neural audio separation

Audio spatial separation, i.e., isolating sounds from a mix with varied angles of arrival, is a elementary matter within the area of audio processing. The duty is to leverage the spatial variety of audio captured from a number of microphones generally obtainable on moveable units, similar to telephones, tablets, good audio system, and laptops, to separate audio sources in designated angular areas from the remaining interference.

Typical linear beamformer approaches derive their output from a linear mixture of multi-channel audio inputs within the time/frequency area. The linear weights are often derived assuming a hard and fast microphone geometry and varied estimations of speech and noise statistics. In most neural beamformer approaches, weights could be adaptive and realized by way of information. Whereas these approaches obtain good spatial separation outcomes with massive microphone arrays, they inherently face a high quality bottleneck as a result of restricted capability of linear processing. This limitation turns into notably pronounced in units with solely two or three microphones, limiting their spatial separation efficiency.

Folks can localize and separate sound from totally different spatial instructions to concentrate on an audio sign of curiosity utilizing simply two channel inputs from our ears. Our mind processes these alerts by using the delicate variations within the relative timing and depth of various parts of sounds, amongst different cues. This evokes us to ask: can we practice a machine studying (ML) mannequin to use the relative delays and features of various sign parts from a number of microphones to optimize the standard of spatial separation?

In “Binaural Angular Separation Community” (BASNet), introduced at ICASSP 2024, and “Guided Speech Enhancement Community” (GSENet), introduced at ICASSP 2023, we sort out this query by designing an information simulation pipeline together with a coaching job that challenges a mannequin to use the delay and achieve distinction from two microphones. We exhibit that this answer achieves as much as 40 decibels (dB) suppression of interference whereas preserving the sign of curiosity when evaluated utilizing real-life recordings.