Creating deep studying options for audio sign processing requires entry to giant, high-quality datasets. For coaching sound separation fashions, recording audio on the precise machine has the good thing about capturing the precise acoustics properties to be included within the mannequin coaching. Nevertheless, this recording course of is time consuming and troublesome given the necessity to use an precise machine throughout a consultant variety of lifelike environments. In distinction, utilizing simulated information (e.g., from a room simulator) is quick and low price, however might not seize the positive acoustics properties of the machine.
In “Quantifying the Impact of Simulator-Primarily based Knowledge Augmentation for Speech Recognition on Augmented Actuality Glasses”, introduced at IEEE ICASSP 2024, we show that coaching on a hybrid set — composed of a small quantity of actual recordings with a microphone-equipped head-worn show prototype, and huge quantities of simulated information — improves mannequin efficiency. This hybrid method makes it attainable to:
- seize the acoustics properties of the particular {hardware} (not accessible within the simulated information), and
- simply and rapidly generate giant quantities of simulated information for various room sizes and configurations of acoustic scenes, which might be extraordinarily time consuming to document with the precise machine.
Moreover, we additionally present that modeling the directivity on the prototype’s microphones, and subsequently growing the realism of the simulations, permits us to additional scale back the quantity of actual information wanted within the coaching set.