How LLMs are learning to differentiate spatial sounds
Humans have unique sensory functions, among them binaural hearing — meaning we can identify types of sound, as well as what direction it’s coming from and how far away it is, and we can also differentiate multiple sources of sound all occurring at once.
While large language models (LLMs) are impressive in their ability to perform audio question answering and speech recognition, translation and synthesis, they have yet to handle such “in-the-wild” spatial audio input.
A group of researchers is finally starting to crack that code, introducing BAT, what they are calling the first spatial, audio-based LLM that can reason about sounds in a 3-D environment.
The model shows impressive precision in classifying types of audio (such as laughter, heartbeat, and splashing water), sound direction (right, left, below) and sound distance (anywhere from 1 to 10 feet). It also has strong capabilities in spatial reasoning in scenarios where two different sounds are overlapping.
GB Event
GamesBeat Summit Call for Speakers
We’re thrilled to open our call for speakers to our flagship event, GamesBeat Summit 2024 hosted in Los Angeles, where we will explore the theme of “Resilience and Adaption”.
“The integration of spatial audio into LLMs represents a significant step towards truly multimodal AI systems,” researchers write.
The complexities of spatial audio
Spatial audio — sometimes referred to as ‘virtual surround sound’ — creates the illusion of sound sources in a 3-D space. It is used in applications including virtual reality (VR) and advanced theater systems (as well as other emerging areas, such as the metaverse).
But spatial audio is challenging for AI and machine learning (ML), as intelligent agents in 3-D spaces struggle to localize and interpret sound sources. Scientists have attempted to mitigate this with the development of acoustic simulation techniques and algorithms incorporating spatial audio information (such as YouTube-360 and STARSS23).
However, BAT’s developers point out, that these applications are often inconsistent in quality and lack “crucial ground truth labels” such as source distance and direction. Similarly, Sound Event Localization and Detection (SELD), which fuses sound source localization with sound event detection (SED) often focuses on “shallow spatial audio perception,” researchers point out.
Other applications in the audio domain include AudioGPT, which integrates ChatGPT for a wide range of audio and speech applications; LTU, which trains models to reason and answer questions about sounds in a clip; and Qwen-audio, which enables universal audio understanding.
“However, despite their impressive performance in the audio domain, none of these models have the capability to perceive and reason about spatial audio that is situated in diverse, reverberant, and complex 3-D environments,” researchers assert.
Questions on sound type, direction, distance and spatial reasoning
BAT seems to upend this, demonstrating strong capabilities in spatial reasoning abilities with mixed sounds and sources, achieving a nearly 77% accuracy rate.
Its underlying spatial audio encoder, meanwhile, achieved a Mean Average Precision of more than 50% in identifying sound type; a Mean Angular Error of nearly 18 degrees for sound direction; and a Distance Error Rate within 1.64 feet of the actual location at 32.54% for distance estimation.
The researchers — from the University of Texas, the USA 2Department of Computer Science and Engineering and Shanghai Jiao Tong University in China — began by first developing a Spatial Audio Spectrogram Transformer (SPATIAL-AST), which is capable of sound event detection, spatial localization and distance perception; and SPATIALSOUNDQA, a collection of spatial question-answering tasks.
The ensuing LLM BAT then integrated SPATIAL-AST with the LLaMA-2 LLM.
The model was asked questions in categories including sound type, what direction the sound was coming from and how far away it was. Lastly, it was tasked with spatial reasoning, in which two concurrent sounds came from entirely different distances and directions.
Because previous spatial audio datasets are often limited to music, speech and basic domestic sounds, researchers curated a binaural set of 355 audio event labels using Audioset and Soundspaces. For their environmental meshes, they relied on the large-scale RGB-D dataset Matterport3D, which includes renderings of 90 complete buildings, each with an average of 24.5 rooms across roughly two-and-a-half floors of 5,550 square feet.
Questions on sound type
- Q: What sound events can you detect in the recording?
- A: A baby’s laughter
- Q: What are the distinct sounds present in this audio clip?
- A: Heartbeat
- Q: Identify the sound events in the audio clip coming from the right, front, below, approximately 9 feet away.
- A: Splashing; speech
- Q: What sound events can you detect in the audio recording emanating from the left, behind, above roughly a foot-and-a-half away?
- A: Music; musical instrument; steel pan
Questions on direction and distance
- Q: In which direction and how far away is the source of the heart sounds?
- A: Left, behind, below; 3 feet away
- Q: Where is the sound of the music coming from?
- A: left, behind, below; 10 feet away
Questions on spatial reasoning
- Q: Is the wheezing sound closer than the sound from bird flight/flapping wings?
- A: No
- Q: Is the source of both the explosion sounds and speech sounds on your left side?
- A: Yes
- Q: Does the sound of an electric shaver occur behind the sound of the waterfall?
- A: Yes
- Q: Can you estimate the distance from the sound of the speech to the sound of the dog?
- A: 1.64 feet
- Q: What is the sound on the above side of the sound of the vibration?
- A: Croak; frog
- Q: Could you determine whether the singing’s sound is to the left or right of the steam’s sound?
- A: Left
“This task demands both perception and complex reasoning,” researchers write of the latter. “The model must implicitly separate the sound sources based on their unique classes, spatially localize each source and then analyze the relationship between the sources in the context of the question.”
Spatial audio capabilities open up a multitude of possibilities
Developing LLMs for spatial audio opens up a multitude of possibilities when it comes to virtual reality, gaming, audio engineering and more.
“This can lead to more immersive and realistic experiences in these domains,” researchers write.
The ability to interpret and reason about spatial sounds can also enhance embodied AI systems such as robots or autonomous vehicles. And, the further development of ambisonics (sources above and below) could provide an even more immersive and realistic experience.
The researchers conclude: “We are confident that BAT will significantly contribute to the development of spatial audio perception and reasoning, as well as multimodal LLMs.”
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.