Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.

  • Microsoft Edge(Latest version) 
  • Mozilla Firefox(Latest version) 
  • Google Chrome(Latest version) 
  • Apple Safari(Latest version) 

Please contact your browser provider for download and installation instructions.

Open search panel Close search panel Open menu Close menu

October 6, 2023

An Escape from the Noise

At social events like cocktail parties, humans have a unique ability to tune into a single speaker even amidst a lot of extraneous noise--a phenomenon known as selective hearing. Thanks to NTT's research, this capability is no longer limited to humans alone, as advancements in AI mean that it is becoming possible to replicate this skill electronically. The company's recent introduction of 'SpeakerBeam' allows deep learning algorithms to single out an individual speaker based on a mere snippet of their voice.

Speech recognition has come a long way and it is now commonplace in devices such as smartphones and smart speakers. However, the persistent weakness of these technologies can be observed in environments with multiple concurrent speakers. While humans are able to listen carefully for particular voice characteristics and spatial cues to single out a desired speaker, conventional technologies struggle to perform in the same way, particularly if the speaker's location changes or is unknown.

The innovative approach of SpeakerBeam addresses these challenges head-on. Instead of relying on the spatial location of the speaker, SpeakerBeam uses the unique voice characteristics of the target speaker to separate their speech from the audio mix. To achieve this, the system simply requires a short 'adaptation utterance'--a recording lasting around 10 seconds of the target speaker's voice. The actual location in the room of the target speaker does not affect the technology's accuracy, which paves the way for more dynamic speech recognition in multi-speaker scenarios.

The ingenuity of SpeakerBeam lies in its dual neural network architecture. The main network accepts the mixed speech input, processing it to output the voice of the target speaker. Central to this network is an adaptive layer, which tweaks its parameters based on the target speaker's voice traits provided by the auxiliary network. At the same time, the auxiliary network processes the adaptation utterance to determine the distinct characteristics of the target speaker's voice.

These two networks work in tandem to optimize their performance. This collaborative training lets the system then work out the optimal voice features of the target speaker, removing the need for manual feature engineering. Notably, SpeakerBeam is not confined to voices in its training data, as its broad training set enables it to identify speakers outside its initial learning parameters.

Through rigorous pre-commercialization testing, SpeakerBeam has already demonstrated very impressive results. In experiments with artificial two-speaker mixtures, it showed a 60% improvement in speech recognition when paired with an eight-microphone array, outperforming setups without it. What's more, over and above speech recognition, SpeakerBeam can enhance audio quality, as it proved in real-world settings with background music.

SpeakerBeam has huge potential. Its ability to focus on a target speaker irrespective of their position or the number of background noises makes it a promising tool for multi-party conversation recognition, smart speakers, voice recorders, and even hearing aids. Nevertheless, work still remains before it can be perfected. In scenarios where two similar-sounding speakers talk simultaneously, performance can dip. To overcome this, NTT will conduct further research into refining speaker characteristics and integrating spatial cues, such as direction-of-arrival features.

NTT--Innovating the Future of Sound

Picture: Daniel O'Connor

Daniel O'Connor joined the NTT Group in 1999 when he began work as the Public Relations Manager of NTT Europe. While in London, he liaised with the local press, created the company's intranet site, wrote technical copy for industry magazines and managed exhibition stands from initial design to finished displays.

Later seconded to the headquarters of NTT Communications in Tokyo, he contributed to the company's first-ever winning of global telecoms awards and the digitalisation of internal company information exchange.

Since 2015 Daniel has created content for the Group's Global Leadership Institute, the One NTT Network and is currently working with NTT R&D teams to grow public understanding of the cutting-edge research undertaken by the NTT Group.