Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.
Please contact your browser provider for download and installation instructions.
May 28, 2018
Nippon Telegraph and Telephone Corporation (NTT; head office: Chiyoda-ku, Tokyo; president & CEO: Hiroo Unoura) has developed a new technology called SpeakerBeam*1 for extracting the voice of a target speaker (a speaker you want to listen to) from recordings of several people speaking at the same time based on the characteristics of the target speaker's voice.
Human beings can focus on speech spoken by a target speaker even in the presence of noise or other people speaking in the background. This ability is called selective auditory attention or selective hearing*2 (See figure 1). Selective hearing is realized by exploiting information about the characteristics of the voice and the position of the target speaker. Previously proposed computational selective hearing systems developed to mimic human selective hearing ability used information about the target speaker position*3. Unlike these approaches, SpeakerBeam is the first successful attempt to realize computational selective hearing based on the characteristics of the voice of the target speaker. This was made possible thanks to the novel deep learning technology developed at NTT*4.
SpeakerBeam enables the extraction of the voice of a target speaker without knowing his/her position, which opens new possibilities for the speech recognition of multi-party conversations or speech interfaces for assistant devices.
[Video]https://www.youtube.com/watch?v=7FSHgKip6vI
The link points to an external website
Recently, automatic speech recognition technology has progressed greatly, thus enabling the rapid adoption of speech interfaces in smartphones or smart speakers. However, the performance of current speech interfaces deteriorates severely when several people speak at the same time, which often happens in everyday life e.g. when we take part in a discussion or when a television is on in the background. The main reason for this problem arises from the inability of current speech recognition systems to focus solely on the voice of the target speaker (selective hearing) when several people are speaking.
NTT Communication Science Laboratories has developed*5 SpeakerBeam to make it possible to extract the voice of a target speaker from a recording containing a mixture of several people speaking simultaneously. SpeakerBeam distinguishes the target speaker from the other speakers by using another recording (about 10 seconds long) of the target speaker's voice as auxiliary information. This auxiliary information is employed to compute the characteristics of the voice of the target speaker. SpeakerBeam then extracts the speech from the mixture that matches these voice characteristics. SpeakerBeam can extract the target speaker's voice regardless of the sounds contained in the mixtures, which may include other speakers, music and background noise. It can work using a single microphone but the use of more microphones further improves the quality of the extracted speech.
In experiments, we used simulated mixtures of several speakers to prove that SpeakerBeam could successfully extract the voice of a target speaker (see left side of figure 3) and improve speech recognition accuracy by 60% (see right side of figure 3).
There are various elements that characterize a person's voice such as its pitch, timbre, rhythm, intonation, and accent. Human beings can use these characteristics to focus on the voice of a specific speaker, even when it is mixed with other sounds, and ignore all other sounds. By listening once to someone's voice, human beings are able to recognize the characteristics immediately and listen only to that voice. With SpeakerBeam we have realized a system that replicates this selective hearing ability.
Characterizing human voices may require the combination of the various elements described above. However, it is unclear as to which of these elements are important as regards realizing selective hearing. Instead of manually engineering features characterizing the target speaker's voice, we have developed a purely data-driven approach based on deep learning to automatically learn these features. We proposed combining a module to compute the voice characteristics using the auxiliary information and a module to extract the voice of the target speaker from the speech mixture. These two modules are trained jointly to optimize target speech extraction. As a result, we can obtain the voice characteristics of the target speaker from relatively short utterances that are optimized for extracting the target speaker's voice.
Audio speech separation is another approach for dealing with speech mixtures that has been intensively researched. Source separation separates a mixture of speech signals into each of its original components. It uses such characteristics of the sound mixture as the direction of arrival of the sounds to distinguish and separate the different sounds. Speech separation can separate all the sounds in the mixture, but for this purpose it must know or be able to estimate the number of speakers included in the mixture, the position of all the speakers, and the background noise statistics. These conditions often change dynamically making their estimation difficult thus limiting the actual usage of the separation methods. Moreover, to realize selective hearing, we still need to inform the separation system about the separated speech that corresponds to the target speaker.
In contrast, SpeakerBeam avoids the need to estimate the number of speakers, the position, or the noise statistics, by focusing on the simpler task of solely extracting speech that matches the voice characteristics of the target speaker.
We have developed a novel neural network architecture with which to realize SpeakerBeam. This neural network consists of a main network and an auxiliary network as described below.
These two networks are connected to each other and trained jointly to optimize the speech extraction performance. By training the network with a large amount of training data covering various speakers and background noise conditions, SpeakerBeam can learn to realize selective hearing even for speakers that were not included in the training data.
We intend to pursue research to improve speech extraction performance when speakers with similar voices converse. Moreover, we plan to investigate using SpeakerBeam to help realize AI systems that understand people's discussions in everyday life environments.
Fig. 1 Human selective hearing ability
Fig. 2 SpeakerBeam's selective hearing
Fig. 3 Evalution of speech extraction performance and automatic speech recognition with SpeakerBeam
g. 4 Novel deep learning architecture developed for SpeakerBeam
Contact information
Nippon Telegraph and Telephone Corporation
Science and Core Technology Laboratory Group,
Public Relations
Email science_coretech-pr-ml@hco.ntt.co.jp
Information is current as of the date of issue of the individual press release.
Please be advised that information may be outdated after that point.
WEB media that thinks about the future with NTT