Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.

Microsoft Edge（Latest version）
Mozilla Firefox（Latest version）
Google Chrome（Latest version）
Apple Safari（Latest version）

Please contact your browser provider for download and installation instructions.

June 19, 2023

Artificial Neural Networks for Recognizing Natural Sounds Show Human-Like Responses to Changes in Sound Amplitude

NTT Corporation (NTT) has discovered that artificial neural networks (NN)¹ that recognize natural sounds² show human-like responses to changes in sound amplitude. This study provides a unified understanding of the human perception of amplitude modulation (AM)³. In the future, this research is expected to be applied to various fields including the medical and welfare areas, contributing to, for instance, the development of devices with similar mechanisms to human hearing. This research was published in the American scientific journal "Journal of Neuroscience" on May 24, 2023 (U.S. Eastern Time).

Figure 1 Framework of this study. The responses of the NN trained on natural sounds were compared with human perception and brain activity, which advanced our understanding of perceptual functions and their mechanisms.

1. Background

Humans recognize a sound based on various cues. One of the important cues is the pattern of slow temporal changes in the amplitude (Figure 2). NTT Laboratories has been conducting studies using artificial NN to understand auditory AM processing. AM sounds were fed to NNs trained to recognize natural sounds⁴ and their responses were examined. Their responses to AM sounds were similar to those observed in animal brains. The results suggest that the response to AM sound in animal brains might be a result of adaptation to recognize natural sounds. Until now, we have only examined the relationship between sound recognition and the response properties of single neurons in the brain. We have not yet understood relationship between sound recognition and perception, which results from the activities of many neurons. Moreover, we have only compared our NNs with non-human animal brains. It was not clear whether the same framework could explain human perception partly because the single neuron activities cannot be easily measured in humans. Therefore, we conducted a new study comparing NNs with human perception and demonstrated their similarities.

Figure 2 An example of sound AM. When AM is applied to a sound signal, its amplitude changes slowly; the important parameters of AM are its rate and depth.

As a target perceptual property, we focused on the smallest AM depth that a person can detect (AM detection threshold)⁵. This has been investigated in many auditory studies, but little is known about its relationship with sound recognition, which is an essential auditory function in daily life.

2. Findings

Using artificial NNs trained for natural sound recognition, we simulated perceptual experiments and neuronal activity recording experiments. The results showed that the NNs exhibit human-like AM detection threshold patterns, even though we did not take the nature of the human or animal auditory system into account when constructing the NNs (Figure 3). This suggests that the human AM detection threshold might also be a property arising from the adaptation of the auditory system to sound recognition during its evolution and/or development. Furthermore, we found that natural AM patterns during NN training are important for the NN to obtain this property. We also found that the layers in the NN that exhibited human-like AM detection threshold patterns corresponded to the inferior colliculus, the medial geniculate body, and the auditory cortex in the brain. This result provides insight into the brain regions involved in AM detection in humans (Figure 4).

Figure 3 Similarity (left) and dissimilarity (right) of AM detection thresholds for humans and NN layers. Each line shows the NN trained on natural sounds, the non-trained NN, the NN trained with sounds that preserve natural AM patterns, and the NN trained on sounds with unnatural AM patterns.

Figure 4 Correspondence between NN layers (horizontal axis) and brain regions (vertical axis). The brightness of the colors indicates similarity. The layers showing human-like AM detection thresholds in Figure 3 (around layers 9-11, gray background on the horizontal axis) are similar to the inferior colliculus, the medial geniculate body, and the auditory cortex (gray background on the vertical axis).

These results provide a unified explanation of previous findings in perceptual psychology and neuroscience from the perspective of adaptation to natural sounds.

3. Key Points

- Simulation of perceptual experiments.

A multilayer (deep) artificial NN was used. To reduce possible biases of the researchers in the NN construction, it was trained to recognize sounds using sound waveforms as input without manually designed features. The computer simulation of AM detection was performed using the same sound stimulus as those in human perception experiments. This made it possible to directly compare the obtained AM detection thresholds with those of humans. When a stimulus sound is fed to the model, a time series of activity values is obtained from each NN unit. To calculate the AM detection threshold of the NN, we time-averaged the unit activities in each layer and estimated whether the stimulus was an AM or non-AM sound from the time-averaged activities (Figure 5). By performing this procedure for AM stimuli with various depths, we calculated the minimum AM depth required to discriminate whether or not the stimulus sound is an AM sound (i.e., AM detection threshold).

Figure 5 Simulation of an AM detection experiment. A sound was fed to the NN, and from the time-averaged unit activities, logistic regression was performed to discriminate whether the input sound was an AM sound or not.

- Sound features necessary for a human-like AM detection threshold.

We also confirmed that the AM patterns of natural sounds for training are important for NNs to acquire human-like AM detection thresholds. We trained NNs for the recognition of sounds that retained their natural AM structure⁶ and sounds the AM structure of which was destroyed⁶. The NNs trained on sounds with a natural AM structure exhibited a similar AM detection threshold to those of humans (Figure 3).

4. Future Development

Auditory studies often try to understand perceptual properties such as detection thresholds by simulating sensory information processing in a multi-stage model. In the future, we will clarify the correspondence between the processing stages in such existing models and our NN, and examine in detail which stages of auditory information processing can or cannot be explained by adaptation to sound recognition.
　The present study suggests that AM patterns in natural sounds are important for NNs to acquire a human-like detection threshold. This finding may lead to a better understanding of brain development/plasticity and the mechanisms behind hearing difficulties. For example, signals reaching the brain can change due to some damage in the auditory periphery. If such a condition can be modeled, it will be possible to analyze the effects of hearing loss or its compensation by information processing in the brain. This may lead to the development of devices that more closely resemble the mechanism of human hearing for medical and welfare applications.
　The framework of this research can be extended to auditory functions other than AM processing and to sensory functions more generally. For example, the process by which sound information from both ears is integrated has been studied as extensively as AM processing, but there is currently little unified understanding linking the psychophysical and neurophysiological findings regarding human binaural sound processing. The same paradigm adopted for this research can be used to explore these functions.

Research Support

This work was supported by JSPS KAKENHI Grant Number JP20H05957 (Grant-in-Aid for Transformative Research Areas (A) "Analysis and synthesis of deep SHITSUKAN information in the real world")

Paper Information

Human-like Modulation Sensitivity Emerging through Optimization to Natural Sound Recognition. Takuya Koumura, Hiroki Terashima, and Shigeto Furukawa. Journal of Neuroscience 24 May 2023, 43 (21) 3876-3894; https://doi.org/10.1523/JNEUROSCI.2002-22.2023 Open other window

Glossary

^1.Artificial neural network (NN)
A type of machine learning model that often performs complicated classification tasks with high accuracy. It processes data using a structure consisting of many consecutive layers, each layer consisting of many units. A unit in a layer receives input from the units in the layer below, and after simple processing, its output is transmitted to the units in the next layer.

^2.Natural sound
Sounds that humans hear on a daily basis. For example, animal vocalizations, the sound of rain, sneezing, the sound of a door creaking, and the sound of a car engine.

^3.Amplitude modulation (AM)
A pattern of slow changes in the amplitude of a signal (amplitude envelope). Important parameters describing amplitude modulation are its speed and depth (Figure 2).

^4.Training a machine learning model for sound recognition
Adjusting parameters of the model to increase the accuracy of sound recognition. In the case of an NN, parameters such as the number of units in a layer and the connection pattern and weights between units are adjusted.

^5.AM detection threshold
The minimum AM depth required to distinguish whether a sound stimulus is amplitude modulated or not. Experimentally, it is measured by whether AM and non-AM sounds (sounds without slow changes in amplitude) can be discriminated. In general, the deeper the AM, the easier it is to discriminate between them.

^6.Sounds the AM structure of which is preserved or destroyed
A sound was divided into its amplitude envelope that reflects the AM structure and its temporal fine structure (TFS) that is a faster variation. By combining the amplitude envelope of the original sound and the TFS of a noise sound, we generated a sound the AM structure of which was preserved. By combining the constant amplitude envelope and the TFS of the original sound, we generated a sound the AM structure of which was destroyed. Hilbert transform was used to divide a sound into its amplitude envelope and its TFS.

About NTT

NTT contributes to a sustainable society through the power of innovation. We are a leading global technology company providing services to consumers and business as a mobile operator, infrastructure, networks, applications, and consulting provider. Our offerings include digital business consulting, managed application services, workplace and cloud solutions, data center and edge computing, all supported by our deep global industry expertise. We are over $100B in revenue and 330,000 employees, with $3.6B in annual R&D investments. Our operations span across 80+ countries and regions, allowing us to serve clients in over 190 of them. We serve over 75% of Fortune Global 100 companies, thousands of other enterprise and government clients and millions of consumers.

Press contact information

Nippon Telegraph and Telephone Corporation
NTT Science and Core Technology Laboratory Group,　Public Relations,
E-mail: nttrd-pr@ml.ntt.com

Information is current as of the date of issue of the individual press release.
Please be advised that information may be outdated after that point.

Back to Press Release

NTT STORY

WEB media that thinks about the future with NTT

Group Companies