22 papers authored by NTT Laboratories have been accepted at the ICASSP2025 (2025 IEEE International Conference on Acoustics, Speech, and Signal Processing),
the flagship conference on signal processing technology to be held in Hyderabad, India from April 6 to April 11, 2025. In addition to these accepted papers, we will also present 3 papers recently accepted for publication in IEEE journals at ICASSP.
Abbreviated names of the laboratories:
CS: NTT Communication Science Laboratories
HI: NTT Human Informatics Laboratories
CD: NTT Computer and Data Science Laboratories
(Affiliations are at the time of submission.)
- ◆Guided Speaker Embedding
- Shota Horiguchi (HI), Takafumi Moriya (HI/CS), Atsushi Ando (HI), Takanori Ashihara (HI), Hiroshi Sato (HI), Naohiro Tawara (CS), Marc Delcroix (CS)
- We proposed a method to extract features corresponding to a specific speaker from audio recordings involving multiple speakers. By leveraging information about the timing of each speaker's utterances, this method enables the effective extraction of features even from overlapping speech segments involving the target speaker and others. This technology is expected to be applied to speech recognition and understanding in scenarios where multiple speakers are involved, such as meetings and business discussions.
- ◆Multi-channel Speaker Counting for EEND-VC-based Speaker Diarization on Multi-domain Conversation
- Naohiro Tawara (CS), Atsushi Ando (HI), Shota Horiguchi (HI), Marc Delcroix (CS)
- We proposed a novel speaker counting method for conversational speech using multichannel audio obtained from distributed microphones. By combining a speaker diarization method called EEND-VC with state-of-the-art signal processing-based speech enhancement techniques, we achieved robust speaker count estimation that operates reliably regardless of the environment. Integrating this method as the frontend of the NTT remote speech recognition system led to strong performance in the CHiME-8 Challenge, an international competition on remote conversational speech recognition.
- ◆Mamba-based Segmentation Model for Speaker Diarization
- Alexis Plaquet (IRIT, Universite de Toulouse, CNRS), Naohiro Tawara (CS), Marc Delcroix (CS), Shota Horiguchi (HI), Atsushi Ando (HI), Shoko Araki (CS)
- We propose a novel speaker diarization model to estimate the speech segments from multi-talker audio. For the first time, the proposed method applies Mamba, a state-of-the-art deep learning-based state-space model, to this task, enabling speaker diarization that accounts for long-term history. We have released this model as a module compatible with the widely-used speaker diarization framework, pyannote, making it accessible for anyone to use. In the future, we aim to combine this system with speech recognition framework to develop a more practical conversation analysis system, such as an automatic meeting transcription.
- ◆Alignment-Free Training for Transducer-based Multi-Talker ASR
- Takafumi Moriya (HI/CS), Shota Horiguchi (HI), Marc Delcroix (CS), Ryo Masumura (HI), Takanori Ashihara (HI), Hiroshi Sato (HI), Kohei Matsuura (HI), Masato Mimura (HI)
- We proposed a streaming speech recognition method to simultaneously transcribe all speakers' speech from audio recordings involving multiple speakers. By providing a prompt that indicates the order of speaker occurrences in multi-speaker audio, we simplify the training pipeline for multi-talker automatic speech recognition and enable simultaneous recognition of multi-speaker utterances with very high accuracy in both offline and streaming modes. This technology is expected to be utilized for voice transcription in situations involving multiple speakers, such as meetings and business discussions.
- ◆Advancing Streaming ASR with Chunk-wise Attention and Trans-chunk Selective State Spaces
- Masato Mimura (HI), Takafumi Moriya (HI/CS), Kohei Matsuura (HI)
- By combining an attention mechanism that focuses on short segments of speech with a selective state-space model capable of efficiently capturing long-range dependencies, we improved both accuracy and computational efficiency in streaming speech recognition. This approach is expected to be useful in applications requiring both precision and speed, such as conversational systems, real-time captioning, and smart assistants.
- ◆Leveraging IPA and Articulatory Features as Effective Inductive Biases for Multilingual ASR Training
- Lee Jaeyoung (Kyoto University), Masato Mimura (HI), Tatsuya Kawahara (Kyoto University)
- We demonstrated that incorporating language-independent phonological knowledge, such as the International Phonetic Alphabet (IPA) and articulatory features describing the movements of human speech organs, can significantly enhance the performance of multilingual speech recognition systems designed to support a large number of languages (up to 120). This approach is considered valuable for applications that support multilingual communication.
- ◆Bridging Speech and Text Foundation Models with ReShape Attention
- Takatomo Kano (CS), Atsunori Ogawa (CS), Marc Delcroix (CS), William Chen (Carnegie Mellon University (CMU)), Ryo Fukuda (CS), Kohei Matsuura (HI), Takanori Ashihara (HI), Shinji Watanabe (CMU)
- Large-scale pre-trained models (foundation models) such as LLM are used for various tasks (e.g. speech recognition and text translation). Combining a speech foundation model with a text foundation model is a way to build a speech translation model. Still, it has weaknesses such as mistranslation due to speech recognition errors and the inability of LLM to understand information specific to speech (e.g. speaking style). Therefore, previous works combine different foundation models and optimize the system through fine-turning. In this study, we propose a method to integrate foundation models. Our method is more efficient and stable than conventional methods. The proposal allows the creation of specialized and general-purpose models with fewer resources (data, computers, and time) by combining various foundation models.
- ◆Speech Emotion Recognition Based on Large-Scale Automatic Speech Recognizer
- Ryo Fukuda (CS), Takatomo Kano (CS), Atsushi Ando (HI), Atsunori Ogawa (CS)
- Speech emotion recognition is a technology that identifies emotions from speech. In this research, we proposed a novel speech emotion recognition method that leverages a large-scale pre-trained speech recognition model, enabling the consideration of both linguistic and prosodic information. This technology holds promising potential for applications in areas such as mental health care and customer service.
- ◆Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression
- Kai Yoshida (Nara Institute of Science and Technology / RIKEN GRP), Masahiro Mizukami (CS/HI), Seiya Kawano (RIKEN GRP/ Nara Institute of Science and Technology), Canasai Kruengkrai (RIKEN GRP), Hiroaki Sugiyama (CS/HI), Koichiro Yoshino (Tokyo Institute of Technology / RIKEN GRP / Nara Institute of Science and Technology)
- As part of a joint study conducted by NTT and Guardian Robot Project, RIKEN, we proposed a method to improve not only individual responses but also the overall dialogue impression in LLM-based dialogue systems. We used a model that evaluates 12 types of impressions, including personality, consistency, and empathy, and applied reinforcement learning with rewards designed to enhance these impressions, leading to improvements in both automatic and human evaluations. Not only did improve the overall dialogue impression, but the responses also became more natural. This technology is expected to help develop more natural and engaging chatbots and AI assistants in the future.
- ◆A Hybrid Probabilistic-Deterministic Model Recursively Enhancing Speech
- Tomohiro Nakatani (CS), Naoyuki Kamo (CS), Marc Delcroix (CS), Shoko Araki (CS)
- This paper introduces Probabilistic-Deterministic Recursive Enhancement (PDRE), an innovative deep learning approach designed to accurately suppress background noise and reverberation in recorded audio signals. Utilizing recursive estimation with a hybrid of probabilistic and deterministic methodologies, PDRE achieves accurate estimations that are comparable to or even better than conventional state-of-the-art diffusion model-based approaches, while requiring less than 1/100th of the computation time. We aim to leverage this technique to enhance the performance of various audio applications.
- ◆SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model
- Carlos Hernandez-Olivan (University of Zaragoza), Marc Delcroix (CS), Tsubasa Ochiai (CS), Daisuke Niizumi (CS), Naohiro Tawara (CS), Tomohiro Nakatani (CS), Shoko Araki (CS)
- We proposed a novel target sound extraction (TSE) method for isolating the signal of a desired sound from a mixture of arbitrary sounds. The proposed method extends NTT's SoundBeam TSE approach with audio features derived from a NTT's pre-trained masked-modeling duo (M2D) audio foundation model. We show experimentally that using M2D features allows to better identify the target sound in a mixture and significantly improves extraction performance compared to previous version of SoundBeam. This technology will help create systems that can select desired sounds from audio recordings, which can be used for audio post-processing or hearables.
- ◆TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models
- Junyi Peng (Brno University of Technology (BUT)), Takanori Ashihara (HI), Marc Delcroix (CS), Tsubasa Ochiai (CS), Plchot Oldřich (BUT), Shoko Araki (CS), Jan Honza Cernocky (BUT)
- We proposed a new benchmark, called target-speaker speech processing universal performance benchmark (TS-SUPERB), to evaluate self-supervised learning (SSL) models for target-speaker speech processing tasks. TS-SUPERB includes four tasks that require identifying the target speaker and extracting information from the speech mixture, i.e., target speech extraction, personalized speech enhancement, personalized voice activity detection, and target speaker automatic speech recognition. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. This benchmark will help trigger the development of SSL models effective for development of technologies that can isolate or transcribe the voice of a desired speaker in cocktail party like situations.
- ◆Collision-less and Balanced Sampling for Language-Queried Audio Source Separation
- Binh Thien Nguyen (CS), Daiki Takeuchi (CS), Masahiro Yasuda (CS/CD), Daisuke Niizumi (CS), Noboru Harada (CS)
- In this paper, we proposed a collision-less and balanced sampling for language-queried audio source separation. In the proposed scheme, the interference signals are sampled so that their audio tags do not conflict with those of the target signal, where the tags are generated using an audio tagging model. To balance the data, we also consider several balanced sampling approaches using tag or caption embedding. By leveraging their distribution information, we use either weighted or group sampling to boost the occurrence of underrepresented samples while reducing the presence of overrepresented ones. Experimental results show the superiority of the proposed method over state-of-the-art LASS systems in DCASE 2024 Challenge Task 9.
- ◆30+ Years of Source Separation Research: Achievements and Future Challenges
- Shoko Araki (CS), Nobutaka Ito (University of Tokyo), Reinhold Haeb-Umbach (Paderborn University), Gordon Wichern (Mitsubishi Electric Research Laboratories), Zhong-Qiu Wang (Southern University of Science and Technology), Yuki Mitsufuji (Sony AI)
- On the occasion of ICASSP's 50th anniversary, we review the major contributions and advancements in the past three decades in the speech, audio, and music source separation (SS) research field. We will also look back on key efforts to foster a culture of scientific evaluation in the research field, including challenges, performance metrics, and datasets. In addition, we will discuss future directions in which sound source separation research should be pursued. and contribute to the further development of the research field.
- ◆Rethinking Mean Opinion Scores in Speech Quality Assessment: Score Aggregation through Quantized Distribution Fitting
- Yuto Kondo (CS), Hirokazu Kameoka (CS), Kou Tanaka (CS), Takuhiro Kaneko (CS)
- In recent years, the development of predictive models for automatically evaluating the quality of speech generated by text-to-speech (TTS) systems has been actively pursued. Among these, the most fundamental models are trained to predict the mean opinion score (MOS), which represents the average of scores given by listeners regarding speech quality through subjective evaluation questionnaire. In this study, we question the use of MOS in training the predictive models and propose a score aggregation method that accounts for the rating processes of individual listeners. By replacing the predictive target from MOS to the newly aggregated score during training, we confirmed improvements in the predictive performance of the models. Furthermore, the proposed method is a versatile approach that not only enhances the prediction of "speech quality" but also improves the performance of various types of predictive models, such as those targeting the "coolness" of a voice. By utilizing the proposed method, it is expected that speech recognition AI will be able to capture speech more naturally, contributing to the realization of a society where AI can engage in natural conversations with humans.
- ◆Sound Source Distance Estimation Utilizing Physics-informed Prior for Sound Event Localization and Detection
- Nao Sato (CD), Masahiro Yasuda (CD), Shoichiro Saito (CD), Noboru Harada (CS/CD)
- Sound event localization and detection (SELD) is the task of identifying the type and location of acoustic events. Traditional data-driven distance estimation methods for SELD degrade performance as data conditions change. To address this problem, we proposed the SELD system utilizing physics-based prior knowledge and confirmed its effectiveness. This technology is expected to contribute to deploying applications such as sound-based public security systems in various real-world environments.
- ◆Spatial Annotation-free Training for Sound Event Localization and Detection
- Masahiro Yasuda (CD), Shoichiro Saito (CD), Nao Sato (CD), Noboru Harada (CS/CD)
- Sound Event Localization and Detection (SELD) estimates sound events' class, duration, and direction of arrival (DOA). This study proposed a new framework of the SELD: spatial annotation-free training for SELD, which trains a SELD system using only sound class and duration labels. Experimental results show that the proposed method can effectively utilize data without spatial annotation for the SELD task. This technology will enable the effective use of sound data that could not be used before, and it is expected to contribute to providing human support through sound using a variety of familiar devices.
- ◆Multi-Task Learning for Ultrasonic Echo-based Depth Estimation with Audible Frequency Recovery
- Junpei Honma (Tokyo University of Science), Akisato Kimura (CS), Go Irie (Tokyo University of Science)
- We propose a method of estimation the indoor depth information by placing loudspeakers in the environment, transmitting ultrasonic waves, and recording their echoes with a microphone. The previous methods for echo-based depth estimation require noisy audible sounds to obtain effective echoes for depth estimation, which could have adverse effects on surrounding environments and human bodies. Our proposed method can estimate the indoor depth information from only ultrasonic sound emissions by using audible sounds only in the training stage, and by adding a training task to predict the reverberation sound obtained when audible sounds emitted into the target environment from inaudible ultrasonic reverberation sounds. This technique is expected to pave the way for analysis and reconstruction of real-world scenes in situations where visible-light cameras are not available.
- ◆3GPP IVAS Codec – Perspectives on Development, Testing and Standardization
- Stefan Bruhn (Dolby), Tomas Toftgård (Ericsson), Stefan Döhla (FhG), Huan-yu Su (Huawei), Lasse Laaksonen (Nokia), Takehiro Moriya (CS), Stéphane Ragot (Orange), Hiroyuki Ehara (Panasonic), Marek Szczerba (Philips), Imre Varga (Qualcomm), Andrey Schevciw (Qualcomm), Milan Jerinec (VoiceAge)
- The standardization of the codec for Immersive Voice and Audio Services (IVAS) was completed by the 3rd Generation Partnership Project in June 2024. The IVAS codec goes beyond traditional mono voice coding by representing and reproducing the spatial characteristics of sound, creating an immersive auditory experience. It opens the space for new applications in the realm of mobile communications and user-generated live content streaming such as immersive telephony and extended reality (XR) teleconferencing. The present paper provides a brief overview of the IVAS standard framework covering key features and properties and unique perspectives that are essential for understanding the underlying development and standardization processes that have led to this new 3GPP standard.
- ◆Stereo Downmix in 3GPP IVAS for EVS Compatibility
- Takehiro Moriya (CS), Stephane Ragot (Orange), Arnaud Lefort (Orange) Alexandre Guerin (Orange) Noboru Harada (CS), Ryosuke Sugiura (CS), Yutaka Kamamoto (CS)
- The 3GPP IVAS codec specifies an EVS-compatible stereo downmix as one of the key functionalities. This paper describes how this novel active downmix scheme has been devised to achieve high and stable quality from stereo input to EVS encoder/decoder with no additional algorithmic delay. An example of a network configuration for a multi-party conference using the proposed EVS-compatible downmix is also provided. This paper shows several fundamental schemes of active processes, including adaptive weight and phase-compensated weight between two channels. Subjective listening test results show that the devised scheme's quality is better than that of a passive downmix.
- ◆Hyperbolic PHATE: Visualizing Continuous Hierarchy of Latent Differentiation Structures
- Masahiro Nakano (CS), Hiroki Sakuma (CS), Ryo Nishikimi (CS), Kenji Komiya (CS), Tomoharu Iwata (CS), Kunio Kashino (CS)
- We propose a technology that visualizes the differentiation process of living cells from birth to organs and tissues such as the heart and lungs based on genetic information. We have focused on the assumption that cell differentiation is mainly composed of branching structures and diffusion structures, and have made it possible to embed the potential diffusion distance between data in a hyperbolic space that implicitly induces branching structures. In the future, we hope to use this technology to elucidate the cases of diseases and to apply it to regenerative medicine.
- ◆CardioFlow: Learning to Generate ECG from PPG with Rectified Flow
- Yuta Nambu(HI), Masahiro Kohjima(HI), Ryuji Yamamoto(HI)
- We proposed a deep learning model that can quickly generate electrocardiograms (ECG), which are laborious to measure, from photoplethysmograms (PPG), which can be easily measured with a smartwatch. By using Rectified Flow, one of the cutting-edge models, we have achieved more accurate and faster ECG generation than the previous method based on diffusion models. We confirmed the effectiveness of the proposed method through experiments using two datasets that recorded PPG and ECG during emotion induction and exercise. We also confirmed that using generated ECGs for training improves classification performance over using raw PPGs in the emotion recognition task.
The followings are publication of papers recently accepted for publication in IEEE journals:
- ◆Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance
- Tsubasa Ochiai (CS), Kazuma Iwamoto (Doshisha Univ.), Marc Delcroix (CS), Rintaro Ikeshita (CS), Hiroshi Sato (HI), Shoko Araki (CS), Shigeru Katagiri (Doshisha Univ.)
- Although single-channel speech enhancement significantly improves speech enhancement metrics, it has been reported to have a negative effect on speech recognition performance. In this study, we proposed an analysis method with orthogonal projection-based error decomposition and addressed to identify the cause of such ASR performance degradation. In addition, to mitigate such degradation factors, we proposed an observation adding post-processing and a new training objective and experimentally demonstrated the effectiveness of the proposed schemes. The insight obtained from this study enables to design single-channel speech enhancement systems that can improve ASR performance.
- ◆Masked Modeling Duo: Towards a Universal Audio Pre-training Framework
- Daisuke Niizumi (CS), Daiki Takeuchi (CS), Yasunori Ohishi (CS), Noboru Harada (CS), Kunio Kashino (CS)
- Audio representation learning can convert diverse sounds around us to data valuable for extensive application systems. We proposed a self-supervised learning method, masked modeling duo (M2D), which improves a learning technique that learns through predicting masked parts of input data using visible parts. We also proposed M2D for X, which extends M2D for learning specialized representations for an application X. We contribute to developing future applications that understand sounds through our proposed methods, which provide more useful audio representations.
- ◆Sparse Regularization with Reverse Sorted Sum of Squares via an Unrolled Difference-of-Convex Approach
- Takayuki Sasaki (CD), Kazuya Hayase (CD), Masaki Kitahara (CD), Shunsuke Ono (Institute of Science Tokyo)
- In inverse problems of estimating signal values from insufficient observation, sparse regularization approaches that focus on the "sparseness" of the target signal are widely used. Recently, non-convex sparse regularization functions have been introduced to improve the estimation performance. However, their theoretical analysis is difficult, making it difficult to develop interpretable algorithms. In this study, we propose a novel non-convex sparse regularization function (RSSS) and successfully achieve both high estimation performance and interpretability by formulating it as a Difference-of-Convex (DC) programming problem. This technique is expected to be beneficial for various inverse problems such as noise reduction, super-resolution, and colorization.