Seventeen papers authored by NTT Laboratories have been accepted at INTERSPEECH2024 (the 25th INTERSPEECH Conference) , the world's largest international conference on spoken language processing, to be held in Kos Island, Greece, from September 1 to 5, 2024. Also, Dr. Shoko Araki (CS) will give a keynote presentation entitled "Frontier of Frontend for Conversational Speech Processing".
Abbreviated names of the laboratories:
CS: NTT Communication Science Laboratories
HI: NTT Human Informatics Laboratories
(The affiliations are as of the time of submission.)
- ◆M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
- Daisuke Niizumi (CS), Daiki Takeuchi (CS), Yasuynori Ohishi (CS), Noboru Harada (CS), Masahiro Yasuda (CS), Shunsuke Tsubaki (Doshisha University), Keisuke Imoto (Doshisha University)
- While CLAP, which enables zero-shot inference by aligning language and audio, has attracted attention, it cannot be applied to problems such as regression. We propose M2D-CLAP that addresses both conventional transfer learning and zero-shot inference by harnessing the conventional audio representation of M2D, previously published in the IEEE TASLP journal, with CLAP training.
- ◆Unified Multi-Talker ASR with and without Target-speaker Enrollment
- Ryo Masumura (HI), Naoki Makishima (HI), Tomohiro Tanaka (HI), Mana Ihori (HI), Naotaka Kawata (HI), Shota Orihashi (HI), Kazutoshi Shinoda (HI), Taiga Yamane (HI), Saki Mizuno (HI), Keita Suzuki (HI), Satoshi Suzuki (HI), Nobukatsu Hojo (HI), Takafumi Moriya (HI/CS), Atsushi Ando (HI)
- We propose a novel multi-talker automatic speech recognition system that can perform both a target-speaker enrollment-driven process and a target-speaker-free process in a unified modeling framework.
- ◆SOMSRED: Sequential Output Modeling for Joint Multi-talker Overlapped Speech Recognition and Speaker Diarization
- Naoki Makishima (HI), Naotaka Kawata (HI), Mana Ihori (HI), Tomohiro Tanaka (HI), Shota Orihashi (HI), Atsushi Ando (HI), Ryo Masumura (HI)
- Conventional studies use separately trained models to estimate who spoke when and what from overlapped speech. This makes the system complex, and non-overlapping speech regions are required to accurately estimate the speakers. In contrast, the proposed method utilizes a single model to estimate who spoke when and what, enabling speaker estimation even in fully overlapped speech.
- ◆Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding
- Takafumi Moriya (HI/CS), Takanori Ashihara (HI), Masato Mimura (HI), Hiroshi Sato (HI), Kohei Matsuura (HI), Ryo Masumura (HI), Taichi Asami (HI)
- Transducer-based automatic speech recognition (ASR) model training can exhibit unstable behavior due to the varying lengths of audio and text pairs. We propose an Internal Acoustic Model (IAM) as a method to stabilize training and enhance ASR performance. Additionally, we introduce a thresholding method to improve inference speed by leveraging the characteristics of both the Transducer and the IAM, which can predict blank durations.
- ◆Text-only domain adaptation for CTC-based speech recognition through substitution of implicit linguistic information in the search space
- Tatsunari Takagi (Toyohashi University of Technology), Yukoh Wakabayashi (Toyohashi University of Technology), Atsunori Ogawa (CS), and Norihide Kitaoka (Toyohashi University of Technology)
- We propose an efficient domain adaptation method for a CTC-based end-to-end ASR model by substituting the implicit linguistic information obtained in the model training and then adding the target domain linguistic information during decoding. In experiments, we revealed that a 1-gram language model is suitable for subtraction, and a 4-gram language model is suitable for addition.
- ◆Boosting CTC-based ASR using inter-layer attention-based CTC loss
- Keigo Hojo (Toyohashi University of Technology), Yukoh Wakabayashi (Toyohashi University of Technology), Kengo Ohta (National Institute of Technology, Anan College), Atsunori Ogawa (CS), and Norihide Kitaoka (Toyohashi University of Technology)
- The lower and upper layers of the encoder in a CTC-based end-to-end ASR model have different roles, i.e., phonetic localization and linguistic localization, respectively. In this study, focusing on these different roles, we proposed training the model by setting different attention-based auxiliary CTC losses to them, which improved the ASR accuracy.
- ◆Factor-Conditioned Speaking Style Captioning
- Atsushi Ando (HI), Takafumi Moriya (HI/CS), Shota Horiguchi (HI), Ryo Masumura (HI)
- We propose a novel speaking-style captioning method that introduces factor-conditioned captioning, which first outputs a phrase representing speaking-style factors and then generates a caption, to ensure the model explicitly learns speaking-style factors. Experiments show that the proposed method enables the generation of more accurate and diverse captions compared to conventional methods.
- ◆Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation
- Kohei Matsuura (HI), Takanori Ashihara (HI), Takafumi Moriya (HI/CS), Masato Mimura (HI), Takatomo Kano (CS), Atsunori Ogawa (CS), Marc Delcroix (CS)
- Transcriptions generated by conventional speech recognition systems often contain redundant expressions and disfluencies, which hinder readability. We propose a novel technology, "Sentence-wise Speech Summarization" enabling users to sequentially refer to concise and readable text. Furthermore, we also propose a data augmentation method utilizing an external language model for an end-to-end modeling.
- ◆Participant-Pair-Wise Bottleneck Transformer for Engagement Estimation from Video Conversation
- Keita Suzuki (HI), Nobukatsu Hojo (HI), Kazutoshi Shinoda (HI), Saki Mizuno (HI), Ryo Masumura (HI)
- We proposed a Transformer-based method for estimating engagement in multiple participants meetings. The method efficiently represents interactions among high-dimensional data, such as audio and video from multiple participants, using a small number of tokens. We confirmed that the accuracy can be improved by representing interactions with a small number of tokens.
- ◆Learning from Multiple Annotator Biased Labels in Multimodal Conversation
- Kazutoshi Shinoda (HI), Nobukatsu Hojo (HI), Saki Mizuno (HI), Keita Suzuki (HI), Satoshi Kobashikawa (HI), Ryo Masumura (HI)
- We proposed a debiasing method to prevent overfitting to dataset biases arising from the variability in judgments among multiple annotators in the task of classifying multimodal conversations involving both speaker audio and video. Our method improved accuracy for minority speakers and classes while maintaining performance for majority speakers and classes.
- ◆Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers
- Marvin Tammen (University of Oldenburg), Tsubasa Ochiai (CS), Marc Delcroix (CS), Tomohiro Nakatani (CS), Shoko Araki (CS), Simon Doclo (University of Oldenburg)
- We proposed an extension of the attention-based neural beamforming framework for moving sources, which we recently proposed, so that it can be applied for arbitrary microphone arrays with different geometries. Experimental results showed that the proposed method can automatically track the moving source and accurately extract it even when using a microphone array that was not seen during training.
- ◆SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling
- Hiroshi Sato (HI), Takafumi Moriya (HI/CS), Masato Mimura (HI), Shota Horiguchi (HI), Tsubasa Ochiai (CS), Takanori Ashihara (HI), Atsushi Ando (HI), Kentaro Shinayama (HI), Marc Delcroix (CS)
- Implementing real-time speech enhancement is challenging as the computational complexity must be reduced to provide real-time operation. In this study, we propose SpeakerBeam-SS, which incorporates a state space modeling into the target speaker extraction technology, a speech enhancement technique that extracts the desired speaker from the observed signal. This approach achieved about five times processing speed while maintaining the performance.
- ◆Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
- Kenichi Fujita (HI), Takanori Ashihara (HI), Marc Delcroix (CS), Yusuke Ijima (HI)
- We proposed a method to realize zero-shot speech synthesis, which can generate speech similar to the speaker's voice from a few seconds of speech, using a small number of model parameters. This enables high-speed speech synthesis on a CPU without using a GPU. This method is expected to be applied to applications that require fast response such as spoken dialogue systems.
- ◆FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation
- Takuhiro Kaneko (CS), Hirokazu Kameoka (CS), Kou Tanaka (CS), Yuto Kondo (CS)
- Diffusion-based voice conversion (VC), such as VoiceGrad, has attracted interest owing to its high VC performance. However, a notable limitation is slow inference caused by a multi-step reverse diffusion process. To overcome this limitation, we propose FastVoiceGrad, a novel one-step diffusion-based VC, which reduces the number of iterations from dozens to one while achieving VC performance superior to or comparable to that of the multi-step diffusion-based VC.
- ◆PRVAE-VC2: Non-Parallel Voice Conversion by Distillation of Speech Representations
- Kou Tanaka (CS), Hirokazu Kameoka (CS), Takuhiro Kaneko (CS), Yuto Kondo (CS)
- Voice conversion using self-supervised speech representation learning has garnered significant interest in recent years. In this study, we focus on the fact that the speech representation learning method using discretization (HuBERT) and the speech representation learning method considering perturbation resistance (PRVAE-VC), which we recently proposed, compress information along different axes. We demonstrate that it is possible to effectively enhance speech conversion performance by combining multiple information compression methods in an additive manner.
- ◆Knowledge Distillation from Self-Supervised Representation Learning Model with Discrete Speech Units for Any-to-Any Streaming Voice Conversion
- Hiroki Kanagawa (HI), Yusuke Ijima (HI)
- We proposed a streamable voice conversion (VC) via knowledge distillation from a self-supervised learning (SSL) model for speech, which was originally designed for offline operation. Thanks to the robustness derived from the SSL model, our streaming VC achieves comparable performance to that of offline VC for both in-domain and out-of-domain speakers.
- ◆Pre-training Neural Transducer-based Streaming Voice Conversion for Faster Convergence and Alignment-free Training
- Hiroki Kanagawa (HI), Takafumi Moriya (HI/CS), Yusuke Ijima (HI)
- For faster training of the VC-T, a collapse robust streaming speech conversion model, we introduced a pre-training stage that aligns speech lengths between the source and target speakers. Our approach reduces the total training time by one-third while obtaining the naturalness comparable to that of conventional VC-T. The suitable initial values obtained in the pre-training stage also eliminated the need for alignment labels, which were essential for stabilizing the training.