Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.

  • Microsoft Edge(Latest version) 
  • Mozilla Firefox(Latest version) 
  • Google Chrome(Latest version) 
  • Apple Safari(Latest version) 

Please contact your browser provider for download and installation instructions.

Open search panel Close search panel Open menu Close menu

April 15, 2024


20 papers authored by NTT Laboratories have been accepted for publication for ICASSP

2024 (2024 IEEE International Conference on Acoustics, Speech, and Signal Processing)(, the flagship conference on signal processing technology to be held in Seoul, Korea from 14th to 19th April, 2024. In addition, we present demonstrations at the Show and Tell sessions at the conference. (Affiliations are at the time of submission.)

Abbreviated names of the laboratories:
CS: NTT Communication Science Laboratories
HI: NTT Human Informatics Laboratories
CD: NTT Computer and Data Science Laboratories

■Sparse Regularization Based on Reverse Ordered Weighted L1-Norm and Its Application to Edge-Preserving Smoothing

Takayuki Sasaki (CD), Yukihiro Bandoh (CD), Masaki Kitahara (CD)
 Sparse regularization approaches are widely used for various problems where signal value estimation is difficult, such as noise removal and super-resolution. However, conventional approaches have faced the challenge of edge smoothing and gradation loss when applied to images. In this study, we propose a novel sparse regularization function (ROWL) and verify that it can simultaneously suppress these image quality degradations.

■Online Target Sound Extraction with Knowledge Distillation from Partially Non-Causal Teacher

Keigo Wakayama (CD), Tsubasa Ochiai (CS), Marc Delcroix (CS), Masahiro Yasuda (CD/CS), Shoichiro Saito (CD), Shoko Araki (CS), Akira Nakayama (CD)
 To mitigate the performance degradation caused by replacing a non-causal model with a causal model, we proposed an online Target Sound Extraction (TSE) with Knowledge Distillation (KD) from non-causal or partially non-causal teacher. We also confirmed the effectiveness of the proposed KD scheme using experiments with a dataset.

■6DoF SELD: Sound Event Localization and Detection Using Microphones and Tracking Sensors on Self-Motioning Human

Masahiro Yasuda (CD/CS), Shoichiro Saito(CD), Akira Nakayama (CD), Noboru Harada (CS)
 This study designed a new 6DoF SELD task to classify and localize sound events using microphones attached to a self-motion human and recorded and published a real dataset. Furthermore, to address the issue of performance degradation due to self-motion, a mechanism to excite valid acoustic features using head motion information acquired by tracking sensors as cues is proposed.

■On the Equivalence of Dynamic Mode Decomposition and Complex Nonnegative Matrix Factorization

Masahiro Kohjima (HI)
 This study clarifies the theoretical relationship between two methods for time series analysis that have different origins: dynamic mode decomposition (DMD), commonly applied in fluid analysis, and complex nonnegative matrix factorization (Complex NMF), typically used in acoustic signal processing. Our theoretical analysis provides insight into understanding each method's principles, advantages, and limitations, thereby enriching the framework for time series analysis.

■StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-Supervised Learning Models

Kazuki Yamauchi(Univ. Tokyo) , Yusuke Ijima(HI), Yuki Saito(Univ. Tokyo)
 This paper proposed a new task, "speaking-style captioning," and a new method, "StyleCap," for describing speaking-styles in natural language, instead of identifying predefined classes. The proposed method consists of a speech encoder and a large-scale language model (LLM). Experimental results show that the proposed method can accurately describe speaking-style information in natural language.

■What Do Self-Supervised Speech and Speaker Models Learn? New Findings from a Cross Model Layer-Wise Analysis

Takanori Ashihara (HI), Marc Delcroix (CS), Takafumi Moriya (HI/CS), Kohei Matsuura (HI), Taichi Asami (HI), Yusuke Ijima (HI)
 To improve speaker embeddings used in various tasks, such as speaker verification and target-speaker speech recognition, we analyzed the representations acquired by utterance-level self-supervised learning (SSL). From the analysis results, we found the strengths and weaknesses among utterance-level SSL, frame-level SSL, and supervised speaker models.

■Noise-Robust Zero-Shot Text-to-Speech Synthesis Conditioned on Self-Supervised Speech-Representation Model with Adapters

Kenichi Fujita(HI), Hiroshi Sato(HI), Takanori Ashihara(HI), Hiroki Kanagawa(HI), Marc Delcroix(CS), Takafumi Moriya(HI/CS), Yusuke Ijima(HI)
 We have proposed a text-to-speech (TTS) technique that reproduces speaker characteristics very accurately from a few seconds of a desired speaker's utterance containing noise. This technique enables us to achieve personalized TTS for patients with speech disorders using noisy recordings made before patients lose their voice.

■Talking Face Generation for Impression Conversion Considering Speech Semantics

Saki Mizuno(CD), Nobukatsu Hojo(CD), Kazutoshi Shinoda(CD), Keita Suzuki(CD), Mana Ihori(CD), Hiroshi Sato(HI), Tomohiro Tanaka(CD), Naotaka Kawata(CD), Satoshi Kobashikawa(CD), Ryo Masumura(CD)
 We proposed a talking face generation method to convert the impression of a speaker's video. The impression conversion needs to consider the semantics of the input speech because they influence the impression of the speaker's video along with the facial expression. Therefore, we proposed a novel model that can consider the speech semantics of the input video in addition to the video features and confirmed its effectiveness.

●NTT Speaker Diarization System for CHiME-7: Multi-Domain, Multi-Microphone End-to-End and Vector Clustering Diarization

Naohiro Tawara (CS), Marc Delcroix (CS), Atsushi Ando (HI), Atsunori Ogawa (CS)
 We proposed a robust speaker diarization system capable of handling in microphone arrays, recording environments, and speaking styles. The proposed system combined speaker segmentation based on an end-to-end deep learning model, channel integration using result voting, and self-supervised adaptation. The proposed system was implemented as the frontend of the NTT system for the Distant Automatic Speech Recognition (DASR) task in the CHiME-7 challenge, achieving top rankings.

●Discriminative Training of VBx Diarization

Dominik Klement (BUT), Mireia Diez (BUT), Federico Landini (BUT), Lukas Burget (BUT), Anna Silnova (BUT), Marc Delcroix (CS), Naohiro Tawara (CS)
 VBx has been a widely adopted baseline for speaker diarization, the technology that estimates who speaks when in a recording of multi-talkers. It uses Bayesian inference to cluster speaker embedding vectors, allowing associating speaker identities to each portion of the speech. This paper presents a new framework for updating the VBx parameters using discriminative training, which greatly simplifies the hyper-parameter search of VBx.

●Target Speech Extraction with Pre-Trained Self-Supervised Learning Models

Junyi Peng (BUT), Marc Delcroix (CS), Tsubasa Ochiai (CS), Oldrich Plchot (BUT), Shoko Araki (CS), Jan Cernocky (BUT)
 Target speech extraction (TSE) consists of separating the speech of a target speaker from other speakers' voices using a short pre-recorded enrollment of that speaker to identify it in a multi-speaker mixture. In this work, we explore the use of pre-trained self-supervised learning models widely used in the speech community to boost the performance of TSE. In particular, we show that the proposed SSL-based TSE system can greatly reduce speaker confusion in the extraction process.

●Train Long and Test Long: Leveraging Full Document Contexts in Speech Processing

William Chen (CMU), Takatomo Kano (CS), Atsunori Ogawa (CS), Marc Delcroix (CS), Shinji Watanabe (CMU)
 The quadratic memory complexity of self-attention has generally restricted Transformer-based models to utterance-based speech processing, preventing models from leveraging long-form contexts. This paper explores the developments in efficient attention mechanisms, such as Flash Attention, and simpler alternatives to allow the encoding of entire documents at once. We also propose a new attention-free self-supervised model, LongHuBERT, capable of handling long inputs. We show that exploiting long context during training can benefit document-level ASR and speech summarization tasks.

●Neural Network-Based Virtual Microphone Estimation with Virtual Microphone and Beamformer-Level Multi-Task Loss

Hanako Segawa (Tsukuba Univ.), Tsubasa Ochiai (CS), Marc Delcroix (CS), Tomohiro Nakatani (CS), Rintaro Ikeshita (CS), Shoko Araki (CS), Takeshi Yamada (Tsukuba Univ.), Shoji Makino (Tsukuba Univ.)
 Virtual microphone estimation is a technique to estimate unobserved microphone signals given actual observed microphone signals and virtually increase the number of microphones. In this paper, we introduce a novel multi-task training objective adopting an array processing-level loss and succeeded in generating virtual microphone signals more optimal for the array processing back-end.

●How Does End-to-End Speech Recognition Training Impact Speech Enhancement Artifacts?

Kazuma Iwamoto (Doshisha Univ.), Tsubasa Ochiai (CS), Marc Delcroix (CS), Rintaro Ikeshita (CS), Hiroshi Sato (HI), Shoko Araki (CS), Shigeru Katagiri (Doshisha Univ.)
 In distant automatic speech recognition community, the impact of speech enhancement (SE) estimation errors on ASR performance has not been fully explored. In this study, we proved for the first time that the end-to-end optimization of SE systems based on ASR criteria works to reduce artifact errors that is obtained by decomposing total SE errors.

●Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

Takuhiro Kaneko (CS), Hirokazu Kameoka (CS), Kou Tanaka (CS)
 A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis. However, this data-driven model requires a large amount of data for training. Data augmentation is a promising solution; however, a previous discriminator, agnostic to the augmentation state, may consider augmented speech as the desired real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD) that assesses speech conditioned on the augmentation state.

●Selecting N-Lowest Scores for Training MOS Prediction Models

Yuto Kondo (CS), Hirokazu Kameoka (CS), Kou Tanaka (CS), Takuhiro Kaneko (CS)
 In recent years, the development of deep learning models to predict Mean Opinion Score (MOS), an average measure of survey results on audio quality, for automating speech synthesis system evaluation, has surged. This study proposes hypotheses on labeling tendencies in this area. Analyzing audio quality score datasets supports the hypotheses and suggests an alternative representative measure of audio quality replacing MOS.

●Unrestricted Global-Phase-Bias Aware Single-channel Speech Enhancement with Conformer-Based Metric GAN

Shiqi Zhang (Waseda Univ.), Qui Zheng (Waseda Univ.), Daiki Takeuchi (CS), Noboru Harada (CS), Shoji Makino (Waseda Univ.)
 Recently, deep learning methods have been widely used for speech enhancement that extracts speech from noise-mixed sounds. In this study, we proposed a loss function focusing on the phase difference of the audio waveform without restricting the global phase bias that humans cannot perceive. Experiments showed that our method improves objective evaluation metrics based on human perception.

●Sunflower Strategy for Bayesian Relational Data Analysis

Masahiro Nakano (CS), Ryohei Shibue (CS), Kunio Kashino (CS)
 There are many types of data in the world that are represented in the form of matrices, such as gene expression, user purchase logs, network adjacencies, and so on. The technique of discovering cluster structures from such matrix-type data is an important technique widely used for various signal processing and machine learning problems. This paper proposes a technique for inferring non-overlapping rectangle clusters of the entire data by using the cluster structure from separate views of the data in the row and column directions.

Show and Tell presentations are listed below.

●MeetEval, Show Me the Errors! Interactive Visualization of Transcript Alignments for the Analysis of Conversational ASR

Thilo von Neumann (Paderborn University), Christoph Boeddeker (Paderborn University), Marc Delcroix (CS), Reinhold Haeb-Umbach (Paderborn University)
 We present a demonstration of a new tool to visualize and analyze errors made by conversational ASR systems. The tool displays the alignment between actual and estimated transcripts for multiple speakers and long recordings. It highlights and summarizes different types of errors, such as word insertions, deletions, and substitutions, making it easier to locate areas with high error density.

●Target Speech Spotting and Extraction Based on ConceptBeam

Yasunori Ohishi (CS), Marc Delcroix (CS), Tsubasa Ochiai (CS), Shoko Araki (CS), Daiki Takeuchi (CS), Daisuke Niizumi (CS), Akisato Kimura (CS), Noboru Harada (CS), Kunio Kashino(CS)
 A target speech spotting and extraction technique called ConceptBeam is demonstrated. To the best of our knowledge, this is the first semantic sound source separation technique that can extract speech signals from a multi-speaker mixture that match a target concept or topic of interest specified by a user by spoken words, images, or combinations thereof.

Information is current as of the date of issue of the individual topics.
Please be advised that information may be outdated after that point.