Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.

  • Microsoft Edge(Latest version) 
  • Mozilla Firefox(Latest version) 
  • Google Chrome(Latest version) 
  • Apple Safari(Latest version) 

Please contact your browser provider for download and installation instructions.

Open search panel Close search panel Open menu Close menu

August 10, 2023


NTT's 19 papers accepted for INTERSPEECH2023, the world's largest international conference on spoken language processing

Nineteen papers (Table 1) authored by NTT Laboratories have been accepted at INTERSPEECH2023 (the 24th INTERSPEECH Conference)Open other window, the world's largest international conference on spoken language processing, to be held in Dublin, Ireland, from August 20 to 24, 2023. (The affiliations are as of the time of submission. )

Abbreviated names of the laboratories:
CS: NTT Communication Science Laboratories
HI: NTT Human Informatics Laboratories
CD: NTT Computer and Data Science Laboratories
SIC: NTT Software Innovation Center

Table 1 Number of accepted papers for each research area

Research area # Paper
Representation learning 2
Speech recognition 5
Speech summarization 1
Speaker diarization/Conversation analysis 2
Speech enhancement 3
Speech synthesis/Voice conversion 3
Speech perception 2
Speaker age estimation 1

■ Representation learning

● Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillationn

Daisuke Niizumi (CS), Daiki Takeuchi (CS), Yasuynori Ohishi (CS), Noboru Harada (CS), Kunio Kashino (CS)

In June 2023, we proposed the Masked Modeling Duo at ICASSP2023, successfully demonstrating its ability to learn effective general-purpose audio representations for various purposes. This study highlights its capability to achieve state-of-the-art performance even when specialized in the highly competitive speech domain, suggesting its potential contribution to various future specialized applications.

● SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

Takanori Ashihara (HI), Takafumi Moriya (HI/CS), Kohei Matsuura (HI), Tomohiro Tanaka (CD), Yusuke Ijima (HI), Taichi Asami (HI), Marc Delcroix (CS), Yukinori Homma (HI)

We explored how well self-supervised speech models can capture linguistic knowledge by probing based on natural language understanding tasks. From the result, the self-supervised model showed the superior performance than the baseline, indicating that the model captures some general linguistic knowledge.

■ Speech recognition

● End-to-End Joint Target and Non-Target Speakers ASR

Ryo Masumura (CD), Saki Mizuno (CD), Naoki Makishima (CD), Mana Ihori (CD), Mihiro Uchida (CD), Hiroshi Sato (HI), Tomohiro Tanaka (CD), Satoshi Suzuki (CD), Akihiko Takashima (CD), Shota Orihashi (CD), Takafumi Moriya (HI/CS), Nobukatsu Hojo (CD), Atsushi Ando (HI/CD), Yoshihiro Yamazaki (CD), Taiga Yamane (CD)

We propose a novel automatic speech recognition system that can transcribe individual speaker's speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. Our proposed system is performed by recursively generating both textual tokens and tokens that represent target or non-target speakers in an end-to-end manner.

● Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Speech Time Estimation

Naoki Makishima (CD), Keita Suzuki (CD), Satoshi Suzuki (CD), Atsushi Ando (HI/CD), Ryo Masumura (CD)

We propose joint modeling of multi-talker automatic speech recognition (ASR) and utterance-level timestamp prediction. By treating the timestamp prediction task as a classification problem of quantized timestamp tokens, the proposed method solves the multi-talker ASR and utterance-level timestamp prediction jointly with the same simple modeling, which improves their estimation accuracy.

● Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data

Takafumi Moriya (HI/CS), Hiroshi Sato (HI), Tsubasa Ochiai (CS), Marc Delcroix (CS), Takanor Ashihara (HI), Kohei Matsuura (HI), Tomohiro Tanaka (CD), Ryo Masumura (CD), Atsunori Ogawa (CS), Taichi Asami (HI)

In the training of the target-speaker automatic speech recognition (TS-ASR) model, which transcribes only the target speaker's voice from mixed audio with multiple speakers overlapping, we proposed a method for utilizing the target speaker's pre-overlapped speech that was not conventionally used. Through experiments, we confirmed a further improvement in the recognition performance of the TS-ASR model.

● Transcribing Speech as Spoken and Written Dual Text Using an Autoregressive Model

Mana Ihori (CD), Hiroshi Sato (HI), Tomohiro Tanaka (CD), Ryo Masumura (CD), Saki Mizuno (CD), Nobukatsu Hojo (CD)

We propose a method to generate spoken and written dual text from input speech. By generating the joint text of spoken and written text autoregressively, the proposed method can generate the written text using the information of input speech and spoken text, which improves the performance to output written text.

● miniStreamer: Enhancing Small Conformer with Chunked-Context Masking for Streaming Applications on the Edge

Haris Gulzar (SIC), Monikka Busto (SIC), Takeharu Eda (SIC), Katsutoshi Itoyama (Tokyo Institute of Technology), Kazuhiro Nakadai (Tokyo Institute of Technology)

Streaming speech recognition requires real-time processing. In the conventional method, as the utterance length increases, the computation cost increases quadratically. The proposed method focuses on a fixed context and limits the computational cost to a constant value which also reduces the latency for low-power edge devices.

■ Speech summarization

● Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization

Kohei Matsuura (HI), Takanori Ashihara (HI), Takafumi Moriya (HI/CS), Tomohiro Tanaka (CD), Takatomo Kano (CS), Atsunori Ogawa (CS), Marc Delcroix (CS)

We proposed a method that integrates a pre-trained language model into the End-to-End speech summarization (E2E SSum) model via transfer learning, and confirmed the improvement in summary quality and accuracy on the How2 dataset. This study has propelled the E2E SSum model towards practical application.

■ Speaker diarization/Conversation analysis

● Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization

Marc Delcroix (CS), Mireia Diez (BUT), Federico Landini (BUT), Anna Silnova (BUT), Atsunori Ogawa (CS), Tomohiro Nakatani (CS), Lukas Burget (BUT), Shoko Araki (CS)

(BUT: Brno University of Technology)
This paper introduces a new approach to cluster speaker embeddings obtained from the end-to-end speaker diarization with vector clustering (EEND-VC). The proposed approach, called multi-stream VBx (MS-VBx), extends the classical VBx clustering to handle multiple speakers per speech chunk. MS-VBx improves performance of a strong EEND-VC baseline on three widely used diarization datasets.

● Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer

Nobukatsu Hojo (CD), Saki Mizuno (CD), Satoshi Kobashikawa (CD), Ryo Masumura (CD), Mana Ihori (CD), Hiroshi Sato (HI), Tomohiro Tanaka (CD)

To support communication skill training, we proposed a method to estimate the existence of preferable behaviors of a speaker in a conversational video. The proposed Transformer model is trained focusing on the relationships between the synchronized time steps across the input modalities, leading to improved estimation accuracy.

■ Speech enhancement

● Target Speech Extraction with Conditional Diffusion Model

Naoyuki Kamo (CS), Marc Delcroix (CS), Tomohiro Nakatani (CS)

We proposed target speech extraction (TSE) for mixed speech based on a diffusion model. We realized TSE using a diffusion model conditioned by the observed speech and the enrollment speech. In the experiment, our method outperformed the conventional TSE model in terms of SDR, ESTOI, and PESQ,

● Downstream Task Agnostic Speech Enhancement Conditioned on Self-Supervised Representation Loss

Hiroshi Sato (HI), Ryo Masumura (CD), Tsubasa Ochiai (CS), Marc Delcroix (CS), Takafumi Moriya (HI/CS), Takanori Ashihara (HI), Kentaro Shinayama (HI), Mana Ihori (CD), Saki Mizuno (CD), Tomohiro Tanaka (CD), Nobukatsu Hojo (CD)

To build a speech enhancement model that can be used universally for various speech tasks, we proposed a method for learning a speech enhancement model so that the enhanced signal approaches a clean signal in terms of the output of a self-supervised learning (SSL) model of speech, and showed that the proposed method can significantly improve performance on the Noisy SUPERB benchmark, which is an extension of the SSL model evaluation benchmark for noisy environments.

● Impact of Residual Noise and Artifacts in Speech Enhancement Errors on Intelligibility of Human and Machine

Shoko Araki (CS), Ayako Yamamoto (Wakayama University), Tsubasa Ochiai (CS), Kenichi Arai (CS), Atsunori Ogawa (CS), Tomohiro Nakatani (CS), Toshio Irino (Wakayama University)

It had been previously shown that the major cause of the deterioration of automatic speech recognition performance for enhanced speech by single-channel speech enhancement is the non-linear distortion of speech rather than residual noise. This study investigates and reports the effects of these error factors on human intelligibility.

■ Speech synthesis/Voice conversion

● iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN

Takuhiro Kaneko (CS), Hirokazu Kameoka (CS), Kou Tanaka (CS), Shogo Seki (CS)

We propose iSTFTNet2, an improved variant of iSTFTNet with a 1D-2D CNN that employs 1D and 2D CNNs to model temporal and spectrogram structures, respectively. This modification facilitates the reduction of the neural temporal upsampling. The experimental results demonstrated that iSTFTNet2 made iSTFTNet faster and more lightweight with comparable speech quality.

● CFCV: Conditional Filtering for Controllable Voice Conversion

Kou Tanaka (CS), Takuhiro Kaneko (CS), Hirokazu Kameoka (CS), Shogo Seki (CS)

We propose CFVC, a many-to-many voice conversion model that filters the speaker vector to control high-level attributes such as speaking rate while preserving voice timbre. The challenge is to train the disentangled speaker representations with no/few annotation data. The experimental results showed that our method disentangled complex attributes without annotation and separately controlled speaking rate and voice timbre.

● VC-T: Streaming Voice Conversion Based on Neural Transducer

Hiroki Kanagawa (HI), Takafumi Moriya (HI/CS), Yusuke Ijima (HI)

For practical voice conversion (VC), we incorporated the neural transducer used in speech recognition (RNN-T) into VC model for the first time. By exploiting the RNN-T's ability to use both input and previously generated speech to predict upcoming one, we accomplished both 1) overcoming the linguistic information collapse that was an issue of the conventional seq2seq-based VC, and 2) a stable streaming operation.

■ Speech perception

● A stimulus-organism-response model of willingness to buy from advertising speech using voice quality

Mizuki Nagano (HI), Yusuke Ijima (HI), Sadao Hiroya (CS)

In-store stimulus (e.g., background music) influence consumer's emotion and the emotion enhance their willingness to buy. However, it was not clear how the impression of advertising speech influences the willingness to buy. This study revealed that advertising speech with warmth or brightness enhances listeners' willingness to buy.

● Influence of Personal Traits on Impressions of One's Own Voice

Hikaru Yanagida (HI), Yusuke Ijima (HI), Naohiro Tawara (CS)

To clarify the impression of one's own recorded voice, we conducted a large-scale subjective evaluation experiment and analyzed the relationship between impressions (e.g., attractiveness, familiarity) of one's own voice and personal traits (e.g., age, gender, personality traits). Results revealed that people who frequently listen to one's own voices enhance the attractiveness and familiarity of one's own voice.

■ Speaker age estimation

● What are differences? Comparing DNN and human by their performance and characteristics in speaker age estimation

Yuki Kitagishi (HI), Naohiro Tawara (CS), Atsunori Ogawa (CS), Ryo Masumura (CD), Taichi Asami (HI)

We compared a SoTA DNN model and human listeners by their performance and characteristics in speaker age estimation. We revealed 1) the model yielded comparable estimation performance to the human listeners, 2) the performance of the models are more sensitive to short-time speech input and mismatches between training and testing acoustic condition, and 3) the speaker's gender and some acoustic features negatively affect the human listeners' performance.

Information is current as of the date of issue of the individual topics.
Please be advised that information may be outdated after that point.