Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.

  • Microsoft Edge(Latest version) 
  • Mozilla Firefox(Latest version) 
  • Google Chrome(Latest version) 
  • Apple Safari(Latest version) 

Please contact your browser provider for download and installation instructions.

Open search panel Close search panel Open menu Close menu

February 19, 2025

Change Your Voice. Change Your Accent. Speak With Confidence!

NTT has recently developed an advanced generative AI technology for real-time voice conversion. It's a system able to instantly change both voice and speaking style. Combining high sound quality with low latency, the technology is a potential solution to challenges that have always got in the way of practical, high-performance voice modification. Through the use of deep learning, NTT's voice conversion system delivers clarity, quick response times, and adaptability to various kinds of voice communication scenarios.

Buffering Not Necessary

The system's core feature is the way it can extract key voice characteristics with minimal speaker dependency. Up until now, voice conversion systems have suffered from the delays that come from needing a buffer to predict and integrate future speech frames. This buffering, or latency, can disrupt the natural flow of communication, especially in real-time applications like video calls or live-streaming sessions; delays can make interactions feel jarring. NTT overcomes this by using a causal model that processes the speaker's current and past speech frames, without requiring future data. That enables almost instantaneous, high-quality voice conversion. The feature is especially useful for maintaining smooth, uninterrupted communication during interactive sessions.

It's Your Voice... But Better

Beyond words alone, a speaker's voice conveys subtle cues—intonation, rhythm, and emotional tone—which affect how messages are received. People would sometimes like to communicate in ways that make their speech clearer, more persuasive, or less affected by nervousness. Until now, however, making nuanced modifications using technology has been difficult.

But times are changing. NTT's AI model can adjust voice characteristics like tone, pitch, and tempo, which helps users to sound more confident, fluent, or even adapt their accents to approximate native speakers. And that offers very exciting possibilities. In a customer service setting, for instance, it could clarify speech intonation to make the agent's responses more understandable. Meanwhile, in educational contexts, non-native speakers could use the system to communicate with improved fluency, improving both in-person and virtual learning experiences.

Natural-Sounding Voices

The flexibility comes from NTT's deep learning model, which has been engineered to adapt without requiring large datasets of expensive paired speaker data. Rather than depending on pre-collected voice features, the system learns each new feature independently. This opens the door to broader applications, since the technology can convert voices from only a single speaker's samples. And what's more, NTT incorporates its proprietary waveform synthesis technology, which generates natural-sounding voice waveforms from modified speech features, adding a layer of realism to the converted voices.

The Results Are In

Listening tests have already shown that the technology offers big improvements in both sound quality and speaker similarity—in other words, how well the conversion process matches the vocal characteristics, such as tone, pitch, and timbre, of the target speaker. Simply put, it delivered clear, lifelike voice conversions that mirrored the intended speaker's characteristics.

Clarity, Fluency, Privacy

Along with potential applications in customer service and education, the voice conversion technology could also be used in healthcare to assist individuals with voice disorders, adjusting tone and pitch for more comfortable communication. Public speakers might use it to improve their vocal clarity or reduce any tremors caused by being nervous, allowing for a smoother, more confident delivery. For people concerned about privacy, voice conversion could mask their identity while keeping voice quality high, which would be valuable in sensitive, anonymous contexts. Additionally, it could support enhanced communication in web conferencing, VR interactions, and other remote communication scenarios where clear, adaptable voice representation is important.

Innovating Communication

What's next? NTT plans to improve the technology's robustness, making it more resistant to background noise and more stable in real-world conditions. They also aim to strengthen safeguards against misuse, such as impersonation, ensuring that users can enjoy the technology without compromising security. Once perfected, NTT's real-time voice conversion technology has the potential to empower users to communicate more freely, choosing their vocal style and persona regardless of language or cultural differences.

For further information, please see this link:
https://group.ntt/en/newsrelease/2024/06/17/240617a.html

NTT—Innovating the Future

Picture: Daniel O'Connor

Daniel O'Connor joined the NTT Group in 1999 when he began work as the Public Relations Manager of NTT Europe. While in London, he liaised with the local press, created the company's intranet site, wrote technical copy for industry magazines and managed exhibition stands from initial design to finished displays.

Later seconded to the headquarters of NTT Communications in Tokyo, he contributed to the company's first-ever winning of global telecoms awards and the digitalisation of internal company information exchange.

Since 2015 Daniel has created content for the Group's Global Leadership Institute, the One NTT Network and is currently working with NTT R&D teams to grow public understanding of the cutting-edge research undertaken by the NTT Group.