Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.

  • Microsoft Edge(Latest version) 
  • Mozilla Firefox(Latest version) 
  • Google Chrome(Latest version) 
  • Apple Safari(Latest version) 

Please contact your browser provider for download and installation instructions.

Open search panel Close search panel Open menu Close menu

January 17, 2024

NTT Corporation

Individuality Reproduction Dialogue Technology for efficient reproduction of an individual's speech using LLMs
Digital alter egos can be generated at low cost using NTT LLM "tsuzumi",

Tokyo - Jan. 17, 2024 - NTT Corporation (NTT) is pursuing research and development of Another Me®, an AI agent that acts like a person and shares experiences with the user, with the aim of realizing personal growth through activities and exchanges in the digital space that transcend the constraints of the physical world. Digital Twin Computing (DTC)1 is one of the pillars of the IOWN concept2, and as an extension of the NTT's large language model (LLM) "tsuzumi"3, we have developed a dialogue technology that reproduces person's individuality by generating dialogue reflecting the characteristics of an individual's tone and speech content from a small amount of dialogue data. We also developed Zero/Few-shot Speech Synthesis Technology, which synthesizes speech that reflects an individual's voice tone from a small amount of speech data. Traditionally, a large amount of data about an individual was required to learn and reproduce individual characteristics, but with the ability to reproduce from a small amount of data, many people can easily have their own alter ego in the digital space. We will work toward the practical application of the results of this research, including the public demonstration of a digital alter ego that can communicate with people and engage in community activities on behalf of the user.

1. Background

While the digitalization of society and the development of AI technology have led to the realization of more efficient lifestyles, it has also been pointed out that the diversity of individuals and society may be undermined by the excessive reliance on AI such as general AI to provide uniform answers to all kinds of problems. NTT aims to realize a society in which each person can demonstrate diverse personality naturally in the IOWN concept, and has been conducting research and development of NTT's LLM "tsuzumi" based on the policy of ensuring diversity through the collective knowledge of relatively small-scale AI with expertise and individuality. In addition, NTT is promoting the "Another Me" project, which aims to reflect human diversity in various social and economic activities through AI that learns a wide variety of human personalities and acts autonomously on behalf of people. In the last year, we developed personality extraction technology that estimates a person's hobbies and values based on past behavior, and Individuality Reproduction Dialogue Technology that reproduces individuality dialogue from profiles and attributes4. To further advance the social implementation of Another Me, we applied the LLM to dialogue and developed a technology that achieves high reproducibility even from small amounts of data.

Figure 1 Another Me Vision Figure 1 Another Me Vision

2. Technology Overview

The ability to communicate in a personable manner is essential to the realization of Another Me, which can play an active role in society as a proxy for individuals. To enable everyone to have an alter ego, we have developed Individuality Reproduction Dialogue Technology, which generates utterances unique to the person based on a small amount of data, and Zero/Few-shot Speech Synthesis Technology, which can synthesize the person's speech from a few seconds to a few minutes of speech.

・Individuality Reproduction Dialogue Technology

With excellent sentence generation capabilities, LLM is also applicable to dialogue technologies that generate natural human conversations, such as small talk and discussion, by learning from large amounts of data collected from dialogue. In conventional research on dialogue technology, LLM was fine-tuned5 with a large amount of individual data to reproduce individuality, but the cost was too high to reproduce the digital alter ego of all people that Another Me aims to reproduce. On the other hand, adapter technology6 is a method for efficiently learning additional LLMs with relatively small amounts of data. When this method is applied to the reproduction of individuality in dialogues, since the base LLM is trained on a large amount of data from a wide variety of people, a small amount of data is not sufficient for learning to proceed, resulting in the generation of utterances that sound like other people with completely different characteristics, which reduces the degree of individual reproduction (Figure 2, top).
 On the other hand, Individuality Reproduction Dialogue technology solves this issue by combining the Adapter technology with the Persona Dialogue technology7. By adding persona functionality to the base LLM through Persona Dialogue technology, LLM responses reflect the broad individuality of the person you want to replicate, bringing the initial state of learning closer to the person, enabling more efficient learning with less data. It also increases individual reproducibility by providing a reasonable response that reflects the persona in the generation phase in interactions that are completely different from those contained in the adapter's learning data.
 The personal adapter, which applies tsuzumi's adapter technology to the reproduction of individuality, can produce utterances specific to the target individual, such as episodic utterances and speech habits. The size of the model added to each individual as a personal adapter is very small and can be switched dynamically, effectively replicating a large number of interactions (Figure 2 bottom).

Figure 2 Comparison of the Conventional Technology and the Individuality Reproduction Dialogue Technology Figure 2 Comparison of the Conventional Technology and the Individuality Reproduction Dialogue Technology

・Zero/Few-shot Speech Synthesis Technology

Conventional Text-to-Speech technologies require few tens of minutes of voice data for each speaker and tone of voice to be created (recording time is several times longer), making it expensive to reproduce the voices of ordinary people and to create a variety of characteristic and tones. However, we have realized two technologies that enable the generation of high-quality and diverse expressions from a smaller amount of voice data.
 The first is Zero-shot Text-to-Speech Synthesis Technology. By extracting the characteristics of voice timbre from only a few seconds of a speaker's voice, we will generate a voice that reproduces those characteristics without learning a text-to-speech synthesis model, aiming to easily reproduce the voice of any person who wants to reproduce their voice, including those who are busy or have lost their voice, and those who can only speak in very low volume.
 The second is Few-shot Text-to-Speech Synthesis Technology. Aiming to reflect the voice tone of famous people and famous characters with a higher degree of reproduction, Few-shot Text-to-Speech synthesis Technology learns a text-to-speech synthesis model from a few to 10 minutes of speech data containing the tone of voice you want to reproduce, and synthesizes highly reproducible speech while greatly reducing the amount of speech data required.
 In order to realize these technologies, deep learning models with many parameters are required. However, we have succeeded in using CPUs with general specifications due to the speedup of computational processing, and have achieved low operating costs for speech synthesis services using this technology.

Figure 3 Zero-shot Speech Synthesis Technology (top), Few-shot Speech Synthesis Technology (bottom) Figure 3 Zero-shot Speech Synthesis Technology (top), Few-shot Speech Synthesis Technology (bottom)

3. Effects of the technology

These technologies will enable anyone to have a digital alter ego to communicate with others on their behalf, with new digital communication services such as the Metaverse as the primary application. While there are advanced users who use the service and interact with a variety of people in virtual space, many new users are often confused at first about who to talk to and what to do. The digital alter-ego using this technology communicates with other users and their digital alter ego as a non-player character (NPC) that operates autonomously even when the user is not logged in, and takes the contents back to the user and shares them. This gives you the opportunity to make friends with interested and agreeable users without the psychological barriers of talking to complete strangers or time constraints such as work or housework. Alter egos can also participate vicariously in a community of people who share common interests and bridge them to users, thereby activating community activities. By constantly placing digital alter egos of celebrities and influencers within the service, we can expect to expand and revitalize the fan community.
 A prototype of such a digital alter ego will be implemented on NTT DOCOMO's meta-communication service MetaMe® and was exhibited at docomo Open House'248, held at the Tokyo International Forum on January 17, 2024.

Figure 4 User Experience Image of Digital Alter Ego Prototype Using the Technology Figure 4 User Experience Image of Digital Alter Ego Prototype Using the Technology

4. Outlook

We plan to start a field experiment on MetaMe® by the end of Japanese fiscal 2023 on the effect of creating relationships through users' digital alter egos. Through these efforts, we aim to improve the accuracy of our technology by the end of Japanese fiscal 2024, with the aim of providing a function to reproduce individuality using the NTT's LLM model tsuzumi. This will lead to the realization of digital human beings and chatbots that possess highly specialized language skills in specific areas, yet have friendly personalities and can build relationships with customers, employees, etc.

<Terminology>

1Towards a new future world creating harmonized relationship between the Earth, society, and people
- Four Grand Challenges of Digital Twin Computing -
https://group.ntt/en/newsrelease/2020/11/13/201113c.html

2IOWN Concept
A new communication infrastructure that can provide high-speed broadband communication and enormous computing resources by using innovative technologies including optical technologies
https://www.rd.ntt/iown/Open other window

3NTT's large-scale language model "tsuzumi"
https://www.rd.ntt/research/LLM_tsuzumi.htmlOpen other window

4NTT press release from February 1, 2023: Development of next-generation avatar UX technology that creates connections between people
~Trial implementation of human digital twin technology in virtual space MetaMeTM provided by NTT DOCOMO~
https://group.ntt/en/newsrelease/2023/02/01/230201a.html
2022 version announced on February 1, 2023, only implements the Persona Dialogue Technology

5Fine-tuning:
A term used to describe machine learning techniques for imparting data-based knowledge to AI. Fine-tuning an AI model that has already been learned on a large amount of data by having it learn on another relatively small amount of data.

6Adapter technology:
Technology that enables efficient additional learning while keeping the parameters of a pre-learned model fixed by adding a relatively small model (adapter) outside the pre-learned model
NTT's Large-Scale Language Model "tsuzumi": Flexible Tuning - Base Model + Adapter -
https://www.rd.ntt/e/research/LLM_tsuzumi.htmlOpen other window

7Persona Dialogue Technology:
Persona Dialogue Technology adds persona functionality to LLM by learning user profiles along with dialogue data. By parameterizing a person's general profile information, such as where they live and their hobbies, it is possible to reproduce speech content appropriate for a personality.

8docomo Open House'24
https://docomo-openhouse24.smktg.jp/public/application/add/32Open other window

MetaMe® is a registered trademark of NTT DOCOMO, INC.

About NTT

NTT contributes to a sustainable society through the power of innovation. We are a leading global technology company providing services to consumers and business as a mobile operator, infrastructure, networks, applications, and consulting provider. Our offerings include digital business consulting, managed application services, workplace and cloud solutions, data center and edge computing, all supported by our deep global industry expertise. We are over $97B in revenue and 330,000 employees, with $3.6B in annual R&D investments. Our operations span across 80+ countries and regions, allowing us to serve clients in over 190 of them. We serve over 75% of Fortune Global 100 companies, thousands of other enterprise and government clients and millions of consumers.

Media contact

NTT Service Innovation Laboratory Group
Public Relations
nttrd-pr@ml.ntt.com

Information is current as of the date of issue of the individual press release.
Please be advised that information may be outdated after that point.