Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.
Please contact your browser provider for download and installation instructions.
Think about those times when you've struggled to find words. Not because you don't know what to say, but perhaps because of physical issues—speech can disappear entirely for people who have suffered a stroke, during intubation, or because of neurological damage. Maybe there have been times when you've felt stress, pain, fatigue, or shock and been unable to communicate. Or even just in your daily life, when you're trying to remember something you saw briefly, but can't find the words to describe it.
Close your eyes and think about the image you have in mind. A street corner, a moving crowd, the strange expression on someone's face you half noticed. A moment that stuck with you. Now imagine a system with the power to turn that mental image into a written description on a screen.
We're not quite there yet, but NTT is taking steps to make that science fiction into a potential future.
The NTT Science and Core Technology Laboratory Group recently published a study that explored the concept of mind captioning. Their research took recorded patterns of brain activity and was able to generate short pieces of text that described the visual content a person was watching, or later recalling from memory.
Humans are visual creatures. Long before we are able to speak, we learn to recognize scenes, movement, and relationships between objects. Even as adults, we often think in images first and words second. NTT's research builds on that fact by linking brain activity related to vision with the same kind of semantic representations used by modern language models. Instead of asking the brain for words, however, it asks for visual concepts and lets its language model do the rest.
The Science and Core Technology Laboratory used functional MRI, which measures changes in blood flow in the brain. Research volunteers watched short video clips, while their brain activity was recorded. Those clips were also described in text, and the descriptions were converted into numerical features used by a language model. A decoder then learned how patterns of brain activity correspond to those features. Later, when the participant was asked to recall the video from memory, the system tried to find text that best matched the predicted semantic pattern.
Just to be clear, it's not about telepathy. No one's inner thoughts were being deciphered. The generated captions were not transcripts of inner speech. Rather, they were often able to capture the general idea of what was shown, such as movement, setting, or action, rather than precise detail. When the captions came from memory rather than direct viewing, they became fuzzier, which is similar to how real human recall actually works. However, they were able to pinpoint the right conceptual areas.
The system continued to function even when brain regions typically linked to language were excluded from the analysis, which suggests the captions were grounded in visual and associative processing, rather than speech. In other words, the system isn't listening for words forming in the head; it's responding to patterns tied to seeing and remembering.
NTT's setup relied on large scanners, careful training, and controlled conditions; the technology is not likely to appear in consumer devices anytime soon! After further development, however, one possible use of the technology could be communication support. Someone temporarily unable to speak, due to injury, anxiety, or medical treatment, might still be able to share basic information by recalling images.
In a similar vein, professionals who rely on visual judgment, including surgeons, pilots, or craftspeople, might use such tools to help explain something they have noticed visually, but are unable to describe verbally. Creative work could benefit as well: turning vague, mind's-eye scenes into rough text could help writers or designers get unstuck at the earliest stage of an idea.
General research could also benefit. Being able to compare brain activity during perception and recall, using the same language-based reference frame, may come to provide a new way of studying memory and imagination.
It's still early days, but NTT is working to make visual experiences translatable into written language. Watch this space.
Innovating a Sustainable Future for People and Planet
For further information, please see this link:
https://group.ntt/en/newsrelease/2025/11/17/251117a.html
If you have any questions on the content of this article, please contact:
Public Relations
NTT Science and Core Technology Laboratory Group
https://tools.group.ntt/en/news/contact/index.php
Daniel O'Connor joined the NTT Group in 1999 when he began work as the Public Relations Manager of NTT Europe. While in London, he liaised with the local press, created the company's intranet site, wrote technical copy for industry magazines and managed exhibition stands from initial design to finished displays.
Later seconded to the headquarters of NTT Communications in Tokyo, he contributed to the company's first-ever winning of global telecoms awards and the digitalisation of internal company information exchange.
Since 2015 Daniel has created content for the Group's Global Leadership Institute, the One NTT Network and is currently working with NTT R&D teams to grow public understanding of the cutting-edge research undertaken by the NTT Group.