Better Behavior From Multi-Modal AI

Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.

Microsoft Edge（Latest version）
Mozilla Firefox（Latest version）
Google Chrome（Latest version）
Apple Safari（Latest version）

Please contact your browser provider for download and installation instructions.

June 8, 2026

Technology

Modern AI has an interesting* habit: it explains itself with complete confidence, even when its explanations aren't actually doing any work. If you’ve ever asked an AI to analyze a chart or a slide, you’ve probably seen it lay out a neat, logical argument before giving you an answer.

^*euphemism

The answer you get feels reassuring and looks like genuine thought. But often, what appears to be sound reasoning is just a surface-level explanation. The model can ignore its own reasoning entirely and still arrive at the same answer.

Decorative Reasoning and LVLMs

This kind of decorative reasoning is exactly what NTT has been digging into with its latest research on Large Vision-Language Models (LVLMs).

LVLMs are the systems that work with images and text together, interpreting everything from complex diagrams to street photos. As they become more common in document analysis and applications where reliability matters, we’ve mostly and naively just assumed their reasoning was the engine driving their answers.

Show Your Work, Please

A commonly used technique when prompting LVLMs to reason step by step is to use "Chain-of-Thought" processing—essentially asking the AI to show its work. The theory is that by generating an intermediate explanation, the model becomes more accurate and transparent. This makes the whole, mysterious black-box process feel easier to trust.

However, NTT’s research shows that this trust might be misplaced. In controlled tests, LVLMs were found to ignore their own reasoning. Even when researchers tried swapping out the AI's reasoning with something completely unrelated, the LVLM’s final answer often didn't change. This leads us to a worrying thought: if the reasoning can be removed without changing the AI’s outcome, then that reasoning isn’t really part of the decision-making process at all.

Two Independent Signals, Fused

To address this problem, NTT has developed a new inference method called rationale-enhanced decoding.

While the math is perhaps a bit tricky to follow, the concept is straightforward. Instead of tossing the image and the reasoning into the model as a single combined input, the system treats them as two distinct signals that have to be processed independently before being fused back together at the final stage of output generation.

Technically speaking, the model generates two separate probability distributions for its next word: one is based on the image, one is based on the reasoning. By merging the two paths, the system makes the output explicitly depend on both. It’s no longer a suggestion; the logic is baked into the result.

In a standard LVLM setup, the model is free to ignore its own explanation. With NTT’s method, the reasoning becomes a core part of the machinery. The explanation isn't a story the AI makes up and tells you after the fact. It’s the actual path it took to get there.

One of the handy parts of NTT’s approach is that it doesn't require retraining from scratch; it’s possible to apply rationale-enhanced decoding at inference time (basically, while the AI is running). This makes it much easier for companies to upgrade their existing tools without significant additional computational cost.

Sounds great, but what do the results show?

Better AI Behavior

Accuracy on benchmark tasks went up, and more importantly, the model’s behavior changed. Turns out that if you give it high-quality reasoning, the answers get better. If you feed it bad logic, the answers get worse. For you and me, that’s just common sense, but in the world of AI it’s a real step forward, because it shows the system is actually using the reasoning it is given.

How could this be used practically? In a medical setting, let’s say, a doctor could provide specific clinical notes alongside an image, and the model would then incorporate that human expertise into its final analysis. In a business setting, meanwhile, it means the insights pulled from a financial report are actually rooted in the data on the page.

Right… For The Right Reasons

We’re getting more used to using AI in our daily lives and as we do so, we’re starting to realize it can be… eccentric, let’s say? We’re starting to demand consistency. It’s not enough for the AI to be correct; we need to know that it was correct for the right reasons.

NTT’s work doesn't solve every mystery of the black box, but it addresses a gap in how reasoning is used. If an AI claims to be reasoning, that reasoning needs to have consequences. Otherwise, we can’t really rely on it when the stakes are high. It’s about ensuring that when an AI tells you why it did something, its answer genuinely reflects the reasoning it used.

Innovating a Sustainable Future for People and Planet

For further information, please see this link:
https://group.ntt/en/newsrelease/2026/06/01/260601a.html

If you have any questions on the content of this article, please contact:

Public Relations
NTT
https://tools.group.ntt/en/news/contact/index.php Open other window

Daniel O'Connor

Daniel O'Connor joined the NTT Group in 1999 when he began work as the Public Relations Manager of NTT Europe. While in London, he liaised with the local press, created the company's intranet site, wrote technical copy for industry magazines and managed exhibition stands from initial design to finished displays.

Later seconded to the headquarters of NTT Communications in Tokyo, he contributed to the company's first-ever winning of global telecoms awards and the digitalisation of internal company information exchange.

Since 2015 Daniel has created content for the Group's Global Leadership Institute, the One NTT Network and is currently working with NTT R&D teams to grow public understanding of the cutting-edge research undertaken by the NTT Group.