2 papers from NTT Laboratories have been accepted for CVPR2026, a premier conference on computer vision

Two papers authored by NTT Laboratories have been accepted at CVPR (The IEEE/CVF Conference on Computer Vision and Pattern Recognition) 2026 to be held in Denver, Colorado, USA, from June 3 to 7, 2026. This is a flagship conference on computer vision and pattern recognition where researchers seek the computational understanding, control, and generation of images and movies as well as their foundational theories.

Abbreviated names of the laboratories:
HI: NTT Human Informatics Laboratories
CD: NTT Computer and Data Science Laboratories
(The affiliations are at the time of submission.)

■Rationale-Enhanced Decoding for Multi-Modal Chain-of-Thought

Shin'ya Yamaguchi (CD)、Kosuke Nishida (HI)、Daiki Chijiwa (CD)

Chain-of-Thought (CoT) prompting has been adapted for large vision-language models (LVLMs) to enhance multi-modal reasoning capability by generating intermediate rationales. However, our experiments reveal a key challenge: existing models often ignore the contents of these generated rationales during output generation. To address this issue, we propose Rationale-Enhanced Decoding (RED), a novel decoding strategy that requires no additional training. RED effectively harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next-token distributions at the decoding time, ensuring the model's outputs are strictly grounded in the rationale. Extensive experiments demonstrate that RED consistently and significantly improves the dependency on rationales and reasoning performance across multiple benchmarks. This technology not only improves the performance of LVLM but also enhances the interpretability required for leveraging AI in critical decision-making. It is expected to find applications in various domains where humans and AI collaborate, including AI Constellation©.

■Parallel In-Context Learning for Large Vision Language Models

Shin'ya Yamaguchi (CD)、Tamao Sakao (CD)、Daiki Chijiwa (CD)、Taku Hasegawa (HI)

Multi-modal in-context learning (MM-ICL) allows large vision-language models (LVLMs) to adapt to new tasks using demonstration examples. However, increasing the number of demonstrations causes significant inference latency due to long image token sequences. To address this, we propose Parallel-ICL, a novel plug-and-play inference algorithm. Based on the ensemble learning theory, Parallel-ICL partitions a long demonstration context into shorter, manageable chunks by maximizing the inter-chunk diversity, processes them in parallel, and integrates their predictions by weighting each chunk with task-similarity. Extensive experiments demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL while significantly improving inference speed. This enables LVLM to perform high-speed, high-accuracy inference even for tasks in specific domains like medical image processing that were not fully covered during training, paving the way for further expansion of AI applications.

Information is current as of the date of issue of the individual topics.
Please be advised that information may be outdated after that point.

Topics

NTT STORY

WEB media that thinks about the future with NTT

Group Companies