Five papers authored by NTT Laboratories have been accepted at CVPR (IEEE/CVF Conference on Computer Vision and Pattern Recognition) 2025 held in Nashville, Tennessee, USA, from 11th to 15th June 2025. This is a flagship conference on computer vision and pattern recognition where researchers seek the computational understanding, control, and generation of images and movies as well as their foundational theories. CVPR is known as one of the most competitive conferences in the research field, and the acceptance rate this year is 22.1% (2878 papers accepted among 13008 submissions). In addition, one paper has been selected for an oral presentation—an honor given to only 0.7% (96 papers) of all submissions, and another paper has been selected for a highlight presentation (13.5%, 387 papers).
Abbreviated names of the laboratories:
HI: NTT Human Informatics Laboratories
CD: NTT Computer and Data Science Laboratories
CS: NTT Communication Science Laboratories
(The affiliations are at the time of submission.)
- Gromov–Wasserstein Problem with Cyclic Symmetry (Oral presentation)
- Shoichiro Takeda, Associate Distinguished Researcher (HI), Yasunori Akagi, Associate Distinguished Researcher (HI)
- Gromov-Wasserstein problem finds structural similarities and correspondences between data. For example, it can be used to identify common structures in different proteins for drug discovery, or to compare the structures of old and new buildings for anomaly detection. This time we have developed a new algorithm that solves this problem faster by using cyclic symmetry hidden in real-world data. This achievement has made it possible to efficiently find structural similarities and correspondences between larger-scale data.
- VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
- Ryota Tanaka, Associate Distinguished Researcher (HI), Taichi Iki, Research Engineer (HI), Taku Hasegawa, Research Engineer (HI), Kyosuke Nishida, Senior Distinguished Researcher (HI), Kuniko Saito, Executive Research Engineer (HI), Jun Suzuki, Professor (Tohoku University)
- We propose VDocRAG, a novel retrieval-augmented generation (RAG) framework designed for question answering over a corpus of visually-rich documents (e.g., PDFs). VDocRAG unifies the understanding of diverse documents and modalities by processing them in an image format, enabling direct comprehension of document content. This study contributes to the advancement of technologies such as question answering and retrieval for real-world visually rich documents.
- Post-pre-training for Modality Alignment in Vision-Language Foundation Models
- Shinya Yamaguchi, Associate Distinguished Researcher (CD), Dewei Feng (MIT), Sekitoshi Kanai, Research Scientist (CD), Kazuki Adachi, Researcher (CD), Daiki Chijiwa, Associate Distinguished Researcher (CD)
- Contrastive Language-Image Pre-training (CLIP) is crucial for building modern vision-language foundation models, but it faces a challenge: a gap between image and text features. This gap limits its performance. We introduce "CLIP-Refine," a new post-pre-training method that enhances CLIP's zero-shot performance with minimal data and short training time. It aligns image and text features to a common standard, ensuring the model retains existing knowledge while learning new information. Our experiments show that CLIP-Refine effectively reduces the gap between image and text understanding, improving overall zero-shot performance.
- HuPerFlow: A Comprehensive Benchmark for Human vs. Machine Motion Estimation Comparison
- Yung-Hao Yang (Kyoto University), Zitang Sun (Kyoto University), Taiki Fukiage, Senior Research Scientist (CS), Shin'ya Nishida, Professor (Kyoto University)
- As AI models are increasingly integrated into applications that interact directly with humans, it is essential for AI to accurately understand human visual perception and collaborate effectively. In this study, we conducted large-scale psychophysical experiments to investigate how humans perceive motion in videos. Based on these experiments, we constructed the HuPerFlow dataset and analyzed human perceptual characteristics. Our findings reveal that human-perceived motion systematically deviates from the physical ground truth and does not align with existing AI-based motion estimation. HuPerFlow provides a new foundation for evaluating how well AI aligns with human perception, contributing to the development of human-centered AI technology.
- Structure from Collision (Highlight presentation)
- Takuhiro Kaneko, Distinguished Researcher (CS)
- Recent advancements in neural 3D representations have improved the estimation of 3D structures from multiview images. However, these methods are limited to estimating the external surface of objects, and accurately estimating internal structures hidden behind the surface remains a significant challenge. To address this issue, this study introduces a novel task, structure from collision (SfC). Specifically, we propose a new model, SfC-NeRF, which enables the estimation of both visible external and invisible internal structures based on appearance changes during collision. This work contributes to an accurate prediction of shape changes in objects from images as well as a reliable technique for robotic manipulation and interactions with external objects by computers.
NTT believes in resolving social issues through our business operations by applying technology for good. An innovative spirit has been part of our culture for over 150 years, making breakthroughs that enable a more naturally connected and sustainable world. NTT Research and Development shares insights, innovations, and knowledge with NTT operating companies and partners to support new ideas and solutions. Around the world, our research laboratories focus on artificial intelligence, photonic networks, theoretical quantum physics, cryptography, health and medical informatics, smart data platforms, and digital twin computing. As a top-five global technology and business solutions provider, our diverse teams deliver services to over 190 countries and regions. We serve over 75% of Fortune Global 100 companies and thousands of other clients and communities worldwide. For more information on NTT, visit https://www.rd.ntt/e/