Microsoft ends support for Internet Explorer on June 16, 2022.
We recommend using one of the browsers listed below.

  • Microsoft Edge(Latest version) 
  • Mozilla Firefox(Latest version) 
  • Google Chrome(Latest version) 
  • Apple Safari(Latest version) 

Please contact your browser provider for download and installation instructions.

Open search panel Close search panel Open menu Close menu

November 1, 2024

Information

6 papers authored by NTT Laboratories have been accepted for publication for ICIP2024

Six papers authored by NTT Laboratories have been accepted at ICIP2024(IEEE International Conference on Image Processing), the flagship conference on image & video processing and computer vision, to be held in Abu Dhabi, United Arab Emirates, from 27th to 30th of October 2024. (Affiliations are at the time of submission.)

SIC: NTT Software Innovation Center
CD: NTT Computer and Data Science Laboratories
HI: NTT Human Informatics Laboratories
CS: NTT Communication Science Laboratories

  1. SCENE GENERALIZED MULTI-VIEW PEDESTRIAN DETECTION WITH ROTATION-BASED AUGMENTATION AND REGULARIZATION
    1. Satoshi Suzuki(HI), Shotaro Tora(CD), Ryo Masumura(HI)
    2. Multi-view pedestrian detection aims to predict a bird's eye view (BEV) occupancy map using multiple camera views. Existing deep learning based methods are difficult to generalize to new camera layouts other than the layouts used in training. To address this problem, we propose a novel data augmentation and regularization method for multi-view pedestrian detection. Our key idea is that rotation of the features represented in BEV brings the features originating from a new camera layout. By exploiting this idea, our method can effectively train the detection model with the knowledge of new layouts.
  2. MVAFORMER: RGB-BASED MULTI-VIEW SPATIO-TEMPORAL ACTION RECOGNITION WITH TRANSFORMER
    1. Taiga Yamane(HI), Satoshi Suzuki(HI), Ryo Masumura(HI), Shotaro Tora(CD)
    2. Previous studies on multi-view action recognition, which aims to recognize human actions from multiple camera views, are not practical because they focus only on the task setting to recognize a single action from an entire video. In this paper, we newly tackled "multi-view spatio-temporal action recognition", which combines multi-view action recognition with a task setting to recognize continuous actions of multiple people, called spatio-temporal action recognition. Furthermore, we proposed a new transformer based method for this task and significantly outperforms the comparison methods.
  3. COLLABORATIVE INTELLIGENCE FOR VISION TRANSFORMERS: A TOKEN
    SPARSITY-DRIVEN EDGE-CLOUD FRAMEWORK
    1. Monikka Roslianna Busto(SIC), Shohei Enomoto(CD), Takeharu Eda(SIC)
    2. Collaborative Intelligence (CI) optimizes deep neural network (DNN) deployment in edge-cloud systems by balancing workloads and leveraging data sparsity for compression and reduced computational cost. While Vision Transformers (ViTs) offer performance-based Sadvantages, their higher computational overhead complicates edge-cloud deployment compared to CNNs, which benefit from feature map sparsity. Existing CI methods focus on CNNs, ViTs, on the other hand, use token-based sparsity, requiring a different approach. We propose a novel CI method that utilizes token sparsity in ViTs, using a network called an offloading policy to prioritize task-relevant tokens. This reduces ViT computational costs by 41.98-45.75%, with minimal accuracy loss (1.96-3.10 points) and up to 36.85% data compression.
  4. IMPROVING IMAGE CODING FOR MACHINES THROUGH OPTIMIZING ENCODER VIA AUXILIARY LOSS
    1. Kei Iino(Waseda University), Shunsuke Akamatsu(Columbia University), Hiroshi Watanabe(Waseda University), Shohei Enomoto(CD), Akira Sakamoto(SIC), Takeharu Eda(SIC)
    2. In image coding for machines (ICM), the encoder must prioritize compressing information that is crucial for machine analysis rather than for human perception. However, existing methods, such as optimizing the compression model using task-specific loss and allocating bits based on Regions of Interest (ROI), face challenges. These include training difficulties with deep models and additional overhead during evaluation. This paper proposes a new training method for ICM models, which introduces an auxiliary loss to the encoder to enhance both recognition capability and rate-distortion performance. Experiments demonstrate that this method achieves Bjøntegaard Delta rate improvements of 27.7% in object detection tasks and 20.3% in semantic segmentation tasks.
  5. CROSS-ACTION CROSS-SUBJECT SKELETON RECOGNITION VIA SIMULTANEOUS ACTION-SUBJECT LEARNING WITH TWO-STEP FEATURE REMOVAL
    1. Yu Mitsuzumi (CS), Akisato Kimura (CS), Go Irie (CS*1), Atsushi Nakazawa (Kyoto University *2)
      *1 Currently affiliated with Tokyo University of Science
      *2 Currently affiliated with Okayama University
    2. In this paper, we tackle a novel skeleton-based action recognition problem named Cross-Action Cross-Subject Skeleton Action Recognition, where we can access the data of only a part of the target action classes for each training subject. Existing skeleton-based action recognition methods suffer from solving this problem since there are scarce clues to resolve the cross-entanglement of action and subject information, and the trained model will confuse those two features. To solve this challenging problem, we propose a method that consists of a novel data augmentation technique and a debiasing learning approach to remove the confusing features.
  6. ESTIMATING INDOOR SCENE DEPTH MAPS FROM ULTRASONIC ECHOES
    1. Junpei Honma (Tokyo University of Science), Akisato Kimura, (CS), Go Irie (Tokyo University of Science)
    2. Measuring 3D geometric structures of indoor scenes requires dedicated depth sensors, which are not always available. Echo-based depth estimation has recently been studied as a promising alternative solution. While previous studies use audible echoes, one major problem is that audible echoes may be perceived as noise in quiet spaces. In this paper, we consider echo-based depth estimation using inaudible ultrasonic echoes, and propose a novel deep learning method to use audible echoes as auxiliary data only during training.

Information is current as of the date of issue of the individual topics.
Please be advised that information may be outdated after that point.