A Study on Systematic Improvement of Transformer Models for Object Pose Estimation

Sensors (Basel). 2025 Feb 18;25(4):1227. doi: 10.3390/s25041227.

Abstract

Transformer architecture, initially developed for natural language processing and time series analysis, has been successfully adapted to various generative models in several domains. Object pose estimation, which uses images to determine the 3D position and orientation of an object, is essential for tasks such as robotic manipulation. This study introduces a transformer-based deep learning model for object pose estimation in computer vision, which determines the 3D position and orientation of objects from images. A baseline model derived from an encoder-only transformer faces challenges with high GPU memory usage when handling multiple objects. To improve training efficiency and support multi-object inference, it reduces memory consumption by adjusting the transformer's attention layer and incorporates low-rank weight decomposition to decrease parameters. Additionally, GQA and RMS normalization enhance multi-object pose estimation performance, resulting in reduced memory usage and improved training accuracy. The improved model implementation with an extended matrix dimension reduced the GPU memory usage to only 2.5% of the baseline model, although it increased the number of model weight parameters. To mitigate this, the number of weight parameters was reduced by 28% using low-rank weight decomposition in the linear layer of attention. In addition, a 17% improvement in rotation training accuracy over the baseline model was achieved by applying GQA and RMS normalization.

Keywords: low-rank weight decomposition; object pose estimation; transformer.