TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

1 Apr 2021  ·  Peng Chu, Jiang Wang, Quanzeng You, Haibin Ling, Zicheng Liu ·

Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked objects as a set of sparse weighted graphs, and constructing a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial graph transformer decoder layer based on the graphs. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. To further improve the tracking speed and accuracy, we propose a cascade association framework to handle low-score detections and long-term occlusions that require large computational resources to model in TransMOT. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20, and it achieves state-of-the-art performance on all the datasets.

PDF Abstract

Results from the Paper


Ranked #2 on Multi-Object Tracking on 2DMOT15 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Multi-Object Tracking 2DMOT15 STGT MOTA 57 # 2
IDF1 66 # 1
Multi-Object Tracking MOT16 STGT MOTA 76.7 # 4
IDF1 76.8 # 1
Multi-Object Tracking MOT17 STGT MOTA 76.7 # 12
IDF1 75.1 # 13
Multi-Object Tracking MOT20 STGT MOTA 77.5 # 5
IDF1 75.2 # 9

Methods