Pyramidal Transformer with Conv-Patchify for Person Re-identification

The robust and discriminative feature extraction is the key component in person re-identification (Re-ID). The major weakness ofconventional convolution neural network (CNN) based methods is that they cannot extract long-range information from diverse parts, which can be alleviated by recently developed Transformers. Existing vision Transformers show their power on various vision tasks. However, they (i) cannot address translation problems and different viewpoints; (ii) cannot capture detailed features to discriminate people with a similar appearance. In this paper, we propose a powerful Re-ID baseline built on top of the pyramidal transformer with conv-patchify operation, termed PTCR, which inherits the advantages of both CNN and Transformer. The pyramidal structure captures multi-scale fine-grained features, while the convpatchify enhances the robustness against translation. Moreover, we additionally design two novel modules to improve the robust feature learning. A Token Perception module augments the patch embeddings to enhance the robustness against perturbation and viewpoint changes, while the Auxiliary Embedding module integrates the auxiliary information (cam ID, pedestrian attributes, etc.)to reduce feature bias caused by non-visual factors. Our method is validated through extensive experiments to show its superior performance with abundant ablation studies. Notably, without re-ranking, we achieve 98.0% Rank-1 on Market-1501 and 88.6% Rank-1 on MSMT17, significantly outperforming the counterparts. The code is available at: https://github.com/lihe404/PTCR

PDF

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here