Favoring One Among Equals - Not a Good Idea: Many-to-one Matching for Robust Transformer based Pedestrian Detection
WACV 2024



overview

Abstract

We investigate the reasons for the lower performance of transformer-based pedestrian detection models compared to convolutional neural network (CNN) based ones. CNN models generate dense pedestrian proposals, refine each proposal individually, and follow it up with non-maximal-suppression (NMS) to generate sparse predictions. In contrast, transformer models select one proposal per ground-truth (GT) pedestrian box and backpropagate positive gradient from them. All other proposals, many of them highly similar to the selected ones, are passed a negative gradient. Though this leads to sparse predictions, obviating the need for NMS, the arbitrary selection of one among many similar proposals hinders effective training and lower pedestrian detection accuracy. To mitigate the problem, instead of the commonly used Kuhn-Munkres matching algorithm, we propose Min-cost-flow based formulation and incorporate constraints such as each ground truth box being matched to at least one proposal and many equally good proposals can be matched to a single ground truth box. We propose the first transformer-based pedestrian detection model incorporating our matching algorithm. Extensive experiments reveal that our approach achieves a miss rate (lower is better) of 3.7 / 17.4 / 21.8 / 8.3 / 2.0 on Eurocity / TJU-traffic / TJU-campus / Cityperson / Caltech datasets compared to 4.7 / 18.7 / 24.8 / 8.5 / 3.1 by the current SOTA.

Performance Comparision

results

Credits: Template of this webpage from here.