PA-Dino

Abstract

Despite the significant progress made in pedestrian detection in last decade, detecting pedestrians under heavy occlusion still remains a challenging problem. In state of the art (SOTA), convolutional neural network (CNN) based models, the reason is attributed to non-maximal-suppression (NMS), which often erroneously deletes true positives when one pedestrian is oc- cluding other. SOTA transformer based models do not have such NMS step, yet fail to detect highly occluded pedestrians. In this paper, we study the reasons for such failures. We observe that such models first predict key-points, and then compute the attention at the specific key-points. Our analysis reveals that the key-points do not have any preference towards semantically important body parts. Under heavy occlusion, such key-points end up attending to non-discriminative re- gions or background, leading to false negatives. We take inspiration from the conventional wisdom of detecting ob- jects using their parts, and bias the attention of proposed transformer architecture towards semantically important, and highly discriminative human body parts. The intervention leads to SOTA results on benchmark Citypersons and Caltech datasets, achieving 30.75%, and 32.96% miss-rate (lower is better) respectively, against 32.6%, and 38.2% by the current SOTA

Parts based Attention for Highly Occluded Pedestrian Detection with Transformers
ICIP 2023

Abstract

Performance Comparision

Parts based Attention for Highly Occluded Pedestrian Detection with Transformers ICIP 2023

Abstract

Performance Comparision

Parts based Attention for Highly Occluded Pedestrian Detection with Transformers
ICIP 2023