Copy this text
Fully Deep Simple Online Real-time Tracking: Efficient Re-Identification by Attention without Explicit Similarity Learning
La plupart des m´ethodes de suivi d’objets multiples qui existent consid`erent la d´etection et la r´eidentification comme deux ´etapes distinctes. Par cons´equent, la r´eidentification ne peut pas tirer b´en´efice de l’emplacement des objets et se base uniquement sur leurs apparences, causant ainsi des fusions d’ID lorsque les objets sont tr`es similaires. Lors de l’utilisation d’un mod`ele de mouvement ou d’un r´eseau r´ecurrent de pr´ediction de d´eplacement pour d´elimiter la zone de recherche et surmonter le probl`eme des fusions d’ID, la croissance de l’incertitude survenant lorsque ces mod`eles ne sont pas mis `a jour conduit souvent `a des changements d’ID. Dans cet article, nous abordons ces probl`emes et proposons d’utiliser le mˆeme mod`ele pour la d´etection et la r´eidentification en utilisant l’attention entre les caract´eristiques de deux images. Ainsi, le r´eseau peut faire des pr´edictions de mouvement sans fournir de descripteur d’apparence pour le calcul d’une similarit´e apprise, ´eliminant alors le besoin d’un mod`ele de pr´ediction de mouvement et rendant le suivi entraˆınable de bout en bout. Nos r´esultats exp´erimentaux valident nos contributions et montrent que notre Fully DeepSORT r´eduit consid´erablement le nombre de changements et de fusions d’ID. De plus, notre mod`ele est plus r´esistant aux variations de laps de temps entre deux images successives, ce qui am´eliore les r´esultats du suivi.
Most existing Multi-Object Tracking methods consider detection and re-identification as two distinct steps. As a result, the re-identification cannot leverage object location and is only based on appearance, thus leading to ID merges when dealing with highly similar objects. The few works that combine detection and re-identification still generate an appearance descriptor for similarity computation. However, since the detection task conflicts with the tracking task, the network privileges the former and generates similar descriptors for objects of the same class, especially when class instances have a strong visual similarity. Besides, when using a motion model or a motion prediction recurrent neural network to delimit the search area and overcome the problem of ID merges, the rise of uncertainty occurring when those models are not updated often leads to ID switches. In this paper, we tackle these issues and propose to use the same model for detection and re-identification by leveraging attention between features of two frames. By doing so, the network can make motion predictions without providing any appearance descriptor and without computing any learned similarity, thus eliminating the need for any motion prediction model and making the tracking trainable end-to-end. Our experimental results support our main contributions and show that our fully DeepSORT significantly reduces the number of ID switches and merges, even when using non-class-agnostic non-maximum suppression. Besides, our model is more resistant to variations in time lapses between two images, leading to improved tracking results.
Keyword(s)
Resistance, Visualization, Uncertainty, Recurrent neural networks, Tracking, Computational modeling, Predictive models
Full Text
File | Pages | Size | Access | |
---|---|---|---|---|
Publisher's official version | 7 | 10 Mo | ||
Author's final draft | 5 | 186 Ko |