⌘+K

February 14, 2024Open Access

Follow Anything: Open-Set Detection, Tracking, and Following in Real-Time

Key Points

Key points are not available for this paper at this time.

Abstract

Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this paper, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed follow anything ( FAn ), is an open-vocabulary and multimodal model — it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models ( foundation models ), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle), and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To enable rapid adoption, deployment, and extensibility, we opensource our code on our project webpage. We also encourage the reader to watch our 5-minute explainer video.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Alaa Maalouf

Citigroup

Ninad Jadhav

Citigroup

Krishna Murthy Jatavallabhula

Moscow Institute of Thermal Technology

Journals

IEEE Robotics and Automation Letters

Actions

Institutions

Harvard University

Massachusetts Institute of Technology

Citigroup

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Maalouf et al. (Wed,) studied this question.

synapsesocial.com/papers/68e792c7b6db643587703cbb — DOI: https://doi.org/10.1109/lra.2024.3366013

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Hierarchical Text-Conditional Image Generation with CLIP Latents· 2022 · 2,282 citations
Go Closer to See Better: Camouflaged Object Detection via Object Area Amplification and Figure-Ground Conversion· 2023 · 122 citations
Open Vocabulary Scene Parsing· 2017 · 93 citations
A Multi-Modal Distributed Real-Time IoT System for Urban Traffic Control (Invited Paper)· 2024 · 14,319 citations
Emerging Properties in Self-Supervised Vision Transformers· 2021 · 4,977 citations

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Hierarchical Text-Conditional Image Generation with CLIP Latents· 2022 · 2,282 citations
Go Closer to See Better: Camouflaged Object Detection via Object Area Amplification and Figure-Ground Conversion· 2023 · 122 citations
Open Vocabulary Scene Parsing· 2017 · 93 citations
A Multi-Modal Distributed Real-Time IoT System for Urban Traffic Control (Invited Paper)· 2024 · 14,319 citations
Emerging Properties in Self-Supervised Vision Transformers· 2021 · 4,977 citations

Follow Anything: Open-Set Detection, Tracking, and Following in Real-Time

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider