What question did this study set out to answer?

This research aims to enhance machine vision capabilities through a framework that leverages event-based sensors for edge AI computing.

February 2, 2026Open Access

Event-Based Machine Vision for Edge AI Computing

Key Points

This research aims to enhance machine vision capabilities through a framework that leverages event-based sensors for edge AI computing.
Developed a machine vision framework utilizing a Dynamic Vision Sensor (DVS).
Implemented timestamp-based, polarity-agnostic recency encoding to process data efficiently.
Applied network optimizations, including architectural reduction and mixed-bit quantization.
Conducted experiments on human/object detection, 2D pose estimation, and hand posture recognition.
Achieved a 0.908 accuracy in action recognition using recency encoding compared to 0.896 with temporal accumulation.
Reduced computation for human detection from 5.8 billion FLOPs to 81 million FLOPs, achieving over 11x speed-up.
Decreased pose estimation model size from 127 MB to 19 MB while maintaining accuracy (mAP from 0.95 to 0.94).
Obtained 99.19% recall in hand posture recognition with low latency of 14.31 ms.

Abstract

Event-based sensors provide sparse, motion-centric measurements that can reduce data bandwidth and enable always-on perception on resource-constrained edge devices. This paper presents an event-based machine vision framework for smart-home AIoT that couples a Dynamic Vision Sensor (DVS) with compute-efficient algorithms for (i) human/object detection, (ii) 2D human pose estimation, (iii) hand posture recognition for human–machine interfaces. The main methodological contributions are timestamp-based, polarity-agnostic recency encoding that preserves moving-edge structure while suppressing static background, and task-specific network optimizations (architectural reduction and mixed-bit quantization) tailored to sparse event images. With a fixed downstream network, the recency encoding improves action recognition accuracy over temporal accumulation (0.908 vs. 0.896). In a 24 h indoor monitoring experiment (640 × 480), the raw DVS stream is about 30× smaller than conventional CMOS video and remains about 5× smaller after standard compression. For human detection, the optimized event processing reduces computation from 5.8 G to 81 M FLOPs and runtime from 172 ms to 15 ms (more than 11× speed-up). For pose estimation, a pruned HRNet reduces model size from 127 MB to 19 MB and inference time from 70 ms to 6 ms on an NVIDIA Titan X while maintaining a comparable accuracy (mAP from 0.95 to 0.94) on MS COCO 2017 using synthetic event streams generated by an event simulator. For hand posture recognition, a compact CNN achieves 99.19% recall and 0.0926% FAR with 14.31 ms latency on a single i5-4590 CPU core using 10-frame sequence voting. These results indicate that event-based sensing combined with lightweight inference is a practical approach to privacy-friendly, real-time perception under strict edge constraints.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper