In the era of surging robotic applications, robots are becoming increasingly mobile and ubiquitous, participating in diverse tasks ranging from household chores to navigating complex urban environments. This surge is attributed to advances in sensing, computing, and artificial intelligence, leading to the development of sophisticated autonomous systems. In contrast to industrial robots designed for factory automation, mobile robots are designed to operate in complex and unstructured environments, necessitating advanced scene understanding capabilities for safe interactions. One such capability is person detection, which is the focus of this thesis. While mature solutions exist for detecting persons using RGB(-D) cameras and deep learning-based object detectors, these come with limitations such as restricted field of view, increased computation with multi-camera setups, and privacy concerns. LiDAR sensors offer an alternative with advantages including a wide field of view, accurate range measurement, and reduced privacy intrusion. However, efficiently detecting persons with LiDAR sensors presents challenges due to the unique data structure of point clouds and the sparsity of the range scan.This thesis proposes novel techniques to overcome these challenges and design efficient person detection algorithms for LiDAR sensors. The first contribution is DR-SPAAM, a fast person detector for 2D LiDAR sensors on mobile robots. The key component of DR-SPAAM is a novel SPatial Attention and Auto-regressive Module, which learns to associate and aggregate measurements from consecutive scans, addressing challenges from the sparse point cloud with minimal computational overhead. DR-SPAAM establishes a new state-of-the-art on the DROW dataset, reaching 70.3% AP. In addition, it has a high inference rate of 87.2 FPS on a laptop with a dedicated GPU or 22.6 FPS on an NVIDIA Jetson AGX with an embedded GPU. Further experiments show that pretrained DR-SPAAM is robust against changes in sensor specifications such as sampling resolution and rate, making it easy to deploy on platforms that are equipped with different sensors. The high inference rate and the robustness against changing sensor specifications make DR-SPAAM well-suited for mobile robotic applications. Contrary to the abundance of labeled images, only a few datasets exist for training 2D-LiDAR-based person detectors such as DR-SPAAM. This limitation in data may cause the network to focus on false correlations that are only present in training data, leading to poor detection generalization during deployment. To mitigate this problem, we propose an approach that automatically generates training labels, called pseudo-labels, for the 2D LiDAR sensor, using a calibrated camera with an off-the-shelf image-based detector. We conduct experiments to confirm that training or fine-tuning a detector with these pseudo-labels indeed improves its performance during deployment, and the performance can be further improved with robust training techniques such as mixup regularization and partially Huberized cross-entropy loss. With the proposed approach, a mobile robot can fine-tune its person detector during deployment, with no additional manual labeling cost. Despite their effectiveness and affordability, 2D LiDAR sensors are limited to measuring a single plane. In contrast, 3D LiDAR sensors scan multiple planes at various heights due to their additional rotation angle, but they come with higher costs. This difference presents a trade-off that should be considered when selecting sensors for designing new robots. We conduct experiments to understand the performance gap between these two types of LiDAR sensors for detecting persons, aiming to facilitate informed decisions on sensor selection. We use state-of-the-art CenterPoint and DR-SPAAM person detectors as proxies for 3D and 2D LiDAR sensors, respectively, and compare their performance in multiple aspects, including detection accuracy, inference speed, localization performance, and robustness. Our results show that 2D LiDAR sensors can detect visible persons as accurately as 3D LiDAR sensors. However, it is much easier for persons to be fully occluded in the single-plane 2D LiDAR scan, making detection impossible. For applications such as collision avoidance, where detecting nearby visible persons is sufficient, 2D LiDAR sensors are a good choice due to their high inference speed and low cost. However, for applications requiring reliable person detection over an extended range, the more expensive 3D LiDAR sensors should be used. The permutation equivariant attention mechanism, originally proposed for language tasks, is a promising operation for processing LiDAR scans. However, the quadratic space complexity with respect to the number of points forbids its application on large point clouds. To address this issue, we propose a memory-efficient variant of the attention mechanism, called Global Hierarchical Attention (GHA), which approximates the regular attention by computing local attention on a series of resolution hierarchies. GHA offers two key advantages. First, it has linear complexity relative to the number of points, allowing it to process large point clouds. Second, GHA inherently biases towards spatially close points, giving more attention to local structures while maintaining global connectivity among all points. Combined with a feedforward network, GHA can be inserted into many existing network architectures. Our experiments with multiple baseline networks show that adding GHA consistently improves performance across various tasks and datasets. For the task of semantic segmentation, GHA gives a +1.7% mIoU increase to the MinkowskiEngine baseline on ScanNet. For the 3D object detection task, GHA improves the CenterPoint baseline by +0.5% mAP on the nuScenes dataset, and the 3DETR baseline by +2.1% mAP25 and +1.5% mAP50 on ScanNet.
Dan Jia (Wed,) studied this question.