Recently, there has been a growing demand for real-time intelligent systems that can execute multiple deep neural network (DNN) models simultaneously for tasks such as object recognition, detection and tracking. However, running multiple DNNs simultaneously in resource-constrained embedded environments can lead to resource contention due to limited system resources. This can result in execution delays that cause critical issues in latency-sensitive processing. This paper proposes a dynamic scheduling technique that divides DNN models into functional units called blocks, which are then configured as execution units. Additionally, when running different models in parallel, it identifies blocks that actually increase execution time and controls them to run sequentially. Furthermore, to minimize execution delays while maintaining accuracy, we propose a dynamic lightweight replacement technique that replaces blocks with highly anticipated execution delays with lightweight blocks at runtime. This technique uses LAG , a metric which quantifies the degree of execution delay for each block, to dynamically adjust the balance between execution delays and accuracy. Experimental results show that when running multiple heterogeneous DNNs simultaneously on a commercial off-the-shelf board, the proposed technique improves latency by up to 29.3%, while maintaining 90% of baseline accuracy.
Kim et al. (Mon,) studied this question.