Abstract This study presents an operation‐level scheduling framework for efficient deep learning inference on heterogeneous embedded systems. Motivated by the observation that deep neural networks comprise diverse operations in which the execution latency is highly dependent on the target hardware and input dimensions. The framework hypothesizes that accurate latency prediction and fine‐grained scheduling of individual operations reduce end‐to‐end inference time. It follows a three‐stage approach: (i) offline profiling of operation latencies across varying input sizes and devices; (ii) training latency prediction models using input‐aware features; and (iii) directed acyclic graph‐based runtime scheduling to assign each operation to a central processing unit, graphics processing unit, or both. The framework is evaluated on two embedded platforms (Jetson Nano and ODROID‐XU4) and demonstrates an inference latency reduction of up to 74% across multiple deep learning models. These results indicate that the framework is adaptable, lightweight, and effective for resource‐constrained artificial intelligence deployments.
Kim et al. (Thu,) studied this question.