ML deployments in real-world settings are often heterogeneous, spanning local multi-core CPU operations, GPU-accelerated computations, and distributed execution on platforms such as Apache Spark. This heterogeneity arises from practical necessity: a feature engineering step might run locally, a large matrix multiplication offloads to Spark when data exceeds driver memory, and DNN layers execute on GPUs. Or an initial homogeneous setup might evolve over time into a heterogeneous one. This multi-backend reality creates a systems challenge that has received surprisingly little research attention: How to holistically manage computation reuse and memory across backends with fundamentally different execution models, memory hierarchies, and caching primitives?
Arun Kumar (Thu,) studied this question.