What question did this study set out to answer?

This research aims to understand the relationship between performance and energy consumption in data processing systems using Apache Spark.

February 6, 2026Open Access

A Multi‐Layered Analysis of Energy Consumption in Spark

Key Points

This research aims to understand the relationship between performance and energy consumption in data processing systems using Apache Spark.
Developed a multi-layered methodology for analyzing energy consumption
Conducted direct energy measurements with a Power Distribution Unit (PDU)
Monitored system-level and application-level resource utilization
Investigated infrastructure choices and workload characteristics
Optimal virtual machine configurations depend on workload type and input size
Provisioning decisions significantly affect energy consumption
System-level metrics such as CPU utilization and disk I/O provide insights into performance and energy consumption
Task distribution and resource under-utilization impact overall energy efficiency

Abstract

ABSTRACT Although energy has become a major concern in data processing systems, it is usually hard to get a deep understanding of how performance and energy consumption relate to each other when planning how to configure a computing environment to execute a specific data‐oriented workload. In this paper, we propose a multi‐layered methodology to analyze the energy consumption of big data workloads executed using Apache Spark in virtualized cloud environments. The approach is structured into three layers: Resource provisioning, system‐level resource utilization, and application‐level resource utilization. Using direct energy measurements using a Power Distribution Unit (PDU) and detailed system monitoring, the study investigates how infrastructure choices and workload characteristics influence energy consumption. Results show that optimal virtual machine configurations depend on workload type and input size; while provisioning decisions affect energy consumption, system‐level metrics such as CPU utilization and disk I/O offer a deeper understanding of the final performance versus energy consumption results. By applying our methodology, our results reveal the impact of task distribution and resource under‐utilization on overall energy efficiency. The findings demonstrate that energy optimization in big data environments requires a comprehensive understanding of factors across infrastructure, system, and application layers. The proposed methodology serves as a practical guide for energy‐aware design and decision‐making in cloud‐based data processing systems.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Volpini et al. (Sun,) studied this question.

synapsesocial.com/papers/698585ea8f7c464f23009ae2 https://doi.org/https://doi.org/10.1002/cpe.70565

Bookmark

View Full Paper