ABSTRACT Although energy has become a major concern in data processing systems, it is usually hard to get a deep understanding of how performance and energy consumption relate to each other when planning how to configure a computing environment to execute a specific data‐oriented workload. In this paper, we propose a multi‐layered methodology to analyze the energy consumption of big data workloads executed using Apache Spark in virtualized cloud environments. The approach is structured into three layers: Resource provisioning, system‐level resource utilization, and application‐level resource utilization. Using direct energy measurements using a Power Distribution Unit (PDU) and detailed system monitoring, the study investigates how infrastructure choices and workload characteristics influence energy consumption. Results show that optimal virtual machine configurations depend on workload type and input size; while provisioning decisions affect energy consumption, system‐level metrics such as CPU utilization and disk I/O offer a deeper understanding of the final performance versus energy consumption results. By applying our methodology, our results reveal the impact of task distribution and resource under‐utilization on overall energy efficiency. The findings demonstrate that energy optimization in big data environments requires a comprehensive understanding of factors across infrastructure, system, and application layers. The proposed methodology serves as a practical guide for energy‐aware design and decision‐making in cloud‐based data processing systems.
Volpini et al. (Sun,) studied this question.