Key points are not available for this paper at this time.
Cloud infrastructures are capable of leveraging massive computational as well as data processing capabilities in virtualized environments. Emerging applications on today's clouds are data intensive and this has led to the trend of employing data-parallel frameworks, like Hadoop and its myriad descendants, for handling such massive data requirements. Scheduling of jobs in such frameworks is in essence a two-step process, where the block-data distribution follows mapping of computations among those resources. Since most Hadoop-based systems make these two decisions independently, it seems a promising prospective to map computations within cloud resources based on data blocks already distributed to them. This paper proposes data partitioning and placement aware computation scheduling scheme (DPPACS), a data and computation scheduling framework that adopts the strategy of improving computation and data co-allocation within a Hadoop cloud infrastructure based on knowledge of data blocks availability. Accordingly, this paper proposes a data-partitioning algorithm, a novel partition-cum-placement algorithm and finally proposes a computational scheduling algorithm that exploits knowledge of data availability at different clusters. The proposed DPPACS has been implemented on a test bed and its comparative performance results with respect to Hadoop's default data placement strategy have been presented. Experiments conducted herein conclusively demonstrate the efficacy of the proposed DPPACS.
Reddy et al. (Tue,) studied this question.