Random Sample Partition: A Distributed Data Model for Big Data Analysis

Key Points

Key points are not available for this paper at this time.

Abstract

With the ever-increasing volume of data, alternative strategies are required to divide big data into statistically consistent data blocks that can be used directly as representative samples of the entire data set in big data analysis. In this paper, we propose the Random Sample Partition (RSP) distributed data model to represent a big data set as a set of disjoint data blocks, called RSP blocks. Each RSP block has a probability distribution similar to that of the entire data set. RSP blocks can be used to estimate the statistical properties of the data and build predictive models without computing the entire data set. We demonstrate the implications of the RSP model on sampling from big data and introduce a new RSP-based method for approximate big data analysis which can be applied to different scenarios in the industry. This method significantly reduces the computational burden of big data and increases the productivity of data scientists.

Mark Helpful

Bookmark

Relay

View Full Paper