ABSTRACT This work presents comprehensive research of a novel privacy‐preserving approach for big data publishing, emphasizing the integration of advanced clustering and anonymization techniques within a distributed computing framework. The hypothesis underpinning this research is that combining improved clustering algorithms with a refined k‐anonymity model can significantly reduce information loss while enhancing data utility, even at large scales. To test this, the methodology employs a hybrid approach that leverages the 2‐means clustering algorithm, enhanced with a mean‐center initialization method, alongside a greedy algorithm for effective data generalization. The core techniques involve constructing a new information loss function based on information quantity theory, which assigns weights to quasi‐identifiers according to their influence on sensitive attributes, thus enabling more precise and less destructive generalization. The implementation utilizes Apache Spark's RDD programming model to facilitate parallel processing, ensuring scalability and efficiency in handling massive datasets. Experimental results demonstrate that the proposed algorithms outperform existing methods such as traditional k‐means, SparkDA, MRA, and SKA in terms of both information loss and processing time. Specifically, the fuzzy c‐means variant achieves a 25.5% information loss on a dataset of 100 million records, with processing times reduced by 31%–55% compared to Hadoop‐based solutions. Additionally, the approach maintains high data utility for statistical analysis and decision‐making, effectively balancing privacy and utility. Overall, the study confirms that the integration of improved clustering, weighted information loss assessment, and distributed processing tools like Apache Spark provides a scalable, efficient, and privacy‐preserving solution for big data anonymization, with promising implications for real‐time data publishing and analysis.
Pan et al. (Wed,) studied this question.