What question did this study set out to answer?

The aim is to develop a privacy-preserving method for big data publishing that minimizes information loss and maximizes data utility.

March 10, 2026

An Improved k‐Anonymization Clustering Mechanisms Using Apache Spark for Privacy‐Preserving Big Data Publishing for Real‐Time Dataset

Puntos clave

The aim is to develop a privacy-preserving method for big data publishing that minimizes information loss and maximizes data utility.
Utilized a hybrid approach combining 2-means clustering and a greedy algorithm for data generalization.
Created a new information loss function based on information quantity theory.
Implemented the solution using Apache Spark's RDD programming model for parallel processing.
Proposed algorithms significantly reduce information loss and processing time compared to traditional methods.
Fuzzy c-means variant achieved 25.5% information loss on a dataset with 100 million records.
Processing times were reduced by 31%-55% compared to Hadoop-based solutions.

Resumen

ABSTRACT This work presents comprehensive research of a novel privacy‐preserving approach for big data publishing, emphasizing the integration of advanced clustering and anonymization techniques within a distributed computing framework. The hypothesis underpinning this research is that combining improved clustering algorithms with a refined k‐anonymity model can significantly reduce information loss while enhancing data utility, even at large scales. To test this, the methodology employs a hybrid approach that leverages the 2‐means clustering algorithm, enhanced with a mean‐center initialization method, alongside a greedy algorithm for effective data generalization. The core techniques involve constructing a new information loss function based on information quantity theory, which assigns weights to quasi‐identifiers according to their influence on sensitive attributes, thus enabling more precise and less destructive generalization. The implementation utilizes Apache Spark's RDD programming model to facilitate parallel processing, ensuring scalability and efficiency in handling massive datasets. Experimental results demonstrate that the proposed algorithms outperform existing methods such as traditional k‐means, SparkDA, MRA, and SKA in terms of both information loss and processing time. Specifically, the fuzzy c‐means variant achieves a 25.5% information loss on a dataset of 100 million records, with processing times reduced by 31%–55% compared to Hadoop‐based solutions. Additionally, the approach maintains high data utility for statistical analysis and decision‐making, effectively balancing privacy and utility. Overall, the study confirms that the integration of improved clustering, weighted information loss assessment, and distributed processing tools like Apache Spark provides a scalable, efficient, and privacy‐preserving solution for big data anonymization, with promising implications for real‐time data publishing and analysis.

Me gusta

Guardar

Me gusta

Guardar

An Improved k‐Anonymization Clustering Mechanisms Using Apache Spark for Privacy‐Preserving Big Data Publishing for Real‐Time Dataset

Puntos clave

Resumen

Cite This Study