Key points are not available for this paper at this time.
Dirty data are prevalent in time series, such as energy consumption or stock data. Existing data cleaning algorithms present shortcomings in dirty data identification and unsatisfactory cleaning decisions. To handle these drawbacks, we leverage inherent recurrent patterns in time series, analogize them as fixed combinations in textual data, and incorporate the concept of perplexity. The cleaning problem is thus transformed to minimize the perplexity of the time series under a given cleaning cost, and we design a four-phase algorithmic framework to tackle this problem. To ensure the framework's feasibility, we also conduct a brief analysis of the impact of dirty data and devise an automatic budget selection strategy. Moreover, to make it more generic, we additionally introduce advanced solutions, including an ameliorative probability calculation method grounded in the homomorphic pattern aggregation and a greedy-based heuristic algorithm for resource savings. Experiments on 12 real-world datasets demonstrate the superiority of our methods.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiaoyu Han
Haoran Xiong
Zhenying He
Proceedings of the ACM on Management of Data
Tsinghua University
Fudan University
Building similarity graph...
Analyzing shared references across papers
Loading...
Han et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e67e1cb6db643587607a59 — DOI: https://doi.org/10.1145/3654993