Data is exploding in many fields and may exist in the streaming mode. When the generation speed of massive streaming data far exceeds the processing speed of a single node, and the generated data needs to be processed in real-time, traditional centralized learning models are challenging in meeting the efficiency requirements. Therefore, online distributed learning models emerge. As time progresses, features may continuously emerge from various sources in a distributed and heterogeneous fashion. Therefore, we study the problem of online distributed heterogeneous streaming feature selection and propose a novel framework to address it, named DHSFS. The framework comprises two main components: sub-node streaming feature selection and global information synchronization. The sub-node component uses a dynamic strategy to select strong features, discard irrelevant ones, and cache weakly relevant features. In the global information synchronization stage, each sub-node synchronizes statistics information with the master node to adjust the global thresholds dynamically. Finally, the features selected by each sub-node are summarized and output. Experiments on 16 datasets show that the DHSFS framework has both high prediction accuracy and high efficiency of online stream feature selection.
Zhou et al. (Mon,) studied this question.