Since the advent of Industry 4.0 and the resulting increased digitalisation of manufacturing, an ever-increasing amount of data has become available. Machine learning experts aim to exploit this data for different regression and classification tasks. One major challenge arises with the availability of newly gathered sensor data, be it over time or via the addition of completely new sensors and hence features. This paper aims to provide a software architecture utilising Apache Kafka, Flink, and Trino to prove how to automatically integrate and consolidate a dynamically growing set of samples and features to provide a basis for future machine learning tasks. This approach is validated by an artificial data generation setup that compares different machine learning techniques to solve a regression task and to highlight potential improvements.
Neuhauser et al. (Thu,) studied this question.