The exponential rise of artificial intelligence and machine learning (ML) has changed what modern data infrastructure needs to meet, and exposed the latency, fragility, scalability problems with traditional extract–transform–load (ETL). ETL pipelines were originally developed for structured and batch processing but struggle to cope with real-time analytics or detecting anomalies or feeding back the results into any machine learned model whenever required for developing AI-driven applications. Zero-ETL (eliminating the need for pre-processing by ease of ingesting based on event natively using schema-on-read), Declarative Pipeline (replacing imperative scripts style orchestration with outcome-oriented logic), and Data Contract (formalizing producer-consumer agreement for quality/validation/governance) are three paradigms that emerged to tackle these limitations. This article reviews the current status of research and practice in each of these paradigms, recognizing their integration allows for AI-ready data ecosystems that are automated, flexible and trustworthy. Meanwhile, important gaps have yet to be addressed, such as the lack of academic validation frameworks for Zero-ETL, weak observability and testing story around declarative pipelines or a common baseline of benchmarks or standardized performance or reliability SLAs. This paper consolidates different aspects of such innovations by surveying the work in architectures, tools and governance that not only provide a composite view of how they complement each other but also identifies open research problems (e.g., schema drift management, autonomous contract enforcement, federated data governance) which define the frontier in next generation data engineering. The findings identify Zero-ETL, declarative pipelines and data contracts as the core triad in constructing reliable, scalable and AI-ready data infrastructures.
Zain Ali (Fri,) studied this question.