Los puntos clave no están disponibles para este artículo en este momento.
A comprehensive framework for an Extract, Transform, Load (ETL) pipeline is developed with the use of Apache Airflow, Docker, and Azure services. The study identifies gaps in current ETL pipelines, focusing on automation, workflow optimization, and the integration of modern technologies. Key objectives include the development of a data pipeline using Azure Databricks, automation of ETL processes, and performance analysis through containerization. The methodology involves the utilization of cloud technologies, with a middleware architecture facilitating efficient metadata management, automated file download, and conversion utilitiesthat are incorporated by creating separate functions for each file-conversion module. The second part of the methodology includes the ability of the system to efficiently handle large files. The systemhandles large datasets over Spark distributed systems, by creating smaller chunks of200 Megabytes, where the main data node of Azure-Databricks cluster is selected as StandardDS3ᵥ2, thus able to process file sizes of more than 2GB over Distributed processing. The third part of the system includes the usage of separated job clusters for each smaller job of the ETL pipeline. The separation of job clusters for Airflow and Databricks jobs is also introduced to optimize resource utilization, which earlier was at only 34% usage (i. e. about 11GB memory out of 32GB pool) and reduce costs, increasing the memory utilization to 78% by configuring the job cluster to DS3ᵥ2 VM instance.
Bhatlawande et al. (Sat,) studied this question.