This thesis evaluates the scalability and performance characteristics of the Dask framework for large-scale analytical workloads across two contrasting computing environments: a local Kubernetes-based deployment and a high-performance computing (HPC) cluster managed by SLURM. Although Dask is increasingly used as a flexible alternative to frameworks such as Apache Spark, its behavior under compute-intensive workloads remains poorly known. To address this gap, we conduct extensive strong-scaling and weak-scaling experiments using different-scale (TPC-H–like) datasets, analyze Dask’s adaptive autoscaling mechanism, and integrate DuckDB as an embedded execution engine within Dask workers.The results show that the local Kubernetes deployment achieves limited scalability due to shared-resource contention, with performance saturating after only a few workers. In contrast, the HPC system maintains high parallel efficiency across many nodes, particularly when running one worker per node. This demonstrates that distributed memory bandwidth and reduced spill-to-disk activity are essential for scaling shuffle-intensive queries. Weak scaling on the HPC system is more stable than on the local cluster, although absolute runtimes are higher because distributed execution introduces significant network and I/O overheads.Adaptive scaling performs poorly in both environments. On Kubernetes, the autoscaler behaves unpredictably, frequently terminating workers too aggressively and destabilizing long-running queries. On the HPC system, adaptive scaling consistently underperforms fixed-size clusters because the ramp-up phase forces memory-bound workloads to execute with insufficient resources.Finally, experiments combining Dask with DuckDB reveal substantial performance improvements. Distributed DuckDB consistently outperforms Dask DataFrame execution—often by a factor of three to four—due to more efficient local processing, reduced I/O per worker, and optimized query execution. Even single-worker DuckDB is faster than Dask-only execution, emphasizing the importance of leveraging specialized query engines within distributed data-processing frameworks.Overall, this thesis provides a detailed empirical analysis of Dask’s behavior under large-scale analytical workloads and offers practical recommendations for configuring Dask on HPC systems, understanding its scaling limits, and integrating complementary technologies such as DuckDB to improve performance.
Teodor Chakarov (Sun,) studied this question.