In the last two decades, almost all fields of science became "data rich" due to growing digitalization, increased connectedness of systems and disciplines, the pervasiveness of digital devices, and new experimental techniques. At the same time, the types of analyses to be performed with these data sets became more complex, which led to the need of a modularized development approach where individual analysis steps can be designed and implemented independently of others. Furthermore, the sheer size of the data to be analyzed more and more requires the usage of distributed compute resources to achieve sufficient throughput and scalability. To keep developments efficient despite these three properties – large data sets, complex analysis, distributed execution –, specialized software infrastructures emerged, namely scientific workflow management systems (SWMS). In essence, a SWMS is a software system that allows the specification of data analysis workflows over large scientific data sets and that is capable of steering the execution of such workflows on a distributed compute infrastructure. These key functionalities often are accompanied by additional features, such as graphical user interfaces, provenance management and analysis, runtime monitoring and debugging, or repositories for workflow exchange between groups and communities. In this chapter, we describe the anatomy of a typical (idealized) SWMS from a technical perspective. We first highlight the most salient features of SWMS and then propose a simple reference architecture as basis for our further description. We characterize existing workflow languages regarding their expressiveness and highlight the impact of different language features on a system’s architecture. Furthermore, we discuss alternative architectures for specialized use cases, delineate SWMSs from related classes of systems, and give an outlook on present and future topics regarding the advancement of workflow systems.
Ulf Leser (Thu,) studied this question.