What question did this study set out to answer?

This study aims to evaluate the performance of various time series forecasting models on software runtime metrics.

May 16, 2026Open Access

Forecasting software runtime metrics: A comparative study of classical statistical, neural network, and foundation models

FMFederico Di MennaUniversity of L'Aquila LTLuca TrainiUniversity of L'Aquila VCVittorio CortellessaUniversity of L'Aquila

Key Points

This study aims to evaluate the performance of various time series forecasting models on software runtime metrics.
Conducted a comprehensive empirical evaluation using 110 real-world software runtime metrics over the course of one year.
Assessed three classical statistical models, three neural network models, and two foundation models.
Analyzed model performance in terms of capability to forecast software behavior and identify anomalies.
Foundation models achieved state-of-the-art performance in forecasting software runtime metrics, outperforming classical and neural network models.
Performance differences were statistically significant, indicating the effectiveness of foundation models in a zero-shot setting.
No universally superior model was found, emphasizing variability across different time series data.

Abstract

Modern software applications generate a wide range of runtime metrics, which are vital to many quality assurance activities. These data are often recorded and aggregated as time series to observe patterns and trends of various runtime aspects over time. In this context, Time Series Forecasting (TSF) offers unique opportunities for predicting software runtime behavior and identifying potential anomalies. Although TSF models have been successfully applied in fields such as economics and climatology, their capabilities for forecasting software runtime metrics remain relatively underexplored. In this paper, we conduct a comprehensive empirical evaluation of 8 TSF models on 110 real-world software runtime metrics recorded over the course of about one year. Our evaluation encompasses three classical statistical models, three neural network models, and two time series foundation models. Results show that the foundation models achieve state-of-the-art performance on TSF of software runtime metrics, outperforming other models with strong statistical significance. Our findings indicate that foundation models, despite being trained exclusively on time series data from other domains, can effectively generalize to software runtime metrics in a zero-shot setting. This makes them a convenient plug-and-play solution for practitioners and researchers aiming to integrate TSF into their software quality assurance processes. Yet, their performance is not uniformly superior across all the time series, underscoring the absence of a “ silver bullet ” solution.

Ask AI

Helpful

Bookmark

View Full Paper