July 25, 2024

Assessing Evaluation Metrics for Neural Test Oracle Generation

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Recently, deep learning models have shown promising results in test oracles generation.Static evaluation metrics from Natural Language Generation (NLG) such as BLEU, CodeBLEU, ROUGE-L, METEOR, and Accuracy, which is mainly based on textual comparisons, have been widely adopted to measure the performance of Neural Oracle Generation (NOG) models.However, these NLG-based metrics may not reflect the testing effectiveness of the generated oracle within a test suite, which is often measured by dynamic (execution-based) test adequacy metrics such as code coverage and mutation score.In this work, we revisit existing oracle generation studies plus ChatGPT to empirically investigate the current standing of their performance in both NLG-based and test adequacy metrics.Specifically, we train and run four state-of-the-art test oracle generation models on five NLG-based and two test adequacy metrics for our analysis.We apply two different correlation analyses between these two different sets of metrics.Surprisingly, we found no significant correlation between the NLG-based metrics and test adequacy metrics.For instance, oracles generated from ChatGPT on the project activemq-artemis had the highest performance on all the NLG-based metrics among the studied NOGs, however, it had the most number of projects with a decrease in test adequacy metrics compared to all the studied NOGs.We further conduct a qualitative analysis to explore the reasons behind our observations, we found that oracles with high NLG-based metrics but low test adequacy metrics tend to have complex or multiple chained method invocations within the oracle's parameters, making it hard for the model to generate completely, affecting the test adequacy metrics.On the other hand, oracles with low NLG-based metrics but high test adequacy metrics tend to have to call different assertion types or a different method that functions similarly to the ones in the ground truth.Overall, this work complements prior studies on test oracle generation with an extensive performance evaluation with both NLG and test adequacy metrics and provides guidelines for better assessment of deep learning applications in software test generation in the future.

Preguntar a la IA

Me gusta

Guardar