Automatic text summarization is frequently evaluated using standard automatic metrics such as ROUGE, BLEU, and BERTScore. These metrics are widely adopted due to their ease of computation and reproducibility. However, their interpretation becomes challenging in specialized domains such as legal text, where document length, formal language, and information density differ significantly from general-purpose datasets. This paper examines how commonly used evaluation metrics influencethe interpretation of summarization performance for Indian legal documents. Using results obtained from a comparativeevaluation of extractive and abstractive summarization models under uniform experimental settings, we analyze how differentmetrics emphasize different aspects of summary quality. The study highlights that conclusions regarding model effectivenessmay vary depending on the chosen evaluation metric, underscoring the importance of careful metric interpretation in legal textsummarization research.
Tanmay Dayal (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: