What question did this study set out to answer?

The central aim is to evaluate and adapt language models for effective legal contract automation.

March 30, 2026Open Access

Technical evaluation of language models adapted for the automation of legal contracts: clause extraction, classification, and summarization

JGJaime GoveaUniversidad de las Américas IOIván Ortiz-GarcésUniversidad de las Américas PPPablo PalaciosDiego Portales University

Key Points

The central aim is to evaluate and adapt language models for effective legal contract automation.
Developed a methodological pipeline for evaluation and adaptation of language models.
Conducted domain-specific fine-tuning of open-source models.
Performed controlled comparative assessments against general-purpose language models.
Curated legal corpus and performed clause-level annotations.
Evaluated across classification, clause extraction, and summarization tasks.
Achieved Macro-F1 score of 0.921 in contract classification.
Attained span-level F1 score of 0.903 in clause extraction.
Obtained ROUGE-L score of 0.886 in summarization.
Demonstrated consistent performance improvements over general-purpose models with statistical significance (p < 0.01).
Confirmed model stability across different contract types.

Abstract

The growing demand for automation in legal contract management exposes a persistent limitation of current language models: insufficient adaptation to the semantic, structural, and regulatory constraints of legal language. While large language models perform well on general NLP tasks, their direct application to legal document classification, clause extraction, and contract summarization often yields unstable, legally unreliable outputs. This work presents a structured methodological pipeline for evaluating and adapting language models for legal contract automation, combining domain-specific fine-tuning of open-source models with a controlled comparative assessment against large general-purpose LLMs used exclusively in inference mode. The methodology integrates legal corpus curation, clause-level annotation, and efficient adaptation techniques, and is evaluated across three core tasks: contract document classification, normative clause extraction, and regulatory summarization. The evaluation protocol is explicitly designed to disentangle the effects of supervision from deployment constraints arising in regulated legal settings. Experimental results show consistent and statistically significant performance gains for legally adapted models over general-purpose baselines, achieving Macro-F1 of 0.921 in classification, span-level F1 of 0.903 in clause extraction, and ROUGE-L of 0.886 in summarization ( p 0.01). Robustness analysis and cross-validation confirm stability across heterogeneous private-sector contract types. The findings should be interpreted under the evaluated comparison regime and highlight that, in legally constrained multi-stage workflows, task-aligned supervision provides measurable structural benefits that are not reducible to model scale alone when general-purpose LLMs are restricted to inference-only deployment.

Perguntar à IA

Bookmark

View Full Paper