Judicial documents have become a significant data source for crime geography research, offering advantages in accessibility and scale compared to highly restricted police-recorded crime data. However, extracting crime addresses from these texts is challenging due to sparse, inconsistent, and incomplete address information. Without proper classification, errors in geocoding and spatial analysis can arise, compromising data quality. To address these limitations, we employed large language models (LLMs) and a structured prompt engineering strategy tailored for this task. Specifically, we propose a fine-tuned LLM, named CAECLLM, to extract addresses from judicial documents and classify these crime addresses at various categories with different spatial scales. Experimental results demonstrate that the model achieved an F1-score of 0. 79 for address extraction and a classification accuracy of up to 0. 74 for the best-performing category, significantly outperforming other LLMs. This study makes two primary contributions: (1) designing an address classification scheme specifically for crime addresses, and (2) developing a fine-tuned LLM for extracting and classifying crime addresses from Chinese judicial documents, enabling LLMs to be used to classify crime addresses into different categories on a spatial scale. These advancements facilitate more accurate crime pattern analysis and data-driven urban planning.
Wang et al. (Fri,) studied this question.