What question did this study set out to answer?

This research aims to enhance access to geoscience resources via a structured Korean question-answer dataset for conversational AI.

April 1, 2026Open Access

GeoBot: Construction of a Structured Korean Question-Answer Dataset for Conversational Artificial Intelligence Services on the Geo Big Data Open Platform

Key Points

This research aims to enhance access to geoscience resources via a structured Korean question-answer dataset for conversational AI.
Constructed a structured QA dataset from 1,200 real user queries.
Collected metadata from various geoscience resources including reports and FAQs.
Utilized preprocessing steps like metadata normalization and coordinate transformation.
Implemented sentence embedding using a Korean model based on KoSimCSE-RoBERTa.
Indexed results in Elasticsearch for semantic and spatial evaluations.
Constructed a dataset of 1,200 structured Korean QA records.
Achieved a factual consistency rate of 96.83% in expert evaluations.
Enabled semantic retrieval and geospatial filtering for improved user experiences.
Documents formatted as Geo-JSON for geospatial queries.

Abstract

The Geo Big Data Open Platform operated by the Korea Institute of Geoscience and Mineral Resources (KIGAM) integrates diverse geoscience resources, including geological maps, thematic maps, reports, metadata, and frequently asked questions, and provides public access through web services and open APIs. Despite this rich content, practical access through natural-language queries remains limited because most records are organized around structured metadata and domain-specific terminology. This study presents GeoBot, a conversational artificial intelligence service for the platform, and describes the construction of a structured Korean question-answer (QA) dataset designed to support semantic retrieval, geospatial filtering, and retrieval-augmented response generation. The retrieval corpus was assembled from open platform metadata, geo data, paper metadata, report metadata, and FAQ content, while the released QA records were derived from 1, 200 real user queries collected during the beta service. The workflow comprised source-data collection, schema validation, metadata normalization, coordinate transformation, sentence embedding, vector indexing, similarity retrieval, and large language model-based answer generation. During preprocessing, fragmented metadata records were merged into document-level objects, coordinate strings were converted into Geo-JSON point or polygon objects in WGS84, and normalized text fields were embedded using a Korean sentence-embedding model based on KoSimCSE-RoBERTa. The resulting documents were indexed in Elasticsearch using denseᵥector and geoₛhape fields to enable joint evaluation of semantic similarity and spatial intersection. The final dataset contains 1, 200 structured Korean QA records, including raw questions, refined questions, keywords, evidence-grounded answers, spatial information, and source identifiers. Expert evaluation showed a factual consistency rate of 96. 83%, demonstrating the dataset's reliability for geoscience conversational services.

Ask AI

Helpful

Bookmark

View Full Paper