What question did this study set out to answer?

This study aims to establish an automated process for assessing the reusability of biomedical scientific data.

June 12, 2026Open Access

Scientific Data Reusability Identification Using Deep Learning Method

Puntos clave

This study aims to establish an automated process for assessing the reusability of biomedical scientific data.
Extracted data availability statements from biomedical papers on PubMed Central using Python.
Developed a 5-level grading standard based on data sharing policies from major publishers and existing research.
Utilized various machine learning and deep learning models for training and evaluation on labeled data.
Deep learning models outperformed traditional models in identifying data reusability.
The BioBert_TextCNN model achieved an accuracy rate of 88% in grading data reusability.
Misclassification reasons included diverse sharing scenarios and lack of standardization in DAS descriptions.

Resumen

With the development of open science, the reusability of scientific data has become increasingly important, especially in the field of high-throughput biomedical data. However, there is currently a lack of effective standards and discriminant methods to identify the reusability of scientific data. This study proposes an automated process for judging the reusability of biomedical scientific data. The experiment selected a large number of biomedical papers from PubMed Central as data sources. First, the data availability statement (DAS) part in the full-text literature was extracted using Python. By analyzing the data sharing and reuse policies issued by the world’s five largest publishers and Cell Press, and combining them with existing relevant research, a detailed 5-level data reusability grading standard was formulated, and the extracted data were manually labeled accordingly. Then, a variety of machine learning and deep learning models are used for training and evaluation after data preprocessing. The results show that deep learning models generally outperform traditional ones, and the BioBertTextCNN model exhibits good performance in the data reusability grading identification task, with an accuracy rate reaching 88%. The reasons for misclassification in DAS grading were explored, highlighting issues like diverse data sharing scenarios without unified descriptions and lack of DAS standardization. This study provides an effective method and reference for the sharing, reuse, and traceability of scientific data.

Me gusta

Guardar

Ver artículo completo