With the development of open science, the reusability of scientific data has become increasingly important, especially in the field of high-throughput biomedical data. However, there is currently a lack of effective standards and discriminant methods to identify the reusability of scientific data. This study proposes an automated process for judging the reusability of biomedical scientific data. The experiment selected a large number of biomedical papers from PubMed Central as data sources. First, the data availability statement (DAS) part in the full-text literature was extracted using Python. By analyzing the data sharing and reuse policies issued by the world’s five largest publishers and Cell Press, and combining them with existing relevant research, a detailed 5-level data reusability grading standard was formulated, and the extracted data were manually labeled accordingly. Then, a variety of machine learning and deep learning models are used for training and evaluation after data preprocessing. The results show that deep learning models generally outperform traditional ones, and the BioBertTextCNN model exhibits good performance in the data reusability grading identification task, with an accuracy rate reaching 88%. The reasons for misclassification in DAS grading were explored, highlighting issues like diverse data sharing scenarios without unified descriptions and lack of DAS standardization. This study provides an effective method and reference for the sharing, reuse, and traceability of scientific data.
Zhou et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: