Los puntos clave no están disponibles para este artículo en este momento.
To the Editor: Data linkage is a method of identifying and combining information derived from various sources that actually belongs to the same individual or event.1 In recent years, data linkage has become an increasingly common practice in many countries. The socioeconomic and health benefits of data linkage in the healthcare field can be clearly demonstrated. From the perspective of scientific research, the combination of multidimensional health-related data at the individual level (e.g., risk factor data, clinical diagnosis and treatment data, community management and follow-up data, disease surveillance data, birth, and vital statistics) and public records (e.g., environmental monitoring data, motor vehicle crash data, criminal data) can create scientific inspiration in multiple disciplines and spur innovation. From the perspective of health management, data linkage can be used to better monitor the population's well-being and its influencing factors and to evaluate the quality and outcome of government services.2 The extensive practice of data linkage requires two aspects of support: technological feasibility and the establishment of an efficient administrative structure for data sharing. The advent of sophisticated computer technologies enables the realization of data linkage. However, generally accepted guidance for database preparation, linkage, and quality assessment methods in the healthcare field in China is still lacking. In addition, issues considering ethical principles, privacy protection policies, and approaches to balance the interests of all parties involved have not been fully discussed in China.3 Against this background, we carried out a pilot study to link two national surveillance databases at the individual level. In this paper, we mainly introduce our data linkage, quality control techniques, and the data linkage outcome. We hope to provide a reference for other studies in this field. In this study, we linked the China Chronic Disease and Risk Factor Surveillance (CCDRFS) database with the Population-based Cancer Registries (PBCR) database. Detailed introductions to the two databases are provided in Section "Data Sources Description" in Supplementary Materials, https://links.lww.com/CM9/B949. Database preparation was performed by the respective database owners. The residential identification (RID) number and basic demographic information (age, sex, date of birth, ethnicity, marital status, and address) of individuals were collected in both databases and were defined as common variables. These common variables were used for data matching. The coding and quality control of common variables in the two databases were identical, both according to national standards. Other variables involved in the data linkage process were defined as unique variables, and they were key variables for conducting subsequent research. Unique variables in each database underwent quality control procedures according to the respective study protocols. A total of 547,963 baseline records (BRs) from 335 districts/counties in the CCDRFS database and 12,689,999 cancer incidence records (CIRs) from 1152 districts/counties in the PBCR database were prepared for data matching. We used exact matching and fuzzy matching methods. The exact matching method was used for individuals with correct RID numbers, and the fuzzy matching method was used for individuals with no or incorrect RID numbers. In fuzzy matching, we generated unique index variables using different combinations of common variables (except RID) and tested the matching performance. We selected the combination of the optimal matching rate, accuracy rate, and error rate as the algorithm for fuzzy matching. Detailed descriptions of the matching methods are provided in Section "Data Matching Methods" in Supplementary Materials, https://links.lww.com/CM9/B949. For records of the CCDRFS database that matched more than one cancer incidence outcome, International Agency for Research on Cancer (IARC) rules were applied to distinguish multiple primary cancer cases from duplicated records.4 Mortality information in the CCDRFS database (which was derived from the Disease Surveillance Points DSPs database, see details in Section "Data Sources Description" in Supplementary Materials, https://links.lww.com/CM9/B949) and cancer mortality information in the PBCR database were compared. If either database reported cancer mortality information for a record, it was identified as the death outcome in this record. If two databases reported inconsistent mortality information for the same record, death outcomes from the PBCR database were retained since the coding in this database was considered to be more accurate. If one record had only a cancer mortality outcome but had no cancer incidence outcome, this record was considered a death certificate only (DCO) case. According to IARC rules, we assigned the incidence in the International Statistical Classification of Diseases and Related Health Problems 10th Revision (ICD-10) equal to cause-of-death in the ICD-10, and the date of diagnosis equal to the date of death.5 After data matching, 547,963 BRs in the CCDRFS database were matched with 9263 CIRs from the PBCR database. As for quality control for the linked database, regional and record inclusion criteria were set up. The units of survey in the CCDRFS and PBCR databases were district/county administrative divisions, and only regions that had carried out CCDRFS surveys and had well-established cancer registries were included. After excluding 52 regions with no cancer registry and 22 regions with poor cancer registries, 425,520 BRs with 9161 CIRs from 261 regions were included. Record inclusion criteria: (1) The date of cancer diagnosis was before December 31, 2020 (according to the study deadline). (2) The date of cancer diagnosis was later than the CCDRFS baseline survey date (to ensure that risk factor information collection occurred before the occurrence of disease). One major concern for the quality of the merged database was the possible high missing report rate of cancer incidence caused by insufficient matching. We proposed the following method to evaluate the degree of underreporting. First, we calculated the 35–64-year truncated incidence rates (35–64 TIRs) by region in the merged database. Then, we used the cancer registry data of the 261 included regions to calculate the 35–64 TIRs and took them as the population references. We compared the 35–64 TIRs of the merged database and the population references by region and excluded regions according to the following criteria: (1) The 35–64 TIR in the merged database was ≥70% lower than the population reference when the total number of BRs in this region was <1000; (2) the 35–64 TIR in the merged database was ≥60% lower than the population reference when the total number of BRs was 1000–2000; and (3) the 35–64 TIR in the merged database was ≥50% lower than the population reference when the total number of BRs was ≥2000. Thirty-one regions with 56,283 BRs and 345 CIRs were excluded. Supplementary Figure 1, https://links.lww.com/CM9/B949 shows the whole process of database merging and quality control. Finally, 368,470 BRs in the CCDRFS database with 8049 cancer incidence records from the PBCR from 230 regions were pooled together as the Epidemiology Database of Cancer Incidence (EDCI), which included 6198 (77.0%) exact matches and 1851 (23.0%) fuzzy matches. A total of 230 regions were distributed in provinces of China Supplementary Table 1, https://links.lww.com/CM9/B949. The crude incidence rate (CIR), age-standardized incidence rate (ASIR) by Segi's world standard population, and 35–64 TIR were calculated and compared for the EDCI and cancer registry population references. The differences in the 35–64 TIR between the EDCI and population references by sex and area were between −5.80% and 5.07%, which was quite small Table 1. Supplementary Figure 2, https://links.lww.com/CM9/B949 shows the comparison of the cumulative proportions of cancer cases by cancer site in the EDCI and the population reference. Both cumulative trends were very similar. These results indicated that the EDCI is of high quality. Table 1 - Comparison of the incidence rate (per 100,000) of the EDCI database and the cancer registry population reference database. Sex Area EDCI database Cancer registry population reference database* 35–64 TIR differences (%) CIR ASIR 35–64 TIR CIR ASIR 35–64 TIR Total Total 341.98 233.45 309.54 312.19 187.90 314.65 −1.62 Urban 374.66 252.98 335.99 337.73 193.91 325.21 3.31 Rural 314.76 216.75 288.40 259.60 173.85 291.05 −0.91 Males Total 364.92 233.15 297.24 337.64 205.52 309.8 −4.05 Urban 387.15 243.66 304.14 359.64 206.96 309.65 −1.78 Rural 347.49 224.48 292.07 293.43 201.67 310.05 −5.80 Females Total 323.46 232.14 317.56 286.41 172.32 319.72 −0.68 Urban 365.16 259.91 358.21 315.89 182.94 340.94 5.07 Rural 287.00 207.43 283.22 224.19 147.60 271.59 4.28 *Cancer registry population reference was derived from the population-based cancer registry data during 2013−2017 in corresponding areas. Rural: Counties were defined as rural; Urban: Districts were defined as urban areas.35−64 TIR: 35−64 years truncated incidence rate; ASIR: Age-standardized incidence rate by Segi's world standard population; CIR: Crude incidence rate; EDCI: Epidemiology database of cancer incidence. To ensure participants' privacy during the data linkage process, the whole process was split into two stages: the matching of common variables and the matching of unique variables. Primary index variables were created in the original databases by each institution separately. The creation rules of both primary index variables were different, but it was guaranteed that these primary index variables were unique identifiers in their respective databases, and these index variables did not contain personally identifiable information. In stage I, common variables and primary index variables were extracted from both databases, and data matching was carried out on computer with physical isolation from the external network (the secure computer) in the National Cancer Center. After double checking and quality control, only the matching results of primary index variables were provided for later use. In stage II, the deprivatized CCDRFS database and successfully matched PBCR records with primary index variables, but no sensitive personal information was uploaded to a secure computer to form the merged database. Only employees directly involved in the data matching process had access to the sensitive personal information required for linkage. All employees were required to sign a confidentiality statement. The final merged database did not include sensitive personal information. Data linkage and sharing is the foundation of the high-quality development of medical big data research. In this study, we proposed a method to realize multisource database linkage and established a high-quality longitudinal dynamic cancer cohort. However, this study also had some limitations. First, there were records with missing RID numbers in both the CCDRFS database and the PBCR database, and although fuzzy matching and double-checking were performed, there were still possible false matches and missing matches. In addition, the risk factors involved in the CCDRFS survey were mainly risk factors for common chronic diseases (e.g., hypertension and diabetes), and some cancer-specific risk factors, such as HBV or HPV infection, were not included. This might reduce the application value of the EDCI in the future. In conclusion, we explored the data linkage procedures of two large health surveillance systems at the national level in this study. After stringent quality control, we obtained one merged database for cancer incidence, which lays a solid foundation for subsequent in-depth cancer epidemiologic studies. Acknowledgments We would like to thank the participants, project staff, and diligent provincial and local staff of the CDCs and cancer registries for their participation and contributions. Funding This study was supported by grants from the National Key Research and Development Program of China (Nos. 2021YFF1201101, 2018YFC1311704, and 2018YFC1311706). Conflicts of interest None.
Zang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: