Background: Missing data is common in database research. Cancer patients with missing data are often excluded from database analyses. However, this practice may result in selection bias. We sought to clarify the presence of selection bias in such studies, with respect to the accreditation status of the treating facility (cancer center vs non-cancer center). Study Design: We evaluated 2018-2020 missing Surveillance, Epidemiology, and End Reports (SEER) data prevalence for patients with breast, pancreas, colon, or non-small cell lung cancer (NSCLC) patients by Commission on Cancer (CoC) accreditation of the treating center and 3-year overall survival (OS) by missing data and treatment center. Results: We identified 328,030 patients. Across disease sites, patients were predominately treated at CoC centers (breast 82%, pancreas 83%, colon 75%, NSCLC 80%), with missing data more prevalent from non-CoC centers than CoC centers (breast 23% vs 9%, pancreas 36% vs 14%, colon 30% vs 13%, NSCLC 42% vs 13%). The odds of missing data were significantly higher at non-CoC centers than at CoC centers. Patients with missing data had significantly lower 3-year OS than patients with known data (breast 63% vs 81%, pancreas 5% vs 12%, colon 43% vs 61%, NSCLC 17% vs 27%, p<0.001 for all). Conclusions: Disproportionately more missing data was observed from non-CoC centers than from CoC centers. Patients with missing data had lower OS than those with known data, with the lowest survival reported for patients missing data treated at non-CoC centers. SEER studies which exclude patients with missing data will predominately exclude patients from non-CoC centers, and may report erroneously superior outcomes by approximating registry-based, rather than population-based findings.
White et al. (Thu,) studied this question.