Web content extraction — isolating a page’s main content from boilerplate — is critical for web mining, search indexing, and dataset construction. Existing extractors achieve strong results on articles but degrade significantly on other page types. We present a page-type-aware extraction pipeline that classifies pages into seven types (article, forum, product, collection, listing, documentation, service) and applies type-specific extraction profiles. A three-stage classifier combining URL heuristics, HTML signal analysis, and an XGBoost model achieves 86.6% classification accuracy. Evaluated on the 1,497-page development set of the Web Content Extraction Benchmark (WCXB) from 1,295 domains, our system achieves F1 = 0.859, outperforming Trafilatura (0.791), MinerU-HTML (0.827), and ReaderLM-v2 (0.741) while processing pages in 44 ms. On the 511-page held-out test set, the system achieves F1 = 0.893, confirming generalization. An ML-based quality predictor identifies the 8% of pages where heuristic extraction is unreliable, enabling a hybrid pipeline that routes these pages to a 0.6B neural model for F1 = 0.910 on the held-out set. The gains come from page types that existing extractors neglect: forums (+0.103 F1 over the type-agnostic baseline), service pages, and listings. We release the system as an open-source Rust library.
Murrough Foley (Mon,) studied this question.