Key points are not available for this paper at this time.
Background: Accurate interpretation of hepatitis B virus (HBV) polymerase sequences is essential for identifying antiviral resistance, particularly for high-genetic-barrier agents such as entecavir. Current resistance interpretation relies largely on deterministic rule-based systems that do not quantify uncertainty and are difficult to evaluate across independent datasets. We aimed to develop and externally validate a transparent probabilistic framework for reconstructing a predefined entecavir resistance pathway from HBV polymerase sequences. Methods: HBV polymerase sequences were retrieved from the NCBI GenBank database and curated through translation, quality control, and deduplication to create the development dataset. Reverse transcriptase (RT) positions were indexed using motif-anchored numbering based on the YMDD-family motif. A genotypic proxy for the entecavir resistance pathway was defined by lamivudine-associated background substitutions combined with entecavir-associated RT substitutions. A logistic regression model with probability calibration was trained and internally validated using prespecified performance metrics and thresholds. External validation was performed on an independent HBVdb dataset with preprocessing, model parameters, and thresholds frozen prior to evaluation. Results: The development dataset comprised 1174 unique polymerase sequences, of which 268 met the resistance pathway definition. Internal validation demonstrated perfect discrimination, consistent with the deterministic genotypic definition of the outcome. External validation on 11,513 independent HBVdb sequences demonstrated reproducible performance across repositories despite a markedly lower prevalence of the resistance pathway (2.2%), with preserved discrimination and stable threshold-based performance. Conclusions: This study presents a transparent and externally validated machine learning framework for probabilistic identification of the entecavir resistance pathway in HBV. The approach provides a transparent and reproducible probabilistic formalization of an established genotypic resistance definition and may serve as a methodological framework for standardized sequence-based resistance interpretation.
Kapatais et al. (Wed,) studied this question.