Integrating genomic datasets from homogenous or disparate sources to identify genes that are commonly or uniquely expressed remains a largely underexplored area. Such integrative analysis can reveal biologically relevant genes that are common or exclusive across datasets or within specific conditions or cohorts. Identifying these gene expression profiles and employing them to classify disease status can aid in the development of vaccines, diagnostics and targeted therapeutics with efficacy against difficult-to-treat medically important pathogens and cancer. This work develops new methodologies to integrate transcriptomic patterns from the lungs and spleen tissues infected by Francisella tularensis – Schu4 and Live Vaccine Strain (LVS). Our objective is to (i) identify biologically relevant gene features indicative of respiratory infection, disease severity, and bacterial dissemination to the spleen, and (ii) develop a Weighted \ (₁\) -norm Non-Parallel Support Vector Machines (\ (₁\) -WNPSVM) that will utilize the selected genes to predict disease status. The \ (₁\) -WNPSVM is trained on the lungs data and validated on the spleen data, introducing a form of transfer learning, with uninfected controls and Schu4 or LVS samples as classes. Currently, a direct application of existing NPSVM-type methods to analyze gene expression datasets, where the number of genes significantly exceeds the number of samples, is computationally impractical due to their large memory requirements. This work addresses these challenges and also generalizes to models of similar formulations by incorporating dimensionality reduction and gene selection into the NPSVM-type frameworks. The \ (₁\) -WNPSVM method outperforms traditional machine learning techniques such as ANN, XGBoost, AdaBoost, GradBoost, KNN, SVM, Naive Bayes, Random Forest, Logistic Regression, and Decision Tree, achieving a \ (97\%\) balanced accuracy on imbalanced data. We discovered sets of 235 genes exclusively expressed in the lungs and spleen tissues and utilized them to classify bacterial strains and controls, enabling prediction of disease status. Gene ontology is performed to reveal underlying metabolic pathways. Our analysis shows that signal transduction and disease (cancer) pathways are the most significant pathways activated in the lungs while gene expression (transcription), immune system, and disease (cancer) pathways are activated in the spleen. Collectively, these pathways indicate a significant host response to infection, including how the bacteria interact with host tissues during dissemination.
Ugwu et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: