After a U.S. Coast Guard (USCG) search and rescue (SAR) case, USCG personnel create an after-action report containing a textual narrative of the situation and Coast Guard response efforts. Data analysts explored how to identify reports involving cases with a verified person in the water. With restricted access to compute resources and limiting policy, large language models (LLMs) could not be utilized, so statistical (‘classical’ and non-neural) methods were considered for training a classification model to identify SAR case outcomes from report texts. The dataset was severely imbalanced toward the negative class, and the texts were extremely messy, with many typos and abbreviations. Therefore, an extensive text cleaning pipeline was developed and tested for improving classification performance. The Iterative Token Elimination Algorithm (iTEA) was developed to increase differences in vocabulary between classes. Model improvement was further explored through augmentation of the feature space using non-text data. The best model was an XGBoost model, achieving 0.762 recall and precision (and 0.959 accuracy). Errors from the test set are analyzed to guide future improvements until LLMs can be used, which are expected to improve performance and reduce text cleaning requirements.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zachary Kudlak
United States Coast Guard Academy
Justin Sherman
United States Coast Guard Academy
The Journal of Defense Modeling and Simulation Applications Methodology Technology
United States Coast Guard Academy
Building similarity graph...
Analyzing shared references across papers
Loading...
Kudlak et al. (Wed,) studied this question.
synapsesocial.com/papers/69eb0bc7553a5433e34b552e — DOI: https://doi.org/10.1177/15485129261440549
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: