Despite recent advances, medical question answering systems still struggle with domain-specific reasoning and data efficiency. This paper presents Med-LLaMA3, a family of medical large language models developed by parameter-efficient fine-tuning of the LLaMA-3.1 (8 billion) and LLaMA-3.2 (1 and 3 billion) architectures using quantized low-rank adaptation (QLoRA) and low-rank adaptation (LoRA) with 4-bit quantization. Beyond model training, this work contributes the following: (1) a formalized dataset curation taxonomy (source type × clinical granularity × task format) with a source-category ablation confirming that the multi-source combination drives benchmark gains beyond any single category; (2) a systematic characterization of low-rank-adaptation rank-scaling behavior for the LLaMA-3 family in the medical domain (monotonic improvement up to rank 128, with no observed plateau); and (3) statistically validated comparisons using McNemar’s test and 95% bootstrap confidence intervals. We curated a medical instruction dataset of over 1.5 million samples spanning medical examinations, clinical dialogues, and biomedical literature. Our approach trains only ∼4% of the base model’s parameters and, consistent with prior studies of parameter-efficient methods in the medical domain, achieves performance comparable to full fine-tuning at a fraction of the memory footprint. Evaluated with five in-context examples per prompt, the 8-billion-parameter model attains a mean accuracy of 75.71% across the eight medical-domain subsets of the Massive Multitask Language Understanding benchmark; improvements over the unmodified LLaMA-3.1-8B-Instruct baseline are statistically significant on the medical multiple-choice benchmark MedMCQA and, after Bonferroni correction across the eight subsets, on three subsets (Clinical Knowledge, Medical Genetics, and Nutrition), with two further subsets being significant only before correction. A structured named-entity-recognition evaluation on 100 hospital discharge summaries (macro-averaged F1 0.94; dual-annotator agreement κ=0.87) provides complementary evidence of clinical-text utility. A safety mitigation pilot shows that context-disambiguation preprocessing reduces the highest-severity abbreviation-ambiguity error rate from 30% to 10% on a 30-case held-out set. These results show that parameter-efficient fine-tuning can deliver high-performance medical large language models while training only ∼4% of the model’s parameters and reducing memory use by roughly 75%, enabling development on low-cost consumer-grade hardware.
El-Enen et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: