What question did this study set out to answer?

This research aims to develop NPC Fin 32B, a financial reasoning model optimized for multi-GPU training.

April 28, 2026Open Access

NPC Fin 32B: A Domain-Specialized Financial Reasoning Model via Multi-GPU QLoRA

Key Points

This research aims to develop NPC Fin 32B, a financial reasoning model optimized for multi-GPU training.
Fine-tuned from Qwen2.5-32B-Instruct using 32,496 examples (59.7M tokens) for training.
Utilized DeepSpeed ZeRO-3 across 12 NVIDIA H100 GPUs for approximately 72 hours.
Generated training labels synthetically from Qwen2.5-72B-Instruct based on signals from a production database.
Achieved 93.6% accuracy on a 500-question internal financial reasoning benchmark.
Documented a training pipeline that effectively utilizes multi-GPU hardware for domain-specific tasks.
Identified a batch size shift resulting in a realized effective batch size of approximately 384 instead of the planned 32.

Abstract

We describe NPC Fin 32B, a 32B-parameter financial-reasoning model fine-tuned from Qwen2.5-32B-Instruct via QLoRA on 32,496 supervised examples (59.7M tokens) drawn from five domain tags: crypto-signal analysis, broad crypto knowledge, multi-path logic-tree reasoning, equities and macroeconomic analysis, and cross-asset correlation. Training labels were generated synthetically from Qwen2.5-72B-Instruct over signals exported from a production MongoDB. The model achieves 93.6% on a 500-question internal financial-reasoning benchmark. The training run used DeepSpeed ZeRO-3 with full CPU offload across 12 NVIDIA H100 SXM5 80 GB GPUs for approximately 72 hours of wall-clock time, totalling 864 H100-hours. We document this as a recipe for 32B-scale domain-specialized supervised fine-tuning that fits the engineering surface of a small lab: a single multi-GPU node, open-weight base model, and synthetically-generated training labels. The paper's central honest observation is a config-vs-runtime drift that is invisible from the published model card alone. The training YAML inherited from an earlier single-GPU plan declared an effective batch size of 32 (per-device 4 × grad-accum 8), but the realized run, distributed across 12 GPUs under DeepSpeed ZeRO-3, scaled the global effective batch to approximately 384. The optimizer's peak learning rate of 2e-4 was tuned for the planned batch and was not retuned for the realized 12× scale-up; standard scaling rules would have suggested a peak nearer 7e-4. We document the discrepancy, discuss why the under-scaled LR did not destabilize training, and treat it as a real limitation of the recipe. The contribution is recipe-level: a documented, reproducible pipeline for a domain-specialized 32B reasoner on accessible multi-GPU hardware, with the config-drift bug and other unmet experiments reported alongside the wins.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Rama Krishna Bachu (Mon,) studied this question.

synapsesocial.com/papers/69f04e9b727298f751e728ad https://doi.org/https://doi.org/10.5281/zenodo.19802598

Bookmark

View Full Paper