Vision–language models (VLMs) show promise for remote-sensing scene classification but still struggle with fine-grained categories and distribution shifts. We present a hierarchical prompting framework that decomposes recognition into a coarse-to-fine decision process with structured outputs, paired with parameter-efficient adaptation (LoRA/QLoRA). To assess robustness without relying on multiple external datasets, we construct five protocol variants of the AID dataset (V0-V4) that systematically vary label granularity, class consolidation, and augmentation settings. The design goals and construction rules of these variants, as well as their alignment with prompt styles, are summarized in Section 3.1.1 and Table 1. We enforce a split-before-augment pipeline (augmenting the training split only) to preclude leakage27. We further conduct a leakage audit using rotation/flip–invariant perceptual hashing across splits28 to guarantee reproducibility.Experiments across these AID variants show that hierarchical prompting consistently outperforms non-hierarchical prompts and matches or exceeds full fine-tuning while requiring substantially less compute. Ablations on prompt design, adaptation strategy, and model capacity, together with confusion matrices and class-wise metrics, demonstrate improved coarse- and fine-grained recognition as well as resilience to rotations and flips. The approach provides a strong, reproducible baseline for remote-sensing classification under constrained compute, with complete prompt templates and processing scripts supplied for replication.
Chen et al. (Thu,) studied this question.