Large Language Models (LLMs) differ widely in their performance across tasks, making efficient model selection essential for reliable and cost-effective deployment. This paper proposes a zero-shot LLM ranking framework that predicts the most suitable model for a given prompt without executing any candidate models. Using data from the TREC Million LLM Track, which includes 14,950 prompts evaluated across 1130 LLMs, the framework integrates prompt-aware, cluster-aware, and LLM metadata-aware embeddings within an end-to-end neural architecture. The proposed model achieved an nDCG@10 of 0.3451 and an MRR of 0.2550, representing a 38% improvement over single-feature baselines. Analysis across 2,990 test prompts showed that ranking effectiveness varies with prompt type, length and prompt search intent. The results demonstrate that fusing heterogeneous features enables accurate zero-shot LLM selection while significantly reducing computational cost. This work provides a scalable and energy-efficient alternative to brute-force evaluation and establishes a foundation for adaptive, prompt-aware routing in multi-LLM systems.
Bashir et al. (Wed,) studied this question.