What question did this study set out to answer?

Explore cost-reduction strategies for large language model inference, focusing on prompt compression and model routing.

January 22, 2026Open Access

Compress or Route? Task-Dependent Strategies for Cost-Efficient Large Language Model Inference

Key Points

Explore cost-reduction strategies for large language model inference, focusing on prompt compression and model routing.
Conducted a factorial experiment with 72 conditions and 2,650 trials.
Analyzed performance in code generation and chain-of-thought reasoning tasks.
Developed a decision framework based on task characteristics for efficient inference.
Identified a threshold compression ratio for code tasks (r ≥ 0.6) where quality remains high.
Achieved up to 93% cost reduction compared to premium inference options.
For reasoning tasks, showed effective cost savings through model routing despite gradual quality degradation.

Abstract

Large Language Models (LLMs) have revolutionized AI applications, but their inference costs remain a significant barrier to widespread deployment. We investigate a fundamental question: when should practitioners use prompt compression versus model routing to reduce costs? Through a factorial experiment (72 conditions, 2,650 trials), we reveal a critical task-dependent dichotomy. Code generation tasks exhibit threshold behavior—maintaining quality at compression ratios r ≥ 0.6 (perfect at r ≥ 0.7), then degrading sharply below r = 0.6 (the "cliff effect")—making compression above this threshold an effective cost-reduction strategy. In contrast, chain-of-thought (CoT) reasoning tasks degrade gradually under compression but achieve comparable cost savings through intelligent routing to appropriately-capable models. We formalize this finding into a task-aware decision framework that achieves substantial cost reduction (up to 93% vs. premium inference, exceeding our >60% target) with only 6.2% quality degradation by applying compression-first strategies for code tasks and routing-first strategies for reasoning tasks. We release our benchmark suite and decision framework to facilitate reproducible research.

Compress or Route? Task-Dependent Strategies for Cost-Efficient Large Language Model Inference

Key Points

Abstract

Cite This Study