Large Language Models (LLMs) have revolutionized AI applications, but their inference costs remain a significant barrier to widespread deployment. We investigate a fundamental question: when should practitioners use prompt compression versus model routing to reduce costs? Through a factorial experiment (72 conditions, 2,650 trials), we reveal a critical task-dependent dichotomy. Code generation tasks exhibit threshold behavior—maintaining quality at compression ratios r ≥ 0.6 (perfect at r ≥ 0.7), then degrading sharply below r = 0.6 (the "cliff effect")—making compression above this threshold an effective cost-reduction strategy. In contrast, chain-of-thought (CoT) reasoning tasks degrade gradually under compression but achieve comparable cost savings through intelligent routing to appropriately-capable models. We formalize this finding into a task-aware decision framework that achieves substantial cost reduction (up to 93% vs. premium inference, exceeding our >60% target) with only 6.2% quality degradation by applying compression-first strategies for code tasks and routing-first strategies for reasoning tasks. We release our benchmark suite and decision framework to facilitate reproducible research.
Warren Johnson (Tue,) studied this question.