When a language model generates multiple candidate answers, how should we pick the best one? The default strategy - majority voting - treats the model as a black box, discarding everything except final answer strings. We show that the model's internal computations already contain a usable signal for answer quality, and that a remarkably simple method can extract it. We propose trajectory probes: lightweight linear classifiers trained on hidden state features aggregated across the generation process. From each candidate answer, we extract mean, standard deviation, and final-token activations at eight evenly spaced layers, projected to 256 dimensions - a 6, 144-dimensional trajectory fingerprint. A logistic regression probe trained with a pairwise ranking objective (RankNet) learns to prefer correct answers over incorrect ones from the same question. On TriviaQA (Llama-3. 1-8B-Instruct, K=4, T=0. 3; mean+/-std over 3 seeds), the probe reaches 56. 4%+/-3. 9 versus 51. 3%+/-3. 9 for majority voting, recovering 58. 4%+/-3. 0% of the gap to the oracle upper bound, with a selection precision (PickAcc) of 91. 2%+/-1. 7% on questions where at least one sampled answer is correct. On MATH, gains are smaller and strongly K-dependent: at Kₑval=2 the probe improves over majority voting by +2. 1 points (3/3 seeds positive), while at the canonical Kₑval=4 the improvement narrows to +0. 6+/-1. 0 points and is not statistically significant. Two findings surprised us. First, the choice of training objective can matter more than standalone classifier quality: a binary classifier with higher cross-validated AUC can underperform a pairwise probe with lower AUC, because ranking among candidates is a different task than classifying correctness in isolation. Second, the per-layer signal distribution acts as a domain fingerprint - factual recall spreads information across layers while mathematical reasoning concentrates it in the final third - yet a single probe trained on mixed-domain data can match domain-specific specialists. Our results suggest that the "verifier" for best-of-K selection need not be a separate model or an additional LLM call. It can be a linear function of what the model already computes.
Building similarity graph...
Analyzing shared references across papers
Loading...
Nikolay Yudin (Sun,) studied this question.
synapsesocial.com/papers/6994058c4e9c9e835dfd67fd — DOI: https://doi.org/10.5281/zenodo.18649682
Nikolay Yudin
Building similarity graph...
Analyzing shared references across papers
Loading...