This technical note presents a minimal stress test for referential binding in small instruction-tuned language models. The experiment uses a controlled synthetic dataset of 60 queries divided into three balanced families: entity binding, relational binding, and compositional binding. In all cases, the information required to answer correctly is explicitly present in the prompt. The goal is therefore not to test knowledge retrieval, but to test whether a model can select the correct referential target from a valid support set. Three metrics are used: Atomic Support Score (ASS), Binding Legitimacy Score (BLS), and Illegitimate Binding Rate (IBR). The results show that models can remain inside the valid support set while selecting the wrong relational or compositional target. The note reports results for Qwen/Qwen2.5-1.5B-Instruct and HuggingFaceTB/SmolLM2-1.7B-Instruct, with a deterministic repeat run for Qwen. The accompanying package includes result CSV files, summary tables, requirements, README, and a reproducibility script. The contribution is intended as a narrow diagnostic stress test. It does not propose a general theory of grounding or a general benchmark for language models.
Building similarity graph...
Analyzing shared references across papers
Loading...
Danilo Tavella
Building similarity graph...
Analyzing shared references across papers
Loading...
Danilo Tavella (Fri,) studied this question.
www.synapsesocial.com/papers/69f6e5ac8071d4f1bdfc6540 — DOI: https://doi.org/10.5281/zenodo.19944707