What question did this study set out to answer?

This note aims to evaluate whether small instruction-tuned language models can correctly select referential targets from a valid support set.

May 3, 2026Open Access

Referential Binding Stress Test: Supported Fragments Do Not Guarantee Referential Legitimacy in Small Instruction-Tuned Language Models

Puntos clave

This note aims to evaluate whether small instruction-tuned language models can correctly select referential targets from a valid support set.
Utilized a synthetic dataset of 60 queries divided into entity, relational, and compositional binding families.
Employed three metrics: Atomic Support Score (ASS), Binding Legitimacy Score (BLS), and Illegitimate Binding Rate (IBR).
Analyzed results specifically for Qwen/Qwen2.5-1.5B-Instruct and HuggingFaceTB/SmolLM2-1.7B-Instruct models.
Models often select incorrect relational or compositional targets despite having access to the correct support set.
The metrics indicated a significant illegitimate binding rate in the tested models.

Resumen

This technical note presents a minimal stress test for referential binding in small instruction-tuned language models. The experiment uses a controlled synthetic dataset of 60 queries divided into three balanced families: entity binding, relational binding, and compositional binding. In all cases, the information required to answer correctly is explicitly present in the prompt. The goal is therefore not to test knowledge retrieval, but to test whether a model can select the correct referential target from a valid support set. Three metrics are used: Atomic Support Score (ASS), Binding Legitimacy Score (BLS), and Illegitimate Binding Rate (IBR). The results show that models can remain inside the valid support set while selecting the wrong relational or compositional target. The note reports results for Qwen/Qwen2.5-1.5B-Instruct and HuggingFaceTB/SmolLM2-1.7B-Instruct, with a deterministic repeat run for Qwen. The accompanying package includes result CSV files, summary tables, requirements, README, and a reproducibility script. The contribution is intended as a narrow diagnostic stress test. It does not propose a general theory of grounding or a general benchmark for language models.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Danilo Tavella (Fri,) studied this question.

synapsesocial.com/papers/69f6e5ac8071d4f1bdfc6540 https://doi.org/https://doi.org/10.5281/zenodo.19944707

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo