Key points are not available for this paper at this time.
In the behavioural sciences, researchers often report transformed effect size measures like correlation coefficients and Cohen’s ds, purportedly to facilitate interpretability and comparability to other studies. These values are often referred to as standardised and treated as unitless. In reality, they are relative to the sample-local estimate of variability. Variances differ from sample to sample for reasons unrelated to the size of the effect, such as selection bias and measurement error. Heterogeneity in the local sample variances will in turn introduce heterogeneity into the effect sizes standardised using them when no such heterogeneity exists in the raw effects. Psychologists have resisted past calls to report raw effect sizes instead of “standardised” effects, at least in part because latent variables and Likert scales lack natural units. Recognising the continuing necessity for effect size standardisation in some subfields, we argue that replacing local with common standards for the purpose of standardisation would improve the interpretability and comparability of results reported by psychological studies. Whenever available, test norms (i.e., population estimates of the mean and standard deviation of our outcome measure) could serve as the field-wide common standards to which smaller study samples are compared and calibrated. Concretely, variables would be transformed to standard scores or z-scores according to the norm values, reports of sample descriptive statistics would include the average norm-standardised scores, and effects would be reported in those norm-standardised units. Changing reporting in this manner would make it easy to diagnose selection bias at a glance (manifested in range restriction for example) in addition to yielding more interpretable and comparable effect sizes. In the long term, we need to invest significantly more resources into deriving test norms to increase the availability of high-quality common standards. In the short term, we argue that even subpar common standards are better than local ones.
Alsalti et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: