Key points are not available for this paper at this time.
This study examined the stability of scores on two types of performance assessments, an observed hands‐on investigation and a notebook surrogate. Twenty‐nine sixth‐grade students in a hands‐on inquiry‐based science curriculum completed three investigations on two occasions separated by 5 months. Results indicated that: (a) the generalizability across occasions for relative decisions was, on average, moderate for the observed investigations (.52) and the notebooks (.50); (b) the generalizability for absolute decisions was only slightly lower; (c) the major source of measurement error was the person by occasion (residual) interaction; and (d) the procedures students used to carry out the investigations tended to change from one occasion to the other.
Ruiz‐Primo et al. (Mon,) studied this question.