Key points are not available for this paper at this time.
Computer professionals have a need for robust, easy-to-use usability evaluation methods (UEMs) to help them systematically improve the usability of computer arti-facts. However, cognitive walkthrough (CW), heuristic evaluation (HE), and thinking-aloud study (TA)—3 of the most widely used UEMs—suffer from a substantial evalua-tor effect in that multiple evaluators evaluating the same interface with the same UEM detect markedly different sets of problems. A review of 11 studies of these 3 UEMs re-veals that the evaluator effect exists for both novice and experienced evaluators, for both cosmetic and severe problems, for both problem detection and severity assess-ment, and for evaluations of both simple and complex systems. The average agree-ment between any 2 evaluators who have evaluated the same system using the same UEM ranges from 5 % to 65%, and no 1 of the 3 UEMs is consistently better than the others. Although evaluator effects of this magnitude may not be surprising for a UEM as informal as HE, it is certainly notable that a substantial evaluator effect persists for evaluators who apply the strict procedure of CW or observe users thinking out loud. Hence, it is highly questionable to use a TA with 1 evaluator as an authoritative state-ment about what problems an interface contains. Generally, the application of the UEMs is characterized by (a) vague goal analyses leading to variability in the task sce-narios, (b) vague evaluation procedures leading to anchoring, or (c) vague problem criteria leading to anything being accepted as a usability problem, or all of these. The
Hertzum et al. (Sat,) studied this question.