Understanding how humans interact with and are influenced by intelligent systems is essential for improving their design and effectiveness. Although computer vision research has largely centered on algorithmic advances, these increasingly complex systems are ultimately used by people whose decisions and interpretations shape how they function in practice. As models grow more sophisticated and are rapidly deployed in real-world settings, their internal processes often appear as ``black boxes,'' especially to non-experts, creating uncertainty about how to interpret outputs, communicate intent, and how much to rely on system feedback. This disconnect underscores the importance of integrating humans into the design and evaluation of computer vision systems to ensure alignment with user needs and capabilities. This dissertation addresses this gap through a comprehensive evaluation of human involvement throughout the modern computer vision pipeline. We focus on two primary roles: annotators, who create and refine training data, and users, who engage with deployed systems and rely on explanations to inform their decisions. These roles are central to key stages of the modern computer vision pipeline where human decision-making, input, and interpretation directly impact system performance and outcomes: data labeling, training, deployment, and explainability. We conduct human-centered evaluations across these four stages. For the data labeling stage, we examined how varying the amount of context available to annotators influenced efficiency and accuracy in an object matching task. Reduced context improved efficiency without compromising accuracy, while additional context was helpful when objects were less distinctive or image quality was poor. At the training stage, where annotators’ norms and biases shape the training data, we substituted gendered terms in existing image captioning datasets with gender-neutral equivalents to reduce gender bias and examine the impact on model outputs and perceived caption quality. Models trained on neutral data produced fewer gendered descriptions while maintaining, and in some cases, improving descriptive quality. During deployment, we compared two interaction modes for intent communication in multimodal instruction-based image editing systems: post-edit correction and proactive clarification. Although both modes led to similar task performance, post-edit correction helped users better understand how the system functioned and experience a greater sense of communicative ease. Finally, for explainability, we evaluated the impact of AI-generated visual explanations on users’ decision-making in AI-assisted systems. Although the explanations had no impact on performance or confidence, they supported the development of better mental models of system behavior. Users’ AI literacy shaped how the task was approached and how the explanations were utilized. By centering the roles of annotators and users, these contributions collectively identify opportunities to improve efficiency, interpretability, and the overall effectiveness of human interactions across these stages. This dissertation aims to guide the development of human-centered computer vision systems that effectively meet the needs of diverse users.
Albatool Wazzan (Thu,) studied this question.