July 26, 2022Open Access

American == White in Multimodal Language-and-Image AI

Key Points

Key points are not available for this paper at this time.

Abstract

Three state-of-the-art language-and-image AI models, CLIP, SLIP, and BLIP, are evaluated for evidence of a bias previously observed in social and experimental psychology: equating American identity with being White. Embedding association tests (EATs) using standardized images of self-identified Asian, Black, Latina/o, and White individuals from the Chicago Face Database (CFD) reveal that White individuals are more associated with collective in-group words than are Asian, Black, or Latina/o individuals, with effect sizes >.4 for White vs. Asian comparisons across all models. In assessments of three core aspects of American identity reported by social psychologists, single-category EATs reveal that images of White individuals are more associated with patriotism and with being born in America, but that, consistent with prior findings in psychology, White individuals are associated with being less likely to treat people of all races and backgrounds equally. Additional tests reveal that the number of images of Black individuals returned by an image ranking task is more strongly correlated with state-level implicit bias scores for White individuals (Pearson's ρ=.63 in CLIP, ρ=.69 in BLIP) than are state demographics (ρ=.60), suggesting a relationship between regional prototypicality and implicit bias. Three downstream machine learning tasks demonstrate biases associating American with White. In a visual question answering task using BLIP, 97% of White individuals are identified as American, compared to only 3% of Asian individuals. When asked in what state the individual depicted lives in, the model responds China 53% of the time for Asian individuals, but always with an American state for White individuals. In an image captioning task, BLIP remarks upon the race of Asian individuals as much as 36% of the time, and the race of Black individuals as much as 18% of the time, but never remarks upon race for White individuals. Finally, when provided with an initialization image of individuals from the CFD and the text "an American person," a synthetic image generator (VQGAN) using the text-based guidance of CLIP consistently lightens the skin tone of individuals of all races (by 35% for Black individuals, based on mean pixel brightness), and generates output images of White individuals with blonde hair. The results indicate that societal biases equating American identity with being White are learned by multimodal language-and-image AI, and that these biases propagate to downstream applications of such models.

AI에게 질문

Bookmark

View Full Paper