To the general public, text-to-image generators, such as Midjourney and DALL-E, seem to work through magic and, indeed, their inner workings are often frustratingly opaque. This is, in part, due to the lack of transparency from big tech companies around aspects like training data and how the algorithms powering their generators work, on the one hand, and the deep and technical knowledge in computer science and machine learning, on the other, that is required to understand these workings. Acknowledging these aspects, this qualitative examination seeks to better understand the black box of algorithmic vision through asking a large language model to first describe two sets of visually distinct journalistic images. The resulting descriptions are then fed into the same large language model to see how the AI tool remediates these images. In doing so, this study evaluates how machines process images in each set and which specific visual style elements across three dimensions (representational, aesthetic and technical) machine vision regards as important for the description, and which it does not. Taken together, this exploration helps scholars understand more about how computers process, describe and render images, including the attributes that they focus on and tend to ignore when doing so.
T.J. Thomson (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: