Recent AI breakthroughs have significantly improved performance across various tasks, yet machine learning models still struggle with scene understanding and abstract reasoning—challenges that humans solve effortlessly. We explore how structured representations can enhance AI’s ability to tackle these problems. First, we examine how organizing multimodal information into structured formats helps summarize scene content and address data imbalances, achieving state-of-the-art results in Controllable Image Captioning (CIC). Next, we focus on learning from raw pixel information structured scene representations for Abstract Visual Reasoning (AVR), leading to interpretable representations, enhanced generative capabilities, and improved generalization both in- and out-of-distribution. Through these advances, we highlight how structured representations can drive more adaptable and explainable AI systems. The dissertation consists of two parts. In the first one, we focus on the problem of CIC, which aims at generating natural language descriptions for an image, conditioned on information provided by end users, e.g., regions, entities, or events of interest. However, available image--language datasets mainly contain captions that describe the entirety of an image, making them ineffective for training CIC models that can potentially attend to any subset of regions or relationships. To tackle this challenge, we propose a novel, fully automatic method to sample additional focused and visually grounded captions using a unified structured semantic representation built on top of the existing set of captions associated with an image. We leverage Abstract Meaning Representation (AMR), a cross-lingual graph-based semantic formalism, to encode all possible spatio-semantic relations between entities, beyond the typical spatial-relations-only focus of current methods. We use this Structured Semantic Augmentation (SSA) framework to augment existing image--caption datasets with the grounded controlled captions, increasing their spatial and semantic diversity and focal coverage. We then develop a new model, CIC-BART-SSA, specifically tailored for the CIC task, that sources its control signals from SSA-diversified datasets. We empirically show that, compared to SOTA CIC models, CIC-BART-SSA generates captions that are superior in diversity and text quality, are competitive in controllability, and, importantly, minimize the gap between broad and highly focused controlled captioning performance by efficiently generalizing to the challenging highly focused scenarios. Then, in the second part, we explore the ability of AI models to tackle AVR problems. In particular, we focus on Raven’s Progressive Matrices (RPMs), which is an established benchmark to examine the ability to perform high-level AVR. Despite the current success of algorithms that solve this task, humans can generalize beyond a given puzzle and create new puzzles given a set of rules, whereas machines remain locked in solving a fixed puzzle from a curated choice list. We propose Generative Visual Puzzles (GenVP), a framework to model the entire RPM generation process, a substantially more challenging task. We focus on capturing the characteristics of objects in a structured representation, along with their relationships with other puzzle objects. Our model's capability spans from generating multiple solutions for one specific problem prompt to creating complete new puzzles out of the desired set of rules. Experiments on five different datasets indicate that GenVP achieves state-of-the-art (SOTA) performance both in puzzle-solving accuracy and out-of-distribution (OOD) generalization in 22 OOD scenarios. Compared to SOTA generative approaches, which struggle to solve RPMs when the feasible solution space increases, GenVP efficiently generalizes to these challenging setups. Moreover, our model demonstrates the ability to produce a wide range of complete RPMs given a set of abstract rules by effectively capturing the relationships between abstract rules and visual object properties.
Kalliopi Basioti (Thu,) studied this question.