What does this research mean for the field?

The value alignment problem in artificial intelligence is better understood as a set of context-driven issues arising from the dynamics of human delegation to AI systems, rather than a single intractable problem. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

The work aims to redefine the value alignment problem in artificial intelligence, emphasizing its context-dependency and complexity.

March 14, 2026

Artificial Intelligence and the Value Alignment Problem: A Philosophical Introduction. T.LaCroix, 2025. Peterborough, Broadview Press. 354 pp, £32.95 (pb)

Key Points

The work aims to redefine the value alignment problem in artificial intelligence, emphasizing its context-dependency and complexity.
Detailed examination of AI ethics and its implications for marginalized groups.
Analysis of case studies including credit scoring and predictive policing.
Introduction of a structural definition of the value alignment problem through a principal-agent framework.
Proposes a new framework for understanding value alignment as a set of context-driven problems.
Explains the implications of information asymmetry in AI decision-making processes.
Demonstrates how misaligned objectives can lead to ethical issues in AI applications.

Abstract

This timely book provides a detailed examination of ethics in artificial intelligence and is sure to become a valuable reference for educators and researchers looking for a fundamental introduction to the topic. Rather than focusing on distant existential threats of artificial intelligence, the work is grounded in present-day AI systems which are already generating significant harm to vulnerable or marginalised groups. This book does not shy away from explaining the technical background of AI, not just generative forms such as ChatGPT (though these are mentioned). The numerous case studies throughout this book detail diverse AI applications which are already having an impact on society, such as credit scoring models, predictive policing, hiring decisions, retail pricing, and healthcare. This overcomes one issue with many mainstream discussions of AI that treat the topic as one monolithic whole, when it is in fact a diverse field with many different applications. The author notes early on that the book was derived from lecture and course material; as such it falls somewhere in between a teaching aid and an applied work of philosophy. Many sections can be directly put into a course on AI ethics or ethics of technology more broadly. While this book has utility as a teaching aid, it also serves as a standalone philosophical work offering novel insights into the AI value alignment problem. Early chapters give detailed and technical insights into the history and current workings of AI and might specifically appeal to those on a computer science course on machine learning and other related areas. Section 1 (Chapters 1, 2, and 3) provides a history of AI in order to contextualise the current state of the art. Chapter 1 gives a detailed history of AI development over the past 70 or so years. This details the waves of so-called AI summer and winters, shifts from AI hype to disappointment in the limitations of the technology. It begs the question of whether today's AI hype cycle is similarly set up for an oncoming winter or if there is something truly different about what we are currently experiencing. Chapter 2 goes into the more technical details of AI as it stands today. As the author writes, this is a book about real-world AI problems, and therefore the technical side of things should be understood by those who want to understand this technology. For the less technically minded it might be tempting to skip this part – but there are worthwhile insights for those in the field of AI ethics. The ideas are clearly explained and someone more mathematically inclined than this reviewer would likely get a lot out of it. After laying the groundwork to the topic, Chapter 3 comes to the novel philosophical strength of the work; that is, to provide a ‘reconceptualisation’ of the value alignment problem. To understand the original philosophical contribution of the work to the value alignment problem, we have to have an idea of what this problem is. The author describes the ‘standard definition’ of the value alignment problem, as being the problem of ensuring that AI systems align with the values of humanity (p. 69). However, on this definition further problems arise of how values can be encoded into an AI system (technical), and what these values should be (normative). LaCroix argues that this standard definition has limited utility. The ambiguity in defining the ‘values of humanity’ does little to help those designing AI systems and those tasked with making them safe. Chapter 3 proposes a new ‘structural definition’ of the value alignment problem. This definition sees value alignment not as one intractable problem, but as a set of problems which arise from the dynamics at play when humans employ artificial agents to act on their behalf. In doing so it argues that the alignment problem is better seen as a set of context-driven problems, rather than one problem that can be ‘solved’. Specifically, these alignment problems occur when humans delegate decisions to AI systems. The aim is to see under what circumstances misalignment generally can occur, what the ‘structure of the problem’ is, and specifically in what contexts it is a problem (p. 75). To explain this definition, LaCroix draws an analogy from the ‘principal–agent’ framework in economics: where a principal appoints an agent to act on their behalf. For example, shareholders (principals) appoint a CEO (agent) to make decisions on their behalf; or citizens (principals) elect party officials (agents) to do the same. LaCroix adapts this framework, using cases where a human (principal) delegates decisions to an AI system (agent). To quote the definition given, alignment problems arise ‘from the dynamics of multi-agent interactions involving the delegation of tasks from one actor (a human principal) to another (an AI agent)’ (p. 82). The book points to two ‘axes’ on which this can happen: (a) misaligned objectives, and (b) information asymmetries between the human (principal) and the AI system (agent). The core philosophical work of the book follows from a description of these axes. First, Chapter 4 explains how agents can have objectives or incentives that differ from (or are misaligned with) the true objectives of the principal. Consider citizens who vote for a party based on certain promises, only to have these promises broken in favour of decisions that were of benefit to the party or some other agenda. Similarly, an AI agent might make a decision that contradicts the true objective of the human principal. The real-world objectives we want AI models to tackle can be highly complex and messy, but this needs to be translated into a simplified representation or goal so an algorithm can work. Chapter 4 illustrates that the goals given to AI models are actually proxies for our real objectives. Think of something such as AI in hiring. The human's true objective might be ‘hire the best person possible’. In order to delegate this task to an AI, the human has to come up with criteria by which this is measured. For example, use all the CVs of past successful candidates, and look for commonalities. However, if past hiring processes showed bias toward only hiring certain demographics, this will be re-inscribed in the AI system's future decisions. LaCroix rightly points out that AI systems do not have inherent ‘values’; rather all goals or objectives are programmed. Misaligned objectives occur because it is difficult (if not impossible) to specify an objective completely correctly for an AI system to follow. Rather, humans give AI systems objectives that are proxies for their true objectives. The author clarifies that objectives here are situation-dependent; as such it is possible to look at individual cases of misalignment on the ‘objectives’ axis. The discussion of algorithmic bias in this chapter is comprehensive; it does not just look at the issue from one angle but follows different ways in which algorithmic bias can manifest in AI. Chapter 4 is of particular use for an introductory course on AI ethics and algorithmic bias. It not only looks at the technical aspect of AI bias, but also how this impacts society more broadly. For example, the AI system COMPAS is given as a case study to illustrate bias and fairness in algorithmic decision-making (pp. 105–7). The system uses statistics and algorithms to predict recidivism risks in criminal cases. The model provides a risk score (from 1 to 10) to predict whether an inmate, if released, would commit a violent or non-violent crime within one to three years of being released. Of particular interest is not the accuracy of the model, which seems to be similar for both black and white offenders, but the difference in what the model gets wrong. For black offenders there is a higher rate of false positives (predicting they would re-offend when they did not) and for white offenders a higher rate of false negatives (predicting they would not re-offend when they did). This gives a prime example of how ‘fairness’ can depend upon which statistical metrics were the right ones to define and measure in the first place. Misaligned objectives are particularly important with the latest push toward AI agents or agentic AI. These are AI models designed to independently go out and complete tasks for their user; for example, finding a recipe for authentic Japanese food and ordering the ingredients online. This type of delegation is precisely captured by the principal–agent framework. However, it can lead to problems if the objectives are not sufficiently specified. To fulfil its task, the AI agent may order ingredients directly from Japan, costing the user hundreds of dollars. While frustrating for the user, this is generally not something we would normally consider to be a value alignment problem. However, on LaCroix's account it appears it would. The result of LaCroix's definition is to make value alignment highly context-dependent, rather than one intractable problem. This definition is more specific than the ‘standard’ definition of value alignment but also widens the number of cases that could be considered as value alignment problems. The second ‘axis’ of the structural definition (detailed in Chapter 5) is an information asymmetry between a principal and an agent. For example, a CEO (agent) might have more information than their shareholders (principal) when making a decision. Chapter 5 looks at the information asymmetries between a human and AI as a form of misalignment. An AI system might make a decision based on information unknown to the user, or the user might not know what specific piece of information is used to make the decision. Reading this I am struck by another analogy from economics, taken from George A. Akerlof's seminal paper “The Market for ‘Lemons’” (1970). Akerlof argues that in the used car market there is an inherent information asymmetry between seller and buyer due to the information the seller has which the buyer does not. The seller knows more about the car's history, the previous owner's driving habits, maintenance or any imperfections. This information can be hidden from the buyer's view, yet the buyer has to make a decision based on the information the seller gives. Do they trust the seller? Even if they do, how can they be certain they will be buying a good car (a peach) or a bad one (a lemon)? This is a classic example of information asymmetry in the marketplace. To stretch this analogy, AI becomes the used car sellers (agents) and we humans are the buyers (principals) deciding whether to accept the information given to us. An AI system draws on countless sources of information to provide decisions. In this ‘black box’ style system the exact source or piece of information the decision is made upon can be unknown to us. If we are unsure how the AI makes its decision, what specific source of information it is drawing from, then we cannot be sure of the reliability of its decision any more than we can know if we bought a peach or a lemon. A large dataset used to train an AI might contain skewed or biased information, unbeknownst to those training or using the model. There may be parts of an AI model that are not possible to monitor or verify, an AI model might conduct activities unknown to the user, or the user might not have sufficient information about the AI model's abilities. Each of these equates to an information asymmetry between the AI model and the user. This calls into question the reliability of the decision the human makes based on the AI model's recommendation. On the argument put forward in this book, this information asymmetry is a value alignment problem. When thinking of value alignment, we have traditionally thought of aligning to an abstract set of human values. However, the question quickly arises of whose values we are aligning to. The third ‘axis’ of the structural definition given in this work tackles this issue, focusing on who the principal in the principal–agent framework is (p. 138). In doing so it differentiates between shareholders and stakeholders of AI systems. The shareholders are those involved in the AI system, in some form or another, either through creation or use. They may be those who make, design, sell, and research the models, as well as those who regulate, those who ensure compliance, and the end users. The stakeholders are broader groups who are directly or indirectly affected by an AI system. While these sound similar, where they come apart is relevant for alignment. An example is given of a pedestrian struck and killed by an autonomous vehicle (pp. 141–2). The victim had no say in the autonomous system's creation, nor were they a user of it, yet they were directly affected by it as a stakeholder. Though there are crossovers between shareholders and stakeholders (those who are affected by a model may very well be users of it), making a distinction between those who design and those who are affected by a model is useful. Both parties might have very different goals and objectives. Chapter 6 gives several pertinent examples of the effect of AI models on stakeholders, showing the human costs of AI models, including issues of copyright, privacy, and environmental concerns. Consider the artists who have no say in the creation of an AI image generator and do not use it. These artists are nonetheless profoundly affected by it as stakeholders. The summary of this chapter returns to the point that the ‘standard’ definition of the value alignment problem takes for granted that there is a unifying set of human values. A further issue is that the term alignment could be used for ‘ethics washing’ and to justify perpetuating the status quo. However, on the definition put forward by LaCroix, it is not sufficient to use the term alignment, unless we specify for whom the system is aligned (p. 157). Section 3 moves onto AI safety (Chapter 7) and machine ethics (Chapter 8). Chapter 7 examines the technical aspects of AI safety including designs and risk mitigations. While important work is being done in this field, we find that solutions to AI safety are not purely technical. Chapter 8 gives a solid background to normative ethical theory, which tends to be aimed at an audience who might not have a formal ethical or philosophical background. There is value here in using AI alignment as a vehicle to explain normative ethical theory, and whether top-down or bottom-up approaches to artificial moral agency should be employed. The usefulness of the author's structural definition is to isolate issues of misalignment, rather than seeing value alignment as one monolithic problem. As such, mitigation strategies are proposed (Chapter 9) based on the ‘axes’ of misalignment. If there is information asymmetry, then increasing transparency in AI systems could be a useful way of ensuring alignment on this axis. Measuring degrees of alignment on the ‘objectives’ axis means measuring the degree to which a ‘proxy’ (AI agent) is aligned with the system's intended goal. However, the more removed these agents are from our initial goal, the more difficult (or impossible) it will be to measure their alignment. Unfortunately, while the structural definition might give new ways to frame the issues AI raises, it does not offer simple solutions. Chapter 10 introduces normativity and language, arguing that attention should be paid to the role that communication plays in aligning values. This tackles a philosophical question of whether AI output from something like ChatGPT has any meaning. Chapter 11 takes a broader look at what value truly means, comparing epistemic and non-epistemic values. This touches on many key points related to the perception of AI models being value-free. As the chapter points out, there are questions about who determines the objectives, selects the features, and sets the parameters for these algorithms. The argument is made that research and design necessarily means imparting personal, social, cultural, organizational, or political values. This is particularly pertinent with regard to AI and control in the political sphere, something not explicitly covered in this book. We have seen Chinese AI models censor certain information, and the US government rallying against ‘woke’ AI, calling instead for ‘objective’ and unbiased AI. The danger in discussing the alignment problem in broad terms, such as aligning to human values, or unbiased AI, is that those in power are able to set the agenda for what these human values should be. The broader concern is that the language AI uses can be co-opted to suit political agendas. Particularly when language models are now used as epistemic tools giving users information about the world, the way they talk about things matters. The ending of the book calls for regulation of AI. In the global political climate, this is something that may or may not be possible. If there is one convincing argument the book makes, it is that the value alignment problem is fundamentally and inextricably social. Humans ultimately shape the output of the models, either through design or training. This means there are important questions about who determines the objectives, selects the features, and sets the parameters for these algorithms (p. 251). AI models do not appear in a vacuum and thus they are subject to many outside influences, from funding agencies, universities, and those pursuing political agendas. A concern with LaCroix's ‘structural definition’, and one that is addressed in the conclusion, is that it allows too many things to be alignment problems. Issues of algorithmic bias and AI black boxes are already widely written about as being harmful problems; does classifying them as alignment problems add anything? LaCroix does address this concern by arguing that value misalignment is not a single issue but should be seen as a ‘broader category or class of problems’ (p. 264). The utility here seems to be a pragmatic one. It is a call to redirect attention away from superficial talk of ‘the’ value alignment problem or of ‘solving’ it, but instead toward mitigating problems in individual contexts. LaCroix gives us a structure by which to do this: look at misaligned objectives, information asymmetries, and the stakeholders of the systems. This might mean that addressing value alignment has added complexity, but no one said it would be easy. Any book or paper published on AI currently will need to be seen as a snapshot of the time at which it was published. AI innovations seem to be happening weekly; as such anything published will soon have out-of-date information. One thing about the alignment problem is that it does not seem like it will be solved in the near future. On this topic at least this book will remain relevant for a number of years. Overall, the benefit of the work being based on educational material is that fundamental concepts of AI alignment are well explained to the extent that someone with little background knowledge of the topic can come away with a foundational understanding. At times, the technical and mathematical explanations in the work lend themselves more toward computer science students wanting to understand the deeper philosophical impacts of AI. This is not a bad thing at all. It is vitally important for those on such to have ethical This book does an of issues of value alignment in AI, not as some distant existential but grounded in the of its impact in the

Bookmark

Artificial Intelligence and the Value Alignment Problem: A Philosophical Introduction. T.LaCroix, 2025. Peterborough, Broadview Press. 354 pp, £32.95 (pb)

Key Points

Abstract

Cite This Study