What question did this study set out to answer?

The study aims to evaluate the reasoning quality of large language models in urban infrastructure decision-making while assessing governance risks.

May 16, 2026Open Access

Governance risks of AI reasoning in urban infrastructure through Delphi audit of human and large language model judgment

Key Points

The study aims to evaluate the reasoning quality of large language models in urban infrastructure decision-making while assessing governance risks.
Conducted a sociotechnical audit of six commercial large language models using a Delphi-derived rubric.
Engaged twenty infrastructure professionals to derive expert reasoning criteria.
Analyzed the alignment of AI responses with expert judgment across varying scenario complexities.
51.3% of cited sources from LLMs were unverifiable or fabricated, indicating significant reliability issues.
LLM self-reported confidence negatively correlated with actual reasoning quality (r = -0.23), showing lower-performing models projected higher certainty.
There was strong agreement on operational triage, but divergence on strategic capital allocation increased with scenario complexity.

Abstract

Cities are increasingly considering large language models (LLMs) to support smart city operations and infrastructure decision-making. While these tools promise efficiency, their use in public institutions raises concerns about accountability, reliability, and institutional risk. This study presents a sociotechnical audit of six commercial LLMs by comparing their reasoning with a Delphi-derived rubric constructed from the responses of twenty infrastructure professionals. The Delphi process elicited and refined expert reasoning criteria, producing a rubric that emphasized public safety, regulatory compliance, contextual judgment, financial stewardship, and system reliability. Results show that LLMs often generate responses with the structural clarity associated with early-career engineers, yet they display persistent weaknesses in factual grounding and contextual interpretation. Across all models, 51.3% of cited sources were unverifiable or fabricated, and LLM self-reported confidence was negatively correlated with actual reasoning quality (r = -0.23), meaning the lowest-performing models projected the greatest certainty. Decision alignment with expert judgment degraded as scenario complexity increased, with strong agreement on operational triage but near-complete divergence on strategic capital allocation. Many responses misinterpreted regulatory requirements or relied on shallow justification. These failures extend beyond technical accuracy and introduce risks for governance, fiscal responsibility, and regulatory compliance. Methodologically, the study demonstrates how expert reasoning can be operationalized as a benchmark for evaluating AI systems in urban infrastructure contexts, addressing gaps in empirical assessment and governance tools. The findings carry direct implications for accountability, institutional integrity, and public trust in urban governance, and contribute to ongoing discourse on responsible AI adoption in cities aligned with global sustainability priorities.

Bookmark

View Full Paper

Bookmark

View Full Paper

Governance risks of AI reasoning in urban infrastructure through Delphi audit of human and large language model judgment

Key Points

Abstract

Cite This Study