Cities are increasingly considering large language models (LLMs) to support smart city operations and infrastructure decision-making. While these tools promise efficiency, their use in public institutions raises concerns about accountability, reliability, and institutional risk. This study presents a sociotechnical audit of six commercial LLMs by comparing their reasoning with a Delphi-derived rubric constructed from the responses of twenty infrastructure professionals. The Delphi process elicited and refined expert reasoning criteria, producing a rubric that emphasized public safety, regulatory compliance, contextual judgment, financial stewardship, and system reliability. Results show that LLMs often generate responses with the structural clarity associated with early-career engineers, yet they display persistent weaknesses in factual grounding and contextual interpretation. Across all models, 51.3% of cited sources were unverifiable or fabricated, and LLM self-reported confidence was negatively correlated with actual reasoning quality (r = -0.23), meaning the lowest-performing models projected the greatest certainty. Decision alignment with expert judgment degraded as scenario complexity increased, with strong agreement on operational triage but near-complete divergence on strategic capital allocation. Many responses misinterpreted regulatory requirements or relied on shallow justification. These failures extend beyond technical accuracy and introduce risks for governance, fiscal responsibility, and regulatory compliance. Methodologically, the study demonstrates how expert reasoning can be operationalized as a benchmark for evaluating AI systems in urban infrastructure contexts, addressing gaps in empirical assessment and governance tools. The findings carry direct implications for accountability, institutional integrity, and public trust in urban governance, and contribute to ongoing discourse on responsible AI adoption in cities aligned with global sustainability priorities.
Poudel et al. (Thu,) studied this question.