This dissertation builds on the growing prominence of artificial intelligence and presents three studies that span both methodological and applied research in machine learning and natural language processing within finance. The first research paper (Chapter I) offers a methodological contribution to the machine learning literature—with important implications for financial research—and examines a core component of any supervised learning task: model validation. Model validation refers to the process of estimating a model’s prediction error, that is, assessing how well a trained model is expected to perform on new data that were not used during model training. The paper shows that the finance literature contains widespread misconceptions about model validation for multilevel data, i.e., data where lower-level units (such as loans) are nested within higher-level units (such as countries or years). For such data structures, model validation can be conducted either by sampling individual lower-level units or by sampling entire higher-level units into training and test sets. Drawing on a theoretical framework and a Monte Carlo simulation, the paper shows that an appropriate validation strategy must reflect the typical use case of the prediction model. The paper demonstrates that selecting an inappropriate validation approach has important implications for both model assessment and model selection: (i) it yields a biased estimate of a model’s prediction error and (ii) it adversely affects model selection by increasing the likelihood of choosing a model that is either too simple or overly complex. A detailed discussion reveals that many empirical finance studies rely on validation strategies that are misaligned with the typical use case of the prediction model, rendering their statements about model assessment and model selection potentially invalid. This paper aims to raise awareness of this often-overlooked issue and to provide clear guidance on how to define an appropriate validation strategy. While the first research paper provides a methodological contribution at the intersection of machine learning and finance, the second paper (Chapter II) is applied work in this domain. In particular, it examines what is arguably the most common supervised learning task there is in finance: credit risk modeling. Numerous studies have documented that ML can drastically improve credit risk screening (e.g., Bastos 2010; Qi and Zhao 2011; Altman and Kalotay 2014; Kalotay and Altman 2017; Yao et al. 2017; Nagl et al. 2025). Yet, the adoption of ML in banking practice remains limited—with simpler linear models still predominating (European Banking Authority 2023). This paper argues that this may be due to a new challenge in forecasting tasks created by the emergence of those highly flexible ML models: model multiplicity—where equally accurate models at the aggregate level produce divergent individual-level predictions (predictive multiplicity) or differ in their decision surface (procedural multiplicity). The paper is the first in finance to document both facets of model multiplicity in an applied regression context and proposes heterogeneous ML ensembles as a natural solution. Such ensembles ensure that individual borrowers are not exposed to arbitrary fluctuations in any single ML model’s predictions—thereby mitigating issues related to predictive multiplicity—and they reduce procedural multiplicity by yielding more stable and robust estimates of the features that improve out-of-sample performance. The third paper (Chapter III) investigates whether firm-specific media coverage prior to an earnings conference call causally affects managers’ incentives to obfuscate in their communication during the call. Using text transcripts of earnings conference calls, the paper applies a rule-based NLP method to quantify obfuscation through the linguistic complexity of managerial speech. To establish causality, the study exploits restructuring events at the Wall Street Journal within a difference-in-differences (DID) framework, implemented using a stacked DID model, and documents that increased media coverage reduces managerial obfuscation. To explore effect heterogeneity, the paper then pursues two complementary analyses. First, the paper measures text similarity between call transcripts and news articles by computing cosine similarity of TF–IDF representations—a traditional statistical NLP approach—to test whether the effect varies with the editorial content of the media coverage. Second, the paper proxies for analyst monitoring by quantifying, for instance, the number of questions asked by financial analysts during the call—another rule-based NLP application—to assess whether the impact of media coverage varies with the strength of other corporate governance mechanisms. The paper finds that the disciplining role of the news media is concentrated in (i) articles with substantial editorial content and (ii) firms that receive limited attention from alternative corporate governance channels.
Noah Urban (Fri,) studied this question.