Modern cloud and enterprise data centers rely on large numbers of interdependent configuration variables that shape system performance, cost, reliability, and fault recovery. As these environments become increasingly dynamic, distributed, and service-oriented, managing configurations manually becomes both inefficient and error-prone. Frequent updates, the limited availability of labeled data for configuration-performance modeling, and the growing complexity of cross-service dependencies make it difficult to optimize system behavior and diagnose failures quickly. This dissertation addresses these challenges through a set of domain knowledge-assisted methods for automated configuration and misconfiguration diagnosis in data centers. The first part of the dissertation focuses on dynamic configuration optimization in agile software environments. It introduces a stochastic optimization framework that enhances Simulated Annealing with domain knowledge to prioritize the most influential configuration variables and explore the search space more effectively. This approach is designed to improve configuration quality while reducing search cost and convergence time in rapidly changing cloud settings. The second part addresses the problem of sparse labeled data for configuration-performance modeling. It presents a domain-knowledge-assisted semi-supervised learning framework assisted by the notion of configuration health measures to identify new data points that are likely to be useful in improving prediction accuracy. We also develop an Intelligent Configuration Space Coverage methodology, to improve prediction accuracy by allowing for a limited number of additional experiments (when that is feasible) and determine where these points should be located in the state space to provide the maximum benefit from additional data collection. These methods improve prediction quality by expanding useful coverage of the configuration space without relying on exhaustive experimentation. The final part of the dissertation develops automated root-cause analysis methods for service and network misconfigurations in enterprise and geographically distributed systems. At the core of this contribution is ConfExp, a confidence-driven, test-based diagnosis framework that uses a sequential testing procedure to root-cause faulty configuration variables. The work further extends this framework with distributed diagnosis support, AI-guided trouble ticket generation, and retrieval-augmented generation (RAG) mechanisms that use prior incidents and service dependencies to improve diagnostic efficiency while preserving test-driven validation. Together, these contributions advance the design of adaptive, scalable, and data-efficient configuration management systems that support both performance optimization and reliable fault diagnosis in modern data center environments.
Negar Mohammadi Koushki (Thu,) studied this question.