Series: K11tech Agentic AI QA System -- Paper 2 Extends: Jadhav, K. (2026). Autonomous CI/CD Quality Assurance Using LangGraph Multi-Agent Orchestration and Risk-Proportionate Human-in-the-Loop Control. Zenodo. doi: 10. 5281/zenodo. 20543872 --- OVERVIEW Static quality gates in continuous integration pipelines suffer from three compounding limitations: single-model risk assessment provides no uncertainty signal, fixed escalation thresholds accumulate miscalibration over time, and detected defects require manual developer remediation. This paper extends the K11tech Agentic AI QA System -- a LangGraph-orchestrated 14-agent pipeline that achieved 91. 2% defect detection (F1 = 0. 913) and 87% execution time reduction in Paper 1 -- with three interlocking innovations that close a detect-fix-learn feedback loop and transform the static quality gate into a continuously self-improving system. --- INNOVATIONS 1. Multi-LLM Consensus Gate (Epistemic Uncertainty as a Safety Signal) Replaces the single-model scoreᵣisk node with a parallel fan-out to GPT-4o, Claude Sonnet 4. 6, and Gemini 1. 5 Pro. When models disagree -- measured by risk-bucket placement and score standard deviation -- the gate forces Human-in-the-Loop (HITL) escalation independently of the numeric risk score. Three configurable agreement modes (unanimous, majority, weighted) allow teams to tune uncertainty sensitivity for their risk tolerance. Majority consensus reduces the false-positive HITL rate from 8. 4% to 6. 8% and the false-negative rate from 8. 8% to 7. 2% versus the single-model baseline. Critically, 71% of consensus-forced escalations were independently assessed by reviewers as genuinely ambiguous -- confirming that model disagreement is a meaningful uncertainty signal, not noise. 2. AutoRemediationAgent -- Confidence-Gated Patch Generation (Phase 2. 5) A 15th pipeline agent, executing after Phase 2 test agents complete and before Phase 3 reporting, that automatically generates and opens remediation pull requests for five safe, deterministic defect classes: accessibility violations, missing docstrings, unused imports, missing type hints, and test coverage gaps. Patches are generated as unified diffs with LLM self-assessed confidence; patches below the 0. 80 confidence threshold are skipped and routed to Jira. High-confidence patches are committed to a dedicated branch via the GitHub MCP server. Remediation PRs are never auto-merged -- human approval is always required. Across 120 evaluated pull requests: 85. 7% of safe-class defects produced patches above the confidence threshold; reviewer acceptance rate was 89. 3% (Cohen's kappa = 0. 87). Eliminates post-detection developer toil for the majority of deterministic defects. 3. Adaptive HITL Threshold Learner -- Online Calibration from Reviewer Decisions Replaces the fixed riskₛcore >= 0. 85 escalation threshold with a per-repository exponential moving average (EMA) over the reviewer's implied preferred threshold, bounded within safety constraints 0. 70, 0. 95. A Bayesian correction term jointly adapts the consensus variance threshold from reviewer feedback -- tightening it when the consensus gate passes PRs a reviewer would have blocked, and relaxing it when the gate unnecessarily escalates approved PRs. Simulation over 500 synthetic reviewer decisions demonstrates convergence to within MAE = 0. 018 of the true preferred threshold within 68-81 decisions. Reduces unnecessary HITL activations by 34% versus the fixed-threshold baseline. Escape rate remains 0% across all scenarios, including distribution shift. --- COMPOSITION SAFETY The three mechanisms are designed to interlock safely. The consensus gate produces the uncertainty signal the adaptive learner uses to weight its variance threshold updates. The AutoRemediationAgent operates independently and does not influence threshold calibration -- remediation PRs do not generate new HITL decisions. End-to-end composition across 120 PRs confirmed threshold stability (std = 0. 031), no safety bound violations, and zero production escapes. --- KEY RESULTS - Consensus majority mode: F1 = 0. 927 vs 0. 913 baseline; false-positive rate -1. 6pp; false-negative rate -1. 6pp- Adaptive learner: convergence within 80 decisions; 34% reduction in unnecessary HITL activations; 0% escape rate across all scenarios including distribution shift- AutoRemediationAgent: 89. 3% patch acceptance rate; 100% acceptance for unused import and missing type hint classes; Cohen's kappa = 0. 87- Composition: threshold std = 0. 031 over 120 runs; zero safety bound violations --- IMPLEMENTATION Open-source implementation extending the K11tech Agentic AI QA System: https: //github. com/K11-Software-Solutions/k11techlab-agentic-ai-qa-system New modules: pipeline/consensus. py (Consensus Gate), pipeline/adaptiveₜhreshold. py (Threshold Learner), AutoRemediationAgent as a conditional Phase 2. 5 node. Stack additions: langchain-anthropic, langchain-google-genai, GPT-4o patch generation, GitHub MCP PR creation, SQLite/PostgreSQL threshold persistence. --- RELATED WORK IN THIS SERIES Paper 1 -- Jadhav, K. (2026). Autonomous CI/CD Quality Assurance Using LangGraph Multi-Agent Orchestration and Risk-Proportionate Human-in-the-Loop Control. doi: 10. 5281/zenodo. 20543872 Paper 3 (forthcoming) -- Cross-Repository Dependency Analysis: system-level impact assessment from a single PR trigger via Knowledge Store MCP inter-service contract capture. --- (c) 2026 Kavita Jadhav, K11 Software Solutions LLC. All rights reserved. Contact: kavita. jadhav@k11softwaresolutions. comGitHub: https: //github. com/K11-Software-Solutions
Kavita Jadhav (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: