What does this research mean for the field?

Integrating multi-LLM consensus, adaptive human-in-the-loop thresholds, and automated remediation into an agentic CI/CD pipeline improves defect detection, reduces unnecessary human escalations by 34%, and achieves an 89.3% patch acceptance rate for deterministic defects. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This study aims to enhance quality assurance in CI/CD pipelines by integrating innovative AI mechanisms that close the detect-fix-learn loop.

June 11, 2026Open Access

Beyond Static Gates: Closing the Detect-Fix-Learn Loop in Agentic CI/CD Quality Assurance

Key Points

This study aims to enhance quality assurance in CI/CD pipelines by integrating innovative AI mechanisms that close the detect-fix-learn loop.
Developed a multi-agent pipeline incorporating a consensus gate for uncertainty assessment.
Introduced an AutoRemediationAgent that generates patches for specific defect classes.
Implemented an adaptive threshold learner that continuously calibrates escalation thresholds based on reviewer feedback.
Majority consensus mode improved F1 score to 0.927 (previously 0.913) with reduced false-positive and false-negative rates.
Adaptive learner achieved a 34% reduction in unnecessary human-in-the-loop activations, maintaining a 0% escape rate.
The AutoRemediationAgent demonstrated an acceptance rate of 89.3% across patch proposals.

Abstract

Series: K11tech Agentic AI QA System -- Paper 2 Extends: Jadhav, K. (2026). Autonomous CI/CD Quality Assurance Using LangGraph Multi-Agent Orchestration and Risk-Proportionate Human-in-the-Loop Control. Zenodo. doi: 10. 5281/zenodo. 20543872 --- OVERVIEW Static quality gates in continuous integration pipelines suffer from three compounding limitations: single-model risk assessment provides no uncertainty signal, fixed escalation thresholds accumulate miscalibration over time, and detected defects require manual developer remediation. This paper extends the K11tech Agentic AI QA System -- a LangGraph-orchestrated 14-agent pipeline that achieved 91. 2% defect detection (F1 = 0. 913) and 87% execution time reduction in Paper 1 -- with three interlocking innovations that close a detect-fix-learn feedback loop and transform the static quality gate into a continuously self-improving system. --- INNOVATIONS 1. Multi-LLM Consensus Gate (Epistemic Uncertainty as a Safety Signal) Replaces the single-model scoreᵣisk node with a parallel fan-out to GPT-4o, Claude Sonnet 4. 6, and Gemini 1. 5 Pro. When models disagree -- measured by risk-bucket placement and score standard deviation -- the gate forces Human-in-the-Loop (HITL) escalation independently of the numeric risk score. Three configurable agreement modes (unanimous, majority, weighted) allow teams to tune uncertainty sensitivity for their risk tolerance. Majority consensus reduces the false-positive HITL rate from 8. 4% to 6. 8% and the false-negative rate from 8. 8% to 7. 2% versus the single-model baseline. Critically, 71% of consensus-forced escalations were independently assessed by reviewers as genuinely ambiguous -- confirming that model disagreement is a meaningful uncertainty signal, not noise. 2. AutoRemediationAgent -- Confidence-Gated Patch Generation (Phase 2. 5) A 15th pipeline agent, executing after Phase 2 test agents complete and before Phase 3 reporting, that automatically generates and opens remediation pull requests for five safe, deterministic defect classes: accessibility violations, missing docstrings, unused imports, missing type hints, and test coverage gaps. Patches are generated as unified diffs with LLM self-assessed confidence; patches below the 0. 80 confidence threshold are skipped and routed to Jira. High-confidence patches are committed to a dedicated branch via the GitHub MCP server. Remediation PRs are never auto-merged -- human approval is always required. Across 120 evaluated pull requests: 85. 7% of safe-class defects produced patches above the confidence threshold; reviewer acceptance rate was 89. 3% (Cohen's kappa = 0. 87). Eliminates post-detection developer toil for the majority of deterministic defects. 3. Adaptive HITL Threshold Learner -- Online Calibration from Reviewer Decisions Replaces the fixed riskₛcore >= 0. 85 escalation threshold with a per-repository exponential moving average (EMA) over the reviewer's implied preferred threshold, bounded within safety constraints 0. 70, 0. 95. A Bayesian correction term jointly adapts the consensus variance threshold from reviewer feedback -- tightening it when the consensus gate passes PRs a reviewer would have blocked, and relaxing it when the gate unnecessarily escalates approved PRs. Simulation over 500 synthetic reviewer decisions demonstrates convergence to within MAE = 0. 018 of the true preferred threshold within 68-81 decisions. Reduces unnecessary HITL activations by 34% versus the fixed-threshold baseline. Escape rate remains 0% across all scenarios, including distribution shift. --- COMPOSITION SAFETY The three mechanisms are designed to interlock safely. The consensus gate produces the uncertainty signal the adaptive learner uses to weight its variance threshold updates. The AutoRemediationAgent operates independently and does not influence threshold calibration -- remediation PRs do not generate new HITL decisions. End-to-end composition across 120 PRs confirmed threshold stability (std = 0. 031), no safety bound violations, and zero production escapes. --- KEY RESULTS - Consensus majority mode: F1 = 0. 927 vs 0. 913 baseline; false-positive rate -1. 6pp; false-negative rate -1. 6pp- Adaptive learner: convergence within 80 decisions; 34% reduction in unnecessary HITL activations; 0% escape rate across all scenarios including distribution shift- AutoRemediationAgent: 89. 3% patch acceptance rate; 100% acceptance for unused import and missing type hint classes; Cohen's kappa = 0. 87- Composition: threshold std = 0. 031 over 120 runs; zero safety bound violations --- IMPLEMENTATION Open-source implementation extending the K11tech Agentic AI QA System: https: //github. com/K11-Software-Solutions/k11techlab-agentic-ai-qa-system New modules: pipeline/consensus. py (Consensus Gate), pipeline/adaptiveₜhreshold. py (Threshold Learner), AutoRemediationAgent as a conditional Phase 2. 5 node. Stack additions: langchain-anthropic, langchain-google-genai, GPT-4o patch generation, GitHub MCP PR creation, SQLite/PostgreSQL threshold persistence. --- RELATED WORK IN THIS SERIES Paper 1 -- Jadhav, K. (2026). Autonomous CI/CD Quality Assurance Using LangGraph Multi-Agent Orchestration and Risk-Proportionate Human-in-the-Loop Control. doi: 10. 5281/zenodo. 20543872 Paper 3 (forthcoming) -- Cross-Repository Dependency Analysis: system-level impact assessment from a single PR trigger via Knowledge Store MCP inter-service contract capture. --- (c) 2026 Kavita Jadhav, K11 Software Solutions LLC. All rights reserved. Contact: kavita. jadhav@k11softwaresolutions. comGitHub: https: //github. com/K11-Software-Solutions

Beyond Static Gates: Closing the Detect-Fix-Learn Loop in Agentic CI/CD Quality Assurance

Key Points

Abstract

Cite This Study

Also Consider

Also Consider