Artificial intelligence (AI) coding systems now range from inline completion to repository-level agents and platform-supported application builders; yet, software engineering still lacks a code-generation-centered operational taxonomy for describing how much work is delegated, under what conditions, and with what responsibility structure. This study proposes Levels of Automated Code Generation (LACG), a six-level taxonomy (L0–L5) for classifying automation in AI-augmented software construction. LACG is organized around four responsibility-aware concepts—The Software Development Task (SDT), Operational Capability Domain (OCD), fallback responsibility, and minimal risk condition—and is assigned to a declared configuration–SDT–OCD tuple rather than to a vendor brand or model family in the abstract. To reduce the risk that public vendor documentation reproduces marketing bias, the method separates declared affordance evidence from routine capability evidence and adopts an evidence-triangulation design. Public documentation is used only to identify configuration boundaries and declared affordances; independent software engineering benchmarks, agent studies, productivity studies, and taxonomy-evaluation literature are used to calibrate the level boundaries and constrain the claims. LACG is then applied to 30 representative current AI coding tool configurations using time-stamped public-documentation records, with boundary logic cross-checked against independent evidence on repository-level issue solving, agent tool use, and context-dependent productivity outcomes. Three anonymized human raters, selected for software engineering or AI-coding-tool expertise and independent of the authors and evaluated vendors, then classified the same prepared, blinded public-documentation records using the LACG coding manual. Exact three-rater agreement was 28/30 (93.3%); adjacent-level and majority agreement were both 30/30 (100.0%); mean pairwise quadratic-weighted Cohen’s kappa was 0.963; and Krippendorff’s alpha for ordinal ratings was 0.963. These agreement statistics test classification consistency over a structured documentary evidence base; they do not test actual tool behavior, direct execution, product performance, safety, productivity, or deployment outcomes. After adjudication, the final sample contains six L1 configurations, nine L2 configurations, and fifteen L3 configurations; no public configuration is classified as L4 or L5 under the fallback-responsibility criterion. The study supports preliminary, documentation-bound classification applicability, boundary calibration, and discriminative vocabulary development, not predictive validation or product-level performance claims. LACG provides an operational vocabulary for future empirical work on AI-augmented software construction, benchmark design, tool comparison, and responsibility allocation, while leaving outcome validation for governance, security, productivity, and procurement to subsequent empirical studies.
Chen et al. (Mon,) studied this question.