What question did this study set out to answer?

The study aims to address bias and inconsistency in hackathon evaluation processes through an AI-driven framework.

May 27, 2026

HackEval: An Intelligent Multi-Agent Framework for Automated, Bias- Mitigated Assessment in Competitive Hackathon Ecosystems

Key Points

The study aims to address bias and inconsistency in hackathon evaluation processes through an AI-driven framework.
Developed the HackEval framework utilizing six specialized agents for various evaluation tasks.
Conducted empirical evaluations on 27 hackathon submissions to compare AI scores with expert evaluations.
Implemented a multi-tenant Software-as-a-Service model for scalability and efficiency.
Achieved a Pearson correlation of r = 0.93 between AI composite scores and expert evaluations.
Reduced per-team evaluation time by 92.8%, demonstrating significant efficiency gains.
Improved cross-evaluator scoring consistency by 77.4%, addressing the consistency issue in human evaluations.

Abstract

Contemporary hackathon adjudication is burdened by four structural deficiencies inherent to human- centric evaluation: inconsistent rubric application, substantial inter-rater score variance, prohibitive assessment latency, and a near-total absence of granular, actionable post-event diagnostic feedback. This paper introduces HackEval, a production-grade multi-agent artificial intelligence framework designed to systematica ly resolve these limitations through real-time, bias-mitigated evaluation across heterogeneous project submission modalities. Six functionally specialized agents operate in parallel: (i) a Code Quality Agent performing deep multi- criterion static analysis on GitHub repositories; (ii) a Presentation Analyzer Agent realizing a four-stage pipeline integrating LLM semantic reasoning with contrastive vision-language embeddings; (iii) a UI/UX Evaluation Agent leveraging CLIP-based aesthetic regression 13; (iv) an Innovation Agent quantifying originality via semantic embedding distance; (v) a Fea sibility Agent applying chain-of-thought LLM reasoning 12; and (vi) a Plagiarism Detection Agent employing sentence- transformer cosine similarity with FAISS indexing 15. Empirical evaluation across 27 authentic hackathon submissions yields a Pearson correlation of r = 0.93 between AI composite scores and consensus expert evaluations, a 92.8% reduction in per-team evaluation time, and a 77.4% improvement in cross-evaluator scoring consistency. The platform is delivered as a multi-tenant Software-as-a-Service system built on a MERN + FastAPI + LangChain stack, demonstrating concurrent scalability exceeding 500 simultaneous teams with sub-10-second feedback delivery.

Mark Helpful

Bookmark

Relay

View Full Paper