What question did this study set out to answer?

The aim is to develop a machine learning framework for accurate crop yield prediction utilizing district and crop networks.

March 7, 2026Open Access

Network-enhanced machine learning framework for multi crop yield prediction: a comprehensive analysis of indian agricultural data

Key Points

The aim is to develop a machine learning framework for accurate crop yield prediction utilizing district and crop networks.
Developed a network-enhanced machine learning framework.
Analyzed 52 years of agricultural data from 311 Indian districts.
Constructed district similarity and crop co-occurrence networks.
Computed centrality indicators and integrated them with temporal features.
Compared advanced models against simple baselines using time-series cross-validation.
Random Forest model outperformed all others for every crop with R2 values above 0.94.
Significant outperformance of advanced models over baselines for five of six crops (p < 0.05).
Network features contributed less than 1% to feature importance, highlighting the dominance of temporal patterns.

Abstract

Accurate crop yield prediction is a cornerstone for food security, agricultural planning, and evidence-based policy design. In this work, we develop a network-enhanced machine learning framework that combines district similarity structures and crop co-occurrence patterns with rich temporal features to forecast yields for multiple crops across India. The empirical analysis relies on 52 years of district-level agricultural data (1966–2017) from 311 districts and focuses on six key crops: rice, wheat, maize, groundnut, cotton, and sugarcane. We construct two complementary network representations: a district similarity network derived from long-term yield trajectories (311 nodes, 2,996 edges, 6.2% density) and a crop co-occurrence network spanning 23 crops (253 edges). From these networks, we compute several centrality indicators and integrate them with temporal covariates, including lagged yields, rolling statistics, volatility measures, and diversification indices. We used a strict time-series cross-validation setup to compare simple baselines (Naive, Rolling Mean) with more advanced models (Ridge Regression, Random Forest, Gradient Boosting), both with and without network-based features. Among all evaluated models, Random Forest achieved the strongest performance for every crop, yielding R 2 values above 0.94 (rice: 0.988, wheat: 0.976, maize: 0.971, groundnut: 0.946, cotton: 0.969, sugarcane: 0.986). Statistical tests showed that the advanced models significantly outperformed the baselines for five of the six crops ( p 0.05). However, network features contributed less than 1% to overall feature importance, indicating that temporal patterns are the main drivers of prediction. Together with temporal stability checks and residual diagnostics, this evaluation setup offers a solid framework for agricultural forecasting and for designing practical crop yield prediction and decision-support systems. This study is primarily positioned as a rigorous benchmarking and methodological validation framework rather than a performance breakthrough, providing empirical evidence on the relative value of different feature-engineering strategies and establishing best practices for time-series cross-validation in agricultural machine learning. The finding that static network features provide negligible incremental value beyond temporal covariates is itself a significant contribution, guiding practitioners toward investments in data quality rather than complex network constructions.

Bookmark

View Full Paper