Electrocardiogram (ECG) heartbeat classification is an essential component of automated arrhythmia detection and intelligent cardiac monitoring systems. Traditionally, ECG analysis has depended on manual interpretation by clinicians and conventional machine learning approaches based on handcrafted features, which are labor-intensive, noise-sensitive, and inadequate for capturing complex nonlinear morphological and temporal characteristics of ECG signals. Furthermore, real-world ECG datasets are highly imbalanced, noisy, and exhibit overlapping waveform patterns across heartbeat classes, leading to biased learning, poor minority class detection, and unreliable predictions. To address these challenges, this paper presents a calibration-aware, reliability-oriented evaluation framework for ECG heartbeat classification, incorporating hybrid deep learning architectures that combine convolutional feature extraction, bidirectional GRU-based temporal modeling, and attention mechanisms. The framework assesses probabilistic reliability using calibration metrics, such as the Brier Score and Expected Calibration Error (ECE), rather than explicitly modeling predictive uncertainty methods. Experimental results on the ECG Heartbeat dataset show that CNN achieves the highest testing accuracy (98.44%), largely due to strong performance on the majority class in an imbalanced setting. Among hybrid approaches, a representative hybrid CNN + BiGRU + Attention model attains a competitive accuracy of 97.80%, along with a higher macro F1-score (0.9052), improved training stability, and good calibration behavior (Brier Score = 0.0417, ECE = 0.1023). As the experiments are conducted on preprocessed, fixed-length segments, the results reflect performance under controlled conditions rather than real-world clinical deployment conditions and should therefore be interpreted as a benchmark-level evaluation. Furthermore, no single model consistently outperforms others across all evaluation criteria, as different metrics capture distinct aspects of performance.
Rani et al. (Thu,) studied this question.