In this paper, we propose a novel Depth from Focus (DFF) framework that formulates depth estimation as an energy minimization problem and unrolls the corresponding iterative optimization into a trainable neural architecture. Given a focal stack, a deep feature extractor constructs a learned focus volume that encodes defocus and structural cues. Based on this representation, multiple candidate depth maps are generated using a plane-based probabilistic formulation, while an attention mechanism adaptively assigns pixel-wise confidence weights to each candidate. The depth estimation is performed through an iterative refinement process, where each stage corresponds to a learned proximal update implemented via lightweight conditional networks. These updates incorporate focus consistency, adaptive step sizes, and learned regularization priors, enabling effective integration of physical imaging constraints with data-driven modeling. A final refinement module further enhances prediction accuracy by fusing the refined depth, focus volume features, and candidate hypotheses to estimate residual corrections. The entire framework is trained end-to-end, ensuring coherent optimization across all components. Experimental results demonstrate that the proposed method achieves improved robustness and accuracy, particularly in low-texture and noisy regions, while preserving interpretability through its unfolding-based design.
Muhammad Tariq Mahmood (Tue,) studied this question.