What question did this study set out to answer?

This research aims to improve salient object detection by bridging the gap between CNN and Transformer features while utilizing edge cues for better structural guidance.

April 18, 2026Open Access

Salient Object Detection with Semantic-Aware Edge Refinement and Edge-Guided Cross-Attention Feature Aggregation

Key Points

This research aims to improve salient object detection by bridging the gap between CNN and Transformer features while utilizing edge cues for better structural guidance.
Developed SECA-Net to combine CNN and Transformer representations effectively.
Introduced a level-wise feature interaction module for better feature fusion.
Designed a semantic-aware edge refinement module to clear edge priors.
Implemented an edge-guided cross-attention module to integrate structural constraints into saliency decoding.
SECA-Net outperformed 19 state-of-the-art methods on five benchmark datasets.
Ranked first in Fβ and BDE metrics across all datasets tested.
Achieved a notable 1.54% improvement in Fβ on the challenging DUTS-TE dataset.

Abstract

Hybrid multi-backbone architectures and the utilization of edge cues for auxiliary training have become two major research trends in salient object detection (SOD). It is widely acknowledged that CNNs can effectively model local spatial structures, while Transformers can capture long-range global dependencies. However, the representation discrepancy between CNN and Transformer features, together with boundary-detail degradation during multi-scale fusion, remains a major challenge. In addition, how to effectively leverage edge cues as reliable structural guidance without introducing texture-induced false boundaries or boundary leakages remains an open issue. In this paper, we present SECA-Net, a unified framework that establishes a profound synergy between CNN and Transformer representations. It explicitly bridges their inherent discrepancies through level-dependent interaction strategies, while resolving structural degradation via a sequential “purify-and-guide” mechanism. This approach enables the network to extract and utilize edge cues effectively, thereby alleviating boundary degradation and texture-induced false contours. Specifically, we design a dual-encoder structure to extract features. A level-wise feature interaction (LFI) module is introduced to perform discrepancy-aware fusion across feature levels, stabilizing CNN–Transformer aggregation. Meanwhile, the features extracted from the CNN branch are projected into a semantic-aware edge refinement (SAER) module to produce clean multi-scale edge priors under high-level semantic guidance, suppressing texture-induced spurious edges. Finally, we design an edge-guided cross-attention feature aggregation (ECFA) module, which progressively injects refined edge priors as structural constraints into multi-scale saliency decoding via cascaded cross-attention, enabling effective structural refinement. Overall, LFI reduces cross-branch discrepancy, SAER purifies boundary priors, and ECFA integrates semantics and structure in a progressive decoding manner, forming a unified SECA-Net framework. Extensive experimental results on five benchmark SOD datasets show that SECA-Net outperforms 19 state-of-the-art methods, demonstrating its effectiveness. Specifically, our proposed method ranks first in Fβ and BDE across all datasets, notably improving Fβ by 1.54% on the challenging DUTS-TE dataset.

Salient Object Detection with Semantic-Aware Edge Refinement and Edge-Guided Cross-Attention Feature Aggregation

Key Points

Abstract

Cite This Study