Tactile data can enhance the environmental perception and interaction capabilities of intelligent agents, serving as a foundational component for the development of embodied intelligence. Despite its critical role, tactile data acquisition remains cost-prohibitive and labor-intensive, resulting in severe data scarcity. Cross-modal generation offers a promising solution by leveraging abundant visual and textual data. However, effectively aligning heterogeneous visual-textual modalities under data-scarce and sparsely-annotated conditions remains a significant challenge. To address these challenges, a visual-textual information-driven tactile data generation (VTTac) framework is proposed, which features three key innovations. First, a multi-granularity text enhancement strategy is introduced to mitigate annotation sparsity through hierarchical semantic enrichment. Second, a cascaded dual cross-attention mechanism is designed to ensure cross-modal alignment. Third, a condition adapter injects a low-frequency background prior, enabling the generative backbone to focus on high-frequency texture synthesis. Subsequently, a wavelet transform seamlessly fuses these synthesized details with the real background. Extensive evaluations across three datasets demonstrate that VTTac consistently outperforms representative baselines. Furthermore, downstream tasks validate the physical faithfulness of the synthesized data for material classification and semantic reasoning, and zero-shot experiments confirm generalization to unseen objects.
Song et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: