The primary aim of this study is to meld self-supervised learning techniques with transparent transformer-based frameworks to enable resilient, end-to-end speech and language understanding, alongside pretraining deep transformer models using unannotated speech and text corpora. But the system's complicated structure makes it very hard to compute, and its ability to be understood depends in part on using rough benchmarks to judge feature relevance. This research work proposes an explainable, systematic transformer-based framework concept for understanding voice and language that integrates self-supervising learning with built-in explainability. The model proposed here presented a low word error rate, high accuracy, and interpretation on multiple datasets. The framework has many strengths, but it also has some challenges, which are highlighted in the work. This deep transformer architecture needs a lot of computing power, and figuring out how important something relies on indirect truth values. In the future, planned improvements include making the framework work with more than one language and more than one field, making transformer models work better in real time, and adding assessment methods that focus on human perspectives to make it even easier to understand. Subsequently, we will work on expanding into datasets that are multilingual and cross-domain, making more efficient forms of transformers for real-time use, and employing human-centered assessment to verify that we are interpreting things correctly in real time.
Mahfuzul Huda (Thu,) studied this question.