X-ray diffraction (XRD) is a powerful analytical technique for identifying crystalline phases in unknown mixtures. However, traditional phase identification methods are time-consuming and require substantial human intervention. To accelerate this process, machine learning has become increasingly important in XRD phase identification. By framing phase identification as an image classification task, Convolutional Neural Network (CNN)-based methods have achieved notable performance. However, the inherent limitation of CNN on capturing long-range dependencies made it difficult to process multiphase identification tasks in which the characteristic peaks may be widely separated. The Vision Transformer (ViT) architecture, with its self-attention mechanism, offers a promising alternative by effectively modeling global relationships. However, the difference between XRD data and natural image limits ViT's model performance in the phase identification task. To align ViT architectures with XRD domain knowledge, we proposed the XRD-VisionTransformer (XViT), a new network for multiphase identification of XRD patterns. Additionally, to address ViT's sensitivity to datasize, we introduced a statistical positional embedding module in XViT that encodes crystallographic position priors using global intensity statistics rather than fully learnable embeddings. This ensures that the application runs on smaller data sets while maintaining performance. Furthermore, to better catch the interpeak dependencies, we introduced a deep classifier tail that uses all of the features in the last transformer layer. This ensures that the relationships between different characteristic peaks are well learned and gives a better phase (combinations of characteristic peaks) identification result. Comprehensive experiments on two inorganic data sets demonstrate that XViT outperforms both CNN and ViT models in XRD phase identification.
Wei et al. (Fri,) studied this question.