Developing computational approaches to accurately predict materials properties has the potential to greatly accelerate materials research by enabling large-scale screening. However, for crystal materials, capturing their 3D structure and their periodicity is critical to make accurate properties predictions. While traditional deep learning approaches have showed significant improvements, they lack the ability to properly convey structure, periodicity and complex crystallographic information. Conversely, since crystal text descriptions can be very expressive, LLMs-based properties prediction models are promising. Here, I finetuned a state- of-the-art LLM-based model developed by Rubungo et al. 1 that predicts crystal bandgaps based on crystal text descriptions and used it as my baseline. I then developed an architecture that explicitly leverages numerical tokens in the text descriptions. I showed a 15meV improvement in the test error (MAE) over my baseline, more than the gain showcased by Rubungo et al. over other state-of-the- art models 1. I also investigated newer embedding models to test their capabilities to extract information from text descriptions with numerical information. The best of these embeddings-based architectures, obtained with the E5-large model, yielded results comparable to my baseline, without the need to finetune the LLM embedding part. These newly developed architectures offer ways to push the limits of materials properties prediction.
Alexis Geslin (Tue,) studied this question.