Language Models (LMs) have become widely used in software engineering, especially for tasks such as code generation, where they are referred to as code LMs. These models have proven effective in generating code, making it easier for developers to automate coding activities. However, research has highlighted a significant limitation: despite their effectiveness, LMs often produce code that is incorrect, buggy, or not fully functional. Addressing this limitation is essential to fully realize LMs’ potential in practical applications. Yet, updating LMs, especially with only a small amount of feedback data is highly challenging. In such cases, hot-fix techniques, which facilitate agile model updates with limited data, may be promising to maintain reliability and usefulness. In this paper, we propose M odel I mprovement via N euron T argeting ( MINT ), a novel approach for repairing code LMs. MINT leverages the semantic property of language models to perform neuron-level repairs in a novel way. Furthermore, by analyzing the relationships between the model's latent representations, the incorrect outputs, and the desired outputs, MINT determines which neurons are worth updating. This approach ensures that only the neurons crucial to the model's failure are targeted, avoiding unnecessary changes and allowing for a more efficient and precise repair process. MINT is effective, efficient, and reliable, capable of correcting a neural model by patching a minimum number of neurons (usually one neuron for short code, while two or more neurons for long code). We introduce new measures to evaluate its generalisability and develop a new benchmark which is made available for further study. Our approach is evaluated on three coding tasks: line-level code generation, shellcode generation, and intent-to-bash translation. The experimental results demonstrate that the proposed approach significantly outperforms the state-of-the-art in both effectiveness and efficiency measures. With respect to the ExactMatch score, MINT achieves \(5.7\%-20.8\%\) improvements in StarCoder2-3B , and \(3.9\%-18.5\%\) improvements in CodeLlama-7B concerning the state-of-the-art. Regarding efficiency, MINT is \(32.3\%-74.8\%\) faster than the state-of-the-art. In addition, we analyze and discuss the side effects of model repair techniques, including the balance between generalization and specificity, and the performance after multiple repairs in succession.
Gu et al. (Fri,) studied this question.