In this report, we introduce the dataset curated to replicate and extend experiments on R programming tasks, particularly code summarization and method name prediction. The dataset was generated by collecting R repositories from GitHub, parsing the code snippets using the tree-sitter parser, and matching them with natural language descriptions based on Roxygen2 documentation. Building on this dataset, our work conducts an in-depth analysis of the performance of Pre-trained Language Models for Code (Code-PLMs) on R code. We highlight the challenges posed by R’s dual paradigms—Tidyverse and Base R—and demonstrate that current models, including Large Language Models, exhibit varying degrees of performance degradation when applied to R code. As a result, we underscore the complexity of effectively leveraging Code-PLMs for R, given its diverse programming styles and language features.
Zhao et al. (Tue,) studied this question.