Published on Tue Jan 11 2022
Automatic Detection and Analysis of Technical Debts in Peer-Review Documentation of R Packages
See More ...
Technical debt (TD) is a metaphor for code-related problems that arise as a
result of prioritizing speedy delivery over perfect code. Given that the
reduction of TDs can have long-term positive impact in the software engineering
life-cycle (SDLC), TDs are studied extensively in the literature. However, very
few of the existing research focused on the technical debts of R programming
language despite its popularity and usage. Recent research by Codabux et al.
[21] finds that R packages can have 10 diverse TD types analyzing peer-review
documentation. However, the findings are based on the manual analysis of a
small sample of R package review comments. In this paper, we develop a suite of
Machine Learning (ML) classifiers to detect the 10 TDs automatically. The best
performing classifier is based on the deep ML model BERT, which achieves
F1-scores of 0.71 - 0.91. We then apply the trained BERT models on all
available peer-review issue comments from two platforms, rOpenSci and
BioConductor (13.5K review comments coming from a total of 1297 R packages). We
conduct an empirical study on the prevalence and evolution of 10 TDs in the two
R platforms. We discovered documentation debt is the most prevalent among all
types of TD, and it is also expanding rapidly. We also find that R packages of
generic platform (i.e. rOpenSci) are more prone to TD compared to
domain-specific platform (i.e. BioConductor). Our empirical study findings can
guide future improvements opportunities in R package documentation. Our ML
models can be used to automatically monitor the prevalence and evolution of TDs
in R package documentation.