Objective of the CTVsuggest Package
The CRAN Task Views are maintained by volunteers, and are always welcome to contributions for additional content from members of the community. Due to the huge number of packages that are in CRAN, it is infeasible for a Task View maintainer to review all of the available packages. Therefore, a model that would give suggestions for the maintainer to then review would be useful.
The aim of CTVsuggest is to give suggestions for packages to be added
to CRAN
Task Views. There is a single function, CTVsuggest()
,
that outputs these suggestions.
CTVsuggest Example
First install the CTVsuggest package:
library(devtools)
install_github("DylanDijk/CTVsuggest")
Then attach the package.
To output the top 5 suggested packages for the NaturalLanguageProcessing Task View:
CTVsuggest(n = 5, taskview = "NaturalLanguageProcessing")
#> NaturalLanguageProcessing Packages
#> LSX 0.9946226 LSX
#> doc2vec 0.9940239 doc2vec
#> jiebaRD 0.9892844 jiebaRD
#> morestopwords 0.9802875 morestopwords
#> text.alignment 0.9770789 text.alignment
The Package Workflow
Alongside the CTVsuggest package, there is the CTVsuggestTrain package.
CTVsuggestTrain contains functions which execute the training of the
model. In particular, CTVsuggestTrain::Train_model()
trains a multinomial logistic regression model, where the outcome
categories are the available CRAN Task Views plus an additional
"none"
category. In addition, after training the model,
Train_model()
returns a data.frame
containing
the predicted classification probabilities of CRAN packages to CRAN Task
Views.
The current workflow is that I run the CTVsuggestTrain::Train_model()
function weekly, each time storing the data.frame
of
predicted probabilities on the CTVsuggestTrain
GitHub repo. The CTVsuggest package then loads this
data.frame
to output suggestions for different Task Views,
as shown in the example.
The Model
I now provide a high level view of how the model is trained. The CTVsuggestTrain Section then provides an overview of the CTVsuggestTrain package which executes the model training, the section also contains links to the functions documentation and source code.
For a more comprehensive description of the model building process, view Section 4 of a long form report.
Multinomial Logistic Regression
The model aims to give a Task View suggestion after being giving a set of features of an unassigned CRAN R package. There exist packages that are assigned to multiple Task Views, but I have looked at classifying packages to a single Task View. Therefore, I have set this up as a multi-class classification problem, compared to a multi-label problem.
However, it would not be reasonable to assign a Task View to every
package. This would depart from the objective of Task Views, which is to
give a sharp focus on packages that are needed for a task1. For this reason, the
model also has the possibility of assigning a package to no Task View.
Therefore, I have included an additional "none"
outcome
category and hence the number of possible labels in this classification
problem is the number of Task Views + 1.
To perform this multi-class classification problem I have used a
multinomial logistic model with a LASSO penalty, trained using the
glmnet::cv.glmnet()
function. The cv.glmnet()
function performs cross-validation over a grid of lambda values,
providing a measure of performance for each of the lambda values. I then
select the model which has the lowest average multinomial deviance
across the folds (lambda.min
).
Features
In order to construct the multinomial logistic model, I need to create a set of features for each of the CRAN packages.
I have used three different types of data to create the features for
this model:
(I let \(n\) denote the number of
currently available Task
Views)
- Text data from Package and Task View descriptions. \(n\) features
- Package dependencies. \(n + 1\) features
- Other Packages developed by the Authors. \(n + 1\) features
I will now go into more detail into how I constructed features from this data.
Text data from Package and Task View descriptions
- Each Task View has a page with text describing its scope
- Example: Econometrics Task View description
- Each package on CRAN has a title and description describing the
purpose of the package
- Example: fxregime package web page
I extract the Task View text data from their corresponding markdown files, which are stored in the GitHub repository of the corresponding Task View. I also extract the titles and descriptions for each of the packages. I then vectorize the texts using the TF-IDF method, and compute the cosine similarity of the vectorized package text to each of the vectorized Task View text to create \(n\) features.
View Section 3.3.2 for a detailed description of the text vectorization.
Package Dependencies
For each package, I look at the immediate hard dependencies2 to other packages, and then calculate the proportion of the assignment of these packages to Task Views. This creates \(n + 1\) features for each CRAN R package.
Training Data
To train and test the model I used packages that were assigned a Task View, I also used a set of packages which I labelled as not belonging to a Task View.
The large proportion of packages not belonging to a Task View would
have not been reviewed by Task View maintainers, and so would not be
representative for packages belonging to the "none"
category. Therefore to choose the packages to include that are labelled
as belonging to the "none"
category, we select packages
that have a high amount of monthly downloads.
Then from this set of packages I split it with an 80:20 ratio, into a training and testing set.
View Section 4.3.1 for a detailed description of how the training and testing sets are constructed.
Predictions
Using the trained model, I can get predictions by looking at the predicted classification probabilities when given the set of features of a package. I output the predicted classification probabilities for the packages that are not assigned a Task View and do not meet the monthly download threshold I mentioned in the previous section.
Model Performance
The accuracy of the model can then be measured by comparing the trained model predictions with the testing set.
In particular, I set the predicted Task View to be the one with the largest predicted probability. The accuracy is then measured as the proportion of correct predictions on the testing set. For packages that are assigned to multiple Task Views, I set a prediction as accurate if its prediction is one of its assigned Task Views.
The current (2024-01-30 09:03:39.37332) accuracy is: 80.77%.
CTVsuggestTrain
The CTVsuggestTrain package contains all of the code used for the training of model.
To install the CTVsuggestTrain package:
library(devtools)
install_github("DylanDijk/CTVsuggestTrain")
As mentioned above, the training of the models is performed with the
CTVsuggestTrain::Train_model()
function. This function uses
four internal functions of the CTVsuggestTrain package which are run in
order:
For a description of the steps taken by these function, view the details section of their documentation.