CTVsuggest Overview • CTVsuggest

Objective of the CTVsuggest Package

The CRAN Task Views are maintained by volunteers, and are always welcome to contributions for additional content from members of the community. Due to the huge number of packages that are in CRAN, it is infeasible for a Task View maintainer to review all of the available packages. Therefore, a model that would give suggestions for the maintainer to then review would be useful.

The aim of CTVsuggest is to give suggestions for packages to be added to CRAN Task Views. There is a single function, CTVsuggest(), that outputs these suggestions.

CTVsuggest Example

First install the CTVsuggest package:

library(devtools)
install_github("DylanDijk/CTVsuggest")

Then attach the package.

library(CTVsuggest)

To output the top 5 suggested packages for the NaturalLanguageProcessing Task View:

CTVsuggest(n = 5, taskview = "NaturalLanguageProcessing")
#>                NaturalLanguageProcessing       Packages
#> LSX                            0.9946226            LSX
#> doc2vec                        0.9940239        doc2vec
#> jiebaRD                        0.9892844        jiebaRD
#> morestopwords                  0.9802875  morestopwords
#> text.alignment                 0.9770789 text.alignment

The Package Workflow

Alongside the CTVsuggest package, there is the CTVsuggestTrain package.

CTVsuggestTrain contains functions which execute the training of the model. In particular, CTVsuggestTrain::Train_model() trains a multinomial logistic regression model, where the outcome categories are the available CRAN Task Views plus an additional "none" category. In addition, after training the model, Train_model() returns a data.frame containing the predicted classification probabilities of CRAN packages to CRAN Task Views.

The current workflow is that I run the CTVsuggestTrain::Train_model() function weekly, each time storing the data.frame of predicted probabilities on the CTVsuggestTrain GitHub repo. The CTVsuggest package then loads this data.frame to output suggestions for different Task Views, as shown in the example.

The Model

I now provide a high level view of how the model is trained. The CTVsuggestTrain Section then provides an overview of the CTVsuggestTrain package which executes the model training, the section also contains links to the functions documentation and source code.

For a more comprehensive description of the model building process, view Section 4 of a long form report.

Multinomial Logistic Regression

The model aims to give a Task View suggestion after being giving a set of features of an unassigned CRAN R package. There exist packages that are assigned to multiple Task Views, but I have looked at classifying packages to a single Task View. Therefore, I have set this up as a multi-class classification problem, compared to a multi-label problem.

However, it would not be reasonable to assign a Task View to every package. This would depart from the objective of Task Views, which is to give a sharp focus on packages that are needed for a task¹. For this reason, the model also has the possibility of assigning a package to no Task View. Therefore, I have included an additional "none" outcome category and hence the number of possible labels in this classification problem is the number of Task Views + 1.

To perform this multi-class classification problem I have used a multinomial logistic model with a LASSO penalty, trained using the glmnet::cv.glmnet() function. The cv.glmnet() function performs cross-validation over a grid of lambda values, providing a measure of performance for each of the lambda values. I then select the model which has the lowest average multinomial deviance across the folds (lambda.min).

Features

In order to construct the multinomial logistic model, I need to create a set of features for each of the CRAN packages.

I have used three different types of data to create the features for this model:
(I let \(n\) denote the number of currently available Task Views)

Text data from Package and Task View descriptions. \(n\) features
Package dependencies. \(n + 1\) features
Other Packages developed by the Authors. \(n + 1\) features

I will now go into more detail into how I constructed features from this data.

Text data from Package and Task View descriptions

Each Task View has a page with text describing its scope
- Example: Econometrics Task View description
Each package on CRAN has a title and description describing the purpose of the package
- Example: fxregime package web page

I extract the Task View text data from their corresponding markdown files, which are stored in the GitHub repository of the corresponding Task View. I also extract the titles and descriptions for each of the packages. I then vectorize the texts using the TF-IDF method, and compute the cosine similarity of the vectorized package text to each of the vectorized Task View text to create \(n\) features.

View Section 3.3.2 for a detailed description of the text vectorization.

Package Dependencies

For each package, I look at the immediate hard dependencies² to other packages, and then calculate the proportion of the assignment of these packages to Task Views. This creates \(n + 1\) features for each CRAN R package.

Authors Other Packages

For each package I look at all of the other packages created by the authors of that package. I then calculate the proportion of these packages that are assigned to each Task View. This creates another \(n + 1\) features for each CRAN R package.

Training Data

To train and test the model I used packages that were assigned a Task View, I also used a set of packages which I labelled as not belonging to a Task View.

The large proportion of packages not belonging to a Task View would have not been reviewed by Task View maintainers, and so would not be representative for packages belonging to the "none" category. Therefore to choose the packages to include that are labelled as belonging to the "none" category, we select packages that have a high amount of monthly downloads.

Then from this set of packages I split it with an 80:20 ratio, into a training and testing set.

View Section 4.3.1 for a detailed description of how the training and testing sets are constructed.

Predictions

Using the trained model, I can get predictions by looking at the predicted classification probabilities when given the set of features of a package. I output the predicted classification probabilities for the packages that are not assigned a Task View and do not meet the monthly download threshold I mentioned in the previous section.

Model Performance

The accuracy of the model can then be measured by comparing the trained model predictions with the testing set.

In particular, I set the predicted Task View to be the one with the largest predicted probability. The accuracy is then measured as the proportion of correct predictions on the testing set. For packages that are assigned to multiple Task Views, I set a prediction as accurate if its prediction is one of its assigned Task Views.

The current (2024-01-30 09:03:39.37332) accuracy is: 80.77%.

CTVsuggestTrain

The CTVsuggestTrain package contains all of the code used for the training of model.

To install the CTVsuggestTrain package:

library(devtools)
install_github("DylanDijk/CTVsuggestTrain")

As mentioned above, the training of the models is performed with the CTVsuggestTrain::Train_model() function. This function uses four internal functions of the CTVsuggestTrain package which are run in order:

For a description of the steps taken by these function, view the details section of their documentation.