Skip to contents

Train_model() trains a multinomial logistic regression model with a LASSO penalty, where the outcome categories are the current CRAN Task Views and an additional "None" category.

Usage

Train_model(
  TEST = FALSE,
  limiting_n_observations = 100,
  get_input_stored = FALSE,
  get_input_path =
    "tests/testthat/fixtures/get_CRAN_logs_output/get_CRAN_logs_output.rds",
  save_output = FALSE,
  save_path = "OUTPUT/"
)

Arguments

TEST

logical. Default is FALSE. If TRUE, then a subset of the data that is extracted from CRAN is selected. This is to speed up testing.

More precisely, if TRUE a random selection of rows from CRAN_data is selected, where the number of rows chosen is given by limiting_n_observations.

limiting_n_observations

Integer that decides the size of the subset of CRAN_data, when TEST is TRUE.

get_input_stored

logical. If TRUE then the function uses pre saved data as input, otherwise it runs the CTVsuggestTrain internal get_data() function.

get_input_path

string. If get_input_stored is set to TRUE, get_input_path gives the path location of the pre-saved data.

save_output

logical. Default is FALSE. If TRUE, then the list that is returned is saved to the path set by save_path.

save_path

string. Sets the path where the list created by the function will be saved, which is when save_output is set to TRUE

Value

Returns

  • predicted_probs_for_suggestions - data.frame where each row is the predicted probability vector for each CRAN package that is not assigned Task View that does not meet monthly download threshold. predicted_probs_for_suggestions is created using the predict() function and the model object.

  • model - Model object

  • model_accuracy - A percentage value which says how accurate the model is on a test set.

Details

The Train_model() function, relies on the four internal functions:

These four internal functions are run within each other in this order e.g. Train_model() initiates with running get_CRAN_logs() which initiates with get_create_features(). Hence the entire pipeline begins with get_data().

The Train_model() function itself, after running get_CRAN_logs(), carries out the model training using the response matrix and feature matrix that were constructed with the four internal functions mentioned above.

Examples

if (FALSE) {
Train_model(save_output = TRUE, save_path = "OUTPUT/")
}