Skip to contents

Extracts data used to create features for model

Usage

get_data(
  TEST = FALSE,
  limiting_n_observations = 100,
  save_output = FALSE,
  save_path = "tests/testthat/fixtures/get_data_output",
  file_name
)

Arguments

TEST

logical. Default is FALSE. If TRUE, then a subset of the data that is extracted from CRAN is selected. This is to speed up testing.

More precisely, if TRUE a random selection of rows from CRAN_data is selected, where the number of rows chosen is given by limiting_n_observations.

limiting_n_observations

Integer that decides the size of the subset of CRAN_data, when TEST is TRUE.

save_output

logical. Default is FALSE. If TRUE, then the list that is returned is saved to the path set by save_path.

save_path

string. Sets the path where the list created by the function will be saved, which is when save_output is set to TRUE

file_name

string. Sets the file name for the saved object.

Value

get_data returns data objects required for rest of scripts involved in training the model:

  • CRAN_data - Data extracted from CRAN package repository using tools::CRAN_package_db(). Duplicated packages removed. If TEST = TRUE then a random selection of rows CRAN_data of length limiting_n_observations is selected.

  • all_CRAN_pks - Package names that have data included in the CRAN_data object.

  • CRAN_cranly_data - data.frame with class cranly_db that is created using cranly::clean_CRAN_db(). The function cranly::clean_CRAN_db() cleans the data.frame generated by tools::CRAN_package_db(), has the same variables as CRAN_data.

  • tvdb - list object of class ctvlist that contains information about the Task Views. This is downloaded using the function CTVsuggest:::download_taskview_data() which is a modified version of RWsearch::tvdb_down()

  • TEST - returns the TEST value used in the function. As this function is used within the get_nlp function, and information about whether a subset of the full data is being used needs to be carried forward.

Details

The get_data() function is run inside get_NLP().

get_data() extracts the following types of data:

get_data() then also runs the cranly::clean_CRAN_db() function on the CRAN data repository.

Examples

if (FALSE) {
CTVsuggestTrain:::get_data(TEST = TRUE, limiting_n_observations = 100)
}