Extracts data used to create features for model
Usage
get_data(
TEST = FALSE,
limiting_n_observations = 100,
save_output = FALSE,
save_path = "tests/testthat/fixtures/get_data_output",
file_name
)
Arguments
- TEST
logical. Default is
FALSE
. IfTRUE
, then a subset of the data that is extracted from CRAN is selected. This is to speed up testing.More precisely, if
TRUE
a random selection of rows fromCRAN_data
is selected, where the number of rows chosen is given bylimiting_n_observations
.- limiting_n_observations
Integer that decides the size of the subset of
CRAN_data
, whenTEST
isTRUE
.- save_output
logical. Default is
FALSE
. IfTRUE
, then the list that is returned is saved to the path set bysave_path
.- save_path
string. Sets the path where the list created by the function will be saved, which is when
save_output
is set toTRUE
- file_name
string. Sets the file name for the saved object.
Value
get_data
returns data objects required for rest of scripts involved in training the model:
CRAN_data
- Data extracted from CRAN package repository usingtools::CRAN_package_db()
. Duplicated packages removed. IfTEST
=TRUE
then a random selection of rowsCRAN_data
of lengthlimiting_n_observations
is selected.all_CRAN_pks
- Package names that have data included in theCRAN_data
object.CRAN_cranly_data
-data.frame
with classcranly_db
that is created usingcranly::clean_CRAN_db()
. The functioncranly::clean_CRAN_db()
cleans thedata.frame
generated bytools::CRAN_package_db()
, has the same variables asCRAN_data
.tvdb
- list object of classctvlist
that contains information about the Task Views. This is downloaded using the functionCTVsuggest:::download_taskview_data()
which is a modified version ofRWsearch::tvdb_down()
TEST
- returns theTEST
value used in the function. As this function is used within theget_nlp
function, and information about whether a subset of the full data is being used needs to be carried forward.
Details
The get_data()
function is run inside get_NLP()
.
get_data()
extracts the following types of data:
Task View data, using the
download_taskview_data()
.CRAN data from the CRAN package repository using
tools::CRAN_package_db()
.
get_data()
then also runs the cranly::clean_CRAN_db()
function on the CRAN data repository.