get_NLP()
creates the NLP features for the model using Task View text and Package description text.
The function extracts the Task View text from the source markdown files on GitHub, then cleans them.
Then with the corpus of words generated from the Task Views text, it creates TF_IDF vectors for each of the words.
Then it computes the cosine similarity of the text in package titles and descriptions, to the TF-IDF vectors of each Task View.
Therefore, generating a set of features for each package, where the number of features is given by the number of Task Views.
Usage
get_NLP(
TEST = FALSE,
limiting_n_observations = 100,
get_input_stored = FALSE,
get_input_path = "tests/testthat/fixtures/get_data_output/get_data_output.rds",
save_output = FALSE,
save_path = "tests/testthat/fixtures/get_NLP_output",
file_name
)
Arguments
- TEST
logical. Default is
FALSE
. IfTRUE
, then a subset of the data that is extracted from CRAN is selected. This is to speed up testing.More precisely, if
TRUE
a random selection of rows fromCRAN_data
is selected, where the number of rows chosen is given bylimiting_n_observations
.- limiting_n_observations
Integer that decides the size of the subset of
CRAN_data
, whenTEST
isTRUE
.- get_input_stored
logical. If
TRUE
then the function uses pre saved data as input, otherwise it runs theCTVsuggestTrain
internalget_data()
function.- get_input_path
string. If
get_input_stored
is set toTRUE
,get_input_path
gives the path location of the pre-saved data.- save_output
logical. Default is
FALSE
. IfTRUE
, then the list that is returned is saved to the path set bysave_path
.- save_path
string. Sets the path where the list created by the function will be saved, which is when
save_output
is set toTRUE
- file_name
string. Sets the file name for the saved object.
Value
Returns
feature_matrix_titles_descriptions_packages_cosine
- list. With element for each package being a vector of length of the number of task views. Each element of the vector is generated by taking the cosine similarity of the TF_IDF vector of the corresponding Task View with the package text data TF_IDF vector. The IDF term for the TF_IDF vector of the package text is generated by the Task View text corpus.input_CRAN_data
- This just a list containing all of the data created by the CTVsuggest:::get_data function, so that it is carried forward.
Details
The get_NLP()
function is run inside get_create_features()
.
get_NLP()
carries out the following steps:
First the markdown files that generate the CRAN Task View description pages are imported. The text is then cleaned, for example, links are removed.
Using the text extracted for each Task View, a data frame is created which gives the count of each word for each Task View.
Using this object, we compute the TF-IDF weightings for each word. This is a data frame of the same dimension as the previous object mentioned.
Next we use code, provided by Dirk Eddilbettel, which extracts the titles and descriptions of each of the packages on CRAN. This is given in a matrix object with a row for each package.
Then we create a list, consisting of data frames for each package, that give the counts for words in each of the package text.
For each package, we take the cosine similarity of the package text to the TF-IDF text of each Task View.