Creates the NLP features — get_NLP • CTVsuggestTrain

get_NLP() creates the NLP features for the model using Task View text and Package description text. The function extracts the Task View text from the source markdown files on GitHub, then cleans them. Then with the corpus of words generated from the Task Views text, it creates TF_IDF vectors for each of the words. Then it computes the cosine similarity of the text in package titles and descriptions, to the TF-IDF vectors of each Task View. Therefore, generating a set of features for each package, where the number of features is given by the number of Task Views.

Usage

get_NLP(
  TEST = FALSE,
  limiting_n_observations = 100,
  get_input_stored = FALSE,
  get_input_path = "tests/testthat/fixtures/get_data_output/get_data_output.rds",
  save_output = FALSE,
  save_path = "tests/testthat/fixtures/get_NLP_output",
  file_name
)

Arguments

TEST

logical. Default is FALSE. If TRUE, then a subset of the data that is extracted from CRAN is selected. This is to speed up testing.

More precisely, if TRUE a random selection of rows from CRAN_data is selected, where the number of rows chosen is given by limiting_n_observations.

limiting_n_observations

Integer that decides the size of the subset of CRAN_data, when TEST is TRUE.

get_input_stored

logical. If TRUE then the function uses pre saved data as input, otherwise it runs the CTVsuggestTrain internal get_data() function.

get_input_path

string. If get_input_stored is set to TRUE, get_input_path gives the path location of the pre-saved data.

save_output

logical. Default is FALSE. If TRUE, then the list that is returned is saved to the path set by save_path.

save_path

string. Sets the path where the list created by the function will be saved, which is when save_output is set to TRUE

file_name

string. Sets the file name for the saved object.

Value

Returns

feature_matrix_titles_descriptions_packages_cosine - list. With element for each package being a vector of length of the number of task views. Each element of the vector is generated by taking the cosine similarity of the TF_IDF vector of the corresponding Task View with the package text data TF_IDF vector. The IDF term for the TF_IDF vector of the package text is generated by the Task View text corpus.
input_CRAN_data - This just a list containing all of the data created by the CTVsuggest:::get_data function, so that it is carried forward.

Details

The get_NLP() function is run inside get_create_features().

get_NLP() carries out the following steps:

First the markdown files that generate the CRAN Task View description pages are imported. The text is then cleaned, for example, links are removed.
Using the text extracted for each Task View, a data frame is created which gives the count of each word for each Task View.
Using this object, we compute the TF-IDF weightings for each word. This is a data frame of the same dimension as the previous object mentioned.
Next we use code, provided by Dirk Eddilbettel, which extracts the titles and descriptions of each of the packages on CRAN. This is given in a matrix object with a row for each package.
Then we create a list, consisting of data frames for each package, that give the counts for words in each of the package text.
For each package, we take the cosine similarity of the package text to the TF-IDF text of each Task View.