Skip to contents

get_create_features() returns a set of features, and a response matrix for all packages whose data has been extracted. A subset of these are then used for model training and testing. And the features of the packages are required to generate model predictions, using the trained model.

Usage

get_create_features(
  TEST = FALSE,
  limiting_n_observations = 100,
  get_input_stored = FALSE,
  get_input_path = "tests/testthat/fixtures/get_NLP_output/get_NLP_output.rds",
  save_output = FALSE,
  save_path = "tests/testthat/fixtures/get_create_features_output",
  file_name = "get_create_features_output.rds"
)

Arguments

TEST

logical. Default is FALSE. If TRUE, then a subset of the data that is extracted from CRAN is selected. This is to speed up testing.

More precisely, if TRUE a random selection of rows from CRAN_data is selected, where the number of rows chosen is given by limiting_n_observations.

limiting_n_observations

Integer that decides the size of the subset of CRAN_data, when TEST is TRUE.

get_input_stored

logical. If TRUE then the function uses pre saved data as input, otherwise it runs the CTVsuggestTrain internal get_data() function.

get_input_path

string. If get_input_stored is set to TRUE, get_input_path gives the path location of the pre-saved data.

save_output

logical. Default is FALSE. If TRUE, then the list that is returned is saved to the path set by save_path.

save_path

string. Sets the path where the list created by the function will be saved, which is when save_output is set to TRUE

file_name

string. Sets the file name for the saved object.

Value

Returns

  • response_matrix - Matrix with a row for each CRAN package, and a column for each CRAN Task View. A value of 1 denotes that the package is assigned to the Task View of the corresponding column, and a value of zero if not.

  • features - Matrix with a row for each CRAN package, and a column for each variable. This feature matrix is constructed using each of the three different types of features as described at the end of the Details section.

  • All_data - List containing a package and an author network created with cranly.

  • pac_network_igraph - igraph version of the cranly package network.

  • input_CRAN_data - This just a list containing all of the data created by get_NLP(), so that it is carried forward.

Details

The get_create_features() function is run inside get_CRAN_logs().

get_create_features() carries out the following steps:

  • Firstly, CRAN packages with no author have their maintainer set as the author in the CRAN_cranly_data object.

  • Then using the cranly::build_network(), Author and Package cranly networks are built.

  • Next a list is created, with an element for each CRAN package that is assigned to at least one Task View. Each element is a character vector with the name of Task Views that the corresponding package is assigned to.

  • Response matrix is created - object is described in the Value section of documentation.

  • The feature matrices are then created, these are matrices where each row corresponds to a feature vector for a CRAN package. The final feature matrix is then a combination of each of these individual matrices.

In the description of the feature matrices below, let \(x\) denote an example CRAN package.

Package Dependencies

Feature vector of a package \(x\), is the distribution of the Task View assignation of the hard dependencies of \(x\). For example if a quarter of the hard dependencies of \(x\) belong to Bayesian than the corresponding element of the vector will be 0.25.

Other Author Packages

Feature vector of a package \(x\), is the distribution of the Task View assignation of other packages developed by the authors of \(x\).

Text Data

feature_matrix_titles_descriptions_packages_cosine object is created by the get_NLP() function.