get_create_features()
returns a set of features, and a response matrix for all packages whose data has been extracted.
A subset of these are then used for model training and testing. And the features of the packages are required to generate model predictions, using the trained model.
Usage
get_create_features(
TEST = FALSE,
limiting_n_observations = 100,
get_input_stored = FALSE,
get_input_path = "tests/testthat/fixtures/get_NLP_output/get_NLP_output.rds",
save_output = FALSE,
save_path = "tests/testthat/fixtures/get_create_features_output",
file_name = "get_create_features_output.rds"
)
Arguments
- TEST
logical. Default is
FALSE
. IfTRUE
, then a subset of the data that is extracted from CRAN is selected. This is to speed up testing.More precisely, if
TRUE
a random selection of rows fromCRAN_data
is selected, where the number of rows chosen is given bylimiting_n_observations
.- limiting_n_observations
Integer that decides the size of the subset of
CRAN_data
, whenTEST
isTRUE
.- get_input_stored
logical. If
TRUE
then the function uses pre saved data as input, otherwise it runs theCTVsuggestTrain
internalget_data()
function.- get_input_path
string. If
get_input_stored
is set toTRUE
,get_input_path
gives the path location of the pre-saved data.- save_output
logical. Default is
FALSE
. IfTRUE
, then the list that is returned is saved to the path set bysave_path
.- save_path
string. Sets the path where the list created by the function will be saved, which is when
save_output
is set toTRUE
- file_name
string. Sets the file name for the saved object.
Value
Returns
response_matrix
- Matrix with a row for each CRAN package, and a column for each CRAN Task View. A value of 1 denotes that the package is assigned to the Task View of the corresponding column, and a value of zero if not.features
- Matrix with a row for each CRAN package, and a column for each variable. This feature matrix is constructed using each of the three different types of features as described at the end of the Details section.All_data
- List containing a package and an author network created with cranly.pac_network_igraph
- igraph version of the cranly package network.input_CRAN_data
- This just a list containing all of the data created byget_NLP()
, so that it is carried forward.
Details
The get_create_features()
function is run inside get_CRAN_logs()
.
get_create_features()
carries out the following steps:
Firstly, CRAN packages with no author have their maintainer set as the author in the
CRAN_cranly_data
object.Then using the
cranly::build_network()
, Author and Package cranly networks are built.Next a list is created, with an element for each CRAN package that is assigned to at least one Task View. Each element is a character vector with the name of Task Views that the corresponding package is assigned to.
Response matrix is created - object is described in the Value section of documentation.
The feature matrices are then created, these are matrices where each row corresponds to a feature vector for a CRAN package. The final feature matrix is then a combination of each of these individual matrices.
In the description of the feature matrices below, let \(x\) denote an example CRAN package.
- Package Dependencies
Feature vector of a package \(x\), is the distribution of the Task View assignation of the hard dependencies of \(x\). For example if a quarter of the hard dependencies of \(x\) belong to Bayesian than the corresponding element of the vector will be 0.25.
- Other Author Packages
Feature vector of a package \(x\), is the distribution of the Task View assignation of other packages developed by the authors of \(x\).
- Text Data
feature_matrix_titles_descriptions_packages_cosine
object is created by theget_NLP()
function.