Learning
This section will walk you through the details on how to set up the configuration file for the machine learning part of the pipeline. It will be separated to the following subdivisions:
Experiment Design Parameters
This set of parameters is used to define the experiment design (data splitting, splitting proportion…), it is organized as follows:
{
"testSets": ["Define method here"],
"method name": "Define method here"
}
Now let’s specify the parameters for the selected method; for instance, in the case of the Random
and CV
methods:
Splitting methods
Type of sets to create. |
||||
type |
object |
|||
properties |
||||
|
Random splitting method. |
|||
type |
object |
|||
properties |
||||
|
Method of splitting the data. |
|||
type |
string |
|||
options |
||||
|
The data will be randomly split |
|||
type |
string |
|||
|
The data will be split based on institutions |
|||
type |
string |
|||
|
Number of splits to create. |
|||
type |
int |
|||
|
If |
|||
type |
bool |
|||
|
Proportion of the test set. |
|||
type |
float |
|||
|
Seed for the random number generator. |
|||
type |
int |
|||
|
Cross-validation splitting method. |
|||
type |
object |
|||
properties |
||||
|
Number of folds to use. |
|||
type |
int |
|||
|
Seed for the random number generator. |
|||
type |
int |
Example
{
"Random": {
"method": "SubSampling",
"nSplits": 10,
"stratifyInstitutions": 1,
"testProportion": 0.33,
"seed": 54288
}
}
Data Cleaning Parameters
This set of parameters is used to define the data cleaning process Parameters, it is organized as follows:
{
"method name": {
"define parameters here"
},
"another method": {
"define parameters here"
}
}
Cleaning methods
Feature cleaning method name. |
||
type |
object |
|
properties |
||
|
Default cleaning method. |
|
type |
string |
Now let’s specify the parameters for the selected cleaning method; for instance, in the case of the default
method:
Chosen method’s parameters
Feature cleaning parameters. |
||||
type |
object |
|||
properties |
||||
|
Continuous feature cleaning parameters. |
|||
type |
object |
|||
properties |
||||
|
Maximum percentage cut-offs of missing features per sample. Samples with more missing features than this cut-off will be removed. |
|||
type |
float |
|||
|
Minimal coefficient of variation cut-offs over samples per variable. Variables with less coefficient of variation than this cut-off will be removed. |
|||
type |
float |
|||
|
Maximal percentage cut-offs of missing samples per variable. Features with more missing samples than this cut-off will be removed. |
|||
type |
float |
|||
|
Imputation method for missing values. Default is |
|||
type |
string |
|||
options |
||||
|
Impute missing values with the mean of the feature. |
|||
type |
string |
|||
|
Impute missing values with the median of the feature. |
|||
type |
string |
|||
|
Impute missing values with the a random value from the feature set. |
|||
type |
string |
Example
{
"default":
{
"feature": {
"continuous": {
"missingCutoffps": 0.25,
"covCutoff": 0.1,
"missingCutoffpf": 0.1,
"imputation": "mean"
}
}
}
Note
Note that you can add as many methods as you want, for other feature types (categorical, ordinal, etc.) and for other cleaning methods (e.g. PCA
).
Data Normalization Parameters
Data normalization aims to remove batch effects from the data. This set of parameters is used to define the data normalization process Parameters, it is organized as follows:
{
"standardCombat": {
"define parameters here"
}
}
Chosen method parameters
Normalization method name. |
||
type |
string |
|
options |
||
|
Standard Combat normalization method. |
|
type |
string |
Note
For now only the standardCombat
method is available and it does not require any parameters.
Feature Set Reduction Parameters
Feature set reduction consists of reducing the number of features in the data by removing correlated features, selecting important features, etc. This set of parameters is used to define the feature set reduction process Parameters, it is organized as follows:
{
"selected method": {
"define parameters here"
}
}
method name
Feature set reduction method name. |
||
type |
string |
|
options |
||
|
False discovery avoidance method. Read the paper. |
|
type |
string |
|
|
Balanced version of the False discovery avoidance method, where the selected number of features is the same for each table. |
|
type |
string |
Now let’s specify the parameters for the selected feature set reduction method; for instance, in the case of the FDA
method:
FDA method
Feature set reduction parameters. |
||||
type |
object |
|||
properties |
||||
|
FDA method’s parameters. |
|||
type |
object |
|||
properties |
||||
|
Number of splits to use for the FDA algorithm. |
|||
type |
int |
|||
|
Type of correlation to use for the FDA algorithm. Default is |
|||
type |
string |
|||
options |
||||
|
Spearman correlation. |
|||
type |
string |
|||
|
Pearson correlation. |
|||
type |
string |
|||
|
Stability threshold to cut-off the unstable features at the beginning of the FDA algorithm. |
|||
type |
float |
|||
|
Threshold to cut-off the inter-correlated features. |
|||
type |
float |
|||
|
Minimum number of stable features to keep before inter-correlation step. |
|||
type |
int |
|||
|
Minimum number of inter-correlated features to keep. |
|||
type |
int |
|||
|
Minimum number of features to keep at the end of the FDA algorithm. |
|||
type |
int |
|||
|
Seed for the random number generator. |
|||
type |
int |
Example
{
"FDA": {
"nSplits": 100,
"corrType": "Spearman",
"threshStableStart": 0.5,
"threshInterCorr": 0.7,
"minNfeatStable": 100,
"minNfeatInterCorr": 60,
"minNfeat": 5,
"seed": 54288
}
}
Note
Only FDA
and FDAbalanced
methods are available for now and they share the same parameters.
Machine Learning Parameters
This set of parameters is used to define the machine learning process, algorithm, and parameters, it is organized as follows:
{
"selected algorithm": {
"define parameters here"
}
}
Now let’s specify the parameters for the selected machine learning algorithm; for instance, in the case of the XGBoost
algorithm:
ML Algorithm
Machine learning algorithm name. |
||||
type |
object |
|||
properties |
||||
|
XGBoost algorithm. |
|||
type |
object |
|||
properties |
||||
|
Variable importance threshold. Default is |
|||
type |
float |
|||
|
If |
|||
type |
float |
|||
|
Model’s optimization metric. Default is |
|||
type |
string |
|||
|
Method to use for the XGBoost algorithm. Default is |
|||
type |
string |
|||
options |
||||
|
Automated using PyCaret. |
|||
type |
string |
|||
|
Random search using a pre-defined grid of parameters. |
|||
type |
string |
|||
|
Grid search using a pre-defined grid of parameters. |
|||
type |
string |
|||
|
Name of the file to save the model. |
|||
type |
string |
|||
|
Seed for the random number generator. |
|||
type |
int |
Example
{
"XGBoost": {
"varImportanceThreshold": 0.3,
"optimalThreshold": null,
"optimizationMetric": "AUC",
"method": "pycaret",
"nameSave": "XGBoost03AUC",
"seed": 54288
}
}
Note
Only the XGBoost
algorithm is available for now.
Variables Definition
This set of parameters is used to define the variables to use for the machine learning process, it is organized as follows:
{
"selected variable": {
"define parameters here"
},
"combinations": [
"Insert combinations of variables here"
]
}
Variables
Variables to use for the machine learning process. |
||
type |
object |
|
properties |
||
|
List of variables combinations to use for the study. |
|
type |
List[str] |
For the selected variable, you can specify the following parameters:
selected variable
Variable name to use for the machine learning process. |
||
type |
object |
|
properties |
||
|
Type of variable to use. Must contain |
|
type |
string |
|
|
Path to the variable file. Use |
|
type |
string |
|
|
List of scans to use for the variable. For example is |
|
type |
List[str] |
|
|
List of ROIs to include in the study (will be used to identify the features fie). For example is |
|
type |
List[str] |
|
|
Radiomics level, the features file must end with this level. For example is |
|
type |
List[str] |
|
|
Data cleaning method to use for the variable. Default is |
|
type |
string |
|
|
Data normalization method to use for the variable. Default is |
|
type |
string |
|
|
Feature set reduction method to use for the variable. Default is |
|
type |
string |
Example
{
"var1": {
"nameType": "RadiomicsMorph",
"path": "setToMyFeaturesInWorkspace",
"scans": ["T1CE"],
"rois": ["GTV"],
"imSpaces": ["morph"],
"var_datacleaning": "default",
"var_normalization": "combat",
"var_fSetReduction": "FDA"
},
"combinations": [
"var1"
]
}