Learning

This section will walk you through the details on how to set up the configuration file for the machine learning part of the pipeline. It will be separated to the following subdivisions:

Experiment Design Parameters

This set of parameters is used to define the experiment design (data splitting, splitting proportion…), it is organized as follows:

{
    "testSets": ["Define method here"],
    "method name": "Define method here"

}

Now let’s specify the parameters for the selected method; for instance, in the case of the Random and CV methods:

Splitting methods

Type of sets to create.

type

object

properties

  • Random

Random splitting method.

type

object

properties

  • method

Method of splitting the data.

type

string

options

SubSampling

The data will be randomly split

type

string

Institutions

The data will be split based on institutions

type

string

  • nSplits

Number of splits to create.

type

int

  • stratifyInstitutions

If True, the data will be stratified based on institutions.

type

bool

  • testProportion

Proportion of the test set.

type

float

  • seed

Seed for the random number generator.

type

int

  • CV

Cross-validation splitting method.

type

object

properties

  • nFolds

Number of folds to use.

type

int

  • seed

Seed for the random number generator.

type

int

  • Example

{
    "Random": {
        "method": "SubSampling",
        "nSplits": 10,
        "stratifyInstitutions": 1,
        "testProportion": 0.33,
        "seed": 54288
    }
}

Data Cleaning Parameters

This set of parameters is used to define the data cleaning process Parameters, it is organized as follows:

{
        "method name": {
        "define parameters here"
    },
    "another method": {
        "define parameters here"
    }
}

Cleaning methods

Feature cleaning method name.

type

object

properties

  • default

Default cleaning method.

type

string

Now let’s specify the parameters for the selected cleaning method; for instance, in the case of the default method:

Chosen method’s parameters

Feature cleaning parameters.

type

object

properties

  • continuous

Continuous feature cleaning parameters.

type

object

properties

  • missingCutoffps

Maximum percentage cut-offs of missing features per sample. Samples with more missing features than this cut-off will be removed.

type

float

  • covCutoff

Minimal coefficient of variation cut-offs over samples per variable. Variables with less coefficient of variation than this cut-off will be removed.

type

float

  • missingCutoffpf

Maximal percentage cut-offs of missing samples per variable. Features with more missing samples than this cut-off will be removed.

type

float

  • imputation

Imputation method for missing values. Default is mean.

type

string

options

mean

Impute missing values with the mean of the feature.

type

string

median

Impute missing values with the median of the feature.

type

string

random

Impute missing values with the a random value from the feature set.

type

string

  • Example

{
    "default":
    {
    "feature": {
                    "continuous": {
                            "missingCutoffps": 0.25,
                            "covCutoff": 0.1,
                            "missingCutoffpf": 0.1,
                            "imputation": "mean"
        }
    }
}

Note

Note that you can add as many methods as you want, for other feature types (categorical, ordinal, etc.) and for other cleaning methods (e.g. PCA).

Data Normalization Parameters

Data normalization aims to remove batch effects from the data. This set of parameters is used to define the data normalization process Parameters, it is organized as follows:

{
    "standardCombat": {
        "define parameters here"
    }
}

Chosen method parameters

Normalization method name.

type

string

options

standardCombat

Standard Combat normalization method.

type

string

Note

For now only the standardCombat method is available and it does not require any parameters.

Feature Set Reduction Parameters

Feature set reduction consists of reducing the number of features in the data by removing correlated features, selecting important features, etc. This set of parameters is used to define the feature set reduction process Parameters, it is organized as follows:

{
    "selected method": {
        "define parameters here"
    }
}

method name

Feature set reduction method name.

type

string

options

FDA

False discovery avoidance method. Read the paper.

type

string

FDAbalanced

Balanced version of the False discovery avoidance method, where the selected number of features is the same for each table.

type

string

Now let’s specify the parameters for the selected feature set reduction method; for instance, in the case of the FDA method:

FDA method

Feature set reduction parameters.

type

object

properties

  • FDA

FDA method’s parameters.

type

object

properties

  • nSplits

Number of splits to use for the FDA algorithm.

type

int

  • corrType

Type of correlation to use for the FDA algorithm. Default is Spearman.

type

string

options

Spearman

Spearman correlation.

type

string

Pearson

Pearson correlation.

type

string

  • threshStableStart

Stability threshold to cut-off the unstable features at the beginning of the FDA algorithm.

type

float

  • threshInterCorr

Threshold to cut-off the inter-correlated features.

type

float

  • minNfeatStable

Minimum number of stable features to keep before inter-correlation step.

type

int

  • minNfeatInterCorr

Minimum number of inter-correlated features to keep.

type

int

  • minNfeat

Minimum number of features to keep at the end of the FDA algorithm.

type

int

  • seed

Seed for the random number generator.

type

int

  • Example

{
    "FDA": {
        "nSplits": 100,
        "corrType": "Spearman",
        "threshStableStart": 0.5,
        "threshInterCorr": 0.7,
        "minNfeatStable": 100,
        "minNfeatInterCorr": 60,
        "minNfeat": 5,
        "seed": 54288
    }
}

Note

Only FDA and FDAbalanced methods are available for now and they share the same parameters.

Machine Learning Parameters

This set of parameters is used to define the machine learning process, algorithm, and parameters, it is organized as follows:

{
    "selected algorithm": {
        "define parameters here"
    }
}

Now let’s specify the parameters for the selected machine learning algorithm; for instance, in the case of the XGBoost algorithm:

ML Algorithm

Machine learning algorithm name.

type

object

properties

  • XGBoost

XGBoost algorithm.

type

object

properties

  • varImportanceThreshold

Variable importance threshold. Default is 0.3. Variables with importance below this threshold will be removed.

type

float

  • optimalThreshold

If null, the optimal threshold will be computed. Default is 0.5.

type

float

  • optimizationMetric

Model’s optimization metric. Default is AUC. Only used if method is pycaret.

type

string

  • method

Method to use for the XGBoost algorithm. Default is pycaret.

type

string

options

pycaret

Automated using PyCaret.

type

string

random_search

Random search using a pre-defined grid of parameters.

type

string

grid_search

Grid search using a pre-defined grid of parameters.

type

string

  • nameSave

Name of the file to save the model.

type

string

  • seed

Seed for the random number generator.

type

int

  • Example

{
    "XGBoost": {
        "varImportanceThreshold": 0.3,
        "optimalThreshold": null,
        "optimizationMetric": "AUC",
        "method": "pycaret",
        "nameSave": "XGBoost03AUC",
        "seed": 54288
    }
}

Note

Only the XGBoost algorithm is available for now.

Variables Definition

This set of parameters is used to define the variables to use for the machine learning process, it is organized as follows:

{
    "selected variable": {
        "define parameters here"
    },
    "combinations": [
        "Insert combinations of variables here"
    ]
}

Variables

Variables to use for the machine learning process.

type

object

properties

  • combinations

List of variables combinations to use for the study.

type

List[str]

For the selected variable, you can specify the following parameters:

selected variable

Variable name to use for the machine learning process.

type

object

properties

  • nameType

Type of variable to use. Must contain Radiomics for radiomics features.

type

string

  • path

Path to the variable file. Use "setToFolderNameinWorkspace" to set the features folder to FolderName in the workspace.

type

string

  • scans

List of scans to use for the variable. For example is T1C.

type

List[str]

  • rois

List of ROIs to include in the study (will be used to identify the features fie). For example is GTV.

type

List[str]

  • imSpaces

Radiomics level, the features file must end with this level. For example is morph.

type

List[str]

  • var_datacleaning

Data cleaning method to use for the variable. Default is default.

type

string

  • var_normalization

Data normalization method to use for the variable. Default is combat.

type

string

  • var_fSetReduction

Feature set reduction method to use for the variable. Default is FDA.

type

string

  • Example

{
    "var1": {
        "nameType": "RadiomicsMorph",
        "path": "setToMyFeaturesInWorkspace",
        "scans": ["T1CE"],
        "rois": ["GTV"],
        "imSpaces": ["morph"],
        "var_datacleaning": "default",
        "var_normalization": "combat",
        "var_fSetReduction": "FDA"
    },
    "combinations": [
        "var1"
    ]
}