Learning

This section will walk you through the details on how to set up the configuration file for the machine learning part of the pipeline. It will be separated to the following subdivisions:

Design
Data Cleaning
Data Normalization
Feature Set Reduction
Machine Learning
Variables Definition

Experiment Design Parameters

This set of parameters is used to define the experiment design (data splitting, splitting proportion…), it is organized as follows:

{
    "testSets": ["Define method here"],
    "method name": "Define method here"

}

Now let’s specify the parameters for the selected method; for instance, in the case of the Random and CV methods:

Splitting methods

Type of sets to create.
type	object
properties
Random	Random splitting method.
	type	object
	properties
	method	Method of splitting the data.
		type	string
		options
		`SubSampling`	The data will be randomly split
			type	string
		`Institutions`	The data will be split based on institutions
			type	string
	nSplits	Number of splits to create.
		type	int
	stratifyInstitutions	If `True`, the data will be stratified based on institutions.
		type	bool
	testProportion	Proportion of the test set.
		type	float
	seed	Seed for the random number generator.
		type	int
CV	Cross-validation splitting method.
	type	object
	properties
	nFolds	Number of folds to use.
		type	int
	seed	Seed for the random number generator.
		type	int

Example

{
    "Random": {
        "method": "SubSampling",
        "nSplits": 10,
        "stratifyInstitutions": 1,
        "testProportion": 0.33,
        "seed": 54288
    }
}

Data Cleaning Parameters

This set of parameters is used to define the data cleaning process Parameters, it is organized as follows:

{
        "method name": {
        "define parameters here"
    },
    "another method": {
        "define parameters here"
    }
}

Cleaning methods

Feature cleaning method name.
type	object
properties
default	Default cleaning method.
	type	string

Now let’s specify the parameters for the selected cleaning method; for instance, in the case of the default method:

Chosen method’s parameters

Feature cleaning parameters.
type	object
properties
continuous	Continuous feature cleaning parameters.
	type	object
	properties
	missingCutoffps	Maximum percentage cut-offs of missing features per sample. Samples with more missing features than this cut-off will be removed.
		type	float
	covCutoff	Minimal coefficient of variation cut-offs over samples per variable. Variables with less coefficient of variation than this cut-off will be removed.
		type	float
	missingCutoffpf	Maximal percentage cut-offs of missing samples per variable. Features with more missing samples than this cut-off will be removed.
		type	float
	imputation	Imputation method for missing values. Default is `mean`.
		type	string
		options
		`mean`	Impute missing values with the mean of the feature.
			type	string
		`median`	Impute missing values with the median of the feature.
			type	string
		`random`	Impute missing values with the a random value from the feature set.
			type	string

Example

{
    "default":
    {
    "feature": {
                    "continuous": {
                            "missingCutoffps": 0.25,
                            "covCutoff": 0.1,
                            "missingCutoffpf": 0.1,
                            "imputation": "mean"
        }
    }
}

Note

Note that you can add as many methods as you want, for other feature types (categorical, ordinal, etc.) and for other cleaning methods (e.g. PCA).

Data Normalization Parameters

Data normalization aims to remove batch effects from the data. This set of parameters is used to define the data normalization process Parameters, it is organized as follows:

{
    "standardCombat": {
        "define parameters here"
    }
}

Chosen method parameters

Normalization method name.
type	string
options
`standardCombat`	Standard Combat normalization method.
	type	string

Note

For now only the standardCombat method is available and it does not require any parameters.

Feature Set Reduction Parameters

Feature set reduction consists of reducing the number of features in the data by removing correlated features, selecting important features, etc. This set of parameters is used to define the feature set reduction process Parameters, it is organized as follows:

{
    "selected method": {
        "define parameters here"
    }
}

method name

Feature set reduction method name.
type	string
options
`FDA`	False discovery avoidance method. Read the paper.
	type	string
`FDAbalanced`	Balanced version of the False discovery avoidance method, where the selected number of features is the same for each table.
	type	string

Now let’s specify the parameters for the selected feature set reduction method; for instance, in the case of the FDA method:

FDA method

Feature set reduction parameters.
type	object
properties
FDA	FDA method’s parameters.
	type	object
	properties
	nSplits	Number of splits to use for the FDA algorithm.
		type	int
	corrType	Type of correlation to use for the FDA algorithm. Default is `Spearman`.
		type	string
		options
		`Spearman`	Spearman correlation.
			type	string
		`Pearson`	Pearson correlation.
			type	string
	threshStableStart	Stability threshold to cut-off the unstable features at the beginning of the FDA algorithm.
		type	float
	threshInterCorr	Threshold to cut-off the inter-correlated features.
		type	float
	minNfeatStable	Minimum number of stable features to keep before inter-correlation step.
		type	int
	minNfeatInterCorr	Minimum number of inter-correlated features to keep.
		type	int
	minNfeat	Minimum number of features to keep at the end of the FDA algorithm.
		type	int
	seed	Seed for the random number generator.
		type	int

Example

{
    "FDA": {
        "nSplits": 100,
        "corrType": "Spearman",
        "threshStableStart": 0.5,
        "threshInterCorr": 0.7,
        "minNfeatStable": 100,
        "minNfeatInterCorr": 60,
        "minNfeat": 5,
        "seed": 54288
    }
}

Note

Only FDA and FDAbalanced methods are available for now and they share the same parameters.

Machine Learning Parameters

This set of parameters is used to define the machine learning process, algorithm, and parameters, it is organized as follows:

{
    "selected algorithm": {
        "define parameters here"
    }
}

Now let’s specify the parameters for the selected machine learning algorithm; for instance, in the case of the XGBoost algorithm:

ML Algorithm

Machine learning algorithm name.
type	object
properties
XGBoost	XGBoost algorithm.
	type	object
	properties
	varImportanceThreshold	Variable importance threshold. Default is `0.3`. Variables with importance below this threshold will be removed.
		type	float
	optimalThreshold	If `null`, the optimal threshold will be computed. Default is `0.5`.
		type	float
	optimizationMetric	Model’s optimization metric. Default is `AUC`. Only used if `method` is `pycaret`.
		type	string
	method	Method to use for the XGBoost algorithm. Default is `pycaret`.
		type	string
		options
		`pycaret`	Automated using PyCaret.
			type	string
		`random_search`	Random search using a pre-defined grid of parameters.
			type	string
		`grid_search`	Grid search using a pre-defined grid of parameters.
			type	string
	nameSave	Name of the file to save the model.
		type	string
	seed	Seed for the random number generator.
		type	int

Example

{
    "XGBoost": {
        "varImportanceThreshold": 0.3,
        "optimalThreshold": null,
        "optimizationMetric": "AUC",
        "method": "pycaret",
        "nameSave": "XGBoost03AUC",
        "seed": 54288
    }
}

Note

Only the XGBoost algorithm is available for now.

Variables Definition

This set of parameters is used to define the variables to use for the machine learning process, it is organized as follows:

{
    "selected variable": {
        "define parameters here"
    },
    "combinations": [
        "Insert combinations of variables here"
    ]
}

Variables

Variables to use for the machine learning process.
type	object
properties
combinations	List of variables combinations to use for the study.
	type	List[str]

For the selected variable, you can specify the following parameters:

selected variable

Variable name to use for the machine learning process.
type	object
properties
nameType	Type of variable to use. Must contain `Radiomics` for radiomics features.
	type	string
path	Path to the variable file. Use `"setToFolderNameinWorkspace"` to set the features folder to `FolderName` in the workspace.
	type	string
scans	List of scans to use for the variable. For example is `T1C`.
	type	List[str]
rois	List of ROIs to include in the study (will be used to identify the features fie). For example is `GTV`.
	type	List[str]
imSpaces	Radiomics level, the features file must end with this level. For example is `morph`.
	type	List[str]
var_datacleaning	Data cleaning method to use for the variable. Default is `default`.
	type	string
var_normalization	Data normalization method to use for the variable. Default is `combat`.
	type	string
var_fSetReduction	Feature set reduction method to use for the variable. Default is `FDA`.
	type	string

Example

{
    "var1": {
        "nameType": "RadiomicsMorph",
        "path": "setToMyFeaturesInWorkspace",
        "scans": ["T1CE"],
        "rois": ["GTV"],
        "imSpaces": ["morph"],
        "var_datacleaning": "default",
        "var_normalization": "combat",
        "var_fSetReduction": "FDA"
    },
    "combinations": [
        "var1"
    ]
}