Configuration#
Example of the Gordo model configuration:
evaluation:
cv:
sklearn.model_selection.TimeSeriesSplit:
n_splits: 5
model:
gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector:
base_estimator:
sklearn.pipeline.Pipeline:
steps:
- sklearn.preprocessing.MinMaxScaler
- gordo.machine.model.models.KerasAutoEncoder:
batch_size: 128
compression_factor: 0.6
encoding_layers: 1
epochs: 100
func: tanh
kind: feedforward_hourglass
loss: mse
optimizer: Adam
out_func: linear
validation_split: 0.1
scaler: sklearn.preprocessing.MinMaxScaler
shuffle: true
smoothing_method: smm
We can deserialize this configuration into a model object with using gordo.serializer.serializer module.
A Gordo model is typically wrapped by the class gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector.
This class holds generic methods for model cross-validation, training and fitting.
In turn, this method is wrapped around the class gordo.builder.build_model.ModelBuilder, which is the top-level
We will focus on the output that is created using sklearn.model_selection.cross_validate() method,
which is used when using the model for predictions.
gordo.machine.machine.Machine class holds basically all information that is contained in the one Gordo config.
The gordo.builder.build_model.ModelBuilder class takes a gordo.machine.machine.Machine and does the heavy lifting
when it comes to data fetching, cross-validation and model training.
Evaluation specification#
Alongside the ML-model itself, all aspects of the cross-validation evaluation is parameterized in the config:
- evaluation:
cv:
sklearn.model_selection.TimeSeriesSplit:
n_splits: 3
cv_mode: full_build
scoring_scaler: sklearn.preprocessing.MinMaxScaler
metrics:
- explained_variance_score
- r2_score
- mean_squared_error
- mean_absolute_error
Alternatively, the cv_mode can be set to cross_val_only which will not fit the final model.
Cross-validation methods#
Setting cv to sklearn.model_selection.TimeSeriesSplit , the dataset is split as depicted below.
Independent of the number of splits, the test set always is of the same size.
An alternative is to use k-fold cross-validation. Here, one can decide to shuffle the data before it is split into folds. In contradiction to the time-series-split above, which augments the considered data in each fold with time-consecutive observations, this method is uncoupled from the time dimension. This must be considered when comparing results from different folds.
The following parameters can then be set as such:
- evaluation:
cv:
sklearn.model_selection.KFold:
n_splits: 3
shuffle: True
random_state: 0
Borrowed from scikit-learn , which performs the actual split/train for us.