Configuration#

Example of the Gordo model configuration:

evaluation:
  cv:
    sklearn.model_selection.TimeSeriesSplit:
      n_splits: 5
model:
  gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector:
    base_estimator:
      sklearn.pipeline.Pipeline:
        steps:
          - sklearn.preprocessing.MinMaxScaler
          - gordo.machine.model.models.KerasAutoEncoder:
              batch_size: 128
              compression_factor: 0.6
              encoding_layers: 1
              epochs: 100
              func: tanh
              kind: feedforward_hourglass
              loss: mse
              optimizer: Adam
              out_func: linear
              validation_split: 0.1
    scaler: sklearn.preprocessing.MinMaxScaler
    shuffle: true
    smoothing_method: smm

We can deserialize this configuration into a model object with using gordo.serializer.serializer module.

A Gordo model is typically wrapped by the class gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector. This class holds generic methods for model cross-validation, training and fitting.

In turn, this method is wrapped around the class gordo.builder.build_model.ModelBuilder, which is the top-level

We will focus on the output that is created using sklearn.model_selection.cross_validate() method, which is used when using the model for predictions.

gordo.machine.machine.Machine class holds basically all information that is contained in the one Gordo config. The gordo.builder.build_model.ModelBuilder class takes a gordo.machine.machine.Machine and does the heavy lifting when it comes to data fetching, cross-validation and model training.

Evaluation specification#

Alongside the ML-model itself, all aspects of the cross-validation evaluation is parameterized in the config:

- evaluation:
     cv:
       sklearn.model_selection.TimeSeriesSplit:
         n_splits: 3
     cv_mode: full_build
     scoring_scaler: sklearn.preprocessing.MinMaxScaler
     metrics:
     - explained_variance_score
     - r2_score
     - mean_squared_error
     - mean_absolute_error

Alternatively, the cv_mode can be set to cross_val_only which will not fit the final model.

Cross-validation methods#

Setting cv to sklearn.model_selection.TimeSeriesSplit , the dataset is split as depicted below. Independent of the number of splits, the test set always is of the same size.

An alternative is to use k-fold cross-validation. Here, one can decide to shuffle the data before it is split into folds. In contradiction to the time-series-split above, which augments the considered data in each fold with time-consecutive observations, this method is uncoupled from the time dimension. This must be considered when comparing results from different folds.

The following parameters can then be set as such:

- evaluation:
     cv:
       sklearn.model_selection.KFold:
         n_splits: 3
         shuffle: True
         random_state: 0

Borrowed from scikit-learn , which performs the actual split/train for us.