Anomaly Models#

Models which implment a gordo.machine.model.anomaly.base.AnomalyDetectorBase.anomaly() and can be served under the model server POST /prediction endpoint.

AnomalyDetectorBase#

The base class for all other anomaly detector models

class gordo.machine.model.anomaly.base.AnomalyDetectorBase(**kwargs)[source]#

Bases: BaseEstimator, GordoBase

Initialize the model

abstract anomaly(X: DataFrame | DataArray, y: DataFrame | DataArray, frequency: timedelta | None = None) DataFrame | Dataset[source]#

Take X, y and optionally frequency; returning a dataframe containing anomaly score(s)

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') AnomalyDetectorBase#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

DiffBasedAnomalyDetector#

Calculates the absolute value prediction differences between y and yhat as well as the absolute difference error between both matrices via numpy.linalg.norm()

class gordo.machine.model.anomaly.diff.DiffBasedAnomalyDetector(base_estimator: BaseEstimator = tensorflow.keras.wrappers.scikit_learn.KerasRegressor, scaler: TransformerMixin = MinMaxScaler(), require_thresholds: bool = True, shuffle: bool = False, window: int | None = None, smoothing_method: str | None = None)[source]#

Bases: AnomalyDetectorBase

Estimator which wraps a base_estimator and provides a diff error based approach to anomaly detection.

It trains a scaler to the target after training, purely for error calculations. The underlying base_estimator is trained with the original, unscaled, y.

Threshold calculation is based on a rolling statistic of the validation errors on the last fold of cross-validation.

Parameters:
  • base_estimator – The model to which normal .fit, .predict methods will be used. defaults to py:class:gordo.machine.model.models.KerasAutoEncoder with kind='feedforward_hourglass

  • scaler – Defaults to sklearn.preprocessing.RobustScaler Used for transforming model output and the original y to calculate the difference/error in model output vs expected.

  • require_thresholds – Requires calculating thresholds_ via a call to cross_validate(). If this is set (default True), but cross_validate() was not called before calling anomaly() an AttributeError will be raised.

  • shuffle – Flag to shuffle or not data in .fit so that the model, if relevant, will be trained on a sample of data accross the time range and not just the last elements according to model arg validation_split.

  • window – Window size for smoothed thresholds

  • smoothing_method – Method to be used together with window to smooth metrics. Must be one of: ‘smm’: simple moving median, ‘sma’: simple moving average or ‘ewma’: exponential weighted moving average.

anomaly(X: DataFrame | DataArray, y: DataFrame | DataArray, frequency: timedelta | None = None) DataFrame | Dataset[source]#

Create an anomaly dataframe from the base provided dataframe.

Parameters:
  • X – Dataframe representing the data to go into the model.

  • y – Dataframe representing the target output of the model.

Returns:

  • A superset of the original base dataframe with added anomaly specific

  • features

cross_validate(*, X: DataFrame | ndarray, y: DataFrame | ndarray, cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None), **kwargs)[source]#

Run TimeSeries cross validation on the model, and will update the model’s threshold values based on the cross validation folds.

Parameters:
fit(X: ndarray, y: ndarray)[source]#
get_metadata()[source]#

Generates model metadata.

get_params(deep=True)[source]#

Get parameters for this estimator.

score(X: ndarray | DataFrame, y: ndarray | DataFrame, sample_weight: ndarray | None = None) float[source]#

Score the model; must implement the correct default scorer based on model type

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DiffBasedAnomalyDetector#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

class gordo.machine.model.anomaly.diff.DiffBasedKFCVAnomalyDetector(base_estimator: BaseEstimator = tensorflow.keras.wrappers.scikit_learn.KerasRegressor, scaler: TransformerMixin = MinMaxScaler(), require_thresholds: bool = True, shuffle: bool = True, window: int = 144, smoothing_method: str = 'smm', threshold_percentile: float = 0.99)[source]#

Bases: DiffBasedAnomalyDetector

Estimator which wraps a base_estimator and provides a diff error based approach to anomaly detection.

It trains a scaler to the target after training, purely for error calculations. The underlying base_estimator is trained with the original, unscaled, y.

Threshold calculation is based on a percentile of the smoothed validation errors as calculated from cross-validation predictions.

Parameters:
  • base_estimator – The model to which normal .fit, .predict methods will be used. defaults to py:class:gordo.machine.model.models.KerasAutoEncoder with kind='feedforward_hourglass

  • scaler – Defaults to sklearn.preprocessing.RobustScaler Used for transforming model output and the original y to calculate the difference/error in model output vs expected.

  • require_thresholds – Requires calculating thresholds_ via a call to cross_validate(). If this is set (default True), but cross_validate() was not called before calling anomaly() an AttributeError will be raised.

  • shuffle – Flag to shuffle or not data in .fit so that the model, if relevant, will be trained on a sample of data accross the time range and not just the last elements according to model arg validation_split.

  • window – Window size for smooth metrics and threshold calculation.

  • smoothing_method – Method to be used together with window to smooth metrics. Must be one of: ‘smm’: simple moving median, ‘sma’: simple moving average or ‘ewma’: exponential weighted moving average.

  • threshold_percentile – Percentile of the validation data to be used to calculate the threshold.

cross_validate(*, X: DataFrame | ndarray, y: DataFrame | ndarray, cv=KFold(n_splits=5, random_state=0, shuffle=True), **kwargs)[source]#

Run Kfold cross validation on the model, and will update the model’s threshold values based on a percentile of the validation metrics.

Parameters:
get_metadata()[source]#

Generates model metadata.

get_params(deep=True)[source]#

Get parameters for this estimator.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DiffBasedKFCVAnomalyDetector#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object