Ensemble Studying for Anomaly Detection | by Alex Davis | Oct, 2024

Lets see a easy instance utilizing the isolation forest mannequin to detect anomalies in time-series information. Under, we have now imported a gross sales information set that comprises the day of an order, details about the product, geographical details about the shopper, and the quantity of the sale. To maintain this instance easy, lets simply have a look at one characteristic (gross sales) over time.

See information right here: https://www.kaggle.com/datasets/rohitsahoo/sales-forecasting (GPL 2.0)

#packages for information manipulation
import pandas as pd
from datetime import datetime

#packages for modeling
from sklearn.ensemble import IsolationForest

#packages for information visualization
import matplotlib.pyplot as plt

#import gross sales information
gross sales = pd.read_excel("Information/Gross sales Information.xlsx")

#subset so far and gross sales
income = gross sales[['Order Date', 'Sales']]
income.head()

Picture by the writer

As you’ll be able to see above, we have now the whole sale quantity for each order on a specific day. Since we have now a ample quantity of knowledge (4 years value), let’s attempt to detect months the place the whole gross sales is both noticeably larger or decrease than the anticipated complete gross sales.

First, we have to conduct some preprocessing, and sum the gross sales for each month. Then, visualize month-to-month gross sales.

#format the order date to datetime month and yr
income['Order Date'] = pd.to_datetime(income['Order Date'],format='%Y-%m').dt.to_period('M')

#sum gross sales by month and yr
income = income.groupby(income['Order Date']).sum()

#set date as index
income.index = income.index.strftime('%m-%Y')

#set the fig dimension
plt.determine(figsize=(8, 5))

#create the road chart
plt.plot(income['Order Date'],
income['Sales'])

#add labels and a title
plt.xlabel('Moth')
plt.ylabel('Complete Gross sales')
plt.title('Month-to-month Gross sales')

#rotate x-axis labels by 45 levels for higher visibility
plt.xticks(rotation = 90)

#show the chart
plt.present()

Picture by the writer

Utilizing the road chart above, we are able to see that whereas gross sales fluctuates from month-to-month, complete gross sales developments upward over time. Ideally, our mannequin will determine months the place complete gross sales fluctuates extra that anticipated and is extremely influential to our total development.

Now we have to initialize and match our mannequin. The mannequin beneath makes use of the default parameters. I’ve highlighted these parameters as they’re a very powerful to the mannequin’s efficiency.

  • n_estimators: The variety of base estimators within the ensemble.
  • max_samples: The variety of samples to attract from X to coach every base estimator (if “auto”, then max_samples = min(256, n_samples)).
  • contamination: The quantity of contamination of the information set, i.e. the proportion of outliers within the information set. Used when becoming to outline the brink on the scores of the samples.
  • max_features: The variety of options to attract from X to coach every base estimator.
#set isolation forest mannequin and match to the gross sales
mannequin = IsolationForest(n_estimators = 100, max_samples = 'auto', contamination = float(0.1), max_features = 1.0)
mannequin.match(income[['Sales']])

Subsequent, lets use the mannequin to show the anomalies and their anomaly rating. The anomaly rating is the imply measure of normality of an remark among the many base estimators. The decrease the rating, the extra irregular the remark. Adverse scores characterize outliers, optimistic scores characterize inliers.

#add anomaly scores and prediction
income['scores'] = mannequin.decision_function(income[['Sales']])
income['anomaly'] = mannequin.predict(income[['Sales']])
Picture by the writer

Lastly, lets deliver up the identical line chart from earlier than, however highlighting the anomalies with plt.scatter.

Picture by the writer

The mannequin seems to do effectively. For the reason that information fluctuates a lot month-to-month, a fear might be that inliers would get marked as anomalies, however this isn’t the case as a result of bootstrap sampling of the mannequin. The anomalies seem like the bigger fluctuations the place gross sales deviated from the development a ‘vital’ quantity.

Nevertheless, figuring out the information is necessary right here as a few of the anomalies ought to include a caveat. Let’s have a look at the primary (February 2015) and final (November 2018) anomaly detected. At first, we see that they each are giant fluctuations from the imply.

Nevertheless, the primary anomaly (February 2015) is barely our second month of recording gross sales and the enterprise could have simply began working. Gross sales are undoubtedly low, and we see a big spike the subsequent month. However is it truthful to mark the second month of enterprise an anomaly as a result of gross sales have been low? Or is that this the norm for a brand new enterprise?

For our final anomaly (November 2018), we see an enormous spike in gross sales that seems to deviate from the general development. Nevertheless, we have now run out of knowledge. As information continues to be recorded, it could not have been an anomaly, however maybe an identifier of a steeper upwards development.