Figuring out and Leveraging Main Indicators for Time Collection Forecasting | by Afolabi Lagunju

Python Implementation

Earlier than we begin, create an account with Federal Reserve Financial Information (FRED), and get an API key utilizing this hyperlink. Please observe that this product makes use of the FRED® API however shouldn’t be endorsed or licensed by the Federal Reserve Financial institution of St. Louis.

We begin with putting in and loading the wanted modules.

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from pandas.tseries.offsets import MonthEnd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_percentage_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import grangercausalitytests

from pmdarima import auto_arima

Subsequent, we’ll create a customized perform to learn the information utilizing FRED API.

FRED_API_KEY = '__YOUR_API_KEY__'

# Perform to learn information from FRED API
def get_fred_data(data_id, data_name):
response = requests.get(f'https://api.stlouisfed.org/fred/collection/observations?series_id={data_id}&api_key={FRED_API_KEY}&file_type=json')
df = pd.DataFrame(response.json()['observations'])[['date', 'value']].rename(columns={'worth': data_name})
df[data_name] = pd.to_numeric(df[data_name], errors='coerce')
df['date'] = pd.to_datetime(df['date']) + MonthEnd(1)
df.set_index('date', inplace=True)
df.index.freq='M'
return df

Now, let’s learn our information and retailer it in a pandas dataframe.

dependent_timeseries_id = 'MRTSSM4453USN'
dependent_timeseries_name = 'liquor_sales'

potential_leading_indicators = {
'USACPIALLMINMEI': 'consumer_price_index',
'PCU44534453': 'liquor_ppi',
'DSPIC96': 'real_income',
'LFWA25TTUSM647S': 'working_age_population',
'UNRATENSA': 'unemployment_rate',
'LNU01300000': 'labor_force_participation',
'A063RC1': 'social_benefits',
'CCLACBM027NBOG': 'consumer_loans',
}

# Learn dependent time collection
timeseries = get_fred_data(dependent_timeseries_id, dependent_timeseries_name)

# Be part of timeseries with potential main indicators
for data_id, data_name in potential_leading_indicators.gadgets():
df = get_fred_data(data_id, data_name)
timeseries = timeseries.be a part of(df)

# We are going to begin our evaluation from Jan-2010
timeseries = timeseries['2010':]

# add month we wish to predict liquor_sales
timeseries = timeseries.reindex(timeseries.index.union([timeseries.index[-1] + timeseries.index.freq]))

timeseries

timeseries dataframe

Fast visible evaluation of our information reveals that our dependent time collection (liquor gross sales) roughly follows the identical cycle each 12 months. We are going to use this 12 month cycle as a parameter in a while in our time collection forecasting.

timeseries[dependent_timeseries_name].plot(figsize=(20,8));
Liquor gross sales pattern

Earlier than we check for causality, we have to verify the stationarity of our time collection information. To realize this, we’ll use the Augmented Dickey–Fuller check. If our dataset fails this stationarity check, we should make use of a recursive differencing strategy till it satisfies the check standards.

# create a replica of the timeseries to make use of for assessments. You should definitely exclude the extra row we added within the earlier activity
timeseries_for_gc_tests = timeseries[:-1]
all_cols = timeseries_for_gc_tests.columns

stationary_cols = []
diff_times = 0

whereas True:

# Check for stationarity
for col in all_cols:
adf, pvalue, lagsused, observations, critical_values, icbest = adfuller(timeseries_for_gc_tests[col])
if pvalue <= 0.05:
stationary_cols.append(col)

# Distinction the time collection if at the very least one column fails the stationary check
if set(stationary_cols) != set(all_cols):
timeseries_for_gc_tests = timeseries_for_gc_tests.diff().dropna()
diff_times += 1
stationary_cols = []
else:
print(f'No of Differencing: {diff_times}')
break

Now that we’ve got loaded our time collection information right into a pandas dataframe, and it passes the stationarity check, we check for causality utilizing the granger causality check.

maxlag = 6 # represents the utmost variety of previous time intervals to search for potential causality. We cap ours at 6 months
leading_indicators = []

for x in all_cols[1:]:
gc_res = grangercausalitytests(timeseries_for_gc_tests[[dependent_timeseries_name, x]], maxlag=maxlag, verbose=0)
leading_indicators_tmp = []
for lag in vary(1, maxlag+1):
ftest_stat = gc_res[lag][0]['ssr_ftest'][0]
ftest_pvalue = gc_res[lag][0]['ssr_ftest'][1]
if ftest_pvalue <= 0.05:
leading_indicators_tmp.append({'x': x, 'lag': lag, 'ftest_pvalue': ftest_pvalue, 'ftest_stat': ftest_stat, 'xlabel': f'{x}__{lag}_mths_ago'})
if leading_indicators_tmp:
leading_indicators.append(max(leading_indicators_tmp, key=lambda x:x['ftest_stat']))

# Show main indicators as a dataframe
pd.DataFrame(leading_indicators).reset_index(drop=True).reset_index(drop=True)

main indicators of liquor gross sales

From our assessments, we see can see that liquor gross sales of the present month is affected by Shopper Value Indexᵈ² and Shopper Loansᵈ¹⁰ of two months in the past; and Labour Power Participationᵈ⁷ of 6 months in the past.

Now that we’ve got established our main indicators, we’ll shift their data in order that their lagged figures are in the identical row as the present information of liquor_sales which they “trigger”.

# shift the main indicators by their corresponding lag intervals
for i in leading_indicators:
timeseries[i['xlabel']] = timeseries[i['x']].shift(intervals=i['lag'], freq='M')

# choose solely the dependent_timeseries_name and main indicators for additional evaluation
timeseries = timeseries[[dependent_timeseries_name, *[i['xlabel'] for i in leading_indicators]]].dropna(subset=[i['xlabel'] for i in leading_indicators], axis=0)
timeseries

main indicators offered as X variables of liquor_sales

Subsequent, we scale our information so that every one options are inside the identical vary, then apply PCA to eradicate multicollinearity between our main indicators.

# Scale dependent timeseries
y_scaler = StandardScaler()
dependent_timeseries_scaled = y_scaler.fit_transform(timeseries[[dependent_timeseries_name]])

# Scale main indicators
X_scaler = StandardScaler()
leading_indicators_scaled = X_scaler.fit_transform(timeseries[[i['xlabel'] for i in leading_indicators]])

# Scale back dimensionality of the main indicators
pca = PCA(n_components=0.90)
leading_indicators_scaled_components = pca.fit_transform(leading_indicators_scaled)

leading_indicators_scaled_components.form

Lastly, we are able to construct our SARIMAX mannequin with the assistance of auto_arima. For the aim of this implementation, we’ll go away all parameters as their default, apart from seasonality flag and the variety of intervals in every cycle (m).

We are going to prepare our mannequin utilizing the timeseries information up till ‘2024–05–31’, check with ‘2024–06–30’ information, then predict the ‘2024–07–31’ liquor gross sales.

# Construct SARIMAX mannequin
periods_in_cycle = 12 # variety of intervals per cycle. In our case, its 12 months
mannequin = auto_arima(y=dependent_timeseries_scaled[:-2], X=leading_indicators_scaled_components[:-2], seasonal=True, m=periods_in_cycle)
mannequin.abstract()
SARIMAX mannequin abstract
# Forecast the following two intervals
preds_scaled = mannequin.predict(n_periods=2, X=leading_indicators_scaled_components[-2:])
pred_2024_06_30, pred_2024_07_31 = np.spherical(y_scaler.inverse_transform([preds_scaled]))[0]

print("TESTn----")
print(f"Precise Liquor Gross sales for 2024-06-30: {timeseries[dependent_timeseries_name]['2024-06-30']}")
print(f"Predicted Liquor Gross sales for 2024-06-30: {pred_2024_06_30}")
print(f"MAPE: {mean_absolute_percentage_error([timeseries[dependent_timeseries_name]['2024-06-30']], [pred_2024_06_30]):.1%}")

print("nFORECASTn--------")
print(f"Forecasted Liquor Gross sales for 2024-07-31: {pred_2024_07_31}")

Check and Forecast outcome

By following the method step-by-step, we forecasted the liquor gross sales determine for July 2024 with an estimated MAPE of simply 0.4%.

To additional enhance the accuracy of our prediction, we are able to discover including extra potential main indicators and finetuning the fashions used.

Conclusion

Main indicators, as we’ve explored, function early indicators of future developments, offering an important edge in anticipating modifications earlier than they totally materialise. By leveraging strategies similar to Granger causality assessments to establish main indicator collection and incorporating them into forecasting fashions, we are able to considerably improve the precision and robustness of our predictions.