Glass Jar Limited
Time Series Analysis

Univariate Approach

The following time series analysis uses as an example, daily domestic gas consumption figures. The series used is several years long but contains gaps. By analysing the historic patterns in the data we will gain an understanding that allows us to make a forecast. The need for a forecast is often the primary motivator for time series analysis.

We'll use Python's Pandas library for this. Pandas comes with many tools designed to help make time series analysis easier. First task is to import our data. We do this with the function 'read_csv'. Note we highlight the date column in the parse_dates parameter. Handling of dates is often a source of problems when ingesting data since there are many variants in the date format. By default read_csv will make an intelligent case but it does give the option to use a custom date parser if necessary.

x = pd.read_csv('gas.csv',
   sep=',', usecols=['gas', 'dt'],
   parse_dates=['dt'],
   index_col='dt', dayfirst=True,
   dtype='float')
x['date_raw'] = x.index
x['date_id'] = (x.index - x.index.min())/pd.to_timedelta(1, 'd')
x['days'] = (x['date_raw'] - x['date_raw'].shift(1)).dt.days
x['gas_m3'] = x['gas'] - x['gas'].shift(1)
x['gas_m3_daily'] = x['gas_m3'] / x['days']

x.groupby('days').size()

Out[1]:
days
1.0  1420
2.0   239
3.0    86
4.0    61
5.0    29
6.0    18
7.0    10
8.0     3
9.0     1
10.0    7
11.0    3
12.0    1
15.0    1
31.0    1
189.0   1
dtype: int64

The above table shows, our next issue is the varying granularity of the data. Most meter readings cover a single day but some cover a longer period and there is one that covers several months. Our first solution will be to calculate period averages. Where, for example, we have the usage over 3 days, we will assume the usage on each day of the period is a third of the total. We do this by first expanding the data frame to include the missing dates, then spreading the usage data over the introduced days. Once we have daily data we convert the volume of gas consumed (cubic metres) into energy (kilowatt-hours), using the fact that National Grid methane is supplied with a calorific value of a little under 40kJ/m3 and that kJ can be converted to kwh by multiplying by number of seconds in an hour (ie 3600). For reference later we'll also create a flag to mark out the days with estimated values.

x = x.asfreq('D', method=None) #expand series to include missing dates, new records populated with nans
x['date_raw'] = x.index
x['date_id'] = (x.index - x.index.min())/pd.to_timedelta(1, 'd')
x['estimated'] = x['gas_m3_daily'].isna()
x['gas_m3_daily'].bfill(inplace=True)
x['daily_kwh'] = x['gas_m3_daily'] * 39780 / 3600
x = x.iloc[1:]

Now that we have a complete series, we can start to analyse it. Let's create some moving averages and use linear regression to make a trend line. (We might consider using the moving average values as a better estimate for the gaps in our data, especially for the shorter gaps. We could use the period totals to scale the moving average values to avoid any distortion - but for now let's move on).

from sklearn import linear_model
x['ma7'] = x['daily_kwh'].shift(-3).rolling(7).mean()
x['ma30'] = x['daily_kwh'].shift(-15).rolling(30).mean()
m = linear_model.LinearRegression() m.fit(x[['date_id']], x[['daily_kwh']]) #nb fit expects 2d arrays
print(m.coef_, m.score(x[['date_id']], x[['daily_kwh']]) )

[[-0.00998603]] 0.02562331738429191

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20,10), dpi=500)
ax.grid(True, linestyle='--')
ax.scatter(x.date_raw, x.daily_kwh, c='red', marker='.', s=10)
ax.plot(x.date_raw, x.ma7, c='red', linewidth=0.5)
ax.plot(x.date_raw, x.ma30, c='black', linewidth=1)
ax.plot(x.date_raw, m.predict(x.date_id), color='blue', linewidth=2)

The data shows a strong seasonal pattern (as we might expect - domestic gas use is dominated by heating). The linear regression shows daily gas use decreasing by 0.01kwh/day (over the course of a year that is a reduction in total use of about 650kwh or about 3%) - overall regression line fit is poor due to strong seasonality. Decreasing consumption is always a good thing, but let's be a little cautious - the seasonal peaks and troughs have a strong leverage on the trend line - note in particular the series starts on a peak and ends on a trough. What we need to do is isolate and remove the seasonal pattern. The python library Statsmodels gives us some functionality to decompose the time series into its components: trend, seasonality and residual. We'll chose a multiplicative model i.e.

consumption[t] = trend[t] * seasonality[t] * residual[t]

import statsmodels.tsa.seasonal
y = x['daily_kwh'].dropna()
kwh_decomp = statsmodels.tsa.seasonal.seasonal_decompose(y, model='multiplicative', period=365, extrapolate_trend=False) #trend extrapolation is just
straight line - may not be that good
fig = kwh_decomp.plot()
plt.tight_layout()
fig.set_size_inches(16,16)

Notice a couple of things: 1) seasonality looks pretty stable 2) fit, as evidenced by the residuals, is poor in 2021 and this coincides with the long period of missing data. We should also be cautious regarding the recent change in trend because it coincides with the period of poor fit. Let's use the seasonal pattern to improve our missing values estimates and then use that improved data to get a better decomposition. We'll need to scale the seasonal factors so usage from the period matches the original data.

gap_end = x[x.days==189].date_raw.values[0]
gap_start = gap_end - pd.to_timedelta(189-1, 'd')
x['daily_kwh_seasonal'] = kwh_decomp.seasonal
scale_factor = (x.loc[gap_start:gap_end, 'daily_kwh'].sum() / x.loc[gap_start:gap_end, 'daily_kwh_seasonal'].sum())
estimate = x['daily_kwh_seasonal'] * scale_factor
x['daily_kwh_new'] = np.where((x.date_raw>=gap_start) & (x.date_raw<=gap_end), estimate, x['daily_kwh'])
y = x['daily_kwh_new'].dropna()
y.loc[y[y==0].index] = 0.0001
kwh_decomp2 = statsmodels.tsa.seasonal.seasonal_decompose(y, model='multiplicative', period=365, extrapolate_trend=False)
fig = kwh_decomp2.plot()
fig.set_size_inches(16,10)

Univariate forecast

Looking at the residuals, the fit looks a bit better. There are still some reservations re the recent trend from the seasonal decomposition, so let's make a forecast using a long term trend and together with the seasonality. Note that the trend is not estimated for the ends of the dataset - we need to take that into account when making the forecast. (We could get statsmodels to extrapolate the trend to the ends of the dataset but it just uses a linear fit on the last few points which is really just diluting the information content of the data ).

y = x[~x.daily_kwh_trend.isna()] #top and tail data to get region for which trend is defined
m = linear_model.LinearRegression()
m.fit(y[['date_id']], y[['daily_kwh_trend']])
print(m.coef_, m.score(y[['date_id']], y[['daily_kwh_trend']]) )

[[-0.00386417]] 0.14049423049127996

We'll extend our dataframe to cover the forecast period. Our previous linear regression model can then be used to create the forecast values for the trend. Trend multiplied by seasonality gives us the final forecast. #extend main dataset for forecast period

fcast_start = x.index.max()+pd.to_timedelta(1, 'd')
fcast = pd.DataFrame(index= pd.date_range(start=fcast_start, periods=365, freq='1D'))
xnew = x.reindex(pd.date_range(start = x.index.min(), end=x.index.max()+ pd.to_timedelta(365, 'day'), freq='1d'))
xnew['date_id'] = (xnew.index - x.index.min())/pd.Timedelta('1D')
xnew['daily_kwh_trend1'] = m.predict(xnew[['date_id']])
xnew.loc[fcast_start:,'daily_kwh_seasonal'] = xnew.loc[fcast_start - pd.to_timedelta(365, 'd'): fcast_start - pd.to_timedelta(1, 'd'),'daily_kwh_seasonal'].values
xnew.loc[fcast_start:,'daily_kwh_new' ] = np.nan
xnew.loc[fcast_start:,'daily_kwh_new' ] = xnew.daily_kwh_seasonal * xnew.daily_kwh_trend1

With the benefit of hindsight, let's load the actual data for the forecast period and compare.

Mmm, not bad. Actual consumption was less than forecast - perhaps energy conservation measures are taking effect! But note we have tried to forecast using only the univariate history when we know the series is driven by temperature. Can we use this to improve the forecast? Let's see.

Part Two - Forecasting Using Dependent Variables

Heat loss is driven by outloor temperature, but only when it is below the required indoor temperature, so we cant use weather data directly. The metric we need is "heating degree days". This is an integral of temperature below a set point over a day. So for example, if the set point is 18C and the extenal temperature is 17C over a whole day, then that's one heating degree day. 16C for 12 hours followed by 18C for 12 hours would also be one heating degree day. Here we choose a set-point of 18C. Plotting heating degree days (hdd) against kwh shows a strong relationship (R-squared 73%)... ...so, if we can forecast HDD, we can forecast the energy consumption. Let's use the statsmodels seasonal decomposition on HDD and assume the trend picks up where the statsmodel value leaves off and is flat over the coming year. The code is similar to that shown previous so we'll skip it. The next stage is to use sklearn linear regression to estimate a model. Previously there was some evidence to suggest decreases in consumption over time, so we could use time *and* hdd as explanatories.

m = linear_model.LinearRegression()
m.fit(xtemp[['date_id', 'hdd']], xtemp[['daily_kwh']])
print(m.coef_, m.intercept_, m.score(xtemp[['date_id', 'hdd']], xtemp[['daily_kwh']])

[[-5.89400382e-03 9.66103335e+00]] [15.17878606] 0.7558621670512726

x.loc[fcast_start:, 'daily_kwh']= m.predict(x.loc[fcast_start:, ['date_id', 'hdd']])

Closing Thoughts

We should go on to analyse and compare the performance of the two methodologies with respect to the observed 2023 data. Secondly, have a look back at the scatter plot. There are two aspects to the data that our model doesn't address: (i) at low heating demand (sub 3-4hdd) consumption is often flat, suggesting a break down in the relationship between hdd and kwh. This could reflect an error in the choice in hdd setpoint (perhaps move it down 3-4C) but also it reflects hot water usage, which is less strongly influenced by hdd. We could try to separate the original kwh series into kwh-heating vs kwh-hotwater and remodel. (ii) at any point on the hdd curve there is a wide spread in consumption. This might reflect other weather factors (e.g. solar gain, wind chill) which we could try to bring into our regression equation. But it might also reflect lags in the physical system (any property takes time to warm up and cool down and will have significant heat storage capacity). We could analyse our residuals for lags over time using autocorrelation and partial autocorrelation measures. If we find some, we could then consider bringing these into our modelling - perhaps through an autoregressive-moving average (ARIMA) model.

Lastly, we have introduced a pure time component into our model - the output shows a coefficient of 0.006kwh/day - this sounds small but represents significant drops in consumption when considered over a period of years. We could bring to bear other statistical routines to look at the confidence interval for this parameter. What external factors might be responsible for decreases in energy consumption over time. Occupant behaviour? Property changes? If these are known, they can be modelled. We could then adjust the series to take account of these (in the same way that we might adjust historical financial data to account for share splits, buy backs etc). See here for an example of energy modelling.

Back to other example analyses.