We'll use Python's Pandas library for this. Pandas comes with many tools designed to help make time series analysis easier. First task is to import our data. We do this with the function 'read_csv'. Note we highlight the date column in the parse_dates parameter. Handling of dates is often a source of problems when ingesting data since there are many variants in the date format. By default read_csv will make an intelligent case but it does give the option to use a custom date parser if necessary.
x = pd.read_csv('gas.csv',
sep=',', usecols=['gas', 'dt'],
parse_dates=['dt'],
index_col='dt', dayfirst=True,
dtype='float')
x['date_raw'] = x.index
x['date_id'] = (x.index - x.index.min())/pd.to_timedelta(1, 'd')
x['days'] = (x['date_raw'] - x['date_raw'].shift(1)).dt.days
x['gas_m3'] = x['gas'] - x['gas'].shift(1)
x['gas_m3_daily'] = x['gas_m3'] / x['days']
x.groupby('days').size()
Out[1]:
days
1.0 1420
2.0 239
3.0 86
4.0 61
5.0 29
6.0 18
7.0 10
8.0 3
9.0 1
10.0 7
11.0 3
12.0 1
15.0 1
31.0 1
189.0 1
dtype: int64
The above table shows, our next issue is the varying granularity of the data. Most meter readings cover a single day but some cover a longer period and there is one that covers several months. Our first solution will be to calculate period averages. Where, for example, we have the usage over 3 days, we will assume the usage on each day of the period is a third of the total. We do this by first expanding the data frame to include the missing dates, then spreading the usage data over the introduced days. Once we have daily data we convert the volume of gas consumed (cubic metres) into energy (kilowatt-hours), using the fact that National Grid methane is supplied with a calorific value of a little under 40kJ/m3 and that kJ can be converted to kwh by multiplying by number of seconds in an hour (ie 3600). For reference later we'll also create a flag to mark out the days with estimated values.
x = x.asfreq('D', method=None) #expand series to include missing dates, new records populated with nans
x['date_raw'] = x.index
x['date_id'] = (x.index - x.index.min())/pd.to_timedelta(1, 'd')
x['estimated'] = x['gas_m3_daily'].isna()
x['gas_m3_daily'].bfill(inplace=True)
x['daily_kwh'] = x['gas_m3_daily'] * 39780 / 3600
x = x.iloc[1:]
Now that we have a complete series, we can start to analyse it. Let's create some moving averages and use linear regression to make a trend line. (We might consider using the moving average values as a better estimate for the gaps in our data, especially for the shorter gaps. We could use the period totals to scale the moving average values to avoid any distortion - but for now let's move on).
from sklearn import linear_model
x['ma7'] = x['daily_kwh'].shift(-3).rolling(7).mean()
x['ma30'] = x['daily_kwh'].shift(-15).rolling(30).mean()
m = linear_model.LinearRegression()
m.fit(x[['date_id']], x[['daily_kwh']]) #nb fit expects 2d arrays
print(m.coef_, m.score(x[['date_id']], x[['daily_kwh']]) )
[[-0.00998603]] 0.02562331738429191
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20,10), dpi=500)
ax.grid(True, linestyle='--')
ax.scatter(x.date_raw, x.daily_kwh, c='red', marker='.', s=10)
ax.plot(x.date_raw, x.ma7, c='red', linewidth=0.5)
ax.plot(x.date_raw, x.ma30, c='black', linewidth=1)
ax.plot(x.date_raw, m.predict(x.date_id), color='blue', linewidth=2)
The data shows a strong seasonal pattern (as we might expect - domestic gas use is dominated by heating). The linear regression shows daily gas use decreasing by 0.01kwh/day (over the course of a year that is a reduction in total use of about 650kwh or about 3%) - overall regression line fit is poor due to strong seasonality. Decreasing consumption is always a good thing, but let's be a little cautious - the seasonal peaks and troughs have a strong leverage on the trend line - note in particular the series starts on a peak and ends on a trough. What we need to do is isolate and remove the seasonal pattern. The python library Statsmodels gives us some functionality to decompose the time series into its components: trend, seasonality and residual. We'll chose a multiplicative model i.e.
consumption[t] = trend[t] * seasonality[t] * residual[t]
import statsmodels.tsa.seasonal
y = x['daily_kwh'].dropna()
kwh_decomp = statsmodels.tsa.seasonal.seasonal_decompose(y, model='multiplicative', period=365, extrapolate_trend=False) #trend extrapolation is just
straight line - may not be that good
fig = kwh_decomp.plot()
plt.tight_layout()
fig.set_size_inches(16,16)
Notice a couple of things: 1) seasonality looks pretty stable 2) fit, as evidenced by the residuals, is poor in 2021 and this coincides with the long period of missing data. We should also be cautious regarding the recent change in trend because it coincides with the period of poor fit. Let's use the seasonal pattern to improve our missing values estimates and then use that improved data to get a better decomposition. We'll need to scale the seasonal factors so usage from the period matches the original data.
gap_end = x[x.days==189].date_raw.values[0]
gap_start = gap_end - pd.to_timedelta(189-1, 'd')
x['daily_kwh_seasonal'] = kwh_decomp.seasonal
scale_factor = (x.loc[gap_start:gap_end, 'daily_kwh'].sum() / x.loc[gap_start:gap_end, 'daily_kwh_seasonal'].sum())
estimate = x['daily_kwh_seasonal'] * scale_factor
x['daily_kwh_new'] = np.where((x.date_raw>=gap_start) & (x.date_raw<=gap_end), estimate, x['daily_kwh'])
y = x['daily_kwh_new'].dropna()
y.loc[y[y==0].index] = 0.0001
kwh_decomp2 = statsmodels.tsa.seasonal.seasonal_decompose(y, model='multiplicative', period=365, extrapolate_trend=False)
fig = kwh_decomp2.plot()
fig.set_size_inches(16,10)
y = x[~x.daily_kwh_trend.isna()] #top and tail data to get region for which trend is defined
m = linear_model.LinearRegression()
m.fit(y[['date_id']], y[['daily_kwh_trend']])
print(m.coef_, m.score(y[['date_id']], y[['daily_kwh_trend']]) )
[[-0.00386417]] 0.14049423049127996
We'll extend our dataframe to cover the forecast period. Our previous linear regression model can then be used to create the forecast values for the trend. Trend multiplied by seasonality gives us the final forecast. #extend main dataset for forecast period
fcast_start = x.index.max()+pd.to_timedelta(1, 'd')
fcast = pd.DataFrame(index= pd.date_range(start=fcast_start, periods=365, freq='1D'))
xnew = x.reindex(pd.date_range(start = x.index.min(), end=x.index.max()+ pd.to_timedelta(365, 'day'), freq='1d'))
xnew['date_id'] = (xnew.index - x.index.min())/pd.Timedelta('1D')
xnew['daily_kwh_trend1'] = m.predict(xnew[['date_id']])
xnew.loc[fcast_start:,'daily_kwh_seasonal'] = xnew.loc[fcast_start - pd.to_timedelta(365, 'd'): fcast_start - pd.to_timedelta(1, 'd'),'daily_kwh_seasonal'].values
xnew.loc[fcast_start:,'daily_kwh_new' ] = np.nan
xnew.loc[fcast_start:,'daily_kwh_new' ] = xnew.daily_kwh_seasonal * xnew.daily_kwh_trend1
With the benefit of hindsight, let's load the actual data for the forecast period and compare.
Mmm, not bad. Actual consumption was less than forecast - perhaps energy conservation measures are taking effect! But note we have tried to forecast using only the univariate history when we know the series is driven by temperature. Can we use this to improve the forecast? Let's see.
m = linear_model.LinearRegression()
m.fit(xtemp[['date_id', 'hdd']], xtemp[['daily_kwh']])
print(m.coef_, m.intercept_, m.score(xtemp[['date_id', 'hdd']], xtemp[['daily_kwh']])
[[-5.89400382e-03 9.66103335e+00]] [15.17878606] 0.7558621670512726
x.loc[fcast_start:, 'daily_kwh']= m.predict(x.loc[fcast_start:, ['date_id', 'hdd']])
Lastly, we have introduced a pure time component into our model - the output shows a coefficient of 0.006kwh/day - this sounds small but represents significant drops in consumption when considered over a period of years. We could bring to bear other statistical routines to look at the confidence interval for this parameter. What external factors might be responsible for decreases in energy consumption over time. Occupant behaviour? Property changes? If these are known, they can be modelled. We could then adjust the series to take account of these (in the same way that we might adjust historical financial data to account for share splits, buy backs etc). See here for an example of energy modelling.
Back to other example analyses.