Time Series Analysis Basics

Using Python

Contents

  1. What is Time Series?
  2. Data Extract and Processing
  3. Timae Series Analysis
  4. Resample Data
  5. Conclusion

You can download the updated notebook from my Github page

What is a time series data?¶

Time series is simply the representation of data points over time. They exist everywhere in nature and find themselves strong in businesses. Domain examples are temperatures, heartbeats, births, internet traffic, stocks, sales and industry production. Efficient processing and forecasting of time series data provide decisive advantages. It can help businesses implement their strategies ahead of time.

Time Series Analysis (TSA) uses methods for analyzing time series data in order to identify useful patterns and extract meaningful statistics of the data. There are two major goals of TSA:

  1. Identifing patterns and features represented by the data; and
  2. Forecasting (in some cases using models) to predict future values based on previous data. Cool stuff eh?

Data Extract and Processing¶

Import Libraries¶

In [211]:
import numpy as np, pandas as pd
import itertools #provides various functions that work on iterators
import matplotlib
import matplotlib.pyplot as plt
import xlrd # in the case where raw data is in Excel/xls format 
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
In [193]:
%matplotlib inline
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['axes.titlesize'] = 15
matplotlib.rcParams['text.color'] = 'k'
rcParams['figure.figsize'] = 14, 5

View data in Pandas¶

In [212]:
df = pd.read_excel("Superstore.xls")
print('Data downloaded and read into a dataframe!')
Data downloaded and read into a dataframe!
In [215]:
df.head(3) # A quick preview of 3 rows of the dataframe
Out[215]:
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City ... Postal Code Region Product ID Category Sub-Category Product Name Sales Quantity Discount Profit
0 1 CA-2016-152156 2016-11-08 2016-11-11 Second Class CG-12520 Claire Gute Consumer United States Henderson ... 42420 South FUR-BO-10001798 Furniture Bookcases Bush Somerset Collection Bookcase 261.96 2 0.0 41.9136
1 2 CA-2016-152156 2016-11-08 2016-11-11 Second Class CG-12520 Claire Gute Consumer United States Henderson ... 42420 South FUR-CH-10000454 Furniture Chairs Hon Deluxe Fabric Upholstered Stacking Chairs,... 731.94 3 0.0 219.5820
2 3 CA-2016-138688 2016-06-12 2016-06-16 Second Class DV-13045 Darrin Van Huff Corporate United States Los Angeles ... 90036 West OFF-LA-10000240 Office Supplies Labels Self-Adhesive Address Labels for Typewriters b... 14.62 2 0.0 6.8714

3 rows × 21 columns

A quick view of data properties¶

  1. A look at column names
  2. Checks for any missing values in the data (It turns out there's NONE)
  3. A look at the statistics of numeric values in the data.
  4. Viewing unique categories of products in the data and their value counts

The Category is an interesting column (or feature) at the 'Super Stores', from which we can divide the data into several subsets according to the Store's product categories, Office Supplies, Furniture and Technology.

In [195]:
# 1.
df.columns.values
Out[195]:
array(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'Country', 'City',
       'State', 'Postal Code', 'Region', 'Product ID', 'Category',
       'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount',
       'Profit'], dtype=object)
In [196]:
# 2.
df.isnull().any()
Out[196]:
Row ID           False
Order ID         False
Order Date       False
Ship Date        False
Ship Mode        False
Customer ID      False
Customer Name    False
Segment          False
Country          False
City             False
State            False
Postal Code      False
Region           False
Product ID       False
Category         False
Sub-Category     False
Product Name     False
Sales            False
Quantity         False
Discount         False
Profit           False
dtype: bool
In [216]:
# 3.
df.describe() # View basic statistics
Out[216]:
Row ID Postal Code Sales Quantity Discount Profit
count 9994.000000 9994.000000 9994.000000 9994.000000 9994.000000 9994.000000
mean 4997.500000 55190.379428 229.858001 3.789574 0.156203 28.656896
std 2885.163629 32063.693350 623.245101 2.225110 0.206452 234.260108
min 1.000000 1040.000000 0.444000 1.000000 0.000000 -6599.978000
25% 2499.250000 23223.000000 17.280000 2.000000 0.000000 1.728750
50% 4997.500000 56430.500000 54.490000 3.000000 0.200000 8.666500
75% 7495.750000 90008.000000 209.940000 5.000000 0.200000 29.364000
max 9994.000000 99301.000000 22638.480000 14.000000 0.800000 8399.976000
In [198]:
# 4
df.Category.unique()
Out[198]:
array(['Furniture', 'Office Supplies', 'Technology'], dtype=object)
In [199]:
df.Category.value_counts()
Out[199]:
Office Supplies    6026
Furniture          2121
Technology         1847
Name: Category, dtype: int64

Time Series Analysis of Office Supplies¶

For a basic time series analysis, we use the Office Supplies data segment as an example. It could be (or we already know) that We have a good 4-year office supplies sales data. We will try to find some meaningful patterns in this data.

In [217]:
# Extract the data of Office Supplies
office_supplies = df.loc[df['Category'] == 'Office Supplies']
In [218]:
office_supplies.head(3)
Out[218]:
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City ... Postal Code Region Product ID Category Sub-Category Product Name Sales Quantity Discount Profit
2 3 CA-2016-138688 2016-06-12 2016-06-16 Second Class DV-13045 Darrin Van Huff Corporate United States Los Angeles ... 90036 West OFF-LA-10000240 Office Supplies Labels Self-Adhesive Address Labels for Typewriters b... 14.620 2 0.0 6.8714
4 5 US-2015-108966 2015-10-11 2015-10-18 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale ... 33311 South OFF-ST-10000760 Office Supplies Storage Eldon Fold 'N Roll Cart System 22.368 2 0.2 2.5164
6 7 CA-2014-115812 2014-06-09 2014-06-14 Standard Class BH-11710 Brosina Hoffman Consumer United States Los Angeles ... 90032 West OFF-AR-10002833 Office Supplies Art Newell 322 7.280 4 0.0 1.9656

3 rows × 21 columns

What are the minimum and maximum order dates?

In [202]:
# Answer:
office_supplies['Order Date'].min(), office_supplies['Order Date'].max()
Out[202]:
(Timestamp('2014-01-03 00:00:00'), Timestamp('2017-12-30 00:00:00'))
  1. We'll start by dropping other columns. First, let us set the index of our data using the 'Order Data' column.
  2. It is proper that the sales data on the same date should be integrated together.
In [203]:
# 1.
office_supplies = office_supplies[['Order Date','Sales']]
office_supplies.head()
Out[203]:
Order Date Sales
2 2016-06-12 14.620
4 2015-10-11 22.368
6 2014-06-09 7.280
8 2014-06-09 18.504
9 2014-06-09 114.900
In [204]:
# 2.
office_supplies = office_supplies.sort_values('Order Date')
office_supplies = office_supplies.groupby('Order Date')['Sales'].sum().reset_index()
office_supplies.head(10)
Out[204]:
Order Date Sales
0 2014-01-03 16.448
1 2014-01-04 288.060
2 2014-01-05 19.536
3 2014-01-06 685.340
4 2014-01-07 10.430
5 2014-01-09 9.344
6 2014-01-10 2.890
7 2014-01-13 2027.116
8 2014-01-16 48.660
9 2014-01-18 64.864

Reset index to Order date (the Time Series data points)¶

Take a quick view of the date time index

In [205]:
office_supplies = office_supplies.set_index('Order Date')
office_supplies.head()
Out[205]:
Sales
Order Date
2014-01-03 16.448
2014-01-04 288.060
2014-01-05 19.536
2014-01-06 685.340
2014-01-07 10.430
In [206]:
office_supplies.index
Out[206]:
DatetimeIndex(['2014-01-03', '2014-01-04', '2014-01-05', '2014-01-06',
               '2014-01-07', '2014-01-09', '2014-01-10', '2014-01-13',
               '2014-01-16', '2014-01-18',
               ...
               '2017-12-21', '2017-12-22', '2017-12-23', '2017-12-24',
               '2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
               '2017-12-29', '2017-12-30'],
              dtype='datetime64[ns]', name='Order Date', length=1148, freq=None)

Resample data¶

Hmm! As we can see from the index, our current date-time is not continuous and can be tricky to do analysis with. Hence, we can use the average daily sales value for that month instead, and we are using the start of each month as the timestamp. This requires us to resample the data.

In [207]:
df_monthly = office_supplies['Sales'].resample('MS').mean()
In [208]:
df_monthly.head(10)
Out[208]:
Order Date
2014-01-01     285.357647
2014-02-01      63.042588
2014-03-01     391.176318
2014-04-01     464.794750
2014-05-01     324.346545
2014-06-01     588.774409
2014-07-01     756.060400
2014-08-01     541.879143
2014-09-01    1015.677704
2014-10-01     267.078815
Freq: MS, Name: Sales, dtype: float64

Visualizing the Sales Time Series Data¶

In [209]:
df_monthly.plot ()
plt.xlabel('Order Date')
plt.ylabel('Sales')
plt.title('Superstore Sales')
plt.show()

Some peculiar patterns appear on the we plot. The time-series has seasonality pattern: It appears that

  1. Sales are always low at the beginning of the year and high at the end of the year
  2. There is usually an upward trend within any single year with some monthly 'lows' in the mid sections of the year.

Finally, in this 'Time Series Analysis' basics, we can also visualize our data using a method called time-series decomposition. This process allows us to decompose our time series into three distinct components:

  1. Trend
  2. Seasonality, and
  3. Noise.
In [210]:
decomposition = sm.tsa.seasonal_decompose(df_monthly, model = 'additive')
fig = decomposition.plot()
plt.show()

Conclusion¶

It is important to note the importance of resampling in a time series analysis and that usually, our features are represented as a function of time. The last figure above clearly shows the seasonality in our data, and the trend is gradually increasing through the years.

Moving forward in Time Series data analysis, forecasting or predicting features like sales can positively influence business decisions. Auto-regressive Integrated Moving Average, known as ARIMA is a commonly used method for time-series prediction. Another interesting forecasting method is Time Series Modeling with **Prophet**. Prophet is designed for analyzing time-series that display patterns on different time scales such as yearly, weekly and daily. It also has advanced capabilities for modeling the effects of holidays on a time-series and implementing custom change-points.