# Pre-Processing Data for Time Series Analysis: Outlier Analysis, Missing Values, and Time Series Seasonality

Many time series models require or assume the input time series to be “well behaved”. That is, the series is stationary, ergodic, and free of outliers and missing values. To meet these requirements, it is usually necessary to pre-process raw time series data before doing any formal analysis. In this blog, we discuss methods to address outliers, missing values, and seasonal patterns using different functions in IMSL’s time series package.

- Pre-Processing Data for Time Series Analysis
- Estimation of Missing Values
- Accounting for Time Series Seasonality
- Identifying Outliers in Time Series
- Final Thoughts

## Pre-Processing Data for Time Series Analysis

As stated above, to adapt real data to proper time series analysis, it must often be pre-processed. Such pre-processing can involve estimating missing values, removing outliers, and accounting for seasonal variations. Fortunately, beyond the initial exploratory methods, algorithmic methods have been developed to help. We illustrate some of these methods in the examples to follow.

## Estimation of Missing Values in Time Series Analysis

Missing values occur for a variety of reasons in the data collection and observation process. Time series models require that there be no gaps in data along the time index, and so simply omitting observations with missing values (and re-indexing as if there were no gaps) is not an option, as it might be for non-time indexed data. Instead, the missing values need to be replaced with judiciously chosen values before fitting a model. Replacement of values in this context is known as “imputation”.

The IMSL function, estimate_missing, provides 4 methods for imputing missing values. The first method uses the median of the non-missing values leading up to the missing value. Method 2 uses spline interpolation, while methods 3 and 4 use auto-regressive models of different orders. (Methods 2, 3, and 4 use data observations before and after the gap.) To illustrate, we consider the following example of quarterly gasoline prices.

### Applied Example

The quarterly average price of gasoline from 1978 through the second quarter of 2003 is used as a test data set for the estimation of missing values. While the data set originally contained all values, we removed three of the 101 observations for illustration purposes. The four different methods mentioned above are used to estimate the missing values. The estimated values are shown in TABLE 3 along with the actual observed value of the original time series and the percent difference is shown in TABLE 4.

In this instance, method 3 performs the best overall, with method 4 providing very similar results since method 3 is a special case (p=1) of method 4. Method 2 performs very poorly, underestimating values considerably, while median replacement (method 1) works adequately. Results of course will vary depending on the data set.

Once missing values have been filled in, there is no special consideration given to them with respect to seasonal differencing, outlier detection, or forecasting. That is, there is no internal information kept by the routine to give less weight or importance to these estimated values. In other words, so far as the model is concerned, the imputed values are actual values

## Accounting for Time Series Seasonality

Financial and economic time series are often subjected to seasonal variations due to natural phenomena, normal business cycles, socio-economic behaviors, and a myriad of other factors. To account for seasonal trends, a common technique is to remove seasonal variation from the data series before fitting a model. This technique is known as seasonal differencing.

Seasonal differencing requires two parameters for each seasonal cycle: the seasonal lag, s, and the number of differences, d. For each pair (s, d) a differenced series is generated by subtracting observations at time intervals of s time units; and repeating the process d times. Then, the selected time series model is applied to the resulting differenced series. In the following, we apply seasonal differencing to historical automobile sales.

### Applied Example

A lot of sales data is seasonal, and automobile sales are a prime example, traditionally peaking in the fall when new models arrive on the showroom floor. The data set for this example is monthly vehicle sales from in the United States from January 1971 to December 1991.

For this example, we use IMSL’s seasonal_fit function to determine the best values for the necessary seasonality parameters s and d, which were found to be 2 and 1 respectively.

The original time series is shown in FIGURE 1 as the black line, with the next eight predicted values in red. The modified time series with the seasonal trend removed appears as the blue line, with the prediction in green. We can see that the blue line has trends removed and should be a more appropriate series for ARIMA modeling. (Note that in the chart, there is a small artifact of the overlay that truncates the top of the blue line). For both series, forecasts are generated using the ARIMA(2,2,2) model. This is the auto-regressive integrated moving average model with 2 autoregressive lags, an integrated parameter of 2 (meaning the twice-differenced series is expected to be stationary), and 2 moving average lags. The low-frequency trend in the data has been removed in the seasonally adjusted time series so that the prediction is influenced more dominantly by the higher frequency oscillations.

## Identifying Outliers in Time Series Analysis

Outliers are extreme observations relative to the rest of the data. Outliers can corrupt model estimates and consequently result in less accurate predictions. Below we consider the gasoline price data once again, this time analyzing it for outliers using the IMSL function, ts_outlier_identification.

### Applied Example

The same gasoline price data used in the Missing Values section are used in this showcase of outlier identification. In this example, the original time series (thin black line in FIGURE 2) is used to generate a forecast (dashed black line) using an AR(6) model. (This parameter selection was chosen by auto_arima to be the best fitting for the gasoline price data.)

The function ts_outlier_identification is employed to identify any outliers. The function implements the algorithm of Chen & Liu and identifies outliers as being one of 5 types: Innovational (IO), Additive (AO), Level Shift (LS), Time Change (TC), and Unable to Identify (UI). These classifications have to do with the impact the outlier has on the mean of the series, and whether or not its effect persists or dampens out quickly.

To analyze the effects of outlier removal, the original data set is taken without modification and eight future values are forecasted. Then, with the input parameters described, the time series is passed through the outlier identification algorithm to find and identify any outliers. Another forecast is computed using the same AR(6) model.

For this data set, four points were identified. Two time change (TC) outliers were found at April 1989 and October 1990. An additive outlier (AO) was identified at April 2001. Lastly, an innovational outlier (IO) was determined to exist at April 2002.

The results of this analysis are shown in FIGURE 2. The original time series is plotted as a thin black line, with its computed forecast a black dashed line. Outliers are marked with an asterisk and labeled with the type of outlier. The time series with the effects of these four points removed is overlaid as a thick blue line in the figure. The forecasted values using the adjusted time series appear as a dotted blue line. Eight points (two years) are forecasted beyond the end of the time series.

Notice that while the forecasted trends are similar, their magnitude differs by an average of 13% for the eight points. By accounting for outliers in input time series, more accurate forecasts can be obtained. Note that IMSL’s auto_arima function performs outlier detection automatically using the same algorithms as in ts_outlier_detection.

## Final Thoughts

Pre-processing data is a necessary component in time series analysis. Visual inspection and simple summaries are always very useful; but for the size and complexity of real-world time series, manual inspection is not practical. Fortunately, computational methods exist to auto-detect, impute, and auto-configure time series for modeling and prediction.

### References

Chen, C., and L.Liu 1993. Joint estimation of model parameters and outlier effects in time series. J. Am. Stat. Assoc. 88:284-297.

### Additional Resources

Looking for additional reading on time series analysis? Be sure to check out the first two blogs in this series:

### Add Time Series Analysis Functionality With IMSL

IMSL features a robust selection of algorithms for creating advanced time series models and analyzing time series data. Talk with an IMSL expert today to see how you can add time series analysis functionality to your application without the cost of creating or maintaining in-house solutions.