Data Issues Part 1

In a recent public training workshop (for @RISK for Excel) I was reminded of an unusual fact regarding data.

Commonly @RISK for Excel is used to fit distributions to historical data for use in risk modelling, and it sure beats wildly guessing obscure parameters. However there are (naturally) a litany of woe-inducing problems with all historical data sets: non-stationary data series, extreme values/outliers, data recording errors, seasonality and heteroskedasticity to name a few. Excessive ‘cleansing’ of the data set is commonly prescribed, but the statistician in me cringes to even type those words! Quality control and transforming the data will help to eliminate most of those problems, but what about outliers?

In the early Naughties I was working for a large Australian bank, forecasting their daily call centre volumes for the purpose of planning staff levels and predicting service levels. A particular call centre averaged 30,000 calls per weekday. Yet on September 12th, 2001, calls dropped to less than 10,000. Along with the rest of the world, Australians were watching the terrorist attacks on television and the internet rather than calling to fix spelling mistakes in their contact details or transfer small sums of money between accounts. But what to do with that data point? Presuming the forecasting model is not intended to include such extreme events as terrorist attacks then the point could simply be filtered out of the data set and not thought of again.

But now consider a process that should include rarer events, such as flood damage or operational risk, as one of the risks in a model. If you have 10 years of good data (say), but the set includes an event that should only occur every 100 years. This level of impact is thus drastically overrepresented in the data and any fitted distribution will be biased toward such extremes. Yet the data point can not be completely ignored as such values can occur and the simulation models must have the capacity to sample such values (though with a reasonable likelihood). In this case the artistry that is fitting distributions to data comes to the fore. The data point could be removed from the set but not from our decision making process.

From the range of distributions that can be selected, the optimal choice should not only represent the remaining data well but also have a tail that samples events in the vicinity of those that have been excluded from the analysis with reasonable probability. No, that’s not always easy to do. But as with many elements of probabilistic modelling it simply must be done in order to provide useful information to decision makers.

Thus the context of the modelling can go a long way to determine the most appropriate steps to take with your data set. If that sounds like a subjective guideline then you read it correctly. Not enough people realise just how important experience and intuition can be in the seemingly prescriptive fields of mathematics and statistics. Fitting distributions to data is no different.

And yet that isn’t the unusual fact I was reminded of in the workshop! But I’ll leave that for Part 2 of my Data Issues blog.

Rishi Prabhakar

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s