I have been meaning to write about the stability of correlation coefficients for a while, and have been spurred into action by a recent article in the International Herald Tribune (“Eat Quickly, for the Economy’s Sake,” May 8, 2009). The article discusses the relationship between economic growth and time spent eating, based on a recent study published by the OECD, which looked at a variety of social indicators in the 22 countries surveyed. The article finds a negative correlation between eating time and economic growth, with the U.S., Canada and Mexico being the countries in which people spend the least time eating (and with higher economic growth), and France the country where most time is spent eating (“to no one’s surprise,” according to the author).
Since I was keen to try to replicate the results of the study for myself, I downloaded the eating survey data from the OECD web-site, but then for some reason was unable to find the same economic growth figures used in the article, and so searched for similar data myself, with the result that the time period covered in measuring economic growth was different to that used in the article… the key point being that I ended up in my analysis with a positive (+16%) correlation coefficient between eating time and economic growth, rather than a negative one (…maybe a bit of relaxation does help creativity?…back to the land of dreams…).
In fact, the reality is that correlation coefficients measured between data samples are not particularly stable until the data sets become very large. The @RISK (risk analysis using Monte Carlo simulation in Excel) tool can be used to explore this issue. For example, two data sets of a given size (e.g. 22 points) containing independent distributions can be set up, and for each random sampling of the points (each iteration of a simulation), the correlation coefficient between the sample points drawn from independent distributions calculated, with this correlation coefficient being set as the output call for the simulation. Having done this, a few observations and notes can be made from such reflections and calculations:
- Data sets containing only two points each will have a correlation coefficient which is either 100% or minus 100% (except in the rare case where two data points have exactly the same value), as the points are ordered either both low-high, or one is low-high and the other high-low. This already provides some intuition as to the lack of stability of the coefficient.
- The correlation coefficient as measured from two independent samples of 22 data points has a standard deviation of around 20% (and a mean of zero); in roughly one-third of cases, the measured correlation coefficient between these two independent samples will be more than 20% (in either the positive of negative direction)
- About 100 data points are required in each set before the standard deviation of the correlation coefficient drops to below 10%
- The inverse square-root law applies: a doubling of the sample size reduces the standard deviation of the correlation coefficient to approximately 71% of the value with a smaller sample size.
Referring back to the original article, my own feeling is therefore that the relationship discussed is driven by the inherent uncertainty in dealing with small sample sizes. As the article’s author states: “Such correlations may be nothing but coincidence, of course.” As we saw when @RISK was applied to the problem, any statistical analysis can be subject to these traps.
Undaunted, the author continues: “But if the data are genuine, a contribution to world growth is rendered by any institution that enables people to eat rapidly and gain weight. Take a bow, McDonalds.”
Dr. Michael Rees
Director of Training and Consulting