Use of Excel for Statistical Analysis
Neil Cox, Statistician, AgResearch Ruakura
Private Bag 3123, Hamilton, New Zealand
16 May 2000
This article gives an assessment of the practical implications of deficiencies reported by McCullough and Wilson (1999) in Excel’s statistical procedures. I outline what testing was done, discuss what deficiencies were found, assess the likely impact of the deficiencies, and give my opinion on the role of Excel in the analysis of data.
My overall assessment is that, while Excel uses algorithms that are not robust and can lead to errors in extreme cases, the errors are very unlikely to arise in typical scientific data analysis in AgResearch.
THE DEFICIENCIES OF EXCEL’S STATISTICAL ALGORITHMS
What Aspects Were Examined?
Excel’s calculation of distributions (tail probabilities), mean and standard deviation calculations, analysis of variance, linear regression, non-linear regression (using Solver) and random numbers were scrutinised using data sets designed to reveal any shortcomings in the numerical procedures used in the calculations of statistics packages. The distributions were tested by Knusel (1998), the other aspects by McCullough and Wilson (1999). McCullough (1998, 1999) describes the methodology and the performance of SAS, SPSS and S-Plus.
How Did Excel Rate?
Generally Excel performed worse than the 3 statistics packages (SAS, SPSS, S-Plus) also examined, particularly in the non-linear regression problems. See below for more detail. The conclusion from these tests is that, in many cases, Excel uses naïve algorithms that are vulnerable to rounding and truncation errors and may produce very inaccurate results in extreme cases.
Distributions
Excel failed to give results for some discrete distributions; the failures occur when the number of cases is high and result from Excel producing, in its calculations, numbers too big to handle. The results are reliable when an answer is given. For the continuous distributions, such as the normal distribution, Excel’s results for extreme tails beyond about 10-6 are poor; this is not normally an issue for significance testing.
Means, Standard Deviations, Analysis of Variance
Various data sets were used to check Excel’s ability to get accurate results. The data sets are designed to discover whether the algorithms used are robust. For instance, the 2 data sets 90000001, 90000002, 90000003 and 1, 2, 3 have the same standard deviation (1) but Excel fails to get this answer in the first case. This is because it uses a naïve algorithm that results in subtracting two nearly equal very large numbers and the correct answer gets lost because computers store numbers with finite precision. Better algorithms avoid this problem. Excel failed to give satisfactory results in several of the more testing anova data sets but SAS (Anova) and SPSS did no better.
Linear Regression
Excel gave satisfactory results on all but one data set that had very high collinearity. SAS and SPSS report this problem and their inability to find a solution while Excel happily found a "solution" that is wrong.
Non-Linear Regression
Excel’s "Solver" (an Excel add-in) was not able to give satisfactory results for several of the non-linear problems while the more sophisticated routines in the statistics packages gave satisfactory results for most of the problems. I have used Excel’s Solver for a few problems and it has performed well (once I have parameterised the problem sensibly and found a reliable way of choosing starting values), giving results in close agreement to statistics packages. However, its performance for any particular application needs to be checked against a better package. And as you get no standard errors with the estimates, its usefulness is limited.
Random Number Generator
Excel’s random number generator failed more of the tests of randomness than did the statistics packages examined. Hence bootstrap methods should not be used without further testing of the implications of the deficiencies in the generator. My own experience, using simulations to check difficult (for me) theoretical probability calculations, has been that the random number generator is very satisfactory.
THE IMPACT OF THESE DEFICIENCIES
In What Circumstances Will Excel be Unreliable?
Will These Problems Affect You?
If you are using Excel for simple summaries, simple tests (t-tests, Chi-square, etc), regression analysis, it is most unlikely you will have any problems; Excel will give the right answers. The impact of the poorer algorithms used by Excel is less now that numbers are stored with about 15 significant digits than some years ago when numbers were often stored with only 7 significant digits. If you’re dealing with very large numbers, scaling and/or re-centring your numbers can easily ensure you don’t strike any rounding errors. Any serious statistics package will look after this for you; Excel does not.
CONCLUSIONS
What is Excel’s Use in the Analysis of Scientific Data?
Excel is not a statistics package, more so for the very limited range of analysis tools available in it than for its naïve numerical algorithms. Nevertheless, it has a useful role in the analysis of data. Data analysis is much more than doing formal analyses and calculating P-values. When used effectively, Excel can be very useful in the exploratory analysis of data:
While some statistics packages have much more powerful exploratory graphing capability, Excel can often do all that is needed quite easily. Excel is of very limited use in the formal statistical analysis of data unless your experimental design is very simple. It is possible to write procedures in Excel to do more complex analyses and many people have produced statistical add-ins. Some producers of add-ins have used numerically sound procedures and have not relied on Excel’s functions. However, the "Data Analysis Toolpack" provided with Excel is no easier to use than most statistics packages, has very limited capability, has known bugs and so, on the whole, is not worth bothering with. In AgResearch, we have a number of good statistics packages available and it is very easy to simply cut and paste your data into them to do formal statistical analyses. Any new statistical package (whether it be an Excel add-in or a stand-alone package) should be regarded with caution until it has been thoroughly checked out.
References
Knusel, L., (1998) On the accuracy of statistical distributions in Microsoft Excel 97. Computational Statistics and Data Analysis 26, 375-377
McCullough, B.D., (1998) Assessing the reliability of statistical software: Part I. The American Statistician 52, 358-366
McCullough, B.D., (1999) Assessing the reliability of statistical software: Part II. The American Statistician 53, 149-159
McCullough B.D. and Wilson B., (1999) On the accuracy of statistical procedures in Microsoft Excel 97. Computational Statistics and Data Analysis 31, 27-37