16. Skeptical Conclusions

 

  



Nathan Divinsky in The Chess Encyclopedia calls the Elo System "a mathematically sound and universally accepted (1970) rating system for chess players." [D]  The year refers to the adoption of the Elo System by FIDE.  Aside from a 1965 contribution by Elo to The Journal of Gerontology, there has been virtually no peer review of the system beyond the world of organized chess. One of the few external references is to be found in The Mathematics of Games, by J. D. Beasely [B, p.61].  Beasely offered this scathing footnote on the work of the late Professor Elo:

Chess enthusiasts may be surprised that the name of Elo has not figured more prominently in this discussion, since the Elo rating system has been in use internationally since 1970.  However, Elo's work as described in his book The rating of chessplayers, past and present (Batsford, 1978) is open to serious criticism.  His statistical testing is unsatisfactory to the point of being meaningless; he calculates standard deviations without allowing for draws, he does not always appear to allow for the extent to which his test results have contributed to the ratings which they purport to be testing, and he fails to make the important distinction between proving a proposition true and merely failing to prove it false.  In particular, an analysis of 4795 games from Milwaukee Open tournaments, which he represents as demonstrating the normal distribution function to be the appropriate expectation function for chess, is actually no more than an incorrect analysis of the variation within his data.  He appears not to realize that changes in the overall strength of a pool cannot be detected, and that his 'deflation control', which claims to stabilize the implied reference level, is a delusion.  Administrators of other sports (for example tennis) currently publish only rankings.  The limitations of these are obvious, but at least they do not encourage illusory comparisons between today's champions and those of the past.  

The proof of the pudding, it has been said, is the actual operation of a rating system, and the Elo System has been grinding out chess ratings for over four decades now with hardly a grumble from the rating pool.  One is tempted to say that the system works despite its theory rather than because of it.  The reputation of the Elo System rests largely on its supposed ability to predict chess outcomes.  There is even the occasional inquiry as to whether the system can predict outcomes in sports such as basketball, football, golf and soccer.  As this treatise has attempted to show, the predictive powers of the Elo System are not due to its application of probability theory, which in the final analysis must be characterized as a misapplication, but rather to principles of averaging which have hardly been articulated elsewhere.

The main weakness of the Elo System arises from the scientist’s habit of overvaluing the artifacts of his profession, in this particular instance, probability distributions. The system is not so much an attempt to apply statistical principles to the rating problem as an effort to shape mathematical intuitions to the judgments of abstract theory. As a case in point, Elo began his development of the established rating logically enough with cumulative averaging, but the argument then takes a detour into the expectancy curve. Again, great pains are taken with nonlinear probability functions, only to find that over their “most used” portions they behave much like a linear system. And again, the development of the logistic system begins with a principle that has been offered in this treatise as the basis of ratio systems but is cast in the Elo System by a dubious multiplication of odds as a probability distribution. Equally dubious are the attempts to explain the association of rating difference with percentage expectancy by the overlapping of hypothetical normal distributions. As an alternative to these tortured arguments, this treatise has postulated simple relations between rating differences or ratios and relative performance.

Probability theory, as it happens, does explain much of the success of the Elo System, but theory of a different sort than its author took for granted. The understanding of percentage scores with respect to rating differences or ratios as tending to long-term limits is quite absent from the system, though application of the frequency theory of probability seems natural enough. If Elo misapplied theory, he also made considerable use of mathematical intuition, which he otherwise disparaged.  The result is a system that is a marked improvement over those that preceded it, but a system that falls short of the scientific rigor that Elo envisioned for it.  The lesson perhaps is that no one system is likely to be the last word in statistical precision.  What, then, lies in the future for chess rating systems? There is, to be sure, no predicting the winds of change, but the fascination of chess ratings lies in their controversial nature and their capacity for inspiring new ideas. Let us hope that the pronouncements of experts do not prematurely put an end to the controversy.