13. Skeptical Conclusions

 

  



Alfred North Whitehead in his 1911 Introduction to Mathematics observed,

It is a profoundly erroneous truism, repeated by all copy-books and by eminent people when they are making speeches, that we should cultivate the habit of thinking of what we are doing. The precise opposite is the case. Civilization advances by extending the number of important operations which we can perform without thinking about them. . . .

From a slightly different perspective, this is precisely a formula for the stagnation of civilization, for someone at some time must do the hard thinking upon which such important operations are built, especially if civilization is to advance by the overthrow of thoughtless dogma.

Nathan Divinsky in The Chess Encyclopedia [D] calls the Elo System "a mathematically sound and universally accepted (1970) rating system for chess players."  The year refers to the adoption of the Elo System by FIDE.  Aside from a 1965 contribution by Elo to The Journal of Gerontology, there has been virtually no peer review of the system beyond the world of organized chess. One of the few external references is to be found in The Mathematics of Games, by J. D. Beasely [B, p.61].  Beasely offers this scathing footnote on the work of the late Professor Elo:

Chess enthusiasts may be surprised that the name of Elo has not figured more prominently in this discussion, since the Elo rating system has been in use internationally since 1970.  However, Elo's work as described in his book The rating of chessplayers, past and present (Batsford, 1978) is open to serious criticism.  His statistical testing is unsatisfactory to the point of being meaningless; he calculates standard deviations without allowing for draws, he does not always appear to allow for the extent to which his test results have contributed to the ratings which they purport to be testing, and he fails to make the important distinction between proving a proposition true and merely failing to prove it false.  In particular, an analysis of 4795 games from Milwaukee Open tournaments, which he represents as demonstrating the normal distribution function to be the appropriate expectation function for chess, is actually no more than an incorrect analysis of the variation within his data.  He appears not to realize that changes in the overall strength of a pool cannot be detected, and that his 'deflation control', which claims to stabilize the implied reference level, is a delusion.  Administrators of other sports (for example tennis) currently publish only rankings.  The limitations of these are obvious, but at least they do not encourage illusory comparisons between today's champions and those of the past.  

The proof of the pudding, it has been said, is the actual operation of a rating system, and the Elo System has been grinding out chess ratings for half a century now with hardly a grumble from the rating pool.  One is tempted to say that the system works despite its theory rather than because of it.  The reputation of the Elo System, on the other hand, rests largely on its supposed ability to predict chess outcomes.  There is even the occasional inquiry as to whether the system can predict outcomes in sports such as basketball, football, golf and soccer.  As this treatise has attempted to show, the predictive powers of the Elo System are not due to its application of probability theory, which in the final analysis must be characterized as a misapplication, but rather to principles of averaging which have hardly been articulated elsewhere.

Probability theory, as it happens, does explain much of the success of the Elo System, but theory of a different sort than its author took for granted.  If Elo misapplied theory, he also made considerable use of mathematical intuition, which in other contexts he disparaged.  The result is a system that is a marked improvement over those that preceded it, but a system that falls short of the scientific rigor that Elo envisioned for it.  If there is any lesson to be learned from his celebrated work, it is that no single system is likely to satisfy the requirements of statistical precision.  Rating systems in the past, as Elo notes, "received acceptance because they produced ranking lists which agreed generally with the personal estimates of rankings made by knowledgeable chess players" [E2, Part 1].  Even now popular taste may have a role in deciding which system is to be sanctioned by organized chess and how it is to be administered.       

The principles of rating theory undoubtedly have applications beyond chess. As Elo said of his own system, it is "applicable to any type of competitive activity in which individuals or teams engage in pairwise competition" [E1, preface].  To this may be added applications for noncompetitive pairwise comparisons, such as opinion sampling for marketing research. One would hope that the current controversy is resolved before such wholesale applications.  For some, however, the allure of rating theory lies in the controversy itself.  It is a controversy that has not yet been played out in organized chess and a cautionary tale for all involved.