2. Probability in Rating Systems



The first assumption of the Elo System is that the chess performance of an individual player is a random variable that can be described by the normal curve [E1, 1.31]. But what is varying in this random variable? As applied to a single game, performance is an abstraction which cannot be measured objectively. It consists of all the judgments, decisions, and action of the contestant in the course of the game. Perhaps a panel of experts in the art of the game could evaluate each move on some arbitrary scale and crudely express the total performance numerically, even as is done in boxing and gymnastics [E1, 1.32] Performance in this light is not a promising candidate for mathematical treatment, but there is a simple definition of performance that is as old as the game. If a player outperforms the opponent, the player scores a point; if the player is outperformed, the opponent scores the point; and if a draw occurs, the point is divided. This simple definition allows an interesting probability treatment. Under the heading "Sundry Theoretical Topics" Elo pointed out that the probability of a specific outcome in terms of wins W, losses L, and draws D can be calculated precisely as [2.1] P(W, L, D) = [ N! / (W! L! D!) ]^{ .} P(win)^{W} ^{.} P(loss)^{L} ^{.} P(draw)^{D } if we know the probabilities P(win), P(loss), and P(draw) [E1, 8.9]. Let us consider the outcomes for ten games (N = 10), where P(win) = .5, P(loss) = .2, and P(draw) = .3, and let us express the results in terms of points scored. Since [2.1] must be calculated for 66 threeway compositions (weak) of 10, this is best done with a computer program (Downloads). The results are as follows: Elo presented [2.1] as an afterthought, but it provides the only convincing demonstration of the "first and basic assumption" of his system [E2, Part 1, "Form Varies"]. Contrary to the assumption of a constant variance, the variance here becomes increasingly narrow as N increases. As N goes to infinity, the distribution becomes a vertical bar at the average score, 6.5. It is this longterm average that is the key to probability in rating systems. The problem then becomes one of relating the longterm average to rating difference. Elo's nebulous definition of performance leads to the central concept of his system: the Percentage Expectancy Curve, which is patterned on a wellknown probability function, either the normal or the logistic. The Percentage Expectancy Curve relates percentage score to rating difference. Exactly how it does this is a crucial question. We are shown two overlapping distributions and are told that the shaded portion of one "represents the probability that the lower rated player will outperform the higher" [E1, 8.23]. Apparently, if a player's performance is greater than that of the opponent, the player wins; if the performance is lower, the player loses. This argument, aside from the objection that it leaves draws out of account, makes distinctions in distributions that are already illdefined. Outperforming the opponent seems equivalent in every respect to winning; yet this would suggest a binomial distribution, if not trinomial as in the illustration above. It is sometimes noted in support of the Percentage Expectancy Curve that the distribution on which it is based appears quite often in natural phenomena, in everything indeed from IQ scores to errors of measurement. This fact was not lost on Elo: Eminent mathematicians have tried many times to deduce the normal distribution curve from pure theory, with little notable success. "Everybody firmly believes it," the great mathematician Henri Poincare remarked, "because mathematicians imagine that it is a fact of observation, and observers that it is a theorem of mathematics." (Poincare 1892) [E1, 1.39] This somewhat mysterious observation is explained by the prevalence of binary phenomena in nature. It applies, in any case, only to phenomena that are measurable and whose measurement is independent of the normal curve itself. For Elo the Percentage Expectancy Curve
was patently a probability function, and he would entertain no
objection that it was not. Since a percentage score
may be thought of as an estimate of probability defined as a longterm
percentage, there is reason enough to regard the Percentage Expectancy
Curve as a function that relates probability to rating difference.
In this broad sense it is a probability function. More precisely the term
applies to those functions that arise
in probability theory from a mathematical analysis of variability, such as
the normal curve or the logistic, and these we may call true
probability functions. A function that merely maps probability to
another variable without some justification based on variability analysis
would thus be called an arbitrary probability function. Such
functions include those based arbitrarily on a true probability function,
which is the case of the Percentage Expectancy Curve. If the
Percentage Expectancy Curve itself were a true probability function, it
would be derived independently from distributions of rating difference,
however these might arise, but these can hardly be known without a
preestablished definition of ratings. 