4. Interval Ratings

 

     

   

A statistical theory of ratings begins with the discovery that a typical rating system, such as the Ingo, is tacitly dealing with differences in percentage score.  Consider a sequence of games between two players, A and B.  If A's percentage score in this sequence is P, then B's score is 1 – P, and the difference in percentage score P(A) – P(B) is

                         P – (1 – P)  =  2P – 1  =  2(P - .50) .

This last expression recalls the basic Ingo formula [1.1] and suggests its motivation.  We can generalize this discovery by postulating a principle of interval rating systems:  differences in rating reflect differences in percentage score.  A formula that captures this principle is

[4.1]          R  =  ERc + K(P - Pc) .

The symbol E (expected) is used here to designate an arithmetic mean, so that  ERc represents the mean rating of opponents.  K is an arbitrary constant (-50 in the Ingo System).  P and Pc are the percentage scores of player and opposition.  The difference in percentage score may also be written as 

[4.2]          (W - L) / N

for points won and lost out of N games.  A term of convenience for the difference in percentage score is relative performance, which will later be broadened to include ratios.  

The effect of [4.1] when applied to a competing field is to generate rating differences in proportion to relative performance, which is more easily seen by writing the formula as

[4.3]          R - ERc  =  K(P - Pc) .

The latter may be viewed as an equation of means over game instances,

[4.4]           E[R - Rc]  =  E[K(S - Sc)],

where the relative score, S - Sc, in chess evaluates to 1, -1, or 0.  For individual games, rating difference may be thought of as predicting relative score as an approximation, and the question naturally arises as to how good this approximation is.  The efficiency of linear rating systems relies on a basic statistical argument:  Simply put, the mean of a distribution is the value that minimizes the sum of squared deviations of the scores.  This can be demonstrated by the mathematically trained as an exercise in differential calculus, but the idea is basic enough to be taken for granted.  We shall call the argument Gauss' principle since he appears to have been the first to have used it.

If the rating of Formula [4.1] is represented as the mean

[4.5]          R  =  E[Rc + K(S - Sc)] ,   

it follows directly from Gauss's principle that

                   S(R - [Rc + K(S - Sc)]) 2      

is an absolute minimum over real values of R when [4.1] holds true.  We have only to regroup terms as

                    S [(R - Rc)  -  K(S - Sc)] 2

to show that difference in rating predicts relative score in this least-squares sense.  The approximation is optimal for ratings calculated by the general linear formula, regardless of the consistency of data on which the ratings are based.

The foregoing argument would seem to be conclusive in favor of linear ratings, but since Elo's characterization continues to hold sway, some elucidation is called for.  First of all, what is meant by the statement that relative score is predicted by rating difference in individual games?  Imagine a set of linear ratings that have been calculated for a set of results in a competing field.  We shall suppose for the sake of simplicity that all of the results are either wins or losses.  We shall further suppose that the ratings are consistent, that is, that they are all calculated from the same set of ratings.  The latter condition, as we shall see, is not easily achievable in practice, but for now we simply state it as given.  You, as arbiter in this issue, are given only one piece of data from this collection, namely, a result that occurred between players X and Y, perhaps one of many, with X rated R and Y rated Rc.  You are asked to guess whether the result was a win or a loss for player X.  By manipulating [4.1] you can easily determine that the percentage score between Player X and players rated the same as Y was some value P from the viewpoint of X.  If P is greater than .5, your best guess would be that X won.  If P is less than .5, your best guess is that X lost.

The prediction you make is a statistical one.  It assumes nothing about the playing strength of the players in the competing field.  For all you know, they could be a bunch of machines programmed to make random moves.  Now suppose that you are asked to predict a further result between X and Y beyond the data given.  If the players really are machines like the ones described, you might as well toss a coin.  It is conceivable, on the other hand, that by continuing the competition indefinitely the percentage scores would tend toward fixed values as limits.  The original scores would then represent estimates of their limiting values or probabilities, and the competing field would qualify as a collective in the sense used by Richard von Mises in his theory of probability [V].  Your further prediction would then be plausible.

Linear rating systems, to conclude, involve predictions in several senses of the term, and the probability issues that are raised are by no means trivial.  The persistence of a popular rating system, however flawed, is therefore not to be underestimated.