12. A Test

 

 

The testing of rating systems suffers the same illusion that bedevils rating theory in general, namely, that ratings are literally measurements of playing strength. We know by now that ratings are statistics, sometimes metaphorically called measurements, but actually quite distinct from the physical sort.  Imagine an experiment that set out to compare the efficiency of the arithmetic mean with that of the statistical median.  The results would be not so much incorrect as misplaced.  Statistics embody in their definitions their own criteria of success and put on the data they purport to measure their own stamp of accuracy.  

What, then, are we to make of Elo's tests of his rating system?  The tests are a recurrent theme in his work, appearing for example in sections 2.6 and 5.55 of his main treatise [E1].  Assuming first of all that such tests are to be taken seriously, a primary consideration is the data on which the tests are based.  Historical data, such as the Hoogoven International Tournament of Elo's treatment, lends an air of reality to a test but has the drawback, regarded as an experiment, of not being repeatable.  A computer, on the other hand, allows an inexhaustible supply of data to be generated, as well as experimental controls that would otherwise be impossible.  The tendency in previous studies has been to grant assumptions about the data made by Elo, e.g. that it is normally distributed or that it demonstrates transitivity, but this approach verges on parody.  The data ideally should be neutral with respect to probability models.  

Data used for the present test consists of a crosstable of result probabilities for 800 imaginary players, ranked in order of playing strength.  For each ordered pairing, AB, where A is the stronger player, the probability of A defeating B is  

                        P(A, B)  =  1 – P(B, A) .  

This probability is determined partly by a strength component and partly by a random component.  If the strength component is s, the random component is determined as a random portion of 1 – s, which is generated in typical computer code as  

                        rand( )  *  (1 – s) .

 Consequently,

                         P(A, B)  =  s  +  rand() * (1 – s) ,

and P(B, A) is determined from this result.  For example, with s = .5, P(A, B) would have a value in the range from .5 to 1, with .75 as its expected value. 

A crosstable generated with s = .5 was used to generate in turn a sequence of sixteen-player round-robin tournaments.  The contestants for each tournament were selected at random from the field of 800 without replacement, equivalent to a partition of the shuffled field.  Statistics were generated after each of the players had participated in one of the 50 tournaments, after which the field was reshuffled.  The cycle was then repeated 25 times, with each player participating in a total of 375 games.  Outcomes for individual pairings were generated randomly, but with probabilities determined by the crosstable.  Ratings were initialized to an arbitrary value of 2000, and the delta or "established" form of rating formulas was used.  The experiment was then repeated with a crosstable generated with s = .3.

The tests were analyzed with their own set of statistics, calculated after each of the 25 cycles.  The first of these takes the standard deviation of the tournament scores generated by the crosstable from scores predicted by each of the tested systems for the ratings calculated up to that point.  This error statistic is in the form of an absolute score difference ranging from 0 to 15 points.  A second statistic uses Spearman's rank-difference correlation coefficient to compare the ranking of players by their calculated ratings after each cycle with their predefined ranking.  This coefficient ranges from +1 for a perfect relationship to -1 for a perfect inverse relationship.  Figure 1 shows the error statistic, and Figure 2 the correlation coefficient, for s = .5.  Figure 3 and Figure 4 show the respective statistics for s = .3.

The only certain conclusion to be drawn from these tests is that simple theoretical concepts in a random environment yield complicated results.  Comparisons of the systems, although each were rating precisely the same outcomes, are not as clear.  Perhaps the most telling observation to be made is the effect of reducing the strength component from .5 to .3.  With s = .5, we see the Elo System performing slightly better than the other systems tested.  With s = .3, it is performing slightly worse.  It may be argued that a strength component of .3 produces data that is too chaotic to be meaningful.  It is true that the only structure in the resulting data lies in the fact that the stronger player outperforms the weaker player with a probability, on average, of .65.  This means, of course, that the average probability of an upset is .35.  The counterargument is to reflect on the very purpose of a rating system, which is to render meaning from data that is not always consistent and that may, in fact, be quite chaotic.