![]() |
12. A Test
|
|
|
|
The testing of rating systems suffers the same
illusion that bedevils rating theory in general, namely, that ratings
are literally measurements of playing strength. We know by now that
ratings are statistics, sometimes metaphorically called measurements,
but actually quite distinct from the physical sort.
Imagine an experiment that set out to compare the efficiency of
the arithmetic mean with that of the statistical median.
The results would be not so much incorrect as misplaced.
Statistics embody in their definitions their own criteria of
success and put on the data they purport to measure their own stamp of
accuracy. What, then, are we to make of Elo's tests of his
rating system? The tests are
a recurrent theme in his work, appearing for example in sections 2.6 and
5.55 of his main treatise [E1].
Assuming first of all that such tests are to be taken seriously,
a primary consideration is the data on which the tests are based.
Historical data, such as the Hoogoven International Tournament of
Elo's treatment, lends an air of reality to a test but has the drawback,
regarded as an experiment, of not being repeatable.
A computer, on the other hand, allows an inexhaustible supply of
data to be generated, as well as experimental controls that would
otherwise be impossible. The
tendency in previous studies has been to grant assumptions about the
data made by Elo, e.g. that it is normally distributed or that it
demonstrates transitivity, but this approach verges on parody.
The data ideally should be neutral with respect to probability
models. Data used for the present test consists of a
crosstable of result probabilities for 800 imaginary players, ranked in
order of playing strength. For
each ordered pairing, AB, where A is the stronger player, the
probability of A defeating B is
P(A, B) =
1 – P(B, A) . This probability is determined partly by a strength
component and partly by a random component.
If the strength component is s,
the random component is determined as a random portion of 1 – s, which is generated in typical computer code as rand( ) * (1 – s) . The only certain conclusion to be drawn from these tests is that simple theoretical concepts in a random environment yield complicated results. Comparisons of the systems, although each were rating precisely the same outcomes, are not as clear. Perhaps the most telling observation to be made is the effect of reducing the strength component from .5 to .3. With s = .5, we see the Elo System performing slightly better than the other systems tested. With s = .3, it is performing slightly worse. It may be argued that a strength component of .3 produces data that is too chaotic to be meaningful. It is true that the only structure in the resulting data lies in the fact that the stronger player outperforms the weaker player with a probability, on average, of .65. This means, of course, that the average probability of an upset is .35. The counterargument is to reflect on the very purpose of a rating system, which is to render meaning from data that is not always consistent and that may, in fact, be quite chaotic. |