15. Tests

 

 

Tests of a rating system, such as those offered by Elo is his main work [E1, 2.6],  tend not to be worth the paper they are written on, mainly because a rating is a statistic.  The predictions they offer in this respect are self-fulfilling.  By analogy, a test of arithmetic averaging to determine whether it yields central values would be pointless.  Ratings do not predict changes in playing strength. Rather, they assume that the playing strength exhibited by past results will be exhibited in future results.  Their predictions are an extrapolation of demonstrated playing strength.  If the ratings tend toward a long-term limit, their predictions will by and large hold true.

The typical rating test, especially among those that have been employed by this author, uses sequential calculations to determine convergence toward assumed playing strength. It must be said that such a demonstration is more a measure of cumulative averaging than of rating validity.  For rating systems in general the convergence is slow, and differences from one rating system to another are of doubtful significance.  A more meaningful test would employ simultaneous calculations, and rating systems that can be adapted to this process, such as linear systems and the Berkin System, have a distinct advantage.  In such a test there is no need to postulate "true" rating strengths.  The measure in this case is how well percentage scores match rating differences or ratios, and Gauss' principle assures us that the match cannot be improved.   

What a rating test that uses sequential ratings is actually measuring is progress toward consistency.  Consistent ratings are predictive in a statistical sense, that is, one can predict from rating relationships what results would have occurred, even though many may already be known.  The rate at which sequential ratings tend toward consistency is a measure of their efficiency in one respect, namely, how well they accommodate the averaging process, but it is not a certain measure of their overall efficiency. An averaging process designed to reveal long-term limits may not be adequate to the task of detecting changes in rating strength, and sequential ratings are especially vulnerable to such changes.  Anti-deflationary measures, such as Elo's "feedback," may help by boosting the efficiency of the averaging process, but they are at best stopgap measures.