Here’s a summary of how well my predictions worked out this year. There’s a section by week and then overall. Within each section there is a table that gives an idea of how well-calibrated the results are.
The entry under “MAD” is the average difference between the prediction and the final. If one were to just flip a coin for each team, this would be .5, or if you were to just guess that this team had a 50% change to make it in it would also be around .5. The entry under “rms” is also known as the Brier score.
mar10
Predicted Observed Samples
0<x<10 0.0 5
10<=x<20 0.09 34
20<=x<30 0.28 18
30<=x<40 0.34 53
40<=x<50 0.71 7
50<=x<60 0.83 6
60<=x<70 0.64 11
70<=x<80 1.0 4
80<=x<90 1.0 2
90<=x<100 1.0 6
x=100 1.0 9
mad: 0.336321935484
rms: 0.400023810824
mar17
Predicted Observed Samples
x=0 0.01 234
0<x<10 0.01 205
10<=x<20 0.06 262
20<=x<30 0.24 291
30<=x<40 0.34 147
40<=x<50 0.53 160
50<=x<60 0.65 98
60<=x<70 0.84 55
70<=x<80 0.92 50
80<=x<90 0.97 32
90<=x<100 0.96 47
x=100 1.0 248
mad: 0.22786440678
rms: 0.323887001132
mar24
Predicted Observed Samples
x=0 0.0 484
0<x<10 0.03 186
10<=x<20 0.08 222
20<=x<30 0.25 170
30<=x<40 0.34 93
40<=x<50 0.51 73
50<=x<60 0.8 54
60<=x<70 0.79 43
70<=x<80 0.98 43
80<=x<90 0.91 22
90<=x<100 0.95 43
x=100 0.98 396
mad: 0.15088354292
rms: 0.267576994126
mar31
Predicted Observed Samples
x=0 0.01 848
0<x<10 0.08 111
10<=x<20 0.06 88
20<=x<30 0.4 58
30<=x<40 0.38 29
40<=x<50 0.4 30
50<=x<60 0.76 21
60<=x<70 0.73 11
70<=x<80 0.85 13
80<=x<90 0.8 5
90<=x<100 0.92 24
x=100 0.97 591
mad: 0.0734882449426
rms: 0.210646096619
Overall:
Predicted Observed Samples
x=0 0.01 1566
0<x<10 0.04 507
10<=x<20 0.07 606
20<=x<30 0.26 537
30<=x<40 0.34 322
40<=x<50 0.51 270
50<=x<60 0.72 179
60<=x<70 0.79 120
70<=x<80 0.94 110
80<=x<90 0.93 61
90<=x<100 0.95 120
x=100 0.98 1244
mad: 0.155843654732
rms: 0.275676432301
Overall, the results are way ahead of random, but I’d like to have a better baseline to compare against.
There’s definitely room for improvement here because it should be possible to get the 0% and 100% bins to be right all of the time, and the calibration shows that there’s an underestimation of teams that are doing well and overestimation of teams that are doing poorly, part of which could be fixed by a model of team skill, or by just adding some calibration tables.