This paper provides a comparison of common statistical prediction methods in order to determine which methods have the most predictive power. To see each model’s predictions for each match during the period 2008-2016, as well as each team’s rating before each match during this period, go to its corresponding workbook. The “Data Summary and Methodology” workbook contains details on each model, a FAQ, a summary of predictive capabilities of each model, and a side-by-side comparison of each model for the year 2016.
I am continuing on my journey of building a predictive model for the 2017 season. Here, I compared a bunch of different predictive methods to determine where my efforts will be best spent. The extremely short version is that, in order of most predictive to least predictive, we have:
Calculated Contribution to Score (OPR)
WM Elo
Average Score
Calculated Contribution to WM (CCWM)
Average WM
Calculated Contribution to Win
Adjusted Winning Record
I was surprised how predictive average score was, and generally how similar the “average” methods were with the “calculated contribution” methods. Moving forward, I am planning to continue development of WM Elo and Calculated Contribution to Score methods, and take some kind of weighted average of those two.
I have a couple of questions after looking at it more deeply.
Isn’t best linear combination kind of cheating? It seems like you’re picking weights after knowing the outcome.
Do you have any high-level takeaways for what you think will contribute to a better predictive model moving forward? (Assuming the 2017 game isn’t super weird)
Essentially, yes, it is cheating. The linear combination wasn’t a particularly important part of this effort, I kind of tacked it on at the end. My interest in the linear combination is twofold:
I wanted to know how much “better” the models could get by adding two or more models together. The result was that it has reasonable, but not extremely high value, so it is worth more investigation.
I wanted to know how “different” each model’s predictions were to each other. If two models both have reasonable predictive power, and their correct predictions are uncorrelated, taking a combination of the two will provide much better predictions than either could provide individually. It turned out that the good model’s all made pretty similar predictions.
But yes, it is cheating in that I knew the results as I was finding the best linear combination. In comparison, almost everything else I did was tuned using the period 2012-2014, so the 2016 predictions are actually true predictions for the other models.
Do you have any high-level takeaways for what you think will contribute to a better predictive model moving forward? (Assuming the 2017 game isn’t super weird)
I will be looking more at making a better model over the next 2 weeks, and I will gladly share my insights there. One big takeaway that I didn’t put in here was how “predictive” the calculated contribution to score model was when it knew team’s contributions from the end of that event. That model had a Brier score of .1232, but it was clearly cheating because it knew the results of matches when it was making predictions. However, that value is important because I think it is a good approximation of an upper (lower? Brier scores are weird) bound on the predictive power of any predictive model. Alternatively, this value could be used to approximate the inherent uncertainty in all FRC matches.
My biggest takeaway though was best summarized by Richard Wallace above.
If you haven’t read the following thread in a while, you might want to revisit it and see if any of wgardner’s suggestions might be worth comparing to your other methods:
I have added an Ether Power Rating model. I also added two more answers to the FAQ. I did not update 2016 linear combination to include EPR.
This EPR model has close to the same predictive power as the best previous models. It is generally a bit better than Elo, and a bit worse than calculated contribution to total points.
I have added Winning Margin Power Rating. The results are pretty bad, it is the worst of the models by a significant margin. I think that using only as many equations as matches causes this model to be far too overfit to have good predictive power.
I didn’t include 2008-2010 because this model was unable to predict anything for the 2010 Israel Regional because insufficient matches were played.
I also added a correlation coefficient table to the “Data Summary and Methodology” book.
Something is going on that I don’t understand. How are all of the prediction methods able to give predictions for matches where all teams haven’t competed yet? I believe WM Elo carries over information from previous years, but I don’t get how other predictors are able to give any prediction other than 50%.