This is a solid summary, I’ll try to remember to link to this in the future.
I’m planning to play around more with OPRm sometime around here, and I’ll post the results as I get them. I don’t actually know off-hand whether it has more or less predictive power than ixOPR, but if it does that’s what I’ll probably use moving forward.
I don’t actually know off-hand whether it has more or less predictive power than ixOPR, but if it does that’s what I’ll probably use moving forward.
I think OPRm is probably similar in predictive power, but has a bit more “tunability” than xOPR/ixOPR which may make it easier to eke out a few extra percent in performance.
A few points:
The amount that xOPR/ixOPR weight based on the prior OPR estimate depends on how many matches each team plays in an event, which seems odd and arbitrary to me. If you’re playing in a 12 match event, after your first match xOPR/ixOPR add in 11 fake matches, while if you’re playing in a 10 match event, after your first match xOPR/ixOPR add in 9 fake matches. IMHO after 1 match played your “new” OPR should be the same regardless of how many matches you have left to play in the event. You could just add a single fake match with a weighting factor to bias the relative importance of the previous estimate and the new match results, and then you basically just end up doing OPRm.
xOPR/ixOPR depend on the particular match schedule for unplayed matches. This also seems odd and arbitrary. After 1 match, the xOPR/ixOPR results will be different depending on what the unplayed remaining match schedule is. Change that unplayed schedule and you’ll get different values for xOPR/ixOPR. OPRm doesn’t have this issue.
Regressing to the mean definitely helps predictive ability (at least it did when I studied it in my paper from a few years ago). OPRm does this naturally. You could do this with xOPR/ixOPR too by adding a fake match with match scores pulled towards the mean, or just by doing that a bit for every fake match you currently are using: i.e., instead of making the fake match score exactly the sum of the OPR estimates, make the fake match score xOverallMeanScore + (1-x)(sum of OPR estimates for the specific teams).
At the end of an event, xOPR/ixOPR are 100% based on the event results. I think Caleb has found that better predictive results can be gained by weighting multiple events, at least in the Elo space. OPRm allows a weighted blending of results from previous estimates and from the current event. This could be done with xOPR/ixOPR by still having 1 or more fake matches added in even after all matches have been played.
First I see that your prediction percent averages is around 70%. Is this number what you expected? How does this compare to other Elo rated sports?
Secondly I read about using Elo ratings for pro sports and they use modifiers like home team, starting pitchers, weather. Other than your veteran/rookie modifiers, what other modifiers do you think you can use to improve the prediction percentage? Matches played? Matches since last match? Difference between best and worst team in an alliance?
If you weigh contributions to the alliance ELO in a non-constant way, can you gain predictive power? My understanding is that right now, you are taking the average.
Many FRC games have diminishing opportunities for scoring beyond two robots (Destination: Deep Space is one of them). As such, it would seem like the two highest ELO robots on an alliance matter much more than the contribution of the third.
My understanding is probably oversimplified, but doesn’t Caleb call his metric Winning Margin Elo? Seems to imply that preventing opponents from scoring weighs equally with scoring for your own alliance.
Is 70% what I expected? I don’t really recall anymore to be honest. I built the first iteration of my system a few years ago and have only nudged up the performance 2ish percent since then. I don’t know what number I expected exactly, probably around there as I expected it to be reasonably close to OPR which was already known to be in the 70% vicinity.
That sounds like a good question for someone else to look up.
My bs guesses would be something like:
According to the 538 forecasts, here are the Brier scores for those 3 (I’m assuming their reliability is very close to 0 since they don’t state it and their charts look well calibrated):
Remember that lower is better with Brier scores. Brier scores from my Elo model are:
So every FRC game from the past few years has been more predictable than these pro sports (or I’m just way smarter than Nate Silver, take your pick).
I can’t really think of any others that I haven’t already attempted. I have tried modifiers for all 3 of the options you mention and none of them provide a large enough consistent improvement in predictive power for me to consider them worth adding. My future work will probably be more in the direction of finding year-specific attributes that can improve predictive power instead of features that improve performance in all years.
Yes, you can definitely improve predictive power by giving different weights to the Elos of the strongest, middlest, and weakest teams. The problem is, the best “weights” to use vary drastically year to year. In 2016 for instance, I could improve performance by multiplying the highest team’s Elo by 1.5 and multiplying the middle and weakest teams’ ratings by 1. In 2017 though, this same structure caused a decrease in predictive power, and the best weights were 1,1,1. In 2018, I could very drastically improve predictive power by using the weights 1.5,1,0.5 (as in, the weakest team is only weighted a third as much as the strongest, and only half as much as the middle team). I haven’t tried for 2019, but my rough guess would be that the optimal weights would be around 1.5,1,1.
I’ve tried some other ways of mixing the Elos other than just taking their raw sum, but I haven’t found any alternative arrangement that improves performance in all years. As such, I opt for simplicity and take raw sums. Somewhere here I’ll incorporate year-specific features and that should really help predictive power.
Correct, any way you improve your winning margin will improve your Elo, offense or defense.
It seems to me that there must be a limit to the predictive power of any type of rating for events that have an “unpredictable random” element (as opposed to rolling loaded dice, which would have a “predictable random” element). Current Elo ratings appear to run in the low 70’s% range, but this community continues to explore ways to tweak the algorithm and squeeze another few percent out of the data. Do the predictions asymptotically approach some maximum value? Would it be ever be possible to know if/when a model had reached its ideal state?
I would expect that the “ideal” model for any particular game could be identified retroactively (via AI or some automated iterative approach, etc.), but it would change for each game.
I think that there is an inherent uncertainty to every match and that the more effective way to think about match prediction and its ceiling isn’t in terms of correctly predicting the outcome, but correctly predicting the uncertainty (probability). Similarly, the error metric of choice shouldn’t be accuracy, rather mean squared error or log loss, which are more suitable for probabilistic models. Predicting a 51% chance of a loser winning is a much smaller failure than predicting a 90% chance of a loser winning.
I’m guessing that the weights for the first two are close, but higher than the third. That’s due to the single defender being able to focus on the best bot which then required the second bot to step up to win the match.
Good catch! I had an error in the event key reporting and Einstein years. Here is a fixed book. I had fixed this issue on another version of my book but apparently I uploaded the wrong one. Here’s also a fixed top 100 list:
Fixed top 100
Michigan FIRST Robotics Competition State Championship