2012, 2016, 2017, and 2018 have featured bonus RPs during qualification matches.
Many successful teams in these seasons often sacrificed winning margin for RPs to varying degrees, and I’m not sure how to quantify that in a summary statistic. Is there a way to incorporate these in the Elo model that captures all the nuances (some fraction of RPs correlate to whether you win the match, but the other fraction is independent of opponent)? Maybe some combination of Elo + ranking point OPR?
Another reason why 254 in 2018 may seem less dominant than 1114 in 2008 might be due to how Caleb’s Elo factors in winning margin. Scores in 2018 were timebased, so it was probably harder for 254 to stand out (statistically) in terms of scores and winning margins as it was for 1114 in 2008.
I’ll have to think about this more. It’s really a tough problem that I’ve wanted to address but can’t think of a great way to do so. My best idea at the moment would be to try to quantify the opportunity cost (in terms of winning margin) of each different kind of ranking point in quals matches. You could then create adjusted quals scores by adding this cost to the total score if the team gets the ranking point.
For example, this year I would guess the opportunity cost of choosing to go for the auto RP to be ~5 points, since the autos don’t vary a ton between quals and playoffs, but teams will be more likely to go for riskier (but on average higher scoring) autos in playoffs since the cost of one robot not crossing the line or no one getting a cube in the switch is so much lower in quals than playoffs. Likewise, I would estimate that going for the climb RP has a much higher opportunity cost, maybe ~20 points. Even without the climbing RP, many teams still climb in the playoffs because the time based scoring makes switch/scale scoring not as valuable in the last 30 seconds. However in playoffs, teams seemed to spend a few seconds less prepping for the climb, and would use that time to score other points or play defense.
Those are just guesses though, sometime I might try making “adjusted” quals scores that incorporate the opportunity costs described above, and I can select the opportunity costs that provide the most predictive power for future team performance.
This is very true. If my model were based on WLT instead of winning margin, it would be very plausible that 254 in 2018 would look the best.
I’ve uploaded a new book, this is the Elo model I am currently planning to use going into the 2019 season (although that is subject to change). The only key change in this book compared to my previous books is the addition of 20022004 matches, which has negligible impact on 2018+ Elos.
I advised caution when looking at 20052007 due to poor data quality, so I need to caution twice as hard about this when looking at 20022004 data. Even for events in which the data quality is solid, the incentive structure in those years can potentially cause Elo to be misleading. The predictive power in these years is worse than basically every other year on record, however, they are worth including in the model if only to improve predictive power in the following years.
Interestingly, the key model parameters for 20022004 are essentially the same as the parameters for 3team alliance years. I could have made them a bit different to improve predictive power a touch, but I have opted to keep them the same for simplicity.
I didn’t make any major model changes other than including 20022004, but it certainly wasn’t for a lack of trying. Here are the big changes I attempted but ended up opting not to include:
 Varying kQuals and kPlayoffs for 2team alliances in 20022004: I mentioned this above, but the predictive power increase was small enough that this change didn’t merit inclusion.
 Using the log of match winning margins instead of the raw match winning margin (stolen from xkcd and initiated by conversations with Bob Croucher): This was to hopefully counteract the annoying effects of ridiculous penalty matches and/or red cards. Unfortunately, this didn’t improve predictive power in every year I applied it, although some years did see improvement. In a similar vein, I tried taking the log of the difference between the predicted match score and the actual match score, this had similarly disappointing results.
 Updating Elo values differently for the highest, second highest, and lowest Elo teams (stolen from Eugene Fang): The idea here is that each team doesn’t contribute equally to matches, so it might make more sense to, for example, weight the Elo change more for the “better” teams than the worse teams, or vice versa. Unfortunately, this had negligible impact on predictive power while also being inconsistent between years.
 Using a weighted sum of Elos instead of just taking a raw sum when predicting match results (stolen from Kevin Leonard on discord): This is a similar idea to the one above, but it would be applied before the match instead of after the match. For example, I found that, in 2018, it would have been best to take 1.5*(best team’s Elo) + 1.0*(second best team’s Elo) + 0.5*(worst team’s Elo) when determining the alliance’s strength instead of just taking a raw sum. This actually provided a very large amount of predictive power increase when applied to 2018, but the optimal weights varied drastically across years. For example, 2017 had weights very close to 1.0, 1.0, and 1.0, and 2016 was close to 1.5, 1.0, and 1.0. Because of this, I had to leave this improvement out of the model.
 Using a geometric mean instead of an arithmetic mean when predicting match results: This is tied very closely to the above change. The idea would be to have a weighting of sorts for teams of varying skills, since geometric means (depending on where you place the zero) will either be higher or lower than arithmetic means but will also factor in alliance composition. The results were very similary to the different weightings, the optimal placement for the zero varied drastically yeartoyear, but could potentially provide drastic predictive power increase in years like 2018.
 Setting some cap on the amount a team’s Elo rating can change between matches: This was another approach to dealing with the ridiculous penalty matches. Unfortunately, the predictive power increase with this change was negligible. Perhaps waiting until after a team has played a few matches to implement the cap could be valuable? Since team’s ratings move around more early in the season. I’m not too hopeful though.
 Using a variant of the fivethirtyeight Elo update formula: The model currently only looks at the winning margin when updating Elos after matches. This seems a bit counterintuitive since, for example, it will cause a team’s Elo to drop if they win a match by 190 points if they were expected to win by 210. I’ve tried this change in the past unsuccessfully, but I thought I’d take another crack at it since I think I’m at least a little smarter than I was two years ago. Nothing much came out of it though. I could be misunderstanding how to implement it, but I manually tweaked all of the parameters and couldn’t get anything even slightly more predictive than my current model. I’m starting to believe that, in most matches, perhaps teams aren’t actually trying to win, but are rather trying to maximize the match winning margin. This makes intuitive sense to me since matches are so short, there is no perfect information about the game state, any team trying to strictly win has two partners they don’t have direct control over, and there are frequently alternative incentives in a match instead of winning (showing off for alliance selection, going for other RPs, etc…). If this is true, then it would make more sense for a model to not incorporate who wins or loses, but instead to look only at the winning margin.
 Adjusting the Elo distribution to be “standard” between years: I analyzed Elo distributions in different years here. Using the 20082018 average distribution, I was able to get a really solid predictive performance increase in most years in that range. Unfortunately, there were a couple years that saw their predictive power drop with this change. There’s certainly potential here though if I ever decide to update parameters each year.
It’s a bit frustrating that I don’t really have anything concrete to show from all of these investigations other than the addition of 20022004 data. I think I’m getting close to maximizing the predictive power I can get using the strict restrictions I have placed on myself, which means it might be time to branch out and break some of those restrictions. Roughly, here are the key restrictions I may loosen in the future to improve my model:
 Demanding that model parameters stay the same every year: With the exception of the standard deviations, all of my model parameters are the same in 2018 as they were in 2008 or even 2002. The huge advantage of this approach is that I have a model that will probably work pretty darn well in 2019 and beyond, even though we know basically nothing about those games at the moment. The drawback is that I can’t incorporate things that clearly improve predictive power in some years if they reduce predictive power in other years. If I loosened this restriction, I would probably restrict myself to measurements I could take using only week 1 data (like stdevs). Possible additions after loosening this restriction would be different team weights or geometric mean usage as well as looking at the Elo distribution change between years.
 Restricting my analyses only to raw match scores and match type: For much of the history of FRC data, this has been basically the only type of data we have had available, so looking historically this is about as good as we’re going to get. However, the addition of twitter score breakdowns (~2009?) opened up the possibility to use other match data, and the detailed API score breakdowns in 2015 expanded on this data drastically. There’s a ton of potential to use these data sets to improve upon predictions. I’ve played around with incorporating these data into my Elo ratings in the past, maybe it’s time to officially add some things like this. Looking forward, I think the TBA live scoring breakdowns will also add a whole new dimension to the metrics we can use.
 Demanding a team’s rating be restricted to a single number (my current Elo rating): There are lots of supplementary metrics that I have envisioned that could be added onto raw Elo ratings to provide better predictions in certain scenarios. For example, giving teams a playoff Elo boost to improve playoff predictions or rating the region a team hails from to make better champs predictions. Additions such as these could certainly improve predictive power, but would require that I track extra metrics instead of just the simple Elo rating.
I’m not strictly opposed to loosening any of the above restrictions, although I do think I might make two separate ratings if I do, one for the tighter restrictions and one for the looser restrictions. It’s important to me that my model can be quickly applied in a new year, which my current model does very well, and loosening these restrictions will only make the transition to a new year more difficult, but certainly manageable.
I’ve uploaded a revision to my 20022018 Elos here: FRC_Elo_20022018_v2.xlsm (26.9 MB)
There’s only one major change, which is that I use max Elos instead of end of season Elos to carry over team ratings between years. This provides appreciable predictive power improvement for every year in my data set. On average, I should now be able to predict roughly 0.3% more matches “correctly” in the sense of which side of 50% the predictions are. I also added a sheet which shows each team’s max Elo in a season.
score vs wm stdevs:
I didn’t end up including this change, but something I attempted was to use the standard deviation of winning margins in week 1 of each year as the yearspecific feature that I get from week 1 data. I have been using the standard deviation of all scores, but I thought that the standard deviation of winning margins might provide a better baseline to compare years. For most years, there’s not much of a distinction, with the stdev of wms being roughly 1.4X the stdev of scores, but in years like 2018 where your score is closely correlated to the opponent’s score, or 2015 where your score has negligible correlation to your opponent’s score, there can be a difference. Here are the stdevs of scores and wms by year.
year  score stdev  wm stdev  ratio 

2002  11.3  17.6  1.56 
2003  31.4  50.9  1.62 
2004  33.7  45.5  1.35 
2005  15.5  24.5  1.58 
2006  20.5  28.3  1.38 
2007  32.9  46  1.40 
2008  24.4  31.7  1.30 
2009  21  32.1  1.53 
2010  2.7  3.7  1.37 
2011  28.4  37.4  1.32 
2012  15.5  19.1  1.23 
2013  31.1  39.8  1.28 
2014  49.3  70.6  1.43 
2015  33.2  32.4  0.98 
2016  27.5  29.8  1.08 
2017  70.6  86.2  1.22 
2018  106.9  182.2  1.70 
I could’t get across the board predictive power increase with this though, so I chose to continue using score stdevs.
Using max Elos:
Inspired by conversation here I decided to investigate the possibility of using average Elos and max Elos as the primary method for setting start of season Elos in place of end of season Elo. The first thing I tried was to use average Elos, I could get a slight predictive power boost out of this when using it in sync with yearend Elos, but it was inconsistent across years, so I left it out for simplicity.
Max Elos on the other hand, much to my surprise, turn out to be a much better predictor of future season success than eos Elos. In hindsight, this maybe shouldn’t have surprised me as much as it did, considering I saw something similar with max OPRs, but I guess I’m a slow learner. I found that instead of using just the raw max Elo of the previous seasons, it was better to ignore each team’s first 8 matches, and choose the max Elo after that. This makes it so that teams don’t have as their max rating a value that was just carried over from the previous season. If a team did not have at least 8 matches in a year, I instead just use their max rating over all of their matches. The other key model parameters ended up staying largely the same, the only difference being that now, instead of meanreverting toward 1550, I mean revert toward 1450. This makes sense, as max Elos will be strictly greater than or equal to end of season Elos, so the mean reversion point needs to be much lower to avoid rating inflation.
Overall, I’m really happy to have found an improvement to make to this model for 2019, as I previously tried a lot of things unsuccessfully and didn’t want to feel like that was wasted effort. I’ll post updated 2019 start of season ratings soon.
Here is my Elo book updated through 2019: FRC_Elo_20022019.xlsm (30.8 MB)
I’ll probably transition this to GitHub sometime, but for now I’m just putting it here.
Also, here are the top 25 teams based on 2020 Start of Season Elo.
team  2020 SOS Elo 

2056  1889 
254  1859 
2910  1852 
1678  1848 
1114  1840 
1323  1838 
148  1835 
2046  1832 
3538  1830 
118  1804 
1619  1800 
133  1797 
225  1795 
2767  1794 
330  1791 
195  1786 
1519  1785 
1796  1784 
5172  1783 
2481  1783 
5460  1782 
33  1781 
27  1780 
971  1778 
3707  1777 
The full list of start of 2020 Elos can be found here:
2020 SOS Elos.xlsx (100.3 KB)
I cry…
Super happy to see a MN team in the list finally!
Here are the percentage of matches “correctly” predicted by year for Elo, meaning matches where the alliance with the higher combined Elo won:
year  matches “correctly” predicted 

2002  56.0% 
2003  63.4% 
2004  66.5% 
2005  67.5% 
2006  66.8% 
2007  67.5% 
2008  68.4% 
2009  70.1% 
2010  72.9% 
2011  75.7% 
2012  71.1% 
2013  73.2% 
2014  70.9% 
2015  71.4% 
2016  71.9% 
2017  66.9% 
2018  73.9% 
2019  72.4% 
So 2019 was a pretty average year if you’re looking at the range 2010+, not quite as good as 2018, but not nearly as bad as 2017.
OPR correct predictions tend to follow the Elo trend pretty closely, but usually get about 2 or so fewer percent of matches correct.
For curiosity, what percentage of matches do Elo and OPR agree, and of those how many are wrong? And in cases where they disagree, which one is correct more often?
Elo is computed and updated for each team after each match (if I understand your methodology correctly). OPRs are usually “batch” computed from a number of matches (often all of the matches in an event).
For your OPR prediction comparison, what OPRs are you using for each match prediction? The ones from the previous event? The “World” OPR up to that match? Current event OPR? Current event OPR but preloading an MMSEcomputation with the previous event’s OPRs for all teams?
Or are you simply looking at the endofseason Elo and overall World OPR and then backpredicting matches using those numbers?
It seems like it would be tricky to do an applestoapples comparison where all of the same historical match info was known for both Elo and OPR, but neither used the match info being predicted to get the numbers used in the prediction.
The “OPR” predictions are actually what I call “predicted contribution” predictions, which use max previous OPR for the first match, and then ixOPRs with 2 iterations for remaining matches seeded with the max previous OPR. So it is an applestoapples comparison in that both Elo and OPR are using all of the information available to them leading up to the match.
I’ll get around to this eventually, no promise on a timeframe.
Cool. Thanks for the clarification.
How do you initialize the xOPR calculations at the start of a new season?
Do you use elimination matches for Elo and ixOPR calculations or no? [I imagine that Elo might still work just fine during playoffs, but OPRtype calculations might be thrown off with superior defense played in playoff matches compared with quals?]
Has any detailed investigation been done on the xOPR/ixOPR methodology to minimize the error in match predictions? How much better does using maxOPR do vs. mostRecentOPR? xOPR vs ixOPR(1 iteration) vs. ixOPR (2 iterations), etc.? I saw a few charts the Eugene put up 3 years ago: has anyone done anything beyond that?
I’d love to see the match prediction error probability over an event for both Elo and ixOPR. I’d imagine the error probabilities would drop from the 1st match in an event to the last one. How much?
Another comment coming soon describing a common way of viewing xOPR, ixOPR, and OPRm (MMSE OPR)…
A Common OPR framework
OPR
Standard “vanilla” OPRs are calculated to minimize the squared error between the match scores and the sum of the OPRs of the teams on the alliances that produced the match scores.
Vanilla OPRs are very poor at predicting match results when the number of match scores is less than or equal to the number of teams (and they are still quite poor until a lot more matches have been played).
xOPR
xOPRs address this issue by adding additional terms to the error minimization procedure. xOPRs are calculated by minimizing the same squared error as the vanilla OPRs, plus squared errors between “fake” matches played with match scores that are set to be exactly the sum of prior OPR estimates of the teams in the matches. This adds additional constraints and essentially pulls the xOPR values towards the prior OPR estimates.
ixOPR
ixOPRs do the same thing as xOPRs but iterate further. So first, the xOPRs are computed using some previous estimates of the OPRs, then ixOPRs are computed using the same procedure but using the xOPRs as the prior estimates. This can be continued indefinitely, using the results from one iteration as the previous estimates in the next iteration.
OPRm (MMSE OPR)
OPRm values are computed in a similar way. The error minimized in the OPRm solution is the same standard squared OPR error, plus squared errors in “fake” matches played by each team by themself leading to a match score that is their own individual prior OPR estimate. As in xOPRs and ixOPRs, this tends to pull the resulting OPRm values towards the prior estimates.
The second “fake” error in the OPRm calculation is also usually weighted by an arbitrary term which roughly measures how confident the prior estimate is. Note that this could also be done in the xOPR/ixOPR methodology but AFAIK has not been previously done.
The OPRm calculation usually only adds 1 “fake” match per team, so as the number of real matches in the event increases the OPRm values converge towards the vanilla OPR values. The xOPR/ixOPR calculations add more “fake” matches per team at the start of an event and progressively replace these fake matches with real matches, so again, the xOPR/ixOPR values converge to the vanilla OPR values by the end of the event.
The OPRm calculation can also add another “fake” match per team that pulls their OPR value towards the overall mean OPR value slightly, which is a way of implementing a “regression towards the mean” that improves match prediction performance slightly. This could also be added to the xOPR/ixOPR methodology as well, and could slightly improve the prediction performance of these metrics. This could also be implemented using only 1 single “fake” match per team, with the prior teamspecific OPR estimate being some weighted combination of that team’s prior OPR estimate and an estimate of the average of all OPRs for the teams at the event.
This is a solid summary, I’ll try to remember to link to this in the future.
I’m planning to play around more with OPRm sometime around here, and I’ll post the results as I get them. I don’t actually know offhand whether it has more or less predictive power than ixOPR, but if it does that’s what I’ll probably use moving forward.
I don’t actually know offhand whether it has more or less predictive power than ixOPR, but if it does that’s what I’ll probably use moving forward.
I think OPRm is probably similar in predictive power, but has a bit more “tunability” than xOPR/ixOPR which may make it easier to eke out a few extra percent in performance.
A few points:

The amount that xOPR/ixOPR weight based on the prior OPR estimate depends on how many matches each team plays in an event, which seems odd and arbitrary to me. If you’re playing in a 12 match event, after your first match xOPR/ixOPR add in 11 fake matches, while if you’re playing in a 10 match event, after your first match xOPR/ixOPR add in 9 fake matches. IMHO after 1 match played your “new” OPR should be the same regardless of how many matches you have left to play in the event. You could just add a single fake match with a weighting factor to bias the relative importance of the previous estimate and the new match results, and then you basically just end up doing OPRm.

xOPR/ixOPR depend on the particular match schedule for unplayed matches. This also seems odd and arbitrary. After 1 match, the xOPR/ixOPR results will be different depending on what the unplayed remaining match schedule is. Change that unplayed schedule and you’ll get different values for xOPR/ixOPR. OPRm doesn’t have this issue.

Regressing to the mean definitely helps predictive ability (at least it did when I studied it in my paper from a few years ago). OPRm does this naturally. You could do this with xOPR/ixOPR too by adding a fake match with match scores pulled towards the mean, or just by doing that a bit for every fake match you currently are using: i.e., instead of making the fake match score exactly the sum of the OPR estimates, make the fake match score xOverallMeanScore + (1x)(sum of OPR estimates for the specific teams).

At the end of an event, xOPR/ixOPR are 100% based on the event results. I think Caleb has found that better predictive results can be gained by weighting multiple events, at least in the Elo space. OPRm allows a weighted blending of results from previous estimates and from the current event. This could be done with xOPR/ixOPR by still having 1 or more fake matches added in even after all matches have been played.
I have a couple of questions.
First I see that your prediction percent averages is around 70%. Is this number what you expected? How does this compare to other Elo rated sports?
Secondly I read about using Elo ratings for pro sports and they use modifiers like home team, starting pitchers, weather. Other than your veteran/rookie modifiers, what other modifiers do you think you can use to improve the prediction percentage? Matches played? Matches since last match? Difference between best and worst team in an alliance?
If you weigh contributions to the alliance ELO in a nonconstant way, can you gain predictive power? My understanding is that right now, you are taking the average.
Many FRC games have diminishing opportunities for scoring beyond two robots (Destination: Deep Space is one of them). As such, it would seem like the two highest ELO robots on an alliance matter much more than the contribution of the third.