paper: FRC Elo 2008-2016


2012, 2016, 2017, and 2018 have featured bonus RPs during qualification matches.
Many successful teams in these seasons often sacrificed winning margin for RPs to varying degrees, and I’m not sure how to quantify that in a summary statistic. Is there a way to incorporate these in the Elo model that captures all the nuances (some fraction of RPs correlate to whether you win the match, but the other fraction is independent of opponent)? Maybe some combination of Elo + ranking point OPR?



Another reason why 254 in 2018 may seem less dominant than 1114 in 2008 might be due to how Caleb’s Elo factors in winning margin. Scores in 2018 were time-based, so it was probably harder for 254 to stand out (statistically) in terms of scores and winning margins as it was for 1114 in 2008.



I’ll have to think about this more. It’s really a tough problem that I’ve wanted to address but can’t think of a great way to do so. My best idea at the moment would be to try to quantify the opportunity cost (in terms of winning margin) of each different kind of ranking point in quals matches. You could then create adjusted quals scores by adding this cost to the total score if the team gets the ranking point.

For example, this year I would guess the opportunity cost of choosing to go for the auto RP to be ~5 points, since the autos don’t vary a ton between quals and playoffs, but teams will be more likely to go for riskier (but on average higher scoring) autos in playoffs since the cost of one robot not crossing the line or no one getting a cube in the switch is so much lower in quals than playoffs. Likewise, I would estimate that going for the climb RP has a much higher opportunity cost, maybe ~20 points. Even without the climbing RP, many teams still climb in the playoffs because the time based scoring makes switch/scale scoring not as valuable in the last 30 seconds. However in playoffs, teams seemed to spend a few seconds less prepping for the climb, and would use that time to score other points or play defense.

Those are just guesses though, sometime I might try making “adjusted” quals scores that incorporate the opportunity costs described above, and I can select the opportunity costs that provide the most predictive power for future team performance.



This is very true. If my model were based on WLT instead of winning margin, it would be very plausible that 254 in 2018 would look the best.



I’ve uploaded a new book, this is the Elo model I am currently planning to use going into the 2019 season (although that is subject to change). The only key change in this book compared to my previous books is the addition of 2002-2004 matches, which has negligible impact on 2018+ Elos.

I advised caution when looking at 2005-2007 due to poor data quality, so I need to caution twice as hard about this when looking at 2002-2004 data. Even for events in which the data quality is solid, the incentive structure in those years can potentially cause Elo to be misleading. The predictive power in these years is worse than basically every other year on record, however, they are worth including in the model if only to improve predictive power in the following years.

Interestingly, the key model parameters for 2002-2004 are essentially the same as the parameters for 3-team alliance years. I could have made them a bit different to improve predictive power a touch, but I have opted to keep them the same for simplicity.

I didn’t make any major model changes other than including 2002-2004, but it certainly wasn’t for a lack of trying. Here are the big changes I attempted but ended up opting not to include:

  • Varying kQuals and kPlayoffs for 2-team alliances in 2002-2004: I mentioned this above, but the predictive power increase was small enough that this change didn’t merit inclusion.
  • Using the log of match winning margins instead of the raw match winning margin (stolen from xkcd and initiated by conversations with Bob Croucher): This was to hopefully counteract the annoying effects of ridiculous penalty matches and/or red cards. Unfortunately, this didn’t improve predictive power in every year I applied it, although some years did see improvement. In a similar vein, I tried taking the log of the difference between the predicted match score and the actual match score, this had similarly disappointing results.
  • Updating Elo values differently for the highest, second highest, and lowest Elo teams (stolen from Eugene Fang): The idea here is that each team doesn’t contribute equally to matches, so it might make more sense to, for example, weight the Elo change more for the “better” teams than the worse teams, or vice versa. Unfortunately, this had negligible impact on predictive power while also being inconsistent between years.
  • Using a weighted sum of Elos instead of just taking a raw sum when predicting match results (stolen from Kevin Leonard on discord): This is a similar idea to the one above, but it would be applied before the match instead of after the match. For example, I found that, in 2018, it would have been best to take 1.5*(best team’s Elo) + 1.0*(second best team’s Elo) + 0.5*(worst team’s Elo) when determining the alliance’s strength instead of just taking a raw sum. This actually provided a very large amount of predictive power increase when applied to 2018, but the optimal weights varied drastically across years. For example, 2017 had weights very close to 1.0, 1.0, and 1.0, and 2016 was close to 1.5, 1.0, and 1.0. Because of this, I had to leave this improvement out of the model.
  • Using a geometric mean instead of an arithmetic mean when predicting match results: This is tied very closely to the above change. The idea would be to have a weighting of sorts for teams of varying skills, since geometric means (depending on where you place the zero) will either be higher or lower than arithmetic means but will also factor in alliance composition. The results were very similary to the different weightings, the optimal placement for the zero varied drastically year-to-year, but could potentially provide drastic predictive power increase in years like 2018.
  • Setting some cap on the amount a team’s Elo rating can change between matches: This was another approach to dealing with the ridiculous penalty matches. Unfortunately, the predictive power increase with this change was negligible. Perhaps waiting until after a team has played a few matches to implement the cap could be valuable? Since team’s ratings move around more early in the season. I’m not too hopeful though.
  • Using a variant of the fivethirtyeight Elo update formula: The model currently only looks at the winning margin when updating Elos after matches. This seems a bit counterintuitive since, for example, it will cause a team’s Elo to drop if they win a match by 190 points if they were expected to win by 210. I’ve tried this change in the past unsuccessfully, but I thought I’d take another crack at it since I think I’m at least a little smarter than I was two years ago. Nothing much came out of it though. I could be misunderstanding how to implement it, but I manually tweaked all of the parameters and couldn’t get anything even slightly more predictive than my current model. I’m starting to believe that, in most matches, perhaps teams aren’t actually trying to win, but are rather trying to maximize the match winning margin. This makes intuitive sense to me since matches are so short, there is no perfect information about the game state, any team trying to strictly win has two partners they don’t have direct control over, and there are frequently alternative incentives in a match instead of winning (showing off for alliance selection, going for other RPs, etc…). If this is true, then it would make more sense for a model to not incorporate who wins or loses, but instead to look only at the winning margin.
  • Adjusting the Elo distribution to be “standard” between years: I analyzed Elo distributions in different years here. Using the 2008-2018 average distribution, I was able to get a really solid predictive performance increase in most years in that range. Unfortunately, there were a couple years that saw their predictive power drop with this change. There’s certainly potential here though if I ever decide to update parameters each year.

It’s a bit frustrating that I don’t really have anything concrete to show from all of these investigations other than the addition of 2002-2004 data. I think I’m getting close to maximizing the predictive power I can get using the strict restrictions I have placed on myself, which means it might be time to branch out and break some of those restrictions. Roughly, here are the key restrictions I may loosen in the future to improve my model:

  • Demanding that model parameters stay the same every year: With the exception of the standard deviations, all of my model parameters are the same in 2018 as they were in 2008 or even 2002. The huge advantage of this approach is that I have a model that will probably work pretty darn well in 2019 and beyond, even though we know basically nothing about those games at the moment. The drawback is that I can’t incorporate things that clearly improve predictive power in some years if they reduce predictive power in other years. If I loosened this restriction, I would probably restrict myself to measurements I could take using only week 1 data (like stdevs). Possible additions after loosening this restriction would be different team weights or geometric mean usage as well as looking at the Elo distribution change between years.
  • Restricting my analyses only to raw match scores and match type: For much of the history of FRC data, this has been basically the only type of data we have had available, so looking historically this is about as good as we’re going to get. However, the addition of twitter score breakdowns (~2009?) opened up the possibility to use other match data, and the detailed API score breakdowns in 2015 expanded on this data drastically. There’s a ton of potential to use these data sets to improve upon predictions. I’ve played around with incorporating these data into my Elo ratings in the past, maybe it’s time to officially add some things like this. Looking forward, I think the TBA live scoring breakdowns will also add a whole new dimension to the metrics we can use.
  • Demanding a team’s rating be restricted to a single number (my current Elo rating): There are lots of supplementary metrics that I have envisioned that could be added onto raw Elo ratings to provide better predictions in certain scenarios. For example, giving teams a playoff Elo boost to improve playoff predictions or rating the region a team hails from to make better champs predictions. Additions such as these could certainly improve predictive power, but would require that I track extra metrics instead of just the simple Elo rating.

I’m not strictly opposed to loosening any of the above restrictions, although I do think I might make two separate ratings if I do, one for the tighter restrictions and one for the looser restrictions. It’s important to me that my model can be quickly applied in a new year, which my current model does very well, and loosening these restrictions will only make the transition to a new year more difficult, but certainly manageable.


[TBA Blog] 1114 Is FRC's Greatest Dynasty

I’ve uploaded a revision to my 2002-2018 Elos here: FRC_Elo_2002-2018_v2.xlsm (26.9 MB)

There’s only one major change, which is that I use max Elos instead of end of season Elos to carry over team ratings between years. This provides appreciable predictive power improvement for every year in my data set. On average, I should now be able to predict roughly 0.3% more matches “correctly” in the sense of which side of 50% the predictions are. I also added a sheet which shows each team’s max Elo in a season.

score vs wm stdevs:
I didn’t end up including this change, but something I attempted was to use the standard deviation of winning margins in week 1 of each year as the year-specific feature that I get from week 1 data. I have been using the standard deviation of all scores, but I thought that the standard deviation of winning margins might provide a better baseline to compare years. For most years, there’s not much of a distinction, with the stdev of wms being roughly 1.4X the stdev of scores, but in years like 2018 where your score is closely correlated to the opponent’s score, or 2015 where your score has negligible correlation to your opponent’s score, there can be a difference. Here are the stdevs of scores and wms by year.

year score stdev wm stdev ratio
2002 11.3 17.6 1.56
2003 31.4 50.9 1.62
2004 33.7 45.5 1.35
2005 15.5 24.5 1.58
2006 20.5 28.3 1.38
2007 32.9 46 1.40
2008 24.4 31.7 1.30
2009 21 32.1 1.53
2010 2.7 3.7 1.37
2011 28.4 37.4 1.32
2012 15.5 19.1 1.23
2013 31.1 39.8 1.28
2014 49.3 70.6 1.43
2015 33.2 32.4 0.98
2016 27.5 29.8 1.08
2017 70.6 86.2 1.22
2018 106.9 182.2 1.70

I could’t get across the board predictive power increase with this though, so I chose to continue using score stdevs.

Using max Elos:
Inspired by conversation here I decided to investigate the possibility of using average Elos and max Elos as the primary method for setting start of season Elos in place of end of season Elo. The first thing I tried was to use average Elos, I could get a slight predictive power boost out of this when using it in sync with year-end Elos, but it was inconsistent across years, so I left it out for simplicity.

Max Elos on the other hand, much to my surprise, turn out to be a much better predictor of future season success than eos Elos. In hindsight, this maybe shouldn’t have surprised me as much as it did, considering I saw something similar with max OPRs, but I guess I’m a slow learner. I found that instead of using just the raw max Elo of the previous seasons, it was better to ignore each team’s first 8 matches, and choose the max Elo after that. This makes it so that teams don’t have as their max rating a value that was just carried over from the previous season. If a team did not have at least 8 matches in a year, I instead just use their max rating over all of their matches. The other key model parameters ended up staying largely the same, the only difference being that now, instead of mean-reverting toward 1550, I mean revert toward 1450. This makes sense, as max Elos will be strictly greater than or equal to end of season Elos, so the mean reversion point needs to be much lower to avoid rating inflation.

Overall, I’m really happy to have found an improvement to make to this model for 2019, as I previously tried a lot of things unsuccessfully and didn’t want to feel like that was wasted effort. I’ll post updated 2019 start of season ratings soon.

1 Like

Paper: Miscellaneous Statistics Projects 2018
Visually viewing Caleb Sykes' Scouting Database: Data is Beautiful