Analyzing Robot Performance Across Years

,

I’ve been looking for a way to measure team performance in a way that’s independent of the individual season, so that you can track performance on a line graph and get a good idea of how much better/worse the team has done over the years. Below is a summary of my research. I’d be interested to see if anyone has any commentary or experience with what they use.

Too Long, Didn’t Read (TLDR)

Elo is good but has a couple minor problems. I tried fixing the problems but encountered bigger problems. The best I’ve found is a normalized version of CCWM. Most simpler systems have huge flaws.

Easy Rule-Outs

OPR scaling differs from year to year. Win-loss record is too dependent on the individual schedule of each team, and can reward teams for narrowly missing out on qualifying for DCMP/CMP. District points/rank is also too dependent on the events you go to. We need something that can compare across regions and seasons.

Elo

Caleb Sykes’ adaptation of the Elo rating is great. It’s what I’ve been using for this purpose and I’ve gotten some pretty good insights out of it. Unfortunately, I have a couple issues with it. The algorithm relies on each team’s Elo score going into a match to determine their updated score. Within a season, this is great, but it has the consequence of requiring some starting value for each season. There are two possible solutions.

First, you could use something based off of past seasons. This relies on the assumption that a team’s performance can be approximated by their performances in previous seasons. While this is often true, it means that each year’s data isn’t strictly generated by performances from that year. Two teams with equal performances in 2019 will end up with different scores depending on their 2018 and prior performances. It’s tolerable but not ideal. Also, rookie teams can be seriously over/undervalued. They are presumed to have an average score (~1500) for the purposes of early calculations, but this means it could take several years before they approach their proper score. If you were making a line graph of a far-from-average rookie team’s Elo scores, you would see some slope to it over several years, even if their score remained consistent. Again, tolerable but not ideal. This is the solution Caleb Sykes uses, and for good reason.

Second, you could start everyone off at the same score each season. This would address both issues with the above solution, but it comes with two downsides. The reality of FRC is that the vast majority of teams won’t see a huge (>100 pt) difference in Elo score from year to year. By ignoring previous data, the algorithm just has less to go off of to sort teams into the right place. Also, early matches would be devalued. Elo awards more points for bigger upsets. If all teams start at the same score, the same upset would result in fewer points exchanged at the beginning of the season than at the end, even if all teams perform the exact same way. Overall, this solution just has too severe downsides to bother with, even if it’s a bit more elegant.

CMR

CMR, or Contribution to Match Result, is an algorithm that I developed to experiment with a more elegant alternative to Elo that addresses the other concerns. I’m not entirely happy with it, but developing it made the core problems more clear, so that’s neat. If you’re interested, the GitHub is here.

The core idea of CMR is to use a linear algebra calculation, similar to OPR, to calculate each team’s contribution to whether the match is a win, loss, or tie. There are two main differences between OPR and CMR. First, the CMR left-hand “match schedule” matrix includes both alliances, meaning the teams’ scores depend both on what they contribute to their alliance and what they take away from their opponents’ alliance. If this sounds familiar, bear with me until the next section. Second, the CMR right-hand “match result” matrix includes just a 1 for red win, 0 for tie, and -1 for blue win, rather than the specific match scores. This means that rather than calculating how many points you contribute to your alliance, CMR calculates how much you contribute to the actual match result.

There are a couple advantages as compared to Elo. In short, it works completely independently of both the region and season. First, each season’s CMR value depends only on matches played during that season. Second, scores are determined based on final results, rather than cumulative results. With CMR, each team’s value is determined simultaneously, evaluating their performance in the context of who they were playing with and against. This means that the data shouldn’t be distorted by teams that started off with an incorrect value. Third, regional differences in performance can be communicated through many fewer matches. With Elo, cross-region communication requires directly playing against teams that have played out-of-region. With CMR, this is automatically adjusted for as long as a few teams have crossed regions and their performances were fairly consistent.

However, it does have a fatal flaw: a binary representation of the match performance just isn’t enough data to measure teams on. I was finding that each team needed to play almost ten times more matches to get a comparable deviation on a “Performance Measure vs. OPR” graph. Without the additional matches, you could end up with the third-highest-OPR team at CMP rated as contributing nothing to their alliance on average, only due to the schedule they got. It’s just not suitable for real FRC data. I tried changing the right-hand “match result” matrix to instead be the difference in scores over the sum of scores. Unfortunately, this made the data much worse, as that isn’t a linear function between actual performance and measured performance. Fortunately, it led me to the next performance measure.

Normalized CCWM

CCWM is great. It’s basically OPR but better. The gist of CCWM (calculated contribution to winning margin) is that it’s how many points you contribute to your alliance plus how many points you take away from your opponents. It’s probably the best method for analyzing a robot in a given year. However, it still has two main flaws.

First, it’s also non-linear, but in a slightly different way. In short, many FRC games don’t use scoring systems with each action representing the same number of points each occurrence, meaning that teams that are just a little better can be rated a lot higher.

Second, it depends on the details of the game. It was a lot easier to contribute a ton of points in 2018 than it was in 2019. This is really easily solved though: just divide the CCWM by the median CCWM that season. This normalized CCWM compares across regions and seasons without the inaccuracy of CMR and the lack of elegance of Elo. From what I’ve seen, its only flaw is the non-linearity. As that’s an inherent problem with FRC scoring, I’m pretty happy to just use normalized CCWM and call it a day.

Conclusion

There are three core problems with these algorithms I’ve found: comparing across regions, comparing across seasons, and getting good data. Here’s a table of each measurement system’s performance at each of those, ranked 1-5. Normalized CCWM is the best I’ve seen. However, if anyone knows of a better system, I would love to see it.

System Comparing Across Regions Comparing Across Seasons Getting Good Data
Normalized CCWM 5 5 5
Normalized OPR 5 5 4
CMR 4 5 1
Elo (using past data) 4 4 4
Elo (not using past data) 4 5 3
District Points 3 4 3
Win-Loss 1 3 2
CCWM 5 1 5
OPR 5 1 4
11 Likes

Did you actually run the numbers for all of these or is this just an elaborate explanation of a weighted decision matrix?

Edit: This post is why I miss white papers being more visible

3 Likes

I ran numbers. I can gather and post them here later tonight.

2 Likes

I am definitely interested to see the numbers. I’m glad folks are looking at new ways to slice the available data.

Keep in mind the ultimate goal that I think most teams care about: They want a simple way to understand, given teams at a particular event, what is the expected match outcome given different combinations of robots?

Showing how one method outperforms another would be the first step in convincing folks to change their current methodology.

2 Likes

Cool stuff @csully! Keep at it

On my Elo:
My Elo does indeed use past data to set start of season Elos, as that is how I gain the most match predictive power. You are correct though that a team’s Elo is never fully defined by a single season, it is a combination of in-season results and the team’s seed at the start of the season. It is possible as you mentioned to start the season with all teams at the same seed, but this method (even with some tweaks) does not converge as quickly to stable/predictive ratings as a method with reasonable seeds.
On the topic of seeds, rookies are very difficult to seed properly, since they have no history to use, which is why excellent rookies can be badly underestimated by Elo.
I think you understand all of this pretty well, I’m just clarifying.

On CMR:
I actually do calculate in my scouting database and event simulator a metric called calculated contribution to win, which I believe is equivalent to CMR at the event level (although I assume the CMR you used was based on all matches in a season, not a single event). This is an intriguing idea for a metric, but not particularly useful as you have found. WLT is just not a granular enough data point to be useful when teams only play on the order of dozens of matches in a season.

On CCWM vs OPR:
In terms of predictive power, I have consistently seen OPR to be a better predictor of match results than CCWM. When I first started out doing FRC analytics work, I was also under the impression that CCWM would be superior to OPR as it has some very nice properties. However, this has not been borne out in my testing. The fundamental problem to me comes from the relationship CCWM = OPR - DPR. Since DPR is essentially noise, this equation can be rewritten as CCWM = OPR + noise, meaning that CCWM is just a noisier version of OPR. Obviously this is dramatically simplified, but it’s useful for understanding why I prefer OPR. Now, CCWM isn’t useless by any means as a metric, I am just saying I personally prefer the higher predictive power of OPR in most situations.

One quick point of clarification:

Do you actually mean that you divide by the median of the absolute value of all CCWMs in a season? The median CCWM in a season will be close to 0.

Overall I’m impressed despite my small critiques, you explain your thoughts very coherently. Looking forward to diving into your numbers more.

11 Likes

Here’s the data. This is my first time analyzing data like this, so I might miss things I should’ve included. I’d appreciate any pointers so I can get better at analyzing and presenting data.

Elo

The raw Elo data can be found here. There’s one aspect of it that I wanted to highlight: distribution.

It forms a pretty nice bell curve, skewed to the left a little. I’m confident that the rise on the far left is due to teams that didn’t attend all their events all five years. I highlight this graph just to contrast it against the CMR histograms below.

CMR

The only full season I’ve run the CMR algorithm on is 2002, the season with the smallest amount of match data while still actually having match data. It takes such a long time to get the data from The Blue Alliance that I’ve decided not to run it on any other full seasons unless there’s another purpose for it. For the purposes of choosing a rating system, the 2002 data is enough to see that there are some flaws.

At first glance, the distribution doesn’t seem to be too bad. The problem is that teams are not nearly where you would expect them to be. You can see there’s a team way ahead of the pack at 141 points. That’s not the result of 71’s legendary championship-winning crawler bot. That’s 497, who went 5-2 at their one competition that season and most likely happened to just beat the right team to rank high. 71 is all the way down at a CMR of 10.

The same thing happens with more modern data. Here’s the 2019 Alpena 2 District Event.

It’s not exactly a normal distribution. Team 33 won the event with an OPR of 35.3, the second highest at the event, but had a CMR of 12. There just aren’t enough matches in an event or season to get good data.

Simulating more matches reveals it does have some potential. The following simulation assigns OPRs to teams with a linear spacing between 0 and 100. It then calculates the CMR for a given number of matches with random match-ups. Theoretically, a team that contributes more points to their alliance should contribute more to the win-loss-tie result of the match, so CMR and OPR should be highly correlated.

This is a typical district event: 40 teams and 100 matches, for roughly 15 matches per team. The correlation coefficient is 0.814. 81% doesn’t sound too bad, but as we saw with 71 and 33, it has consequences. We can see in this example that the second worst team at the event was rated above the average team, and the tenth best and ninth worst were rated about the same.

Giving each team 75 matches instead of 15 gives much better results. The correlation coefficient is 0.985. Very few teams are ranked out of order by more than one position.

This is the closest I’ve done to a simulation of an entire season: 5000 teams, 50 matches per team. The correlation coefficient is 0.964, which is honestly better than I expected. It’s still not good enough, as roughly equal teams can expect a spread of nearly 60 points. This is likely due to a combination of limited match count and flaws in the simulation. In a real FRC season, teams primarily compete against teams from their region. The possible number of teams anyone could play against doesn’t really exceed 250. Here, everyone is equally likely to play each other. This isn’t so much simulating a season as it is simulating an FRC-wide competition. The effect of this is that the simulation ignores the concerns of communicating across all regions. The process of selecting teams for DCMP and CMP also affects the accuracy of scores for teams who didn’t play with a large variety of teams.

In the future, I’d like to simulate isolated regions and see if a certain number of matches evens them out. I’m pretty close to the limits of my computer, so that might be a while. Until then, I’m not confident enough in the data to use it over other measures.

Normalized OPR

After taking @Caleb_Sykes response into account, I decided to switch from normalized CCWM to normalized OPR. Here are histograms of 2019 Alpena 2 District Event, with OPR instead of CMR.

I haven’t gotten around to any more advanced analysis than this. I’m not aware of a database that offers season-wide OPR data, so I’ll have to program it myself to make deeper comparisons to Elo and CMR.

@cadandcookies @gerthworm here’s the data

4 Likes

I just posted the data below.

While you’re right that that’s what most teams care about, that’s not what I’m doing this for. My purpose isn’t to make match predictions. It’s to make comparisons across seasons, so at the end of your season, you can draw conclusions like “We’re doing as well as we did in 20XX” and “We aren’t improving as much as we have in past years.” I’m not aware of any methods designed to do that, so this is a first look at possible methods.

4 Likes

statbotics.io has OPR data under the Years tab (click ‘Show OPR’, and then there is a download csv button). The OPRs listed are the team’s highest OPR for an event that season. Let me know if you have any difficulty accessing it.

2 Likes

I posit that DPR (and thereby CCWM) won’t be a truly valuable stat unless we have a game where a significant number (>25%) of teams are incentivized to treat defense as their primary strategy, or a majority of teams focus on offensive output similarly to defensive output. Until that point there will be too much noise in these stats.

4 Likes

Thank you for the detailed response and kind words! It’s neat to see that your calculated contribution to win metric achieved the same result as CMR. I think you’re right about CCWM vs OPR. If I were to do any analysis with normalized CCWM, I would use the median absolute value, but that shouldn’t be a problem with OPR. You mentioned that CCWM isn’t useless. What situations would cause you to favor it over OPR?

xRC simulator has been running multiple full scale 2020 frc game events where this has been the case for may events that a virtually full time defender is the norm (one per alliance). We also have perfect data on who did play essentially full time data, and who did not.

This format also normalizes the robot connection issues/robots getting better or worse over time. Will investigate some of these and see if it looks better than traditional events. (paging @cobbie as well)

1 Like

Brennan and I ran some analysis on the most recent full-scale xRC event, we have relatively high amounts of data for this event because we know exactly who played defense in each match and exactly how many points each robot scored in each match. When we compared the number of defensive matches played with DPR we got a positive correlation of r=.205 (which is very unexpected because if better defenders are supposed to have lower DPR then playing more defense should result in a lower DPR which should result in a negative correlation.) This data was not statistically significant with H_{0}: r=0 and \alpha = .05. We then ran this against the average score of each player’s opponents throughout the competition, this resulted in a much stronger correlation of r = .578 which was statistically significant with H_0: r>\frac{1}{7}* and \alpha = .05. Defense was played in all 28 matches and \frac{1}{3} of players focused on defense in the majority of their matches, which eclipses the 25% threshold mentioned by Karthik.

The raw data for this event can be found here and some of the basic analysis can be found here.

*Note: the reason it is H_0:r>\frac{1}{7} and not H_0:r>0 is because since each player played seven matches, one-seventh of the correlation can be assumed to be dependent.

5 Likes

Be careful with this, I would advise against just normalizing by dividing by the mean or median OPR in a given year. A better way to normalize would be to take each team’s OPR, subtract out the mean OPR, and then divide by the standard deviation of all OPRs that season. Say hypothetically you have two games in different seasons, the first has a mean/median OPR of 10, with a standard deviation of 10. The second game has a mean/median OPR of 10, with a standard deviation of 1. Team A has an OPR of 20 in the first game (contender for top 8 robot at their event, 85th ish percentile) and an OPR of 20 in the second game (easily the best robot in the world that season and probably of all time). Dividing by mean/median OPR to normalize would give team A a normalized OPR of 2.0 in both seasons, while subtracting 10 and dividing by each season’s respective stdev gives a normalized OPR of 1.0 in game 1, and 10.0 in game 2.

Oh man, asking me to argue against myself. Well let’s give it a shot.
To start, I think ccwm is better scaled than OPR since ccwm’s average will be near to 0. You can quickly tell if a team is above or below average based on the sign of their CCWM. For OPR you have to know the average OPR and work from there (perfectly manageable but less quick and dirty). I also think it’s helpful to get people thinking about winning margins, even if CCWM isn’t the best at objectively measuring how good a team is at achieving them. As a side note, my Elo also works off of winning margins.

In games for which DPR is at least somewhat effective, CCWM will also be somewhat effective. 2018 is the only game in recent memory in which DPR was something other than essentially noise. In that season, all teams effectively were playing “defense” just by playing offense since the vast majority of points you could earn came at the expense of your opponent’s points. This made DPR an actually somewhat usable metric and because of that CCWM was the best it has ever been, although still demonstrably worse predictively than OPR.

There could hypothetically be future games in which DPR/CCWM has more value than OPR. One such hypothetical game would be a modified version of PowerUp in which the only ways to score is from switches and scales (no auto line crossings, endgame parks/climbs etc…). If penalties added to your opponent’s score, you would likely end up with a game where DPR and CCWM would be superior metrics to OPR (depending on how the penalties are earned).

One last thought, I suspect that DPR (and thus CCWM) actually has some value for teams that are consistently defended. The standard usage for DPR is to look at the team playing defense, but any team playing defense will not be scoring for their own team, and thus any team that consistently draws defense should also get a low DPR and a better CCWM as a result. I haven’t tested this theory rigorously, mostly due to lack of data about which teams are defending/being defended, but I’ll probably research more as Zebra Data becomes more prevalent.

2 Likes

The real problem with DPR is that the proposed formulation doesn’t actually measure defensive effect unless all of the teams play all of the other teams. A true DPR would measure how much the OPR of the opposing teams is reduced by the presence of that team, so you need the OPR of the opposing teams in the calculation. That’s not the case in this calculation. The “DPR” is actually the OPR of the opposing alliances that were faced–if a team faces weak alliances, the DPR improves regardless of whether there was any defense.

A possible correction is to normalize the DPR by the strength of schedule metric.

Note the distributions of OPR can vary widely from season to season as measured by the standard deviation. For example in 2008 1114’s OPR was 6 standard deviations out from the mean OPR. In 2015 I calculated that about a half dozen teams had OPRs 6 SDs out. In other years, the top team typically is 3 SDs above the mean. This variation derives from differences in game design and the ease of domination by a single team in a match.