Scouting Revolution: How much does your data suck? (Survey on scouting/submit your data)

My data sucks? What??

You might say:

I can confidently say this team has a 70% climb rate!
Team x averages 1 more cycle a match than team y!
We scout every team in every match every time!

Is this actually true?


tl;dr

DM me or post on this thread full datasets from 2019 of your events to review overall FRC data accuracy, fill out this survey about how good you think your data is.


Recently I have become more and more interested in ways to validate my data over the last few years, so that I can have a higher confidence that my data is always correct. It’s something we went all in on for 2020, and has upped our game a lot when it comes to quality of scouting data. Making sure your data is validated with the access we have to TBA/FIRST API’s has never been easier. I’m also fairly confident that this step of making sure your data is actually correct regularly isn’t done even in the most basic ways for a large percentage of teams. Bad data leads to incorrect picks, and often these incorrect picks are not recognized post mortem after a team gets knocked out from an event.

One of the puzzling things that it seems is regularly discussed in many of the frc friend groups i’ve been in is something to the effect of “Why the heck would team x pick team y? My data says team z is better than team y? What a bad pick!” Was it a bad pick? Was their data bad? Was your data bad?

A few things hit me pretty hard to start this project, and I want to highlight some of the work that already exists to promote usage of data sets for analysis to develop new techniques for making scouting better.

Starting with a quote from a relatively recent thread on ELO vs OPR.

I’m curious to see exactly how accurate, trustworthy, and detailed scouting data is, but going beyond that I’m interested to see how good you think your data actually is. In general I think teams over-estimate how good their data is.

I was super inspired also by 1678’s phenomenal scouting whitepaper. (Seriously, go read it) Specifically by chapter 5 where they analyze their 2020 data from 2020caln against TBA api data.

Prior to this paper coming out we also implemented a live version of this for all our scouting in 2020. Prior to each submission, through a data connection via tethered phone, we would check in with the api and see if the counts/climbs/autos matched up, and would get instant feedback to any individual/team of scouts on mismatches with data. The scouts would then attempt to reconcile the mismatch “Oh I wasn’t sure this one time when team x shot, they might have not shot there” etc. Any mismatch not resolved would then be recorded in the notes for that data. The scouts then also get near instant feedback on areas to improve in when it comes to accurately recording data.

If we look at an excerpt from 1678’s paper, we see essentially what we can do to provide a more robust understanding of our dataset.

The flaw with this method of course, is just because the sum of total number of game pieces add up, doesn’t mean that the allocation of who scored what is correct, but it does give us a good idea of how accurate our data is. That being said the overall api data level of precision has been increasing as of late, to where for the last several years we can confidently know which teams crossed auto lines/endgame scores. We can use this to our advantage to really drill into how good our data actually is.


Fast forward to today:

I was super inspired by the Beach Blitz crew (check out the event folks!) and saw their scouting hackathon info. I decided that I would write up a quick little program to take a full set of data and do a review on it’s overall integrity to see how good it is. Unfortunately due to the lack of 2020 events, I decided to review 2019 data, as it is the most recent and most likely to be complete for teams to be able to review. It also has the added benefit of being ideally the most up to date/sophisticated scouting systems that exist today.


The ask:

There is a great dataset from team 2791 featured in the Beach Blitz competition from 2791’s Semi-Finalist performance at the Darwin division as the first pick of alliance 1. Since then I have been collecting full datasets from those that were willing, including multi time event winners, championship alliance captains, and even Einstein level teams, along with our own data (4476) from 2019. Currently up to 10 full events of data.

I’m looking for full datasets from 2019 that I can run through, and evaluate based on this method of determining accuracy. You can choose (and I encourage) you to censor your comment fields. Later then publishing the average accuracy of all the submissions, code, and any teams that want to know their individual numbers when it comes to data accuracy.

If you are interested in submitting a data set for this, I would be HUGELY appreciative, you can either dm on Chief, or post your data set on this thread. Individual stats from submitted datasets will not be published unless the data is posted publicly, or the submitter specifically indicates it is ok to do so. Please don’t try to fix up your data from the state that it was when pick decisions were made from it.


The ask part two:

I’m also extremely curious to see what people think of their own data. You can find a survey here that will cover some of the same data points I am tracking within the datasets. I appreciate you taking the time to read/fill out the survey!


I hope to publish everything in a week or two, and hopefully there will be some interesting discussion coming from that.

15 Likes

Thanks for bringing this up. This is a topic that’s been on and off my mind for the last few years. I’m very interested in your results — especially if there is any association with team size. My last team was fairly small, so we tried to automate as much as we could.

In general, automated data cleaning/validating is not a solved problem. One interesting line of research coming out of MIT right now is to use Bayesian methods to detect and correct outliers on arbitrary datasets with unknown distributions. (Video available here.)

Seems to me that you could do something similar in FRC — ideally, you would flag stuff that is both inconsistent against the API and past performance. For example: you can use the API to validate that the correct number of panels scored by an alliance were recorded by your scouting, but this only validates the total of each alliance. You might also benefit from looking at the past panel performance of each team and validating each teams’ individual panel scores to make sure they’re consistent.

Perhaps an easy way would be to develop confidence intervals for the different metrics you’re tracking and flag things that don’t fall into that interval.

2 Likes

Another Clickbait Scouting Title… :roll_eyes:

6 Likes

I know we aren’t a high-level team by any means (especially not in 2019), but I think our data collection system is solid. Attached is 2708’s data from Durham 2019 (all 75 columns of it).

I uploaded one file with all data from the event, however due to our poor performance we cut back on scouts at the end of day 1 so there are <6 data points per match from then out. The cutoff file has had that data removed.
If you want the data condensed, let me know and I’ll combine columns to fit your spec.

If you have any questions about collection (like why we have 75 columns) I’ll be happy to never shut up about our scouting app :smile:. (Also if you want all of our data we also have hundreds of rows of timestamps for calculating relative time between things robots did, but we didn’t find them useful in 2019)

Sidenote: We asked scouts to rate the quality of their own data, it would be interesting to see how accurate that was.

full data.xlsx (116.6 KB)
cutoff data.xlsx (88.3 KB)

1 Like

This sounds pretty fun.

1 Like

This topic has always been really interesting to me. Keep me posted!

Still looking for data if anyone has some from 2019 kicking around! Will be releasing things shortly after beach blitz finishes up in the case that someone has the same idea for some of the data.

And the results are in!

Thank you to all those that submitted datasets, of course it’s still an extremely small dataset, but I think some of the results are rather surprising to look at. Some of the teams/events will be anonymous due to them not explicitly indicating that they would like their numbers published here, or if they published their data publically (thank you 2708 and 180!) then they are listed.

The composition of all submitted teams I think are definitely the upper level teams. All of the submitted data has been from teams that have won multiple events within the last four years as an alliance captain/first pick, with around 80% of the submitted data having reached division finals or higher in the same time period. This is definitely representative of a well above average team, and I expect that lower level teams would have significantly worse results than others.


My takeaways:


More data fields = worse data

The team that performed the best in this simple data validation had the fewest things scouted, and the least complex scouting system in terms of what was recorded. (17 columns of recordable fields) while teams with more data consistently performed worse in terms of data accuracy.


Shocking numbers

It was shocking to me the level of inconsistency of submitting correct number and match pairings. While this is really important to the data validation process, an average of nearing 13% of matches with only 5 teams of data or less was extremely surprising to me. Generally I had higher hopes for the accuracy of match numbers that can be pre-populated in multiple ways.

Overall missing data was about what I expected, but some datasets had extremely large amounts of missing data (anything over 5% is fairly shocking.) As to what causes this number to be so large, i’m not sure. Of course no data is a better option than bad data, but based on the calibre of teams I would have expected much better performance.

The amount of inaccuracy with correctly recording stationary robot locations (pre and post match scouting) was also very surprising to me. 8.03% of the time scouts had the incorrect starting level marked, and that’s not including the Left/Right/Center variants that many teams chose to scout, but were not verifiable using this method.

Climbs were also surprisingly inaccurate with some teams (including my own) getting it wrong at around 15%+ of the time. This especially hit home with flashbacks about many teams who come to match strategy meetings that very confidently say “this team climbed 50% of the time” Clearly there is a large error range on this kind of statistic, and despite them being able to observe robots climbed or not climbed for a long period of time after the match completed in 2019, there was still very poor performances in this statistic. We can see just how weak it is with some teams being about a level 2 climb off in points PER ALLIANCE, or approximately messing up one level 3 climb a match. This level of consistency with these teams isn’t good, and I worry what less successful teams accuracy may be.

Generally the game pieces inaccuracy didn’t surprise me too much. 2019 had fairly low visibility on about half the fields scoring locations, so I expect more error in these numbers. some teams were still exceptionally high, with the highest being 4.11 hatches, and 6.48 cargo per match, which of course are ludicrously high numbers. That error range basically wipes out the credibility of the data completely, you may as well have a random number generator putting in values from 0-15 in every category.

Even the absolute error for game pieces that was high is probably too high for the level of confidence that teams generally have in their data. How many times have you made an argument for one team over another because they average 1-2 game pieces more a match? Many of these datasets can not say that this is true with any level of certainty.


Easily fixable mistakes

Many of these errors we can completely eradicate during an event, or during a picklist session by simply overwriting our data with data we know to be true. TBA provides exact data on all climbs, starting levels, and line crosses for each team, so there really is no reason to not use this data which is certifiably 100% correct. In some cases you can live update this through cellular data and connecting to the tba api periodically per match submission. We can also resolve our issue with missing datasets (partially), incorrect team numbers, and correct allocation of teams in specific matches easily through TBA integrations.


Absolute error in points

Not counting autonomous cross points, we can see what the absolute error is per alliance in a match on average. This number works out to be 10.15 points per alliance in abs error. Looking at TBA insights, we can see the average score for the entire year in quals is 52.38, and elims was 67.43. That means that our absolute error accounts for 19.4% of the average match score in quals, and 15.1% in elims, not including auto line bonuses.


Lastly, I ask that every team implement some sort of validation in your scouting process. It’s quite easy to get something going, even for the most inexperienced teams. TBA is a fantastic resource that many teams should be involving in their data collection process, either during the submission process, or at least prior to the picklist meeting. Simple flags for things can be done even without api/coding knowledge. Flag outliers to a team’s dataset, flag team numbers that don’t match an approved list. Check over mentally to make sure that the data you are entering lines up with what you know about a team.

This is not a perfect solution to data validation, and there are many pitfalls, but it’s a very simple and easy approach to drastically improve the quality of your data, and using this as a starting point before going to more advanced statistical modeling to better approximate what error is likely to be attributed to a specific team.

Also… not enough people filled out the survey to make any interesting conclusions from it, so didn’t end up doing much with it, but thanks to those that did!

22 Likes

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.