My data sucks? What??
You might say:
I can confidently say this team has a 70% climb rate!
Team x averages 1 more cycle a match than team y!
We scout every team in every match every time!
Is this actually true?
tl;dr
DM me or post on this thread full datasets from 2019 of your events to review overall FRC data accuracy, fill out this survey about how good you think your data is.
Recently I have become more and more interested in ways to validate my data over the last few years, so that I can have a higher confidence that my data is always correct. It’s something we went all in on for 2020, and has upped our game a lot when it comes to quality of scouting data. Making sure your data is validated with the access we have to TBA/FIRST API’s has never been easier. I’m also fairly confident that this step of making sure your data is actually correct regularly isn’t done even in the most basic ways for a large percentage of teams. Bad data leads to incorrect picks, and often these incorrect picks are not recognized post mortem after a team gets knocked out from an event.
One of the puzzling things that it seems is regularly discussed in many of the frc friend groups i’ve been in is something to the effect of “Why the heck would team x pick team y? My data says team z is better than team y? What a bad pick!” Was it a bad pick? Was their data bad? Was your data bad?
A few things hit me pretty hard to start this project, and I want to highlight some of the work that already exists to promote usage of data sets for analysis to develop new techniques for making scouting better.
Starting with a quote from a relatively recent thread on ELO vs OPR.
I’m curious to see exactly how accurate, trustworthy, and detailed scouting data is, but going beyond that I’m interested to see how good you think your data actually is. In general I think teams over-estimate how good their data is.
I was super inspired also by 1678’s phenomenal scouting whitepaper. (Seriously, go read it) Specifically by chapter 5 where they analyze their 2020 data from 2020caln against TBA api data.
Prior to this paper coming out we also implemented a live version of this for all our scouting in 2020. Prior to each submission, through a data connection via tethered phone, we would check in with the api and see if the counts/climbs/autos matched up, and would get instant feedback to any individual/team of scouts on mismatches with data. The scouts would then attempt to reconcile the mismatch “Oh I wasn’t sure this one time when team x shot, they might have not shot there” etc. Any mismatch not resolved would then be recorded in the notes for that data. The scouts then also get near instant feedback on areas to improve in when it comes to accurately recording data.
If we look at an excerpt from 1678’s paper, we see essentially what we can do to provide a more robust understanding of our dataset.
The flaw with this method of course, is just because the sum of total number of game pieces add up, doesn’t mean that the allocation of who scored what is correct, but it does give us a good idea of how accurate our data is. That being said the overall api data level of precision has been increasing as of late, to where for the last several years we can confidently know which teams crossed auto lines/endgame scores. We can use this to our advantage to really drill into how good our data actually is.
Fast forward to today:
I was super inspired by the Beach Blitz crew (check out the event folks!) and saw their scouting hackathon info. I decided that I would write up a quick little program to take a full set of data and do a review on it’s overall integrity to see how good it is. Unfortunately due to the lack of 2020 events, I decided to review 2019 data, as it is the most recent and most likely to be complete for teams to be able to review. It also has the added benefit of being ideally the most up to date/sophisticated scouting systems that exist today.
The ask:
There is a great dataset from team 2791 featured in the Beach Blitz competition from 2791’s Semi-Finalist performance at the Darwin division as the first pick of alliance 1. Since then I have been collecting full datasets from those that were willing, including multi time event winners, championship alliance captains, and even Einstein level teams, along with our own data (4476) from 2019. Currently up to 10 full events of data.
I’m looking for full datasets from 2019 that I can run through, and evaluate based on this method of determining accuracy. You can choose (and I encourage) you to censor your comment fields. Later then publishing the average accuracy of all the submissions, code, and any teams that want to know their individual numbers when it comes to data accuracy.
If you are interested in submitting a data set for this, I would be HUGELY appreciative, you can either dm on Chief, or post your data set on this thread. Individual stats from submitted datasets will not be published unless the data is posted publicly, or the submitter specifically indicates it is ok to do so. Please don’t try to fix up your data from the state that it was when pick decisions were made from it.
The ask part two:
I’m also extremely curious to see what people think of their own data. You can find a survey here that will cover some of the same data points I am tracking within the datasets. I appreciate you taking the time to read/fill out the survey!
I hope to publish everything in a week or two, and hopefully there will be some interesting discussion coming from that.