Predictive Power of Qualitative Ratings

DISCLAIMER: The following analysis and data are based on a sample of one event, with incomplete scouting data (stats reported for a little over 5/9 of a team’s matches on average). Take it with a grain of salt, my true evil motive is to solicit more data to see if it checks out.

DISCLAIMER II: This post also got a little long :/, whoops

Recently (desperately avoiding college apps), I was looking back through our team’s scouting data from 2019mnmi and found something interesting. For that regional, I had decided to try something new and have scouts include on their report a qualitative rating from 1 to 4 of each robot.* We didn’t use the data for much at the time, but when I went back 8 months later and used conditional formatting to look at the average ratings, I noticed it mapped pretty well to the points scored we scouted for each team. Being a stats nerd (with no stats education), I thought “hmm, that’s pretty funky, I wonder how strong the correlation is?” So I went ahead and graphed the average rating and found the correlation coefficient:

The correlation coefficient was 0.8457. As mentioned above, I have no idea what I’m doing with statistics, and couldn’t judge how good that was, so I decided to graph OPR compared to our scouted scoring numbers in comparison:

OPR only correlated with real scoring with an R-value of 0.8117, worse than our qualitative ratings! (though not significantly) That blew my mind! I will note however, that since our scoring data was incomplete, OPR probably had a higher true correlation, and I don’t know if the same would be true of more qualitative data.

I think the application of this finding is limited in games like 2019 and 2020 where one can deterministically tell how many points a team scores, but could be very useful in other games (ie 2017, 2018) where contributions are more convoluted.

So… those were my observations, what are yours? I am 100% sure I’m not the first person to come up with the novel idea to have people rate robots, and I beg anyone else who’s collected data to share it. I have no idea if this is a legit thing or not, but I’m excited to find out.

Cheers,
Zach

*an idea primarily stolen from the Citrus Circuits, especially the insight to use an even number to force scouts to sort the robot as either above or below average.

Here are my data:
Qualitative Scouting Predictive Power.xlsx (14.8 KB)

5 Likes

Its legit,

Qualitative scouting is a completely valid way to track teams. The sample size is well within the scout teams ability to rank teams and find ways to use partners and defeat opponents. OPR does not do any of this past a ranking metric based on other mostly API driven scoring metric 's.

I like OPR but its only one more data point to me , I use it and CCWM to look for any outliers the qualitative rankings may have “missed” in our 28rank radar which is rare. I have never had to add for observation more than two teams that we missed on our qualitative radar. Usually those up to two get eliminated anyhow in Day 2. The API’s are based on who they had as partners and that can skew data a lot…schedules are not fair in terms of alliance power levels. So every API is subject to that as its an alliance API.

Qualitative never is alliance based , its team based and much more focused, eliminates alliance API bias. It starts with team history as in past event wins (class) , previous this season event performances (recent races), then a set of game specific objectives tracked by time (eyes on bots).

CC 1-4 scale shows even a very simple 1 through 4 can order the set pretty well. Imagine what would happen if that got expanded with several other levels. In the end its not about ranking “ask any scout” rank simply does not matter . Its about what two teams+ yourself makeup a strong enough alliance to defeat three other teams that you know very well, through qualitative notes.

Were the holistic scores per robot per match, and then averaged? Was there discussion pre-event as to what a 1 is vs a 4?
I realize that gut feelings are pretty important. There have been plenty of times, when discussing possible alliance members, that the team puts a decent scoring robot on the do-not-pick list due to quality of play.

Hey Zach,
It’s cool you are getting into statistical analysis. I play around a bit with statistics and I think it’s pretty fun :). Thanks also for sharing your work, I know that it might be a bit intimidating to post, but you’re not going to learn unless you put yourself out there.

There are two related problems with the correlation analysis you’ve posted. The first is your incomplete scouting data that you mention. I do think the correlation with OPR would be higher relative to your scouted points if your dataset was more complete, 55% of matches is pretty low. Good job identifying that as a potential problem though.

The second is that your scouted points and your qualitative assessments of robots are not independently derived, and as such correlations between them can be misleading. This is related to the above point on missing data, as if you are missing a match where a team does very well, both your qualitative assessments and your scouted points will underrate said team, while OPR will not. Likewise, say that you had a scout that mis-scored a lvl 3 climb and said a team missed it when they actually scored it. The scouted points will drop, but that scout’s assessment of the team’s holistic rating might also drop from 3 to 2 for that match.

Just to show why this doesn’t work so well, let’s look at unpenalized OPR as a reference point to determine if OPR or your qualitative scouting is a superior methodology. Here are graphs of OPR, holistic rating points, and scouted points plotted against unpenalized OPR:
image
image
image
If you just look at the correlations, it would seem that OPR is a much superior metric than scouted points or your qualitative ratings. Similar to your analysis though, this isn’t a fair comparison, because OPR and unpenalized OPR are not independently derived.

Basically what I’m saying is that in order to effectively compare two datasets to determine which one is superior (in your case, OPR and your team’s qualitative scouting) you need an unbiased third dataset to reference. If another team from mnmi has scouting data that they would be willing to share that would probably be the best source. Without something like that though, you can’t draw many useful conclusions about which methodology is superior.

Incidentally, I did a very similar comparison at Iowa in 2016 where I took 4 teams’ scouting data sets and compared them against each other and with component OPRs. Basically, I gave each team’s data a similarity score with the median of all 5 datasets. I could publish that if anyone is interested, but I’d want to re-do my analysis as I wasn’t quite as good at analysis then as I am now.

8 Likes

Okay, this got me curious so I went and dug up the old data and formatted it. The event was Iowa 2016 which had 89 total qual matches. I collected datasets from 4 teams at the event and found equivalent metrics that all 4 as well as OPR could track. These were:
auto low
auto high
auto cross
tele low
tele high
challenge
scale

There were only 1 or 2 teams at Iowa that could score auto low, so that field can essentially be ignored due to small sample size. The 4 teams that we collected data from were 3130, 3883, 1156, and 4536. Each scouting team scouted a different number of matches, with the smallest being 59 matches and the largest being all 89.

Here is a link to the summary data showing teams’ correlations with the median datapoints, teams have been anonymized.

General thoughts:
It seems that roughly we have, in order of best to worst scouting systems:
Team D > Team A > Team C >> OPR >> Team B

Teams A and D seem to have the best systems across the board, with high correlations and nearly all matches scouted. Team B has both the lowest number of matches scouted and the lowest correlations with the consensus. It should be noted that we only collected data from teams that both had formal scouting systems and were organized enough to both compile this kind of data and submit it to us after the event.

This leads me to believe that if you don’t have a reasonably well-organized scouting system, you will likely be better off borrowing another team’s scouting data or using component OPRs than you would using poor-quality private data. I think that (very roughly) <20% of teams have a good enough scouting system to beat component OPRs, although the ones that do can do so by a considerable margin. I’ve long dreamed of collecting many many scouting datasets and comparing them against each other like this on a grander scale, but it is quite a difficult logistical challenge. I would encourage teams to do this at their local events though, I know 4536 found the insights useful for our scouting system. It’s just so hard to know if you are “doing it right” with scouting unless you can see if your data aligns well with some kind of consensus like this.

2 Likes

I very much agree with this analysis, in particular this situation:

gave me a tangible source of error that was very insightful.

It only just occured to me why we only have 5 matches per team, which is because (as I definitely should’ve remembered) we do our analysis on Friday night, and don’t have the bandwidth for live data entry on Saturday besides a component OPR script. I think your sources of error are spot on, but I wonder if they are outweighed by the inaccuracy of OPR before it settles down at the end of quals, whereas the holistic rating converges on a value comparitively quickly. In other words, which has a stronger correlation with even just final OPR instead of points: Friday night OPR or qualitative ratings? I might do that analysis later today if I have time.

This scares me a bit, becuase there are a couple of problems with qualitative observation (that made me skeptical of my result in the first place, despite the attempt to quantitatively tie it down). People are pretty bad at keeping track of data, but REALLY bad at providing unbiased opinions. For a team that’s percieved as good, I would fear even experienced scouts would gloss over terrible matches, so long as there was at least one good performance to tie their expectations to, and scouts may similarly underrate a team who died their first match, but absolutely demolished thereafter.

I think the greatest problem with qualitative analysis, however, lies in communication. As I wrote the OP I recalled an anecdote from Superforcasting by Phillip Tetlock (A great book for prediction geeks). Before the Bay of Pigs invasion of Cuba, JFK asked his generals what the chances of success were. His generals said there was a “good” liklihood of establishing a beachhead. When the invasion was an absolute failure, the generals weren’t all that suprised. They thought there was a “good” (~30%) chance of success for such a hard operation. Kennedy, on the other hand, was shocked becuase he thought there was a “good” (>75%) chance of victory. Similarly, what is “good” robot performance? Is it enough to win a match? Is it good for a rookie? It’s hard to be consistent.

Yep, just took an average across matches played. We didn’t really train in on the qualitative rating, so I also assume there was variation in whether people tried to assign scores along a bell curve, or an even distribution, etc. It seems like it did average out well though, given we rotate scouts and randomly assign robots.

3 Likes

I’m curious, I know your sample was limited, but over time are there any aspects of scouting that stand out to you about how consistent scouting systems are developed?

If you do something like I or TBA does with ixOPRs/predicted contributions, I’m reasonably confident that would correlate better with final OPR at the end of Friday than your qualitative ratings as it converges more quickly to final OPR than normal OPR does. Not sure about regular OPR.

Brian and Katie cover this question far better than I could do here and here. I personally don’t have a ton of experience in-event scouting, although I did my best to help design the systems before competitions from 2011-2017. The general advice I’d give to teams is to do everything you can to get a 9+ person scouting team that actually cares about scouting so you can have rotations and the scouts don’t get burnt out. Burnt out scouts create poor data. Create scouting alliances with other teams if you can’t hit that number on your own, or just ask a big team nicely to use their data.

I’d also just stick to simple metrics until you both have enough scouts and you feel like there are important insights you are missing. This year for example, knowing robot shooting locations would certainly be helpful, but I’d much rather leave this out of my scouting system if my scouts were few or new just to make sure they are counting shots as well as possible.

1 Like

Does not matter until the selection phase, in-game scouting tracks by team # and is not skewed by “good” rather what “is” on the field in that match per watched team.

This may be minor, but I’d clarify this as most accurate data collection opposed to best scouting systems. The goal of scouting isn’t to collect accurate data but to get an accurate idea of how good a team is and predict how well they will do in future matches. I’d argue that the best scouting system would NOT be perfectly or close to perfectly correlated with accurate scoring data.

1 Like

In this system, is it possible for all three teams in an alliance to score a “1” or a “4”? If that’s the case, is the “1” or the “4” rated against the other teams in the match or across the entire field? Is it relative to the team’s previous performances at all? For example, if 254 is getting “4” in all of its matches and then has a letdown match, would it get a “3” or a “1”?

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.