# The Case for the Median

As Karthik has said in one of his lectures, “a lot of people are just like: ‘what’s the average points per match? what’s the average?’” In Karthik’s case, he was advocating for people to pay attention to standard deviations: for people to care about the spread of a team’s performance as well as measures of central tendency like the mean. I’d like to approach what Karthik observed from a different angle: teams are always saying “what’s the average?”, and I’d like to encourage them to ask: “what’s the median?”*

Let’s be real: it’s extremely unlikely for any official competition to have more than 12 qualification matches per team. The Waterloo regional had 13 qual matches per team, but I can’t think of any other official event with that many matches per team. So, when we go to make statistical decisions about teams for match strategy and for picklisting, teams ought to remember that they are working with a maximum of n=12: often much less. Usually important picklisting decisions are made with data in the realm of 8 or 9 matches, even at district events. I haven’t competed in the regional system yet, but I expect that regional teams picklisting the night before alliance selection are working with even less data.

That sample size matters when it comes to outliers. The median is resistant to outliers, while the mean isn’t. If you rely on the mean alone, you are choosing to be influenced strongly by outlier matches. In some situations, this can be valuable, but for most alliance captains, and when making strategy for most qualification matches, teams are more interested in the likely outcome of a team rather than their tail results. It’s not like the median is difficult to calculate, either. More teams than ever use complicated electronic scouting systems, and advanced statistics like OPR are commonplace among teams making any sort of strategic assessment. If you’re doing the linear algebra to calculate OPR, how hard is it to type =median(range) into excel?

Of course, the mean has its place. It’s a good measure of central tendency, and it has the advantage of combining all the available data points into one summary statistic rather than selecting one representative (or splitting the difference between two.) The mean is the basis of most methods of statistical inference, so if teams want to use z-scores, confidence intervals, or p-values in their scouting and strategy decisions the mean is critical. But how many FRC teams use statistical inference? I remember seeing z-scores in 1678’s scouting documentation from 2016, and I know that 1712 used gear confidence intervals and tried to approximate the probability of a 4-rotor or 40 kPa match in 2017, but realistically most teams aren’t building statistical inference into their scouting decisions.

Spending half a minute to add the median to your analysis toolbox is going to be more valuable than spending half a week to add OPR to your analysis toolbox. That’s not to say that OPR isn’t a helpful statistic (which is an entirely different conversation), but rather that the median is heavily undervalued for the minimal time and effort it takes to use effectively. Teams should care about the median, not just the mean.

• Yes, the median is technically an average. But it’s pretty obvious that people asking “what’s the average” are asking about the arithmetic mean.

More topics like this on the scouting forum please!

I’ve tended to look to mode and range more often than mean for evaluation. Mean has also given me mixed resullts for match prediction. I’ll keep median in mind in the future.

The issue with mode is that it can be quite sensitive to the size of the bins, especially for small data sets. For example, if a team’s scores (sorted by score for clarity) after ten matches were:
0 0 131 134 136 139 147 160 161 163
Binning by 1 would give 0 as the mode, by 5 would give 160-164, by 10 would give 130-139. It’s hard to attach much significance to a measurement which is almost as much a function of the details of the measuring as of that which is being measured. Bimodal/multimodal distributions are also not uncommon with such small data sets, even if the underlying phenomenon is Gaussian.

I strongly agree. What value the mode has comes from its intuitiveness.

Captain_Kirch also brought up the range as an estimator of spread: it’s interesting to look at robust alternatives to standard deviation for measuring spread. Range also, isn’t robust: by definition it’s entirely reliant on the maximum and minimum. Personally, I’d advocate for IQR (= Q3-Q1), since it’s resistant to outliers and provides more specificity than something like MAD (Median Absolute Deviations). On the other hand, though, if you’re trying to determine how consistent a team is, you might want to be strongly influenced by outliers.

So, I went into my Mid-Atlantic Champs data and did some comparisons. I’ve pulled out, with respect to the number of gears scored, the most consistent teams by IQR and Standard Deviation.

Here are the 5 most consistent teams in the dataset by IQR, along with their datasets.

56: 3, 4, 3, 2, 3, 3, 5, 3, 0, 3, 3, 2 IQR = 0
3974: 6, 4, 3, 4, 5, 4, 4, 4, 6, 4, 4, 3 IQR = .25
5407: 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 2, 3 IQR = .25
4285: 3, 2, 2, 3, 3, 3, 4, 3, 3, 2, 3, 4 IQR = .25
1257: 3, 4, 0, 3, 4, 4, 4, 4, 4, 4, 3, 0, 4 IQR = .5

Here are the 5 most consistent teams in the dataset by Standard Deviation, along with the dataset and the calculated (sample) standard deviation.

5407: 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 2, 3 StDev: .651
2600:4, 3, 4, 3, 4, 4, 4, 4, 2, 3, 3, 3 StDev: .669
3929: 4, 4, 3, 4, 4, 4, 4, 3, 4, 4, 3, 2 StDev: .669
4285: 3, 2, 2, 3, 3, 3, 4, 3, 3, 2, 3, 4 StDev: .669
303: 4, 3, 3, 3, 3, 2, 2, 3, 4, 4, 3, 2 StDev: .739

Feel free to make of this what you wish. To me, it looks like the story of outliers: IQR is resistant, StDev isn’t. I’d personally advocate for people to look at both when making a decision and understanding what they both mean. IQR definitely has some value: a team like 3974 or 1257 with one or two outlier matches but otherwise absurdly consistent performances will be looked over by StDev, but IQR would pull it out. Similarly, 3929’s performance is definitely desirable*, but because all the deviations are on one side of the mean their IQR raises a bit. Teams like 5407 and 4285, with small deviations on both sides of the center, are the types that score well by both metrics.

• I mean, they won MAR Champs.

As a statistician with training in game theory, I am going to plead with everyone not to place a lot of reliance on any of the numbers. I believe GKrotkov is completely correct that in sample sizes such as we have in a typical FRC competition, the median is going to be a better predictor in general than the mean for predicting any single team’s performance. But in general, sample sizes aren’t big enough for any of the typical measures to be good choices. We try to record all of the hard number data we can in scouting, but often our most useful information are the qualitative observations. My gut feeling is that the overall effectiveness of scouting has not improved at all in the past half decade or so. In particular I see teams relying on overall numbers, and not paying enough attention to what is actually happening in matches. Sometimes a team may be great but can be shut down by a particular strategy. Sometimes a team had a really rough first day of competition, but their last three or four matches were great. (We never make any final decisions on Friday.)

Big +1 there!

Yes! Understanding WHY something happens may be more important than WHETHER it happened. You may be able to mitigate your partner’s weakness or exploit your opponents’, but only if you understand what that weakness is. Leave space on your scouting forms for free form comments, and read them!

Very good advice.

Instead of just looking at the median/mode/average etc., I find myself trying to look at the trend of the data over the course of the event. For example, look at “climb data” (easy example because it’s a boolean) (N= not climbed, Y = climbed)

Team 1:
NNNYNYYNYY

Team 2:
YYYNYYNNNN

Team 3:
YNNYNYYNNY

All three teams have the same climb percentage (50%), but Team 1 is the best pick. I have yet to determine a good way to produce a qualitative metric for this type of data analysis.

(Sorry if this doesn’t make sense or derails thread too much off topic)

Whenever attempting to make use of trend data, it’s often helpful to actually understand the root cause of the trend (whether by observation or conversation with the team).

This idea is one of the core beliefs of my scouting philosophy.

For example, on Saturday morning at Tech Valley this year, my head scout and I were going through teams and noticed that while 5952 could reliably score three gears a match, their climb rate was ~50%. We observed that missed climbs clearly increased in frequency over time, so we talked to them. We learned that the problem was their Velcro roller not sticking to their rope strongly enough. There were no other problems with the climber.

Ultimately, they were unpicked by the time our alliance’s second chance to pick came along (likely due to their ~50% climb rate), and we begged our captain to pick them. After lunch, their climber was good to go and they played a key role in helping us score the fourth rotor and walk away from the event with our first ever blue banner.

This is very good advice. As Brian pointed out, if you know the reason for something it can completely change your analysis. I was chatting with my co-coach the other day and we think that the emphasis on people collecting and sharing data (which is a good thing, I am not trying to get teams not to collaborate and share) has lead to a decrease in number of people actually looking at what is going on and trying to figure out the whys. We also think that teams tend to do a really bad job of putting thought into what teams will best help them implement what strategies.

There are different levels of understanding and interpreting performance. At the worst, you have seed points/ranking, then are stats like CCWM

and OPR
, and some of the other top-down metrics like Elo, TrueSkill, etc. Next comes scouting the individual mathches, and finally talking to the teams to figure out why they’re performing the way they are. Each level takes a bit more effort than the next to understand, but there’s a fundamental limit teams have: people. If teams don’t have scouting information, the best they can do is have the top-down metrics. Teams generally do this when they have very small teams (that is, can’t get enough people to sit in the stands and watch all robots in all matches). By having several small teams that can contribute just a couple of people, they can get to the next level of understanding. But when you’re on a small team and your lead strategist is also the person in the pit that needs to change out wheels, grease gearboxes, and coach the match, sometimes you miss out on opportunities to find out why a team’s climb has slowed down a half second the past 3 matches.

Now that said, this is for small, resource-limited teams using collaborative scouting information. Large teams have fewer excuses (but I would accept some).

I am not criticizing teams that don’t have enough people to scout. Or really even the ones that choose not to scout. But I do think that there is a ton of scouting that is going on that is not particularly useful, or maybe a better way to say it is that it is not as useful as it could be. Writing down notes after each match, and reading those notes, is important. We try to have a page (or more than one page) for each team. All of the match data gets collected, along with the observations of the scouts. After each match the lead scout(s) initiate a quick discussion to see if anyone saw anything noteworthy in the match. Those observations often show useful information.

I remember once (in 2013, when I was a referee) a team telling me that they would never pick a mecanum drive team as an ally. They passed up on picking the eventual event champion, who clearly had the best driven robot at the event, because of this. Even if you had a bias against mecanum drive, if a bunch of scouts had all written down comments like “Wow, that driver is awesome!” or “Very good driver” you might well be convinced to make that pick. The same team opted to play defense against the team they did not pick because “They are mecanum and we will be able to shove them all over the place.” That didn’t work, and observation of the matches would have told them it wouldn’t work. But the mecanum team had been unlucky with alliance partners and seeded only 11th.

This post got me thinking about OPR

and if we could do some kind of median adjustment to OPR
to improve its performance. Since OPR
is a minimization of least squares, OPRs can be sensitive to outlier values, in a similar way to means. I know Ether played around with an L1 alternative to OPR
* which alleviates some of these concerns, but unfortunately L1 solutions are not unique and are also computationally intensive.

I’m imagining a procedure which would be something like this:

1. Find conventional L2 OPRs for all teams
2. For each team, find the median residual error for their matches and add that error to their normal OPR
.
3. Possibly do something along these lines iteratively a few times?

For example, say a team has an OPR

of 10 and the following residuals (actual score - (sum of alliance OPRs)) for their 5 matches: -4, -3, -1, 0, 1, 5. Since their median residual is -0.5, their median-adjusted OPR
would be 9.5. Their new residuals would then be -3.5, -2.5, -0.5, 0.5, 1.5, 5.5 assuming that their partners’ OPRs are unchanged.

I might play around with something like this to see if this can be used to improve the predictive power of OPR

.

*I would put a link here but I can’t find the post or whitepaper

You could track trend data using a simple IIR (infinite impulse response filter). Letting X be your data stream (in this case just 0s and 1s) and A be the running average:
[STRIKE]A[sub]0[/sub] = X[sub]0[/sub]
A[SUB]i+1[/SUB] = k X[sub]i+1[/sub] + (1-k)A[sub]i[/sub][/STRIKE]
D[sub]0[/sub] = 1
A[sub]0[/sub] = X[sub]0[/sub]

D[sub]i+1[/sub] = k D[sub]i[/sub] + 1
A[sub]i+1[/sub] = (k A[sub]i[/sub] + X[sub]i+1[/sub]) / D[sub]i+1[/sub]

The weight of the older data decays exponentially; a small k is a fast decay; large k (though less than 1!) is slower. With a k of 0.9, after twelve matches, the first match is weighted about 31% as much as the most recent.

I spent a while trying to use this method to predict results, and I was unable to create anything which had an appreciable improvement over conventional OPR

.