"standard error" of OPR values

I’m thinking of the parameter standard errors, i.e., the error estimate around the OPR parameter itself for each team. That can be computed from the matrix–it’s a primary output of any statistical software package.

The second quote above is from a dialog I’ve been having with Citrus Dad, and I have his permission to post it here.

I’d like to hear what others have to say.

Do you think the concept of standard error applies to the individual computed values of OPR, given the way OPR is computed and the data from which it is computed?

Why or why not?

If yes: explain how you would propose to compute the standard error for each OPR value, what assumptions would need to be made about the model and the data in order for said computed standard error values to be meaningful, and how the standard error values should be interpreted.

Just to check that I understand this correctly: standard error is basically the standard deviation from the “correct” value, and you’re asking if OPR values have this distribution from the “correct” value (i.e. the given OPR)?

Also, is OPR is calculated by taking t1+t2+t3 = redScore1, t4+t5+t6 = blueScore1, etc. and then solving that series of linear equations?

I would guess it would depend on what you mean by OPR. I always assumed, perhaps incorrectly, that OPR was the solution of the above calculations, and thus it is just a number, neither correct nor incorrect.

If OPR is meant to indicate the actual scoring ability, this would change. However I’m not sure how to figure out how many points a team contributes–if one team stacked 6 totes and another capped, does the first get 12 and the second 24, or each get 36, or something other combination?

I suppose one way to do it would be to take the difference between a team’s OPR and 1/3 of their alliance’s points from each match they played in, and see the change in that difference. Comparing that between teams will be tricky, since the very top/bottom teams will have a greater difference than an average one. Similarly, looking at a team’s OPR after X matches and 1/3 of the match X+1 score would be interesting but would also have that problem.

(Or I could just be very confused about what OPR and standard error really are–I’ve tried to piece together what I’ve read here and on the internet but haven’t formally taken linear algebra or statistics.)

I’m not sure if there is a good clean method which produces some sort of statistical standard deviation or the such, although I would be happy to be proven wrong.
However, I believe that the following method should give a useful result:

If you start out with the standard OPR calculations, with the matrix equation A * x = b, where x is a n x 1 matrix containing all the OPRs, A is the matrix describing which teams a given team has played with and b has the sum of the scores from the matches a team played in, then in order to compute a useful error value we would do the following:

  1. Calculate the expected score from each match (using OPR), storing the result in a matrix exp, which is m x 1. Also, store all the actual scores in another m x 1 matrix, act.
  2. Calculate the square of the error for each match, in the matrix err = (act - exp)^2 (using the squared notation to refer to squaring individual elements). You could also try taking the absolute value of each element, which would result in a similar distinction as that between the L1 and L2 norm.
  3. Sum up the squared err for each match into the matrix errsum, which will replace the b from the original OPR calculation.
  4. Solve for y in A * y = errsum (obviously, this would be over-determined, just like the original OPR calculation). In order to get things into the right units, you should then take the square root of every element of y and that will give a team’s typical variance.

This should give each team’s typical contribution to the change in their match scores.

added-in note:
I’m not sure what statistical meaning the values generated by this method would have, but I do believe that they would have some useful meaning, unlike the values generated by just directly computing the total least-squared error of the original calculation (ie, (A*x - b)^2). If no one else does, I may implement this method just to see how it performs.

Calculation of the standard error in OPR for each team sounds straightforward - the RMS of the residuals between the linear model and the match data for the matches in which a team participated. However, this number would probably not cast much if any light on the source of this scatter. One obvious source of scatter is the actual match-to-match performance variation of each team - puts up two stacks per match, but in **that **match, they set the stack on some litter and it knocked over the first. Another is non-linearity in the combined scoring (e.g. two good teams that perform very well when with mediocre partners, but run out of game pieces when allied, or a tote specialist allied with an RC specialist who do much better together than separately).

There are two types of error:

The first is the prediction residual which measures how well the OPR model is predicting match outcomes. In games where there is a lot of match-to-match variation, the prediction residual will be high no matter how many matches each team plays.

The second is the error in measuring the actual, underlying OPR value (if you buy into the linear model). If teams actually had an underlying OPR value, then as teams play 10, 100, 1000 matches the error in computing this value will go to zero.

So, the question is, what exactly are you trying to measure? If you want confidence in the underlying OPR values or perhaps the rankings produced by the OPR values, then the second error is the one you want to figure out and the prediction residual won’t really answer that. If you want to know how well the OPR model will predict match outcomes, then the first error is the one you care about.

Unfortunately, this is tenuous - there’s no real reason to believe that each team contributes linearly to score by some flat amount per match, and that variance beyond that is a random variable whose distribution does not change match-to-match.

If one were to assume that this is actually the case, though, then one would just take the error from the first part and divide it by sqrt(n) to find the error in the estimation of the mean.

I agree with most of what people have said so far. I would like to add my observations and opinions on this topic.

First of all, it is important to understand how OPR is calculated and what it means from a mathematical standpoint. Next it is important to understand all the reasons why OPR does not perfectly reflect what a team actually scores in a match.

To put things in perspective, I would like to categorize all the reasons into two bins. Things that are beyond a team’s control and things that reflects the actual “performance” of the team. I consider anything that is beyond a team’s control as noise. This is something that will always be there. Some examples, as others have also pointed out, are bad call by refs, compatibility with partners’ robots, non-linearity of scoring, accidents that is not due to carelessness, field fault not being recognized, robot failure that is not repeatable etc.

The second bin will be things that truly reflects the “performance” of a team. This will measure what a team can potentially contribute to a match. This will take into account how consistent a team is. The variation here will include factors like how careful they are in not knocking stacks down, getting fouls, robot not functioning due to wiring that is avoidable. The problem is this measure is meaningful only if no teams are allowed to modify their robot between matches meaning the robot is in the exact same condition in every match. However in reality there are three scenarios. 1) The robot keeps getting better as teams worked out the kinks or tuned it better. 2) The robot keeps getting worse as things wear down quickly due to inappropriate choice of motors, bearings or the lack of, design or construction techniques were used. Performance can get worse also as some teams keep tinkering with their robot or programming without fully validating the change. 3) The robot stays the same.

I understand what some people are trying to do. We want a measure of expected variability around each team’s OPR numbers, some kind of a confidence band. If we have that information, then there will be a max and min prediction of the outcome of the score of each alliance. Mathematically, this can be done relatively easily. However the engineer in me tells me that it is a waste of time. Based on the noise factors I listed above and that the robot performance may change over time, this becomes just a mathematical exercise and does not have much contribution to the prediction of outcome of the next match.

However I do support the publication of the R^2 coefficient of determination. It will give an overall number as to how well the actual outcome fits the statistical model.

This year may be an anomaly, but it seems to me like, for some teams anyway, this is a reasonable model. Teams have built robots that are very predictable and task-oriented. For example: grab a bin, drive to the feeder station, stack, stack, stack, push, stack, stack, stack, push, etc. Knowing how quickly our human player and stack mechanism are, we can predict with reasonable accuracy how many points we can typically score in a match, with the only real variance coming from when things go wrong.

I have to strongly agree with what Ed had to say above. Errors in OPR happen when its assumptions go unmet: partner or opponent interaction, team inconsistency (including improvement), etc. If one if these single factors caused significantly more variation than the others, then the standard error might be a reasonable estimate of that factor. However, I don’t believe that this is the case.

Another option would be to take this measure in the same way that we take OPR. We know that OPR is not a perfect depiction of a team’s robot quality or even a team’s contribution to its alliance, but we use OPR anyway. In the same way, we know the standard error is an imperfect depiction of a team’s variation in contribution.

People constantly use the same example in discussing consistency in FRC. A low-seeded captain, when considering two similarly contributing teams, is generally better off selecting an inconsistent team over a consistent one. Standard error could be a reasonable measure of this inconsistency (whether due to simple variation or improvement). At a scouting meeting, higher standard error could indicate “teams to watch” (for improvement).

But without having tried it, I suspect a team’s standard error will ultimately be mostly unintelligible noise.

Has anyone ever attempted a validation study to compare “actual contribution” (based on scouting data or a review of match video) to OPR values? It seems like this would be fairly easy and accurate for Recycle Rush (and very difficult for Aerial Assist). I did that with our performance at one district event and found the result to be very close (OPR=71 vs “Actual”= 74).

In some ways, OPR is probably more relevant than “actual contribution”. For example, a good strategist in Aerial Assist could extract productivity from teams that might otherwise just drive around aimlessly. This sort of contribution would show up in OPR, but a scout wouldn’t attribute it to them as an “actual contribution”.

It would be interesting to see if OPR error was the same (magnitude and direction) for low, medium, and high OPR teams, etc.

The first error term is generally reported and Ether produced a measure of those residuals.

It’s the second error term that I haven’t seen reported. And in my experience working with econometric models, having only 10 observations likely leads to a very large standard error around this parameter estimate. I don’t think that calculating this will change the OPR per se, but it will provide a useful measure of the (im)precision of the estimates that I don’t think most students and mentors are aware of.

Also, as you imply, a linear model may not be the most appropriate structure even though it is by far the easiest to compute with Excel. For example, the cap on resource availability probably creates a log-linear relationship.

I have done this in the past (2010, 2011, 2012, 2013). 2013 was pretty close. 2010 was pretty close at a given event though interesting strategies could result in interesting scores. 2011 was useful at District level of competition, but not very useful at MSC or Worlds. 2012, semi useful if using some sort of co-op balance partial contribution factor.

Someone did a study for Archimedes this year. I would say it is similar to 2011 where 3 really impressive scorers would put up a really great score, but if you expected 3X, you would instead get more like 2.25 to 2.5…

This is a different question than whether the OPR accurately measures true contribution. (Another benefit of that exercise however is to determine whether the OPR estimate has a bias, e.g., related to relative scoring). There will always be error terms around the OPR parameter, so the question to be answered is what are the statistical properties of those error terms.

I produced predicted scores for Newton using the OPR components to eliminate potential double counting of auto and adjust for coop points. I predicted 118 would average 200 and they averaged 198. I have to check the distribution of OPR vs actual points.

The statistician in me led to asking this question. One aspect is that I believe publishing standard errors for parameter estimates provides greater transparency. Plus it is very educational. I suspect that most students looking at OPRs don’t understand that they are actually statistical estimates with large error bands around the parameter estimates. Providing that education is directly in line with our STEM mission. Too many engineers don’t understand the implications and importance of statistical properties in their own work (I see it constantly in my professional life.)

And regardless I think see the SEs lets us see if a team has a more variable performance than another. That’s another piece of information that we can then use to explore it further. For example is the variability arising because parts keep breaking or is there an underlying improvement trend through the competition–either one would increase the SE compared to a steady performance rate. There’s other tools for digging into that data, but we may not look unless we have that SE measure first.

Kind of reminds me of a joke I heard this past weekend that was accidentally butchered:

A physicist, engineer and a statistician are out hunting. Suddenly, a deer appears 50 yards away.

The physicist does some basic ballistic calculations, assuming a vacuum, lifts his rifle to a specific angle, and shoots. The bullet lands 5 yards short.

The engineer adds a fudge factor for air resistance, lifts his rifle slightly higher, and shoots. The bullet lands 5 yards long.

The statistician yells “We got him!”

A really interesting read into “what is important” from stats in basketball:

+/- system is probably the most similar “stat” to OPR utilized in basketball. It is figured a different way, but is a good way of estimating impact from a player vs. just using points/rebounds and…

The article does a really good job of doing some comparison to a metric like that to more typical event driven stats to actual impactful details of a particularly difficult to scout player.

I really enjoy the line where it discusses trying to find undervalued mid pack players. Often with scouting, this is exactly what you too are trying to do. Rank the #16-#24 team at an event as accurately as possible in order to help foster your alliances best chance at advancing.

If you enjoy this topic, enjoy the article, and have not read Moneyball, it is well worth the read. I enjoyed the movie, but the book is so much better about the details.

Of course! I’m absolutely successful everytime I go hunting!

There’s an equivalent economists’ joke in which trying to feed a group on a desert island ends with “assume a can opener!”:smiley:

Wholly endorse Moneyball to anyone reading this thread. It’s what FRC scouting is all about. We call our system “MoneyBot.”

In baseball, this use of statistics is called “sabremetrics.” Bill James is the originator of this method.

*Getting back to the original question:

Do you think the concept of standard error applies to the individual computed values of OPR, given the way OPR is computed and the data from which it is computed?

Why or why not?

If yes: explain how you would propose to compute the standard error for each OPR value, what assumptions would need to be made about the model and the data in order for said computed standard error values to be meaningful, and how the standard error values should be interpreted.

So for those of you who answered “yes”:

Pick an authoritative (within the field of statistics) definition for standard error, and compute that “standard error” for each Team’s OPR for the attached example.


Archimedes.zip (41.9 KB)

Archimedes.zip (41.9 KB)

… and for those of you who think the answer is “no”, explain why none of the well-defined “standard errors” (within the field of statistics) can be meaningfully applied to the example data (provided in the linked post) in a statistically valid way.

Here’s a poor-man’s approach to approximating the error of the OPR value calculation (as opposed to the prediction error aka regression error):

  1. Collect all of a team’s match results.

  2. Compute the normal OPR.

  3. Then, re-compute the OPR but excluding the result from the first match.

  4. Repeat this process by removing the results from only the 2nd match, then only the 3rd, etc. This will give you a set of OPR values computed by excluding a single match. So for example, if a team played 6 matches, there would be the original OPR plus 6 additional “OPR-” values.

  5. Compute the standard deviation of the set of OPR- values. This should give you some idea of how much variability a particular match contributes to the team’s OPR. Note that this will even vary team-by-team.