"standard error" of OPR values

I think one solution is to use a fixed-effects model that includes a separate variable for each team and the SE for each team will show up there. To be honest, issues like that for FE models is getting beyond my econometric experience. Maybe someone else could research that and cheick. FE models (as well as random effects models) have become quite popular in the last decade.

This sounds very similar to bootstrap resampling (http://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf), which should measure the variation in estimated OPR from the “true” OPR values rather than how consistently individual teams perform. This may be why the values are virtually identical.

Yep, though my derivation is for straight bootstrapping (Figure #1 in your attachment) rather than re-sampled bootstrapping (Figure #3). And yes, given this, the standard errors I compute are the variations of the OPR estimates if they fit the model, all of which assumes that there is not variation in the way individual teams perform other than their mean contribution. Obviously, this final assumption is suspect.

I think this assumption can be restated a different way. We might assume that the contribution of each team is normally distributed with a certain mean and variance. If the variance is fixed and assumed to be the same for each team, then the maximum likelihood estimate of the means of the distributions should be the same as the least squares estimate as in usual OPR. This assumes that there is some hidden distribution of the contribution of each team which is normal and that the variance of each distribution is the same.

hi all,

as a student going into his first year of undergrad this fall, this kind of stuff interests me. what level (or course equivalent or experience of the student) is this kind of stuff typically taught at?

I have researched into interpolation, as I would like to spend some time developing spline path generation for auton modes independently, and that particular area requires a bit of knowledge in Linear Algebra, which I will begin the process of self-teaching soon enough.

As for this, what would be the equivalent of interpolation:linear algebra?

I don’t mean to hijack the thread, but it feels like the most appropriate place to ask…

Getting back to this after a long hiatus.

If you are asking for individual standard error associated with each OPR value, no one ever posts them because the official FRC match data doesn’t contain enough information to make a meaningful computation of those individual values.

In a situation, unlike FRC OPR, where you know the variance of each observed value (either by repeated observations using the same values for the predictor variables, or if you are measuring something with an instrument of known accuracy) you can put those variances into the design matrix for each observation and compute a meaningful standard error for each of the model parameters.

Or if, unlike FRC OPR, you have good reason to believe the observations are homoscedastic, you can compute the variance of the residuals and use that to back-calculate standard errors for the model parameters. If you do this for FRC data the result will be standard errors which are very nearly the same for each OPR value… which is clearly not the expected result.

The standard errors for the OPR values can be computed, but they are in fact quite large relative to the parameter values. Which is actually my point–the statistical precision of the OPR values are really quite poor because there are so few observations, which are in fact not independent. Rather than ignoring the SEs because they show how poor the OPR estimators are performing, the SEs should be reported to show how poorly the estimators perform for everyone’s consideration.

*I think you missed my point entirely.

Yes, they can be computed, but that doesn’t mean they are statistically valid. They are not, because the data does not conform to the necessary assumptions.

…but they are in fact quite large relative to the parameter values.

Yes they are, but they are also nearly all the same value… which is obviously incorrect… and a result of assumptions which the data does not meet.

the statistical precision of the OPR values are really quite poor because there are so few observations, which are in fact not independent.

Lack of independence is only one of the assumptions which the data do not meet.

Rather than ignoring the SEs because they show how poor the OPR estimators are performing…

They are not being ignored “because they show how poor the OPR estimators are performing”; they are not being reported because they are invalid and misleading.

the SEs should be reported to show how poorly the estimators perform for everyone’s consideration.

There are better metrics to report to show how poorly the estimators perform.

So it’s not possible to perform a statistically valid calculation for standard deviation? Are there no ways to solve for it with a system that is dependent on other robots’ performances?

+1, +/- 0.3.

While it would be great if standard error could be used as a measure of consistency of a team, but that’s not its only function. I agree with Richard that one of the benefits of an error value is to provide an indication of how much difference is (or is not) significant. If the error bars on the OPRs are all (for example) about 10 points, then a 4 point difference in OPR between two teams probably means less in sorting a pick list than does a qualitative difference in a scouting report.

As it turns out, I was recently asked for the average time it takes members of my branch to produce environmental support products. Because we get requests that range from a 10 mile square box on one day to seasonal variability for a whole ocean basin, the (requested) mean production time means nothing. For one class of product, the standard deviation of production times was greater than the mean. Without the scatter info, the reader would have probably assumed that we were making essentially identical widgets and that the scatter was +/- 1 or 2 in the last reported digit.

We’re discussing standard error of the model parameters, also known as standard error of the regression coefficients. So in our particular case, that would be standard error of the OPRs.

Standard error of the model parameters is a very useful statistic in those cases where it applies. I mentioned one such situation in my previous post:

In a situation, unlike FRC OPR, where you know the variance of each observed value (either by repeated observations using the same values for the predictor variables, or if you are measuring something with an instrument of known accuracy) you can put those variances into the design matrix for each observation and compute a meaningful standard error for each of the model parameters.

An example of the above would be analysis and correction of land surveying network measurement data. The standard deviation of the measurements is known a priori from the manufacturer’s specs for the measurement instruments and from the surveyor’s prior experience with those instruments.

In such as case, computing standard error of the model parameters is justified, and the results are meaningful. All modern land surveying measurement adjustment apps include it in their reports.

Are there no ways to solve for it with a system that is dependent on other robots’ performances?

That’s a large (but not the only) part of the problem.

I briefly addressed this in my previous post:

Or if, unlike FRC OPR, you have good reason to believe the observations are homoscedastic, you can compute the variance of the residuals and use that to back-calculate standard errors for the model parameters. If you do this for FRC data the result will be standard errors which are very nearly the same for each OPR value… which is clearly not the expected result.

In the case computing OPRs using only FIRST-provided match results data (no manual scouting), the data does not meet the requirements for using the above technique.

In fact, when you use the above technique for OPR you are essentially assuming that all teams are identical in their consistency of scoring, so it’s not surprising that when you put that assumption into the calculation you get it back out in the results. GIGO.

Posting invalid and misleading statistics is a bad idea, especially when there are better, more meaningful statistics to fill the role.

For Richard and Gus: If all you are looking for is one overall ballpark number “how bad are the OPR calculations for this event” let’s explore better ways to present that.

But based on this response, the OPR estimates themselves should not be reported because they are not statistically valid either. Instead by not reporting some measure of the potential error, they give the impression of precision to the OPRs.

I just discussed this problem as a major failing for engineers in general–if they are not fully comfortable in reporting a parameter, e.g., a measure of uncertainty, they often will simply ignore the parameter entirely. (I was discussing how the value of solar PV is being estimated across a dozen studies. I’ve seen this tendency over and over in almost 30 years of professional work.) Instead, the appropriate method ALWAYS, ALWAYS, ALWAYS is to report the uncertain or unknown parameter with some sort of estimate and all sorts of caveats. Instead what happens is that decisionmakers and stakeholders much too often accept the values given as having much greater precision than they actually have.

While calculating the OPR really is of no true consequence, because we are working with high school students who are very likely to be engineers, it is imperative that they understand and use the correct method of presenting their results.

So, the SEs should be reported as the best available approximation of the error term around the OPR estimates. And the caveats about the properties of the distribution can be reported with a discussion about the likely biases in the parameters due to the probability distributions.

Sez who? They are the valid least-squares fit to the model. That is all they are. According to what criteria are they then not valid?

Instead by not reporting some measure of the potential error, they give the impression of precision to the OPRs.

Who is suggesting not to report some measure of the potential error? Certainly not me. Read my posts.

I just discussed this problem as a major failing for engineers in general–if they are not fully comfortable in reporting a parameter, e.g., a measure of uncertainty, they often will simply ignore the parameter entirely.

I do not have the above failing, if that is what you were implying.

ALWAYS, ALWAYS, ALWAYS is to report the uncertain or unknown parameter with some sort of estimate and all sorts of caveats.

You are saying this as if you think I disagree. If so, you would be wrong.

Instead what happens is that decisionmakers and stakeholders much too often accept the values given as having much greater precision than they actually have.

Exactly. And perhaps more often than you realize, those values they are given shouldn’t have been reported in the first place because the data does not support them. Different (more valid) measures of uncertainty should have been reported.

While calculating the OPR really is of no true consequence, because we are working with high school students who are very likely to be engineers, it is imperative that they understand and use the correct method of presenting their results.

Well I couldn’t agree more, and it is why we are having this discussion.

So, the SEs should be reported as the best available approximation of the error term around the OPR estimates

Assigning a separate standard error to each OPR value computed from the FIRST match results data is totally meaningless and statistically invalid. As you said above, “it is imperative that they understand and use the correct method of presenting their results”.

Let’s explore alternative ways to demonstrate the shortcomings of the OPR values.

the caveats about the properties of the distribution can be reported with a discussion about the likely biases in the parameters due to the probability distributions

“Likely” is an understatement. The individual (per-OPR) computed standard error values are obviously and demonstrably wrong (this can be verified with manual scouting data). And what’s more, we know why they are wrong.

As I’ve suggested in my previous two posts, how about let’s explore alternative, valid ways to demonstrate the shortcomings of the OPR values.

One place to start might be to ask whether or not the average value of the vector of standard errors of OPRs might be meaningful, and if so, what exactly it means.

Ether

I wasn’t quite sure why you dug up my original post to start this discussion. It seemed out of context with all of your other discussion about adding error estimates. That said, my request was more general, and it seems to be answered more generally by the other computational efforts that have been going on in the 2 related threads.

But one point, I will say that using a fixed effects models with a separate match progression parameter (to capture the most likely source of heteroskedasticity) should lead to parameter estimates that will provide valid error terms using FRC data. But computing fixed effects models are much more complex processes. It is something that can be done in R.

That one can calculate a number doesn’t mean that the number is meaningful. Without a report of the error around the parameter estimates, the least squares fit is not statistically valid and the meaning cannot be interpreted. This is a fundamental principle in econometrics (and I presume in statistics in general.)

I’m glad you agree with me on this very important point. It’s what I have been saying about your request for SE estimates for each individual OPR.

Without a report of the error around the parameter estimates, the least squares fit is not statistically valid

Without knowing your private definition of “statistically valid” I can neither agree nor disagree.

and the meaning cannot be interpreted.

The meaning can be interpreted as follows: It is the set of model parameters which minimizes the sum of the squares of the differences between the actual and model-predicted alliance scores. This is universally understood. Now once you’ve done that regression, proceeding to do inferential statistics based on the fitted model is where you hit a speed bump because the data does not satisfy the assumptions required for many of the common statistics.

The usefulness of the fitted model can, however, be assessed without using said statistics.

I wasn’t quite sure why you dug up my original post to start this discussion.

I had spent quite some time researching the OP question and came back to tie up loose ends.

It seemed out of context with all of your other discussion about adding error estimates.

How so? I think I have been fairly consistent throughout this thread.

That said, my request was more general, and it seems to be answered more generally by the other computational efforts that have been going on in the 2 related threads.

Your original request was (emphasis mine):

I’m thinking of the parameter standard errors, i.e., the error estimate around the OPR parameter itself for each team. That can be computed from the matrix–it’s a primary output of any statistical software package.

During the hiatus I researched this extensively. The standard error of model parameters (regression coefficients) is reported by SPSS, SAS, MINITAB, R, ASP, MicrOsiris, Tanagra, and even Excel. All these packages compute the same set of values, so they are all doing the same thing.

Given [A][x]=**, the following computation produces the same values as those packages:

x = A\b;
residuals = b-A*x;
SSres = residuals’*residuals;
VARres = SSres/(alliances-teams);

Av = A/sqrt(VARres);
Nvi = inv(Av’*Av);
SE_of_parameters = sqrt(diag(Nvi))

The above code clearly shows that this computation is assuming that the standard deviation is constant for all measurements (alliance scores) and thus for all teams… which we know is clearly not the case. That’s one reason it produces meaningless results in the case of FRC match results data.

But one point, I will say that using a fixed effects models with a separate match progression parameter (to capture the most likely source of heteroskedasticity) should lead to parameter estimates that will provide valid error terms using FRC data. But computing fixed effects models are much more complex processes. It is something that can be done in R.

That’s an interesting suggestion, but I doubt it would be successful. I’d be pleased to be proven wrong. If you are willing to try it, I will provide whatever raw data you need in your format of choice.


One definition of statistical validity:
https://explorable.com/statistical-validity

Statistical validity refers to whether a statistical study is able to draw conclusions that are in agreement with statistical and scientific laws. This means if a conclusion is drawn from a given data set after experimentation, it is said to be scientifically valid if the conclusion drawn from the experiment is scientific and relies on mathematical and statistical laws.

It is the set of model parameters which minimizes the sum of the squares of the differences between the actual and model-predicted alliance scores. This is universally understood.

This is the point upon which we disagree. This is not a mathematical exercise–it is a statistical one. And statistical analysis requires inference about the validity of the estimated parameters. And I strongly believe that the many students who will be working in engineering in the future who read this need to understand that this is a statistical exercise which requires all of the caveats of such analysis.

Here’s a discussion for fixed effects from the SAS manual:

One of the two textbooks for my intermediate mechanics lab (sophomores and juniors in physics and engineering) was entitled How to Lie with Statistics. Chapter 4 is entitled “Much Ado about Practically Nothing.” For me, the takeaway sentence from this chapter is:

Unfortunately, not many high schoolers have been exposed to this concept.

Finally, if standard errors could be validly produced for each team as a measure of its consistency/reliability, that would be outstanding. Given that teams change strategy and modify robots between matches, (and this year’s nonlinear scoring), it is not surprising that per-team standard error calculations are not valid. (And by the way, Ether’s finding that the numbers could be calculated but did not communicate variability is at least qualitatively similar to Richard’s argument concerning OPR.)

This does not negate the need for a “standard error” or “probable error” of the whole data set. OPR is ultimately a measurement, and anyone using OPR to drive a decision needs to understand the accuracy. That is, does a difference of 5 points in OPR means that one team is better than the other with 10% confidence, 50% confidence, or 90% confidence?

Hi All,

Ether and I have been having some private discussions and running some simulations on this topic. I thought I’d report the general results here. I think Ether agrees with what I say below, but I’ll leave that for him to confirm or deny. :slight_smile:

Executive Summary:

  1. The mean of the standard error vector for the OPR estimates is a decent approximation for the standard deviation of the team-specific OPR estimates themselves, and is a very good approximation for the mean of the standard deviations of the team-specific OPR estimates taken across all of the teams in the tournament.

  2. Teams with more variability in their offensive contributions (e.g., teams that contribute a huge amount to their alliance’s score by performing some high-scoring feats, but fail at doing so 1/2 the time) will have slightly more uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

  3. Teams with less variability in their offensive contributions (e.g., consistent teams that always contribute about the same amount to their alliance’s score every match) will have slightly less uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

Details:

I simulated match scores in the following way.

  1. I computed the actual OPRs from the actual match data (in this case, from the 2014 misjo tournament as suggested by Ether).

  2. I computed the sum of the squared values of the prediction residual and divided this sum by (#matches - #teams) to get an estimate of the per-match randomness that exists after the OPR prediction is performed.

  3. I divided the result from step#2 above by 3 to get a per-team estimate of the variance of each team’s offensive contribution. I took the square root of this to get the per-team estimate of the standard deviation of each team’s offensive contribution.

  4. I then simulated 1000 tournaments using the same match schedule as the 2014 misjo tournament. The simulated match scores were the sum of the 3 OPRs for the teams in that match plus 3 zero-mean, variance-1 normally distributed random numbers scaled by the 3 per-team offensive standard deviations computed in step #3. Note that at this point, each team has the same value for the per-team offensive standard deviations.

  5. I then computed the OPR estimates from the match scores for each simulated tournament and computed the actual standard deviation of the 1000 OPR estimates for each team. These standard deviations were all close to 11.5 (between 11 and 12) which was the average of the elements of the traditional standard error vector calculation performed on the original data. This makes sense, as the standard error is supposed to be the standard deviation of the estimates if the randomness of the match scores had equal variance for all matches, as was simulated. As a reminder, all of the individual elements of the standard error vector were extremely close to 11.5 in this case.

  6. But then I tried something different. Instead of having the per-team standard deviation of the offensive contributions be constant, I instead added a random variable to these standard deviations and then renormalized all of them so that the average variance of the match scores would be unchanged. In other words, now some teams have a larger variance in their offensive contributions (e.g., team A might have an OPR of 30 but have its score contribution typically vary between 15 and 45) while other teams might have a smaller variance in their contributions (e.g., team B might also have an OPR of 30 but have its score contribution only typically vary between 25 and 35).

  7. Now I resimulated another 1000 tournaments using this model. So now, some match scores might have greater variances and some match scores might have smaller variances. But the way OPR was calculated was not changed.

  8. Then I calculated the OPRs for these new 1000 simulated tournaments and calculated the standard deviations of these 1000 new per-team OPR estimates.

What I found was that the OPR estimates did vary more for teams that had a greater offensive variance and did vary less for teams that had a smaller offensive variance. So, if you’re convinced that different teams have substantially different variances in their offensive contributions, then just using the one average standard error computation to estimate how reliable all of the different OPR estimates are is not completely accurate.

But the differences were not that large. For example, in one set of simulations, team A had an offensive contribution with a standard deviation of 8 while team B had an offensive contribution with a standard deviation of 29. So in this case, team B had a LOT more variability in their offensive contribution than team A did (almost 4x as much). But the standard deviation of the 1000 OPR estimates for team A was 10.8 while the standard deviation of the 1000 OPR estimates for team B was 12.9. So yes, team B had a much bigger offensive variability and that made the confidence in their OPR estimates worse than the 11.5 that the standard error would suggest, but it only went up by 1.4, while team A had a much smaller offensive variability but that only improved the confidence in their OPR estimates by 0.7.

And also, the average of the standard deviations of the OPR estimates for the teams in the 1000 tournaments was still very close to the average of the standard error vector computed assuming that the match scores had identical variances.

So, repeating the Executive Summary:

  1. The mean of the standard error vector for the OPR estimates is a decent approximation for the standard deviation of the team-specific OPR estimates themselves, and is a very good approximation for the mean of the standard deviations of the team-specific OPR estimates taken across all of the teams in the tournament.

  2. Teams with more variability in their offensive contributions (e.g., teams that contribute a huge amount to their alliance’s score by performing some high-scoring feats, but fail at doing so 1/2 the time) will have slightly more uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

  3. Teams with less variability in their offensive contributions (e.g., consistent teams that always contribute about the same amount to their alliance’s score every match) will have slightly less uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

Couldn’t one generate an estimate for each team’s “contribution to variance” by doing the same least-squares fit used to generate OPR in the first place (using the matrix of squared residuals rather than of scores)? This might run the risk of assigning some team a negative contribution to variance (good luck making sense of that one), but other than that (seemingly unlikely) case I can’t think of why this wouldn’t work.

The bottom line here seems to be that even assuming that an alliance’s expected score is a simple sum of each team’s contributions, the statistics tend to properly report the **global **match-to-match variation, while under-reporting **each team’s **match-to-match variation.
The elephant in the room here is that assumption that the alliance is equal to the sum of its members. For example, consider a 2015 (Recycle Rush) robot with a highly effective 2-can grab during autonomous, and the ability to build, score, cap and noodle one stack of six from the HP station, or cap five stacks of up to six totes during a match, or cap four stacks with noodles loaded over the wall. For argument’s sake, it is essentially 100% proficient at these tasks, selecting which to do based on its alliance partners. I will also admit up front that the alliance match-ups are somewhat contrived, but none truly unrealistic. If I’d wanted to really stack the deck, I’d have assumed that the robot was the consummate RC specialist and had no tote manipulators at all.

  • If the robot had the field to itself, it could score 42 points. (one noodled, capped stack of 6) The canburglar is useless, except as a defensive measure.
  • If paired with two HP robots that could combine to score 2 or 3 capped stacks, this robot would add at most a few noodles to the final score. It either can’t get to the HP station, or it would displace another robot that would have been using the station. Again, the canburglar has no offensive value.
  • If paired with an HP robot that could score two capped & noodled stacks, and a landfill miner that could build and cap two non-noodled stacks, the margin for this robot would be 66 points. (42 points for its own noodled, capped stack, and 24 points for the fourth stack that the landfill robot could cap). The canburglar definitely contributes here!
  • If allied with two HP robots that could put up 4 or 5 6-stacks of totes (but no RCs), the margin value of this robot would be a whopping 120 points. (Cap 4 6-stacks with RCs and noodles, or cap 5 6-stacks with RCs). Couldn’t do it without that canburglar!

The real point is that this variation is based on the alliance composition, not on “performance variation” of the robot in the same situation. I also left HP littering out, which would provide additional wrinkles.

My takeaway on this thread is that it would be good and useful information to know the rms (root-mean-square) of the residuals for an OPR/DPR data set (tournament or season). This would provide some understanding as to how much difference really is a difference, and a clue as to when the statistics mean about as much as the scouting.

On another slightly related matter, I have wondered why CCWM (Combined Contribution to Winning Margin) is calculated by combining separate calculations of OPR and DPR, rather than by solving a single matrix of winning margin. I suspect that the single calculation would prove to be more consistent for games with robot-based defense (**not **Recycle Rush); if a robot plays offense five matches and defense five matches, then both OPR and DPR would each have a lot of noise, whereas true CCWM should be a more consistent number.