Log in

View Full Version : "standard error" of OPR values


Ether
12-05-2015, 22:38
It's been known that OPR doesn't reflect actual scoring ability. It's a regression analysis that computes the implied "contribution" to scores. Unfortunately no one ever posts the estimates' standard errors (which I imagine to be enormous with 10 or so observations.)

I'm thinking of the parameter standard errors, i.e., the error estimate around the OPR parameter itself for each team. That can be computed from the matrix--it's a primary output of any statistical software package.

The second quote above is from a dialog I've been having with Citrus Dad, and I have his permission to post it here.

I'd like to hear what others have to say.

Do you think the concept of standard error applies to the individual computed values of OPR, given the way OPR is computed and the data from which it is computed?

Why or why not?

If yes: explain how you would propose to compute the standard error for each OPR value, what assumptions would need to be made about the model and the data in order for said computed standard error values to be meaningful, and how the standard error values should be interpreted.

Rachel Lim
12-05-2015, 23:29
Just to check that I understand this correctly: standard error is basically the standard deviation from the "correct" value, and you're asking if OPR values have this distribution from the "correct" value (i.e. the given OPR)?

Also, is OPR is calculated by taking t1+t2+t3 = redScore1, t4+t5+t6 = blueScore1, etc. and then solving that series of linear equations?


I would guess it would depend on what you mean by OPR. I always assumed, perhaps incorrectly, that OPR was the solution of the above calculations, and thus it is just a number, neither correct nor incorrect.

If OPR is meant to indicate the actual scoring ability, this would change. However I'm not sure how to figure out how many points a team contributes--if one team stacked 6 totes and another capped, does the first get 12 and the second 24, or each get 36, or something other combination?

I suppose one way to do it would be to take the difference between a team's OPR and 1/3 of their alliance's points from each match they played in, and see the change in that difference. Comparing that between teams will be tricky, since the very top/bottom teams will have a greater difference than an average one. Similarly, looking at a team's OPR after X matches and 1/3 of the match X+1 score would be interesting but would also have that problem.


(Or I could just be very confused about what OPR and standard error really are--I've tried to piece together what I've read here and on the internet but haven't formally taken linear algebra or statistics.)

James Kuszmaul
12-05-2015, 23:30
I'm not sure if there is a good clean method which produces some sort of statistical standard deviation or the such, although I would be happy to be proven wrong.
However, I believe that the following method should give a useful result:

If you start out with the standard OPR calculations, with the matrix equation A * x = b, where x is a n x 1 matrix containing all the OPRs, A is the matrix describing which teams a given team has played with and b has the sum of the scores from the matches a team played in, then in order to compute a useful error value we would do the following:
1) Calculate the expected score from each match (using OPR), storing the result in a matrix exp, which is m x 1. Also, store all the actual scores in another m x 1 matrix, act.
2) Calculate the square of the error for each match, in the matrix err = (act - exp)^2 (using the squared notation to refer to squaring individual elements). You could also try taking the absolute value of each element, which would result in a similar distinction as that between the L1 and L2 norm.
3) Sum up the squared err for each match into the matrix errsum, which will replace the b from the original OPR calculation.
4) Solve for y in A * y = errsum (obviously, this would be over-determined, just like the original OPR calculation). In order to get things into the right units, you should then take the square root of every element of y and that will give a team's typical variance.

This should give each team's typical contribution to the change in their match scores.

added-in note:
I'm not sure what statistical meaning the values generated by this method would have, but I do believe that they would have some useful meaning, unlike the values generated by just directly computing the total least-squared error of the original calculation (ie, (A*x - b)^2). If no one else does, I may implement this method just to see how it performs.

GeeTwo
13-05-2015, 05:00
Calculation of the standard error in OPR for each team sounds straightforward - the RMS of the residuals between the linear model and the match data for the matches in which a team participated. However, this number would probably not cast much if any light on the source of this scatter. One obvious source of scatter is the actual match-to-match performance variation of each team - puts up two stacks per match, but in that match, they set the stack on some litter and it knocked over the first. Another is non-linearity in the combined scoring (e.g. two good teams that perform very well when with mediocre partners, but run out of game pieces when allied, or a tote specialist allied with an RC specialist who do much better together than separately).

wgardner
13-05-2015, 08:51
There are two types of error:

The first is the prediction residual which measures how well the OPR model is predicting match outcomes. In games where there is a lot of match-to-match variation, the prediction residual will be high no matter how many matches each team plays.

The second is the error in measuring the actual, underlying OPR value (if you buy into the linear model). If teams actually had an underlying OPR value, then as teams play 10, 100, 1000 matches the error in computing this value will go to zero.

So, the question is, what exactly are you trying to measure? If you want confidence in the underlying OPR values or perhaps the rankings produced by the OPR values, then the second error is the one you want to figure out and the prediction residual won't really answer that. If you want to know how well the OPR model will predict match outcomes, then the first error is the one you care about.

Oblarg
13-05-2015, 10:43
The second is the error in measuring the actual, underlying OPR value (if you buy into the linear model). If teams actually had an underlying OPR value, then as teams play 10, 100, 1000 matches the error in computing this value will go to zero.

Unfortunately, this is tenuous - there's no real reason to believe that each team contributes linearly to score by some flat amount per match, and that variance beyond that is a random variable whose distribution does not change match-to-match.

If one were to assume that this is actually the case, though, then one would just take the error from the first part and divide it by sqrt(n) to find the error in the estimation of the mean.

Ed Law
13-05-2015, 10:49
I agree with most of what people have said so far. I would like to add my observations and opinions on this topic.

First of all, it is important to understand how OPR is calculated and what it means from a mathematical standpoint. Next it is important to understand all the reasons why OPR does not perfectly reflect what a team actually scores in a match.

To put things in perspective, I would like to categorize all the reasons into two bins. Things that are beyond a team's control and things that reflects the actual "performance" of the team. I consider anything that is beyond a team's control as noise. This is something that will always be there. Some examples, as others have also pointed out, are bad call by refs, compatibility with partners' robots, non-linearity of scoring, accidents that is not due to carelessness, field fault not being recognized, robot failure that is not repeatable etc.

The second bin will be things that truly reflects the "performance" of a team. This will measure what a team can potentially contribute to a match. This will take into account how consistent a team is. The variation here will include factors like how careful they are in not knocking stacks down, getting fouls, robot not functioning due to wiring that is avoidable. The problem is this measure is meaningful only if no teams are allowed to modify their robot between matches meaning the robot is in the exact same condition in every match. However in reality there are three scenarios. 1) The robot keeps getting better as teams worked out the kinks or tuned it better. 2) The robot keeps getting worse as things wear down quickly due to inappropriate choice of motors, bearings or the lack of, design or construction techniques were used. Performance can get worse also as some teams keep tinkering with their robot or programming without fully validating the change. 3) The robot stays the same.

I understand what some people are trying to do. We want a measure of expected variability around each team's OPR numbers, some kind of a confidence band. If we have that information, then there will be a max and min prediction of the outcome of the score of each alliance. Mathematically, this can be done relatively easily. However the engineer in me tells me that it is a waste of time. Based on the noise factors I listed above and that the robot performance may change over time, this becomes just a mathematical exercise and does not have much contribution to the prediction of outcome of the next match.

However I do support the publication of the R^2 coefficient of determination. It will give an overall number as to how well the actual outcome fits the statistical model.

GreyingJay
13-05-2015, 11:53
Unfortunately, this is tenuous - there's no real reason to believe that each team contributes linearly to score by some flat amount per match, and that variance beyond that is a random variable whose distribution does not change match-to-match.


This year may be an anomaly, but it seems to me like, for some teams anyway, this is a reasonable model. Teams have built robots that are very predictable and task-oriented. For example: grab a bin, drive to the feeder station, stack, stack, stack, push, stack, stack, stack, push, etc. Knowing how quickly our human player and stack mechanism are, we can predict with reasonable accuracy how many points we can typically score in a match, with the only real variance coming from when things go wrong.

Basel A
13-05-2015, 12:31
I have to strongly agree with what Ed had to say above. Errors in OPR happen when its assumptions go unmet: partner or opponent interaction, team inconsistency (including improvement), etc. If one if these single factors caused significantly more variation than the others, then the standard error might be a reasonable estimate of that factor. However, I don't believe that this is the case.

Another option would be to take this measure in the same way that we take OPR. We know that OPR is not a perfect depiction of a team's robot quality or even a team's contribution to its alliance, but we use OPR anyway. In the same way, we know the standard error is an imperfect depiction of a team's variation in contribution.

People constantly use the same example in discussing consistency in FRC. A low-seeded captain, when considering two similarly contributing teams, is generally better off selecting an inconsistent team over a consistent one. Standard error could be a reasonable measure of this inconsistency (whether due to simple variation or improvement). At a scouting meeting, higher standard error could indicate "teams to watch" (for improvement).

But without having tried it, I suspect a team's standard error will ultimately be mostly unintelligible noise.

Wayne TenBrink
13-05-2015, 13:47
Has anyone ever attempted a validation study to compare "actual contribution" (based on scouting data or a review of match video) to OPR values? It seems like this would be fairly easy and accurate for Recycle Rush (and very difficult for Aerial Assist). I did that with our performance at one district event and found the result to be very close (OPR=71 vs "Actual"= 74).

In some ways, OPR is probably more relevant than "actual contribution". For example, a good strategist in Aerial Assist could extract productivity from teams that might otherwise just drive around aimlessly. This sort of contribution would show up in OPR, but a scout wouldn't attribute it to them as an "actual contribution".

It would be interesting to see if OPR error was the same (magnitude and direction) for low, medium, and high OPR teams, etc.

Citrus Dad
13-05-2015, 13:50
There are two types of error:

The first is the prediction residual which measures how well the OPR model is predicting match outcomes. In games where there is a lot of match-to-match variation, the prediction residual will be high no matter how many matches each team plays.

The second is the error in measuring the actual, underlying OPR value (if you buy into the linear model). If teams actually had an underlying OPR value, then as teams play 10, 100, 1000 matches the error in computing this value will go to zero.

So, the question is, what exactly are you trying to measure? If you want confidence in the underlying OPR values or perhaps the rankings produced by the OPR values, then the second error is the one you want to figure out and the prediction residual won't really answer that. If you want to know how well the OPR model will predict match outcomes, then the first error is the one you care about.

The first error term is generally reported and Ether produced a measure of those residuals.

It's the second error term that I haven't seen reported. And in my experience working with econometric models, having only 10 observations likely leads to a very large standard error around this parameter estimate. I don't think that calculating this will change the OPR per se, but it will provide a useful measure of the (im)precision of the estimates that I don't think most students and mentors are aware of.

Also, as you imply, a linear model may not be the most appropriate structure even though it is by far the easiest to compute with Excel. For example, the cap on resource availability probably creates a log-linear relationship.

IKE
13-05-2015, 13:52
Has anyone ever attempted a validation study to compare "actual contribution" (based on scouting data or a review of match video) to OPR values? It seems like this would be fairly easy and accurate for Recycle Rush (and very difficult for Aerial Assist). I did that with our performance at one district event and found the result to be very close (OPR=71 vs "Actual"= 74).

In some ways, OPR is probably more relevant than "actual contribution". For example, a good strategist in Aerial Assist could extract productivity from teams that might otherwise just drive around aimlessly. This sort of contribution would show up in OPR, but a scout wouldn't attribute it to them as an "actual contribution".

It would be interesting to see if OPR error was the same (magnitude and direction) for low, medium, and high OPR teams, etc.

I have done this in the past (2010, 2011, 2012, 2013). 2013 was pretty close. 2010 was pretty close at a given event though interesting strategies could result in interesting scores. 2011 was useful at District level of competition, but not very useful at MSC or Worlds. 2012, semi useful if using some sort of co-op balance partial contribution factor.

Someone did a study for Archimedes this year. I would say it is similar to 2011 where 3 really impressive scorers would put up a really great score, but if you expected 3X, you would instead get more like 2.25 to 2.5....

Citrus Dad
13-05-2015, 13:52
Has anyone ever attempted a validation study to compare "actual contribution" (based on scouting data or a review of match video) to OPR values? It seems like this would be fairly easy and accurate for Recycle Rush (and very difficult for Aerial Assist). I did that with our performance at one district event and found the result to be very close (OPR=71 vs "Actual"= 74).

In some ways, OPR is probably more relevant than "actual contribution". For example, a good strategist in Aerial Assist could extract productivity from teams that might otherwise just drive around aimlessly. This sort of contribution would show up in OPR, but a scout wouldn't attribute it to them as an "actual contribution".

It would be interesting to see if OPR error was the same (magnitude and direction) for low, medium, and high OPR teams, etc.

This is a different question than whether the OPR accurately measures true contribution. (Another benefit of that exercise however is to determine whether the OPR estimate has a bias, e.g., related to relative scoring). There will always be error terms around the OPR parameter, so the question to be answered is what are the statistical properties of those error terms.

Citrus Dad
13-05-2015, 13:55
I have done this in the past (2010, 2011, 2012, 2013). 2013 was pretty close. 2010 was pretty close at a given event though interesting strategies could result in interesting scores. 2011 was useful at District level of competition, but not very useful at MSC or Worlds. 2012, semi useful if using some sort of co-op balance partial contribution factor.

Someone did a study for Archimedes this year. I would say it is similar to 2011 where 3 really impressive scorers would put up a really great score, but if you expected 3X, you would instead get more like 2.25 to 2.5....

I produced predicted scores for Newton using the OPR components to eliminate potential double counting of auto and adjust for coop points. I predicted 118 would average 200 and they averaged 198. I have to check the distribution of OPR vs actual points.

Citrus Dad
13-05-2015, 14:03
However the engineer in me tells me that it is a waste of time. Based on the noise factors I listed above and that the robot performance may change over time, this becomes just a mathematical exercise and does not have much contribution to the prediction of outcome of the next match.

However I do support the publication of the R^2 coefficient of determination. It will give an overall number as to how well the actual outcome fits the statistical model.

The statistician in me led to asking this question. One aspect is that I believe publishing standard errors for parameter estimates provides greater transparency. Plus it is very educational. I suspect that most students looking at OPRs don't understand that they are actually statistical estimates with large error bands around the parameter estimates. Providing that education is directly in line with our STEM mission. Too many engineers don't understand the implications and importance of statistical properties in their own work (I see it constantly in my professional life.)

And regardless I think see the SEs lets us see if a team has a more variable performance than another. That's another piece of information that we can then use to explore it further. For example is the variability arising because parts keep breaking or is there an underlying improvement trend through the competition--either one would increase the SE compared to a steady performance rate. There's other tools for digging into that data, but we may not look unless we have that SE measure first.

IKE
13-05-2015, 17:37
Kind of reminds me of a joke I heard this past weekend that was accidentally butchered:

A physicist, engineer and a statistician are out hunting. Suddenly, a deer appears 50 yards away.

The physicist does some basic ballistic calculations, assuming a vacuum, lifts his rifle to a specific angle, and shoots. The bullet lands 5 yards short.

The engineer adds a fudge factor for air resistance, lifts his rifle slightly higher, and shoots. The bullet lands 5 yards long.

The statistician yells "We got him!"
************************************************** ********

A really interesting read into "what is important" from stats in basketball:
http://www.nytimes.com/2009/02/15/magazine/15Battier-t.html?pagewanted=1&_r=0

+/- system is probably the most similar "stat" to OPR utilized in basketball. It is figured a different way, but is a good way of estimating impact from a player vs. just using points/rebounds and....

The article does a really good job of doing some comparison to a metric like that to more typical event driven stats to actual impactful details of a particularly difficult to scout player.

I really enjoy the line where it discusses trying to find undervalued mid pack players. Often with scouting, this is exactly what you too are trying to do. Rank the #16-#24 team at an event as accurately as possible in order to help foster your alliances best chance at advancing.

If you enjoy this topic, enjoy the article, and have not read Moneyball, it is well worth the read. I enjoyed the movie, but the book is so much better about the details.

Citrus Dad
13-05-2015, 18:35
Kind of reminds me of a joke I heard this past weekend that was accidentally butchered:

A physicist, engineer and a statistician are out hunting. Suddenly, a deer appears 50 yards away.

The physicist does some basic ballistic calculations, assuming a vacuum, lifts his rifle to a specific angle, and shoots. The bullet lands 5 yards short.

The engineer adds a fudge factor for air resistance, lifts his rifle slightly higher, and shoots. The bullet lands 5 yards long.

The statistician yells "We got him!"

Of course! I'm absolutely successful everytime I go hunting!

There's an equivalent economists' joke in which trying to feed a group on a desert island ends with "assume a can opener!":D
************************************************** ********


If you enjoy this topic, enjoy the article, and have not read Moneyball, it is well worth the read. I enjoyed the movie, but the book is so much better about the details.

Wholly endorse Moneyball to anyone reading this thread. It's what FRC scouting is all about. We call our system "MoneyBot."

In baseball, this use of statistics is called "sabremetrics (http://sabr.org/sabermetrics)." Bill James is the originator of this method.

Ether
15-05-2015, 18:27
Getting back to the original question:

Do you think the concept of standard error applies to the individual computed values of OPR, given the way OPR is computed and the data from which it is computed?

Why or why not?

If yes: explain how you would propose to compute the standard error for each OPR value, what assumptions would need to be made about the model and the data in order for said computed standard error values to be meaningful, and how the standard error values should be interpreted.


So for those of you who answered "yes":

Pick an authoritative (within the field of statistics) definition for standard error, and compute that "standard error" for each Team's OPR for the attached example.

Ether
16-05-2015, 09:42
So for those of you who answered "yes"

... and for those of you who think the answer is "no", explain why none of the well-defined "standard errors" (within the field of statistics) can be meaningfully applied to the example data (http://www.chiefdelphi.com/forums/attachment.php?attachmentid=18999&d=1431728628) (provided in the linked post (http://www.chiefdelphi.com/forums/showpost.php?p=1482422&postcount=18)) in a statistically valid way.

wgardner
16-05-2015, 14:12
Here's a poor-man's approach to approximating the error of the OPR value calculation (as opposed to the prediction error aka regression error):

1. Collect all of a team's match results.

2. Compute the normal OPR.

3. Then, re-compute the OPR but excluding the result from the first match.

4. Repeat this process by removing the results from only the 2nd match, then only the 3rd, etc. This will give you a set of OPR values computed by excluding a single match. So for example, if a team played 6 matches, there would be the original OPR plus 6 additional "OPR-" values.

5. Compute the standard deviation of the set of OPR- values. This should give you some idea of how much variability a particular match contributes to the team's OPR. Note that this will even vary team-by-team.

Thoughts?

Ether
16-05-2015, 14:31
Here's a poor-man's approach to approximating the error of the OPR value calculation (as opposed to the prediction error aka regression error):

1. Collect all of a team's match results.

2. Compute the normal OPR.

3. Then, re-compute the OPR but excluding the result from the first match.

4. Repeat this process by removing the results from only the 2nd match, then only the 3rd, etc. This will give you a set of OPR values computed by excluding a single match. So for example, if a team played 6 matches, there would be the original OPR plus 6 additional "OPR-" values.

5. Compute the standard deviation of the set of OPR- values. This should give you some idea of how much variability a particular match contributes to the team's OPR. Note that this will even vary team-by-team.

Thoughts?

This is interesting but not what I'm looking for.

The question is this thread is how (or if) a standard, textbook, widely-used, statistically valid "standard error" (as mention by Citrus Dad and quoted in the original post in this thread) can be computed for OPR from official FRC qual match results data unsupplemented by manual scouting data or any other data.

James Kuszmaul
16-05-2015, 16:31
Here's a poor-man's approach to approximating the error of the OPR value calculation (as opposed to the prediction error aka regression error):

1. Collect all of a team's match results.

2. Compute the normal OPR.

3. Then, re-compute the OPR but excluding the result from the first match.

4. Repeat this process by removing the results from only the 2nd match, then only the 3rd, etc. This will give you a set of OPR values computed by excluding a single match. So for example, if a team played 6 matches, there would be the original OPR plus 6 additional "OPR-" values.

5. Compute the standard deviation of the set of OPR- values. This should give you some idea of how much variability a particular match contributes to the team's OPR. Note that this will even vary team-by-team.

Thoughts?

Using Ether's data, I just did essentially this, where I randomly* selected 200 (I just chose this because it excludes enough matches to ensure variation in OPRs, but should include enough matches to keep the system sufficiently over-determined) of the 254 alliance scores to use for the OPR calculation. I ran this 200 times and got the following:

Team Original OPR Mean OPR Standard Deviation StdDev / Mean
1023 119.9222385 120.0083320153 11.227427964 0.0935554038
234 73.13049299 72.801129356 8.9138064084 0.1224404963
135 71.73803792 72.0499437529 7.953512079 0.1103888728
1310 68.29454232 69.3467152712 14.1978070751 0.2047365476
1538 66.51660956 65.739882921 10.0642899215 0.1530926049
1640 63.89355804 63.1124212044 12.5486944006 0.1988308191
4213 59.83218159 60.3799737845 9.7581954471 0.1616131117
2383 59.3454496 58.4390556944 8.8170835924 0.1508765583
5687 58.89565276 58.0801454327 8.5447703278 0.1471203328
2338 57.52050487 57.8998084926 9.9345796042 0.1715822533
68 57.31570571 57.5000280561 7.3734953486 0.1282346391
2342 56.91016998 57.2987212179 6.6038945531 0.115253786
2974 55.52108592 57.1342122847 8.3752237419 0.1465885921
857 56.58983207 56.5258351411 7.2736015551 0.1286774718
2619 55.87939909 55.7690519681 8.4202867997 0.150984937
314 54.93283739 54.2189755764 9.2781646413 0.1711239385
4201 54.36868175 53.4393101098 10.5474638148 0.1973727541
2907 52.20131966 52.8528874425 7.542822466 0.1427135362
360 50.27624758 50.4115562132 7.0992892482 0.1408266235
5403 50.29915841 50.3683881678 6.7117433122 0.133253089
201 45.9115291 44.7743914139 8.4846178186 0.189497111
2013 44.91032156 44.6243506137 6.8765159824 0.1540978387
3602 44.27190346 44.0845482182 9.1690079569 0.2079868872
207 43.76003325 43.534273676 9.6975195297 0.2227559739
1785 42.88695283 43.4312399486 8.2699452851 0.1904146714
1714 43.01192386 42.548981107 10.4744349747 0.2461735793
2848 42.09926229 42.3315382699 5.5963086425 0.1322018729
5571 41.52437471 41.7434170692 9.1647109829 0.2195486528
3322 41.46602143 41.5494849767 7.1743838875 0.1726708259
4334 40.44991373 41.05033774 8.7102627815 0.2121849237
5162 40.45440709 40.9929568271 8.2624477928 0.2015577414
5048 39.89000748 40.3308767357 11.0199899828 0.2732395344
2363 39.94545778 40.1152579819 6.6177263936 0.1649678134
280 39.5619946 39.5341268065 7.3717432763 0.1864653117
4207 38.2684727 39.4991498122 6.9528849981 0.1760261938
5505 39.67352888 38.9668291926 11.3348728596 0.2908851732
217 36.77649547 37.4492632177 6.4891284445 0.1732778668
836 36.43648963 37.0437210956 12.1307341233 0.3274707228
503 36.81699351 36.7802949819 7.9491833149 0.2161261436
1322 36.38199798 36.7254993257 8.5268395114 0.2321776332
4451 35.19372256 35.3483644749 9.807710599 0.2774586815
623 34.52165055 35.1189107974 7.930898959 0.2258298671
1648 35.50610406 35.0638323174 10.815198205 0.3084431304
51 34.66010328 34.6703806244 5.4485310273 0.157152328
122 34.32806143 33.5962803896 7.5092149942 0.223513285
115 31.91437124 31.3399395607 8.4108320311 0.2683742263
5212 30.01729221 30.4525516362 8.9862156315 0.2950890861
1701 29.87650404 30.3212455768 6.3833025833 0.2105224394
3357 29.17742219 29.6022237315 6.381280757 0.2155676146
1572 29.88934385 29.5148636895 7.882621955 0.2670729582
3996 29.80296599 29.071104692 12.1221539603 0.4169829144
2655 26.12997208 26.8414199039 8.2799141902 0.3084752677
3278 27.75400612 26.676383757 8.7090459236 0.3264702593
2605 26.77170149 26.4416718205 7.2093344642 0.2726504781
2914 25.16358084 25.6405460981 8.2266061339 0.3208436397
5536 25.12712518 25.537683706 8.9692243899 0.3512152666
108 25.12900331 24.9994393089 8.1059495087 0.3242452524
4977 23.84091367 24.1678220977 8.8309117942 0.3653995697
931 20.64386303 20.6395850124 9.7862519781 0.4741496485
3284 20.6263851 20.3004828941 7.7358872421 0.3810691244
5667 20.24853487 20.2012572648 10.5728126478 0.5233739915
188 19.63432177 19.5009951172 8.527091207 0.4372644142
5692 17.52522898 16.9741593261 9.9533189003 0.5863806689
1700 15.35451961 15.0093164719 7.5208523959 0.5010789405
4010 12.26210563 13.9952121466 9.8487154699 0.7037203414
1706 12.6972477 11.7147928015 6.1811481569 0.5276361487
3103 12.14379904 11.6822069225 8.4008681879 0.7191165371
378 11.36567533 11.6581748916 8.2483175766 0.7075136248
3238 8.946537399 9.2298154231 9.6683698675 1.0475149745
5581 9.500192257 8.7380812257 8.2123397521 0.9398333044
5464 4.214298451 5.4505495437 7.2289498778 1.326279088
41 5.007828439 4.3002816244 9.0353666405 2.1011104457
2220 4.381189923 4.2360658386 6.880055327 1.6241615662
4364 4.923793169 3.504087428 8.6917749423 2.4804674886
1089 1.005273551 0.9765385053 6.9399339807 7.1066670109
691 -1.731531162 -1.2995295456 11.9708242834 9.2116599609


Original OPR is just copied straight from Ether's OPR.csv; Mean OPR is just the mean OPR that a team received across the 200 iterations; Standard Deviation is the Standard Deviation of all the OPRs a team recieved and the final column is just the standard deviation divided by the Mean OPR. The data is sorted by Mean OPR.

In terms of whether this is a valid way of looking at it, I'm not sure--the results seem to have some meaning, but I'm not sure how much of it is just that only looking at 200 scores is even worse than just 254, or if there is something more meaningful going on.

*Using python's random.sample() function. This means that I did nothing to prevent duplicate runs (which are extremely unlikely; 254 choose 200 is ~7.2 * 10^55) and nothing to ensure that a team didn't "play" <3 times in the selection of 200 scores.

wgardner
16-05-2015, 18:07
This is interesting but not what I'm looking for.

The question is this thread is how (or if) a standard, textbook, widely-used, statistically valid "standard error" (as mention by Citrus Dad and quoted in the original post in this thread) can be computed for OPR from official FRC qual match results data unsupplemented by manual scouting data or any other data.




I guess I'm not sure how you're defining "standard error." I assume you're trying to get some confidence on the OPR value itself (not in how well the OPR can predict match results, which is the other error I referred to previously).

The method I propose above gives a standard deviation measure on how much a single match changes a team's OPR. I would think this is something like what you want. If not, can you define what you're looking for more precisely?

Also, rather than taking 200 of 254 matches and looking at the standard deviation of all OPRs, I suggest just removing a single match (e.g., compute OPR based on 253 of the 254 matches) and looking at how that removal affects only the OPRs of the teams involved in the removed match.

So if you had 254 matches in a tournament, you'd compute 254 different sets of OPRs (1 for each possible match removal) and then look at the variability of the OPRs only for the teams involved in each specific removed match.

This only uses the actual qualification match results, no scouting or other data as you want.

wgardner
16-05-2015, 18:24
And just to make sure I'm being clear (because I fear that I may not be):

Let's say that team 1234 played in a tournament and was involved in matches 5, 16, 28, 39, 51, and 70.

You compute team 1234's OPR using all matches except match 5. Say it's 55.
Then you compute team 1234's OPR using all matches except match 16. Say it's 60.
Keep repeating this, removing each of that team's matches, which will give you 6 different OPR numbers. Let's say that they're 55, 60, 50, 44, 61, and 53. Then you can compute the standard deviation of those 6 numbers to give you a confidence on what team 1234's OPR is.

Of course, you can do this for every team in the tournament and get team-specific OPR standard deviations and an overall tournament OPR standard deviation.

Team 1234 may have a large standard deviation (because maybe 1/3 of the time they always knock over a stack in the last second) while team 5678 may have a small standard deviation (because they always contribute the exactly same point value to their alliance's final score).

And hopefully the standard deviations will be lower in tournaments with more matches per team because you have more data points to average.

Ether
16-05-2015, 18:29
I guess I'm not sure how you're defining "standard error."

I am not defining "standard error".

I am asking you (or anyone who cares to weigh in) to pick a definition from an authoritative source and use that definition to compute said standard errors of the OPRs (or state why not):

for those of you who answered "yes":

Pick an authoritative (within the field of statistics) definition for standard error, and compute that "standard error" for each Team's OPR for the attached example (http://www.chiefdelphi.com/forums/attachment.php?attachmentid=18999&d=1431728628).

... and for those of you who think the answer is "no", explain why none of the well-defined "standard errors" (within the field of statistics) can be meaningfully applied to the example data (http://www.chiefdelphi.com/forums/attachment.php?attachmentid=18999&d=1431728628) (provided in the linked post (http://www.chiefdelphi.com/forums/showpost.php?p=1482422&postcount=18)) in a statistically valid way.

Ether
16-05-2015, 18:39
I assume you're trying to get some confidence on the OPR value itself

No. I am not trying to do this. I will try to be clearer:

Citrus Dad asked why no-one ever reports "the" standard error for the OPRs.

"Standard Error" is a concept within the field of statistics. There are several well-defined meanings depending on the context.

So what am trying to do is this: have a discussion about what "the" standard error might mean in the context of OPR.

Ether
16-05-2015, 18:47
And just to make sure I'm being clear (because I fear that I may not be)

No, your original post was quite clear. And interesting. But it's not what I am asking about.

Ether
16-05-2015, 19:11
@ Citrus Dad:

If you are reading this thread, would you please weigh in here and reveal what you mean by "the standard errors" of the OPRs, and how you would compute them, using only the data in the example I posted (http://www.chiefdelphi.com/forums/showpost.php?p=1482422&postcount=18)?

Also, what assumptions do you have to make about the data and the model in order for the computed standard errors to be statistically valid/relevant/meaningful, and what is the statistical meaning of those computed errors?

Oblarg
17-05-2015, 03:20
So what am trying to do is this: have a discussion about what "the" standard error might mean in the context of OPR.

Let us assume that the model OPR uses is a good description of FRC match performance - that is, match scores are given by a linear sum of team performance values, and each team's performance in a given match is described by a random variable whose distribution is identical between matches.

OPR should then yield an estimate of the mean of this distribution. An estimate of the standard deviation can be obtained, as mentioned, by taking the RMS of the residuals.

To approximate the standard deviation of the mean (which is what is usually meant by "standard error" of these sorts of measurements), one would then divide this by sqrt(n) (for those interested in a proof of this, simply consider the fact that when summing random variables, variances add), where n is the number of matches used in the team's OPR calculation.

This, of course, fails if the assumptions we made at the outset aren't good (e.g. OPR is not a good model of team performance). Moreover, even if the assumptions hold, if the distribution of the random variable describing a team's performance in a given match is sufficiently wonky that the distribution of the mean is not particularly Gaussian then one is fairly limited in the conclusions they can draw from the standard deviation, anyway.

wgardner
17-05-2015, 05:51
To approximate the standard deviation of the mean (which is what is usually meant by "standard error" of these sorts of measurements), one would then divide this by sqrt(n) (for those interested in a proof of this, simply consider the fact that when summing random variables, variances add), where n is the number of matches used in the team's OPR calculation.


I am interested in a proof of this, because I don't think the normal assumptions hold. Can you explain this more in the full context of how OPR is computed? [Edit: I spent more time trying to derive this whole thing. See my next posts for an attempt at the derivation].

What you say holds if one is taking a number of independent, noisy measurements of a value and computing the mean of the measurements as the estimate of the underlying value. So that would work if OPR was computed by simply averaging the match scores for a team (and dividing by 3 to accommodate for 1/3 of the match score being due to each team's contribution).

But that's not the way OPR is computed at all. It's computed using linear regressions and all of the OPRs for all of the teams are computed simultaneously in one big matrix operation.

For example, it isn't clear to me what n should be. You say "n is the number of matches used in the team's OPR calculation." But all OPRs are computed at the same time using all of the available match data. Does n count matches that a team didn't play in, but that are still used in the computation? Is n the number of matches a team has played? Or the total matches? OPR can be computed based on whatever matches have already occurred at any time. So if some teams have played 4 matches and some have played 5, it would seem like the OPRs for the teams that have played fewer matches should have more uncertainty than the OPRs for the teams that have played more. And the fact that the computation is all intertwined and that the OPRs for different teams are not independent (e.g., if one alliance has a huge score in one match, that affects 3 OPRs directly and the rest of them indirectly through the computation) seems to make the standard assumptions and arguments suspect.

Thoughts?

wgardner
17-05-2015, 06:34
No. I am not trying to do this. I will try to be clearer:

Citrus Dad asked why no-one ever reports "the" standard error for the OPRs.

"Standard Error" is a concept within the field of statistics. There are several well-defined meanings depending on the context.


I guess my uncertainty is not about what "standard error" means but what you mean by "the OPRs."

Wikipedia gives the following definition: "The standard error (SE) is the standard deviation of the sampling distribution of a statistic, most commonly of the mean. The term may also be used to refer to an estimate of that standard deviation, derived from a particular sample used to compute the estimate."

So my real question is perhaps what is the statistic, or what are we trying to estimate, or what are we "computing the mean" of? At each tournament, we have many matches and can get a standard error for the predicted match results because we have many predicted match results and can compute the standard deviation of the distribution of the errors in the predictions. But each tournament only provides one single OPR estimate for each team. It's tough to compute a standard error on these OPR estimates based on this 1 sample because you only have the 1 data point.

If OPRs are a value that we estimate for each team at each tournament and we expect them to stay the same from tournament to tournament (stop laughing now), we can compute the standard deviation in each team's independent OPR values across all of the tournaments in a season to get a standard error for those values. Then you could use the standard error to estimate the distribution of a team's OPR in future tournaments based on their previous tournament results. And I suppose you could also view this same standard error to estimate how much a team's OPR might vary if the same tournament was run again and we had a different set of random match outcomes.

But I'm guessing that what you're really interested in is: if the same tournament were run multiple times and if the match results varied randomly as we modeled (yeah, yeah, and if everybody had a can opener), what would be the standard error of the OPR estimates? Or in other words, what if the same teams with the same robots and the same drivers played in 100 tournaments back-to-back and we computed the OPR for each team for all 100 tournaments, what would be the standard error for these 100 different OPR estimates?

If this is the question you're interested in, then now we have a statistic that we can compute the standard error for. Let's look into this.

Let's let the OPR vector for all of the teams be called O (t x 1 vector, where t is the # of teams).
Let's let the match scores be called M (m x 1, where m is the number of scores or 2x the number of actual matches).

So we're modeling the matches as:

M = A O + N

where A is an m x t matrix with the i,jth element equal to 1 if team j was a member of the alliance leading to the ith match score and 0 otherwise, and where N is an m x 1 noise vector with variance equal to the variance of the prediction residual for each match score. Let's call this variance sig^2.

Given M, the least squares estimate for OPR is calculated as
Oest = Inv(A' A) A' M = Inv(A' A) A' (A O + N) = O + Inv(A' A) A' N
As N is zero-mean, Oest has mean O (which we want) and variance equal to the variance of the second term, Inv(A' A) A' N.
Note that Inv(A' A) A' is a t x m matrix that is solely a function of the match schedule.

The variance of the estimated OPR for the ith team is the variance of the ith element of Oest, which is sig^2 * (sum of the squared values of the elements in the ith row of Inv(A' A) A' ). This can be different for each team if the match schedule represented in A is unbalanced (e.g., if when a live OPR is being computed during a tournament, some teams have played more matches than others). I would hope for a complete tournament with a balanced match schedule that these variances would be equal or very nearly so. But it would be interesting to compute Inv (A' A) A' for a tournament and see if the sum of the squared values of each row are truly the same.

Then finally the standard error for each estimate is just the standard deviation, or the square root of the variance we just computed.

To summarize the whole thing:

If a tournament has random match scores created by

M = A O + N

where N is zero mean and variance = sig^2, and if you estimate the underlying O values by computing Oest = Inv (A' A) A' M, then the ith team's OPR estimate which is the ith value of the Oest vector will have mean equal to the ith value of the O vector, will have variance = sig^2 * (sum of the squared values of the elements in the ith row of the matrix Inv(A' A) A'), and thus will have a "standard error" equal to the square root of this variance.

To estimate this for a particular tournament, you first compute the OPR estimate O and compute sig^2 as the variance of the regression error in the predicted match results. Then you compute the matrix Inv(A' A) A' from the match schedule and then finally compute the standard errors as described.

Too much for a Sunday morning. Thoughts?

wgardner
17-05-2015, 07:20
And a follow up:

Take the above derivation, but let's pretend that each match score is only the result of 1 team's efforts, not 3. So in this case, each row of A would only have a single 1 in it, not 3.

In this pretend case, the OPR IS exactly just computing the average of that team's match scores(!). A' A is diagonal and the diagonal elements are the number of matches that a team has played, so Inv (A' A) is diagonal with diagonal elements that are 1/ the number of matches that a team has played.

Then the i,jth elements of Inv (A' A) A' are just 1/the number of matches a team has played if team i played in match j or 0 otherwise.

The variance of the Oest values in this pretend case is the variance of the prediction residual / number of matches that a team has played, and thus the standard error of the Oest value is the standard error of the match predictions divided by the square root of the number of matches that a team has played.

So this connects Oblarg's statements to the derivation. If match results were solely the result of one team's efforts, then the standard error of the OPR would just be the standard error of the match prediction / sqrt(n), where n is the number of matches that a team has played. But match results aren't solely the result of one team's efforts, so the previous derivation holds in the more complicated, real case.

Oblarg
17-05-2015, 10:27
I guess my uncertainty is not about what "standard error" means but what you mean by "the OPRs."

Wikipedia gives the following definition: "The standard error (SE) is the standard deviation of the sampling distribution of a statistic, most commonly of the mean. The term may also be used to refer to an estimate of that standard deviation, derived from a particular sample used to compute the estimate."

So my real question is perhaps what is the statistic, or what are we trying to estimate, or what are we "computing the mean" of? At each tournament, we have many matches and can get a standard error for the predicted match results because we have many predicted match results and can compute the standard deviation of the distribution of the errors in the predictions. But each tournament only provides one single OPR estimate for each team. It's tough to compute a standard error on these OPR estimates based on this 1 sample because you only have the 1 data point.

If OPRs are a value that we estimate for each team at each tournament and we expect them to stay the same from tournament to tournament (stop laughing now), we can compute the standard deviation in each team's independent OPR values across all of the tournaments in a season to get a standard error for those values. Then you could use the standard error to estimate the distribution of a team's OPR in future tournaments based on their previous tournament results. And I suppose you could also view this same standard error to estimate how much a team's OPR might vary if the same tournament was run again and we had a different set of random match outcomes.

But I'm guessing that what you're really interested in is: if the same tournament were run multiple times and if the match results varied randomly as we modeled (yeah, yeah, and if everybody had a can opener), what would be the standard error of the OPR estimates? Or in other words, what if the same teams with the same robots and the same drivers played in 100 tournaments back-to-back and we computed the OPR for each team for all 100 tournaments, what would be the standard error for these 100 different OPR estimates?

If this is the question you're interested in, then now we have a statistic that we can compute the standard error for. Let's look into this.

Let's let the OPR vector for all of the teams be called O (t x 1 vector, where t is the # of teams).
Let's let the match scores be called M (m x 1, where m is the number of scores or 2x the number of actual matches).

So we're modeling the matches as:

M = A O + N

where A is an m x t matrix with the i,jth element equal to 1 if team j was a member of the alliance leading to the ith match score and 0 otherwise, and where N is an m x 1 noise vector with variance equal to the variance of the prediction residual for each match score. Let's call this variance sig^2.

Given M, the least squares estimate for OPR is calculated as
Oest = Inv(A' A) A' M = Inv(A' A) A' (A O + N) = O + Inv(A' A) A' N
As N is zero-mean, Oest has mean O (which we want) and variance equal to the variance of the second term, Inv(A' A) A' N.
Note that Inv(A' A) A' is a t x m matrix that is solely a function of the match schedule.

The variance of the estimated OPR for the ith team is the variance of the ith element of Oest, which is sig^2 * (sum of the squared values of the elements in the ith row of Inv(A' A) A' ). This can be different for each team if the match schedule represented in A is unbalanced (e.g., if when a live OPR is being computed during a tournament, some teams have played more matches than others). I would hope for a complete tournament with a balanced match schedule that these variances would be equal or very nearly so. But it would be interesting to compute Inv (A' A) A' for a tournament and see if the sum of the squared values of each row are truly the same.

Then finally the standard error for each estimate is just the standard deviation, or the square root of the variance we just computed.

To summarize the whole thing:

If a tournament has random match scores created by

M = A O + N

where N is zero mean and variance = sig^2, and if you estimate the underlying O values by computing Oest = Inv (A' A) A' M, then the ith team's OPR estimate which is the ith value of the Oest vector will have mean equal to the ith value of the O vector, will have variance = sig^2 * (sum of the squared values of the elements in the ith row of the matrix Inv(A' A) A'), and thus will have a "standard error" equal to the square root of this variance.

To estimate this for a particular tournament, you first compute the OPR estimate O and compute sig^2 as the variance of the regression error in the predicted match results. Then you compute the matrix Inv(A' A) A' from the match schedule and then finally compute the standard errors as described.

Too much for a Sunday morning. Thoughts?

This appears to be correct, with one minor quibble: why not just write M = AO where the elements of O are random variables with the appropriate variance? It doesn't really make it cleaner to split it into a zero-mean "noise variable" plus a flat mean.

wgardner
17-05-2015, 10:47
This appears to be correct, with one minor quibble: why not just write M = AO where the elements of O are random variables with the appropriate variance? It doesn't really make it cleaner to split it into a zero-mean "noise variable" plus a flat mean.

That would be different, I think. N is match noise and an m x 1 vector. If I understand your equation correctly, O would be the OPR random variable with mean of the "actual" OPR and some variance, but O is t x 1 and not m x 1, so I don't think they're the same. And the noise that the regression is computing is truly the noise to be expected in each match outcome, not the noise in the OPR estimates themselves. Or am I misunderstanding what you're saying?

Oblarg
17-05-2015, 11:12
That would be different, I think. N is match noise and an m x 1 vector. If I understand your equation correctly, O would be the OPR random variable with mean of the "actual" OPR and some variance, but O is t x 1 and not m x 1, so I don't think they're the same. And the noise that the regression is computing is truly the noise to be expected in each match outcome, not the noise in the OPR estimates themselves. Or am I misunderstanding what you're saying?

We need to distinguish, in our notation, between our model our measurements.

What I'm saying is that our model is that M = AO, where M and O are both vectors whose elements are random variables. Writing O as a vector of flat means and adding a noise vector N doesn't really gain you anything - in our underlying model, the *teams* have fundamental variances, not the matches. The match variances can be computed from the variances of each team's O variable.

Now, we have the problem that we cannot directly measure the variance of each element of O, because the only residuals we can measure are total for each match (the elements of the "noise vector" N). However, we can do another linear least-squares fit to assign estimated variance values for each team, which I believe is precisely what your solution ends up doing.

wgardner
17-05-2015, 12:19
in our underlying model, the *teams* have fundamental variances, not the matches. The match variances can be computed from the variances of each team's O variable.

I view it differently: that the matches that have the fundamental variances. I view the equations as estimating the variance of the estimates of the constant O vector due to the introduction of match noise, not that the underlying O itself is a random variable.

Perhaps the truth of the matter is that there's variability in both: for example if a driver screws up or an autonomous run doesn't work quite perfectly, then I guess that's team-specific OPR variability, but if the litter gets thrown a particular random way that hinders performance or score, then that's match variability.

But I guess the bottom line is that if we're in agreement on the algorithm and the results of the equations, then it probably doesn't matter much if we think about the underlying process differently. :)

wgardner
17-05-2015, 13:15
One more thought on this:

I guess I wrote the equations the way I did because that's always the way that I've seen the linear regression derived in the first place.

Namely, that the way you compute the OPR values is view the problem as
M = A O + N
then form the squared error as
N' N = (M - A O)' (M - A O)
then compute the derivative of N' N with respect to O and solve for O, which gives you Oest = Inv(A' A) A M.

Is there a different way of expressing this derivation without resorting to a vector N of the errors that are being minimized?

Ether
17-05-2015, 13:42
Is there a different way of expressing this derivation without resorting to a vector N of the errors that are being minimized?

AO=M;
254 equations in 76 unknowns; (for the example I posted (http://www.chiefdelphi.com/forums/showpost.php?p=1482422&postcount=18))
system is overdetermined;
there is no exact solution for O;

A'AO=A'M;
76 equations in 76 unknowns;
Exact solution O for this system will be the least squares solution for the original 254x76 system.

Ether
17-05-2015, 13:45
Guys,

Can we all agree on the following?

Computing OPR, as done here on CD, is a problem in multiple linear regression (one dependent variable and 2 or more independent variables).

The dependent variable for each measurement is alliance final score in a qual match.

Each qual match consists of 2 measurements (red alliance final score and blue alliance final score).

If the game has any defense or coopertition, those two measurements are not independent of each other.

For Archimedes (http://www.chiefdelphi.com/forums/showpost.php?p=1482422&postcount=18), there were 127 qual matches, producing 254 measurements (alliance final scores).

Let [b] be the column vector of those 254 measurements.


For Archimedes, there were 76 teams, so there are 76 independent dichotomous variables (each having value 0 or 1).

For each measurement, all the independent variables are 0 except for 3 of them which are 1.

Let [A] be the 254 by 76 matrix whose ith row is a vector of the values of the independent variables for measurement i.


Let [x] be the 76x1 column vector of model parameters. [x] is what we are trying to find.


[A][x]=[b] is a set of 254 simultaneous equations in 76 variables. The variables in those 254 equations are the 76 (unknown) model parameters in [x]. We want to solve that system for [x].

Since there are more equations (254) than unknowns (76), the system is overdetermined, and there is no exact solution for [x].

Since there's no exact solution for [x], we use least squares to find the "best" solution1. The solution will be a 76x1 column vector of Team OPR. Let that solution be known as [OPR].

Citrus Dad wants to know "the standard error" of each element in [OPR].

Are we in agreement so far? If so, I will continue.


1Yes, I know there are other ways to define "best", but every OPR computation I've ever on CD uses least squares, so I infer that's what Citrus Dad had in mind.

wgardner
17-05-2015, 15:53
Citrus Dad wants to know "the standard error" of each element in [OPR].


You're still not answering the basic question. The standard error applies to multiple measurements of a particular value. Which set of multiple measurements are you proposing to use to generate your standard error value?

For 1 tournament, you have a single estimate of each element of OPR. There is no standard error.

If you have multiple tournaments, then you will have multiple estimates of the underlying OPR and can compute the standard error of these estimates.

If you use the baseline model to create a hypothetical set of random tournaments as I described, then you can compute the standard error of these estimates from the hypothetical set of random tournaments.

wgardner
17-05-2015, 16:01
Is there a different way of expressing this derivation without resorting to a vector N of the errors that are being minimized?

AO=M;
254 equations in 76 unknowns; (for the example I posted (http://www.chiefdelphi.com/forums/showpost.php?p=1482422&postcount=18))
system is overdetermined;
there is no exact solution for O;

A'AO=A'M;
76 equations in 76 unknowns;
Exact solution O for this system will be the least squares solution for the original 254x76 system.


OK, but this doesn't derive that this is the least squares solution: it merely states the result without explaining where it came from. The only derivations I've ever seen start with a formulation like I laid out and find the O that minimizes the squared error term N' N by taking the derivative of N' N with respect to O, setting it equal to zero, and solving for O. Is there another way to show this derivation without N? That was my question, as Oblarg was asking why I bothered to introduce N in the first place.

Citrus Dad
17-05-2015, 17:08
For 1 tournament, you have a single estimate of each element of OPR. There is no standard error..

There is a standard error for the OPR estimate for a single tournament. That standard error tells you the probability range that your estimate falls within making some fundamental assumptions. The assumption about the normality distribution derives from the Central Limit Theorem. The OPR is essentially an estimate of the average point contribution across all of the matches in the tournament. The OPR itself assumes that in a perfect world the robot would contribute the same in each match which of course isn't true. The variation in the contribution in each match (which we don't always observe directly) is the source of the standard error.

Ether
17-05-2015, 17:11
OK, but this doesn't derive that this is the least squares solution: it merely states the result without explaining where it came from.

See attached.

Do you agree with this post (http://www.chiefdelphi.com/forums/showpost.php?p=1482647&postcount=39)?

Citrus Dad
17-05-2015, 17:15
I guess my uncertainty is not about what "standard error" means but what you mean by "the OPRs."

But I'm guessing that what you're really interested in is: if the same tournament were run multiple times and if the match results varied randomly as we modeled (yeah, yeah, and if everybody had a can opener), what would be the standard error of the OPR estimates? Or in other words, what if the same teams with the same robots and the same drivers played in 100 tournaments back-to-back and we computed the OPR for each team for all 100 tournaments, what would be the standard error for these 100 different OPR estimates?

Too much for a Sunday morning. Thoughts?

The OPR measures the expected contribution per MATCH. We usually compute it for a tournament as representing the average contribution per match. So if we run the same match over and over, we would expect to see a similar OPR. The SE tells us the probability range that we expect the OPR to fall in if we kept running that match over and over. Confidence intervals (e.g. 95%) tell us that we have 95% confidence that the OPR will fall into this set range if we ran the same match (with complete amnesia by the participants) over and over.

wgardner
17-05-2015, 17:17
See attached.

Yes, and r_x in that post is exactly the N that I described.

wgardner
17-05-2015, 17:23
The OPR measures the expected contribution per MATCH. We usually compute it for a tournament as representing the average contribution per match. So if we run the same match over and over, we would expect to see a similar OPR. The SE tells us the probability range that we expect the OPR to fall in if we kept running that match over and over. Confidence intervals (e.g. 95%) tell us that we have 95% confidence that the OPR will fall into this set range if we ran the same match (with complete amnesia by the participants) over and over.

OPR is not computed per match, it is computed per tournament (or at least based on a large number of matches).

We use OPR to estimate the score of an alliance in a match. Or to be even more precise, we compute the OPR values as the ones that result in the best linear prediction of the match results.

If we have an alliance run the same match over and over, we will see a variability in the match results and a variability in the prediction error we get when we subtract the actual match results from the OPR-based prediction. We can compute the standard error of this prediction error. This SE tells us the probability range that we would expect the match result to fall in, but doesn't tell us anything about the range that we would expect OPR estimates to fall in over a full tournament.

I'm confused by this sentence: "So if we run the same match over and over, we would expect to see a similar OPR." ???

wgardner
17-05-2015, 17:25
Do you agree with this post (http://www.chiefdelphi.com/forums/showpost.php?p=1482647&postcount=39)?




BTW, I'm not trying to dodge the question. I can't say if I agree with the post until you answer my bigger picture question about what exactly you're computing the standard error of.

wgardner
17-05-2015, 17:33
There is a standard error for the OPR estimate for a single tournament. That standard error tells you the probability range that your estimate falls within making some fundamental assumptions. The assumption about the normality distribution derives from the Central Limit Theorem. The OPR is essentially an estimate of the average point contribution across all of the matches in the tournament. The OPR itself assumes that in a perfect world the robot would contribute the same in each match which of course isn't true. The variation in the contribution in each match (which we don't always observe directly) is the source of the standard error.

OK, so can we agree that you're looking for the standard error of the OPR estimate (not the standard error of the match predictions or match results)?

Can we agree that, if we have multiple measurements of a value and make fundamental assumptions that these multiple measurements are representative of the underlying distribution, then we can model the underlying distribution and look at the standard error of the estimates assuming they are computed from the underlying distribution?

If you're willing to accept this, then I humbly suggest that my long derivation from this morning is what you're looking for.

One topic of confusion is this statement: "The OPR is essentially an estimate of the average point contribution across all of the matches in the tournament."

I will agree that if you truly computed the average contribution across all matches by just computing the average of the match results, then you could simply compute the standard error because you have multiple estimates of the average: the individual match results.

But in fact OPR is not computed by averaging the match results. It's a single, simultaneous joint optimization of ALL OPRs for ALL teams at the same time. That's why, for example, if a new match is played and new match results are provided, ALL of the OPRs change, not just the OPRs for the teams in that match.

We don't actually have a bunch of OPR estimates that we just average together to compute our final estimate. That's the rub. If we did, we could compute the standard error of these separate estimates. But in fact, we don't have them: only the single estimate computed from the whole tournament.

wgardner
17-05-2015, 17:43
I know we're beating a dead horse here. Here's my attempt at trying to create a simpler example of my confusion.

Let's say we have a single bag of 10 apples. We compute the average weight of the 10 apples. Let's say we want to know the confidence we have in the estimate of the average weight of the apples.

I claim there are a few ways to do this.

Ideally we'd get some more bags of apples, compute the average weights of the apples in each bag, and compute the standard error of these different measurements.

Or, we could compute the average weight of the 10 apples we know, compute the standard deviation of the weights of the 10 apples, then assume that the true distribution of the weights of all apples has this average weight and standard deviation. If we buy into this assumed distribution, then we can look at the standard error of all estimates of the average weight of 10 apples as if each set of 10 apples was actually taken from this modeled distribution.

Does this make sense? Are there other ways y'all are thinking about this?

To relate this to OPRs, I'd claim that the "get lots of bags of apples" approach is like the "get lots of results from different tournaments and see what the standard error in those OPR estimates is".

I'd claim that the "model the underlying distribution and then look at how the estimates will vary if the data is truly taken from this distribution" approach is like what I derived this morning.

Ether
17-05-2015, 18:35
BTW, I'm not trying to dodge the question. I can't say if I agree with the post until you answer my bigger picture question about what exactly you're computing the standard error of.

Do you agree with everything in the post except for the sentence

"Citrus Dad wants to know "the standard error" of each element in [OPR]"

?

Ether
17-05-2015, 18:39
There is a standard error for the OPR estimate for a single tournament.

I posted a simple example here (http://www.chiefdelphi.com/forums/showpost.php?p=1482422&postcount=18). Please compute your standard error for that example.

I think that would help greatly to clear the confusion about what you mean.

wgardner
17-05-2015, 18:46
Do you agree with everything in the post except for the sentence

"Citrus Dad wants to know "the standard error" of each element in [OPR]"

?




Yep. The rest of the post describes how the OPR is computed for the data set you're using, and I have no qualms about that.

wgardner
17-05-2015, 18:48
I posted a simple example here (http://www.chiefdelphi.com/forums/showpost.php?p=1482422&postcount=18). Please compute your standard error for that example.

I think that would help greatly to clear the confusion about what you mean.




Yeah, I can do this but it might take me a day or two as I don't have the environment set up to do the matrix operations quickly.

I do compute OPRs and match predictions in Java in my android apps available on Google play (Watch FTC Tournament and FTC Online) but it may take me a few days to find the time to either translate the new equations from this morning's derivation into the Java or to bring the code up in Octave, Scilab, or something similar as I haven't had to do that for a while.

Ether
17-05-2015, 18:53
Do you agree with everything in the post except for the sentence

"Citrus Dad wants to know "the standard error" of each element in [OPR]"

?

Yep. The rest of the post describes how the OPR is computed for the data set you're using, and I have no qualms about that.

Well I would argue the sentence "Citrus Dad wants to know "the standard error" of each element in [OPR]" is true, but neither you nor I know what he means (yet).

I am hoping he will do the specific computation for the example I posted; I think that will make things clear.

wgardner
17-05-2015, 19:30
Well I would argue the sentence "Citrus Dad wants to know "the standard error" of each element in [OPR]" is true, but neither you nor I know what he means (yet).

I am hoping he will do the specific computation for the example I posted; I think that will make things clear.

Fair enough. I didn't want to agree to the statement because it would seem to imply that I understood what you meant by the standard error of each element in [OPR], which I don't. :)

Ether
17-05-2015, 19:31
Yes, and r_x in that post is exactly the N that I described.

That's where the similarity ends though.

Your proof starts with N, squares it, finds the least-squares minimum, and shows that it's the solution to the normal equations.

The proof I posted starts with the solution to the normal equations, and shows that it minimizes (in the least-squares sense) the residuals of the associated overdetermined system.

Ether
17-05-2015, 19:38
Well I would argue the sentence "Citrus Dad wants to know "the standard error" of each element in [OPR]" is true, but neither you nor I know what he means (yet).

Fair enough. I didn't want to agree to the statement because it would seem to imply that I understood what you meant by the standard error of each element in [OPR], which I don't. :)

Perhaps it would have been better had I repeated the Citrus Dad quote from post #1 (http://www.chiefdelphi.com/forums/showpost.php?p=1481777&postcount=1) verbatim, instead of paraphrasing:

I'm thinking of the parameter standard errors, i.e., the error estimate around the OPR parameter itself for each team. That can be computed from the matrix--it's a primary output of any statistical software package.

wgardner
17-05-2015, 20:21
I'm thinking of the parameter standard errors, i.e., the error estimate around the OPR parameter itself for each team. That can be computed from the matrix--it's a primary output of any statistical software package.

Perhaps it would have been better had I repeated the Citrus Dad quote from post #1 (http://www.chiefdelphi.com/forums/showpost.php?p=1481777&postcount=1) verbatim, instead of paraphrasing:


What is "the matrix" (no, not trying to make a Keanu Reeves reference) and if the standard error is a primary output of any statistical package, what is/are the values for the Archimedes data you've provided?

Ether
17-05-2015, 20:40
What is "the matrix"

I don't know either. Hopefully Citrus Dad will answer that question when he responds to this post (http://www.chiefdelphi.com/forums/showpost.php?p=1482695&postcount=51).

and if the standard error is a primary output of any statistical package, what is/are the values for the Archimedes data you've provided?

That is the question I am asking here (http://www.chiefdelphi.com/forums/showpost.php?p=1482700&postcount=54) and here (http://www.chiefdelphi.com/forums/showpost.php?p=1482695&postcount=51) and here (http://www.chiefdelphi.com/forums/showpost.php?p=1482799&postcount=61).

wgardner
18-05-2015, 07:24
Scilab code is in the attachment.

Note that there is a very real chance that there's a bug in the code, so please check it over before you trust anything I say below. :)

----------------------

Findings:

stdev(M)=47 (var = 2209) Match scores have a standard deviation of 47 points.

stdev(M - A O)=32.5 (var = 1060) OPR prediction residuals have a standard deviation of about 32 points.

So OPR linear prediction can account for about 1/2 the variance in match outcomes (1-1060/2209).


What is the standard error for the OPR estimates (assuming the modeled distribution is valid) after the full tournament?

about 11.4 per team. Some teams have a bit more or a bit less, but the standard deviation of this was only 0.1 so all teams were pretty close to 11.4.

To be as clear as I can about this: This says that if we compute the OPRs based on the full data set, compute the match prediction residuals based on the full data set, then run lots of different tournaments with match results generated by adding the OPRs for the teams in the match and random match noise with the same match noise variance, and then compute the OPR estimates for all of these different randomly generated tournaments, we would expect to see the OPR estimates themselves have a standard deviation around 11.4.

If you choose to accept these assumptions, you might be willing to say that the OPR estimates have a 1 std-deviation confidence of +/- 11.4 points.


How does the standard error of the OPR (assuming the modeled distribution is valid) decrease as the number of matches increases?

I ran simulations through only the first 3 full matches per team up to 10 full matches per team, or with match totals of:
76, 102, 128, 152, 178, 204, 228, 254


sig^2 (the variance of the per-match residual prediction error) from 3 matches per team to 10 matches per team was
0.0, 19.7, 26.1, 29.3, 30.2, 30.8, 32.5, 32.5

(With only 3 matches played per team, the "least squares" solution can perfectly fit the data as we only have 76 unknowns and 76 parameters. With 4 or 5 matches per team, the model is still a bit "overfit" as we have 102 or 128 unknowns being predicted by 76 parameters.)


mean (StdErr of OPR) from 3 matches per team to 10 is
0.0, 16.5, 16.3, 15.2, 13.6, 12.8, 12.2, 11.4

(so the uncertainty in the OPR estimates decreases as the number of matches increases, as expected)

stdev (StdErr of OPR) from 3 matches per team to 10 is
0.0, 1.3, 0.6, 0.4, 0.3, 0.2, 0.1, 0.1

(so there isn't much variability in team-to-team uncertainty in the OPR measurements, though the uncertainty does drop as the number of matches increases. There could be more variability if we only ran a number of matches where, say, 1/2 the teams played 5 matches and 1/2 played 4?)


And for the record, sqrt(sig^2/matchesPerTeam) was
0.0, 9.9, 12.0, 12.0, 11.4, 11.1, 10.8, 10.3

(compare this with "mean (StdErr of OPR)" above. As the number of matches per team grows, the OPR will eventually approach the simple average match score per team/3 and then these two values should approach each other. They're in the same general range but still apart by 1.1 (or 11.4 - 10.3) with 10 matches played per team.)

Ether
18-05-2015, 11:09
What is the standard error for the OPR estimates (assuming the modeled distribution is valid) after the full tournament?

about 11.4 per team. Some teams have a bit more or a bit less, but the standard deviation of this was only 0.1 so all teams were pretty close to 11.4.

I sincerely appreciate the time and effort you spent on this.

I could be wrong, but I doubt this is what Citrus Dad had in mind.

Can we all agree that 0.1 is real-world meaningless? There is without a doubt far more variation in consistency of performance from team to team.

Manual scouting data would surely confirm this.


@ Citrus Dad: you wrote:
I'm thinking of the parameter standard errors, i.e., the error estimate around the OPR parameter itself for each team. That can be computed from the matrix--it's a primary output of any statistical software package. ... so would you please compute the parameter standard errors for this example (http://www.chiefdelphi.com/forums/showthread.php?p=1482422#post1482422) using your statistical software package and post results here? Thank you.

Basel A
18-05-2015, 11:35
Scilab code is in the attachment.

Note that there is a very real chance that there's a bug in the code, so please check it over before you trust anything I say below. :)

/snip


Very interesting results. I wonder if you could run the same analysis on the 2015 Waterloo Regional. The reason I'm asking for that event in particular is because it had the ideal situation for OPR: high matches per team (13) and a small number of teams (30).

wgardner
18-05-2015, 13:40
Can we all agree that 0.1 is real-world meaningless?

There is without a doubt far more variation in consistency of performance from team to team.

Manual scouting data would surely confirm this.


Sure. To reiterate though for others on the thread, it looks like the OPR estimates (assuming the model) for a tournament like the one in the data provided had a 1 standard deviation confidence range of around +/- 11.4 for nearly all teams (some teams might have been 11.3, some might have been 11.5, depending on their match schedules but as Ether says these very slight variations are essentially meaningless).

For example, this means that if a team had, say, an OPR of 50, that if they were in another identical tournament with the same matches and randomness in the match results, that the OPR computed from that tournament would probably be between 39 and 61 (if you're being picky, 68% of the time the score would lie in this range if the data is sufficiently normal or Gaussian).

So picking a team for your alliance that has an OPR of 55 over a different team that has an OPR of 52 is silly. But picking a team that has an OPR of 80 over a team that has an OPR of 52 is probably a safe bet. :)


In response to the latest post, this could be run on any other tournament for which the data is present. Ether made this particularly easy to do by providing the A match matrix and the vector of match results in nice csv files.

BTW, the code is attached and scilab is free, so anybody can do this for whatever data they happen to have on hand.

Ether
18-05-2015, 15:10
Ether made this particularly easy to do by providing the A match matrix and the vector of match results in nice csv files.

[A] and CSV files for all 117 events in 2015 (9878 qual matches, 2872 teams) can be found at the link below at or near the bottom of the attachments list

[B]http://www.chiefdelphi.com/media/papers/3132

wgardner
18-05-2015, 19:57
[A] and CSV files for all 117 events in 2015 (9878 qual matches, 2872 teams) can be found at the link below at or near the bottom of the attachments list

[B]http://www.chiefdelphi.com/media/papers/3132


Thanks for the awesome data, Ether!

Here are the results for the Waterloo tournament:

mpt = matches per team (so the last row is for the whole tournament and earlier rows are for the tournament through 4 matches per team, through 5, etc.)

varM = variance of the match scores

stdevM = standard deviation of the match scores

varR and stdevR are the same for the match prediction residual
so varR/varM is the fraction of the match variance that can't be predicted by the OPR linear prediction model.

/sqrt(mpt) = the standard deviation of the OPRs we would have if we were simply averaging a teams match score to estimate their OPR, which is just stdevR/sqrt(mpt)

StdErrO = the standard error of the OPRs using my complicated model derivation.

stdevO = the standard deviation of the StdErrO values taken across all teams, which is big if some teams have more standard error on their OPR values than other teams do.


mpt varM stdevM varR stdevR /sqrt(mpt) StdErrO stdevO
4 3912.31 62.55 206.90 14.38 7.19 12.22 1.60
5 4263.97 65.30 290.28 17.04 7.62 10.44 0.71
6 3818.40 61.79 346.49 18.61 7.60 9.44 0.43
7 3611.50 60.10 379.83 19.49 7.37 8.64 0.30
8 3617.25 60.14 429.42 20.72 7.33 8.28 0.17
9 3592.06 59.93 469.44 21.67 7.22 8.00 0.11
10 3623.44 60.20 539.33 23.22 7.34 8.01 0.10
11 3530.91 59.42 548.08 23.41 7.06 7.58 0.08
12 3440.36 58.65 578.65 24.06 6.94 7.38 0.07
13 3356.17 57.93 645.25 25.40 7.05 7.42 0.06

And for comparison, here's the same data for the Archimedes division results:


mpt varM stdevM varR stdevR /sqrt(mpt) StdErrO stdevO
4 1989.58 44.60 389.80 19.74 9.87 16.51 1.28
5 2000.09 44.72 714.81 26.74 11.96 16.31 0.57
6 2157.47 46.45 863.88 29.39 12.00 15.17 0.37
7 2225.99 47.18 916.16 30.27 11.44 13.64 0.29
8 2204.03 46.95 985.63 31.39 11.10 12.77 0.24
9 2235.14 47.28 1053.26 32.45 10.82 12.21 0.10
10 2209.46 47.00 1056.14 32.50 10.28 11.37 0.12


The OPR seems to do a much better job of predicting the match results in the Waterloo tournament (removing 80% of the match variance vs. 50% in Archmedes), and the standard deviation of the OPR estimates themselves is less (7.42 in Waterloo vs. 11.37 in Archimedes).

Citrus Dad
19-05-2015, 15:07
I sincerely appreciate the time and effort you spent on this.

I could be wrong, but I doubt this is what Citrus Dad had in mind.


@ Citrus Dad: you wrote:
... so would you please compute the parameter standard errors for this example (http://www.chiefdelphi.com/forums/showthread.php?p=1482422#post1482422) using your statistical software package and post results here? Thank you.




I believe that the SEs that have been posted are what I was interested in. (I have think harder about this than I can right now.) I think that using the pooled time series method which is essentially what's been done results in SEs will be largely the same for each participant because the OPRs are estimated across all participants.

To be honest setting up a pooled time series with this data would take me more time than I have at the moment. I've thought about it and maybe it will be a summer project (maybe my son Jake (themccannman) can do it!)

Note that the 1 SD SE of 11.5 is the 68% confidence interval. For 10 or so observations, the 95% confidence interval is about 2 SD or about 23.0. The t-statistic is the relevant tool for finding the confidence interval metric.

Ether
19-05-2015, 17:09
I believe that the SEs that have been posted are what I was interested in.

You stated earlier that "the parameter standard errors, i.e., the error estimate around the OPR parameter itself for each team a primary output of any statistical software package."

From this and other prior statements, I had the very strong impression you were seeking a [I]separate error estimate for each team's OPR.

Such estimates would certainly not be virtually identical for every team!

It would be very helpful if you would please provide more information about statistical software packages you know that provide "parameter standard errors".

I couldn't find any that could provide such estimates for the multiple-regression model we are talking about for OPR computation using FRC-provided match score data. I suspect that's because it's simply not possible to get such estimates for that model and data.

wgardner
19-05-2015, 20:16
From this and other prior statements, I had the very strong impression you were seeking a separate error estimate for each team's OPR.

Such estimates would certainly not be virtually identical for every team!


The approach I described does find a separate error estimate for each team and, at least in this approach, they are virtually identical. Why do you think they would "certainly not be virtually identical"?

Note that this is computing the confidence of each OPR estimate for each team. This is different from trying to compute the variance of score contribution from match to match for each team, which is a very different (and also very interesting) question. I think it would be reasonable to hypothesize that the variance of score contribution for each team might vary from team to team, possibly substantially.

For example, it might be interesting to know that team A scores 50 points +/- 10 points with 68% confidence but team B scores 50 points +/- 40 points with 68% confidence. At the very least, if you saw that one team had a particularly large score variance, it might make you investigate this robot and see what the underlying root cause was (maybe 50% of the time they have an awesome autonomous but 50% of the time it completely messes up, for example).

Hmmm....

Ether
19-05-2015, 21:44
Why do you think they would "certainly not be virtually identical"?

Because there's no reason whatsoever to believe there's virtually no variation in consistency of performance from team to team.

Manual scouting data would surely confirm this.


Consider the following thought experiment.

Team A gets actual scores of 40,40,40,40,40,40,40,40,40,40 in each of its 10 qual matches.

Team B gets actual scores of 0,76,13,69,27,23,16,88,55,33

The simulation you described assigns virtually the same standard error to their OPR values.

If what is being sought is a metric which is somehow correlated to the real-world trustworthiness of the OPR for each individual team (I thought that's what Citrus Dad was seeking), then the standard error coming out of the simulation is not that metric.


My guess is that the 0.1 number is just measuring how well your random number generator is conforming to the sample distribution you requested.

wgardner
19-05-2015, 22:09
Because there's no reason whatsoever to believe there's virtually no variation in consistency of performance from team to team.

Manual scouting data would surely confirm this.

[Edit: darn it! I tried to reply and mistakenly edited my previous post. I'll try to reconstruct it here.]

Your model certainly might be valid, and my derivation explicitly does not deal with this case.

The derivation is for a model where OPRs are computed, then multiple tournaments are generated using those OPRs and adding the same amount of noise to each match, and then seeing what the standard error of the resulting OPR estimates is across these multiple tournaments.

If you know that the variances for each team's score contribution are different, then the model fails. For that matter, the least squares solution for computing the OPRs in the first place is also a failed model in this case. If you knew the variances of the teams' contributions, then you should use weighted-least-squares to get a better estimate of the OPRs.

I wonder if some iterative approach might work: First compute OPRs assuming all teams have equal variance of contribution, then estimate the actual variances of contributions for each team, then recompute the OPRs via weighted-least-squares taking this into account, then repeat the variance estimates, etc., etc., etc. Would it converge?

[Edit: 2nd part of post, added here a day later]

http://en.wikipedia.org/wiki/Generalized_least_squares

OPRs are computed with an ordinary-least-squares (OLS) analysis.

If we knew ahead of time the variances we expected for each team's scoring contribution, we could use weighted-least-squares (WLS) to get a better estimate of the OPRs.

The link also describes something like I was suggesting above, called "Feasible generalized least squares (FGLS)". In FGLS, you use OLS to get your initial OPRs, then estimate the variances, then compute WLS to improve the OPR estimate. It discusses iterating this approach also.

But, the link also includes this comment: "For finite samples, FGLS may be even less efficient than OLS in some cases. Thus, while (FGLS) can be made feasible, it is not always wise to apply this method when the sample is small."

If we have 254 match results and we're trying to estimate 76 OPRs and 76 OPRvariances (152 parameters total), we have a pretty small sample size. So this approach would probably suffer from too small of a sample size.

wgardner
20-05-2015, 07:27
See also this link:
http://en.wikipedia.org/wiki/Heteroscedasticity

"In statistics, a collection of random variables is heteroscedastic if there are sub-populations that have different variabilities from others. Here "variability" could be quantified by the variance or any other measure of statistical dispersion."

And see particularly the "Consequences" section which says, "Heteroscedasticity does not cause ordinary least squares coefficient estimates to be biased, although it can cause ordinary least squares estimates of the variance (and, thus, standard errors) of the coefficients to be biased, possibly above or below the true or population variance. Thus, regression analysis using heteroscedastic data will still provide an unbiased estimate for the relationship between the predictor variable and the outcome, but standard errors and therefore inferences obtained from data analysis are suspect. Biased standard errors lead to biased inference..."

Citrus Dad
21-05-2015, 20:58
You stated earlier that "the parameter standard errors, i.e., the error estimate around the OPR parameter itself for each team a primary output of any statistical software package."

From this and other prior statements, I had the very strong impression you were seeking a [I]separate error estimate for each team's OPR.

Such estimates would certainly not be virtually identical for every team!

It would be very helpful if you would please provide more information about statistical software packages you know that provide "parameter standard errors".

I couldn't find any that could provide such estimates for the multiple-regression model we are talking about for OPR computation using FRC-provided match score data. I suspect that's because it's simply not possible to get such estimates for that model and data.



I think one solution is to use a fixed-effects model that includes a separate variable for each team and the SE for each team will show up there. To be honest, issues like that for FE models is getting beyond my econometric experience. Maybe someone else could research that and cheick. FE models (as well as random effects models) have become quite popular in the last decade.

sur
24-05-2015, 13:43
To be as clear as I can about this: This says that if we compute the OPRs based on the full data set, compute the match prediction residuals based on the full data set, then run lots of different tournaments with match results generated by adding the OPRs for the teams in the match and random match noise with the same match noise variance, and then compute the OPR estimates for all of these different randomly generated tournaments, we would expect to see the OPR estimates themselves have a standard deviation around 11.4.

This sounds very similar to bootstrap resampling (http://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf), which should measure the variation in estimated OPR from the "true" OPR values rather than how consistently individual teams perform. This may be why the values are virtually identical.

wgardner
24-05-2015, 14:06
This sounds very similar to bootstrap resampling (http://www.stat.cmu.edu/~cshalizi/402/lectures/08-bootstrap/lecture-08.pdf), which should measure the variation in estimated OPR from the "true" OPR values rather than how consistently individual teams perform. This may be why the values are virtually identical.

Yep, though my derivation is for straight bootstrapping (Figure #1 in your attachment) rather than re-sampled bootstrapping (Figure #3). And yes, given this, the standard errors I compute are the variations of the OPR estimates if they fit the model, all of which assumes that there is not variation in the way individual teams perform other than their mean contribution. Obviously, this final assumption is suspect.

sur
24-05-2015, 16:52
And yes, given this, the standard errors I compute are the variations of the OPR estimates if they fit the model, all of which assumes that there is not variation in the way individual teams perform other than their mean contribution. Obviously, this final assumption is suspect.

I think this assumption can be restated a different way. We might assume that the contribution of each team is normally distributed with a certain mean and variance. If the variance is fixed and assumed to be the same for each team, then the maximum likelihood estimate of the means of the distributions should be the same as the least squares estimate as in usual OPR. This assumes that there is some hidden distribution of the contribution of each team which is normal and that the variance of each distribution is the same.

jtrv
27-05-2015, 17:45
hi all,

as a student going into his first year of undergrad this fall, this kind of stuff interests me. what level (or course equivalent or experience of the student) is this kind of stuff typically taught at?

I have researched into interpolation, as I would like to spend some time developing spline path generation for auton modes independently, and that particular area requires a bit of knowledge in Linear Algebra, which I will begin the process of self-teaching soon enough.

As for this, what would be the equivalent of interpolation:linear algebra?

I don't mean to hijack the thread, but it feels like the most appropriate place to ask...

Ether
23-06-2015, 21:32
It's been known that OPR doesn't reflect actual scoring ability. It's a regression analysis that computes the implied "contribution" to scores. Unfortunately no one ever posts the estimates' standard errors

Getting back to this after a long hiatus.

If you are asking for individual standard error associated with each OPR value, no one ever posts them because the official FRC match data doesn't contain enough information to make a meaningful computation of those individual values.

In a situation, unlike FRC OPR, where you know the variance of each observed value (either by repeated observations using the same values for the predictor variables, or if you are measuring something with an instrument of known accuracy) you can put those variances into the design matrix for each observation and compute a meaningful standard error for each of the model parameters.

Or if, unlike FRC OPR, you have good reason to believe the observations are homoscedastic, you can compute the variance of the residuals and use that to back-calculate standard errors for the model parameters. If you do this for FRC data the result will be standard errors which are very nearly the same for each OPR value... which is clearly not the expected result.

Citrus Dad
29-06-2015, 16:52
Getting back to this after a long hiatus.

If you are asking for individual standard error associated with each OPR value, no one ever posts them because the official FRC match data doesn't contain enough information to make a meaningful computation of those individual values.

In a situation, unlike FRC OPR, where you know the variance of each observed value (either by repeated observations using the same values for the predictor variables, or if you are measuring something with an instrument of known accuracy) you can put those variances into the design matrix for each observation and compute a meaningful standard error for each of the model parameters.

Or if, unlike FRC OPR, you have good reason to believe the observations are homoscedastic, you can compute the variance of the residuals and use that to back-calculate standard errors for the model parameters. If you do this for FRC data the result will be standard errors which are very nearly the same for each OPR value... which is clearly not the expected result.




The standard errors for the OPR values can be computed, but they are in fact quite large relative to the parameter values. Which is actually my point--the statistical precision of the OPR values are really quite poor because there are so few observations, which are in fact not independent. Rather than ignoring the SEs because they show how poor the OPR estimators are performing, the SEs should be reported to show how poorly the estimators perform for everyone's consideration.

Ether
29-06-2015, 17:20
I think you missed my point entirely.

The standard errors for the OPR values can be computed...

Yes, they can be computed, but that doesn't mean they are statistically valid. They are not, because the data does not conform to the necessary assumptions.

...but they are in fact quite large relative to the parameter values.

Yes they are, but they are also nearly all the same value... which is obviously incorrect... and a result of assumptions which the data does not meet.

the statistical precision of the OPR values are really quite poor because there are so few observations, which are in fact not independent.

Lack of independence is only one of the assumptions which the data do not meet.

Rather than ignoring the SEs because they show how poor the OPR estimators are performing...

They are not being ignored "because they show how poor the OPR estimators are performing"; they are not being reported because they are invalid and misleading.

the SEs should be reported to show how poorly the estimators perform for everyone's consideration.

There are better metrics to report to show how poorly the estimators perform.

asid61
29-06-2015, 21:38
So it's not possible to perform a statistically valid calculation for standard deviation? Are there no ways to solve for it with a system that is dependent on other robots' performances?

GeeTwo
30-06-2015, 01:20
The standard errors for the OPR values can be computed, but they are in fact quite large relative to the parameter values. Which is actually my point--the statistical precision of the OPR values are really quite poor because there are so few observations, which are in fact not independent. Rather than ignoring the SEs because they show how poor the OPR estimators are performing, the SEs should be reported to show how poorly the estimators perform for everyone's consideration.

+1, +/- 0.3.


There are better metrics to report to show how poorly the estimators perform.

While it would be great if standard error could be used as a measure of consistency of a team, but that's not its only function. I agree with Richard that one of the benefits of an error value is to provide an indication of how much difference is (or is not) significant. If the error bars on the OPRs are all (for example) about 10 points, then a 4 point difference in OPR between two teams probably means less in sorting a pick list than does a qualitative difference in a scouting report.

As it turns out, I was recently asked for the average time it takes members of my branch to produce environmental support products. Because we get requests that range from a 10 mile square box on one day to seasonal variability for a whole ocean basin, the (requested) mean production time means nothing. For one class of product, the standard deviation of production times was greater than the mean. Without the scatter info, the reader would have probably assumed that we were making essentially identical widgets and that the scatter was +/- 1 or 2 in the last reported digit.

Ether
30-06-2015, 10:01
So it's not possible to perform a statistically valid calculation for standard deviation?

We're discussing standard error of the model parameters, also known as standard error of the regression coefficients. So in our particular case, that would be standard error of the OPRs.

Standard error of the model parameters is a very useful statistic in those cases where it applies. I mentioned one such situation in my previous post:

In a situation, unlike FRC OPR, where you know the variance of each observed value (either by repeated observations using the same values for the predictor variables, or if you are measuring something with an instrument of known accuracy) you can put those variances into the design matrix for each observation and compute a meaningful standard error for each of the model parameters.

An example of the above would be analysis and correction of land surveying network measurement data. The standard deviation of the measurements is known a priori from the manufacturer's specs for the measurement instruments and from the surveyor's prior experience with those instruments.

In such as case, computing standard error of the model parameters is justified, and the results are meaningful. All modern land surveying measurement adjustment apps include it in their reports.


Are there no ways to solve for it with a system that is dependent on other robots' performances?

That's a large (but not the only) part of the problem.

I briefly addressed this in my previous post:

Or if, unlike FRC OPR, you have good reason to believe the observations are homoscedastic, you can compute the variance of the residuals and use that to back-calculate standard errors for the model parameters. If you do this for FRC data the result will be standard errors which are very nearly the same for each OPR value... which is clearly not the expected result.

In the case computing OPRs using only FIRST-provided match results data (no manual scouting), the data does not meet the requirements for using the above technique.

In fact, when you use the above technique for OPR you are essentially assuming that all teams are identical in their consistency of scoring, so it's not surprising that when you put that assumption into the calculation you get it back out in the results. GIGO.

Posting invalid and misleading statistics is a bad idea, especially when there are better, more meaningful statistics to fill the role.

For Richard and Gus: If all you are looking for is one overall ballpark number "how bad are the OPR calculations for this event" let's explore better ways to present that.

Citrus Dad
30-06-2015, 13:59
I think you missed my point entirely.



Yes, they can be computed, but that doesn't mean they are statistically valid. They are not, because the data does not conform to the necessary assumptions.



Yes they are, but they are also nearly all the same value... which is obviously incorrect... and a result of assumptions which the data does not meet.



Lack of independence is only one of the assumptions which the data do not meet.



They are not being ignored "because they show how poor the OPR estimators are performing"; they are not being reported because they are invalid and misleading.



There are better metrics to report to show how poorly the estimators perform.





But based on this response, the OPR estimates themselves should not be reported because they are not statistically valid either. Instead by not reporting some measure of the potential error, they give the impression of precision to the OPRs.

I just discussed this problem as a major failing for engineers in general--if they are not fully comfortable in reporting a parameter, e.g., a measure of uncertainty, they often will simply ignore the parameter entirely. (I was discussing how the value of solar PV is being estimated across a dozen studies. I've seen this tendency over and over in almost 30 years of professional work.) Instead, the appropriate method ALWAYS, ALWAYS, ALWAYS is to report the uncertain or unknown parameter with some sort of estimate and all sorts of caveats. Instead what happens is that decisionmakers and stakeholders much too often accept the values given as having much greater precision than they actually have.

While calculating the OPR really is of no true consequence, because we are working with high school students who are very likely to be engineers, it is imperative that they understand and use the correct method of presenting their results.

So, the SEs should be reported as the best available approximation of the error term around the OPR estimates. And the caveats about the properties of the distribution can be reported with a discussion about the likely biases in the parameters due to the probability distributions.

Ether
30-06-2015, 15:00
But based on this response, the OPR estimates themselves should not be reported because they are not statistically valid either.

Sez who? They are the valid least-squares fit to the model. That is all they are. According to what criteria are they then not valid?

Instead by not reporting some measure of the potential error, they give the impression of precision to the OPRs.

Who is suggesting not to report some measure of the potential error? Certainly not me. Read my posts.

I just discussed this problem as a major failing for engineers in general--if they are not fully comfortable in reporting a parameter, e.g., a measure of uncertainty, they often will simply ignore the parameter entirely.

I do not have the above failing, if that is what you were implying.


ALWAYS, ALWAYS, ALWAYS is to report the uncertain or unknown parameter with some sort of estimate and all sorts of caveats.

You are saying this as if you think I disagree. If so, you would be wrong.


Instead what happens is that decisionmakers and stakeholders much too often accept the values given as having much greater precision than they actually have.

Exactly. And perhaps more often than you realize, those values they are given shouldn't have been reported in the first place because the data does not support them. Different (more valid) measures of uncertainty should have been reported.


While calculating the OPR really is of no true consequence, because we are working with high school students who are very likely to be engineers, it is imperative that they understand and use the correct method of presenting their results.

Well I couldn't agree more, and it is why we are having this discussion.

So, the SEs should be reported as the best available approximation of the error term around the OPR estimates

Assigning a separate standard error to each OPR value computed from the FIRST match results data is totally meaningless and statistically invalid. As you said above, "it is imperative that they understand and use the correct method of presenting their results".

Let's explore alternative ways to demonstrate the shortcomings of the OPR values.

the caveats about the properties of the distribution can be reported with a discussion about the likely biases in the parameters due to the probability distributions

"Likely" is an understatement. The individual (per-OPR) computed standard error values are obviously and demonstrably wrong (this can be verified with manual scouting data). And what's more, we know why they are wrong.

As I've suggested in my previous two posts, how about let's explore alternative, valid ways to demonstrate the shortcomings of the OPR values.

One place to start might be to ask whether or not the average value of the vector of standard errors of OPRs might be meaningful, and if so, what exactly it means.

Citrus Dad
01-07-2015, 18:07
Ether

I wasn't quite sure why you dug up my original post to start this discussion. It seemed out of context with all of your other discussion about adding error estimates. That said, my request was more general, and it seems to be answered more generally by the other computational efforts that have been going on in the 2 related threads.

But one point, I will say that using a fixed effects models with a separate match progression parameter (to capture the most likely source of heteroskedasticity) should lead to parameter estimates that will provide valid error terms using FRC data. But computing fixed effects models are much more complex processes. It is something that can be done in R.

Sez who? They are the valid least-squares fit to the model. That is all they are. According to what criteria are they then not valid?

That one can calculate a number doesn't mean that the number is meaningful. Without a report of the error around the parameter estimates, the least squares fit is not statistically valid and the meaning cannot be interpreted. This is a fundamental principle in econometrics (and I presume in statistics in general.)

Ether
01-07-2015, 20:15
That one can calculate a number doesn't mean that the number is meaningful.

I'm glad you agree with me on this very important point. It's what I have been saying about your request for SE estimates for each individual OPR.


Without a report of the error around the parameter estimates, the least squares fit is not statistically valid

Without knowing your private definition of "statistically valid" I can neither agree nor disagree.


and the meaning cannot be interpreted.

The meaning can be interpreted as follows: It is the set of model parameters which minimizes the sum of the squares of the differences between the actual and model-predicted alliance scores. This is universally understood. Now once you've done that regression, proceeding to do inferential statistics based on the fitted model is where you hit a speed bump because the data does not satisfy the assumptions required for many of the common statistics.

The usefulness of the fitted model can, however, be assessed without using said statistics.


I wasn't quite sure why you dug up my original post to start this discussion.

I had spent quite some time researching the OP question and came back to tie up loose ends.

It seemed out of context with all of your other discussion about adding error estimates.

How so? I think I have been fairly consistent throughout this thread.

That said, my request was more general, and it seems to be answered more generally by the other computational efforts that have been going on in the 2 related threads.

Your original request was (emphasis mine):
I'm thinking of the parameter standard errors, i.e., the error estimate around the OPR parameter itself for each team. That can be computed from the matrix--it's a primary output of any statistical software package.

During the hiatus I researched this extensively. The standard error of model parameters (regression coefficients) is reported by SPSS, SAS, MINITAB, R, ASP, MicrOsiris, Tanagra, and even Excel. All these packages compute the same set of values, so they are all doing the same thing.

Given [A][x]=[b], the following computation produces the same values as those packages:

x = A\b;
residuals = b-A*x;
SSres = residuals'*residuals;
VARres = SSres/(alliances-teams);

Av = A/sqrt(VARres);
Nvi = inv(Av'*Av);
SE_of_parameters = sqrt(diag(Nvi))

The above code clearly shows that this computation is assuming that the standard deviation is constant for all measurements (alliance scores) and thus for all teams... which we know is clearly not the case. That's one reason it produces meaningless results in the case of FRC match results data.


But one point, I will say that using a fixed effects models with a separate match progression parameter (to capture the most likely source of heteroskedasticity) should lead to parameter estimates that will provide valid error terms using FRC data. But computing fixed effects models are much more complex processes. It is something that can be done in R.

That's an interesting suggestion, but I doubt it would be successful. I'd be pleased to be proven wrong. If you are willing to try it, I will provide whatever raw data you need in your format of choice.

Citrus Dad
02-07-2015, 18:25
One definition of statistical validity:
https://explorable.com/statistical-validity

Statistical validity refers to whether a statistical study is able to draw conclusions that are in agreement with statistical and scientific laws. This means if a conclusion is drawn from a given data set after experimentation, it is said to be scientifically valid if the conclusion drawn from the experiment is scientific and relies on mathematical and statistical laws.

It is the set of model parameters which minimizes the sum of the squares of the differences between the actual and model-predicted alliance scores. This is universally understood.

This is the point upon which we disagree. This is not a mathematical exercise--it is a statistical one. And statistical analysis requires inference about the validity of the estimated parameters. And I strongly believe that the many students who will be working in engineering in the future who read this need to understand that this is a statistical exercise which requires all of the caveats of such analysis.

Here's a discussion for fixed effects from the SAS manual:
http://www.sas.com/storefront/aux/en/spfixedeffect/58348_excerpt.pdf

GeeTwo
02-07-2015, 20:21
This is not a mathematical exercise--it is a statistical one. And statistical analysis requires inference about the validity of the estimated parameters.

One of the two textbooks for my intermediate mechanics lab (sophomores and juniors in physics and engineering) was entitled How to Lie with Statistics (http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728). Chapter 4 is entitled "Much Ado about Practically Nothing." For me, the takeaway sentence from this chapter is:
You must always keep that plus-or-minus in mind, even (or especially) when it is not stated.


Unfortunately, not many high schoolers have been exposed to this concept.

Finally, if standard errors could be validly produced for each team as a measure of its consistency/reliability, that would be outstanding. Given that teams change strategy and modify robots between matches, (and this year's nonlinear scoring), it is not surprising that per-team standard error calculations are not valid. (And by the way, Ether's finding that the numbers could be calculated but did not communicate variability is at least qualitatively similar to Richard's argument concerning OPR.)

This does not negate the need for a "standard error" or "probable error" of the whole data set. OPR is ultimately a measurement, and anyone using OPR to drive a decision needs to understand the accuracy. That is, does a difference of 5 points in OPR means that one team is better than the other with 10% confidence, 50% confidence, or 90% confidence?

wgardner
12-07-2015, 09:25
As I've suggested in my previous two posts, how about let's explore alternative, valid ways to demonstrate the shortcomings of the OPR values.

One place to start might be to ask whether or not the average value of the vector of standard errors of OPRs might be meaningful, and if so, what exactly it means.


Hi All,

Ether and I have been having some private discussions and running some simulations on this topic. I thought I'd report the general results here. I think Ether agrees with what I say below, but I'll leave that for him to confirm or deny. :)


Executive Summary:

1. The mean of the standard error vector for the OPR estimates is a decent approximation for the standard deviation of the team-specific OPR estimates themselves, and is a very good approximation for the mean of the standard deviations of the team-specific OPR estimates taken across all of the teams in the tournament.

2. Teams with more variability in their offensive contributions (e.g., teams that contribute a huge amount to their alliance's score by performing some high-scoring feats, but fail at doing so 1/2 the time) will have slightly more uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

3. Teams with less variability in their offensive contributions (e.g., consistent teams that always contribute about the same amount to their alliance's score every match) will have slightly less uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

Details:

I simulated match scores in the following way.

1. I computed the actual OPRs from the actual match data (in this case, from the 2014 misjo tournament as suggested by Ether).

2. I computed the sum of the squared values of the prediction residual and divided this sum by (#matches - #teams) to get an estimate of the per-match randomness that exists after the OPR prediction is performed.

3. I divided the result from step#2 above by 3 to get a per-team estimate of the variance of each team's offensive contribution. I took the square root of this to get the per-team estimate of the standard deviation of each team's offensive contribution.

4. I then simulated 1000 tournaments using the same match schedule as the 2014 misjo tournament. The simulated match scores were the sum of the 3 OPRs for the teams in that match plus 3 zero-mean, variance-1 normally distributed random numbers scaled by the 3 per-team offensive standard deviations computed in step #3. Note that at this point, each team has the same value for the per-team offensive standard deviations.

5. I then computed the OPR estimates from the match scores for each simulated tournament and computed the actual standard deviation of the 1000 OPR estimates for each team. These standard deviations were all close to 11.5 (between 11 and 12) which was the average of the elements of the traditional standard error vector calculation performed on the original data. This makes sense, as the standard error is supposed to be the standard deviation of the estimates if the randomness of the match scores had equal variance for all matches, as was simulated. As a reminder, all of the individual elements of the standard error vector were extremely close to 11.5 in this case.

6. But then I tried something different. Instead of having the per-team standard deviation of the offensive contributions be constant, I instead added a random variable to these standard deviations and then renormalized all of them so that the average variance of the match scores would be unchanged. In other words, now some teams have a larger variance in their offensive contributions (e.g., team A might have an OPR of 30 but have its score contribution typically vary between 15 and 45) while other teams might have a smaller variance in their contributions (e.g., team B might also have an OPR of 30 but have its score contribution only typically vary between 25 and 35).

7. Now I resimulated another 1000 tournaments using this model. So now, some match scores might have greater variances and some match scores might have smaller variances. But the way OPR was calculated was not changed.

8. Then I calculated the OPRs for these new 1000 simulated tournaments and calculated the standard deviations of these 1000 new per-team OPR estimates.

What I found was that the OPR estimates did vary more for teams that had a greater offensive variance and did vary less for teams that had a smaller offensive variance. So, if you're convinced that different teams have substantially different variances in their offensive contributions, then just using the one average standard error computation to estimate how reliable all of the different OPR estimates are is not completely accurate.

But the differences were not that large. For example, in one set of simulations, team A had an offensive contribution with a standard deviation of 8 while team B had an offensive contribution with a standard deviation of 29. So in this case, team B had a LOT more variability in their offensive contribution than team A did (almost 4x as much). But the standard deviation of the 1000 OPR estimates for team A was 10.8 while the standard deviation of the 1000 OPR estimates for team B was 12.9. So yes, team B had a much bigger offensive variability and that made the confidence in their OPR estimates worse than the 11.5 that the standard error would suggest, but it only went up by 1.4, while team A had a much smaller offensive variability but that only improved the confidence in their OPR estimates by 0.7.

And also, the average of the standard deviations of the OPR estimates for the teams in the 1000 tournaments was still very close to the average of the standard error vector computed assuming that the match scores had identical variances.

So, repeating the Executive Summary:

1. The mean of the standard error vector for the OPR estimates is a decent approximation for the standard deviation of the team-specific OPR estimates themselves, and is a very good approximation for the mean of the standard deviations of the team-specific OPR estimates taken across all of the teams in the tournament.

2. Teams with more variability in their offensive contributions (e.g., teams that contribute a huge amount to their alliance's score by performing some high-scoring feats, but fail at doing so 1/2 the time) will have slightly more uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

3. Teams with less variability in their offensive contributions (e.g., consistent teams that always contribute about the same amount to their alliance's score every match) will have slightly less uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

Oblarg
12-07-2015, 20:21
Hi All,

Ether and I have been having some private discussions and running some simulations on this topic. I thought I'd report the general results here. I think Ether agrees with what I say below, but I'll leave that for him to confirm or deny. :)


Executive Summary:

1. The mean of the standard error vector for the OPR estimates is a decent approximation for the standard deviation of the team-specific OPR estimates themselves, and is a very good approximation for the mean of the standard deviations of the team-specific OPR estimates taken across all of the teams in the tournament.

2. Teams with more variability in their offensive contributions (e.g., teams that contribute a huge amount to their alliance's score by performing some high-scoring feats, but fail at doing so 1/2 the time) will have slightly more uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

3. Teams with less variability in their offensive contributions (e.g., consistent teams that always contribute about the same amount to their alliance's score every match) will have slightly less uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

Details:

I simulated match scores in the following way.

1. I computed the actual OPRs from the actual match data (in this case, from the 2014 misjo tournament as suggested by Ether).

2. I computed the sum of the squared values of the prediction residual and divided this sum by (#matches - #teams) to get an estimate of the per-match randomness that exists after the OPR prediction is performed.

3. I divided the result from step#2 above by 3 to get a per-team estimate of the variance of each team's offensive contribution. I took the square root of this to get the per-team estimate of the standard deviation of each team's offensive contribution.

4. I then simulated 1000 tournaments using the same match schedule as the 2014 misjo tournament. The simulated match scores were the sum of the 3 OPRs for the teams in that match plus 3 zero-mean, variance-1 normally distributed random numbers scaled by the 3 per-team offensive standard deviations computed in step #3. Note that at this point, each team has the same value for the per-team offensive standard deviations.

5. I then computed the OPR estimates from the match scores for each simulated tournament and computed the actual standard deviation of the 1000 OPR estimates for each team. These standard deviations were all close to 11.5 (between 11 and 12) which was the average of the elements of the traditional standard error vector calculation performed on the original data. This makes sense, as the standard error is supposed to be the standard deviation of the estimates if the randomness of the match scores had equal variance for all matches, as was simulated. As a reminder, all of the individual elements of the standard error vector were extremely close to 11.5 in this case.

6. But then I tried something different. Instead of having the per-team standard deviation of the offensive contributions be constant, I instead added a random variable to these standard deviations and then renormalized all of them so that the average variance of the match scores would be unchanged. In other words, now some teams have a larger variance in their offensive contributions (e.g., team A might have an OPR of 30 but have its score contribution typically vary between 15 and 45) while other teams might have a smaller variance in their contributions (e.g., team B might also have an OPR of 30 but have its score contribution only typically vary between 25 and 35).

7. Now I resimulated another 1000 tournaments using this model. So now, some match scores might have greater variances and some match scores might have smaller variances. But the way OPR was calculated was not changed.

8. Then I calculated the OPRs for these new 1000 simulated tournaments and calculated the standard deviations of these 1000 new per-team OPR estimates.

What I found was that the OPR estimates did vary more for teams that had a greater offensive variance and did vary less for teams that had a smaller offensive variance. So, if you're convinced that different teams have substantially different variances in their offensive contributions, then just using the one average standard error computation to estimate how reliable all of the different OPR estimates are is not completely accurate.

But the differences were not that large. For example, in one set of simulations, team A had an offensive contribution with a standard deviation of 8 while team B had an offensive contribution with a standard deviation of 29. So in this case, team B had a LOT more variability in their offensive contribution than team A did (almost 4x as much). But the standard deviation of the 1000 OPR estimates for team A was 10.8 while the standard deviation of the 1000 OPR estimates for team B was 12.9. So yes, team B had a much bigger offensive variability and that made the confidence in their OPR estimates worse than the 11.5 that the standard error would suggest, but it only went up by 1.4, while team A had a much smaller offensive variability but that only improved the confidence in their OPR estimates by 0.7.

And also, the average of the standard deviations of the OPR estimates for the teams in the 1000 tournaments was still very close to the average of the standard error vector computed assuming that the match scores had identical variances.

So, repeating the Executive Summary:

1. The mean of the standard error vector for the OPR estimates is a decent approximation for the standard deviation of the team-specific OPR estimates themselves, and is a very good approximation for the mean of the standard deviations of the team-specific OPR estimates taken across all of the teams in the tournament.

2. Teams with more variability in their offensive contributions (e.g., teams that contribute a huge amount to their alliance's score by performing some high-scoring feats, but fail at doing so 1/2 the time) will have slightly more uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

3. Teams with less variability in their offensive contributions (e.g., consistent teams that always contribute about the same amount to their alliance's score every match) will have slightly less uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

Couldn't one generate an estimate for each team's "contribution to variance" by doing the same least-squares fit used to generate OPR in the first place (using the matrix of squared residuals rather than of scores)? This might run the risk of assigning some team a negative contribution to variance (good luck making sense of that one), but other than that (seemingly unlikely) case I can't think of why this wouldn't work.

GeeTwo
12-07-2015, 22:46
So, repeating the Executive Summary:

1. The mean of the standard error vector for the OPR estimates is a decent approximation for the standard deviation of the team-specific OPR estimates themselves, and is a very good approximation for the mean of the standard deviations of the team-specific OPR estimates taken across all of the teams in the tournament.

2. Teams with more variability in their offensive contributions (e.g., teams that contribute a huge amount to their alliance's score by performing some high-scoring feats, but fail at doing so 1/2 the time) will have slightly more uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

3. Teams with less variability in their offensive contributions (e.g., consistent teams that always contribute about the same amount to their alliance's score every match) will have slightly less uncertainty in their OPR estimate than the mean of the standard error vector would indicate, but not by too much.

The bottom line here seems to be that even assuming that an alliance's expected score is a simple sum of each team's contributions, the statistics tend to properly report the global match-to-match variation, while under-reporting each team's match-to-match variation.
The elephant in the room here is that assumption that the alliance is equal to the sum of its members. For example, consider a 2015 (Recycle Rush) robot with a highly effective 2-can grab during autonomous, and the ability to build, score, cap and noodle one stack of six from the HP station, or cap five stacks of up to six totes during a match, or cap four stacks with noodles loaded over the wall. For argument's sake, it is essentially 100% proficient at these tasks, selecting which to do based on its alliance partners. I will also admit up front that the alliance match-ups are somewhat contrived, but none truly unrealistic. If I'd wanted to really stack the deck, I'd have assumed that the robot was the consummate RC specialist and had no tote manipulators at all.

If the robot had the field to itself, it could score 42 points. (one noodled, capped stack of 6) The canburglar is useless, except as a defensive measure.
If paired with two HP robots that could combine to score 2 or 3 capped stacks, this robot would add at most a few noodles to the final score. It either can't get to the HP station, or it would displace another robot that would have been using the station. Again, the canburglar has no offensive value.
If paired with an HP robot that could score two capped & noodled stacks, and a landfill miner that could build and cap two non-noodled stacks, the margin for this robot would be 66 points. (42 points for its own noodled, capped stack, and 24 points for the fourth stack that the landfill robot could cap). The canburglar definitely contributes here!
If allied with two HP robots that could put up 4 or 5 6-stacks of totes (but no RCs), the margin value of this robot would be a whopping 120 points. (Cap 4 6-stacks with RCs and noodles, or cap 5 6-stacks with RCs). Couldn't do it without that canburglar!


The real point is that this variation is based on the alliance composition, not on "performance variation" of the robot in the same situation. I also left HP littering out, which would provide additional wrinkles.

My takeaway on this thread is that it would be good and useful information to know the rms (root-mean-square) of the residuals for an OPR/DPR data set (tournament or season). This would provide some understanding as to how much difference really is a difference, and a clue as to when the statistics mean about as much as the scouting.

On another slightly related matter, I have wondered why CCWM (Combined Contribution to Winning Margin) is calculated by combining separate calculations of OPR and DPR, rather than by solving a single matrix of winning margin. I suspect that the single calculation would prove to be more consistent for games with robot-based defense (not Recycle Rush); if a robot plays offense five matches and defense five matches, then both OPR and DPR would each have a lot of noise, whereas true CCWM should be a more consistent number.

Oblarg
13-07-2015, 00:22
The elephant in the room here is that assumption that the alliance is equal to the sum of its members.

This was directly addressed on the earlier pages, and it's known that there's no real way we can account for the degree to which the underlying OPR model is inaccurate (short of positing some complicated nonlinear model and using that instead).

wgardner
13-07-2015, 06:28
Couldn't one generate an estimate for each team's "contribution to variance" by doing the same least-squares fit used to generate OPR in the first place (using the matrix of squared residuals rather than of scores)? This might run the risk of assigning some team a negative contribution to variance (good luck making sense of that one), but other than that (seemingly unlikely) case I can't think of why this wouldn't work.

:) We just tried this about a day ago. Unfortunately, there isn't enough data in a typical tournament to get reliable estimates of the per-team offensive variation. With much larger tournament sizes, it does work OK, but it doesn't work when you only have about 5-10 matches played by each team. I'll send you some of our private messages where this is discussed and where the results are shown.

wgardner
13-07-2015, 06:39
On another slightly related matter, I have wondered why CCWM (Combined Contribution to Winning Margin) is calculated by combining separate calculations of OPR and DPR, rather than by solving a single matrix of winning margin. I suspect that the single calculation would prove to be more consistent for games with robot-based defense (not Recycle Rush); if a robot plays offense five matches and defense five matches, then both OPR and DPR would each have a lot of noise, whereas true CCWM should be a more consistent number.

Yes, read the paper attached in the first post of this thread (http://www.chiefdelphi.com/forums/showthread.php?t=137451). What you described is called the Winning Margin Power Rating (WMPR) or Combined Power Rating (CPR) in that paper depending on how you choose to normalize it (called WMPR if the means are 0 like CCWM or called CPR if the means equal the means of the OPRs). If combined with MMSE estimation to address some overfitting issues, it can occasionally result in improved match prediction compared to OPR, DPR, or CCWM measures. Even in years with a lot of defense though, it's not a whole lot better.

wgardner
13-07-2015, 06:50
My takeaway on this thread is that it would be good and useful information to know the rms (root-mean-square) of the residuals for an OPR/DPR data set (tournament or season). This would provide some understanding as to how much difference really is a difference, and a clue as to when the statistics mean about as much as the scouting.

Yes. In the paper in the other thread that I just posted about, the appendices show how much percentage reduction in the mean-squared residual is achieved by all of the different metrics (OPR, CCWM, WMPR, etc). An interesting thing to note is that the metrics are often much worse at predicting match results that they haven't included in their computation, indicating overfitting in many cases.

The paper discusses MMSE-based estimation of the metrics (as opposed to the traditional least-squares method) which reduces the overfitting effects, does better at predicting previously unseen matches (as measured by the size of the squared prediction residual in "testing set" matches), and is better at predicting the actual underlying metric values in tournaments which are simulated using the actual metric models.

Oblarg
13-07-2015, 21:48
Yes. In the paper in the other thread that I just posted about, the appendices show how much percentage reduction in the mean-squared residual is achieved by all of the different metrics (OPR, CCWM, WMPR, etc). An interesting thing to note is that the metrics are often much worse at predicting match results that they haven't included in their computation, indicating overfitting in many cases.

I don't think this necessarily indicates "overfitting" in the traditional sense of the word - you're always going to get an artificially-low estimate of your error when you test your model against the same data you used to tune it, whether your model is overfitting or not (the only way to avoid this is to partition your data into model and verification sets). This is "double dipping."

Rather, it would be overfitting if the predictive power of the model (when tested against data not used to tune it) did not increase with the amount of data available to tune the parameters. I highly doubt that is the case here.

GeeTwo
13-07-2015, 22:31
Yes, read the paper attached in the first post of this thread (http://www.chiefdelphi.com/forums/showthread.php?t=137451). What you described is called the Winning Margin Power Rating (WMPR) or Combined Power Rating (CPR) in that paper depending on how you choose to normalize it (called WMPR if the means are 0 like CCWM or called CPR if the means equal the means of the OPRs). If combined with MMSE estimation to address some overfitting issues, it can occasionally result in improved match prediction compared to OPR, DPR, or CCWM measures. Even in years with a lot of defense though, it's not a whole lot better.

Seems a shame for such a meaningful statistic to be referenced not with a bang, but with a WMPR. ;->

wgardner
14-07-2015, 09:22
I don't think this necessarily indicates "overfitting" in the traditional sense of the word - you're always going to get an artificially-low estimate of your error when you test your model against the same data you used to tune it, whether your model is overfitting or not (the only way to avoid this is to partition your data into model and verification sets). This is "double dipping."

Rather, it would be overfitting if the predictive power of the model (when tested against data not used to tune it) did not increase with the amount of data available to tune the parameters. I highly doubt that is the case here.

From Wikipedia on Overfitting : "In statistics and machine learning, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations."

On the first sentence of that quote, I previously found that if I replaced the data from the 2014 casa tournament (which had the greatest number of matches per team of the tournaments I worked with) with completely random noise, the OPR could "predict" 26% of the variance and WMPR could "predict" 47% of it. So they're clearly describing the random noise in this case where a "properly fit" model would come closer to finding no relationship between the model parameters and the data, as should be the case when the data is purely random.

On the second sentence, again for the 2014 casa tournament, the OPR calculation only has 4 data points per parameter and the WMPR only has 2, which again sounds like "having too many parameters relative to the number of observations" to me. BTW, I think the model is appropriate, so I view it more as a problem of having too few observations rather than too many parameters.

And again, the casa tournament is one of the best cases. Most other tournaments have even fewer observations per parameter.

So that's why I think it's overfitting. Your opinion may differ. No worries either way. :)

This is also discussed a bit in the section on "Effects of Tournament Size" on my "Overview and Analysis of First Stats" paper.

Oblarg
14-07-2015, 13:35
From Wikipedia on Overfitting : "In statistics and machine learning, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations."

On the first sentence of that quote, I previously found that if I replaced the data from the 2014 casa tournament (which had the greatest number of matches per team of the tournaments I worked with) with completely random noise, the OPR could "predict" 26% of the variance and WMPR could "predict" 47% of it. So they're clearly describing the random noise in this case where a "properly fit" model would come closer to finding no relationship between the model parameters and the data, as should be the case when the data is purely random.

On the second sentence, again for the 2014 casa tournament, the OPR calculation only has 4 data points per parameter and the WMPR only has 2, which again sounds like "having too many parameters relative to the number of observations" to me. BTW, I think the model is appropriate, so I view it more as a problem of having too few observations rather than too many parameters.

And again, the casa tournament is one of the best cases. Most other tournaments have even fewer observations per parameter.

So that's why I think it's overfitting. Your opinion may differ. No worries either way. :)

This is also discussed a bit in the section on "Effects of Tournament Size" on my "Overview and Analysis of First Stats" paper.

Well, any nontrivial model at all that's looking at a purely random process without sufficient data is going to "overfit," nearly by definition, because no nontrivial model is going to be at all correct.

The problem here is that there are two separate things in that wikipedia article that are called "overfitting:" errors caused by fundamentally sound models with insufficient data, and errors caused by improperly revising the model specifically to fit the available training data (and thus causing a failure to generalize).

If one is reasoning purely based on patterns seen in the data, then there is no difference between the two (since the only way to know that one's model fits the data would be through validation against those data). However, these aren't necessarily the same thing if one has an externally motivated model (and I believe OPR has reasonable, albeit clearly imperfect, motivation).

We may be veering off-topic, though.