Girls of Steel’s (FRC 3504, FTC 9820, 9821, 9981) Data Science team is proud to release “Clutch Matches are in the Middle”, a paper detailing an improvement on OPR/Calculated Contribution as a metric for FRC. We will be bringing “Clutch Matches are in the Middle” to the Carnegie Mellon Sports Analytics Conference on November 2nd. We would love to hear feedback about this paper, especially in advance of the conference.
In this paper, we present an improvement to Offensive Power Rating (OPR), a popular linear regression model for assessing team performance at a given event. One key assumption of linear regression is the independence of the errors, but in the FIRST Robotics Competition (FRC) context, this assumption is not exactly true. Using data from all district events between 2009 and 2024, we model the unweighted errors as a function of tournament progression and generate weightings to improve the regression fit through Weighted Least Squares. The best weightings show that the most representative matches for a team’s overall performance are midway through the tournament. That is, the real clutch matches are in the middle.
Thank you for your question! As we’ve acknowledged in the first paragraph of our motivation section, EPA is a better measure for predicting match results. However, there is still a place for OPR regression methods to help summarize the performance throughout qualification rounds.
Hello! My name is Anuva, and I am a co-author of this whitepaper who worked primarily on the Introduction and Discussion sections. Although I am now an alumnus, I’m happy to answer any questions you may have about these sections or the paper as a whole. Please feel free to reach out!
Weighted Least Squares improved our estimation of teams by 0.015 Crescendo points
Changing the score prediction by one or two hundredths of a point seems like an argument that WLS does not make a practical difference. I think I’m interpreting the results correctly, but perhaps I’m missing something.
Shifting from practical to academic conclusions, what are the 95% confidence intervals of the mean OLS/WLS ratios in Figures 3a and 3b? Is the 0.1% improvement in prediction statistically significant?
Thanks for reading the paper! We definitely agree; using WLS instead of regular regression doesn’t make a big difference in the final coefficients, and we don’t want to give the impression that it does. Was there something in the paper that implied that we did?
What we’re most interested in and excited by is optimizing and interpreting the weightings. We didn’t have enough time to include it prominently in this paper (it is in the technical appendix), but here is some hyperparameter tuning. Stepwise optimization over the weights roughly triples the effect size (still not a really big deal.) We haven’t run any whole-grid optimization - our back-of-the-napkin math says that’d take 4 days to run even if we parallelized the process over 4 cores. The hope is that we could find optimal weightings and then, as we outlined in “Applications Outside the Regression Context”, apply those weights to other methods.
In the end we’re running up against Gauss-Markov here: as long as we’re using an unbiased estimator like regression, it basically doesn’t get more statistically efficient*, and the FRC sample size is really, really small. But regression has computational advantages that we can leverage to get the optimized weights, which is what we’re really after.
As for the CIs, I think you’re talking about figures 4a and 4b; LaTeX puts the figure label below the image and 3a/3b don’t have a ratio in them. We haven’t computed CIs for the LOOCV ratios yet. Did you have a particular procedure in mind? To me it looks like the only valid LOOCV CI procedures were developed in 2020 (Bayle, Bayle, Janson, and Mackey; Austern and Zhou), and we would need to be careful adapting these to a ratio.
*or more precisely, it only gets more efficient insomuch as the assumptions are violated
The title of the paper strongly implies that there is a significant optimization to be made to OLS OPR by weighting matches in the middle of an event differently. While there may be a statistically detectable effect in the 0.1% range when utilizing the power conferred by 800-ish events over 14 years, there would likely be few, if any, matches over that 800 event span where the WLS estimator would predict differently at the integer point level compared to the OLS estimator.
Yes. Apologies for my misread.
Each data point in Figure 4b has uncertainty which could be represented by the CI of the LOOCV procedure. However, you have 800+ data points which form their own distribution. You want to know the CI of the mean of that distribution because that is the statistic upon which an improvement/optimization to OLS is claimed. Even though Figure 4b shows a small amount of skew, a simple standard-deviation-based CI should be accurate enough:
x_bar +/- z * (s / sqrt(n))
A more rigorous way if you are concerned about non-normality would be to use bootstrapping (e.g. with Python SciPy)
My guess from looking at the data is that you will find the CI of the mean does not come close to including 1.0, but it is important to make that computation.
What matters much more in practice is that the prediction interval of an observation from the distribution almost certainly DOES include 1.0. If one was to use a WLS formulation of OPR at a particular event, one could not be assured that the results would be an improvement over an OLS formulation at that event.
This isn’t how we read it but I’m glad we found a different perspective to help us be more careful.
Ah, I see what you’re saying. Thanks for pointing this out - as you say, the skew is pretty small and the simple CI is probably fine. @anuva I think you’d be especially interested in this. Let’s talk about this point, make the required changes, and improve our validity for CMSAC. Once we’ve done the baseline this would also be a good opportunity to explore the bootstrap.