Predicting Ranking Points from COPR

Edit: upon further review, my results contained errors. See thread below.

I had some free time and did some machine learning to predict ranking points over the last few seasons. Here are some sample results:

2019 season:

Metric Accuracy Precision Recall Brier Score
Win 86.3% 86.3% 86.3% 0.136
Hab 88.1% 88.2% 88.2% 0.118
Rocket 98.5% 98.4% 98.5% 0.015

2018 season:

Metric Accuracy Precision Recall Brier Score
Win 86.6% 86.6% 86.6% 0.134
Auto 79.2% 79.2% 79.2% 0.207
Climb 95.8% 95.6% 95.8% 0.042

The full post, including methodology and an explanation on the machine learning algorithms I used, is available on my blog.

8 Likes

I also experimented with implementing the train and test datasets based on timing – that is, using events from weeks 1, 2, and 3 to predict events in week 4.

Were you using post-event data to compute coprs and then RP predictions? Or results from matches 1-N to predict the result of match N+1?

86% accuracy for win prediction would be very impressive (75% is typical for elo), but it looks like you’re including data from both the match you’re predicting, and future matches. Sure you have a test dataset and a train dataset, but the cOPR data still includes the data from all of the matches the team played, right?

The tables I mentioned in this thread (corresponding to tables 2 and 3 in the blog post) used data from weeks 1, 2, and 3 as the training set and were tested on data from week 4 of competition:

I also experimented with implementing the train and test datasets based on timing – that is, using events from weeks 1, 2, and 3 to predict events in week 4. This decreased performance, but is probably a more realistic test.

It depends on how you calculate it. You could simultaneously solve OPR (and cOPR) for all matches in a season. If you do that, I recommend getting some marshmallows – my computer turned into a space heater for a while.

For the results I reported in those tables, I calculated cOPR across weeks 1, 2, and 3 simultaneously (my training data). I calculated cOPR for week 4 separately.

Did you include the rp and related ranking point fields as part of your training set? It seems as though your model has a circular dependency on this data.

You can’t calculate cOPR for week 4 then use that to predict results. Is that what you’re doing? Or are you calculating cOPR as matches are played? How do you deal with the instability problem of OPR?

Sorry if my question wasn’t clear. When testing on week 4 data, were you

  1. Taking the full week 4 data, calculating coprs, using those coprs to calculate RP predictions (using the model tuned on weeks 1-3), and then checking the accuracy of those predictions

or

  1. Taking a week 4 event matches 1-N, calculating coprs with that data, using those coprs to calculate an RP prediction for match N+1, and then checking the accuracy of that prediction

From my understanding you’re doing (1), which is using completed match results to then re-predict those matches? As opposed to (2), which would find the accuracy as if they were being predicted while an event was happening.

1 Like

Your results seem odd. At most, your accuracy, precision, and recall differ by 0.2% and are often equal. For all three to be equal, your model needs to be producing the exact same number of false positives and false negatives as well as exactly the same number of true positives and true negatives. It seems like the chances of that happening repeatedly for multiple target variables and multiple seasons would be remote. I’m not sure it is even possible unless the dataset is exactly balanced with positive and negative outcomes to start with. Bonus RP outcomes are far from balanced. Are you confident that these classification metrics are being computed correctly? What does the confusion matrix look like for the 2019 season Rocket RP, for example?

Thanks for the feedback everyone. I ran most of these experiments in March and did the methodology write up a while back too. I can’t recall exactly how I was handling this. For the sake of accuracy and completeness, I’m re-running my experiments. I will update the post and this thread once I have results back.

They aren’t equal. I just chose to round to the tenths spot. Remember that there’s 72 data points that get generated for each match (6 permutations of red * 6 of blue * 2 alliances). I’ll provide a confusion matrix from the results I’m generating right now.

1 Like

The differences I’m talking about are well beyond hundredths of percents. To use a nice round number, consider 10000 opportunities for a bonus ranking point (5000 matches). Let’s say it was the 2019 Hab ranking point and assume that it was achieved in 25% of opportunities. Let’s also consider a very good model that predicts correctly 90% of the time that the RP was achieved as well as 90% of the time that the RP was not achieved. The confusion matrix would be:

               Hab RP     Hab RP
               achieved   not achieved
Hab RP
predicted       2250        750

Hab RP
not predicted    250       6750

In this case, the classification metrics would be:
Accuracy = (2250+6750)/10000 = 90%
Precision = 2250/(2250+750) = 75%
Recall = 2250/(2250+250) = 90%

Precision would be significantly different than accuracy and recall. The difference would be much more than rounding error. If the model was much better at avoiding false positives such that there were only 250 of them for all 7500 opportunities where the RP was not achieved, you would get

               Hab RP     Hab RP
               achieved   not achieved
Hab RP
predicted       2250        250

Hab RP
not predicted    250       7250

In this case, the classification metrics would be:
Accuracy = (2250+7250)/10000 = 95%
Precision = 2250/(2250+250) = 90%
Recall = 2250/(2250+250) = 90%

Accuracy would be significantly different than both precision and recall. Again, a much bigger difference than rounding error.

I look forward to seeing the results after you re-run your analysis.

There was a bug with how I was calculating precision and recall where I was weighting different classes incorrectly – probably a holdover from some nonbinary classifier I was hacking on or something. Unfortunately, I only noticed it after I had re-run my experiments…so my results are still baking in the oven.

I re-ran my experiments. I found a few problems:

  • I was incorrectly weighting precision and recall scores based on class. I believe it was a holdover from earlier iterations of experiments where I was exploring nonbinary classifications.
  • Based on these new numbers, it’s pretty clear there was an issue computing COPRs using matches in the future in my previous experiments. My new Brier scores are far higher (worse). Similarly, my accuracy, precision, and recall are far lower (also worse).

To tell you the truth, I’m actually pretty embarassed I didn’t catch these errors sooner – and I want to apologize for not more rigorously validating my results. I have a few trusted peers in the community review my CD papers, but I don’t have this same policy for personal blog posts.

I currently believe that the model is learning but is overfitting (results available below). I still think there’s potential here, but I don’t have more time to debug things. In the meantime, I’ve unpublished the blog post and edited my original post in this thread.

2019
[win train] brier: 0.0
[win train] accuracy: 1.0
[win train] precision: 1.0
[win train] recall: 1.0
[win train] confusion:
array([[226908,      0],
       [     0, 217044]])

[win test] brier: 0.28642322097378276
[win test] accuracy: 0.7135767790262172
[win test] precision: 0.7177978983543718
[win test] recall: 0.6888001014713343
[win test] confusion:
array([[12002,  4270],
       [ 4907, 10861]])

[hab train] brier: 0.0
[hab train] accuracy: 1.0
[hab train] precision: 1.0
[hab train] recall: 1.0
[hab train] confusion:
array([[313992,      0],
       [     0, 129960]])

[hab test] brier: 0.24079275905118602
[hab test] accuracy: 0.759207240948814
[hab test] precision: 0.7346512235478534
[hab test] recall: 0.6437728937728938
[hab test] confusion:
array([[15889,  3047],
       [ 4668,  8436]])

[rkt train] brier: 0.0
[rkt train] accuracy: 1.0
[rkt train] precision: 1.0
[rkt train] recall: 1.0
[rkt train] confusion:
array([[433368,      0],
       [     0,  10584]])

[rkt test] brier: 0.05811485642946317
[rkt test] accuracy: 0.9418851435705369
[rkt test] precision: 0.16758241758241757
[rkt test] recall: 0.037654320987654324
[rkt test] confusion:
array([[30117,   303],
       [ 1559,    61]])
2018
[win train] brier: 0.0
[win train] accuracy: 1.0
[win train] precision: 1.0
[win train] recall: 1.0
[win train] confusion:
array([[212328,      0],
       [     0, 211824]])

[win test] brier: 0.2878498727735369
[win test] accuracy: 0.7121501272264631
[win test] precision: 0.7145939725239157
[win test] recall: 0.7040107709750567
[win test] confusion:
array([[10216,  3968],
       [ 4177,  9935]])

[auto train] brier: 0.0
[auto train] accuracy: 1.0
[auto train] precision: 1.0
[auto train] recall: 1.0
[auto train] confusion:
array([[276228,      0],
       [     0, 147924]])

[auto test] brier: 0.39351851851851855
[auto test] accuracy: 0.6064814814814815
[auto test] precision: 0.7225664092336417
[auto test] recall: 0.5197802197802198
[auto test] confusion:
array([[8647, 3269],
       [7866, 8514]])

[climb train] brier: 0.0
[climb train] accuracy: 1.0
[climb train] precision: 1.0
[climb train] recall: 1.0
[climb train] confusion:
array([[407520,      0],
       [     0,  16632]])

[climb test] brier: 0.046296296296296294
[climb test] accuracy: 0.9537037037037037
[climb test] precision: 0.5316091954022989
[climb test] recall: 0.1388888888888889
[climb test] confusion:
array([[26801,   163],
       [ 1147,   185]])
2017
[win train] brier: 0.0
[win train] accuracy: 1.0
[win train] precision: 1.0
[win train] recall: 1.0
[win train] confusion:
array([[195444,      0],
       [     0, 188028]])

[win test] brier: 0.371354543263437
[win test] accuracy: 0.628645456736563
[win test] precision: 0.6293059453942332
[win test] recall: 0.6022588522588522
[win test] confusion:
array([[11001,  5811],
       [ 6515,  9865]])

[kpa train] brier: 0.0
[kpa train] accuracy: 1.0
[kpa train] precision: 1.0
[kpa train] recall: 1.0
[kpa train] confusion:
array([[379908,      0],
       [     0,   3564]])

[kpa test] brier: 0.01654013015184382
[kpa test] accuracy: 0.9834598698481561
[kpa test] precision: 0.2289156626506024
[kpa test] recall: 0.037698412698412696
[kpa test] confusion:
array([[32624,    64],
       [  485,    19]])

[rotor train] brier: 0.0
[rotor train] accuracy: 1.0
[rotor train] precision: 1.0
[rotor train] recall: 1.0
[rotor train] confusion:
array([[381096,      0],
       [     0,   2376]])

[rotor test] brier: 0.04775247047481321
[rotor test] accuracy: 0.9522475295251868
[rotor test] precision: 0.3323076923076923
[rotor test] recall: 0.07317073170731707
[rotor test] confusion:
array([[31499,   217],
       [ 1368,   108]])
9 Likes

No worries, almost all of us have been there at least once. :slight_smile: You’re fine in my book as long as you own up to the mistake.

4 Likes

It’s a good reminder for us all of the importance of validation. It’s also nice to see that the community is thinking critically, but respectfully. Thanks for such a quick re-do and for the transparency.

1 Like