FRC Match Predictor

Hey guys, former frc team member here. It’s my first time posting here and this is also my first data science project so it isn’t perfect but it was fun to make!

To cut to the chase, using data from past matches I was able to create a predictive model that predicts the outcome of FRC matches with what looks like an accuracy of 73%. The best model in the end seemed to be Logistic Regression. The write up and the code for this project can be found here: https://github.com/Mark-of-JP/FRC-Match-Simulator/blob/master/FRC%20Match%20Simulator.ipynb

Unfortunately due to the changing nature of the FRC seasons I had to generalize a lot of data. Any tips/criticisms are welcome!

9 Likes

Can you elaborate on how you did your feature engineering? How do you define “red_points_ratio” and the like? What led you to selecting those as predictors of match outcomes?

Also, I am a bit concerned with this line here:

X_train,X_test,y_train_prime,y_test_prime = train_test_split(X,y,test_size=0.25, random_state=5406)

If I’m understanding your notebook right, this means that your training data could include results from matches that are chronologically in the future – which could mean that you’re using data from the future to inform predictions for matches in the past.

So “red_points_ratio” was taken by looking at an alliance and getting the average amount of points each team in the alliance got that season. I then averaged out those averages to get the averages to get the average score of the alliance. I then divided the average score of the red alliance by the blue alliance. For this one in particular I also debated on whether I should change the way I got the average score for each alliance. Another way I thought about it is by getting the average contribution each time gives in points and then adding the contributions together.

Most of the predictors I chose came from intuition and then backed up with graphs that I looked through. I wasn’t sure whether to put all the graphs there since it may be a lot so I provided the intuition for one of them.

For the train test split, while yes technically I used future data to predict matches from the past I aimed to make my model mostly time and team agnostic. Which I mean is that the model looks at the current state of each match and each match contains the stats and accolades of all the teams up until that point. The model does not know how well specifically team 5406 is doing. It just knows how well each alliance is doing for that match. So a match from 2016 and a match from 2019 could have the exact same stats and thus the same prediction will be made.

What it seems like though is that I did not have enough documentation about my features and variables which I will rectify.

Also I am no expert so if I am doing something horribly wrong then please tell me! Thanks for your tips and questions

1 Like

Is this based on OPR, or are you doing a straight arithmetic mean? And are you using the points a team has scored up to that point in the season, or are you potentially using the points the team has scored in the future?

From personal experience, I can tell you that this doesn’t work. My COPR random forest was able to exploit this and dramatically overfit – even across multiple years of data.

Not to self-promote, but I recently wrote up two blog posts (here and here) on match prediction methods. Another good resource is this thread, where a bunch of folks predicted 2019 championship matches as a competition.

1 Like

It’s not OPR. It is what I believe to what you are referring to as straight arithmetic. Also all data for each match is set up so it only takes data from previous matches and no future data affects matches in the past.

For the second point, can you explain why the using future data does not work? No future data is touching pass data so a match from 2017 has no idea what any matches from 2019 looks like. Is using the future data to train my model potentially tainting my test set?

Also I checked out your match prediction and they seem cool and interesting! I might look into your methods more after I play around and finish my stuff

This looks really cool, thanks for sharing! I was interested in seeing how these models compared against an Elo or OPR approach. Between 2016 and 2020, my Elo model had an accuracy of 0.70, OPR had an accuracy of 0.69, and a mixed approach had an accuracy of 0.71. So it looks like you managed to squeeze out a few percentage point improvements! Have you tried calculating probabilities instead of binary classification yet?

Thanks man! I haven’t tried calculating probabilities yet but that sounds like an interesting idea. I might look into that once I polish this current project and make sure that it is valid

Edit: I also just checked out your statbotics thing and it seems pretty impressive

1 Like

I checked out your data_cleaner.py code and it looks like you’ve taken great care to make sure that your independent variables are based only on previous results, not future results. Very nice!

Unfortunately, I think you have a little bug in your code that means you need to re-run your analysis.

In line 49 of data_cleaner.py, you attempt to sort the matches by time. However

match_df.sort_values(by='time')

does not assign the sorted DataFrame back to match_df. Your script continues processing with the unsorted match_df.

Line 49 needs to be

match_df.sort_values(by='time', inplace=True)

to apply the sort to the DataFrame for further processing.

You can see the problem manifest in frc_use_case.csv. The first row with match 2016week0_qm1 should have no history of avg_winrate or avg_games_played, yet there are non-zero values in those fields. Conversely, row 9835 for match 2016abca_f1m1 has avg_winrate and avg_games_played as zeros where they should have non-zero values since there is history from the 2016abca event and all the 2016 events from earlier weeks. 2016abca_f1m1 happened to be the first row of the unsorted match_df when you started engineering your features match-by-match.

I could re-run all of this for you since you made your code available, but I’ll let you do it and give us an update on the results. It will be interesting to see if predictions are better or worse.

While doing that, it would also be of great interest if you could compute the importance of each independent variable on the win prediction for your various model forms. For example, how much do the match awards contribute to the prediction vs the average winrate, average team age, or any other variable?

By the way, you should probably do feature scaling with your kNN model. Since it’s distance-based, the “nearest neighbors” are going to be influenced by the relative scales of the features.

2 Likes

So the 0 for all values for the very first match is actually kind of expected. My though process was that even though there is data before 2016, I wasn’t sure how outdated that data is. For example some in some matches in the previous years they only played with two alliance. FRC changes so much that I didn’t want to use too outdated data and 2016 felt like it was old enough for me to still have a healthy amount of data while note being too outdated. That is my intuition though and if you feel like I should go back further I can try it out.

Everything else does sound interesting like the bug you talked about (oof). I will be implement the feature scaling and variable importance!

I agree that zeros for the first match is expected with the way you are doing things. That’s OK, and very reasonable. Every season we start off with a bit of a blank slate in terms of how a team’s new robot will perform. In your case, your slate is a little more blank for the 2016 matches since you don’t have the match history for previous years to predict with, but you’ve got to start somewhere and 4+ years of data seems more than enough to drown out the extra startup inaccuracy in the early 2016 matches.

You could always build your dataset starting in 2016, but only use 2017 and beyond for training and test. That way you have at least one year of history for all matches rather than starting that first year with none.

After looking over my model and researching some more stuff there seems to a good amount of stuff I want to add or change. For example, better explaining the feature selection, altering some features to remove its correlation with other features, and fixing any bugs that exists. Please keep on giving critics and tips if you guys have time. I am going to try and overhaul some parts of the model and the notebook!

1 Like

Ooof thanks for the bug report! When I fixed the bug, it caused another bug to occur since I made an assumption with the awards given due to the unsorted dataframe. The accuracy is now 70%. I am not finished but for a quick update I added Elo to the model as well as feature scaling.

The next step is to add variable importance.

1 Like