I checked out your data_cleaner.py code and it looks like you’ve taken great care to make sure that your independent variables are based only on previous results, not future results. Very nice!
Unfortunately, I think you have a little bug in your code that means you need to re-run your analysis.
In line 49 of data_cleaner.py, you attempt to sort the matches by time. However
does not assign the sorted DataFrame back to match_df. Your script continues processing with the unsorted match_df.
Line 49 needs to be
to apply the sort to the DataFrame for further processing.
You can see the problem manifest in frc_use_case.csv. The first row with match 2016week0_qm1 should have no history of avg_winrate or avg_games_played, yet there are non-zero values in those fields. Conversely, row 9835 for match 2016abca_f1m1 has avg_winrate and avg_games_played as zeros where they should have non-zero values since there is history from the 2016abca event and all the 2016 events from earlier weeks. 2016abca_f1m1 happened to be the first row of the unsorted match_df when you started engineering your features match-by-match.
I could re-run all of this for you since you made your code available, but I’ll let you do it and give us an update on the results. It will be interesting to see if predictions are better or worse.
While doing that, it would also be of great interest if you could compute the importance of each independent variable on the win prediction for your various model forms. For example, how much do the match awards contribute to the prediction vs the average winrate, average team age, or any other variable?
By the way, you should probably do feature scaling with your kNN model. Since it’s distance-based, the “nearest neighbors” are going to be influenced by the relative scales of the features.