There’s a ton of data on The Blue Alliance. We have data about individual matches, OPR for an event, insights over years, tons of match videos, and so much more. But, how do we use that data effectively?
There’s a lot to be said about casually browsing the site and taking in the information as a human. But that’s just one way we can use the data. What if we used a bit of machine learning? Ideally, a machine learning model would allow us to do a few things:
Effectively summarize our data
Predict things about our data
Provide us with insights (ideally ones we wouldn’t have found on our own)
One thing that may not be clear to many is what you mean by “overfitting”. I think I understand that, but I don’t see how you keep it from being a problem.
If someone doesn’t know what overfitting is, the the Wikipedia article is pretty good.
One of the telltale signs of overfitting is that your training error goes down while testing error increases. Thus, a common technique is to train on the training data and then test against an intermediate, separate dataset (typically known as the validation dataset) after each training iteration or so. If the error on this validation dataset increases, then you can assume that you’re getting to the point where you’re overfitting and stop training early (hence why this technique is commonly called “early stopping”). Or you can just cheat and use the testing dataset.
There are lots of algorithm-specific ways to help hinder overfitting too. In practice, though, this one of those areas where things are a bit more artsy and less a science. I find that it’s useful to use a model with the right number of parameters. A neural net can easily have over a hundred million parameters, whereas logistic regressions for FRC purposes can run in the 10-50 range. The more parameters you have, the more likely it is that your model can overfit.