View Single Post
  #44   Spotlight this post!  
Unread 14-07-2015, 09:22
wgardner's Avatar
wgardner wgardner is online now
Registered User
no team
Team Role: Coach
 
Join Date: Feb 2013
Rookie Year: 2012
Location: Charlottesville, VA
Posts: 172
wgardner is a splendid one to beholdwgardner is a splendid one to beholdwgardner is a splendid one to beholdwgardner is a splendid one to beholdwgardner is a splendid one to beholdwgardner is a splendid one to beholdwgardner is a splendid one to behold
Re: "standard error" of OPR values

Quote:
Originally Posted by Oblarg View Post
I don't think this necessarily indicates "overfitting" in the traditional sense of the word - you're always going to get an artificially-low estimate of your error when you test your model against the same data you used to tune it, whether your model is overfitting or not (the only way to avoid this is to partition your data into model and verification sets). This is "double dipping."

Rather, it would be overfitting if the predictive power of the model (when tested against data not used to tune it) did not increase with the amount of data available to tune the parameters. I highly doubt that is the case here.
From Wikipedia on Overfitting : "In statistics and machine learning, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations."

On the first sentence of that quote, I previously found that if I replaced the data from the 2014 casa tournament (which had the greatest number of matches per team of the tournaments I worked with) with completely random noise, the OPR could "predict" 26% of the variance and WMPR could "predict" 47% of it. So they're clearly describing the random noise in this case where a "properly fit" model would come closer to finding no relationship between the model parameters and the data, as should be the case when the data is purely random.

On the second sentence, again for the 2014 casa tournament, the OPR calculation only has 4 data points per parameter and the WMPR only has 2, which again sounds like "having too many parameters relative to the number of observations" to me. BTW, I think the model is appropriate, so I view it more as a problem of having too few observations rather than too many parameters.

And again, the casa tournament is one of the best cases. Most other tournaments have even fewer observations per parameter.

So that's why I think it's overfitting. Your opinion may differ. No worries either way.

This is also discussed a bit in the section on "Effects of Tournament Size" on my "Overview and Analysis of First Stats" paper.
__________________
CHEER4FTC website and facebook online FTC resources.
Providing support for FTC Teams in the Charlottesville, VA area and beyond.

Last edited by wgardner : 14-07-2015 at 09:25.
Reply With Quote