In response to requests for more model validation, I decided to pull all of the top 4, top 8, and top 12 predictions my current simulator would have made on the week 3-4 events at 4 different points at the event, before the schedule was released, after the schedule was released, after 1/3 of qual matches had been played, and after 2/3 of qual matches had been played. The results can be seen in my “week3-4_calibration_results.xlsx” book. This book contains the raw data, a sheet that has calibration results, and a sheet of a bunch of graphs showing these results.
I’m still toying around with how best to represent the calibration results. I’ve done this format in the past, but this is limited by arbitrary bin sizes and inability to show how many of a type of prediction I make. One idea I had was to use my normal calibration curve format, but with smaller bins, and with dot sizes corresponding to the number of points that fall into the bin. This way, people can see how many of a type of prediction I make, as well as how well calibrated those predictions are. Here’s one of these graphs:
Although this is a pretty cool format to look at, there’s not much actionable information I can pull out of it. So I also made simple scatter plots of the predicted probabilities versus actual results, here’s one of those graphs:
This isn’t very visually intuitive, but the linear regression tells me both how much predictive power I have (according to the R^2 value) and how well calibrated my model is (looking at the slope of the line). In the shown graph, the R^2 value of 0.30 indicates that my model can explain about 30% of the variance in the top 12 seeds after the schedule is released, but before any matches have been played. The slope of 0.82 indicates that, if I wanted to be well calibrated for top 12 predictions, I would need to mean-revert all of the top 12 predictions at this point in the event by about 18%.
The results are pretty much what I would have expected. I’m aware that my model has a small but appreciable overconfidence problem, particularly early in the event, you can see though from the graphs that the calibration gets better and better as the event goes on. The main spots in my model that I need to inject more uncertainty are:
The non-WLT RPs, I threw together the predictions for these really quickly this year and didn’t get around to any proper calibration of them, so I expect them to be pretty overconfident
The second and third order RP sorts. I don’t have any uncertainty in these predictions at all, they are completely deterministic (that is, no variation from one simulation to another).
Simulations running “cold” instead of “hot”. I don’t have any accounting for the possibility that a team will consistently perform better or worse than their metrics would suggest. This is fine near the end of the event, since all of the team’s skill levels are pretty well known, but early on in the event this is a poorer assumption.
I could throw a blanket mean-reversion on my predictions to fix the over-confidence, but I’d prefer to fix the above problems first and the over-confidence should largely be taken care of as a result of the added uncertainty I introduce with the changes.
Not really sure how to respond to accusations that I personally am overconfident, since my model isn’t particularly overconfident in a statistical sense. Any major uncertainties I have are either already introduced into my model, or listed above and planned to be improved upon in the future. I don’t see my posting of my predictions as any different from predicting a coin flip has a 50% chance of heads. Sure, if you knew more properties of the coin, the surrounding air, the person flipping the coin, and the local gravitational field, you could make a prediction that is better than the 50% prediction, but that doesn’t mean the 50% prediction is bad. It’s just a well calibrated prediction that recognizes what it knows and doesn’t know. That’s all I try to do with my predictions, to maximize predictive power given the known bounds of the metrics available to me. I’m nowhere remotely close to predicting everything perfectly, but I’m doing the best I can with the information I have available.