paper: Miscellaneous Statistics Projects

I am investing some of my effort now into improving calculated contribution (OPR) predictions. The first thing I really want to figure out is what the best “seed” OPR is for a team going into an event is. We have many choices for how to calculate this seed value based on past results, so I’d like to narrow my options down before building a formal model. To accomplish this, I investigated the years 2011-2014 to find which choice of seeds correlate the best with teams’ calculated contributions at the championship. I used this point in time because it is the spot in the season where we have the most data on teams before the season is over. The best seeds should have the strongest correlation with the team’s championship OPR, and by using correlations instead of building a model, I can ignore linear offsets in seed values.

When a team has only a single event in a season, my choices of metrics to use to generate seed values are basically restricted to either their OPR at their only event, or their pre-champ world OPR. There is potential for using normalized OPRs from previous seasons as seeds, but I chose not to investigate this since year-year team performance is quite drastic.

When a team has attended 2+ events, I have many more options for metrics that can be used to determine their seed value:
The team’s OPR at their first event of the season
The team’s OPR at their second event of the season
The team’s OPR at their second to last pre-champs event of the season
The team’s OPR at their last pre-champs event of the season
The team’s highest pre-champs OPR of the season
The team’s second highest pre-champs OPR of the season
The team’s lowest pre-champs OPR of the season
The team’s pre-champs world OPR

Many of these metrics will overlap for teams, but they are all distinct metrics.

Using each of these seed options, I found correlation coefficients for these metrics with every championship attending team with each team’s championship OPR. I did this for each year 2011-2014 as well as an average correlation for all four years (I didn’t weight by number of teams since there were ~400 champ teams in each of these years). The results are summarized in this table, and can also be found in the “summary tab” of the “OPR seed investigator.xlsx” spreadsheet. Raw data can be found in the year sheets of the workbook as well.

As can be seen in the table, we roughly have from most correlation to least correlation:
highest > world > last > second >>> second highest > second to last >>> lowest > first
Going into this analysis, I had anticipated that the top three seed metrics would be highest, world, and last, but my expected ordering probably would have been something like last > highest >> world.

I was actually hoping that there would be a clearer difference between these top three metrics so that I could throw out one or two of these options going into my model creation. I had always been pretty skeptical of world OPR, it seemed to me that, although it has a better sample size than conventional single event OPR, that it would perform worse since it incorporates early season matches that may not reflect teams accurately by the time champs rolls around. However, world OPR was better correlated with champs performance than was my previous metric of choice, last event OPR, so my fears with world OPR are probably not very justified.

I also tried combining metrics with a weighted average. The optimal weightings I found and correlation coefficients can also be found in the “summary” tab. For example, when combining first OPR and second OPR, the optimal weighted average would be 0.3*(first OPR) + 0.7*(second OPR) I did not find much that was interesting in this effort. Highest OPR is consistently the best predictor of champs OPR no matter which other metric it is paired with. Some of the optimal weightings are mildly interesting particularly the negative weightings given to poor metrics paired with world OPR.

Moving forward, I will probably have to try to use all three of highest OPR, world OPR, and last OPR when building a predictive model. I will also have to determine the best linear offsets to use for these metrics, and determine if the best seed metrics remain the same throughout the season, since this effort only looked at a single point in the season.