Go to Post Well, are comedy and sarcasm COTS items? - Karibou [more]
Home
Go Back   Chief Delphi > ChiefDelphi.com Website > Extra Discussion
CD-Media   CD-Spy  
portal register members calendar search Today's Posts Mark Forums Read FAQ rules

 
Reply
Thread Tools Rate Thread Display Modes
  #1   Spotlight this post!  
Unread 05-11-2018, 07:11 PM
Caleb Sykes's Avatar
Caleb Sykes Caleb Sykes is online now
Knock-off Dr. Strange
AKA: inkling16
no team
 
Join Date: Feb 2011
Rookie Year: 2009
Location: Minneapolis, Minnesota
Posts: 1,672
Caleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond repute
paper: Miscellaneous Statistics Projects 2018

Thread created automatically to discuss a document in CD-Media.

Miscellaneous Statistics Projects 2018 by Caleb Sykes

This whitepaper is a continuation of my miscellaneous statistics projects whitepapers from last year. For those not familiar, here is a summary of why I do this:
I frequently work on small projects that I don't believe merit entire threads on their own, so I have decided to upload them here and make a post about them in an existing thread. I also generally want my whitepapers to have instructions sheets so that anyone can pick them up and understand them. However, I don't want to bother with this for my smaller projects.

I have decided to make a new thread this year in order to not overload my other thread with too many whitepapers, and because I will be analyzing 2018 specific things here. As always, feel free to provide feedback of any kind, including pointing out flaws in my data or my analysis.
Reply With Quote
  #2   Spotlight this post!  
Unread 05-11-2018, 07:24 PM
Caleb Sykes's Avatar
Caleb Sykes Caleb Sykes is online now
Knock-off Dr. Strange
AKA: inkling16
no team
 
Join Date: Feb 2011
Rookie Year: 2009
Location: Minneapolis, Minnesota
Posts: 1,672
Caleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond repute
Re: paper: Miscellaneous Statistics Projects 2018

My first book for this year is an investigation of what my Elo model might look like if I tried to incorporate non-WLT RPs. This idea was spawned by posts 41-44 in this thread. This workbook has identical data as my normal FRC Elo book for 2005-2015, but from 2016-2018, I make adjustments to incorporate the other ranking points. I was unable to find a nice data set to use for 2012 coop RPs, if someone knows of one, let me know and I might try to do this same analysis for that year. For each year in 2016-2018, there were two additional non-WLT ranking points available in each quals match. In 2016 and 2017, the tasks required to achieve these ranking points were also worth bonus points in playoff matches.

The concern that spawned this effort is that, in quals matches, many/most teams are not strictly trying to win, but rather are trying to maximize the number of ranking points they earn. Without some kind of RP correction, this means that teams who are good at earning these RPs might be under-rated by Elo, since they might be more likely to win matches if they werenít expending effort on the RPs. Additionally, since playoffs had different scoring structures than quals in 2016-2017, the teams that do well earning these RPs in quals will presumably be even more competitive in playoffs due to the bonuses.

My approach for this effort was to find the optimal value to assign to the qualification RPs, and to add this value to teamsí winning margins for the quals matches in which they achieve this RP. I wanted to find the optimal value for each of the six types of RPs between 2016-2018. Although there are other approaches to incorporating RP strength into an all-encompassing team rating, I always prefer to use methods which can be used to maximize predictive power over methods that donít, since I can justify why I chose the values I did over just taking guesses about how much different things are worth. There are a few different metrics I could have chosen to optimize, but I settled on overall playoff predictive power over the full period 2016-2018. I chose to optimize for playoff performance since in playoff matches teams are almost strictly just trying to maximize their winning margin (or win). This contrasts with quals matches where teams may have other considerations for the match, potentially including going for RPs or showing off so they are more likely to be selected. I also chose to maximize predictive power over the full 2016-2018 instead of each year separately since Elo ratings carry over some between years, so the optimal value for 2016 RPs when maximizing predictive power for 2016 alone will be a bit different than the optimal value when maximizing over all three years since the latter will look at how well the rating carries over between years.

Here were the optimal (+-20% or 1 point, whichever is greater) values I found for each of the 6 RPs, measured in units of their respective yearís points:
2016 Teleop Defenses Breached: 2
2016 Teleop Tower Captured: 8
2017 kPa Ranking Point Achieved: 80
2017 Rotor Ranking Point Achieved: 40
2018 Auto Quest Ranking Point: 7
2018 Face The Boss Ranking Point: 45
All of these values are positive, which indicates that on average teams that get these RPs in quals are more likely to do better in playoffs than similar teams who do not. You can see the effects of these adjustments by looking at the attached book and looking at the ďAdjusted Red winning marginĒ column. This value should be equal to the red score minus the blue score with additional additions/subtractions depending on the RPs both alliances received. For example, in 2018 Great Northern qm 31, blue wins 305 to 288, so redís unadjusted winning margin is -17. Red got the auto RP and blue got the climb RP in this match though, so after accounting for these, redís adjusted winning margin is -17+7-45=-55.

Here are my probably BS rationalizations of why these RPs have the values above:
2016 Teleop Defenses Breached: It really doesnít surprise me that this value is so low. Teams tended to deal with the defenses in quals in much the same way they dealt with them during playoffs. Although there was a 20 point bonus in playoffs for the breach, any alliance worth their salt was going to get this anyway, so a team that got this RP consistently in quals wasnít set up to do that much better in playoffs than a similar team who got this RP less consistently.
2016 Teleop Tower Captured: I donít want to analyze this RP too much since its definition changed for championships, an event where teams were getting this RP much more frequently than a standard regional/district. I wouldnít have expected this value to exceed 10, since it generally took at least a pair of competent scorers to get 8 or 10 balls, and the 20 point playoff bonus divided by 2 is 10. I donít think teams would have played much differently in quals if this RP had not existed, except maybe being more conservative in the last few 30 seconds to make sure everyone surrounded the tower.
2017 kPa Ranking Point Achieved: This is by far the RP that had the most value. There are a couple of reasons I think it is so high. To start, there was a 20 point playoff bonus for this task that was unavailable in quals, and unlike the teleop tower captured in 2016, getting this RP was generally an individual effort, so a team that gets this RP consistently in quals should be worth at least 20 points more in playoffs than a similar team that does not. On top of this, because there were so few ways to score additional points in playoffs, the 40-70 fuel points scored in playoffs are in a sense more valuable than the points scored with other methods. There were diminishing returns on gear scoring after getting the third rotor, and no value at all in scoring gears after the fourth rotor, and thereís not much teams could do to get more climbing points except potentially lining up a bit earlier to avoid mistakes. Fuel points though were unbounded, so a team that consistently got the kPa RP in quals was going to be so much better off in playoffs just because they could get 60-90 points that were unachievable for a non-fuel opposing alliance.
2017 Rotor Ranking Point Achieved: Similar to 2016 teleop tower captured, I think most of the value of this RP comes from the playoff bonus of 100 points. This task required at least two competent robots to perform, which means I would have expected the value of this RP to be bounded above by 50. I donít think the strategy changed much in playoffs due to this RP, since the goal of 40 points + RP in quals is comparatively lucrative to 140 points in playoffs.
2018 Auto Quest Ranking Point: I expected this RP to be worth around 5 points and I was correct. Teams likely opted for higher risk and higher average reward autonomous modes in playoffs than they did in quals because they could afford to have one robot miss out on the crossing or be okay with not getting the switch if they could get one more cube on the scale. This wasnít a huge effect but it does exist.
2018 Face The Boss Ranking Point: I expected the value of this RP to be around 20 points because there is no playoff bonus for this task and I didnít think the opportunity cost was particularly high, although certainly higher than the auto RP. This was the value that most surprised me at 45 points. In my original analysis, I was thinking of the opportunity cost of going for the climb RP, not the extra value of a team implied by said team achieving the climb RP. I think the distinction is important because relatively few teams were able to consistently achieve the climb RP, and the teams that did so were generally very competitive teams. This means that in the playoffs they can afford to spend a few more seconds scoring elsewhere in the field before going for the climb, and can climb much faster on average than teams that were not consistently getting the climb RP in quals. If I had thought about it more from this perspective, I might have predicted this RP to be worth around 30 points instead of 20. The remaining 15 still surprises me though, one possible explanation is that this value is over-rated since we havenít had the 2019 season yet, so the model doesnít properly account for teamsí future success.


Overall, this was an interesting analysis, but I will almost certainly not be incorporating a change like this into my Elo ratings moving forward for the following reasons:
The adjustments made here do not provide enough predictive power for me to consider them worthwhile. These adjustments improved the Brier score for playoff matches in 2016-2018 by about 0.001. I would have needed it to be at least 0.003 to consider it worthwhile, since I am reasonably sure there exist other improvements to my model which can provide this much or more improvement.
We have no guarantee that future games will have similar RP incentives. I try hard to keep the number of assumptions in my model to a minimum. I do this because I want my model to be valuable even when we get thrown a curveball for some aspect of the game like we did this year for time-dependent scoring. Assuming we will continue getting games with this RP structure is just not a very good assumption in my opinion.
There isnít a clear way to find good values to use for the RPs during the season in some years. I am back-fitting data right now so I have a good sample size of quals matches where teams get the RPs. However, if we get a game like 2017 again, where we didnít get above a 2% success rate for either RP until week 4, there just wouldnít be a good sample size of matches to use to find good values until late in the season.
Reply With Quote
  #3   Spotlight this post!  
Unread 05-14-2018, 07:31 PM
Caleb Sykes's Avatar
Caleb Sykes Caleb Sykes is online now
Knock-off Dr. Strange
AKA: inkling16
no team
 
Join Date: Feb 2011
Rookie Year: 2009
Location: Minneapolis, Minnesota
Posts: 1,672
Caleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond repute
Re: paper: Miscellaneous Statistics Projects 2018

Continuing with my investigation I did last year of autonomous mobility. I thought it would be interesting to look at auto mobility rates for every year since 2016. I have attached a book titled "2016-2018_successful_auto_movement" which provides a summary of this investigation. For each team that competed in 2018, it shows their matches played, successful mobilities, and success rates for each year 2016-2018. It also contains these metrics in aggregate over all of these years as well as a reference to the first match in which the team missed auto mobility. I counted both "Crossed" and "Reached" in 2016 as successful mobilities.

Note that this is using data provided by the TBA api, which pulls directly from FIRST. So there are certainly some matches where teams are incorrectly assigned auto mobility or not. There are many possible reasons for this, but one of the ones I identified last year was that referees at some events were entering mobilities based on team positions and not team numbers.

Here's a fun graph of 2017 auto mobility rates versus 2018 auto mobility rates:


Here are the teams that have competed each year 2016-2018, and have never missed auto mobility points according to this dataset:
Code:
team	matches
1506	149
5554	112
4550	86
4050	77
5031	69
3061	61
6175	59
6026	55
1178	48
3293	45
4462	39
6054	36
6167	35
5119	35
6155	35
5508	34
884	32
3511	32
4630	30
4728	29
6164	28
1230	28
2264	28
4054	27
4648	26
5171	26
The only other per-team metric available is climb rates, so I might do an investigation of that in the future. Climbing was wildly different each year though, whereas just moving forward a few feet is essentially the same each year, so it wouldn't be as easy of a comparison.
Reply With Quote
  #4   Spotlight this post!  
Unread 06-08-2018, 07:39 PM
Caleb Sykes's Avatar
Caleb Sykes Caleb Sykes is online now
Knock-off Dr. Strange
AKA: inkling16
no team
 
Join Date: Feb 2011
Rookie Year: 2009
Location: Minneapolis, Minnesota
Posts: 1,672
Caleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond repute
Re: paper: Miscellaneous Statistics Projects 2018

I'm looking to make predicted schedules soon to use for a couple of projects. I would like the capability to do this even before the total number of matches at the event is known. With this in mind, I have attached a simple book which looks at, for every 2018 event, the number of teams at the event versus the number of qual matches/team.

Here is a plot for all events:


And here is a plot for regional events:


I'll likely just set district events (including district champs) to 12 matches/team, champs events to 10 matches/team (although I may change this depending on the structure of champs in future years and the game), and regional events matches according to the formula matches/team = 17.0-(0.13*teams).
Reply With Quote
  #5   Spotlight this post!  
Unread 07-15-2018, 04:01 PM
Caleb Sykes's Avatar
Caleb Sykes Caleb Sykes is online now
Knock-off Dr. Strange
AKA: inkling16
no team
 
Join Date: Feb 2011
Rookie Year: 2009
Location: Minneapolis, Minnesota
Posts: 1,672
Caleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond repute
Re: paper: Miscellaneous Statistics Projects 2018

I just uploaded a workbook called "2018_schedule_strengths_v1".

I'm planning to make a new thread soon to discuss "strength of schedule" in FRC, so I made this book to hopefully inform that discussion. I labeled it v1 because I imagine I'll need to go back and calculate other metrics as the discussion in my upcoming thread progresses.

Essentially, all I did was run my event simulator at each event twice, once before the schedule was released and once after. By looking at each team's ranking distribution change in this time period, we can pinpoint what effect exactly the schedule had. At least that's the idea anyway. I have some summary statistics of each team's ranking predictions for both of these periods, as well as the changes between them included.

Additionally, I have what is my first pass at a "strength of schedule" metric. I calculate this by finding the probability that the given team will seed better with the actual schedule than they would have with a random schedule. So a "schedule strength" of 0% means that you will never seed higher with the existing schedule than you would have with a random schedule, and a "schedule strength" of 100% means that you are guaranteed to seed higher with the actual schedule than you would have with a random schedule.

What I like about this metric:
It compares the given schedule against other hypothetical schedules
It is customized for each team, that is, it compares your hypothetical results with a random schedule with your hypothetical results with the given schedule. I'm not the biggest fan of team-independent metrics since, for example, a schedule full of buddy climb capable partners is amazing for a team without a buddy climber, but just alright for a team that has a good buddy climber, and team-independent metrics would have to give the schedule a single score for both of these teams.
It's on an interpretable scale (0% to 100%) and has meaningful significance
It's able to be calculated before the event occurs (I don't like metrics that require hindsight unless maybe we want to use SoS as a tiebreaker for something)

What I don't like about this metric:
Requires a full event simulator to calculate
Teams that are basically guaranteed to seed first (like 1678 at their later regionals) will inevitably be shown to have bad schedules, since there is no schedule that would give them much of a better chance of seeding higher than their expectation (1st). Switching to greater than or equal ranks just flips the problem to high scores instead of low scores for these scenarios
Average value is 48.1% instead of 50%


Anyway, feel free to use this as proof of how bad your schedule was. The worst schedule this year according to my metric were (excluding the expected 1 seeds):
2096 on Hopper
4065 at Orlando
6459 on Roebling

And the best schedules were:
2220 on Archimedes
5104 on Newton
1806 on Turing

Last edited by Caleb Sykes : 07-15-2018 at 04:05 PM.
Reply With Quote
  #6   Spotlight this post!  
Unread 07-15-2018, 06:38 PM
AriMB's Avatar
AriMB AriMB is offline
The Philadelphian emigrant
AKA: Ari Meles-Braverman
FRC #5987 (Galaxia)
Team Role: Mentor
 
Join Date: Mar 2015
Rookie Year: 2012
Location: Haifa, Israel
Posts: 1,572
AriMB has a reputation beyond reputeAriMB has a reputation beyond reputeAriMB has a reputation beyond reputeAriMB has a reputation beyond reputeAriMB has a reputation beyond reputeAriMB has a reputation beyond reputeAriMB has a reputation beyond reputeAriMB has a reputation beyond reputeAriMB has a reputation beyond reputeAriMB has a reputation beyond reputeAriMB has a reputation beyond repute
Re: paper: Miscellaneous Statistics Projects 2018

Quote:
Originally Posted by Caleb Sykes View Post
I just uploaded a workbook called "2018_schedule_strengths_v1".

I'm planning to make a new thread soon to discuss "strength of schedule" in FRC, so I made this book to hopefully inform that discussion. I labeled it v1 because I imagine I'll need to go back and calculate other metrics as the discussion in my upcoming thread progresses.

Essentially, all I did was run my event simulator at each event twice, once before the schedule was released and once after. By looking at each team's ranking distribution change in this time period, we can pinpoint what effect exactly the schedule had. At least that's the idea anyway. I have some summary statistics of each team's ranking predictions for both of these periods, as well as the changes between them included.

Additionally, I have what is my first pass at a "strength of schedule" metric. I calculate this by finding the probability that the given team will seed better with the actual schedule than they would have with a random schedule. So a "schedule strength" of 0% means that you will never seed higher with the existing schedule than you would have with a random schedule, and a "schedule strength" of 100% means that you are guaranteed to seed higher with the actual schedule than you would have with a random schedule.

What I like about this metric:
It compares the given schedule against other hypothetical schedules
It is customized for each team, that is, it compares your hypothetical results with a random schedule with your hypothetical results with the given schedule. I'm not the biggest fan of team-independent metrics since, for example, a schedule full of buddy climb capable partners is amazing for a team without a buddy climber, but just alright for a team that has a good buddy climber, and team-independent metrics would have to give the schedule a single score for both of these teams.
It's on an interpretable scale (0% to 100%) and has meaningful significance
It's able to be calculated before the event occurs (I don't like metrics that require hindsight unless maybe we want to use SoS as a tiebreaker for something)

What I don't like about this metric:
Requires a full event simulator to calculate
Teams that are basically guaranteed to seed first (like 1678 at their later regionals) will inevitably be shown to have bad schedules, since there is no schedule that would give them much of a better chance of seeding higher than their expectation (1st). Switching to greater than or equal ranks just flips the problem to high scores instead of low scores for these scenarios
Average value is 48.1% instead of 50%


Anyway, feel free to use this as proof of how bad your schedule was. The worst schedule this year according to my metric were (excluding the expected 1 seeds):
2096 on Hopper
4065 at Orlando
6459 on Roebling

And the best schedules were:
2220 on Archimedes
5104 on Newton
1806 on Turing
How do you calculate the schedule strength metric? I'm assuming it's at least partially based on the changes due to schedule numbers, but I don't see any direct correlation (team A having a larger positive change in average rank than team B does not seem to imply that team A's schedule strength will be lower than team B's, or vice versa).
__________________
Studying MechE at the Technion - Israel Institute of Technology
2017-present: FIRST Israel CSA/FTAA
2017-present: FRC 5987 Technical Mentor 18isr2 18isr4 18isrcmp 18carv
2012-2016: FRC 423 Member 15njtab 15padre 16paphi
Reply With Quote
  #7   Spotlight this post!  
Unread 07-15-2018, 10:24 PM
Caleb Sykes's Avatar
Caleb Sykes Caleb Sykes is online now
Knock-off Dr. Strange
AKA: inkling16
no team
 
Join Date: Feb 2011
Rookie Year: 2009
Location: Minneapolis, Minnesota
Posts: 1,672
Caleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond repute
Re: paper: Miscellaneous Statistics Projects 2018

Quote:
Originally Posted by AriMB View Post
How do you calculate the schedule strength metric? I'm assuming it's at least partially based on the changes due to schedule numbers, but I don't see any direct correlation (team A having a larger positive change in average rank than team B does not seem to imply that team A's schedule strength will be lower than team B's, or vice versa).
The schedule strength is the probability that the given team will seed higher with the actual schedule than they would have with a random schedule according to the simulator. This is found by the following formula:



Where r and q are ranks and are summed over all ranks and all ranks greater than r respectively. Changing the second summation to over all q>=r would provide a very similar metric, just that it would err on the high side instead of the low side.

For example, say that before the schedule is released a team is predicted to have a 20% chance of seeding first, 30% second, and 50% third. After the schedule is released, they have a 5% chance of seeding first, 25% second, and 70% third. Their schedule strength would then be:
0.05*(0.3+0.5)+0.25*(0.5)=0.04+0.125=0.165 or 16.5%.



Looks like a pretty strong correlation to me. Excepting the teams which are heavy favorites to seed first it seems to be doing it's job.
Reply With Quote
  #8   Spotlight this post!  
Unread 07-25-2018, 10:38 PM
Caleb Sykes's Avatar
Caleb Sykes Caleb Sykes is online now
Knock-off Dr. Strange
AKA: inkling16
no team
 
Join Date: Feb 2011
Rookie Year: 2009
Location: Minneapolis, Minnesota
Posts: 1,672
Caleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond repute
Re: paper: Miscellaneous Statistics Projects 2018

As I'm delving more into schedule analysis, I decided I should probably do a proper comparison of the Cheesy Arena schedules, which is what I have been using instead of generating my own schedules, and the actual schedules used this year, generated according to the IdleLoop algorithm at each event. I am unfamiliar with exactly how this algorithm is run at each event, but I have heard that the scorekeeper generally will select a minimum match gap based on a number of factors, and then run the algorithm at the "best" (5 million attempts) setting. How exactly the minimum gap is selected is unknown to me, and likely varies across events.

For all 174 events which had qual matches, I compared the actual schedule for the event with the cheesy arena schedule for the same number of teams and the same number of matches per team. The metrics I was looking at to determine how good the schedule were:
number of surrogates used
red/blue balance for each team
gap between matches for all consecutive pairs of matches for all teams
repeated partners for all teams
repeated opponents for all teams

In some cases, these criteria will interfere with each other, most notably, a large minimum match gap will tend to cause more duplicate partners and opponents than a small minimum match gap. I also made the decision to treat surrogates the same as normal teams. Although for some metrics it might make sense to ignore surrogates, it seems to me that, if you were up against 254 in 4 different matches, it would be of little consolation to you if 254 was a surrogate in one of them, so I treated surrogates as normal teams.

The results show that overall, the cheesy arena schedules and the IdleLoop schedules are very similar. There were 11 events this season for which the Cheesy Arena schedules were as good or better than the IdleLoop schedules across all categories according to my metrics, and 6 events for which the IdleLoop schedules were as good or better than the Cheesy Arena schedules across all metrics. Here is a breakdown by category:
Surrogates: The number of surrogates used was identical for all events, this isn't surprising since surrogate usage should be bounded above by 6 for any combination of teams and matches/team.
Red/blue balance: There were 80 events for which the red/blue balance was better achieved by the cheesy arena schedules than the Idle Loop schedules, and 64 for the reverse.
Match gap: There were 85 events for which the match gap was better in the Cheesy Arena schedules than the Idle Loop schedules, and 89 for the reverse.
Duplicate partners: There were 14 events for which the Cheesy Arena schedules were better at avoiding duplicated partners than the Idle Loop schedules, and 9 for the reverse.
Duplicate opponents: There were 70 events for which the Cheesy Arena schedules were better at avoiding duplicated opponents than the Idle Loop schedules, and 88 for the reverse.

Overall, it seems that the Cheesy Arena schedules might be slightly better at red/blue balance and avoiding duplicate partners, whereas the IdleLoop schedules might be slightly better at having larger match gaps and avoiding duplicate opponents, although the sample sizes might be too small to draw too much of a conclusion.

The full results can be seen in my IdleLoop_and_Cheesy_Arena_comparison.xlsx file. The "summary" page shows a quick look at which schedule was "better" according to my metric for each category. The "raw data" sheet shows a more in depth look at each event. I will walk through the results for the Great Northern Regional here in order to explain them. You can stop reading now if you're not interested in a deeper dive into the data:
Columns 2-5 show information about the event. Columns 8-36 show data on the actual (Idle Loop) schedule used for this event. Columns 58-86 show data on the Cheesy schedule with the specified number of teams and matches/team. Columns 108-136 show the difference between the actual schedule and the cheesy schedule.
Starting in column 8, we see that the actual schedule had 5 surrogates used, and in column 58 we see the cheesy schedule also had 5 surrogates used. The difference between these in column 108 is 0.
In columns 9-17, we see the red/blue balance for teams in the actual schedule. There are 0 teams at this event that have an even number of red/blue matches, which is unsurprising given that each team had 11 matches. There are 40 teams that had either 1 more red matches than blue matches or 1 more blue matches than red matches, which are as balanced as you can get with 11 matches. There were 5 teams (the surrogate teams) that had a red/blue difference of 2 matches. Finally, there were 2 teams that had a 3 match red/blue difference. The cheesy schedule is similar, except that 4 of the surrogate teams had 0 red/blue difference (columns 59-67), this difference can be seen in cols 109-117, with the +4 value in the 2 match difference indicating that the actual schedule had 4 more instances of 2 match red/blue difference than the cheesy schedule did. Since the worst red/blue balance is worse for the actual schedule than the cheesy schedule, I mark in the "summary" sheet that the cheesy arena schedule is better for red/blue balance.
Moving on in a similar way to the next category of gap between matches, we see that with the actual schedule, no teams had back-to-back matches or consecutive matches with only a single match between their matches. There were however 38 occurrences of teams that had consecutive matches that were 3 matches apart. In the cheesy schedule though, there were only 34 of these occurrences. This gives the cheesy schedule the edge in "match gap" in my summary.
For pairs of partners, both schedules had 1044 occurrences of teams having another partner only once. There were no duplicated partners in either schedule. 1044 is to be expected since there are 87 matches, 2 alliances per match, 3 teams per alliance, and 2 partners per team, and 87*2*3*2 = 1044. Neither schedule is superior by this metric.
Finally, looking at opponent pair occurrences, we see that the actual schedule had 2 instances where a team had to face another team in 3 separate matches (2500 facing 7257 and 7257 facing 2500). In contrast, the cheesy schedule has 6 such occurrences. This gives the Idle Loop schedule the edge in my summary sheet.

Anyway, let me know if you see any flaws in my analysis or have any questions.
Reply With Quote
  #9   Spotlight this post!  
Unread 07-28-2018, 01:07 PM
Caleb Sykes's Avatar
Caleb Sykes Caleb Sykes is online now
Knock-off Dr. Strange
AKA: inkling16
no team
 
Join Date: Feb 2011
Rookie Year: 2009
Location: Minneapolis, Minnesota
Posts: 1,672
Caleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond repute
Re: paper: Miscellaneous Statistics Projects 2018

In response to requests for more model validation, I decided to pull all of the top 4, top 8, and top 12 predictions my current simulator would have made on the week 3-4 events at 4 different points at the event, before the schedule was released, after the schedule was released, after 1/3 of qual matches had been played, and after 2/3 of qual matches had been played. The results can be seen in my "week3-4_calibration_results.xlsx" book. This book contains the raw data, a sheet that has calibration results, and a sheet of a bunch of graphs showing these results.

I'm still toying around with how best to represent the calibration results. I've done this format in the past, but this is limited by arbitrary bin sizes and inability to show how many of a type of prediction I make. One idea I had was to use my normal calibration curve format, but with smaller bins, and with dot sizes corresponding to the number of points that fall into the bin. This way, people can see how many of a type of prediction I make, as well as how well calibrated those predictions are. Here's one of these graphs:


Although this is a pretty cool format to look at, there's not much actionable information I can pull out of it. So I also made simple scatter plots of the predicted probabilities versus actual results, here's one of those graphs:


This isn't very visually intuitive, but the linear regression tells me both how much predictive power I have (according to the R^2 value) and how well calibrated my model is (looking at the slope of the line). In the shown graph, the R^2 value of 0.30 indicates that my model can explain about 30% of the variance in the top 12 seeds after the schedule is released, but before any matches have been played. The slope of 0.82 indicates that, if I wanted to be well calibrated for top 12 predictions, I would need to mean-revert all of the top 12 predictions at this point in the event by about 18%.

The results are pretty much what I would have expected. I'm aware that my model has a small but appreciable overconfidence problem, particularly early in the event, you can see though from the graphs that the calibration gets better and better as the event goes on. The main spots in my model that I need to inject more uncertainty are:
The non-WLT RPs, I threw together the predictions for these really quickly this year and didn't get around to any proper calibration of them, so I expect them to be pretty overconfident
The second and third order RP sorts. I don't have any uncertainty in these predictions at all, they are completely deterministic (that is, no variation from one simulation to another).
Simulations running "cold" instead of "hot". I don't have any accounting for the possibility that a team will consistently perform better or worse than their metrics would suggest. This is fine near the end of the event, since all of the team's skill levels are pretty well known, but early on in the event this is a poorer assumption.

I could throw a blanket mean-reversion on my predictions to fix the over-confidence, but I'd prefer to fix the above problems first and the over-confidence should largely be taken care of as a result of the added uncertainty I introduce with the changes.

Not really sure how to respond to accusations that I personally am overconfident, since my model isn't particularly overconfident in a statistical sense. Any major uncertainties I have are either already introduced into my model, or listed above and planned to be improved upon in the future. I don't see my posting of my predictions as any different from predicting a coin flip has a 50% chance of heads. Sure, if you knew more properties of the coin, the surrounding air, the person flipping the coin, and the local gravitational field, you could make a prediction that is better than the 50% prediction, but that doesn't mean the 50% prediction is bad. It's just a well calibrated prediction that recognizes what it knows and doesn't know. That's all I try to do with my predictions, to maximize predictive power given the known bounds of the metrics available to me. I'm nowhere remotely close to predicting everything perfectly, but I'm doing the best I can with the information I have available.
Reply With Quote
  #10   Spotlight this post!  
Unread 08-06-2018, 11:28 AM
Caleb Sykes's Avatar
Caleb Sykes Caleb Sykes is online now
Knock-off Dr. Strange
AKA: inkling16
no team
 
Join Date: Feb 2011
Rookie Year: 2009
Location: Minneapolis, Minnesota
Posts: 1,672
Caleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond repute
Re: paper: Miscellaneous Statistics Projects 2018

I'm going to try soon to add some normalization between years for my Elo ratings. Presently, I find start of season Elos by taking 70% of the previous season's Elo + 30% of the end of season Elo from 2 seasons ago. I then revert this sum toward 1550 by 20%. My concern with this method is that I don't think it's fair to directly sum Elos from different seasons since the Elo distributions vary so greatly year to year based on the game. If we had the same game every year, this wouldn't be a problem.

To start, I measured the average, stdev, skew, and kurtosis for the end of season Elo distributions in each year. The results are shown in this table:


The average hovers right around 1500 each year, but this is due to how I designed my Elo ratings, and doesn't actually tell us much. Some actual measure of "true skill" would probably have higher averages in recent years, since most would agree the average robot in 2018 is much better than the average robot was in 2002.
stdevs move around each year, likely due to the game structure. 2018 had the highest stdev on record by a pretty solid margin. I have previously speculated that this could be due to the snowballing scoring structure of powerup.
The skewness is interesting, for those of you unfamiliar with skewness, a positive skewness indicates a larger positive "tail" on the distribution than the negative "tail". Every year on record has had a positive skew, which indicates that there are always more "outlier" good teams than "outlier" bad teams. Some years have had much higher skews than others though. For example, 2015 had an incredibly positive skew, which means there were a large number of very dominant teams. 2017 in contrast had one of the smallest skews on record. This is probably due to the severely limited scoring opportunities for the strong teams after the climb and 3 rotors, as well as the fact that the teams that lacked climbing ability were a severe hindrance to their alliances. The difference in skews between 2015 and 2017 can be seen in histograms of their Elo distributions. Notice how much longer the 2015 positive tail is than the 2017 one.



I also threw in kurtosis, kurtosis is a rough measure of how "outlier-y" or "taily" a distribution is. Kurtosis tracks very closely with skew every year. This means that the "outlier" teams driving the high kurtosis in some years are "good team" outliers and not "bad team" outliers. A high kurtosis with low skew would indicate that there are lots of good team and bad team outliers. Plots of stdev vs skew and skew vs kurtosis can be seen below.



Next, I'll be trying to normalize end of season Elos so that I can get better start of season Elos. We've now had two years in a row of games that have low skew/kurtosis, which means that without adjustment the 2019 start of season Elos will also have low skew/kurtosis even though the 2019 game likely will not. It'll all come down to predictive power though, if I can get enough predictive power increase I'll add it in, otherwise I won't.
Reply With Quote
  #11   Spotlight this post!  
Unread 08-11-2018, 04:33 PM
Caleb Sykes's Avatar
Caleb Sykes Caleb Sykes is online now
Knock-off Dr. Strange
AKA: inkling16
no team
 
Join Date: Feb 2011
Rookie Year: 2009
Location: Minneapolis, Minnesota
Posts: 1,672
Caleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond reputeCaleb Sykes has a reputation beyond repute
Re: paper: Miscellaneous Statistics Projects 2018

I'm currently working on analyzing the awesome timeseries data from TBA. I'll have plenty more to come, but I'm at a point where I got some really sweet graphs, so I thought I'd share, and describe a rough outline of my live model at the same time.

I am currently analyzing the ~1500 matches that have the best timeseries data. It's possible that I'll go back later and clean up the messier data, but I wanted to focus my early analysis on data I could have high trust in. What I'm currently working on is a way to predict the match winner in real-time based on this data. Here is a Brier score graph of my current model:


The first five seconds of the match just use my pre-match Elo win probability, but from then on, I begin incorporating the real-time scoring (in conjunction with the pre-match Elo prediction) to create win probabilities. The Brier score is basically steady for the first 5 seconds (when I'm not incorporating match data) but also from ~19 to ~23 seconds, which is probably because teams have by this point scored their first set of cubes and are picking up the second set. Also, note that even at t = 150 seconds, the Brier score is not zero because the actual final score can differ from the last score shown on the screen.

I mentioned that I incorporate Elo ratings into the predictions, here is a graph showing how much weight I give to Elo versus live match data at each second:


This and the following graphs you will see were created by tuning my prediction model, so the values that you see were the most predictive ones I found. After the first 5 seconds, the importance of Elo drops sharply down to ~65% by the end of auto, where it holds roughly steady for the same 19-23 second interval described above. This makes sense, since if there isn't much scoring, we wouldn't expect the live scoring to increase in importance much. After that, the Elo weight decays roughly exponentially down to 0.

The general form of my model (excluding Elo) is red win probability = 1-1/(1+10^((current red winning margin)/scale)) where "scale" is how much of a lead red would need to have a 10/11 ~=91% chance of winning the match at that point in the match. Let's call this "scale" the "big lead" amount so as not to be confused with the scale on the field. If a team is up by 40 at a point in the match where the "big lead" value is 40, that team has a 91% chance of winning, but if they are up by 80 (two big leads), that team has a 99% chance of winning. Obviously, what is considered a big lead will vary over the match, so here is a graph showing that change over time:


I excluded the first 5 seconds since the values there are indeterminate since I don't incorporate match data then. The next few seconds of auto are also a bit weird, probably since not much happens at this time in most matches, and even in the matches that do have things happen, a "big lead" of 30+ points is not very intuitive since there is no way a team could even have this much of a lead this early (excluding penalties). By the end of auto though, we see the big lead value settle at around 20, which sounds about right, teams who are up by 20 after auto are probably feeling pretty good since they probably have control of both the switch and the scale. After auto, what is considered a big lead increases steadily until peaking at around 60 points at around 60 seconds in. This seems to make sense, because a team up by 20 after auto should be up by 60 40 seconds later if they control the scale the whole time and nothing else changes. After this, the "big lead" holds steady until 110 seconds, when it sharply drops but then recovers at ~122 seconds. I don't know the explanation for this, but my gut tells me this has to do with climbing positioning. After that, the "big lead" drops until ending at 29 points. This means that, if you see your score on the screen at the end of the match at 30 points, you are about 90% likely to win the match in the final score, and if you are up by 60 points, you are about 99% likely to win the match.

I mentioned that the form of my model uses the red winning margin, but that's not precisely true. In fact, I use an adjusted red winning margin where I account for ownership of the scale and of the switch. Basically, I found how much "value" to give to switch and scale ownership at each point in the match. What I mean by "value" is this, if red is down by X points but controls the scale, what is the value of X such that red and blue have an equal chance of winning the match? Here is a graph of value versus time for the scale:


Again, skipping the first 10 seconds of auto, we see scale ownership to be worth ~30 points after auto. It then drops in value to a min of 25 points at 23 seconds. This drop might be due to the initial scuffle for the scale. By 45 seconds, the scale has peaked in value at 49 points, and then it has a jittery drop until the end of the match. The same dip seen in "big lead" also appears in scale value at around 120 seconds. Interestingly, scale value does not go to 0 at the end of the match, but rather ends at 8 points. Perhaps scale ownership provides some indication of climb success?

Here is a similar graph for switch value:


Most of the same trends as in the previous graph also appear in this one. The biggest difference though is that switch value actually does go to 0 by the end of the match.


Let me know if you have any questions. I'll have more to come soon, including win probability graphs and match "excitement" and "comeback/upset" scores.
Reply With Quote
Reply


Thread Tools
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 07:39 PM.

The Chief Delphi Forums are sponsored by Innovation First International, Inc.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi