I noticed some people were requesting DPR scores. While the meaning of a DPR number isn't as straightforward as OPR, I think we may be able to improve the OPR calculation by taking it into account. If a team tends to play heavy defense, the teams they play against shouldn't have their OPR reduced when they play below average. Plus I love linear algebra so this gave me an excuse to use it.
<complex math warning>
So here's the equation:
Code:
( M -N ) ( p ) = ( s_t )
( N -M ) ( d ) = ( s_o )
Where (n = total # of teams):
M = n x n matrix with M(ij) = # of times i played with j. M(ii) = # of times i played. (same as M from before)
N = n x n matrix with N(ij) = # of times i played against j. N(ii) = 0.
p = n x 1 column vector of OPRs. p(i) = OPR for team i. (same as p from before)
d = n x 1 column vector of DPRs. d(i) = DPR for team i.
s_t = n x 1 column vector of total scores. s_t(i) = Sum of all of team i's match scores. (same as s from before)
s_o = n x 1 column vector of total opponent scores. s_o(i) = Sum of all of team i's opponents' match scores.
In other words, the first n equations add all the offense played by team i's allies, subtracts all the defense played by team i's opponents, and equates that with team i's total score. The second n equations sums all the offense played by team i's opponents, subtracts all the defense played by team i's allies, and equates it with team i's opponents' total score.
We can rewrite the equation as Ax = y where A = (M -N; N -M), x = (p; d), and y = (s_t; s_o).
In the data set I used, there are 2 isolated sets of teams that played no matches with teams outside their set: the Israeli and non-Israeli teams. We can separate these sets and write an equation for each one, and I think it's easier if we do:
Code:
A_1 * x_1 = y_1
A_2 * x_2 = y_2
We can solve each equation completely independently, so I'm just going to focus on one equation and call it Ax = y. A has a null space of dimension 1 so it's not invertible. We can increase all the OPRs and DPRs by the same amount without having any effect on the scores, so the null space is the span of x = (1 1 1 ... 1). We can get a unique solution by adding one more equation. I (somewhat arbitrarily) chose the equation by saying: if there was no defense, scores would be 25% higher. In equation form that is:
Code:
M(11)*p(1) + M(22)*p(2) + ... + M(nn)*p(n) = 1.25 * (sum(s_t) / 3)
or
Code:
( E 0 ) ( p ) = 1.25 * (sum(s_t) / 3)
( d )
E = ( M(11) M(22) ... M(nn) )
You can tack the last equation onto the end of A like so:
Code:
A = ( M -N )
( N -M )
( E 0 )
And just ask Matlab to solve Ax = y for you. Or replace a random row in A with ( E 0 ) so A becomes invertible and solve x = A_inv * y.
</complex math warning>
I ran this against the
first csv Greg posted and here are the results (top 50, ordered by OPR):
Code:
Team OPR DPR OPR + DPR
1114 71.6377 0.9474 72.5852
1124 53.2773 15.2694 68.5467
2056 51.7991 6.8767 58.6759
217 51.6703 13.4995 65.1698
233 51.5346 11.4470 62.9816
39 51.0832 4.8484 55.9316
330 50.1059 0.1292 50.2351
525 50.0129 1.2725 51.2855
175 47.6712 11.6710 59.3422
40 46.3240 10.1761 56.5001
1731 46.1172 -0.0044 46.1128
987 45.9656 7.0006 52.9662
103 45.0985 10.9291 56.0276
191 44.6783 12.1649 56.8432
79 44.1938 6.1087 50.3025
1024 43.9389 5.5522 49.4911
16 43.2490 6.0931 49.3421
67 43.2308 11.1761 54.4070
20 42.3096 6.3370 48.6466
469 41.9469 -5.4304 36.5165
494 41.2950 -1.9009 39.3941
1806 40.8038 5.2194 46.0232
365 40.6742 2.5704 43.2446
47 40.3067 -1.6335 38.6732
148 39.3002 9.0662 48.3663
1493 39.0307 -0.9984 38.0323
383 38.9932 5.5749 44.5681
1625 38.8912 5.1900 44.0813
1519 38.8147 6.3316 45.1463
1126 38.6616 0.8032 39.4648
141 38.6570 7.6568 46.3137
1718 38.4372 3.7425 42.1797
663 38.2419 14.4349 52.6767
126 37.8160 6.3778 44.1938
121 37.7131 12.7949 50.5080
195 37.7043 -1.5850 36.1192
1477 37.4595 10.8694 48.3289
368 37.1072 -2.7458 34.3614
25 37.0417 -3.1295 33.9121
1717 36.5859 6.3136 42.8995
71 36.1253 8.4642 44.5895
836 35.9330 6.4528 42.3859
93 35.5093 1.6895 37.1989
69 35.3987 2.3330 37.7317
61 35.3566 5.7965 41.1532
968 34.7835 4.2148 38.9984
2345 34.6020 1.6196 36.2216
1086 34.5104 10.6218 45.1321
58 34.4129 6.7595 41.1725
935 34.3941 7.7033 42.0974
Compare this with
Guy's results from the same dataset. The results are fairly similar, but there's definitely some movement in the rankings. Make of it what you will.
Personally, I don't think it tells you a whole lot to know a team's DPR. The two OPRs tell you slightly different things about a team. The old OPR tries to tell you how much a team actually scored each match. The new OPR tries to tell you how much a team could have scored each match if there was no defense. They are both potentially useful numbers.
Finally, knowing both OPR and DPR does allow you to better predict the score of a match. If you define error as:
Code:
error = actual_red_score - ( p(red1) + p(red2) + p(red3) - d(blue1) - d(blue2) - d(blue3) )
then both methods are the least squares solution for their respective vector spaces, but method #2 has a lower MSE (mean square error) and ME (mean (abs) error) because it has a bigger vector space (less information loss).
Method #1 MSE = 245.0446, ME = 12.306
Method #2 MSE = 180.8867, ME = 10.514
So it's better at predicting past scores. Is it better at predicting future scores? I guess we'll see.