View Full Version : OPR Programming Challenge
Given the dataset Michael Hill posted here (http://www.chiefdelphi.com/forums/showpost.php?p=1403503&postcount=17), calculate OPR, DPR, and CCWM; and create a report sorted by team number.
The goal is to see who can go from raw input data to finished output report the fastest.
@ Michael Hill: please refrain since you and I have discussed this.
by "fastest" do you mean the first response or fastest operational efficiency?
Nathan Streeter
11-10-2014, 19:14
I believe operational efficiency is the goal; however, Ether can speak for himself on that...
Michael Hill
11-10-2014, 19:44
*Actively Refraining* ...if that's possible
saikiranra
05-12-2014, 01:40
I'm about 2 months late, but I think I have generated an accurate report. Some of the OPR and CCWM values are identical to the 1114/2834 databases, while others are a few values off. I don't know whether this is an input error, matrix inversion error, or general code bug.
I'm about 2 months late, but I think I have generated an accurate report.
Thanks for posting. Maybe this will revive interest in the thread.
Did you time how long it took the computer to go from raw input data to finished output report ? That's the challenge.
Some of the OPR and CCWM values are identical to the 1114/2834 databases, while others are a few values off. I don't know whether this is an input error, matrix inversion error, or general code bug.
http://www.chiefdelphi.com/forums/showpost.php?p=1403829&postcount=7
saikiranra
05-12-2014, 20:59
Did you time how long it took the computer to go from raw input data to finished output report ? That's the challenge.
http://www.chiefdelphi.com/forums/showpost.php?p=1403829&postcount=7
Oops, sorry for not posting it. Over 10 runs, it averaged 0.823266 seconds to generate the output.
As for the matrix inversion, I didn't realize that the Cholesky solving library I used didn't find the inverse, rather it found Cholesky factorization of the matrix.
Over 10 runs, it averaged 0.823266 seconds to generate the output.
That's pretty good.
The challenge includes the total time to do all these steps*:
- read Michael's 8-column fixed-width data file
- convert that data into a set of linear equations
- compute the solution vectors OPR, DPR, & CCWM
- sort the solution vector elements in ascending order by team number
- write the output report to a disk file.
Please report the times for each of the above steps.
What hardware did you run this on? What software did you use?
*if you used different steps, please explain and provide timings
saikiranra
05-12-2014, 23:56
Out of 50 runs, these are the average times to:
1. Read from file to create data structures: 0.3636 seconds
2. Make sorted team lists and Bs in Ax=B: 0.0019 seconds
3. Make matrix A in Ax=B: 0.0675 seconds
4. Create Cholesky factorization and find OPR and CCWM values (DPR = OPR - CCWM): 0.2793 seconds
5. Write to file: 0.0242 seconds
(Record these times: 0.0128 seconds)
Total: 0.7494 seconds
I wrote this in Python using the numpy and scipy libraries to manipulate matrices. The software is running on a Windows 8 machine with an Intel i7-4700MQ running at 2.4 GHz, with 12 GB of RAM.
Out of 50 runs, these are the average times to:
1. Read from file to create data structures: 0.3636 seconds
2. Make sorted team lists and Bs in Ax=B: 0.0019 seconds
3. Make matrix A in Ax=B: 0.0675 seconds
4. Create Cholesky factorization and find OPR and CCWM values (DPR = OPR - CCWM): 0.2793 seconds
5. Write to file: 0.0242 seconds
(Record these times: 0.0128 seconds)
Total: 0.7494 seconds
I wrote this in Python using the numpy and scipy libraries to manipulate matrices. The software is running on a Windows 8 machine with an Intel i7-4700MQ running at 2.4 GHz, with 12 GB of RAM.
Nice work.
For those of you reading along, the timing above is for a dataset containing 2696 teams, 8921 matches, 17842 alliance scores.
@ Michael Hill: feel free to jump in here now if you want :-)
Michael Hill
06-12-2014, 16:35
Nice work.
For those of you reading along, the timing above is for a dataset containing 2696 teams, 8921 matches, 17842 alliance scores.
@ Michael Hill: feel free to jump in here now if you want :-)
Fair enough. It's been a while since I've messed around with this, but I dug up my MATLAB script I originally did this with. I got it down to 0.518726 seconds (average over 10 runs.)
There are some optimizations that are done with this. For example, using sparse matrices lets MATLAB do a much quicker job of left divides (which is not the same as inverting the matrix). MATLAB is actually fairly smart and decides which form of factorization to use based on the data that's inputted, so it (generally) will choose the fastest method. Sparse is very fast.
tic
alldata = dlmread('C:\Users\Michael Hill\Documents\Projects\OPR\scores.dat');
teams = [alldata(:,1:3); alldata(:,4:6)];
reddiff = [alldata(:,7)-alldata(:,8)];
bluediff = [alldata(:,8)-alldata(:,7)];
scores = [alldata(:,7); alldata(:,8)];
diffscores = [reddiff; bluediff];
a = size(teams);
A = zeros(max(max(teams)));
B = zeros(max(max(teams)),2);
for i = 1:a(1)
A(teams(i,1),teams(i,1)) = A(teams(i,1),teams(i,1)) + 1;
A(teams(i,1),teams(i,2)) = A(teams(i,1),teams(i,2)) + 1;
A(teams(i,1),teams(i,3)) = A(teams(i,1),teams(i,3)) + 1;
A(teams(i,2),teams(i,1)) = A(teams(i,2),teams(i,1)) + 1;
A(teams(i,2),teams(i,2)) = A(teams(i,2),teams(i,2)) + 1;
A(teams(i,2),teams(i,3)) = A(teams(i,2),teams(i,3)) + 1;
A(teams(i,3),teams(i,1)) = A(teams(i,3),teams(i,1)) + 1;
A(teams(i,3),teams(i,2)) = A(teams(i,3),teams(i,2)) + 1;
A(teams(i,3),teams(i,3)) = A(teams(i,3),teams(i,3)) + 1;
B(teams(i,1),1) = B(teams(i,1),1) + scores(i);
B(teams(i,2),1) = B(teams(i,2),1) + scores(i);
B(teams(i,3),1) = B(teams(i,3),1) + scores(i);
B(teams(i,1),2) = B(teams(i,1),2) + diffscores(i);
B(teams(i,2),2) = B(teams(i,2),2) + diffscores(i);
B(teams(i,3),2) = B(teams(i,3),2) + diffscores(i);
end
%gpuA = gpuArray(A);
%gpuB = gpuArray(B);
A = sparse(A);
B = sparse(B);
A(~any(A,2), :) = [];
A(:, ~any(A,1)) = [];
B(~any(B,2), :) = [];
X = A\B(:,1);
Y = A\B(:,2);
D = X-Y;
tdupe = sort([teams(:,1);teams(:,2);teams(:,3)]);
[junk, index] = unique(tdupe,'first');
teamList = tdupe(sort(index));
numTeams = size(teamList);
out = zeros(numTeams(1),4);
for i = 1:size(teamList)
out(i,1) = teamList(i);
out(i,2) = gather(X(i));
out(i,3) = gather(Y(i));
out(i,4) = gather(D(i));
end
% expectedRed = zeros(a(1)/2,1);
% expectedBlue = zeros(a(1)/2,1);
% for i=1:a(1)/2
% index1 = find(out(:,1)==alldata(i,1));
% index2 = find(out(:,1)==alldata(i,2));
% index3 = find(out(:,1)==alldata(i,3));
% index4 = find(out(:,1)==alldata(i,4));
% index5 = find(out(:,1)==alldata(i,5));
% index6 = find(out(:,1)==alldata(i,6));
% opr1 = out(index1,2);
% opr2 = out(index2,2);
% opr3 = out(index3,2);
% opr4 = out(index4,2);
% opr5 = out(index5,2);
% opr6 = out(index6,2);
% expectedRed(i) = opr1+opr2+opr3;
% expectedBlue(i) = opr4+opr5+opr6;
% end
%
% diffExpRed = expectedRed - alldata(:,7);
% diffExpBlue = expectedBlue - alldata(:,8);
%
% diffExp = [diffExpBlue; diffExpRed];
%
% C = zeros(gather(max(max(teams))));
% for i = 1:a(1)
% C(teams(i,1)) = C(teams(i,1)) + diffExp(i);
% C(teams(i,2)) = C(teams(i,2)) + diffExp(i);
% C(teams(i,3)) = C(teams(i,3)) + diffExp(i);
% end
% gpuC = gpuArray(C);
% gpuC(~any(gpuC,2), :) = [];
% Z = gpuA\gpuC(:,1);
% for i = 1:size(teamList)
% out(i,5) = gather(Z(i));
% end
disp(out);
toc
There are some remnants in the comments of me messing with gpuArrays on MATLAB, and I'm too lazy to get rid of them. I have a Quadro K1100M Graphics card in my laptop, so I was messing around with using CUDA cores. It turns out that MATLAB has not implemented sparse matrices with CUDA processing, so they have to be full matrices, which actually slows it down (fairly significantly) over using sparse matrices with calculation on the CPU. For the record, my CPU is an Intel i7-4702HQ @ 2.2 GHz with 16 GB of RAM (My Computer is a Dell Precision M3800).
EDIT: I actually just realized there was some code that slows it down in there. the "gather"s in developing the "out" matrix are more remnants of GPU computing that I didn't take out. They are used for getting the variables out of GPU memory to my RAM. After removing them, It took total computation time to about 0.48 seconds.
EDIT: I actually just realized there was some code that slows it down in there. the "gather"s in developing the "out" matrix are more remnants of GPU computing that I didn't take out. They are used for getting the variables out of GPU memory to my RAM. After removing them, It took total computation time to about 0.48 seconds.
Here's another optimization you can try. Replace this code:
X = A\B(:,1);
Y = A\B(:,2);
... with this:
XY = A\B
... then you'll have only one left-divide, and the XY vector will be a 2-column vector containing X and Y.
Michael Hill
06-12-2014, 19:13
Here's another optimization you can try. Replace this code:
X = A\B(:,1);
Y = A\B(:,2);
... with this:
XY = A\B
... then you'll have only one left-divide, and the XY vector will be a 2-column vector containing X and Y.
Made it a bit faster (0.02 seconds). I expected it to be a lot more, but surprisingly, reading the file into MATLAB is the longest operation at 0.31 seconds (That's reading in the file and creating my A and B matrices). It surprises me it takes that long with a the SSD I have.
reading the file into MATLAB is the longest operation at 0.31 seconds (That's reading in the file and creating my A and B matrices)
Could I talk you into putting separate tic/toc statements around
alldata = dlmread('C:\Users\Michael Hill\Documents\Projects\OPR\scores.dat');
and
for i = 1:a(1)
A(teams(i,1),teams(i,1)) = A(teams(i,1),teams(i,1)) + 1;
A(teams(i,1),teams(i,2)) = A(teams(i,1),teams(i,2)) + 1;
A(teams(i,1),teams(i,3)) = A(teams(i,1),teams(i,3)) + 1;
A(teams(i,2),teams(i,1)) = A(teams(i,2),teams(i,1)) + 1;
A(teams(i,2),teams(i,2)) = A(teams(i,2),teams(i,2)) + 1;
A(teams(i,2),teams(i,3)) = A(teams(i,2),teams(i,3)) + 1;
A(teams(i,3),teams(i,1)) = A(teams(i,3),teams(i,1)) + 1;
A(teams(i,3),teams(i,2)) = A(teams(i,3),teams(i,2)) + 1;
A(teams(i,3),teams(i,3)) = A(teams(i,3),teams(i,3)) + 1;
B(teams(i,1),1) = B(teams(i,1),1) + scores(i);
B(teams(i,2),1) = B(teams(i,2),1) + scores(i);
B(teams(i,3),1) = B(teams(i,3),1) + scores(i);
B(teams(i,1),2) = B(teams(i,1),2) + diffscores(i);
B(teams(i,2),2) = B(teams(i,2),2) + diffscores(i);
B(teams(i,3),2) = B(teams(i,3),2) + diffscores(i);
end
Michael Hill
11-12-2014, 17:15
Could I talk you into putting separate tic/toc statements around
alldata = dlmread('C:\Users\Michael Hill\Documents\Projects\OPR\scores.dat');
and
for i = 1:a(1)
A(teams(i,1),teams(i,1)) = A(teams(i,1),teams(i,1)) + 1;
A(teams(i,1),teams(i,2)) = A(teams(i,1),teams(i,2)) + 1;
A(teams(i,1),teams(i,3)) = A(teams(i,1),teams(i,3)) + 1;
A(teams(i,2),teams(i,1)) = A(teams(i,2),teams(i,1)) + 1;
A(teams(i,2),teams(i,2)) = A(teams(i,2),teams(i,2)) + 1;
A(teams(i,2),teams(i,3)) = A(teams(i,2),teams(i,3)) + 1;
A(teams(i,3),teams(i,1)) = A(teams(i,3),teams(i,1)) + 1;
A(teams(i,3),teams(i,2)) = A(teams(i,3),teams(i,2)) + 1;
A(teams(i,3),teams(i,3)) = A(teams(i,3),teams(i,3)) + 1;
B(teams(i,1),1) = B(teams(i,1),1) + scores(i);
B(teams(i,2),1) = B(teams(i,2),1) + scores(i);
B(teams(i,3),1) = B(teams(i,3),1) + scores(i);
B(teams(i,1),2) = B(teams(i,1),2) + diffscores(i);
B(teams(i,2),2) = B(teams(i,2),2) + diffscores(i);
B(teams(i,3),2) = B(teams(i,3),2) + diffscores(i);
end
Sorry it's taken so long, I've been somewhat busy the last few days at home.
The first one is ~0.025 sec and the second (around the for loop) is ~0.235 sec
...the second (around the for loop) is ~0.235 sec
I wonder would happen if you wrote that in C and compiled it to a callable routine.
Greg McKaskle
13-12-2014, 21:02
I've been a bit busy, but since the light is at the end of the tunnel, I met with Jim, a mathematician on our team to see if he would do this with our sparse matrix tools. The first screenshot shows the breakdown of times. The second is the code written in LV.
He is going to tinker to see if he can find a better way to build the sparse matrix, since most of the time is spent before invoking the solver.
This was timed on a Windows VM that has 4 cores running on my macbook 2.7GHz core i7. Jim was running on a desktop machine which I don't have details for, and he wasn't writing the file. His was somewhat faster. My cores are only about 70% utilized since most of the time isn't spent in the solver.
Greg McKaskle
OK guys.
I just wrote, compiled, and ran a 32-bit single-core native app on an 8-year-old Pentium D machine running 32-bit XP Pro SP3, and timed it using RDTSC instructions embedded in the code.
It took 11.9 milliseconds to read the raw data file (cached in RAM) and generate the alliance score vectors and the sparse design matrix.
Using 16-bit unsigned integers for the team numbers and scores, generating the sparse matrix directly from the raw data, and compiling to native code saves a lot of runtime.
I just wrote, compiled, and ran a 32-bit single-core native app on an 8-year-old Pentium D machine running 32-bit XP Pro SP3, and timed it using RDTSC instructions embedded in the code.
Here's the Delphi code. I wonder how fast it would run on a modern 64-bit machine. Maybe somebody would like to port it to C and try it.
Here's the Delphi code. I wonder how fast it would run on a modern 64-bit machine. Maybe somebody would like to port it to C and try it.
Based on some PMs I have received, I should clarify a few things.
The posted Delphi code reads the raw 8-column (r1 r2 r3 b1 b2 b3 rs bs) whitespace-delimited alliance scores text data file (from cached RAM) and constructs two matrices [At] and [b]. It takes ~12ms to do this on an 8-year-old Pentium D machine.
[A]=[At]' is the the binary design matrix
[b] is the matrix of alliance scores (not the team OPR scores)
[A][x]≈[b] is the overdetermined system of linear equations
[A][x]≈[b] can be solved with one line of code in Octave (or MatLab) as follows: [x]=[A]\[b]. [x] will be the matrix whose column vectors minimize the sum of the squares of the corresponding column vectors in the residuals matrix [r]=[b]-[A][x]
But it is much faster (and acceptably stable and accurate for OPR purposes) to create and solve the Normal Equations [N][x]=[d] instead
[N] and [d] can be formed from [At] and [b] as follows: [N]=[At][At]' and [d]=[At][b]
[N][x]=[d] can be solved with one line of code in Octave (or MatLab) as follows: [x]=[N]\[d]
solving [N][x]=[d] is faster than solving [A][x]≈[b] because 1) [N] is a smaller matrix than [A], and 2) N is symmetric positive definite so Cholesky factorization can be used
the computations [A]=[At]', N=[At][A], and [d]=[At][b] take about 20ms on the 8-year-old Pentium D machine
the Normal Equations solution [x]=[N]\[d] takes about 210ms on the 8-year-old Pentium D machine, using a single core
there may be Cholesky factoring algorithms which would permit multiple cores to be used for factoring
See this post for some additional details:
http://www.chiefdelphi.com/forums/showthread.php?p=1404300#post1404300
Greg McKaskle
19-12-2014, 22:32
Less busy now, so I met with Jim a few we made a few more passes. The attached image shows the top level LV diagram. And the zip has the code saved in LV 2014. The other image shows the time breakdown for the different portions.
The code runs in around 20ms on my laptop running a VM. It iterates until the residual is about 3 digits after the decimal.
Building the sparse matrix creates the diagonal terms and upper portion independently, which substantially speeds the elimination of duplicate terms. The complete solver was swapped out for one based on conjugate gradient.
The commented code loads from disk, the enabled code has the data in RAM, in a constant. Loading from disk adds another 20ms.
Greg McKaskle
Less busy now, so I met with Jim a few we made a few more passes. The attached image shows the top level LV diagram. And the zip has the code saved in LV 2014. The other image shows the time breakdown for the different portions.
Thanks for posting this Greg. Very impressive performance.
It iterates until the residual is about 3 digits after the decimal.
What did you use for the initial guess for the iterative solver?
Greg McKaskle
20-12-2014, 07:11
The initial guess is the zero vector. It takes 11 iterations to converge.
Greg McKaskle
vBulletin® v3.6.4, Copyright ©2000-2017, Jelsoft Enterprises Ltd.