OPR-computation-related linear algebra problem

*Looking for interested parties to crunch the numbers and report how long it takes to solve Nx=d for x with the tools, libraries, and computing platforms you use.

Attached ZIP file contains N and d. N is a symmetric positive definite 2509x2509 square matrix; d is a 2509 element column vector.

*

Nx=d.ZIP (135 KB)


Nx=d.ZIP (135 KB)

My linear algebra is very rusty and isn’t part of my day job, so nothing special and I hope I did it right.

The time to invert and multiply is shown on the panel and is about 5.5 seconds plus another 1.something to load them. This is in a VM on a Macbook Pro. It was clearly running on a single CPU. If you are interested I can talk to the math guys on Tuesday.

CD seems to have problems uploading at the moment, so the first few elements are 10.1758, 29.6333, 11.1155.

Greg McKaskle

Using MATLAB 2012a on a Intel Core i7-3615QM:

Using linear equation solver (backslash operator): 0.26977 seconds
Using invert-and-multiply: 2.4433 seconds

N = dlmread('N.dat');
d = dlmread('d.dat');

numIters = 100;
tic;
for i=1:numIters
    r = N \ d;
end
disp('Linear solver ' num2str(toc/numIters)]);

numIters = 10;
tic;
for i=1:numIters
    r = inv(N) * d;
end
disp('Invert and multiply ' num2str(toc/numIters)]);
```<br><br><a class='attachment' href='/uploads/default/original/3X/5/c/5ce97554a4ab08ddc500c99c58d4b4d60b929703.txt'>r.dat.txt</a> (17.4 KB)<br><br><br><a class='attachment' href='/uploads/default/original/3X/5/c/5ce97554a4ab08ddc500c99c58d4b4d60b929703.txt'>r.dat.txt</a> (17.4 KB)<br>

Wouldn’t it me much less computationally intensive to actually solve the matrix into reduced row echelon form?

Using Python with Numpy

System:


Ubuntu 12.04 32-bit
Kernel Linux 3.2.0-43-generic-pae
Memory 3.8 GiB
Processor Intel Core 2 Duo T9400 @ 2.53 GHz x 2

Code:


import sys
import numpy
import time
import scipy
import psutil

n_runs = 1000

print ""
print ""
print "Python version %s" % (sys.version)
print "Numpy version %s" % (numpy.__version__)
print "Scipy version %s" % (scipy.__version__)
print "Psutil version %s" % (psutil.__version__)
print ""


N = numpy.loadtxt(open('N.dat'))
d = numpy.loadtxt(open('d.dat'))

data = ]
for i in range(1,n_runs+1):
    start = time.time()
    x = numpy.linalg.solve(N,d)
    end = time.time()
    row = [end - start]
    row.extend(psutil.cpu_percent(interval=1,percpu=True))
    s = "	".join([str(item) for item in row])
    data.append(s)
    
f = open('times.dat','w')
f.write("
".join(data))
f.close()

x = numpy.linalg.solve(N,d)
print ", ".join([str(f) for f in x])
print ""

Average run time: 10.1 seconds
Standard Deviation: 5.1 seconds

The file output.txt contains versions and the solution for x.
The file runs.txt contains the run data. Note that I was doing work while letting this run in the backround, which skews the times. I collected CPU usage data to try and account for this; one interesting note is that there are two different clusters of execution times - I believe this is from my laptop throttling the CPU when I unplugged and was running off battery for a while (if you plot runs over time, you will see three distinct sections where the execution times are consistently higher).

output.txt (37.1 KB)


runs.txt (23.2 KB)


output.txt (37.1 KB)

runs.txt (23.2 KB)

Interestingly, no. Gaussian elimination is O(N^3), which gets ugly really fast. When you get into the realm of hundreds or thousands of elements, there are much better ways to do it, which computational packages like MATLAB take full advantage of. I’ve attached a graph showing the computation times for Gaussian elimination, invert-and-multiply, and direct solve for a variety of matrix sizes. By the time you get to 300 elements, Gaussian elimination is already painfully slow, but even the invert-and-multiply has hardly broken a sweat (less than 0.02 seconds).

This test was run on my 6-year-old Core 2 Duo (T7200 @ 2.00GHz) laptop with MATLAB R2010a. Sometime later this week I’ll see about running the matrix solve on a real computer, maybe one with a little extra horsepower.

sizes = floor(logspace(1, 2.5, 10));
times = zeros(length(sizes), 3);

for s = 1:length(sizes);
  A = rand(sizes(s));
  b = rand(sizes(s), 1);
  
  %% Gaussian elimination
  tic;
  nIters = 1;
  for ii = 1:nIters;
    r = rref([A b]);
    x = r(:, end);
  end
  times(s, 1) = toc / nIters;
  
  %% Invert and multiply
  tic;
  nIters = 50;
  for ii = 1:nIters;
    x2 = inv(A) * b;
  end
  times(s, 2) = toc / nIters;
  
  %% Direct solve in MATLAB
  tic;
  nIters = 50;
  for ii = 1:nIters;
    x3 = A \ b;
  end
  times(s, 3) = toc / nIters;

end

plot(sizes, times, '-x');
xlabel('Matrix size');
ylabel('Computation time [s]');
legend('Gaussian elimination (rref)', 'Invert and multiply', 'Direct solve')

EDIT: It’s been pointed out to me that a matrix inversion is also inherently O(n^3), and so there’s something else at work making it slow. In this case, the catch is that rref() is written in plain MATLAB code (try “edit rref”), while inversion and solving are implemented as highly-optimized C code. Gaussian elimination is not the fastest, but it’s not nearly as bad as I made it out to be.

Thanks to those who pointed this out. Obviously I need to go study some more linear algebra. :o That’s on the schedule for the fall.

computation_time.png


computation_time.png

CD let me attach again. I attached the things I intended for the previous post.

As with many of the analysis functions, this calls into a DLL, to the function InvMatrixLUDri_head. So it seems to be using LU decomposition. I think the matrix qualifies as sparse, so that helps with performance.

The direct solver was almost ten seconds.

Greg McKaskle


x.dat.zip (7.81 KB)



x.dat.zip (7.81 KB)

Ryan, could you please re-run this, without iterating? I want to eliminate the possibility that Matlab is optimizing out the iteration.

tic;
    r1 = N \ d;
t1 = toc;

// also save r1 to a file here so the computation is not optimized out.

disp('Linear solver ' num2str(t1)]);


tic;
    r2 = inv(N) * d;
t2 = toc;

// also save r2 to a file here so the computation is not optimized out.

disp('Invert and multiply ' num2str(t2)]);

Thanks.

PS - can someone with a working Octave installation please run this? also SciLab and R.

Couple of things.

In a PDE class I tool for CFD, we had to solve really large sparse matrices. the trick was to never actually store the entire matrix. However ours was much more structured and more sparse. Not sure if I can apply something similar. in this case.

What is the accuracy you are looking for. Could use some iterative methods for much faster results. You can pick an accuracy of 1e-1 (inf norm) and be fine I think for OPRs.

Loading it into my GTX 580 GPU right now to get some values. Will do that with and without the time taken to load it into the GPU memory and back.

I had tried this originally, and the results were consistent with the iterated/averaged result, but I was getting some variation in timing so I wanted to take the average case. Interestingly, the average from the iterated trials was consistently higher than any of the trials running single-shot.

Linear solver 0.19269
Invert and multiply 1.8698

N = dlmread('N.dat');
d = dlmread('d.dat');

tic;
    r1 = N \ d;
t1 = toc;

% also save r1 to a file here so the computation is not optimized out.
dlmwrite('r_solver.dat', r1);

disp('Linear solver ' num2str(t1)]);


tic;
    r2 = inv(N) * d;
t2 = toc;

% also save r2 to a file here so the computation is not optimized out.
dlmwrite('r_invmult.dat', r2);

disp('Invert and multiply ' num2str(t2)]);

This matrix is quite small compared to those generally solved in finite elements, CFD, or other common codes. As was mentioned a little bit earlier, the biggest benefit to speedup can be done by processing everything as sparse matrices.

On my 2.0 GHz Macbook Air running Matlab Student R2012a, I can run:

tic
d = load(‘d.dat’);
N = load(‘N.dat’);
toc
tic
output = N\d;
toc

and get the output:
Elapsed time is 2.768235 seconds. <–loading files into memory
Elapsed time is 0.404477 seconds. <–solving the matrix

If I now change the code to:

tic
d = load(‘d.dat’);
N = load(‘N.dat’);
toc
tic
Ns = sparse(N);
toc
tic
output = Ns\d;
toc

With output:
Elapsed time is 2.723927 seconds. <–load files
Elapsed time is 0.040358 seconds. <–conversion to sparse
Elapsed time is 0.017368 seconds. <–solving

There are only 82267 nonzero elements in the N matrix, (vs 2509*2509 ~ 6.3 million) so the sparse matrix runs much faster - it essentially skips over processing entries that are zero, so doesn’t have to do that part of the inversion process.

Here’s an iterative method solving the problem. I haven’t tuned any iteration parameters for bicgstab (biconjugate gradients, stabilized) so it could be a bit better but the mean squared error is pretty small.

tic
d = load(‘d.dat’);
N = load(‘N.dat’);
toc
tic
Ns = sparse(N);
toc
tic
output = bicgstab(Ns,d);
toc
% compute a true output
output_true = Ns\d;
% compute mean squared error of OPR
output_mse = sum((output_true - output).^2)/length(output)

Elapsed time is 2.728844 seconds.
Elapsed time is 0.040895 seconds.
bicgstab stopped at iteration 20 without converging to the desired tolerance 1e-06
because the maximum number of iterations was reached.
The iterate returned (number 20) has relative residual 2e-06.
Elapsed time is 0.015128 seconds.

output_mse =

9.0544e-07

Not much benefit in the iterative method here…the matrix is quite small. The speedup is much more considerable when you are solving similarly sparse matrices that are huge. In industry and research in my career my finite element models can get to matrices that are millions by millions or more…at that point you need sophisticated algorithms. But for the size of the OPR matrix, unless we get TONS more FRC teams soon, just running it with sparse tools should be sufficient for it to run quite fast. Octave and MATLAB have it built in, and I believe NumPy/SciPy distributions do as well. There are also C++ and Java libraries for sparse computation.

A final suggestion would be that if you construct your matrices in the sparse form explicitly from the get-go (not N, but the precursor to it) you can alleviate even the data loading time to a small fraction of what it is now.

Hope that helps.

Added: I did check the structure of N, and it is consistent with a sparse least squares matrix. It is also symmetric and positive definite. These properties are why I chose bicgstab instead of gmres or another iterative algorithm. If you don’t want to solve it iteratively, Cholesky Factorization is also very good for dealing with symmetric positive definite matrices.

Sounds great. I had to actually code up some different solvers in C. We could use matlab but now allowed to use any functions more complicated than adding etc.

nice to see some of the matlab tools to do that.

Just wondering, Where do you work for?

Thank you, Borna. I am currently a Ph.D. student in mechatronics and control systems at Purdue University. I did my Master’s Degree in Heat Transfer and Design Optimization, and the tools I learned through that included finite element methods for structural, thermal, and fluid flow analysis, as well as the mathematical underpinnings of those methods and the numerical implementation. I also spent a lot of time looking at optimization algorithms. Some of my work was industry sponsored and so I got to help solve large problems that way.

I also did an internship at Alcatel-Lucent Bell Labs where I did CFD modeling for electronics cooling. I also use finite elements often when designing parts for my current research.

For coding some of these algorithms in C by hand, if you are interested, one of the best possible references is: Matrix Computations by Golub and Van Loan. which will get you much of the way there.

C code implementing Cholesky decomposition-based solver. With minimal optimization, the calculation runs in 3.02 seconds on my system.

Cholesky.c (3.43 KB)


Cholesky.c (3.43 KB)

The reason I say it’s computationally intensive is this article: http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/

That article is 100% correct. The solutions above that are solving in a handful of seconds or less are not inverting the matrix. Reducing the matrix to reduced-row echelon form is related to what methods like LU and Cholesky factorization do.

Even normal Gaussian elimination will be pretty fast on a sparse matrix (though still slower than most methods above), but it has problems with numerical stability that get worse and worse as matrix size increases and is for that main reason avoided by most people solving numerical linear algebra problems.

Re-ran using Scipy’s sparse matrix solver.

Average run time: 0.085s
Standard deviation: 0.005s


import sys
import numpy
import time
import scipy
import scipy.sparse
import scipy.sparse.linalg
import psutil

n_runs = 1000

print ""
print ""
print "Python version %s" % (sys.version)
print "Numpy version %s" % (numpy.__version__)
print "Scipy version %s" % (scipy.__version__)
print "Psutil version %s" % (psutil.__version__)
print ""


N = numpy.loadtxt(open('N.dat'))
d = numpy.loadtxt(open('d.dat'))

Ns = scipy.sparse.csr_matrix(N)

data = ]
for i in range(1,n_runs+1):
    start = time.time()
    x = scipy.sparse.linalg.spsolve(Ns,d)
    end = time.time()
    row = [end - start]
    row.extend(psutil.cpu_percent(interval=1,percpu=True))
    s = "	".join([str(item) for item in row])
    data.append(s)
    
f = open('times2.dat','w')
f.write("
".join(data))
f.close()

_x = scipy.sparse.linalg.spsolve(Ns,d)
print ", ".join([str(f) for f in _x])
print ""

```<br><br><a class='attachment' href='/uploads/default/original/3X/1/2/1280afe4d545809c093ab1831c6abddcca6dc759.txt'>runs2.txt</a> (24.3 KB)<br><a class='attachment' href='/uploads/default/original/3X/e/e/ee8dcb8735b4697600fae0c6824702b2621e027f.txt'>output2.txt</a> (37.1 KB)<br><br><br><a class='attachment' href='/uploads/default/original/3X/1/2/1280afe4d545809c093ab1831c6abddcca6dc759.txt'>runs2.txt</a> (24.3 KB)<br><a class='attachment' href='/uploads/default/original/3X/e/e/ee8dcb8735b4697600fae0c6824702b2621e027f.txt'>output2.txt</a> (37.1 KB)<br>

Nikhil,
Is there a reason why you are not using the “pcg” function which assumes symmetric positive definite inputs? This should be faster. Also please consider using the diagonal as a preconditioner. Unfortunately I do not have access the MATLAB at the moment. Could you try please the following? And sorry in advance for any bugs:

Ns = sparse(N);
D = diag(Ns);
Ds = sparse(diag(D)); #This was a bug… maybe it still is!

Reference Solution

tic
output = Ns\d;
toc

CG Solution

tic
output = pcg(Ns,d)
toc

Diagonal PCG Solution

tic
output = pcg(Ns,d,],],Ds)
toc

Reverse Cutthill-McKee re-ordering

tic
p = symrcm(Ns); # permutation array
Nr = Ns(p,p); # re-ordered problem
toc

Re-ordered Solve

tic
output = Nr\d; #answer is stored in a permuted matrix indexed by ‘p’
toc

Another advantage to the conjugate gradient methods is concurrent form of the solution within each iteration (parallel processing).

Best regards

New Code, based on what James put up (I just added some disp’s so that the results would be more clear. disps are outside of tics and tocs. I did not find any bugs though had to change #'s to %'s.

clc
disp('Loading Data...')
tic
d = load('d.dat');
N = load('N.dat');
toc
Ns = sparse(N);
D = diag(Ns);
Ds = sparse(diag(D)); %This was a bug... maybe it still is!

% Reference Solution 
disp('Reference Solution:')
tic
output1 = Ns\d;
toc


% CG Solution
disp('CG Solution:');
tic
output2 = pcg(Ns,d);
toc

% Diagonal PCG Solution
disp('Diagonal PCG Solution:');
tic
output3 = pcg(Ns,d,],],Ds);
toc

% Reverse Cutthill-McKee re-ordering
disp('Re-ordering (Reverse Cutthill-McKee:');
tic
p = symrcm(Ns); % permutation array
Nr = Ns(p,p); % re-ordered problem
toc

% Re-ordered Solve
disp('Re-ordered Solution:');
tic
output4 = Nr\d; %answer is stored in a permuted matrix indexed by 'p'
toc

Output:

Loading Data...
Elapsed time is 3.033846 seconds.
Reference Solution:
Elapsed time is 0.014136 seconds.
CG Solution:
pcg stopped at iteration 20 without converging to the desired tolerance 1e-06
because the maximum number of iterations was reached.
The iterate returned (number 20) has relative residual 4.8e-05.
Elapsed time is 0.007545 seconds.
Diagonal PCG Solution:
pcg converged at iteration 17 to a solution with relative residual 8.9e-07.
Elapsed time is 0.009216 seconds.
Re-ordering (Reverse Cutthill-McKee:
Elapsed time is 0.004523 seconds.
Re-ordered Solution:
Elapsed time is 0.015021 seconds.

I didn’t precondition earlier because I was being sloppy/lazy :). Thanks for calling me out. :yikes: And you’re right, I should have used pcg. Thanks for the suggestion.

Since no-one has done Octave yet, I’ll go ahead and do it (along with MATLAB for comparison). I can’t do SciLab or R because I don’t know how to use those :stuck_out_tongue:

MATLAB 2012b:

>> N = dlmread('N.dat');
>> d = dlmread('d.dat');
>> tic ; r = N \ d; toc
Elapsed time is 0.797772 seconds.

GNU Octave 3.6.2:

octave:1> N = dlmread('N.dat');
octave:2> d = dlmread('d.dat');
octave:3> tic ; r = N \ d; toc
Elapsed time is 0.624047 seconds.

This is on an Intel i5 (2 core + hyperthreading) with Linux as the host OS (kernel version 3.7.6).