loading a COO file into Python

*Greetings Python gurus.

I’m looking for a better (faster) way to load a plaintext COO (row,column,value tuple) file into Python. The values are all integers.

File Aijv.dat is a ~7.5 MB plaintext file in COO format representing a
178420x2696 sparse matrix with 535260 non-zero entries.

Here’s a portion of code showing Python taking almost 19 seconds to load Aijv.dat:

Python 2.7.5 (default, May 15 2013, 22:43:36)
[MSC v.1500 32 bit (Intel)] on win32

>>> import numpy
>>> import time
>>> import scipy
>>> import scipy.sparse as sp
>>> import scipy.sparse.linalg

>>> start=time.time()
>>> Aijv = numpy.loadtxt('Aijv.dat', 'int')
>>> time.time()-start
18.844000101089478

The exact same file takes only 5.5 seconds to load in Octave:

GNU Octave, version 3.6.4

+ tic;
+ Aijv = dlmread ('Aijv.dat');
+ toc
Elapsed time is 5.547 seconds.

Write a native module to do it faster? :rolleyes:

*I could certainly do that as a last resort if Python doesn’t already have a built-in function to do it.

Long experience has taught me to explore the capabilities of the language to avoid re-inventing the wheel.

The simplest solution is often the easiest one. I’d personally just parse it myself and read line by line with a file reader, parse into an object and add each one into an array.

My 2c.

*Here’s a link to the 7zipped Aij.dat file.

If your solution is the simplest and easiest one, please show us how you propose to “parse it myself and read line by line with a file reader, parse into an object and add each one into an array”.

Apparently pandas has something for this that outperforms numpy.loadtxt?

And here:

http://akuederle.com/stop-using-numpy-loadtxt

*Thanks Dustin.

I’ll try that.

I’m curious as to why you’re using Python in the first place if performance is a concern.

*Well this is a bit unexpected. Looks like I get to play the role of Python guru momentarily :slight_smile:

Python is supported by an immense library of highly optimized math and science routines. It’s as fast (and faster in some cases) as Matlab/Octave, and in some cases actually easier to use.

It’s free, has a large installation base, and the code is quite readable… all of which makes it a convenient vehicle for sharing solutions… such as how to efficiently compute World OPR or other metrics using both min |Ax-b|[sub]2[/sub] and min |Ax-b][sub]1[/sub] etc.

This only took 3.4 seconds to run on my Win 8.1 system.

import os
import time

time_start = time.time( )

data =  ]

for line in open( r'D:	emp\Aijv.dat', 'r' ):
	data.append(  int( x ) for x in line.rstrip( ).split( ' ' ) ] )

print( '{0:.2f} secs'.format( time.time( ) - time_start ) )

Thanks. That’s quite an improvement.

It took ~5.3 seconds on my 10-year-old PentiumD Win32 machine:

>>> start = time.time( )
>>> data =  ]
>>> for line in open( r'k:/data/Aijv big data.dat', 'r' ):
...     data.append(  int( x ) for x in line.rstrip( ).split( ' ' ) ] )
...
>>> time.time()-start
**5.296999931335449**

>>> start=time.time()
>>> Aijv = **np.loadtxt**('k:/data/Aijv big data.dat', 'int');
>>> time.time()-start
**18.858999967575073**

Also:

>>> start=time.time()
>>> Ajiv = **np.transpose(Aijv)**
>>> time.time()-start
**0.0**

>>> start=time.time()
>>> datajiv = **np.transpose(data)**
>>> time.time()-start
**3.937999963760376**

Why does it take so much longer to transpose data than Aijv ?

BTW, the benchmark time on my machine appears to be about 0.5 seconds. That’s the total time it took a compiled Win32 app to read Aijv.dat into a transposed array.

WOW. Pandas is FAST.

Just ran it on a slower machine and it loaded Aijv.dat in less than one second.

You are my Python guru Dustin:)

Glad to help. :slight_smile:

My guess is that because data isn’t already a numpy array, it has to do a lot of work to convert it to one first, and then do the transpose operation. If I’m correct, a second transpose on the result of the first transpose would be very fast, similar to the operation on Ajiv.

If I understand what transpose does, it most likely is just moving the pointers to the various axes around – thus why it’s so fast. My understanding of how numpy arrays work is that they try really hard to not actually move data around, but many common operations can be done by just creating/moving various pointers around.