Chief Delphi

Chief Delphi (http://www.chiefdelphi.com/forums/index.php)
-   Python (http://www.chiefdelphi.com/forums/forumdisplay.php?f=187)
-   -   loading a COO file into Python (http://www.chiefdelphi.com/forums/showthread.php?t=151753)

Ether 06-10-2016 15:40

loading a COO file into Python
 

Greetings Python gurus.

I'm looking for a better (faster) way to load a plaintext COO (row,column,value tuple) file into Python. The values are all integers.

File Aijv.dat is a ~7.5 MB plaintext file in COO format representing a
178420x2696 sparse matrix with 535260 non-zero entries.

Here's a portion of code showing Python taking almost 19 seconds to load Aijv.dat:
Code:

Python 2.7.5 (default, May 15 2013, 22:43:36)
[MSC v.1500 32 bit (Intel)] on win32

>>> import numpy
>>> import time
>>> import scipy
>>> import scipy.sparse as sp
>>> import scipy.sparse.linalg

>>> start=time.time()
>>> Aijv = numpy.loadtxt('Aijv.dat', 'int')
>>> time.time()-start
18.844000101089478

The exact same file takes only 5.5 seconds to load in Octave:
Code:

GNU Octave, version 3.6.4

+ tic;
+ Aijv = dlmread ('Aijv.dat');
+ toc
Elapsed time is 5.547 seconds.


euhlmann 06-10-2016 16:34

Re: loading a COO file into Python
 
Write a native module to do it faster? :rolleyes:

Ether 06-10-2016 16:47

Re: loading a COO file into Python
 

I could certainly do that as a last resort if Python doesn't already have a built-in function to do it.

Long experience has taught me to explore the capabilities of the language to avoid re-inventing the wheel.



tjf 06-10-2016 18:00

Re: loading a COO file into Python
 
The simplest solution is often the easiest one. I'd personally just parse it myself and read line by line with a file reader, parse into an object and add each one into an array.

My 2c.

Ether 06-10-2016 18:12

Re: loading a COO file into Python
 

Here's a link to the 7zipped Aij.dat file.

If your solution is the simplest and easiest one, please show us how you propose to "parse it myself and read line by line with a file reader, parse into an object and add each one into an array".



virtuald 06-10-2016 23:36

Re: loading a COO file into Python
 
Apparently pandas has something for this that outperforms numpy.loadtxt?

http://wesmckinney.com/blog/a-new-hi...ne-for-pandas/

And here:

http://akuederle.com/stop-using-numpy-loadtxt

Ether 07-10-2016 00:36

Re: loading a COO file into Python
 

Thanks Dustin.

I'll try that.



euhlmann 07-10-2016 09:12

Re: loading a COO file into Python
 
I'm curious as to why you're using Python in the first place if performance is a concern.

Ether 07-10-2016 10:12

Re: loading a COO file into Python
 

Well this is a bit unexpected. Looks like I get to play the role of Python guru momentarily :)

Python is supported by an immense library of highly optimized math and science routines. It's as fast (and faster in some cases) as Matlab/Octave, and in some cases actually easier to use.

It's free, has a large installation base, and the code is quite readable... all of which makes it a convenient vehicle for sharing solutions... such as how to efficiently compute World OPR or other metrics using both min |Ax-b|2 and min |Ax-b]1 etc.



vScourge 07-10-2016 15:17

Re: loading a COO file into Python
 
This only took 3.4 seconds to run on my Win 8.1 system.

Code:

import os
import time

time_start = time.time( )

data = [ ]

for line in open( r'D:\temp\Aijv.dat', 'r' ):
        data.append( [ int( x ) for x in line.rstrip( ).split( ' ' ) ] )

print( '{0:.2f} secs'.format( time.time( ) - time_start ) )


Ether 10-10-2016 14:29

Re: loading a COO file into Python
 
Quote:

Originally Posted by vScourge (Post 1610912)
This only took 3.4 seconds to run on my Win 8.1 system.

Thanks. That's quite an improvement.

It took ~5.3 seconds on my 10-year-old PentiumD Win32 machine:
Code:

>>> start = time.time( )
>>> data = [ ]
>>> for line in open( r'k:/data/Aijv big data.dat', 'r' ):
...    data.append( [ int( x ) for x in line.rstrip( ).split( ' ' ) ] )
...
>>> time.time()-start
5.296999931335449

>>> start=time.time()
>>> Aijv = np.loadtxt('k:/data/Aijv big data.dat', 'int');
>>> time.time()-start
18.858999967575073


Also:
Code:

>>> start=time.time()
>>> Ajiv = np.transpose(Aijv)
>>> time.time()-start
0.0

>>> start=time.time()
>>> datajiv = np.transpose(data)
>>> time.time()-start
3.937999963760376

Why does it take so much longer to transpose data than Aijv ?


BTW, the benchmark time on my machine appears to be about 0.5 seconds. That's the total time it took a compiled Win32 app to read Aijv.dat into a transposed array.



Ether 10-10-2016 14:55

Re: loading a COO file into Python
 
Quote:

Originally Posted by virtuald (Post 1610807)
Apparently pandas has something for this that outperforms numpy.loadtxt?

WOW. Pandas is FAST.

Just ran it on a slower machine and it loaded Aijv.dat in less than one second.

You are my Python guru Dustin:)



virtuald 10-10-2016 21:00

Re: loading a COO file into Python
 
Quote:

Originally Posted by Ether (Post 1611198)
WOW. Pandas is FAST.

Just ran it on a slower machine and it loaded Aijv.dat in less than one second.

You are my Python guru Dustin:)



Glad to help. :)

My guess is that because data isn't already a numpy array, it has to do a lot of work to convert it to one first, and then do the transpose operation. If I'm correct, a second transpose on the result of the first transpose would be very fast, similar to the operation on Ajiv.

If I understand what transpose does, it most likely is just moving the pointers to the various axes around -- thus why it's so fast. My understanding of how numpy arrays work is that they try really hard to not actually move data around, but many common operations can be done by just creating/moving various pointers around.


All times are GMT -5. The time now is 04:42.

Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi