[SOLVED] Convert single-column txt with 300 million rows to NumPy array faster

Issue

I have a txt file that contains more than 300 million rows with 1 column. I’m trying to read and convert it to numpy array. Currently I have tried

label = np.loadtxt('/path/to/file')

and

for lines in fileinput.input('/path/to/file'):
    do_something_with(lines)

It seems np.loadtxt has a slightly faster performance but it still needs hours to read a single txt file. The file has more than 300 million rows with 1 column, but its size is only around 950 MB. I’m suspecting that np.loadtxt is also reading the file line by line which causes the processing time to be long.

I’m wondering if there is any method that can speed up this reading and converting process while keeping the sequence of the rows.

Solution

Sounds like the file is simple enough that readlines may work

Make a small sample file:

In [2]: arr = np.random.random((100,1))
In [4]: np.savetxt('test.txt', arr, fmt='%f')
In [6]: !head test.txt
0.872225
0.365394
0.802365
0.140455
0.041390
0.531483
0.415459
0.906439
0.789604
0.493369

The straightforward loadtxt:

In [8]: arr1 = np.loadtxt('test.txt')
In [9]: arr1.shape
Out[9]: (100,)

I think there’s a parameter to force a (100,1) shape, I leave that for now.

Let’s try a readlines, using np.array to convert the list of strings to float array:

In [11]: arr2 = np.array(open('test.txt').readlines(), dtype=float)
In [12]: arr2.shape
Out[12]: (100,)
In [13]: np.allclose(arr1,arr2)
Out[13]: True

Compare times:

In [14]: timeit arr2 = np.array(open('test.txt').readlines(), dtype=float)
77.5 µs ± 961 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [15]: timeit arr1 = np.loadtxt('test.txt')
605 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A lot faster with readlines.

Another approach is fromfile:

In [18]: arr3 = np.fromfile('test.txt',dtype=float, sep=' ')
In [19]: arr3.shape
Out[19]: (100,)
In [20]: np.allclose(arr1,arr3)
Out[20]: True
In [21]: timeit arr3 = np.fromfile('test.txt',dtype=float, sep=' ')
118 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Not quite as fast, but still better than loadtxt. Yes, loadtxt reads the file line by line.

genfromtxt is a bit better than loadtxt.

pandas is supposed to have a fast csv reader, but that doesn’t seem to be the case here:

In [33]: timeit arr4=pd.read_csv('test.txt',header=None).to_numpy()
1.12 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answered By – hpaulj

Answer Checked By – Katrina (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *