I have a txt file that contains more than 300 million rows with 1 column. I’m trying to read and convert it to numpy array. Currently I have tried
label = np.loadtxt('/path/to/file')
for lines in fileinput.input('/path/to/file'): do_something_with(lines)
np.loadtxt has a slightly faster performance but it still needs hours to read a single txt file. The file has more than 300 million rows with 1 column, but its size is only around 950 MB. I’m suspecting that
np.loadtxt is also reading the file line by line which causes the processing time to be long.
I’m wondering if there is any method that can speed up this reading and converting process while keeping the sequence of the rows.
Sounds like the file is simple enough that
readlines may work
Make a small sample file:
In : arr = np.random.random((100,1)) In : np.savetxt('test.txt', arr, fmt='%f') In : !head test.txt 0.872225 0.365394 0.802365 0.140455 0.041390 0.531483 0.415459 0.906439 0.789604 0.493369
In : arr1 = np.loadtxt('test.txt') In : arr1.shape Out: (100,)
I think there’s a parameter to force a (100,1) shape, I leave that for now.
Let’s try a
np.array to convert the list of strings to float array:
In : arr2 = np.array(open('test.txt').readlines(), dtype=float) In : arr2.shape Out: (100,) In : np.allclose(arr1,arr2) Out: True
In : timeit arr2 = np.array(open('test.txt').readlines(), dtype=float) 77.5 µs ± 961 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In : timeit arr1 = np.loadtxt('test.txt') 605 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
A lot faster with readlines.
Another approach is
In : arr3 = np.fromfile('test.txt',dtype=float, sep=' ') In : arr3.shape Out: (100,) In : np.allclose(arr1,arr3) Out: True In : timeit arr3 = np.fromfile('test.txt',dtype=float, sep=' ') 118 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Not quite as fast, but still better than
loadtxt reads the file line by line.
genfromtxt is a bit better than
pandas is supposed to have a fast csv reader, but that doesn’t seem to be the case here:
In : timeit arr4=pd.read_csv('test.txt',header=None).to_numpy() 1.12 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Answered By – hpaulj
Answer Checked By – Katrina (BugsFixing Volunteer)