[SOLVED] Python fastest access to nth line in huge file

Issue

I have an ASCII table in a file from which I want to read a particular set of lines (e.g. lines 4003 to 4005). The issue is that this file could be very very long (e.g. 100’s of thousands to millions of lines), and I’d like to do this as quickly as possible.

a) Bad Solution: Read in the entire file, and go to those lines,

f = open('filename')
lines = f.readlines()[4003:4005]

b) Better Solution: enumerate over each line so that it’s not all in memory (a la https://stackoverflow.com/a/2081880/230468)

f = open('filename')
lines = []
for i, line in enumerate(f):
    if i >= 4003 and i <= 4005: lines.append(line)
    if i > 4005: break                                    # @Wooble

c) Best Solution?

But b) still requires going through each line.

Is there a better (in terms of speed/efficiency) method of accessing a particular line from a huge file?

  • Should I use a linecache even though I will only access the file once (typically)?
  • Using a binary file instead, in which case it might be easier to skip-ahead, is an option — but I’d much rather avoid it.

Solution

I would probably just use itertools.islice. Using islice over an iterable like a file handle means the whole file is never read into memory, and the first 4002 lines are discarded as quickly as possible. You could even cast the two lines you need into a list pretty cheaply (assuming the lines themselves aren’t very long). Then you can exit the with block, closing the filehandle.

from itertools import islice
with open('afile') as f:
    lines = list(islice(f, 4003, 4005))
do_something_with(lines)

Update

But holy cow is linecache faster for multiple accesses. I created a million-line file to compare islice and linecache and linecache blew it away.

>>> timeit("x=islice(open('afile'), 4003, 4005); print next(x) + next(x)", 'from itertools import islice', number=1)
4003
4004

0.00028586387634277344
>>> timeit("print getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=1)
4002
4003

2.193450927734375e-05

>>> timeit("getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=10**5)
0.14125394821166992
>>> timeit("''.join(islice(open('afile'), 4003, 4005))", 'from itertools import islice', number=10**5)
14.732316970825195

Constantly re-importing and re-reading the file:

This is not a practical test, but even re-importing linecache at each step it’s only a second slower than islice.

>>> timeit("from linecache import getline; getline('afile', 4003) + getline('afile', 4004)", number=10**5)
15.613967180252075

Conclusion

Yes, linecache is faster than islice for all but constantly re-creating the linecache, but who does that? For the likely scenarios (reading only a few lines, once, and reading many lines, once) linecache is faster and presents a terse syntax, but the islice syntax is quite clean and fast as well and doesn’t ever read the whole file into memory. On a RAM-tight environment, the islice solution may be the right choice. For very high speed requirements, linecache may be the better choice. Practically, though, in most environments both times are small enough it almost doesn’t matter.

Answered By – kojiro

Answer Checked By – Katrina (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *