[SOLVED] Any faster alternative to reading nth line of a file

Issue

I have to read a file from a particular line number and I know the line number say "n":
I have been thinking of two ways:

1.
for i in range(n):
fname.readline()
k=readline()
print k

2.
i=0
for line in fname:
dictionary[i]=line
i=i+1

but I want a faster alternative as I might have to perform this on different files 20000 times.
Are there any better alternatives?

Also, are there are other performance enhancements for simple looping, as my code has nested loops.

Solution

If the files aren’t too huge, the linecache module of the standard library is pretty good — it lets you very directly ask for the Nth line of such-and-such file.

If the files are huge, I recommend something like (warning, untested code):

def readlinenum(filepath, n, BUFSIZ=65536):
  bufs = [None] * 2
  previous_lines = lines_so_far = 0
  with open(filepath, 'b') as f
    while True:
      bufs[0] = f.read(BUFSIZ)
      if not bufs[0]:
        raise ValueError('File %s has only %d lines, not %d',
                         filepath, lines_so_far, n)
      lines_this_block = bufs[0].count('\n')
      updated_lines_count = lines_so_far + lines_this_block
      if n < updated_lines_count:
          break
      previous_lines = lines_so_far
      lines_so_far = updated_lines_count
      bufs[1] = bufs[0]
    if n == lines_so_far:
      # line split between blocks
      buf = bufs[1] + bufs[0]
      delta = n - previous_lines
    else:  # normal case
      buf = bufs[0]
      delta = n = lines_so_far
    f = cStringIO.StringIO(buf)
    for i, line in enumerate(f):
      if i == delta: break
    return line.rstrip()

The general idea is to read in the file as binary, in large blocks (at least as large as the longest possible line) — the processing (on Windows) from binary to “text” is costly on huge files — and use the fast .count method of strings on most blocks. At the end we can do the line parsing on a single block (two at most in the anomalous case where the line being sought spans block boundaries).

This kind of code requires careful testing and checking (which I haven’t performed in this case), being prone to off-by-one and other boundary errors, so I’d recommend it only for truly huge files — ones that would essentially bust memory if using linecache (which just sucks up the whole file into memory instead of working by blocks). On a typical modern machine with 4GB bytes of RAM, for example, I’d start thinking about such techniques for text files that are over a GB or two.

Edit: a commenter does not believe that binary reading a file is much faster than the processing required by text mode (on Windows only). To show how wrong this is, let’s use the 'U' (“universal newlines”) option that forces the line-end processing to happen on Unix machines too (as I don’t have a Windows machine to run this on;-). Using the usual kjv.txt file:

$ wc kjv.txt
  114150  821108 4834378 kjv.txt

(4.8 MB, 114 Klines) — about 1/1000th of the kind of file sizes I was mentioning earlier:

$ python -mtimeit 'f=open("kjv.txt", "rb")' 'f.seek(0); f.read()'
100 loops, best of 3: 13.9 msec per loop
$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0); f.read()'
10 loops, best of 3: 39.2 msec per loop

i.e., just about exactly a factor of 3 cost for the line-end processing (this is on an old-ish laptop, but the ratio should be pretty repeatable elsewhere, too).

Reading by a loop on lines, of course, is even slower:

$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0)' 'for x in f: pass'
10 loops, best of 3: 54.6 msec per loop

and using readline as the commented mentioned (with less efficient buffering than directly looping on the file) is worst:

$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0); x=1' 'while x: x=f.readline()'
10 loops, best of 3: 81.1 msec per loop

If, as the question mentions, there are 20,000 files to read (say they’re all small-ish, on the order of this kjv.txt), the fastest approach (reading each file in binary mode in a single gulp) should take about 260 seconds, 4-5 minutes, while the slowest one (based on readline) should take about 1600 seconds, almost half an hour — a pretty significant difference for many, I’d say most, actual applications.

Answered By – Alex Martelli

Answer Checked By – Katrina (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *