[SOLVED] More efficient way to copy file line by line in python?


I have 10GB file with that pattern:

aaa, HO222222222222, AD, CE 
bbb, HO222222222222, AS, AE 
ccc, HO222222222222, AD, CE 
ddd, HO222222222222, BD, CE 
eee, HO222222222222, AD, CE 
fff, HO222222222222, BD, CE 
ggg, HO222222222222, AD, AE 
hhh, HO222222222222, AD, CE 
aaa, HO333333333333, AG, CE 
bbb, HO333333333333, AT, AE 
ccc, HO333333333333, AD, CT 
ddd, HO333333333333, BD, CE 
eee, HO333333333333, AD, CE 
fff, HO333333333333, BD, CE 
ggg, HO333333333333, AU, AE 
hhh, HO333333333333, AD, CE 

Let’s say that in second column I have a ID. In whole files I have 4000 person and each have 50k records.

I can’t use my prepared script for analysis on that big file (10GB – scripts in pandas, and I have too low memory. I know I should refactored it, and I working on it), so I need to divided that file to 4. But I can’t split ID between files. I mean I can’t have a part of one person in separate files.

So I write script. It divided file on 4 based on ID.

There is code:

file1 = open('file.txt', 'r')
count = 0
list_of_ids= set()
while True:
    if len(list_of_ids) < 1050:
        a = "out1.csv"
    elif (len(list_of_ids)) >= 1049 and (len(list_of_ids)) < 2100:
        a = "out2.csv"
    elif (len(list_of_ids)) >= 2099 and (len(list_of_ids)) < 3200:
        a = "out3.csv"
        a = "out4.csv"
    line = file1.readline()
    if not line:
        out = open(a, "a")
    except IndexError as e:
    count += 1

But it’s sooooo slow, and I need to speed it up.
There is many if, and each time I open file, but I can’t figure it out how to get better performance.
Maybe someone have some tips?


I think you want something more like this:

# this number is arbitrary, of course
ids_per_file = 1000
# use with, so the file always closes when you're done, or something happens
with open('20220317_EuroG_MD_v3_XT_POL_FinalReport.txt', 'r') as f:
    # an easier way to loop over all the lines:
    n = 0
    ids = set()
        for line in f:
            except IndexError:
                # you don't want to break, you just want to ignore the line and continue
            # when the number ids reaches the limit (or at the start), start a new file
            if not n or len(ids) > ids_per_file:
                # close the previous one, unless it's the first
                if n > 0:
                # on to the next
                n += 1
                out_f = open(f'out{n}.csv', 'w')
                # reset ids
                ids = {line.split(',')[1]}
            # write the line, if you get here, it's a record
        # close the last file

Edit: actually had a bug, would write the first new identifier to the previous file, think this is better.

Answered By – Grismar

Answer Checked By – David Goodson (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *