[SOLVED] Why is reading multiple files at the same time slower than reading sequentially?

Issue

I am trying to parse many files found in a directory, however using multiprocessing slows my program.

# Calling my parsing function from Client.
L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here.
                                                       combined ~100MB

Following this example from python documentation:

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(f, [1, 2, 3]))

I’ve written this piece of code:

from multiprocessing import Pool
from api.ttypes import *

import gc
import os

def _parse(pathToFile):
    myList = []
    with open(pathToFile) as f:
        for line in f:
            s = line.split()
            x, y = [int(v) for v in s]
            obj = CoresetPoint(x, y)
            gc.disable()
            myList.append(obj)
            gc.enable()
    return Points(myList)

def getParsedFiles(pathToFile):
    myList = []
    p = Pool(2)
    for filename in os.listdir(pathToFile):
        if filename.endswith(".txt"):
            myList.append(filename)
    return p.map(_pars, , myList)

I followed the example, put all the names of the files that end with a .txt in a list, then created Pools, and mapped them to my function. Then I want to return a list of objects. Each object holds the parsed data of a file. However it amazes me that I got the following results:

#Pool 32  ---> ~162(s)
#Pool 16 ---> ~150(s)
#Pool 12 ---> ~142(s)
#Pool 2 ---> ~130(s)

Graph:
enter image description here

Machine specification:

62.8 GiB RAM
Intel® Core™ i7-6850K CPU @ 3.60GHz × 12   

What am I missing here ?
Thanks in advance!

Solution

Looks like you’re I/O bound:

In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.

You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using map.

As you are processing a line at a time, and the inputs are split, you can use fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:

Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:

def _parse_coreset_points(lines):
    return Points([_parse_coreset_point(line) for line in lines])

def _parse_coreset_point(line):
    s = line.split()
    x, y = [int(v) for v in s]
    return CoresetPoint(x, y)

And our main function:

import fileinput

def getParsedFiles(directory):
    pool = Pool(2)

    txts = [filename for filename in os.listdir(directory):
            if filename.endswith(".txt")]

    return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)

Answered By – Peter Wood

Answer Checked By – Mildred Charles (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *