Issue
I use python.multiprocessing.sharedctypes.RawArray
to share large numpy arrays between multiple processes. And I’ve noticed that when this array is large (> 1 or 2 Gb) it becomes very slow to initialize and also much slower to read/write to (and read/write time is not predictable, sometimes pretty fast, sometimes very very slow).
I’ve made a small sample script that uses just one process, initialize a shared array and write to it several times. And measures time to do these operations.
import argparse
import ctypes
import multiprocessing as mp
import multiprocessing.sharedctypes as mpsc
import numpy as np
import time
def main():
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('-c', '--block-count', type=int, default=1,
help='Number of blocks to write')
parser.add_argument('-w', '--block-width', type=int, default=20000,
help='Block width')
parser.add_argument('-d', '--block-depth', type=int, default=15000,
help='Block depth')
args = parser.parse_args()
blocks = args.block_count
blockwidth = args.block_width
depth = args.block_depth
start = time.perf_counter()
shared_array = mpsc.RawArray(ctypes.c_uint16, blocks*blockwidth*depth)
finish = time.perf_counter()
print('Init shared array of size {:.2f} Gb: {:.2f} s'.format(blocks*blockwidth*depth*ctypes.sizeof(ctypes.c_uint16)/1024/1024/1024, (finish-start)))
numpy_array = np.ctypeslib.as_array(shared_array).reshape(blocks*blockwidth, depth)
start = time.perf_counter()
for i in range(blocks):
begin = time.perf_counter()
numpy_array[i*blockwidth:(i+1)*blockwidth, :] = np.ones((blockwidth, depth), dtype=np.uint16)
end = time.perf_counter()
print('Write = %.2f s' % (end-begin))
finish = time.perf_counter()
print('Total time = %.2f s' % (finish-start))
if __name__ == '__main__':
main()
When I run this code I get the following on my PC:
$ python shared-minimal.py -c 1
Init shared array of size 0.56 Gb: 0.36 s
Write = 0.13 s
Total time = 0.13 s
$ python shared-minimal.py -c 2
Init shared array of size 1.12 Gb: 0.72 s
Write = 0.12 s
Write = 0.13 s
Total time = 0.25 s
$ python shared-minimal.py -c 4
Init shared array of size 2.24 Gb: 5.40 s
Write = 1.17 s
Write = 1.17 s
Write = 1.17 s
Write = 1.57 s
Total time = 5.08 s
In the last case, when array size is more than 2 Gb, initialization time is not linearly dependent on array size, and assigning save size slices to the array is more than 5 times slower.
I wonder why that happens. I’m running the script on Ubuntu 16.04 using Python 3.5. I also noticed by using iotop that when initializing and writing to the array there is a disk writing activity with same size as shared array, but I’m not sure if a real file is created or it’s only in-memory operation (I suppose it should be). In general my system becomes less responsive as well in case of large shared array. There is no swapping, checked with top
, ipcs -mu
and vmstat
.
Solution
After more research I’ve found that python actually creates folders in /tmp
which are starting with pymp-
, and though no files are visible within them using file viewers, it looks exatly like /tmp/
is used by python for shared memory. Performance seems to be decreasing when file cashes are flushed.
The working solution in the end was to mount /tmp
as tmpfs
:
sudo mount -t tmpfs tmpfs /tmp
And, if using the latest docker, by providing --tmpfs /tmp
argument to the docker run
command.
After doing this, read/write operations are done in RAM, and performance is fast and stable.
I still wonder why /tmp
is used for shared memory, not /dev/shm
which is already monted as tmpfs
and is supposed to be used for shared memory.
Answered By – nasedil-genio
Answer Checked By – Jay B. (BugsFixing Admin)