On Windows + Python 3.7 + i5 laptop, it takes 200ms to receive 100MB of data via a
socket, that’s obviously very low compared to the RAM speed.
How to improve this socket speed on Windows?
# SERVER import socket, time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind(('127.0.0.1', 1234)) s.listen() conn, addr = s.accept() t0 = time.time() while True: data = conn.recv(8192) # 8192 instead of 1024 improves from 0.5s to 0.2s if data == b'': break print(time.time() - t0) # ~ 0.200s # CLIENT import socket, time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect(('127.0.0.1', 1234)) a = b"a" * 100_000_000 # 100 MB of data t0 = time.time() s.send(a) print(time.time() - t0) # ~ 0.020s
Note: The question How to improve the send/receive speed of this Python socket? is about a wrapper around
socket, so I wanted to test directly with a pure
socket, and no wrapper.
TL;DR: based on the fact gathered mainly on my machine (and confirmed on another machine) on which I can reproduce a similar behaviour, the issue appear to mainly come from an inefficient implementation of the Windows networking TCP stack. More specifically, Windows performs a lot of temporary-buffer copies that cause the RAM to be intensively used. Furthermore, the overall resources are not efficiently used. That being said, the benchmark can be also improved.
The main target platform used in to perform the benchmark have the following attributes:
- OS: Windows 10 Famille N (version 21H1)
- Processor: i5-9600KF
- RAM: 2 x 8GiB DDR4 channels @ 3200GHz capable of reaching up to 40 GiB/s in practice.
- CPython 3.8.1
Please keep in mind that results can differ from one platform to another.
Improving the code/benchmark
First of all, the line
a = b"a" * 100_000_000 take a bit of time which is included in the timing of the server since the client is connected before executing it and the server should accept the client during this time. It is better to move this line before the
Additionally, a buffer of 8192 is very small. Reading 100 MB by chunks of 8 KiB means that 12208 C calls must be performed and probably a similar number of system calls. Since system calls are pretty expensive as they tend to take at least few millisecond on most platform, it is better to increase the buffer size to at least 32 KiB on mainstream processors. The buffer should be small enough to fit in fast CPU cache but also big enough to reduce the amount of system calls. On my machine, using a 256 KiB buffer results in a 70% speed up.
Moreover, you need to close the socket in the client code for the server code not to hang on. Indeed, otherwise
conn.recv should wait for incoming data. In fact, checking if
data == b'' is not a good idea as this is not a safe way to check if the stream is over. You need to send the size of the buffer sent or wait for a given predefined size. For example, the stream can be interrupted prematurely. Alternatively, the client can close the connection and the server will not always be directly notified (it can sometime take a very long time although it is fast on the loopback).
Here is the modified/improved benchmark:
# CLIENT import socket, time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) a = b"a" * 100_000_000 # 100 MB of data s.connect(('127.0.0.1', 1234)) t0 = time.time() s.send(a) s.close() print(time.time() - t0) # SERVER import socket, time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind(('127.0.0.1', 1234)) s.listen() conn, addr = s.accept() s = 0 t0 = time.time() while True: data = conn.recv(256*1024) s += len(data) if s == 100_000_000: break print(time.time() - t0)
I repeated the
s.send call and the
recv-based loop 100 times so to get stable results. With that I can reach 2.2 GiB/s. TCP sockets tends to be pretty slow on most platforms, but this result is clearly not great (Linux succeed to achieve a substantially better throughput).
On a different machine with Windows 10 Professional, a Skylake Xeon processor and a RAM reaching 40 GiB/s, I achieved 0.8~1.0 GiB/s which is very bad.
A profiling analysis show that the client process often saturate the TCP buffer and sleep for a short time (20~40 ms) waiting for the server to receive data. Here is an example of scheduling of the two processes (the top one is the server, the middle one is the client, the bottom one is a kernel thread and the light-green parts are idle time):
One can see that the server is not immediately awaken when the client fill the TCP buffer which is a missed-optimization of the Windows scheduler. In fact, the scheduler could wake up the client before the server starvation so to reduce latency issues. Note that a non-negligible part of the time is spent in a kernel process and the time slice are matching with the client activity.
Overall, 55% of the time is spend in the
recv function of ws2_32.dll, 10% in the
send function of the same DLL, 25% in synchronization functions, and 10% in other functions including ones of the CPython interpreter. Thus, the modified benchmark is not slowed down by CPython. Additionally, synchronizations are not the main source of slowdown.
When processes are scheduled, the memory throughput goes from 16 GiB/s up to 34 GiB/s with an average of ~20 GiB/s which is pretty big (especially considering the time taken by synchronizations). This means Windows performs a lot of big temporary buffer copies, especially during the
Note that the reason why the Xeon-based platform is slower is certainly because the processor only succeed to reach 14 GiB/s in sequential while the i5-9600KF processor reach 24 GiB/s in sequential. The Xeon processor also operate at a lower frequency. Such things are common for server-based processors that mainly focus on scalability.
A deeper analysis of ws2_32.dll show that nearly all the time of
recv is spent in the obscure instruction
call qword ptr [rip+0x3440f] which I guess is a kernel call to copy data from a kernel buffer to the user one. The same thing applies for
send. This means that the copies are not done in user-land but in the Windows kernel itself…
If you want to share data between two processes on Windows, I strongly advise you to use shared memory instead of sockets. Some message passing libraries provide an abstraction on top of this (like ZeroMQ for example).
Here is some notes as pointed out in the comments:
If increasing the buffer size does not impact significantly the performance, then it certainly means that the code is already memory bound on the target machine. For example, with a 1 DDR4 memory channel @ 2400 GHz common on 3-year old PC, then the maximum practical throughput will be about 14 GiB/s and I expect the sockets throughput to be clearly less than 1 GiB/s. On much older PC with a basic 1 channel DDR3, the throughput should even be close to 500 MiB/s. The speed should be bounded by something like
maxMemThroughput / K where
K = (N+1) * P and where:
Nis the number of copy the operating system perform;
Pis equal to 2 on processor with a write-through cache policy or operating system using non-temporal SIMD instructions, and 3 otherwise.
Low-level profilers show that
K ~= 8 on Windows. They also show that
send performs an efficient copy that benefit from non-temporal stores and quite saturate the RAM throughput, while
recv seems not to use non-temporal stores, clearly does not saturate the RAM throughput and performs a lot more reads than writes (for some unknown reason).
On NUMA system like recent AMD processors (Zen) or multi-socket systems, this should be even be worse since the interconnect and the saturation of NUMA nodes can slow down transfers. Windows is known to behave badly in this case.
AFAIK, ZeroMQ has multiple backends (aka. "Multi-Transport") and one of them operate with TCP (default) while another operate with shared memory.
Answered By – Jérôme Richard
Answer Checked By – Robin (BugsFixing Admin)