[SOLVED] Speed of socket send/recv on Windows

Issue

On Windows + Python 3.7 + i5 laptop, it takes 200ms to receive 100MB of data via a socket, that’s obviously very low compared to the RAM speed.

How to improve this socket speed on Windows?

# SERVER
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('127.0.0.1', 1234))
s.listen()
conn, addr = s.accept()
t0 = time.time()
while True:
    data = conn.recv(8192)  # 8192 instead of 1024 improves from 0.5s to 0.2s
    if data == b'':
        break
print(time.time() - t0)  # ~ 0.200s

# CLIENT
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 1234))
a = b"a" * 100_000_000  # 100 MB of data
t0 = time.time()
s.send(a)
print(time.time() - t0)  # ~ 0.020s

Note: The question How to improve the send/receive speed of this Python socket? is about a wrapper around socket, so I wanted to test directly with a pure socket, and no wrapper.

Solution

TL;DR: based on the fact gathered mainly on my machine (and confirmed on another machine) on which I can reproduce a similar behaviour, the issue appear to mainly come from an inefficient implementation of the Windows networking TCP stack. More specifically, Windows performs a lot of temporary-buffer copies that cause the RAM to be intensively used. Furthermore, the overall resources are not efficiently used. That being said, the benchmark can be also improved.


Setup

The main target platform used in to perform the benchmark have the following attributes:

  • OS: Windows 10 Famille N (version 21H1)
  • Processor: i5-9600KF
  • RAM: 2 x 8GiB DDR4 channels @ 3200GHz capable of reaching up to 40 GiB/s in practice.
  • CPython 3.8.1

Please keep in mind that results can differ from one platform to another.


Improving the code/benchmark

First of all, the line a = b"a" * 100_000_000 take a bit of time which is included in the timing of the server since the client is connected before executing it and the server should accept the client during this time. It is better to move this line before the s.connect call.

Additionally, a buffer of 8192 is very small. Reading 100 MB by chunks of 8 KiB means that 12208 C calls must be performed and probably a similar number of system calls. Since system calls are pretty expensive as they tend to take at least few millisecond on most platform, it is better to increase the buffer size to at least 32 KiB on mainstream processors. The buffer should be small enough to fit in fast CPU cache but also big enough to reduce the amount of system calls. On my machine, using a 256 KiB buffer results in a 70% speed up.

Moreover, you need to close the socket in the client code for the server code not to hang on. Indeed, otherwise conn.recv should wait for incoming data. In fact, checking if data == b'' is not a good idea as this is not a safe way to check if the stream is over. You need to send the size of the buffer sent or wait for a given predefined size. For example, the stream can be interrupted prematurely. Alternatively, the client can close the connection and the server will not always be directly notified (it can sometime take a very long time although it is fast on the loopback).

Here is the modified/improved benchmark:

# CLIENT
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
a = b"a" * 100_000_000  # 100 MB of data
s.connect(('127.0.0.1', 1234))
t0 = time.time()
s.send(a)
s.close()
print(time.time() - t0)

# SERVER
import socket, time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('127.0.0.1', 1234))
s.listen()
conn, addr = s.accept()
s = 0
t0 = time.time()
while True:
    data = conn.recv(256*1024)
    s += len(data)
    if s == 100_000_000:
        break
print(time.time() - t0)

I repeated the s.send call and the recv-based loop 100 times so to get stable results. With that I can reach 2.2 GiB/s. TCP sockets tends to be pretty slow on most platforms, but this result is clearly not great (Linux succeed to achieve a substantially better throughput).

On a different machine with Windows 10 Professional, a Skylake Xeon processor and a RAM reaching 40 GiB/s, I achieved 0.8~1.0 GiB/s which is very bad.


Analysis

A profiling analysis show that the client process often saturate the TCP buffer and sleep for a short time (20~40 ms) waiting for the server to receive data. Here is an example of scheduling of the two processes (the top one is the server, the middle one is the client, the bottom one is a kernel thread and the light-green parts are idle time):

process scheduling

One can see that the server is not immediately awaken when the client fill the TCP buffer which is a missed-optimization of the Windows scheduler. In fact, the scheduler could wake up the client before the server starvation so to reduce latency issues. Note that a non-negligible part of the time is spent in a kernel process and the time slice are matching with the client activity.

Overall, 55% of the time is spend in the recv function of ws2_32.dll, 10% in the send function of the same DLL, 25% in synchronization functions, and 10% in other functions including ones of the CPython interpreter. Thus, the modified benchmark is not slowed down by CPython. Additionally, synchronizations are not the main source of slowdown.

When processes are scheduled, the memory throughput goes from 16 GiB/s up to 34 GiB/s with an average of ~20 GiB/s which is pretty big (especially considering the time taken by synchronizations). This means Windows performs a lot of big temporary buffer copies, especially during the recv calls.

Note that the reason why the Xeon-based platform is slower is certainly because the processor only succeed to reach 14 GiB/s in sequential while the i5-9600KF processor reach 24 GiB/s in sequential. The Xeon processor also operate at a lower frequency. Such things are common for server-based processors that mainly focus on scalability.

A deeper analysis of ws2_32.dll show that nearly all the time of recv is spent in the obscure instruction call qword ptr [rip+0x3440f] which I guess is a kernel call to copy data from a kernel buffer to the user one. The same thing applies for send. This means that the copies are not done in user-land but in the Windows kernel itself

If you want to share data between two processes on Windows, I strongly advise you to use shared memory instead of sockets. Some message passing libraries provide an abstraction on top of this (like ZeroMQ for example).


Notes

Here is some notes as pointed out in the comments:

If increasing the buffer size does not impact significantly the performance, then it certainly means that the code is already memory bound on the target machine. For example, with a 1 DDR4 memory channel @ 2400 GHz common on 3-year old PC, then the maximum practical throughput will be about 14 GiB/s and I expect the sockets throughput to be clearly less than 1 GiB/s. On much older PC with a basic 1 channel DDR3, the throughput should even be close to 500 MiB/s. The speed should be bounded by something like maxMemThroughput / K where K = (N+1) * P and where:

  • N is the number of copy the operating system perform;
  • P is equal to 2 on processor with a write-through cache policy or operating system using non-temporal SIMD instructions, and 3 otherwise.

Low-level profilers show that K ~= 8 on Windows. They also show that send performs an efficient copy that benefit from non-temporal stores and quite saturate the RAM throughput, while recv seems not to use non-temporal stores, clearly does not saturate the RAM throughput and performs a lot more reads than writes (for some unknown reason).

On NUMA system like recent AMD processors (Zen) or multi-socket systems, this should be even be worse since the interconnect and the saturation of NUMA nodes can slow down transfers. Windows is known to behave badly in this case.

AFAIK, ZeroMQ has multiple backends (aka. "Multi-Transport") and one of them operate with TCP (default) while another operate with shared memory.

Answered By – Jérôme Richard

Answer Checked By – Robin (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *