Samba with multichannel and io_uring

* Samba with multichannel and io_uring
@ 2020-10-15  9:58 Stefan Metzmacher
  2020-10-15 10:06 ` Ralph Boehme
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Stefan Metzmacher @ 2020-10-15  9:58 UTC (permalink / raw)
  To: Samba Technical, io-uring

[-- Attachment #1.1: Type: text/plain, Size: 6696 bytes --]

Hi,

related to my talk at the virtual storage developer conference
"multichannel / iouring Status Update within Samba"
(https://www.samba.org/~metze/presentations/2020/SDC/),
I have some additional updates.

DDN was so kind to sponsor about a week of research on real world
hardware with 100GBit/s interfaces and two NUMA nodes per server.

I was able to improve the performance drastically.

I concentrated on SMB2 read performance, but similar improvements would be expected for write too.

We used "server multi channel support = yes" and the network interface is RSS capable,
it means that a Windows client uses 4 connections by default.

I first tested a share using /dev/shm and the results where really slow,
it was not possible to reach more than ~30 GBits/s on the net and ~ 3.8 GBytes/s
from fio.exe.

smbd uses pread() from within a pthread based threadpool for file io
and sendmsg() to deliver the response to the socket. All multichannel
connections are served by the same smbd process (based on the client guid).

The main smbd is cpu bound and the helper threads also use quite some cpu
about ~ 600% in total!

https://www.samba.org/~metze/presentations/2020/SDC/future/read-32GBit-4M-2T-shm-sendmsg-top-02.png

It turns out that NUMA access caused a lot of slow down.
The network adapter was connected to numa node 1, so we pinned
the ramdisk and smbd to that node.

  mount -t tmpfs -o size=60g,mpol=bind:1 tmpfs /dev/shm-numanode1
  numactl --cpunodebind=netdev:ens3f0 --membind=netdev:ens3f0 smbd

With that it was possible to reach ~ 5 GBytes/s from fio.exe

But the main problem remains the kernel is busy copying data
and sendmsg() takes up to 0.5 msecs, which means that we don't process new requests
during these 0.5 msecs.

I created a prototype that uses IORING_OP_SENDMSG with IOSQE_ASYNC (I used a 5.8.12 kernel)
instead of the sync sendmsg() calls, which means that one kernel thread
(io_wqe_work ~50% cpu) per connection is doing the memory copy to the socket
and the main smbd only uses ~11% cpu, but we still use > 400% cpu in total.

https://www.samba.org/~metze/presentations/2020/SDC/future/read-57GBit-4M-2T-shm-io-uring-sendmsg-async-top-02.png

But it seems the numa binding for the io_wqe_work thread doesn't seem to work as expected,
so the results vary between 5.0 GBytes/s and 7.6 GBytes/s, depending on which numa node
io_wqe_work kernel threads are running. Also note that the threadpool with pread was
still faster than using IORING_OP_READV towards the filesystem, the reason might also
be numa dependent.

https://www.samba.org/~metze/presentations/2020/SDC/future/read-57GBit-4M-2T-shm-io-uring-sendmsg-async-numatop-02.png
https://www.samba.org/~metze/presentations/2020/SDC/future/read-57GBit-4M-2T-shm-io-uring-sendmsg-async-perf-top-02.png

The main problem is still copy_user_enhanced_fast_string, so I tried to use
IORING_IO_SPLICE (from the filesystem via a pipe to the socket) in order to avoid
copying memory around.

With that I was able to reduce the cpu usage of the main smbd to ~6% cpu with
io_wqe_work threads using between ~3-6% cpu (filesystem to pipe) and
6-30% cpu (pipe to socket).

But the Windows client wasn't able to reach better numbers than 7.6 GBytes/s (65 GBits/s).
Only using "Set-SmbClientConfiguration -ConnectionCountPerRssNetworkInterface 16" helped to
get up to 8.9 GBytes/s (76 GBits/s).

With 8 MByte IOs smbd is quite idle at ~ 5% cpu with the io_wqe_work threads ~100% cpu in total.
https://www.samba.org/~metze/presentations/2020/SDC/future/read-75GBit-8M-20T-RSS16-shm-io-uring-splice-top-02.png

With 512 KByte IOs smbd uses ~56% cpu with the io_wqe_work threads ~130% cpu in total.
https://www.samba.org/~metze/presentations/2020/SDC/future/read-76GBit-512k-10T-RSS16-shm-io-uring-splice-02.png

With 256 KByte IOS smbd uses ~87% cpu with the io_wqe_work threads ~180% cpu in total.
https://www.samba.org/~metze/presentations/2020/SDC/future/read-76GBit-256k-10T-RSS16-shm-io-uring-splice-02.png

In order to get higher numbers I also tested with smbclient.

- With the default configuration (sendmsg and threadpool pread) I was able to get
  4.2 GBytes/s over a single connection, while smbd with all threads uses ~150% cpu.
  https://www.samba.org/~metze/presentations/2020/SDC/future/read-4.2G-smbclient-shm-sendmsg-pthread.png

- With IORING_IO_SPLICE I was able to get 5 GBytes/s over a single connection,
  while smbd uses ~ 6% cpu, with 2 io_wqe_work threads (filesystem to pipe) at 5.3% cpu each +
  1 io_wqe_work thread (pipe to socket) at ~29% cpu. This is only ~55% cpu in total on the server
  and the client is the bottleneck here.
  https://www.samba.org/~metze/presentations/2020/SDC/future/read-5G-smbclient-shm-io-uring-sendmsg-splice-async.png

- With a modified smbclient using a forced client guid I used 4 connections into
  a single smbd on the server. With that I was able to reach ~ 11 GBytes/s (92 GBits/s)
  (This is similar to what 4 iperf instances are able to reach).
  The main smbd uses 8.6 % cpu with 4 io_wqe_work threads (pipe to socket) at ~20% cpu each.
  https://www.samba.org/~metze/presentations/2020/SDC/future/read-11G-smbclient-same-client-guid-shm-io-uring-splice-async.png

- With 8 smbclient instances over loopback we are able to reach ~ 22 GBytes/s (180 GBits/s)
  and smbd uses 22 % cpu.
  https://www.samba.org/~metze/presentations/2020/SDC/future/read-22G-smbclient-8-same-client-guid-localhost-shm-io-uring-splice.png

So IORING_IO_SPLICE will bring us into a very good shape for streaming reads.
Also note that numa pinning is not really needed here as the memory is not really touched at all.

It's very likely that IORING_IO_RECVMSG in combination with IORING_IO_SPLICE would also improve the write path.

Using AF_KCM socket (Kernel Connection Multiplexor) as wrapper to the
(TCP) stream socket might be able to avoid wakeups for incoming packets and
should allow better buffer management for incoming packets within smbd.

The prototype/work in process patches are available here:
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/v4-13-multichannel
and
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master-multichannel

Also notice the missing generic multichannel things via this meta bug:
https://bugzilla.samba.org/show_bug.cgi?id=14534

I'm not sure when all this will be production ready, but it's great to know
the potential we have on a modern Linux kernel!

Later SMB-Direct should be able to reduce the cpu load of the io_wqe_work threads (pipe to socket)...

metze

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread