netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* epoll_wait() performance
@ 2019-11-22 11:17 David Laight
  2019-11-27  9:50 ` Marek Majkowski
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2019-11-22 11:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: network dev

I'm trying to optimise some code that reads UDP messages (RTP and RTCP) from a lot of sockets.
The 'normal' data pattern is that there is no data on half the sockets (RTCP) and
one message every 20ms on the others (RTP).
However there can be more than one message on each socket, and they all need to be read.
Since the code processing the data runs every 10ms, the message receiving code
also runs every 10ms (a massive gain when using poll()).

While using recvmmsg() to read multiple messages might seem a good idea, it is much
slower than recv() when there is only one message (even recvmsg() is a lot slower).
(I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
and faffing with the user iov[].)

So using poll() we repoll the fd after calling recv() to find is there is a second message.
However the second poll has a significant performance cost (but less than using recvmmsg()).

If we use epoll() in level triggered mode a second epoll_wait() call (after the recv()) will
indicate that there is more data.

For poll() it doesn't make much difference how many fd are supplied to each system call.
The overall performance is much the same for 32, 64 or 500 (all the sockets).

For epoll_wait() that isn't true.
Supplying a buffer that is shorter than the list of 'ready' fds gives a massive penalty.
With a buffer long enough for all the events epoll() is somewhat faster than poll().
But with a 64 entry buffer it is much slower.
I've looked at the code and can't see why splicing the unread events back is expensive.

I'd like to be able to change the code so that multiple threads are reading from the epoll fd.
This would mean I'd have to run it in edge mode and each thread reading a smallish
block of events.
Any suggestions on how to efficiently read the 'unusual' additional messages from
the sockets?

FWIW the fastest way to read 1 RTP message every 20ms is to do non-blocking recv() every 10ms.
The failing recv() is actually faster than either epoll() or two poll() actions.
(Although something is needed to pick up the occasional second message.) 

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-22 11:17 epoll_wait() performance David Laight
@ 2019-11-27  9:50 ` Marek Majkowski
  2019-11-27 10:39   ` David Laight
  0 siblings, 1 reply; 20+ messages in thread
From: Marek Majkowski @ 2019-11-27  9:50 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel, network dev, kernel-team, Jesper Dangaard Brouer

On Fri, Nov 22, 2019 at 12:18 PM David Laight <David.Laight@aculab.com> wrote:
> I'm trying to optimise some code that reads UDP messages (RTP and RTCP) from a lot of sockets.
> The 'normal' data pattern is that there is no data on half the sockets (RTCP) and
> one message every 20ms on the others (RTP).
> However there can be more than one message on each socket, and they all need to be read.
> Since the code processing the data runs every 10ms, the message receiving code
> also runs every 10ms (a massive gain when using poll()).

How many sockets we are talking about? More like 500 or 500k? We had very
bad experience with UDP connected sockets, so if you are using UDP connected
sockets, the RX path is super slow, mostly consumed by udp_lib_lookup()
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/udp.c#L445

Then we might argue that doing thousands of udp unconnected sockets  - like
192.0.2.1:1234, 192.0.2.2:1234, etc - creates little value. I guess the only
reasonable case for large number of UDP sockets is when you need
large number of source ports.

In such case we experimented with abusing TPROXY:
https://web.archive.org/web/20191115081000/https://blog.cloudflare.com/how-we-built-spectrum/

> While using recvmmsg() to read multiple messages might seem a good idea, it is much
> slower than recv() when there is only one message (even recvmsg() is a lot slower).
> (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> and faffing with the user iov[].)
>
> So using poll() we repoll the fd after calling recv() to find is there is a second message.
> However the second poll has a significant performance cost (but less than using recvmmsg()).

That sounds wrong. Single recvmmsg(), even when receiving only a
single message, should be faster than two syscalls - recv() and
poll().

> If we use epoll() in level triggered mode a second epoll_wait() call (after the recv()) will
> indicate that there is more data.
>
> For poll() it doesn't make much difference how many fd are supplied to each system call.
> The overall performance is much the same for 32, 64 or 500 (all the sockets).
>
> For epoll_wait() that isn't true.
> Supplying a buffer that is shorter than the list of 'ready' fds gives a massive penalty.
> With a buffer long enough for all the events epoll() is somewhat faster than poll().
> But with a 64 entry buffer it is much slower.
> I've looked at the code and can't see why splicing the unread events back is expensive.

Again, this is surprising.

> I'd like to be able to change the code so that multiple threads are reading from the epoll fd.
> This would mean I'd have to run it in edge mode and each thread reading a smallish
> block of events.
> Any suggestions on how to efficiently read the 'unusual' additional messages from
> the sockets?

Random ideas:
1. Perhaps reducing the number of sockets could help - with iptables or TPROXY.
TPROXY has some performance impact though, so be careful.

2. I played with io_submit for syscall batching, but in my experiments I wasn't
able to show performance boost:
https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/
Perhaps the newer io_uring with networking support could help:
https://twitter.com/axboe/status/1195047335182524416

3. SO_BUSYPOLL drastically reduces latency, but I've only used it with
a single socket..

4. If you want to get number of outstanding packets, there is SIOCINQ
and SO_MEMINFO.

My older writeups:
https://blog.cloudflare.com/how-to-receive-a-million-packets/
https://blog.cloudflare.com/how-to-achieve-low-latency/

Cheers,
   Marek

> FWIW the fastest way to read 1 RTP message every 20ms is to do non-blocking recv() every 10ms.
> The failing recv() is actually faster than either epoll() or two poll() actions.
> (Although something is needed to pick up the occasional second message.)
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: epoll_wait() performance
  2019-11-27  9:50 ` Marek Majkowski
@ 2019-11-27 10:39   ` David Laight
  2019-11-27 15:48     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2019-11-27 10:39 UTC (permalink / raw)
  To: 'Marek Majkowski'
  Cc: linux-kernel, network dev, kernel-team, Jesper Dangaard Brouer

From: Marek Majkowski
> Sent: 27 November 2019 09:51
> On Fri, Nov 22, 2019 at 12:18 PM David Laight <David.Laight@aculab.com> wrote:
> > I'm trying to optimise some code that reads UDP messages (RTP and RTCP) from a lot of sockets.
> > The 'normal' data pattern is that there is no data on half the sockets (RTCP) and
> > one message every 20ms on the others (RTP).
> > However there can be more than one message on each socket, and they all need to be read.
> > Since the code processing the data runs every 10ms, the message receiving code
> > also runs every 10ms (a massive gain when using poll()).
> 
> How many sockets we are talking about? More like 500 or 500k? We had very
> bad experience with UDP connected sockets, so if you are using UDP connected
> sockets, the RX path is super slow, mostly consumed by udp_lib_lookup()
> https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/udp.c#L445

For my tests I have 900, but that is nothing like the limit for the application.
The test system is over 50% idle and running at its minimal clock speed.
The sockets are all unconnected, I believe the remote application is allowed
to change the source IP mid-flow!

...
> > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > and faffing with the user iov[].)
> >
> > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > However the second poll has a significant performance cost (but less than using recvmmsg()).
> 
> That sounds wrong. Single recvmmsg(), even when receiving only a
> single message, should be faster than two syscalls - recv() and
> poll().

My suspicion is the extra two copy_from_user() needed for each recvmsg are a
significant overhead, most likely due to the crappy code that tries to stop
the kernel buffer being overrun.
I need to run the tests on a system with a 'home built' kernel to see how much
difference this make (by seeing how much slower duplicating the copy makes it).

The system call cost of poll() gets factored over a reasonable number of sockets.
So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
even allowing for looking up the fd.

This could be fixed by an extra flag to recvmmsg() to indicate that you only really
expect one message and to call the poll() function before each subsequent receive.

There is also the 'reschedule' that Eric added to the loop in recvmmsg.
I don't know how much that actually costs.
In this case the process is likely to be running at a RT priority and pinned to a cpu.
In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.

We really do want to receive all these UDP packets in a timely manner.
Although very low latency isn't itself an issue.
The data is telephony audio with (typically) one packet every 20ms.
The code only looks for packets every 10ms - that helps no end since, in principle,
only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.

> > If we use epoll() in level triggered mode a second epoll_wait() call (after the recv()) will
> > indicate that there is more data.
> >
> > For poll() it doesn't make much difference how many fd are supplied to each system call.
> > The overall performance is much the same for 32, 64 or 500 (all the sockets).
> >
> > For epoll_wait() that isn't true.
> > Supplying a buffer that is shorter than the list of 'ready' fds gives a massive penalty.
> > With a buffer long enough for all the events epoll() is somewhat faster than poll().
> > But with a 64 entry buffer it is much slower.
> > I've looked at the code and can't see why splicing the unread events back is expensive.
> 
> Again, this is surprising.

Yep, but repeatedly measurable.
If no one else has seen this I'll have to try to instrument it in the kernel somehow.
I'm pretty sure it isn't a userspace issue.

> > I'd like to be able to change the code so that multiple threads are reading from the epoll fd.
> > This would mean I'd have to run it in edge mode and each thread reading a smallish
> > block of events.
> > Any suggestions on how to efficiently read the 'unusual' additional messages from
> > the sockets?
> 
> Random ideas:
> 1. Perhaps reducing the number of sockets could help - with iptables or TPROXY.
> TPROXY has some performance impact though, so be careful.

We'd then have to use recvmsg() - which is measurably slower than recv().

> 2. I played with io_submit for syscall batching, but in my experiments I wasn't
> able to show performance boost:
> https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/
> Perhaps the newer io_uring with networking support could help:
> https://twitter.com/axboe/status/1195047335182524416

You need an OS that actually does async IO - Like RSX11/M or windows.
Just deferring the request to a kernel thread can mean you get stuck
behind other processes blocking reads.

> 3. SO_BUSYPOLL drastically reduces latency, but I've only used it with
> a single socket..

We need to minimise the cpu cost more than the absolute latency.

> 4. If you want to get number of outstanding packets, there is SIOCINQ
> and SO_MEMINFO.

That's another system call.
poll() can tell use whether there is any data on a lot of sockets quicker.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-27 10:39   ` David Laight
@ 2019-11-27 15:48     ` Jesper Dangaard Brouer
  2019-11-27 16:04       ` David Laight
  2019-11-27 16:26       ` Paolo Abeni
  0 siblings, 2 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2019-11-27 15:48 UTC (permalink / raw)
  To: David Laight
  Cc: 'Marek Majkowski',
	linux-kernel, network dev, kernel-team, brouer, Paolo Abeni


On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <David.Laight@ACULAB.COM> wrote:

> ...
> > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > and faffing with the user iov[].)
> > >
> > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > However the second poll has a significant performance cost (but less than using recvmmsg()).  
> > 
> > That sounds wrong. Single recvmmsg(), even when receiving only a
> > single message, should be faster than two syscalls - recv() and
> > poll().  
> 
> My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> significant overhead, most likely due to the crappy code that tries to stop
> the kernel buffer being overrun.
>
> I need to run the tests on a system with a 'home built' kernel to see how much
> difference this make (by seeing how much slower duplicating the copy makes it).
> 
> The system call cost of poll() gets factored over a reasonable number of sockets.
> So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> even allowing for looking up the fd.
> 
> This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> expect one message and to call the poll() function before each subsequent receive.
> 
> There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> I don't know how much that actually costs.
> In this case the process is likely to be running at a RT priority and pinned to a cpu.
> In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> 
> We really do want to receive all these UDP packets in a timely manner.
> Although very low latency isn't itself an issue.
> The data is telephony audio with (typically) one packet every 20ms.
> The code only looks for packets every 10ms - that helps no end since, in principle,
> only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.

I have a simple udp_sink tool[1] that cycle through the different
receive socket system calls.  I gave it a quick spin on a F31 kernel
5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
to see a significant regression/slowdown for recvMmsg.

$ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
          	run      count   	ns/pkt	pps		cycles	payload
recvMmsg/32  	run:  0	10000000	1461.41	684270.96	5261	18	 demux:1
recvmsg   	run:  0	10000000	889.82	1123824.84	3203	18	 demux:1
read      	run:  0	10000000	974.81	1025841.68	3509	18	 demux:1
recvfrom  	run:  0	10000000	1056.51	946513.44	3803	18	 demux:1

Normal recvmsg almost have double performance that recvmmsg.
 recvMmsg/32 = 684,270 pps
 recvmsg     = 1,123,824 pps

[1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

For connected UDP socket:

$ sudo ./udp_sink --port 9 --repeat 1 --connect
          	run      count   	ns/pkt	pps		cycles	payload
recvMmsg/32  	run:  0	 1000000	1240.06	806411.73	4464	18	 demux:1 c:1
recvmsg   	run:  0	 1000000	768.80	1300724.75	2767	18	 demux:1 c:1
read      	run:  0	 1000000	823.40	1214478.40	2964	18	 demux:1 c:1
recvfrom  	run:  0	 1000000	889.19	1124616.11	3201	18	 demux:1 c:1


Found some old results (approx v4.10-rc1):

[brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
 recvMmsg/32    run: 0 10000000 537.89  1859106.74      2155    21559353816
 recvmsg        run: 0 10000000 552.69  1809344.44      2215    22152468673
 read           run: 0 10000000 476.65  2097970.76      1910    19104864199
 recvfrom       run: 0 10000000 450.76  2218492.60      1806    18066972794



^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: epoll_wait() performance
  2019-11-27 15:48     ` Jesper Dangaard Brouer
@ 2019-11-27 16:04       ` David Laight
  2019-11-27 19:48         ` Willem de Bruijn
  2019-11-28 11:12         ` Jesper Dangaard Brouer
  2019-11-27 16:26       ` Paolo Abeni
  1 sibling, 2 replies; 20+ messages in thread
From: David Laight @ 2019-11-27 16:04 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer'
  Cc: 'Marek Majkowski',
	linux-kernel, network dev, kernel-team, Paolo Abeni

From: Jesper Dangaard Brouer
> Sent: 27 November 2019 15:48
> On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > ...
> > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > > and faffing with the user iov[].)
> > > >
> > > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > > However the second poll has a significant performance cost (but less than using recvmmsg()).
> > >
> > > That sounds wrong. Single recvmmsg(), even when receiving only a
> > > single message, should be faster than two syscalls - recv() and
> > > poll().
> >
> > My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> > significant overhead, most likely due to the crappy code that tries to stop
> > the kernel buffer being overrun.
> >
> > I need to run the tests on a system with a 'home built' kernel to see how much
> > difference this make (by seeing how much slower duplicating the copy makes it).
> >
> > The system call cost of poll() gets factored over a reasonable number of sockets.
> > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> > even allowing for looking up the fd.
> >
> > This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> > expect one message and to call the poll() function before each subsequent receive.
> >
> > There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> > I don't know how much that actually costs.
> > In this case the process is likely to be running at a RT priority and pinned to a cpu.
> > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> >
> > We really do want to receive all these UDP packets in a timely manner.
> > Although very low latency isn't itself an issue.
> > The data is telephony audio with (typically) one packet every 20ms.
> > The code only looks for packets every 10ms - that helps no end since, in principle,
> > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.
> 
> I have a simple udp_sink tool[1] that cycle through the different
> receive socket system calls.  I gave it a quick spin on a F31 kernel
> 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
> to see a significant regression/slowdown for recvMmsg.
> 
> $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
>           	run      count   	ns/pkt	pps		cycles	payload
> recvMmsg/32  	run:  0	10000000	1461.41	684270.96	5261	18	 demux:1
> recvmsg   	run:  0	10000000	889.82	1123824.84	3203	18	 demux:1
> read      	run:  0	10000000	974.81	1025841.68	3509	18	 demux:1
> recvfrom  	run:  0	10000000	1056.51	946513.44	3803	18	 demux:1
> 
> Normal recvmsg almost have double performance that recvmmsg.
>  recvMmsg/32 = 684,270 pps
>  recvmsg     = 1,123,824 pps

Can you test recv() as well?
I think it might be faster than read().

...
> Found some old results (approx v4.10-rc1):
> 
> [brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
>  recvMmsg/32    run: 0 10000000 537.89  1859106.74      2155    21559353816
>  recvmsg        run: 0 10000000 552.69  1809344.44      2215    22152468673
>  read           run: 0 10000000 476.65  2097970.76      1910    19104864199
>  recvfrom       run: 0 10000000 450.76  2218492.60      1806    18066972794

That is probably nearer what I am seeing on a 4.15 Ubuntu 18.04 kernel.
recvmmsg() and recvmsg() are similar - but both a lot slower then recv().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-27 15:48     ` Jesper Dangaard Brouer
  2019-11-27 16:04       ` David Laight
@ 2019-11-27 16:26       ` Paolo Abeni
  2019-11-27 17:30         ` David Laight
  1 sibling, 1 reply; 20+ messages in thread
From: Paolo Abeni @ 2019-11-27 16:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David Laight
  Cc: 'Marek Majkowski', linux-kernel, network dev, kernel-team

On Wed, 2019-11-27 at 16:48 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > ...
> > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > > and faffing with the user iov[].)
> > > > 
> > > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > > However the second poll has a significant performance cost (but less than using recvmmsg()).  
> > > 
> > > That sounds wrong. Single recvmmsg(), even when receiving only a
> > > single message, should be faster than two syscalls - recv() and
> > > poll().  
> > 
> > My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> > significant overhead, most likely due to the crappy code that tries to stop
> > the kernel buffer being overrun.
> > 
> > I need to run the tests on a system with a 'home built' kernel to see how much
> > difference this make (by seeing how much slower duplicating the copy makes it).
> > 
> > The system call cost of poll() gets factored over a reasonable number of sockets.
> > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> > even allowing for looking up the fd.
> > 
> > This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> > expect one message and to call the poll() function before each subsequent receive.
> > 
> > There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> > I don't know how much that actually costs.
> > In this case the process is likely to be running at a RT priority and pinned to a cpu.
> > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> > 
> > We really do want to receive all these UDP packets in a timely manner.
> > Although very low latency isn't itself an issue.
> > The data is telephony audio with (typically) one packet every 20ms.
> > The code only looks for packets every 10ms - that helps no end since, in principle,
> > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.
> 
> I have a simple udp_sink tool[1] that cycle through the different
> receive socket system calls.  I gave it a quick spin on a F31 kernel
> 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
> to see a significant regression/slowdown for recvMmsg.
> 
> $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
>           	run      count   	ns/pkt	pps		cycles	payload
> recvMmsg/32  	run:  0	10000000	1461.41	684270.96	5261	18	 demux:1
> recvmsg   	run:  0	10000000	889.82	1123824.84	3203	18	 demux:1
> read      	run:  0	10000000	974.81	1025841.68	3509	18	 demux:1
> recvfrom  	run:  0	10000000	1056.51	946513.44	3803	18	 demux:1
> 
> Normal recvmsg almost have double performance that recvmmsg.

For stream tests, the above is true, if the BH is able to push the
packets to the socket fast enough. Otherwise the recvmmsg() will make
the user space even faster, the BH will find the user space process
sleeping more often and the BH will have to spend more time waking-up
the process.

If a single receive queue is in use this condition is not easy to meet.

Before spectre/meltdown and others mitigations using connected sockets
and removing ct/nf was usually sufficient - at least in my scenarios -
to make BH fast enough. 

But it's no more the case, and I have to use 2 or more different
receive queues.

@David: If I read your message correctly, the pkt rate you are dealing
with is quite low... are we talking about tput or latency? I guess
latency could be measurably higher with recvmmsg() in respect to other
syscall. How do you measure the releative performances of recvmmsg()
and recv() ? with micro-benchmark/rdtsc()? Am I right that you are
usually getting a single packet per recvmmsg() call?

Thanks,

PAolo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: epoll_wait() performance
  2019-11-27 16:26       ` Paolo Abeni
@ 2019-11-27 17:30         ` David Laight
  2019-11-27 17:46           ` Eric Dumazet
  2019-11-27 17:50           ` Paolo Abeni
  0 siblings, 2 replies; 20+ messages in thread
From: David Laight @ 2019-11-27 17:30 UTC (permalink / raw)
  To: 'Paolo Abeni', Jesper Dangaard Brouer
  Cc: 'Marek Majkowski', linux-kernel, network dev, kernel-team

From: Paolo Abeni
> Sent: 27 November 2019 16:27
...
> @David: If I read your message correctly, the pkt rate you are dealing
> with is quite low... are we talking about tput or latency? I guess
> latency could be measurably higher with recvmmsg() in respect to other
> syscall. How do you measure the releative performances of recvmmsg()
> and recv() ? with micro-benchmark/rdtsc()? Am I right that you are
> usually getting a single packet per recvmmsg() call?

The packet rate per socket is low, typically one packet every 20ms.
This is RTP, so telephony audio.
However we have a lot of audio channels and hence a lot of sockets.
So there are can be 1000s of sockets we need to receive the data from.
The test system I'm using has 16 E1 TDM links each of which can handle
31 audio channels.
Forwarding all these to/from RTP (one of the things it might do) is 496
audio channels - so 496 RTP sockets and 496 RTCP ones.
Although the test I'm doing is pure RTP and doesn't use TDM.

What I'm measuring is the total time taken to receive all the packets
(on all the sockets) that are available to be read every 10ms.
So poll + recv + add_to_queue.
(The data processing is done by other threads.)
I use the time difference (actually CLOCK_MONOTONIC - from rdtsc)
to generate a 64 entry (self scaling) histogram of the elapsed times.
Then look for the histograms peak value.
(I need to work on the max value, but that is a different (more important!) problem.)
Depending on the poll/recv method used this takes 1.5 to 2ms
in each 10ms period.
(It is faster if I run the cpu at full speed, but it usually idles along
at 800MHz.)

If I use recvmmsg() I only expect to see one packet because there
is (almost always) only one packet on each socket every 20ms.
However there might be more than one, and if there is they
all need to be read (well at least 2 of them) in that block of receives.

The outbound traffic goes out through a small number of raw sockets.
Annoyingly we have to work out the local IPv4 address that will be used
for each destination in order to calculate the UDP checksum.
(I've a pending patch to speed up the x86 checksum code on a lot of
cpus.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-27 17:30         ` David Laight
@ 2019-11-27 17:46           ` Eric Dumazet
  2019-11-28 10:17             ` David Laight
  2019-11-27 17:50           ` Paolo Abeni
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2019-11-27 17:46 UTC (permalink / raw)
  To: David Laight, 'Paolo Abeni', Jesper Dangaard Brouer
  Cc: 'Marek Majkowski', linux-kernel, network dev, kernel-team



On 11/27/19 9:30 AM, David Laight wrote:
> From: Paolo Abeni
>> Sent: 27 November 2019 16:27
> ...
>> @David: If I read your message correctly, the pkt rate you are dealing
>> with is quite low... are we talking about tput or latency? I guess
>> latency could be measurably higher with recvmmsg() in respect to other
>> syscall. How do you measure the releative performances of recvmmsg()
>> and recv() ? with micro-benchmark/rdtsc()? Am I right that you are
>> usually getting a single packet per recvmmsg() call?
> 
> The packet rate per socket is low, typically one packet every 20ms.
> This is RTP, so telephony audio.
> However we have a lot of audio channels and hence a lot of sockets.
> So there are can be 1000s of sockets we need to receive the data from.
> The test system I'm using has 16 E1 TDM links each of which can handle
> 31 audio channels.
> Forwarding all these to/from RTP (one of the things it might do) is 496
> audio channels - so 496 RTP sockets and 496 RTCP ones.
> Although the test I'm doing is pure RTP and doesn't use TDM.
> 
> What I'm measuring is the total time taken to receive all the packets
> (on all the sockets) that are available to be read every 10ms.
> So poll + recv + add_to_queue.
> (The data processing is done by other threads.)
> I use the time difference (actually CLOCK_MONOTONIC - from rdtsc)
> to generate a 64 entry (self scaling) histogram of the elapsed times.
> Then look for the histograms peak value.
> (I need to work on the max value, but that is a different (more important!) problem.)
> Depending on the poll/recv method used this takes 1.5 to 2ms
> in each 10ms period.
> (It is faster if I run the cpu at full speed, but it usually idles along
> at 800MHz.)
> 
> If I use recvmmsg() I only expect to see one packet because there
> is (almost always) only one packet on each socket every 20ms.
> However there might be more than one, and if there is they
> all need to be read (well at least 2 of them) in that block of receives.
> 
> The outbound traffic goes out through a small number of raw sockets.
> Annoyingly we have to work out the local IPv4 address that will be used
> for each destination in order to calculate the UDP checksum.
> (I've a pending patch to speed up the x86 checksum code on a lot of
> cpus.)
> 
> 	David

A QUIC server handles hundred of thousands of ' UDP flows' all using only one UDP socket
per cpu.

This is really the only way to scale, and does not need kernel changes to efficiently
organize millions of UDP sockets (huge memory footprint even if we get right how
we manage them)

Given that UDP has no state, there is really no point trying to have one UDP
socket per flow, and having to deal with epoll()/poll() overhead.





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-27 17:30         ` David Laight
  2019-11-27 17:46           ` Eric Dumazet
@ 2019-11-27 17:50           ` Paolo Abeni
  1 sibling, 0 replies; 20+ messages in thread
From: Paolo Abeni @ 2019-11-27 17:50 UTC (permalink / raw)
  To: David Laight, Jesper Dangaard Brouer
  Cc: 'Marek Majkowski', linux-kernel, network dev, kernel-team

Hi,

Thanks for the additional details.

On Wed, 2019-11-27 at 17:30 +0000, David Laight wrote:
> From: Paolo Abeni
> > Sent: 27 November 2019 16:27
> ...
> > @David: If I read your message correctly, the pkt rate you are dealing
> > with is quite low... are we talking about tput or latency? I guess
> > latency could be measurably higher with recvmmsg() in respect to other
> > syscall. How do you measure the releative performances of recvmmsg()
> > and recv() ? with micro-benchmark/rdtsc()? Am I right that you are
> > usually getting a single packet per recvmmsg() call?
> 
> The packet rate per socket is low, typically one packet every 20ms.
> This is RTP, so telephony audio.
> However we have a lot of audio channels and hence a lot of sockets.
> So there are can be 1000s of sockets we need to receive the data from.
> The test system I'm using has 16 E1 TDM links each of which can handle
> 31 audio channels.
> Forwarding all these to/from RTP (one of the things it might do) is 496
> audio channels - so 496 RTP sockets and 496 RTCP ones.
> Although the test I'm doing is pure RTP and doesn't use TDM.

Oks, I think this is not exactly the preferred recvmmsg() use case ;)

> What I'm measuring is the total time taken to receive all the packets
> (on all the sockets) that are available to be read every 10ms.
> So poll + recv + add_to_queue.
> (The data processing is done by other threads.)
> I use the time difference (actually CLOCK_MONOTONIC - from rdtsc)
> to generate a 64 entry (self scaling) histogram of the elapsed times.
> Then look for the histograms peak value.
> (I need to work on the max value, but that is a different (more important!) problem.)
> Depending on the poll/recv method used this takes 1.5 to 2ms
> in each 10ms period.
> (It is faster if I run the cpu at full speed, but it usually idles along
> at 800MHz.)
> 
> If I use recvmmsg() I only expect to see one packet because there
> is (almost always) only one packet on each socket every 20ms.
> However there might be more than one, and if there is they
> all need to be read (well at least 2 of them) in that block of receives.

I would wild guess that recvmmsg() would be faster than 2 recv() when
there are exactly 2 pkts to read and the user-space provides exactly 2
msg entries, but likely non very relevant for the overall scenario.

Sorry, I don't have any good suggestion here.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-27 16:04       ` David Laight
@ 2019-11-27 19:48         ` Willem de Bruijn
  2019-11-28 16:25           ` David Laight
  2019-11-28 11:12         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 20+ messages in thread
From: Willem de Bruijn @ 2019-11-27 19:48 UTC (permalink / raw)
  To: David Laight
  Cc: Jesper Dangaard Brouer, Marek Majkowski, linux-kernel,
	network dev, kernel-team, Paolo Abeni

On Wed, Nov 27, 2019 at 11:04 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Jesper Dangaard Brouer
> > Sent: 27 November 2019 15:48
> > On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <David.Laight@ACULAB.COM> wrote:
> >
> > > ...
> > > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > > > and faffing with the user iov[].)
> > > > >
> > > > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > > > However the second poll has a significant performance cost (but less than using recvmmsg()).
> > > >
> > > > That sounds wrong. Single recvmmsg(), even when receiving only a
> > > > single message, should be faster than two syscalls - recv() and
> > > > poll().
> > >
> > > My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> > > significant overhead, most likely due to the crappy code that tries to stop
> > > the kernel buffer being overrun.
> > >
> > > I need to run the tests on a system with a 'home built' kernel to see how much
> > > difference this make (by seeing how much slower duplicating the copy makes it).
> > >
> > > The system call cost of poll() gets factored over a reasonable number of sockets.
> > > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> > > even allowing for looking up the fd.
> > >
> > > This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> > > expect one message and to call the poll() function before each subsequent receive.
> > >
> > > There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> > > I don't know how much that actually costs.
> > > In this case the process is likely to be running at a RT priority and pinned to a cpu.
> > > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> > >
> > > We really do want to receive all these UDP packets in a timely manner.
> > > Although very low latency isn't itself an issue.
> > > The data is telephony audio with (typically) one packet every 20ms.
> > > The code only looks for packets every 10ms - that helps no end since, in principle,
> > > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.
> >
> > I have a simple udp_sink tool[1] that cycle through the different
> > receive socket system calls.  I gave it a quick spin on a F31 kernel
> > 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
> > to see a significant regression/slowdown for recvMmsg.
> >
> > $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
> >               run      count          ns/pkt  pps             cycles  payload
> > recvMmsg/32   run:  0 10000000        1461.41 684270.96       5261    18       demux:1
> > recvmsg       run:  0 10000000        889.82  1123824.84      3203    18       demux:1
> > read          run:  0 10000000        974.81  1025841.68      3509    18       demux:1
> > recvfrom      run:  0 10000000        1056.51 946513.44       3803    18       demux:1
> >
> > Normal recvmsg almost have double performance that recvmmsg.
> >  recvMmsg/32 = 684,270 pps
> >  recvmsg     = 1,123,824 pps
>
> Can you test recv() as well?
> I think it might be faster than read().
>
> ...
> > Found some old results (approx v4.10-rc1):
> >
> > [brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
> >  recvMmsg/32    run: 0 10000000 537.89  1859106.74      2155    21559353816
> >  recvmsg        run: 0 10000000 552.69  1809344.44      2215    22152468673
> >  read           run: 0 10000000 476.65  2097970.76      1910    19104864199
> >  recvfrom       run: 0 10000000 450.76  2218492.60      1806    18066972794
>
> That is probably nearer what I am seeing on a 4.15 Ubuntu 18.04 kernel.
> recvmmsg() and recvmsg() are similar - but both a lot slower then recv().

Indeed, surprising that recv(from) would be less efficient than recvmsg.

Are the latest numbers with CONFIG_HARDENED_USERCOPY?

I assume that the poll() after recv() is non-blocking. If using
recvmsg, that extra syscall could be avoided by implementing a cmsg
inq hint for udp sockets analogous to TCP_CM_INQ/tcp_inq_hint.

More outlandish would be to abuse the mmsghdr->msg_len field to pass
file descriptors and amortize the kernel page-table isolation cost
across sockets. Blocking semantics would be weird, for starters.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: epoll_wait() performance
  2019-11-27 17:46           ` Eric Dumazet
@ 2019-11-28 10:17             ` David Laight
  2019-11-30  1:07               ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2019-11-28 10:17 UTC (permalink / raw)
  To: 'Eric Dumazet', 'Paolo Abeni', Jesper Dangaard Brouer
  Cc: 'Marek Majkowski', linux-kernel, network dev, kernel-team

From: Eric Dumazet
> Sent: 27 November 2019 17:47
...
> A QUIC server handles hundred of thousands of ' UDP flows' all using only one UDP socket
> per cpu.
> 
> This is really the only way to scale, and does not need kernel changes to efficiently
> organize millions of UDP sockets (huge memory footprint even if we get right how
> we manage them)
> 
> Given that UDP has no state, there is really no point trying to have one UDP
> socket per flow, and having to deal with epoll()/poll() overhead.

How can you do that when all the UDP flows have different destination port numbers?
These are message flows not idempotent requests.
I don't really want to collect the packets before they've been processed by IP.

I could write a driver that uses kernel udp sockets to generate a single message queue
than can be efficiently processed from userspace - but it is a faff compiling it for
the systems kernel version.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-27 16:04       ` David Laight
  2019-11-27 19:48         ` Willem de Bruijn
@ 2019-11-28 11:12         ` Jesper Dangaard Brouer
  2019-11-28 16:37           ` David Laight
  1 sibling, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2019-11-28 11:12 UTC (permalink / raw)
  To: David Laight
  Cc: 'Marek Majkowski',
	linux-kernel, network dev, kernel-team, Paolo Abeni, brouer

On Wed, 27 Nov 2019 16:04:12 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Jesper Dangaard Brouer
> > Sent: 27 November 2019 15:48
> > On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <David.Laight@ACULAB.COM> wrote:
> >   
> > > ...  
> > > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > > > and faffing with the user iov[].)
> > > > >
> > > > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > > > However the second poll has a significant performance cost (but less than using recvmmsg()).  
> > > >
> > > > That sounds wrong. Single recvmmsg(), even when receiving only a
> > > > single message, should be faster than two syscalls - recv() and
> > > > poll().  
> > >
> > > My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> > > significant overhead, most likely due to the crappy code that tries to stop
> > > the kernel buffer being overrun.
> > >
> > > I need to run the tests on a system with a 'home built' kernel to see how much
> > > difference this make (by seeing how much slower duplicating the copy makes it).
> > >
> > > The system call cost of poll() gets factored over a reasonable number of sockets.
> > > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> > > even allowing for looking up the fd.
> > >
> > > This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> > > expect one message and to call the poll() function before each subsequent receive.
> > >
> > > There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> > > I don't know how much that actually costs.
> > > In this case the process is likely to be running at a RT priority and pinned to a cpu.
> > > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> > >
> > > We really do want to receive all these UDP packets in a timely manner.
> > > Although very low latency isn't itself an issue.
> > > The data is telephony audio with (typically) one packet every 20ms.
> > > The code only looks for packets every 10ms - that helps no end since, in principle,
> > > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.  
> > 
> > I have a simple udp_sink tool[1] that cycle through the different
> > receive socket system calls.  I gave it a quick spin on a F31 kernel
> > 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
> > to see a significant regression/slowdown for recvMmsg.
> > 
> > $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
> >           	run      count   	ns/pkt	pps		cycles	payload
> > recvMmsg/32	run:  0	10000000	1461.41	684270.96	5261	18	 demux:1
> > recvmsg   	run:  0	10000000	889.82	1123824.84	3203	18	 demux:1
> > read      	run:  0	10000000	974.81	1025841.68	3509	18	 demux:1
> > recvfrom  	run:  0	10000000	1056.51	946513.44	3803	18	 demux:1
> > 
> > Normal recvmsg almost have double performance that recvmmsg.
> >  recvMmsg/32 = 684,270 pps
> >  recvmsg     = 1,123,824 pps  
> 
> Can you test recv() as well?

Sure: https://github.com/netoptimizer/network-testing/commit/9e3c8b86a2d662

$ sudo taskset -c 1 ./udp_sink --port 9  --count $((10**6*2))
          	run      count   	ns/pkt	pps		cycles	payload
recvMmsg/32  	run:  0	 2000000	653.29	1530704.29	2351	18	 demux:1
recvmsg   	run:  0	 2000000	631.01	1584760.06	2271	18	 demux:1
read      	run:  0	 2000000	582.24	1717518.16	2096	18	 demux:1
recvfrom  	run:  0	 2000000	547.26	1827269.12	1970	18	 demux:1
recv      	run:  0	 2000000	547.37	1826930.39	1970	18	 demux:1

> I think it might be faster than read().

Slightly, but same speed as recvfrom.

Strangely recvMmsg is not that bad in this testrun, and it is on the
same kernel 5.3.12-300.fc31.x86_64 and hardware.  I have CPU pinned
udp_sink, as it if jumps to the CPU doing RX-NAPI it will be fighting
for CPU time with softirq (which have Eric mitigated a bit), and
results are bad and look like this:

[broadwell src]$ sudo taskset -c 5 ./udp_sink --port 9  --count $((10**6*2))
          	run      count   	ns/pkt	pps		cycles	payload
recvMmsg/32  	run:  0	 2000000	1252.44	798439.60	4508	18	 demux:1
recvmsg   	run:  0	 2000000	1917.65	521470.72	6903	18	 demux:1
read      	run:  0	 2000000	1817.31	550263.37	6542	18	 demux:1
recvfrom  	run:  0	 2000000	1742.44	573909.46	6272	18	 demux:1
recv      	run:  0	 2000000	1741.51	574213.08	6269	18	 demux:1


> [...]
> > Found some old results (approx v4.10-rc1):
> > 
> > [brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
> >  recvMmsg/32    run: 0 10000000 537.89  1859106.74      2155    21559353816
> >  recvmsg        run: 0 10000000 552.69  1809344.44      2215    22152468673
> >  read           run: 0 10000000 476.65  2097970.76      1910    19104864199
> >  recvfrom       run: 0 10000000 450.76  2218492.60      1806    18066972794  
> 
> That is probably nearer what I am seeing on a 4.15 Ubuntu 18.04 kernel.
> recvmmsg() and recvmsg() are similar - but both a lot slower then recv().

Notice tool can also test connect UDP sockets, which is done in above.
I did a quick run with --connect:

$ sudo taskset -c 1 ./udp_sink --port 9  --count $((10**6*2)) --connect
          	run      count   	ns/pkt	pps		cycles	payload
recvMmsg/32  	run:  0	 2000000	500.72	1997107.02	1802	18	 demux:1 c:1
recvmsg   	run:  0	 2000000	662.52	1509380.46	2385	18	 demux:1 c:1
read      	run:  0	 2000000	613.46	1630103.14	2208	18	 demux:1 c:1
recvfrom  	run:  0	 2000000	577.71	1730974.34	2079	18	 demux:1 c:1
recv      	run:  0	 2000000	578.27	1729305.35	2081	18	 demux:1 c:1

And now, recvMmsg is actually the fastest...?!


p.s.
DISPLAIMER: Do notice that this udp_sink tool is a network-overload
micro-benchmark, that does not represent the use-case you are
describing.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: epoll_wait() performance
  2019-11-27 19:48         ` Willem de Bruijn
@ 2019-11-28 16:25           ` David Laight
  0 siblings, 0 replies; 20+ messages in thread
From: David Laight @ 2019-11-28 16:25 UTC (permalink / raw)
  To: 'Willem de Bruijn'
  Cc: Jesper Dangaard Brouer, Marek Majkowski, linux-kernel,
	network dev, kernel-team, Paolo Abeni

From: Willem de Bruijn
> Sent: 27 November 2019 19:48
...
> Are the latest numbers with CONFIG_HARDENED_USERCOPY?

According to /boot/config-`uname -r` it is enabled on my system.
I suspect it has a measurable effect on these tests.

> I assume that the poll() after recv() is non-blocking. If using
> recvmsg, that extra syscall could be avoided by implementing a cmsg
> inq hint for udp sockets analogous to TCP_CM_INQ/tcp_inq_hint.

All the poll() calls are non-blocking.
The first poll() has all the sockets in it.
The second poll() only those that returned data the first time around.
The code then sleeps elsewhere for the rest of the 10ms interval.
(Actually the polls are done in blocks of 64, filling up the pfd[] each time.)

This avoids the problem of repeatedly setting up and tearing down the
per-fd data for poll().

> More outlandish would be to abuse the mmsghdr->msg_len field to pass
> file descriptors and amortize the kernel page-table isolation cost
> across sockets. Blocking semantics would be weird, for starters.

It would be better to allow a single UDP socket be bound to multiple ports.
And then use the cmsg data to sort out the actual destination port.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: epoll_wait() performance
  2019-11-28 11:12         ` Jesper Dangaard Brouer
@ 2019-11-28 16:37           ` David Laight
  2019-11-28 16:52             ` Willy Tarreau
  2019-12-19  7:57             ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 20+ messages in thread
From: David Laight @ 2019-11-28 16:37 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer'
  Cc: 'Marek Majkowski',
	linux-kernel, network dev, kernel-team, Paolo Abeni

From: Jesper Dangaard Brouer
> Sent: 28 November 2019 11:12
...
> > Can you test recv() as well?
> 
> Sure: https://github.com/netoptimizer/network-testing/commit/9e3c8b86a2d662
> 
> $ sudo taskset -c 1 ./udp_sink --port 9  --count $((10**6*2))
>           	run      count   	ns/pkt	pps		cycles	payload
> recvMmsg/32  	run:  0	 2000000	653.29	1530704.29	2351	18	 demux:1
> recvmsg   	run:  0	 2000000	631.01	1584760.06	2271	18	 demux:1
> read      	run:  0	 2000000	582.24	1717518.16	2096	18	 demux:1
> recvfrom  	run:  0	 2000000	547.26	1827269.12	1970	18	 demux:1
> recv      	run:  0	 2000000	547.37	1826930.39	1970	18	 demux:1
> 
> > I think it might be faster than read().
> 
> Slightly, but same speed as recvfrom.

I notice that you recvfrom() code doesn't request the source address.
So is probably identical to recv().

My test system tends to increase its clock rate when busy.
(The fans speed up immediately, the cpu has a passive heatsink and all the
case fans are connected (via buffers) to the motherboard 'cpu fan' header.)
I could probably work out how to lock the frequency, but for some tests I run:
$ while :; do :; done
Putting 1 cpu into a userspace infinite loop make them all run flat out
(until thermally throttled).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-28 16:37           ` David Laight
@ 2019-11-28 16:52             ` Willy Tarreau
  2019-12-19  7:57             ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 20+ messages in thread
From: Willy Tarreau @ 2019-11-28 16:52 UTC (permalink / raw)
  To: David Laight
  Cc: 'Jesper Dangaard Brouer', 'Marek Majkowski',
	linux-kernel, network dev, kernel-team, Paolo Abeni

On Thu, Nov 28, 2019 at 04:37:01PM +0000, David Laight wrote:
> My test system tends to increase its clock rate when busy.
> (The fans speed up immediately, the cpu has a passive heatsink and all the
> case fans are connected (via buffers) to the motherboard 'cpu fan' header.)
> I could probably work out how to lock the frequency, but for some tests I run:
> $ while :; do :; done
> Putting 1 cpu into a userspace infinite loop make them all run flat out
> (until thermally throttled).

It would be way more efficient to only make the CPUs spin in the idle
loop. I wrote a small module a few years ago for this, which allows me
to do the equivalent of "idle=poll" at runtime. It's very convenient
in VMs as it significantly reduces your latency and jitter by preventing
them from sleeping. It's quite efficient as well to stabilize CPUs having
an important difference between their highest and lowest frequencies.

I'm attaching the patch here, it's straightforward, it was made on
3.14 and still worked unmodified on 4.19, I'm sure it still does with
more recent kernels.

Hoping this helps,
Willy

---

From 22d67389c2b28d924260b8ced78111111006ed94 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 27 Jan 2016 17:24:54 +0100
Subject: staging: add a new "idle_poll" module to disable idle loop

Sometimes it's convenient to be able to switch to polled mode for the
idle loop. This module does just that, and reverts back to the original
mode once unloaded.
---
 drivers/staging/Kconfig               |  2 ++
 drivers/staging/Makefile              |  1 +
 drivers/staging/idle_poll/Kconfig     |  8 ++++++++
 drivers/staging/idle_poll/Makefile    |  7 +++++++
 drivers/staging/idle_poll/idle_poll.c | 22 ++++++++++++++++++++++
 kernel/cpu/idle.c                     |  1 +
 6 files changed, 41 insertions(+)
 create mode 100644 drivers/staging/idle_poll/Kconfig
 create mode 100644 drivers/staging/idle_poll/Makefile
 create mode 100644 drivers/staging/idle_poll/idle_poll.c

diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig
index 9594f204d4fc..936a2721b0f7 100644
--- a/drivers/staging/Kconfig
+++ b/drivers/staging/Kconfig
@@ -148,4 +148,6 @@ source "drivers/staging/dgnc/Kconfig"
 
 source "drivers/staging/dgap/Kconfig"
 
+source "drivers/staging/idle_poll/Kconfig"
+
 endif # STAGING
diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index 6ca1cf3dbcd4..d3d45aff73d2 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -66,3 +66,4 @@ obj-$(CONFIG_XILLYBUS)		+= xillybus/
 obj-$(CONFIG_DGNC)			+= dgnc/
 obj-$(CONFIG_DGAP)			+= dgap/
 obj-$(CONFIG_MTD_SPINAND_MT29F)	+= mt29f_spinand/
+obj-$(CONFIG_IDLE_POLL)		+= idle_poll/
diff --git a/drivers/staging/idle_poll/Kconfig b/drivers/staging/idle_poll/Kconfig
new file mode 100644
index 000000000000..4c96a21f66aa
--- /dev/null
+++ b/drivers/staging/idle_poll/Kconfig
@@ -0,0 +1,8 @@
+config IDLE_POLL
+	tristate "IDLE_POLL enabler"
+	help
+	    This module automatically enables polling-based idle loop.
+	    It is convenient in certain situations to simply load the
+	    module to disable the idle loop, or unload it to re-enable
+	    it.
+
diff --git a/drivers/staging/idle_poll/Makefile b/drivers/staging/idle_poll/Makefile
new file mode 100644
index 000000000000..60ad176f11f6
--- /dev/null
+++ b/drivers/staging/idle_poll/Makefile
@@ -0,0 +1,7 @@
+# This rule extracts the directory part from the location where this Makefile
+# is found, strips last slash and retrieves the last component which is used
+# to make a file name. It is a generic way of building modules which always
+# have the name of the directory they're located in. $(lastword) could have
+# been used instead of $(word $(words)) but it's bogus on some versions.
+
+obj-m += $(notdir $(patsubst %/,%,$(dir $(word $(words $(MAKEFILE_LIST)),$(MAKEFILE_LIST))))).o
diff --git a/drivers/staging/idle_poll/idle_poll.c b/drivers/staging/idle_poll/idle_poll.c
new file mode 100644
index 000000000000..6f39f85cc61d
--- /dev/null
+++ b/drivers/staging/idle_poll/idle_poll.c
@@ -0,0 +1,22 @@
+#include <linux/module.h>
+#include <linux/cpu.h>
+
+static int __init modinit(void)
+{
+	cpu_idle_poll_ctrl(true);
+	return 0;
+}
+
+static void __exit modexit(void)
+{
+	cpu_idle_poll_ctrl(false);
+	return;
+}
+
+module_init(modinit);
+module_exit(modexit);
+
+MODULE_DESCRIPTION("idle_poll enabler");
+MODULE_AUTHOR("Willy Tarreau");
+MODULE_VERSION("0.0.1");
+MODULE_LICENSE("GPL");
diff --git a/kernel/cpu/idle.c b/kernel/cpu/idle.c
index 277f494c2a9a..fbf648bc52b2 100644
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -22,6 +22,7 @@ void cpu_idle_poll_ctrl(bool enable)
 		WARN_ON_ONCE(cpu_idle_force_poll < 0);
 	}
 }
+EXPORT_SYMBOL(cpu_idle_poll_ctrl);
 
 #ifdef CONFIG_GENERIC_IDLE_POLL_SETUP
 static int __init cpu_idle_poll_setup(char *__unused)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-28 10:17             ` David Laight
@ 2019-11-30  1:07               ` Eric Dumazet
  2019-11-30 13:29                 ` Jakub Sitnicki
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2019-11-30  1:07 UTC (permalink / raw)
  To: David Laight, 'Paolo Abeni', Jesper Dangaard Brouer
  Cc: 'Marek Majkowski', linux-kernel, network dev, kernel-team



On 11/28/19 2:17 AM, David Laight wrote:
> From: Eric Dumazet
>> Sent: 27 November 2019 17:47
> ...
>> A QUIC server handles hundred of thousands of ' UDP flows' all using only one UDP socket
>> per cpu.
>>
>> This is really the only way to scale, and does not need kernel changes to efficiently
>> organize millions of UDP sockets (huge memory footprint even if we get right how
>> we manage them)
>>
>> Given that UDP has no state, there is really no point trying to have one UDP
>> socket per flow, and having to deal with epoll()/poll() overhead.
> 
> How can you do that when all the UDP flows have different destination port numbers?
> These are message flows not idempotent requests.
> I don't really want to collect the packets before they've been processed by IP.
> 
> I could write a driver that uses kernel udp sockets to generate a single message queue
> than can be efficiently processed from userspace - but it is a faff compiling it for
> the systems kernel version.

Well if destinations ports are not under your control,
you also could use AF_PACKET sockets, no need for 'UDP sockets' to receive UDP traffic,
especially it the rate is small.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-30  1:07               ` Eric Dumazet
@ 2019-11-30 13:29                 ` Jakub Sitnicki
  2019-12-02 12:24                   ` David Laight
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Sitnicki @ 2019-11-30 13:29 UTC (permalink / raw)
  To: David Laight
  Cc: Eric Dumazet, 'Paolo Abeni',
	Jesper Dangaard Brouer, 'Marek Majkowski',
	linux-kernel, network dev, kernel-team

On Sat, Nov 30, 2019 at 02:07 AM CET, Eric Dumazet wrote:
> On 11/28/19 2:17 AM, David Laight wrote:
>> From: Eric Dumazet
>>> Sent: 27 November 2019 17:47
>> ...
>>> A QUIC server handles hundred of thousands of ' UDP flows' all using only one UDP socket
>>> per cpu.
>>>
>>> This is really the only way to scale, and does not need kernel changes to efficiently
>>> organize millions of UDP sockets (huge memory footprint even if we get right how
>>> we manage them)
>>>
>>> Given that UDP has no state, there is really no point trying to have one UDP
>>> socket per flow, and having to deal with epoll()/poll() overhead.
>>
>> How can you do that when all the UDP flows have different destination port numbers?
>> These are message flows not idempotent requests.
>> I don't really want to collect the packets before they've been processed by IP.
>>
>> I could write a driver that uses kernel udp sockets to generate a single message queue
>> than can be efficiently processed from userspace - but it is a faff compiling it for
>> the systems kernel version.
>
> Well if destinations ports are not under your control,
> you also could use AF_PACKET sockets, no need for 'UDP sockets' to receive UDP traffic,
> especially it the rate is small.

Alternatively, you could steer UDP flows coming to a certain port range
to one UDP socket using TPROXY [0, 1].

TPROXY has the same downside as AF_PACKET, meaning that it requires at
least CAP_NET_RAW to create/set up the socket.

OTOH, with TPROXY you can gracefully co-reside with other services,
filering on just the destination addresses you want in iptables/nftables.

Fan-out / load-balancing with reuseport to have one socket per CPU is
not possible, though. You would need to do that with Netfilter.

-Jakub

[0] https://www.kernel.org/doc/Documentation/networking/tproxy.txt
[1] https://blog.cloudflare.com/how-we-built-spectrum/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: epoll_wait() performance
  2019-11-30 13:29                 ` Jakub Sitnicki
@ 2019-12-02 12:24                   ` David Laight
  2019-12-02 16:47                     ` Willem de Bruijn
  0 siblings, 1 reply; 20+ messages in thread
From: David Laight @ 2019-12-02 12:24 UTC (permalink / raw)
  To: 'Jakub Sitnicki'
  Cc: Eric Dumazet, 'Paolo Abeni',
	Jesper Dangaard Brouer, 'Marek Majkowski',
	linux-kernel, network dev, kernel-team

From: Jakub Sitnicki <jakub@cloudflare.com>
> Sent: 30 November 2019 13:30
> On Sat, Nov 30, 2019 at 02:07 AM CET, Eric Dumazet wrote:
> > On 11/28/19 2:17 AM, David Laight wrote:
...
> >> How can you do that when all the UDP flows have different destination port numbers?
> >> These are message flows not idempotent requests.
> >> I don't really want to collect the packets before they've been processed by IP.
> >>
> >> I could write a driver that uses kernel udp sockets to generate a single message queue
> >> than can be efficiently processed from userspace - but it is a faff compiling it for
> >> the systems kernel version.
> >
> > Well if destinations ports are not under your control,
> > you also could use AF_PACKET sockets, no need for 'UDP sockets' to receive UDP traffic,
> > especially it the rate is small.
> 
> Alternatively, you could steer UDP flows coming to a certain port range
> to one UDP socket using TPROXY [0, 1].

I don't think that can work, we don't really know the list of valid UDP port
numbers ahead of time.

> TPROXY has the same downside as AF_PACKET, meaning that it requires at
> least CAP_NET_RAW to create/set up the socket.

CAP_NET_RAW wouldn't be a problem - we already send from a 'raw' socket.
(Which is a PITA for IPv4 because we have to determine the local IP address
in order to calculate the UDP checksum - so we have to have a local copy
of what (hopefully) matches the kernel routing tables.

> OTOH, with TPROXY you can gracefully co-reside with other services,
> filtering on just the destination addresses you want in iptables/nftables.

Yes, the packets need to be extracted from normal processing - otherwise
the UDP code would send ICMP port unreachable errors.

> 
> Fan-out / load-balancing with reuseport to have one socket per CPU is
> not possible, though. You would need to do that with Netfilter.

Or put different port ranges onto different sockets.
 
> [0] https://www.kernel.org/doc/Documentation/networking/tproxy.txt
> [1] https://blog.cloudflare.com/how-we-built-spectrum/

The latter explains it a bit better than the former....

I think I'll try to understand the 'ftrace' documentation enough to add some
extra trace points to those displayed by shedviz.
So far I've only found some trivial example uses, not any actual docs.
Hopefully there is a faster way than reading the kernel sources :-(

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-12-02 12:24                   ` David Laight
@ 2019-12-02 16:47                     ` Willem de Bruijn
  0 siblings, 0 replies; 20+ messages in thread
From: Willem de Bruijn @ 2019-12-02 16:47 UTC (permalink / raw)
  To: David Laight
  Cc: Jakub Sitnicki, Eric Dumazet, Paolo Abeni,
	Jesper Dangaard Brouer, Marek Majkowski, linux-kernel,
	network dev, kernel-team

On Mon, Dec 2, 2019 at 7:24 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Jakub Sitnicki <jakub@cloudflare.com>
> > Sent: 30 November 2019 13:30
> > On Sat, Nov 30, 2019 at 02:07 AM CET, Eric Dumazet wrote:
> > > On 11/28/19 2:17 AM, David Laight wrote:
> ...
> > >> How can you do that when all the UDP flows have different destination port numbers?
> > >> These are message flows not idempotent requests.
> > >> I don't really want to collect the packets before they've been processed by IP.
> > >>
> > >> I could write a driver that uses kernel udp sockets to generate a single message queue
> > >> than can be efficiently processed from userspace - but it is a faff compiling it for
> > >> the systems kernel version.
> > >
> > > Well if destinations ports are not under your control,
> > > you also could use AF_PACKET sockets, no need for 'UDP sockets' to receive UDP traffic,
> > > especially it the rate is small.
> >
> > Alternatively, you could steer UDP flows coming to a certain port range
> > to one UDP socket using TPROXY [0, 1].
>
> I don't think that can work, we don't really know the list of valid UDP port
> numbers ahead of time.

How about -j REDIRECT. That does not require all ports to be known
ahead of time.

> > TPROXY has the same downside as AF_PACKET, meaning that it requires at
> > least CAP_NET_RAW to create/set up the socket.
>
> CAP_NET_RAW wouldn't be a problem - we already send from a 'raw' socket.

One other issue when comparing udp and packet sockets is ip
defragmentation. That is critical code that is not at all trivial to
duplicate in userspace.

Even when choosing packet sockets, which normally would not
defragment, there is a trick. A packet socket with fanout and flag
PACKET_FANOUT_FLAG_DEFRAG will defragment before fanout.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: epoll_wait() performance
  2019-11-28 16:37           ` David Laight
  2019-11-28 16:52             ` Willy Tarreau
@ 2019-12-19  7:57             ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-19  7:57 UTC (permalink / raw)
  To: David Laight
  Cc: 'Marek Majkowski',
	linux-kernel, network dev, kernel-team, Paolo Abeni, brouer

On Thu, 28 Nov 2019 16:37:01 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Jesper Dangaard Brouer
> > Sent: 28 November 2019 11:12  
> ...
> > > Can you test recv() as well?  
> > 
> > Sure: https://github.com/netoptimizer/network-testing/commit/9e3c8b86a2d662
> > 
> > $ sudo taskset -c 1 ./udp_sink --port 9  --count $((10**6*2))
> >           	run      count   	ns/pkt	pps		cycles	payload
> > recvMmsg/32  	run:  0	 2000000	653.29	1530704.29	2351	18	 demux:1
> > recvmsg   	run:  0	 2000000	631.01	1584760.06	2271	18	 demux:1
> > read      	run:  0	 2000000	582.24	1717518.16	2096	18	 demux:1
> > recvfrom  	run:  0	 2000000	547.26	1827269.12	1970	18	 demux:1
> > recv      	run:  0	 2000000	547.37	1826930.39	1970	18	 demux:1
> >   
> > > I think it might be faster than read().  
> > 
> > Slightly, but same speed as recvfrom.  
> 
> I notice that you recvfrom() code doesn't request the source address.
> So is probably identical to recv().

Created a GitHub issue/bug on this:
 https://github.com/netoptimizer/network-testing/issues/5

Feel free to fix this and send a patch/PR.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-12-19  7:57 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-22 11:17 epoll_wait() performance David Laight
2019-11-27  9:50 ` Marek Majkowski
2019-11-27 10:39   ` David Laight
2019-11-27 15:48     ` Jesper Dangaard Brouer
2019-11-27 16:04       ` David Laight
2019-11-27 19:48         ` Willem de Bruijn
2019-11-28 16:25           ` David Laight
2019-11-28 11:12         ` Jesper Dangaard Brouer
2019-11-28 16:37           ` David Laight
2019-11-28 16:52             ` Willy Tarreau
2019-12-19  7:57             ` Jesper Dangaard Brouer
2019-11-27 16:26       ` Paolo Abeni
2019-11-27 17:30         ` David Laight
2019-11-27 17:46           ` Eric Dumazet
2019-11-28 10:17             ` David Laight
2019-11-30  1:07               ` Eric Dumazet
2019-11-30 13:29                 ` Jakub Sitnicki
2019-12-02 12:24                   ` David Laight
2019-12-02 16:47                     ` Willem de Bruijn
2019-11-27 17:50           ` Paolo Abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).