All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] softirq: let ksoftirqd do its job
       [not found]           ` <1472661956.14381.335.camel@edumazet-glaptop3.roam.corp.google.com>
@ 2016-08-31 17:42             ` Eric Dumazet
  2016-08-31 19:40               ` Jesper Dangaard Brouer
                                 ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Eric Dumazet @ 2016-08-31 17:42 UTC (permalink / raw)
  To: Peter Zijlstra, David Miller
  Cc: Rik van Riel, Paolo Abeni, Hannes Frederic Sowa,
	Jesper Dangaard Brouer, linux-kernel, netdev, Jonathan Corbet

From: Eric Dumazet <edumazet@google.com>

A while back, Paolo and Hannes sent an RFC patch adding threaded-able
napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) 

The problem seems to be that softirqs are very aggressive and are often
handled by the current process, even if we are under stress and that
ksoftirqd was scheduled, so that innocent threads would have more chance
to make progress.

This patch makes sure that if ksoftirq is running, we let it
perform the softirq work.

Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/

Tested:

 - NIC receiving traffic handled by CPU 0
 - UDP receiver running on CPU 0, using a single UDP socket.
 - Incoming flood of UDP packets targeting the UDP socket.

Before the patch, the UDP receiver could almost never get cpu cycles and
could only receive ~2,000 packets per second.

After the patch, cpu cycles are split 50/50 between user application and
ksoftirqd/0, and we can effectively read ~900,000 packets per second,
a huge improvement in DOS situation. (Note that more packets are now
dropped by the NIC itself, since the BH handlers get less cpu cycles to
drain RX ring buffer)

Since the load runs in well identified threads context, an admin can
more easily tune process scheduling parameters if needed.

Reported-by: Paolo Abeni <pabeni@redhat.com>
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Miller <davem@davemloft.net
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
---
 kernel/softirq.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b63342..8ed90e3a88d6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -78,6 +78,17 @@ static void wakeup_softirqd(void)
 }
 
 /*
+ * If ksoftirqd is scheduled, we do not want to process pending softirqs
+ * right now. Let ksoftirqd handle this at its own rate, to get fairness.
+ */
+static bool ksoftirqd_running(void)
+{
+	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
+
+	return tsk && (tsk->state == TASK_RUNNING);
+}
+
+/*
  * preempt_count and SOFTIRQ_OFFSET usage:
  * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
  *   softirq processing.
@@ -313,7 +324,7 @@ asmlinkage __visible void do_softirq(void)
 
 	pending = local_softirq_pending();
 
-	if (pending)
+	if (pending && !ksoftirqd_running())
 		do_softirq_own_stack();
 
 	local_irq_restore(flags);
@@ -340,6 +351,9 @@ void irq_enter(void)
 
 static inline void invoke_softirq(void)
 {
+	if (ksoftirqd_running())
+		return;
+
 	if (!force_irqthreads) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
 		/*

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 17:42             ` [PATCH] softirq: let ksoftirqd do its job Eric Dumazet
@ 2016-08-31 19:40               ` Jesper Dangaard Brouer
  2016-08-31 20:42                 ` Eric Dumazet
  2016-09-01 12:01               ` Hannes Frederic Sowa
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2016-08-31 19:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 31 Aug 2016 10:42:29 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> A while back, Paolo and Hannes sent an RFC patch adding threaded-able
> napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) 
> 
> The problem seems to be that softirqs are very aggressive and are often
> handled by the current process, even if we are under stress and that
> ksoftirqd was scheduled, so that innocent threads would have more chance
> to make progress.
> 
> This patch makes sure that if ksoftirq is running, we let it
> perform the softirq work.
> 
> Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/
> 
> Tested:
> 
>  - NIC receiving traffic handled by CPU 0
>  - UDP receiver running on CPU 0, using a single UDP socket.
>  - Incoming flood of UDP packets targeting the UDP socket.
> 
> Before the patch, the UDP receiver could almost never get cpu cycles and
> could only receive ~2,000 packets per second.
> 
> After the patch, cpu cycles are split 50/50 between user application and
> ksoftirqd/0, and we can effectively read ~900,000 packets per second,
> a huge improvement in DOS situation. (Note that more packets are now
> dropped by the NIC itself, since the BH handlers get less cpu cycles to
> drain RX ring buffer)

I can confirm the improvement of approx 900Kpps (no wonder people have
been complaining about DoS against UDP/DNS servers).

BUT during my extensive testing, of this patch, I also think that we
have not gotten to the bottom of this.  I was expecting to see a higher
(collective) PPS number as I add more UDP servers, but I don't.

Running many UDP netperf's with command:
 super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N

With 'top' I can see ksoftirq are still getting a higher %CPU time:

    PID   %CPU     TIME+  COMMAND
     3   36.5   2:28.98  ksoftirqd/0
 10724    9.6   0:01.05  netserver
 10722    9.3   0:01.05  netserver
 10723    9.3   0:01.05  netserver
 10725    9.3   0:01.05  netserver


> Since the load runs in well identified threads context, an admin can
> more easily tune process scheduling parameters if needed.

With this patch applied, I found that changing the UDP server process,
scheduler policy to SCHED_RR or SCHED_FIFO gave me a performance boost
from 900Kpps to 1.7Mpps, and not a single UDP packet dropped (even with
a single UDP stream, also tested with more)

Command used:
 sudo chrt --rr -p 20 $(pgrep netserver)

The scheduling picture also change a lot:

   PID  %CPU   TIME+   COMMAND
 10783  24.3  0:21.53  netserver
 10784  24.3  0:21.53  netserver
 10785  24.3  0:21.52  netserver
 10786  24.3  0:21.50  netserver
     3   2.7  3:12.18  ksoftirqd/0

 
> Reported-by: Paolo Abeni <pabeni@redhat.com>
> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: David Miller <davem@davemloft.net
> Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  kernel/softirq.c |   16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 17caf4b63342..8ed90e3a88d6 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -78,6 +78,17 @@ static void wakeup_softirqd(void)
>  }
>  
>  /*
> + * If ksoftirqd is scheduled, we do not want to process pending softirqs
> + * right now. Let ksoftirqd handle this at its own rate, to get fairness.
> + */
> +static bool ksoftirqd_running(void)
> +{
> +	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
> +
> +	return tsk && (tsk->state == TASK_RUNNING);
> +}
> +
> +/*
>   * preempt_count and SOFTIRQ_OFFSET usage:
>   * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
>   *   softirq processing.
> @@ -313,7 +324,7 @@ asmlinkage __visible void do_softirq(void)
>  
>  	pending = local_softirq_pending();
>  
> -	if (pending)
> +	if (pending && !ksoftirqd_running())
>  		do_softirq_own_stack();
>  
>  	local_irq_restore(flags);
> @@ -340,6 +351,9 @@ void irq_enter(void)
>  
>  static inline void invoke_softirq(void)
>  {
> +	if (ksoftirqd_running())
> +		return;
> +
>  	if (!force_irqthreads) {
>  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
>  		/*
> 
> 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 19:40               ` Jesper Dangaard Brouer
@ 2016-08-31 20:42                 ` Eric Dumazet
  2016-08-31 21:51                   ` Jesper Dangaard Brouer
  2016-09-01 12:05                   ` Hannes Frederic Sowa
  0 siblings, 2 replies; 31+ messages in thread
From: Eric Dumazet @ 2016-08-31 20:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote:

> I can confirm the improvement of approx 900Kpps (no wonder people have
> been complaining about DoS against UDP/DNS servers).
> 
> BUT during my extensive testing, of this patch, I also think that we
> have not gotten to the bottom of this.  I was expecting to see a higher
> (collective) PPS number as I add more UDP servers, but I don't.
> 
> Running many UDP netperf's with command:
>  super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N

Are you sure sender can send fast enough ?

> 
> With 'top' I can see ksoftirq are still getting a higher %CPU time:
> 
>     PID   %CPU     TIME+  COMMAND
>      3   36.5   2:28.98  ksoftirqd/0
>  10724    9.6   0:01.05  netserver
>  10722    9.3   0:01.05  netserver
>  10723    9.3   0:01.05  netserver
>  10725    9.3   0:01.05  netserver

Looks much better on my machine, with "udprcv -n 4" (using 4 threads,
and 4 sockets using SO_REUSEPORT)

10755 root      20   0   34948      4      0 S  79.7  0.0   0:33.66 udprcv                                                                                                                                 
    3 root      20   0       0      0      0 R  19.9  0.0   0:25.49 ksoftirqd/0                 

Pressing 'H' in top gives :

    3 root      20   0       0      0      0 R 19.9  0.0   0:47.84 ksoftirqd/0                                                                                                                             
10756 root      20   0   34948      4      0 R 19.9  0.0   0:30.76 udprcv                                                                                                                                  
10757 root      20   0   34948      4      0 R 19.9  0.0   0:30.76 udprcv                                                                                                                                  
10758 root      20   0   34948      4      0 S 19.9  0.0   0:30.76 udprcv                                                                                                                                  
10759 root      20   0   34948      4      0 S 19.9  0.0   0:30.76 udprcv


Patch was on top of commit 071e31e254e0e0c438eecba3dba1d6e2d0da36c2
                          
> 
> 
> > Since the load runs in well identified threads context, an admin can
> > more easily tune process scheduling parameters if needed.
> 
> With this patch applied, I found that changing the UDP server process,
> scheduler policy to SCHED_RR or SCHED_FIFO gave me a performance boost
> from 900Kpps to 1.7Mpps, and not a single UDP packet dropped (even with
> a single UDP stream, also tested with more)
> 
> Command used:
>  sudo chrt --rr -p 20 $(pgrep netserver)


Sure, this is what I mentioned in my changelog : Once we properly
schedule and rely on ksoftirqd, tuning is available.

> 
> The scheduling picture also change a lot:
> 
>    PID  %CPU   TIME+   COMMAND
>  10783  24.3  0:21.53  netserver
>  10784  24.3  0:21.53  netserver
>  10785  24.3  0:21.52  netserver
>  10786  24.3  0:21.50  netserver
>      3   2.7  3:12.18  ksoftirqd/0
> 
>  

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 20:42                 ` Eric Dumazet
@ 2016-08-31 21:51                   ` Jesper Dangaard Brouer
  2016-08-31 22:27                     ` Eric Dumazet
  2016-09-01 11:02                     ` Jesper Dangaard Brouer
  2016-09-01 12:05                   ` Hannes Frederic Sowa
  1 sibling, 2 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2016-08-31 21:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 31 Aug 2016 13:42:30 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote:
> 
> > I can confirm the improvement of approx 900Kpps (no wonder people have
> > been complaining about DoS against UDP/DNS servers).
> > 
> > BUT during my extensive testing, of this patch, I also think that we
> > have not gotten to the bottom of this.  I was expecting to see a higher
> > (collective) PPS number as I add more UDP servers, but I don't.
> > 
> > Running many UDP netperf's with command:
> >  super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N  
> 
> Are you sure sender can send fast enough ?

Yes, as I can see drops (overrun UDP limit UdpRcvbufErrors). Switching
to pktgen and udp_sink to be sure.

> > 
> > With 'top' I can see ksoftirq are still getting a higher %CPU time:
> > 
> >     PID   %CPU     TIME+  COMMAND
> >      3   36.5   2:28.98  ksoftirqd/0
> >  10724    9.6   0:01.05  netserver
> >  10722    9.3   0:01.05  netserver
> >  10723    9.3   0:01.05  netserver
> >  10725    9.3   0:01.05  netserver  
> 
> Looks much better on my machine, with "udprcv -n 4" (using 4 threads,
> and 4 sockets using SO_REUSEPORT)
> 
> 10755 root      20   0   34948      4      0 S  79.7  0.0   0:33.66 udprcv 
>     3 root      20   0       0      0      0 R  19.9  0.0   0:25.49 ksoftirqd/0                 
> 
> Pressing 'H' in top gives :
> 
>     3 root      20   0       0      0      0 R 19.9  0.0   0:47.84 ksoftirqd/0
> 10756 root      20   0   34948      4      0 R 19.9  0.0   0:30.76 udprcv 
> 10757 root      20   0   34948      4      0 R 19.9  0.0   0:30.76 udprcv 
> 10758 root      20   0   34948      4      0 S 19.9  0.0   0:30.76 udprcv
> 10759 root      20   0   34948      4      0 S 19.9  0.0   0:30.76 udprcv

Yes, I'm seeing the same when unning 5 instances my own udp_sink[1]:
 sudo taskset -c 0 ./udp_sink --port 10003 --recvmsg --reuse-port --count $((10**10))

 PID  S  %CPU     TIME+  COMMAND
    3 R  21.6   2:21.33  ksoftirqd/0
 3838 R  15.9   0:02.18  udp_sink
 3856 R  15.6   0:02.16  udp_sink
 3862 R  15.6   0:02.16  udp_sink
 3844 R  15.3   0:02.15  udp_sink
 3850 S  15.3   0:02.15  udp_sink

This is the expected result, that adding more userspace receivers
scales up.  I needed 5 udp_sink's before I don't see any drops, either
this says the job performed by ksoftirqd is 5 times faster or the
collective queue size of the programs was fast enough to absorb the
scheduling jitter.

The result from this run were handling 1,517,248 pps, without any
drops, all processes pinned to the same CPU.

 $ nstat > /dev/null && sleep 1 && nstat
 #kernel
 IpInReceives                    1517225            0.0
 IpInDelivers                    1517224            0.0
 UdpInDatagrams                  1517248            0.0
 IpExtInOctets                   69793408           0.0
 IpExtInNoECTPkts                1517246            0.0

I'm acking this patch:

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

> 
> Patch was on top of commit 071e31e254e0e0c438eecba3dba1d6e2d0da36c2

Mine on top of commit 84fd1b191a9468

> > 
> >   
> > > Since the load runs in well identified threads context, an admin can
> > > more easily tune process scheduling parameters if needed.  
> > 
> > With this patch applied, I found that changing the UDP server process,
> > scheduler policy to SCHED_RR or SCHED_FIFO gave me a performance boost
> > from 900Kpps to 1.7Mpps, and not a single UDP packet dropped (even with
> > a single UDP stream, also tested with more)
> > 
> > Command used:
> >  sudo chrt --rr -p 20 $(pgrep netserver)  
> 
> 
> Sure, this is what I mentioned in my changelog : Once we properly
> schedule and rely on ksoftirqd, tuning is available.
> 
> > 
> > The scheduling picture also change a lot:
> > 
> >    PID  %CPU   TIME+   COMMAND
> >  10783  24.3  0:21.53  netserver
> >  10784  24.3  0:21.53  netserver
> >  10785  24.3  0:21.52  netserver
> >  10786  24.3  0:21.50  netserver
> >      3   2.7  3:12.18  ksoftirqd/0
> > 


[1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 21:51                   ` Jesper Dangaard Brouer
@ 2016-08-31 22:27                     ` Eric Dumazet
  2016-08-31 22:47                       ` Rick Jones
  2016-09-01 11:02                     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2016-08-31 22:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 2016-08-31 at 23:51 +0200, Jesper Dangaard Brouer wrote:

> 
> The result from this run were handling 1,517,248 pps, without any
> drops, all processes pinned to the same CPU.
> 
>  $ nstat > /dev/null && sleep 1 && nstat
>  #kernel
>  IpInReceives                    1517225            0.0
>  IpInDelivers                    1517224            0.0
>  UdpInDatagrams                  1517248            0.0
>  IpExtInOctets                   69793408           0.0
>  IpExtInNoECTPkts                1517246            0.0
> 
> I'm acking this patch:
> 
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 

Thanks a lot for bringing back the issue to me again, and all your
tests !

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 22:27                     ` Eric Dumazet
@ 2016-08-31 22:47                       ` Rick Jones
  2016-08-31 23:11                         ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Rick Jones @ 2016-08-31 22:47 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer
  Cc: Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

With regard to drops, are both of you sure you're using the same socket 
buffer sizes?

In the meantime, is anything interesting happening with TCP_RR or 
TCP_STREAM?

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 22:47                       ` Rick Jones
@ 2016-08-31 23:11                         ` Eric Dumazet
  2016-08-31 23:29                           ` Rick Jones
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2016-08-31 23:11 UTC (permalink / raw)
  To: Rick Jones
  Cc: Jesper Dangaard Brouer, Peter Zijlstra, David Miller,
	Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel,
	netdev, Jonathan Corbet

On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:
> With regard to drops, are both of you sure you're using the same socket 
> buffer sizes?

Does it really matter ?

I used the standard /proc/sys/net/core/rmem_default, but under flood
receive queue is almost always full, even if you make it bigger.

By varying its size, you only make batches bigger and number of context
switches should be lower, if only two threads are competing for the cpu.

Exact 'optimal' size would depend on various factors, depending on
application and platform constraints.

> 
> In the meantime, is anything interesting happening with TCP_RR or 
> TCP_STREAM?

TCP_RR is driven by the network latency, we do not drop packets in the
socket itself.

TC_STREAM is normally paced by the ability of the receiver to send ACK
packets. TCP has this auto regulating mode, unless the sender violates
the RFC(s).

If your question is :

What happens if thousands of threads on the host want the cpu, and
ksoftirqd gets not enough cycles by virtue of being a normal thread ?

Then, you are back to typical provisioning problems, and normally people
play with priorities and containers/cgroups, and/or various techniques
like RPS/RFS

(You can change ksoftirqd priority if you like)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 23:11                         ` Eric Dumazet
@ 2016-08-31 23:29                           ` Rick Jones
  2016-09-01 10:38                             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Rick Jones @ 2016-08-31 23:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Peter Zijlstra, David Miller,
	Rik van Riel, Paolo Abeni, Hannes Frederic Sowa, linux-kernel,
	netdev, Jonathan Corbet

On 08/31/2016 04:11 PM, Eric Dumazet wrote:
> On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:
>> With regard to drops, are both of you sure you're using the same socket
>> buffer sizes?
>
> Does it really matter ?

At least at points in the past I have seen different drop counts at the 
SO_RCVBUF based on using (sometimes much) larger sizes.  The hypothesis 
I was operating under at the time was that this dealt with those 
situations where the netserver was held-off from running for "a little 
while" from time to time.  It didn't change things for a sustained 
overload situation though.

>> In the meantime, is anything interesting happening with TCP_RR or
>> TCP_STREAM?
>
> TCP_RR is driven by the network latency, we do not drop packets in the
> socket itself.

I've been of the opinion it (single stream) is driven by path length. 
Sometimes by NIC latency.  But then I'm almost always measuring in the 
LAN rather than across the WAN.

happy benchmarking,

rick

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 23:29                           ` Rick Jones
@ 2016-09-01 10:38                             ` Jesper Dangaard Brouer
  2016-09-01 13:06                               ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-01 10:38 UTC (permalink / raw)
  To: Rick Jones
  Cc: brouer, Eric Dumazet, Peter Zijlstra, David Miller, Rik van Riel,
	Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev,
	Jonathan Corbet


On Wed, 31 Aug 2016 16:29:56 -0700 Rick Jones <rick.jones2@hpe.com> wrote:
> On 08/31/2016 04:11 PM, Eric Dumazet wrote:
> > On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:  
> >> With regard to drops, are both of you sure you're using the same socket
> >> buffer sizes?  
> >
> > Does it really matter ?  
> 
> At least at points in the past I have seen different drop counts at the 
> SO_RCVBUF based on using (sometimes much) larger sizes.  The hypothesis 
> I was operating under at the time was that this dealt with those 
> situations where the netserver was held-off from running for "a little 
> while" from time to time.  It didn't change things for a sustained 
> overload situation though.

Yes, Rick, your hypothesis corresponds to my measurements.  The
userspace program is held-off from running for "a little while" from
time to time.  I've measured this with perf sched record/latency.  It
is sort of a natural scheduler characteristic.
 The userspace UDP socket program consume/need more cycles to perform
its jobs, than kernel softirqd. Thus the UDP-prog use up its sched
time-slice, and periodically ksoftirq get schedule multiple times,
because UDP-prog don't have any credits any-longer.

WARNING: Do not increase socket queue size to pamper over this issue,
it is the WRONG solution, it will give horrible latency issues.

With above warning, I can tell your, yes you are also right about
increasing the socket buffer size, can be used to mitigate/hide the
packet drops.  You can even increase the socket size so much, that the
drop problem "goes-away".  The queue simply need to be deep enough to
absorb the worst/maximum time UDP-prog was scheduled out.  The hidden
effect to make this work (to not contradict queue theory) is that this
also slows-down/cost-more-cycles for ksoftirqd/NAPI as it cost more to
enqueue (instead of dropping packets on a full queue).

You can measure the sched "Maximum delay" using:
 sudo perf sched record -C 0 sleep 10
 sudo perf sched latency

On my setup I measured "Maximum delay" of approx 9 ms.  Given I can
see an incoming packet rate of 2.4Mpps (880Kpps reach UDP-prog), and
knowing network stack use skb->truesize (approx 2048 bytes on this
driver), I can calculate that I need approx 45MBytes buffer
((2.4*10^6)*(9/1000)*2048 = 44.2Mb)

The PPS measurement comes from:

 $ nstat > /dev/null && sleep 1 && nstat
 #kernel
 IpInReceives                    2335926            0.0
 IpInDelivers                    2335925            0.0
 UdpInDatagrams                  880086             0.0
 UdpInErrors                     1455850            0.0
 UdpRcvbufErrors                 1455850            0.0
 IpExtInOctets                   107453056          0.0

Changing queue size to 50MBytes :
 sysctl -w net/core/rmem_max=$((50*1024*1024)) ;\
 sysctl -w net.core.rmem_default=$((50*1024*1024))

New result looks "nice", with no drops, and 1.42Mpps delivered to
UDP-prog, but in reality it is not nice for latency...

 $ nstat > /dev/null && sleep 1 && nstat
 #kernel
 IpInReceives                    1425013            0.0
 IpInDelivers                    1425017            0.0
 UdpInDatagrams                  1432139            0.0
 IpExtInOctets                   65539328           0.0
 IpExtInNoECTPkts                1424771            0.0

Tracking of queue size, max, min and average::

 while (true); do netstat -uan | grep '0.0.0.0:9'; sleep 0.3; done |
  awk 'BEGIN {max=0;min=0xffffffff;sum=0;n=0} \
   {if ($2 > max) max=$2;
    if ($2 < min) min=$2;
    n++; sum+=$2;
    printf "%s Recv-Q: %d max: %d min: %d ave: %.3f\n",$1,$2,max,min,sum/n;}';
 Result:
  udp Recv-Q: 23624832 max: 47058176 min: 4352 ave: 25092687.698

I see max queue of 47MBytes, and worse an average standing queue of
25Mbytes, which is really bad for the latency seen by the
application. And having this much outstanding memory is also bad for
CPU cache size effects, and stressing the memory allocator.
 I'm actually using this huge queue "misconfig" to stress the page
allocator and my page_pool implementation into worse case situations ;-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 21:51                   ` Jesper Dangaard Brouer
  2016-08-31 22:27                     ` Eric Dumazet
@ 2016-09-01 11:02                     ` Jesper Dangaard Brouer
  2016-09-01 11:11                       ` Hannes Frederic Sowa
  2016-09-01 11:53                       ` Peter Zijlstra
  1 sibling, 2 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-01 11:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: brouer, Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Wed, 31 Aug 2016 23:51:16 +0200
Jesper Dangaard Brouer <jbrouer@redhat.com> wrote:

> On Wed, 31 Aug 2016 13:42:30 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote:
> >   
> > > I can confirm the improvement of approx 900Kpps (no wonder people have
> > > been complaining about DoS against UDP/DNS servers).
> > > 
> > > BUT during my extensive testing, of this patch, I also think that we
> > > have not gotten to the bottom of this.  I was expecting to see a higher
> > > (collective) PPS number as I add more UDP servers, but I don't.
> > > 
> > > Running many UDP netperf's with command:
> > >  super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N    
> > 
> > Are you sure sender can send fast enough ?  
> 
> Yes, as I can see drops (overrun UDP limit UdpRcvbufErrors). Switching
> to pktgen and udp_sink to be sure.
> 
> > > 
> > > With 'top' I can see ksoftirq are still getting a higher %CPU time:
> > > 
> > >     PID   %CPU     TIME+  COMMAND
> > >      3   36.5   2:28.98  ksoftirqd/0
> > >  10724    9.6   0:01.05  netserver
> > >  10722    9.3   0:01.05  netserver
> > >  10723    9.3   0:01.05  netserver
> > >  10725    9.3   0:01.05  netserver    
> > 
> > Looks much better on my machine, with "udprcv -n 4" (using 4 threads,
> > and 4 sockets using SO_REUSEPORT)
> > 
> > 10755 root      20   0   34948      4      0 S  79.7  0.0   0:33.66 udprcv 
> >     3 root      20   0       0      0      0 R  19.9  0.0   0:25.49 ksoftirqd/0                 
> > 
> > Pressing 'H' in top gives :
> > 
> >     3 root      20   0       0      0      0 R 19.9  0.0   0:47.84 ksoftirqd/0
> > 10756 root      20   0   34948      4      0 R 19.9  0.0   0:30.76 udprcv 
> > 10757 root      20   0   34948      4      0 R 19.9  0.0   0:30.76 udprcv 
> > 10758 root      20   0   34948      4      0 S 19.9  0.0   0:30.76 udprcv
> > 10759 root      20   0   34948      4      0 S 19.9  0.0   0:30.76 udprcv  
> 
> Yes, I'm seeing the same when unning 5 instances my own udp_sink[1]:
>  sudo taskset -c 0 ./udp_sink --port 10003 --recvmsg --reuse-port --count $((10**10))
> 
>  PID  S  %CPU     TIME+  COMMAND
>     3 R  21.6   2:21.33  ksoftirqd/0
>  3838 R  15.9   0:02.18  udp_sink
>  3856 R  15.6   0:02.16  udp_sink
>  3862 R  15.6   0:02.16  udp_sink
>  3844 R  15.3   0:02.15  udp_sink
>  3850 S  15.3   0:02.15  udp_sink
> 
> This is the expected result, that adding more userspace receivers
> scales up.  I needed 5 udp_sink's before I don't see any drops, either
> this says the job performed by ksoftirqd is 5 times faster or the
> collective queue size of the programs was fast enough to absorb the
> scheduling jitter.

I need some help from scheduler people explaining this!

In above run of udp_sink (which had expected behavior), I ran udp_sink
in 5 different xterm/shells.  Below, I'm running all 5 udp_sink
programs from the same bash shell (just backgrounding them).

   PID  S  %CPU     TIME+  COMMAND
     3  R  50.0  29:02.23  ksoftirqd/0
 10881  R  10.7   1:01.61  udp_sink
 10837  R  10.0   1:05.20  udp_sink
 10852  S  10.0   1:01.78  udp_sink
 10862  R  10.0   1:05.19  udp_sink
 10844  S   9.7   1:01.91  udp_sink

This is strange, why is ksoftirqd/0 getting 50% of the CPU time???


And I'm no-longer getting the full tput delivered into userspace (as I
did before with 5 receivers).

 $ nstat > /dev/null && sleep 1 && nstat
 #kernel
 IpInReceives                    1234368            0.0
 IpInDelivers                    1234368            0.0
 UdpInDatagrams                  1133971            0.0
 UdpInErrors                     80332              0.0
 UdpRcvbufErrors                 80332              0.0
 IpExtInOctets                   56792704           0.0
 IpExtInNoECTPkts                1234624            0.0

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 11:02                     ` Jesper Dangaard Brouer
@ 2016-09-01 11:11                       ` Hannes Frederic Sowa
  2016-09-01 11:53                       ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Hannes Frederic Sowa @ 2016-09-01 11:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet
  Cc: Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni,
	linux-kernel, netdev, Jonathan Corbet

On 01.09.2016 13:02, Jesper Dangaard Brouer wrote:
> On Wed, 31 Aug 2016 23:51:16 +0200
> Jesper Dangaard Brouer <jbrouer@redhat.com> wrote:
> 
>> On Wed, 31 Aug 2016 13:42:30 -0700
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>> On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote:
>>>   
>>>> I can confirm the improvement of approx 900Kpps (no wonder people have
>>>> been complaining about DoS against UDP/DNS servers).
>>>>
>>>> BUT during my extensive testing, of this patch, I also think that we
>>>> have not gotten to the bottom of this.  I was expecting to see a higher
>>>> (collective) PPS number as I add more UDP servers, but I don't.
>>>>
>>>> Running many UDP netperf's with command:
>>>>  super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N    
>>>
>>> Are you sure sender can send fast enough ?  
>>
>> Yes, as I can see drops (overrun UDP limit UdpRcvbufErrors). Switching
>> to pktgen and udp_sink to be sure.
>>
>>>>
>>>> With 'top' I can see ksoftirq are still getting a higher %CPU time:
>>>>
>>>>     PID   %CPU     TIME+  COMMAND
>>>>      3   36.5   2:28.98  ksoftirqd/0
>>>>  10724    9.6   0:01.05  netserver
>>>>  10722    9.3   0:01.05  netserver
>>>>  10723    9.3   0:01.05  netserver
>>>>  10725    9.3   0:01.05  netserver    
>>>
>>> Looks much better on my machine, with "udprcv -n 4" (using 4 threads,
>>> and 4 sockets using SO_REUSEPORT)
>>>
>>> 10755 root      20   0   34948      4      0 S  79.7  0.0   0:33.66 udprcv 
>>>     3 root      20   0       0      0      0 R  19.9  0.0   0:25.49 ksoftirqd/0                 
>>>
>>> Pressing 'H' in top gives :
>>>
>>>     3 root      20   0       0      0      0 R 19.9  0.0   0:47.84 ksoftirqd/0
>>> 10756 root      20   0   34948      4      0 R 19.9  0.0   0:30.76 udprcv 
>>> 10757 root      20   0   34948      4      0 R 19.9  0.0   0:30.76 udprcv 
>>> 10758 root      20   0   34948      4      0 S 19.9  0.0   0:30.76 udprcv
>>> 10759 root      20   0   34948      4      0 S 19.9  0.0   0:30.76 udprcv  
>>
>> Yes, I'm seeing the same when unning 5 instances my own udp_sink[1]:
>>  sudo taskset -c 0 ./udp_sink --port 10003 --recvmsg --reuse-port --count $((10**10))
>>
>>  PID  S  %CPU     TIME+  COMMAND
>>     3 R  21.6   2:21.33  ksoftirqd/0
>>  3838 R  15.9   0:02.18  udp_sink
>>  3856 R  15.6   0:02.16  udp_sink
>>  3862 R  15.6   0:02.16  udp_sink
>>  3844 R  15.3   0:02.15  udp_sink
>>  3850 S  15.3   0:02.15  udp_sink
>>
>> This is the expected result, that adding more userspace receivers
>> scales up.  I needed 5 udp_sink's before I don't see any drops, either
>> this says the job performed by ksoftirqd is 5 times faster or the
>> collective queue size of the programs was fast enough to absorb the
>> scheduling jitter.
> 
> I need some help from scheduler people explaining this!
> 
> In above run of udp_sink (which had expected behavior), I ran udp_sink
> in 5 different xterm/shells.  Below, I'm running all 5 udp_sink
> programs from the same bash shell (just backgrounding them).
> 
>    PID  S  %CPU     TIME+  COMMAND
>      3  R  50.0  29:02.23  ksoftirqd/0
>  10881  R  10.7   1:01.61  udp_sink
>  10837  R  10.0   1:05.20  udp_sink
>  10852  S  10.0   1:01.78  udp_sink
>  10862  R  10.0   1:05.19  udp_sink
>  10844  S   9.7   1:01.91  udp_sink

Could you enable schedstats (sysctl schedstats) and show
/proc/ksoftirq*/sched?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 11:02                     ` Jesper Dangaard Brouer
  2016-09-01 11:11                       ` Hannes Frederic Sowa
@ 2016-09-01 11:53                       ` Peter Zijlstra
  2016-09-01 12:29                         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-01 11:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote:
>    PID  S  %CPU     TIME+  COMMAND
>      3  R  50.0  29:02.23  ksoftirqd/0
>  10881  R  10.7   1:01.61  udp_sink
>  10837  R  10.0   1:05.20  udp_sink
>  10852  S  10.0   1:01.78  udp_sink
>  10862  R  10.0   1:05.19  udp_sink
>  10844  S   9.7   1:01.91  udp_sink
> 
> This is strange, why is ksoftirqd/0 getting 50% of the CPU time???

Do you run your udp_sink thingy in a cpu-cgroup?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 17:42             ` [PATCH] softirq: let ksoftirqd do its job Eric Dumazet
  2016-08-31 19:40               ` Jesper Dangaard Brouer
@ 2016-09-01 12:01               ` Hannes Frederic Sowa
  2016-09-02  6:39               ` David Miller
  2016-09-30 11:55               ` [tip:irq/core] softirq: Let " tip-bot for Eric Dumazet
  3 siblings, 0 replies; 31+ messages in thread
From: Hannes Frederic Sowa @ 2016-09-01 12:01 UTC (permalink / raw)
  To: Eric Dumazet, Peter Zijlstra, David Miller
  Cc: Rik van Riel, Paolo Abeni, Jesper Dangaard Brouer, linux-kernel,
	netdev, Jonathan Corbet

On 31.08.2016 19:42, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> A while back, Paolo and Hannes sent an RFC patch adding threaded-able
> napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) 
> 
> The problem seems to be that softirqs are very aggressive and are often
> handled by the current process, even if we are under stress and that
> ksoftirqd was scheduled, so that innocent threads would have more chance
> to make progress.
> 
> This patch makes sure that if ksoftirq is running, we let it
> perform the softirq work.
> 
> Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/
> 
> Tested:
> 
>  - NIC receiving traffic handled by CPU 0
>  - UDP receiver running on CPU 0, using a single UDP socket.
>  - Incoming flood of UDP packets targeting the UDP socket.
> 
> Before the patch, the UDP receiver could almost never get cpu cycles and
> could only receive ~2,000 packets per second.
> 
> After the patch, cpu cycles are split 50/50 between user application and
> ksoftirqd/0, and we can effectively read ~900,000 packets per second,
> a huge improvement in DOS situation. (Note that more packets are now
> dropped by the NIC itself, since the BH handlers get less cpu cycles to
> drain RX ring buffer)
> 
> Since the load runs in well identified threads context, an admin can
> more easily tune process scheduling parameters if needed.
> 
> Reported-by: Paolo Abeni <pabeni@redhat.com>
> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: David Miller <davem@davemloft.net
> Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 20:42                 ` Eric Dumazet
  2016-08-31 21:51                   ` Jesper Dangaard Brouer
@ 2016-09-01 12:05                   ` Hannes Frederic Sowa
  2016-09-01 12:51                     ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Hannes Frederic Sowa @ 2016-09-01 12:05 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer
  Cc: Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni,
	linux-kernel, netdev, Jonathan Corbet

On 31.08.2016 22:42, Eric Dumazet wrote:
> On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote:
> 
>> I can confirm the improvement of approx 900Kpps (no wonder people have
>> been complaining about DoS against UDP/DNS servers).
>>
>> BUT during my extensive testing, of this patch, I also think that we
>> have not gotten to the bottom of this.  I was expecting to see a higher
>> (collective) PPS number as I add more UDP servers, but I don't.
>>
>> Running many UDP netperf's with command:
>>  super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N
> 
> Are you sure sender can send fast enough ?
> 
>>
>> With 'top' I can see ksoftirq are still getting a higher %CPU time:
>>
>>     PID   %CPU     TIME+  COMMAND
>>      3   36.5   2:28.98  ksoftirqd/0
>>  10724    9.6   0:01.05  netserver
>>  10722    9.3   0:01.05  netserver
>>  10723    9.3   0:01.05  netserver
>>  10725    9.3   0:01.05  netserver
> 
> Looks much better on my machine, with "udprcv -n 4" (using 4 threads,
> and 4 sockets using SO_REUSEPORT)

Would it make sense to include used socket backlog in udp socket lookup
compute_score calculation? Just want to throw out the idea, I actually
could imagine to also cause bad side effects.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 11:53                       ` Peter Zijlstra
@ 2016-09-01 12:29                         ` Jesper Dangaard Brouer
  2016-09-01 12:38                           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-01 12:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet,
	brouer

On Thu, 1 Sep 2016 13:53:56 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote:
> >    PID  S  %CPU     TIME+  COMMAND
> >      3  R  50.0  29:02.23  ksoftirqd/0
> >  10881  R  10.7   1:01.61  udp_sink
> >  10837  R  10.0   1:05.20  udp_sink
> >  10852  S  10.0   1:01.78  udp_sink
> >  10862  R  10.0   1:05.19  udp_sink
> >  10844  S   9.7   1:01.91  udp_sink
> > 
> > This is strange, why is ksoftirqd/0 getting 50% of the CPU time???  
> 
> Do you run your udp_sink thingy in a cpu-cgroup?

That was also Paolo's feedback (IRC).  I'm not aware of it, but it
might be some distribution (Fedora 22) default thing.

How do I verify/check if I have enabled a cpu-cgroup?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 12:29                         ` Jesper Dangaard Brouer
@ 2016-09-01 12:38                           ` Jesper Dangaard Brouer
  2016-09-01 12:48                             ` Peter Zijlstra
  2016-09-01 12:57                             ` Eric Dumazet
  0 siblings, 2 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-01 12:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet,
	brouer

On Thu, 1 Sep 2016 14:29:25 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Thu, 1 Sep 2016 13:53:56 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote:  
> > >    PID  S  %CPU     TIME+  COMMAND
> > >      3  R  50.0  29:02.23  ksoftirqd/0
> > >  10881  R  10.7   1:01.61  udp_sink
> > >  10837  R  10.0   1:05.20  udp_sink
> > >  10852  S  10.0   1:01.78  udp_sink
> > >  10862  R  10.0   1:05.19  udp_sink
> > >  10844  S   9.7   1:01.91  udp_sink
> > > 
> > > This is strange, why is ksoftirqd/0 getting 50% of the CPU time???    
> > 
> > Do you run your udp_sink thingy in a cpu-cgroup?  
> 
> That was also Paolo's feedback (IRC).  I'm not aware of it, but it
> might be some distribution (Fedora 22) default thing.

Correction, on the server-under-test, I'm actually running RHEL7.2


> How do I verify/check if I have enabled a cpu-cgroup?

Hannes says I can look in "/proc/self/cgroup"

 $ cat /proc/self/cgroup
 7:net_cls:/
 6:blkio:/
 5:devices:/
 4:perf_event:/
 3:cpu,cpuacct:/
 2:cpuset:/
 1:name=systemd:/user.slice/user-1000.slice/session-c1.scope
 
And that "/" indicate I've not enabled cgroups, right?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 12:38                           ` Jesper Dangaard Brouer
@ 2016-09-01 12:48                             ` Peter Zijlstra
  2016-09-01 13:30                               ` Jesper Dangaard Brouer
  2016-09-01 12:57                             ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-01 12:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Thu, Sep 01, 2016 at 02:38:59PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 1 Sep 2016 14:29:25 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > On Thu, 1 Sep 2016 13:53:56 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote:  
> > > >    PID  S  %CPU     TIME+  COMMAND
> > > >      3  R  50.0  29:02.23  ksoftirqd/0
> > > >  10881  R  10.7   1:01.61  udp_sink
> > > >  10837  R  10.0   1:05.20  udp_sink
> > > >  10852  S  10.0   1:01.78  udp_sink
> > > >  10862  R  10.0   1:05.19  udp_sink
> > > >  10844  S   9.7   1:01.91  udp_sink
> > > > 
> > > > This is strange, why is ksoftirqd/0 getting 50% of the CPU time???    
> > > 
> > > Do you run your udp_sink thingy in a cpu-cgroup?  
> > 
> > That was also Paolo's feedback (IRC).  I'm not aware of it, but it
> > might be some distribution (Fedora 22) default thing.
> 
> Correction, on the server-under-test, I'm actually running RHEL7.2
> 
> 
> > How do I verify/check if I have enabled a cpu-cgroup?
> 
> Hannes says I can look in "/proc/self/cgroup"
> 
>  $ cat /proc/self/cgroup
>  7:net_cls:/
>  6:blkio:/
>  5:devices:/
>  4:perf_event:/
>  3:cpu,cpuacct:/
>  2:cpuset:/
>  1:name=systemd:/user.slice/user-1000.slice/session-c1.scope
>  
> And that "/" indicate I've not enabled cgroups, right?

Mostly so. I think RHEL/Fedora has SCHED_AUTOGROUP enabled, and you can
find that through:

cat /proc/self/autogroup

And disable with the noautogroup boot param, or:

echo 0 > /proc/sys/kernel/sched_autogroup_enabled

although this latter will leave the current state intact while avoiding
creation of any further autogroups iirc.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 12:05                   ` Hannes Frederic Sowa
@ 2016-09-01 12:51                     ` Eric Dumazet
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2016-09-01 12:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, Peter Zijlstra, David Miller,
	Rik van Riel, Paolo Abeni, linux-kernel, netdev, Jonathan Corbet

On Thu, 2016-09-01 at 14:05 +0200, Hannes Frederic Sowa wrote:

> Would it make sense to include used socket backlog in udp socket lookup
> compute_score calculation? Just want to throw out the idea, I actually
> could imagine to also cause bad side effects.

Hopefully we can get rid of the backlog for UDP, by no longer having to
lock the socket in RX path, and perform memory charging in a better way.

The backlog for TCP is problematic for high speed flows, and for UDP it
is problematic in flood situations as a single recvmsg() might have to
process thousands of skbs before returning to user space.

What you suggest is going to be difficult :

1) Packets of a 5-tuple (eg QUIC flow) wont all land to the same silo,
and will cause reorders or application issues.

2) SO_ATTACH_REUSEPORT_CBPF wont have access to the socket(s) backlog to
perform the choice.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 12:38                           ` Jesper Dangaard Brouer
  2016-09-01 12:48                             ` Peter Zijlstra
@ 2016-09-01 12:57                             ` Eric Dumazet
  2016-09-01 13:00                               ` Hannes Frederic Sowa
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2016-09-01 12:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Thu, 2016-09-01 at 14:38 +0200, Jesper Dangaard Brouer wrote:

> Correction, on the server-under-test, I'm actually running RHEL7.2
> 
> 
> > How do I verify/check if I have enabled a cpu-cgroup?
> 
> Hannes says I can look in "/proc/self/cgroup"
> 
>  $ cat /proc/self/cgroup
>  7:net_cls:/
>  6:blkio:/
>  5:devices:/
>  4:perf_event:/
>  3:cpu,cpuacct:/
>  2:cpuset:/
>  1:name=systemd:/user.slice/user-1000.slice/session-c1.scope
>  
> And that "/" indicate I've not enabled cgroups, right?
> 

In my experience, I found that times displayed by top are often off for
softirq processing.

Before applying my patch, top shows very small amount of cpu time for
udp_rcv and ksoftirqd/0 , while obviously cpu 0 is completely busy.

Make sure to try latest Linus tree, as I did yesterday, because
apparently things are better than a few weeks back.

BTW, even 'perf top' has sometimes problems showing me cycles spent in
softirq. I need to make sure the cpu processing NIC interrupts also
spend cycles in some user space program to get meaningful results.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 12:57                             ` Eric Dumazet
@ 2016-09-01 13:00                               ` Hannes Frederic Sowa
  2016-09-01 13:25                                 ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Frederic Sowa @ 2016-09-01 13:00 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer
  Cc: Peter Zijlstra, David Miller, Rik van Riel, Paolo Abeni,
	linux-kernel, netdev, Jonathan Corbet

On 01.09.2016 14:57, Eric Dumazet wrote:
> On Thu, 2016-09-01 at 14:38 +0200, Jesper Dangaard Brouer wrote:
> 
>> Correction, on the server-under-test, I'm actually running RHEL7.2
>>
>>
>>> How do I verify/check if I have enabled a cpu-cgroup?
>>
>> Hannes says I can look in "/proc/self/cgroup"
>>
>>  $ cat /proc/self/cgroup
>>  7:net_cls:/
>>  6:blkio:/
>>  5:devices:/
>>  4:perf_event:/
>>  3:cpu,cpuacct:/
>>  2:cpuset:/
>>  1:name=systemd:/user.slice/user-1000.slice/session-c1.scope
>>  
>> And that "/" indicate I've not enabled cgroups, right?
>>
> 
> In my experience, I found that times displayed by top are often off for
> softirq processing.
> 
> Before applying my patch, top shows very small amount of cpu time for
> udp_rcv and ksoftirqd/0 , while obviously cpu 0 is completely busy.
> 
> Make sure to try latest Linus tree, as I did yesterday, because
> apparently things are better than a few weeks back.
> 
> BTW, even 'perf top' has sometimes problems showing me cycles spent in
> softirq. I need to make sure the cpu processing NIC interrupts also
> spend cycles in some user space program to get meaningful results.

I think that ksoftirqd time is actually accounted to system:

excerpt from irqtime_account_process_tick in kernel/sched/cputime.c

	if (this_cpu_ksoftirqd() == p) {
		/*
		 * ksoftirqd time do not get accounted in cpu_softirq_time.
		 * So, we have to handle it separately here.
		 * Also, p->stime needs to be updated for ksoftirqd.
		 */
		__account_system_time(p, cputime, scaled, CPUTIME_SOFTIRQ);
	} else if (user_tick) {

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 10:38                             ` Jesper Dangaard Brouer
@ 2016-09-01 13:06                               ` Eric Dumazet
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2016-09-01 13:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Rick Jones, Peter Zijlstra, David Miller, Rik van Riel,
	Paolo Abeni, Hannes Frederic Sowa, linux-kernel, netdev,
	Jonathan Corbet

On Thu, 2016-09-01 at 12:38 +0200, Jesper Dangaard Brouer wrote:

> I see max queue of 47MBytes, and worse an average standing queue of
> 25Mbytes, which is really bad for the latency seen by the
> application. And having this much outstanding memory is also bad for
> CPU cache size effects, and stressing the memory allocator.
>  I'm actually using this huge queue "misconfig" to stress the page
> allocator and my page_pool implementation into worse case situations ;-)
> 

Since commit 95766fff6b9a78d11f ("[UDP]: Add memory accounting."),
it is dangerous to have a big SO_RCVBUF value, since it adds unexpected
recvmsg() latencies.

1) User thread locks the socket.
2) Gets one skb from receive queue
   3) incoming flood of UDP packets are processed by softirq
   4) Socket is found 'owned by the user'
   5) packets are parked into the 'socket backlog' up to the SO_RCVBUF
limit
6) User thread release the socket.
7)  It finds many skbs in the backlog and have to process them _all_ and
re-inject in socket receive queue.
8) return to user space.


Time spent in 7) can me in the order of millions of cpu cycles...

At least starting from 5413d1babe8f10d ("net: do not block BH while
processing socket backlog") we no longer block BH while doing 7) and we
have cond resched points.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 13:00                               ` Hannes Frederic Sowa
@ 2016-09-01 13:25                                 ` Eric Dumazet
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2016-09-01 13:25 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, Peter Zijlstra, David Miller,
	Rik van Riel, Paolo Abeni, linux-kernel, netdev, Jonathan Corbet

On Thu, 2016-09-01 at 15:00 +0200, Hannes Frederic Sowa wrote:
> On 01.09.2016 14:57, Eric Dumazet wrote:
> > On Thu, 2016-09-01 at 14:38 +0200, Jesper Dangaard Brouer wrote:
> > 
> >> Correction, on the server-under-test, I'm actually running RHEL7.2
> >>
> >>
> >>> How do I verify/check if I have enabled a cpu-cgroup?
> >>
> >> Hannes says I can look in "/proc/self/cgroup"
> >>
> >>  $ cat /proc/self/cgroup
> >>  7:net_cls:/
> >>  6:blkio:/
> >>  5:devices:/
> >>  4:perf_event:/
> >>  3:cpu,cpuacct:/
> >>  2:cpuset:/
> >>  1:name=systemd:/user.slice/user-1000.slice/session-c1.scope
> >>  
> >> And that "/" indicate I've not enabled cgroups, right?
> >>
> > 
> > In my experience, I found that times displayed by top are often off for
> > softirq processing.
> > 
> > Before applying my patch, top shows very small amount of cpu time for
> > udp_rcv and ksoftirqd/0 , while obviously cpu 0 is completely busy.
> > 
> > Make sure to try latest Linus tree, as I did yesterday, because
> > apparently things are better than a few weeks back.
> > 
> > BTW, even 'perf top' has sometimes problems showing me cycles spent in
> > softirq. I need to make sure the cpu processing NIC interrupts also
> > spend cycles in some user space program to get meaningful results.
> 
> I think that ksoftirqd time is actually accounted to system:
> 
> excerpt from irqtime_account_process_tick in kernel/sched/cputime.c
> 
> 	if (this_cpu_ksoftirqd() == p) {
> 		/*
> 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
> 		 * So, we have to handle it separately here.
> 		 * Also, p->stime needs to be updated for ksoftirqd.
> 		 */
> 		__account_system_time(p, cputime, scaled, CPUTIME_SOFTIRQ);
> 	} else if (user_tick) {
> 

Tell me more about kernel/sched/cputime.c stability over recent linux
versions ;)

git log --oneline v4.2.. kernel/sched/cputime.c
03cbc732639ddcad15218c4b2046d255851ff1e3 sched/cputime: Resync steal time when guest & host lose sync
173be9a14f7b2e901cf77c18b1aafd4d672e9d9e sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression
26f2c75cd2cf10a6120ef02ca9a94db77cc9c8e0 sched/cputime: Fix omitted ticks passed in parameter
f9bcf1e0e0145323ba2cf72ecad5264ff3883eb1 sched/cputime: Fix steal time accounting
08fd8c17686c6b09fa410a26d516548dd80ff147 Merge tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
553bf6bbfd8a540c70aee28eb50e24caff456a03 sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
0cfdf9a198b0d4f5ad6c87d894db7830b796b2cc sched/cputime: Clean up the old vtime gen irqtime accounting completely
b58c35840521bb02b150e1d0d34ca9197f8b7145 sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
57430218317e5b280a80582a139b26029c25de6c sched/cputime: Count actually elapsed irq & softirq time
ecb23dc6f2eff0ce64dd60351a81f376f13b12cc xen: add steal_clock support on x86
807e5b80687c06715d62df51a5473b231e3e8b15 sched/cputime: Add steal time support to full dynticks CPU time accounting
f9c904b7613b8b4c85b10cd6b33ad41b2843fa9d sched/cputime: Fix steal_account_process_tick() to always return jiffies
ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
c9bed1cf51011c815d88288b774865d013ca78a8 Merge tag 'for-linus-4.5-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
1fe7c4ef88bd32e039f5f4126537c3f20c340414 missing include asm/paravirt.h in cputime.c
b7ce2277f087fd052e7e1bbf432f7fecbee82bb6 sched/cputime: Convert vtime_seqlock to seqcount
e592539466380279a9e6e6fdfe4545aa54f22593 sched/cputime: Introduce vtime accounting check for readers
55dbdcfa05533f44c9416070b8a9f6432b22314a sched/cputime: Rename vtime_accounting_enabled() to vtime_accounting_cpu_enabled()
cab245d68c38afff1a4c4d018ab7e1d316982f5d sched/cputime: Correctly handle task guest time on housekeepers
7098c1eac75dc03fdbb7249171a6e68ce6044a5a sched/cputime: Clarify vtime symbols and document them
7877a0ba5ec63c7b0111b06c773f1696fa17b35a sched/cputime: Remove extra cost in task_cputime()
2541117b0cf79977fa11a0d6e17d61010677bd7b sched/cputime: Fix invalid gtime in proc
9eec50b8bbe1535c440a1ee88c1958f78fc55957 kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support
9d7fb04276481c59610983362d8e023d262b58ca sched/cputime: Guarantee stime + utime == rtime

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 12:48                             ` Peter Zijlstra
@ 2016-09-01 13:30                               ` Jesper Dangaard Brouer
  2016-09-01 15:28                                 ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-01 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet,
	brouer

On Thu, 1 Sep 2016 14:48:39 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 01, 2016 at 02:38:59PM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 1 Sep 2016 14:29:25 +0200
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >   
> > > On Thu, 1 Sep 2016 13:53:56 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > >   
> > > > On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote:    
> > > > >    PID  S  %CPU     TIME+  COMMAND
> > > > >      3  R  50.0  29:02.23  ksoftirqd/0
> > > > >  10881  R  10.7   1:01.61  udp_sink
> > > > >  10837  R  10.0   1:05.20  udp_sink
> > > > >  10852  S  10.0   1:01.78  udp_sink
> > > > >  10862  R  10.0   1:05.19  udp_sink
> > > > >  10844  S   9.7   1:01.91  udp_sink
> > > > > 
> > > > > This is strange, why is ksoftirqd/0 getting 50% of the CPU time???      
> > > > 
> > > > Do you run your udp_sink thingy in a cpu-cgroup?    
> > > 
> > > That was also Paolo's feedback (IRC).  I'm not aware of it, but it
> > > might be some distribution (Fedora 22) default thing.  
> > 
> > Correction, on the server-under-test, I'm actually running RHEL7.2
> > 
> >   
> > > How do I verify/check if I have enabled a cpu-cgroup?  
> > 
> > Hannes says I can look in "/proc/self/cgroup"
> > 
> >  $ cat /proc/self/cgroup
> >  7:net_cls:/
> >  6:blkio:/
> >  5:devices:/
> >  4:perf_event:/
> >  3:cpu,cpuacct:/
> >  2:cpuset:/
> >  1:name=systemd:/user.slice/user-1000.slice/session-c1.scope
> >  
> > And that "/" indicate I've not enabled cgroups, right?  
> 
> Mostly so. I think RHEL/Fedora has SCHED_AUTOGROUP enabled, and you can
> find that through:
> 
> cat /proc/self/autogroup

$ cat /proc/self/autogroup
/autogroup-88 nice 0

> And disable with the noautogroup boot param, or:
> 
> echo 0 > /proc/sys/kernel/sched_autogroup_enabled

Looks like it is enabled on my system:

$ grep -H . /proc/sys/kernel/sched_autogroup_enabled
/proc/sys/kernel/sched_autogroup_enabled:1


> although this latter will leave the current state intact while avoiding
> creation of any further autogroups iirc.

$ sudo sh -c 'echo 0 > /proc/sys/kernel/sched_autogroup_enabled'
$ grep -H . /proc/sys/kernel/sched_autogroup_enabled
/proc/sys/kernel/sched_autogroup_enabled:0

$ sudo systemctl restart sshd

Starting new SSH login:

$ cat /proc/self/autogroup
/autogroup-153 nice 0

Hmmm, still enabled...

$ sudo systemctl stop sshd
$ sudo systemctl start sshd
$ grep -H . /proc/sys/kernel/sched_autogroup_enabled
/proc/sys/kernel/sched_autogroup_enabled:0
$ cat /proc/self/autogroup
/autogroup-158 nice 0

Still... enabled!
Hmmm.. more idea how to disable this???

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 13:30                               ` Jesper Dangaard Brouer
@ 2016-09-01 15:28                                 ` Peter Zijlstra
  2016-09-02  8:35                                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-01 15:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet

On Thu, Sep 01, 2016 at 03:30:42PM +0200, Jesper Dangaard Brouer wrote:
> Still... enabled!
> Hmmm.. more idea how to disable this???

I think you ought to be able to assign yourself to the root cgroup,
something like:

  echo $$ > /cgroup/tasks

or wheverever the cpu-cgroup controller is mounted at.

But its been a fair while since I touched any of that, its not a CONFIG
I have enabled much.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-08-31 17:42             ` [PATCH] softirq: let ksoftirqd do its job Eric Dumazet
  2016-08-31 19:40               ` Jesper Dangaard Brouer
  2016-09-01 12:01               ` Hannes Frederic Sowa
@ 2016-09-02  6:39               ` David Miller
  2016-09-23 11:35                 ` Daniel Borkmann
  2016-09-30 11:55               ` [tip:irq/core] softirq: Let " tip-bot for Eric Dumazet
  3 siblings, 1 reply; 31+ messages in thread
From: David Miller @ 2016-09-02  6:39 UTC (permalink / raw)
  To: eric.dumazet
  Cc: peterz, riel, pabeni, hannes, jbrouer, linux-kernel, netdev, corbet

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 31 Aug 2016 10:42:29 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> A while back, Paolo and Hannes sent an RFC patch adding threaded-able
> napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) 
> 
> The problem seems to be that softirqs are very aggressive and are often
> handled by the current process, even if we are under stress and that
> ksoftirqd was scheduled, so that innocent threads would have more chance
> to make progress.
> 
> This patch makes sure that if ksoftirq is running, we let it
> perform the softirq work.
> 
> Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/
> 
> Tested:
> 
>  - NIC receiving traffic handled by CPU 0
>  - UDP receiver running on CPU 0, using a single UDP socket.
>  - Incoming flood of UDP packets targeting the UDP socket.
> 
> Before the patch, the UDP receiver could almost never get cpu cycles and
> could only receive ~2,000 packets per second.
> 
> After the patch, cpu cycles are split 50/50 between user application and
> ksoftirqd/0, and we can effectively read ~900,000 packets per second,
> a huge improvement in DOS situation. (Note that more packets are now
> dropped by the NIC itself, since the BH handlers get less cpu cycles to
> drain RX ring buffer)
> 
> Since the load runs in well identified threads context, an admin can
> more easily tune process scheduling parameters if needed.
> 
> Reported-by: Paolo Abeni <pabeni@redhat.com>
> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I'm just kind of assuming this won't go through my tree, but I can take
it if that's what everyone agrees to.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-01 15:28                                 ` Peter Zijlstra
@ 2016-09-02  8:35                                   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-02  8:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, David Miller, Rik van Riel, Paolo Abeni,
	Hannes Frederic Sowa, linux-kernel, netdev, Jonathan Corbet,
	brouer

On Thu, 1 Sep 2016 17:28:02 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Sep 01, 2016 at 03:30:42PM +0200, Jesper Dangaard Brouer wrote:
> > Still... enabled!
> > Hmmm.. more idea how to disable this???  
> 
> I think you ought to be able to assign yourself to the root cgroup,
> something like:
> 
>   echo $$ > /cgroup/tasks
> 
> or wheverever the cpu-cgroup controller is mounted at.
> 
> But its been a fair while since I touched any of that, its not a CONFIG
> I have enabled much.

I could not figure out how to disable autogroups, so I ended up
compiling the kernel without CONFIG_SCHED_AUTOGROUP.

  PID   PR   S  %CPU     TIME+ 	COMMAND
    3   20   R  20.7   0:53.05 	ksoftirqd/0
 9299   20   R  16.3   0:03.62 	udp_sink
 9296   20   S  16.0   0:03.59 	udp_sink
 9297   20   R  16.0   0:03.58 	udp_sink
 9298   20   R  16.0   0:03.57 	udp_sink
 9295   20   R  15.3   0:03.43 	udp_sink

Top new shows the CPU distribution is more correct, thus we can
concluded the artifact I saw was indeed caused by autogroup.

I can also confirm that my netperf UDP_STREAM tests now work again,
but I need around 32 parallel netperf to counter the effectiveness of
the ksoftirqd process.  While I only need 5 udp_sink programs.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-02  6:39               ` David Miller
@ 2016-09-23 11:35                 ` Daniel Borkmann
  2016-09-23 11:53                   ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel Borkmann @ 2016-09-23 11:35 UTC (permalink / raw)
  To: David Miller, eric.dumazet
  Cc: peterz, riel, pabeni, hannes, jbrouer, linux-kernel, netdev, corbet

On 09/02/2016 08:39 AM, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 31 Aug 2016 10:42:29 -0700
>
>> From: Eric Dumazet <edumazet@google.com>
>>
>> A while back, Paolo and Hannes sent an RFC patch adding threaded-able
>> napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/)
>>
>> The problem seems to be that softirqs are very aggressive and are often
>> handled by the current process, even if we are under stress and that
>> ksoftirqd was scheduled, so that innocent threads would have more chance
>> to make progress.
>>
>> This patch makes sure that if ksoftirq is running, we let it
>> perform the softirq work.
>>
>> Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/
>>
>> Tested:
>>
>>   - NIC receiving traffic handled by CPU 0
>>   - UDP receiver running on CPU 0, using a single UDP socket.
>>   - Incoming flood of UDP packets targeting the UDP socket.
>>
>> Before the patch, the UDP receiver could almost never get cpu cycles and
>> could only receive ~2,000 packets per second.
>>
>> After the patch, cpu cycles are split 50/50 between user application and
>> ksoftirqd/0, and we can effectively read ~900,000 packets per second,
>> a huge improvement in DOS situation. (Note that more packets are now
>> dropped by the NIC itself, since the BH handlers get less cpu cycles to
>> drain RX ring buffer)
>>
>> Since the load runs in well identified threads context, an admin can
>> more easily tune process scheduling parameters if needed.
>>
>> Reported-by: Paolo Abeni <pabeni@redhat.com>
>> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> I'm just kind of assuming this won't go through my tree, but I can take
> it if that's what everyone agrees to.

Was this actually picked up somewhere in the mean time?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-23 11:35                 ` Daniel Borkmann
@ 2016-09-23 11:53                   ` Peter Zijlstra
  2016-09-23 16:51                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-23 11:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David Miller, eric.dumazet, riel, pabeni, hannes, jbrouer,
	linux-kernel, netdev, corbet, Ingo Molnar

On Fri, Sep 23, 2016 at 01:35:59PM +0200, Daniel Borkmann wrote:
> On 09/02/2016 08:39 AM, David Miller wrote:
> >
> >I'm just kind of assuming this won't go through my tree, but I can take
> >it if that's what everyone agrees to.
> 
> Was this actually picked up somewhere in the mean time?

I can queue it for tip. In fact, I've just done so to avoid loosing it.
If anybody else wants it holler.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-23 11:53                   ` Peter Zijlstra
@ 2016-09-23 16:51                     ` Jesper Dangaard Brouer
  2016-09-23 21:16                       ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-23 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Borkmann, David Miller, eric.dumazet, riel, pabeni,
	hannes, linux-kernel, netdev, corbet, Ingo Molnar

On Fri, 23 Sep 2016 13:53:33 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Sep 23, 2016 at 01:35:59PM +0200, Daniel Borkmann wrote:
> > On 09/02/2016 08:39 AM, David Miller wrote:  
> > >
> > >I'm just kind of assuming this won't go through my tree, but I can take
> > >it if that's what everyone agrees to.  
> > 
> > Was this actually picked up somewhere in the mean time?  
> 
> I can queue it for tip. In fact, I've just done so to avoid loosing it.
> If anybody else wants it holler.

Good that you are picking this up! It is a very important fix, as least
for networking.

This is your git tree, right:
 https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/

Doesn't look like you pushed it yet, or do I need to look at a specific
branch?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] softirq: let ksoftirqd do its job
  2016-09-23 16:51                     ` Jesper Dangaard Brouer
@ 2016-09-23 21:16                       ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-23 21:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, David Miller, eric.dumazet, riel, pabeni,
	hannes, linux-kernel, netdev, corbet, Ingo Molnar

On Fri, Sep 23, 2016 at 06:51:04PM +0200, Jesper Dangaard Brouer wrote:

> This is your git tree, right:
>  https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/
> 
> Doesn't look like you pushed it yet, or do I need to look at a specific
> branch?

I mainly work from a local quilt queue which I feed to mingo. I
occasionally push out to get build-bot coverage or have people look at
bits I poked together.

That said, I'll try and do a push later tonight.

Do note however, that git tree is a complete wipe and rebuild, don't
expect any kind of continuity from it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [tip:irq/core] softirq: Let ksoftirqd do its job
  2016-08-31 17:42             ` [PATCH] softirq: let ksoftirqd do its job Eric Dumazet
                                 ` (2 preceding siblings ...)
  2016-09-02  6:39               ` David Miller
@ 2016-09-30 11:55               ` tip-bot for Eric Dumazet
  3 siblings, 0 replies; 31+ messages in thread
From: tip-bot for Eric Dumazet @ 2016-09-30 11:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-kernel, hannes, edumazet, hpa, riel, jbrouer, hannes,
	peterz, davem, corbet, pabeni, mingo, torvalds

Commit-ID:  4cd13c21b207e80ddb1144c576500098f2d5f882
Gitweb:     http://git.kernel.org/tip/4cd13c21b207e80ddb1144c576500098f2d5f882
Author:     Eric Dumazet <edumazet@google.com>
AuthorDate: Wed, 31 Aug 2016 10:42:29 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 30 Sep 2016 10:43:36 +0200

softirq: Let ksoftirqd do its job

A while back, Paolo and Hannes sent an RFC patch adding threaded-able
napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/)

The problem seems to be that softirqs are very aggressive and are often
handled by the current process, even if we are under stress and that
ksoftirqd was scheduled, so that innocent threads would have more chance
to make progress.

This patch makes sure that if ksoftirq is running, we let it
perform the softirq work.

Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/

Tested:

 - NIC receiving traffic handled by CPU 0
 - UDP receiver running on CPU 0, using a single UDP socket.
 - Incoming flood of UDP packets targeting the UDP socket.

Before the patch, the UDP receiver could almost never get CPU cycles and
could only receive ~2,000 packets per second.

After the patch, CPU cycles are split 50/50 between user application and
ksoftirqd/0, and we can effectively read ~900,000 packets per second,
a huge improvement in DOS situation. (Note that more packets are now
dropped by the NIC itself, since the BH handlers get less CPU cycles to
drain RX ring buffer)

Since the load runs in well identified threads context, an admin can
more easily tune process scheduling parameters if needed.

Reported-by: Paolo Abeni <pabeni@redhat.com>
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1472665349.14381.356.camel@edumazet-glaptop3.roam.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/softirq.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 17caf4b..8ed90e3 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -78,6 +78,17 @@ static void wakeup_softirqd(void)
 }
 
 /*
+ * If ksoftirqd is scheduled, we do not want to process pending softirqs
+ * right now. Let ksoftirqd handle this at its own rate, to get fairness.
+ */
+static bool ksoftirqd_running(void)
+{
+	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
+
+	return tsk && (tsk->state == TASK_RUNNING);
+}
+
+/*
  * preempt_count and SOFTIRQ_OFFSET usage:
  * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
  *   softirq processing.
@@ -313,7 +324,7 @@ asmlinkage __visible void do_softirq(void)
 
 	pending = local_softirq_pending();
 
-	if (pending)
+	if (pending && !ksoftirqd_running())
 		do_softirq_own_stack();
 
 	local_irq_restore(flags);
@@ -340,6 +351,9 @@ void irq_enter(void)
 
 static inline void invoke_softirq(void)
 {
+	if (ksoftirqd_running())
+		return;
+
 	if (!force_irqthreads) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
 		/*

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-09-30 11:56 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4f1c4b38528762619fff1fa963de8971006c1234.1472460085.git.pabeni@redhat.com>
     [not found] ` <20160831100854.23dad2d8@redhat.com>
     [not found]   ` <1472650472.14381.317.camel@edumazet-glaptop3.roam.corp.google.com>
     [not found]     ` <1472650688.32433.115.camel@redhat.com>
     [not found]       ` <1472652643.14381.320.camel@edumazet-glaptop3.roam.corp.google.com>
     [not found]         ` <20160831164216.2901190c@redhat.com>
     [not found]           ` <1472661956.14381.335.camel@edumazet-glaptop3.roam.corp.google.com>
2016-08-31 17:42             ` [PATCH] softirq: let ksoftirqd do its job Eric Dumazet
2016-08-31 19:40               ` Jesper Dangaard Brouer
2016-08-31 20:42                 ` Eric Dumazet
2016-08-31 21:51                   ` Jesper Dangaard Brouer
2016-08-31 22:27                     ` Eric Dumazet
2016-08-31 22:47                       ` Rick Jones
2016-08-31 23:11                         ` Eric Dumazet
2016-08-31 23:29                           ` Rick Jones
2016-09-01 10:38                             ` Jesper Dangaard Brouer
2016-09-01 13:06                               ` Eric Dumazet
2016-09-01 11:02                     ` Jesper Dangaard Brouer
2016-09-01 11:11                       ` Hannes Frederic Sowa
2016-09-01 11:53                       ` Peter Zijlstra
2016-09-01 12:29                         ` Jesper Dangaard Brouer
2016-09-01 12:38                           ` Jesper Dangaard Brouer
2016-09-01 12:48                             ` Peter Zijlstra
2016-09-01 13:30                               ` Jesper Dangaard Brouer
2016-09-01 15:28                                 ` Peter Zijlstra
2016-09-02  8:35                                   ` Jesper Dangaard Brouer
2016-09-01 12:57                             ` Eric Dumazet
2016-09-01 13:00                               ` Hannes Frederic Sowa
2016-09-01 13:25                                 ` Eric Dumazet
2016-09-01 12:05                   ` Hannes Frederic Sowa
2016-09-01 12:51                     ` Eric Dumazet
2016-09-01 12:01               ` Hannes Frederic Sowa
2016-09-02  6:39               ` David Miller
2016-09-23 11:35                 ` Daniel Borkmann
2016-09-23 11:53                   ` Peter Zijlstra
2016-09-23 16:51                     ` Jesper Dangaard Brouer
2016-09-23 21:16                       ` Peter Zijlstra
2016-09-30 11:55               ` [tip:irq/core] softirq: Let " tip-bot for Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.