* [RFC 0/2] Net softirq deferring to ksoftirqd @ 2018-01-09 13:36 Dmitry Safonov 2018-01-09 13:36 ` [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context Dmitry Safonov 2018-01-09 13:36 ` [RFC 2/2] softirq: Introduce mask for __do_softirq() Dmitry Safonov 0 siblings, 2 replies; 57+ messages in thread From: Dmitry Safonov @ 2018-01-09 13:36 UTC (permalink / raw) To: linux-kernel Cc: 0x7f454c46, Dmitry Safonov, Andrew Morton, David Miller, Eric Dumazet, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Linus Torvalds, Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li Based on events I've saw on out-of-tree drivers, but I believe that can happen on the mainstream ones. I managed to get some results in Qemu VMs, but this lacks testing on a real mainstream hardware. Also I'm not sure that's completely sane thing to-do, so sending early to get some reviews. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radu Rendec <rrendec@arista.com> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Dmitry Safonov (2): softirq: Defer net rx/tx processing to ksoftirqd context softirq: Introduce mask for __do_softirq() include/linux/interrupt.h | 8 ++++---- kernel/softirq.c | 48 ++++++++++++++++++++++++++++++++++------------- 2 files changed, 39 insertions(+), 17 deletions(-) -- 2.13.6 ^ permalink raw reply [flat|nested] 57+ messages in thread
* [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-09 13:36 [RFC 0/2] Net softirq deferring to ksoftirqd Dmitry Safonov @ 2018-01-09 13:36 ` Dmitry Safonov 2018-01-09 16:03 ` Frederic Weisbecker 2018-01-09 18:02 ` Eric Dumazet 2018-01-09 13:36 ` [RFC 2/2] softirq: Introduce mask for __do_softirq() Dmitry Safonov 1 sibling, 2 replies; 57+ messages in thread From: Dmitry Safonov @ 2018-01-09 13:36 UTC (permalink / raw) To: linux-kernel Cc: 0x7f454c46, Dmitry Safonov, Andrew Morton, David Miller, Eric Dumazet, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Linus Torvalds, Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li Warning: Not merge-ready I. Current workflow of ksoftirqd. Softirqs are processed in the context of ksoftirqd iff they are being raised very frequently. How it works: do_softirq() and invoke_softirq() deffer pending softirq iff ksoftirqd is in runqueue. Ksoftirqd is scheduled mostly in the end of processed softirqs if 2ms were not enough to process all pending softirqs. Here is pseudo-picture of the workflow (for simplicity on UMP): ------------- ------------------ ------------------ | ksoftirqd | | User's process | | Softirqs | ------------- ------------------ ------------------ Not scheduled Running | o------------------------o | __do_softirq() | 2ms & softirq pending? Schedule ksoftirqd | Scheduled o------------------------o | o--------------------o | Running Scheduled | o--------------------o | Not scheduled Running Timegraph for the workflow, dash (-) means ksoftirqd not scheduled; equal(=) ksoftirqd is scheduled, a softirq may still be pending Pending softirqs | | | | | | | | | v v v v | | | | v Processing o-----o | | | | o--o softirqs | | | | | | | | | | | | | | | | | | | | | | | | Userspace o-o o=========o | | | | o----o o---------o <-2ms-> | | | | | | | v v v v | Ksoftirqd o----------o II. Corner-conditions. During testing of commit [1] on some non-mainstream driver, I've found that due to platform specifics, the IRQ is being raised too late (after softirq has been processed). In result softirqs steal time from userspace process, leaving it starving for CPU time and never/rarely scheduling ksoftirqd: Pending softirqs | | | | | | v v v v v v Processing o-----o o-----o o-----o o-----o o-----o o ... softirqs | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userspace o-o o-o o-o o-o o-o o-o (starving) Ksoftirqd (rarely scheduled) Afterwards I thought that the same may happen to mainstream if PPS rate is selected to raise an IRQ just after previous softirq was processed. I managed to reproduce the conjecture, see (IV). III. RFC proposal. Firstly, I tried to count all time spent in softirq processing to ksoftirqd thread that serves local CPU and add comparison of vruntime for ksoftirqd and current task to decide if softirq should be delayed. You may imagine what a disgraceful hacks were involved. Current RFC has nothing of that kind and relies on fair scheduling of ksoftirqd and other tasks. To do that we check pending softirqs and serve them on current context only if there are non-net softirqs pending. The following patch adds a mask to __do_softirq() to process net-softirqs only on ksoftirqd context if multiply softirqs are pending. IV. Test results. Unfortunately, I wasn't able to test it on hardware with mainstream kernel. So, I've only results from Qemu VMs with fedora 26. The first VM stresses the second with UDP packages by pktgen. The receiver VM is running udp_sink[2] program and prints the amount of PPS served. Vms have virtio as network cards, have rt priority and are assigned to different CPUs on the host. Host's CPU is Intel Core i7-7600U @ 2.80GHz. RFC definitely needs some testing on the real HW (because I don't expect anyone would quite believe VM perf testing) - any help with testing it would be appreciated. Source | Destination --------|------------------------------------ | master | RFC | | (4.15-rc4) | | --------|------------------|----------------| 5000 | 5000.7 | 4999.7 | --------|------------------|----------------| 7000 | 6997.42 | 6995.88 | --------|------------------|----------------| 8000 | 7999.55 | 7999.86 | --------|------------------|----------------| 9000 | 8951.37 | 8986.30 | --------|------------------|----------------| 10000 | 9864.96 | 9972.05 | --------|------------------|----------------| 11000 | 10711.92 | 10976.26 | --------|------------------|----------------| 12000 | 11494.79 | 11962.40 | --------|------------------|----------------| 13000 | 12161.76 | 12946.91 | --------|------------------|----------------| 14000 | 11152.07 | 13942.96 | --------|------------------|----------------| 15000 | 8650.22 | 14878.26 | --------|------------------|----------------| 16000 | 7662.55 | 15880.60 | --------|------------------|----------------| 17000 | 6485.49 | 16814.07 | --------|------------------|----------------| 18000 | 5489.48 | 17679.69 | --------|------------------|----------------| 19000 | 4679.59 | 18543.60 | --------|------------------|----------------| 20000 | 4738.24 | 19233.56 | --------|------------------|----------------| 21000 | 4015.00 | 20247.50 | --------|------------------|----------------| 22000 | 4376.99 | 20654.62 | --------|------------------|----------------| 23000 | 9429.80 | 20925.07 | --------|------------------|----------------| 24000 | 8872.33 | 21336.31 | --------|------------------|----------------| 25000 | 19824.67 | 21486.84 | --------|------------------|----------------| 30000 | 20779.49 | 21487.15 | --------|------------------|----------------| 40000 | 24559.83 | 21452.74 | --------|------------------|----------------| 50000 | 18469.20 | 21191.34 | --------|------------------|----------------| 100000 | 19773.00 | 22592.28 | --------|------------------|----------------| Note, that I tested in VMs and I've found that if I produce more hw irqs on the host, than the results for master are not that dramatically bad, but still much worse then with RFC. By that reason I have qualms if my test's results are correct. V. References: [1] 4cd13c21b207 ("softirq: Let ksoftirqd do its job") [2] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c Signed-off-by: Dmitry Safonov <dima@arista.com> --- kernel/softirq.c | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 2f5e87f1bae2..ee48f194dcec 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -88,6 +88,28 @@ static bool ksoftirqd_running(void) return tsk && (tsk->state == TASK_RUNNING); } +static bool defer_softirq(void) +{ + __u32 pending = local_softirq_pending(); + + if (!pending) + return true; + + if (ksoftirqd_running()) + return true; + + /* + * Defer net-rx softirqs to ksoftirqd processing as they may + * make userspace starving cpu time. + */ + if (pending & (NET_RX_SOFTIRQ | NET_TX_SOFTIRQ)) { + wakeup_softirqd(); + return true; + } + + return false; +} + /* * preempt_count and SOFTIRQ_OFFSET usage: * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving @@ -315,7 +337,6 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) asmlinkage __visible void do_softirq(void) { - __u32 pending; unsigned long flags; if (in_interrupt()) @@ -323,9 +344,7 @@ asmlinkage __visible void do_softirq(void) local_irq_save(flags); - pending = local_softirq_pending(); - - if (pending && !ksoftirqd_running()) + if (!defer_softirq()) do_softirq_own_stack(); local_irq_restore(flags); @@ -352,7 +371,7 @@ void irq_enter(void) static inline void invoke_softirq(void) { - if (ksoftirqd_running()) + if (defer_softirq()) return; if (!force_irqthreads) { -- 2.13.6 ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-09 13:36 ` [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context Dmitry Safonov @ 2018-01-09 16:03 ` Frederic Weisbecker 2018-01-10 21:20 ` Dmitry Safonov 2018-01-09 18:02 ` Eric Dumazet 1 sibling, 1 reply; 57+ messages in thread From: Frederic Weisbecker @ 2018-01-09 16:03 UTC (permalink / raw) To: Dmitry Safonov Cc: linux-kernel, 0x7f454c46, Andrew Morton, David Miller, Eric Dumazet, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Linus Torvalds, Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li 2018-01-09 14:36 UTC+01:00, Dmitry Safonov <dima@arista.com>: > Warning: Not merge-ready > > I. Current workflow of ksoftirqd. > Softirqs are processed in the context of ksoftirqd iff they are > being raised very frequently. How it works: > do_softirq() and invoke_softirq() deffer pending softirq iff > ksoftirqd is in runqueue. Ksoftirqd is scheduled mostly in the > end of processed softirqs if 2ms were not enough to process all > pending softirqs. > > Here is pseudo-picture of the workflow (for simplicity on UMP): > ------------- ------------------ ------------------ > | ksoftirqd | | User's process | | Softirqs | > ------------- ------------------ ------------------ > Not scheduled Running > | > o------------------------o > | > __do_softirq() > | > 2ms & softirq pending? > Schedule ksoftirqd > | > Scheduled o------------------------o > | > o--------------------o > | > Running Scheduled > | > o--------------------o > | > Not scheduled Running > > Timegraph for the workflow, > dash (-) means ksoftirqd not scheduled; > equal(=) ksoftirqd is scheduled, a softirq may still be pending > > Pending softirqs > | | | | | | | | | > v v v v | | | | v > Processing o-----o | | | | o--o > softirqs | | | | | | | | > | | | | | | | | > | | | | | | | | > Userspace o-o o=========o | | | | o----o o---------o > <-2ms-> | | | | | | > | v v v v | > Ksoftirqd o----------o > > II. Corner-conditions. > During testing of commit [1] on some non-mainstream driver, > I've found that due to platform specifics, the IRQ is being > raised too late (after softirq has been processed). I'm a bit confused about that part. I would expect the softirq to be raised by the IRQ. So I guess in this scenario the softirq is raised by something else and you expect the upcoming IRQ to handle the softirq, right? (sorry I'm not used to networking code). Thanks. > In result softirqs steal time from userspace process, leaving > it starving for CPU time and never/rarely scheduling ksoftirqd: ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-09 16:03 ` Frederic Weisbecker @ 2018-01-10 21:20 ` Dmitry Safonov 0 siblings, 0 replies; 57+ messages in thread From: Dmitry Safonov @ 2018-01-10 21:20 UTC (permalink / raw) To: Frederic Weisbecker Cc: linux-kernel, 0x7f454c46, Andrew Morton, David Miller, Eric Dumazet, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Linus Torvalds, Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Tue, 2018-01-09 at 17:03 +0100, Frederic Weisbecker wrote: > 2018-01-09 14:36 UTC+01:00, Dmitry Safonov <dima@arista.com>: > > Warning: Not merge-ready > > > > I. Current workflow of ksoftirqd. > > Softirqs are processed in the context of ksoftirqd iff they are > > being raised very frequently. How it works: > > do_softirq() and invoke_softirq() deffer pending softirq iff > > ksoftirqd is in runqueue. Ksoftirqd is scheduled mostly in the > > end of processed softirqs if 2ms were not enough to process all > > pending softirqs. > > > > Here is pseudo-picture of the workflow (for simplicity on UMP): > > ------------- ------------------ ------------------ > > | ksoftirqd | | User's process | | Softirqs | > > ------------- ------------------ ------------------ > > Not scheduled Running > > | > > o------------------------o > > | > > __do_softirq() > > | > > 2ms & softirq > > pending? > > Schedule ksoftirqd > > | > > Scheduled o------------------------o > > | > > o--------------------o > > | > > Running Scheduled > > | > > o--------------------o > > | > > Not scheduled Running > > > > Timegraph for the workflow, > > dash (-) means ksoftirqd not scheduled; > > equal(=) ksoftirqd is scheduled, a softirq may still be > > pending > > > > Pending softirqs > > | | | | | | | | | > > v v v v | | | | v > > Processing o-----o | | | | o--o > > softirqs | | | | | | | | > > | | | | | | | | > > | | | | | | | | > > Userspace o-o o=========o | | | | o----o o---------o > > <-2ms-> | | | | | | > > | v v v v | > > Ksoftirqd o----------o > > > > II. Corner-conditions. > > During testing of commit [1] on some non-mainstream driver, > > I've found that due to platform specifics, the IRQ is being > > raised too late (after softirq has been processed). > > I'm a bit confused about that part. I would expect the softirq to be > raised by the IRQ. The rx-softirq is raised by napi_schedule(), which is called on receiving an interrupt from device, yes. > So I guess in this scenario the softirq is raised by something else > and you expect the upcoming IRQ to handle the softirq, right? (sorry > I'm not used to networking code). So, the softirq is served by after upcomming IRQ. But in the end of the softirq there is no yet-pending softirq. And just after irq_exit(), there emerges another pending softirq. ITOW, here is what I see on the trace: python-8597 [001] d... 16635.495480: __do_softirq <-irq_exit python-8597 [001] d... 16635.495498: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495503: __do_softirq <-irq_exit python-8597 [001] d... 16635.495533: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495542: __do_softirq <-irq_exit python-8597 [001] d... 16635.495567: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495576: __do_softirq <-irq_exit python-8597 [001] d... 16635.495601: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495615: __do_softirq <-irq_exit python-8597 [001] d... 16635.495627: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495637: __do_softirq <-irq_exit python-8597 [001] d... 16635.495668: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495684: __do_softirq <-irq_exit python-8597 [001] d... 16635.495703: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495710: __do_softirq <-irq_exit python-8597 [001] d... 16635.495723: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495736: __do_softirq <-irq_exit python-8597 [001] d... 16635.495755: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495769: __do_softirq <-irq_exit python-8597 [001] d... 16635.495788: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495804: __do_softirq <-irq_exit python-8597 [001] d... 16635.495822: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495834: __do_softirq <-irq_exit python-8597 [001] d... 16635.495857: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495868: __do_softirq <-irq_exit python-8597 [001] d... 16635.495891: rcu_irq_exit <-irq_exit python-8597 [001] d.s1 16635.495907: rcu_irq_exit <-irq_exit python-8597 [001] d..1 16635.495908: __do_softirq <- do_softirq_own_stack python-8597 [001] d... 16635.495939: __do_softirq <-irq_exit python-8597 [001] d... 16635.495959: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.495973: __do_softirq <-irq_exit python-8597 [001] d... 16635.495988: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.496002: __do_softirq <-irq_exit python-8597 [001] d... 16635.496014: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.496027: __do_softirq <-irq_exit python-8597 [001] d... 16635.496044: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.496059: __do_softirq <-irq_exit python-8597 [001] d... 16635.496088: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.496102: __do_softirq <-irq_exit python-8597 [001] d.s. 16635.496108: rcu_irq_exit <-irq_exit python-8597 [001] dN.. 16635.496131: rcu_irq_exit <-irq_exit ksoftirqd/1-14 [001] d... 16635.496132: __do_softirq <- run_ksoftirqd ksoftirqd/1-14 [001] d.s1 16635.496145: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.496171: __do_softirq <-irq_exit python-8597 [001] d... 16635.496191: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.496204: __do_softirq <-irq_exit python-8597 [001] d... 16635.496218: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.496231: __do_softirq <-irq_exit python-8597 [001] d... 16635.496250: rcu_irq_exit <-irq_exit python-8597 [001] d... 16635.496263: __do_softirq <-irq_exit python-8597 [001] d... 16635.496283: rcu_irq_exit <-irq_exit Ksoftirqd here is very rarely scheduled, the python process here is receiver, and looks CPU-starving: %Cpu1 : 0.6 us, 9.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 35.3 hi, 54.6 si, 0.0 st [..] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8597 root 20 0 7664 5736 3924 R 10 0.0 4:00.69 python That's what I tried to picture there with diagrams. (the source of __do_softirq() also show how ksoftirq got scheduled) -- Thanks, Dmitry ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-09 13:36 ` [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context Dmitry Safonov 2018-01-09 16:03 ` Frederic Weisbecker @ 2018-01-09 18:02 ` Eric Dumazet 2018-01-10 21:48 ` Dmitry Safonov 1 sibling, 1 reply; 57+ messages in thread From: Eric Dumazet @ 2018-01-09 18:02 UTC (permalink / raw) To: Dmitry Safonov Cc: LKML, 0x7f454c46, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Linus Torvalds, Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Tue, Jan 9, 2018 at 5:36 AM, Dmitry Safonov <dima@arista.com> wrote: > Warning: Not merge-ready > > I. Current workflow of ksoftirqd. > Softirqs are processed in the context of ksoftirqd iff they are > being raised very frequently. How it works: > do_softirq() and invoke_softirq() deffer pending softirq iff > ... > > Note, that I tested in VMs and I've found that if I produce more > hw irqs on the host, than the results for master are not that > dramatically bad, but still much worse then with RFC. > By that reason I have qualms if my test's results are correct. Note that deferring all NET RX/TX to ksoftirqd is going to dramatically hurt tail latencies. You really should test with RPC like workloads (netperf -t TCP_RR) and hundred of threads per cpu :/ It seems we are going to revert/adapt 4cd13c21b2 , not defer more stuff to ksoftirqd. Thanks ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-09 18:02 ` Eric Dumazet @ 2018-01-10 21:48 ` Dmitry Safonov 2018-01-11 2:13 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Safonov @ 2018-01-10 21:48 UTC (permalink / raw) To: Eric Dumazet Cc: LKML, 0x7f454c46, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Linus Torvalds, Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Tue, 2018-01-09 at 10:02 -0800, Eric Dumazet wrote: > On Tue, Jan 9, 2018 at 5:36 AM, Dmitry Safonov <dima@arista.com> > wrote: > > Warning: Not merge-ready > > > > I. Current workflow of ksoftirqd. > > Softirqs are processed in the context of ksoftirqd iff they are > > being raised very frequently. How it works: > > do_softirq() and invoke_softirq() deffer pending softirq iff > > > > ... > > > > > Note, that I tested in VMs and I've found that if I produce more > > hw irqs on the host, than the results for master are not that > > dramatically bad, but still much worse then with RFC. > > By that reason I have qualms if my test's results are correct. > > Note that deferring all NET RX/TX to ksoftirqd is going to > dramatically hurt tail latencies. > > You really should test with RPC like workloads (netperf -t TCP_RR) > and > hundred of threads per cpu :/ Yeah, thanks for the reply, will give that a shot. > > It seems we are going to revert/adapt 4cd13c21b2 , not defer more > stuff to ksoftirqd. Hmm, what if we use some other logic for deferring/non-deferring like checking how many softirqs where serviced during process's timeslice and decide if proceed with __do_softirq() or defer it not to starve a task? Might that make sense? -- Thanks, Dmitry ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-10 21:48 ` Dmitry Safonov @ 2018-01-11 2:13 ` Linus Torvalds 2018-01-11 3:22 ` Frederic Weisbecker 0 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2018-01-11 2:13 UTC (permalink / raw) To: Dmitry Safonov Cc: Eric Dumazet, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Wed, Jan 10, 2018 at 1:48 PM, Dmitry Safonov <dima@arista.com> wrote: > Hmm, what if we use some other logic for deferring/non-deferring > like checking how many softirqs where serviced during process's > timeslice and decide if proceed with __do_softirq() or defer it > not to starve a task? Might that make sense? Yes, but it might also be hard to come up with a good heuristic. We actually *have* a fairly good heuristic right now: we end up punting to softirqd if we have too much work at one synchronous event. We simply count, and refuse to do too much, and say "Ok, wake up ksoftirqd". That has worked fairly well for a long time, and I think it's fundamentally the right thing to do. I think that the problem with the "once you punt to ksoftirqd, _keep_ punting to it" in commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job") was that it simply went much too far. Doing it under heavy load once is fine. But then what happens is that ksoftirqd keeps running (for the same reason that we woke it up in the first place), and then eventually it gets scheduled away because it's doing a lot of work. And I think THAT is when the ksoftirqd scheduling latencies get bad. Not on initial "push things to ksoftirqd". If ksoftirqd hasn't been running, then the scheduler will be pretty eager to pick it. But if ksoftirqd has been using CPU time, and gets preempted by other things (and it's pretty eager to do so - see the whole "need_resched()" in __do_softirq()), now we're really talking long latencies when there are other runnable processes. And dammit, softirq latencies are *MORE IMPORTANT* than some random user process scheduling. But the ksoftirqd_running() code will just see "ok, it's runnable, I'm not going to run anything synchronously", and let those softirq things wait. We're talking packet scheduling, but we're talking other things too. So just saying "hey, ksoftirq is runnable - but maybe not running _now"" and ignoring softirqs entirely is just stupid. Even if we could easily do another small bunch of them, at least the non-networking ones. So maybe that "ksoftirqd_running()" check should actually be something like static bool ksoftirqd_running(void) { struct task_struct *tsk = __this_cpu_read(ksoftirqd); return tsk == current; } which actually checks that ksoftirq is running right *now*, and not scheduled away because somebody is running a perl script. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 2:13 ` Linus Torvalds @ 2018-01-11 3:22 ` Frederic Weisbecker 2018-01-11 4:19 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: Frederic Weisbecker @ 2018-01-11 3:22 UTC (permalink / raw) To: Linus Torvalds Cc: Dmitry Safonov, Eric Dumazet, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Wed, Jan 10, 2018 at 06:13:01PM -0800, Linus Torvalds wrote: > So just saying "hey, ksoftirq is runnable - but maybe not running > _now"" and ignoring softirqs entirely is just stupid. Even if we could > easily do another small bunch of them, at least the non-networking > ones. > > So maybe that "ksoftirqd_running()" check should actually be something like > > static bool ksoftirqd_running(void) > { > struct task_struct *tsk = __this_cpu_read(ksoftirqd); > > return tsk == current; > } > > which actually checks that ksoftirq is running right *now*, and not > scheduled away because somebody is running a perl script. Makes sense, but I think you need to keep the TASK_RUNNING check. In case the hardirq is interrupting ksoftirqd in TASK_INTERRUPTIBLE state right before it's going to sleep. In that case neither ksoftirqd nor the hardirq are going to serve the poor pending softirqd. And if we are in nohz mode, it may not be served before a while. So perhaps it should be: diff --git a/kernel/softirq.c b/kernel/softirq.c index 2f5e87f..6e5d7bc 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -85,7 +85,7 @@ static bool ksoftirqd_running(void) { struct task_struct *tsk = __this_cpu_read(ksoftirqd); - return tsk && (tsk->state == TASK_RUNNING); + return (tsk == current) && (tsk->state == TASK_RUNNING); } /* ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 3:22 ` Frederic Weisbecker @ 2018-01-11 4:19 ` Linus Torvalds 2018-01-11 4:44 ` Frederic Weisbecker 0 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2018-01-11 4:19 UTC (permalink / raw) To: Frederic Weisbecker Cc: Dmitry Safonov, Eric Dumazet, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Wed, Jan 10, 2018 at 7:22 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > Makes sense, but I think you need to keep the TASK_RUNNING check. Yes, good point. > So perhaps it should be: > > - return tsk && (tsk->state == TASK_RUNNING); > + return (tsk == current) && (tsk->state == TASK_RUNNING); Looks good to me - definitely worth trying. Maybe that weakens the thing so much that it doesn't actually help the UDP packet storm case? And maybe it's not sufficient for the dvb issue. But I think it's worth at least testing. Maybe it makes neither side entirely happy, but maybe it might be a good halfway point? Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 4:19 ` Linus Torvalds @ 2018-01-11 4:44 ` Frederic Weisbecker 2018-01-11 14:31 ` Dmitry Safonov 0 siblings, 1 reply; 57+ messages in thread From: Frederic Weisbecker @ 2018-01-11 4:44 UTC (permalink / raw) To: Linus Torvalds Cc: Dmitry Safonov, Eric Dumazet, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Wed, Jan 10, 2018 at 08:19:49PM -0800, Linus Torvalds wrote: > On Wed, Jan 10, 2018 at 7:22 PM, Frederic Weisbecker > <frederic@kernel.org> wrote: > > > > Makes sense, but I think you need to keep the TASK_RUNNING check. > > Yes, good point. > > > So perhaps it should be: > > > > - return tsk && (tsk->state == TASK_RUNNING); > > + return (tsk == current) && (tsk->state == TASK_RUNNING); > > Looks good to me - definitely worth trying. > > Maybe that weakens the thing so much that it doesn't actually help the > UDP packet storm case? > > And maybe it's not sufficient for the dvb issue. > > But I think it's worth at least testing. Maybe it makes neither side > entirely happy, but maybe it might be a good halfway point? Yes I believe Dmitry is facing a different problem where he would rather see ksoftirqd scheduled more often to handle the queue as a deferred batch instead of having it served one by one on the tails of IRQ storms. (Dmitry correct me if I misunderstood). But your patch still seems to make sense for the case you described: when ksoftirqd is voluntarily preempted off and the current IRQ could handle the queue. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 4:44 ` Frederic Weisbecker @ 2018-01-11 14:31 ` Dmitry Safonov 2018-01-11 16:20 ` Eric Dumazet 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Safonov @ 2018-01-11 14:31 UTC (permalink / raw) To: Frederic Weisbecker, Linus Torvalds Cc: Eric Dumazet, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, 2018-01-11 at 05:44 +0100, Frederic Weisbecker wrote: > On Wed, Jan 10, 2018 at 08:19:49PM -0800, Linus Torvalds wrote: > > On Wed, Jan 10, 2018 at 7:22 PM, Frederic Weisbecker > > <frederic@kernel.org> wrote: > > > > > > Makes sense, but I think you need to keep the TASK_RUNNING check. > > > > Yes, good point. > > > > > So perhaps it should be: > > > > > > - return tsk && (tsk->state == TASK_RUNNING); > > > + return (tsk == current) && (tsk->state == TASK_RUNNING); > > > > Looks good to me - definitely worth trying. > > > > Maybe that weakens the thing so much that it doesn't actually help > > the > > UDP packet storm case? > > > > And maybe it's not sufficient for the dvb issue. > > > > But I think it's worth at least testing. Maybe it makes neither > > side > > entirely happy, but maybe it might be a good halfway point? > > Yes I believe Dmitry is facing a different problem where he would > rather > see ksoftirqd scheduled more often to handle the queue as a deferred > batch > instead of having it served one by one on the tails of IRQ storms. > (Dmitry correct me if I misunderstood). Quite so, what I see is that ksoftirqd is rarely (close to never) scheduled in case of UDP packet storm. That's because the up coming irq is too late in __do_softirq(). So, there is no wakeup on UDP storm here: : pending = local_softirq_pending(); : if (pending & mask) { : if (time_before(jiffies, end) && !need_resched() && : --max_restart) : goto restart; : : wakeup_softirqd(); : } (as there is yet no pending softirq). It comes a bit late to schedule ksoftirqd and in result the next softirq is processed on the context of the task again, not in the scheduled ksoftirqd. That results in cpu-time starvation for the process on irq storm. While I saw that on out-of-tree driver, I believe that on some frequencies (lower than storm) one can observe the same on mainstream drivers. And I *think* that I've reproduced that on mainstream with virtio driver and package size of 1500 in VMs (thou I don't quite like the perf testing in VMs). So, ITOW, maybe there is a bit better way to *detect* that cpu time spent on serving softirqs is close to storm and that userspace starts starving? (and launch ksoftirqd in the result or balance between deferring and serving softirq right-there). > But your patch still seems to make sense for the case you described: > when > ksoftirqd is voluntarily preempted off and the current IRQ could > handle the > queue. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 14:31 ` Dmitry Safonov @ 2018-01-11 16:20 ` Eric Dumazet 2018-01-11 16:32 ` Peter Zijlstra 0 siblings, 1 reply; 57+ messages in thread From: Eric Dumazet @ 2018-01-11 16:20 UTC (permalink / raw) To: Dmitry Safonov Cc: Frederic Weisbecker, Linus Torvalds, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 6:31 AM, Dmitry Safonov <dima@arista.com> wrote: > On Thu, 2018-01-11 at 05:44 +0100, Frederic Weisbecker wrote: >> On Wed, Jan 10, 2018 at 08:19:49PM -0800, Linus Torvalds wrote: >> > On Wed, Jan 10, 2018 at 7:22 PM, Frederic Weisbecker >> > <frederic@kernel.org> wrote: >> > > >> > > Makes sense, but I think you need to keep the TASK_RUNNING check. >> > >> > Yes, good point. >> > >> > > So perhaps it should be: >> > > >> > > - return tsk && (tsk->state == TASK_RUNNING); >> > > + return (tsk == current) && (tsk->state == TASK_RUNNING); >> > >> > Looks good to me - definitely worth trying. >> > >> > Maybe that weakens the thing so much that it doesn't actually help >> > the >> > UDP packet storm case? >> > >> > And maybe it's not sufficient for the dvb issue. >> > >> > But I think it's worth at least testing. Maybe it makes neither >> > side >> > entirely happy, but maybe it might be a good halfway point? >> >> Yes I believe Dmitry is facing a different problem where he would >> rather >> see ksoftirqd scheduled more often to handle the queue as a deferred >> batch >> instead of having it served one by one on the tails of IRQ storms. >> (Dmitry correct me if I misunderstood). > > Quite so, what I see is that ksoftirqd is rarely (close to never) > scheduled in case of UDP packet storm. That's because the up coming irq > is too late in __do_softirq(). > So, there is no wakeup on UDP storm here: > : pending = local_softirq_pending(); > : if (pending & mask) { > : if (time_before(jiffies, end) && !need_resched() && > : --max_restart) > : goto restart; > : > : wakeup_softirqd(); > : } > (as there is yet no pending softirq). It comes a bit late to schedule > ksoftirqd and in result the next softirq is processed on the context of > the task again, not in the scheduled ksoftirqd. > That results in cpu-time starvation for the process on irq storm. > > While I saw that on out-of-tree driver, I believe that on some > frequencies (lower than storm) one can observe the same on mainstream > drivers. And I *think* that I've reproduced that on mainstream with > virtio driver and package size of 1500 in VMs (thou I don't quite like > the perf testing in VMs). > > So, ITOW, maybe there is a bit better way to *detect* that cpu time > spent on serving softirqs is close to storm and that userspace starts > starving? (and launch ksoftirqd in the result or balance between > deferring and serving softirq right-there). > >> But your patch still seems to make sense for the case you described: >> when >> ksoftirqd is voluntarily preempted off and the current IRQ could >> handle the >> queue. Note that ksoftirqd being kicked (TASK_RUNNING) is the sign of softirq pressure. Or maybe we lack one bit to signal that __do_softirq() had to wakep_softirq() because of pressure. (If I remember well, I added such state when submitting my first patch, https://www.spinics.net/lists/netdev/msg377172.html then Peter suggested to use tsk->state == TASK_RUNNING https://www.spinics.net/lists/netdev/msg377210.html Maybe the problem is not the new patch, but use of need_resched() in __do_softirq() that I added in 2013 ( commit c10d73671ad30f54692f7f69f0e09e75d3a8926a ) combined with the new patch. diff --git a/kernel/softirq.c b/kernel/softirq.c index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip); /* * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times, - * but break the loop if need_resched() is set or after 2 ms. + * but break the loop after 2 ms. * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in * certain cases, such as stop_machine(), jiffies may cease to * increment and so we need the MAX_SOFTIRQ_RESTART limit as @@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) pending = local_softirq_pending(); if (pending) { - if (time_before(jiffies, end) && !need_resched() && - --max_restart) + if (time_before(jiffies, end) && --max_restart) goto restart; wakeup_softirqd(); ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 16:20 ` Eric Dumazet @ 2018-01-11 16:32 ` Peter Zijlstra 2018-01-11 16:38 ` David Miller 2018-01-11 18:48 ` Linus Torvalds 0 siblings, 2 replies; 57+ messages in thread From: Peter Zijlstra @ 2018-01-11 16:32 UTC (permalink / raw) To: Eric Dumazet Cc: Dmitry Safonov, Frederic Weisbecker, Linus Torvalds, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 08:20:18AM -0800, Eric Dumazet wrote: > diff --git a/kernel/softirq.c b/kernel/softirq.c > index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba > 100644 > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip); > > /* > * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times, > - * but break the loop if need_resched() is set or after 2 ms. > + * but break the loop after 2 ms. > * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in > * certain cases, such as stop_machine(), jiffies may cease to > * increment and so we need the MAX_SOFTIRQ_RESTART limit as > @@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) > > pending = local_softirq_pending(); > if (pending) { > - if (time_before(jiffies, end) && !need_resched() && > - --max_restart) > + if (time_before(jiffies, end) && --max_restart) > goto restart; > > wakeup_softirqd(); You just introduced a 2ms preempt-disable region I think, that's not cool for PREEMPT and a plain bug on PREEMPT_RT. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 16:32 ` Peter Zijlstra @ 2018-01-11 16:38 ` David Miller 2018-01-11 16:43 ` Peter Zijlstra 2018-01-11 18:48 ` Linus Torvalds 1 sibling, 1 reply; 57+ messages in thread From: David Miller @ 2018-01-11 16:38 UTC (permalink / raw) To: peterz Cc: edumazet, dima, frederic, torvalds, linux-kernel, 0x7f454c46, akpm, fweisbec, hannes, mingo, alexander.levin, pabeni, paulmck, rrendec, riel, sgruszka, tglx, wanpeng.li From: Peter Zijlstra <peterz@infradead.org> Date: Thu, 11 Jan 2018 17:32:04 +0100 > You just introduced a 2ms preempt-disable region I think, that's not > cool for PREEMPT and a plain bug on PREEMPT_RT. I'd just like to know what the difference is between ksoftirqd running softirq work for 2ms and the real softirq context doing it. The latter already has this "2ms preempt-disable region" of sorts, right? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 16:38 ` David Miller @ 2018-01-11 16:43 ` Peter Zijlstra 0 siblings, 0 replies; 57+ messages in thread From: Peter Zijlstra @ 2018-01-11 16:43 UTC (permalink / raw) To: David Miller Cc: edumazet, dima, frederic, torvalds, linux-kernel, 0x7f454c46, akpm, fweisbec, hannes, mingo, alexander.levin, pabeni, paulmck, rrendec, riel, sgruszka, tglx, wanpeng.li On Thu, Jan 11, 2018 at 11:38:40AM -0500, David Miller wrote: > From: Peter Zijlstra <peterz@infradead.org> > Date: Thu, 11 Jan 2018 17:32:04 +0100 > > > You just introduced a 2ms preempt-disable region I think, that's not > > cool for PREEMPT and a plain bug on PREEMPT_RT. > > I'd just like to know what the difference is between ksoftirqd running > softirq work for 2ms and the real softirq context doing it. The > latter already has this "2ms preempt-disable region" of sorts, right? True I suppose.. and PREEMPT_RT does the whole softirq business differently anyway. I'll go back to staring at PTI crud.. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 16:32 ` Peter Zijlstra 2018-01-11 16:38 ` David Miller @ 2018-01-11 18:48 ` Linus Torvalds 2018-01-11 19:15 ` Eric Dumazet 1 sibling, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2018-01-11 18:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 8:32 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Jan 11, 2018 at 08:20:18AM -0800, Eric Dumazet wrote: >> diff --git a/kernel/softirq.c b/kernel/softirq.c >> index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba >> 100644 >> --- a/kernel/softirq.c >> +++ b/kernel/softirq.c >> @@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip); >> >> /* >> * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times, >> - * but break the loop if need_resched() is set or after 2 ms. >> + * but break the loop after 2 ms. >> * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in >> * certain cases, such as stop_machine(), jiffies may cease to >> * increment and so we need the MAX_SOFTIRQ_RESTART limit as >> @@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) >> >> pending = local_softirq_pending(); >> if (pending) { >> - if (time_before(jiffies, end) && !need_resched() && >> - --max_restart) >> + if (time_before(jiffies, end) && --max_restart) >> goto restart; >> >> wakeup_softirqd(); > > You just introduced a 2ms preempt-disable region I think, that's not > cool for PREEMPT and a plain bug on PREEMPT_RT. I actually think that he whole "time_before()" is garbage and should be removed. If doing 10 rounds of softirq processing takes several jiffies, you have bigger issue. Looking at the history, the original time_before() was _replacing_ the "max_retry". See commit c10d73671ad3 ("softirq: reduce latencies"). But then "max_restart" was put back in because jiffies just isn't reliable in the first place. And I think *this* is the real issue. Networking does a *shitton* of things in softirq context, and the problem is not the softirq code, it's the networking code. The softirq code doesn't understand that the networking code does billions of operations, and one softirq can take ages and ages because it is handling millions of packets. My feeling is that it's the networking code that should notice "I have a billion packets in my softirq queue, and I just keep getting more, now *I* should start doing things in a thread instead". Because none of those limits really make sense for any of the other softirq users. Everybody else just wants a low-latency "soft interrupt", not some stupid thread. Who knew? Maybe the name should have given the thing away? If you want a thread, use a tasklet and add_task_work(). Maybe networking should reconsider its use of softirqs entirely? Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 18:48 ` Linus Torvalds @ 2018-01-11 19:15 ` Eric Dumazet 2018-01-11 19:43 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: Eric Dumazet @ 2018-01-11 19:15 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 10:48 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jan 11, 2018 at 8:32 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> On Thu, Jan 11, 2018 at 08:20:18AM -0800, Eric Dumazet wrote: >>> diff --git a/kernel/softirq.c b/kernel/softirq.c >>> index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba >>> 100644 >>> --- a/kernel/softirq.c >>> +++ b/kernel/softirq.c >>> @@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip); >>> >>> /* >>> * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times, >>> - * but break the loop if need_resched() is set or after 2 ms. >>> + * but break the loop after 2 ms. >>> * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in >>> * certain cases, such as stop_machine(), jiffies may cease to >>> * increment and so we need the MAX_SOFTIRQ_RESTART limit as >>> @@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) >>> >>> pending = local_softirq_pending(); >>> if (pending) { >>> - if (time_before(jiffies, end) && !need_resched() && >>> - --max_restart) >>> + if (time_before(jiffies, end) && --max_restart) >>> goto restart; >>> >>> wakeup_softirqd(); >> >> You just introduced a 2ms preempt-disable region I think, that's not >> cool for PREEMPT and a plain bug on PREEMPT_RT. > > I actually think that he whole "time_before()" is garbage and should be removed. > > If doing 10 rounds of softirq processing takes several jiffies, you > have bigger issue. > > Looking at the history, the original time_before() was _replacing_ the > "max_retry". See commit c10d73671ad3 ("softirq: reduce latencies"). > > But then "max_restart" was put back in because jiffies just isn't > reliable in the first place. > > And I think *this* is the real issue. Networking does a *shitton* of > things in softirq context, and the problem is not the softirq code, > it's the networking code. The softirq code doesn't understand that the > networking code does billions of operations, and one softirq can take > ages and ages because it is handling millions of packets. > > My feeling is that it's the networking code that should notice "I have > a billion packets in my softirq queue, and I just keep getting more, > now *I* should start doing things in a thread instead". > > Because none of those limits really make sense for any of the other > softirq users. Everybody else just wants a low-latency "soft > interrupt", not some stupid thread. Who knew? Maybe the name should > have given the thing away? > > If you want a thread, use a tasklet and add_task_work(). > > Maybe networking should reconsider its use of softirqs entirely? Issue can happen when one packet is delivered (and handled) per NIC IRQ. Not 'billions' :/ Suppose host receives 99,000 packets per second on one queue (one CPU handling the particular IRQ), each packet taking 10 usec to handle (various cache misses and all the overhead) How are we supposed to know this particular workload is problematic for innocent user threads on the same cpu ? Is 'networking' supposed to perform some kind of mid-term analysis of what happened in the past and/or predict the future ? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 19:15 ` Eric Dumazet @ 2018-01-11 19:43 ` Linus Torvalds 2018-01-11 19:48 ` Eric Dumazet 0 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2018-01-11 19:43 UTC (permalink / raw) To: Eric Dumazet Cc: Peter Zijlstra, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 11:15 AM, Eric Dumazet <edumazet@google.com> wrote: > > How are we supposed to know this particular workload is problematic > for innocent user threads on the same cpu ? Put that question another way: how is the _softirq_ code supposed to know? If you can't know, then the softirq code definitely can't know either. It has even less information. It doesn't know about NAPI, it doesn't know about irq steering, it doesn't know anything at all. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 19:43 ` Linus Torvalds @ 2018-01-11 19:48 ` Eric Dumazet 2018-01-11 20:03 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: Eric Dumazet @ 2018-01-11 19:48 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 11:43 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jan 11, 2018 at 11:15 AM, Eric Dumazet <edumazet@google.com> wrote: >> >> How are we supposed to know this particular workload is problematic >> for innocent user threads on the same cpu ? > > Put that question another way: how is the _softirq_ code supposed to know? That was the purpose on the last patch : As soon as ksoftirqd is scheduled (by some kind of jitter in the 99,000 pps workload, or antagonist wakeup), we then switch to a mode where process scheduler can make decisions based on threads prios and cpu usage. Then, as soon as the load was able to finish in its quantum the pending irqs, we re-enter the mode where softirq are immediately serviced. > > If you can't know, then the softirq code definitely can't know either. > It has even less information. It doesn't know about NAPI, it doesn't > know about irq steering, it doesn't know anything at all. That is true. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 19:48 ` Eric Dumazet @ 2018-01-11 20:03 ` Linus Torvalds 2018-01-11 20:16 ` Eric Dumazet 0 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2018-01-11 20:03 UTC (permalink / raw) To: Eric Dumazet Cc: Peter Zijlstra, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 11:48 AM, Eric Dumazet <edumazet@google.com> wrote: > That was the purpose on the last patch : As soon as ksoftirqd is scheduled > (by some kind of jitter in the 99,000 pps workload, or antagonist wakeup), > we then switch to a mode where process scheduler can make decisions > based on threads prios and cpu usage. Yeah, but that really screws up everybody else. It really is a soft *interrupt*. That was what it was designed for. The thread handling is not primary, it's literally a fallback to avoid complete starvation. The fact that networking has now - for several years - tried to make it some kind of thread and get fairness with user threads is all entirely antithetical to what softirq was designed for. > Then, as soon as the load was able to finish in its quantum the > pending irqs, we re-enter the mode > where softirq are immediately serviced. Except that's not at all how the code works. As I pointed out, the softirq thread can be scheduled away, but the "softiq_running()" wil stilll return true - and the networking code has now screwed up all the *other* softirqs too! I really suspect that what networking wants is more like the workqueues. Or at least more isolation between different softirq users, but that's fairly fundamentally hard, given how softirqs are designed. My dvb-fixing patch was an *extremely* stupid version of that "more isolation". But it really is a complete hack, saying that tasklets are special and shouldn't trigger ksoftirqd. They're not really all that special, but it at least isolated the USB/DVB usage from the "networking wants softirqd" problem. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 20:03 ` Linus Torvalds @ 2018-01-11 20:16 ` Eric Dumazet 2018-01-11 20:22 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: Eric Dumazet @ 2018-01-11 20:16 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 12:03 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jan 11, 2018 at 11:48 AM, Eric Dumazet <edumazet@google.com> wrote: >> That was the purpose on the last patch : As soon as ksoftirqd is scheduled >> (by some kind of jitter in the 99,000 pps workload, or antagonist wakeup), >> we then switch to a mode where process scheduler can make decisions >> based on threads prios and cpu usage. > > Yeah, but that really screws up everybody else. > > It really is a soft *interrupt*. That was what it was designed for. > The thread handling is not primary, it's literally a fallback to avoid > complete starvation. > > The fact that networking has now - for several years - tried to make > it some kind of thread and get fairness with user threads is all > entirely antithetical to what softirq was designed for. > >> Then, as soon as the load was able to finish in its quantum the >> pending irqs, we re-enter the mode >> where softirq are immediately serviced. > > Except that's not at all how the code works. > > As I pointed out, the softirq thread can be scheduled away, but the > "softiq_running()" wil stilll return true - and the networking code > has now screwed up all the *other* softirqs too! > > I really suspect that what networking wants is more like the > workqueues. Or at least more isolation between different softirq > users, but that's fairly fundamentally hard, given how softirqs are > designed. > Note that when I implemented TCP Small queues, I did experiments between using a work queue or a tasklet, and workqueues added unacceptable P99 latencies, when many user threads are competing with kernel threads. I suspect that firing a worqueue for networking RX will likely have the same effect :/ Note that the current __do_softirq() implementation suffers from the following : Say we receive NET_RX softirq -> While processing the packet, we wakeup on thread (thus need_resched() becomes true), but also raise a tasklet (because a particular driver needs some extra processing in tasklet context instead of NET_RX ???) -> Then we exit __do_softirq() _and_ schedule ksoftirqd (because tasklet needs to be serviced) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 20:16 ` Eric Dumazet @ 2018-01-11 20:22 ` Linus Torvalds 2018-01-11 20:34 ` Dmitry Safonov 2018-01-12 5:23 ` Mike Galbraith 0 siblings, 2 replies; 57+ messages in thread From: Linus Torvalds @ 2018-01-11 20:22 UTC (permalink / raw) To: Eric Dumazet Cc: Peter Zijlstra, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <edumazet@google.com> wrote: > > Note that when I implemented TCP Small queues, I did experiments between > using a work queue or a tasklet, and workqueues added unacceptable P99 > latencies, when many user threads are competing with kernel threads. Yes. So I think one solution might be to have a hybrid system, where we do the softirq's synchronously normally (which is what you really want for good latency). But then fall down on a threaded model - but that fallback case should be per-softirq, not global. So if one softirq uses a lot of CPU time, that shouldn't affect the latency of other softirqs. So maybe we could get rid of the per-cpu ksoftirqd entirely, and replace it with with per-cpu and per-softirq workqueues? Would something like that sound sane? Just a SMOP/SMOT (small matter of programming/testing). Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 20:22 ` Linus Torvalds @ 2018-01-11 20:34 ` Dmitry Safonov 2018-01-11 20:37 ` Eric Dumazet 2018-01-11 20:40 ` Linus Torvalds 2018-01-12 5:23 ` Mike Galbraith 1 sibling, 2 replies; 57+ messages in thread From: Dmitry Safonov @ 2018-01-11 20:34 UTC (permalink / raw) To: Linus Torvalds, Eric Dumazet Cc: Peter Zijlstra, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote: > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <edumazet@google.com> > wrote: > > > > Note that when I implemented TCP Small queues, I did experiments > > between > > using a work queue or a tasklet, and workqueues added unacceptable > > P99 > > latencies, when many user threads are competing with kernel > > threads. > > Yes. > > So I think one solution might be to have a hybrid system, where we do > the softirq's synchronously normally (which is what you really want > for good latency). > > But then fall down on a threaded model - but that fallback case > should > be per-softirq, not global. So if one softirq uses a lot of CPU time, > that shouldn't affect the latency of other softirqs. > > So maybe we could get rid of the per-cpu ksoftirqd entirely, and > replace it with with per-cpu and per-softirq workqueues? > > Would something like that sound sane? > > Just a SMOP/SMOT (small matter of programming/testing). I could try to write a PoC for that.. What should be the trigger to fall into workqueue? How to tell if there're too many softirqs of the kind? Current logic with if (pending) in the end of __do_softirq() looks working selectively.. It looks to be still possible to starve a cpu. -- Thanks, Dmitry ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 20:34 ` Dmitry Safonov @ 2018-01-11 20:37 ` Eric Dumazet 2018-01-11 20:40 ` Linus Torvalds 1 sibling, 0 replies; 57+ messages in thread From: Eric Dumazet @ 2018-01-11 20:37 UTC (permalink / raw) To: Dmitry Safonov Cc: Linus Torvalds, Peter Zijlstra, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <dima@arista.com> wrote: > On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote: >> On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <edumazet@google.com> >> wrote: >> > >> > Note that when I implemented TCP Small queues, I did experiments >> > between >> > using a work queue or a tasklet, and workqueues added unacceptable >> > P99 >> > latencies, when many user threads are competing with kernel >> > threads. >> >> Yes. >> >> So I think one solution might be to have a hybrid system, where we do >> the softirq's synchronously normally (which is what you really want >> for good latency). >> >> But then fall down on a threaded model - but that fallback case >> should >> be per-softirq, not global. So if one softirq uses a lot of CPU time, >> that shouldn't affect the latency of other softirqs. >> >> So maybe we could get rid of the per-cpu ksoftirqd entirely, and >> replace it with with per-cpu and per-softirq workqueues? >> >> Would something like that sound sane? >> >> Just a SMOP/SMOT (small matter of programming/testing). > > I could try to write a PoC for that.. > What should be the trigger to fall into workqueue? > How to tell if there're too many softirqs of the kind? > Current logic with if (pending) in the end of __do_softirq() > looks working selectively.. > It looks to be still possible to starve a cpu. I guess we would need to track amount of time spent while processing sortirq (while interrupting a non idle task) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 20:34 ` Dmitry Safonov 2018-01-11 20:37 ` Eric Dumazet @ 2018-01-11 20:40 ` Linus Torvalds 2018-01-11 20:46 ` Dmitry Safonov 1 sibling, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2018-01-11 20:40 UTC (permalink / raw) To: Dmitry Safonov Cc: Eric Dumazet, Peter Zijlstra, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <dima@arista.com> wrote: > > I could try to write a PoC for that.. > What should be the trigger to fall into workqueue? > How to tell if there're too many softirqs of the kind? I suspect it would have to be time-based, probably using the scheduler clock. Most softirqs are really really small. So just counting them probably isn't all that meaningful, although the count is good as a fallback (as shown by the jiffy issues). The good news is that we only have a fairly small handful of softirqs, so counting/timing them separately is still mainly a pretty small array (which needs to be percpu, of course). Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 20:40 ` Linus Torvalds @ 2018-01-11 20:46 ` Dmitry Safonov 2018-01-11 20:53 ` Eric Dumazet 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Safonov @ 2018-01-11 20:46 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Peter Zijlstra, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, 2018-01-11 at 12:40 -0800, Linus Torvalds wrote: > On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <dima@arista.com> > wrote: > > > > I could try to write a PoC for that.. > > What should be the trigger to fall into workqueue? > > How to tell if there're too many softirqs of the kind? > > I suspect it would have to be time-based, probably using the > scheduler clock. I thought about this, but I was a bit afraid of how much pricey it would be recalculate it each clock. Well, might just try to write that and measure the impact. > Most softirqs are really really small. So just counting them probably > isn't all that meaningful, although the count is good as a fallback > (as shown by the jiffy issues). > > The good news is that we only have a fairly small handful of > softirqs, > so counting/timing them separately is still mainly a pretty small > array (which needs to be percpu, of course). ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 20:46 ` Dmitry Safonov @ 2018-01-11 20:53 ` Eric Dumazet 2018-01-11 21:13 ` Dmitry Safonov 0 siblings, 1 reply; 57+ messages in thread From: Eric Dumazet @ 2018-01-11 20:53 UTC (permalink / raw) To: Dmitry Safonov Cc: Linus Torvalds, Peter Zijlstra, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 12:46 PM, Dmitry Safonov <dima@arista.com> wrote: > On Thu, 2018-01-11 at 12:40 -0800, Linus Torvalds wrote: >> On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <dima@arista.com> >> wrote: >> > >> > I could try to write a PoC for that.. >> > What should be the trigger to fall into workqueue? >> > How to tell if there're too many softirqs of the kind? >> >> I suspect it would have to be time-based, probably using the >> scheduler clock. > > I thought about this, but I was a bit afraid of how much pricey it > would be recalculate it each clock. Well, might just try to write that > and measure the impact. > >> Most softirqs are really really small. So just counting them probably >> isn't all that meaningful, although the count is good as a fallback >> (as shown by the jiffy issues). >> >> The good news is that we only have a fairly small handful of >> softirqs, >> so counting/timing them separately is still mainly a pretty small >> array (which needs to be percpu, of course). Note that using (scheduler) clock might also help to break net_rx_action() not on a stupid netdev_budget, but on a more precise time limit as well. netdev_budget of 300 packets is quite big :/ (The time_limit based on jiffies + 2 does not work on hosts with one cpu, since jiffies wont make progress while net_rx_action() is running) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 20:53 ` Eric Dumazet @ 2018-01-11 21:13 ` Dmitry Safonov 2018-01-12 5:41 ` Frederic Weisbecker 0 siblings, 1 reply; 57+ messages in thread From: Dmitry Safonov @ 2018-01-11 21:13 UTC (permalink / raw) To: Eric Dumazet Cc: Linus Torvalds, Peter Zijlstra, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, 2018-01-11 at 12:53 -0800, Eric Dumazet wrote: > On Thu, Jan 11, 2018 at 12:46 PM, Dmitry Safonov <dima@arista.com> > wrote: > > On Thu, 2018-01-11 at 12:40 -0800, Linus Torvalds wrote: > > > On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <dima@arista.com > > > > > > > wrote: > > > > > > > > I could try to write a PoC for that.. > > > > What should be the trigger to fall into workqueue? > > > > How to tell if there're too many softirqs of the kind? > > > > > > I suspect it would have to be time-based, probably using the > > > scheduler clock. > > > > I thought about this, but I was a bit afraid of how much pricey it > > would be recalculate it each clock. Well, might just try to write > > that > > and measure the impact. > > > > > Most softirqs are really really small. So just counting them > > > probably > > > isn't all that meaningful, although the count is good as a > > > fallback > > > (as shown by the jiffy issues). > > > > > > The good news is that we only have a fairly small handful of > > > softirqs, > > > so counting/timing them separately is still mainly a pretty small > > > array (which needs to be percpu, of course). > > Note that using (scheduler) clock might also help to break > net_rx_action() > not on a stupid netdev_budget, but on a more precise time limit as > well. > > netdev_budget of 300 packets is quite big :/ > > (The time_limit based on jiffies + 2 does not work on hosts with one > cpu, since jiffies wont make progress while net_rx_action() is > running) Thanks for the details, Eric. I'll try to come up with poc if no one beats me at it. -- Thanks, Dmitry ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 21:13 ` Dmitry Safonov @ 2018-01-12 5:41 ` Frederic Weisbecker 0 siblings, 0 replies; 57+ messages in thread From: Frederic Weisbecker @ 2018-01-12 5:41 UTC (permalink / raw) To: Dmitry Safonov Cc: Eric Dumazet, Linus Torvalds, Peter Zijlstra, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 09:13:42PM +0000, Dmitry Safonov wrote: > On Thu, 2018-01-11 at 12:53 -0800, Eric Dumazet wrote: > > On Thu, Jan 11, 2018 at 12:46 PM, Dmitry Safonov <dima@arista.com> > > wrote: > > > On Thu, 2018-01-11 at 12:40 -0800, Linus Torvalds wrote: > > > > On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <dima@arista.com > > > > > > > > > wrote: > > > > > > > > > > I could try to write a PoC for that.. > > > > > What should be the trigger to fall into workqueue? > > > > > How to tell if there're too many softirqs of the kind? > > > > > > > > I suspect it would have to be time-based, probably using the > > > > scheduler clock. > > > > > > I thought about this, but I was a bit afraid of how much pricey it > > > would be recalculate it each clock. Well, might just try to write > > > that > > > and measure the impact. > > > > > > > Most softirqs are really really small. So just counting them > > > > probably > > > > isn't all that meaningful, although the count is good as a > > > > fallback > > > > (as shown by the jiffy issues). > > > > > > > > The good news is that we only have a fairly small handful of > > > > softirqs, > > > > so counting/timing them separately is still mainly a pretty small > > > > array (which needs to be percpu, of course). > > > > Note that using (scheduler) clock might also help to break > > net_rx_action() > > not on a stupid netdev_budget, but on a more precise time limit as > > well. > > > > netdev_budget of 300 packets is quite big :/ > > > > (The time_limit based on jiffies + 2 does not work on hosts with one > > cpu, since jiffies wont make progress while net_rx_action() is > > running) > > Thanks for the details, Eric. > I'll try to come up with poc if no one beats me at it. I just gave it a try. Sorry I couldn't resist :-s ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-11 20:22 ` Linus Torvalds 2018-01-11 20:34 ` Dmitry Safonov @ 2018-01-12 5:23 ` Mike Galbraith 2018-01-12 10:13 ` Peter Zijlstra ` (2 more replies) 1 sibling, 3 replies; 57+ messages in thread From: Mike Galbraith @ 2018-01-12 5:23 UTC (permalink / raw) To: Linus Torvalds, Eric Dumazet Cc: Peter Zijlstra, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote: > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <edumazet@google.com> wrote: > > > > Note that when I implemented TCP Small queues, I did experiments between > > using a work queue or a tasklet, and workqueues added unacceptable P99 > > latencies, when many user threads are competing with kernel threads. > > Yes. > > So I think one solution might be to have a hybrid system, where we do > the softirq's synchronously normally (which is what you really want > for good latency). > > But then fall down on a threaded model - but that fallback case should > be per-softirq, not global. So if one softirq uses a lot of CPU time, > that shouldn't affect the latency of other softirqs. > > So maybe we could get rid of the per-cpu ksoftirqd entirely, and > replace it with with per-cpu and per-softirq workqueues? How would that be better than what RT used to do, and I still do for my RT kernels via boot option, namely split ksoftirqd into per-softirq threads. -Mike ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 5:23 ` Mike Galbraith @ 2018-01-12 10:13 ` Peter Zijlstra 2018-01-12 17:26 ` Linus Torvalds 2018-01-12 14:58 ` Frederic Weisbecker 2018-01-12 17:24 ` Linus Torvalds 2 siblings, 1 reply; 57+ messages in thread From: Peter Zijlstra @ 2018-01-12 10:13 UTC (permalink / raw) To: Mike Galbraith Cc: Linus Torvalds, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote: > On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote: > > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <edumazet@google.com> wrote: > > > > > > Note that when I implemented TCP Small queues, I did experiments between > > > using a work queue or a tasklet, and workqueues added unacceptable P99 > > > latencies, when many user threads are competing with kernel threads. > > > > Yes. > > > > So I think one solution might be to have a hybrid system, where we do > > the softirq's synchronously normally (which is what you really want > > for good latency). > > > > But then fall down on a threaded model - but that fallback case should > > be per-softirq, not global. So if one softirq uses a lot of CPU time, > > that shouldn't affect the latency of other softirqs. > > > > So maybe we could get rid of the per-cpu ksoftirqd entirely, and > > replace it with with per-cpu and per-softirq workqueues? > > How would that be better than what RT used to do, and I still do for my > RT kernels via boot option, namely split ksoftirqd into per-softirq > threads. Since we mention it; one of the problems RT has is that all RX is through a single softirq context, which generates a priority inversion between devices. If we're splitting things, it would be awesome if we could get per-device context. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 10:13 ` Peter Zijlstra @ 2018-01-12 17:26 ` Linus Torvalds 2018-01-12 17:44 ` Mike Galbraith 0 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2018-01-12 17:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Mike Galbraith, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Fri, Jan 12, 2018 at 2:13 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote: >> >> How would that be better than what RT used to do, and I still do for my >> RT kernels via boot option, namely split ksoftirqd into per-softirq >> threads. > > Since we mention it; one of the problems RT has is that all RX is > through a single softirq context, which generates a priority inversion > between devices. Oh, yes, that is complete and utter shit. Not acceptable at all. As mentioned, it really would need to be per-cpu _and_ per-softirq. Which is why I thought workqueues might be the thing. Whatever RT is doing is apparently just pure and utter garbage. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 17:26 ` Linus Torvalds @ 2018-01-12 17:44 ` Mike Galbraith 2018-01-12 17:51 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: Mike Galbraith @ 2018-01-12 17:44 UTC (permalink / raw) To: Linus Torvalds, Peter Zijlstra Cc: Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Fri, 2018-01-12 at 09:26 -0800, Linus Torvalds wrote: > On Fri, Jan 12, 2018 at 2:13 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote: > >> > >> How would that be better than what RT used to do, and I still do for my > >> RT kernels via boot option, namely split ksoftirqd into per-softirq > >> threads. > > > > Since we mention it; one of the problems RT has is that all RX is > > through a single softirq context, which generates a priority inversion > > between devices. > > Oh, yes, that is complete and utter shit. Not acceptable at all. > > As mentioned, it really would need to be per-cpu _and_ per-softirq. > > Which is why I thought workqueues might be the thing. Whatever RT is > doing is apparently just pure and utter garbage. Nah, a misunderstanding happened. RT that still offers full threading creates per-softirq threads per cpu. The regular trees split ksoftirqd into only two threads per cpu, one processes timer/hrtimer softriqs, the other processes the rest. -Mike ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 17:44 ` Mike Galbraith @ 2018-01-12 17:51 ` Linus Torvalds 2018-01-12 18:15 ` Mike Galbraith 2018-01-12 19:28 ` Thomas Gleixner 0 siblings, 2 replies; 57+ messages in thread From: Linus Torvalds @ 2018-01-12 17:51 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <efault@gmx.de> wrote: > > Nah, a misunderstanding happened. RT that still offers full threading > creates per-softirq threads per cpu. The regular trees split ksoftirqd > into only two threads per cpu, one processes timer/hrtimer softriqs, > the other processes the rest. Ok, that sounds like it should work, but it also sounds like it's very specific to RT itself. For example, the dvb issue was not about the timer softirqs, but about the tasklet ones. So maybe we wouldn't need to split it for _every_ softirq, but we'd need to split it more than just along the timer case. And it does sound a bit excessive to have ten fixed threads for every CPU. The days when tens of CPU's meant "huge system" are gone. These days it can be a phone. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 17:51 ` Linus Torvalds @ 2018-01-12 18:15 ` Mike Galbraith 2018-01-12 18:44 ` Mike Galbraith 2018-01-12 19:28 ` Thomas Gleixner 1 sibling, 1 reply; 57+ messages in thread From: Mike Galbraith @ 2018-01-12 18:15 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Fri, 2018-01-12 at 09:51 -0800, Linus Torvalds wrote: > On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <efault@gmx.de> wrote: > > > > Nah, a misunderstanding happened. RT that still offers full threading > > creates per-softirq threads per cpu. The regular trees split ksoftirqd > > into only two threads per cpu, one processes timer/hrtimer softriqs, > > the other processes the rest. > > Ok, that sounds like it should work, but it also sounds like it's very > specific to RT itself. > > For example, the dvb issue was not about the timer softirqs, but about > the tasklet ones. > > So maybe we wouldn't need to split it for _every_ softirq, but we'd > need to split it more than just along the timer case. > > And it does sound a bit excessive to have ten fixed threads for every > CPU. The days when tens of CPU's meant "huge system" are gone. These > days it can be a phone. Yeah, it is excessive more often than not. You get to prioritize, and segregate, which is nice, but you pay for it. -Mike ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 18:15 ` Mike Galbraith @ 2018-01-12 18:44 ` Mike Galbraith 2018-01-17 20:30 ` David Miller 0 siblings, 1 reply; 57+ messages in thread From: Mike Galbraith @ 2018-01-12 18:44 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Fri, 2018-01-12 at 19:15 +0100, Mike Galbraith wrote: > On Fri, 2018-01-12 at 09:51 -0800, Linus Torvalds wrote: > > On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <efault@gmx.de> wrote: > > > > > > Nah, a misunderstanding happened. RT that still offers full threading > > > creates per-softirq threads per cpu. The regular trees split ksoftirqd > > > into only two threads per cpu, one processes timer/hrtimer softriqs, > > > the other processes the rest. > > > > Ok, that sounds like it should work, but it also sounds like it's very > > specific to RT itself. > > > > For example, the dvb issue was not about the timer softirqs, but about > > the tasklet ones. > > > > So maybe we wouldn't need to split it for _every_ softirq, but we'd > > need to split it more than just along the timer case. > > > > And it does sound a bit excessive to have ten fixed threads for every > > CPU. The days when tens of CPU's meant "huge system" are gone. These > > days it can be a phone. > > Yeah, it is excessive more often than not. You get to prioritize, and > segregate, which is nice, but you pay for it. BTW, much of the softirq load in RT is processed by the raising task. tbench_srv-6985 [000] d...112 293.902511: softirq_raise: vec=3 [action=NET_RX] tbench_srv-6985 [000] .....13 293.902511: softirq_entry: vec=3 [action=NET_RX] tbench_srv-6985 [000] .....13 293.902515: softirq_exit: vec=3 [action=NET_RX] tbench-6984 [003] d...112 293.902520: softirq_raise: vec=3 [action=NET_RX] tbench-6984 [003] .....13 293.902520: softirq_entry: vec=3 [action=NET_RX] tbench-6984 [003] .....13 293.902523: softirq_exit: vec=3 [action=NET_RX] ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 18:44 ` Mike Galbraith @ 2018-01-17 20:30 ` David Miller 2018-01-17 21:06 ` Linus Torvalds 0 siblings, 1 reply; 57+ messages in thread From: David Miller @ 2018-01-17 20:30 UTC (permalink / raw) To: efault Cc: torvalds, peterz, edumazet, dima, frederic, linux-kernel, 0x7f454c46, akpm, fweisbec, hannes, mingo, alexander.levin, pabeni, paulmck, rrendec, riel, sgruszka, tglx, wanpeng.li From: Mike Galbraith <efault@gmx.de> Date: Fri, 12 Jan 2018 19:44:30 +0100 > On Fri, 2018-01-12 at 19:15 +0100, Mike Galbraith wrote: >> On Fri, 2018-01-12 at 09:51 -0800, Linus Torvalds wrote: >> > On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <efault@gmx.de> wrote: >> > > >> > > Nah, a misunderstanding happened. RT that still offers full threading >> > > creates per-softirq threads per cpu. The regular trees split ksoftirqd >> > > into only two threads per cpu, one processes timer/hrtimer softriqs, >> > > the other processes the rest. >> > >> > Ok, that sounds like it should work, but it also sounds like it's very >> > specific to RT itself. >> > >> > For example, the dvb issue was not about the timer softirqs, but about >> > the tasklet ones. >> > >> > So maybe we wouldn't need to split it for _every_ softirq, but we'd >> > need to split it more than just along the timer case. >> > >> > And it does sound a bit excessive to have ten fixed threads for every >> > CPU. The days when tens of CPU's meant "huge system" are gone. These >> > days it can be a phone. >> >> Yeah, it is excessive more often than not. You get to prioritize, and >> segregate, which is nice, but you pay for it. > > BTW, much of the softirq load in RT is processed by the raising task. > > tbench_srv-6985 [000] d...112 293.902511: softirq_raise: vec=3 [action=NET_RX] > tbench_srv-6985 [000] .....13 293.902511: softirq_entry: vec=3 [action=NET_RX] > tbench_srv-6985 [000] .....13 293.902515: softirq_exit: vec=3 [action=NET_RX] > tbench-6984 [003] d...112 293.902520: softirq_raise: vec=3 [action=NET_RX] > tbench-6984 [003] .....13 293.902520: softirq_entry: vec=3 [action=NET_RX] > tbench-6984 [003] .....13 293.902523: softirq_exit: vec=3 [action=NET_RX] And this is because tbench runs over loopback, which triggers softirqs in the context of whatever generates the loopback packet transmit. Which in this case is the tbench process calling sendmsg(). I wanted to chime in about this earlier, and make it clear that it isn't just IRQs that can trigger softirqs. User context actions in the kernel can trigger softirqs too. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 20:30 ` David Miller @ 2018-01-17 21:06 ` Linus Torvalds 2018-01-17 21:49 ` David Miller 0 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2018-01-17 21:06 UTC (permalink / raw) To: David Miller Cc: Mike Galbraith, Peter Zijlstra, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, Linux Kernel Mailing List, Dmitry Safonov, Andrew Morton, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Sasha Levin, Paolo Abeni, Paul McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Wed, Jan 17, 2018 at 12:30 PM, David Miller <davem@davemloft.net> wrote: > > I wanted to chime in about this earlier, and make it clear that it isn't > just IRQs that can trigger softirqs. User context actions in the kernel > can trigger softirqs too. Yes, anybody can do that "raise_softirq()" thing, although the common thing tends to be for it to be from an interrupt that wants to delay more long-running things to a non-irq-dsiabled context. It was in some way always a "poor mans interrupt thread" (with no blocking like a real thread context, but at least not impacting actual interrupt latency). That said, this made me wonder a bit. I wonder how bounded the latency is for raising a softirq from process context. We only _check_ the softirq on the last hardirq exit, I think. We latency was traditionally bounded by the timer irq, and non-idle CPU's should still be running timers. But I do note that networking does seem to have some private hacks for the latency problem (ie net_tx_action())? I wonder if we should run softirqs on return to user mode (and make softirq set a thread flag if not in interrupt context). Although maybe some people actually want the "delay until next interrupt" kind of behavior to throttle things if they do softirqs from process context. That sounds unlikely, but who knows. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 21:06 ` Linus Torvalds @ 2018-01-17 21:49 ` David Miller 2018-01-17 21:54 ` Thomas Gleixner 2018-01-17 21:54 ` Linus Torvalds 0 siblings, 2 replies; 57+ messages in thread From: David Miller @ 2018-01-17 21:49 UTC (permalink / raw) To: torvalds Cc: efault, peterz, edumazet, dima, frederic, linux-kernel, 0x7f454c46, akpm, fweisbec, hannes, mingo, alexander.levin, pabeni, paulmck, rrendec, riel, sgruszka, tglx, wanpeng.li From: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed, 17 Jan 2018 13:06:58 -0800 > It was in some way always a "poor mans interrupt thread" (with no > blocking like a real thread context, but at least not impacting actual > interrupt latency). Or in this loopback device case (and tunnel decapsulation) a poor man's longjmp, releasing the current stack frame to keep the depth in check. Anyways... > That said, this made me wonder a bit. I wonder how bounded the latency > is for raising a softirq from process context. We only _check_ the > softirq on the last hardirq exit, I think. System call return checks it, otherwise this situation would be completely bolixed. > I wonder if we should run softirqs on return to user mode (and make > softirq set a thread flag if not in interrupt context). I'm pretty sure we already do. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 21:49 ` David Miller @ 2018-01-17 21:54 ` Thomas Gleixner 2018-01-17 21:58 ` Linus Torvalds 2018-01-17 21:54 ` Linus Torvalds 1 sibling, 1 reply; 57+ messages in thread From: Thomas Gleixner @ 2018-01-17 21:54 UTC (permalink / raw) To: David Miller Cc: torvalds, efault, peterz, edumazet, dima, frederic, linux-kernel, 0x7f454c46, akpm, fweisbec, hannes, mingo, alexander.levin, pabeni, paulmck, rrendec, riel, sgruszka, wanpeng.li On Wed, 17 Jan 2018, David Miller wrote: > From: Linus Torvalds <torvalds@linux-foundation.org> > Date: Wed, 17 Jan 2018 13:06:58 -0800 > > > It was in some way always a "poor mans interrupt thread" (with no > > blocking like a real thread context, but at least not impacting actual > > interrupt latency). > > Or in this loopback device case (and tunnel decapsulation) a poor > man's longjmp, releasing the current stack frame to keep the depth > in check. > > Anyways... > > > That said, this made me wonder a bit. I wonder how bounded the latency > > is for raising a softirq from process context. We only _check_ the > > softirq on the last hardirq exit, I think. > > System call return checks it, otherwise this situation would be > completely bolixed. Errm. No. > > > I wonder if we should run softirqs on return to user mode (and make > > softirq set a thread flag if not in interrupt context). > > I'm pretty sure we already do. Nope. raise_softirq() -> raise_softirq_irqoff() set_softirq_bit(); if (!in_interrupt()) wake_softirqd(); So if the caller is not in hard or soft interrupt context, which includes bottom half disabled regions softirqd is woken. If the caller is in a bottom half disabled region then local_bh_enable() will run the pending softirqs. Thanks, tglx ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 21:54 ` Thomas Gleixner @ 2018-01-17 21:58 ` Linus Torvalds 2018-01-17 22:00 ` Thomas Gleixner 0 siblings, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2018-01-17 21:58 UTC (permalink / raw) To: Thomas Gleixner Cc: David Miller, Mike Galbraith, Peter Zijlstra, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, Linux Kernel Mailing List, Dmitry Safonov, Andrew Morton, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Sasha Levin, Paolo Abeni, Paul McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Wanpeng Li On Wed, Jan 17, 2018 at 1:54 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > raise_softirq() -> raise_softirq_irqoff() > > set_softirq_bit(); > > if (!in_interrupt()) > wake_softirqd(); > > So if the caller is not in hard or soft interrupt context, which includes > bottom half disabled regions softirqd is woken. That does seem unnecessarily expensive, and maybe we could just do it with thread flag (TIF_NOTIFY_RESUME or whatever). In fact, that was what I *thought* we did. Maybe I just remember some historical behavior. Since networking seems to largely prefer softirqd anyway, maybe that wake_softirqd() is the right thing to do anyway. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 21:58 ` Linus Torvalds @ 2018-01-17 22:00 ` Thomas Gleixner 2018-01-17 22:02 ` Eric Dumazet 0 siblings, 1 reply; 57+ messages in thread From: Thomas Gleixner @ 2018-01-17 22:00 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, Mike Galbraith, Peter Zijlstra, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, Linux Kernel Mailing List, Dmitry Safonov, Andrew Morton, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Sasha Levin, Paolo Abeni, Paul McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Wanpeng Li On Wed, 17 Jan 2018, Linus Torvalds wrote: > On Wed, Jan 17, 2018 at 1:54 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > raise_softirq() -> raise_softirq_irqoff() > > > > set_softirq_bit(); > > > > if (!in_interrupt()) > > wake_softirqd(); > > > > So if the caller is not in hard or soft interrupt context, which includes > > bottom half disabled regions softirqd is woken. > > That does seem unnecessarily expensive, and maybe we could just do it > with thread flag (TIF_NOTIFY_RESUME or whatever). > > In fact, that was what I *thought* we did. Maybe I just remember some > historical behavior. > > Since networking seems to largely prefer softirqd anyway, maybe that > wake_softirqd() is the right thing to do anyway. Well, but we only do it when we are not in a bh disabled region. The places where thread context raises the network softirqs is usually inside a bh disabled region, so the softirq is executed on local_bh_enable(). The thread is woken up rarely. Thanks, tglx ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 22:00 ` Thomas Gleixner @ 2018-01-17 22:02 ` Eric Dumazet 2018-01-17 22:24 ` David Miller 0 siblings, 1 reply; 57+ messages in thread From: Eric Dumazet @ 2018-01-17 22:02 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, David Miller, Mike Galbraith, Peter Zijlstra, Dmitry Safonov, Frederic Weisbecker, Linux Kernel Mailing List, Dmitry Safonov, Andrew Morton, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Sasha Levin, Paolo Abeni, Paul McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Wanpeng Li On Wed, Jan 17, 2018 at 2:00 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Wed, 17 Jan 2018, Linus Torvalds wrote: > >> On Wed, Jan 17, 2018 at 1:54 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >> > raise_softirq() -> raise_softirq_irqoff() >> > >> > set_softirq_bit(); >> > >> > if (!in_interrupt()) >> > wake_softirqd(); >> > >> > So if the caller is not in hard or soft interrupt context, which includes >> > bottom half disabled regions softirqd is woken. >> >> That does seem unnecessarily expensive, and maybe we could just do it >> with thread flag (TIF_NOTIFY_RESUME or whatever). >> >> In fact, that was what I *thought* we did. Maybe I just remember some >> historical behavior. >> >> Since networking seems to largely prefer softirqd anyway, maybe that >> wake_softirqd() is the right thing to do anyway. > > Well, but we only do it when we are not in a bh disabled region. The places > where thread context raises the network softirqs is usually inside a bh > disabled region, so the softirq is executed on local_bh_enable(). The > thread is woken up rarely. There is also the netif_rx_ni() stuff. Can't remember right now why it is not using local_bh_{diable,enable}() pair instead of preempt_disable() ... if (local_softirq_pending()) do_softirq(); ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 22:02 ` Eric Dumazet @ 2018-01-17 22:24 ` David Miller 2018-01-17 22:54 ` Thomas Gleixner 2018-02-06 16:56 ` Sebastian Andrzej Siewior 0 siblings, 2 replies; 57+ messages in thread From: David Miller @ 2018-01-17 22:24 UTC (permalink / raw) To: edumazet Cc: tglx, torvalds, efault, peterz, dima, frederic, linux-kernel, 0x7f454c46, akpm, fweisbec, hannes, mingo, alexander.levin, pabeni, paulmck, rrendec, riel, sgruszka, wanpeng.li From: Eric Dumazet <edumazet@google.com> Date: Wed, 17 Jan 2018 14:02:43 -0800 > On Wed, Jan 17, 2018 at 2:00 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >> On Wed, 17 Jan 2018, Linus Torvalds wrote: >> >>> On Wed, Jan 17, 2018 at 1:54 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >>> > raise_softirq() -> raise_softirq_irqoff() >>> > >>> > set_softirq_bit(); >>> > >>> > if (!in_interrupt()) >>> > wake_softirqd(); >>> > >>> > So if the caller is not in hard or soft interrupt context, which includes >>> > bottom half disabled regions softirqd is woken. >>> >>> That does seem unnecessarily expensive, and maybe we could just do it >>> with thread flag (TIF_NOTIFY_RESUME or whatever). >>> >>> In fact, that was what I *thought* we did. Maybe I just remember some >>> historical behavior. >>> >>> Since networking seems to largely prefer softirqd anyway, maybe that >>> wake_softirqd() is the right thing to do anyway. >> >> Well, but we only do it when we are not in a bh disabled region. The places >> where thread context raises the network softirqs is usually inside a bh >> disabled region, so the softirq is executed on local_bh_enable(). The >> thread is woken up rarely. > > There is also the netif_rx_ni() stuff. > > Can't remember right now why it is not using > local_bh_{diable,enable}() pair instead > of preempt_disable() ... if (local_softirq_pending()) do_softirq(); Hmmm, that code predates the initial GIT repository build. I do remember we had some back and forth with that stuff. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 22:24 ` David Miller @ 2018-01-17 22:54 ` Thomas Gleixner 2018-02-06 16:56 ` Sebastian Andrzej Siewior 1 sibling, 0 replies; 57+ messages in thread From: Thomas Gleixner @ 2018-01-17 22:54 UTC (permalink / raw) To: David Miller Cc: edumazet, torvalds, efault, peterz, dima, frederic, linux-kernel, 0x7f454c46, akpm, fweisbec, hannes, mingo, alexander.levin, pabeni, paulmck, rrendec, riel, sgruszka, wanpeng.li On Wed, 17 Jan 2018, David Miller wrote: > From: Eric Dumazet <edumazet@google.com> > Date: Wed, 17 Jan 2018 14:02:43 -0800 > > > On Wed, Jan 17, 2018 at 2:00 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > >> On Wed, 17 Jan 2018, Linus Torvalds wrote: > >> > >>> On Wed, Jan 17, 2018 at 1:54 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > >>> > raise_softirq() -> raise_softirq_irqoff() > >>> > > >>> > set_softirq_bit(); > >>> > > >>> > if (!in_interrupt()) > >>> > wake_softirqd(); > >>> > > >>> > So if the caller is not in hard or soft interrupt context, which includes > >>> > bottom half disabled regions softirqd is woken. > >>> > >>> That does seem unnecessarily expensive, and maybe we could just do it > >>> with thread flag (TIF_NOTIFY_RESUME or whatever). > >>> > >>> In fact, that was what I *thought* we did. Maybe I just remember some > >>> historical behavior. > >>> > >>> Since networking seems to largely prefer softirqd anyway, maybe that > >>> wake_softirqd() is the right thing to do anyway. > >> > >> Well, but we only do it when we are not in a bh disabled region. The places > >> where thread context raises the network softirqs is usually inside a bh > >> disabled region, so the softirq is executed on local_bh_enable(). The > >> thread is woken up rarely. > > > > There is also the netif_rx_ni() stuff. > > > > Can't remember right now why it is not using > > local_bh_{diable,enable}() pair instead > > of preempt_disable() ... if (local_softirq_pending()) do_softirq(); > > Hmmm, that code predates the initial GIT repository build. > > I do remember we had some back and forth with that stuff. commit 3ace873f40916682583562c1e1f324a4a50a2811 in https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git deinlines that function and drops a comment while at it: -/* Post buffer to the network code from _non interrupt_ context. - * see net/core/dev.c for netif_rx description. - */ -static inline int netif_rx_ni(struct sk_buff *skb) It was added somewhere around 2.5.0 Thanks, tglx ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 22:24 ` David Miller 2018-01-17 22:54 ` Thomas Gleixner @ 2018-02-06 16:56 ` Sebastian Andrzej Siewior 2018-02-13 16:36 ` Thomas Gleixner 1 sibling, 1 reply; 57+ messages in thread From: Sebastian Andrzej Siewior @ 2018-02-06 16:56 UTC (permalink / raw) To: David Miller Cc: edumazet, tglx, torvalds, efault, peterz, dima, frederic, linux-kernel, 0x7f454c46, akpm, fweisbec, hannes, mingo, alexander.levin, pabeni, paulmck, rrendec, riel, sgruszka, wanpeng.li On 2018-01-17 17:24:47 [-0500], David Miller wrote: > From: Eric Dumazet <edumazet@google.com> > Date: Wed, 17 Jan 2018 14:02:43 -0800 > > There is also the netif_rx_ni() stuff. > > > > Can't remember right now why it is not using > > local_bh_{diable,enable}() pair instead > > of preempt_disable() ... if (local_softirq_pending()) do_softirq(); > > Hmmm, that code predates the initial GIT repository build. > > I do remember we had some back and forth with that stuff. So I did a little research and tried to replace preempt_disable() with local_bh_disable() [0] a while ago. [0] https://lkml.kernel.org/r/20170616172400.10809-1-bigeasy@linutronix.de Sebastian ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-02-06 16:56 ` Sebastian Andrzej Siewior @ 2018-02-13 16:36 ` Thomas Gleixner 0 siblings, 0 replies; 57+ messages in thread From: Thomas Gleixner @ 2018-02-13 16:36 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: David Miller, edumazet, torvalds, efault, peterz, dima, frederic, linux-kernel, 0x7f454c46, akpm, fweisbec, hannes, mingo, alexander.levin, pabeni, paulmck, rrendec, riel, sgruszka, wanpeng.li On Tue, 6 Feb 2018, Sebastian Andrzej Siewior wrote: > On 2018-01-17 17:24:47 [-0500], David Miller wrote: > > From: Eric Dumazet <edumazet@google.com> > > Date: Wed, 17 Jan 2018 14:02:43 -0800 > > > There is also the netif_rx_ni() stuff. > > > > > > Can't remember right now why it is not using > > > local_bh_{diable,enable}() pair instead > > > of preempt_disable() ... if (local_softirq_pending()) do_softirq(); > > > > Hmmm, that code predates the initial GIT repository build. > > > > I do remember we had some back and forth with that stuff. > > So I did a little research and tried to replace preempt_disable() with > local_bh_disable() [0] a while ago. > > [0] https://lkml.kernel.org/r/20170616172400.10809-1-bigeasy@linutronix.de Maybe you should just try again with a slightly better changelog :) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 21:49 ` David Miller 2018-01-17 21:54 ` Thomas Gleixner @ 2018-01-17 21:54 ` Linus Torvalds 2018-01-17 21:58 ` Thomas Gleixner 1 sibling, 1 reply; 57+ messages in thread From: Linus Torvalds @ 2018-01-17 21:54 UTC (permalink / raw) To: David Miller Cc: Mike Galbraith, Peter Zijlstra, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, Linux Kernel Mailing List, Dmitry Safonov, Andrew Morton, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Sasha Levin, Paolo Abeni, Paul McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Wed, Jan 17, 2018 at 1:49 PM, David Miller <davem@davemloft.net> wrote: > >> That said, this made me wonder a bit. I wonder how bounded the latency >> is for raising a softirq from process context. We only _check_ the >> softirq on the last hardirq exit, I think. > > System call return checks it, otherwise this situation would be > completely bolixed. That's what I thought too. But then I went and looked, and I can't find it. But you're probably right, and I just missed it. Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-17 21:54 ` Linus Torvalds @ 2018-01-17 21:58 ` Thomas Gleixner 0 siblings, 0 replies; 57+ messages in thread From: Thomas Gleixner @ 2018-01-17 21:58 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, Mike Galbraith, Peter Zijlstra, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, Linux Kernel Mailing List, Dmitry Safonov, Andrew Morton, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Sasha Levin, Paolo Abeni, Paul McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Wanpeng Li On Wed, 17 Jan 2018, Linus Torvalds wrote: > On Wed, Jan 17, 2018 at 1:49 PM, David Miller <davem@davemloft.net> wrote: > > > >> That said, this made me wonder a bit. I wonder how bounded the latency > >> is for raising a softirq from process context. We only _check_ the > >> softirq on the last hardirq exit, I think. > > > > System call return checks it, otherwise this situation would be > > completely bolixed. > > That's what I thought too. But then I went and looked, and I can't find it. > > But you're probably right, and I just missed it. Not really. There is nothing to see there. Thanks, tglx ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 17:51 ` Linus Torvalds 2018-01-12 18:15 ` Mike Galbraith @ 2018-01-12 19:28 ` Thomas Gleixner 2018-01-12 19:53 ` Frederic Weisbecker 1 sibling, 1 reply; 57+ messages in thread From: Thomas Gleixner @ 2018-01-12 19:28 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Galbraith, Peter Zijlstra, Eric Dumazet, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Wanpeng Li On Fri, 12 Jan 2018, Linus Torvalds wrote: > On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <efault@gmx.de> wrote: > > > > Nah, a misunderstanding happened. RT that still offers full threading > > creates per-softirq threads per cpu. The regular trees split ksoftirqd > > into only two threads per cpu, one processes timer/hrtimer softriqs, > > the other processes the rest. > > Ok, that sounds like it should work, but it also sounds like it's very > specific to RT itself. > > For example, the dvb issue was not about the timer softirqs, but about > the tasklet ones. > > So maybe we wouldn't need to split it for _every_ softirq, but we'd > need to split it more than just along the timer case. > > And it does sound a bit excessive to have ten fixed threads for every > CPU. The days when tens of CPU's meant "huge system" are gone. These > days it can be a phone. That's true. One thing which might worth a try is utilizing the threaded irq infrastructure and that might also pave the way to address Peters request for per device splitting. I've experimented with that in the past but never got around to finish it completely. I should have half baken patches somewhere in the poison cabinet. Thanks, tglx ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 19:28 ` Thomas Gleixner @ 2018-01-12 19:53 ` Frederic Weisbecker 0 siblings, 0 replies; 57+ messages in thread From: Frederic Weisbecker @ 2018-01-12 19:53 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Mike Galbraith, Peter Zijlstra, Eric Dumazet, Dmitry Safonov, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Wanpeng Li On Fri, Jan 12, 2018 at 08:28:06PM +0100, Thomas Gleixner wrote: > On Fri, 12 Jan 2018, Linus Torvalds wrote: > > On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <efault@gmx.de> wrote: > > > > > > Nah, a misunderstanding happened. RT that still offers full threading > > > creates per-softirq threads per cpu. The regular trees split ksoftirqd > > > into only two threads per cpu, one processes timer/hrtimer softriqs, > > > the other processes the rest. > > > > Ok, that sounds like it should work, but it also sounds like it's very > > specific to RT itself. > > > > For example, the dvb issue was not about the timer softirqs, but about > > the tasklet ones. > > > > So maybe we wouldn't need to split it for _every_ softirq, but we'd > > need to split it more than just along the timer case. > > > > And it does sound a bit excessive to have ten fixed threads for every > > CPU. The days when tens of CPU's meant "huge system" are gone. These > > days it can be a phone. > > That's true. > > One thing which might worth a try is utilizing the threaded irq > infrastructure and that might also pave the way to address Peters request > for per device splitting. I've experimented with that in the past but never > got around to finish it completely. I should have half baken patches > somewhere in the poison cabinet. I'll gladly have a look at them to see what I can do. Thanks. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 5:23 ` Mike Galbraith 2018-01-12 10:13 ` Peter Zijlstra @ 2018-01-12 14:58 ` Frederic Weisbecker 2018-01-12 15:15 ` Mike Galbraith 2018-01-12 17:24 ` Linus Torvalds 2 siblings, 1 reply; 57+ messages in thread From: Frederic Weisbecker @ 2018-01-12 14:58 UTC (permalink / raw) To: Mike Galbraith Cc: Linus Torvalds, Eric Dumazet, Peter Zijlstra, Dmitry Safonov, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote: > On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote: > > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <edumazet@google.com> wrote: > > > > > > Note that when I implemented TCP Small queues, I did experiments between > > > using a work queue or a tasklet, and workqueues added unacceptable P99 > > > latencies, when many user threads are competing with kernel threads. > > > > Yes. > > > > So I think one solution might be to have a hybrid system, where we do > > the softirq's synchronously normally (which is what you really want > > for good latency). > > > > But then fall down on a threaded model - but that fallback case should > > be per-softirq, not global. So if one softirq uses a lot of CPU time, > > that shouldn't affect the latency of other softirqs. > > > > So maybe we could get rid of the per-cpu ksoftirqd entirely, and > > replace it with with per-cpu and per-softirq workqueues? > > How would that be better than what RT used to do, and I still do for my > RT kernels via boot option, namely split ksoftirqd into per-softirq > threads. Workqueue are probably more simple. Unless you need to set specific prios to your ksoftirqds? Not sure if that's tunable on workqueues. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 14:58 ` Frederic Weisbecker @ 2018-01-12 15:15 ` Mike Galbraith 2018-01-12 17:00 ` Frederic Weisbecker 0 siblings, 1 reply; 57+ messages in thread From: Mike Galbraith @ 2018-01-12 15:15 UTC (permalink / raw) To: Frederic Weisbecker Cc: Linus Torvalds, Eric Dumazet, Peter Zijlstra, Dmitry Safonov, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Fri, 2018-01-12 at 15:58 +0100, Frederic Weisbecker wrote: > On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote: > > On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote: > > > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > Note that when I implemented TCP Small queues, I did experiments between > > > > using a work queue or a tasklet, and workqueues added unacceptable P99 > > > > latencies, when many user threads are competing with kernel threads. > > > > > > Yes. > > > > > > So I think one solution might be to have a hybrid system, where we do > > > the softirq's synchronously normally (which is what you really want > > > for good latency). > > > > > > But then fall down on a threaded model - but that fallback case should > > > be per-softirq, not global. So if one softirq uses a lot of CPU time, > > > that shouldn't affect the latency of other softirqs. > > > > > > So maybe we could get rid of the per-cpu ksoftirqd entirely, and > > > replace it with with per-cpu and per-softirq workqueues? > > > > How would that be better than what RT used to do, and I still do for my > > RT kernels via boot option, namely split ksoftirqd into per-softirq > > threads. > > Workqueue are probably more simple. Unless you need to set specific prios > to your ksoftirqds? Not sure if that's tunable on workqueues. No, you can't prioritize workqueues, and they spawn threads whenever they bloody well feel like. I carry a hack to give users minimal control over kthread/workqueue priority. Very handy thing to have, especially if you're doing high utilization stuff, and would prefer your box actually survive it. -Mike ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 15:15 ` Mike Galbraith @ 2018-01-12 17:00 ` Frederic Weisbecker 2018-01-12 18:08 ` Mike Galbraith 0 siblings, 1 reply; 57+ messages in thread From: Frederic Weisbecker @ 2018-01-12 17:00 UTC (permalink / raw) To: Mike Galbraith Cc: Linus Torvalds, Eric Dumazet, Peter Zijlstra, Dmitry Safonov, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Fri, Jan 12, 2018 at 04:15:04PM +0100, Mike Galbraith wrote: > On Fri, 2018-01-12 at 15:58 +0100, Frederic Weisbecker wrote: > > On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote: > > > On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote: > > > > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > Note that when I implemented TCP Small queues, I did experiments between > > > > > using a work queue or a tasklet, and workqueues added unacceptable P99 > > > > > latencies, when many user threads are competing with kernel threads. > > > > > > > > Yes. > > > > > > > > So I think one solution might be to have a hybrid system, where we do > > > > the softirq's synchronously normally (which is what you really want > > > > for good latency). > > > > > > > > But then fall down on a threaded model - but that fallback case should > > > > be per-softirq, not global. So if one softirq uses a lot of CPU time, > > > > that shouldn't affect the latency of other softirqs. > > > > > > > > So maybe we could get rid of the per-cpu ksoftirqd entirely, and > > > > replace it with with per-cpu and per-softirq workqueues? > > > > > > How would that be better than what RT used to do, and I still do for my > > > RT kernels via boot option, namely split ksoftirqd into per-softirq > > > threads. > > > > Workqueue are probably more simple. Unless you need to set specific prios > > to your ksoftirqds? Not sure if that's tunable on workqueues. > > No, you can't prioritize workqueues, and they spawn threads whenever > they bloody well feel like. > > I carry a hack to give users minimal control over kthread/workqueue > priority. Very handy thing to have, especially if you're doing high > utilization stuff, and would prefer your box actually survive it. How useful system_highpri_wq can be in this regard? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 17:00 ` Frederic Weisbecker @ 2018-01-12 18:08 ` Mike Galbraith 0 siblings, 0 replies; 57+ messages in thread From: Mike Galbraith @ 2018-01-12 18:08 UTC (permalink / raw) To: Frederic Weisbecker Cc: Linus Torvalds, Eric Dumazet, Peter Zijlstra, Dmitry Safonov, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Fri, 2018-01-12 at 18:00 +0100, Frederic Weisbecker wrote: > On Fri, Jan 12, 2018 at 04:15:04PM +0100, Mike Galbraith wrote: > > On Fri, 2018-01-12 at 15:58 +0100, Frederic Weisbecker wrote: > > > On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote: > > > > On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote: > > > > > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > Note that when I implemented TCP Small queues, I did experiments between > > > > > > using a work queue or a tasklet, and workqueues added unacceptable P99 > > > > > > latencies, when many user threads are competing with kernel threads. > > > > > > > > > > Yes. > > > > > > > > > > So I think one solution might be to have a hybrid system, where we do > > > > > the softirq's synchronously normally (which is what you really want > > > > > for good latency). > > > > > > > > > > But then fall down on a threaded model - but that fallback case should > > > > > be per-softirq, not global. So if one softirq uses a lot of CPU time, > > > > > that shouldn't affect the latency of other softirqs. > > > > > > > > > > So maybe we could get rid of the per-cpu ksoftirqd entirely, and > > > > > replace it with with per-cpu and per-softirq workqueues? > > > > > > > > How would that be better than what RT used to do, and I still do for my > > > > RT kernels via boot option, namely split ksoftirqd into per-softirq > > > > threads. > > > > > > Workqueue are probably more simple. Unless you need to set specific prios > > > to your ksoftirqds? Not sure if that's tunable on workqueues. > > > > No, you can't prioritize workqueues, and they spawn threads whenever > > they bloody well feel like. > > > > I carry a hack to give users minimal control over kthread/workqueue > > priority. Very handy thing to have, especially if you're doing high > > utilization stuff, and would prefer your box actually survive it. > > How useful system_highpri_wq can be in this regard? Not at all. You could make the system protect itself via boosting, but that just makes a noisy mess. -Mike ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context 2018-01-12 5:23 ` Mike Galbraith 2018-01-12 10:13 ` Peter Zijlstra 2018-01-12 14:58 ` Frederic Weisbecker @ 2018-01-12 17:24 ` Linus Torvalds 2 siblings, 0 replies; 57+ messages in thread From: Linus Torvalds @ 2018-01-12 17:24 UTC (permalink / raw) To: Mike Galbraith Cc: Eric Dumazet, Peter Zijlstra, Dmitry Safonov, Frederic Weisbecker, LKML, Dmitry Safonov, Andrew Morton, David Miller, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Paolo Abeni, Paul E. McKenney, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li On Thu, Jan 11, 2018 at 9:23 PM, Mike Galbraith <efault@gmx.de> wrote: > On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote: >> >> So maybe we could get rid of the per-cpu ksoftirqd entirely, and >> replace it with with per-cpu and per-softirq workqueues? > > How would that be better than what RT used to do, and I still do for my > RT kernels via boot option, namely split ksoftirqd into per-softirq > threads. So I think testing and playing around with things is probably the way to go, but the reason I would suggest at least looking at workqueues is that I feel that we already have a *ton* of percpu threads, and multiplying the ksoftirqd threads by a factor of ten (or however many softirqs we have) sounds insane. And yes, they do need to be percpu. Networking wants to do all that packet handling concurrently across CPUs, so we can't just replace the per-cpu ksoftirqd threads with per-softirq threads: we really do want both per-cpu _and_ per-softirq. And honestly, that sounds insane to do with dedicated threads, most of which do basically nothing at all.. And it's what the workqueue threads are designed for - expanding as needed, and not tying up a dedicated thread for every little work. But code and testing talks, Linus ^ permalink raw reply [flat|nested] 57+ messages in thread
* [RFC 2/2] softirq: Introduce mask for __do_softirq() 2018-01-09 13:36 [RFC 0/2] Net softirq deferring to ksoftirqd Dmitry Safonov 2018-01-09 13:36 ` [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context Dmitry Safonov @ 2018-01-09 13:36 ` Dmitry Safonov 1 sibling, 0 replies; 57+ messages in thread From: Dmitry Safonov @ 2018-01-09 13:36 UTC (permalink / raw) To: linux-kernel Cc: 0x7f454c46, Dmitry Safonov, Andrew Morton, David Miller, Eric Dumazet, Frederic Weisbecker, Hannes Frederic Sowa, Ingo Molnar, Levin, Alexander (Sasha Levin), Linus Torvalds, Paolo Abeni, Paul E. McKenney, Peter Zijlstra, Radu Rendec, Rik van Riel, Stanislaw Gruszka, Thomas Gleixner, Wanpeng Li Warning: Not merge-ready, tested only on x86_64 & arm32. For the reason of deferring net-softirqs till ksoftirqd run, __do_softirq() needs to process softirqs with mask, depending if it's called from ksoftirqd thread or on the context of some task. Signed-off-by: Dmitry Safonov <dima@arista.com> --- include/linux/interrupt.h | 8 ++++---- kernel/softirq.c | 41 ++++++++++++++++++++++------------------- 2 files changed, 26 insertions(+), 23 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 69c238210325..1e943959b31a 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -489,14 +489,14 @@ struct softirq_action }; asmlinkage void do_softirq(void); -asmlinkage void __do_softirq(void); +asmlinkage void __do_softirq(__u32 mask); #ifdef __ARCH_HAS_DO_SOFTIRQ -void do_softirq_own_stack(void); +void do_softirq_own_stack(__u32 mask); #else -static inline void do_softirq_own_stack(void) +static inline void do_softirq_own_stack(__u32 mask) { - __do_softirq(); + __do_softirq(mask); } #endif diff --git a/kernel/softirq.c b/kernel/softirq.c index ee48f194dcec..5459b079bf73 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -88,26 +88,26 @@ static bool ksoftirqd_running(void) return tsk && (tsk->state == TASK_RUNNING); } -static bool defer_softirq(void) +#define DEFER_SOFTIRQS (NET_RX_SOFTIRQ | NET_TX_SOFTIRQ) + +/* Mask of softirqs that are pending to be processed on the current context. */ +static __u32 current_softirq_pending(void) { __u32 pending = local_softirq_pending(); - if (!pending) - return true; - if (ksoftirqd_running()) - return true; + return 0; /* * Defer net-rx softirqs to ksoftirqd processing as they may * make userspace starving cpu time. */ - if (pending & (NET_RX_SOFTIRQ | NET_TX_SOFTIRQ)) { + if (pending & DEFER_SOFTIRQS) wakeup_softirqd(); - return true; - } - return false; + pending &= ~DEFER_SOFTIRQS; + + return pending; } /* @@ -261,7 +261,7 @@ static inline bool lockdep_softirq_start(void) { return false; } static inline void lockdep_softirq_end(bool in_hardirq) { } #endif -asmlinkage __visible void __softirq_entry __do_softirq(void) +asmlinkage __visible void __softirq_entry __do_softirq(__u32 mask) { unsigned long end = jiffies + MAX_SOFTIRQ_TIME; unsigned long old_flags = current->flags; @@ -286,7 +286,8 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) restart: /* Reset the pending bitmask before enabling irqs */ - set_softirq_pending(0); + set_softirq_pending(pending & ~mask); + pending &= mask; local_irq_enable(); @@ -320,7 +321,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) local_irq_disable(); pending = local_softirq_pending(); - if (pending) { + if (pending & mask) { if (time_before(jiffies, end) && !need_resched() && --max_restart) goto restart; @@ -338,14 +339,14 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) asmlinkage __visible void do_softirq(void) { unsigned long flags; + __u32 pending = current_softirq_pending(); - if (in_interrupt()) + if (in_interrupt() || !pending) return; local_irq_save(flags); - if (!defer_softirq()) - do_softirq_own_stack(); + do_softirq_own_stack(pending); local_irq_restore(flags); } @@ -371,7 +372,9 @@ void irq_enter(void) static inline void invoke_softirq(void) { - if (defer_softirq()) + __u32 pending = current_softirq_pending(); + + if (!pending) return; if (!force_irqthreads) { @@ -381,14 +384,14 @@ static inline void invoke_softirq(void) * it is the irq stack, because it should be near empty * at this stage. */ - __do_softirq(); + __do_softirq(pending); #else /* * Otherwise, irq_exit() is called on the task stack that can * be potentially deep already. So call softirq in its own stack * to prevent from any overrun. */ - do_softirq_own_stack(); + do_softirq_own_stack(pending); #endif } else { wakeup_softirqd(); @@ -682,7 +685,7 @@ static void run_ksoftirqd(unsigned int cpu) * We can safely run softirq on inline stack, as we are not deep * in the task stack here. */ - __do_softirq(); + __do_softirq(~0); local_irq_enable(); cond_resched_rcu_qs(); return; -- 2.13.6 ^ permalink raw reply related [flat|nested] 57+ messages in thread
end of thread, other threads:[~2018-02-13 16:36 UTC | newest] Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-01-09 13:36 [RFC 0/2] Net softirq deferring to ksoftirqd Dmitry Safonov 2018-01-09 13:36 ` [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context Dmitry Safonov 2018-01-09 16:03 ` Frederic Weisbecker 2018-01-10 21:20 ` Dmitry Safonov 2018-01-09 18:02 ` Eric Dumazet 2018-01-10 21:48 ` Dmitry Safonov 2018-01-11 2:13 ` Linus Torvalds 2018-01-11 3:22 ` Frederic Weisbecker 2018-01-11 4:19 ` Linus Torvalds 2018-01-11 4:44 ` Frederic Weisbecker 2018-01-11 14:31 ` Dmitry Safonov 2018-01-11 16:20 ` Eric Dumazet 2018-01-11 16:32 ` Peter Zijlstra 2018-01-11 16:38 ` David Miller 2018-01-11 16:43 ` Peter Zijlstra 2018-01-11 18:48 ` Linus Torvalds 2018-01-11 19:15 ` Eric Dumazet 2018-01-11 19:43 ` Linus Torvalds 2018-01-11 19:48 ` Eric Dumazet 2018-01-11 20:03 ` Linus Torvalds 2018-01-11 20:16 ` Eric Dumazet 2018-01-11 20:22 ` Linus Torvalds 2018-01-11 20:34 ` Dmitry Safonov 2018-01-11 20:37 ` Eric Dumazet 2018-01-11 20:40 ` Linus Torvalds 2018-01-11 20:46 ` Dmitry Safonov 2018-01-11 20:53 ` Eric Dumazet 2018-01-11 21:13 ` Dmitry Safonov 2018-01-12 5:41 ` Frederic Weisbecker 2018-01-12 5:23 ` Mike Galbraith 2018-01-12 10:13 ` Peter Zijlstra 2018-01-12 17:26 ` Linus Torvalds 2018-01-12 17:44 ` Mike Galbraith 2018-01-12 17:51 ` Linus Torvalds 2018-01-12 18:15 ` Mike Galbraith 2018-01-12 18:44 ` Mike Galbraith 2018-01-17 20:30 ` David Miller 2018-01-17 21:06 ` Linus Torvalds 2018-01-17 21:49 ` David Miller 2018-01-17 21:54 ` Thomas Gleixner 2018-01-17 21:58 ` Linus Torvalds 2018-01-17 22:00 ` Thomas Gleixner 2018-01-17 22:02 ` Eric Dumazet 2018-01-17 22:24 ` David Miller 2018-01-17 22:54 ` Thomas Gleixner 2018-02-06 16:56 ` Sebastian Andrzej Siewior 2018-02-13 16:36 ` Thomas Gleixner 2018-01-17 21:54 ` Linus Torvalds 2018-01-17 21:58 ` Thomas Gleixner 2018-01-12 19:28 ` Thomas Gleixner 2018-01-12 19:53 ` Frederic Weisbecker 2018-01-12 14:58 ` Frederic Weisbecker 2018-01-12 15:15 ` Mike Galbraith 2018-01-12 17:00 ` Frederic Weisbecker 2018-01-12 18:08 ` Mike Galbraith 2018-01-12 17:24 ` Linus Torvalds 2018-01-09 13:36 ` [RFC 2/2] softirq: Introduce mask for __do_softirq() Dmitry Safonov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).