All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
       [not found] <877dly6ooz.fsf@toke.dk>
@ 2021-03-23 16:43 ` Paul E. McKenney
  2021-03-23 17:29   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-23 16:43 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: bpf, Magnus Karlsson

On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> Hi Paul
> 
> Magnus and I have been debugging an issue where close() on a bpf_link
> file descriptor would hang indefinitely when the system was under load
> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> 
> The issue is triggered reliably by loading up a system with network
> traffic (causing 100% softirq CPU load on one or more cores), and then
> attaching an freplace bpf_link and closing it again. The close() will
> hang until the network traffic load is lowered.
> 
> Digging further, it appears that the hang happens in
> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> 
> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> Attaching 2 probes...
> enter
> exit after 54 ms
> enter
> exit after 3249 ms
> 
> (the two enter/exit pairs are, respectively, from an unloaded system,
> and from a loaded system where I stopped the network traffic after a
> couple of seconds).
> 
> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> 
> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> 
> And because it does this while holding trampoline_mutex, even deferring
> the put to a worker (as a previously applied-then-reverted patch did[0])
> doesn't help: that'll fix the initial hang on close(), but any
> subsequent use of BPF trampolines will then be blocked because of the
> mutex.
> 
> Also, if I just keep the network traffic running I will eventually get a
> kernel panic with:
> 
> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> 
> I've created a reproducer for the issue here:
> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> 
> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> 
> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> $ cd bpf-examples/bpf-link-hang
> $ make
> $ ./sudo bpf-link-hang
> 
> you'll need to load up the system to trigger the hang; I'm using pktgen
> from a separate machine to do this.
> 
> My question is, of course, as ever, What Is To Be Done? Is it expected
> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> or can this be fixed? And if it is expected, how can the BPF code be
> fixed so it doesn't deadlock because of this?
> 
> Hoping you can help us with this - many thanks in advance! :)

Let me start with the usual question...  Is the network traffic intense
enough that one of the CPUs might remain in a loop handling softirqs
indefinitely?

If so, does the (untested, probably does not build) patch below help?

Please note that this is only a diagnostic patch.  It has the serious
side effect of making __do_softirq() and anything that calls it implicitly
noinstr.  But it might at least be a decent starting point for a real fix.
Or might be part of the real fix, who knows?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b06be5..e21e7b0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -242,6 +242,7 @@ void rcu_softirq_qs(void)
 {
 	rcu_qs();
 	rcu_preempt_deferred_qs(current);
+	rcu_tasks_qs(current, true);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-23 16:43 ` BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels Paul E. McKenney
@ 2021-03-23 17:29   ` Toke Høiland-Jørgensen
  2021-03-23 17:57     ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-23 17:29 UTC (permalink / raw)
  To: paulmck; +Cc: bpf, Magnus Karlsson

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> Hi Paul
>> 
>> Magnus and I have been debugging an issue where close() on a bpf_link
>> file descriptor would hang indefinitely when the system was under load
>> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> 
>> The issue is triggered reliably by loading up a system with network
>> traffic (causing 100% softirq CPU load on one or more cores), and then
>> attaching an freplace bpf_link and closing it again. The close() will
>> hang until the network traffic load is lowered.
>> 
>> Digging further, it appears that the hang happens in
>> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> 
>> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> Attaching 2 probes...
>> enter
>> exit after 54 ms
>> enter
>> exit after 3249 ms
>> 
>> (the two enter/exit pairs are, respectively, from an unloaded system,
>> and from a loaded system where I stopped the network traffic after a
>> couple of seconds).
>> 
>> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> 
>> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> 
>> And because it does this while holding trampoline_mutex, even deferring
>> the put to a worker (as a previously applied-then-reverted patch did[0])
>> doesn't help: that'll fix the initial hang on close(), but any
>> subsequent use of BPF trampolines will then be blocked because of the
>> mutex.
>> 
>> Also, if I just keep the network traffic running I will eventually get a
>> kernel panic with:
>> 
>> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> 
>> I've created a reproducer for the issue here:
>> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> 
>> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> 
>> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> $ cd bpf-examples/bpf-link-hang
>> $ make
>> $ ./sudo bpf-link-hang
>> 
>> you'll need to load up the system to trigger the hang; I'm using pktgen
>> from a separate machine to do this.
>> 
>> My question is, of course, as ever, What Is To Be Done? Is it expected
>> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> or can this be fixed? And if it is expected, how can the BPF code be
>> fixed so it doesn't deadlock because of this?
>> 
>> Hoping you can help us with this - many thanks in advance! :)
>
> Let me start with the usual question...  Is the network traffic intense
> enough that one of the CPUs might remain in a loop handling softirqs
> indefinitely?

Yup, I'm pegging all CPUs in softirq:

$ mpstat -P ALL 1
[...]
18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00

> If so, does the (untested, probably does not build) patch below help?

Doesn't appear to, no. It builds fine, but I still get:

Attaching 2 probes...
enter
exit after 8480 ms

(that was me interrupting the network traffic again)

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-23 17:29   ` Toke Høiland-Jørgensen
@ 2021-03-23 17:57     ` Paul E. McKenney
  2021-03-23 19:50       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-23 17:57 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: bpf, Magnus Karlsson

On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> Hi Paul
> >> 
> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> file descriptor would hang indefinitely when the system was under load
> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> 
> >> The issue is triggered reliably by loading up a system with network
> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> attaching an freplace bpf_link and closing it again. The close() will
> >> hang until the network traffic load is lowered.
> >> 
> >> Digging further, it appears that the hang happens in
> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> 
> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> Attaching 2 probes...
> >> enter
> >> exit after 54 ms
> >> enter
> >> exit after 3249 ms
> >> 
> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> and from a loaded system where I stopped the network traffic after a
> >> couple of seconds).
> >> 
> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> 
> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> 
> >> And because it does this while holding trampoline_mutex, even deferring
> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> doesn't help: that'll fix the initial hang on close(), but any
> >> subsequent use of BPF trampolines will then be blocked because of the
> >> mutex.
> >> 
> >> Also, if I just keep the network traffic running I will eventually get a
> >> kernel panic with:
> >> 
> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> 
> >> I've created a reproducer for the issue here:
> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> 
> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> 
> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> $ cd bpf-examples/bpf-link-hang
> >> $ make
> >> $ ./sudo bpf-link-hang
> >> 
> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> from a separate machine to do this.
> >> 
> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> fixed so it doesn't deadlock because of this?
> >> 
> >> Hoping you can help us with this - many thanks in advance! :)
> >
> > Let me start with the usual question...  Is the network traffic intense
> > enough that one of the CPUs might remain in a loop handling softirqs
> > indefinitely?
> 
> Yup, I'm pegging all CPUs in softirq:
> 
> $ mpstat -P ALL 1
> [...]
> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> 
> > If so, does the (untested, probably does not build) patch below help?
> 
> Doesn't appear to, no. It builds fine, but I still get:
> 
> Attaching 2 probes...
> enter
> exit after 8480 ms
> 
> (that was me interrupting the network traffic again)

Is your kernel properly shifting from back-of-interrupt softirq processing
to ksoftirqd under heavy load?  If not, my patch will not have any effect.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-23 17:57     ` Paul E. McKenney
@ 2021-03-23 19:50       ` Toke Høiland-Jørgensen
  2021-03-23 19:59         ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-23 19:50 UTC (permalink / raw)
  To: paulmck; +Cc: bpf, Magnus Karlsson

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> Hi Paul
>> >> 
>> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> file descriptor would hang indefinitely when the system was under load
>> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> 
>> >> The issue is triggered reliably by loading up a system with network
>> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> hang until the network traffic load is lowered.
>> >> 
>> >> Digging further, it appears that the hang happens in
>> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> 
>> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> Attaching 2 probes...
>> >> enter
>> >> exit after 54 ms
>> >> enter
>> >> exit after 3249 ms
>> >> 
>> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> and from a loaded system where I stopped the network traffic after a
>> >> couple of seconds).
>> >> 
>> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> 
>> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> 
>> >> And because it does this while holding trampoline_mutex, even deferring
>> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> mutex.
>> >> 
>> >> Also, if I just keep the network traffic running I will eventually get a
>> >> kernel panic with:
>> >> 
>> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> 
>> >> I've created a reproducer for the issue here:
>> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> 
>> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> 
>> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> $ cd bpf-examples/bpf-link-hang
>> >> $ make
>> >> $ ./sudo bpf-link-hang
>> >> 
>> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> from a separate machine to do this.
>> >> 
>> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> fixed so it doesn't deadlock because of this?
>> >> 
>> >> Hoping you can help us with this - many thanks in advance! :)
>> >
>> > Let me start with the usual question...  Is the network traffic intense
>> > enough that one of the CPUs might remain in a loop handling softirqs
>> > indefinitely?
>> 
>> Yup, I'm pegging all CPUs in softirq:
>> 
>> $ mpstat -P ALL 1
>> [...]
>> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> 
>> > If so, does the (untested, probably does not build) patch below help?
>> 
>> Doesn't appear to, no. It builds fine, but I still get:
>> 
>> Attaching 2 probes...
>> enter
>> exit after 8480 ms
>> 
>> (that was me interrupting the network traffic again)
>
> Is your kernel properly shifting from back-of-interrupt softirq processing
> to ksoftirqd under heavy load?  If not, my patch will not have any
> effect.

Seems to be - this is from top:

     12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0                                                           
     24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2                                                           
     34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4                                                           
     39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5                                                           
     19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1                                                           
     29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3     

Any other ideas? :)

(And thanks for taking a look, BTW!)

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-23 19:50       ` Toke Høiland-Jørgensen
@ 2021-03-23 19:59         ` Andrii Nakryiko
  2021-03-23 21:04           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2021-03-23 19:59 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Paul E . McKenney, bpf, Magnus Karlsson

On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> "Paul E. McKenney" <paulmck@kernel.org> writes:
>
> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >>
> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> Hi Paul
> >> >>
> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> >> file descriptor would hang indefinitely when the system was under load
> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> >>
> >> >> The issue is triggered reliably by loading up a system with network
> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> >> attaching an freplace bpf_link and closing it again. The close() will
> >> >> hang until the network traffic load is lowered.
> >> >>
> >> >> Digging further, it appears that the hang happens in
> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> >>
> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> >> Attaching 2 probes...
> >> >> enter
> >> >> exit after 54 ms
> >> >> enter
> >> >> exit after 3249 ms
> >> >>
> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> >> and from a loaded system where I stopped the network traffic after a
> >> >> couple of seconds).
> >> >>
> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> >>
> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> >>
> >> >> And because it does this while holding trampoline_mutex, even deferring
> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> >> doesn't help: that'll fix the initial hang on close(), but any
> >> >> subsequent use of BPF trampolines will then be blocked because of the
> >> >> mutex.
> >> >>
> >> >> Also, if I just keep the network traffic running I will eventually get a
> >> >> kernel panic with:
> >> >>
> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> >>
> >> >> I've created a reproducer for the issue here:
> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> >>
> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> >>
> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> >> $ cd bpf-examples/bpf-link-hang
> >> >> $ make
> >> >> $ ./sudo bpf-link-hang
> >> >>
> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> >> from a separate machine to do this.
> >> >>
> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> >> fixed so it doesn't deadlock because of this?
> >> >>
> >> >> Hoping you can help us with this - many thanks in advance! :)
> >> >
> >> > Let me start with the usual question...  Is the network traffic intense
> >> > enough that one of the CPUs might remain in a loop handling softirqs
> >> > indefinitely?
> >>
> >> Yup, I'm pegging all CPUs in softirq:
> >>
> >> $ mpstat -P ALL 1
> >> [...]
> >> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> >> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >>
> >> > If so, does the (untested, probably does not build) patch below help?
> >>
> >> Doesn't appear to, no. It builds fine, but I still get:
> >>
> >> Attaching 2 probes...
> >> enter
> >> exit after 8480 ms
> >>
> >> (that was me interrupting the network traffic again)
> >
> > Is your kernel properly shifting from back-of-interrupt softirq processing
> > to ksoftirqd under heavy load?  If not, my patch will not have any
> > effect.
>
> Seems to be - this is from top:
>
>      12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
>      24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
>      34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
>      39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
>      19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
>      29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3
>
> Any other ideas? :)

bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
anymore. Please give it a try. It's in bpf tree.

>
> (And thanks for taking a look, BTW!)
>
> -Toke
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-23 19:59         ` Andrii Nakryiko
@ 2021-03-23 21:04           ` Toke Høiland-Jørgensen
  2021-03-23 21:52             ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-23 21:04 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Paul E . McKenney, bpf, Magnus Karlsson

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>>
>> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >>
>> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> Hi Paul
>> >> >>
>> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> >> file descriptor would hang indefinitely when the system was under load
>> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> >>
>> >> >> The issue is triggered reliably by loading up a system with network
>> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> >> hang until the network traffic load is lowered.
>> >> >>
>> >> >> Digging further, it appears that the hang happens in
>> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> >>
>> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> >> Attaching 2 probes...
>> >> >> enter
>> >> >> exit after 54 ms
>> >> >> enter
>> >> >> exit after 3249 ms
>> >> >>
>> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> >> and from a loaded system where I stopped the network traffic after a
>> >> >> couple of seconds).
>> >> >>
>> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> >>
>> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> >>
>> >> >> And because it does this while holding trampoline_mutex, even deferring
>> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> >> mutex.
>> >> >>
>> >> >> Also, if I just keep the network traffic running I will eventually get a
>> >> >> kernel panic with:
>> >> >>
>> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> >>
>> >> >> I've created a reproducer for the issue here:
>> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> >>
>> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> >>
>> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> >> $ cd bpf-examples/bpf-link-hang
>> >> >> $ make
>> >> >> $ ./sudo bpf-link-hang
>> >> >>
>> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> >> from a separate machine to do this.
>> >> >>
>> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> >> fixed so it doesn't deadlock because of this?
>> >> >>
>> >> >> Hoping you can help us with this - many thanks in advance! :)
>> >> >
>> >> > Let me start with the usual question...  Is the network traffic intense
>> >> > enough that one of the CPUs might remain in a loop handling softirqs
>> >> > indefinitely?
>> >>
>> >> Yup, I'm pegging all CPUs in softirq:
>> >>
>> >> $ mpstat -P ALL 1
>> >> [...]
>> >> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> >> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >>
>> >> > If so, does the (untested, probably does not build) patch below help?
>> >>
>> >> Doesn't appear to, no. It builds fine, but I still get:
>> >>
>> >> Attaching 2 probes...
>> >> enter
>> >> exit after 8480 ms
>> >>
>> >> (that was me interrupting the network traffic again)
>> >
>> > Is your kernel properly shifting from back-of-interrupt softirq processing
>> > to ksoftirqd under heavy load?  If not, my patch will not have any
>> > effect.
>>
>> Seems to be - this is from top:
>>
>>      12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
>>      24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
>>      34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
>>      39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
>>      19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
>>      29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3
>>
>> Any other ideas? :)
>
> bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
> Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
> anymore. Please give it a try. It's in bpf tree.

Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
indeed works better; awesome!

And sorry for bothering you with this, Paul; guess I should have looked
harder for fixes first... :/

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-23 21:04           ` Toke Høiland-Jørgensen
@ 2021-03-23 21:52             ` Paul E. McKenney
  2021-03-23 22:06               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-23 21:52 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson

On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> 
> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >>
> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >>
> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> Hi Paul
> >> >> >>
> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> >> >> file descriptor would hang indefinitely when the system was under load
> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> >> >>
> >> >> >> The issue is triggered reliably by loading up a system with network
> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
> >> >> >> hang until the network traffic load is lowered.
> >> >> >>
> >> >> >> Digging further, it appears that the hang happens in
> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> >> >>
> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> >> >> Attaching 2 probes...
> >> >> >> enter
> >> >> >> exit after 54 ms
> >> >> >> enter
> >> >> >> exit after 3249 ms
> >> >> >>
> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> >> >> and from a loaded system where I stopped the network traffic after a
> >> >> >> couple of seconds).
> >> >> >>
> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> >> >>
> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> >> >>
> >> >> >> And because it does this while holding trampoline_mutex, even deferring
> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
> >> >> >> mutex.
> >> >> >>
> >> >> >> Also, if I just keep the network traffic running I will eventually get a
> >> >> >> kernel panic with:
> >> >> >>
> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> >> >>
> >> >> >> I've created a reproducer for the issue here:
> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> >> >>
> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> >> >>
> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> >> >> $ cd bpf-examples/bpf-link-hang
> >> >> >> $ make
> >> >> >> $ ./sudo bpf-link-hang
> >> >> >>
> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> >> >> from a separate machine to do this.
> >> >> >>
> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> >> >> fixed so it doesn't deadlock because of this?
> >> >> >>
> >> >> >> Hoping you can help us with this - many thanks in advance! :)
> >> >> >
> >> >> > Let me start with the usual question...  Is the network traffic intense
> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
> >> >> > indefinitely?
> >> >>
> >> >> Yup, I'm pegging all CPUs in softirq:
> >> >>
> >> >> $ mpstat -P ALL 1
> >> >> [...]
> >> >> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> >> >> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >>
> >> >> > If so, does the (untested, probably does not build) patch below help?
> >> >>
> >> >> Doesn't appear to, no. It builds fine, but I still get:
> >> >>
> >> >> Attaching 2 probes...
> >> >> enter
> >> >> exit after 8480 ms
> >> >>
> >> >> (that was me interrupting the network traffic again)
> >> >
> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
> >> > to ksoftirqd under heavy load?  If not, my patch will not have any
> >> > effect.
> >>
> >> Seems to be - this is from top:
> >>
> >>      12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
> >>      24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
> >>      34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
> >>      39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
> >>      19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
> >>      29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3
> >>
> >> Any other ideas? :)
> >
> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
> > anymore. Please give it a try. It's in bpf tree.
> 
> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
> indeed works better; awesome!
> 
> And sorry for bothering you with this, Paul; guess I should have looked
> harder for fixes first... :/

Glad it is now working!

And in any case, my patch needed an s/true/false/.  :-/

Hey, I did say "untested"!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-23 21:52             ` Paul E. McKenney
@ 2021-03-23 22:06               ` Toke Høiland-Jørgensen
  2021-03-24  2:41                 ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-23 22:06 UTC (permalink / raw)
  To: paulmck; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> 
>> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >>
>> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >>
>> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> Hi Paul
>> >> >> >>
>> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> >> >> file descriptor would hang indefinitely when the system was under load
>> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> >> >>
>> >> >> >> The issue is triggered reliably by loading up a system with network
>> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> >> >> hang until the network traffic load is lowered.
>> >> >> >>
>> >> >> >> Digging further, it appears that the hang happens in
>> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> >> >>
>> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> >> >> Attaching 2 probes...
>> >> >> >> enter
>> >> >> >> exit after 54 ms
>> >> >> >> enter
>> >> >> >> exit after 3249 ms
>> >> >> >>
>> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> >> >> and from a loaded system where I stopped the network traffic after a
>> >> >> >> couple of seconds).
>> >> >> >>
>> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> >> >>
>> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> >> >>
>> >> >> >> And because it does this while holding trampoline_mutex, even deferring
>> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> >> >> mutex.
>> >> >> >>
>> >> >> >> Also, if I just keep the network traffic running I will eventually get a
>> >> >> >> kernel panic with:
>> >> >> >>
>> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> >> >>
>> >> >> >> I've created a reproducer for the issue here:
>> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> >> >>
>> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> >> >>
>> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> >> >> $ cd bpf-examples/bpf-link-hang
>> >> >> >> $ make
>> >> >> >> $ ./sudo bpf-link-hang
>> >> >> >>
>> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> >> >> from a separate machine to do this.
>> >> >> >>
>> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> >> >> fixed so it doesn't deadlock because of this?
>> >> >> >>
>> >> >> >> Hoping you can help us with this - many thanks in advance! :)
>> >> >> >
>> >> >> > Let me start with the usual question...  Is the network traffic intense
>> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
>> >> >> > indefinitely?
>> >> >>
>> >> >> Yup, I'm pegging all CPUs in softirq:
>> >> >>
>> >> >> $ mpstat -P ALL 1
>> >> >> [...]
>> >> >> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> >> >> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >>
>> >> >> > If so, does the (untested, probably does not build) patch below help?
>> >> >>
>> >> >> Doesn't appear to, no. It builds fine, but I still get:
>> >> >>
>> >> >> Attaching 2 probes...
>> >> >> enter
>> >> >> exit after 8480 ms
>> >> >>
>> >> >> (that was me interrupting the network traffic again)
>> >> >
>> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
>> >> > to ksoftirqd under heavy load?  If not, my patch will not have any
>> >> > effect.
>> >>
>> >> Seems to be - this is from top:
>> >>
>> >>      12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
>> >>      24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
>> >>      34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
>> >>      39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
>> >>      19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
>> >>      29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3
>> >>
>> >> Any other ideas? :)
>> >
>> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
>> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
>> > anymore. Please give it a try. It's in bpf tree.
>> 
>> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
>> indeed works better; awesome!
>> 
>> And sorry for bothering you with this, Paul; guess I should have looked
>> harder for fixes first... :/
>
> Glad it is now working!
>
> And in any case, my patch needed an s/true/false/.  :-/
>
> Hey, I did say "untested"!  ;-)

Haha, right, well at least you run afoul of the 'truth in advertising'
committee ;)

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-23 22:06               ` Toke Høiland-Jørgensen
@ 2021-03-24  2:41                 ` Paul E. McKenney
  2021-03-24 11:33                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-24  2:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson

On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >> 
> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >>
> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >>
> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> >> Hi Paul
> >> >> >> >>
> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> >> >> >> file descriptor would hang indefinitely when the system was under load
> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> >> >> >>
> >> >> >> >> The issue is triggered reliably by loading up a system with network
> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
> >> >> >> >> hang until the network traffic load is lowered.
> >> >> >> >>
> >> >> >> >> Digging further, it appears that the hang happens in
> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> >> >> >>
> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> >> >> >> Attaching 2 probes...
> >> >> >> >> enter
> >> >> >> >> exit after 54 ms
> >> >> >> >> enter
> >> >> >> >> exit after 3249 ms
> >> >> >> >>
> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> >> >> >> and from a loaded system where I stopped the network traffic after a
> >> >> >> >> couple of seconds).
> >> >> >> >>
> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> >> >> >>
> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> >> >> >>
> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
> >> >> >> >> mutex.
> >> >> >> >>
> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
> >> >> >> >> kernel panic with:
> >> >> >> >>
> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> >> >> >>
> >> >> >> >> I've created a reproducer for the issue here:
> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> >> >> >>
> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> >> >> >>
> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> >> >> >> $ cd bpf-examples/bpf-link-hang
> >> >> >> >> $ make
> >> >> >> >> $ ./sudo bpf-link-hang
> >> >> >> >>
> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> >> >> >> from a separate machine to do this.
> >> >> >> >>
> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> >> >> >> fixed so it doesn't deadlock because of this?
> >> >> >> >>
> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
> >> >> >> >
> >> >> >> > Let me start with the usual question...  Is the network traffic intense
> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
> >> >> >> > indefinitely?
> >> >> >>
> >> >> >> Yup, I'm pegging all CPUs in softirq:
> >> >> >>
> >> >> >> $ mpstat -P ALL 1
> >> >> >> [...]
> >> >> >> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> >> >> >> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >>
> >> >> >> > If so, does the (untested, probably does not build) patch below help?
> >> >> >>
> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
> >> >> >>
> >> >> >> Attaching 2 probes...
> >> >> >> enter
> >> >> >> exit after 8480 ms
> >> >> >>
> >> >> >> (that was me interrupting the network traffic again)
> >> >> >
> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
> >> >> > to ksoftirqd under heavy load?  If not, my patch will not have any
> >> >> > effect.
> >> >>
> >> >> Seems to be - this is from top:
> >> >>
> >> >>      12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
> >> >>      24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
> >> >>      34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
> >> >>      39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
> >> >>      19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
> >> >>      29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3
> >> >>
> >> >> Any other ideas? :)
> >> >
> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
> >> > anymore. Please give it a try. It's in bpf tree.
> >> 
> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
> >> indeed works better; awesome!
> >> 
> >> And sorry for bothering you with this, Paul; guess I should have looked
> >> harder for fixes first... :/
> >
> > Glad it is now working!
> >
> > And in any case, my patch needed an s/true/false/.  :-/
> >
> > Hey, I did say "untested"!  ;-)
> 
> Haha, right, well at least you run afoul of the 'truth in advertising'
> committee ;)

If you get a chance, could you please test the (hopefully) corrected
patch shown below?  This issue might affect other use cases.

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b06be5..e21e7b0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -242,6 +242,7 @@ void rcu_softirq_qs(void)
 {
 	rcu_qs();
 	rcu_preempt_deferred_qs(current);
+	rcu_tasks_qs(current, false);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-24  2:41                 ` Paul E. McKenney
@ 2021-03-24 11:33                   ` Toke Høiland-Jørgensen
  2021-03-24 16:11                     ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-24 11:33 UTC (permalink / raw)
  To: paulmck; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
>> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> >> 
>> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >>
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >>
>> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >>
>> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> Hi Paul
>> >> >> >> >>
>> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> >> >> >> file descriptor would hang indefinitely when the system was under load
>> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> >> >> >>
>> >> >> >> >> The issue is triggered reliably by loading up a system with network
>> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> >> >> >> hang until the network traffic load is lowered.
>> >> >> >> >>
>> >> >> >> >> Digging further, it appears that the hang happens in
>> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> >> >> >>
>> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> >> >> >> Attaching 2 probes...
>> >> >> >> >> enter
>> >> >> >> >> exit after 54 ms
>> >> >> >> >> enter
>> >> >> >> >> exit after 3249 ms
>> >> >> >> >>
>> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> >> >> >> and from a loaded system where I stopped the network traffic after a
>> >> >> >> >> couple of seconds).
>> >> >> >> >>
>> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> >> >> >>
>> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> >> >> >>
>> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
>> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> >> >> >> mutex.
>> >> >> >> >>
>> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
>> >> >> >> >> kernel panic with:
>> >> >> >> >>
>> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> >> >> >>
>> >> >> >> >> I've created a reproducer for the issue here:
>> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> >> >> >>
>> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> >> >> >>
>> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> >> >> >> $ cd bpf-examples/bpf-link-hang
>> >> >> >> >> $ make
>> >> >> >> >> $ ./sudo bpf-link-hang
>> >> >> >> >>
>> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> >> >> >> from a separate machine to do this.
>> >> >> >> >>
>> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> >> >> >> fixed so it doesn't deadlock because of this?
>> >> >> >> >>
>> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
>> >> >> >> >
>> >> >> >> > Let me start with the usual question...  Is the network traffic intense
>> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
>> >> >> >> > indefinitely?
>> >> >> >>
>> >> >> >> Yup, I'm pegging all CPUs in softirq:
>> >> >> >>
>> >> >> >> $ mpstat -P ALL 1
>> >> >> >> [...]
>> >> >> >> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> >> >> >> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >>
>> >> >> >> > If so, does the (untested, probably does not build) patch below help?
>> >> >> >>
>> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
>> >> >> >>
>> >> >> >> Attaching 2 probes...
>> >> >> >> enter
>> >> >> >> exit after 8480 ms
>> >> >> >>
>> >> >> >> (that was me interrupting the network traffic again)
>> >> >> >
>> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
>> >> >> > to ksoftirqd under heavy load?  If not, my patch will not have any
>> >> >> > effect.
>> >> >>
>> >> >> Seems to be - this is from top:
>> >> >>
>> >> >>      12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
>> >> >>      24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
>> >> >>      34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
>> >> >>      39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
>> >> >>      19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
>> >> >>      29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3
>> >> >>
>> >> >> Any other ideas? :)
>> >> >
>> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
>> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
>> >> > anymore. Please give it a try. It's in bpf tree.
>> >> 
>> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
>> >> indeed works better; awesome!
>> >> 
>> >> And sorry for bothering you with this, Paul; guess I should have looked
>> >> harder for fixes first... :/
>> >
>> > Glad it is now working!
>> >
>> > And in any case, my patch needed an s/true/false/.  :-/
>> >
>> > Hey, I did say "untested"!  ;-)
>> 
>> Haha, right, well at least you run afoul of the 'truth in advertising'
>> committee ;)
>
> If you get a chance, could you please test the (hopefully) corrected
> patch shown below?  This issue might affect other use cases.

Yup, that does seem to help:

Attaching 2 probes...
enter
exit after 136 ms

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-24 11:33                   ` Toke Høiland-Jørgensen
@ 2021-03-24 16:11                     ` Paul E. McKenney
  2021-03-24 19:17                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-24 16:11 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson

On Wed, Mar 24, 2021 at 12:33:47PM +0100, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> 
> >> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >> >> 
> >> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >> >>
> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >>
> >> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >>
> >> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> Hi Paul
> >> >> >> >> >>
> >> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> >> >> >> >> file descriptor would hang indefinitely when the system was under load
> >> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> >> >> >> >>
> >> >> >> >> >> The issue is triggered reliably by loading up a system with network
> >> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
> >> >> >> >> >> hang until the network traffic load is lowered.
> >> >> >> >> >>
> >> >> >> >> >> Digging further, it appears that the hang happens in
> >> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> >> >> >> >>
> >> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> >> >> >> >> Attaching 2 probes...
> >> >> >> >> >> enter
> >> >> >> >> >> exit after 54 ms
> >> >> >> >> >> enter
> >> >> >> >> >> exit after 3249 ms
> >> >> >> >> >>
> >> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> >> >> >> >> and from a loaded system where I stopped the network traffic after a
> >> >> >> >> >> couple of seconds).
> >> >> >> >> >>
> >> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> >> >> >> >>
> >> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> >> >> >> >>
> >> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
> >> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
> >> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
> >> >> >> >> >> mutex.
> >> >> >> >> >>
> >> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
> >> >> >> >> >> kernel panic with:
> >> >> >> >> >>
> >> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> >> >> >> >>
> >> >> >> >> >> I've created a reproducer for the issue here:
> >> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> >> >> >> >>
> >> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> >> >> >> >>
> >> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> >> >> >> >> $ cd bpf-examples/bpf-link-hang
> >> >> >> >> >> $ make
> >> >> >> >> >> $ ./sudo bpf-link-hang
> >> >> >> >> >>
> >> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> >> >> >> >> from a separate machine to do this.
> >> >> >> >> >>
> >> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> >> >> >> >> fixed so it doesn't deadlock because of this?
> >> >> >> >> >>
> >> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
> >> >> >> >> >
> >> >> >> >> > Let me start with the usual question...  Is the network traffic intense
> >> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
> >> >> >> >> > indefinitely?
> >> >> >> >>
> >> >> >> >> Yup, I'm pegging all CPUs in softirq:
> >> >> >> >>
> >> >> >> >> $ mpstat -P ALL 1
> >> >> >> >> [...]
> >> >> >> >> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> >> >> >> >> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >>
> >> >> >> >> > If so, does the (untested, probably does not build) patch below help?
> >> >> >> >>
> >> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
> >> >> >> >>
> >> >> >> >> Attaching 2 probes...
> >> >> >> >> enter
> >> >> >> >> exit after 8480 ms
> >> >> >> >>
> >> >> >> >> (that was me interrupting the network traffic again)
> >> >> >> >
> >> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
> >> >> >> > to ksoftirqd under heavy load?  If not, my patch will not have any
> >> >> >> > effect.
> >> >> >>
> >> >> >> Seems to be - this is from top:
> >> >> >>
> >> >> >>      12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
> >> >> >>      24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
> >> >> >>      34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
> >> >> >>      39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
> >> >> >>      19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
> >> >> >>      29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3
> >> >> >>
> >> >> >> Any other ideas? :)
> >> >> >
> >> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
> >> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
> >> >> > anymore. Please give it a try. It's in bpf tree.
> >> >> 
> >> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
> >> >> indeed works better; awesome!
> >> >> 
> >> >> And sorry for bothering you with this, Paul; guess I should have looked
> >> >> harder for fixes first... :/
> >> >
> >> > Glad it is now working!
> >> >
> >> > And in any case, my patch needed an s/true/false/.  :-/
> >> >
> >> > Hey, I did say "untested"!  ;-)
> >> 
> >> Haha, right, well at least you run afoul of the 'truth in advertising'
> >> committee ;)
> >
> > If you get a chance, could you please test the (hopefully) corrected
> > patch shown below?  This issue might affect other use cases.
> 
> Yup, that does seem to help:
> 
> Attaching 2 probes...
> enter
> exit after 136 ms

Thank you very much!  May I please apply your Tested-by?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-24 16:11                     ` Paul E. McKenney
@ 2021-03-24 19:17                       ` Toke Høiland-Jørgensen
  2021-03-25 16:28                         ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-24 19:17 UTC (permalink / raw)
  To: paulmck; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Wed, Mar 24, 2021 at 12:33:47PM +0100, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> 
>> >> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> >> >> 
>> >> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >> >>
>> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >>
>> >> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >>
>> >> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> Hi Paul
>> >> >> >> >> >>
>> >> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> >> >> >> >> file descriptor would hang indefinitely when the system was under load
>> >> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> >> >> >> >>
>> >> >> >> >> >> The issue is triggered reliably by loading up a system with network
>> >> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> >> >> >> >> hang until the network traffic load is lowered.
>> >> >> >> >> >>
>> >> >> >> >> >> Digging further, it appears that the hang happens in
>> >> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> >> >> >> >>
>> >> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> >> >> >> >> Attaching 2 probes...
>> >> >> >> >> >> enter
>> >> >> >> >> >> exit after 54 ms
>> >> >> >> >> >> enter
>> >> >> >> >> >> exit after 3249 ms
>> >> >> >> >> >>
>> >> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> >> >> >> >> and from a loaded system where I stopped the network traffic after a
>> >> >> >> >> >> couple of seconds).
>> >> >> >> >> >>
>> >> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> >> >> >> >>
>> >> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> >> >> >> >>
>> >> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
>> >> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> >> >> >> >> mutex.
>> >> >> >> >> >>
>> >> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
>> >> >> >> >> >> kernel panic with:
>> >> >> >> >> >>
>> >> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> >> >> >> >>
>> >> >> >> >> >> I've created a reproducer for the issue here:
>> >> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> >> >> >> >>
>> >> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> >> >> >> >>
>> >> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> >> >> >> >> $ cd bpf-examples/bpf-link-hang
>> >> >> >> >> >> $ make
>> >> >> >> >> >> $ ./sudo bpf-link-hang
>> >> >> >> >> >>
>> >> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> >> >> >> >> from a separate machine to do this.
>> >> >> >> >> >>
>> >> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> >> >> >> >> fixed so it doesn't deadlock because of this?
>> >> >> >> >> >>
>> >> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
>> >> >> >> >> >
>> >> >> >> >> > Let me start with the usual question...  Is the network traffic intense
>> >> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
>> >> >> >> >> > indefinitely?
>> >> >> >> >>
>> >> >> >> >> Yup, I'm pegging all CPUs in softirq:
>> >> >> >> >>
>> >> >> >> >> $ mpstat -P ALL 1
>> >> >> >> >> [...]
>> >> >> >> >> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> >> >> >> >> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >>
>> >> >> >> >> > If so, does the (untested, probably does not build) patch below help?
>> >> >> >> >>
>> >> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
>> >> >> >> >>
>> >> >> >> >> Attaching 2 probes...
>> >> >> >> >> enter
>> >> >> >> >> exit after 8480 ms
>> >> >> >> >>
>> >> >> >> >> (that was me interrupting the network traffic again)
>> >> >> >> >
>> >> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
>> >> >> >> > to ksoftirqd under heavy load?  If not, my patch will not have any
>> >> >> >> > effect.
>> >> >> >>
>> >> >> >> Seems to be - this is from top:
>> >> >> >>
>> >> >> >>      12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
>> >> >> >>      24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
>> >> >> >>      34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
>> >> >> >>      39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
>> >> >> >>      19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
>> >> >> >>      29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3
>> >> >> >>
>> >> >> >> Any other ideas? :)
>> >> >> >
>> >> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
>> >> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
>> >> >> > anymore. Please give it a try. It's in bpf tree.
>> >> >> 
>> >> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
>> >> >> indeed works better; awesome!
>> >> >> 
>> >> >> And sorry for bothering you with this, Paul; guess I should have looked
>> >> >> harder for fixes first... :/
>> >> >
>> >> > Glad it is now working!
>> >> >
>> >> > And in any case, my patch needed an s/true/false/.  :-/
>> >> >
>> >> > Hey, I did say "untested"!  ;-)
>> >> 
>> >> Haha, right, well at least you run afoul of the 'truth in advertising'
>> >> committee ;)
>> >
>> > If you get a chance, could you please test the (hopefully) corrected
>> > patch shown below?  This issue might affect other use cases.
>> 
>> Yup, that does seem to help:
>> 
>> Attaching 2 probes...
>> enter
>> exit after 136 ms
>
> Thank you very much!  May I please apply your Tested-by?

Sure!

Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-24 19:17                       ` Toke Høiland-Jørgensen
@ 2021-03-25 16:28                         ` Paul E. McKenney
  2021-03-25 21:13                           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-25 16:28 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson

On Wed, Mar 24, 2021 at 08:17:35PM +0100, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Wed, Mar 24, 2021 at 12:33:47PM +0100, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> 
> >> > On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> 
> >> >> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >> >> >> 
> >> >> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >> >> >>
> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >>
> >> >> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> >>
> >> >> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> >> Hi Paul
> >> >> >> >> >> >>
> >> >> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> >> >> >> >> >> file descriptor would hang indefinitely when the system was under load
> >> >> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> >> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> >> >> >> >> >>
> >> >> >> >> >> >> The issue is triggered reliably by loading up a system with network
> >> >> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> >> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
> >> >> >> >> >> >> hang until the network traffic load is lowered.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Digging further, it appears that the hang happens in
> >> >> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> >> >> >> >> >>
> >> >> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> >> >> >> >> >> Attaching 2 probes...
> >> >> >> >> >> >> enter
> >> >> >> >> >> >> exit after 54 ms
> >> >> >> >> >> >> enter
> >> >> >> >> >> >> exit after 3249 ms
> >> >> >> >> >> >>
> >> >> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> >> >> >> >> >> and from a loaded system where I stopped the network traffic after a
> >> >> >> >> >> >> couple of seconds).
> >> >> >> >> >> >>
> >> >> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> >> >> >> >> >>
> >> >> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> >> >> >> >> >>
> >> >> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
> >> >> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> >> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
> >> >> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
> >> >> >> >> >> >> mutex.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
> >> >> >> >> >> >> kernel panic with:
> >> >> >> >> >> >>
> >> >> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> >> >> >> >> >>
> >> >> >> >> >> >> I've created a reproducer for the issue here:
> >> >> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> >> >> >> >> >>
> >> >> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> >> >> >> >> >>
> >> >> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> >> >> >> >> >> $ cd bpf-examples/bpf-link-hang
> >> >> >> >> >> >> $ make
> >> >> >> >> >> >> $ ./sudo bpf-link-hang
> >> >> >> >> >> >>
> >> >> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> >> >> >> >> >> from a separate machine to do this.
> >> >> >> >> >> >>
> >> >> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> >> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> >> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> >> >> >> >> >> fixed so it doesn't deadlock because of this?
> >> >> >> >> >> >>
> >> >> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
> >> >> >> >> >> >
> >> >> >> >> >> > Let me start with the usual question...  Is the network traffic intense
> >> >> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
> >> >> >> >> >> > indefinitely?
> >> >> >> >> >>
> >> >> >> >> >> Yup, I'm pegging all CPUs in softirq:
> >> >> >> >> >>
> >> >> >> >> >> $ mpstat -P ALL 1
> >> >> >> >> >> [...]
> >> >> >> >> >> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> >> >> >> >> >> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> >> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> >> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> >> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> >> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> >> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> >> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
> >> >> >> >> >>
> >> >> >> >> >> > If so, does the (untested, probably does not build) patch below help?
> >> >> >> >> >>
> >> >> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
> >> >> >> >> >>
> >> >> >> >> >> Attaching 2 probes...
> >> >> >> >> >> enter
> >> >> >> >> >> exit after 8480 ms
> >> >> >> >> >>
> >> >> >> >> >> (that was me interrupting the network traffic again)
> >> >> >> >> >
> >> >> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
> >> >> >> >> > to ksoftirqd under heavy load?  If not, my patch will not have any
> >> >> >> >> > effect.
> >> >> >> >>
> >> >> >> >> Seems to be - this is from top:
> >> >> >> >>
> >> >> >> >>      12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
> >> >> >> >>      24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
> >> >> >> >>      34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
> >> >> >> >>      39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
> >> >> >> >>      19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
> >> >> >> >>      29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3
> >> >> >> >>
> >> >> >> >> Any other ideas? :)
> >> >> >> >
> >> >> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
> >> >> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
> >> >> >> > anymore. Please give it a try. It's in bpf tree.
> >> >> >> 
> >> >> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
> >> >> >> indeed works better; awesome!
> >> >> >> 
> >> >> >> And sorry for bothering you with this, Paul; guess I should have looked
> >> >> >> harder for fixes first... :/
> >> >> >
> >> >> > Glad it is now working!
> >> >> >
> >> >> > And in any case, my patch needed an s/true/false/.  :-/
> >> >> >
> >> >> > Hey, I did say "untested"!  ;-)
> >> >> 
> >> >> Haha, right, well at least you run afoul of the 'truth in advertising'
> >> >> committee ;)
> >> >
> >> > If you get a chance, could you please test the (hopefully) corrected
> >> > patch shown below?  This issue might affect other use cases.
> >> 
> >> Yup, that does seem to help:
> >> 
> >> Attaching 2 probes...
> >> enter
> >> exit after 136 ms
> >
> > Thank you very much!  May I please apply your Tested-by?
> 
> Sure!
> 
> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>

Applied, and thank you!

							Thanx, Paul

------------------------------------------------------------------------

commit 1a0dfc099c1e61e22045705265a1323ac294bea6
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed Mar 24 17:08:48 2021 -0700

    rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
    
    Heavy networking load can cause a CPU to execute continuously and
    indefinitely within ksoftirqd, in which case there will be no voluntary
    task switches and thus no RCU-tasks quiescent states.  This commit
    therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks
    quiescent state.
    
    This of course means that __do_softirq() and its callers cannot be
    invoked from within a tracing trampoline.
    
    Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b06be5..9d7cb74 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -242,6 +242,7 @@ void rcu_softirq_qs(void)
 {
 	rcu_qs();
 	rcu_preempt_deferred_qs(current);
+	rcu_tasks_qs(current, false);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
  2021-03-25 16:28                         ` Paul E. McKenney
@ 2021-03-25 21:13                           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-25 21:13 UTC (permalink / raw)
  To: paulmck; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Wed, Mar 24, 2021 at 08:17:35PM +0100, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> 
>> > On Wed, Mar 24, 2021 at 12:33:47PM +0100, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> 
>> >> > On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> 
>> >> >> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> >> >> >> 
>> >> >> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >> >> >>
>> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >>
>> >> >> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> >>
>> >> >> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> >> Hi Paul
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> >> >> >> >> >> file descriptor would hang indefinitely when the system was under load
>> >> >> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> >> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> The issue is triggered reliably by loading up a system with network
>> >> >> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> >> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> >> >> >> >> >> hang until the network traffic load is lowered.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Digging further, it appears that the hang happens in
>> >> >> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> >> >> >> >> >> Attaching 2 probes...
>> >> >> >> >> >> >> enter
>> >> >> >> >> >> >> exit after 54 ms
>> >> >> >> >> >> >> enter
>> >> >> >> >> >> >> exit after 3249 ms
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> >> >> >> >> >> and from a loaded system where I stopped the network traffic after a
>> >> >> >> >> >> >> couple of seconds).
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
>> >> >> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> >> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> >> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> >> >> >> >> >> mutex.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
>> >> >> >> >> >> >> kernel panic with:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I've created a reproducer for the issue here:
>> >> >> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> >> >> >> >> >> $ cd bpf-examples/bpf-link-hang
>> >> >> >> >> >> >> $ make
>> >> >> >> >> >> >> $ ./sudo bpf-link-hang
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> >> >> >> >> >> from a separate machine to do this.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> >> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> >> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> >> >> >> >> >> fixed so it doesn't deadlock because of this?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
>> >> >> >> >> >> >
>> >> >> >> >> >> > Let me start with the usual question...  Is the network traffic intense
>> >> >> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
>> >> >> >> >> >> > indefinitely?
>> >> >> >> >> >>
>> >> >> >> >> >> Yup, I'm pegging all CPUs in softirq:
>> >> >> >> >> >>
>> >> >> >> >> >> $ mpstat -P ALL 1
>> >> >> >> >> >> [...]
>> >> >> >> >> >> 18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>> >> >> >> >> >> 18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> >> 18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> >> 18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> >> 18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> >> 18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> >> 18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> >> 18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
>> >> >> >> >> >>
>> >> >> >> >> >> > If so, does the (untested, probably does not build) patch below help?
>> >> >> >> >> >>
>> >> >> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
>> >> >> >> >> >>
>> >> >> >> >> >> Attaching 2 probes...
>> >> >> >> >> >> enter
>> >> >> >> >> >> exit after 8480 ms
>> >> >> >> >> >>
>> >> >> >> >> >> (that was me interrupting the network traffic again)
>> >> >> >> >> >
>> >> >> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
>> >> >> >> >> > to ksoftirqd under heavy load?  If not, my patch will not have any
>> >> >> >> >> > effect.
>> >> >> >> >>
>> >> >> >> >> Seems to be - this is from top:
>> >> >> >> >>
>> >> >> >> >>      12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
>> >> >> >> >>      24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
>> >> >> >> >>      34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
>> >> >> >> >>      39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
>> >> >> >> >>      19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
>> >> >> >> >>      29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3
>> >> >> >> >>
>> >> >> >> >> Any other ideas? :)
>> >> >> >> >
>> >> >> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
>> >> >> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
>> >> >> >> > anymore. Please give it a try. It's in bpf tree.
>> >> >> >> 
>> >> >> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
>> >> >> >> indeed works better; awesome!
>> >> >> >> 
>> >> >> >> And sorry for bothering you with this, Paul; guess I should have looked
>> >> >> >> harder for fixes first... :/
>> >> >> >
>> >> >> > Glad it is now working!
>> >> >> >
>> >> >> > And in any case, my patch needed an s/true/false/.  :-/
>> >> >> >
>> >> >> > Hey, I did say "untested"!  ;-)
>> >> >> 
>> >> >> Haha, right, well at least you run afoul of the 'truth in advertising'
>> >> >> committee ;)
>> >> >
>> >> > If you get a chance, could you please test the (hopefully) corrected
>> >> > patch shown below?  This issue might affect other use cases.
>> >> 
>> >> Yup, that does seem to help:
>> >> 
>> >> Attaching 2 probes...
>> >> enter
>> >> exit after 136 ms
>> >
>> > Thank you very much!  May I please apply your Tested-by?
>> 
>> Sure!
>> 
>> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
>
> Applied, and thank you!

Awesome! You're welcome, and thank you for the fix (and the quick turnaround)! :)

-Toke


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-25 21:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <877dly6ooz.fsf@toke.dk>
2021-03-23 16:43 ` BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels Paul E. McKenney
2021-03-23 17:29   ` Toke Høiland-Jørgensen
2021-03-23 17:57     ` Paul E. McKenney
2021-03-23 19:50       ` Toke Høiland-Jørgensen
2021-03-23 19:59         ` Andrii Nakryiko
2021-03-23 21:04           ` Toke Høiland-Jørgensen
2021-03-23 21:52             ` Paul E. McKenney
2021-03-23 22:06               ` Toke Høiland-Jørgensen
2021-03-24  2:41                 ` Paul E. McKenney
2021-03-24 11:33                   ` Toke Høiland-Jørgensen
2021-03-24 16:11                     ` Paul E. McKenney
2021-03-24 19:17                       ` Toke Høiland-Jørgensen
2021-03-25 16:28                         ` Paul E. McKenney
2021-03-25 21:13                           ` Toke Høiland-Jørgensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.