* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
[not found] <877dly6ooz.fsf@toke.dk>
@ 2021-03-23 16:43 ` Paul E. McKenney
2021-03-23 17:29 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-23 16:43 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: bpf, Magnus Karlsson
On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> Hi Paul
>
> Magnus and I have been debugging an issue where close() on a bpf_link
> file descriptor would hang indefinitely when the system was under load
> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>
> The issue is triggered reliably by loading up a system with network
> traffic (causing 100% softirq CPU load on one or more cores), and then
> attaching an freplace bpf_link and closing it again. The close() will
> hang until the network traffic load is lowered.
>
> Digging further, it appears that the hang happens in
> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>
> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> Attaching 2 probes...
> enter
> exit after 54 ms
> enter
> exit after 3249 ms
>
> (the two enter/exit pairs are, respectively, from an unloaded system,
> and from a loaded system where I stopped the network traffic after a
> couple of seconds).
>
> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>
> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>
> And because it does this while holding trampoline_mutex, even deferring
> the put to a worker (as a previously applied-then-reverted patch did[0])
> doesn't help: that'll fix the initial hang on close(), but any
> subsequent use of BPF trampolines will then be blocked because of the
> mutex.
>
> Also, if I just keep the network traffic running I will eventually get a
> kernel panic with:
>
> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>
> I've created a reproducer for the issue here:
> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>
> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>
> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> $ cd bpf-examples/bpf-link-hang
> $ make
> $ ./sudo bpf-link-hang
>
> you'll need to load up the system to trigger the hang; I'm using pktgen
> from a separate machine to do this.
>
> My question is, of course, as ever, What Is To Be Done? Is it expected
> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> or can this be fixed? And if it is expected, how can the BPF code be
> fixed so it doesn't deadlock because of this?
>
> Hoping you can help us with this - many thanks in advance! :)
Let me start with the usual question... Is the network traffic intense
enough that one of the CPUs might remain in a loop handling softirqs
indefinitely?
If so, does the (untested, probably does not build) patch below help?
Please note that this is only a diagnostic patch. It has the serious
side effect of making __do_softirq() and anything that calls it implicitly
noinstr. But it might at least be a decent starting point for a real fix.
Or might be part of the real fix, who knows?
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b06be5..e21e7b0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -242,6 +242,7 @@ void rcu_softirq_qs(void)
{
rcu_qs();
rcu_preempt_deferred_qs(current);
+ rcu_tasks_qs(current, true);
}
/*
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-23 16:43 ` BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels Paul E. McKenney
@ 2021-03-23 17:29 ` Toke Høiland-Jørgensen
2021-03-23 17:57 ` Paul E. McKenney
0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-23 17:29 UTC (permalink / raw)
To: paulmck; +Cc: bpf, Magnus Karlsson
"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> Hi Paul
>>
>> Magnus and I have been debugging an issue where close() on a bpf_link
>> file descriptor would hang indefinitely when the system was under load
>> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>>
>> The issue is triggered reliably by loading up a system with network
>> traffic (causing 100% softirq CPU load on one or more cores), and then
>> attaching an freplace bpf_link and closing it again. The close() will
>> hang until the network traffic load is lowered.
>>
>> Digging further, it appears that the hang happens in
>> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>>
>> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> Attaching 2 probes...
>> enter
>> exit after 54 ms
>> enter
>> exit after 3249 ms
>>
>> (the two enter/exit pairs are, respectively, from an unloaded system,
>> and from a loaded system where I stopped the network traffic after a
>> couple of seconds).
>>
>> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>>
>> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>>
>> And because it does this while holding trampoline_mutex, even deferring
>> the put to a worker (as a previously applied-then-reverted patch did[0])
>> doesn't help: that'll fix the initial hang on close(), but any
>> subsequent use of BPF trampolines will then be blocked because of the
>> mutex.
>>
>> Also, if I just keep the network traffic running I will eventually get a
>> kernel panic with:
>>
>> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>>
>> I've created a reproducer for the issue here:
>> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>>
>> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>>
>> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> $ cd bpf-examples/bpf-link-hang
>> $ make
>> $ ./sudo bpf-link-hang
>>
>> you'll need to load up the system to trigger the hang; I'm using pktgen
>> from a separate machine to do this.
>>
>> My question is, of course, as ever, What Is To Be Done? Is it expected
>> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> or can this be fixed? And if it is expected, how can the BPF code be
>> fixed so it doesn't deadlock because of this?
>>
>> Hoping you can help us with this - many thanks in advance! :)
>
> Let me start with the usual question... Is the network traffic intense
> enough that one of the CPUs might remain in a loop handling softirqs
> indefinitely?
Yup, I'm pegging all CPUs in softirq:
$ mpstat -P ALL 1
[...]
18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> If so, does the (untested, probably does not build) patch below help?
Doesn't appear to, no. It builds fine, but I still get:
Attaching 2 probes...
enter
exit after 8480 ms
(that was me interrupting the network traffic again)
-Toke
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-23 17:29 ` Toke Høiland-Jørgensen
@ 2021-03-23 17:57 ` Paul E. McKenney
2021-03-23 19:50 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-23 17:57 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: bpf, Magnus Karlsson
On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
>
> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> Hi Paul
> >>
> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> file descriptor would hang indefinitely when the system was under load
> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >>
> >> The issue is triggered reliably by loading up a system with network
> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> attaching an freplace bpf_link and closing it again. The close() will
> >> hang until the network traffic load is lowered.
> >>
> >> Digging further, it appears that the hang happens in
> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >>
> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> Attaching 2 probes...
> >> enter
> >> exit after 54 ms
> >> enter
> >> exit after 3249 ms
> >>
> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> and from a loaded system where I stopped the network traffic after a
> >> couple of seconds).
> >>
> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >>
> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >>
> >> And because it does this while holding trampoline_mutex, even deferring
> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> doesn't help: that'll fix the initial hang on close(), but any
> >> subsequent use of BPF trampolines will then be blocked because of the
> >> mutex.
> >>
> >> Also, if I just keep the network traffic running I will eventually get a
> >> kernel panic with:
> >>
> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >>
> >> I've created a reproducer for the issue here:
> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >>
> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >>
> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> $ cd bpf-examples/bpf-link-hang
> >> $ make
> >> $ ./sudo bpf-link-hang
> >>
> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> from a separate machine to do this.
> >>
> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> fixed so it doesn't deadlock because of this?
> >>
> >> Hoping you can help us with this - many thanks in advance! :)
> >
> > Let me start with the usual question... Is the network traffic intense
> > enough that one of the CPUs might remain in a loop handling softirqs
> > indefinitely?
>
> Yup, I'm pegging all CPUs in softirq:
>
> $ mpstat -P ALL 1
> [...]
> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>
> > If so, does the (untested, probably does not build) patch below help?
>
> Doesn't appear to, no. It builds fine, but I still get:
>
> Attaching 2 probes...
> enter
> exit after 8480 ms
>
> (that was me interrupting the network traffic again)
Is your kernel properly shifting from back-of-interrupt softirq processing
to ksoftirqd under heavy load? If not, my patch will not have any effect.
Thanx, Paul
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-23 17:57 ` Paul E. McKenney
@ 2021-03-23 19:50 ` Toke Høiland-Jørgensen
2021-03-23 19:59 ` Andrii Nakryiko
0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-23 19:50 UTC (permalink / raw)
To: paulmck; +Cc: bpf, Magnus Karlsson
"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>>
>> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> Hi Paul
>> >>
>> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> file descriptor would hang indefinitely when the system was under load
>> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >>
>> >> The issue is triggered reliably by loading up a system with network
>> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> hang until the network traffic load is lowered.
>> >>
>> >> Digging further, it appears that the hang happens in
>> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >>
>> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> Attaching 2 probes...
>> >> enter
>> >> exit after 54 ms
>> >> enter
>> >> exit after 3249 ms
>> >>
>> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> and from a loaded system where I stopped the network traffic after a
>> >> couple of seconds).
>> >>
>> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >>
>> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >>
>> >> And because it does this while holding trampoline_mutex, even deferring
>> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> mutex.
>> >>
>> >> Also, if I just keep the network traffic running I will eventually get a
>> >> kernel panic with:
>> >>
>> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >>
>> >> I've created a reproducer for the issue here:
>> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >>
>> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >>
>> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> $ cd bpf-examples/bpf-link-hang
>> >> $ make
>> >> $ ./sudo bpf-link-hang
>> >>
>> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> from a separate machine to do this.
>> >>
>> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> fixed so it doesn't deadlock because of this?
>> >>
>> >> Hoping you can help us with this - many thanks in advance! :)
>> >
>> > Let me start with the usual question... Is the network traffic intense
>> > enough that one of the CPUs might remain in a loop handling softirqs
>> > indefinitely?
>>
>> Yup, I'm pegging all CPUs in softirq:
>>
>> $ mpstat -P ALL 1
>> [...]
>> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>>
>> > If so, does the (untested, probably does not build) patch below help?
>>
>> Doesn't appear to, no. It builds fine, but I still get:
>>
>> Attaching 2 probes...
>> enter
>> exit after 8480 ms
>>
>> (that was me interrupting the network traffic again)
>
> Is your kernel properly shifting from back-of-interrupt softirq processing
> to ksoftirqd under heavy load? If not, my patch will not have any
> effect.
Seems to be - this is from top:
12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
Any other ideas? :)
(And thanks for taking a look, BTW!)
-Toke
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-23 19:50 ` Toke Høiland-Jørgensen
@ 2021-03-23 19:59 ` Andrii Nakryiko
2021-03-23 21:04 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2021-03-23 19:59 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: Paul E . McKenney, bpf, Magnus Karlsson
On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> "Paul E. McKenney" <paulmck@kernel.org> writes:
>
> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >>
> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> Hi Paul
> >> >>
> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> >> file descriptor would hang indefinitely when the system was under load
> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> >>
> >> >> The issue is triggered reliably by loading up a system with network
> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> >> attaching an freplace bpf_link and closing it again. The close() will
> >> >> hang until the network traffic load is lowered.
> >> >>
> >> >> Digging further, it appears that the hang happens in
> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> >>
> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> >> Attaching 2 probes...
> >> >> enter
> >> >> exit after 54 ms
> >> >> enter
> >> >> exit after 3249 ms
> >> >>
> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> >> and from a loaded system where I stopped the network traffic after a
> >> >> couple of seconds).
> >> >>
> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> >>
> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> >>
> >> >> And because it does this while holding trampoline_mutex, even deferring
> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> >> doesn't help: that'll fix the initial hang on close(), but any
> >> >> subsequent use of BPF trampolines will then be blocked because of the
> >> >> mutex.
> >> >>
> >> >> Also, if I just keep the network traffic running I will eventually get a
> >> >> kernel panic with:
> >> >>
> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> >>
> >> >> I've created a reproducer for the issue here:
> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> >>
> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> >>
> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> >> $ cd bpf-examples/bpf-link-hang
> >> >> $ make
> >> >> $ ./sudo bpf-link-hang
> >> >>
> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> >> from a separate machine to do this.
> >> >>
> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> >> fixed so it doesn't deadlock because of this?
> >> >>
> >> >> Hoping you can help us with this - many thanks in advance! :)
> >> >
> >> > Let me start with the usual question... Is the network traffic intense
> >> > enough that one of the CPUs might remain in a loop handling softirqs
> >> > indefinitely?
> >>
> >> Yup, I'm pegging all CPUs in softirq:
> >>
> >> $ mpstat -P ALL 1
> >> [...]
> >> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> >> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >>
> >> > If so, does the (untested, probably does not build) patch below help?
> >>
> >> Doesn't appear to, no. It builds fine, but I still get:
> >>
> >> Attaching 2 probes...
> >> enter
> >> exit after 8480 ms
> >>
> >> (that was me interrupting the network traffic again)
> >
> > Is your kernel properly shifting from back-of-interrupt softirq processing
> > to ksoftirqd under heavy load? If not, my patch will not have any
> > effect.
>
> Seems to be - this is from top:
>
> 12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
> 24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
> 34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
> 39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
> 19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
> 29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
>
> Any other ideas? :)
bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
anymore. Please give it a try. It's in bpf tree.
>
> (And thanks for taking a look, BTW!)
>
> -Toke
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-23 19:59 ` Andrii Nakryiko
@ 2021-03-23 21:04 ` Toke Høiland-Jørgensen
2021-03-23 21:52 ` Paul E. McKenney
0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-23 21:04 UTC (permalink / raw)
To: Andrii Nakryiko; +Cc: Paul E . McKenney, bpf, Magnus Karlsson
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>>
>> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >>
>> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> Hi Paul
>> >> >>
>> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> >> file descriptor would hang indefinitely when the system was under load
>> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> >>
>> >> >> The issue is triggered reliably by loading up a system with network
>> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> >> hang until the network traffic load is lowered.
>> >> >>
>> >> >> Digging further, it appears that the hang happens in
>> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> >>
>> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> >> Attaching 2 probes...
>> >> >> enter
>> >> >> exit after 54 ms
>> >> >> enter
>> >> >> exit after 3249 ms
>> >> >>
>> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> >> and from a loaded system where I stopped the network traffic after a
>> >> >> couple of seconds).
>> >> >>
>> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> >>
>> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> >>
>> >> >> And because it does this while holding trampoline_mutex, even deferring
>> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> >> mutex.
>> >> >>
>> >> >> Also, if I just keep the network traffic running I will eventually get a
>> >> >> kernel panic with:
>> >> >>
>> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> >>
>> >> >> I've created a reproducer for the issue here:
>> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> >>
>> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> >>
>> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> >> $ cd bpf-examples/bpf-link-hang
>> >> >> $ make
>> >> >> $ ./sudo bpf-link-hang
>> >> >>
>> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> >> from a separate machine to do this.
>> >> >>
>> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> >> fixed so it doesn't deadlock because of this?
>> >> >>
>> >> >> Hoping you can help us with this - many thanks in advance! :)
>> >> >
>> >> > Let me start with the usual question... Is the network traffic intense
>> >> > enough that one of the CPUs might remain in a loop handling softirqs
>> >> > indefinitely?
>> >>
>> >> Yup, I'm pegging all CPUs in softirq:
>> >>
>> >> $ mpstat -P ALL 1
>> >> [...]
>> >> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> >> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >>
>> >> > If so, does the (untested, probably does not build) patch below help?
>> >>
>> >> Doesn't appear to, no. It builds fine, but I still get:
>> >>
>> >> Attaching 2 probes...
>> >> enter
>> >> exit after 8480 ms
>> >>
>> >> (that was me interrupting the network traffic again)
>> >
>> > Is your kernel properly shifting from back-of-interrupt softirq processing
>> > to ksoftirqd under heavy load? If not, my patch will not have any
>> > effect.
>>
>> Seems to be - this is from top:
>>
>> 12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
>> 24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
>> 34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
>> 39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
>> 19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
>> 29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
>>
>> Any other ideas? :)
>
> bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
> Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
> anymore. Please give it a try. It's in bpf tree.
Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
indeed works better; awesome!
And sorry for bothering you with this, Paul; guess I should have looked
harder for fixes first... :/
-Toke
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-23 21:04 ` Toke Høiland-Jørgensen
@ 2021-03-23 21:52 ` Paul E. McKenney
2021-03-23 22:06 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-23 21:52 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson
On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >>
> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >>
> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> Hi Paul
> >> >> >>
> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> >> >> file descriptor would hang indefinitely when the system was under load
> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> >> >>
> >> >> >> The issue is triggered reliably by loading up a system with network
> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
> >> >> >> hang until the network traffic load is lowered.
> >> >> >>
> >> >> >> Digging further, it appears that the hang happens in
> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> >> >>
> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> >> >> Attaching 2 probes...
> >> >> >> enter
> >> >> >> exit after 54 ms
> >> >> >> enter
> >> >> >> exit after 3249 ms
> >> >> >>
> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> >> >> and from a loaded system where I stopped the network traffic after a
> >> >> >> couple of seconds).
> >> >> >>
> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> >> >>
> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> >> >>
> >> >> >> And because it does this while holding trampoline_mutex, even deferring
> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
> >> >> >> mutex.
> >> >> >>
> >> >> >> Also, if I just keep the network traffic running I will eventually get a
> >> >> >> kernel panic with:
> >> >> >>
> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> >> >>
> >> >> >> I've created a reproducer for the issue here:
> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> >> >>
> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> >> >>
> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> >> >> $ cd bpf-examples/bpf-link-hang
> >> >> >> $ make
> >> >> >> $ ./sudo bpf-link-hang
> >> >> >>
> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> >> >> from a separate machine to do this.
> >> >> >>
> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> >> >> fixed so it doesn't deadlock because of this?
> >> >> >>
> >> >> >> Hoping you can help us with this - many thanks in advance! :)
> >> >> >
> >> >> > Let me start with the usual question... Is the network traffic intense
> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
> >> >> > indefinitely?
> >> >>
> >> >> Yup, I'm pegging all CPUs in softirq:
> >> >>
> >> >> $ mpstat -P ALL 1
> >> >> [...]
> >> >> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> >> >> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >>
> >> >> > If so, does the (untested, probably does not build) patch below help?
> >> >>
> >> >> Doesn't appear to, no. It builds fine, but I still get:
> >> >>
> >> >> Attaching 2 probes...
> >> >> enter
> >> >> exit after 8480 ms
> >> >>
> >> >> (that was me interrupting the network traffic again)
> >> >
> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
> >> > to ksoftirqd under heavy load? If not, my patch will not have any
> >> > effect.
> >>
> >> Seems to be - this is from top:
> >>
> >> 12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
> >> 24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
> >> 34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
> >> 39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
> >> 19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
> >> 29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
> >>
> >> Any other ideas? :)
> >
> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
> > anymore. Please give it a try. It's in bpf tree.
>
> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
> indeed works better; awesome!
>
> And sorry for bothering you with this, Paul; guess I should have looked
> harder for fixes first... :/
Glad it is now working!
And in any case, my patch needed an s/true/false/. :-/
Hey, I did say "untested"! ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-23 21:52 ` Paul E. McKenney
@ 2021-03-23 22:06 ` Toke Høiland-Jørgensen
2021-03-24 2:41 ` Paul E. McKenney
0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-23 22:06 UTC (permalink / raw)
To: paulmck; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson
"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >>
>> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >>
>> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> Hi Paul
>> >> >> >>
>> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> >> >> file descriptor would hang indefinitely when the system was under load
>> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> >> >>
>> >> >> >> The issue is triggered reliably by loading up a system with network
>> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> >> >> hang until the network traffic load is lowered.
>> >> >> >>
>> >> >> >> Digging further, it appears that the hang happens in
>> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> >> >>
>> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> >> >> Attaching 2 probes...
>> >> >> >> enter
>> >> >> >> exit after 54 ms
>> >> >> >> enter
>> >> >> >> exit after 3249 ms
>> >> >> >>
>> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> >> >> and from a loaded system where I stopped the network traffic after a
>> >> >> >> couple of seconds).
>> >> >> >>
>> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> >> >>
>> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> >> >>
>> >> >> >> And because it does this while holding trampoline_mutex, even deferring
>> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> >> >> mutex.
>> >> >> >>
>> >> >> >> Also, if I just keep the network traffic running I will eventually get a
>> >> >> >> kernel panic with:
>> >> >> >>
>> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> >> >>
>> >> >> >> I've created a reproducer for the issue here:
>> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> >> >>
>> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> >> >>
>> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> >> >> $ cd bpf-examples/bpf-link-hang
>> >> >> >> $ make
>> >> >> >> $ ./sudo bpf-link-hang
>> >> >> >>
>> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> >> >> from a separate machine to do this.
>> >> >> >>
>> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> >> >> fixed so it doesn't deadlock because of this?
>> >> >> >>
>> >> >> >> Hoping you can help us with this - many thanks in advance! :)
>> >> >> >
>> >> >> > Let me start with the usual question... Is the network traffic intense
>> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
>> >> >> > indefinitely?
>> >> >>
>> >> >> Yup, I'm pegging all CPUs in softirq:
>> >> >>
>> >> >> $ mpstat -P ALL 1
>> >> >> [...]
>> >> >> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> >> >> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >>
>> >> >> > If so, does the (untested, probably does not build) patch below help?
>> >> >>
>> >> >> Doesn't appear to, no. It builds fine, but I still get:
>> >> >>
>> >> >> Attaching 2 probes...
>> >> >> enter
>> >> >> exit after 8480 ms
>> >> >>
>> >> >> (that was me interrupting the network traffic again)
>> >> >
>> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
>> >> > to ksoftirqd under heavy load? If not, my patch will not have any
>> >> > effect.
>> >>
>> >> Seems to be - this is from top:
>> >>
>> >> 12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
>> >> 24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
>> >> 34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
>> >> 39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
>> >> 19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
>> >> 29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
>> >>
>> >> Any other ideas? :)
>> >
>> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
>> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
>> > anymore. Please give it a try. It's in bpf tree.
>>
>> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
>> indeed works better; awesome!
>>
>> And sorry for bothering you with this, Paul; guess I should have looked
>> harder for fixes first... :/
>
> Glad it is now working!
>
> And in any case, my patch needed an s/true/false/. :-/
>
> Hey, I did say "untested"! ;-)
Haha, right, well at least you run afoul of the 'truth in advertising'
committee ;)
-Toke
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-23 22:06 ` Toke Høiland-Jørgensen
@ 2021-03-24 2:41 ` Paul E. McKenney
2021-03-24 11:33 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-24 2:41 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson
On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
>
> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >>
> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >>
> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> >> Hi Paul
> >> >> >> >>
> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> >> >> >> file descriptor would hang indefinitely when the system was under load
> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> >> >> >>
> >> >> >> >> The issue is triggered reliably by loading up a system with network
> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
> >> >> >> >> hang until the network traffic load is lowered.
> >> >> >> >>
> >> >> >> >> Digging further, it appears that the hang happens in
> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> >> >> >>
> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> >> >> >> Attaching 2 probes...
> >> >> >> >> enter
> >> >> >> >> exit after 54 ms
> >> >> >> >> enter
> >> >> >> >> exit after 3249 ms
> >> >> >> >>
> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> >> >> >> and from a loaded system where I stopped the network traffic after a
> >> >> >> >> couple of seconds).
> >> >> >> >>
> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> >> >> >>
> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> >> >> >>
> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
> >> >> >> >> mutex.
> >> >> >> >>
> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
> >> >> >> >> kernel panic with:
> >> >> >> >>
> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> >> >> >>
> >> >> >> >> I've created a reproducer for the issue here:
> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> >> >> >>
> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> >> >> >>
> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> >> >> >> $ cd bpf-examples/bpf-link-hang
> >> >> >> >> $ make
> >> >> >> >> $ ./sudo bpf-link-hang
> >> >> >> >>
> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> >> >> >> from a separate machine to do this.
> >> >> >> >>
> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> >> >> >> fixed so it doesn't deadlock because of this?
> >> >> >> >>
> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
> >> >> >> >
> >> >> >> > Let me start with the usual question... Is the network traffic intense
> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
> >> >> >> > indefinitely?
> >> >> >>
> >> >> >> Yup, I'm pegging all CPUs in softirq:
> >> >> >>
> >> >> >> $ mpstat -P ALL 1
> >> >> >> [...]
> >> >> >> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> >> >> >> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >>
> >> >> >> > If so, does the (untested, probably does not build) patch below help?
> >> >> >>
> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
> >> >> >>
> >> >> >> Attaching 2 probes...
> >> >> >> enter
> >> >> >> exit after 8480 ms
> >> >> >>
> >> >> >> (that was me interrupting the network traffic again)
> >> >> >
> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
> >> >> > to ksoftirqd under heavy load? If not, my patch will not have any
> >> >> > effect.
> >> >>
> >> >> Seems to be - this is from top:
> >> >>
> >> >> 12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
> >> >> 24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
> >> >> 34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
> >> >> 39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
> >> >> 19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
> >> >> 29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
> >> >>
> >> >> Any other ideas? :)
> >> >
> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
> >> > anymore. Please give it a try. It's in bpf tree.
> >>
> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
> >> indeed works better; awesome!
> >>
> >> And sorry for bothering you with this, Paul; guess I should have looked
> >> harder for fixes first... :/
> >
> > Glad it is now working!
> >
> > And in any case, my patch needed an s/true/false/. :-/
> >
> > Hey, I did say "untested"! ;-)
>
> Haha, right, well at least you run afoul of the 'truth in advertising'
> committee ;)
If you get a chance, could you please test the (hopefully) corrected
patch shown below? This issue might affect other use cases.
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b06be5..e21e7b0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -242,6 +242,7 @@ void rcu_softirq_qs(void)
{
rcu_qs();
rcu_preempt_deferred_qs(current);
+ rcu_tasks_qs(current, false);
}
/*
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-24 2:41 ` Paul E. McKenney
@ 2021-03-24 11:33 ` Toke Høiland-Jørgensen
2021-03-24 16:11 ` Paul E. McKenney
0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-24 11:33 UTC (permalink / raw)
To: paulmck; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson
"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>>
>> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
>> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> >>
>> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >>
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >>
>> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >>
>> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> Hi Paul
>> >> >> >> >>
>> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> >> >> >> file descriptor would hang indefinitely when the system was under load
>> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> >> >> >>
>> >> >> >> >> The issue is triggered reliably by loading up a system with network
>> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> >> >> >> hang until the network traffic load is lowered.
>> >> >> >> >>
>> >> >> >> >> Digging further, it appears that the hang happens in
>> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> >> >> >>
>> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> >> >> >> Attaching 2 probes...
>> >> >> >> >> enter
>> >> >> >> >> exit after 54 ms
>> >> >> >> >> enter
>> >> >> >> >> exit after 3249 ms
>> >> >> >> >>
>> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> >> >> >> and from a loaded system where I stopped the network traffic after a
>> >> >> >> >> couple of seconds).
>> >> >> >> >>
>> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> >> >> >>
>> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> >> >> >>
>> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
>> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> >> >> >> mutex.
>> >> >> >> >>
>> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
>> >> >> >> >> kernel panic with:
>> >> >> >> >>
>> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> >> >> >>
>> >> >> >> >> I've created a reproducer for the issue here:
>> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> >> >> >>
>> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> >> >> >>
>> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> >> >> >> $ cd bpf-examples/bpf-link-hang
>> >> >> >> >> $ make
>> >> >> >> >> $ ./sudo bpf-link-hang
>> >> >> >> >>
>> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> >> >> >> from a separate machine to do this.
>> >> >> >> >>
>> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> >> >> >> fixed so it doesn't deadlock because of this?
>> >> >> >> >>
>> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
>> >> >> >> >
>> >> >> >> > Let me start with the usual question... Is the network traffic intense
>> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
>> >> >> >> > indefinitely?
>> >> >> >>
>> >> >> >> Yup, I'm pegging all CPUs in softirq:
>> >> >> >>
>> >> >> >> $ mpstat -P ALL 1
>> >> >> >> [...]
>> >> >> >> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> >> >> >> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >>
>> >> >> >> > If so, does the (untested, probably does not build) patch below help?
>> >> >> >>
>> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
>> >> >> >>
>> >> >> >> Attaching 2 probes...
>> >> >> >> enter
>> >> >> >> exit after 8480 ms
>> >> >> >>
>> >> >> >> (that was me interrupting the network traffic again)
>> >> >> >
>> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
>> >> >> > to ksoftirqd under heavy load? If not, my patch will not have any
>> >> >> > effect.
>> >> >>
>> >> >> Seems to be - this is from top:
>> >> >>
>> >> >> 12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
>> >> >> 24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
>> >> >> 34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
>> >> >> 39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
>> >> >> 19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
>> >> >> 29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
>> >> >>
>> >> >> Any other ideas? :)
>> >> >
>> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
>> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
>> >> > anymore. Please give it a try. It's in bpf tree.
>> >>
>> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
>> >> indeed works better; awesome!
>> >>
>> >> And sorry for bothering you with this, Paul; guess I should have looked
>> >> harder for fixes first... :/
>> >
>> > Glad it is now working!
>> >
>> > And in any case, my patch needed an s/true/false/. :-/
>> >
>> > Hey, I did say "untested"! ;-)
>>
>> Haha, right, well at least you run afoul of the 'truth in advertising'
>> committee ;)
>
> If you get a chance, could you please test the (hopefully) corrected
> patch shown below? This issue might affect other use cases.
Yup, that does seem to help:
Attaching 2 probes...
enter
exit after 136 ms
-Toke
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-24 11:33 ` Toke Høiland-Jørgensen
@ 2021-03-24 16:11 ` Paul E. McKenney
2021-03-24 19:17 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-24 16:11 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson
On Wed, Mar 24, 2021 at 12:33:47PM +0100, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
>
> > On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >>
> >> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >> >>
> >> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >> >>
> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >>
> >> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >>
> >> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> Hi Paul
> >> >> >> >> >>
> >> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> >> >> >> >> file descriptor would hang indefinitely when the system was under load
> >> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> >> >> >> >>
> >> >> >> >> >> The issue is triggered reliably by loading up a system with network
> >> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
> >> >> >> >> >> hang until the network traffic load is lowered.
> >> >> >> >> >>
> >> >> >> >> >> Digging further, it appears that the hang happens in
> >> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> >> >> >> >>
> >> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> >> >> >> >> Attaching 2 probes...
> >> >> >> >> >> enter
> >> >> >> >> >> exit after 54 ms
> >> >> >> >> >> enter
> >> >> >> >> >> exit after 3249 ms
> >> >> >> >> >>
> >> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> >> >> >> >> and from a loaded system where I stopped the network traffic after a
> >> >> >> >> >> couple of seconds).
> >> >> >> >> >>
> >> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> >> >> >> >>
> >> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> >> >> >> >>
> >> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
> >> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
> >> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
> >> >> >> >> >> mutex.
> >> >> >> >> >>
> >> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
> >> >> >> >> >> kernel panic with:
> >> >> >> >> >>
> >> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> >> >> >> >>
> >> >> >> >> >> I've created a reproducer for the issue here:
> >> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> >> >> >> >>
> >> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> >> >> >> >>
> >> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> >> >> >> >> $ cd bpf-examples/bpf-link-hang
> >> >> >> >> >> $ make
> >> >> >> >> >> $ ./sudo bpf-link-hang
> >> >> >> >> >>
> >> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> >> >> >> >> from a separate machine to do this.
> >> >> >> >> >>
> >> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> >> >> >> >> fixed so it doesn't deadlock because of this?
> >> >> >> >> >>
> >> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
> >> >> >> >> >
> >> >> >> >> > Let me start with the usual question... Is the network traffic intense
> >> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
> >> >> >> >> > indefinitely?
> >> >> >> >>
> >> >> >> >> Yup, I'm pegging all CPUs in softirq:
> >> >> >> >>
> >> >> >> >> $ mpstat -P ALL 1
> >> >> >> >> [...]
> >> >> >> >> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> >> >> >> >> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >>
> >> >> >> >> > If so, does the (untested, probably does not build) patch below help?
> >> >> >> >>
> >> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
> >> >> >> >>
> >> >> >> >> Attaching 2 probes...
> >> >> >> >> enter
> >> >> >> >> exit after 8480 ms
> >> >> >> >>
> >> >> >> >> (that was me interrupting the network traffic again)
> >> >> >> >
> >> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
> >> >> >> > to ksoftirqd under heavy load? If not, my patch will not have any
> >> >> >> > effect.
> >> >> >>
> >> >> >> Seems to be - this is from top:
> >> >> >>
> >> >> >> 12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
> >> >> >> 24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
> >> >> >> 34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
> >> >> >> 39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
> >> >> >> 19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
> >> >> >> 29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
> >> >> >>
> >> >> >> Any other ideas? :)
> >> >> >
> >> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
> >> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
> >> >> > anymore. Please give it a try. It's in bpf tree.
> >> >>
> >> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
> >> >> indeed works better; awesome!
> >> >>
> >> >> And sorry for bothering you with this, Paul; guess I should have looked
> >> >> harder for fixes first... :/
> >> >
> >> > Glad it is now working!
> >> >
> >> > And in any case, my patch needed an s/true/false/. :-/
> >> >
> >> > Hey, I did say "untested"! ;-)
> >>
> >> Haha, right, well at least you run afoul of the 'truth in advertising'
> >> committee ;)
> >
> > If you get a chance, could you please test the (hopefully) corrected
> > patch shown below? This issue might affect other use cases.
>
> Yup, that does seem to help:
>
> Attaching 2 probes...
> enter
> exit after 136 ms
Thank you very much! May I please apply your Tested-by?
Thanx, Paul
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-24 16:11 ` Paul E. McKenney
@ 2021-03-24 19:17 ` Toke Høiland-Jørgensen
2021-03-25 16:28 ` Paul E. McKenney
0 siblings, 1 reply; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-24 19:17 UTC (permalink / raw)
To: paulmck; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson
"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Wed, Mar 24, 2021 at 12:33:47PM +0100, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>>
>> > On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >>
>> >> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> >> >>
>> >> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >> >>
>> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >>
>> >> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >>
>> >> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> Hi Paul
>> >> >> >> >> >>
>> >> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> >> >> >> >> file descriptor would hang indefinitely when the system was under load
>> >> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> >> >> >> >>
>> >> >> >> >> >> The issue is triggered reliably by loading up a system with network
>> >> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> >> >> >> >> hang until the network traffic load is lowered.
>> >> >> >> >> >>
>> >> >> >> >> >> Digging further, it appears that the hang happens in
>> >> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> >> >> >> >>
>> >> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> >> >> >> >> Attaching 2 probes...
>> >> >> >> >> >> enter
>> >> >> >> >> >> exit after 54 ms
>> >> >> >> >> >> enter
>> >> >> >> >> >> exit after 3249 ms
>> >> >> >> >> >>
>> >> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> >> >> >> >> and from a loaded system where I stopped the network traffic after a
>> >> >> >> >> >> couple of seconds).
>> >> >> >> >> >>
>> >> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> >> >> >> >>
>> >> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> >> >> >> >>
>> >> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
>> >> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> >> >> >> >> mutex.
>> >> >> >> >> >>
>> >> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
>> >> >> >> >> >> kernel panic with:
>> >> >> >> >> >>
>> >> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> >> >> >> >>
>> >> >> >> >> >> I've created a reproducer for the issue here:
>> >> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> >> >> >> >>
>> >> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> >> >> >> >>
>> >> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> >> >> >> >> $ cd bpf-examples/bpf-link-hang
>> >> >> >> >> >> $ make
>> >> >> >> >> >> $ ./sudo bpf-link-hang
>> >> >> >> >> >>
>> >> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> >> >> >> >> from a separate machine to do this.
>> >> >> >> >> >>
>> >> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> >> >> >> >> fixed so it doesn't deadlock because of this?
>> >> >> >> >> >>
>> >> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
>> >> >> >> >> >
>> >> >> >> >> > Let me start with the usual question... Is the network traffic intense
>> >> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
>> >> >> >> >> > indefinitely?
>> >> >> >> >>
>> >> >> >> >> Yup, I'm pegging all CPUs in softirq:
>> >> >> >> >>
>> >> >> >> >> $ mpstat -P ALL 1
>> >> >> >> >> [...]
>> >> >> >> >> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> >> >> >> >> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >>
>> >> >> >> >> > If so, does the (untested, probably does not build) patch below help?
>> >> >> >> >>
>> >> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
>> >> >> >> >>
>> >> >> >> >> Attaching 2 probes...
>> >> >> >> >> enter
>> >> >> >> >> exit after 8480 ms
>> >> >> >> >>
>> >> >> >> >> (that was me interrupting the network traffic again)
>> >> >> >> >
>> >> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
>> >> >> >> > to ksoftirqd under heavy load? If not, my patch will not have any
>> >> >> >> > effect.
>> >> >> >>
>> >> >> >> Seems to be - this is from top:
>> >> >> >>
>> >> >> >> 12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
>> >> >> >> 24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
>> >> >> >> 34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
>> >> >> >> 39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
>> >> >> >> 19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
>> >> >> >> 29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
>> >> >> >>
>> >> >> >> Any other ideas? :)
>> >> >> >
>> >> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
>> >> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
>> >> >> > anymore. Please give it a try. It's in bpf tree.
>> >> >>
>> >> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
>> >> >> indeed works better; awesome!
>> >> >>
>> >> >> And sorry for bothering you with this, Paul; guess I should have looked
>> >> >> harder for fixes first... :/
>> >> >
>> >> > Glad it is now working!
>> >> >
>> >> > And in any case, my patch needed an s/true/false/. :-/
>> >> >
>> >> > Hey, I did say "untested"! ;-)
>> >>
>> >> Haha, right, well at least you run afoul of the 'truth in advertising'
>> >> committee ;)
>> >
>> > If you get a chance, could you please test the (hopefully) corrected
>> > patch shown below? This issue might affect other use cases.
>>
>> Yup, that does seem to help:
>>
>> Attaching 2 probes...
>> enter
>> exit after 136 ms
>
> Thank you very much! May I please apply your Tested-by?
Sure!
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-24 19:17 ` Toke Høiland-Jørgensen
@ 2021-03-25 16:28 ` Paul E. McKenney
2021-03-25 21:13 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2021-03-25 16:28 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson
On Wed, Mar 24, 2021 at 08:17:35PM +0100, Toke Høiland-Jørgensen wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
>
> > On Wed, Mar 24, 2021 at 12:33:47PM +0100, Toke Høiland-Jørgensen wrote:
> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >>
> >> > On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >>
> >> >> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >> >> >>
> >> >> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >> >> >>
> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >>
> >> >> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >> >> >> >> >>
> >> >> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
> >> >> >> >> >> >> Hi Paul
> >> >> >> >> >> >>
> >> >> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
> >> >> >> >> >> >> file descriptor would hang indefinitely when the system was under load
> >> >> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
> >> >> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
> >> >> >> >> >> >>
> >> >> >> >> >> >> The issue is triggered reliably by loading up a system with network
> >> >> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
> >> >> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
> >> >> >> >> >> >> hang until the network traffic load is lowered.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Digging further, it appears that the hang happens in
> >> >> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
> >> >> >> >> >> >>
> >> >> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
> >> >> >> >> >> >> Attaching 2 probes...
> >> >> >> >> >> >> enter
> >> >> >> >> >> >> exit after 54 ms
> >> >> >> >> >> >> enter
> >> >> >> >> >> >> exit after 3249 ms
> >> >> >> >> >> >>
> >> >> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
> >> >> >> >> >> >> and from a loaded system where I stopped the network traffic after a
> >> >> >> >> >> >> couple of seconds).
> >> >> >> >> >> >>
> >> >> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
> >> >> >> >> >> >>
> >> >> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
> >> >> >> >> >> >>
> >> >> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
> >> >> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
> >> >> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
> >> >> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
> >> >> >> >> >> >> mutex.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
> >> >> >> >> >> >> kernel panic with:
> >> >> >> >> >> >>
> >> >> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
> >> >> >> >> >> >>
> >> >> >> >> >> >> I've created a reproducer for the issue here:
> >> >> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
> >> >> >> >> >> >>
> >> >> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
> >> >> >> >> >> >>
> >> >> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
> >> >> >> >> >> >> $ cd bpf-examples/bpf-link-hang
> >> >> >> >> >> >> $ make
> >> >> >> >> >> >> $ ./sudo bpf-link-hang
> >> >> >> >> >> >>
> >> >> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
> >> >> >> >> >> >> from a separate machine to do this.
> >> >> >> >> >> >>
> >> >> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
> >> >> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
> >> >> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
> >> >> >> >> >> >> fixed so it doesn't deadlock because of this?
> >> >> >> >> >> >>
> >> >> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
> >> >> >> >> >> >
> >> >> >> >> >> > Let me start with the usual question... Is the network traffic intense
> >> >> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
> >> >> >> >> >> > indefinitely?
> >> >> >> >> >>
> >> >> >> >> >> Yup, I'm pegging all CPUs in softirq:
> >> >> >> >> >>
> >> >> >> >> >> $ mpstat -P ALL 1
> >> >> >> >> >> [...]
> >> >> >> >> >> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> >> >> >> >> >> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> >> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> >> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> >> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> >> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> >> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> >> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
> >> >> >> >> >>
> >> >> >> >> >> > If so, does the (untested, probably does not build) patch below help?
> >> >> >> >> >>
> >> >> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
> >> >> >> >> >>
> >> >> >> >> >> Attaching 2 probes...
> >> >> >> >> >> enter
> >> >> >> >> >> exit after 8480 ms
> >> >> >> >> >>
> >> >> >> >> >> (that was me interrupting the network traffic again)
> >> >> >> >> >
> >> >> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
> >> >> >> >> > to ksoftirqd under heavy load? If not, my patch will not have any
> >> >> >> >> > effect.
> >> >> >> >>
> >> >> >> >> Seems to be - this is from top:
> >> >> >> >>
> >> >> >> >> 12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
> >> >> >> >> 24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
> >> >> >> >> 34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
> >> >> >> >> 39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
> >> >> >> >> 19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
> >> >> >> >> 29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
> >> >> >> >>
> >> >> >> >> Any other ideas? :)
> >> >> >> >
> >> >> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
> >> >> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
> >> >> >> > anymore. Please give it a try. It's in bpf tree.
> >> >> >>
> >> >> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
> >> >> >> indeed works better; awesome!
> >> >> >>
> >> >> >> And sorry for bothering you with this, Paul; guess I should have looked
> >> >> >> harder for fixes first... :/
> >> >> >
> >> >> > Glad it is now working!
> >> >> >
> >> >> > And in any case, my patch needed an s/true/false/. :-/
> >> >> >
> >> >> > Hey, I did say "untested"! ;-)
> >> >>
> >> >> Haha, right, well at least you run afoul of the 'truth in advertising'
> >> >> committee ;)
> >> >
> >> > If you get a chance, could you please test the (hopefully) corrected
> >> > patch shown below? This issue might affect other use cases.
> >>
> >> Yup, that does seem to help:
> >>
> >> Attaching 2 probes...
> >> enter
> >> exit after 136 ms
> >
> > Thank you very much! May I please apply your Tested-by?
>
> Sure!
>
> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Applied, and thank you!
Thanx, Paul
------------------------------------------------------------------------
commit 1a0dfc099c1e61e22045705265a1323ac294bea6
Author: Paul E. McKenney <paulmck@kernel.org>
Date: Wed Mar 24 17:08:48 2021 -0700
rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
Heavy networking load can cause a CPU to execute continuously and
indefinitely within ksoftirqd, in which case there will be no voluntary
task switches and thus no RCU-tasks quiescent states. This commit
therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks
quiescent state.
This of course means that __do_softirq() and its callers cannot be
invoked from within a tracing trampoline.
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b06be5..9d7cb74 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -242,6 +242,7 @@ void rcu_softirq_qs(void)
{
rcu_qs();
rcu_preempt_deferred_qs(current);
+ rcu_tasks_qs(current, false);
}
/*
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels
2021-03-25 16:28 ` Paul E. McKenney
@ 2021-03-25 21:13 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 14+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-03-25 21:13 UTC (permalink / raw)
To: paulmck; +Cc: Andrii Nakryiko, bpf, Magnus Karlsson
"Paul E. McKenney" <paulmck@kernel.org> writes:
> On Wed, Mar 24, 2021 at 08:17:35PM +0100, Toke Høiland-Jørgensen wrote:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>>
>> > On Wed, Mar 24, 2021 at 12:33:47PM +0100, Toke Høiland-Jørgensen wrote:
>> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >>
>> >> > On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >>
>> >> >> > On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >> >> >>
>> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >>
>> >> >> >> >> > On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> >> >> >> >> >>
>> >> >> >> >> >> > On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:
>> >> >> >> >> >> >> Hi Paul
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Magnus and I have been debugging an issue where close() on a bpf_link
>> >> >> >> >> >> >> file descriptor would hang indefinitely when the system was under load
>> >> >> >> >> >> >> on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
>> >> >> >> >> >> >> to synchronize_rcu_tasks(), so I'm hoping you can help us with it.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> The issue is triggered reliably by loading up a system with network
>> >> >> >> >> >> >> traffic (causing 100% softirq CPU load on one or more cores), and then
>> >> >> >> >> >> >> attaching an freplace bpf_link and closing it again. The close() will
>> >> >> >> >> >> >> hang until the network traffic load is lowered.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Digging further, it appears that the hang happens in
>> >> >> >> >> >> >> synchronize_rcu_tasks(), as seen by running a bpftrace script like:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
>> >> >> >> >> >> >> Attaching 2 probes...
>> >> >> >> >> >> >> enter
>> >> >> >> >> >> >> exit after 54 ms
>> >> >> >> >> >> >> enter
>> >> >> >> >> >> >> exit after 3249 ms
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> (the two enter/exit pairs are, respectively, from an unloaded system,
>> >> >> >> >> >> >> and from a loaded system where I stopped the network traffic after a
>> >> >> >> >> >> >> couple of seconds).
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> And because it does this while holding trampoline_mutex, even deferring
>> >> >> >> >> >> >> the put to a worker (as a previously applied-then-reverted patch did[0])
>> >> >> >> >> >> >> doesn't help: that'll fix the initial hang on close(), but any
>> >> >> >> >> >> >> subsequent use of BPF trampolines will then be blocked because of the
>> >> >> >> >> >> >> mutex.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Also, if I just keep the network traffic running I will eventually get a
>> >> >> >> >> >> >> kernel panic with:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I've created a reproducer for the issue here:
>> >> >> >> >> >> >> https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> To compile simply do this (needs a recent llvm/clang for compiling the BPF program):
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> $ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
>> >> >> >> >> >> >> $ cd bpf-examples/bpf-link-hang
>> >> >> >> >> >> >> $ make
>> >> >> >> >> >> >> $ ./sudo bpf-link-hang
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> you'll need to load up the system to trigger the hang; I'm using pktgen
>> >> >> >> >> >> >> from a separate machine to do this.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> My question is, of course, as ever, What Is To Be Done? Is it expected
>> >> >> >> >> >> >> that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
>> >> >> >> >> >> >> or can this be fixed? And if it is expected, how can the BPF code be
>> >> >> >> >> >> >> fixed so it doesn't deadlock because of this?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Hoping you can help us with this - many thanks in advance! :)
>> >> >> >> >> >> >
>> >> >> >> >> >> > Let me start with the usual question... Is the network traffic intense
>> >> >> >> >> >> > enough that one of the CPUs might remain in a loop handling softirqs
>> >> >> >> >> >> > indefinitely?
>> >> >> >> >> >>
>> >> >> >> >> >> Yup, I'm pegging all CPUs in softirq:
>> >> >> >> >> >>
>> >> >> >> >> >> $ mpstat -P ALL 1
>> >> >> >> >> >> [...]
>> >> >> >> >> >> 18:26:52 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> >> >> >> >> >> 18:26:53 all 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> >> 18:26:53 0 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> >> 18:26:53 1 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> >> 18:26:53 2 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> >> 18:26:53 3 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> >> 18:26:53 4 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> >> 18:26:53 5 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00
>> >> >> >> >> >>
>> >> >> >> >> >> > If so, does the (untested, probably does not build) patch below help?
>> >> >> >> >> >>
>> >> >> >> >> >> Doesn't appear to, no. It builds fine, but I still get:
>> >> >> >> >> >>
>> >> >> >> >> >> Attaching 2 probes...
>> >> >> >> >> >> enter
>> >> >> >> >> >> exit after 8480 ms
>> >> >> >> >> >>
>> >> >> >> >> >> (that was me interrupting the network traffic again)
>> >> >> >> >> >
>> >> >> >> >> > Is your kernel properly shifting from back-of-interrupt softirq processing
>> >> >> >> >> > to ksoftirqd under heavy load? If not, my patch will not have any
>> >> >> >> >> > effect.
>> >> >> >> >>
>> >> >> >> >> Seems to be - this is from top:
>> >> >> >> >>
>> >> >> >> >> 12 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/0
>> >> >> >> >> 24 root 20 0 0 0 0 R 99.3 0.0 0:43.62 ksoftirqd/2
>> >> >> >> >> 34 root 20 0 0 0 0 R 99.3 0.0 0:43.64 ksoftirqd/4
>> >> >> >> >> 39 root 20 0 0 0 0 R 99.3 0.0 0:43.65 ksoftirqd/5
>> >> >> >> >> 19 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/1
>> >> >> >> >> 29 root 20 0 0 0 0 R 99.0 0.0 0:43.63 ksoftirqd/3
>> >> >> >> >>
>> >> >> >> >> Any other ideas? :)
>> >> >> >> >
>> >> >> >> > bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
>> >> >> >> > Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
>> >> >> >> > anymore. Please give it a try. It's in bpf tree.
>> >> >> >>
>> >> >> >> Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
>> >> >> >> indeed works better; awesome!
>> >> >> >>
>> >> >> >> And sorry for bothering you with this, Paul; guess I should have looked
>> >> >> >> harder for fixes first... :/
>> >> >> >
>> >> >> > Glad it is now working!
>> >> >> >
>> >> >> > And in any case, my patch needed an s/true/false/. :-/
>> >> >> >
>> >> >> > Hey, I did say "untested"! ;-)
>> >> >>
>> >> >> Haha, right, well at least you run afoul of the 'truth in advertising'
>> >> >> committee ;)
>> >> >
>> >> > If you get a chance, could you please test the (hopefully) corrected
>> >> > patch shown below? This issue might affect other use cases.
>> >>
>> >> Yup, that does seem to help:
>> >>
>> >> Attaching 2 probes...
>> >> enter
>> >> exit after 136 ms
>> >
>> > Thank you very much! May I please apply your Tested-by?
>>
>> Sure!
>>
>> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
>
> Applied, and thank you!
Awesome! You're welcome, and thank you for the fix (and the quick turnaround)! :)
-Toke
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2021-03-25 21:14 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <877dly6ooz.fsf@toke.dk>
2021-03-23 16:43 ` BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels Paul E. McKenney
2021-03-23 17:29 ` Toke Høiland-Jørgensen
2021-03-23 17:57 ` Paul E. McKenney
2021-03-23 19:50 ` Toke Høiland-Jørgensen
2021-03-23 19:59 ` Andrii Nakryiko
2021-03-23 21:04 ` Toke Høiland-Jørgensen
2021-03-23 21:52 ` Paul E. McKenney
2021-03-23 22:06 ` Toke Høiland-Jørgensen
2021-03-24 2:41 ` Paul E. McKenney
2021-03-24 11:33 ` Toke Høiland-Jørgensen
2021-03-24 16:11 ` Paul E. McKenney
2021-03-24 19:17 ` Toke Høiland-Jørgensen
2021-03-25 16:28 ` Paul E. McKenney
2021-03-25 21:13 ` Toke Høiland-Jørgensen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.