deadlock bug related to bpf,audit subsystems

From: Serhei Makarov <smakarov@redhat.com>
To: linux-audit@redhat.com, bpf@vger.kernel.org
Cc: Jerome Marchand <jmarchan@redhat.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	ast@kernel.org, Serhei Makarov <smakarov@redhat.com>,
	Frank Eigler <fche@redhat.com>, Jiri Olsa <jolsa@redhat.com>,
	guro@fb.com
Subject: deadlock bug related to bpf,audit subsystems
Date: Thu, 18 Mar 2021 10:43:00 -0400	[thread overview]
Message-ID: <CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com> (raw)
In-Reply-To: <YFM+Ijeu4bN4IzH1@krava>

Moving this discussion to kernel mailing lists.

Problem description:

Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
bpf_probe_read_compat call within a sched_switch tracepoint. The
problem is reproducible with the reg_alloc3 testcase from SystemTap's
BPF backend testsuite on x86_64 as well as the runqlat,runqslower
tools from bcc on ppc64le. Example stack trace from [1]:

[  730.868702] stack backtrace:
[  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted
5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
[  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.13.0-2.fc32 04/01/2014
[  730.873278] Call Trace:
[  730.873770]  dump_stack+0x7f/0xa1
[  730.874433]  check_noncircular+0xdf/0x100
[  730.875232]  __lock_acquire+0x1202/0x1e10
[  730.876031]  ? __lock_acquire+0xfc0/0x1e10
[  730.876844]  lock_acquire+0xc2/0x3a0
[  730.877551]  ? __wake_up_common_lock+0x52/0x90
[  730.878434]  ? lock_acquire+0xc2/0x3a0
[  730.879186]  ? lock_is_held_type+0xa7/0x120
[  730.880044]  ? skb_queue_tail+0x1b/0x50
[  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
[  730.881656]  ? __wake_up_common_lock+0x52/0x90
[  730.882532]  __wake_up_common_lock+0x52/0x90
[  730.883375]  audit_log_end+0x5b/0x100
[  730.884104]  slow_avc_audit+0x69/0x90
[  730.884836]  avc_has_perm+0x8b/0xb0
[  730.885532]  selinux_lockdown+0xa5/0xd0
[  730.886297]  security_locked_down+0x20/0x40
[  730.887133]  bpf_probe_read_compat+0x66/0xd0
[  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
[  730.888917]  trace_call_bpf+0xe9/0x240
[  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
[  730.890579]  perf_trace_sched_switch+0x142/0x180
[  730.891485]  ? __schedule+0x6d8/0xb20
[  730.892209]  __schedule+0x6d8/0xb20
[  730.892899]  schedule+0x5b/0xc0
[  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
[  730.894457]  syscall_exit_to_user_mode+0x27/0x70
[  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Jiri Olsa also reports seeing a similar deadlock at v5.10. I'm in the
middle of double-checking my bisection which ended up at a
seemingly-unrelated commit [2]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1938312
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.11-rc7&id=2dcb3964544177c51853a210b6ad400de78ef17d

Reasonable amount of context below:

On Thu, Mar 18, 2021 at 7:48 AM Jiri Olsa <jolsa@redhat.com> wrote:
> > In that case the issue is in the selinux / audit department, not on bpf side.
> >
> > To be honest, I'm actually puzzled that from bpf_probe_read_*() we end up sending
> > audit messages, this seems highly questionable given those BPF helpers are used in
> > performance critical code, and they can be called from any contexts. So going and
> > allocating an skb for audit is just completely wrong. It should probably be at min
> > avc_has_perm_noaudit() if anything ...
>
> I just noticed this discussion is not on the list ;-)
> let's bring it there and include some audit folks

Yes, my apologies. This started as a quick note from me to Daniel to
glance at the RHBZ and the cc:s gradually snowballed from there.

- Serhei

> jirka
>
> >
> > > ----
> > > [   56.866377] =============================
> > > [   56.866397] [ BUG: Invalid wait context ]
> > > [   56.866407] 5.11.0 #4 Tainted: G            E
> > > [   56.866420] -----------------------------
> > > [   56.866438] swapper/69/0 is trying to lock:
> > > [   56.866458] c000000002120038 (notif_lock){....}-{3:3}, at: avc_compute_av.isra.0+0x14c/0x430
> > > [   56.866508] other info that might help us debug this:
> > > [   56.866528] context-{2:2}
> > > [   56.866545] 3 locks held by swapper/69/0:
> > > [   56.866566]  #0: c000001fff1f7a98 (&rq->lock){-.-.}-{2:2}, at: sched_ttwu_pending+0x5c/0x1e0
> > > [   56.866613]  #1: c00000000208b9d8 (rcu_read_lock){....}-{1:3}, at: bpf_trace_run1+0x8/0x240
> > > [   56.866659]  #2: c00000000208b9d8 (rcu_read_lock){....}-{1:3}, at: avc_compute_av.isra.0+0x7c/0x430
> > > [   56.866704] stack backtrace:
> > > [   56.866724] CPU: 69 PID: 0 Comm: swapper/69 Tainted: G            E     5.11.0 #4
> > > [   56.866761] Call Trace:
> > > [   56.866778] [c0000000109fb310] [c000000000a42784] dump_stack+0xe8/0x144 (unreliable)
> > > [   56.866817] [c0000000109fb360] [c0000000001f02a0] __lock_acquire+0xaa0/0x2800
> > > [   56.866857] [c0000000109fb490] [c0000000001f2b40] lock_acquire.part.0+0xc0/0x390
> > > [   56.866885] [c0000000109fb570] [c00000000118af0c] _raw_spin_lock_irqsave+0x6c/0xc0
> > > [   56.866923] [c0000000109fb5b0] [c00000000089cc4c] avc_compute_av.isra.0+0x14c/0x430
> > > [   56.866961] [c0000000109fb670] [c00000000089e0a0] avc_has_perm+0x2c0/0x300
> > > [   56.866997] [c0000000109fb780] [c0000000008a7d34] selinux_lockdown+0xd4/0x100
> > > [   56.867034] [c0000000109fb810] [c000000000891140] security_locked_down+0x50/0xb0
> > > [   56.867086] [c0000000109fb840] [c000000000346b7c] bpf_probe_read_compat+0xbc/0x130
> > > [   56.867125] [c0000000109fb880] [c00800000e63bd38] bpf_prog_3de2db9929262fab_raw_tracepoint__sched_wakeup+0x5c/0x4324
> > > [   56.867167] [c0000000109fb8f0] [c000000000349784] bpf_trace_run1+0xe4/0x240
> > > [   56.867204] [c0000000109fb940] [c00000000018f238] __bpf_trace_sched_wakeup_template+0x18/0x30
> > > [   56.867243] [c0000000109fb960] [c000000000190834] trace_sched_wakeup+0xe4/0x200
> > > [   56.867281] [c0000000109fb9a0] [c0000000001983bc] ttwu_do_wakeup+0x4c/0x1f0
> > > [   56.867317] [c0000000109fba20] [c00000000019c190] sched_ttwu_pending+0x120/0x1e0
> > > [   56.867355] [c0000000109fbac0] [c00000000026cd6c] flush_smp_call_function_queue+0x1bc/0x3c0
> > > [   56.867397] [c0000000109fbb50] [c000000000059fd4] smp_ipi_demux_relaxed+0xf4/0x100
> > > [   56.867436] [c0000000109fbb90] [c0000000000537fc] doorbell_exception+0xbc/0x370
> > > [   56.867474] [c0000000109fbbd0] [c0000000000168d4] replay_soft_interrupts+0x1f4/0x2d0
> > > [   56.867512] [c0000000109fbdb0] [c000000000016a20] arch_local_irq_restore+0x70/0xe0
> > > [   56.867550] [c0000000109fbde0] [c000000000df9d34] cpuidle_enter_state+0x124/0x500
> > > [   56.867587] [c0000000109fbe40] [c000000000dfa1ac] cpuidle_enter+0x4c/0x70
> > > [   56.867613] [c0000000109fbe80] [c0000000001a5dc8] do_idle+0x338/0x450
> > > [   56.867649] [c0000000109fbf10] [c0000000001a62bc] cpu_startup_entry+0x3c/0x40
> > > [   56.867686] [c0000000109fbf40] [c00000000005ac34] start_secondary+0x2a4/0x2b0
> > > [   56.867727] [c0000000109fbf90] [c00000000000c054] start_secondary_prolog+0x10/0x14
> > >
> >
>

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit