[RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist

* [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist
@ 2020-02-14 13:39 Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 01/19] sched: Provide migrate_disable/enable() inlines Thomas Gleixner
                   ` (20 more replies)
  0 siblings, 21 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

Hi!

This is a follow up to the initial patch series which David posted a while
ago:

 https://lore.kernel.org/bpf/20191207.160357.828344895192682546.davem@davemloft.net/

which was (while non-functional on RT) a good starting point for further
investigations.

PREEMPT_RT aims to make the kernel fully preemptible except for a few low
level operations which have to be unpreemptible, like scheduler, low level
entry/exit code, low level interrupt handling, locking internals etc.

This is achieved by the following main transformations (there are some
more minor ones, but they don't matter in this context):

   1) All interrupts, which are not explicitely marked IRQF_NO_THREAD, are
      forced into threaded interrupt handlers. This applies to almost all
      device interrupt handlers.

   2) hrtimers which are not explicitely marked to expire in hard interrupt
      context are expired in soft interrupt context.

   3) Soft interrupts are also forced into thread context. They run either
      in the context of a threaded interrupt handler (similar to the !RT
      mode of running on return from interrupt), in ksoftirqd or in a
      thread context which raised a soft interrupt (same as in !RT).

      Soft interrupts (BH) are serialized per CPU so the usual bottom half
      protections still work.

   4) spinlocks and rwlocks are substituted with 'sleeping spin/rwlocks'.
      The internal representation is a priority inheritance aware rtmutex
      which has glue code attached to preserve the spin/rwlock semantics
      vs. task state. They neither disable preemption nor interrupts, which
      is fine as the interrupt handler they want to be protected against is
      forced into thread context.

      The true hard interrupt handlers are not affecting these sections as
      they are completely independent.

      The code pathes which need to be atomic are using raw_spinlocks which
      disable preemption (and/or interrupts).

      On a non RT kernel spinlocks are mapped to raw_spinlocks so everything
      works as usual.

As a consequence allocation of memory is not possible from truly atomic
contexts on RT, even with GFP_ATOMIC set. This is required because the slab
allocator can run into a situation even with GPF_ATOMIC where it needs to
call into the page allocator, which in turn takes regular spinlocks. As the
RT substitution of regular spinlocks might sleep, it's obvious that memory
allocations from truly atomic contexts on RT are not permitted. The page
allocator locks cannot be converted to raw locks as the length of the
resulting preempt/interrupt disabled sections are way above the tolerance
level of demanding realtime applications.

So this leads to a conflict with the current BPF implementation. BPF
disables preemption around:

  1) The invocation of BPF programs. This is required to guarantee
     that the program runs to completion on one CPU

  2) The invocation of map operations from sys_bpf(). This is required
     to protect the operation against BPF programs which access the
     same map from perf, kprobes or tracing context because the syscall
     operation has to lock the hashtab bucket lock.

     If the perf NMI, kprobe or tracepoint happens inside the bucket lock
     held region and the BPF program needs to access the same bucket then
     the system would deadlock.

     The mechanism used for this is to increment the per CPU recursion
     protection which prevents such programs to be executed. The same per
     CPU variable is used to prevent recursion of the programs themself,
     which might happen when e.g. a kprobe attached BPF program triggers
     another kprobe attached BPF program.

In principle this works on RT as well as long as the BPF programs use
preallocated maps, which is required for trace type programs to prevent
deadlocks of all sorts. Other BPF programs which run in softirq context,
e.g. packet filtering, or in thread context, e.g. syscall auditing have no
requirement for preallocated maps. They can allocate memory when required.

But as explained above on RT memory allocation from truly atomic context is
not permitted, which required a few teaks to the BPF code.

The initial misunderstanding on my side was that the preempt disable around
the invocations of BPF programs is not only required to prevent migration
to a different CPU (in case the program is invoked from preemptible
context) but is also required to prevent reentrancy from a preempting
task. In hindsight I should have figured out myself that this is not the
case because the same program can run concurrently on a different CPU or
from a different context (e.g. interrupt). Alexei thankfully enlightened me
recently over a beer that the real intent here is to guarantee that the
program runs to completion on the same CPU where it started. The same is
true for the syscall side as this just has to guarantee the per CPU
recursion protection.

This part of the problem is trivial. RT provides a true migrate_disable()
mechanism which does not disable preemption. So the preempt_disable /
enable() pairs can be replaced with migrate_disable / enable() pairs.
migrate_disable / enable() maps to preempt_disable / enable() on a
not-RT kernel, so there is no functional change when RT is disabled.

But there is another issue which is not as straight forward to solve:

The map operations which deal with elements in hashtab buckets (or other
map mechanisms) have spinlocks to protect them against concurrent access
from other CPUs. These spinlocks are raw_spinlocks, so they disable
preemption and interrupts (_irqsave()). Not a problem per se, but some of
these code pathes invoke memory allocations with a bucket lock held. This
obviously conflicts with the RT semantics.

The easy way out would be to convert these locks to regular spinlocks which
can be substituted by RT. Works like a charm for both RT and !RT for BPF
programs which run in thread context (includes soft interrupt and device
interrupt handler context on RT).

But there are also the BPF programs which run in truly atomic context even
on a RT kernel, i.e. tracing types (perf, perf NMI, kprobes and trace).

For obvious reasons these context cannot take regular spinlocks on RT
because RT substitutes them with sleeping spinlocks. might_sleep() splats
from NMI context are not really desired and on lock contention the
scheduler explodes in colourful ways. Same problem for kprobes and
tracepoints. So these program types need a raw spinlock which brings us
back to square one.

But, there is an important detail to the rescue. The trace type programs
require preallocated maps to prevent deadlocks of all sorts in the memory
allocator. Preallocated maps never call into the memory allocator or other
code pathes which might sleep on a RT kernel. The allocation type is known
when the program and the map is initialized.

This allows to differentiate between lock types for preallocated and
run-time allocated maps. While not pretty, the proposed solution is to have
a lock union in the bucket:

	union {
		raw_spinlock_t	raw_lock;
		spinlock_t	lock;
	};

and have init/lock/unlock helpers which handle the lock type depending on
the allocation mechanism, i.e. for preallocated maps it uses the raw lock
and for dynamic allocations it uses the regular spinlock. The locks in the
percpu_freelist need to stay raw as well as they nest into the bucket lock
held section, which works for both the raw and the regular spinlock
variant.

I'm not proud of that, but I really couldn't come up with anything better
aside of completely splitting the code pathes which would be even worse due
to the resulting code duplication.

The locks in the LPM trie map needs to become a regular spinlock as well as
trie map is based on dynamic allocation, but again not a problem as this
map cannot be used for the critical types anyway.

I kept the LRU and the stack map locks raw for now as they do not use
dynamic allocations, but I haven't done any testing on latency impact
yet. The stack map critical sections are truly short, so they should not
matter at all, but the LRU ones could. That can be revisited once we have
numbers. I assume that LRU is not the right choice for the trace type
programs anyway, but who knows.

The last and trivial to solve (Dave solved that already) issue is the non
owner release of mmap sem, which is forbidden on RT as well. So this just
forces the IP fallback path.

On a non RT kernel this does not change anything. The compiler optimizes
the lock magic out and everything maps to the state before these changes.

This survives the selftests and a bunch of different invocations of
bpftrace on mainline and on a backport to 5.4-rt.

But of course I surely have missed some details here and there, so please
have a close look at this.

The diffstat of hashtab.c is so large because I sat down and documented the
above write up in condensed form in a comment on top of the file as I
wanted to spare others the dubious experience of having to reverse engineer
the inner workings of BPF.

Thanks,

	tglx

8<----------------
 include/linux/bpf.h          |    8 +-
 include/linux/filter.h       |   33 ++++++--
 include/linux/kernel.h       |    7 +
 include/linux/preempt.h      |   30 +++++++
 kernel/bpf/hashtab.c         |  164 ++++++++++++++++++++++++++++++++-----------
 kernel/bpf/lpm_trie.c        |   12 +--
 kernel/bpf/percpu_freelist.c |   20 ++---
 kernel/bpf/stackmap.c        |   18 +++-
 kernel/bpf/syscall.c         |   16 ++--
 kernel/bpf/trampoline.c      |    9 +-
 kernel/events/core.c         |    2 
 kernel/seccomp.c             |    4 -
 kernel/trace/bpf_trace.c     |    6 -
 lib/test_bpf.c               |    4 -
 net/bpf/test_run.c           |    8 +-
 net/core/flow_dissector.c    |    4 -
 net/core/skmsg.c             |    8 --
 net/kcm/kcmsock.c            |    4 -
 18 files changed, 248 insertions(+), 109 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread