bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist
@ 2020-02-14 13:39 Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 01/19] sched: Provide migrate_disable/enable() inlines Thomas Gleixner
                   ` (20 more replies)
  0 siblings, 21 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

Hi!

This is a follow up to the initial patch series which David posted a while
ago:

 https://lore.kernel.org/bpf/20191207.160357.828344895192682546.davem@davemloft.net/

which was (while non-functional on RT) a good starting point for further
investigations.

PREEMPT_RT aims to make the kernel fully preemptible except for a few low
level operations which have to be unpreemptible, like scheduler, low level
entry/exit code, low level interrupt handling, locking internals etc.

This is achieved by the following main transformations (there are some
more minor ones, but they don't matter in this context):

   1) All interrupts, which are not explicitely marked IRQF_NO_THREAD, are
      forced into threaded interrupt handlers. This applies to almost all
      device interrupt handlers.

   2) hrtimers which are not explicitely marked to expire in hard interrupt
      context are expired in soft interrupt context.

   3) Soft interrupts are also forced into thread context. They run either
      in the context of a threaded interrupt handler (similar to the !RT
      mode of running on return from interrupt), in ksoftirqd or in a
      thread context which raised a soft interrupt (same as in !RT).

      Soft interrupts (BH) are serialized per CPU so the usual bottom half
      protections still work.

   4) spinlocks and rwlocks are substituted with 'sleeping spin/rwlocks'.
      The internal representation is a priority inheritance aware rtmutex
      which has glue code attached to preserve the spin/rwlock semantics
      vs. task state. They neither disable preemption nor interrupts, which
      is fine as the interrupt handler they want to be protected against is
      forced into thread context.

      The true hard interrupt handlers are not affecting these sections as
      they are completely independent.

      The code pathes which need to be atomic are using raw_spinlocks which
      disable preemption (and/or interrupts).

      On a non RT kernel spinlocks are mapped to raw_spinlocks so everything
      works as usual.

As a consequence allocation of memory is not possible from truly atomic
contexts on RT, even with GFP_ATOMIC set. This is required because the slab
allocator can run into a situation even with GPF_ATOMIC where it needs to
call into the page allocator, which in turn takes regular spinlocks. As the
RT substitution of regular spinlocks might sleep, it's obvious that memory
allocations from truly atomic contexts on RT are not permitted. The page
allocator locks cannot be converted to raw locks as the length of the
resulting preempt/interrupt disabled sections are way above the tolerance
level of demanding realtime applications.

So this leads to a conflict with the current BPF implementation. BPF
disables preemption around:

  1) The invocation of BPF programs. This is required to guarantee
     that the program runs to completion on one CPU

  2) The invocation of map operations from sys_bpf(). This is required
     to protect the operation against BPF programs which access the
     same map from perf, kprobes or tracing context because the syscall
     operation has to lock the hashtab bucket lock.

     If the perf NMI, kprobe or tracepoint happens inside the bucket lock
     held region and the BPF program needs to access the same bucket then
     the system would deadlock.

     The mechanism used for this is to increment the per CPU recursion
     protection which prevents such programs to be executed. The same per
     CPU variable is used to prevent recursion of the programs themself,
     which might happen when e.g. a kprobe attached BPF program triggers
     another kprobe attached BPF program.

In principle this works on RT as well as long as the BPF programs use
preallocated maps, which is required for trace type programs to prevent
deadlocks of all sorts. Other BPF programs which run in softirq context,
e.g. packet filtering, or in thread context, e.g. syscall auditing have no
requirement for preallocated maps. They can allocate memory when required.

But as explained above on RT memory allocation from truly atomic context is
not permitted, which required a few teaks to the BPF code.

The initial misunderstanding on my side was that the preempt disable around
the invocations of BPF programs is not only required to prevent migration
to a different CPU (in case the program is invoked from preemptible
context) but is also required to prevent reentrancy from a preempting
task. In hindsight I should have figured out myself that this is not the
case because the same program can run concurrently on a different CPU or
from a different context (e.g. interrupt). Alexei thankfully enlightened me
recently over a beer that the real intent here is to guarantee that the
program runs to completion on the same CPU where it started. The same is
true for the syscall side as this just has to guarantee the per CPU
recursion protection.

This part of the problem is trivial. RT provides a true migrate_disable()
mechanism which does not disable preemption. So the preempt_disable /
enable() pairs can be replaced with migrate_disable / enable() pairs.
migrate_disable / enable() maps to preempt_disable / enable() on a
not-RT kernel, so there is no functional change when RT is disabled.

But there is another issue which is not as straight forward to solve:

The map operations which deal with elements in hashtab buckets (or other
map mechanisms) have spinlocks to protect them against concurrent access
from other CPUs. These spinlocks are raw_spinlocks, so they disable
preemption and interrupts (_irqsave()). Not a problem per se, but some of
these code pathes invoke memory allocations with a bucket lock held. This
obviously conflicts with the RT semantics.

The easy way out would be to convert these locks to regular spinlocks which
can be substituted by RT. Works like a charm for both RT and !RT for BPF
programs which run in thread context (includes soft interrupt and device
interrupt handler context on RT).

But there are also the BPF programs which run in truly atomic context even
on a RT kernel, i.e. tracing types (perf, perf NMI, kprobes and trace).

For obvious reasons these context cannot take regular spinlocks on RT
because RT substitutes them with sleeping spinlocks. might_sleep() splats
from NMI context are not really desired and on lock contention the
scheduler explodes in colourful ways. Same problem for kprobes and
tracepoints. So these program types need a raw spinlock which brings us
back to square one.

But, there is an important detail to the rescue. The trace type programs
require preallocated maps to prevent deadlocks of all sorts in the memory
allocator. Preallocated maps never call into the memory allocator or other
code pathes which might sleep on a RT kernel. The allocation type is known
when the program and the map is initialized.

This allows to differentiate between lock types for preallocated and
run-time allocated maps. While not pretty, the proposed solution is to have
a lock union in the bucket:

	union {
		raw_spinlock_t	raw_lock;
		spinlock_t	lock;
	};

and have init/lock/unlock helpers which handle the lock type depending on
the allocation mechanism, i.e. for preallocated maps it uses the raw lock
and for dynamic allocations it uses the regular spinlock. The locks in the
percpu_freelist need to stay raw as well as they nest into the bucket lock
held section, which works for both the raw and the regular spinlock
variant.

I'm not proud of that, but I really couldn't come up with anything better
aside of completely splitting the code pathes which would be even worse due
to the resulting code duplication.

The locks in the LPM trie map needs to become a regular spinlock as well as
trie map is based on dynamic allocation, but again not a problem as this
map cannot be used for the critical types anyway.

I kept the LRU and the stack map locks raw for now as they do not use
dynamic allocations, but I haven't done any testing on latency impact
yet. The stack map critical sections are truly short, so they should not
matter at all, but the LRU ones could. That can be revisited once we have
numbers. I assume that LRU is not the right choice for the trace type
programs anyway, but who knows.

The last and trivial to solve (Dave solved that already) issue is the non
owner release of mmap sem, which is forbidden on RT as well. So this just
forces the IP fallback path.

On a non RT kernel this does not change anything. The compiler optimizes
the lock magic out and everything maps to the state before these changes.

This survives the selftests and a bunch of different invocations of
bpftrace on mainline and on a backport to 5.4-rt.

But of course I surely have missed some details here and there, so please
have a close look at this.

The diffstat of hashtab.c is so large because I sat down and documented the
above write up in condensed form in a comment on top of the file as I
wanted to spare others the dubious experience of having to reverse engineer
the inner workings of BPF.

Thanks,

	tglx

8<----------------
 include/linux/bpf.h          |    8 +-
 include/linux/filter.h       |   33 ++++++--
 include/linux/kernel.h       |    7 +
 include/linux/preempt.h      |   30 +++++++
 kernel/bpf/hashtab.c         |  164 ++++++++++++++++++++++++++++++++-----------
 kernel/bpf/lpm_trie.c        |   12 +--
 kernel/bpf/percpu_freelist.c |   20 ++---
 kernel/bpf/stackmap.c        |   18 +++-
 kernel/bpf/syscall.c         |   16 ++--
 kernel/bpf/trampoline.c      |    9 +-
 kernel/events/core.c         |    2 
 kernel/seccomp.c             |    4 -
 kernel/trace/bpf_trace.c     |    6 -
 lib/test_bpf.c               |    4 -
 net/bpf/test_run.c           |    8 +-
 net/core/flow_dissector.c    |    4 -
 net/core/skmsg.c             |    8 --
 net/kcm/kcmsock.c            |    4 -
 18 files changed, 248 insertions(+), 109 deletions(-)






^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 01/19] sched: Provide migrate_disable/enable() inlines
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 02/19] sched: Provide cant_migrate() Thomas Gleixner
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman

Currently code which solely needs to prevent migration of a task uses
preempt_disable()/enable() pairs. This is the only reliable way to do so as
setting the task affinity to a single CPU can be undone by a setaffinity
operation from a different task/process. It's also significantly faster.

RT provides a seperate migrate_disable/enable() mechanism which does not
disable preemption to achieve the semantic requirements of a (almost) fully
preemptible kernel.

As it is unclear from looking at a given code path whether the intention is
to disable preemption or migration, introduce migrate_disable/enable()
inline functions which can be used to annotate code which merely needs to
disable migration. Map them to preempt_disable/enable() for now. The RT
substitution will be provided later.

Code which is annotated that way documents that it has no requirement to
protect against reentrancy of a preempting task. Either this is not
required at all or the call sites are already serialized by other means.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
---
 include/linux/preempt.h |   30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -322,4 +322,34 @@ static inline void preempt_notifier_init
 
 #endif
 
+/**
+ * migrate_disable - Prevent migration of the current task
+ *
+ * Maps to preempt_disable() which also disables preemption. Use
+ * migrate_disable() to annotate that the intent is to prevent migration
+ * but not necessarily preemption.
+ *
+ * Can be invoked nested like preempt_disable() and needs the corresponding
+ * number of migrate_enable() invocations.
+ */
+static __always_inline void migrate_disable(void)
+{
+	preempt_disable();
+}
+
+/**
+ * migrate_enable - Allow migration of the current task
+ *
+ * Counterpart to migrate_disable().
+ *
+ * As migrate_disable() can be invoked nested only the uttermost invocation
+ * reenables migration.
+ *
+ * Currently mapped to preempt_enable().
+ */
+static __always_inline void migrate_enable(void)
+{
+	preempt_enable();
+}
+
 #endif /* __LINUX_PREEMPT_H */


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 02/19] sched: Provide cant_migrate()
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 01/19] sched: Provide migrate_disable/enable() inlines Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 03/19] bpf: Update locking comment in hashtab code Thomas Gleixner
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

Some code pathes rely on preempt_disable() to prevent migration on a non RT
enabled kernel. These preempt_disable/enable() pairs are substituted by
migrate_disable/enable() pairs or other forms of RT specific protection. On
RT these protections prevent migration but not preemption. Obviously a
cant_sleep() check in such a section will trigger on RT because preemption
is not disabled.

Provide a cant_migrate() macro which maps to cant_sleep() on a non RT
kernel and an empty placeholder for RT for now. The placeholder will be
changed to a proper debug check along with the RT specific migration
protection mechanism.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/kernel.h |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -257,6 +257,13 @@ extern void __cant_sleep(const char *fil
 
 #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
 
+#ifndef CONFIG_PREEMPT_RT
+# define cant_migrate()		cant_sleep()
+#else
+  /* Placeholder for now */
+# define cant_migrate()		do { } while (0)
+#endif
+
 /**
  * abs - return absolute value of an argument
  * @x: the value.  If it is unsigned type, it is converted to signed type first.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 03/19] bpf: Update locking comment in hashtab code
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 01/19] sched: Provide migrate_disable/enable() inlines Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 02/19] sched: Provide cant_migrate() Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 04/19] bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run() Thomas Gleixner
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

The comment where the bucket lock is acquired says:

  /* bpf_map_update_elem() can be called in_irq() */

which is not really helpful and aside of that it does not explain the
subtle details of the hash bucket locks expecially in the context of BPF
and perf, kprobes and tracing.

Add a comment at the top of the file which explains the protection scopes
and the details how potential deadlocks are prevented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/bpf/hashtab.c |   24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -27,6 +27,26 @@
 	.map_delete_batch =			\
 	generic_map_delete_batch
 
+/*
+ * The bucket lock has two protection scopes:
+ *
+ * 1) Serializing concurrent operations from BPF programs on differrent
+ *    CPUs
+ *
+ * 2) Serializing concurrent operations from BPF programs and sys_bpf()
+ *
+ * BPF programs can execute in any context including perf, kprobes and
+ * tracing. As there are almost no limits where perf, kprobes and tracing
+ * can be invoked from the lock operations need to be protected against
+ * deadlocks. Deadlocks can be caused by recursion and by an invocation in
+ * the lock held section when functions which acquire this lock are invoked
+ * from sys_bpf(). BPF recursion is prevented by incrementing the per CPU
+ * variable bpf_prog_active, which prevents BPF programs attached to perf
+ * events, kprobes and tracing to be invoked before the prior invocation
+ * from one of these contexts completed. sys_bpf() uses the same mechanism
+ * by pinning the task to the current CPU and incrementing the recursion
+ * protection accross the map operation.
+ */
 struct bucket {
 	struct hlist_nulls_head head;
 	raw_spinlock_t lock;
@@ -872,7 +892,6 @@ static int htab_map_update_elem(struct b
 		 */
 	}
 
-	/* bpf_map_update_elem() can be called in_irq() */
 	raw_spin_lock_irqsave(&b->lock, flags);
 
 	l_old = lookup_elem_raw(head, hash, key, key_size);
@@ -952,7 +971,6 @@ static int htab_lru_map_update_elem(stru
 		return -ENOMEM;
 	memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size);
 
-	/* bpf_map_update_elem() can be called in_irq() */
 	raw_spin_lock_irqsave(&b->lock, flags);
 
 	l_old = lookup_elem_raw(head, hash, key, key_size);
@@ -1007,7 +1025,6 @@ static int __htab_percpu_map_update_elem
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
-	/* bpf_map_update_elem() can be called in_irq() */
 	raw_spin_lock_irqsave(&b->lock, flags);
 
 	l_old = lookup_elem_raw(head, hash, key, key_size);
@@ -1071,7 +1088,6 @@ static int __htab_lru_percpu_map_update_
 			return -ENOMEM;
 	}
 
-	/* bpf_map_update_elem() can be called in_irq() */
 	raw_spin_lock_irqsave(&b->lock, flags);
 
 	l_old = lookup_elem_raw(head, hash, key, key_size);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 04/19] bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run()
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (2 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 03/19] bpf: Update locking comment in hashtab code Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-19 16:54   ` Steven Rostedt
  2020-02-14 13:39 ` [RFC patch 05/19] perf/bpf: Remove preempt disable around BPF invocation Thomas Gleixner
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

__bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation.

This is redundant because __bpf_trace_run() is invoked from a trace point
via __DO_TRACE() which already disables preemption _before_ invoking any of
the functions which are attached to a trace point.

Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/trace/bpf_trace.c |    2 --
 1 file changed, 2 deletions(-)

--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1476,9 +1476,7 @@ static __always_inline
 void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
 {
 	rcu_read_lock();
-	preempt_disable();
 	(void) BPF_PROG_RUN(prog, args);
-	preempt_enable();
 	rcu_read_unlock();
 }
 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 05/19] perf/bpf: Remove preempt disable around BPF invocation
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (3 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 04/19] bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run() Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 06/19] bpf: Dont iterate over possible CPUs with interrupts disabled Thomas Gleixner
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

The BPF invocation from the perf event overflow handler does not require to
disable preemption because this is called from NMI or at least hard
interrupt context which is already non-preemptible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/events/core.c |    2 --
 1 file changed, 2 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9206,7 +9206,6 @@ static void bpf_overflow_handler(struct
 	int ret = 0;
 
 	ctx.regs = perf_arch_bpf_user_pt_regs(regs);
-	preempt_disable();
 	if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
 		goto out;
 	rcu_read_lock();
@@ -9214,7 +9213,6 @@ static void bpf_overflow_handler(struct
 	rcu_read_unlock();
 out:
 	__this_cpu_dec(bpf_prog_active);
-	preempt_enable();
 	if (!ret)
 		return;
 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 06/19] bpf: Dont iterate over possible CPUs with interrupts disabled
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (4 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 05/19] perf/bpf: Remove preempt disable around BPF invocation Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 07/19] bpf: Provide BPF_PROG_RUN_PIN_ON_CPU() macro Thomas Gleixner
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

pcpu_freelist_populate() is disabling interrupts and then iterates over the
possible CPUs. The reason why this disables interrupts is to silence
lockdep because the invoked ___pcpu_freelist_push() takes spin locks.

Neither the interrupt disabling nor the locking are required in this
function because it's called during initialization and the resulting map is
not yet visible to anything.

Split out the actual push assignement into an inline, call it from the loop
and remove the interrupt disable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/bpf/percpu_freelist.c |   20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

--- a/kernel/bpf/percpu_freelist.c
+++ b/kernel/bpf/percpu_freelist.c
@@ -25,12 +25,18 @@ void pcpu_freelist_destroy(struct pcpu_f
 	free_percpu(s->freelist);
 }
 
+static inline void pcpu_freelist_push_node(struct pcpu_freelist_head *head,
+					   struct pcpu_freelist_node *node)
+{
+	node->next = head->first;
+	head->first = node;
+}
+
 static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head,
 					 struct pcpu_freelist_node *node)
 {
 	raw_spin_lock(&head->lock);
-	node->next = head->first;
-	head->first = node;
+	pcpu_freelist_push_node(head, node);
 	raw_spin_unlock(&head->lock);
 }
 
@@ -56,21 +62,16 @@ void pcpu_freelist_populate(struct pcpu_
 			    u32 nr_elems)
 {
 	struct pcpu_freelist_head *head;
-	unsigned long flags;
 	int i, cpu, pcpu_entries;
 
 	pcpu_entries = nr_elems / num_possible_cpus() + 1;
 	i = 0;
 
-	/* disable irq to workaround lockdep false positive
-	 * in bpf usage pcpu_freelist_populate() will never race
-	 * with pcpu_freelist_push()
-	 */
-	local_irq_save(flags);
 	for_each_possible_cpu(cpu) {
 again:
 		head = per_cpu_ptr(s->freelist, cpu);
-		___pcpu_freelist_push(head, buf);
+		/* No locking required as this is not visible yet. */
+		pcpu_freelist_push_node(head, buf);
 		i++;
 		buf += elem_size;
 		if (i == nr_elems)
@@ -78,7 +79,6 @@ void pcpu_freelist_populate(struct pcpu_
 		if (i % pcpu_entries)
 			goto again;
 	}
-	local_irq_restore(flags);
 }
 
 struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *s)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 07/19] bpf: Provide BPF_PROG_RUN_PIN_ON_CPU() macro
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (5 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 06/19] bpf: Dont iterate over possible CPUs with interrupts disabled Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 18:50   ` Mathieu Desnoyers
  2020-02-14 13:39 ` [RFC patch 08/19] bpf: Replace cant_sleep() with cant_migrate() Thomas Gleixner
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

BPF programs require to run on one CPU to completion as they use per CPU
storage, but according to Alexei they don't need reentrancy protection as
obviously BPF programs running in thread context can always be 'preempted'
by hard and soft interrupts and instrumentation and the same program can
run concurrently on a different CPU.

The currently used mechanism to ensure CPUness is to wrap the invocation
into a preempt_disable/enable() pair. Disabling preemption is also
disabling migration for a task.

preempt_disable/enable() is used because there is no explicit way to
reliably disable only migration.

Provide a separate macro to invoke a BPF program which can be used in
migrateable task context.

It wraps BPF_PROG_RUN() in a migrate_disable/enable() pair which maps on
non RT enabled kernels to preempt_disable/enable(). On RT enabled kernels
this merely disables migration. Both methods ensure that the invoked BPF
program runs on one CPU to completion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/filter.h |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -576,8 +576,26 @@ DECLARE_STATIC_KEY_FALSE(bpf_stats_enabl
 	}								\
 	ret; })
 
-#define BPF_PROG_RUN(prog, ctx) __BPF_PROG_RUN(prog, ctx,		\
-					       bpf_dispatcher_nopfunc)
+#define BPF_PROG_RUN(prog, ctx)						\
+	__BPF_PROG_RUN(prog, ctx, bpf_dispatcher_nopfunc)
+
+/*
+ * Use in preemptible and therefore migratable context to make sure that
+ * the execution of the BPF program runs on one CPU.
+ *
+ * This uses migrate_disable/enable() explicitely to document that the
+ * invocation of a BPF program does not require reentrancy protection
+ * against a BPF program which is invoked from a preempting task.
+ *
+ * For non RT enabled kernels migrate_disable/enable() maps to
+ * preempt_disable/enable(), i.e. it disables also preemption.
+ */
+#define BPF_PROG_RUN_PIN_ON_CPU(prog, ctx) ({				\
+	u32 ret;							\
+	migrate_disable();						\
+	ret = __BPF_PROG_RUN(prog, ctx, bpf_dispatcher_nopfunc);	\
+	migrate_enable();						\
+	ret; })
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 08/19] bpf: Replace cant_sleep() with cant_migrate()
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (6 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 07/19] bpf: Provide BPF_PROG_RUN_PIN_ON_CPU() macro Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites Thomas Gleixner
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

As already discussed in the previous change which introduced
BPF_RUN_PROG_PIN_ON_CPU() BPF only requires to disable migration to
guarantee per CPUness.

If RT substitutes the preempt disable based migration protection then the
cant_sleep() check will obviously trigger as preemption is not disabled.

Replace it by cant_migrate() which maps to cant_sleep() on a non RT kernel
and will verify that migration is disabled on a full RT kernel.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/filter.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -561,7 +561,7 @@ DECLARE_STATIC_KEY_FALSE(bpf_stats_enabl
 
 #define __BPF_PROG_RUN(prog, ctx, dfunc)	({			\
 	u32 ret;							\
-	cant_sleep();							\
+	cant_migrate();							\
 	if (static_branch_unlikely(&bpf_stats_enabled_key)) {		\
 		struct bpf_prog_stats *stats;				\
 		u64 start = sched_clock();				\


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites.
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (7 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 08/19] bpf: Replace cant_sleep() with cant_migrate() Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-19  1:39   ` Vinicius Costa Gomes
  2020-02-14 13:39 ` [RFC patch 10/19] trace/bpf: Use migrate disable in trace_call_bpf() Thomas Gleixner
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

From: David Miller <davem@davemloft.net>

All of these cases are strictly of the form:

	preempt_disable();
	BPF_PROG_RUN(...);
	preempt_enable();

Replace this with BPF_PROG_RUN_PIN_ON_CPU() which wraps BPF_PROG_RUN()
with:

	migrate_disable();
	BPF_PROG_RUN(...);
	migrate_enable();

On non RT enabled kernels this maps to preempt_disable/enable() and on RT
enabled kernels this solely prevents migration, which is sufficient as
there is no requirement to prevent reentrancy to any BPF program from a
preempting task. The only requirement is that the program stays on the same
CPU.

Therefore, this is a trivially correct transformation.

[ tglx: Converted to BPF_PROG_RUN_PIN_ON_CPU() ]

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/filter.h    |    4 +---
 kernel/seccomp.c          |    4 +---
 net/core/flow_dissector.c |    4 +---
 net/core/skmsg.c          |    8 ++------
 net/kcm/kcmsock.c         |    4 +---
 5 files changed, 6 insertions(+), 18 deletions(-)

--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -713,9 +713,7 @@ static inline u32 bpf_prog_run_clear_cb(
 	if (unlikely(prog->cb_access))
 		memset(cb_data, 0, BPF_SKB_CB_LEN);
 
-	preempt_disable();
-	res = BPF_PROG_RUN(prog, skb);
-	preempt_enable();
+	res = BPF_PROG_RUN_PIN_ON_CPU(prog, skb);
 	return res;
 }
 
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -268,16 +268,14 @@ static u32 seccomp_run_filters(const str
 	 * All filters in the list are evaluated and the lowest BPF return
 	 * value always takes priority (ignoring the DATA).
 	 */
-	preempt_disable();
 	for (; f; f = f->prev) {
-		u32 cur_ret = BPF_PROG_RUN(f->prog, sd);
+		u32 cur_ret = BPF_PROG_RUN_PIN_ON_CPU(f->prog, sd);
 
 		if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) {
 			ret = cur_ret;
 			*match = f;
 		}
 	}
-	preempt_enable();
 	return ret;
 }
 #endif /* CONFIG_SECCOMP_FILTER */
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -920,9 +920,7 @@ bool bpf_flow_dissect(struct bpf_prog *p
 		     (int)FLOW_DISSECTOR_F_STOP_AT_ENCAP);
 	flow_keys->flags = flags;
 
-	preempt_disable();
-	result = BPF_PROG_RUN(prog, ctx);
-	preempt_enable();
+	result = BPF_PROG_RUN_PIN_ON_CPU(prog, ctx);
 
 	flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, nhoff, hlen);
 	flow_keys->thoff = clamp_t(u16, flow_keys->thoff,
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -628,7 +628,6 @@ int sk_psock_msg_verdict(struct sock *sk
 	struct bpf_prog *prog;
 	int ret;
 
-	preempt_disable();
 	rcu_read_lock();
 	prog = READ_ONCE(psock->progs.msg_parser);
 	if (unlikely(!prog)) {
@@ -638,7 +637,7 @@ int sk_psock_msg_verdict(struct sock *sk
 
 	sk_msg_compute_data_pointers(msg);
 	msg->sk = sk;
-	ret = BPF_PROG_RUN(prog, msg);
+	ret = BPF_PROG_RUN_PIN_ON_CPU(prog, msg);
 	ret = sk_psock_map_verd(ret, msg->sk_redir);
 	psock->apply_bytes = msg->apply_bytes;
 	if (ret == __SK_REDIRECT) {
@@ -653,7 +652,6 @@ int sk_psock_msg_verdict(struct sock *sk
 	}
 out:
 	rcu_read_unlock();
-	preempt_enable();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(sk_psock_msg_verdict);
@@ -665,9 +663,7 @@ static int sk_psock_bpf_run(struct sk_ps
 
 	skb->sk = psock->sk;
 	bpf_compute_data_end_sk_skb(skb);
-	preempt_disable();
-	ret = BPF_PROG_RUN(prog, skb);
-	preempt_enable();
+	ret = BPF_PROG_RUN_PIN_ON_CPU(prog, skb);
 	/* strparser clones the skb before handing it to a upper layer,
 	 * meaning skb_orphan has been called. We NULL sk on the way out
 	 * to ensure we don't trigger a BUG_ON() in skb/sk operations
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -380,9 +380,7 @@ static int kcm_parse_func_strparser(stru
 	struct bpf_prog *prog = psock->bpf_prog;
 	int res;
 
-	preempt_disable();
-	res = BPF_PROG_RUN(prog, skb);
-	preempt_enable();
+	res = BPF_PROG_RUN_PIN_ON_CPU(prog, skb);
 	return res;
 }
 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 10/19] trace/bpf: Use migrate disable in trace_call_bpf()
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (8 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 11/19] bpf/tests: Use migrate disable instead of preempt disable Thomas Gleixner
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

BPF does not require preemption disable. It only requires to stay on the
same CPU while running a program. Reflect this by replacing
preempt_disable/enable() with migrate_disable/enable() pairs.

On a non-RT kernel this maps to preempt_disable/enable().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/trace/bpf_trace.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -83,7 +83,7 @@ unsigned int trace_call_bpf(struct trace
 	if (in_nmi()) /* not supported yet */
 		return 1;
 
-	preempt_disable();
+	migrate_disable();
 
 	if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
 		/*
@@ -115,7 +115,7 @@ unsigned int trace_call_bpf(struct trace
 
  out:
 	__this_cpu_dec(bpf_prog_active);
-	preempt_enable();
+	migrate_enable();
 
 	return ret;
 }


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 11/19] bpf/tests: Use migrate disable instead of preempt disable
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (9 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 10/19] trace/bpf: Use migrate disable in trace_call_bpf() Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 12/19] bpf: Use migrate_disable/enabe() in trampoline code Thomas Gleixner
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

From: David Miller <davem@davemloft.net>

Replace the preemption disable/enable with migrate_disable/enable() to
reflect the actual requirement and to allow PREEMPT_RT to substitute it
with an actual migration disable mechanism which does not disable
preemption.

[ tglx: Switched it over to migrate disable ]

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 lib/test_bpf.c     |    4 ++--
 net/bpf/test_run.c |    8 ++++----
 2 files changed, 6 insertions(+), 6 deletions(-)

--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6660,14 +6660,14 @@ static int __run_one(const struct bpf_pr
 	u64 start, finish;
 	int ret = 0, i;
 
-	preempt_disable();
+	migrate_disable();
 	start = ktime_get_ns();
 
 	for (i = 0; i < runs; i++)
 		ret = BPF_PROG_RUN(fp, data);
 
 	finish = ktime_get_ns();
-	preempt_enable();
+	migrate_enable();
 
 	*duration = finish - start;
 	do_div(*duration, runs);
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -37,7 +37,7 @@ static int bpf_test_run(struct bpf_prog
 		repeat = 1;
 
 	rcu_read_lock();
-	preempt_disable();
+	migrate_disable();
 	time_start = ktime_get_ns();
 	for (i = 0; i < repeat; i++) {
 		bpf_cgroup_storage_set(storage);
@@ -54,18 +54,18 @@ static int bpf_test_run(struct bpf_prog
 
 		if (need_resched()) {
 			time_spent += ktime_get_ns() - time_start;
-			preempt_enable();
+			migrate_enable();
 			rcu_read_unlock();
 
 			cond_resched();
 
 			rcu_read_lock();
-			preempt_disable();
+			migrate_disable();
 			time_start = ktime_get_ns();
 		}
 	}
 	time_spent += ktime_get_ns() - time_start;
-	preempt_enable();
+	migrate_enable();
 	rcu_read_unlock();
 
 	do_div(time_spent, repeat);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 12/19] bpf: Use migrate_disable/enabe() in trampoline code.
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (10 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 11/19] bpf/tests: Use migrate disable instead of preempt disable Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 13/19] bpf: Use migrate_disable/enable in array macros and cgroup/lirc code Thomas Gleixner
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

From: David Miller <davem@davemloft.net>

Instead of preemption disable/enable to reflect the purpose. This allows
PREEMPT_RT to substitute it with an actual migration disable
implementation. On non RT kernels this is still mapped to
preempt_disable/enable().

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/bpf/trampoline.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -367,8 +367,9 @@ void bpf_trampoline_put(struct bpf_tramp
 	mutex_unlock(&trampoline_mutex);
 }
 
-/* The logic is similar to BPF_PROG_RUN, but with explicit rcu and preempt that
- * are needed for trampoline. The macro is split into
+/* The logic is similar to BPF_PROG_RUN, but with an explicit
+ * rcu_read_lock() and migrate_disable() which are required
+ * for the trampoline. The macro is split into
  * call _bpf_prog_enter
  * call prog->bpf_func
  * call __bpf_prog_exit
@@ -378,7 +379,7 @@ u64 notrace __bpf_prog_enter(void)
 	u64 start = 0;
 
 	rcu_read_lock();
-	preempt_disable();
+	migrate_disable();
 	if (static_branch_unlikely(&bpf_stats_enabled_key))
 		start = sched_clock();
 	return start;
@@ -401,7 +402,7 @@ void notrace __bpf_prog_exit(struct bpf_
 		stats->nsecs += sched_clock() - start;
 		u64_stats_update_end(&stats->syncp);
 	}
-	preempt_enable();
+	migrate_enable();
 	rcu_read_unlock();
 }
 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 13/19] bpf: Use migrate_disable/enable in array macros and cgroup/lirc code.
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (11 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 12/19] bpf: Use migrate_disable/enabe() in trampoline code Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code Thomas Gleixner
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

From: David Miller <davem@davemloft.net>

Replace the preemption disable/enable with migrate_disable/enable() to
reflect the actual requirement and to allow PREEMPT_RT to substitute it
with an actual migration disable mechanism which does not disable
preemption.

Including the code paths that go via __bpf_prog_run_save_cb().

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/bpf.h    |    8 ++++----
 include/linux/filter.h |    5 +++--
 2 files changed, 7 insertions(+), 6 deletions(-)

--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -885,7 +885,7 @@ int bpf_prog_array_copy(struct bpf_prog_
 		struct bpf_prog *_prog;			\
 		struct bpf_prog_array *_array;		\
 		u32 _ret = 1;				\
-		preempt_disable();			\
+		migrate_disable();			\
 		rcu_read_lock();			\
 		_array = rcu_dereference(array);	\
 		if (unlikely(check_non_null && !_array))\
@@ -898,7 +898,7 @@ int bpf_prog_array_copy(struct bpf_prog_
 		}					\
 _out:							\
 		rcu_read_unlock();			\
-		preempt_enable();			\
+		migrate_enable();			\
 		_ret;					\
 	 })
 
@@ -932,7 +932,7 @@ int bpf_prog_array_copy(struct bpf_prog_
 		u32 ret;				\
 		u32 _ret = 1;				\
 		u32 _cn = 0;				\
-		preempt_disable();			\
+		migrate_disable();			\
 		rcu_read_lock();			\
 		_array = rcu_dereference(array);	\
 		_item = &_array->items[0];		\
@@ -944,7 +944,7 @@ int bpf_prog_array_copy(struct bpf_prog_
 			_item++;			\
 		}					\
 		rcu_read_unlock();			\
-		preempt_enable();			\
+		migrate_enable();			\
 		if (_ret)				\
 			_ret = (_cn ? NET_XMIT_CN : NET_XMIT_SUCCESS);	\
 		else					\
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -673,6 +673,7 @@ static inline u8 *bpf_skb_cb(struct sk_b
 	return qdisc_skb_cb(skb)->data;
 }
 
+/* Must be invoked with migration disabled */
 static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog,
 					 struct sk_buff *skb)
 {
@@ -698,9 +699,9 @@ static inline u32 bpf_prog_run_save_cb(c
 {
 	u32 res;
 
-	preempt_disable();
+	migrate_disable();
 	res = __bpf_prog_run_save_cb(prog, skb);
-	preempt_enable();
+	migrate_enable();
 	return res;
 }
 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (12 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 13/19] bpf: Use migrate_disable/enable in array macros and cgroup/lirc code Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 19:11   ` Mathieu Desnoyers
  2020-02-14 13:39 ` [RFC patch 15/19] bpf: Use migrate_disable() in sys_bpf() Thomas Gleixner
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

The required protection is that the caller cannot be migrated to a
different CPU as these places take either a hash bucket lock or might
trigger a kprobe inside the memory allocator. Both scenarios can lead to
deadlocks. The deadlock prevention is per CPU by incrementing a per CPU
variable which temporarily blocks the invocation of BPF programs from perf
and kprobes.

Replace the preempt_disable/enable() pairs with migrate_disable/enable()
pairs to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT
kernel this maps to preempt_disable/enable(), i.e. no functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/bpf/hashtab.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -698,11 +698,11 @@ static void htab_elem_free_rcu(struct rc
 	 * we're calling kfree, otherwise deadlock is possible if kprobes
 	 * are placed somewhere inside of slub
 	 */
-	preempt_disable();
+	migrate_disable();
 	__this_cpu_inc(bpf_prog_active);
 	htab_elem_free(htab, l);
 	__this_cpu_dec(bpf_prog_active);
-	preempt_enable();
+	migrate_enable();
 }
 
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
@@ -1327,7 +1327,7 @@ static int
 	}
 
 again:
-	preempt_disable();
+	migrate_disable();
 	this_cpu_inc(bpf_prog_active);
 	rcu_read_lock();
 again_nocopy:
@@ -1347,7 +1347,7 @@ static int
 		raw_spin_unlock_irqrestore(&b->lock, flags);
 		rcu_read_unlock();
 		this_cpu_dec(bpf_prog_active);
-		preempt_enable();
+		migrate_enable();
 		goto after_loop;
 	}
 
@@ -1356,7 +1356,7 @@ static int
 		raw_spin_unlock_irqrestore(&b->lock, flags);
 		rcu_read_unlock();
 		this_cpu_dec(bpf_prog_active);
-		preempt_enable();
+		migrate_enable();
 		kvfree(keys);
 		kvfree(values);
 		goto alloc;
@@ -1406,7 +1406,7 @@ static int
 
 	rcu_read_unlock();
 	this_cpu_dec(bpf_prog_active);
-	preempt_enable();
+	migrate_enable();
 	if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys,
 	    key_size * bucket_cnt) ||
 	    copy_to_user(uvalues + total * value_size, values,


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 15/19] bpf: Use migrate_disable() in sys_bpf()
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (13 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 16/19] bpf: Factor out hashtab bucket lock operations Thomas Gleixner
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

The required protection is that the caller cannot be migrated to a
different CPU as these functions end up in places which take either a hash
bucket lock or might trigger a kprobe inside the memory allocator. Both
scenarios can lead to deadlocks. The deadlock prevention is per CPU by
incrementing a per CPU variable which temporarily blocks the invocation of
BPF programs from perf and kprobes.

Replace the preempt_disable/enable() pairs with migrate_disable/enable()
pairs to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT
kernel this maps to preempt_disable/enable(), i.e. no functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/bpf/syscall.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -174,7 +174,7 @@ static int bpf_map_update_value(struct b
 	/* must increment bpf_prog_active to avoid kprobe+bpf triggering from
 	 * inside bpf map update or delete otherwise deadlocks are possible
 	 */
-	preempt_disable();
+	migrate_disable();
 	__this_cpu_inc(bpf_prog_active);
 	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
 	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
@@ -207,7 +207,7 @@ static int bpf_map_update_value(struct b
 		rcu_read_unlock();
 	}
 	__this_cpu_dec(bpf_prog_active);
-	preempt_enable();
+	migrate_enable();
 	maybe_wait_bpf_programs(map);
 
 	return err;
@@ -222,7 +222,7 @@ static int bpf_map_copy_value(struct bpf
 	if (bpf_map_is_dev_bound(map))
 		return bpf_map_offload_lookup_elem(map, key, value);
 
-	preempt_disable();
+	migrate_disable();
 	this_cpu_inc(bpf_prog_active);
 	if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
 	    map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
@@ -269,7 +269,7 @@ static int bpf_map_copy_value(struct bpf
 	}
 
 	this_cpu_dec(bpf_prog_active);
-	preempt_enable();
+	migrate_enable();
 	maybe_wait_bpf_programs(map);
 
 	return err;
@@ -1136,13 +1136,13 @@ static int map_delete_elem(union bpf_att
 		goto out;
 	}
 
-	preempt_disable();
+	migrate_disable();
 	__this_cpu_inc(bpf_prog_active);
 	rcu_read_lock();
 	err = map->ops->map_delete_elem(map, key);
 	rcu_read_unlock();
 	__this_cpu_dec(bpf_prog_active);
-	preempt_enable();
+	migrate_enable();
 	maybe_wait_bpf_programs(map);
 out:
 	kfree(key);
@@ -1254,13 +1254,13 @@ int generic_map_delete_batch(struct bpf_
 			break;
 		}
 
-		preempt_disable();
+		migrate_disable();
 		__this_cpu_inc(bpf_prog_active);
 		rcu_read_lock();
 		err = map->ops->map_delete_elem(map, key);
 		rcu_read_unlock();
 		__this_cpu_dec(bpf_prog_active);
-		preempt_enable();
+		migrate_enable();
 		maybe_wait_bpf_programs(map);
 		if (err)
 			break;


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 16/19] bpf: Factor out hashtab bucket lock operations
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (14 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 15/19] bpf: Use migrate_disable() in sys_bpf() Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 17/19] bpf: Prepare hashtab locking for PREEMPT_RT Thomas Gleixner
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

As a preparation for making the BPF locking RT friendly, factor out the
hash bucket lock operations into inline functions. This allows to do the
necessary RT modification in one place instead of sprinkling it all over
the place. No functional change.

The now unused htab argument of the lock/unlock functions will be used in
the next step which adds PREEMPT_RT support.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/bpf/hashtab.c |   69 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 46 insertions(+), 23 deletions(-)

--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -87,6 +87,32 @@ struct htab_elem {
 	char key[0] __aligned(8);
 };
 
+static void htab_init_buckets(struct bpf_htab *htab)
+{
+	unsigned i;
+
+	for (i = 0; i < htab->n_buckets; i++) {
+		INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
+		raw_spin_lock_init(&htab->buckets[i].lock);
+	}
+}
+
+static inline unsigned long htab_lock_bucket(const struct bpf_htab *htab,
+					     struct bucket *b)
+{
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&b->lock, flags);
+	return flags;
+}
+
+static inline void htab_unlock_bucket(const struct bpf_htab *htab,
+				      struct bucket *b,
+				      unsigned long flags)
+{
+	raw_spin_unlock_irqrestore(&b->lock, flags);
+}
+
 static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
 
 static bool htab_is_lru(const struct bpf_htab *htab)
@@ -336,8 +362,8 @@ static struct bpf_map *htab_map_alloc(un
 	bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
 	bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
 	struct bpf_htab *htab;
-	int err, i;
 	u64 cost;
+	int err;
 
 	htab = kzalloc(sizeof(*htab), GFP_USER);
 	if (!htab)
@@ -399,10 +425,7 @@ static struct bpf_map *htab_map_alloc(un
 	else
 		htab->hashrnd = get_random_int();
 
-	for (i = 0; i < htab->n_buckets; i++) {
-		INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
-		raw_spin_lock_init(&htab->buckets[i].lock);
-	}
+	htab_init_buckets(htab);
 
 	if (prealloc) {
 		err = prealloc_init(htab);
@@ -610,7 +633,7 @@ static bool htab_lru_map_delete_node(voi
 	b = __select_bucket(htab, tgt_l->hash);
 	head = &b->head;
 
-	raw_spin_lock_irqsave(&b->lock, flags);
+	flags = htab_lock_bucket(htab, b);
 
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
 		if (l == tgt_l) {
@@ -618,7 +641,7 @@ static bool htab_lru_map_delete_node(voi
 			break;
 		}
 
-	raw_spin_unlock_irqrestore(&b->lock, flags);
+	htab_unlock_bucket(htab, b, flags);
 
 	return l == tgt_l;
 }
@@ -892,7 +915,7 @@ static int htab_map_update_elem(struct b
 		 */
 	}
 
-	raw_spin_lock_irqsave(&b->lock, flags);
+	flags = htab_lock_bucket(htab, b);
 
 	l_old = lookup_elem_raw(head, hash, key, key_size);
 
@@ -933,7 +956,7 @@ static int htab_map_update_elem(struct b
 	}
 	ret = 0;
 err:
-	raw_spin_unlock_irqrestore(&b->lock, flags);
+	htab_unlock_bucket(htab, b, flags);
 	return ret;
 }
 
@@ -971,7 +994,7 @@ static int htab_lru_map_update_elem(stru
 		return -ENOMEM;
 	memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size);
 
-	raw_spin_lock_irqsave(&b->lock, flags);
+	flags = htab_lock_bucket(htab, b);
 
 	l_old = lookup_elem_raw(head, hash, key, key_size);
 
@@ -990,7 +1013,7 @@ static int htab_lru_map_update_elem(stru
 	ret = 0;
 
 err:
-	raw_spin_unlock_irqrestore(&b->lock, flags);
+	htab_unlock_bucket(htab, b, flags);
 
 	if (ret)
 		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
@@ -1025,7 +1048,7 @@ static int __htab_percpu_map_update_elem
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
-	raw_spin_lock_irqsave(&b->lock, flags);
+	flags = htab_lock_bucket(htab, b);
 
 	l_old = lookup_elem_raw(head, hash, key, key_size);
 
@@ -1048,7 +1071,7 @@ static int __htab_percpu_map_update_elem
 	}
 	ret = 0;
 err:
-	raw_spin_unlock_irqrestore(&b->lock, flags);
+	htab_unlock_bucket(htab, b, flags);
 	return ret;
 }
 
@@ -1088,7 +1111,7 @@ static int __htab_lru_percpu_map_update_
 			return -ENOMEM;
 	}
 
-	raw_spin_lock_irqsave(&b->lock, flags);
+	flags = htab_lock_bucket(htab, b);
 
 	l_old = lookup_elem_raw(head, hash, key, key_size);
 
@@ -1110,7 +1133,7 @@ static int __htab_lru_percpu_map_update_
 	}
 	ret = 0;
 err:
-	raw_spin_unlock_irqrestore(&b->lock, flags);
+	htab_unlock_bucket(htab, b, flags);
 	if (l_new)
 		bpf_lru_push_free(&htab->lru, &l_new->lru_node);
 	return ret;
@@ -1148,7 +1171,7 @@ static int htab_map_delete_elem(struct b
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
-	raw_spin_lock_irqsave(&b->lock, flags);
+	flags = htab_lock_bucket(htab, b);
 
 	l = lookup_elem_raw(head, hash, key, key_size);
 
@@ -1158,7 +1181,7 @@ static int htab_map_delete_elem(struct b
 		ret = 0;
 	}
 
-	raw_spin_unlock_irqrestore(&b->lock, flags);
+	htab_unlock_bucket(htab, b, flags);
 	return ret;
 }
 
@@ -1180,7 +1203,7 @@ static int htab_lru_map_delete_elem(stru
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
-	raw_spin_lock_irqsave(&b->lock, flags);
+	flags = htab_lock_bucket(htab, b);
 
 	l = lookup_elem_raw(head, hash, key, key_size);
 
@@ -1189,7 +1212,7 @@ static int htab_lru_map_delete_elem(stru
 		ret = 0;
 	}
 
-	raw_spin_unlock_irqrestore(&b->lock, flags);
+	htab_unlock_bucket(htab, b, flags);
 	if (l)
 		bpf_lru_push_free(&htab->lru, &l->lru_node);
 	return ret;
@@ -1335,7 +1358,7 @@ static int
 	dst_val = values;
 	b = &htab->buckets[batch];
 	head = &b->head;
-	raw_spin_lock_irqsave(&b->lock, flags);
+	flags = htab_lock_bucket(htab, b);
 
 	bucket_cnt = 0;
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
@@ -1344,7 +1367,7 @@ static int
 	if (bucket_cnt > (max_count - total)) {
 		if (total == 0)
 			ret = -ENOSPC;
-		raw_spin_unlock_irqrestore(&b->lock, flags);
+		htab_unlock_bucket(htab, b, flags);
 		rcu_read_unlock();
 		this_cpu_dec(bpf_prog_active);
 		migrate_enable();
@@ -1353,7 +1376,7 @@ static int
 
 	if (bucket_cnt > bucket_size) {
 		bucket_size = bucket_cnt;
-		raw_spin_unlock_irqrestore(&b->lock, flags);
+		htab_unlock_bucket(htab, b, flags);
 		rcu_read_unlock();
 		this_cpu_dec(bpf_prog_active);
 		migrate_enable();
@@ -1395,7 +1418,7 @@ static int
 		dst_val += value_size;
 	}
 
-	raw_spin_unlock_irqrestore(&b->lock, flags);
+	htab_unlock_bucket(htab, b, flags);
 	/* If we are not copying data, we can go to next bucket and avoid
 	 * unlocking the rcu.
 	 */


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 17/19] bpf: Prepare hashtab locking for PREEMPT_RT
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (15 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 16/19] bpf: Factor out hashtab bucket lock operations Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 18/19] bpf, lpm: Make locking RT friendly Thomas Gleixner
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

PREEMPT_RT forbids certain operations like memory allocations (even with
GFP_ATOMIC) from atomic contexts. This is required because even with
GFP_ATOMIC the memory allocator calls into code pathes which acquire locks
with long held lock sections. To ensure the deterministic behaviour these
locks are regular spinlocks, which are converted to 'sleepable' spinlocks
on RT. The only true atomic contexts on an RT kernel are the low level
hardware handling, scheduling, low level interrupt handling, NMIs etc. None
of these contexts should ever do memory allocations.

As regular device interrupt handlers and soft interrupts are forced into
thread context, the existing code which does
  spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*();
just works.

In theory the BPF locks could be converted to regular spinlocks as well,
but the bucket locks and percpu_freelist locks can be taken from arbitrary
contexts (perf, kprobes, tracepoints) which are required to be atomic
contexts even on RT. These mechanisms require preallocated maps, so there
is no need to invoke memory allocations within the lock held sections.

BPF maps which need dynamic allocation are only used from (forced) thread
context on RT and can therefore use regular spinlocks which in turn allows
to invoke memory allocations from the lock held section.

To achieve this make the hash bucket lock a union of a raw and a regular
spinlock and initialize and lock/unlock either the raw spinlock for
preallocated maps or the regular variant for maps which require memory
allocations.

On a non RT kernel this distinction is neither possible nor required.
spinlock maps to raw_spinlock and the extra code and conditional is
optimized out by the compiler. No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/bpf/hashtab.c |   65 +++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 56 insertions(+), 9 deletions(-)

--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -46,10 +46,43 @@
  * from one of these contexts completed. sys_bpf() uses the same mechanism
  * by pinning the task to the current CPU and incrementing the recursion
  * protection accross the map operation.
+ *
+ * This has subtle implications on PREEMPT_RT. PREEMPT_RT forbids certain
+ * operations like memory allocations (even with GFP_ATOMIC) from atomic
+ * contexts. This is required because even with GFP_ATOMIC the memory
+ * allocator calls into code pathes which acquire locks with long held lock
+ * sections. To ensure the deterministic behaviour these locks are regular
+ * spinlocks, which are converted to 'sleepable' spinlocks on RT. The only
+ * true atomic contexts on an RT kernel are the low level hardware
+ * handling, scheduling, low level interrupt handling, NMIs etc. None of
+ * these contexts should ever do memory allocations.
+ *
+ * As regular device interrupt handlers and soft interrupts are forced into
+ * thread context, the existing code which does
+ *   spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*();
+ * just works.
+ *
+ * In theory the BPF locks could be converted to regular spinlocks as well,
+ * but the bucket locks and percpu_freelist locks can be taken from
+ * arbitrary contexts (perf, kprobes, tracepoints) which are required to be
+ * atomic contexts even on RT. These mechanisms require preallocated maps,
+ * so there is no need to invoke memory allocations within the lock held
+ * sections.
+ *
+ * BPF maps which need dynamic allocation are only used from (forced)
+ * thread context on RT and can therefore use regular spinlocks which in
+ * turn allows to invoke memory allocations from the lock held section.
+ *
+ * On a non RT kernel this distinction is neither possible nor required.
+ * spinlock maps to raw_spinlock and the extra code is optimized out by the
+ * compiler.
  */
 struct bucket {
 	struct hlist_nulls_head head;
-	raw_spinlock_t lock;
+	union {
+		raw_spinlock_t raw_lock;
+		spinlock_t     lock;
+	};
 };
 
 struct bpf_htab {
@@ -87,13 +120,26 @@ struct htab_elem {
 	char key[0] __aligned(8);
 };
 
+static inline bool htab_is_prealloc(const struct bpf_htab *htab)
+{
+	return !(htab->map.map_flags & BPF_F_NO_PREALLOC);
+}
+
+static inline bool htab_use_raw_lock(const struct bpf_htab *htab)
+{
+	return (!IS_ENABLED(CONFIG_PREEMPT_RT) || htab_is_prealloc(htab));
+}
+
 static void htab_init_buckets(struct bpf_htab *htab)
 {
 	unsigned i;
 
 	for (i = 0; i < htab->n_buckets; i++) {
 		INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
-		raw_spin_lock_init(&htab->buckets[i].lock);
+		if (htab_use_raw_lock(htab))
+			raw_spin_lock_init(&htab->buckets[i].raw_lock);
+		else
+			spin_lock_init(&htab->buckets[i].lock);
 	}
 }
 
@@ -102,7 +148,10 @@ static inline unsigned long htab_lock_bu
 {
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&b->lock, flags);
+	if (htab_use_raw_lock(htab))
+		raw_spin_lock_irqsave(&b->raw_lock, flags);
+	else
+		spin_lock_irqsave(&b->lock, flags);
 	return flags;
 }
 
@@ -110,7 +159,10 @@ static inline void htab_unlock_bucket(co
 				      struct bucket *b,
 				      unsigned long flags)
 {
-	raw_spin_unlock_irqrestore(&b->lock, flags);
+	if (htab_use_raw_lock(htab))
+		raw_spin_unlock_irqrestore(&b->raw_lock, flags);
+	else
+		spin_unlock_irqrestore(&b->lock, flags);
 }
 
 static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
@@ -127,11 +179,6 @@ static bool htab_is_percpu(const struct
 		htab->map.map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH;
 }
 
-static bool htab_is_prealloc(const struct bpf_htab *htab)
-{
-	return !(htab->map.map_flags & BPF_F_NO_PREALLOC);
-}
-
 static inline void htab_elem_set_ptr(struct htab_elem *l, u32 key_size,
 				     void __percpu *pptr)
 {


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 18/19] bpf, lpm: Make locking RT friendly
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (16 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 17/19] bpf: Prepare hashtab locking for PREEMPT_RT Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 13:39 ` [RFC patch 19/19] bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabled Thomas Gleixner
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

The LPM trie map cannot be used in contexts like perf, kprobes and tracing
as this map type dynamically allocates memory.

The memory allocation happens with a raw spinlock held which is a truly
spinning lock on a PREEMPT RT enabled kernel which disables preemption and
interrupts.

As RT does not allow memory allocation from such a section for various
reasons, convert the raw spinlock to a regular spinlock.

On a RT enabled kernel these locks are substituted by 'sleeping' spinlocks
which provide the proper protection but keep the code preemptible.

On a non-RT kernel regular spinlocks map to raw spinlocks, i.e. this does
not cause any functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/bpf/lpm_trie.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -34,7 +34,7 @@ struct lpm_trie {
 	size_t				n_entries;
 	size_t				max_prefixlen;
 	size_t				data_size;
-	raw_spinlock_t			lock;
+	spinlock_t			lock;
 };
 
 /* This trie implements a longest prefix match algorithm that can be used to
@@ -315,7 +315,7 @@ static int trie_update_elem(struct bpf_m
 	if (key->prefixlen > trie->max_prefixlen)
 		return -EINVAL;
 
-	raw_spin_lock_irqsave(&trie->lock, irq_flags);
+	spin_lock_irqsave(&trie->lock, irq_flags);
 
 	/* Allocate and fill a new node */
 
@@ -422,7 +422,7 @@ static int trie_update_elem(struct bpf_m
 		kfree(im_node);
 	}
 
-	raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
+	spin_unlock_irqrestore(&trie->lock, irq_flags);
 
 	return ret;
 }
@@ -442,7 +442,7 @@ static int trie_delete_elem(struct bpf_m
 	if (key->prefixlen > trie->max_prefixlen)
 		return -EINVAL;
 
-	raw_spin_lock_irqsave(&trie->lock, irq_flags);
+	spin_lock_irqsave(&trie->lock, irq_flags);
 
 	/* Walk the tree looking for an exact key/length match and keeping
 	 * track of the path we traverse.  We will need to know the node
@@ -518,7 +518,7 @@ static int trie_delete_elem(struct bpf_m
 	kfree_rcu(node, rcu);
 
 out:
-	raw_spin_unlock_irqrestore(&trie->lock, irq_flags);
+	spin_unlock_irqrestore(&trie->lock, irq_flags);
 
 	return ret;
 }
@@ -575,7 +575,7 @@ static struct bpf_map *trie_alloc(union
 	if (ret)
 		goto out_err;
 
-	raw_spin_lock_init(&trie->lock);
+	spin_lock_init(&trie->lock);
 
 	return &trie->map;
 out_err:


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC patch 19/19] bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabled
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (17 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 18/19] bpf, lpm: Make locking RT friendly Thomas Gleixner
@ 2020-02-14 13:39 ` Thomas Gleixner
  2020-02-14 17:53 ` [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist David Miller
  2020-02-15 20:09 ` [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Jakub Kicinski
  20 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
  To: LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

From: David Miller <davem@davemloft.net>

In a RT kernel down_read_trylock() cannot be used from NMI context and
up_read_non_owner() is another problematic issue.

So in such a configuration, simply elide the annotated stackmap and
just report the raw IPs.

In the longer term, it might be possible to provide a atomic friendly
versions of the page cache traversal which will at least provide the info
if the pages are resident and don't need to be paged in.

[ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work
  	callback and add a comment ]

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/bpf/stackmap.c |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -40,6 +40,9 @@ static void do_up_read(struct irq_work *
 {
 	struct stack_map_irq_work *work;
 
+	if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
+		return;
+
 	work = container_of(entry, struct stack_map_irq_work, irq_work);
 	up_read_non_owner(work->sem);
 	work->sem = NULL;
@@ -288,10 +291,19 @@ static void stack_map_get_build_id_offse
 	struct stack_map_irq_work *work = NULL;
 
 	if (irqs_disabled()) {
-		work = this_cpu_ptr(&up_read_work);
-		if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY)
-			/* cannot queue more up_read, fallback */
+		if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
+			work = this_cpu_ptr(&up_read_work);
+			if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY) {
+				/* cannot queue more up_read, fallback */
+				irq_work_busy = true;
+			}
+		} else {
+			/*
+			 * PREEMPT_RT does not allow to trylock mmap sem in
+			 * interrupt disabled context. Force the fallback code.
+			 */
 			irq_work_busy = true;
+		}
 	}
 
 	/*


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (18 preceding siblings ...)
  2020-02-14 13:39 ` [RFC patch 19/19] bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabled Thomas Gleixner
@ 2020-02-14 17:53 ` David Miller
  2020-02-14 18:36   ` Thomas Gleixner
  2020-02-15 20:09 ` [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Jakub Kicinski
  20 siblings, 1 reply; 42+ messages in thread
From: David Miller @ 2020-02-14 17:53 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, bpf, netdev, ast, daniel, bigeasy, peterz,
	williams, rostedt, juri.lelli, mingo

From: Thomas Gleixner <tglx@linutronix.de>
Date: Fri, 14 Feb 2020 14:39:17 +0100

> This is a follow up to the initial patch series which David posted a
> while ago:
> 
>  https://lore.kernel.org/bpf/20191207.160357.828344895192682546.davem@davemloft.net/
> 
> which was (while non-functional on RT) a good starting point for further
> investigations.

This looks really good after a cursory review, thanks for doing this week.

I was personally unaware of the pre-allocation rules for MAPs used by
tracing et al.  And that definitely shapes how this should be handled.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist
  2020-02-14 17:53 ` [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist David Miller
@ 2020-02-14 18:36   ` Thomas Gleixner
  2020-02-17 12:59     ` [PATCH] bpf: Enforce map preallocation for all instrumentation programs Thomas Gleixner
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 18:36 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, bpf, netdev, ast, daniel, bigeasy, peterz,
	williams, rostedt, juri.lelli, mingo

David Miller <davem@davemloft.net> writes:

> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Fri, 14 Feb 2020 14:39:17 +0100
>
>> This is a follow up to the initial patch series which David posted a
>> while ago:
>> 
>>  https://lore.kernel.org/bpf/20191207.160357.828344895192682546.davem@davemloft.net/
>> 
>> which was (while non-functional on RT) a good starting point for further
>> investigations.
>
> This looks really good after a cursory review, thanks for doing this week.
>
> I was personally unaware of the pre-allocation rules for MAPs used by
> tracing et al.  And that definitely shapes how this should be handled.

Hmm. I just noticed that my analysis only holds for PERF events. But
that's broken on mainline already.

Assume the following simplified callchain:

       kmalloc() from regular non BPF context
         cache empty
           freelist empty
             lock(zone->lock);
                tracepoint or kprobe
                  BPF()
                    update_elem()
                      lock(bucket)
                        kmalloc()
                          cache empty
                            freelist empty
                              lock(zone->lock);  <- DEADLOCK

So really, preallocation _must_ be enforced for all variants of
intrusive instrumentation. There is no if and but, it's simply mandatory
as all intrusive instrumentation has to follow the only sensible
principle: KISS = Keep It Safe and Simple.

The above is a perfectly valid scenario and works with perf and tracing,
so it has to work with BPF in the same safe way.

I might be missing some magic enforcement of that, but I got lost in the
maze.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 07/19] bpf: Provide BPF_PROG_RUN_PIN_ON_CPU() macro
  2020-02-14 13:39 ` [RFC patch 07/19] bpf: Provide BPF_PROG_RUN_PIN_ON_CPU() macro Thomas Gleixner
@ 2020-02-14 18:50   ` Mathieu Desnoyers
  2020-02-14 19:36     ` Thomas Gleixner
  0 siblings, 1 reply; 42+ messages in thread
From: Mathieu Desnoyers @ 2020-02-14 18:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, David Miller, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, Sebastian Sewior, Peter Zijlstra,
	Clark Williams, Steven Rostedt, Juri Lelli, Ingo Molnar

On 14-Feb-2020 02:39:24 PM, Thomas Gleixner wrote:
[...]
> +#define BPF_PROG_RUN_PIN_ON_CPU(prog, ctx) ({				\
> +	u32 ret;							\
> +	migrate_disable();						\
> +	ret = __BPF_PROG_RUN(prog, ctx, bpf_dispatcher_nopfunc);	\
> +	migrate_enable();						\
> +	ret; })

Does it really have to be a statement expression with a local variable ?

If so, we should consider renaming "ret" to "__ret" to minimize the
chances of a caller issuing BPF_PROG_RUN_PIN_ON_CPU with "ret" as
prog or ctx argument, which would lead to unexpected results.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
  2020-02-14 13:39 ` [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code Thomas Gleixner
@ 2020-02-14 19:11   ` Mathieu Desnoyers
  2020-02-14 19:56     ` Thomas Gleixner
  0 siblings, 1 reply; 42+ messages in thread
From: Mathieu Desnoyers @ 2020-02-14 19:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, David Miller, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, Sebastian Sewior, Peter Zijlstra,
	Clark Williams, Steven Rostedt, Juri Lelli, Ingo Molnar

On 14-Feb-2020 02:39:31 PM, Thomas Gleixner wrote:
> The required protection is that the caller cannot be migrated to a
> different CPU as these places take either a hash bucket lock or might
> trigger a kprobe inside the memory allocator. Both scenarios can lead to
> deadlocks. The deadlock prevention is per CPU by incrementing a per CPU
> variable which temporarily blocks the invocation of BPF programs from perf
> and kprobes.
> 
> Replace the preempt_disable/enable() pairs with migrate_disable/enable()
> pairs to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT
> kernel this maps to preempt_disable/enable(), i.e. no functional change.

Will that _really_ work on RT ?

I'm puzzled about what will happen in the following scenario on RT:

Thread A is preempted within e.g. htab_elem_free_rcu, and Thread B is
scheduled and runs through a bunch of tracepoints. Both are on the
same CPU's runqueue:

CPU 1

Thread A is scheduled
(Thread A) htab_elem_free_rcu()
(Thread A)   migrate disable
(Thread A)   __this_cpu_inc(bpf_prog_active); -> per-cpu variable for
                                               deadlock prevention.
Thread A is preempted
Thread B is scheduled
(Thread B) Runs through various tracepoints:
           trace_call_bpf()
           if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
               -> will skip any instrumentation that happens to be on
                  this CPU until...
Thread B is preempted
Thread A is scheduled
(Thread A)  __this_cpu_dec(bpf_prog_active);
(Thread A)  migrate enable

Having all those events randomly and silently discarded might be quite
unexpected from a user standpoint. This turns the deadlock prevention
mechanism into a random tracepoint-dropping facility, which is
unsettling. One alternative approach we could consider to solve this
is to make this deadlock prevention nesting counter per-thread rather
than per-cpu.

Also, I don't think using __this_cpu_inc() without preempt-disable or
irq off is safe. You'll probably want to move to this_cpu_inc/dec
instead, which can be heavier on some architectures.

Thanks,

Mathieu


> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/bpf/hashtab.c |   12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -698,11 +698,11 @@ static void htab_elem_free_rcu(struct rc
>  	 * we're calling kfree, otherwise deadlock is possible if kprobes
>  	 * are placed somewhere inside of slub
>  	 */
> -	preempt_disable();
> +	migrate_disable();
>  	__this_cpu_inc(bpf_prog_active);
>  	htab_elem_free(htab, l);
>  	__this_cpu_dec(bpf_prog_active);
> -	preempt_enable();
> +	migrate_enable();
>  }
>  
>  static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> @@ -1327,7 +1327,7 @@ static int
>  	}
>  
>  again:
> -	preempt_disable();
> +	migrate_disable();
>  	this_cpu_inc(bpf_prog_active);
>  	rcu_read_lock();
>  again_nocopy:
> @@ -1347,7 +1347,7 @@ static int
>  		raw_spin_unlock_irqrestore(&b->lock, flags);
>  		rcu_read_unlock();
>  		this_cpu_dec(bpf_prog_active);
> -		preempt_enable();
> +		migrate_enable();
>  		goto after_loop;
>  	}
>  
> @@ -1356,7 +1356,7 @@ static int
>  		raw_spin_unlock_irqrestore(&b->lock, flags);
>  		rcu_read_unlock();
>  		this_cpu_dec(bpf_prog_active);
> -		preempt_enable();
> +		migrate_enable();
>  		kvfree(keys);
>  		kvfree(values);
>  		goto alloc;
> @@ -1406,7 +1406,7 @@ static int
>  
>  	rcu_read_unlock();
>  	this_cpu_dec(bpf_prog_active);
> -	preempt_enable();
> +	migrate_enable();
>  	if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys,
>  	    key_size * bucket_cnt) ||
>  	    copy_to_user(uvalues + total * value_size, values,
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 07/19] bpf: Provide BPF_PROG_RUN_PIN_ON_CPU() macro
  2020-02-14 18:50   ` Mathieu Desnoyers
@ 2020-02-14 19:36     ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 19:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, David Miller, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, Sebastian Sewior, Peter Zijlstra,
	Clark Williams, Steven Rostedt, Juri Lelli, Ingo Molnar

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:

> On 14-Feb-2020 02:39:24 PM, Thomas Gleixner wrote:
> [...]
>> +#define BPF_PROG_RUN_PIN_ON_CPU(prog, ctx) ({				\
>> +	u32 ret;							\
>> +	migrate_disable();						\
>> +	ret = __BPF_PROG_RUN(prog, ctx, bpf_dispatcher_nopfunc);	\
>> +	migrate_enable();						\
>> +	ret; })
>
> Does it really have to be a statement expression with a local variable ?
>
> If so, we should consider renaming "ret" to "__ret" to minimize the
> chances of a caller issuing BPF_PROG_RUN_PIN_ON_CPU with "ret" as
> prog or ctx argument, which would lead to unexpected results.

Indeed. That really can be an inline.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
  2020-02-14 19:11   ` Mathieu Desnoyers
@ 2020-02-14 19:56     ` Thomas Gleixner
  2020-02-18 23:36       ` Alexei Starovoitov
  0 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 19:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, David Miller, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, Sebastian Sewior, Peter Zijlstra,
	Clark Williams, Steven Rostedt, Juri Lelli, Ingo Molnar

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> On 14-Feb-2020 02:39:31 PM, Thomas Gleixner wrote:
>> Replace the preempt_disable/enable() pairs with migrate_disable/enable()
>> pairs to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT
>> kernel this maps to preempt_disable/enable(), i.e. no functional change.

...

> Having all those events randomly and silently discarded might be quite
> unexpected from a user standpoint. This turns the deadlock prevention
> mechanism into a random tracepoint-dropping facility, which is
> unsettling.

Well, it randomly drops events which might be unrelated to the syscall
target today already, this will just drop some more. Shrug.

> One alternative approach we could consider to solve this is to make
> this deadlock prevention nesting counter per-thread rather than
> per-cpu.

That should work both on !RT and RT.

> Also, I don't think using __this_cpu_inc() without preempt-disable or
> irq off is safe. You'll probably want to move to this_cpu_inc/dec
> instead, which can be heavier on some architectures.

Good catch.

Thanks,

        tglx

      

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist
  2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
                   ` (19 preceding siblings ...)
  2020-02-14 17:53 ` [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist David Miller
@ 2020-02-15 20:09 ` Jakub Kicinski
  20 siblings, 0 replies; 42+ messages in thread
From: Jakub Kicinski @ 2020-02-15 20:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, David Miller, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, Sebastian Sewior, Peter Zijlstra,
	Clark Williams, Steven Rostedt, Juri Lelli, Ingo Molnar

On Fri, 14 Feb 2020 14:39:17 +0100 Thomas Gleixner wrote:
>    1) All interrupts, which are not explicitely marked IRQF_NO_THREAD, are

nit: s/explicitely/explicitly/ throughout (incl. code comments)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] bpf: Enforce map preallocation for all instrumentation programs
  2020-02-14 18:36   ` Thomas Gleixner
@ 2020-02-17 12:59     ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-17 12:59 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, bpf, netdev, ast, daniel, bigeasy, peterz,
	williams, rostedt, juri.lelli, mingo

The assumption that only programs attached to perf NMI events can deadlock
on memory allocators is wrong. Assume the following simplified callchain:

 kmalloc() from regular non BPF context
  cache empty
   freelist empty
    lock(zone->lock);
     tracepoint or kprobe
      BPF()
       update_elem()
        lock(bucket)
          kmalloc()
           cache empty
            freelist empty
             lock(zone->lock);  <- DEADLOCK

There are also other ways to create wreckage:

 kmalloc() from regular non BPF context
  local_irq_save();
   ...
    obj = slab_first();
     kprobe()
      BPF()
       update_elem()
        lock(bucket)
         kmalloc()
          local_irq_save();
           ...
            obj = slab_first(); <- Same object as above ...

So preallocation _must_ be enforced for all variants of intrusive
instrumentation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/bpf/verifier.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8144,19 +8144,23 @@ static int check_map_prog_compatibility(
 					struct bpf_prog *prog)
 
 {
-	/* Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
-	 * preallocated hash maps, since doing memory allocation
-	 * in overflow_handler can crash depending on where nmi got
-	 * triggered.
+	/*
+	 * Make sure that trace type programs use preallocated hash maps.
+	 * Perf programs obviously can't do memory allocation in NMI
+	 * context and all other types can deadlock on a memory allocator
+	 * lock when a tracepoint/kprobe triggers a BPF program inside a
+	 * lock held region or create inconsistent state when the probe is
+	 * within an interrupts disabled critical region in the memory
+	 * allocator.
 	 */
-	if (prog->type == BPF_PROG_TYPE_PERF_EVENT) {
+	if ((is_tracing_prog_type(prog->type)) {
 		if (!check_map_prealloc(map)) {
-			verbose(env, "perf_event programs can only use preallocated hash map\n");
+			verbose(env, "tracing programs can only use preallocated hash map\n");
 			return -EINVAL;
 		}
 		if (map->inner_map_meta &&
 		    !check_map_prealloc(map->inner_map_meta)) {
-			verbose(env, "perf_event programs can only use preallocated inner hash map\n");
+			verbose(env, "tracing programs can only use preallocated inner hash map\n");
 			return -EINVAL;
 		}
 	}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
  2020-02-14 19:56     ` Thomas Gleixner
@ 2020-02-18 23:36       ` Alexei Starovoitov
  2020-02-19  0:49         ` Thomas Gleixner
  2020-02-19 15:17         ` Mathieu Desnoyers
  0 siblings, 2 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2020-02-18 23:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, LKML, David Miller, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, Sebastian Sewior,
	Peter Zijlstra, Clark Williams, Steven Rostedt, Juri Lelli,
	Ingo Molnar

On Fri, Feb 14, 2020 at 08:56:12PM +0100, Thomas Gleixner wrote:
> 
> > Also, I don't think using __this_cpu_inc() without preempt-disable or
> > irq off is safe. You'll probably want to move to this_cpu_inc/dec
> > instead, which can be heavier on some architectures.
> 
> Good catch.

Overall looks great.
Thank you for taking time to write commit logs and detailed cover letter.
I think s/__this_cpu_inc/this_cpu_inc/ is the only bit that needs to be
addressed for it to be merged.
There were few other suggestions from Mathieu and Jakub.
Could you address them and resend?
I saw patch 1 landing in tip tree, but it needs to be in bpf-next as well
along with the rest of the series. Does it really need to be in the tip?
I would prefer to take the whole thing and avoid conflicts around
migrate_disable() especially if nothing in tip is going to use it in this
development cycle. So just drop patch 1 from the tip?

Regarding
union {
   raw_spinlock_t  raw_lock;
   spinlock_t      lock;
};
yeah. it's not pretty, but I also don't have better ideas.

Regarding migrate_disable()... can you enable it without the rest of RT?
I haven't seen its implementation. I suspect it's scheduler only change?
If I can use migrate_disable() without RT it will help my work on sleepable
BPF programs. I would only have to worry about rcu_read_lock() since
preempt_disable() is nicely addressed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
  2020-02-18 23:36       ` Alexei Starovoitov
@ 2020-02-19  0:49         ` Thomas Gleixner
  2020-02-19  1:23           ` Alexei Starovoitov
  2020-02-19 15:17         ` Mathieu Desnoyers
  1 sibling, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-19  0:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, LKML, David Miller, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, Sebastian Sewior,
	Peter Zijlstra, Clark Williams, Steven Rostedt, Juri Lelli,
	Ingo Molnar

Alexei,

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> Overall looks great.
> Thank you for taking time to write commit logs and detailed cover letter.
> I think s/__this_cpu_inc/this_cpu_inc/ is the only bit that needs to be
> addressed for it to be merged.
> There were few other suggestions from Mathieu and Jakub.
> Could you address them and resend?

I have them fixed up already, but I was waiting for further
comments. I'll send it out tomorrow morning as I'm dead tired by now.

> I saw patch 1 landing in tip tree, but it needs to be in bpf-next as well
> along with the rest of the series. Does it really need to be in the tip?
> I would prefer to take the whole thing and avoid conflicts around
> migrate_disable() especially if nothing in tip is going to use it in this
> development cycle. So just drop patch 1 from the tip?

I'll add patch 2 to a tip branch as well and I'll give you a tag to pull
into BPF (which has only those two commits). That allows us to further
tweak the relevant files without creating conflicts in next.

> Regarding
> union {
>    raw_spinlock_t  raw_lock;
>    spinlock_t      lock;
> };
> yeah. it's not pretty, but I also don't have better ideas.

Yeah. I really tried hard to avoid it, but the alternative solution was
code duplication which was even more horrible.

> Regarding migrate_disable()... can you enable it without the rest of RT?
> I haven't seen its implementation. I suspect it's scheduler only change?
> If I can use migrate_disable() without RT it will help my work on sleepable
> BPF programs. I would only have to worry about rcu_read_lock() since
> preempt_disable() is nicely addressed.

You have to talk to Peter Zijlstra about this as this is really
scheduler relevant stuff. FYI, he undamentaly hates migrate_disable()
from a schedulabilty POV, but as with the above lock construct the
amount of better solutions is also close to zero.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
  2020-02-19  0:49         ` Thomas Gleixner
@ 2020-02-19  1:23           ` Alexei Starovoitov
  0 siblings, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2020-02-19  1:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, LKML, David Miller, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, Sebastian Sewior,
	Peter Zijlstra, Clark Williams, Steven Rostedt, Juri Lelli,
	Ingo Molnar

On Wed, Feb 19, 2020 at 01:49:57AM +0100, Thomas Gleixner wrote:
> Alexei,
> 
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> > Overall looks great.
> > Thank you for taking time to write commit logs and detailed cover letter.
> > I think s/__this_cpu_inc/this_cpu_inc/ is the only bit that needs to be
> > addressed for it to be merged.
> > There were few other suggestions from Mathieu and Jakub.
> > Could you address them and resend?
> 
> I have them fixed up already, but I was waiting for further
> comments. I'll send it out tomorrow morning as I'm dead tired by now.
> 
> > I saw patch 1 landing in tip tree, but it needs to be in bpf-next as well
> > along with the rest of the series. Does it really need to be in the tip?
> > I would prefer to take the whole thing and avoid conflicts around
> > migrate_disable() especially if nothing in tip is going to use it in this
> > development cycle. So just drop patch 1 from the tip?
> 
> I'll add patch 2 to a tip branch as well and I'll give you a tag to pull
> into BPF (which has only those two commits). 

That works too.

> > Regarding
> > union {
> >    raw_spinlock_t  raw_lock;
> >    spinlock_t      lock;
> > };
> > yeah. it's not pretty, but I also don't have better ideas.
> 
> Yeah. I really tried hard to avoid it, but the alternative solution was
> code duplication which was even more horrible.
> 
> > Regarding migrate_disable()... can you enable it without the rest of RT?
> > I haven't seen its implementation. I suspect it's scheduler only change?
> > If I can use migrate_disable() without RT it will help my work on sleepable
> > BPF programs. I would only have to worry about rcu_read_lock() since
> > preempt_disable() is nicely addressed.
> 
> You have to talk to Peter Zijlstra about this as this is really
> scheduler relevant stuff. FYI, he undamentaly hates migrate_disable()
> from a schedulabilty POV, but as with the above lock construct the
> amount of better solutions is also close to zero.

I would imagine that migrate_disable is like temporary cpu pinning. The
scheduler has to have all the logic to make scheduling decisions quickly in
presence of pinned tasks and additional migrate_disable shouldn't introduce
slowdowns or complexity to critical path. Anyway, we'll discuss further
when migrate_disable patches hit mailing lists.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites.
  2020-02-14 13:39 ` [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites Thomas Gleixner
@ 2020-02-19  1:39   ` Vinicius Costa Gomes
  2020-02-19  9:00     ` Thomas Gleixner
  0 siblings, 1 reply; 42+ messages in thread
From: Vinicius Costa Gomes @ 2020-02-19  1:39 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar

Hi,

Thomas Gleixner <tglx@linutronix.de> writes:

> From: David Miller <davem@davemloft.net>
>
> All of these cases are strictly of the form:
>
> 	preempt_disable();
> 	BPF_PROG_RUN(...);
> 	preempt_enable();
>
> Replace this with BPF_PROG_RUN_PIN_ON_CPU() which wraps BPF_PROG_RUN()
> with:
>
> 	migrate_disable();
> 	BPF_PROG_RUN(...);
> 	migrate_enable();
>
> On non RT enabled kernels this maps to preempt_disable/enable() and on RT
> enabled kernels this solely prevents migration, which is sufficient as
> there is no requirement to prevent reentrancy to any BPF program from a
> preempting task. The only requirement is that the program stays on the same
> CPU.
>
> Therefore, this is a trivially correct transformation.
>
> [ tglx: Converted to BPF_PROG_RUN_PIN_ON_CPU() ]
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>
> ---
>  include/linux/filter.h    |    4 +---
>  kernel/seccomp.c          |    4 +---
>  net/core/flow_dissector.c |    4 +---
>  net/core/skmsg.c          |    8 ++------
>  net/kcm/kcmsock.c         |    4 +---
>  5 files changed, 6 insertions(+), 18 deletions(-)
>
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -713,9 +713,7 @@ static inline u32 bpf_prog_run_clear_cb(
>  	if (unlikely(prog->cb_access))
>  		memset(cb_data, 0, BPF_SKB_CB_LEN);
>  
> -	preempt_disable();
> -	res = BPF_PROG_RUN(prog, skb);
> -	preempt_enable();
> +	res = BPF_PROG_RUN_PIN_ON_CPU(prog, skb);
>  	return res;
>  }
>  
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -268,16 +268,14 @@ static u32 seccomp_run_filters(const str
>  	 * All filters in the list are evaluated and the lowest BPF return
>  	 * value always takes priority (ignoring the DATA).
>  	 */
> -	preempt_disable();
>  	for (; f; f = f->prev) {
> -		u32 cur_ret = BPF_PROG_RUN(f->prog, sd);
> +		u32 cur_ret = BPF_PROG_RUN_PIN_ON_CPU(f->prog, sd);
>

More a question really, isn't the behavior changing here? i.e. shouldn't
migrate_disable()/migrate_enable() be moved to outside the loop? Or is
running seccomp filters on different cpus not a problem?

-- 
Vinicius

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites.
  2020-02-19  1:39   ` Vinicius Costa Gomes
@ 2020-02-19  9:00     ` Thomas Gleixner
  2020-02-19 16:38       ` Alexei Starovoitov
  2020-02-21  0:20       ` Kees Cook
  0 siblings, 2 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-19  9:00 UTC (permalink / raw)
  To: Vinicius Costa Gomes, LKML
  Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar, Kees Cook, Will Drewry, Andy Lutomirski

Vinicius Costa Gomes <vinicius.gomes@intel.com> writes:

Cc+: seccomp folks 

> Thomas Gleixner <tglx@linutronix.de> writes:
>
>> From: David Miller <davem@davemloft.net>

Leaving content for reference

>> All of these cases are strictly of the form:
>>
>> 	preempt_disable();
>> 	BPF_PROG_RUN(...);
>> 	preempt_enable();
>>
>> Replace this with BPF_PROG_RUN_PIN_ON_CPU() which wraps BPF_PROG_RUN()
>> with:
>>
>> 	migrate_disable();
>> 	BPF_PROG_RUN(...);
>> 	migrate_enable();
>>
>> On non RT enabled kernels this maps to preempt_disable/enable() and on RT
>> enabled kernels this solely prevents migration, which is sufficient as
>> there is no requirement to prevent reentrancy to any BPF program from a
>> preempting task. The only requirement is that the program stays on the same
>> CPU.
>>
>> Therefore, this is a trivially correct transformation.
>>
>> [ tglx: Converted to BPF_PROG_RUN_PIN_ON_CPU() ]
>>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>
>> ---
>>  include/linux/filter.h    |    4 +---
>>  kernel/seccomp.c          |    4 +---
>>  net/core/flow_dissector.c |    4 +---
>>  net/core/skmsg.c          |    8 ++------
>>  net/kcm/kcmsock.c         |    4 +---
>>  5 files changed, 6 insertions(+), 18 deletions(-)
>>
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -713,9 +713,7 @@ static inline u32 bpf_prog_run_clear_cb(
>>  	if (unlikely(prog->cb_access))
>>  		memset(cb_data, 0, BPF_SKB_CB_LEN);
>>  
>> -	preempt_disable();
>> -	res = BPF_PROG_RUN(prog, skb);
>> -	preempt_enable();
>> +	res = BPF_PROG_RUN_PIN_ON_CPU(prog, skb);
>>  	return res;
>>  }
>>  
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -268,16 +268,14 @@ static u32 seccomp_run_filters(const str
>>  	 * All filters in the list are evaluated and the lowest BPF return
>>  	 * value always takes priority (ignoring the DATA).
>>  	 */
>> -	preempt_disable();
>>  	for (; f; f = f->prev) {
>> -		u32 cur_ret = BPF_PROG_RUN(f->prog, sd);
>> +		u32 cur_ret = BPF_PROG_RUN_PIN_ON_CPU(f->prog, sd);
>>
>
> More a question really, isn't the behavior changing here? i.e. shouldn't
> migrate_disable()/migrate_enable() be moved to outside the loop? Or is
> running seccomp filters on different cpus not a problem?

In my understanding this is a list of filters and they are independent
of each other.

Kees, Will. Andy?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
  2020-02-18 23:36       ` Alexei Starovoitov
  2020-02-19  0:49         ` Thomas Gleixner
@ 2020-02-19 15:17         ` Mathieu Desnoyers
  2020-02-20  4:19           ` Alexei Starovoitov
  1 sibling, 1 reply; 42+ messages in thread
From: Mathieu Desnoyers @ 2020-02-19 15:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Gleixner, linux-kernel, David S. Miller, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, Sebastian Andrzej Siewior,
	Peter Zijlstra, Clark Williams, rostedt, Juri Lelli, Ingo Molnar

----- On Feb 18, 2020, at 6:36 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:

[...]

> If I can use migrate_disable() without RT it will help my work on sleepable
> BPF programs. I would only have to worry about rcu_read_lock() since
> preempt_disable() is nicely addressed.

Hi Alexei,

You may want to consider using SRCU rather than RCU if you need to sleep while
holding a RCU read-side lock.

This is the synchronization approach I consider for adding the ability to take page
faults when doing syscall tracing.

Then you'll be able to replace preempt_disable() by combining SRCU and
migrate_disable():

AFAIU eBPF currently uses preempt_disable() for two reasons:

- Ensure the thread is not migrated,
  -> can be replaced by migrate_disable() in RT
- Provide RCU existence guarantee through sched-RCU
  -> can be replaced by SRCU, which allows sleeping and taking page faults.

I wonder if it would be acceptable to take a page fault while migration is
disabled though ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites.
  2020-02-19  9:00     ` Thomas Gleixner
@ 2020-02-19 16:38       ` Alexei Starovoitov
  2020-02-21  0:20       ` Kees Cook
  1 sibling, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2020-02-19 16:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vinicius Costa Gomes, LKML, David Miller, bpf,
	Network Development, Alexei Starovoitov, Daniel Borkmann,
	Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
	Juri Lelli, Ingo Molnar, Kees Cook, Will Drewry, Andy Lutomirski

On Wed, Feb 19, 2020 at 1:01 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Vinicius Costa Gomes <vinicius.gomes@intel.com> writes:
>
> Cc+: seccomp folks
>
> > Thomas Gleixner <tglx@linutronix.de> writes:
> >
> >> From: David Miller <davem@davemloft.net>
>
> Leaving content for reference
>
> >> All of these cases are strictly of the form:
> >>
> >>      preempt_disable();
> >>      BPF_PROG_RUN(...);
> >>      preempt_enable();
> >>
> >> Replace this with BPF_PROG_RUN_PIN_ON_CPU() which wraps BPF_PROG_RUN()
> >> with:
> >>
> >>      migrate_disable();
> >>      BPF_PROG_RUN(...);
> >>      migrate_enable();
> >>
> >> On non RT enabled kernels this maps to preempt_disable/enable() and on RT
> >> enabled kernels this solely prevents migration, which is sufficient as
> >> there is no requirement to prevent reentrancy to any BPF program from a
> >> preempting task. The only requirement is that the program stays on the same
> >> CPU.
> >>
> >> Therefore, this is a trivially correct transformation.
> >>
> >> [ tglx: Converted to BPF_PROG_RUN_PIN_ON_CPU() ]
> >>
> >> Signed-off-by: David S. Miller <davem@davemloft.net>
> >> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> >>
> >> ---
> >>  include/linux/filter.h    |    4 +---
> >>  kernel/seccomp.c          |    4 +---
> >>  net/core/flow_dissector.c |    4 +---
> >>  net/core/skmsg.c          |    8 ++------
> >>  net/kcm/kcmsock.c         |    4 +---
> >>  5 files changed, 6 insertions(+), 18 deletions(-)
> >>
> >> --- a/include/linux/filter.h
> >> +++ b/include/linux/filter.h
> >> @@ -713,9 +713,7 @@ static inline u32 bpf_prog_run_clear_cb(
> >>      if (unlikely(prog->cb_access))
> >>              memset(cb_data, 0, BPF_SKB_CB_LEN);
> >>
> >> -    preempt_disable();
> >> -    res = BPF_PROG_RUN(prog, skb);
> >> -    preempt_enable();
> >> +    res = BPF_PROG_RUN_PIN_ON_CPU(prog, skb);
> >>      return res;
> >>  }
> >>
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -268,16 +268,14 @@ static u32 seccomp_run_filters(const str
> >>       * All filters in the list are evaluated and the lowest BPF return
> >>       * value always takes priority (ignoring the DATA).
> >>       */
> >> -    preempt_disable();
> >>      for (; f; f = f->prev) {
> >> -            u32 cur_ret = BPF_PROG_RUN(f->prog, sd);
> >> +            u32 cur_ret = BPF_PROG_RUN_PIN_ON_CPU(f->prog, sd);
> >>
> >
> > More a question really, isn't the behavior changing here? i.e. shouldn't
> > migrate_disable()/migrate_enable() be moved to outside the loop? Or is
> > running seccomp filters on different cpus not a problem?
>
> In my understanding this is a list of filters and they are independent
> of each other.

Yes. It's fine to be preempted between filters.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 04/19] bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run()
  2020-02-14 13:39 ` [RFC patch 04/19] bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run() Thomas Gleixner
@ 2020-02-19 16:54   ` Steven Rostedt
  2020-02-19 17:26     ` Thomas Gleixner
  0 siblings, 1 reply; 42+ messages in thread
From: Steven Rostedt @ 2020-02-19 16:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, David Miller, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, Sebastian Sewior, Peter Zijlstra,
	Clark Williams, Juri Lelli, Ingo Molnar

On Fri, 14 Feb 2020 14:39:21 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:

> __bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation.
> 
> This is redundant because __bpf_trace_run() is invoked from a trace point
> via __DO_TRACE() which already disables preemption _before_ invoking any of
> the functions which are attached to a trace point.
> 
> Remove it.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/trace/bpf_trace.c |    2 --
>  1 file changed, 2 deletions(-)
> 
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1476,9 +1476,7 @@ static __always_inline
>  void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
>  {

Should there be a "cant_migrate()" added here?

-- Steve

>  	rcu_read_lock();
> -	preempt_disable();
>  	(void) BPF_PROG_RUN(prog, args);
> -	preempt_enable();
>  	rcu_read_unlock();
>  }
>  


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 04/19] bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run()
  2020-02-19 16:54   ` Steven Rostedt
@ 2020-02-19 17:26     ` Thomas Gleixner
  0 siblings, 0 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-19 17:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, David Miller, bpf, netdev, Alexei Starovoitov,
	Daniel Borkmann, Sebastian Sewior, Peter Zijlstra,
	Clark Williams, Juri Lelli, Ingo Molnar

Steven Rostedt <rostedt@goodmis.org> writes:

> On Fri, 14 Feb 2020 14:39:21 +0100
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> __bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation.
>> 
>> This is redundant because __bpf_trace_run() is invoked from a trace point
>> via __DO_TRACE() which already disables preemption _before_ invoking any of
>> the functions which are attached to a trace point.
>> 
>> Remove it.
>> 
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> ---
>>  kernel/trace/bpf_trace.c |    2 --
>>  1 file changed, 2 deletions(-)
>> 
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -1476,9 +1476,7 @@ static __always_inline
>>  void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
>>  {
>
> Should there be a "cant_migrate()" added here?

A cant_sleep() is the right thing to add as this really needs to stay
non-preemptible. Hmm?

>>  	rcu_read_lock();
>> -	preempt_disable();
>>  	(void) BPF_PROG_RUN(prog, args);
>> -	preempt_enable();
>>  	rcu_read_unlock();
>>  }

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
  2020-02-19 15:17         ` Mathieu Desnoyers
@ 2020-02-20  4:19           ` Alexei Starovoitov
  0 siblings, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2020-02-20  4:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, David S. Miller, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, Sebastian Andrzej Siewior,
	Peter Zijlstra, Clark Williams, rostedt, Juri Lelli, Ingo Molnar

On Wed, Feb 19, 2020 at 10:17:28AM -0500, Mathieu Desnoyers wrote:
> ----- On Feb 18, 2020, at 6:36 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
> 
> [...]
> 
> > If I can use migrate_disable() without RT it will help my work on sleepable
> > BPF programs. I would only have to worry about rcu_read_lock() since
> > preempt_disable() is nicely addressed.
> 
> Hi Alexei,
> 
> You may want to consider using SRCU rather than RCU if you need to sleep while
> holding a RCU read-side lock.
> 
> This is the synchronization approach I consider for adding the ability to take page
> faults when doing syscall tracing.
> 
> Then you'll be able to replace preempt_disable() by combining SRCU and
> migrate_disable():
> 
> AFAIU eBPF currently uses preempt_disable() for two reasons:
> 
> - Ensure the thread is not migrated,
>   -> can be replaced by migrate_disable() in RT
> - Provide RCU existence guarantee through sched-RCU
>   -> can be replaced by SRCU, which allows sleeping and taking page faults.

bpf is using normal rcu to protect map values
and rcu+preempt to protect per-cpu map values.
srcu is certainly under consideration. It hasn't been used due to performance
implications. atomics and barriers are too heavy for certain use cases. So we
have to keep rcu where performance matters, but cannot fork map implementations
to rcu and srcu due to huge code bloat. So far I've been thinking to introduce
explicit helper bpf_rcu_read_lock() and let programs use it directly instead of
implicit rcu_read_lock() that is done outside of bpf prog. The tricky part is
teaching verifier to enforce critical section.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites.
  2020-02-19  9:00     ` Thomas Gleixner
  2020-02-19 16:38       ` Alexei Starovoitov
@ 2020-02-21  0:20       ` Kees Cook
  2020-02-21 14:00         ` Thomas Gleixner
  1 sibling, 1 reply; 42+ messages in thread
From: Kees Cook @ 2020-02-21  0:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vinicius Costa Gomes, LKML, David Miller, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, Sebastian Sewior,
	Peter Zijlstra, Clark Williams, Steven Rostedt, Juri Lelli,
	Ingo Molnar, Will Drewry, Andy Lutomirski

On Wed, Feb 19, 2020 at 10:00:56AM +0100, Thomas Gleixner wrote:
> Vinicius Costa Gomes <vinicius.gomes@intel.com> writes:
> 
> Cc+: seccomp folks 
> 
> > Thomas Gleixner <tglx@linutronix.de> writes:
> >
> >> From: David Miller <davem@davemloft.net>
> 
> Leaving content for reference
> 
> >> All of these cases are strictly of the form:
> >>
> >> 	preempt_disable();
> >> 	BPF_PROG_RUN(...);
> >> 	preempt_enable();
> >>
> >> Replace this with BPF_PROG_RUN_PIN_ON_CPU() which wraps BPF_PROG_RUN()
> >> with:
> >>
> >> 	migrate_disable();
> >> 	BPF_PROG_RUN(...);
> >> 	migrate_enable();
> >>
> >> On non RT enabled kernels this maps to preempt_disable/enable() and on RT
> >> enabled kernels this solely prevents migration, which is sufficient as
> >> there is no requirement to prevent reentrancy to any BPF program from a
> >> preempting task. The only requirement is that the program stays on the same
> >> CPU.
> >>
> >> Therefore, this is a trivially correct transformation.
> >>
> >> [ tglx: Converted to BPF_PROG_RUN_PIN_ON_CPU() ]
> >>
> >> Signed-off-by: David S. Miller <davem@davemloft.net>
> >> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> >>
> >> ---
> >>  include/linux/filter.h    |    4 +---
> >>  kernel/seccomp.c          |    4 +---
> >>  net/core/flow_dissector.c |    4 +---
> >>  net/core/skmsg.c          |    8 ++------
> >>  net/kcm/kcmsock.c         |    4 +---
> >>  5 files changed, 6 insertions(+), 18 deletions(-)
> >>
> >> --- a/include/linux/filter.h
> >> +++ b/include/linux/filter.h
> >> @@ -713,9 +713,7 @@ static inline u32 bpf_prog_run_clear_cb(
> >>  	if (unlikely(prog->cb_access))
> >>  		memset(cb_data, 0, BPF_SKB_CB_LEN);
> >>  
> >> -	preempt_disable();
> >> -	res = BPF_PROG_RUN(prog, skb);
> >> -	preempt_enable();
> >> +	res = BPF_PROG_RUN_PIN_ON_CPU(prog, skb);
> >>  	return res;
> >>  }
> >>  
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -268,16 +268,14 @@ static u32 seccomp_run_filters(const str
> >>  	 * All filters in the list are evaluated and the lowest BPF return
> >>  	 * value always takes priority (ignoring the DATA).
> >>  	 */
> >> -	preempt_disable();
> >>  	for (; f; f = f->prev) {
> >> -		u32 cur_ret = BPF_PROG_RUN(f->prog, sd);
> >> +		u32 cur_ret = BPF_PROG_RUN_PIN_ON_CPU(f->prog, sd);
> >>
> >
> > More a question really, isn't the behavior changing here? i.e. shouldn't
> > migrate_disable()/migrate_enable() be moved to outside the loop? Or is
> > running seccomp filters on different cpus not a problem?
> 
> In my understanding this is a list of filters and they are independent
> of each other.
> 
> Kees, Will. Andy?

They're technically independent, but they are related to each
other. (i.e. order matters, process hierarchy matters, etc). There's no
reason I can see that we can't switch CPUs between running them, though.
(AIUI, nothing here would suddenly make these run in parallel, right?)

As long as "current" is still "current", and they run in the same order,
we'll get the same final result as far as seccomp is concerned.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites.
  2020-02-21  0:20       ` Kees Cook
@ 2020-02-21 14:00         ` Thomas Gleixner
  2020-02-21 14:05           ` Peter Zijlstra
  2020-02-21 22:15           ` Kees Cook
  0 siblings, 2 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-21 14:00 UTC (permalink / raw)
  To: Kees Cook
  Cc: Vinicius Costa Gomes, LKML, David Miller, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, Sebastian Sewior,
	Peter Zijlstra, Clark Williams, Steven Rostedt, Juri Lelli,
	Ingo Molnar, Will Drewry, Andy Lutomirski

Kees Cook <keescook@chromium.org> writes:
> On Wed, Feb 19, 2020 at 10:00:56AM +0100, Thomas Gleixner wrote:
>> Vinicius Costa Gomes <vinicius.gomes@intel.com> writes:
>> > More a question really, isn't the behavior changing here? i.e. shouldn't
>> > migrate_disable()/migrate_enable() be moved to outside the loop? Or is
>> > running seccomp filters on different cpus not a problem?
>> 
>> In my understanding this is a list of filters and they are independent
>> of each other.
>> 
>> Kees, Will. Andy?
>
> They're technically independent, but they are related to each
> other. (i.e. order matters, process hierarchy matters, etc). There's no
> reason I can see that we can't switch CPUs between running them, though.
> (AIUI, nothing here would suddenly make these run in parallel, right?)

Of course not. If we'd run the same thread on multiple CPUs in parallel
the ordering of your BPF programs would be the least of your worries.

> As long as "current" is still "current", and they run in the same order,
> we'll get the same final result as far as seccomp is concerned.

Right.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites.
  2020-02-21 14:00         ` Thomas Gleixner
@ 2020-02-21 14:05           ` Peter Zijlstra
  2020-02-21 22:15           ` Kees Cook
  1 sibling, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2020-02-21 14:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kees Cook, Vinicius Costa Gomes, LKML, David Miller, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, Sebastian Sewior,
	Clark Williams, Steven Rostedt, Juri Lelli, Ingo Molnar,
	Will Drewry, Andy Lutomirski

On Fri, Feb 21, 2020 at 03:00:54PM +0100, Thomas Gleixner wrote:
> Of course not. If we'd run the same thread on multiple CPUs in parallel
> the ordering of your BPF programs would be the least of your worries.

Been there, done that. It goes sideways *REALLY* fast :-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites.
  2020-02-21 14:00         ` Thomas Gleixner
  2020-02-21 14:05           ` Peter Zijlstra
@ 2020-02-21 22:15           ` Kees Cook
  1 sibling, 0 replies; 42+ messages in thread
From: Kees Cook @ 2020-02-21 22:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vinicius Costa Gomes, LKML, David Miller, bpf, netdev,
	Alexei Starovoitov, Daniel Borkmann, Sebastian Sewior,
	Peter Zijlstra, Clark Williams, Steven Rostedt, Juri Lelli,
	Ingo Molnar, Will Drewry, Andy Lutomirski

On Fri, Feb 21, 2020 at 03:00:54PM +0100, Thomas Gleixner wrote:
> Kees Cook <keescook@chromium.org> writes:
> > They're technically independent, but they are related to each
> > other. (i.e. order matters, process hierarchy matters, etc). There's no
> > reason I can see that we can't switch CPUs between running them, though.
> > (AIUI, nothing here would suddenly make these run in parallel, right?)
> 
> Of course not. If we'd run the same thread on multiple CPUs in parallel
> the ordering of your BPF programs would be the least of your worries.

Right, okay, good. I just wanted to be extra sure. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-02-21 22:15 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-14 13:39 [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 01/19] sched: Provide migrate_disable/enable() inlines Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 02/19] sched: Provide cant_migrate() Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 03/19] bpf: Update locking comment in hashtab code Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 04/19] bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run() Thomas Gleixner
2020-02-19 16:54   ` Steven Rostedt
2020-02-19 17:26     ` Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 05/19] perf/bpf: Remove preempt disable around BPF invocation Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 06/19] bpf: Dont iterate over possible CPUs with interrupts disabled Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 07/19] bpf: Provide BPF_PROG_RUN_PIN_ON_CPU() macro Thomas Gleixner
2020-02-14 18:50   ` Mathieu Desnoyers
2020-02-14 19:36     ` Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 08/19] bpf: Replace cant_sleep() with cant_migrate() Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 09/19] bpf: Use BPF_PROG_RUN_PIN_ON_CPU() at simple call sites Thomas Gleixner
2020-02-19  1:39   ` Vinicius Costa Gomes
2020-02-19  9:00     ` Thomas Gleixner
2020-02-19 16:38       ` Alexei Starovoitov
2020-02-21  0:20       ` Kees Cook
2020-02-21 14:00         ` Thomas Gleixner
2020-02-21 14:05           ` Peter Zijlstra
2020-02-21 22:15           ` Kees Cook
2020-02-14 13:39 ` [RFC patch 10/19] trace/bpf: Use migrate disable in trace_call_bpf() Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 11/19] bpf/tests: Use migrate disable instead of preempt disable Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 12/19] bpf: Use migrate_disable/enabe() in trampoline code Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 13/19] bpf: Use migrate_disable/enable in array macros and cgroup/lirc code Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code Thomas Gleixner
2020-02-14 19:11   ` Mathieu Desnoyers
2020-02-14 19:56     ` Thomas Gleixner
2020-02-18 23:36       ` Alexei Starovoitov
2020-02-19  0:49         ` Thomas Gleixner
2020-02-19  1:23           ` Alexei Starovoitov
2020-02-19 15:17         ` Mathieu Desnoyers
2020-02-20  4:19           ` Alexei Starovoitov
2020-02-14 13:39 ` [RFC patch 15/19] bpf: Use migrate_disable() in sys_bpf() Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 16/19] bpf: Factor out hashtab bucket lock operations Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 17/19] bpf: Prepare hashtab locking for PREEMPT_RT Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 18/19] bpf, lpm: Make locking RT friendly Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 19/19] bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabled Thomas Gleixner
2020-02-14 17:53 ` [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist David Miller
2020-02-14 18:36   ` Thomas Gleixner
2020-02-17 12:59     ` [PATCH] bpf: Enforce map preallocation for all instrumentation programs Thomas Gleixner
2020-02-15 20:09 ` [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).