bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] bpf: Remove in_atomic() from bpf_link_put().
@ 2023-05-09 13:24 Sebastian Andrzej Siewior
  2023-05-10  7:19 ` Andrii Nakryiko
  0 siblings, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-05-09 13:24 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Paul E. McKenney, Peter Zijlstra, Thomas Gleixner

bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
invoked within softirq context. By setting rcutree.use_softirq=0 boot
option the RCU callbacks will be invoked in a per-CPU kthread with
bottom halves disabled which implies a RCU read section.

On PREEMPT_RT the context remains fully preemptible. The RCU read
section however does not allow schedule() invocation. The latter happens
in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
from bpf_link_put().

Remove the context checks and use the workqueue unconditionally.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
The warning can be observed as:
| BUG: sleeping function called from invalid context at kernel/locking/rtmutex_api.c:510
| in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 47, name: rcuc/3
| preempt_count: 0, expected: 0
| RCU nest depth: 2, expected: 0
| CPU: 3 PID: 47 Comm: rcuc/3 Tainted: G            E      v6.3-rt12 #1
| Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.3a 01/06/2021
| Call Trace:
|  <TASK>
|  dump_stack_lvl+0x43/0x60
|  __might_resched+0x137/0x190
|  mutex_lock+0x1a/0x50
|  bpf_trampoline_unlink_prog+0x1b/0x100
|  bpf_tracing_link_release+0x12/0x40
|  bpf_link_free+0x70/0x90
|  bpf_free_inode+0x3e/0x80
|  rcu_core+0x4ff/0x7c0
|  rcu_cpu_kthread+0xa9/0x2f0
|  smpboot_thread_fn+0x141/0x2c0
|  kthread+0x110/0x130
|  ret_from_fork+0x2c/0x50
|  </TASK>

 kernel/bpf/syscall.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 14f39c1e573ee..0adaa1bfbb0d2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2785,12 +2785,8 @@ void bpf_link_put(struct bpf_link *link)
 	if (!atomic64_dec_and_test(&link->refcnt))
 		return;
 
-	if (in_atomic()) {
-		INIT_WORK(&link->work, bpf_link_put_deferred);
-		schedule_work(&link->work);
-	} else {
-		bpf_link_free(link);
-	}
+	INIT_WORK(&link->work, bpf_link_put_deferred);
+	schedule_work(&link->work);
 }
 EXPORT_SYMBOL(bpf_link_put);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] bpf: Remove in_atomic() from bpf_link_put().
  2023-05-09 13:24 [RFC PATCH] bpf: Remove in_atomic() from bpf_link_put() Sebastian Andrzej Siewior
@ 2023-05-10  7:19 ` Andrii Nakryiko
  2023-05-17  5:26   ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2023-05-10  7:19 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Paul E. McKenney, Peter Zijlstra, Thomas Gleixner

On Tue, May 9, 2023 at 6:24 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> invoked within softirq context. By setting rcutree.use_softirq=0 boot
> option the RCU callbacks will be invoked in a per-CPU kthread with
> bottom halves disabled which implies a RCU read section.
>
> On PREEMPT_RT the context remains fully preemptible. The RCU read
> section however does not allow schedule() invocation. The latter happens
> in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> from bpf_link_put().
>
> Remove the context checks and use the workqueue unconditionally.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---

Please see [0] and corresponding revert commit. We do want
bpf_link_free() to happen synchronously if it's caused by close()
syscall.

f00f2f7fe860 ("Revert "bpf: Fix potential call bpf_link_free() in
atomic context"")

  [0] https://lore.kernel.org/bpf/CAEf4BzZ9zwA=SrLTx9JT50OeM6fVPg0Py0Gx+K9ah2we8YtCRA@mail.gmail.com/

> The warning can be observed as:
> | BUG: sleeping function called from invalid context at kernel/locking/rtmutex_api.c:510
> | in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 47, name: rcuc/3
> | preempt_count: 0, expected: 0
> | RCU nest depth: 2, expected: 0
> | CPU: 3 PID: 47 Comm: rcuc/3 Tainted: G            E      v6.3-rt12 #1
> | Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.3a 01/06/2021
> | Call Trace:
> |  <TASK>
> |  dump_stack_lvl+0x43/0x60
> |  __might_resched+0x137/0x190
> |  mutex_lock+0x1a/0x50
> |  bpf_trampoline_unlink_prog+0x1b/0x100
> |  bpf_tracing_link_release+0x12/0x40
> |  bpf_link_free+0x70/0x90
> |  bpf_free_inode+0x3e/0x80
> |  rcu_core+0x4ff/0x7c0
> |  rcu_cpu_kthread+0xa9/0x2f0
> |  smpboot_thread_fn+0x141/0x2c0
> |  kthread+0x110/0x130
> |  ret_from_fork+0x2c/0x50
> |  </TASK>
>
>  kernel/bpf/syscall.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 14f39c1e573ee..0adaa1bfbb0d2 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2785,12 +2785,8 @@ void bpf_link_put(struct bpf_link *link)
>         if (!atomic64_dec_and_test(&link->refcnt))
>                 return;
>
> -       if (in_atomic()) {
> -               INIT_WORK(&link->work, bpf_link_put_deferred);
> -               schedule_work(&link->work);
> -       } else {
> -               bpf_link_free(link);
> -       }
> +       INIT_WORK(&link->work, bpf_link_put_deferred);
> +       schedule_work(&link->work);
>  }
>  EXPORT_SYMBOL(bpf_link_put);
>
> --
> 2.40.1
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] bpf: Remove in_atomic() from bpf_link_put().
  2023-05-10  7:19 ` Andrii Nakryiko
@ 2023-05-17  5:26   ` Alexei Starovoitov
  2023-05-17  6:09     ` Sebastian Andrzej Siewior
  2023-05-17 16:26     ` Andrii Nakryiko
  0 siblings, 2 replies; 17+ messages in thread
From: Alexei Starovoitov @ 2023-05-17  5:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Sebastian Andrzej Siewior, bpf, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Paul E. McKenney,
	Peter Zijlstra, Thomas Gleixner

On Wed, May 10, 2023 at 12:19 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, May 9, 2023 at 6:24 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> >
> > bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> > invoked within softirq context. By setting rcutree.use_softirq=0 boot
> > option the RCU callbacks will be invoked in a per-CPU kthread with
> > bottom halves disabled which implies a RCU read section.
> >
> > On PREEMPT_RT the context remains fully preemptible. The RCU read
> > section however does not allow schedule() invocation. The latter happens
> > in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> > from bpf_link_put().
> >
> > Remove the context checks and use the workqueue unconditionally.
> >
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > ---
>
> Please see [0] and corresponding revert commit. We do want
> bpf_link_free() to happen synchronously if it's caused by close()
> syscall.
>
> f00f2f7fe860 ("Revert "bpf: Fix potential call bpf_link_free() in
> atomic context"")
>
>   [0] https://lore.kernel.org/bpf/CAEf4BzZ9zwA=SrLTx9JT50OeM6fVPg0Py0Gx+K9ah2we8YtCRA@mail.gmail.com/

Sebastian,

Andrii is correct. We cannot do this unconditionally,
but we can do it for IS_ENABLED(CONFIG_PREEMPT_RT)
if it's causing issues on RT, but BPF users won't be happy
with non deterministic prog detach.
Do you see a different way of solving it?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] bpf: Remove in_atomic() from bpf_link_put().
  2023-05-17  5:26   ` Alexei Starovoitov
@ 2023-05-17  6:09     ` Sebastian Andrzej Siewior
  2023-05-17 16:26     ` Andrii Nakryiko
  1 sibling, 0 replies; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-05-17  6:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner

On 2023-05-16 22:26:01 [-0700], Alexei Starovoitov wrote:
> Sebastian,
Hi Alexei,

> Andrii is correct. We cannot do this unconditionally,
> but we can do it for IS_ENABLED(CONFIG_PREEMPT_RT)
> if it's causing issues on RT, but BPF users won't be happy
> with non deterministic prog detach.
> Do you see a different way of solving it?

Yes. I've been distracted with other things, I get back to it.

Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] bpf: Remove in_atomic() from bpf_link_put().
  2023-05-17  5:26   ` Alexei Starovoitov
  2023-05-17  6:09     ` Sebastian Andrzej Siewior
@ 2023-05-17 16:26     ` Andrii Nakryiko
  2023-05-25 14:18       ` [PATCH v2] " Sebastian Andrzej Siewior
  1 sibling, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2023-05-17 16:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastian Andrzej Siewior, bpf, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Paul E. McKenney,
	Peter Zijlstra, Thomas Gleixner

On Tue, May 16, 2023 at 10:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, May 10, 2023 at 12:19 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, May 9, 2023 at 6:24 AM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> > >
> > > bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> > > invoked within softirq context. By setting rcutree.use_softirq=0 boot
> > > option the RCU callbacks will be invoked in a per-CPU kthread with
> > > bottom halves disabled which implies a RCU read section.
> > >
> > > On PREEMPT_RT the context remains fully preemptible. The RCU read
> > > section however does not allow schedule() invocation. The latter happens
> > > in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> > > from bpf_link_put().
> > >
> > > Remove the context checks and use the workqueue unconditionally.
> > >
> > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > ---
> >
> > Please see [0] and corresponding revert commit. We do want
> > bpf_link_free() to happen synchronously if it's caused by close()
> > syscall.
> >
> > f00f2f7fe860 ("Revert "bpf: Fix potential call bpf_link_free() in
> > atomic context"")
> >
> >   [0] https://lore.kernel.org/bpf/CAEf4BzZ9zwA=SrLTx9JT50OeM6fVPg0Py0Gx+K9ah2we8YtCRA@mail.gmail.com/
>
> Sebastian,
>
> Andrii is correct. We cannot do this unconditionally,
> but we can do it for IS_ENABLED(CONFIG_PREEMPT_RT)
> if it's causing issues on RT, but BPF users won't be happy
> with non deterministic prog detach.
> Do you see a different way of solving it?

Is struct file_operations.release() callback guaranteed to be called
from user context? If yes, then perhaps the most straightforward way
to guarantee synchronous bpf_link cleanup on close(link_fd) is to have
a bpf_link_put() variant that will be only called from user-context
and will just do bpf_link_free(link) directly, without checking
in_atomic().

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] bpf: Remove in_atomic() from bpf_link_put().
  2023-05-17 16:26     ` Andrii Nakryiko
@ 2023-05-25 14:18       ` Sebastian Andrzej Siewior
  2023-05-25 17:51         ` Andrii Nakryiko
  0 siblings, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-05-25 14:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner

bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
invoked within softirq context. By setting rcutree.use_softirq=0 boot
option the RCU callbacks will be invoked in a per-CPU kthread with
bottom halves disabled which implies a RCU read section.

On PREEMPT_RT the context remains fully preemptible. The RCU read
section however does not allow schedule() invocation. The latter happens
in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
from bpf_link_put().

It was pointed out that the bpf_link_put() invocation should not be
delayed if originated from close().

Remove the context checks and use the workqueue unconditionally. For the
close() callback add bpf_link_put_direct() which invokes free function
directly.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

v1…v2:
   - Add bpf_link_put_direct() to be used from bpf_link_release() as
     suggested.

On 2023-05-17 09:26:19 [-0700], Andrii Nakryiko wrote:
> Is struct file_operations.release() callback guaranteed to be called
> from user context? If yes, then perhaps the most straightforward way
> to guarantee synchronous bpf_link cleanup on close(link_fd) is to have
> a bpf_link_put() variant that will be only called from user-context
> and will just do bpf_link_free(link) directly, without checking
> in_atomic().

Yes. __fput() has a might_sleep() and it invokes
file_operations::release.

 kernel/bpf/syscall.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 14f39c1e573ee..85159428e5fee 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2785,20 +2785,23 @@ void bpf_link_put(struct bpf_link *link)
 	if (!atomic64_dec_and_test(&link->refcnt))
 		return;
 
-	if (in_atomic()) {
-		INIT_WORK(&link->work, bpf_link_put_deferred);
-		schedule_work(&link->work);
-	} else {
-		bpf_link_free(link);
-	}
+	INIT_WORK(&link->work, bpf_link_put_deferred);
+	schedule_work(&link->work);
 }
 EXPORT_SYMBOL(bpf_link_put);
 
+static void bpf_link_put_direct(struct bpf_link *link)
+{
+	if (!atomic64_dec_and_test(&link->refcnt))
+		return;
+	bpf_link_free(link);
+}
+
 static int bpf_link_release(struct inode *inode, struct file *filp)
 {
 	struct bpf_link *link = filp->private_data;
 
-	bpf_link_put(link);
+	bpf_link_put_direct(link);
 	return 0;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] bpf: Remove in_atomic() from bpf_link_put().
  2023-05-25 14:18       ` [PATCH v2] " Sebastian Andrzej Siewior
@ 2023-05-25 17:51         ` Andrii Nakryiko
  2023-05-26 11:23           ` [PATCH v3] " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2023-05-25 17:51 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner

On Thu, May 25, 2023 at 7:18 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> invoked within softirq context. By setting rcutree.use_softirq=0 boot
> option the RCU callbacks will be invoked in a per-CPU kthread with
> bottom halves disabled which implies a RCU read section.
>
> On PREEMPT_RT the context remains fully preemptible. The RCU read
> section however does not allow schedule() invocation. The latter happens
> in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> from bpf_link_put().
>
> It was pointed out that the bpf_link_put() invocation should not be
> delayed if originated from close().
>
> Remove the context checks and use the workqueue unconditionally. For the
> close() callback add bpf_link_put_direct() which invokes free function
> directly.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>
> v1…v2:
>    - Add bpf_link_put_direct() to be used from bpf_link_release() as
>      suggested.

Looks good to me, but it's not sufficient. See kernel/bpf/inode.c, we
need to do bpf_link_put_direct() from bpf_put_any(), which should be
safe as well because it all is either triggered from bpf() syscall or
by unlink()'ing BPF FS file. For file deletion we have the same
requirement to have deterministic release of bpf_link.

>
> On 2023-05-17 09:26:19 [-0700], Andrii Nakryiko wrote:
> > Is struct file_operations.release() callback guaranteed to be called
> > from user context? If yes, then perhaps the most straightforward way
> > to guarantee synchronous bpf_link cleanup on close(link_fd) is to have
> > a bpf_link_put() variant that will be only called from user-context
> > and will just do bpf_link_free(link) directly, without checking
> > in_atomic().
>
> Yes. __fput() has a might_sleep() and it invokes
> file_operations::release.
>
>  kernel/bpf/syscall.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 14f39c1e573ee..85159428e5fee 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2785,20 +2785,23 @@ void bpf_link_put(struct bpf_link *link)
>         if (!atomic64_dec_and_test(&link->refcnt))
>                 return;
>
> -       if (in_atomic()) {
> -               INIT_WORK(&link->work, bpf_link_put_deferred);
> -               schedule_work(&link->work);
> -       } else {
> -               bpf_link_free(link);
> -       }
> +       INIT_WORK(&link->work, bpf_link_put_deferred);
> +       schedule_work(&link->work);
>  }
>  EXPORT_SYMBOL(bpf_link_put);
>
> +static void bpf_link_put_direct(struct bpf_link *link)
> +{
> +       if (!atomic64_dec_and_test(&link->refcnt))
> +               return;
> +       bpf_link_free(link);
> +}
> +
>  static int bpf_link_release(struct inode *inode, struct file *filp)
>  {
>         struct bpf_link *link = filp->private_data;
>
> -       bpf_link_put(link);
> +       bpf_link_put_direct(link);
>         return 0;
>  }
>
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3] bpf: Remove in_atomic() from bpf_link_put().
  2023-05-25 17:51         ` Andrii Nakryiko
@ 2023-05-26 11:23           ` Sebastian Andrzej Siewior
  2023-05-31 19:00             ` Andrii Nakryiko
  0 siblings, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-05-26 11:23 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner

bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
invoked within softirq context. By setting rcutree.use_softirq=0 boot
option the RCU callbacks will be invoked in a per-CPU kthread with
bottom halves disabled which implies a RCU read section.

On PREEMPT_RT the context remains fully preemptible. The RCU read
section however does not allow schedule() invocation. The latter happens
in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
from bpf_link_put().

It was pointed out that the bpf_link_put() invocation should not be
delayed if originated from close(). It was also pointed out that other
invocations from within a syscall should also avoid the workqueue.
After an audit of all bpf_link_put() callers it looks that the only
atomic caller is the RCU callback. Everything else is called from
preemptible context because it is a syscall, a mutex_t is acquired near
by or due a GFP_KERNEL memory allocation.

Let bpf_link_put() free the resources directly. Add
bpf_link_put_from_atomic() which uses the kworker to free the
resources. Let bpf_any_put() invoke one or the other depending on the
context it is called from (RCU or not).

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
v2…v3:
  - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
    and add bpf_link_put_from_atomic() to do the delayed free via the
    worker.

v1…v2:
   - Add bpf_link_put_direct() to be used from bpf_link_release() as
     suggested.

> Looks good to me, but it's not sufficient. See kernel/bpf/inode.c, we
> need to do bpf_link_put_direct() from bpf_put_any(), which should be
> safe as well because it all is either triggered from bpf() syscall or
> by unlink()'ing BPF FS file. For file deletion we have the same
> requirement to have deterministic release of bpf_link.

Okay. I checked all callers and it seems that the only atomic caller is
the RCU callback.

 include/linux/bpf.h  |  5 +++++
 kernel/bpf/inode.c   | 13 ++++++++-----
 kernel/bpf/syscall.c | 21 ++++++++++++---------
 3 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e53ceee1df370..dced1f880cfa6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2073,6 +2073,7 @@ int bpf_link_settle(struct bpf_link_primer *primer);
 void bpf_link_cleanup(struct bpf_link_primer *primer);
 void bpf_link_inc(struct bpf_link *link);
 void bpf_link_put(struct bpf_link *link);
+void bpf_link_put_from_atomic(struct bpf_link *link);
 int bpf_link_new_fd(struct bpf_link *link);
 struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
@@ -2432,6 +2433,10 @@ static inline void bpf_link_put(struct bpf_link *link)
 {
 }
 
+static inline void bpf_link_put_from_atomic(struct bpf_link *link)
+{
+}
+
 static inline int bpf_obj_get_user(const char __user *pathname, int flags)
 {
 	return -EOPNOTSUPP;
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 9948b542a470e..2e1e9f3c7f701 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -49,7 +49,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
 	return raw;
 }
 
-static void bpf_any_put(void *raw, enum bpf_type type)
+static void bpf_any_put(void *raw, enum bpf_type type, bool may_sleep)
 {
 	switch (type) {
 	case BPF_TYPE_PROG:
@@ -59,7 +59,10 @@ static void bpf_any_put(void *raw, enum bpf_type type)
 		bpf_map_put_with_uref(raw);
 		break;
 	case BPF_TYPE_LINK:
-		bpf_link_put(raw);
+		if (may_sleep)
+			bpf_link_put(raw);
+		else
+			bpf_link_put_from_atomic(raw);
 		break;
 	default:
 		WARN_ON_ONCE(1);
@@ -490,7 +493,7 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
 
 	ret = bpf_obj_do_pin(pathname, raw, type);
 	if (ret != 0)
-		bpf_any_put(raw, type);
+		bpf_any_put(raw, type, true);
 
 	return ret;
 }
@@ -552,7 +555,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
 		return -ENOENT;
 
 	if (ret < 0)
-		bpf_any_put(raw, type);
+		bpf_any_put(raw, type, true);
 	return ret;
 }
 
@@ -617,7 +620,7 @@ static void bpf_free_inode(struct inode *inode)
 	if (S_ISLNK(inode->i_mode))
 		kfree(inode->i_link);
 	if (!bpf_inode_type(inode, &type))
-		bpf_any_put(inode->i_private, type);
+		bpf_any_put(inode->i_private, type, false);
 	free_inode_nonrcu(inode);
 }
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 14f39c1e573ee..87b07ebd6d146 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2777,20 +2777,23 @@ static void bpf_link_put_deferred(struct work_struct *work)
 	bpf_link_free(link);
 }
 
-/* bpf_link_put can be called from atomic context, but ensures that resources
- * are freed from process context
+/* bpf_link_put_from_atomic is called from atomic context. It needs to be called
+ * from sleepable context in order to acquire sleeping locks during the process.
  */
-void bpf_link_put(struct bpf_link *link)
+void bpf_link_put_from_atomic(struct bpf_link *link)
 {
 	if (!atomic64_dec_and_test(&link->refcnt))
 		return;
 
-	if (in_atomic()) {
-		INIT_WORK(&link->work, bpf_link_put_deferred);
-		schedule_work(&link->work);
-	} else {
-		bpf_link_free(link);
-	}
+	INIT_WORK(&link->work, bpf_link_put_deferred);
+	schedule_work(&link->work);
+}
+
+void bpf_link_put(struct bpf_link *link)
+{
+	if (!atomic64_dec_and_test(&link->refcnt))
+		return;
+	bpf_link_free(link);
 }
 EXPORT_SYMBOL(bpf_link_put);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] bpf: Remove in_atomic() from bpf_link_put().
  2023-05-26 11:23           ` [PATCH v3] " Sebastian Andrzej Siewior
@ 2023-05-31 19:00             ` Andrii Nakryiko
  2023-06-05 16:37               ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2023-05-31 19:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner, Linux-Fsdevel

On Fri, May 26, 2023 at 4:24 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> invoked within softirq context. By setting rcutree.use_softirq=0 boot
> option the RCU callbacks will be invoked in a per-CPU kthread with
> bottom halves disabled which implies a RCU read section.
>
> On PREEMPT_RT the context remains fully preemptible. The RCU read
> section however does not allow schedule() invocation. The latter happens
> in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> from bpf_link_put().
>
> It was pointed out that the bpf_link_put() invocation should not be
> delayed if originated from close(). It was also pointed out that other
> invocations from within a syscall should also avoid the workqueue.
> After an audit of all bpf_link_put() callers it looks that the only
> atomic caller is the RCU callback. Everything else is called from
> preemptible context because it is a syscall, a mutex_t is acquired near
> by or due a GFP_KERNEL memory allocation.
>
> Let bpf_link_put() free the resources directly. Add
> bpf_link_put_from_atomic() which uses the kworker to free the
> resources. Let bpf_any_put() invoke one or the other depending on the
> context it is called from (RCU or not).
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
> v2…v3:
>   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
>     and add bpf_link_put_from_atomic() to do the delayed free via the
>     worker.

This seems like an unsafe "default put" choice. I think it makes more
sense to have bpf_link_put() do a safe scheduled put, and then having
a separate bpf_link_put_direct() for those special cases where we care
the most (in kernel/bpf/inode.c and kernel/bpf/syscall.c).

>
> v1…v2:
>    - Add bpf_link_put_direct() to be used from bpf_link_release() as
>      suggested.
>
> > Looks good to me, but it's not sufficient. See kernel/bpf/inode.c, we
> > need to do bpf_link_put_direct() from bpf_put_any(), which should be
> > safe as well because it all is either triggered from bpf() syscall or
> > by unlink()'ing BPF FS file. For file deletion we have the same
> > requirement to have deterministic release of bpf_link.
>
> Okay. I checked all callers and it seems that the only atomic caller is
> the RCU callback.
>
>  include/linux/bpf.h  |  5 +++++
>  kernel/bpf/inode.c   | 13 ++++++++-----
>  kernel/bpf/syscall.c | 21 ++++++++++++---------
>  3 files changed, 25 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e53ceee1df370..dced1f880cfa6 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2073,6 +2073,7 @@ int bpf_link_settle(struct bpf_link_primer *primer);
>  void bpf_link_cleanup(struct bpf_link_primer *primer);
>  void bpf_link_inc(struct bpf_link *link);
>  void bpf_link_put(struct bpf_link *link);
> +void bpf_link_put_from_atomic(struct bpf_link *link);
>  int bpf_link_new_fd(struct bpf_link *link);
>  struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
>  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> @@ -2432,6 +2433,10 @@ static inline void bpf_link_put(struct bpf_link *link)
>  {
>  }
>
> +static inline void bpf_link_put_from_atomic(struct bpf_link *link)
> +{
> +}
> +
>  static inline int bpf_obj_get_user(const char __user *pathname, int flags)
>  {
>         return -EOPNOTSUPP;
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 9948b542a470e..2e1e9f3c7f701 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -49,7 +49,7 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
>         return raw;
>  }
>
> -static void bpf_any_put(void *raw, enum bpf_type type)
> +static void bpf_any_put(void *raw, enum bpf_type type, bool may_sleep)
>  {
>         switch (type) {
>         case BPF_TYPE_PROG:
> @@ -59,7 +59,10 @@ static void bpf_any_put(void *raw, enum bpf_type type)
>                 bpf_map_put_with_uref(raw);
>                 break;
>         case BPF_TYPE_LINK:
> -               bpf_link_put(raw);
> +               if (may_sleep)
> +                       bpf_link_put(raw);
> +               else
> +                       bpf_link_put_from_atomic(raw);

Do we need to do this in two different ways here? The only situation
that has may_sleep=false is when called from superblock->free_inode.
According to documentation:

  Freeing memory in the callback is fine; doing
  more than that is possible, but requires a lot of care and is best
  avoided.

So it feels like cleaning up link should be safe to do from this
context as well? I've cc'ed linux-fsdevel@, hopefully they can advise.


>                 break;
>         default:
>                 WARN_ON_ONCE(1);
> @@ -490,7 +493,7 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname)
>
>         ret = bpf_obj_do_pin(pathname, raw, type);
>         if (ret != 0)
> -               bpf_any_put(raw, type);
> +               bpf_any_put(raw, type, true);
>
>         return ret;
>  }
> @@ -552,7 +555,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
>                 return -ENOENT;
>
>         if (ret < 0)
> -               bpf_any_put(raw, type);
> +               bpf_any_put(raw, type, true);
>         return ret;
>  }
>
> @@ -617,7 +620,7 @@ static void bpf_free_inode(struct inode *inode)
>         if (S_ISLNK(inode->i_mode))
>                 kfree(inode->i_link);
>         if (!bpf_inode_type(inode, &type))
> -               bpf_any_put(inode->i_private, type);
> +               bpf_any_put(inode->i_private, type, false);
>         free_inode_nonrcu(inode);
>  }
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 14f39c1e573ee..87b07ebd6d146 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2777,20 +2777,23 @@ static void bpf_link_put_deferred(struct work_struct *work)
>         bpf_link_free(link);
>  }
>
> -/* bpf_link_put can be called from atomic context, but ensures that resources
> - * are freed from process context
> +/* bpf_link_put_from_atomic is called from atomic context. It needs to be called
> + * from sleepable context in order to acquire sleeping locks during the process.
>   */
> -void bpf_link_put(struct bpf_link *link)
> +void bpf_link_put_from_atomic(struct bpf_link *link)
>  {
>         if (!atomic64_dec_and_test(&link->refcnt))
>                 return;
>
> -       if (in_atomic()) {
> -               INIT_WORK(&link->work, bpf_link_put_deferred);
> -               schedule_work(&link->work);
> -       } else {
> -               bpf_link_free(link);
> -       }
> +       INIT_WORK(&link->work, bpf_link_put_deferred);
> +       schedule_work(&link->work);
> +}
> +
> +void bpf_link_put(struct bpf_link *link)
> +{
> +       if (!atomic64_dec_and_test(&link->refcnt))
> +               return;
> +       bpf_link_free(link);
>  }
>  EXPORT_SYMBOL(bpf_link_put);
>
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] bpf: Remove in_atomic() from bpf_link_put().
  2023-05-31 19:00             ` Andrii Nakryiko
@ 2023-06-05 16:37               ` Sebastian Andrzej Siewior
  2023-06-05 22:47                 ` Andrii Nakryiko
  0 siblings, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-05 16:37 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner, Linux-Fsdevel

On 2023-05-31 12:00:56 [-0700], Andrii Nakryiko wrote:
> > On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
> > v2…v3:
> >   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
> >     and add bpf_link_put_from_atomic() to do the delayed free via the
> >     worker.
> 
> This seems like an unsafe "default put" choice. I think it makes more
> sense to have bpf_link_put() do a safe scheduled put, and then having
> a separate bpf_link_put_direct() for those special cases where we care
> the most (in kernel/bpf/inode.c and kernel/bpf/syscall.c).

I audited them and ended up with them all being safe except for the one
from inode.c. I can reverse the logic if you want.

> > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > index 9948b542a470e..2e1e9f3c7f701 100644
> > --- a/kernel/bpf/inode.c
> > +++ b/kernel/bpf/inode.c
> > @@ -59,7 +59,10 @@ static void bpf_any_put(void *raw, enum bpf_type type)
> >                 bpf_map_put_with_uref(raw);
> >                 break;
> >         case BPF_TYPE_LINK:
> > -               bpf_link_put(raw);
> > +               if (may_sleep)
> > +                       bpf_link_put(raw);
> > +               else
> > +                       bpf_link_put_from_atomic(raw);
> 
> Do we need to do this in two different ways here? The only situation
> that has may_sleep=false is when called from superblock->free_inode.
> According to documentation:
> 
>   Freeing memory in the callback is fine; doing
>   more than that is possible, but requires a lot of care and is best
>   avoided.
> 
> So it feels like cleaning up link should be safe to do from this
> context as well? I've cc'ed linux-fsdevel@, hopefully they can advise.

This is invoked from the RCU callback (which is usually softirq):
	destroy_inode()
	 -> call_rcu(&inode->i_rcu, i_callback);

Freeing memory is fine but there is a mutex that is held in the process.

Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] bpf: Remove in_atomic() from bpf_link_put().
  2023-06-05 16:37               ` Sebastian Andrzej Siewior
@ 2023-06-05 22:47                 ` Andrii Nakryiko
  2023-06-09 14:19                   ` Sebastian Andrzej Siewior
  2023-06-14  8:34                   ` [PATCH v4] " Sebastian Andrzej Siewior
  0 siblings, 2 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2023-06-05 22:47 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner, Linux-Fsdevel

On Mon, Jun 5, 2023 at 9:37 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2023-05-31 12:00:56 [-0700], Andrii Nakryiko wrote:
> > > On 2023-05-25 10:51:23 [-0700], Andrii Nakryiko wrote:
> > > v2…v3:
> > >   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
> > >     and add bpf_link_put_from_atomic() to do the delayed free via the
> > >     worker.
> >
> > This seems like an unsafe "default put" choice. I think it makes more
> > sense to have bpf_link_put() do a safe scheduled put, and then having
> > a separate bpf_link_put_direct() for those special cases where we care
> > the most (in kernel/bpf/inode.c and kernel/bpf/syscall.c).
>
> I audited them and ended up with them all being safe except for the one
> from inode.c. I can reverse the logic if you want.

I understand it's safe today, but I'm more worried for future places
that will do bpf_link_put(). Given it's only close() and BPF FS
unlink() that require synchronous semantics, I think it's fine to make
bpf_link_put() to be async, and have bpf_link_put_sync() (or direct,
or whatever suffix) as a conscious special case.

>
> > > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > > index 9948b542a470e..2e1e9f3c7f701 100644
> > > --- a/kernel/bpf/inode.c
> > > +++ b/kernel/bpf/inode.c
> > > @@ -59,7 +59,10 @@ static void bpf_any_put(void *raw, enum bpf_type type)
> > >                 bpf_map_put_with_uref(raw);
> > >                 break;
> > >         case BPF_TYPE_LINK:
> > > -               bpf_link_put(raw);
> > > +               if (may_sleep)
> > > +                       bpf_link_put(raw);
> > > +               else
> > > +                       bpf_link_put_from_atomic(raw);
> >
> > Do we need to do this in two different ways here? The only situation
> > that has may_sleep=false is when called from superblock->free_inode.
> > According to documentation:
> >
> >   Freeing memory in the callback is fine; doing
> >   more than that is possible, but requires a lot of care and is best
> >   avoided.
> >
> > So it feels like cleaning up link should be safe to do from this
> > context as well? I've cc'ed linux-fsdevel@, hopefully they can advise.
>
> This is invoked from the RCU callback (which is usually softirq):
>         destroy_inode()
>          -> call_rcu(&inode->i_rcu, i_callback);
>
> Freeing memory is fine but there is a mutex that is held in the process.

I think it should be fine for BPF link destruction then?

>
> Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] bpf: Remove in_atomic() from bpf_link_put().
  2023-06-05 22:47                 ` Andrii Nakryiko
@ 2023-06-09 14:19                   ` Sebastian Andrzej Siewior
  2023-06-14  8:34                   ` [PATCH v4] " Sebastian Andrzej Siewior
  1 sibling, 0 replies; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-09 14:19 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner, Linux-Fsdevel

On 2023-06-05 15:47:23 [-0700], Andrii Nakryiko wrote:
> I understand it's safe today, but I'm more worried for future places
> that will do bpf_link_put(). Given it's only close() and BPF FS
> unlink() that require synchronous semantics, I think it's fine to make
> bpf_link_put() to be async, and have bpf_link_put_sync() (or direct,
> or whatever suffix) as a conscious special case.

Okay, let me do that then.

> > This is invoked from the RCU callback (which is usually softirq):
> >         destroy_inode()
> >          -> call_rcu(&inode->i_rcu, i_callback);
> >
> > Freeing memory is fine but there is a mutex that is held in the process.
> 
> I think it should be fine for BPF link destruction then?

bpf_any_put() needs that "may sleep" exception for BPF_TYPE_LINK if it
comes from RCU.
I will swap that patch to be async by default and make sync for
bpf_any_put() if called from close (except for the RCU case).

Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4] bpf: Remove in_atomic() from bpf_link_put().
  2023-06-05 22:47                 ` Andrii Nakryiko
  2023-06-09 14:19                   ` Sebastian Andrzej Siewior
@ 2023-06-14  8:34                   ` Sebastian Andrzej Siewior
  2023-06-15 16:43                     ` Paul E. McKenney
  2023-06-16 16:30                     ` patchwork-bot+netdevbpf
  1 sibling, 2 replies; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-14  8:34 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Paul E. McKenney, Peter Zijlstra,
	Thomas Gleixner, Linux-Fsdevel

bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
invoked within softirq context. By setting rcutree.use_softirq=0 boot
option the RCU callbacks will be invoked in a per-CPU kthread with
bottom halves disabled which implies a RCU read section.

On PREEMPT_RT the context remains fully preemptible. The RCU read
section however does not allow schedule() invocation. The latter happens
in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
from bpf_link_put().

It was pointed out that the bpf_link_put() invocation should not be
delayed if originated from close(). It was also pointed out that other
invocations from within a syscall should also avoid the workqueue.
Everyone else should use workqueue by default to remain safe in the
future (while auditing the code, every caller was preemptible except for
the RCU case).

Let bpf_link_put() use the worker unconditionally. Add
bpf_link_put_direct() which will directly free the resources and is used
by close() and from within __sys_bpf().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v3…v4:
  - Revert back to bpf_link_put_direct() to the direct free and let
    bpf_link_put() use the worker. Let close() and all invocations from
    within the syscall use bpf_link_put_direct() which are all instances
    within syscall.c here.

v2…v3:
  - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
    and add bpf_link_put_from_atomic() to do the delayed free via the
    worker.

v1…v2:
   - Add bpf_link_put_direct() to be used from bpf_link_release() as
     suggested.

 kernel/bpf/syscall.c | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 14f39c1e573ee..8f09aef5949d4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2777,28 +2777,31 @@ static void bpf_link_put_deferred(struct work_struct *work)
 	bpf_link_free(link);
 }
 
-/* bpf_link_put can be called from atomic context, but ensures that resources
- * are freed from process context
+/* bpf_link_put might be called from atomic context. It needs to be called
+ * from sleepable context in order to acquire sleeping locks during the process.
  */
 void bpf_link_put(struct bpf_link *link)
 {
 	if (!atomic64_dec_and_test(&link->refcnt))
 		return;
 
-	if (in_atomic()) {
-		INIT_WORK(&link->work, bpf_link_put_deferred);
-		schedule_work(&link->work);
-	} else {
-		bpf_link_free(link);
-	}
+	INIT_WORK(&link->work, bpf_link_put_deferred);
+	schedule_work(&link->work);
 }
 EXPORT_SYMBOL(bpf_link_put);
 
+static void bpf_link_put_direct(struct bpf_link *link)
+{
+	if (!atomic64_dec_and_test(&link->refcnt))
+		return;
+	bpf_link_free(link);
+}
+
 static int bpf_link_release(struct inode *inode, struct file *filp)
 {
 	struct bpf_link *link = filp->private_data;
 
-	bpf_link_put(link);
+	bpf_link_put_direct(link);
 	return 0;
 }
 
@@ -4764,7 +4767,7 @@ static int link_update(union bpf_attr *attr)
 	if (ret)
 		bpf_prog_put(new_prog);
 out_put_link:
-	bpf_link_put(link);
+	bpf_link_put_direct(link);
 	return ret;
 }
 
@@ -4787,7 +4790,7 @@ static int link_detach(union bpf_attr *attr)
 	else
 		ret = -EOPNOTSUPP;
 
-	bpf_link_put(link);
+	bpf_link_put_direct(link);
 	return ret;
 }
 
@@ -4857,7 +4860,7 @@ static int bpf_link_get_fd_by_id(const union bpf_attr *attr)
 
 	fd = bpf_link_new_fd(link);
 	if (fd < 0)
-		bpf_link_put(link);
+		bpf_link_put_direct(link);
 
 	return fd;
 }
@@ -4934,7 +4937,7 @@ static int bpf_iter_create(union bpf_attr *attr)
 		return PTR_ERR(link);
 
 	err = bpf_iter_new_fd(link);
-	bpf_link_put(link);
+	bpf_link_put_direct(link);
 
 	return err;
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] bpf: Remove in_atomic() from bpf_link_put().
  2023-06-14  8:34                   ` [PATCH v4] " Sebastian Andrzej Siewior
@ 2023-06-15 16:43                     ` Paul E. McKenney
  2023-06-15 19:13                       ` Sebastian Andrzej Siewior
  2023-06-16 16:30                     ` patchwork-bot+netdevbpf
  1 sibling, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2023-06-15 16:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrii Nakryiko, Alexei Starovoitov, bpf, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Linux-Fsdevel

On Wed, Jun 14, 2023 at 10:34:30AM +0200, Sebastian Andrzej Siewior wrote:
> bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> invoked within softirq context. By setting rcutree.use_softirq=0 boot
> option the RCU callbacks will be invoked in a per-CPU kthread with
> bottom halves disabled which implies a RCU read section.
> 
> On PREEMPT_RT the context remains fully preemptible. The RCU read
> section however does not allow schedule() invocation. The latter happens
> in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> from bpf_link_put().

Just to make sure that I understand, you are proposing that the RCU
callbacks continue to run with BH disabled, but that BH-disabled regions
are preemptible in kernels built with CONFIG_PREEMPT_RT=y?

Or did I miss a turn in there somewhere?

							Thanx, Paul

> It was pointed out that the bpf_link_put() invocation should not be
> delayed if originated from close(). It was also pointed out that other
> invocations from within a syscall should also avoid the workqueue.
> Everyone else should use workqueue by default to remain safe in the
> future (while auditing the code, every caller was preemptible except for
> the RCU case).
> 
> Let bpf_link_put() use the worker unconditionally. Add
> bpf_link_put_direct() which will directly free the resources and is used
> by close() and from within __sys_bpf().
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> v3…v4:
>   - Revert back to bpf_link_put_direct() to the direct free and let
>     bpf_link_put() use the worker. Let close() and all invocations from
>     within the syscall use bpf_link_put_direct() which are all instances
>     within syscall.c here.
> 
> v2…v3:
>   - Drop bpf_link_put_direct(). Let bpf_link_put() do the direct free
>     and add bpf_link_put_from_atomic() to do the delayed free via the
>     worker.
> 
> v1…v2:
>    - Add bpf_link_put_direct() to be used from bpf_link_release() as
>      suggested.
> 
>  kernel/bpf/syscall.c | 29 ++++++++++++++++-------------
>  1 file changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 14f39c1e573ee..8f09aef5949d4 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2777,28 +2777,31 @@ static void bpf_link_put_deferred(struct work_struct *work)
>  	bpf_link_free(link);
>  }
>  
> -/* bpf_link_put can be called from atomic context, but ensures that resources
> - * are freed from process context
> +/* bpf_link_put might be called from atomic context. It needs to be called
> + * from sleepable context in order to acquire sleeping locks during the process.
>   */
>  void bpf_link_put(struct bpf_link *link)
>  {
>  	if (!atomic64_dec_and_test(&link->refcnt))
>  		return;
>  
> -	if (in_atomic()) {
> -		INIT_WORK(&link->work, bpf_link_put_deferred);
> -		schedule_work(&link->work);
> -	} else {
> -		bpf_link_free(link);
> -	}
> +	INIT_WORK(&link->work, bpf_link_put_deferred);
> +	schedule_work(&link->work);
>  }
>  EXPORT_SYMBOL(bpf_link_put);
>  
> +static void bpf_link_put_direct(struct bpf_link *link)
> +{
> +	if (!atomic64_dec_and_test(&link->refcnt))
> +		return;
> +	bpf_link_free(link);
> +}
> +
>  static int bpf_link_release(struct inode *inode, struct file *filp)
>  {
>  	struct bpf_link *link = filp->private_data;
>  
> -	bpf_link_put(link);
> +	bpf_link_put_direct(link);
>  	return 0;
>  }
>  
> @@ -4764,7 +4767,7 @@ static int link_update(union bpf_attr *attr)
>  	if (ret)
>  		bpf_prog_put(new_prog);
>  out_put_link:
> -	bpf_link_put(link);
> +	bpf_link_put_direct(link);
>  	return ret;
>  }
>  
> @@ -4787,7 +4790,7 @@ static int link_detach(union bpf_attr *attr)
>  	else
>  		ret = -EOPNOTSUPP;
>  
> -	bpf_link_put(link);
> +	bpf_link_put_direct(link);
>  	return ret;
>  }
>  
> @@ -4857,7 +4860,7 @@ static int bpf_link_get_fd_by_id(const union bpf_attr *attr)
>  
>  	fd = bpf_link_new_fd(link);
>  	if (fd < 0)
> -		bpf_link_put(link);
> +		bpf_link_put_direct(link);
>  
>  	return fd;
>  }
> @@ -4934,7 +4937,7 @@ static int bpf_iter_create(union bpf_attr *attr)
>  		return PTR_ERR(link);
>  
>  	err = bpf_iter_new_fd(link);
> -	bpf_link_put(link);
> +	bpf_link_put_direct(link);
>  
>  	return err;
>  }
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] bpf: Remove in_atomic() from bpf_link_put().
  2023-06-15 16:43                     ` Paul E. McKenney
@ 2023-06-15 19:13                       ` Sebastian Andrzej Siewior
  2023-06-15 19:32                         ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-06-15 19:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrii Nakryiko, Alexei Starovoitov, bpf, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Linux-Fsdevel

On 2023-06-15 09:43:11 [-0700], Paul E. McKenney wrote:
> On Wed, Jun 14, 2023 at 10:34:30AM +0200, Sebastian Andrzej Siewior wrote:
> > bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> > invoked within softirq context. By setting rcutree.use_softirq=0 boot
> > option the RCU callbacks will be invoked in a per-CPU kthread with
> > bottom halves disabled which implies a RCU read section.
> > 
> > On PREEMPT_RT the context remains fully preemptible. The RCU read
> > section however does not allow schedule() invocation. The latter happens
> > in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> > from bpf_link_put().
> 
> Just to make sure that I understand, you are proposing that the RCU
> callbacks continue to run with BH disabled, but that BH-disabled regions
> are preemptible in kernels built with CONFIG_PREEMPT_RT=y?
> 
> Or did I miss a turn in there somewhere?

I'm not proposing anything, just stating what we have. On PREEMPT_RT
you are preemptible within the RCU callback but must not invoke
schedule(). Similar to the RCU read section on CONFIG_PREEMPT where you
are preemptible but must not invoke schedule(). 

> 
> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] bpf: Remove in_atomic() from bpf_link_put().
  2023-06-15 19:13                       ` Sebastian Andrzej Siewior
@ 2023-06-15 19:32                         ` Paul E. McKenney
  0 siblings, 0 replies; 17+ messages in thread
From: Paul E. McKenney @ 2023-06-15 19:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrii Nakryiko, Alexei Starovoitov, bpf, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Peter Zijlstra, Thomas Gleixner,
	Linux-Fsdevel

On Thu, Jun 15, 2023 at 09:13:41PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-06-15 09:43:11 [-0700], Paul E. McKenney wrote:
> > On Wed, Jun 14, 2023 at 10:34:30AM +0200, Sebastian Andrzej Siewior wrote:
> > > bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> > > invoked within softirq context. By setting rcutree.use_softirq=0 boot
> > > option the RCU callbacks will be invoked in a per-CPU kthread with
> > > bottom halves disabled which implies a RCU read section.
> > > 
> > > On PREEMPT_RT the context remains fully preemptible. The RCU read
> > > section however does not allow schedule() invocation. The latter happens
> > > in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> > > from bpf_link_put().
> > 
> > Just to make sure that I understand, you are proposing that the RCU
> > callbacks continue to run with BH disabled, but that BH-disabled regions
> > are preemptible in kernels built with CONFIG_PREEMPT_RT=y?
> > 
> > Or did I miss a turn in there somewhere?
> 
> I'm not proposing anything, just stating what we have. On PREEMPT_RT
> you are preemptible within the RCU callback but must not invoke
> schedule(). Similar to the RCU read section on CONFIG_PREEMPT where you
> are preemptible but must not invoke schedule(). 

Thank you for the clarification!

The main risk of preemptible RCU callbacks is callback flooding, but
RCU priority boosting should take care of that.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] bpf: Remove in_atomic() from bpf_link_put().
  2023-06-14  8:34                   ` [PATCH v4] " Sebastian Andrzej Siewior
  2023-06-15 16:43                     ` Paul E. McKenney
@ 2023-06-16 16:30                     ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 17+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-06-16 16:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: andrii.nakryiko, alexei.starovoitov, bpf, ast, daniel,
	john.fastabend, paulmck, peterz, tglx, linux-fsdevel

Hello:

This patch was applied to bpf/bpf-next.git (master)
by Andrii Nakryiko <andrii@kernel.org>:

On Wed, 14 Jun 2023 10:34:30 +0200 you wrote:
> bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are
> invoked within softirq context. By setting rcutree.use_softirq=0 boot
> option the RCU callbacks will be invoked in a per-CPU kthread with
> bottom halves disabled which implies a RCU read section.
> 
> On PREEMPT_RT the context remains fully preemptible. The RCU read
> section however does not allow schedule() invocation. The latter happens
> in mutex_lock() performed by bpf_trampoline_unlink_prog() originated
> from bpf_link_put().
> 
> [...]

Here is the summary with links:
  - [v4] bpf: Remove in_atomic() from bpf_link_put().
    https://git.kernel.org/bpf/bpf-next/c/ab5d47bd41b1

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-06-16 16:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-09 13:24 [RFC PATCH] bpf: Remove in_atomic() from bpf_link_put() Sebastian Andrzej Siewior
2023-05-10  7:19 ` Andrii Nakryiko
2023-05-17  5:26   ` Alexei Starovoitov
2023-05-17  6:09     ` Sebastian Andrzej Siewior
2023-05-17 16:26     ` Andrii Nakryiko
2023-05-25 14:18       ` [PATCH v2] " Sebastian Andrzej Siewior
2023-05-25 17:51         ` Andrii Nakryiko
2023-05-26 11:23           ` [PATCH v3] " Sebastian Andrzej Siewior
2023-05-31 19:00             ` Andrii Nakryiko
2023-06-05 16:37               ` Sebastian Andrzej Siewior
2023-06-05 22:47                 ` Andrii Nakryiko
2023-06-09 14:19                   ` Sebastian Andrzej Siewior
2023-06-14  8:34                   ` [PATCH v4] " Sebastian Andrzej Siewior
2023-06-15 16:43                     ` Paul E. McKenney
2023-06-15 19:13                       ` Sebastian Andrzej Siewior
2023-06-15 19:32                         ` Paul E. McKenney
2023-06-16 16:30                     ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).