[PATCH RFC bpf-next] bpf: defer bpf_link dealloc to after RCU grace period

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC bpf-next] bpf: defer bpf_link dealloc to after RCU grace period
@ 2024-03-25 21:51 Andrii Nakryiko
  2024-03-26  2:16 ` Alexei Starovoitov
  0 siblings, 1 reply; 4+ messages in thread
From: Andrii Nakryiko @ 2024-03-25 21:51 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau
  Cc: andrii, kernel-team, syzbot+981935d9485a560bfbcb,
	syzbot+2cb5a6c573e98db598cc, syzbot+62d8b26793e8a2bd0516

BPF link for some program types is passed as a "context" which can be
used by those BPF programs to look up additional information. E.g., for
BPF raw tracepoints, link is used to fetch BPF cookie value, similarly
for BPF multi-kprobes and multi-uprobes.

Because of this runtime dependency, when bpf_link refcnt drops to zero
that could be still active BPF programs running accessing link data
(cookie, program pointer, etc).

This patch accommodates this by delaying freeing memory to after RCU GP,
which will fix BPF raw tp, multi-kprobe, and non-sleepable multi-uprobe.

Perhaps a better approach would be to have a per-link flag specifying
desired behavior: no delay, RCU delay, or task_trace RCU delay? So
sending this as an RFC fix to discuss desired final solution.

Fixes: d4dfc5700e86 ("bpf: pass whole link instead of prog when triggering raw tracepoint")
Reported-by: syzbot+981935d9485a560bfbcb@syzkaller.appspotmail.com
Reported-by: syzbot+2cb5a6c573e98db598cc@syzkaller.appspotmail.com
Reported-by: syzbot+62d8b26793e8a2bd0516@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h  |  8 +++++++-
 kernel/bpf/syscall.c | 12 ++++++++++--
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 62762390c93d..d73a8978c800 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1573,7 +1573,13 @@ struct bpf_link {
 	enum bpf_link_type type;
 	const struct bpf_link_ops *ops;
 	struct bpf_prog *prog;
-	struct work_struct work;
+	/* rcu is used before freeing, work can be used to schedule that
+	 * RCU-based freeing before that, so they never overlap
+	 */
+	union {
+		struct rcu_head rcu;
+		struct work_struct work;
+	};
 };
 
 struct bpf_link_ops {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e44c276e8617..af1591af10bb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3024,6 +3024,14 @@ void bpf_link_inc(struct bpf_link *link)
 	atomic64_inc(&link->refcnt);
 }
 
+static void bpf_link_dealloc_deferred(struct rcu_head *rcu)
+{
+	struct bpf_link *link = container_of(rcu, struct bpf_link, rcu);
+
+	/* free bpf_link and its containing memory */
+	link->ops->dealloc(link);
+}
+
 /* bpf_link_free is guaranteed to be called from process context */
 static void bpf_link_free(struct bpf_link *link)
 {
@@ -3033,8 +3041,8 @@ static void bpf_link_free(struct bpf_link *link)
 		link->ops->release(link);
 		bpf_prog_put(link->prog);
 	}
-	/* free bpf_link and its containing memory */
-	link->ops->dealloc(link);
+	/* schedule BPF link deallocation after RCU grace period */
+	call_rcu(&link->rcu, bpf_link_dealloc_deferred);
 }
 
 static void bpf_link_put_deferred(struct work_struct *work)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC bpf-next] bpf: defer bpf_link dealloc to after RCU grace period
  2024-03-25 21:51 [PATCH RFC bpf-next] bpf: defer bpf_link dealloc to after RCU grace period Andrii Nakryiko
@ 2024-03-26  2:16 ` Alexei Starovoitov
  2024-03-26  2:59   ` Andrii Nakryiko
  0 siblings, 1 reply; 4+ messages in thread
From: Alexei Starovoitov @ 2024-03-26  2:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Kernel Team, syzbot, syzbot+2cb5a6c573e98db598cc,
	syzbot+62d8b26793e8a2bd0516

On Mon, Mar 25, 2024 at 2:51 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> BPF link for some program types is passed as a "context" which can be
> used by those BPF programs to look up additional information. E.g., for
> BPF raw tracepoints, link is used to fetch BPF cookie value, similarly
> for BPF multi-kprobes and multi-uprobes.
>
> Because of this runtime dependency, when bpf_link refcnt drops to zero
> that could be still active BPF programs running accessing link data
> (cookie, program pointer, etc).
>
> This patch accommodates this by delaying freeing memory to after RCU GP,
> which will fix BPF raw tp, multi-kprobe, and non-sleepable multi-uprobe.
>
> Perhaps a better approach would be to have a per-link flag specifying
> desired behavior: no delay, RCU delay, or task_trace RCU delay? So
> sending this as an RFC fix to discuss desired final solution.
>
> Fixes: d4dfc5700e86 ("bpf: pass whole link instead of prog when triggering raw tracepoint")
> Reported-by: syzbot+981935d9485a560bfbcb@syzkaller.appspotmail.com
> Reported-by: syzbot+2cb5a6c573e98db598cc@syzkaller.appspotmail.com
> Reported-by: syzbot+62d8b26793e8a2bd0516@syzkaller.appspotmail.com
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/bpf.h  |  8 +++++++-
>  kernel/bpf/syscall.c | 12 ++++++++++--
>  2 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 62762390c93d..d73a8978c800 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1573,7 +1573,13 @@ struct bpf_link {
>         enum bpf_link_type type;
>         const struct bpf_link_ops *ops;
>         struct bpf_prog *prog;
> -       struct work_struct work;
> +       /* rcu is used before freeing, work can be used to schedule that
> +        * RCU-based freeing before that, so they never overlap
> +        */
> +       union {
> +               struct rcu_head rcu;
> +               struct work_struct work;
> +       };
>  };
>
>  struct bpf_link_ops {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index e44c276e8617..af1591af10bb 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -3024,6 +3024,14 @@ void bpf_link_inc(struct bpf_link *link)
>         atomic64_inc(&link->refcnt);
>  }
>
> +static void bpf_link_dealloc_deferred(struct rcu_head *rcu)
> +{
> +       struct bpf_link *link = container_of(rcu, struct bpf_link, rcu);
> +
> +       /* free bpf_link and its containing memory */
> +       link->ops->dealloc(link);
> +}
> +
>  /* bpf_link_free is guaranteed to be called from process context */
>  static void bpf_link_free(struct bpf_link *link)
>  {
> @@ -3033,8 +3041,8 @@ static void bpf_link_free(struct bpf_link *link)
>                 link->ops->release(link);
>                 bpf_prog_put(link->prog);
>         }
> -       /* free bpf_link and its containing memory */
> -       link->ops->dealloc(link);
> +       /* schedule BPF link deallocation after RCU grace period */
> +       call_rcu(&link->rcu, bpf_link_dealloc_deferred);

Do we have any sleepable progs that access link as well?
If so then call_rcu_tasks_trace ?
and check rcu_trace_implies_rcu_gp to avoid extra call_rcu.

Doing that for all link types like networking that don't look
into link is not great. Probably needs to look into link type?
Or look into prog->type and do the extra work only for raw_tp,
multi-[ku]probe ?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC bpf-next] bpf: defer bpf_link dealloc to after RCU grace period
  2024-03-26  2:16 ` Alexei Starovoitov
@ 2024-03-26  2:59   ` Andrii Nakryiko
  2024-03-26  3:15     ` Alexei Starovoitov
  0 siblings, 1 reply; 4+ messages in thread
From: Andrii Nakryiko @ 2024-03-26  2:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team, syzbot,
	syzbot+2cb5a6c573e98db598cc, syzbot+62d8b26793e8a2bd0516

On Mon, Mar 25, 2024 at 7:16 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Mar 25, 2024 at 2:51 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > BPF link for some program types is passed as a "context" which can be
> > used by those BPF programs to look up additional information. E.g., for
> > BPF raw tracepoints, link is used to fetch BPF cookie value, similarly
> > for BPF multi-kprobes and multi-uprobes.
> >
> > Because of this runtime dependency, when bpf_link refcnt drops to zero
> > that could be still active BPF programs running accessing link data
> > (cookie, program pointer, etc).
> >
> > This patch accommodates this by delaying freeing memory to after RCU GP,
> > which will fix BPF raw tp, multi-kprobe, and non-sleepable multi-uprobe.
> >
> > Perhaps a better approach would be to have a per-link flag specifying
> > desired behavior: no delay, RCU delay, or task_trace RCU delay? So
> > sending this as an RFC fix to discuss desired final solution.
> >
> > Fixes: d4dfc5700e86 ("bpf: pass whole link instead of prog when triggering raw tracepoint")
> > Reported-by: syzbot+981935d9485a560bfbcb@syzkaller.appspotmail.com
> > Reported-by: syzbot+2cb5a6c573e98db598cc@syzkaller.appspotmail.com
> > Reported-by: syzbot+62d8b26793e8a2bd0516@syzkaller.appspotmail.com
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/bpf.h  |  8 +++++++-
> >  kernel/bpf/syscall.c | 12 ++++++++++--
> >  2 files changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 62762390c93d..d73a8978c800 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1573,7 +1573,13 @@ struct bpf_link {
> >         enum bpf_link_type type;
> >         const struct bpf_link_ops *ops;
> >         struct bpf_prog *prog;
> > -       struct work_struct work;
> > +       /* rcu is used before freeing, work can be used to schedule that
> > +        * RCU-based freeing before that, so they never overlap
> > +        */
> > +       union {
> > +               struct rcu_head rcu;
> > +               struct work_struct work;
> > +       };
> >  };
> >
> >  struct bpf_link_ops {
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index e44c276e8617..af1591af10bb 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -3024,6 +3024,14 @@ void bpf_link_inc(struct bpf_link *link)
> >         atomic64_inc(&link->refcnt);
> >  }
> >
> > +static void bpf_link_dealloc_deferred(struct rcu_head *rcu)
> > +{
> > +       struct bpf_link *link = container_of(rcu, struct bpf_link, rcu);
> > +
> > +       /* free bpf_link and its containing memory */
> > +       link->ops->dealloc(link);
> > +}
> > +
> >  /* bpf_link_free is guaranteed to be called from process context */
> >  static void bpf_link_free(struct bpf_link *link)
> >  {
> > @@ -3033,8 +3041,8 @@ static void bpf_link_free(struct bpf_link *link)
> >                 link->ops->release(link);
> >                 bpf_prog_put(link->prog);
> >         }
> > -       /* free bpf_link and its containing memory */
> > -       link->ops->dealloc(link);
> > +       /* schedule BPF link deallocation after RCU grace period */
> > +       call_rcu(&link->rcu, bpf_link_dealloc_deferred);
>
> Do we have any sleepable progs that access link as well?
> If so then call_rcu_tasks_trace ?

yep, multi-uprobe, can be sleepable

> and check rcu_trace_implies_rcu_gp to avoid extra call_rcu.

yep, saw that for maps

>
> Doing that for all link types like networking that don't look
> into link is not great. Probably needs to look into link type?
> Or look into prog->type and do the extra work only for raw_tp,
> multi-[ku]probe ?

so I was thinking that bpf_link_init() can accept an enum or flags to
specify what needs to be done, instead of trying to guess/derive that.
That seems less error-prone, more explicit, easier to audit. WDYT?

This should work ok given this is normally dependent on program type,
the only exception (that we currently don't have) would be if
LINK_UPDATE command replaces non-sleepable BPF program (say uprobe)
with sleepable one. But we don't support that for kprobe/uprobe and
other tracing programs, so it's a future problem (but we'll be able to
support it, when we get to this).

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC bpf-next] bpf: defer bpf_link dealloc to after RCU grace period
  2024-03-26  2:59   ` Andrii Nakryiko
@ 2024-03-26  3:15     ` Alexei Starovoitov
  0 siblings, 0 replies; 4+ messages in thread
From: Alexei Starovoitov @ 2024-03-26  3:15 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team, syzbot,
	syzbot+2cb5a6c573e98db598cc, syzbot+62d8b26793e8a2bd0516

On Mon, Mar 25, 2024 at 7:59 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Mar 25, 2024 at 7:16 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Mar 25, 2024 at 2:51 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > BPF link for some program types is passed as a "context" which can be
> > > used by those BPF programs to look up additional information. E.g., for
> > > BPF raw tracepoints, link is used to fetch BPF cookie value, similarly
> > > for BPF multi-kprobes and multi-uprobes.
> > >
> > > Because of this runtime dependency, when bpf_link refcnt drops to zero
> > > that could be still active BPF programs running accessing link data
> > > (cookie, program pointer, etc).
> > >
> > > This patch accommodates this by delaying freeing memory to after RCU GP,
> > > which will fix BPF raw tp, multi-kprobe, and non-sleepable multi-uprobe.
> > >
> > > Perhaps a better approach would be to have a per-link flag specifying
> > > desired behavior: no delay, RCU delay, or task_trace RCU delay? So
> > > sending this as an RFC fix to discuss desired final solution.
> > >
> > > Fixes: d4dfc5700e86 ("bpf: pass whole link instead of prog when triggering raw tracepoint")
> > > Reported-by: syzbot+981935d9485a560bfbcb@syzkaller.appspotmail.com
> > > Reported-by: syzbot+2cb5a6c573e98db598cc@syzkaller.appspotmail.com
> > > Reported-by: syzbot+62d8b26793e8a2bd0516@syzkaller.appspotmail.com
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/bpf.h  |  8 +++++++-
> > >  kernel/bpf/syscall.c | 12 ++++++++++--
> > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 62762390c93d..d73a8978c800 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1573,7 +1573,13 @@ struct bpf_link {
> > >         enum bpf_link_type type;
> > >         const struct bpf_link_ops *ops;
> > >         struct bpf_prog *prog;
> > > -       struct work_struct work;
> > > +       /* rcu is used before freeing, work can be used to schedule that
> > > +        * RCU-based freeing before that, so they never overlap
> > > +        */
> > > +       union {
> > > +               struct rcu_head rcu;
> > > +               struct work_struct work;
> > > +       };
> > >  };
> > >
> > >  struct bpf_link_ops {
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index e44c276e8617..af1591af10bb 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -3024,6 +3024,14 @@ void bpf_link_inc(struct bpf_link *link)
> > >         atomic64_inc(&link->refcnt);
> > >  }
> > >
> > > +static void bpf_link_dealloc_deferred(struct rcu_head *rcu)
> > > +{
> > > +       struct bpf_link *link = container_of(rcu, struct bpf_link, rcu);
> > > +
> > > +       /* free bpf_link and its containing memory */
> > > +       link->ops->dealloc(link);
> > > +}
> > > +
> > >  /* bpf_link_free is guaranteed to be called from process context */
> > >  static void bpf_link_free(struct bpf_link *link)
> > >  {
> > > @@ -3033,8 +3041,8 @@ static void bpf_link_free(struct bpf_link *link)
> > >                 link->ops->release(link);
> > >                 bpf_prog_put(link->prog);
> > >         }
> > > -       /* free bpf_link and its containing memory */
> > > -       link->ops->dealloc(link);
> > > +       /* schedule BPF link deallocation after RCU grace period */
> > > +       call_rcu(&link->rcu, bpf_link_dealloc_deferred);
> >
> > Do we have any sleepable progs that access link as well?
> > If so then call_rcu_tasks_trace ?
>
> yep, multi-uprobe, can be sleepable
>
> > and check rcu_trace_implies_rcu_gp to avoid extra call_rcu.
>
> yep, saw that for maps
>
> >
> > Doing that for all link types like networking that don't look
> > into link is not great. Probably needs to look into link type?
> > Or look into prog->type and do the extra work only for raw_tp,
> > multi-[ku]probe ?
>
> so I was thinking that bpf_link_init() can accept an enum or flags to
> specify what needs to be done, instead of trying to guess/derive that.
> That seems less error-prone, more explicit, easier to audit. WDYT?

yeah. makes sense.

> This should work ok given this is normally dependent on program type,
> the only exception (that we currently don't have) would be if
> LINK_UPDATE command replaces non-sleepable BPF program (say uprobe)
> with sleepable one. But we don't support that for kprobe/uprobe and
> other tracing programs, so it's a future problem (but we'll be able to
> support it, when we get to this).

that would be weird. future problem indeed.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-03-26  3:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-25 21:51 [PATCH RFC bpf-next] bpf: defer bpf_link dealloc to after RCU grace period Andrii Nakryiko
2024-03-26  2:16 ` Alexei Starovoitov
2024-03-26  2:59   ` Andrii Nakryiko
2024-03-26  3:15     ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.