linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Add BPF_SYNCHRONIZE bpf(2) command
@ 2018-07-07  1:56 Daniel Colascione
  2018-07-07  2:54 ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Colascione @ 2018-07-07  1:56 UTC (permalink / raw)
  To: joelaf, ast; +Cc: linux-kernel, timmurray, Daniel Colascione

BPF_SYNCHRONIZE waits for any BPF programs active at the time of
BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
RCU data structure operations with respect to active programs. For
example, userspace can update a map->map entry to point to a new map,
use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
complete, and then drain the old map without fear that BPF programs
may still be updating it.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 include/uapi/linux/bpf.h |  1 +
 kernel/bpf/syscall.c     | 14 ++++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b7db3261c62d..4365c50e8055 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -98,6 +98,7 @@ enum bpf_cmd {
 	BPF_BTF_LOAD,
 	BPF_BTF_GET_FD_BY_ID,
 	BPF_TASK_FD_QUERY,
+	BPF_SYNCHRONIZE,
 };
 
 enum bpf_map_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d10ecd78105f..60ec7811846e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	if (cmd == BPF_SYNCHRONIZE) {
+		if (uattr != NULL || size != 0)
+			return -EINVAL;
+		err = security_bpf(cmd, NULL, 0);
+		if (err < 0)
+			return err;
+		/* BPF programs are run with preempt disabled, so
+		 * synchronize_sched is sufficient even with
+		 * RCU_PREEMPT.
+		 */
+		synchronize_sched();
+		return 0;
+	}
+
 	err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
 	if (err)
 		return err;
-- 
2.18.0.203.gfac676dfb9-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-07  1:56 [RFC] Add BPF_SYNCHRONIZE bpf(2) command Daniel Colascione
@ 2018-07-07  2:54 ` Alexei Starovoitov
  2018-07-07  3:22   ` Daniel Colascione
  2018-07-07 20:33   ` Joel Fernandes
  0 siblings, 2 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2018-07-07  2:54 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: joelaf, ast, linux-kernel, timmurray, daniel, netdev

On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> RCU data structure operations with respect to active programs. For
> example, userspace can update a map->map entry to point to a new map,
> use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> complete, and then drain the old map without fear that BPF programs
> may still be updating it.
> 
> Signed-off-by: Daniel Colascione <dancol@google.com>
> ---
>  include/uapi/linux/bpf.h |  1 +
>  kernel/bpf/syscall.c     | 14 ++++++++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b7db3261c62d..4365c50e8055 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -98,6 +98,7 @@ enum bpf_cmd {
>  	BPF_BTF_LOAD,
>  	BPF_BTF_GET_FD_BY_ID,
>  	BPF_TASK_FD_QUERY,
> +	BPF_SYNCHRONIZE,
>  };
>  
>  enum bpf_map_type {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d10ecd78105f..60ec7811846e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>  		return -EPERM;
>  
> +	if (cmd == BPF_SYNCHRONIZE) {
> +		if (uattr != NULL || size != 0)
> +			return -EINVAL;
> +		err = security_bpf(cmd, NULL, 0);
> +		if (err < 0)
> +			return err;
> +		/* BPF programs are run with preempt disabled, so
> +		 * synchronize_sched is sufficient even with
> +		 * RCU_PREEMPT.
> +		 */
> +		synchronize_sched();
> +		return 0;

I don't think it's necessary. sys_membarrier() can do this already
and some folks use it exactly for this use case.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-07  2:54 ` Alexei Starovoitov
@ 2018-07-07  3:22   ` Daniel Colascione
  2018-07-07 20:33   ` Joel Fernandes
  1 sibling, 0 replies; 30+ messages in thread
From: Daniel Colascione @ 2018-07-07  3:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joel Fernandes, ast, linux-kernel, Tim Murray, daniel, netdev

On Fri, Jul 6, 2018 at 7:54 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
>> BPF_SYNCHRONIZE waits for any BPF programs active at the time of
>> BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
>> RCU data structure operations with respect to active programs. For
>> example, userspace can update a map->map entry to point to a new map,
>> use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>> complete, and then drain the old map without fear that BPF programs
>> may still be updating it.
>>
>> Signed-off-by: Daniel Colascione <dancol@google.com>
>> ---
>>  include/uapi/linux/bpf.h |  1 +
>>  kernel/bpf/syscall.c     | 14 ++++++++++++++
>>  2 files changed, 15 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index b7db3261c62d..4365c50e8055 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -98,6 +98,7 @@ enum bpf_cmd {
>>       BPF_BTF_LOAD,
>>       BPF_BTF_GET_FD_BY_ID,
>>       BPF_TASK_FD_QUERY,
>> +     BPF_SYNCHRONIZE,
>>  };
>>
>>  enum bpf_map_type {
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index d10ecd78105f..60ec7811846e 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>>       if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>>               return -EPERM;
>>
>> +     if (cmd == BPF_SYNCHRONIZE) {
>> +             if (uattr != NULL || size != 0)
>> +                     return -EINVAL;
>> +             err = security_bpf(cmd, NULL, 0);
>> +             if (err < 0)
>> +                     return err;
>> +             /* BPF programs are run with preempt disabled, so
>> +              * synchronize_sched is sufficient even with
>> +              * RCU_PREEMPT.
>> +              */
>> +             synchronize_sched();
>> +             return 0;
>
> I don't think it's necessary. sys_membarrier() can do this already
> and some folks use it exactly for this use case.

True --- implementation-wise, today. The point of the patch is to sort
out contractual guarantees. membarrier isn't documented to have the
BPF_SYNCHRONIZE effect right now, although it does. If we want to
extend membarrier's contract, great, but that might make it more
expensive than necessary for callers who, well, just want a memory
barrier. It's worthwhile to have a dedicated BPF command for this
task, especially considering the simplicity of the implementation.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-07  2:54 ` Alexei Starovoitov
  2018-07-07  3:22   ` Daniel Colascione
@ 2018-07-07 20:33   ` Joel Fernandes
  2018-07-08 20:54     ` Mathieu Desnoyers
  1 sibling, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2018-07-07 20:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Colascione, ast, linux-kernel, timmurray, daniel, netdev,
	mathieu.desnoyers, fengc

On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > RCU data structure operations with respect to active programs. For
> > example, userspace can update a map->map entry to point to a new map,
> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > complete, and then drain the old map without fear that BPF programs
> > may still be updating it.
> > 
> > Signed-off-by: Daniel Colascione <dancol@google.com>
> > ---
> >  include/uapi/linux/bpf.h |  1 +
> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> >  2 files changed, 15 insertions(+)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index b7db3261c62d..4365c50e8055 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >  	BPF_BTF_LOAD,
> >  	BPF_BTF_GET_FD_BY_ID,
> >  	BPF_TASK_FD_QUERY,
> > +	BPF_SYNCHRONIZE,
> >  };
> >  
> >  enum bpf_map_type {
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index d10ecd78105f..60ec7811846e 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >  		return -EPERM;
> >  
> > +	if (cmd == BPF_SYNCHRONIZE) {
> > +		if (uattr != NULL || size != 0)
> > +			return -EINVAL;
> > +		err = security_bpf(cmd, NULL, 0);
> > +		if (err < 0)
> > +			return err;
> > +		/* BPF programs are run with preempt disabled, so
> > +		 * synchronize_sched is sufficient even with
> > +		 * RCU_PREEMPT.
> > +		 */
> > +		synchronize_sched();
> > +		return 0;
> 
> I don't think it's necessary. sys_membarrier() can do this already
> and some folks use it exactly for this use case.

Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
though. No where does the manpage say membarrier should be implemented this
way so what happens if the implementation changes?

Further, membarrier manpage says that a memory barrier should be matched with
a matching barrier. In this use case there is no matching barrier, so it
makes it weirder.

Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
fragile to depend on it for this?

        case MEMBARRIER_CMD_GLOBAL:
                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
                if (tick_nohz_full_enabled())
                        return -EINVAL;
                if (num_online_cpus() > 1)
                        synchronize_sched();
                return 0;


Adding Mathieu as well who I believe is author/maintainer of membarrier.

thanks!

- Joel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-07 20:33   ` Joel Fernandes
@ 2018-07-08 20:54     ` Mathieu Desnoyers
  2018-07-09 21:09       ` Alexei Starovoitov
  2018-07-10  5:13       ` [RFC] Add BPF_SYNCHRONIZE " Joel Fernandes
  0 siblings, 2 replies; 30+ messages in thread
From: Mathieu Desnoyers @ 2018-07-08 20:54 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Alexei Starovoitov, Daniel Colascione, Alexei Starovoitov,
	linux-kernel, Tim Murray, Daniel Borkmann, netdev, fengc

----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:

> On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
>> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
>> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
>> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
>> > RCU data structure operations with respect to active programs. For
>> > example, userspace can update a map->map entry to point to a new map,
>> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>> > complete, and then drain the old map without fear that BPF programs
>> > may still be updating it.
>> > 
>> > Signed-off-by: Daniel Colascione <dancol@google.com>
>> > ---
>> >  include/uapi/linux/bpf.h |  1 +
>> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
>> >  2 files changed, 15 insertions(+)
>> > 
>> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> > index b7db3261c62d..4365c50e8055 100644
>> > --- a/include/uapi/linux/bpf.h
>> > +++ b/include/uapi/linux/bpf.h
>> > @@ -98,6 +98,7 @@ enum bpf_cmd {
>> >  	BPF_BTF_LOAD,
>> >  	BPF_BTF_GET_FD_BY_ID,
>> >  	BPF_TASK_FD_QUERY,
>> > +	BPF_SYNCHRONIZE,
>> >  };
>> >  
>> >  enum bpf_map_type {
>> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> > index d10ecd78105f..60ec7811846e 100644
>> > --- a/kernel/bpf/syscall.c
>> > +++ b/kernel/bpf/syscall.c
>> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
>> > uattr, unsigned int, siz
>> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>> >  		return -EPERM;
>> >  
>> > +	if (cmd == BPF_SYNCHRONIZE) {
>> > +		if (uattr != NULL || size != 0)
>> > +			return -EINVAL;
>> > +		err = security_bpf(cmd, NULL, 0);
>> > +		if (err < 0)
>> > +			return err;
>> > +		/* BPF programs are run with preempt disabled, so
>> > +		 * synchronize_sched is sufficient even with
>> > +		 * RCU_PREEMPT.
>> > +		 */
>> > +		synchronize_sched();
>> > +		return 0;
>> 
>> I don't think it's necessary. sys_membarrier() can do this already
>> and some folks use it exactly for this use case.
> 
> Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> though. No where does the manpage say membarrier should be implemented this
> way so what happens if the implementation changes?
> 
> Further, membarrier manpage says that a memory barrier should be matched with
> a matching barrier. In this use case there is no matching barrier, so it
> makes it weirder.
> 
> Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> fragile to depend on it for this?
> 
>        case MEMBARRIER_CMD_GLOBAL:
>                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
>                if (tick_nohz_full_enabled())
>                        return -EINVAL;
>                if (num_online_cpus() > 1)
>                        synchronize_sched();
>                return 0;
> 
> 
> Adding Mathieu as well who I believe is author/maintainer of membarrier.

See commit 907565337
"Fix: Disable sys_membarrier when nohz_full is enabled"

"Userspace applications should be allowed to expect the membarrier system
call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
nohz_full CPUs, but synchronize_sched() does not take those into
account."

So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
only care about kernel preempt off critical sections.

Clearly bpf code does not run in user-space, so it would "work".

But the guarantees provided by membarrier are not to synchronize against
preempt off per se. It's just that the current implementation happens to
do that. The point of membarrier is to turn user-space memory barriers
into compiler barriers.

If what you need is to wait for a RCU grace period for whatever RCU flavor
ebpf is using, I would against using membarrier for this. I would rather
recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
implementation details to user-space, *and* you can eventually change you
RCU implementation for e.g. SRCU in the future if needed.

Thanks,

Mathieu

> 
> thanks!
> 
> - Joel

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-08 20:54     ` Mathieu Desnoyers
@ 2018-07-09 21:09       ` Alexei Starovoitov
  2018-07-09 21:35         ` Mathieu Desnoyers
  2018-07-09 21:36         ` Daniel Colascione
  2018-07-10  5:13       ` [RFC] Add BPF_SYNCHRONIZE " Joel Fernandes
  1 sibling, 2 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2018-07-09 21:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Joel Fernandes, Daniel Colascione, Alexei Starovoitov,
	linux-kernel, Tim Murray, Daniel Borkmann, netdev, fengc

On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> 
> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> > RCU data structure operations with respect to active programs. For
> >> > example, userspace can update a map->map entry to point to a new map,
> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> > complete, and then drain the old map without fear that BPF programs
> >> > may still be updating it.
> >> > 
> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> >> > ---
> >> >  include/uapi/linux/bpf.h |  1 +
> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> >> >  2 files changed, 15 insertions(+)
> >> > 
> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> > index b7db3261c62d..4365c50e8055 100644
> >> > --- a/include/uapi/linux/bpf.h
> >> > +++ b/include/uapi/linux/bpf.h
> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> >  	BPF_BTF_LOAD,
> >> >  	BPF_BTF_GET_FD_BY_ID,
> >> >  	BPF_TASK_FD_QUERY,
> >> > +	BPF_SYNCHRONIZE,
> >> >  };
> >> >  
> >> >  enum bpf_map_type {
> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> > index d10ecd78105f..60ec7811846e 100644
> >> > --- a/kernel/bpf/syscall.c
> >> > +++ b/kernel/bpf/syscall.c
> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> > uattr, unsigned int, siz
> >> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> >  		return -EPERM;
> >> >  
> >> > +	if (cmd == BPF_SYNCHRONIZE) {
> >> > +		if (uattr != NULL || size != 0)
> >> > +			return -EINVAL;
> >> > +		err = security_bpf(cmd, NULL, 0);
> >> > +		if (err < 0)
> >> > +			return err;
> >> > +		/* BPF programs are run with preempt disabled, so
> >> > +		 * synchronize_sched is sufficient even with
> >> > +		 * RCU_PREEMPT.
> >> > +		 */
> >> > +		synchronize_sched();
> >> > +		return 0;
> >> 
> >> I don't think it's necessary. sys_membarrier() can do this already
> >> and some folks use it exactly for this use case.
> > 
> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > though. No where does the manpage say membarrier should be implemented this
> > way so what happens if the implementation changes?
> > 
> > Further, membarrier manpage says that a memory barrier should be matched with
> > a matching barrier. In this use case there is no matching barrier, so it
> > makes it weirder.
> > 
> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > fragile to depend on it for this?
> > 
> >        case MEMBARRIER_CMD_GLOBAL:
> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> >                if (tick_nohz_full_enabled())
> >                        return -EINVAL;
> >                if (num_online_cpus() > 1)
> >                        synchronize_sched();
> >                return 0;
> > 
> > 
> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> 
> See commit 907565337
> "Fix: Disable sys_membarrier when nohz_full is enabled"
> 
> "Userspace applications should be allowed to expect the membarrier system
> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> nohz_full CPUs, but synchronize_sched() does not take those into
> account."
> 
> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> only care about kernel preempt off critical sections.
> 
> Clearly bpf code does not run in user-space, so it would "work".
> 
> But the guarantees provided by membarrier are not to synchronize against
> preempt off per se. It's just that the current implementation happens to
> do that. The point of membarrier is to turn user-space memory barriers
> into compiler barriers.
> 
> If what you need is to wait for a RCU grace period for whatever RCU flavor
> ebpf is using, I would against using membarrier for this. I would rather
> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> implementation details to user-space, *and* you can eventually change you
> RCU implementation for e.g. SRCU in the future if needed.

The point about future changes to underlying bpf mechanisms is valid.
There is work already on the way to reduce the scope of preempt_off+rcu_lock
that currently lasts the whole prog. We will have new prog types that won't
have such wrappers and will do rcu_lock/unlock and preempt on/off only
when necessary.
So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
also won't make sense for the same reason.
What we can do it instead is to define synchronization barrier for
programs accessing maps. May be call it something like:
BPF_SYNC_MAP_ACCESS ?
uapi/bpf.h would need to have extensive comment what this barrier is doing.
Implementation should probably call synchronize_rcu() and not play games
with synchronize_sched(), since that's going too much into implementation.
Also should such sys_bpf command be root only?
I'm not sure whether dos attack can be made by spamming synchronize_rcu()
and synchronize_sched() for that matter.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-09 21:09       ` Alexei Starovoitov
@ 2018-07-09 21:35         ` Mathieu Desnoyers
  2018-07-09 22:19           ` Paul E. McKenney
  2018-07-09 21:36         ` Daniel Colascione
  1 sibling, 1 reply; 30+ messages in thread
From: Mathieu Desnoyers @ 2018-07-09 21:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joel Fernandes, Daniel Colascione, Alexei Starovoitov,
	linux-kernel, Tim Murray, Daniel Borkmann, netdev, fengc,
	Paul E. McKenney



----- On Jul 9, 2018, at 5:09 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:

> On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
>> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
>> 
>> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
>> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
>> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
>> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
>> >> > RCU data structure operations with respect to active programs. For
>> >> > example, userspace can update a map->map entry to point to a new map,
>> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>> >> > complete, and then drain the old map without fear that BPF programs
>> >> > may still be updating it.
>> >> > 
>> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
>> >> > ---
>> >> >  include/uapi/linux/bpf.h |  1 +
>> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
>> >> >  2 files changed, 15 insertions(+)
>> >> > 
>> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> >> > index b7db3261c62d..4365c50e8055 100644
>> >> > --- a/include/uapi/linux/bpf.h
>> >> > +++ b/include/uapi/linux/bpf.h
>> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
>> >> >  	BPF_BTF_LOAD,
>> >> >  	BPF_BTF_GET_FD_BY_ID,
>> >> >  	BPF_TASK_FD_QUERY,
>> >> > +	BPF_SYNCHRONIZE,
>> >> >  };
>> >> >  
>> >> >  enum bpf_map_type {
>> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> >> > index d10ecd78105f..60ec7811846e 100644
>> >> > --- a/kernel/bpf/syscall.c
>> >> > +++ b/kernel/bpf/syscall.c
>> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
>> >> > uattr, unsigned int, siz
>> >> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>> >> >  		return -EPERM;
>> >> >  
>> >> > +	if (cmd == BPF_SYNCHRONIZE) {
>> >> > +		if (uattr != NULL || size != 0)
>> >> > +			return -EINVAL;
>> >> > +		err = security_bpf(cmd, NULL, 0);
>> >> > +		if (err < 0)
>> >> > +			return err;
>> >> > +		/* BPF programs are run with preempt disabled, so
>> >> > +		 * synchronize_sched is sufficient even with
>> >> > +		 * RCU_PREEMPT.
>> >> > +		 */
>> >> > +		synchronize_sched();
>> >> > +		return 0;
>> >> 
>> >> I don't think it's necessary. sys_membarrier() can do this already
>> >> and some folks use it exactly for this use case.
>> > 
>> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
>> > though. No where does the manpage say membarrier should be implemented this
>> > way so what happens if the implementation changes?
>> > 
>> > Further, membarrier manpage says that a memory barrier should be matched with
>> > a matching barrier. In this use case there is no matching barrier, so it
>> > makes it weirder.
>> > 
>> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
>> > fragile to depend on it for this?
>> > 
>> >        case MEMBARRIER_CMD_GLOBAL:
>> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
>> >                if (tick_nohz_full_enabled())
>> >                        return -EINVAL;
>> >                if (num_online_cpus() > 1)
>> >                        synchronize_sched();
>> >                return 0;
>> > 
>> > 
>> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
>> 
>> See commit 907565337
>> "Fix: Disable sys_membarrier when nohz_full is enabled"
>> 
>> "Userspace applications should be allowed to expect the membarrier system
>> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
>> nohz_full CPUs, but synchronize_sched() does not take those into
>> account."
>> 
>> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
>> only care about kernel preempt off critical sections.
>> 
>> Clearly bpf code does not run in user-space, so it would "work".
>> 
>> But the guarantees provided by membarrier are not to synchronize against
>> preempt off per se. It's just that the current implementation happens to
>> do that. The point of membarrier is to turn user-space memory barriers
>> into compiler barriers.
>> 
>> If what you need is to wait for a RCU grace period for whatever RCU flavor
>> ebpf is using, I would against using membarrier for this. I would rather
>> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
>> implementation details to user-space, *and* you can eventually change you
>> RCU implementation for e.g. SRCU in the future if needed.
> 
> The point about future changes to underlying bpf mechanisms is valid.
> There is work already on the way to reduce the scope of preempt_off+rcu_lock
> that currently lasts the whole prog. We will have new prog types that won't
> have such wrappers and will do rcu_lock/unlock and preempt on/off only
> when necessary.
> So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> also won't make sense for the same reason.
> What we can do it instead is to define synchronization barrier for
> programs accessing maps. May be call it something like:
> BPF_SYNC_MAP_ACCESS ?
> uapi/bpf.h would need to have extensive comment what this barrier is doing.
> Implementation should probably call synchronize_rcu() and not play games
> with synchronize_sched(), since that's going too much into implementation.
> Also should such sys_bpf command be root only?
> I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> and synchronize_sched() for that matter.

Adding Paul E. McKenney in CC. He may want to share his thoughts on the matter.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-09 21:09       ` Alexei Starovoitov
  2018-07-09 21:35         ` Mathieu Desnoyers
@ 2018-07-09 21:36         ` Daniel Colascione
  2018-07-09 22:10           ` Alexei Starovoitov
  1 sibling, 1 reply; 30+ messages in thread
From: Daniel Colascione @ 2018-07-09 21:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Joel Fernandes, Alexei Starovoitov,
	linux-kernel, Tim Murray, Daniel Borkmann, netdev, Chenbo Feng

On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
>> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
>>
>> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
>> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
>> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
>> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
>> >> > RCU data structure operations with respect to active programs. For
>> >> > example, userspace can update a map->map entry to point to a new map,
>> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>> >> > complete, and then drain the old map without fear that BPF programs
>> >> > may still be updating it.
>> >> >
>> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
>> >> > ---
>> >> >  include/uapi/linux/bpf.h |  1 +
>> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
>> >> >  2 files changed, 15 insertions(+)
>> >> >
>> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> >> > index b7db3261c62d..4365c50e8055 100644
>> >> > --- a/include/uapi/linux/bpf.h
>> >> > +++ b/include/uapi/linux/bpf.h
>> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
>> >> >          BPF_BTF_LOAD,
>> >> >          BPF_BTF_GET_FD_BY_ID,
>> >> >          BPF_TASK_FD_QUERY,
>> >> > +        BPF_SYNCHRONIZE,
>> >> >  };
>> >> >
>> >> >  enum bpf_map_type {
>> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> >> > index d10ecd78105f..60ec7811846e 100644
>> >> > --- a/kernel/bpf/syscall.c
>> >> > +++ b/kernel/bpf/syscall.c
>> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
>> >> > uattr, unsigned int, siz
>> >> >          if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>> >> >                  return -EPERM;
>> >> >
>> >> > +        if (cmd == BPF_SYNCHRONIZE) {
>> >> > +                if (uattr != NULL || size != 0)
>> >> > +                        return -EINVAL;
>> >> > +                err = security_bpf(cmd, NULL, 0);
>> >> > +                if (err < 0)
>> >> > +                        return err;
>> >> > +                /* BPF programs are run with preempt disabled, so
>> >> > +                 * synchronize_sched is sufficient even with
>> >> > +                 * RCU_PREEMPT.
>> >> > +                 */
>> >> > +                synchronize_sched();
>> >> > +                return 0;
>> >>
>> >> I don't think it's necessary. sys_membarrier() can do this already
>> >> and some folks use it exactly for this use case.
>> >
>> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
>> > though. No where does the manpage say membarrier should be implemented this
>> > way so what happens if the implementation changes?
>> >
>> > Further, membarrier manpage says that a memory barrier should be matched with
>> > a matching barrier. In this use case there is no matching barrier, so it
>> > makes it weirder.
>> >
>> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
>> > fragile to depend on it for this?
>> >
>> >        case MEMBARRIER_CMD_GLOBAL:
>> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
>> >                if (tick_nohz_full_enabled())
>> >                        return -EINVAL;
>> >                if (num_online_cpus() > 1)
>> >                        synchronize_sched();
>> >                return 0;
>> >
>> >
>> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
>>
>> See commit 907565337
>> "Fix: Disable sys_membarrier when nohz_full is enabled"
>>
>> "Userspace applications should be allowed to expect the membarrier system
>> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
>> nohz_full CPUs, but synchronize_sched() does not take those into
>> account."
>>
>> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
>> only care about kernel preempt off critical sections.
>>
>> Clearly bpf code does not run in user-space, so it would "work".
>>
>> But the guarantees provided by membarrier are not to synchronize against
>> preempt off per se. It's just that the current implementation happens to
>> do that. The point of membarrier is to turn user-space memory barriers
>> into compiler barriers.
>>
>> If what you need is to wait for a RCU grace period for whatever RCU flavor
>> ebpf is using, I would against using membarrier for this. I would rather
>> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
>> implementation details to user-space, *and* you can eventually change you
>> RCU implementation for e.g. SRCU in the future if needed.
>
> The point about future changes to underlying bpf mechanisms is valid.
> There is work already on the way to reduce the scope of preempt_off+rcu_lock
> that currently lasts the whole prog. We will have new prog types that won't
> have such wrappers and will do rcu_lock/unlock and preempt on/off only
> when necessary.
> So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> also won't make sense for the same reason.
> What we can do it instead is to define synchronization barrier for
> programs accessing maps. May be call it something like:
> BPF_SYNC_MAP_ACCESS ?

I'm not sure what you're proposing. In the case the commit message
describes, a user-space program that wants to "drain" a map needs to
be confident that the map won't change under it, even across multiple
bpf system calls on that map. One way of doing that is to ensure that
nothing that could possibly hold a reference to that map is still
running. Are you proposing some kind of refcount-draining approach?
Simple locking won't work, since BPF programs can't block, and I don't
see right now how a simple barrier would help.

> Also should such sys_bpf command be root only?
> I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> and synchronize_sched() for that matter.

membarrier being globally available seems not to have caused any kind
of DoS catastrophe.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-09 21:36         ` Daniel Colascione
@ 2018-07-09 22:10           ` Alexei Starovoitov
  2018-07-09 22:21             ` Daniel Colascione
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2018-07-09 22:10 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Mathieu Desnoyers, Joel Fernandes, Alexei Starovoitov,
	linux-kernel, Tim Murray, Daniel Borkmann, netdev, Chenbo Feng

On Mon, Jul 09, 2018 at 02:36:37PM -0700, Daniel Colascione wrote:
> On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> >>
> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> >> > RCU data structure operations with respect to active programs. For
> >> >> > example, userspace can update a map->map entry to point to a new map,
> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> >> > complete, and then drain the old map without fear that BPF programs
> >> >> > may still be updating it.
> >> >> >
> >> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> >> >> > ---
> >> >> >  include/uapi/linux/bpf.h |  1 +
> >> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> >> >> >  2 files changed, 15 insertions(+)
> >> >> >
> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> >> > index b7db3261c62d..4365c50e8055 100644
> >> >> > --- a/include/uapi/linux/bpf.h
> >> >> > +++ b/include/uapi/linux/bpf.h
> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> >> >          BPF_BTF_LOAD,
> >> >> >          BPF_BTF_GET_FD_BY_ID,
> >> >> >          BPF_TASK_FD_QUERY,
> >> >> > +        BPF_SYNCHRONIZE,
> >> >> >  };
> >> >> >
> >> >> >  enum bpf_map_type {
> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> >> > index d10ecd78105f..60ec7811846e 100644
> >> >> > --- a/kernel/bpf/syscall.c
> >> >> > +++ b/kernel/bpf/syscall.c
> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> >> > uattr, unsigned int, siz
> >> >> >          if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> >> >                  return -EPERM;
> >> >> >
> >> >> > +        if (cmd == BPF_SYNCHRONIZE) {
> >> >> > +                if (uattr != NULL || size != 0)
> >> >> > +                        return -EINVAL;
> >> >> > +                err = security_bpf(cmd, NULL, 0);
> >> >> > +                if (err < 0)
> >> >> > +                        return err;
> >> >> > +                /* BPF programs are run with preempt disabled, so
> >> >> > +                 * synchronize_sched is sufficient even with
> >> >> > +                 * RCU_PREEMPT.
> >> >> > +                 */
> >> >> > +                synchronize_sched();
> >> >> > +                return 0;
> >> >>
> >> >> I don't think it's necessary. sys_membarrier() can do this already
> >> >> and some folks use it exactly for this use case.
> >> >
> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> >> > though. No where does the manpage say membarrier should be implemented this
> >> > way so what happens if the implementation changes?
> >> >
> >> > Further, membarrier manpage says that a memory barrier should be matched with
> >> > a matching barrier. In this use case there is no matching barrier, so it
> >> > makes it weirder.
> >> >
> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> >> > fragile to depend on it for this?
> >> >
> >> >        case MEMBARRIER_CMD_GLOBAL:
> >> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> >> >                if (tick_nohz_full_enabled())
> >> >                        return -EINVAL;
> >> >                if (num_online_cpus() > 1)
> >> >                        synchronize_sched();
> >> >                return 0;
> >> >
> >> >
> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> >>
> >> See commit 907565337
> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> >>
> >> "Userspace applications should be allowed to expect the membarrier system
> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> >> nohz_full CPUs, but synchronize_sched() does not take those into
> >> account."
> >>
> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> >> only care about kernel preempt off critical sections.
> >>
> >> Clearly bpf code does not run in user-space, so it would "work".
> >>
> >> But the guarantees provided by membarrier are not to synchronize against
> >> preempt off per se. It's just that the current implementation happens to
> >> do that. The point of membarrier is to turn user-space memory barriers
> >> into compiler barriers.
> >>
> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> >> ebpf is using, I would against using membarrier for this. I would rather
> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> >> implementation details to user-space, *and* you can eventually change you
> >> RCU implementation for e.g. SRCU in the future if needed.
> >
> > The point about future changes to underlying bpf mechanisms is valid.
> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > that currently lasts the whole prog. We will have new prog types that won't
> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > when necessary.
> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > also won't make sense for the same reason.
> > What we can do it instead is to define synchronization barrier for
> > programs accessing maps. May be call it something like:
> > BPF_SYNC_MAP_ACCESS ?
> 
> I'm not sure what you're proposing. In the case the commit message
> describes, a user-space program that wants to "drain" a map needs to
> be confident that the map won't change under it, even across multiple
> bpf system calls on that map. One way of doing that is to ensure that
> nothing that could possibly hold a reference to that map is still
> running. Are you proposing some kind of refcount-draining approach?
> Simple locking won't work, since BPF programs can't block, and I don't
> see right now how a simple barrier would help.

I'm proposing few changes for your patch:
s/BPF_SYNCHRONIZE/BPF_SYNC_MAP_ACCESS/
and s/synchronize_sched/synchronize_rcu/
with detailed comment in uapi/bpf.h that has an example why folks
would want to use this new cmd.
I think the bpf maps will be rcu protected for foreseeable future
even when rcu_read_lock/unlock will be done by the programs instead of
kernel wrappers.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-09 21:35         ` Mathieu Desnoyers
@ 2018-07-09 22:19           ` Paul E. McKenney
  2018-07-09 22:19             ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2018-07-09 22:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Alexei Starovoitov, Joel Fernandes, Daniel Colascione,
	Alexei Starovoitov, linux-kernel, Tim Murray, Daniel Borkmann,
	netdev, fengc

On Mon, Jul 09, 2018 at 05:35:34PM -0400, Mathieu Desnoyers wrote:
> 
> 
> ----- On Jul 9, 2018, at 5:09 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
> 
> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> >> 
> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> >> > RCU data structure operations with respect to active programs. For
> >> >> > example, userspace can update a map->map entry to point to a new map,
> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> >> > complete, and then drain the old map without fear that BPF programs
> >> >> > may still be updating it.
> >> >> > 
> >> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> >> >> > ---
> >> >> >  include/uapi/linux/bpf.h |  1 +
> >> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> >> >> >  2 files changed, 15 insertions(+)
> >> >> > 
> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> >> > index b7db3261c62d..4365c50e8055 100644
> >> >> > --- a/include/uapi/linux/bpf.h
> >> >> > +++ b/include/uapi/linux/bpf.h
> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> >> >  	BPF_BTF_LOAD,
> >> >> >  	BPF_BTF_GET_FD_BY_ID,
> >> >> >  	BPF_TASK_FD_QUERY,
> >> >> > +	BPF_SYNCHRONIZE,
> >> >> >  };
> >> >> >  
> >> >> >  enum bpf_map_type {
> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> >> > index d10ecd78105f..60ec7811846e 100644
> >> >> > --- a/kernel/bpf/syscall.c
> >> >> > +++ b/kernel/bpf/syscall.c
> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> >> > uattr, unsigned int, siz
> >> >> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> >> >  		return -EPERM;
> >> >> >  
> >> >> > +	if (cmd == BPF_SYNCHRONIZE) {
> >> >> > +		if (uattr != NULL || size != 0)
> >> >> > +			return -EINVAL;
> >> >> > +		err = security_bpf(cmd, NULL, 0);
> >> >> > +		if (err < 0)
> >> >> > +			return err;
> >> >> > +		/* BPF programs are run with preempt disabled, so
> >> >> > +		 * synchronize_sched is sufficient even with
> >> >> > +		 * RCU_PREEMPT.
> >> >> > +		 */
> >> >> > +		synchronize_sched();
> >> >> > +		return 0;
> >> >> 
> >> >> I don't think it's necessary. sys_membarrier() can do this already
> >> >> and some folks use it exactly for this use case.
> >> > 
> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> >> > though. No where does the manpage say membarrier should be implemented this
> >> > way so what happens if the implementation changes?
> >> > 
> >> > Further, membarrier manpage says that a memory barrier should be matched with
> >> > a matching barrier. In this use case there is no matching barrier, so it
> >> > makes it weirder.
> >> > 
> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> >> > fragile to depend on it for this?
> >> > 
> >> >        case MEMBARRIER_CMD_GLOBAL:
> >> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> >> >                if (tick_nohz_full_enabled())
> >> >                        return -EINVAL;
> >> >                if (num_online_cpus() > 1)
> >> >                        synchronize_sched();
> >> >                return 0;
> >> > 
> >> > 
> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> >> 
> >> See commit 907565337
> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> >> 
> >> "Userspace applications should be allowed to expect the membarrier system
> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> >> nohz_full CPUs, but synchronize_sched() does not take those into
> >> account."
> >> 
> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> >> only care about kernel preempt off critical sections.
> >> 
> >> Clearly bpf code does not run in user-space, so it would "work".
> >> 
> >> But the guarantees provided by membarrier are not to synchronize against
> >> preempt off per se. It's just that the current implementation happens to
> >> do that. The point of membarrier is to turn user-space memory barriers
> >> into compiler barriers.
> >> 
> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> >> ebpf is using, I would against using membarrier for this. I would rather
> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> >> implementation details to user-space, *and* you can eventually change you
> >> RCU implementation for e.g. SRCU in the future if needed.
> > 
> > The point about future changes to underlying bpf mechanisms is valid.
> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > that currently lasts the whole prog. We will have new prog types that won't
> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > when necessary.
> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > also won't make sense for the same reason.
> > What we can do it instead is to define synchronization barrier for
> > programs accessing maps. May be call it something like:
> > BPF_SYNC_MAP_ACCESS ?
> > uapi/bpf.h would need to have extensive comment what this barrier is doing.
> > Implementation should probably call synchronize_rcu() and not play games
> > with synchronize_sched(), since that's going too much into implementation.
> > Also should such sys_bpf command be root only?
> > I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> > and synchronize_sched() for that matter.
> 
> Adding Paul E. McKenney in CC. He may want to share his thoughts on the matter.

Let's see...

Spamming synchronize_rcu() and synchronize_sched() should be a non-event,
at least aside from the CPUs doing the spamming.  The reason for this
is that a given task can only fire off a single synchronize_sched or
synchronize_rcu() per few milliseconds, so you need a -lot- of tasks
to have much effect, at which point the sheer number of tasks is much
more a problem than the large number of outstanding synchronize_rcu()
or synchronize_sched() invocations.

I very strongly agree that usermode should have a operation that
synchronizes with whatever eBPF uses, rather than something that forces
a specific type of RCU grace period.

Finally, in a few releases, synchronize_sched() will be retiring in favor
of synchronize_rcu(), which will wait on preemption-disabled regions of
code in addition to waiting on RCU read-side critical sections.  Not a
big deal, as I expect to enlist Coccinelle's aid in this.

Did I manage to hit all the high points?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-09 22:19           ` Paul E. McKenney
@ 2018-07-09 22:19             ` Alexei Starovoitov
  2018-07-09 22:48               ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2018-07-09 22:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, Joel Fernandes, Daniel Colascione,
	Alexei Starovoitov, linux-kernel, Tim Murray, Daniel Borkmann,
	netdev, fengc

On Mon, Jul 09, 2018 at 03:19:03PM -0700, Paul E. McKenney wrote:
> On Mon, Jul 09, 2018 at 05:35:34PM -0400, Mathieu Desnoyers wrote:
> > 
> > 
> > ----- On Jul 9, 2018, at 5:09 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
> > 
> > > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> > >> 
> > >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > >> >> > RCU data structure operations with respect to active programs. For
> > >> >> > example, userspace can update a map->map entry to point to a new map,
> > >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > >> >> > complete, and then drain the old map without fear that BPF programs
> > >> >> > may still be updating it.
> > >> >> > 
> > >> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> > >> >> > ---
> > >> >> >  include/uapi/linux/bpf.h |  1 +
> > >> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> > >> >> >  2 files changed, 15 insertions(+)
> > >> >> > 
> > >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> >> > index b7db3261c62d..4365c50e8055 100644
> > >> >> > --- a/include/uapi/linux/bpf.h
> > >> >> > +++ b/include/uapi/linux/bpf.h
> > >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > >> >> >  	BPF_BTF_LOAD,
> > >> >> >  	BPF_BTF_GET_FD_BY_ID,
> > >> >> >  	BPF_TASK_FD_QUERY,
> > >> >> > +	BPF_SYNCHRONIZE,
> > >> >> >  };
> > >> >> >  
> > >> >> >  enum bpf_map_type {
> > >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > >> >> > index d10ecd78105f..60ec7811846e 100644
> > >> >> > --- a/kernel/bpf/syscall.c
> > >> >> > +++ b/kernel/bpf/syscall.c
> > >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > >> >> > uattr, unsigned int, siz
> > >> >> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > >> >> >  		return -EPERM;
> > >> >> >  
> > >> >> > +	if (cmd == BPF_SYNCHRONIZE) {
> > >> >> > +		if (uattr != NULL || size != 0)
> > >> >> > +			return -EINVAL;
> > >> >> > +		err = security_bpf(cmd, NULL, 0);
> > >> >> > +		if (err < 0)
> > >> >> > +			return err;
> > >> >> > +		/* BPF programs are run with preempt disabled, so
> > >> >> > +		 * synchronize_sched is sufficient even with
> > >> >> > +		 * RCU_PREEMPT.
> > >> >> > +		 */
> > >> >> > +		synchronize_sched();
> > >> >> > +		return 0;
> > >> >> 
> > >> >> I don't think it's necessary. sys_membarrier() can do this already
> > >> >> and some folks use it exactly for this use case.
> > >> > 
> > >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > >> > though. No where does the manpage say membarrier should be implemented this
> > >> > way so what happens if the implementation changes?
> > >> > 
> > >> > Further, membarrier manpage says that a memory barrier should be matched with
> > >> > a matching barrier. In this use case there is no matching barrier, so it
> > >> > makes it weirder.
> > >> > 
> > >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > >> > fragile to depend on it for this?
> > >> > 
> > >> >        case MEMBARRIER_CMD_GLOBAL:
> > >> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > >> >                if (tick_nohz_full_enabled())
> > >> >                        return -EINVAL;
> > >> >                if (num_online_cpus() > 1)
> > >> >                        synchronize_sched();
> > >> >                return 0;
> > >> > 
> > >> > 
> > >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > >> 
> > >> See commit 907565337
> > >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> > >> 
> > >> "Userspace applications should be allowed to expect the membarrier system
> > >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > >> nohz_full CPUs, but synchronize_sched() does not take those into
> > >> account."
> > >> 
> > >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > >> only care about kernel preempt off critical sections.
> > >> 
> > >> Clearly bpf code does not run in user-space, so it would "work".
> > >> 
> > >> But the guarantees provided by membarrier are not to synchronize against
> > >> preempt off per se. It's just that the current implementation happens to
> > >> do that. The point of membarrier is to turn user-space memory barriers
> > >> into compiler barriers.
> > >> 
> > >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> > >> ebpf is using, I would against using membarrier for this. I would rather
> > >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> > >> implementation details to user-space, *and* you can eventually change you
> > >> RCU implementation for e.g. SRCU in the future if needed.
> > > 
> > > The point about future changes to underlying bpf mechanisms is valid.
> > > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > > that currently lasts the whole prog. We will have new prog types that won't
> > > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > > when necessary.
> > > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > > also won't make sense for the same reason.
> > > What we can do it instead is to define synchronization barrier for
> > > programs accessing maps. May be call it something like:
> > > BPF_SYNC_MAP_ACCESS ?
> > > uapi/bpf.h would need to have extensive comment what this barrier is doing.
> > > Implementation should probably call synchronize_rcu() and not play games
> > > with synchronize_sched(), since that's going too much into implementation.
> > > Also should such sys_bpf command be root only?
> > > I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> > > and synchronize_sched() for that matter.
> > 
> > Adding Paul E. McKenney in CC. He may want to share his thoughts on the matter.
> 
> Let's see...
> 
> Spamming synchronize_rcu() and synchronize_sched() should be a non-event,
> at least aside from the CPUs doing the spamming.  The reason for this
> is that a given task can only fire off a single synchronize_sched or
> synchronize_rcu() per few milliseconds, so you need a -lot- of tasks
> to have much effect, at which point the sheer number of tasks is much
> more a problem than the large number of outstanding synchronize_rcu()
> or synchronize_sched() invocations.
> 
> I very strongly agree that usermode should have a operation that
> synchronizes with whatever eBPF uses, rather than something that forces
> a specific type of RCU grace period.
> 
> Finally, in a few releases, synchronize_sched() will be retiring in favor
> of synchronize_rcu(), which will wait on preemption-disabled regions of
> code in addition to waiting on RCU read-side critical sections.  Not a
> big deal, as I expect to enlist Coccinelle's aid in this.
> 
> Did I manage to hit all the high points?

Thanks. It's ok for new cmd being unpriv then.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-09 22:10           ` Alexei Starovoitov
@ 2018-07-09 22:21             ` Daniel Colascione
  2018-07-09 22:34               ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Colascione @ 2018-07-09 22:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Joel Fernandes, Alexei Starovoitov,
	linux-kernel, Tim Murray, Daniel Borkmann, netdev, Chenbo Feng

On Mon, Jul 9, 2018 at 3:10 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Jul 09, 2018 at 02:36:37PM -0700, Daniel Colascione wrote:
>> On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
>> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
>> >>
>> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
>> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
>> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
>> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
>> >> >> > RCU data structure operations with respect to active programs. For
>> >> >> > example, userspace can update a map->map entry to point to a new map,
>> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>> >> >> > complete, and then drain the old map without fear that BPF programs
>> >> >> > may still be updating it.
>> >> >> >
>> >> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
>> >> >> > ---
>> >> >> >  include/uapi/linux/bpf.h |  1 +
>> >> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
>> >> >> >  2 files changed, 15 insertions(+)
>> >> >> >
>> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> >> >> > index b7db3261c62d..4365c50e8055 100644
>> >> >> > --- a/include/uapi/linux/bpf.h
>> >> >> > +++ b/include/uapi/linux/bpf.h
>> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
>> >> >> >          BPF_BTF_LOAD,
>> >> >> >          BPF_BTF_GET_FD_BY_ID,
>> >> >> >          BPF_TASK_FD_QUERY,
>> >> >> > +        BPF_SYNCHRONIZE,
>> >> >> >  };
>> >> >> >
>> >> >> >  enum bpf_map_type {
>> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> >> >> > index d10ecd78105f..60ec7811846e 100644
>> >> >> > --- a/kernel/bpf/syscall.c
>> >> >> > +++ b/kernel/bpf/syscall.c
>> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
>> >> >> > uattr, unsigned int, siz
>> >> >> >          if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
>> >> >> >                  return -EPERM;
>> >> >> >
>> >> >> > +        if (cmd == BPF_SYNCHRONIZE) {
>> >> >> > +                if (uattr != NULL || size != 0)
>> >> >> > +                        return -EINVAL;
>> >> >> > +                err = security_bpf(cmd, NULL, 0);
>> >> >> > +                if (err < 0)
>> >> >> > +                        return err;
>> >> >> > +                /* BPF programs are run with preempt disabled, so
>> >> >> > +                 * synchronize_sched is sufficient even with
>> >> >> > +                 * RCU_PREEMPT.
>> >> >> > +                 */
>> >> >> > +                synchronize_sched();
>> >> >> > +                return 0;
>> >> >>
>> >> >> I don't think it's necessary. sys_membarrier() can do this already
>> >> >> and some folks use it exactly for this use case.
>> >> >
>> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
>> >> > though. No where does the manpage say membarrier should be implemented this
>> >> > way so what happens if the implementation changes?
>> >> >
>> >> > Further, membarrier manpage says that a memory barrier should be matched with
>> >> > a matching barrier. In this use case there is no matching barrier, so it
>> >> > makes it weirder.
>> >> >
>> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
>> >> > fragile to depend on it for this?
>> >> >
>> >> >        case MEMBARRIER_CMD_GLOBAL:
>> >> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
>> >> >                if (tick_nohz_full_enabled())
>> >> >                        return -EINVAL;
>> >> >                if (num_online_cpus() > 1)
>> >> >                        synchronize_sched();
>> >> >                return 0;
>> >> >
>> >> >
>> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
>> >>
>> >> See commit 907565337
>> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
>> >>
>> >> "Userspace applications should be allowed to expect the membarrier system
>> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
>> >> nohz_full CPUs, but synchronize_sched() does not take those into
>> >> account."
>> >>
>> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
>> >> only care about kernel preempt off critical sections.
>> >>
>> >> Clearly bpf code does not run in user-space, so it would "work".
>> >>
>> >> But the guarantees provided by membarrier are not to synchronize against
>> >> preempt off per se. It's just that the current implementation happens to
>> >> do that. The point of membarrier is to turn user-space memory barriers
>> >> into compiler barriers.
>> >>
>> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
>> >> ebpf is using, I would against using membarrier for this. I would rather
>> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
>> >> implementation details to user-space, *and* you can eventually change you
>> >> RCU implementation for e.g. SRCU in the future if needed.
>> >
>> > The point about future changes to underlying bpf mechanisms is valid.
>> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
>> > that currently lasts the whole prog. We will have new prog types that won't
>> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
>> > when necessary.
>> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
>> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
>> > also won't make sense for the same reason.
>> > What we can do it instead is to define synchronization barrier for
>> > programs accessing maps. May be call it something like:
>> > BPF_SYNC_MAP_ACCESS ?
>>
>> I'm not sure what you're proposing. In the case the commit message
>> describes, a user-space program that wants to "drain" a map needs to
>> be confident that the map won't change under it, even across multiple
>> bpf system calls on that map. One way of doing that is to ensure that
>> nothing that could possibly hold a reference to that map is still
>> running. Are you proposing some kind of refcount-draining approach?
>> Simple locking won't work, since BPF programs can't block, and I don't
>> see right now how a simple barrier would help.
>
> I'm proposing few changes for your patch:
> s/BPF_SYNCHRONIZE/BPF_SYNC_MAP_ACCESS/
> and s/synchronize_sched/synchronize_rcu/
> with detailed comment in uapi/bpf.h that has an example why folks
> would want to use this new cmd.

Thanks for clarifying.

> I think the bpf maps will be rcu protected for foreseeable future
> even when rcu_read_lock/unlock will be done by the programs instead of
> kernel wrappers.

Can we guarantee that we always obtain a map reference and dispose of
that reference inside the same critical section? If so, can BPF
programs then disable preemption for as long as they'd like?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-09 22:21             ` Daniel Colascione
@ 2018-07-09 22:34               ` Alexei Starovoitov
  2018-07-10  5:25                 ` Chenbo Feng
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2018-07-09 22:34 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Mathieu Desnoyers, Joel Fernandes, Alexei Starovoitov,
	linux-kernel, Tim Murray, Daniel Borkmann, netdev, Chenbo Feng

On Mon, Jul 09, 2018 at 03:21:43PM -0700, Daniel Colascione wrote:
> On Mon, Jul 9, 2018 at 3:10 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Jul 09, 2018 at 02:36:37PM -0700, Daniel Colascione wrote:
> >> On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> >> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> >> >>
> >> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> >> >> > RCU data structure operations with respect to active programs. For
> >> >> >> > example, userspace can update a map->map entry to point to a new map,
> >> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> >> >> > complete, and then drain the old map without fear that BPF programs
> >> >> >> > may still be updating it.
> >> >> >> >
> >> >> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> >> >> >> > ---
> >> >> >> >  include/uapi/linux/bpf.h |  1 +
> >> >> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> >> >> >> >  2 files changed, 15 insertions(+)
> >> >> >> >
> >> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> >> >> > index b7db3261c62d..4365c50e8055 100644
> >> >> >> > --- a/include/uapi/linux/bpf.h
> >> >> >> > +++ b/include/uapi/linux/bpf.h
> >> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> >> >> >          BPF_BTF_LOAD,
> >> >> >> >          BPF_BTF_GET_FD_BY_ID,
> >> >> >> >          BPF_TASK_FD_QUERY,
> >> >> >> > +        BPF_SYNCHRONIZE,
> >> >> >> >  };
> >> >> >> >
> >> >> >> >  enum bpf_map_type {
> >> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> >> >> > index d10ecd78105f..60ec7811846e 100644
> >> >> >> > --- a/kernel/bpf/syscall.c
> >> >> >> > +++ b/kernel/bpf/syscall.c
> >> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> >> >> > uattr, unsigned int, siz
> >> >> >> >          if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> >> >> >                  return -EPERM;
> >> >> >> >
> >> >> >> > +        if (cmd == BPF_SYNCHRONIZE) {
> >> >> >> > +                if (uattr != NULL || size != 0)
> >> >> >> > +                        return -EINVAL;
> >> >> >> > +                err = security_bpf(cmd, NULL, 0);
> >> >> >> > +                if (err < 0)
> >> >> >> > +                        return err;
> >> >> >> > +                /* BPF programs are run with preempt disabled, so
> >> >> >> > +                 * synchronize_sched is sufficient even with
> >> >> >> > +                 * RCU_PREEMPT.
> >> >> >> > +                 */
> >> >> >> > +                synchronize_sched();
> >> >> >> > +                return 0;
> >> >> >>
> >> >> >> I don't think it's necessary. sys_membarrier() can do this already
> >> >> >> and some folks use it exactly for this use case.
> >> >> >
> >> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> >> >> > though. No where does the manpage say membarrier should be implemented this
> >> >> > way so what happens if the implementation changes?
> >> >> >
> >> >> > Further, membarrier manpage says that a memory barrier should be matched with
> >> >> > a matching barrier. In this use case there is no matching barrier, so it
> >> >> > makes it weirder.
> >> >> >
> >> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> >> >> > fragile to depend on it for this?
> >> >> >
> >> >> >        case MEMBARRIER_CMD_GLOBAL:
> >> >> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> >> >> >                if (tick_nohz_full_enabled())
> >> >> >                        return -EINVAL;
> >> >> >                if (num_online_cpus() > 1)
> >> >> >                        synchronize_sched();
> >> >> >                return 0;
> >> >> >
> >> >> >
> >> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> >> >>
> >> >> See commit 907565337
> >> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> >> >>
> >> >> "Userspace applications should be allowed to expect the membarrier system
> >> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> >> >> nohz_full CPUs, but synchronize_sched() does not take those into
> >> >> account."
> >> >>
> >> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> >> >> only care about kernel preempt off critical sections.
> >> >>
> >> >> Clearly bpf code does not run in user-space, so it would "work".
> >> >>
> >> >> But the guarantees provided by membarrier are not to synchronize against
> >> >> preempt off per se. It's just that the current implementation happens to
> >> >> do that. The point of membarrier is to turn user-space memory barriers
> >> >> into compiler barriers.
> >> >>
> >> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> >> >> ebpf is using, I would against using membarrier for this. I would rather
> >> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> >> >> implementation details to user-space, *and* you can eventually change you
> >> >> RCU implementation for e.g. SRCU in the future if needed.
> >> >
> >> > The point about future changes to underlying bpf mechanisms is valid.
> >> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> >> > that currently lasts the whole prog. We will have new prog types that won't
> >> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> >> > when necessary.
> >> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> >> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> >> > also won't make sense for the same reason.
> >> > What we can do it instead is to define synchronization barrier for
> >> > programs accessing maps. May be call it something like:
> >> > BPF_SYNC_MAP_ACCESS ?
> >>
> >> I'm not sure what you're proposing. In the case the commit message
> >> describes, a user-space program that wants to "drain" a map needs to
> >> be confident that the map won't change under it, even across multiple
> >> bpf system calls on that map. One way of doing that is to ensure that
> >> nothing that could possibly hold a reference to that map is still
> >> running. Are you proposing some kind of refcount-draining approach?
> >> Simple locking won't work, since BPF programs can't block, and I don't
> >> see right now how a simple barrier would help.
> >
> > I'm proposing few changes for your patch:
> > s/BPF_SYNCHRONIZE/BPF_SYNC_MAP_ACCESS/
> > and s/synchronize_sched/synchronize_rcu/
> > with detailed comment in uapi/bpf.h that has an example why folks
> > would want to use this new cmd.
> 
> Thanks for clarifying.
> 
> > I think the bpf maps will be rcu protected for foreseeable future
> > even when rcu_read_lock/unlock will be done by the programs instead of
> > kernel wrappers.
> 
> Can we guarantee that we always obtain a map reference and dispose of
> that reference inside the same critical section? 

yep. the verifier will guarantee that.

> If so, can BPF
> programs then disable preemption for as long as they'd like?

you mean after the finish? no. only while running.
The verifier will match things like lookup/release, lock/unlock, preempt on/off
and will make sure there is no dangling preempt disable after program returns.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-09 22:19             ` Alexei Starovoitov
@ 2018-07-09 22:48               ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2018-07-09 22:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Joel Fernandes, Daniel Colascione,
	Alexei Starovoitov, linux-kernel, Tim Murray, Daniel Borkmann,
	netdev, fengc

On Mon, Jul 09, 2018 at 03:19:55PM -0700, Alexei Starovoitov wrote:
> On Mon, Jul 09, 2018 at 03:19:03PM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 09, 2018 at 05:35:34PM -0400, Mathieu Desnoyers wrote:
> > > 
> > > 
> > > ----- On Jul 9, 2018, at 5:09 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
> > > 
> > > > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > > >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> > > >> 
> > > >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > > >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > > >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > > >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > > >> >> > RCU data structure operations with respect to active programs. For
> > > >> >> > example, userspace can update a map->map entry to point to a new map,
> > > >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > > >> >> > complete, and then drain the old map without fear that BPF programs
> > > >> >> > may still be updating it.
> > > >> >> > 
> > > >> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> > > >> >> > ---
> > > >> >> >  include/uapi/linux/bpf.h |  1 +
> > > >> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> > > >> >> >  2 files changed, 15 insertions(+)
> > > >> >> > 
> > > >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > >> >> > index b7db3261c62d..4365c50e8055 100644
> > > >> >> > --- a/include/uapi/linux/bpf.h
> > > >> >> > +++ b/include/uapi/linux/bpf.h
> > > >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > > >> >> >  	BPF_BTF_LOAD,
> > > >> >> >  	BPF_BTF_GET_FD_BY_ID,
> > > >> >> >  	BPF_TASK_FD_QUERY,
> > > >> >> > +	BPF_SYNCHRONIZE,
> > > >> >> >  };
> > > >> >> >  
> > > >> >> >  enum bpf_map_type {
> > > >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > >> >> > index d10ecd78105f..60ec7811846e 100644
> > > >> >> > --- a/kernel/bpf/syscall.c
> > > >> >> > +++ b/kernel/bpf/syscall.c
> > > >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > > >> >> > uattr, unsigned int, siz
> > > >> >> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > >> >> >  		return -EPERM;
> > > >> >> >  
> > > >> >> > +	if (cmd == BPF_SYNCHRONIZE) {
> > > >> >> > +		if (uattr != NULL || size != 0)
> > > >> >> > +			return -EINVAL;
> > > >> >> > +		err = security_bpf(cmd, NULL, 0);
> > > >> >> > +		if (err < 0)
> > > >> >> > +			return err;
> > > >> >> > +		/* BPF programs are run with preempt disabled, so
> > > >> >> > +		 * synchronize_sched is sufficient even with
> > > >> >> > +		 * RCU_PREEMPT.
> > > >> >> > +		 */
> > > >> >> > +		synchronize_sched();
> > > >> >> > +		return 0;
> > > >> >> 
> > > >> >> I don't think it's necessary. sys_membarrier() can do this already
> > > >> >> and some folks use it exactly for this use case.
> > > >> > 
> > > >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > > >> > though. No where does the manpage say membarrier should be implemented this
> > > >> > way so what happens if the implementation changes?
> > > >> > 
> > > >> > Further, membarrier manpage says that a memory barrier should be matched with
> > > >> > a matching barrier. In this use case there is no matching barrier, so it
> > > >> > makes it weirder.
> > > >> > 
> > > >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > > >> > fragile to depend on it for this?
> > > >> > 
> > > >> >        case MEMBARRIER_CMD_GLOBAL:
> > > >> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > > >> >                if (tick_nohz_full_enabled())
> > > >> >                        return -EINVAL;
> > > >> >                if (num_online_cpus() > 1)
> > > >> >                        synchronize_sched();
> > > >> >                return 0;
> > > >> > 
> > > >> > 
> > > >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > > >> 
> > > >> See commit 907565337
> > > >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> > > >> 
> > > >> "Userspace applications should be allowed to expect the membarrier system
> > > >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > > >> nohz_full CPUs, but synchronize_sched() does not take those into
> > > >> account."
> > > >> 
> > > >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > > >> only care about kernel preempt off critical sections.
> > > >> 
> > > >> Clearly bpf code does not run in user-space, so it would "work".
> > > >> 
> > > >> But the guarantees provided by membarrier are not to synchronize against
> > > >> preempt off per se. It's just that the current implementation happens to
> > > >> do that. The point of membarrier is to turn user-space memory barriers
> > > >> into compiler barriers.
> > > >> 
> > > >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> > > >> ebpf is using, I would against using membarrier for this. I would rather
> > > >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> > > >> implementation details to user-space, *and* you can eventually change you
> > > >> RCU implementation for e.g. SRCU in the future if needed.
> > > > 
> > > > The point about future changes to underlying bpf mechanisms is valid.
> > > > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > > > that currently lasts the whole prog. We will have new prog types that won't
> > > > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > > > when necessary.
> > > > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > > > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > > > also won't make sense for the same reason.
> > > > What we can do it instead is to define synchronization barrier for
> > > > programs accessing maps. May be call it something like:
> > > > BPF_SYNC_MAP_ACCESS ?
> > > > uapi/bpf.h would need to have extensive comment what this barrier is doing.
> > > > Implementation should probably call synchronize_rcu() and not play games
> > > > with synchronize_sched(), since that's going too much into implementation.
> > > > Also should such sys_bpf command be root only?
> > > > I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> > > > and synchronize_sched() for that matter.
> > > 
> > > Adding Paul E. McKenney in CC. He may want to share his thoughts on the matter.
> > 
> > Let's see...
> > 
> > Spamming synchronize_rcu() and synchronize_sched() should be a non-event,
> > at least aside from the CPUs doing the spamming.  The reason for this
> > is that a given task can only fire off a single synchronize_sched or
> > synchronize_rcu() per few milliseconds, so you need a -lot- of tasks
> > to have much effect, at which point the sheer number of tasks is much
> > more a problem than the large number of outstanding synchronize_rcu()
> > or synchronize_sched() invocations.
> > 
> > I very strongly agree that usermode should have a operation that
> > synchronizes with whatever eBPF uses, rather than something that forces
> > a specific type of RCU grace period.
> > 
> > Finally, in a few releases, synchronize_sched() will be retiring in favor
> > of synchronize_rcu(), which will wait on preemption-disabled regions of
> > code in addition to waiting on RCU read-side critical sections.  Not a
> > big deal, as I expect to enlist Coccinelle's aid in this.
> > 
> > Did I manage to hit all the high points?
> 
> Thanks. It's ok for new cmd being unpriv then.

Yes.  After all, users can cause RCU far more pain by doing close(open())
in a tight loop.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-08 20:54     ` Mathieu Desnoyers
  2018-07-09 21:09       ` Alexei Starovoitov
@ 2018-07-10  5:13       ` Joel Fernandes
  2018-07-10 16:42         ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2018-07-10  5:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Joel Fernandes, Alexei Starovoitov, Daniel Colascione,
	Alexei Starovoitov, linux-kernel, Tim Murray, Daniel Borkmann,
	netdev, fengc, paulmck

On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> 
> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> > RCU data structure operations with respect to active programs. For
> >> > example, userspace can update a map->map entry to point to a new map,
> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> > complete, and then drain the old map without fear that BPF programs
> >> > may still be updating it.
> >> > 
> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> >> > ---
> >> >  include/uapi/linux/bpf.h |  1 +
> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> >> >  2 files changed, 15 insertions(+)
> >> > 
> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> > index b7db3261c62d..4365c50e8055 100644
> >> > --- a/include/uapi/linux/bpf.h
> >> > +++ b/include/uapi/linux/bpf.h
> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> >  	BPF_BTF_LOAD,
> >> >  	BPF_BTF_GET_FD_BY_ID,
> >> >  	BPF_TASK_FD_QUERY,
> >> > +	BPF_SYNCHRONIZE,
> >> >  };
> >> >  
> >> >  enum bpf_map_type {
> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> > index d10ecd78105f..60ec7811846e 100644
> >> > --- a/kernel/bpf/syscall.c
> >> > +++ b/kernel/bpf/syscall.c
> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> > uattr, unsigned int, siz
> >> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> >  		return -EPERM;
> >> >  
> >> > +	if (cmd == BPF_SYNCHRONIZE) {
> >> > +		if (uattr != NULL || size != 0)
> >> > +			return -EINVAL;
> >> > +		err = security_bpf(cmd, NULL, 0);
> >> > +		if (err < 0)
> >> > +			return err;
> >> > +		/* BPF programs are run with preempt disabled, so
> >> > +		 * synchronize_sched is sufficient even with
> >> > +		 * RCU_PREEMPT.
> >> > +		 */
> >> > +		synchronize_sched();
> >> > +		return 0;
> >> 
> >> I don't think it's necessary. sys_membarrier() can do this already
> >> and some folks use it exactly for this use case.
> > 
> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > though. No where does the manpage say membarrier should be implemented this
> > way so what happens if the implementation changes?
> > 
> > Further, membarrier manpage says that a memory barrier should be matched with
> > a matching barrier. In this use case there is no matching barrier, so it
> > makes it weirder.
> > 
> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > fragile to depend on it for this?
> > 
> >        case MEMBARRIER_CMD_GLOBAL:
> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> >                if (tick_nohz_full_enabled())
> >                        return -EINVAL;
> >                if (num_online_cpus() > 1)
> >                        synchronize_sched();
> >                return 0;
> > 
> > 
> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> 
> See commit 907565337
> "Fix: Disable sys_membarrier when nohz_full is enabled"
> 
> "Userspace applications should be allowed to expect the membarrier system
> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> nohz_full CPUs, but synchronize_sched() does not take those into
> account."
> 
> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> only care about kernel preempt off critical sections.

Mathieu, Thanks a lot for your reply. I understand what you said and agree
with you. Slight OT, but I tried to go back to first principles and
understand how membarrier() uses synchronize_sched() for the "slow path" and
it didn't make immediate sense to me. Let me clarify my dillema..

My understanding is membarrier's MEMBARRIER_CMD_GLOBAL will employ
synchronize_sched to make sure all other CPUs aren't executing anymore in an
section of usercode that happen to be accessing memory that was written to
before the membarrier call was made. To do this, the system call will use
synchronize_sched to try to guarantee that all user-mode execution that
started before the membarrier call would be completed when the membarrier
call returns. This guarantees that without using a real memory barrier on the
"fast path", things work just fine and everyone wins.

But, going through RCU code, I see that a "RCU-sched quiecent state" on a CPU
may be reached when the CPU receives a timer tick while executing in user
mode:

void rcu_check_callbacks(int user)
{
	trace_rcu_utilization(TPS("Start scheduler-tick"));
	increment_cpu_stall_ticks();
	if (user || rcu_is_cpu_rrupt_from_idle()) {
[...]
		rcu_sched_qs();
		rcu_bh_qs();

The problem I see is the CPU could be executing usermode code at the time of
the RCU sched-QS. This IMO is enough reason for synchronize_sched() to
return, because the CPU in question just reported a QS (assuming all other
CPUs also happen to do so if they needed to).

Then I am wondering how does the membarrier call even work, the tick could
very well have interrupted the CPU while it was executing usermode code in
the middle of a set of instructions performing memory accesses. Reporting a
quiescent state at such an inopportune time would cause the membarrier call
to prematurely return, no? Sorry if I missed something.

The other question I have is about the whole "nohz-full doesn't work" thing.
I didn't fully understand why. RCU is already tracking the state of nohz-full
CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
transitions to and from usermode even if the timer tick is turned off. So why
would it not work?

thanks a lot!

 - Joel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-09 22:34               ` Alexei Starovoitov
@ 2018-07-10  5:25                 ` Chenbo Feng
  2018-07-10 23:52                   ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Chenbo Feng @ 2018-07-10  5:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Colascione, mathieu.desnoyers, Joel Fernandes,
	Alexei Starovoitov, linux-kernel, timmurray, Daniel Borkmann,
	netdev, Lorenzo Colitti

On Mon, Jul 9, 2018 at 3:34 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jul 09, 2018 at 03:21:43PM -0700, Daniel Colascione wrote:
> > On Mon, Jul 9, 2018 at 3:10 PM, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > > On Mon, Jul 09, 2018 at 02:36:37PM -0700, Daniel Colascione wrote:
> > >> On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
> > >> <alexei.starovoitov@gmail.com> wrote:
> > >> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > >> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> > >> >>
> > >> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > >> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > >> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > >> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > >> >> >> > RCU data structure operations with respect to active programs. For
> > >> >> >> > example, userspace can update a map->map entry to point to a new map,
> > >> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > >> >> >> > complete, and then drain the old map without fear that BPF programs
> > >> >> >> > may still be updating it.
> > >> >> >> >
> > >> >> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> > >> >> >> > ---
> > >> >> >> >  include/uapi/linux/bpf.h |  1 +
> > >> >> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> > >> >> >> >  2 files changed, 15 insertions(+)
> > >> >> >> >
> > >> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> >> >> > index b7db3261c62d..4365c50e8055 100644
> > >> >> >> > --- a/include/uapi/linux/bpf.h
> > >> >> >> > +++ b/include/uapi/linux/bpf.h
> > >> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > >> >> >> >          BPF_BTF_LOAD,
> > >> >> >> >          BPF_BTF_GET_FD_BY_ID,
> > >> >> >> >          BPF_TASK_FD_QUERY,
> > >> >> >> > +        BPF_SYNCHRONIZE,
> > >> >> >> >  };
> > >> >> >> >
> > >> >> >> >  enum bpf_map_type {
> > >> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > >> >> >> > index d10ecd78105f..60ec7811846e 100644
> > >> >> >> > --- a/kernel/bpf/syscall.c
> > >> >> >> > +++ b/kernel/bpf/syscall.c
> > >> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > >> >> >> > uattr, unsigned int, siz
> > >> >> >> >          if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > >> >> >> >                  return -EPERM;
> > >> >> >> >
> > >> >> >> > +        if (cmd == BPF_SYNCHRONIZE) {
> > >> >> >> > +                if (uattr != NULL || size != 0)
> > >> >> >> > +                        return -EINVAL;
> > >> >> >> > +                err = security_bpf(cmd, NULL, 0);
> > >> >> >> > +                if (err < 0)
> > >> >> >> > +                        return err;
> > >> >> >> > +                /* BPF programs are run with preempt disabled, so
> > >> >> >> > +                 * synchronize_sched is sufficient even with
> > >> >> >> > +                 * RCU_PREEMPT.
> > >> >> >> > +                 */
> > >> >> >> > +                synchronize_sched();
> > >> >> >> > +                return 0;
> > >> >> >>
> > >> >> >> I don't think it's necessary. sys_membarrier() can do this already
> > >> >> >> and some folks use it exactly for this use case.
> > >> >> >
> > >> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > >> >> > though. No where does the manpage say membarrier should be implemented this
> > >> >> > way so what happens if the implementation changes?
> > >> >> >
> > >> >> > Further, membarrier manpage says that a memory barrier should be matched with
> > >> >> > a matching barrier. In this use case there is no matching barrier, so it
> > >> >> > makes it weirder.
> > >> >> >
> > >> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > >> >> > fragile to depend on it for this?
> > >> >> >
> > >> >> >        case MEMBARRIER_CMD_GLOBAL:
> > >> >> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > >> >> >                if (tick_nohz_full_enabled())
> > >> >> >                        return -EINVAL;
> > >> >> >                if (num_online_cpus() > 1)
> > >> >> >                        synchronize_sched();
> > >> >> >                return 0;
> > >> >> >
> > >> >> >
> > >> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > >> >>
> > >> >> See commit 907565337
> > >> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> > >> >>
> > >> >> "Userspace applications should be allowed to expect the membarrier system
> > >> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > >> >> nohz_full CPUs, but synchronize_sched() does not take those into
> > >> >> account."
> > >> >>
> > >> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > >> >> only care about kernel preempt off critical sections.
> > >> >>
> > >> >> Clearly bpf code does not run in user-space, so it would "work".
> > >> >>
> > >> >> But the guarantees provided by membarrier are not to synchronize against
> > >> >> preempt off per se. It's just that the current implementation happens to
> > >> >> do that. The point of membarrier is to turn user-space memory barriers
> > >> >> into compiler barriers.
> > >> >>
> > >> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> > >> >> ebpf is using, I would against using membarrier for this. I would rather
> > >> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> > >> >> implementation details to user-space, *and* you can eventually change you
> > >> >> RCU implementation for e.g. SRCU in the future if needed.
> > >> >
> > >> > The point about future changes to underlying bpf mechanisms is valid.
> > >> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > >> > that currently lasts the whole prog. We will have new prog types that won't
> > >> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > >> > when necessary.
> > >> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > >> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > >> > also won't make sense for the same reason.
> > >> > What we can do it instead is to define synchronization barrier for
> > >> > programs accessing maps. May be call it something like:
> > >> > BPF_SYNC_MAP_ACCESS ?
> > >>
> > >> I'm not sure what you're proposing. In the case the commit message
> > >> describes, a user-space program that wants to "drain" a map needs to
> > >> be confident that the map won't change under it, even across multiple
> > >> bpf system calls on that map. One way of doing that is to ensure that
> > >> nothing that could possibly hold a reference to that map is still
> > >> running. Are you proposing some kind of refcount-draining approach?
> > >> Simple locking won't work, since BPF programs can't block, and I don't
> > >> see right now how a simple barrier would help.
> > >
> > > I'm proposing few changes for your patch:
> > > s/BPF_SYNCHRONIZE/BPF_SYNC_MAP_ACCESS/
> > > and s/synchronize_sched/synchronize_rcu/
> > > with detailed comment in uapi/bpf.h that has an example why folks
> > > would want to use this new cmd.
> >
> > Thanks for clarifying.
> >
> > > I think the bpf maps will be rcu protected for foreseeable future
> > > even when rcu_read_lock/unlock will be done by the programs instead of
> > > kernel wrappers.
> >
> > Can we guarantee that we always obtain a map reference and dispose of
> > that reference inside the same critical section?
>
> yep. the verifier will guarantee that.
>
> > If so, can BPF
> > programs then disable preemption for as long as they'd like?
>
> you mean after the finish? no. only while running.
> The verifier will match things like lookup/release, lock/unlock, preempt on/off
> and will make sure there is no dangling preempt disable after program returns.
>
It might be a silly question but does this new bpf program type
provide a way to customize where to hold the rcu_lock/unlock and
enable/disable preemption when creating the program? Since the
BPF_SYNC_MAP_ACCESS cmd sounds not very useful if it only protect each
single map access instead of the whole program. For example, sometimes
we have to read the same map multiple times when running a eBPF
program and if the userspace changed the map value between two reads,
it may cause troubles. It would be great if we can have a RCU lock on
a block of eBPF program and in that way I think BPF_SYNC_MAP_ACCESS
cmd should be enough in handling userspace kernel racing problems.

Thanks
Chenbo Feng

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-10  5:13       ` [RFC] Add BPF_SYNCHRONIZE " Joel Fernandes
@ 2018-07-10 16:42         ` Paul E. McKenney
  2018-07-10 16:57           ` Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2018-07-10 16:42 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Mathieu Desnoyers, Joel Fernandes, Alexei Starovoitov,
	Daniel Colascione, Alexei Starovoitov, linux-kernel, Tim Murray,
	Daniel Borkmann, netdev, fengc

On Mon, Jul 09, 2018 at 10:13:47PM -0700, Joel Fernandes wrote:
> On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> > 
> > > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > >> > RCU data structure operations with respect to active programs. For
> > >> > example, userspace can update a map->map entry to point to a new map,
> > >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > >> > complete, and then drain the old map without fear that BPF programs
> > >> > may still be updating it.
> > >> > 
> > >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> > >> > ---
> > >> >  include/uapi/linux/bpf.h |  1 +
> > >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> > >> >  2 files changed, 15 insertions(+)
> > >> > 
> > >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> > index b7db3261c62d..4365c50e8055 100644
> > >> > --- a/include/uapi/linux/bpf.h
> > >> > +++ b/include/uapi/linux/bpf.h
> > >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > >> >  	BPF_BTF_LOAD,
> > >> >  	BPF_BTF_GET_FD_BY_ID,
> > >> >  	BPF_TASK_FD_QUERY,
> > >> > +	BPF_SYNCHRONIZE,
> > >> >  };
> > >> >  
> > >> >  enum bpf_map_type {
> > >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > >> > index d10ecd78105f..60ec7811846e 100644
> > >> > --- a/kernel/bpf/syscall.c
> > >> > +++ b/kernel/bpf/syscall.c
> > >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > >> > uattr, unsigned int, siz
> > >> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > >> >  		return -EPERM;
> > >> >  
> > >> > +	if (cmd == BPF_SYNCHRONIZE) {
> > >> > +		if (uattr != NULL || size != 0)
> > >> > +			return -EINVAL;
> > >> > +		err = security_bpf(cmd, NULL, 0);
> > >> > +		if (err < 0)
> > >> > +			return err;
> > >> > +		/* BPF programs are run with preempt disabled, so
> > >> > +		 * synchronize_sched is sufficient even with
> > >> > +		 * RCU_PREEMPT.
> > >> > +		 */
> > >> > +		synchronize_sched();
> > >> > +		return 0;
> > >> 
> > >> I don't think it's necessary. sys_membarrier() can do this already
> > >> and some folks use it exactly for this use case.
> > > 
> > > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > > though. No where does the manpage say membarrier should be implemented this
> > > way so what happens if the implementation changes?
> > > 
> > > Further, membarrier manpage says that a memory barrier should be matched with
> > > a matching barrier. In this use case there is no matching barrier, so it
> > > makes it weirder.
> > > 
> > > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > > fragile to depend on it for this?
> > > 
> > >        case MEMBARRIER_CMD_GLOBAL:
> > >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > >                if (tick_nohz_full_enabled())
> > >                        return -EINVAL;
> > >                if (num_online_cpus() > 1)
> > >                        synchronize_sched();
> > >                return 0;
> > > 
> > > 
> > > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > 
> > See commit 907565337
> > "Fix: Disable sys_membarrier when nohz_full is enabled"
> > 
> > "Userspace applications should be allowed to expect the membarrier system
> > call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > nohz_full CPUs, but synchronize_sched() does not take those into
> > account."
> > 
> > So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > only care about kernel preempt off critical sections.
> 
> Mathieu, Thanks a lot for your reply. I understand what you said and agree
> with you. Slight OT, but I tried to go back to first principles and
> understand how membarrier() uses synchronize_sched() for the "slow path" and
> it didn't make immediate sense to me. Let me clarify my dillema..
> 
> My understanding is membarrier's MEMBARRIER_CMD_GLOBAL will employ
> synchronize_sched to make sure all other CPUs aren't executing anymore in an
> section of usercode that happen to be accessing memory that was written to
> before the membarrier call was made. To do this, the system call will use
> synchronize_sched to try to guarantee that all user-mode execution that
> started before the membarrier call would be completed when the membarrier
> call returns. This guarantees that without using a real memory barrier on the
> "fast path", things work just fine and everyone wins.
> 
> But, going through RCU code, I see that a "RCU-sched quiecent state" on a CPU
> may be reached when the CPU receives a timer tick while executing in user
> mode:
> 
> void rcu_check_callbacks(int user)
> {
> 	trace_rcu_utilization(TPS("Start scheduler-tick"));
> 	increment_cpu_stall_ticks();
> 	if (user || rcu_is_cpu_rrupt_from_idle()) {
> [...]
> 		rcu_sched_qs();
> 		rcu_bh_qs();
> 
> The problem I see is the CPU could be executing usermode code at the time of
> the RCU sched-QS. This IMO is enough reason for synchronize_sched() to
> return, because the CPU in question just reported a QS (assuming all other
> CPUs also happen to do so if they needed to).

This scenario will have inserted the needed smp_mb() into the userspace
instruction execution stream, as is required by the sys_membarrier
use cases.

> Then I am wondering how does the membarrier call even work, the tick could
> very well have interrupted the CPU while it was executing usermode code in
> the middle of a set of instructions performing memory accesses. Reporting a
> quiescent state at such an inopportune time would cause the membarrier call
> to prematurely return, no? Sorry if I missed something.

One way to think of sys_membarrier() is as something that promotes a
barrier() to an smp_mb().  This barrier then separates the target CPU's
accesses that the caller saw before the sys_membarrier() from that same
CPU's accesses that the caller will see after the sys_membarrier().

> The other question I have is about the whole "nohz-full doesn't work" thing.
> I didn't fully understand why. RCU is already tracking the state of nohz-full
> CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
> transitions to and from usermode even if the timer tick is turned off. So why
> would it not work?

In the nohz_full case, there is no need for sys_membarrier()'s call to
synchronize_sched() to interact directly with the nohz_full CPU.  It
can instead look at the target CPU's dyntick-idle state, and that state
would potentially have been set in the dim distant past, thus having
no effect on the target CPU's current execution.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-10 16:42         ` Paul E. McKenney
@ 2018-07-10 16:57           ` Joel Fernandes
  2018-07-10 17:12             ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2018-07-10 16:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Mathieu Desnoyers, Alexei Starovoitov,
	Daniel Colascione, Alexei Starovoitov, linux-kernel, Tim Murray,
	Daniel Borkmann, netdev, fengc

On Tue, Jul 10, 2018 at 09:42:12AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 09, 2018 at 10:13:47PM -0700, Joel Fernandes wrote:
> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > > ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> > > 
> > > > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > > >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > > >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > > >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > > >> > RCU data structure operations with respect to active programs. For
> > > >> > example, userspace can update a map->map entry to point to a new map,
> > > >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > > >> > complete, and then drain the old map without fear that BPF programs
> > > >> > may still be updating it.
> > > >> > 
> > > >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> > > >> > ---
> > > >> >  include/uapi/linux/bpf.h |  1 +
> > > >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> > > >> >  2 files changed, 15 insertions(+)
> > > >> > 
> > > >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > >> > index b7db3261c62d..4365c50e8055 100644
> > > >> > --- a/include/uapi/linux/bpf.h
> > > >> > +++ b/include/uapi/linux/bpf.h
> > > >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > > >> >  	BPF_BTF_LOAD,
> > > >> >  	BPF_BTF_GET_FD_BY_ID,
> > > >> >  	BPF_TASK_FD_QUERY,
> > > >> > +	BPF_SYNCHRONIZE,
> > > >> >  };
> > > >> >  
> > > >> >  enum bpf_map_type {
> > > >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > >> > index d10ecd78105f..60ec7811846e 100644
> > > >> > --- a/kernel/bpf/syscall.c
> > > >> > +++ b/kernel/bpf/syscall.c
> > > >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > > >> > uattr, unsigned int, siz
> > > >> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > >> >  		return -EPERM;
> > > >> >  
> > > >> > +	if (cmd == BPF_SYNCHRONIZE) {
> > > >> > +		if (uattr != NULL || size != 0)
> > > >> > +			return -EINVAL;
> > > >> > +		err = security_bpf(cmd, NULL, 0);
> > > >> > +		if (err < 0)
> > > >> > +			return err;
> > > >> > +		/* BPF programs are run with preempt disabled, so
> > > >> > +		 * synchronize_sched is sufficient even with
> > > >> > +		 * RCU_PREEMPT.
> > > >> > +		 */
> > > >> > +		synchronize_sched();
> > > >> > +		return 0;
> > > >> 
> > > >> I don't think it's necessary. sys_membarrier() can do this already
> > > >> and some folks use it exactly for this use case.
> > > > 
> > > > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > > > though. No where does the manpage say membarrier should be implemented this
> > > > way so what happens if the implementation changes?
> > > > 
> > > > Further, membarrier manpage says that a memory barrier should be matched with
> > > > a matching barrier. In this use case there is no matching barrier, so it
> > > > makes it weirder.
> > > > 
> > > > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > > > fragile to depend on it for this?
> > > > 
> > > >        case MEMBARRIER_CMD_GLOBAL:
> > > >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > > >                if (tick_nohz_full_enabled())
> > > >                        return -EINVAL;
> > > >                if (num_online_cpus() > 1)
> > > >                        synchronize_sched();
> > > >                return 0;
> > > > 
> > > > 
> > > > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > > 
> > > See commit 907565337
> > > "Fix: Disable sys_membarrier when nohz_full is enabled"
> > > 
> > > "Userspace applications should be allowed to expect the membarrier system
> > > call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > > nohz_full CPUs, but synchronize_sched() does not take those into
> > > account."
> > > 
> > > So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > > only care about kernel preempt off critical sections.
> > 
> > Mathieu, Thanks a lot for your reply. I understand what you said and agree
> > with you. Slight OT, but I tried to go back to first principles and
> > understand how membarrier() uses synchronize_sched() for the "slow path" and
> > it didn't make immediate sense to me. Let me clarify my dillema..
> > 
> > My understanding is membarrier's MEMBARRIER_CMD_GLOBAL will employ
> > synchronize_sched to make sure all other CPUs aren't executing anymore in an
> > section of usercode that happen to be accessing memory that was written to
> > before the membarrier call was made. To do this, the system call will use
> > synchronize_sched to try to guarantee that all user-mode execution that
> > started before the membarrier call would be completed when the membarrier
> > call returns. This guarantees that without using a real memory barrier on the
> > "fast path", things work just fine and everyone wins.
> > 
> > But, going through RCU code, I see that a "RCU-sched quiecent state" on a CPU
> > may be reached when the CPU receives a timer tick while executing in user
> > mode:
> > 
> > void rcu_check_callbacks(int user)
> > {
> > 	trace_rcu_utilization(TPS("Start scheduler-tick"));
> > 	increment_cpu_stall_ticks();
> > 	if (user || rcu_is_cpu_rrupt_from_idle()) {
> > [...]
> > 		rcu_sched_qs();
> > 		rcu_bh_qs();
> > 
> > The problem I see is the CPU could be executing usermode code at the time of
> > the RCU sched-QS. This IMO is enough reason for synchronize_sched() to
> > return, because the CPU in question just reported a QS (assuming all other
> > CPUs also happen to do so if they needed to).
> 
> This scenario will have inserted the needed smp_mb() into the userspace
> instruction execution stream, as is required by the sys_membarrier
> use cases.

Oh ok, that makes sense!

> > Then I am wondering how does the membarrier call even work, the tick could
> > very well have interrupted the CPU while it was executing usermode code in
> > the middle of a set of instructions performing memory accesses. Reporting a
> > quiescent state at such an inopportune time would cause the membarrier call
> > to prematurely return, no? Sorry if I missed something.
> 
> One way to think of sys_membarrier() is as something that promotes a
> barrier() to an smp_mb().  This barrier then separates the target CPU's
> accesses that the caller saw before the sys_membarrier() from that same
> CPU's accesses that the caller will see after the sys_membarrier().

Got it!

> > The other question I have is about the whole "nohz-full doesn't work" thing.
> > I didn't fully understand why. RCU is already tracking the state of nohz-full
> > CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
> > transitions to and from usermode even if the timer tick is turned off. So why
> > would it not work?
> 
> In the nohz_full case, there is no need for sys_membarrier()'s call to
> synchronize_sched() to interact directly with the nohz_full CPU.  It
> can instead look at the target CPU's dyntick-idle state, and that state
> would potentially have been set in the dim distant past, thus having
> no effect on the target CPU's current execution.

In nohz-idle case though, there's nothing to promote the barrier() to
smp_mb() if you were to purely look at the dynticks-idle state on the
nohz-full CPU executing in user mode?

So then it makes sense to me now that nohz-full needs something to IPI that
CPU inorder to enforce the needed memory barrier and pure synchronize_sched()
wouldn't work. So then makes me think the expedited versions of
synchronize_sched should be able to do the job but I could off on a different
track..

Thanks a lot,

-Joel



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-10 16:57           ` Joel Fernandes
@ 2018-07-10 17:12             ` Paul E. McKenney
  2018-07-10 17:29               ` Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2018-07-10 17:12 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Joel Fernandes, Mathieu Desnoyers, Alexei Starovoitov,
	Daniel Colascione, Alexei Starovoitov, linux-kernel, Tim Murray,
	Daniel Borkmann, netdev, fengc

On Tue, Jul 10, 2018 at 09:57:44AM -0700, Joel Fernandes wrote:
> On Tue, Jul 10, 2018 at 09:42:12AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 09, 2018 at 10:13:47PM -0700, Joel Fernandes wrote:
> > > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > > > ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> > > > 
> > > > > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > > > >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > > > >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > > > >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > > > >> > RCU data structure operations with respect to active programs. For
> > > > >> > example, userspace can update a map->map entry to point to a new map,
> > > > >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > > > >> > complete, and then drain the old map without fear that BPF programs
> > > > >> > may still be updating it.
> > > > >> > 
> > > > >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> > > > >> > ---
> > > > >> >  include/uapi/linux/bpf.h |  1 +
> > > > >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> > > > >> >  2 files changed, 15 insertions(+)
> > > > >> > 
> > > > >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > >> > index b7db3261c62d..4365c50e8055 100644
> > > > >> > --- a/include/uapi/linux/bpf.h
> > > > >> > +++ b/include/uapi/linux/bpf.h
> > > > >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > > > >> >  	BPF_BTF_LOAD,
> > > > >> >  	BPF_BTF_GET_FD_BY_ID,
> > > > >> >  	BPF_TASK_FD_QUERY,
> > > > >> > +	BPF_SYNCHRONIZE,
> > > > >> >  };
> > > > >> >  
> > > > >> >  enum bpf_map_type {
> > > > >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > > >> > index d10ecd78105f..60ec7811846e 100644
> > > > >> > --- a/kernel/bpf/syscall.c
> > > > >> > +++ b/kernel/bpf/syscall.c
> > > > >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > > > >> > uattr, unsigned int, siz
> > > > >> >  	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > > >> >  		return -EPERM;
> > > > >> >  
> > > > >> > +	if (cmd == BPF_SYNCHRONIZE) {
> > > > >> > +		if (uattr != NULL || size != 0)
> > > > >> > +			return -EINVAL;
> > > > >> > +		err = security_bpf(cmd, NULL, 0);
> > > > >> > +		if (err < 0)
> > > > >> > +			return err;
> > > > >> > +		/* BPF programs are run with preempt disabled, so
> > > > >> > +		 * synchronize_sched is sufficient even with
> > > > >> > +		 * RCU_PREEMPT.
> > > > >> > +		 */
> > > > >> > +		synchronize_sched();
> > > > >> > +		return 0;
> > > > >> 
> > > > >> I don't think it's necessary. sys_membarrier() can do this already
> > > > >> and some folks use it exactly for this use case.
> > > > > 
> > > > > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > > > > though. No where does the manpage say membarrier should be implemented this
> > > > > way so what happens if the implementation changes?
> > > > > 
> > > > > Further, membarrier manpage says that a memory barrier should be matched with
> > > > > a matching barrier. In this use case there is no matching barrier, so it
> > > > > makes it weirder.
> > > > > 
> > > > > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > > > > fragile to depend on it for this?
> > > > > 
> > > > >        case MEMBARRIER_CMD_GLOBAL:
> > > > >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > > > >                if (tick_nohz_full_enabled())
> > > > >                        return -EINVAL;
> > > > >                if (num_online_cpus() > 1)
> > > > >                        synchronize_sched();
> > > > >                return 0;
> > > > > 
> > > > > 
> > > > > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > > > 
> > > > See commit 907565337
> > > > "Fix: Disable sys_membarrier when nohz_full is enabled"
> > > > 
> > > > "Userspace applications should be allowed to expect the membarrier system
> > > > call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > > > nohz_full CPUs, but synchronize_sched() does not take those into
> > > > account."
> > > > 
> > > > So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > > > only care about kernel preempt off critical sections.
> > > 
> > > Mathieu, Thanks a lot for your reply. I understand what you said and agree
> > > with you. Slight OT, but I tried to go back to first principles and
> > > understand how membarrier() uses synchronize_sched() for the "slow path" and
> > > it didn't make immediate sense to me. Let me clarify my dillema..
> > > 
> > > My understanding is membarrier's MEMBARRIER_CMD_GLOBAL will employ
> > > synchronize_sched to make sure all other CPUs aren't executing anymore in an
> > > section of usercode that happen to be accessing memory that was written to
> > > before the membarrier call was made. To do this, the system call will use
> > > synchronize_sched to try to guarantee that all user-mode execution that
> > > started before the membarrier call would be completed when the membarrier
> > > call returns. This guarantees that without using a real memory barrier on the
> > > "fast path", things work just fine and everyone wins.
> > > 
> > > But, going through RCU code, I see that a "RCU-sched quiecent state" on a CPU
> > > may be reached when the CPU receives a timer tick while executing in user
> > > mode:
> > > 
> > > void rcu_check_callbacks(int user)
> > > {
> > > 	trace_rcu_utilization(TPS("Start scheduler-tick"));
> > > 	increment_cpu_stall_ticks();
> > > 	if (user || rcu_is_cpu_rrupt_from_idle()) {
> > > [...]
> > > 		rcu_sched_qs();
> > > 		rcu_bh_qs();
> > > 
> > > The problem I see is the CPU could be executing usermode code at the time of
> > > the RCU sched-QS. This IMO is enough reason for synchronize_sched() to
> > > return, because the CPU in question just reported a QS (assuming all other
> > > CPUs also happen to do so if they needed to).
> > 
> > This scenario will have inserted the needed smp_mb() into the userspace
> > instruction execution stream, as is required by the sys_membarrier
> > use cases.
> 
> Oh ok, that makes sense!
> 
> > > Then I am wondering how does the membarrier call even work, the tick could
> > > very well have interrupted the CPU while it was executing usermode code in
> > > the middle of a set of instructions performing memory accesses. Reporting a
> > > quiescent state at such an inopportune time would cause the membarrier call
> > > to prematurely return, no? Sorry if I missed something.
> > 
> > One way to think of sys_membarrier() is as something that promotes a
> > barrier() to an smp_mb().  This barrier then separates the target CPU's
> > accesses that the caller saw before the sys_membarrier() from that same
> > CPU's accesses that the caller will see after the sys_membarrier().
> 
> Got it!
> 
> > > The other question I have is about the whole "nohz-full doesn't work" thing.
> > > I didn't fully understand why. RCU is already tracking the state of nohz-full
> > > CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
> > > transitions to and from usermode even if the timer tick is turned off. So why
> > > would it not work?
> > 
> > In the nohz_full case, there is no need for sys_membarrier()'s call to
> > synchronize_sched() to interact directly with the nohz_full CPU.  It
> > can instead look at the target CPU's dyntick-idle state, and that state
> > would potentially have been set in the dim distant past, thus having
> > no effect on the target CPU's current execution.
> 
> In nohz-idle case though, there's nothing to promote the barrier() to
> smp_mb() if you were to purely look at the dynticks-idle state on the
> nohz-full CPU executing in user mode?
> 
> So then it makes sense to me now that nohz-full needs something to IPI that
> CPU inorder to enforce the needed memory barrier and pure synchronize_sched()
> wouldn't work. So then makes me think the expedited versions of
> synchronize_sched should be able to do the job but I could off on a different
> track..

The problem is that the expedited versions also check the dyntick-idle
state and don't touch idle (or nohz_full usermode) CPUs.  This is by
design for the battery-powered embedded use case.  ;-)

							Thanx, Paul

> Thanks a lot,
> 
> -Joel
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-10 17:12             ` Paul E. McKenney
@ 2018-07-10 17:29               ` Joel Fernandes
  2018-07-10 17:42                 ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2018-07-10 17:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Mathieu Desnoyers, Alexei Starovoitov,
	Daniel Colascione, Alexei Starovoitov, linux-kernel, Tim Murray,
	Daniel Borkmann, netdev, fengc

On Tue, Jul 10, 2018 at 10:12:29AM -0700, Paul E. McKenney wrote:
[..]
> > > > The other question I have is about the whole "nohz-full doesn't work" thing.
> > > > I didn't fully understand why. RCU is already tracking the state of nohz-full
> > > > CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
> > > > transitions to and from usermode even if the timer tick is turned off. So why
> > > > would it not work?
> > > 
> > > In the nohz_full case, there is no need for sys_membarrier()'s call to
> > > synchronize_sched() to interact directly with the nohz_full CPU.  It
> > > can instead look at the target CPU's dyntick-idle state, and that state
> > > would potentially have been set in the dim distant past, thus having
> > > no effect on the target CPU's current execution.
> > 
> > In nohz-idle case though, there's nothing to promote the barrier() to
> > smp_mb() if you were to purely look at the dynticks-idle state on the
> > nohz-full CPU executing in user mode?
> > 
> > So then it makes sense to me now that nohz-full needs something to IPI that
> > CPU inorder to enforce the needed memory barrier and pure synchronize_sched()
> > wouldn't work. So then makes me think the expedited versions of
> > synchronize_sched should be able to do the job but I could off on a different
> > track..
> 
> The problem is that the expedited versions also check the dyntick-idle
> state and don't touch idle (or nohz_full usermode) CPUs.  This is by
> design for the battery-powered embedded use case.  ;-)

Oh ok! ;)

I guess there's also a MEMBARRIER_CMD_GLOBAL_EXPEDITED which seems to IPI
CPUs (I'm guessing regardless of dynticks state) and execute smp_mb within
the IPI so userspace can fallback to using that incase MEMBARRIER_CMD_GLOBAL
returns -EINVAL.

thanks!

-Joel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-10 17:29               ` Joel Fernandes
@ 2018-07-10 17:42                 ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2018-07-10 17:42 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Joel Fernandes, Mathieu Desnoyers, Alexei Starovoitov,
	Daniel Colascione, Alexei Starovoitov, linux-kernel, Tim Murray,
	Daniel Borkmann, netdev, fengc

On Tue, Jul 10, 2018 at 10:29:57AM -0700, Joel Fernandes wrote:
> On Tue, Jul 10, 2018 at 10:12:29AM -0700, Paul E. McKenney wrote:
> [..]
> > > > > The other question I have is about the whole "nohz-full doesn't work" thing.
> > > > > I didn't fully understand why. RCU is already tracking the state of nohz-full
> > > > > CPUs because the rcu dynticks code in (kernel/rcu/tree.c) monitors
> > > > > transitions to and from usermode even if the timer tick is turned off. So why
> > > > > would it not work?
> > > > 
> > > > In the nohz_full case, there is no need for sys_membarrier()'s call to
> > > > synchronize_sched() to interact directly with the nohz_full CPU.  It
> > > > can instead look at the target CPU's dyntick-idle state, and that state
> > > > would potentially have been set in the dim distant past, thus having
> > > > no effect on the target CPU's current execution.
> > > 
> > > In nohz-idle case though, there's nothing to promote the barrier() to
> > > smp_mb() if you were to purely look at the dynticks-idle state on the
> > > nohz-full CPU executing in user mode?
> > > 
> > > So then it makes sense to me now that nohz-full needs something to IPI that
> > > CPU inorder to enforce the needed memory barrier and pure synchronize_sched()
> > > wouldn't work. So then makes me think the expedited versions of
> > > synchronize_sched should be able to do the job but I could off on a different
> > > track..
> > 
> > The problem is that the expedited versions also check the dyntick-idle
> > state and don't touch idle (or nohz_full usermode) CPUs.  This is by
> > design for the battery-powered embedded use case.  ;-)
> 
> Oh ok! ;)
> 
> I guess there's also a MEMBARRIER_CMD_GLOBAL_EXPEDITED which seems to IPI
> CPUs (I'm guessing regardless of dynticks state) and execute smp_mb within
> the IPI so userspace can fallback to using that incase MEMBARRIER_CMD_GLOBAL
> returns -EINVAL.

Yes, and this avoids IPIing idle CPUs via the ->mm checks.  But it will
IPI nohz_full CPUs in that same process, as it must for correctness.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-10  5:25                 ` Chenbo Feng
@ 2018-07-10 23:52                   ` Alexei Starovoitov
  2018-07-11  2:46                     ` Lorenzo Colitti
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2018-07-10 23:52 UTC (permalink / raw)
  To: Chenbo Feng
  Cc: Daniel Colascione, mathieu.desnoyers, Joel Fernandes,
	Alexei Starovoitov, linux-kernel, timmurray, Daniel Borkmann,
	netdev, Lorenzo Colitti

On Mon, Jul 09, 2018 at 10:25:04PM -0700, Chenbo Feng wrote:
> On Mon, Jul 9, 2018 at 3:34 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jul 09, 2018 at 03:21:43PM -0700, Daniel Colascione wrote:
> > > On Mon, Jul 9, 2018 at 3:10 PM, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > > On Mon, Jul 09, 2018 at 02:36:37PM -0700, Daniel Colascione wrote:
> > > >> On Mon, Jul 9, 2018 at 2:09 PM, Alexei Starovoitov
> > > >> <alexei.starovoitov@gmail.com> wrote:
> > > >> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> > > >> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> > > >> >>
> > > >> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> > > >> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> > > >> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> > > >> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> > > >> >> >> > RCU data structure operations with respect to active programs. For
> > > >> >> >> > example, userspace can update a map->map entry to point to a new map,
> > > >> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> > > >> >> >> > complete, and then drain the old map without fear that BPF programs
> > > >> >> >> > may still be updating it.
> > > >> >> >> >
> > > >> >> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> > > >> >> >> > ---
> > > >> >> >> >  include/uapi/linux/bpf.h |  1 +
> > > >> >> >> >  kernel/bpf/syscall.c     | 14 ++++++++++++++
> > > >> >> >> >  2 files changed, 15 insertions(+)
> > > >> >> >> >
> > > >> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > >> >> >> > index b7db3261c62d..4365c50e8055 100644
> > > >> >> >> > --- a/include/uapi/linux/bpf.h
> > > >> >> >> > +++ b/include/uapi/linux/bpf.h
> > > >> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> > > >> >> >> >          BPF_BTF_LOAD,
> > > >> >> >> >          BPF_BTF_GET_FD_BY_ID,
> > > >> >> >> >          BPF_TASK_FD_QUERY,
> > > >> >> >> > +        BPF_SYNCHRONIZE,
> > > >> >> >> >  };
> > > >> >> >> >
> > > >> >> >> >  enum bpf_map_type {
> > > >> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > >> >> >> > index d10ecd78105f..60ec7811846e 100644
> > > >> >> >> > --- a/kernel/bpf/syscall.c
> > > >> >> >> > +++ b/kernel/bpf/syscall.c
> > > >> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> > > >> >> >> > uattr, unsigned int, siz
> > > >> >> >> >          if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > >> >> >> >                  return -EPERM;
> > > >> >> >> >
> > > >> >> >> > +        if (cmd == BPF_SYNCHRONIZE) {
> > > >> >> >> > +                if (uattr != NULL || size != 0)
> > > >> >> >> > +                        return -EINVAL;
> > > >> >> >> > +                err = security_bpf(cmd, NULL, 0);
> > > >> >> >> > +                if (err < 0)
> > > >> >> >> > +                        return err;
> > > >> >> >> > +                /* BPF programs are run with preempt disabled, so
> > > >> >> >> > +                 * synchronize_sched is sufficient even with
> > > >> >> >> > +                 * RCU_PREEMPT.
> > > >> >> >> > +                 */
> > > >> >> >> > +                synchronize_sched();
> > > >> >> >> > +                return 0;
> > > >> >> >>
> > > >> >> >> I don't think it's necessary. sys_membarrier() can do this already
> > > >> >> >> and some folks use it exactly for this use case.
> > > >> >> >
> > > >> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> > > >> >> > though. No where does the manpage say membarrier should be implemented this
> > > >> >> > way so what happens if the implementation changes?
> > > >> >> >
> > > >> >> > Further, membarrier manpage says that a memory barrier should be matched with
> > > >> >> > a matching barrier. In this use case there is no matching barrier, so it
> > > >> >> > makes it weirder.
> > > >> >> >
> > > >> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> > > >> >> > fragile to depend on it for this?
> > > >> >> >
> > > >> >> >        case MEMBARRIER_CMD_GLOBAL:
> > > >> >> >                /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> > > >> >> >                if (tick_nohz_full_enabled())
> > > >> >> >                        return -EINVAL;
> > > >> >> >                if (num_online_cpus() > 1)
> > > >> >> >                        synchronize_sched();
> > > >> >> >                return 0;
> > > >> >> >
> > > >> >> >
> > > >> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> > > >> >>
> > > >> >> See commit 907565337
> > > >> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> > > >> >>
> > > >> >> "Userspace applications should be allowed to expect the membarrier system
> > > >> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> > > >> >> nohz_full CPUs, but synchronize_sched() does not take those into
> > > >> >> account."
> > > >> >>
> > > >> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> > > >> >> only care about kernel preempt off critical sections.
> > > >> >>
> > > >> >> Clearly bpf code does not run in user-space, so it would "work".
> > > >> >>
> > > >> >> But the guarantees provided by membarrier are not to synchronize against
> > > >> >> preempt off per se. It's just that the current implementation happens to
> > > >> >> do that. The point of membarrier is to turn user-space memory barriers
> > > >> >> into compiler barriers.
> > > >> >>
> > > >> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> > > >> >> ebpf is using, I would against using membarrier for this. I would rather
> > > >> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> > > >> >> implementation details to user-space, *and* you can eventually change you
> > > >> >> RCU implementation for e.g. SRCU in the future if needed.
> > > >> >
> > > >> > The point about future changes to underlying bpf mechanisms is valid.
> > > >> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > > >> > that currently lasts the whole prog. We will have new prog types that won't
> > > >> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > > >> > when necessary.
> > > >> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > > >> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > > >> > also won't make sense for the same reason.
> > > >> > What we can do it instead is to define synchronization barrier for
> > > >> > programs accessing maps. May be call it something like:
> > > >> > BPF_SYNC_MAP_ACCESS ?
> > > >>
> > > >> I'm not sure what you're proposing. In the case the commit message
> > > >> describes, a user-space program that wants to "drain" a map needs to
> > > >> be confident that the map won't change under it, even across multiple
> > > >> bpf system calls on that map. One way of doing that is to ensure that
> > > >> nothing that could possibly hold a reference to that map is still
> > > >> running. Are you proposing some kind of refcount-draining approach?
> > > >> Simple locking won't work, since BPF programs can't block, and I don't
> > > >> see right now how a simple barrier would help.
> > > >
> > > > I'm proposing few changes for your patch:
> > > > s/BPF_SYNCHRONIZE/BPF_SYNC_MAP_ACCESS/
> > > > and s/synchronize_sched/synchronize_rcu/
> > > > with detailed comment in uapi/bpf.h that has an example why folks
> > > > would want to use this new cmd.
> > >
> > > Thanks for clarifying.
> > >
> > > > I think the bpf maps will be rcu protected for foreseeable future
> > > > even when rcu_read_lock/unlock will be done by the programs instead of
> > > > kernel wrappers.
> > >
> > > Can we guarantee that we always obtain a map reference and dispose of
> > > that reference inside the same critical section?
> >
> > yep. the verifier will guarantee that.
> >
> > > If so, can BPF
> > > programs then disable preemption for as long as they'd like?
> >
> > you mean after the finish? no. only while running.
> > The verifier will match things like lookup/release, lock/unlock, preempt on/off
> > and will make sure there is no dangling preempt disable after program returns.
> >
> It might be a silly question but does this new bpf program type
> provide a way to customize where to hold the rcu_lock/unlock and
> enable/disable preemption when creating the program? Since the
> BPF_SYNC_MAP_ACCESS cmd sounds not very useful if it only protect each
> single map access instead of the whole program. For example, sometimes
> we have to read the same map multiple times when running a eBPF
> program and if the userspace changed the map value between two reads,
> it may cause troubles. It would be great if we can have a RCU lock on
> a block of eBPF program and in that way I think BPF_SYNC_MAP_ACCESS
> cmd should be enough in handling userspace kernel racing problems.

we need to make sure we have detailed description of BPF_SYNC_MAP_ACCESS
in uapi/bpf.h, since I feel the confusion regarding its usage is starting already.
This new cmd will only make sense for map-in-map type of maps.
Expecting that BPF_SYNC_MAP_ACCESS is somehow implies the end of
the program or doing some other map synchronization is not correct.
Commit log of this patch got it right:
"""
 For example, userspace can update a map->map entry to point to a new map,
 use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
 complete, and then drain the old map without fear that BPF programs
 may still be updating it.
"""
Now consider two cases:
1. single program does two lookups (inner and outer map) and the program is called
  twice (as trigger on two events or it was tail-called or through prog_array)
2. future single program without rcu wrappers does:
   rcu_lock + outer lookup + inner lookup + rcu_unlock + .. some other bpf stuff .. +
   rcu_lock + outer lookup + inner lookup + rcu_unlock + .. yet another bpf stuff ..
   All of the above as part of single program that is called once.
These two cases are the same from the point of view of BPF_SYNC_MAP_ACCESS
and completion of this cmd does not guarantee that program finished.
In both cases the user space will be correct to start draining the inner map
it replaced after BPF_SYNC_MAP_ACCESS is done
(assuming that implementation does synchronize_rcu I mentioned earlier)

Preemption on/off as done today by prog wrappers or in the future explicitly
by the prog code is orthogonal to this new cmd and map-in-map draining.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-10 23:52                   ` Alexei Starovoitov
@ 2018-07-11  2:46                     ` Lorenzo Colitti
  2018-07-11  3:40                       ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Lorenzo Colitti @ 2018-07-11  2:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Chenbo Feng, dancol, mathieu.desnoyers, Joel Fernandes,
	Alexei Starovoitov, lkml, Tim Murray, Daniel Borkmann, netdev

On Wed, Jul 11, 2018 at 8:52 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> we need to make sure we have detailed description of BPF_SYNC_MAP_ACCESS
> in uapi/bpf.h, since I feel the confusion regarding its usage is starting already.
> This new cmd will only make sense for map-in-map type of maps.
> Expecting that BPF_SYNC_MAP_ACCESS is somehow implies the end of
> the program or doing some other map synchronization is not correct.
> Commit log of this patch got it right:
> """
>  For example, userspace can update a map->map entry to point to a new map,
>  use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
>  complete, and then drain the old map without fear that BPF programs
>  may still be updating it.
> """

+1 for detailed documentation. For example, consider what happens if
we have two map fds, one active and one standby, and a map-in-map with
one element that contains a pointer to the currently-active map fd.
The kernel program might do:

=====
const int current_map_key = 1;
void *current_map = bpf_map_lookup_elem(outer_map, &current_map_key);

int stats_key = 42;
uint64_t *stats_value = bpf_map_lookup_elem(current_map, &stats_key);
__sync_fetch_and_add(&stats_value, 1);
=====

If a userspace does:

1. Write new fd to outer_map[1].
2. Call BPF_SYNC_MAP_ACCESS.
3. Start deleting everything in the old map.

How can we guarantee that the __sync_fetch_and_add will not add to the
old map? If it does, we'll lose data. Will the verifier automatically
hold the RCU lock for as long as a pointer to an inner map is valid?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-11  2:46                     ` Lorenzo Colitti
@ 2018-07-11  3:40                       ` Alexei Starovoitov
  2018-07-14 18:18                         ` Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2018-07-11  3:40 UTC (permalink / raw)
  To: Lorenzo Colitti
  Cc: Chenbo Feng, dancol, mathieu.desnoyers, Joel Fernandes,
	Alexei Starovoitov, lkml, Tim Murray, Daniel Borkmann, netdev

On Wed, Jul 11, 2018 at 11:46:19AM +0900, Lorenzo Colitti wrote:
> On Wed, Jul 11, 2018 at 8:52 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > we need to make sure we have detailed description of BPF_SYNC_MAP_ACCESS
> > in uapi/bpf.h, since I feel the confusion regarding its usage is starting already.
> > This new cmd will only make sense for map-in-map type of maps.
> > Expecting that BPF_SYNC_MAP_ACCESS is somehow implies the end of
> > the program or doing some other map synchronization is not correct.
> > Commit log of this patch got it right:
> > """
> >  For example, userspace can update a map->map entry to point to a new map,
> >  use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >  complete, and then drain the old map without fear that BPF programs
> >  may still be updating it.
> > """
> 
> +1 for detailed documentation. For example, consider what happens if
> we have two map fds, one active and one standby, and a map-in-map with
> one element that contains a pointer to the currently-active map fd.

yes. that's exactly the use case that folks use.

> The kernel program might do:
> 
> =====
> const int current_map_key = 1;
> void *current_map = bpf_map_lookup_elem(outer_map, &current_map_key);
> 
> int stats_key = 42;
> uint64_t *stats_value = bpf_map_lookup_elem(current_map, &stats_key);
> __sync_fetch_and_add(&stats_value, 1);
> =====
> 
> If a userspace does:
> 
> 1. Write new fd to outer_map[1].
> 2. Call BPF_SYNC_MAP_ACCESS.
> 3. Start deleting everything in the old map.
> 
> How can we guarantee that the __sync_fetch_and_add will not add to the
> old map? 

without any changes to the kernel sys_membarrier will work.
And that's what folks use already.
BPF_SYNC_MAP_ACCESS implemented via synchronize_rcu() will work
as well whether in the current implementation where rcu_lock/unlock
is done outside of the program and in the future when
rcu_lock/unlock are called by the program itself.

> Will the verifier automatically
> hold the RCU lock for as long as a pointer to an inner map is valid?

the verifier will guarantee the equivalency of future explicit
lock/unlock by the program vs current situation of implicit
lock/unlock by the kernel.
The verifier will track that bpf_map_lookup_elem() is done
after rcu_lock and that the value returned by this helper is
not accessed after rcu_unlock. Baby steps of dataflow analysis.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-11  3:40                       ` Alexei Starovoitov
@ 2018-07-14 18:18                         ` Joel Fernandes
  2018-07-16 15:29                           ` Daniel Colascione
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2018-07-14 18:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Lorenzo Colitti, Chenbo Feng, dancol, mathieu.desnoyers,
	Joel Fernandes, Alexei Starovoitov, lkml, Tim Murray,
	Daniel Borkmann, netdev

On Tue, Jul 10, 2018 at 08:40:19PM -0700, Alexei Starovoitov wrote:
[..]
> > The kernel program might do:
> > 
> > =====
> > const int current_map_key = 1;
> > void *current_map = bpf_map_lookup_elem(outer_map, &current_map_key);
> > 
> > int stats_key = 42;
> > uint64_t *stats_value = bpf_map_lookup_elem(current_map, &stats_key);
> > __sync_fetch_and_add(&stats_value, 1);
> > =====
> > 
> > If a userspace does:
> > 
> > 1. Write new fd to outer_map[1].
> > 2. Call BPF_SYNC_MAP_ACCESS.
> > 3. Start deleting everything in the old map.
> > 
> > How can we guarantee that the __sync_fetch_and_add will not add to the
> > old map? 
> 
> without any changes to the kernel sys_membarrier will work.
> And that's what folks use already.
> BPF_SYNC_MAP_ACCESS implemented via synchronize_rcu() will work
> as well whether in the current implementation where rcu_lock/unlock
> is done outside of the program and in the future when
> rcu_lock/unlock are called by the program itself.

Cool Alexei and Lorenzo, sounds great to me. Daniel want to send a follow up
patch with BPF_SYNC_MAP_ACCESS changes then?

> > Will the verifier automatically
> > hold the RCU lock for as long as a pointer to an inner map is valid?
> 
> the verifier will guarantee the equivalency of future explicit
> lock/unlock by the program vs current situation of implicit
> lock/unlock by the kernel.
> The verifier will track that bpf_map_lookup_elem() is done
> after rcu_lock and that the value returned by this helper is
> not accessed after rcu_unlock. Baby steps of dataflow analysis.

Nice!

By the way just curious I was briefly going through kernel/bpf/arraymap.c.
How are you protecting against load-store tearing of values of array map
updates/lookups?

For example, if userspace reads an array map at a particular index, while
another CPU is updating it, then userspace can read partial values /
half-updated values right? Since rcu_read_lock is in use, I was hoping to
find something like rcu_assign_pointer there to protect readers against
concurrent updates.  Thanks for any clarification.

Regards,

- Joel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-14 18:18                         ` Joel Fernandes
@ 2018-07-16 15:29                           ` Daniel Colascione
  2018-07-16 20:23                             ` Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel Colascione @ 2018-07-16 15:29 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Alexei Starovoitov, Lorenzo Colitti, Chenbo Feng,
	Mathieu Desnoyers, Joel Fernandes, Alexei Starovoitov, lkml,
	Tim Murray, Daniel Borkmann, netdev

On Sat, Jul 14, 2018 at 11:18 AM, Joel Fernandes <joel@joelfernandes.org> wrote:
> On Tue, Jul 10, 2018 at 08:40:19PM -0700, Alexei Starovoitov wrote:
> [..]
>> > The kernel program might do:
>> >
>> > =====
>> > const int current_map_key = 1;
>> > void *current_map = bpf_map_lookup_elem(outer_map, &current_map_key);
>> >
>> > int stats_key = 42;
>> > uint64_t *stats_value = bpf_map_lookup_elem(current_map, &stats_key);
>> > __sync_fetch_and_add(&stats_value, 1);
>> > =====
>> >
>> > If a userspace does:
>> >
>> > 1. Write new fd to outer_map[1].
>> > 2. Call BPF_SYNC_MAP_ACCESS.
>> > 3. Start deleting everything in the old map.
>> >
>> > How can we guarantee that the __sync_fetch_and_add will not add to the
>> > old map?
>>
>> without any changes to the kernel sys_membarrier will work.
>> And that's what folks use already.
>> BPF_SYNC_MAP_ACCESS implemented via synchronize_rcu() will work
>> as well whether in the current implementation where rcu_lock/unlock
>> is done outside of the program and in the future when
>> rcu_lock/unlock are called by the program itself.
>
> Cool Alexei and Lorenzo, sounds great to me. Daniel want to send a follow up
> patch with BPF_SYNC_MAP_ACCESS changes then?

Will do. Mind if I just mine this thread for the doc comment?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-16 15:29                           ` Daniel Colascione
@ 2018-07-16 20:23                             ` Joel Fernandes
  2018-07-26 16:51                               ` [PATCH v2] Add BPF_SYNCHRONIZE_MAPS " Daniel Colascione
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2018-07-16 20:23 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Alexei Starovoitov, Lorenzo Colitti, Chenbo Feng,
	Mathieu Desnoyers, Joel Fernandes, Alexei Starovoitov, lkml,
	Tim Murray, Daniel Borkmann, netdev

On Mon, Jul 16, 2018 at 08:29:47AM -0700, Daniel Colascione wrote:
> On Sat, Jul 14, 2018 at 11:18 AM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > On Tue, Jul 10, 2018 at 08:40:19PM -0700, Alexei Starovoitov wrote:
> > [..]
> >> > The kernel program might do:
> >> >
> >> > =====
> >> > const int current_map_key = 1;
> >> > void *current_map = bpf_map_lookup_elem(outer_map, &current_map_key);
> >> >
> >> > int stats_key = 42;
> >> > uint64_t *stats_value = bpf_map_lookup_elem(current_map, &stats_key);
> >> > __sync_fetch_and_add(&stats_value, 1);
> >> > =====
> >> >
> >> > If a userspace does:
> >> >
> >> > 1. Write new fd to outer_map[1].
> >> > 2. Call BPF_SYNC_MAP_ACCESS.
> >> > 3. Start deleting everything in the old map.
> >> >
> >> > How can we guarantee that the __sync_fetch_and_add will not add to the
> >> > old map?
> >>
> >> without any changes to the kernel sys_membarrier will work.
> >> And that's what folks use already.
> >> BPF_SYNC_MAP_ACCESS implemented via synchronize_rcu() will work
> >> as well whether in the current implementation where rcu_lock/unlock
> >> is done outside of the program and in the future when
> >> rcu_lock/unlock are called by the program itself.
> >
> > Cool Alexei and Lorenzo, sounds great to me. Daniel want to send a follow up
> > patch with BPF_SYNC_MAP_ACCESS changes then?
> 
> Will do. Mind if I just mine this thread for the doc comment?

Do you mean the changelog? Yes I believe you could use the discussion in this
thread for the rationale as Alexei described.

thanks,

- Joel



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2] Add BPF_SYNCHRONIZE_MAPS bpf(2) command
  2018-07-16 20:23                             ` Joel Fernandes
@ 2018-07-26 16:51                               ` Daniel Colascione
  0 siblings, 0 replies; 30+ messages in thread
From: Daniel Colascione @ 2018-07-26 16:51 UTC (permalink / raw)
  To: joelaf
  Cc: linux-kernel, timmurray, netdev, Alexei Starovoitov,
	Lorenzo Colitti, Chenbo Feng, Mathieu Desnoyers,
	Alexei Starovoitov, Daniel Borkmann, Daniel Colascione

BPF_SYNCHRONIZE_MAPS waits for the release of any references to a BPF
map made by a BPF program that is running at the time the
BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command is
to provide a means for userspace to replace a BPF map with another,
newer version, then ensure that no component is still using the "old"
map before manipulating the "old" map in some way.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 include/uapi/linux/bpf.h |  9 +++++++++
 kernel/bpf/syscall.c     | 13 +++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b7db3261c62d..5b27e9117d3e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -75,6 +75,14 @@ struct bpf_lpm_trie_key {
 	__u8	data[0];	/* Arbitrary size */
 };
 
+/* BPF_SYNCHRONIZE_MAPS waits for the release of any references to a
+ * BPF map made by a BPF program that is running at the time the
+ * BPF_SYNCHRONIZE_MAPS command is issued. The purpose of this command
+ * is to provide a means for userspace to replace a BPF map with
+ * another, newer version, then ensure that no component is still
+ * using the "old" map before manipulating the "old" map in some way.
+ */
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
 	BPF_MAP_CREATE,
@@ -98,6 +106,7 @@ enum bpf_cmd {
 	BPF_BTF_LOAD,
 	BPF_BTF_GET_FD_BY_ID,
 	BPF_TASK_FD_QUERY,
+	BPF_SYNCHRONIZE_MAPS,
 };
 
 enum bpf_map_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a31a1ba0f8ea..8bbd1a5d01d1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2274,6 +2274,19 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	if (cmd == BPF_SYNCHRONIZE_MAPS) {
+		if (uattr != NULL || size != 0)
+			return -EINVAL;
+		err = security_bpf(cmd, NULL, 0);
+		if (err < 0)
+			return err;
+		/* BPF programs always enter a critical section while
+		 * they have a map reference outstanding.
+		 */
+		synchronize_rcu();
+		return 0;
+	}
+
 	err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
 	if (err)
 		return err;
-- 
2.18.0.233.g985f88cf7e-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
  2018-07-29 15:57 Alexei Starovoitov
@ 2018-07-30 22:26 ` Joel Fernandes
  0 siblings, 0 replies; 30+ messages in thread
From: Joel Fernandes @ 2018-07-30 22:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Colascione, Lorenzo Colitti, Chenbo Feng,
	Mathieu Desnoyers, Joel Fernandes, Alexei Starovoitov, lkml,
	Tim Murray, Daniel Borkmann, netdev

On Sun, Jul 29, 2018 at 06:57:06PM +0300, Alexei Starovoitov wrote:
> On Fri, Jul 27, 2018 at 10:17 PM, Daniel Colascione <dancol@google.com> wrote:
> > On Sat, Jul 14, 2018 at 11:18 AM, Joel Fernandes <joel@joelfernandes.org>
> > wrote:
> >>
> >> By the way just curious I was briefly going through kernel/bpf/arraymap.c.
> >> How are you protecting against load-store tearing of values of array map
> >> updates/lookups?
> >>
> >> For example, if userspace reads an array map at a particular index, while
> >> another CPU is updating it, then userspace can read partial values /
> >> half-updated values right? Since rcu_read_lock is in use, I was hoping to
> >> find something like rcu_assign_pointer there to protect readers against
> >> concurrent updates.  Thanks for any clarification.
> >
> >
> > I'm also curious about the answer to this question.
> 
> i'm not sure I understand the question.
> bpf_map_type_array is a C-like array.
> There is no locking of elements.
> If single program executing on two cpus
> and accesses the same value it will collide.
> Same goes for user space vs prog parallel access.
> bpf long_memcpy is an attempt to provide minimal
> level of automicity when values are aligned and
> size==long.

Thanks for the answer. I think you answered the question.

Regards,

- Joel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
@ 2018-07-29 15:57 Alexei Starovoitov
  2018-07-30 22:26 ` Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2018-07-29 15:57 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Joel Fernandes, Lorenzo Colitti, Chenbo Feng, Mathieu Desnoyers,
	Joel Fernandes, Alexei Starovoitov, lkml, Tim Murray,
	Daniel Borkmann, netdev

On Fri, Jul 27, 2018 at 10:17 PM, Daniel Colascione <dancol@google.com> wrote:
> On Sat, Jul 14, 2018 at 11:18 AM, Joel Fernandes <joel@joelfernandes.org>
> wrote:
>>
>> By the way just curious I was briefly going through kernel/bpf/arraymap.c.
>> How are you protecting against load-store tearing of values of array map
>> updates/lookups?
>>
>> For example, if userspace reads an array map at a particular index, while
>> another CPU is updating it, then userspace can read partial values /
>> half-updated values right? Since rcu_read_lock is in use, I was hoping to
>> find something like rcu_assign_pointer there to protect readers against
>> concurrent updates.  Thanks for any clarification.
>
>
> I'm also curious about the answer to this question.

i'm not sure I understand the question.
bpf_map_type_array is a C-like array.
There is no locking of elements.
If single program executing on two cpus
and accesses the same value it will collide.
Same goes for user space vs prog parallel access.
bpf long_memcpy is an attempt to provide minimal
level of automicity when values are aligned and
size==long.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-07-30 22:27 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-07  1:56 [RFC] Add BPF_SYNCHRONIZE bpf(2) command Daniel Colascione
2018-07-07  2:54 ` Alexei Starovoitov
2018-07-07  3:22   ` Daniel Colascione
2018-07-07 20:33   ` Joel Fernandes
2018-07-08 20:54     ` Mathieu Desnoyers
2018-07-09 21:09       ` Alexei Starovoitov
2018-07-09 21:35         ` Mathieu Desnoyers
2018-07-09 22:19           ` Paul E. McKenney
2018-07-09 22:19             ` Alexei Starovoitov
2018-07-09 22:48               ` Paul E. McKenney
2018-07-09 21:36         ` Daniel Colascione
2018-07-09 22:10           ` Alexei Starovoitov
2018-07-09 22:21             ` Daniel Colascione
2018-07-09 22:34               ` Alexei Starovoitov
2018-07-10  5:25                 ` Chenbo Feng
2018-07-10 23:52                   ` Alexei Starovoitov
2018-07-11  2:46                     ` Lorenzo Colitti
2018-07-11  3:40                       ` Alexei Starovoitov
2018-07-14 18:18                         ` Joel Fernandes
2018-07-16 15:29                           ` Daniel Colascione
2018-07-16 20:23                             ` Joel Fernandes
2018-07-26 16:51                               ` [PATCH v2] Add BPF_SYNCHRONIZE_MAPS " Daniel Colascione
2018-07-10  5:13       ` [RFC] Add BPF_SYNCHRONIZE " Joel Fernandes
2018-07-10 16:42         ` Paul E. McKenney
2018-07-10 16:57           ` Joel Fernandes
2018-07-10 17:12             ` Paul E. McKenney
2018-07-10 17:29               ` Joel Fernandes
2018-07-10 17:42                 ` Paul E. McKenney
2018-07-29 15:57 Alexei Starovoitov
2018-07-30 22:26 ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).