All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>,
	davem@davemloft.net, daniel@iogearbox.net,
	jakub.kicinski@netronome.com, netdev@vger.kernel.org,
	kernel-team@fb.com, mingo@redhat.com, will.deacon@arm.com,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	jannh@google.com
Subject: Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock
Date: Thu, 24 Jan 2019 15:58:59 -0800	[thread overview]
Message-ID: <20190124235857.xyb5xx2ufr6x5mbt@ast-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20190124180109.GA27771@hirez.programming.kicks-ass.net>

On Thu, Jan 24, 2019 at 07:01:09PM +0100, Peter Zijlstra wrote:
> 
> Thanks for having kernel/locking people on Cc...
> 
> On Wed, Jan 23, 2019 at 08:13:55PM -0800, Alexei Starovoitov wrote:
> 
> > Implementation details:
> > - on !SMP bpf_spin_lock() becomes nop
> 
> Because no BPF program is preemptible? I don't see any assertions or
> even a comment that says this code is non-preemptible.
> 
> AFAICT some of the BPF_RUN_PROG things are under rcu_read_lock() only,
> which is not sufficient.

nope. all bpf prog types disable preemption. That is must have for all
sorts of things to work properly.
If there is a prog type that doing rcu_read_lock only it's a serious bug.
About a year or so ago we audited everything specifically to make
sure everything disables preemption before calling bpf progs.
I'm pretty sure nothing crept in in the mean time.

> > - on architectures that don't support queued_spin_lock trivial lock is used.
> >   Note that arch_spin_lock cannot be used, since not all archs agree that
> >   zero == unlocked and sizeof(arch_spinlock_t) != sizeof(__u32).
> 
> I really don't much like direct usage of qspinlock; esp. not as a
> surprise.
> 
> Why does it matter if 0 means unlocked; that's what
> __ARCH_SPIN_LOCK_UNLOCKED is for.
> 
> I get the sizeof(__u32) thing, but why not key off of that?

what do you mean by 'key off of that' ?
to use arch_spinlock_t instead of qspinlock ?
That was my first attempt, but then I painfully found that
its size on parisc is 16 bytes and we're not going to penalize bpf
to waste that much space because of single architecture.
sizeof(arch_spinlock_t) can be 1 byte too (on sparc).
That would fit in __u32, but I figured it's cleaner to use qspinlock
on all archs that support it and dumb_spin_lock on archs that dont.

Another option is use to arch_spinlock_t when its sizeof==4
and use dumb_spin_lock otherwise.
It's doable, but imo still less clean than using qspinlock
due to zero init. Since zero init is a lot less map work
that zero inits all elements already.

If arch_spinlock_t is used than at map init time we would need to
walk all elements and do __ARCH_SPIN_LOCK_UNLOCKED assignment
(and maps can have millions of elements).
Not horrible, but 100% waste of cycles for x86/arm64 where qspinlock
is used. Such waste can be workaround further by doing ugly
#idef __ARCH_SPIN_LOCK_UNLOCKED == 0 -> don't do init loop.
And then add another #ifdef for archs with sizeof(arch_spinlock_t)!=4
to keep zero init for all map types that support bpf_spin_lock
via dumb_spin_lock.
Clearly at that point we're getting into ugliness everywhere.
Hence I've used qspinlock directly.

> > Next steps:
> > - allow bpf_spin_lock in other map types (like cgroup local storage)
> > - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
> >   to request kernel to grab bpf_spin_lock before rewriting the value.
> >   That will serialize access to map elements.
> 
> So clearly this map stuff is shared between bpf proglets, otherwise
> there would not be a need for locking. But what happens if one is from
> task context and another from IRQ context?
> 
> I don't see a local_irq_save()/restore() anywhere. What avoids the
> trivial lock inversion?

> and from NMI ...

progs are not preemptable and map syscall accessors have bpf_prog_active counters.
So nmi/kprobe progs will not be running when syscall is running.
Hence dead lock is not possible and irq_save is not needed.

> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index a74972b07e74..2e98e4caf5aa 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -221,6 +221,63 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
> >  	.arg2_type	= ARG_CONST_SIZE,
> >  };
> >  
> > +#ifndef CONFIG_QUEUED_SPINLOCKS
> > +struct dumb_spin_lock {
> > +	atomic_t val;
> > +};
> > +#endif
> > +
> > +notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
> > +{
> > +#if defined(CONFIG_SMP)
> > +#ifdef CONFIG_QUEUED_SPINLOCKS
> > +	struct qspinlock *qlock = (void *)lock;
> > +
> > +	BUILD_BUG_ON(sizeof(*qlock) != sizeof(*lock));
> > +	queued_spin_lock(qlock);
> > +#else
> > +	struct dumb_spin_lock *qlock = (void *)lock;
> > +
> > +	BUILD_BUG_ON(sizeof(*qlock) != sizeof(*lock));
> > +	do {
> > +		while (atomic_read(&qlock->val) != 0)
> > +			cpu_relax();
> > +	} while (atomic_cmpxchg(&qlock->val, 0, 1) != 0);
> > +#endif
> > +#endif
> > +	return 0;
> > +}
> > +
> > +const struct bpf_func_proto bpf_spin_lock_proto = {
> > +	.func		= bpf_spin_lock,
> > +	.gpl_only	= false,
> > +	.ret_type	= RET_VOID,
> > +	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
> > +};
> > +
> > +notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
> > +{
> > +#if defined(CONFIG_SMP)
> > +#ifdef CONFIG_QUEUED_SPINLOCKS
> > +	struct qspinlock *qlock = (void *)lock;
> > +
> > +	queued_spin_unlock(qlock);
> > +#else
> > +	struct dumb_spin_lock *qlock = (void *)lock;
> > +
> > +	atomic_set(&qlock->val, 0);
> 
> And this is broken... That should've been atomic_set_release() at the
> very least.

right. good catch.

> And this would again be the moment where I go pester you about the BPF
> memory model :-)

hehe :)
How do you propose to define it in a way that it applies to all archs
and yet doesn't penalize x86 ?
"Assume minimum execution ordering model" the way kernel does
unfortunately is not usable, since bpf doesn't have a luxury
of using nice #defines that convert into nops on x86.


  parent reply	other threads:[~2019-01-24 23:59 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-24  4:13 [PATCH v4 bpf-next 0/9] introduce bpf_spin_lock Alexei Starovoitov
2019-01-24  4:13 ` [PATCH v4 bpf-next 1/9] bpf: " Alexei Starovoitov
2019-01-24 18:01   ` Peter Zijlstra
2019-01-24 18:56     ` Peter Zijlstra
2019-01-24 23:42       ` Paul E. McKenney
2019-01-25  0:05         ` Alexei Starovoitov
2019-01-25  1:22           ` Paul E. McKenney
2019-01-25  1:46             ` Jann Horn
2019-01-25  2:38               ` Alexei Starovoitov
2019-01-25  4:27                 ` Alexei Starovoitov
2019-01-25  4:31                   ` Paul E. McKenney
2019-01-25  4:47                     ` Alexei Starovoitov
2019-01-25 16:02                       ` Paul E. McKenney
2019-01-25  4:11               ` Paul E. McKenney
2019-01-25 16:18                 ` Jann Horn
2019-01-25 22:51                   ` Paul E. McKenney
2019-01-25 23:44                     ` Alexei Starovoitov
2019-01-26  0:43                       ` Jann Horn
2019-01-26  0:59                         ` Jann Horn
2019-01-24 23:58     ` Alexei Starovoitov [this message]
2019-01-25  0:18       ` Jann Horn
2019-01-25  2:49         ` Alexei Starovoitov
2019-01-25  2:29       ` Eric Dumazet
2019-01-25  2:34         ` Alexei Starovoitov
2019-01-25  2:44           ` Eric Dumazet
2019-01-25  2:57             ` Alexei Starovoitov
2019-01-25  8:38               ` Peter Zijlstra
2019-01-25  9:10       ` Peter Zijlstra
2019-01-25 23:42         ` Alexei Starovoitov
2019-01-28  8:24           ` Peter Zijlstra
2019-01-28  8:31           ` Peter Zijlstra
2019-01-28  8:35             ` Peter Zijlstra
2019-01-28 20:49               ` Alexei Starovoitov
2019-01-28  8:43           ` Peter Zijlstra
2019-01-28 21:37             ` Alexei Starovoitov
2019-01-29  8:59               ` Peter Zijlstra
2019-01-30  2:20                 ` Alexei Starovoitov
2019-01-25  9:59       ` Peter Zijlstra
2019-01-25 10:09       ` Peter Zijlstra
2019-01-25 10:23       ` Peter Zijlstra
2019-01-26  0:17         ` bpf memory model. Was: " Alexei Starovoitov
2019-01-28  9:24           ` Peter Zijlstra
2019-01-28 21:56             ` Alexei Starovoitov
2019-01-29  9:16               ` Peter Zijlstra
2019-01-30  2:32                 ` Alexei Starovoitov
2019-01-30  8:58                   ` Peter Zijlstra
2019-01-30 19:36                     ` Alexei Starovoitov
2019-01-30 18:11               ` Will Deacon
2019-01-30 18:36                 ` Paul E. McKenney
2019-01-30 19:51                   ` Alexei Starovoitov
2019-01-30 21:05                     ` Paul E. McKenney
2019-01-30 22:57                       ` Alexei Starovoitov
2019-01-31 14:01                         ` Paul E. McKenney
2019-01-31 18:47                           ` Alexei Starovoitov
2019-02-01 14:05                             ` Paul E. McKenney
2019-01-30 19:50                 ` Alexei Starovoitov
2019-01-24  4:13 ` [PATCH v4 bpf-next 2/9] bpf: add support for bpf_spin_lock to cgroup local storage Alexei Starovoitov
2019-01-24  4:13 ` [PATCH v4 bpf-next 3/9] tools/bpf: sync include/uapi/linux/bpf.h Alexei Starovoitov
2019-01-24  4:13 ` [PATCH v4 bpf-next 4/9] selftests/bpf: add bpf_spin_lock tests Alexei Starovoitov
2019-01-24  4:13 ` [PATCH v4 bpf-next 5/9] selftests/bpf: add bpf_spin_lock C test Alexei Starovoitov
2019-01-24  4:14 ` [PATCH v4 bpf-next 6/9] bpf: introduce BPF_F_LOCK flag Alexei Starovoitov
2019-01-24  4:14 ` [PATCH v4 bpf-next 7/9] tools/bpf: sync uapi/bpf.h Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190124235857.xyb5xx2ufr6x5mbt@ast-mbp.dhcp.thefacebook.com \
    --to=alexei.starovoitov@gmail.com \
    --cc=ast@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=jakub.kicinski@netronome.com \
    --cc=jannh@google.com \
    --cc=kernel-team@fb.com \
    --cc=mingo@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.