Re: bpf memory model. Was: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

From: Will Deacon <will.deacon@arm.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Alexei Starovoitov <ast@kernel.org>,
	davem@davemloft.net, daniel@iogearbox.net,
	jakub.kicinski@netronome.com, netdev@vger.kernel.org,
	kernel-team@fb.com, mingo@redhat.com,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	jannh@google.com
Subject: Re: bpf memory model. Was: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock
Date: Wed, 30 Jan 2019 18:11:00 +0000	[thread overview]
Message-ID: <20190130181100.GA18558@fuggles.cambridge.arm.com> (raw)
In-Reply-To: <20190128215623.6eqskzhklydhympa@ast-mbp>

Hi Alexei,

On Mon, Jan 28, 2019 at 01:56:24PM -0800, Alexei Starovoitov wrote:
> On Mon, Jan 28, 2019 at 10:24:08AM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 25, 2019 at 04:17:26PM -0800, Alexei Starovoitov wrote:
> > > What I want to avoid is to define the whole execution ordering model upfront.
> > > We cannot say that BPF ISA is weakly ordered like alpha.
> > > Most of the bpf progs are written and running on x86. We shouldn't
> > > twist bpf developer's arm by artificially relaxing memory model.
> > > BPF memory model is equal to memory model of underlying architecture.
> > > What we can do is to make it bpf progs a bit more portable with
> > > smp_rmb instructions, but we must not force weak execution on the developer.
> > 
> > Well, I agree with only introducing bits you actually need, and my
> > smp_rmb() example might have been poorly chosen, smp_load_acquire() /
> > smp_store_release() might have been a far more useful example.
> > 
> > But I disagree with the last part; we have to pick a model now;
> > otherwise you'll pain yourself into a corner.
> > 
> > Also; Alpha isn't very relevant these days; however ARM64 does seem to
> > be gaining a lot of attention and that is very much a weak architecture.
> > Adding strongly ordered assumptions to BPF now, will penalize them in
> > the long run.
> 
> arm64 is gaining attention just like riscV is gaining it too.
> BPF jit for arm64 is very solid, while BPF jit for riscV is being worked on.
> BPF is not picking sides in CPU HW and ISA battles.

It's not about picking a side, it's about providing an abstraction of the
various CPU architectures out there so that the programmer doesn't need to
worry about where their program may run. Hell, even if you just said "eBPF
follows x86 semantics" that would be better than saying nothing (and then we
could have a discussion about whether x86 semantics are really what you
want).

> Memory model is CPU HW design decision. BPF ISA cannot dictate HW design.
> We're not saying today that BPF is strongly ordered.
> BPF load/stores are behaving differently on x86 vs arm64.
> We can add new instructions, but we cannot 'define' how load/stores behave
> from memory model perspective.
> For example, take atomicity of single byte load/store.
> Not all archs have them atomic, but we cannot say to bpf programmers
> to always assume non-atomic byte loads.

Hmm, I don't think this is a good approach to take for the future of eBPF.
Assuming that a desirable property of an eBPF program is portability between
CPU architectures, then you're effectively forcing the programmer to "assume
the worst", where the worst is almost certainly unusable for practical
purposes.

One easy thing you could consider would be to allow tagging of an eBPF
program with its supported target architectures (the JIT will refuse to
accept it for other architectures). This would at least prevent remove the
illusion of portability and force the programmer to be explicit.

However, I think we'd much better off if we defined some basic ordering
primitives such as relaxed and RCpc-style acquire/release operations
(including atomic RmW), single-copy atomic memory accesses up to the native
machine size and a full-fence instruction. If your program uses something
that the underlying arch doesn't support, then it is rejected (e.g. 64-bit
READ_ONCE on a 32-bit arch)

That should map straightforwardly to all modern architectures and allow for
efficient codegen on x86 and arm64. It would probably require a bunch of new
BPF instructions that would be defined to be atomic (you already have XADD
as a relaxed atomic add).

Apologies if this sounds patronising, but I'm keen to help figure out the
semantics *now* so that we don't end up having to infer them later on, which
is the usual painful case for memory models. I suspect Peter and Paul would
also prefer to attack it that way around. I appreciate that the temptation
is to avoid the problem by deferring to the underlying hardware memory
model, but I think that will create more problems than it solves and we're
here to help you get this right.

Will