Re: [PATCH bpf-next 0/9] xdp: Support multiple programs on a single interface through chain calls - Toke Høiland-Jørgensen

From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Lorenz Bauer <lmb@cloudflare.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>,
	Alexei Starovoitov <ast@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>,
	Marek Majkowski <marek@cloudflare.com>,
	David Miller <davem@davemloft.net>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>
Subject: Re: [PATCH bpf-next 0/9] xdp: Support multiple programs on a single interface through chain calls
Date: Wed, 02 Oct 2019 20:54:31 +0200	[thread overview]
Message-ID: <87zhijq8pk.fsf@toke.dk> (raw)
In-Reply-To: <CACAyw9860eDGU9meO0wQ82OgWNPv3LXAQqrJNf-mQFA0yu7rWQ@mail.gmail.com>

Lorenz Bauer <lmb@cloudflare.com> writes:

> On Wed, 2 Oct 2019 at 14:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> This series adds support for executing multiple XDP programs on a single
>> interface in sequence, through the use of chain calls, as discussed at the Linux
>> Plumbers Conference last month:
>
> Hi Toke,
>
> Thanks for posting the patch set! As I mentioned, this is a big pain
> point for us as well. Right now, all of our different XDP components
> are managed by a single daemon called (drumroll) xdpd. We'd like to
> separate this out into individual bits and pieces that operate
> separately from each other.

Yes, I'm counting on you guys to be one of the potential users and help
me iron out the API and semantics! ;)

> I've looked at the kernel side of your patch set, here are my thoughts:
>
>> # HIGH-LEVEL IDEA
>>
>> The basic idea is to express the chain call sequence through a special map type,
>> which contains a mapping from a (program, return code) tuple to another program
>> to run in next in the sequence. Userspace can populate this map to express
>> arbitrary call sequences, and update the sequence by updating or replacing the
>> map.
>
> How do you imagine this would work in practice? From what I can tell,
> the map is keyed by program id, which makes it difficult to reliably
> construct the control flow that I want.

The way I've been picturing this would work, is that programs would
cooperatively manage this by updating the data structures to insert and
remove themselves as needed. I haven't actually fleshed out this in code
yet (that's next on my agenda), but I imagine it would be something like
this:

> As an example, I'd like to split the daemon into two parts, A and B,
> which I want to be completely independent. So:
>
> - A runs before B, if both are active
> - If A is not active, B is called first
> - If B is not active, only A is called
>
> Both A and B may at any point in time replace their current XDP
> program with a new one. This means that there is no stable program ID
> I can use.

They would have to cooperate. Either by a central management daemon, or
by using the chain call map as a data structure to coordinate.

If A switches out its program, it would need to first load the new
version, then update the map to replace the old ID with the new one.

> Another problem are deletions: if I delete A (because that component
> is shut down) B will never be called, since the program ID that linked
> B into the control flow is gone. This means that B needs to know about
> A and vice versa.

Say both programs are running. If B shuts down, it just removes itself
from the map. If A shuts down, it also needs to remove itself from the
map, and move up all its "descendants" in the call sequence. So in this
case, A would look up itself in the chain call map, find the next
program in the sequence, and install that as the new program on the
interface.

The operations are kinda like linked list manipulations. I had some
examples in my slides at LPC:

- Simple updates: *linked-list like* operations (map stays the same)
# Insert after id 3
  --> id = load(prog.o);
  --> map_update(map, {3, PASS}, id) # atomic update
# Insert before id 2
  --> id = load(prog.o);
  --> map_update(map, {id, PASS}, 2); # no effect on chain sequence
  --> map_update(map, {1, PASS}, id); # atomic update

- More complex operations: /*replace the whole thing*/

# Replace ID 3 with new program
  --> id = load(prog.o); map = new_map();
  --> map_update(map, {1, PASS}, 2);
  --> map_update(map, {1, TX}, id);
  --> map_update(map, {2, PASS}, id);
  --> xdp_attach(eth0, 1, map, FORCE); # atomic replace

The tricky part is avoiding read-modify-update races. I was imagining
that we could provide "cmpxchg-like" semantics by adding an "expected
old value" to the netlink load and/or to map updates. The kernel could
then ensure that the update fails if the actual value being replaced has
changed since userspace read it, in which case userspace could retry the
whole operation.

>> The actual execution of the program sequence is done in
>> bpf_prog_run_xdp(), which will lookup the chain sequence map, and if
>> found, will loop through calls to BPF_PROG_RUN, looking up the next
>> XDP program in the sequence based on the previous program ID and
>> return code.
>
> I think that the tail call chain is useful for other eBPF programs as
> well. How hard is it to turn the logic in bpf_prog_run_xdp into a
> helper instead?

Heh. In principle, I guess we could move the logic *into* BPF_PROG_RUN
instead of wrapping around it. The tricky part would be figuring out
where to stash the map itself.

One way to do this, which your other question about why both prog fd and
map fd was needed made me think about, is to make programs and sequence
maps interchangeable *everywhere*. I.e., every time we attach an eBPF
program anywhere, we pass an fd to that program. So from a UAPI PoV, we
could just extend that so all attach points would accept *either* a
program fd *or* a chain call map fd, and then do the appropriate thing.
This is quite intrusive, though, so let's leave that aside from now and
maybe come back to it in the future if this turns out to be a huge
success :)

-Toke