Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface - Toke Høiland-Jørgensen

From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Andrey Ignatov <rdna@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>, Andrii Nakryiko <andriin@fb.com>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	John Fastabend <john.fastabend@gmail.com>,
	Lorenz Bauer <lmb@cloudflare.com>,
	netdev@vger.kernel.org, bpf@vger.kernel.org, ctakshak@fb.com
Subject: Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
Date: Sat, 29 Feb 2020 11:36:31 +0100	[thread overview]
Message-ID: <87v9npu1cg.fsf@toke.dk> (raw)
In-Reply-To: <20200228221519.GE51456@rdna-mbp>

Andrey Ignatov <rdna@fb.com> writes:

> The main challenges I see for applying this approach in fb case is the
> need to recreate the dispatcher every time a new program has to be
> added.
>
> Imagine there there are a few containers and every container wants to
> run an application that attaches XDP program to the "dispatcher" via
> freplace. Every application may have a "priority" reserved for it, but
> recreating the dispatcher may have race condition, for example:

Yeah, I did realise this is potentially racy, but so is any loading of
XDP programs right now (i.e., two applications can both try loading a
single XDP program at the same time, and end up stomping on each others'
feet). So we'll need to solve that in any case. I've managed to come up
with two possible ways to solve this:

1. Locking: Make it possible for a process to temporarily lock the
XDP program loaded onto an interface so no other program can modify it
until the lock is released.

2. A cmpxchg operation: Add a new field to the XDP load netlink message
containing an fd of the old program that the load call is expecting to
replace. I.e., instead of attach(ifindex, prog_fd, flags), you have
attach(ifindex, prog_fd, old_fd, flags). The kernel can then check that
the old_fd matches the program currently loaded before replacing
anything, and reject the operation otherwise.

With either of these mechanisms it should be possible for userspace to
do the right thing if the kernel state changes underneath it. I'm
leaning towards (2) because I think it is simpler to implement and
doesn't require any new state be kept in the kernel. The drawback is
that it may lead to a lot of retries if many processes are trying to
load their programs at the same time. Some data would be good here: How
often do you expect programs to be loaded/unloaded in your use case?

As for your other suggestion:

> Also I see at least one other way to do it w/o regenerating dispatcher
> every time:
>
> It can be created and attached once with "big enough" number of slots,
> for example with 100 and programs may use use their corresponding slot
> to freplace w/o regenerating the dispatcher. Having those big number of
> no-op slots should not be a big deal from what I understand and kernel
> can optimize it.

I thought about having the dispatcher stay around for longer, and just
replacing more function slots as new programs are added/removed. The
reason I didn't go with this is the following: Modifying the dispatcher
while it is loaded means that the modifications will apply to traffic on
the interface immediately. This is fine for simple add/remove of a
single program, but it limits which operations you can do atomically.
E.g., you can't switch the order of two programs, or add or remove more
than one, in a way that is atomic from the PoV of the traffic on the
interface.

Since I expect that we will need to support atomic operations even for
these more complex cases, that means we'll need to support rebuilding
the dispatcher anyway, and solving the race condition problem for that.
And once we've done that, the simple add/remove in the existing
dispatcher becomes just an additional code path that we'll need to
maintain, so why bother? :)

I am also not sure it's as simple as you say for the kernel to optimise
a more complex dispatcher: The current dead code elimination relies on
map data being frozen at verification time, so it's not applicable to
optimising a dispatcher as it is being changed later. Now, this could
probably be fixed and/or we could try doing clever tricks with the flow
control in the dispatcher program itself. But again, why bother if we
have to support the dispatcher rebuild mode of operation anyway?

I may have missed something, of course, so feel free to point out if you
see anything wrong with my reasoning above!

> This is the main thing so far, I'll likely provide more feedback when
> have some more time to read the code ..

Sounds good! You're already being very helpful, so thank you! :)

-Toke