Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Andrey Ignatov <rdna@fb.com>
Cc: "Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Martin KaFai Lau" <kafai@fb.com>,
	"Song Liu" <songliubraving@fb.com>, "Yonghong Song" <yhs@fb.com>,
	"Andrii Nakryiko" <andriin@fb.com>,
	"David S. Miller" <davem@davemloft.net>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Jesper Dangaard Brouer" <brouer@redhat.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"Lorenz Bauer" <lmb@cloudflare.com>,
	Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>,
	ctakshak@fb.com
Subject: Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
Date: Tue, 3 Mar 2020 11:53:49 -0800	[thread overview]
Message-ID: <CAEf4BzaC=dVV-vc-KRYmErdSOLo94LGTngzu6wfs-jFLLuPcsA@mail.gmail.com> (raw)
In-Reply-To: <20200303010318.GB84713@rdna-mbp>

On Mon, Mar 2, 2020 at 5:03 PM Andrey Ignatov <rdna@fb.com> wrote:
>
> Toke Høiland-Jørgensen <toke@redhat.com> [Sat, 2020-02-29 02:36 -0800]:
> > Andrey Ignatov <rdna@fb.com> writes:
> >
> > > The main challenges I see for applying this approach in fb case is the
> > > need to recreate the dispatcher every time a new program has to be
> > > added.
> > >
> > > Imagine there there are a few containers and every container wants to
> > > run an application that attaches XDP program to the "dispatcher" via
> > > freplace. Every application may have a "priority" reserved for it, but
> > > recreating the dispatcher may have race condition, for example:
> >
> > Yeah, I did realise this is potentially racy, but so is any loading of
> > XDP programs right now (i.e., two applications can both try loading a
> > single XDP program at the same time, and end up stomping on each others'
> > feet). So we'll need to solve that in any case. I've managed to come up
> > with two possible ways to solve this:
> >
> > 1. Locking: Make it possible for a process to temporarily lock the
> > XDP program loaded onto an interface so no other program can modify it
> > until the lock is released.
> >
> > 2. A cmpxchg operation: Add a new field to the XDP load netlink message
> > containing an fd of the old program that the load call is expecting to
> > replace. I.e., instead of attach(ifindex, prog_fd, flags), you have
> > attach(ifindex, prog_fd, old_fd, flags). The kernel can then check that
> > the old_fd matches the program currently loaded before replacing
> > anything, and reject the operation otherwise.
> >
> > With either of these mechanisms it should be possible for userspace to
> > do the right thing if the kernel state changes underneath it. I'm
> > leaning towards (2) because I think it is simpler to implement and
> > doesn't require any new state be kept in the kernel.
>
> Yep, that will solve the race.
>
> (2) sounds good to me, in fact I did similar thing for cgroup-bpf in:
>
> 7dd68b3279f1 ("bpf: Support replacing cgroup-bpf program in MULTI mode")
>
> where user can pass replace_bpf_fd and BPF_F_REPLACE flag and it
> guarantees that the program, users wants, will be replaced, not a new
> program that was attached by somebody else just a moment ago.
>
>
> > The drawback is
> > that it may lead to a lot of retries if many processes are trying to
> > load their programs at the same time. Some data would be good here: How
> > often do you expect programs to be loaded/unloaded in your use case?
>
>
> In the case I mentioned it's more about having multiple applications
> that may start/restart at the same time, not about frequency. It'll be a
> few (one digit number) apps, what means having a few retries should be
> fine if "the old program doesn't exist" can be detected easily (e.g.
> ENOENT should work) not to do retry for errors that are obviously
> unrelated to the race condition.
>
>
> > As for your other suggestion:
> >
> > > Also I see at least one other way to do it w/o regenerating dispatcher
> > > every time:
> > >
> > > It can be created and attached once with "big enough" number of slots,
> > > for example with 100 and programs may use use their corresponding slot
> > > to freplace w/o regenerating the dispatcher. Having those big number of
> > > no-op slots should not be a big deal from what I understand and kernel
> > > can optimize it.
> >
> > I thought about having the dispatcher stay around for longer, and just
> > replacing more function slots as new programs are added/removed. The
> > reason I didn't go with this is the following: Modifying the dispatcher
> > while it is loaded means that the modifications will apply to traffic on
> > the interface immediately. This is fine for simple add/remove of a
> > single program, but it limits which operations you can do atomically.
> > E.g., you can't switch the order of two programs, or add or remove more
> > than one, in a way that is atomic from the PoV of the traffic on the
> > interface.
>
> Right, simple add/remove cases is the only ones I've seen so far since
> programs are usually more or less independent and they just should be
> chained properly w/o anything like "order of programs should be changed
> atomically" or "two programs must be enabled atomically".
>
> But okay, I can imagine that this may happen in the wild. In this case
> yes, full regeneration of the dispatcher looks like the option ..
>
>
> > Since I expect that we will need to support atomic operations even for
> > these more complex cases, that means we'll need to support rebuilding
> > the dispatcher anyway, and solving the race condition problem for that.
> > And once we've done that, the simple add/remove in the existing
> > dispatcher becomes just an additional code path that we'll need to
> > maintain, so why bother? :)
> >
> > I am also not sure it's as simple as you say for the kernel to optimise
> > a more complex dispatcher: The current dead code elimination relies on
> > map data being frozen at verification time, so it's not applicable to
> > optimising a dispatcher as it is being changed later. Now, this could
> > probably be fixed and/or we could try doing clever tricks with the flow
> > control in the dispatcher program itself. But again, why bother if we
> > have to support the dispatcher rebuild mode of operation anyway?
>
> Yeah, having the ability to regenerate the full dispatcher helps to
> avoid dealing with those no-ops programs.
>
> This kinda solves another problem of allocating positions in the list of
> noop_fun1, noop_func2, ..., noop_funcN, since the N is limited and
> keeping "enough space" between existing programs to be able to attach
> something else between them in the future can be challenging in general
> case.
>
> > I may have missed something, of course, so feel free to point out if you
> > see anything wrong with my reasoning above!
> >
> > > This is the main thing so far, I'll likely provide more feedback when
> > > have some more time to read the code ..
> >
> > Sounds good! You're already being very helpful, so thank you! :)
>
> I've spent more time reading the library and like the static global data
> idea that allows to "regenerate" dispatcher w/o actually recompiling it
> so that it can still be compiled once and distributed to all relevant
> hosts.  It simplifies a bunch of things discussed above.
>
> But this part in the "missing pieces":
>
> > > - There is no way to re-attach an already loaded program to another function;
> > >   this is needed for updating the call sequence: When a new program is loaded,
> > >   libxdp should get the existing list of component programs on the interface and
> > >   insert the new one into the chain in the appropriate place. To do this it
> > >   needs to build a new dispatcher and reattach all the old programs to it.
> > >   Ideally, this should be doable without detaching them from the old dispatcher;
> > >   that way, we can build the new dispatcher completely, and atomically replace
> > >   it on the interface by the usual XDP attach mechanism.
>
> seems to be "must-have", including the "Ideally" section, since IMO
> simply adding a new program should not interrupt what previously
> attached programs are doing.
>
> If there is a container A that attached progA to dispatcher some time
> ago, and then container B is regenerating dispatcher to add progB, that
> shouldn't stop progA from being executed even for short period of time
> since for some programs it's just no-go (e.g. if progA is a firewall and
> disabling it would mean allowing traffic that otherwise is denied).
>
> I'm not the one who can answer the question how hard would it be to
> support this kind of "re-attaching" on kernel side and curios myself. I
> do see though that current implementation of ext programs has a single
> (prog->aux->linked_prog, prog->aux->attach_btf_id) pair.
>
> Also it's not clear what to do with fd returned by
> bpf_tracing_prog_attach (whether it can be pined or not), e.g. if
> container A generated dispatcher with ext progA attached to it and got
> this "link" fd, but then dispatcher was regenerated and the progA was
> reattached to the new dispatcher, how to make sure that the "link" fd is
> still the right one and cleanup will happen when a process in container
> A closes the fd it has (or unpins corresponding file in bpf fs).

I just replied to Daniel on another thread (it's weird how we have
inter-related discussions on separate thread, but whatever..).
Basically, once you establish XDP bpf_link for XDP dispatcher, you
can't attach another bpf_link anymore. It will keep failing until
bpf_link goes away (close FD, unpin, etc). But, once you have
bpf_link, you should be able to replace underlying BPF program, as
long as you still own that link). This way it should be possible to
re-generate XDP dispatcher and safely switch it.

>
>
> --
> Andrey Ignatov