Re: [PATCH net-next v2] samples/bpf: fixup some tools to be able to support xdp multibuffer - Toke Høiland-Jørgensen

From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Cc: Tariq Toukan <ttoukan.linux@gmail.com>,
	Lorenzo Bianconi <lorenzo@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>,
	Andy Gospodarek <andrew.gospodarek@broadcom.com>,
	ast@kernel.org, daniel@iogearbox.net, davem@davemloft.net,
	hawk@kernel.org, john.fastabend@gmail.com, andrii@kernel.org,
	kafai@fb.com, songliubraving@fb.com, yhs@fb.com,
	kpsingh@kernel.org, lorenzo.bianconi@redhat.com,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	gal@nvidia.com, Saeed Mahameed <saeedm@nvidia.com>,
	tariqt@nvidia.com
Subject: Re: [PATCH net-next v2] samples/bpf: fixup some tools to be able to support xdp multibuffer
Date: Thu, 05 Jan 2023 23:07:42 +0100	[thread overview]
Message-ID: <87v8lkzlch.fsf@toke.dk> (raw)
In-Reply-To: <Y7cBfE7GpX04EI97@C02YVCJELVCG.dhcp.broadcom.net>

Andy Gospodarek <andrew.gospodarek@broadcom.com> writes:

> On Thu, Jan 05, 2023 at 04:43:28PM +0100, Toke Høiland-Jørgensen wrote:
>> Tariq Toukan <ttoukan.linux@gmail.com> writes:
>> 
>> > On 04/01/2023 14:28, Toke Høiland-Jørgensen wrote:
>> >> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>> >> 
>> >>>> On Tue, 03 Jan 2023 16:19:49 +0100 Toke Høiland-Jørgensen wrote:
>> >>>>> Hmm, good question! I don't think we've ever explicitly documented any
>> >>>>> assumptions one way or the other. My own mental model has certainly
>> >>>>> always assumed the first frag would continue to be the same size as in
>> >>>>> non-multi-buf packets.
>> >>>>
>> >>>> Interesting! :) My mental model was closer to GRO by frags
>> >>>> so the linear part would have no data, just headers.
>> >>>
>> >>> That is assumption as well.
>> >> 
>> >> Right, okay, so how many headers? Only Ethernet, or all the way up to
>> >> L4 (TCP/UDP)?
>> >> 
>> >> I do seem to recall a discussion around the header/data split for TCP
>> >> specifically, but I think I mentally put that down as "something people
>> >> may way to do at some point in the future", which is why it hasn't made
>> >> it into my own mental model (yet?) :)
>> >> 
>> >> -Toke
>> >> 
>> >
>> > I don't think that all the different GRO layers assume having their 
>> > headers/data in the linear part. IMO they will just perform better if 
>> > these parts are already there. Otherwise, the GRO flow manages, and 
>> > pulls the needed amount into the linear part.
>> > As examples, see calls to gro_pull_from_frag0 in net/core/gro.c, and the 
>> > call to pskb_may_pull() from skb_gro_header_slow().
>> >
>> > This resembles the bpf_xdp_load_bytes() API used here in the xdp prog.
>> 
>> Right, but that is kernel code; what we end up doing with the API here
>> affects how many programs need to make significant changes to work with
>> multibuf, and how many can just set the frags flag and continue working.
>> Which also has a performance impact, see below.
>> 
>> > The context of my questions is that I'm looking for the right memory 
>> > scheme for adding xdp-mb support to mlx5e striding RQ.
>> > In striding RQ, the RX buffer consists of "strides" of a fixed size set 
>> > by pthe driver. An incoming packet is written to the buffer starting from 
>> > the beginning of the next available stride, consuming as much strides as 
>> > needed.
>> >
>> > Due to the need for headroom and tailroom, there's no easy way of 
>> > building the xdp_buf in place (around the packet), so it should go to a 
>> > side buffer.
>> >
>> > By using 0-length linear part in a side buffer, I can address two 
>> > challenging issues: (1) save the in-driver headers memcpy (copy might 
>> > still exist in the xdp program though), and (2) conform to the 
>> > "fragments of the same size" requirement/assumption in xdp-mb. 
>> > Otherwise, if we pull from frag[0] into the linear part, frag[0] becomes 
>> > smaller than the next fragments.
>> 
>> Right, I see.
>> 
>> So my main concern would be that if we "allow" this, the only way to
>> write an interoperable XDP program will be to use bpf_xdp_load_bytes()
>> for every packet access. Which will be slower than DPA, so we may end up
>> inadvertently slowing down all of the XDP ecosystem, because no one is
>> going to bother with writing two versions of their programs. Whereas if
>> you can rely on packet headers always being in the linear part, you can
>> write a lot of the "look at headers and make a decision" type programs
>> using just DPA, and they'll work for multibuf as well.
>
> The question I would have is what is really the 'slow down' for
> bpf_xdp_load_bytes() vs DPA?  I know you and Jesper can tell me how many
> instructions each use. :)

I can try running some benchmarks to compare the two, sure!

-Toke