Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs

From: Alexander H Duyck <alexander.duyck@gmail.com>
To: Paolo Abeni <pabeni@redhat.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Cc: netdev <netdev@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Alexander Duyck <alexanderduyck@fb.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Greg Thelen <gthelen@google.com>
Subject: Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs
Date: Thu, 08 Sep 2022 07:53:32 -0700	[thread overview]
Message-ID: <cf264865cb1613c0e7acf4bbc1ed345533767822.camel@gmail.com> (raw)
In-Reply-To: <498a25e4f7ba4e21d688ca74f335b28cadcb3381.camel@redhat.com>

On Thu, 2022-09-08 at 13:00 +0200, Paolo Abeni wrote:
> On Wed, 2022-09-07 at 14:36 -0700, Alexander H Duyck wrote:
> > On Wed, 2022-09-07 at 22:19 +0200, Paolo Abeni wrote:
> > > What outlined above will allow for 10 min size frags in page_order0, as
> > > (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> > > not sure that anything will allocate such small frags.
> > > With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page. 
> > 
> > That doesn't account for any headroom though. 
> 
> Yes, the 0-size data packet was just a theoretical example to make the
> really worst case scenario.
> 
> > Most of the time you have
> > to reserve some space for headroom so that if this buffer ends up
> > getting routed off somewhere to be tunneled there is room for adding to
> > the header. I think the default ends up being NET_SKB_PAD, though many
> > NICs use larger values. So adding any data onto that will push you up
> > to a minimum of 512 per skb for the first 64B for header data.
> > 
> > With that said it would probably put you in the range of 8 or fewer
> > skbs per page assuming at least 1 byte for data:
> >   512 =	SKB_DATA_ALIGN(NET_SKB_PAD + 1) +
> > 	SKB_DATA_ALIGN(struct skb_shared_info)
> 
> In most build GRO_MAX_HEAD packets are even larger (should be 640)

Right, which is why I am thinking we may want to default to a 1K slice.

> > > The maximum truesize underestimation in both cases will be lower than
> > > what we can get with the current code in the worst case (almost 32x
> > > AFAICS). 
> > > 
> > > Is the above schema safe enough or should the requested size
> > > artificially inflatted to fit at most 4 allocations per page_order0?
> > > Am I miss something else? Apart from omitting a good deal of testing in
> > > the above list ;) 
> > 
> > If we are working with an order 0 page we may just want to split it up
> > into a fixed 1K fragments and not bother with a variable pagecnt bias.
> > Doing that we would likely simplify this quite a bit and avoid having
> > to do as much page count manipulation which could get expensive if we
> > are not getting many uses out of the page. An added advantage is that
> > we can get rid of the pagecnt_bias and just work based on the page
> > offset.
> > 
> > As such I am not sure the page frag cache would really be that good of
> > a fit since we have quite a bit of overhead in terms of maintaining the
> > pagecnt_bias which assumes the page is a bit longer lived so the ratio
> > of refcnt updates vs pagecnt_bias updates is better.
> 
> I see. With the above schema there will be 4-6 frags per packet. I'm
> wild guessing that the pagecnt_bias optimization still give some gain
> in that case, but I really shold collect some data points.

As I recall one of the big advantages of the 32k page was that we were
reducing the atomic ops by nearly half. Essentially we did a
page_ref_add at the start and a page_ref_sub_and_test when we were out
of space. Whereas a single 4K allocation would be 2 atomic ops per
allocation, we were only averaging 1.13 per 2k slice. With the Intel
NICs I was able to get even closer to 1 since I was able to do the 2k
flip/flop setup and could get up to 64K uses off of a single page.

Then again though I am not sure now much the atomic ops penalty will be
for your use case. Where it came into play is that MMIO writes to the
device will block atomic ops until they can be completed so in a device
driver atomic ops become very expensive and so we want to batch them as
much as possible.

> If the pagecnt optimization should be dropped, it would be probably
> more straight-forward to use/adapt 'page_frag' for the page_order0
> allocator.

That would make sense. Basically we could get rid of the pagecnt bias
and add the fixed number of slices to the count at allocation so we
would just need to track the offset to decide when we need to allocate
a new page. In addtion if we are flushing the page when it is depleted
we don't have to mess with the pfmemalloc logic.

> BTW it's quite strange/confusing having to very similar APIs (page_frag
> and page_frag_cache) with very similar names and no references between
> them.

I'm not sure what you are getting at here. There are plenty of
references between them, they just aren't direct. I viewed it as being
more like the get_free_pages() type logic where you end up allocating a
page but what you get is an address to the page fragment instead of the
page_frag struct.