Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs

From: Alexander Duyck <alexander.duyck@gmail.com>
To: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, netdev <netdev@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Alexander Duyck <alexanderduyck@fb.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Greg Thelen <gthelen@google.com>
Subject: Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs
Date: Thu, 8 Sep 2022 12:26:21 -0700	[thread overview]
Message-ID: <CAKgT0UeV9+=AcQ1J+UA=KGWKAV2E4CW566qYHNv_XxQMC3Us-Q@mail.gmail.com> (raw)
In-Reply-To: <f3f867cf6814510817b253e6aca997cdd3acc48a.camel@redhat.com>

On Thu, Sep 8, 2022 at 11:01 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2022-09-08 at 07:53 -0700, Alexander H Duyck wrote:
> > On Thu, 2022-09-08 at 13:00 +0200, Paolo Abeni wrote:
> > > In most build GRO_MAX_HEAD packets are even larger (should be 640)
> >
> > Right, which is why I am thinking we may want to default to a 1K slice.
>
> Ok it looks like there is agreement to force a minimum frag size of 1K.
> Side note: that should not cause a memory usage increase compared to
> the slab allocator as kmalloc(640) should use the kmalloc-1k slab.
>
> [...]
>
> > > >
> > > If the pagecnt optimization should be dropped, it would be probably
> > > more straight-forward to use/adapt 'page_frag' for the page_order0
> > > allocator.
> >
> > That would make sense. Basically we could get rid of the pagecnt bias
> > and add the fixed number of slices to the count at allocation so we
> > would just need to track the offset to decide when we need to allocate
> > a new page. In addtion if we are flushing the page when it is depleted
> > we don't have to mess with the pfmemalloc logic.
>
> Uhmm... it looks like that the existing page_frag allocator does not
> always flush the depleted page:
>
> bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
> {
>         if (pfrag->page) {
>                 if (page_ref_count(pfrag->page) == 1) {
>                         pfrag->offset = 0;
>                         return true;
>                 }

Right, we have an option to reuse the page if the page count is 0.
However in the case of the 4K page with 1K slices scenario it means
you are having to bump back up the count on every 3 pages. So you
would be looking at 1.3 atomic accesses per frag. Just doing the bump
once at the start and using all 4 slices would give you 1.25 atomic
accesses per frag. That is why I assumed it would be better to just
let it go.

> so I'll try adding some separate/specialized code and see if the
> overall complexity would be reasonable.

The other thing to keep in mind is that once you start adding the
recycling you will have best case and worst case scenarios to
consider. The code above is for recycling frag in place it seems like,
or reallocating a new one in its place.

> > > BTW it's quite strange/confusing having to very similar APIs (page_frag
> > > and page_frag_cache) with very similar names and no references between
> > > them.
> >
> > I'm not sure what you are getting at here. There are plenty of
> > references between them, they just aren't direct.
>
> Looking/greping the tree I could not trivially understand when 'struct
> page_frag' should be preferred over 'struct page_frag_cache' and/or
> vice versa, I had to look at the respective implementation details.

The page_frag_cache is mostly there to store a higher order page to
slice up to generate page fragments that can be stored in the
page_frag struct. Honestly I am surprised we still have page_frag
floating around. I thought we replaced that with bio_vec some time
ago. At least that is the structure that skb_frag_t is typdef-ed as.