Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

From: Arjun Roy <arjunroy@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Arjun Roy <arjunroy.kdev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Miller <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>,
	 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Cgroups <cgroups@vger.kernel.org>,  Linux MM <linux-mm@kvack.org>,
	Shakeel Butt <shakeelb@google.com>,
	 Eric Dumazet <edumazet@google.com>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	 Jakub Kicinski <kuba@kernel.org>,
	Michal Hocko <mhocko@kernel.org>, Yang Shi <shy828301@gmail.com>,
	 Roman Gushchin <guro@fb.com>
Subject: Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy
Date: Tue, 23 Mar 2021 11:42:05 -0700	[thread overview]
Message-ID: <CAOFY-A2dfWS91b10R9Pu-5T-uT2qF9h9Lm8GaJfV9shfjP4Wbg@mail.gmail.com> (raw)
In-Reply-To: <YFoe8BO0JsbXTHHF@cmpxchg.org>

On Tue, Mar 23, 2021 at 10:01 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Mar 22, 2021 at 02:35:11PM -0700, Arjun Roy wrote:
> > To make sure we're on the same page, then, here's a tentative
> > mechanism - I'd rather get buy in before spending too much time on
> > something that wouldn't pass muster afterwards.
> >
> > A) An opt-in mechanism, that a driver needs to explicitly support, in
> > order to get properly accounted receive zerocopy.
>
> Yep, opt-in makes sense. That allows piece-by-piece conversion and
> avoids us having to have a flag day.
>
> > B) Failure to opt-in (e.g. unchanged old driver) can either lead to
> > unaccounted zerocopy (ie. existing behaviour) or, optionally,
> > effectively disabled zerocopy (ie. any call to zerocopy will return
> > something like EINVAL) (perhaps controlled via some sysctl, which
> > either lets zerocopy through or not with/without accounting).
>
> I'd suggest letting it fail gracefully (i.e. no -EINVAL) to not
> disturb existing/working setups during the transition period. But the
> exact policy is easy to change later on if we change our minds on it.
>
> > The proposed mechanism would involve:
> > 1) Some way of marking a page as being allocated by a driver that has
> > decided to opt into this mechanism. Say, a page flag, or a memcg flag.
>
> Right. I would stress it should not be a memcg flag or any direct
> channel from the network to memcg, as this would limit its usefulness
> while having the same maintenance overhead.
>
> It should make the network page a first class MM citizen - like an LRU
> page or a slab page - which can be accounted and introspected as such,
> including from the memcg side.
>
> So definitely a page flag.

Works for me.

>
> > 2) A callback provided by the driver, that takes a struct page*, and
> > returns a boolean. The value of the boolean being true indicates that
> > any and all refs on the page are held by the driver. False means there
> > exists at least one reference that is not held by the driver.
>
> I was thinking the PageNetwork flag would cover this, but maybe I'm
> missing something?
>

The main reason for a driver callback is to handle whatever
driver-specific behaviour needs to be handled (ie. while a driver may
use code from net/core/page_pool.c, it also may roll its own arbitrary
behaviour and data structures). And because it's not necessarily the
case that a driver would take exactly 1 ref of its own on the page.

> > 3) A branch in put_page() that, for pages marked thus, will consult
> > the driver callback and if it returns true, will uncharge the memcg
> > for the page.
>
> The way I picture it, put_page() (and release_pages) should do this:
>
> void __put_page(struct page *page)
> {
>         if (is_zone_device_page(page)) {
>                 put_dev_pagemap(page->pgmap);
>
>                 /*
>                  * The page belongs to the device that created pgmap. Do
>                  * not return it to page allocator.
>                  */
>                 return;
>         }
> +
> +       if (PageNetwork(page)) {
> +               put_page_network(page);
> +               /* Page belongs to the network stack, not the page allocator */
> +               return;
> +       }
>
>         if (unlikely(PageCompound(page)))
>                 __put_compound_page(page);
>         else
>                 __put_single_page(page);
> }
>
> where put_page_network() is the network-side callback that uncharges
> the page.
>
> (..and later can be extended to do all kinds of things when informed
> that the page has been freed: update statistics (mod_page_state), put
> it on a private network freelist, or ClearPageNetwork() and give it
> back to the page allocator etc.
>

Yes, this is more or less what I had in mind, though
put_page_network() would also need to avail itself of the callback
mentioned previously.

> But for starters it can set_page_count(page, 1) after the uncharge to
> retain the current silent recycling behavior.)
>

This would be one example of where the driver could conceivably have
>1 ref for whatever reason
(https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/mellanox/mlx4/en_rx.c#L495)
where it looks like it could take 2 refs on a page, perhaps storing 2
x 1500B packets on a single 4KB page.

> > The anonymous struct you defined above is part of a union that I think
> > normally is one qword in length (well, could be more depending on the
> > typedefs I saw there) and I think that can be co-opted to provide the
> > driver callback - though, it might require growing the struct by one
> > more qword since there may be drivers like mlx5 that are already using
> > the field already in there  for dma_addr.
>
> The page cache / anonymous struct it's shared with is 5 words (double
> linked list pointers, mapping, index, private), and the network struct
> is currently one word, so you can add 4 words to a PageNetwork() page
> without increasing the size of struct page. That should be plenty of
> space to store auxiliary data for drivers, right?
>

Ah, I think I was looking more narrowly at an older version of the
struct. The new one is much easier to parse. :)

4 words should be plenty, I think.

> > Anyways, the callback could then be used by the driver to handle the
> > other accounting quirks you mentioned, without needing to scan the
> > full pool.
>
> Right.
>
> > Of course there are corner cases and such to properly account for, but
> > I just wanted to provide a really rough sketch to see if this
> > (assuming it were properly implemented) was what you had in mind. If
> > so I can put together a v3 patch.
>
> Yeah, makes perfect sense. We can keep iterating like this any time
> you feel you accumulate too many open questions. Not just for MM but
> also for the networking folks - although I suspect that the first step
> would be mostly about the MM infrastructure, and I'm not sure how much
> they care about the internals there ;)
>
> > Per my response to Andrew earlier, this would make it even more
> > confusing whether this is to be applied against net-next or mm trees.
> > But that's a bridge to cross when we get to it.
>
> The mm tree includes -next, so it should be a safe development target
> for the time being.
>
> I would then decide it based on how many changes your patch interacts
> with on either side. Changes to struct page and the put path are not
> very frequent, so I suspect it'll be easy to rebase to net-next and
> route everything through there. And if there are heavy changes on both
> sides, the -mm tree is the better route anyway.
>
> Does that sound reasonable?

This sounds good to me.

To summarize then, it seems to me that we're on the same page now.
I'll put together a tentative v3 such that:
1. It uses pre-charging, as previously discussed.
2. It uses a page flag to delineate pages of a certain networking sort
(ie. this mechanism).
3. It avails itself of up to 4 words of data inside struct page,
inside the networking specific struct.
4. And it sets up this opt-in lifecycle notification for drivers that
choose to use it, falling back to existing behaviour without.

Thanks,
-Arjun