Re: pageless memory & zsmalloc

From: Matthew Wilcox <willy@infradead.org>
To: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-mm@kvack.org, Minchan Kim <minchan@kernel.org>,
	Nitin Gupta <ngupta@vflare.org>,
	Sergey Senozhatsky <senozhatsky@chromium.org>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: pageless memory & zsmalloc
Date: Wed, 6 Oct 2021 04:21:06 +0100	[thread overview]
Message-ID: <YV0WImeZ1YL5nL+g@casper.infradead.org> (raw)
In-Reply-To: <YVzZD+ZQCQJGwCdv@moria.home.lan>

On Tue, Oct 05, 2021 at 07:00:31PM -0400, Kent Overstreet wrote:
> On Tue, Oct 05, 2021 at 10:28:23PM +0100, Matthew Wilcox wrote:
> > I'm still not convinced of the need for allocator + allocatee words.
> > But I don't think we need to resolve that point of disagreement in
> > order to make progress towards the things we do agree on.
> 
> It's not so much that I disagree, I just don't see how your one-word idea is
> possible and you haven't outlined it :) Can you sketch it out for us?

Sure!  Assuming we want to support allocating page cache memory from
slab (a proposition I'm not yet sold on, but I'm willing to believe that
we end up wanting to do that), let's suppose slab allocates an order-3
slab to allocate from.  It allocates a struct slab, then sets each of the
8 pages' compound_head entries to point to it.  When we allocate a
single page from this slab, we allocate the folio that this page will
point to, then change that page's compound_head to point to this folio.
We need to stash the slab compound_head entry from this page in the
struct folio so we know where to free it back to when we free it.
Yes, this is going to require a bit of a specialist interface, but we're
already talking about adding specialist interfaces for allocating
memory anyway, so that doesn't concern me.

> > > And in my mind, compound_head is allocator state, not allocatee state, and it's
> > > always been using for getting to the head page, which... this is not, so using
> > > it this way, as slick as it is... eh, not sure that's quite what we want to do.
> > 
> > In my mind, in the future where all memory descriptors are dynamically
> > allocated, when we allocate an order-3 page, we initialise the
> > 'allocatee state' of each of the 8 consecutive pages to point to the
> > memory descriptor that we just allocated.  We probably also encode the
> > type of the memory descriptor in the allocatee state (I think we've
> > probably got about 5 bits for that).
> 
> Yep, I've been envisioning using the low bits of the pointer to the allocatee
> state as a type tag. Where does compound_order go, though?

Ah!  I think it can go in page->flags on 64 bit and _last_cpupid on
32-bit.  It's only perhaps 4 bits on 32-bit and 5 on 64-bit (128MB
and 8TB seem like reasonable limits on 32 and 64 bit respectively).

> > The lock state has to be in the memory descriptor.  It can't be in the
> > individual page.  So I think all memory descriptors needs to start with
> > a flags word.  Memory compaction could refrain from locking pages if
> > the memory descriptor is of the wrong type, of course.
> 
> Memory compaction inherently has to switch on the allocatee type, so if the page
> is of a type that can't be migrated, it would make sense to just not bother with
> locking it. On the other hand, the type isn't stable without first locking the
> page.

I think it's stable once you get a refcount on the page; you don't need
a lock to prevent freeing.

> There's another synchronization thing we have to work out: with the lock behind
> a pointer, we can race with the page, and the allocatee state, being freed.
> Which implies that we're always going to have to RCU-free allocatee state, and
> that after chasing the pointer and taking the lock we'll have to check that the
> page is still a member of that allocatee state.

Right, it's just like looking things up in the page cache.  Get the
pointer, inc the refcount if not zero, check the pointer still matches.
I think I mentioned SLAB_TYPESAFE_BY_RCU somewhere.

> This race is something that we'll have to handle every place we deref the
> allocatee state pointer - and in many cases we won't want to lock the allocatee
> state, so page->ref will also have to be part of this common page allocatee
> state.

I think we can avoid it for slab.  But maybe not if we want to support
compaction?  I need to think about that some more.

> > Eventually, I think lock_page() disappears in favour of folio_lock().
> > That doesn't quite work for compaction, but maybe we could do something
> > like this ...
> 
> Question is, do other types of pages besides just folios need lock_page() and
> get_page()? If so, maybe folio_lock() doesn't make sense at all and we should
> just have functions that operate on your (expanded, to include a refcount)
> pgflags_t.

I think folio_lock() makes sense for, eg, filesystems.  We should have
a complete API that operates on folios instead of punching through the
layers into the implementation detail of the folio.  For some parts of
the core MM, something like pgflags_lock() makes more sense.  Probably.
I reserve the right to change my mind on this one ... David Hildenbrand
put the idea into me earlier today and it's not entirely settled down yet.