linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* pageless memory & zsmalloc
@ 2021-10-05 17:51 Matthew Wilcox
  2021-10-05 20:13 ` Kent Overstreet
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Matthew Wilcox @ 2021-10-05 17:51 UTC (permalink / raw)
  To: linux-mm
  Cc: Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Kent Overstreet,
	Johannes Weiner

We're trying to tidy up the mess in struct page, and as part of removing
slab from struct page, zsmalloc came on my radar because it's using some
of slab's fields.  The eventual endgame is to get struct page down to a
single word which points to the "memory descriptor" (ie the current
zspage).

zsmalloc, like vmalloc, allocates order-0 pages.  Unlike vmalloc,
zsmalloc allows compaction.  Currently (from the file):

 * Usage of struct page fields:
 *      page->private: points to zspage
 *      page->freelist(index): links together all component pages of a zspage
 *              For the huge page, this is always 0, so we use this field
 *              to store handle.
 *      page->units: first object offset in a subpage of zspage
 *
 * Usage of struct page flags:
 *      PG_private: identifies the first component page
 *      PG_owner_priv_1: identifies the huge component page

This isn't quite everything.  For compaction, zsmalloc also uses
page->mapping (set in __SetPageMovable()), PG_lock (to sync with
compaction) and page->_refcount (compaction gets a refcount on the page).

Since zsmalloc is so well-contained, I propose we completely stop
using struct page in it, as we intend to do for the rest of the users
of struct page.  That is, the _only_ element of struct page we use is
compound_head and it points to struct zspage.

That means every single page allocated by zsmalloc is PageTail().  Also it
means that when isolate_movable_page() calls trylock_page(), it redirects
to the zspage.  That means struct zspage must now have page flags as its
first element.  Also, zspage->_refcount, and zspage->mapping must match
their locations in struct page.  That's something that we'll get cleaned
up eventually, but for now, we're relying on offsetof() assertions.

The good news is that trylock_zspage() no longer needs to walk the
list of pages, calling trylock_page() on each of them.

Anyway, is there a good test suite for zsmalloc()?  Particularly something
that would exercise its interactions with compaction / migration?
I don't have any code written yet.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pageless memory & zsmalloc
  2021-10-05 17:51 pageless memory & zsmalloc Matthew Wilcox
@ 2021-10-05 20:13 ` Kent Overstreet
  2021-10-05 21:28   ` Matthew Wilcox
  2021-10-07 15:03 ` Vlastimil Babka
  2021-10-08 20:43 ` Minchan Kim
  2 siblings, 1 reply; 8+ messages in thread
From: Kent Overstreet @ 2021-10-05 20:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Johannes Weiner

On Tue, Oct 05, 2021 at 06:51:32PM +0100, Matthew Wilcox wrote:
> We're trying to tidy up the mess in struct page, and as part of removing
> slab from struct page, zsmalloc came on my radar because it's using some
> of slab's fields.  The eventual endgame is to get struct page down to a
> single word which points to the "memory descriptor" (ie the current
> zspage).
> 
> zsmalloc, like vmalloc, allocates order-0 pages.  Unlike vmalloc,
> zsmalloc allows compaction.  Currently (from the file):
> 
>  * Usage of struct page fields:
>  *      page->private: points to zspage
>  *      page->freelist(index): links together all component pages of a zspage
>  *              For the huge page, this is always 0, so we use this field
>  *              to store handle.
>  *      page->units: first object offset in a subpage of zspage
>  *
>  * Usage of struct page flags:
>  *      PG_private: identifies the first component page
>  *      PG_owner_priv_1: identifies the huge component page
> 
> This isn't quite everything.  For compaction, zsmalloc also uses
> page->mapping (set in __SetPageMovable()), PG_lock (to sync with
> compaction) and page->_refcount (compaction gets a refcount on the page).
> 
> Since zsmalloc is so well-contained, I propose we completely stop
> using struct page in it, as we intend to do for the rest of the users
> of struct page.  That is, the _only_ element of struct page we use is
> compound_head and it points to struct zspage.
> 
> That means every single page allocated by zsmalloc is PageTail().  Also it
> means that when isolate_movable_page() calls trylock_page(), it redirects
> to the zspage.  That means struct zspage must now have page flags as its
> first element.  Also, zspage->_refcount, and zspage->mapping must match
> their locations in struct page.  That's something that we'll get cleaned
> up eventually, but for now, we're relying on offsetof() assertions.
> 
> The good news is that trylock_zspage() no longer needs to walk the
> list of pages, calling trylock_page() on each of them.
> 
> Anyway, is there a good test suite for zsmalloc()?  Particularly something
> that would exercise its interactions with compaction / migration?
> I don't have any code written yet.

This is some deep hackery.

So to restate - and making sure I understand correctly - the reason for doing it
this way is that the compaction code calls lock_page(); by using compound_head
(instead of page->private) for the pointer to the zspage is so that
compound_head() will return the pointer to zspage, and lock_page() uses
compound_page(), so the compaction code, when it calls lock_page(), will
actually be taking the lock in struct zspage.

So on the one hand, getting struct page down to two words means that we're going
to be moving the page lock bit into those external structs (maybe we _should_ be
doing some kind of inheritence thing, for the allocatee interface?) - so it is
cool that that all lines up.

But long term, we're going to need two words in struct page, not just one: We
need to store the order of the allocation, and a type tagged pointer, and I
don't think we can realistically cram compound_order into the same word as the
type tagged pointer. Plus, transparently using slab for larger allocations -
that today use the buddy allocator interface - means slab can't be using the
allocatee pointer field.

And in my mind, compound_head is allocator state, not allocatee state, and it's
always been using for getting to the head page, which... this is not, so using
it this way, as slick as it is... eh, not sure that's quite what we want to do.

Q: should lock_page() be calling compound_head() at all? If the goal of the type
system stuff that we're doing is to move the compound_head() calls up to where
we're dealing with tail pages and never more after that, then that call in
lock_page() should go - and we'll need to figure out something different for the
migration code to take the lock in the allocatee-private structure.

I've been kind of feeling that since folios are furthest along, separately
allocating them and figuring out how to make that all work might be the logical
next step.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pageless memory & zsmalloc
  2021-10-05 20:13 ` Kent Overstreet
@ 2021-10-05 21:28   ` Matthew Wilcox
  2021-10-05 23:00     ` Kent Overstreet
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2021-10-05 21:28 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-mm, Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Johannes Weiner

On Tue, Oct 05, 2021 at 04:13:12PM -0400, Kent Overstreet wrote:
> On Tue, Oct 05, 2021 at 06:51:32PM +0100, Matthew Wilcox wrote:
> > We're trying to tidy up the mess in struct page, and as part of removing
> > slab from struct page, zsmalloc came on my radar because it's using some
> > of slab's fields.  The eventual endgame is to get struct page down to a
> > single word which points to the "memory descriptor" (ie the current
> > zspage).
> > 
> > zsmalloc, like vmalloc, allocates order-0 pages.  Unlike vmalloc,
> > zsmalloc allows compaction.  Currently (from the file):
> > 
> >  * Usage of struct page fields:
> >  *      page->private: points to zspage
> >  *      page->freelist(index): links together all component pages of a zspage
> >  *              For the huge page, this is always 0, so we use this field
> >  *              to store handle.
> >  *      page->units: first object offset in a subpage of zspage
> >  *
> >  * Usage of struct page flags:
> >  *      PG_private: identifies the first component page
> >  *      PG_owner_priv_1: identifies the huge component page
> > 
> > This isn't quite everything.  For compaction, zsmalloc also uses
> > page->mapping (set in __SetPageMovable()), PG_lock (to sync with
> > compaction) and page->_refcount (compaction gets a refcount on the page).
> > 
> > Since zsmalloc is so well-contained, I propose we completely stop
> > using struct page in it, as we intend to do for the rest of the users
> > of struct page.  That is, the _only_ element of struct page we use is
> > compound_head and it points to struct zspage.
> > 
> > That means every single page allocated by zsmalloc is PageTail().  Also it
> > means that when isolate_movable_page() calls trylock_page(), it redirects
> > to the zspage.  That means struct zspage must now have page flags as its
> > first element.  Also, zspage->_refcount, and zspage->mapping must match
> > their locations in struct page.  That's something that we'll get cleaned
> > up eventually, but for now, we're relying on offsetof() assertions.
> > 
> > The good news is that trylock_zspage() no longer needs to walk the
> > list of pages, calling trylock_page() on each of them.
> > 
> > Anyway, is there a good test suite for zsmalloc()?  Particularly something
> > that would exercise its interactions with compaction / migration?
> > I don't have any code written yet.
> 
> This is some deep hackery.

... thank you?  ;-)  Actually, it's indicative that we need to work
through what we're doing in a bit more detail.

> So to restate - and making sure I understand correctly - the reason for doing it
> this way is that the compaction code calls lock_page(); by using compound_head
> (instead of page->private) for the pointer to the zspage is so that
> compound_head() will return the pointer to zspage, and lock_page() uses
> compound_page(), so the compaction code, when it calls lock_page(), will
> actually be taking the lock in struct zspage.

Yes.  Just like when we call lock_page() on a THP or hugetlbfs tail
page, we actually take the lock in the head page.

> So on the one hand, getting struct page down to two words means that we're going
> to be moving the page lock bit into those external structs (maybe we _should_ be
> doing some kind of inheritence thing, for the allocatee interface?) - so it is
> cool that that all lines up.
> 
> But long term, we're going to need two words in struct page, not just one: We
> need to store the order of the allocation, and a type tagged pointer, and I
> don't think we can realistically cram compound_order into the same word as the
> type tagged pointer. Plus, transparently using slab for larger allocations -
> that today use the buddy allocator interface - means slab can't be using the
> allocatee pointer field.

I'm still not convinced of the need for allocator + allocatee words.
But I don't think we need to resolve that point of disagreement in
order to make progress towards the things we do agree on.

> And in my mind, compound_head is allocator state, not allocatee state, and it's
> always been using for getting to the head page, which... this is not, so using
> it this way, as slick as it is... eh, not sure that's quite what we want to do.

In my mind, in the future where all memory descriptors are dynamically
allocated, when we allocate an order-3 page, we initialise the
'allocatee state' of each of the 8 consecutive pages to point to the
memory descriptor that we just allocated.  We probably also encode the
type of the memory descriptor in the allocatee state (I think we've
probably got about 5 bits for that).

The lock state has to be in the memory descriptor.  It can't be in the
individual page.  So I think all memory descriptors needs to start with
a flags word.  Memory compaction could refrain from locking pages if
the memory descriptor is of the wrong type, of course.

> Q: should lock_page() be calling compound_head() at all? If the goal of the type
> system stuff that we're doing is to move the compound_head() calls up to where
> we're dealing with tail pages and never more after that, then that call in
> lock_page() should go - and we'll need to figure out something different for the
> migration code to take the lock in the allocatee-private structure.

Eventually, I think lock_page() disappears in favour of folio_lock().
That doesn't quite work for compaction, but maybe we could do something
like this ...

typedef struct { unsigned long f; } pgflags_t;

void pgflags_lock(pgflags_t *);

struct folio {
	pgflags_t flags;
	...
};

static inline void folio_lock(struct folio *folio)
{
	pgflags_lock(&folio->flags);
}

That way, compaction doesn't need to know what kind of memory descriptor
this page belongs to.

Something similar could happen for user-mappable memory.  eg:

struct mmapable {
	atomic_t _refcount;
	atomic_t _mapcount;
	struct address_space *mapping;
};

struct folio {
	pgflags_t flags;
	struct mmapable m;
	struct list_head lru;
	pgoff_t index;
	unsigned long private;
...
};

None of this can happen before we move struct slab out of struct page,
because we can't mess with the alignment of freelist+counters and
struct mmapable is 3 words on 32-bit and 2 on 64-bit.  So I'm trying
to break off pieces that can be done to get us a little bit closer.

> I've been kind of feeling that since folios are furthest along, separately
> allocating them and figuring out how to make that all work might be the logical
> next step.

I'd rather not put anything else on top of the folio work until some of
it -- any of it -- is merged.  Fortunately, our greatest bottleneck is
reviewer bandwidth, and it's actually parallelisable; the people who need
to approve of slab are independent of anon, file and zsmalloc reviewers.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pageless memory & zsmalloc
  2021-10-05 21:28   ` Matthew Wilcox
@ 2021-10-05 23:00     ` Kent Overstreet
  2021-10-06  3:21       ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Kent Overstreet @ 2021-10-05 23:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Johannes Weiner

On Tue, Oct 05, 2021 at 10:28:23PM +0100, Matthew Wilcox wrote:
> I'm still not convinced of the need for allocator + allocatee words.
> But I don't think we need to resolve that point of disagreement in
> order to make progress towards the things we do agree on.

It's not so much that I disagree, I just don't see how your one-word idea is
possible and you haven't outlined it :) Can you sketch it out for us?

> > And in my mind, compound_head is allocator state, not allocatee state, and it's
> > always been using for getting to the head page, which... this is not, so using
> > it this way, as slick as it is... eh, not sure that's quite what we want to do.
> 
> In my mind, in the future where all memory descriptors are dynamically
> allocated, when we allocate an order-3 page, we initialise the
> 'allocatee state' of each of the 8 consecutive pages to point to the
> memory descriptor that we just allocated.  We probably also encode the
> type of the memory descriptor in the allocatee state (I think we've
> probably got about 5 bits for that).

Yep, I've been envisioning using the low bits of the pointer to the allocatee
state as a type tag. Where does compound_order go, though?

> The lock state has to be in the memory descriptor.  It can't be in the
> individual page.  So I think all memory descriptors needs to start with
> a flags word.  Memory compaction could refrain from locking pages if
> the memory descriptor is of the wrong type, of course.

Memory compaction inherently has to switch on the allocatee type, so if the page
is of a type that can't be migrated, it would make sense to just not bother with
locking it. On the other hand, the type isn't stable without first locking the
page.

There's another synchronization thing we have to work out: with the lock behind
a pointer, we can race with the page, and the allocatee state, being freed.
Which implies that we're always going to have to RCU-free allocatee state, and
that after chasing the pointer and taking the lock we'll have to check that the
page is still a member of that allocatee state.

This race is something that we'll have to handle every place we deref the
allocatee state pointer - and in many cases we won't want to lock the allocatee
state, so page->ref will also have to be part of this common page allocatee
state.

> Eventually, I think lock_page() disappears in favour of folio_lock().
> That doesn't quite work for compaction, but maybe we could do something
> like this ...

Question is, do other types of pages besides just folios need lock_page() and
get_page()? If so, maybe folio_lock() doesn't make sense at all and we should
just have functions that operate on your (expanded, to include a refcount)
pgflags_t.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pageless memory & zsmalloc
  2021-10-05 23:00     ` Kent Overstreet
@ 2021-10-06  3:21       ` Matthew Wilcox
  0 siblings, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2021-10-06  3:21 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-mm, Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Johannes Weiner

On Tue, Oct 05, 2021 at 07:00:31PM -0400, Kent Overstreet wrote:
> On Tue, Oct 05, 2021 at 10:28:23PM +0100, Matthew Wilcox wrote:
> > I'm still not convinced of the need for allocator + allocatee words.
> > But I don't think we need to resolve that point of disagreement in
> > order to make progress towards the things we do agree on.
> 
> It's not so much that I disagree, I just don't see how your one-word idea is
> possible and you haven't outlined it :) Can you sketch it out for us?

Sure!  Assuming we want to support allocating page cache memory from
slab (a proposition I'm not yet sold on, but I'm willing to believe that
we end up wanting to do that), let's suppose slab allocates an order-3
slab to allocate from.  It allocates a struct slab, then sets each of the
8 pages' compound_head entries to point to it.  When we allocate a
single page from this slab, we allocate the folio that this page will
point to, then change that page's compound_head to point to this folio.
We need to stash the slab compound_head entry from this page in the
struct folio so we know where to free it back to when we free it.
Yes, this is going to require a bit of a specialist interface, but we're
already talking about adding specialist interfaces for allocating
memory anyway, so that doesn't concern me.

> > > And in my mind, compound_head is allocator state, not allocatee state, and it's
> > > always been using for getting to the head page, which... this is not, so using
> > > it this way, as slick as it is... eh, not sure that's quite what we want to do.
> > 
> > In my mind, in the future where all memory descriptors are dynamically
> > allocated, when we allocate an order-3 page, we initialise the
> > 'allocatee state' of each of the 8 consecutive pages to point to the
> > memory descriptor that we just allocated.  We probably also encode the
> > type of the memory descriptor in the allocatee state (I think we've
> > probably got about 5 bits for that).
> 
> Yep, I've been envisioning using the low bits of the pointer to the allocatee
> state as a type tag. Where does compound_order go, though?

Ah!  I think it can go in page->flags on 64 bit and _last_cpupid on
32-bit.  It's only perhaps 4 bits on 32-bit and 5 on 64-bit (128MB
and 8TB seem like reasonable limits on 32 and 64 bit respectively).

> > The lock state has to be in the memory descriptor.  It can't be in the
> > individual page.  So I think all memory descriptors needs to start with
> > a flags word.  Memory compaction could refrain from locking pages if
> > the memory descriptor is of the wrong type, of course.
> 
> Memory compaction inherently has to switch on the allocatee type, so if the page
> is of a type that can't be migrated, it would make sense to just not bother with
> locking it. On the other hand, the type isn't stable without first locking the
> page.

I think it's stable once you get a refcount on the page; you don't need
a lock to prevent freeing.

> There's another synchronization thing we have to work out: with the lock behind
> a pointer, we can race with the page, and the allocatee state, being freed.
> Which implies that we're always going to have to RCU-free allocatee state, and
> that after chasing the pointer and taking the lock we'll have to check that the
> page is still a member of that allocatee state.

Right, it's just like looking things up in the page cache.  Get the
pointer, inc the refcount if not zero, check the pointer still matches.
I think I mentioned SLAB_TYPESAFE_BY_RCU somewhere.

> This race is something that we'll have to handle every place we deref the
> allocatee state pointer - and in many cases we won't want to lock the allocatee
> state, so page->ref will also have to be part of this common page allocatee
> state.

I think we can avoid it for slab.  But maybe not if we want to support
compaction?  I need to think about that some more.

> > Eventually, I think lock_page() disappears in favour of folio_lock().
> > That doesn't quite work for compaction, but maybe we could do something
> > like this ...
> 
> Question is, do other types of pages besides just folios need lock_page() and
> get_page()? If so, maybe folio_lock() doesn't make sense at all and we should
> just have functions that operate on your (expanded, to include a refcount)
> pgflags_t.

I think folio_lock() makes sense for, eg, filesystems.  We should have
a complete API that operates on folios instead of punching through the
layers into the implementation detail of the folio.  For some parts of
the core MM, something like pgflags_lock() makes more sense.  Probably.
I reserve the right to change my mind on this one ... David Hildenbrand
put the idea into me earlier today and it's not entirely settled down yet.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pageless memory & zsmalloc
  2021-10-05 17:51 pageless memory & zsmalloc Matthew Wilcox
  2021-10-05 20:13 ` Kent Overstreet
@ 2021-10-07 15:03 ` Vlastimil Babka
  2021-10-07 18:11   ` Matthew Wilcox
  2021-10-08 20:43 ` Minchan Kim
  2 siblings, 1 reply; 8+ messages in thread
From: Vlastimil Babka @ 2021-10-07 15:03 UTC (permalink / raw)
  To: Matthew Wilcox, linux-mm
  Cc: Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Kent Overstreet,
	Johannes Weiner

On 10/5/21 19:51, Matthew Wilcox wrote:
> We're trying to tidy up the mess in struct page, and as part of removing
> slab from struct page, zsmalloc came on my radar because it's using some
> of slab's fields.  The eventual endgame is to get struct page down to a
> single word which points to the "memory descriptor" (ie the current
> zspage).
> 
> zsmalloc, like vmalloc, allocates order-0 pages.  Unlike vmalloc,
> zsmalloc allows compaction.  Currently (from the file):
> 
>  * Usage of struct page fields:
>  *      page->private: points to zspage
>  *      page->freelist(index): links together all component pages of a zspage
>  *              For the huge page, this is always 0, so we use this field
>  *              to store handle.
>  *      page->units: first object offset in a subpage of zspage
>  *
>  * Usage of struct page flags:
>  *      PG_private: identifies the first component page
>  *      PG_owner_priv_1: identifies the huge component page
> 
> This isn't quite everything.  For compaction, zsmalloc also uses
> page->mapping (set in __SetPageMovable()), PG_lock (to sync with
> compaction) and page->_refcount (compaction gets a refcount on the page).
> 
> Since zsmalloc is so well-contained, I propose we completely stop
> using struct page in it, as we intend to do for the rest of the users
> of struct page.  That is, the _only_ element of struct page we use is
> compound_head and it points to struct zspage.
> 
> That means every single page allocated by zsmalloc is PageTail().  Also it

I would be worried there is code, i.e. some pfn scanner that will see a
PageTail, lookup its compound_head() and order and use it to skip over the
rest of tail pages. Which would fail spectacularly if compound_head()
pointed somewhere else than to the same memmap array to a struct page.

> means that when isolate_movable_page() calls trylock_page(), it redirects
> to the zspage.  That means struct zspage must now have page flags as its
> first element.  Also, zspage->_refcount, and zspage->mapping must match
> their locations in struct page.  That's something that we'll get cleaned
> up eventually, but for now, we're relying on offsetof() assertions.
> 
> The good news is that trylock_zspage() no longer needs to walk the
> list of pages, calling trylock_page() on each of them.
> 
> Anyway, is there a good test suite for zsmalloc()?  Particularly something
> that would exercise its interactions with compaction / migration?
> I don't have any code written yet.
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pageless memory & zsmalloc
  2021-10-07 15:03 ` Vlastimil Babka
@ 2021-10-07 18:11   ` Matthew Wilcox
  0 siblings, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2021-10-07 18:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Minchan Kim, Nitin Gupta, Sergey Senozhatsky,
	Kent Overstreet, Johannes Weiner

On Thu, Oct 07, 2021 at 05:03:12PM +0200, Vlastimil Babka wrote:
> On 10/5/21 19:51, Matthew Wilcox wrote:
> > We're trying to tidy up the mess in struct page, and as part of removing
> > slab from struct page, zsmalloc came on my radar because it's using some
> > of slab's fields.  The eventual endgame is to get struct page down to a
> > single word which points to the "memory descriptor" (ie the current
> > zspage).
> > 
> > zsmalloc, like vmalloc, allocates order-0 pages.  Unlike vmalloc,
> > zsmalloc allows compaction.  Currently (from the file):
> > 
> >  * Usage of struct page fields:
> >  *      page->private: points to zspage
> >  *      page->freelist(index): links together all component pages of a zspage
> >  *              For the huge page, this is always 0, so we use this field
> >  *              to store handle.
> >  *      page->units: first object offset in a subpage of zspage
> >  *
> >  * Usage of struct page flags:
> >  *      PG_private: identifies the first component page
> >  *      PG_owner_priv_1: identifies the huge component page
> > 
> > This isn't quite everything.  For compaction, zsmalloc also uses
> > page->mapping (set in __SetPageMovable()), PG_lock (to sync with
> > compaction) and page->_refcount (compaction gets a refcount on the page).
> > 
> > Since zsmalloc is so well-contained, I propose we completely stop
> > using struct page in it, as we intend to do for the rest of the users
> > of struct page.  That is, the _only_ element of struct page we use is
> > compound_head and it points to struct zspage.
> > 
> > That means every single page allocated by zsmalloc is PageTail().  Also it
> 
> I would be worried there is code, i.e. some pfn scanner that will see a
> PageTail, lookup its compound_head() and order and use it to skip over the
> rest of tail pages. Which would fail spectacularly if compound_head()
> pointed somewhere else than to the same memmap array to a struct page.

Yes, that's definitely a concern.  What does work is the pfn scanner
doing pfn |= (1 << page_order(page)) - 1; (because page_order(zspage)
is 0, so this is a noop).  It's something that will need to be audited
before we do this.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: pageless memory & zsmalloc
  2021-10-05 17:51 pageless memory & zsmalloc Matthew Wilcox
  2021-10-05 20:13 ` Kent Overstreet
  2021-10-07 15:03 ` Vlastimil Babka
@ 2021-10-08 20:43 ` Minchan Kim
  2 siblings, 0 replies; 8+ messages in thread
From: Minchan Kim @ 2021-10-08 20:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Nitin Gupta, Sergey Senozhatsky, Kent Overstreet,
	Johannes Weiner

[-- Attachment #1: Type: text/plain, Size: 2853 bytes --]

On Tue, Oct 05, 2021 at 06:51:32PM +0100, Matthew Wilcox wrote:
> We're trying to tidy up the mess in struct page, and as part of removing
> slab from struct page, zsmalloc came on my radar because it's using some
> of slab's fields.  The eventual endgame is to get struct page down to a
> single word which points to the "memory descriptor" (ie the current
> zspage).
> 
> zsmalloc, like vmalloc, allocates order-0 pages.  Unlike vmalloc,
> zsmalloc allows compaction.  Currently (from the file):
> 
>  * Usage of struct page fields:
>  *      page->private: points to zspage
>  *      page->freelist(index): links together all component pages of a zspage
>  *              For the huge page, this is always 0, so we use this field
>  *              to store handle.
>  *      page->units: first object offset in a subpage of zspage
>  *
>  * Usage of struct page flags:
>  *      PG_private: identifies the first component page
>  *      PG_owner_priv_1: identifies the huge component page
> 
> This isn't quite everything.  For compaction, zsmalloc also uses
> page->mapping (set in __SetPageMovable()), PG_lock (to sync with
> compaction) and page->_refcount (compaction gets a refcount on the page).
> 
> Since zsmalloc is so well-contained, I propose we completely stop
> using struct page in it, as we intend to do for the rest of the users
> of struct page.  That is, the _only_ element of struct page we use is
> compound_head and it points to struct zspage.

Then, do you mean zsmalloc couldn't use page.lru to link tail pages
from head page? IOW, does zspage need to have subpage list or array?

> 
> That means every single page allocated by zsmalloc is PageTail().  Also it
> means that when isolate_movable_page() calls trylock_page(), it redirects
> to the zspage.  That means struct zspage must now have page flags as its
> first element.  Also, zspage->_refcount, and zspage->mapping must match
> their locations in struct page.  That's something that we'll get cleaned
> up eventually, but for now, we're relying on offsetof() assertions.
> 
> The good news is that trylock_zspage() no longer needs to walk the
> list of pages, calling trylock_page() on each of them.

Sounds good if we could remove the mess.

> 
> Anyway, is there a good test suite for zsmalloc()?  Particularly something
> that would exercise its interactions with compaction / migration?
> I don't have any code written yet.

This is a my toy to give a stress on those path. I ran it on KVM with
8 core and 8GB ram.

[attached memhog.c]

#!/bin/bash

swapoff -a

echo 1 > /sys/block/zram0/reset
echo 6g > /sys/block/zram0/disksize
mkswap /dev/zram0
swapon /dev/zram0

for comp_ratio in $(seq 10 10 100)
do
    ./memhog -m 1g -c $comp_ratio &
done

while true :
do
    echo 1 > /sys/block/zram0/compact &
    echo 1 > /proc/sys/vm/compact_memory &
    sleep 2
done

[-- Attachment #2: memhog.c --]
[-- Type: text/x-csrc, Size: 5779 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <getopt.h>
#include <sys/mman.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

#define CHUNK_SIZE (20UL<<20)
#ifndef PAGE_SIZE
#define PAGE_SIZE (4096)
#endif

/*
 * For native build, use aarch64-linux-android-clang -pie -o memhog memhog.c
 */
void usage(char *exe)
{
    fprintf(stderr,
            "Usage: %s [options] size[k|m|g]\n"
            "    -c|--compress ratio fill memory with ratio compressible data\n"
            "    -m|--memory SIZE    allocate memory in SIZE byte chunks\n"
            "    -M|--mlock          mlock() the memory\n"
            "    -s|--sleep SEC      sleep SEC seconds during repeat cycle\n"
            "    -r|--repeat N       repeat read/write N times\n"
            "    -h|--help           show this message\n",
            exe);

    exit(1);
}

static const struct option opts[] = {
    { "compress", 1, NULL, 'c' },
    { "memory"  , 1, NULL, 'm' },
    { "mlock"   , 0, NULL, 'M' },
    { "sleep"   , 1, NULL, 's' },
    { "repeat"  , 1, NULL, 'r' },
    { "help"    , 0, NULL, 'h' },
    { NULL      , 0, NULL, 0 }
};

unsigned long long memparse(const char *ptr, char **retptr)
{
    char *endptr;

    unsigned long long ret = strtoull(ptr, &endptr, 0);

    switch (*endptr) {
        case 'G':
        case 'g':
            ret <<= 10;
        case 'M':
        case 'm':
            ret <<= 10;
        case 'K':
        case 'k':
            ret <<= 10;
            endptr++;
        default:
            break;
    }

    if (retptr)
        *retptr = endptr;

    return ret;
}

void allocate_mem(unsigned long long size, void *alloc_ptr[], int *len)
{
    int i;
    void *ptr;
    unsigned long nr_chunk = size / CHUNK_SIZE;
    int allocated = 0;

    for (i = 0; i < nr_chunk; i++) {
        ptr = mmap(NULL, CHUNK_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, 0, 0);
        if (ptr == MAP_FAILED) {
            printf("fail to allocate %d\n", i);
            break;
        }

        alloc_ptr[allocated++] = ptr;
    }

    *len = allocated;
}

void free_mem(void *alloc_ptr[], int len)
{
    int i;

    for (i = 0; i < len; i++)
        munmap(alloc_ptr[i], CHUNK_SIZE);
}

void fill_mem(void *ptr, void *rand_page, long int comp_ratio)
{
    int i;
    static int nr_page = CHUNK_SIZE / PAGE_SIZE;
    int zero_size = PAGE_SIZE * comp_ratio / 100;

    for (i = 0; i < nr_page; i++, ptr += PAGE_SIZE) {
        memset(ptr, 0, zero_size);
        memcpy(ptr + zero_size, rand_page, PAGE_SIZE - zero_size);
    }
}

int fill_chunk(void *alloc_ptr[], int len, long int comp_ratio)
{
    int i, ret;
    char rand_buf[PAGE_SIZE];
    int fd = open("/dev/urandom", O_RDONLY);

    if (fd < 0) {
        perror("Fail to open /dev/urandom\n");
        return 1;
    }

    ret = read(fd, rand_buf, PAGE_SIZE);
    if (ret != PAGE_SIZE) {
        perror("Fail to read /dev/urandom\n");
        return 1;
    }

    for (i = 0; i < len; i++)
        fill_mem(alloc_ptr[i], rand_buf, comp_ratio);

    close(fd);
    return 0;
}

int main(int argc, char *argv[])
{
    char buf[256] = {0,};
    unsigned long long opt_mem = 100 << 20;
    long int opt_sleep = 1;
    long int opt_reps = 10000;
    long int opt_comp_ratio = 30;
    unsigned long opt_mlock = 0;
    long int loops;
    int pid = getpid();
    int err, c, count;

    while ((c = getopt_long(argc, argv,
                    "m:Ms:r:c:", opts, NULL)) != -1) {
        switch (c) {
            case 'c':
                opt_comp_ratio = strtol(optarg, NULL, 10);
                break;
            case 'm':
                opt_mem = memparse(optarg, NULL);
                break;
            case 'M':
                opt_mlock = 1;
                break;
            case 's':
                opt_sleep = strtol(optarg, NULL, 10);
                break;
            case 'r':
                opt_reps = strtol(optarg, NULL, 10);
                break;
            case 'h':
                usage(argv[0]);
                break;
            default:
                usage(argv[0]);
        }
    }

    if (opt_mem < CHUNK_SIZE) {
        printf("memory size should be greater than %lu\n", CHUNK_SIZE);
        return 1;
    }

    /* Disable LMK/OOM killer */
    sprintf(buf, "echo -1000 > /proc/%d/oom_score_adj\n", pid);
    if (WEXITSTATUS(system(buf))) {
        fprintf(stderr, "fail to disable OOM. Maybe you need root permission\n");
        return 1;
    }

    if (opt_mlock) {
        err = mlockall(MCL_CURRENT|MCL_FUTURE|MCL_ONFAULT);
        if (err) {
            perror("Fail to mlockall\n");
            return err;
        }
    }

    printf("%llu MB allocated\n", opt_mem >> 20);
    printf("%lu loop\n", opt_reps);
    printf("%lu sleep\n", opt_sleep);
    printf("%lu comp_ratio\n", opt_comp_ratio);
    count = 0;
    loops = opt_reps;

    while (loops) {
        /* 20M * 4096 = 80G is enouggh */
        void *alloc_ptr[PAGE_SIZE];
        int len;

        count++;
retry:
        allocate_mem(opt_mem, alloc_ptr, &len);
        if (len == 0) {
            /*
             * If we couldn't allocate any memory, let's try again
             * after a while
             */
            sleep(1);
            goto retry;
        }

        if (fill_chunk(alloc_ptr, len, opt_comp_ratio)) {
            printf("Fail to fill chunck\n");
            return 1;
        }

        if (opt_sleep == -1) {
            while (1) {
                printf("Forever sleep, Bye\n");
                sleep(100000);
            }
        }

        sleep(opt_sleep);
        free_mem(alloc_ptr, len);
        if (loops != -1)
            loops--;
        printf("[%d] Pass %d\n", pid, count);
    }

    return 0;
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-10-08 20:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-05 17:51 pageless memory & zsmalloc Matthew Wilcox
2021-10-05 20:13 ` Kent Overstreet
2021-10-05 21:28   ` Matthew Wilcox
2021-10-05 23:00     ` Kent Overstreet
2021-10-06  3:21       ` Matthew Wilcox
2021-10-07 15:03 ` Vlastimil Babka
2021-10-07 18:11   ` Matthew Wilcox
2021-10-08 20:43 ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).