From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=pyM+=OZ=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 37391C433F5
	for <linux-mm@archiver.kernel.org>; Tue,  5 Oct 2021 21:29:46 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 4826E61159
	for <linux-mm@archiver.kernel.org>; Tue,  5 Oct 2021 21:29:45 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4826E61159
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id CC2876B006C; Tue,  5 Oct 2021 17:29:44 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C726F6B0071; Tue,  5 Oct 2021 17:29:44 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B131C6B0073; Tue,  5 Oct 2021 17:29:44 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0055.hostedemail.com [216.40.44.55])
	by kanga.kvack.org (Postfix) with ESMTP id A3BC86B006C
	for <linux-mm@kvack.org>; Tue,  5 Oct 2021 17:29:44 -0400 (EDT)
Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 619CA8249980
	for <linux-mm@kvack.org>; Tue,  5 Oct 2021 21:29:44 +0000 (UTC)
X-FDA: 78663675888.26.5679CD3
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf08.hostedemail.com (Postfix) with ESMTP id 9B5953001852
	for <linux-mm@kvack.org>; Tue,  5 Oct 2021 21:29:43 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=mW57PwYjkGwcxpWBmwfdZZaSgRKP2J3esecomzGpA9o=; b=W4NKpRTBmuoBiDoxRamNNSW3ej
	0Vr7wvcvaECuGWMnb0izQSBWcfpYjEDXT7vWhdHONxLVVHp7kBcce6lOg7P+PELEEuosdOvfF0mTK
	l+hbo+lEB2rXDR6j+tj+RkzbCAacWkEA/bYH1NJzW4ou+t0GVzE8wSf1dXM+1/ZJGzg+6G3es5LGQ
	7YxJN8YffsrMby++MkS/lK4Y/gV72MBX/QP3lTGV4wgH8rpiGDJY4b7TlPqaGT4W3ZWXSeUIN5wR1
	dGpgQa9M4e6ZMnNbjKuCWuSWaGTXuQHXtgbCOnh9eucgb59bEwU2Db/NatlDx8AZfT9+U6rHbx4Ys
	+2PvI6Ng==;
Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux))
	id 1mXrz5-000I0r-Ho; Tue, 05 Oct 2021 21:28:53 +0000
Date: Tue, 5 Oct 2021 22:28:23 +0100
From: Matthew Wilcox <willy@infradead.org>
To: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-mm@kvack.org, Minchan Kim <minchan@kernel.org>,
	Nitin Gupta <ngupta@vflare.org>,
	Sergey Senozhatsky <senozhatsky@chromium.org>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: pageless memory & zsmalloc
Message-ID: <YVzDd4AGOVAkRimw@casper.infradead.org>
References: <YVyQpPuwIGFSSEQ8@casper.infradead.org>
 <YVyx2CMlnSBuY/4v@moria.home.lan>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <YVyx2CMlnSBuY/4v@moria.home.lan>
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 9B5953001852
X-Stat-Signature: tqy433xzjrmucjztck4ibsjzwgajqmes
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=W4NKpRTB;
	dmarc=none;
	spf=none (imf08.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org
X-HE-Tag: 1633469383-73535
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Oct 05, 2021 at 04:13:12PM -0400, Kent Overstreet wrote:
> On Tue, Oct 05, 2021 at 06:51:32PM +0100, Matthew Wilcox wrote:
> > We're trying to tidy up the mess in struct page, and as part of removing
> > slab from struct page, zsmalloc came on my radar because it's using some
> > of slab's fields.  The eventual endgame is to get struct page down to a
> > single word which points to the "memory descriptor" (ie the current
> > zspage).
> > 
> > zsmalloc, like vmalloc, allocates order-0 pages.  Unlike vmalloc,
> > zsmalloc allows compaction.  Currently (from the file):
> > 
> >  * Usage of struct page fields:
> >  *      page->private: points to zspage
> >  *      page->freelist(index): links together all component pages of a zspage
> >  *              For the huge page, this is always 0, so we use this field
> >  *              to store handle.
> >  *      page->units: first object offset in a subpage of zspage
> >  *
> >  * Usage of struct page flags:
> >  *      PG_private: identifies the first component page
> >  *      PG_owner_priv_1: identifies the huge component page
> > 
> > This isn't quite everything.  For compaction, zsmalloc also uses
> > page->mapping (set in __SetPageMovable()), PG_lock (to sync with
> > compaction) and page->_refcount (compaction gets a refcount on the page).
> > 
> > Since zsmalloc is so well-contained, I propose we completely stop
> > using struct page in it, as we intend to do for the rest of the users
> > of struct page.  That is, the _only_ element of struct page we use is
> > compound_head and it points to struct zspage.
> > 
> > That means every single page allocated by zsmalloc is PageTail().  Also it
> > means that when isolate_movable_page() calls trylock_page(), it redirects
> > to the zspage.  That means struct zspage must now have page flags as its
> > first element.  Also, zspage->_refcount, and zspage->mapping must match
> > their locations in struct page.  That's something that we'll get cleaned
> > up eventually, but for now, we're relying on offsetof() assertions.
> > 
> > The good news is that trylock_zspage() no longer needs to walk the
> > list of pages, calling trylock_page() on each of them.
> > 
> > Anyway, is there a good test suite for zsmalloc()?  Particularly something
> > that would exercise its interactions with compaction / migration?
> > I don't have any code written yet.
> 
> This is some deep hackery.

... thank you?  ;-)  Actually, it's indicative that we need to work
through what we're doing in a bit more detail.

> So to restate - and making sure I understand correctly - the reason for doing it
> this way is that the compaction code calls lock_page(); by using compound_head
> (instead of page->private) for the pointer to the zspage is so that
> compound_head() will return the pointer to zspage, and lock_page() uses
> compound_page(), so the compaction code, when it calls lock_page(), will
> actually be taking the lock in struct zspage.

Yes.  Just like when we call lock_page() on a THP or hugetlbfs tail
page, we actually take the lock in the head page.

> So on the one hand, getting struct page down to two words means that we're going
> to be moving the page lock bit into those external structs (maybe we _should_ be
> doing some kind of inheritence thing, for the allocatee interface?) - so it is
> cool that that all lines up.
> 
> But long term, we're going to need two words in struct page, not just one: We
> need to store the order of the allocation, and a type tagged pointer, and I
> don't think we can realistically cram compound_order into the same word as the
> type tagged pointer. Plus, transparently using slab for larger allocations -
> that today use the buddy allocator interface - means slab can't be using the
> allocatee pointer field.

I'm still not convinced of the need for allocator + allocatee words.
But I don't think we need to resolve that point of disagreement in
order to make progress towards the things we do agree on.

> And in my mind, compound_head is allocator state, not allocatee state, and it's
> always been using for getting to the head page, which... this is not, so using
> it this way, as slick as it is... eh, not sure that's quite what we want to do.

In my mind, in the future where all memory descriptors are dynamically
allocated, when we allocate an order-3 page, we initialise the
'allocatee state' of each of the 8 consecutive pages to point to the
memory descriptor that we just allocated.  We probably also encode the
type of the memory descriptor in the allocatee state (I think we've
probably got about 5 bits for that).

The lock state has to be in the memory descriptor.  It can't be in the
individual page.  So I think all memory descriptors needs to start with
a flags word.  Memory compaction could refrain from locking pages if
the memory descriptor is of the wrong type, of course.

> Q: should lock_page() be calling compound_head() at all? If the goal of the type
> system stuff that we're doing is to move the compound_head() calls up to where
> we're dealing with tail pages and never more after that, then that call in
> lock_page() should go - and we'll need to figure out something different for the
> migration code to take the lock in the allocatee-private structure.

Eventually, I think lock_page() disappears in favour of folio_lock().
That doesn't quite work for compaction, but maybe we could do something
like this ...

typedef struct { unsigned long f; } pgflags_t;

void pgflags_lock(pgflags_t *);

struct folio {
	pgflags_t flags;
	...
};

static inline void folio_lock(struct folio *folio)
{
	pgflags_lock(&folio->flags);
}

That way, compaction doesn't need to know what kind of memory descriptor
this page belongs to.

Something similar could happen for user-mappable memory.  eg:

struct mmapable {
	atomic_t _refcount;
	atomic_t _mapcount;
	struct address_space *mapping;
};

struct folio {
	pgflags_t flags;
	struct mmapable m;
	struct list_head lru;
	pgoff_t index;
	unsigned long private;
...
};

None of this can happen before we move struct slab out of struct page,
because we can't mess with the alignment of freelist+counters and
struct mmapable is 3 words on 32-bit and 2 on 64-bit.  So I'm trying
to break off pieces that can be done to get us a little bit closer.

> I've been kind of feeling that since folios are furthest along, separately
> allocating them and figuring out how to make that all work might be the logical
> next step.

I'd rather not put anything else on top of the folio work until some of
it -- any of it -- is merged.  Fortunately, our greatest bottleneck is
reviewer bandwidth, and it's actually parallelisable; the people who need
to approve of slab are independent of anon, file and zsmalloc reviewers.