Re: [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations

From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"rppt@kernel.org" <rppt@kernel.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"keescook@chromium.org" <keescook@chromium.org>,
	"Weiny, Ira" <ira.weiny@intel.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"x86@kernel.org" <x86@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"rppt@linux.ibm.com" <rppt@linux.ibm.com>,
	"Lutomirski, Andy" <luto@kernel.org>
Subject: Re: [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations
Date: Mon, 23 Aug 2021 20:02:55 +0000	[thread overview]
Message-ID: <1b49324674fd75294625f725c7f074efd8480efc.camel@intel.com> (raw)
In-Reply-To: <20210823132513.15836-1-rppt@kernel.org>

 Mon, 2021-08-23 at 16:25 +0300, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> This is early prototype for addition of cache of pte-mapped pages to
> the
> page allocator. It survives boot and some cache shrinking, but it's
> still
> a long way to go for it to be ready for non-RFC posting.
> 
> The example use-case for pte-mapped cache is protection of page
> tables and
> keeping them read-only except for the designated code that is allowed
> to
> modify the page tables.
> 
> I'd like to get an early feedback for the approach to see what would
> be
> the best way to move forward with something like this.
> 
> This set is x86 specific at the moment because other architectures
> either
> do not support set_memory APIs that split the direct^w linear map
> (e.g.
> PowerPC) or only enable set_memory APIs when the linear map uses
> basic page
> size (like arm64).
> 
> == Motivation ==
> 
> There are usecases that need to remove pages from the direct map or
> at
> least map them with 4K granularity. Whenever this is done e.g. with
> set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
> direct map are split into smaller pages.
> 
> To reduce the performance hit caused by the fragmentation of the
> direct map
> it make sense to group and/or cache the 4K pages removed from the
> direct
> map so that the split large pages won't be all over the place. 
> 
If you tied this into debug page alloc, you shouldn't need to group the
pages. Are you thinking this PKS-less page table usage would be a
security feature or debug time thing?

> There were RFCs for grouped page allocations for vmalloc permissions
> [1]
> and for using PKS to protect page tables [2] as well as an attempt to
> use a
> pool of large pages in secretmtm [3].
> 
> == Implementation overview ==
> 
> This set leverages ideas from the patches that added PKS protection
> to page
> tables, but instead of adding per-user grouped allocations it tries
> to move
> the cache of pte-mapped pages closer to the page allocator.
> 
> The idea is to use a gfp flag that will instruct the page allocator
> to use
> the cache of pte-mapped pages because the caller needs to remove them
> from
> the direct map or change their attributes. 
> 
> When the cache is empty there is an attempt to refill it using PMD-
> sized
> allocation so that once the direct map is split we'll be able to use
> all 4K
> pages made available by the split. 
> 
> If the high order allocation fails, we fall back to order-0 and mark
> the
> entire pageblock as pte-mapped. When pages from that pageblock are
> freed to
> the page allocator they are put into the pte-mapped cache. There is
> also
> unimplemented provision to add free pages from such pageblock to the
> pte-mapped cache along with the page that was allocated and cause the
> split
> of the pageblock.
> 
> For now only order-0 allocations of pte-mapped pages are supported,
> which
> prevents, for instance, allocation of PGD with PTI enabled.
> 
> The free pages in the cache may be reclaimed using a shrinker, but
> for now
> they will remain mapped with PTEs in the direct map.
> 
> == TODOs ==
> 
> Whenever pte-mapped cache is being shrunk, it is possible to add some
> kind
> of compaction to move all the free pages into PMD-sized chunks, free
> these
> chunks at once and restore large page in the direct map.
I had made a POC to do this a while back that hooked into the buddy
code in the page allocator where this coalescing is already happening
for freed pages. The problem was that most pages that get their direct
map alias broken, end up using a page from the same 2MB page for the
page table in the split. But then the direct map page table never gets
freed so it never can restore the large page when checking the the
allocation page getting freed. Grouping permissioned pages OR page
tables would resolve that and it was my plan to try again after
something like this happened. Was just an experiment, but can share if
you are interested.

> 
> There is also a possibility to add heuristics and knobs to control
> greediness of the cache vs memory pressure so that freed pte-mapped
> cache
> won't be necessarily put into the pte-mapped cache.
> 
> Another thing that can be implemented is pre-populating the pte-cache 
> at
> boot time to include the free pages that are anyway mapped by PTEs.
> 
> == Alternatives ==
> 
> Current implementation uses a single global cache.
> 
> Another option is to have per-user caches, e.g one for the page
> tables,
> another for vmalloc etc.  This approach provides better control of
> the
> permissions of the pages allocated from these caches and allows the
> user to
> decide when (if at all) the pages can be accessed, e.g. for cache
> compaction. The down side of this approach is that it complicates the
> freeing path. A page allocated from a dedicated cache cannot be freed
> with
> put_page()/free_page() etc but it has to be freed with a dedicated
> API or
> there should be some back pointer in struct page so that page
> allocator
> will know what cache this page came from.
This needs to reset the permissions before freeing, so doesn't seem too
different than freeing them a special way.
> 
> Yet another possibility to make pte-mapped cache a migratetype of its
> own.
> Creating a new migratetype would allow higher order allocations of
> pte-mapped pages, but I don't have enough understanding of page
> allocator
> and reclaim internals to estimate the complexity associated with this
> approach. 
> 
I've been thinking about two categories of direct map permission
usages.

One is limiting the use of the direct map alias when it's not in use
and the primary alias is getting some other permission. Examples are
modules, secretmem, xpfo, KVM guest memory unmapping stuff, etc. In
this case re-allocations can share unmapped pages without doing any
expensive maintenance and it helps to have one big cache. If you are
going to convert pages to 4k and cache them, you might as well convert
them to NP at the time, since it's cheap to restore them or set their
permission from that state.

Two is setting permissions on the direct map as the only alias to be
used. This includes this rfc, some PKS usages, but also possibly some
set_pages_uc() callers and the like. It seems that this category could
still make use of a big unmapped cache of pages. Just ask for unmapped
pages and convert them without a flush.

So like something would have a big cache of grouped unmapped pages 
that category one usages could share. And then little category two
allocators could have their own caches that feed on it too. What do you
think? This is regardless if they live in the page allocator or not.

> [1] 
> https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com/
> [2] 
> https://lore.kernel.org/lkml/20210505003032.489164-1-rick.p.edgecombe@intel.com
> [3] 
> https://lore.kernel.org/lkml/20210121122723.3446-8-rppt@kernel.org/
> 
> Mike Rapoport (2):
>   mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-
> mapped pages
>   x86/mm: write protect (most) page tables
> 
> Rick Edgecombe (2):
>   list: Support getting most recent element in list_lru
>   list: Support list head not in object for list_lru
> 
>  arch/Kconfig                            |   8 +
>  arch/x86/Kconfig                        |   1 +
>  arch/x86/boot/compressed/ident_map_64.c |   3 +
>  arch/x86/include/asm/pgalloc.h          |   2 +
>  arch/x86/include/asm/pgtable.h          |  21 +-
>  arch/x86/include/asm/pgtable_64.h       |  33 ++-
>  arch/x86/mm/init.c                      |   2 +-
>  arch/x86/mm/pgtable.c                   |  72 ++++++-
>  include/asm-generic/pgalloc.h           |   2 +-
>  include/linux/gfp.h                     |  11 +-
>  include/linux/list_lru.h                |  26 +++
>  include/linux/mm.h                      |   2 +
>  include/linux/pageblock-flags.h         |  26 +++
>  init/main.c                             |   1 +
>  mm/internal.h                           |   3 +-
>  mm/list_lru.c                           |  38 +++-
>  mm/page_alloc.c                         | 261
> +++++++++++++++++++++++-
>  17 files changed, 496 insertions(+), 16 deletions(-)
>