linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"rppt@kernel.org" <rppt@kernel.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"keescook@chromium.org" <keescook@chromium.org>,
	"Weiny, Ira" <ira.weiny@intel.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"x86@kernel.org" <x86@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"rppt@linux.ibm.com" <rppt@linux.ibm.com>,
	"Lutomirski, Andy" <luto@kernel.org>
Subject: Re: [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations
Date: Mon, 23 Aug 2021 20:02:55 +0000	[thread overview]
Message-ID: <1b49324674fd75294625f725c7f074efd8480efc.camel@intel.com> (raw)
In-Reply-To: <20210823132513.15836-1-rppt@kernel.org>

 Mon, 2021-08-23 at 16:25 +0300, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> This is early prototype for addition of cache of pte-mapped pages to
> the
> page allocator. It survives boot and some cache shrinking, but it's
> still
> a long way to go for it to be ready for non-RFC posting.
> 
> The example use-case for pte-mapped cache is protection of page
> tables and
> keeping them read-only except for the designated code that is allowed
> to
> modify the page tables.
> 
> I'd like to get an early feedback for the approach to see what would
> be
> the best way to move forward with something like this.
> 
> This set is x86 specific at the moment because other architectures
> either
> do not support set_memory APIs that split the direct^w linear map
> (e.g.
> PowerPC) or only enable set_memory APIs when the linear map uses
> basic page
> size (like arm64).
> 
> == Motivation ==
> 
> There are usecases that need to remove pages from the direct map or
> at
> least map them with 4K granularity. Whenever this is done e.g. with
> set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
> direct map are split into smaller pages.
> 
> To reduce the performance hit caused by the fragmentation of the
> direct map
> it make sense to group and/or cache the 4K pages removed from the
> direct
> map so that the split large pages won't be all over the place. 
> 
If you tied this into debug page alloc, you shouldn't need to group the
pages. Are you thinking this PKS-less page table usage would be a
security feature or debug time thing?

> There were RFCs for grouped page allocations for vmalloc permissions
> [1]
> and for using PKS to protect page tables [2] as well as an attempt to
> use a
> pool of large pages in secretmtm [3].
> 
> == Implementation overview ==
> 
> This set leverages ideas from the patches that added PKS protection
> to page
> tables, but instead of adding per-user grouped allocations it tries
> to move
> the cache of pte-mapped pages closer to the page allocator.
> 
> The idea is to use a gfp flag that will instruct the page allocator
> to use
> the cache of pte-mapped pages because the caller needs to remove them
> from
> the direct map or change their attributes. 
> 
> When the cache is empty there is an attempt to refill it using PMD-
> sized
> allocation so that once the direct map is split we'll be able to use
> all 4K
> pages made available by the split. 
> 
> If the high order allocation fails, we fall back to order-0 and mark
> the
> entire pageblock as pte-mapped. When pages from that pageblock are
> freed to
> the page allocator they are put into the pte-mapped cache. There is
> also
> unimplemented provision to add free pages from such pageblock to the
> pte-mapped cache along with the page that was allocated and cause the
> split
> of the pageblock.
> 
> For now only order-0 allocations of pte-mapped pages are supported,
> which
> prevents, for instance, allocation of PGD with PTI enabled.
> 
> The free pages in the cache may be reclaimed using a shrinker, but
> for now
> they will remain mapped with PTEs in the direct map.
> 
> == TODOs ==
> 
> Whenever pte-mapped cache is being shrunk, it is possible to add some
> kind
> of compaction to move all the free pages into PMD-sized chunks, free
> these
> chunks at once and restore large page in the direct map.
I had made a POC to do this a while back that hooked into the buddy
code in the page allocator where this coalescing is already happening
for freed pages. The problem was that most pages that get their direct
map alias broken, end up using a page from the same 2MB page for the
page table in the split. But then the direct map page table never gets
freed so it never can restore the large page when checking the the
allocation page getting freed. Grouping permissioned pages OR page
tables would resolve that and it was my plan to try again after
something like this happened. Was just an experiment, but can share if
you are interested.

> 
> There is also a possibility to add heuristics and knobs to control
> greediness of the cache vs memory pressure so that freed pte-mapped
> cache
> won't be necessarily put into the pte-mapped cache.
> 
> Another thing that can be implemented is pre-populating the pte-cache 
> at
> boot time to include the free pages that are anyway mapped by PTEs.
> 
> == Alternatives ==
> 
> Current implementation uses a single global cache.
> 
> Another option is to have per-user caches, e.g one for the page
> tables,
> another for vmalloc etc.  This approach provides better control of
> the
> permissions of the pages allocated from these caches and allows the
> user to
> decide when (if at all) the pages can be accessed, e.g. for cache
> compaction. The down side of this approach is that it complicates the
> freeing path. A page allocated from a dedicated cache cannot be freed
> with
> put_page()/free_page() etc but it has to be freed with a dedicated
> API or
> there should be some back pointer in struct page so that page
> allocator
> will know what cache this page came from.
This needs to reset the permissions before freeing, so doesn't seem too
different than freeing them a special way.
> 
> Yet another possibility to make pte-mapped cache a migratetype of its
> own.
> Creating a new migratetype would allow higher order allocations of
> pte-mapped pages, but I don't have enough understanding of page
> allocator
> and reclaim internals to estimate the complexity associated with this
> approach. 
> 
I've been thinking about two categories of direct map permission
usages.

One is limiting the use of the direct map alias when it's not in use
and the primary alias is getting some other permission. Examples are
modules, secretmem, xpfo, KVM guest memory unmapping stuff, etc. In
this case re-allocations can share unmapped pages without doing any
expensive maintenance and it helps to have one big cache. If you are
going to convert pages to 4k and cache them, you might as well convert
them to NP at the time, since it's cheap to restore them or set their
permission from that state.

Two is setting permissions on the direct map as the only alias to be
used. This includes this rfc, some PKS usages, but also possibly some
set_pages_uc() callers and the like. It seems that this category could
still make use of a big unmapped cache of pages. Just ask for unmapped
pages and convert them without a flush.

So like something would have a big cache of grouped unmapped pages 
that category one usages could share. And then little category two
allocators could have their own caches that feed on it too. What do you
think? This is regardless if they live in the page allocator or not.


> [1] 
> https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com/
> [2] 
> https://lore.kernel.org/lkml/20210505003032.489164-1-rick.p.edgecombe@intel.com
> [3] 
> https://lore.kernel.org/lkml/20210121122723.3446-8-rppt@kernel.org/
> 
> Mike Rapoport (2):
>   mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-
> mapped pages
>   x86/mm: write protect (most) page tables
> 
> Rick Edgecombe (2):
>   list: Support getting most recent element in list_lru
>   list: Support list head not in object for list_lru
> 
>  arch/Kconfig                            |   8 +
>  arch/x86/Kconfig                        |   1 +
>  arch/x86/boot/compressed/ident_map_64.c |   3 +
>  arch/x86/include/asm/pgalloc.h          |   2 +
>  arch/x86/include/asm/pgtable.h          |  21 +-
>  arch/x86/include/asm/pgtable_64.h       |  33 ++-
>  arch/x86/mm/init.c                      |   2 +-
>  arch/x86/mm/pgtable.c                   |  72 ++++++-
>  include/asm-generic/pgalloc.h           |   2 +-
>  include/linux/gfp.h                     |  11 +-
>  include/linux/list_lru.h                |  26 +++
>  include/linux/mm.h                      |   2 +
>  include/linux/pageblock-flags.h         |  26 +++
>  init/main.c                             |   1 +
>  mm/internal.h                           |   3 +-
>  mm/list_lru.c                           |  38 +++-
>  mm/page_alloc.c                         | 261
> +++++++++++++++++++++++-
>  17 files changed, 496 insertions(+), 16 deletions(-)
> 

  parent reply	other threads:[~2021-08-23 20:03 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-23 13:25 [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Mike Rapoport
2021-08-23 13:25 ` [RFC PATCH 1/4] list: Support getting most recent element in list_lru Mike Rapoport
2021-08-23 13:25 ` [RFC PATCH 2/4] list: Support list head not in object for list_lru Mike Rapoport
2021-08-23 13:25 ` [RFC PATCH 3/4] mm/page_alloc: introduce __GFP_PTE_MAPPED flag to allocate pte-mapped pages Mike Rapoport
2021-08-23 20:29   ` Edgecombe, Rick P
2021-08-24 13:02     ` Mike Rapoport
2021-08-24 16:38       ` Edgecombe, Rick P
2021-08-24 16:54         ` Mike Rapoport
2021-08-24 17:23           ` Edgecombe, Rick P
2021-08-24 17:37             ` Mike Rapoport
2021-08-24 16:12   ` Vlastimil Babka
2021-08-25  8:43   ` David Hildenbrand
2021-08-23 13:25 ` [RFC PATCH 4/4] x86/mm: write protect (most) page tables Mike Rapoport
2021-08-23 20:08   ` Edgecombe, Rick P
2021-08-23 23:50   ` Dave Hansen
2021-08-24  3:34     ` Andy Lutomirski
2021-08-25 14:59       ` Dave Hansen
2021-08-24 13:32     ` Mike Rapoport
2021-08-25  8:38     ` David Hildenbrand
2021-08-26  8:02     ` Mike Rapoport
2021-08-26  9:01       ` Vlastimil Babka
     [not found]   ` <FB6C09CD-9CEA-4FE8-B179-98DB63EBDD68@gmail.com>
2021-08-24  5:34     ` Nadav Amit
2021-08-24 13:36       ` Mike Rapoport
2021-08-23 20:02 ` Edgecombe, Rick P [this message]
2021-08-24 13:03   ` [RFC PATCH 0/4] mm/page_alloc: cache pte-mapped allocations Mike Rapoport
2021-08-24 16:09 ` Vlastimil Babka
2021-08-29  7:06   ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1b49324674fd75294625f725c7f074efd8480efc.camel@intel.com \
    --to=rick.p.edgecombe@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=ira.weiny@intel.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rppt@kernel.org \
    --cc=rppt@linux.ibm.com \
    --cc=vbabka@suse.cz \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).