From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D770C432BE for ; Tue, 31 Aug 2021 10:14:47 +0000 (UTC) Received: from mother.openwall.net (mother.openwall.net [195.42.179.200]) by mail.kernel.org (Postfix) with SMTP id 41A2D60F91 for ; Tue, 31 Aug 2021 10:14:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 41A2D60F91 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=lists.openwall.com Received: (qmail 5596 invoked by uid 550); 31 Aug 2021 10:14:37 -0000 Mailing-List: contact kernel-hardening-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Received: (qmail 5559 invoked from network); 31 Aug 2021 10:14:36 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1630404864; bh=sv+Cg1X+GBWUrO7fD0zMaYtizfycZgRul3+2yg67iwI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=jjmdlQqBBwC61Dm6M3j6ltcmjch9JnUdjCt1m3iPM+1xA+zSjCR9ZVE5lUqJvNw0I QmtznYopatvh54G3PJL4O+aOONlBX6vtKCzq43PN3S23Hzf++5YgjJQHyXI+VTsyD+ /ERe0A1hTT6brhvBCAv26DIqs4FDgfoYs1b2bRarfMN4RzKSfYYhZk4kT+fZmcuA1Z NxZ779TMOgpDgAQ4FlRaaPr4BqkMVOS9M6W4wLsy0adubDhHGNOMfASpb9RxAlSMpK /y46REUiJRJxP5u8Jg2ukGCMW5O8yRlnFK4TxmfDwysH9amt+7vENnsMIn8878hlmd yjXymAIBf97Cg== Date: Tue, 31 Aug 2021 13:14:16 +0300 From: Mike Rapoport To: Rick Edgecombe Cc: dave.hansen@intel.com, luto@kernel.org, peterz@infradead.org, x86@kernel.org, akpm@linux-foundation.org, keescook@chromium.org, shakeelb@google.com, vbabka@suse.cz, linux-mm@kvack.org, linux-hardening@vger.kernel.org, kernel-hardening@lists.openwall.com, ira.weiny@intel.com, dan.j.williams@intel.com, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH v2 17/19] x86/mm/cpa: PKS protect direct map page tables Message-ID: References: <20210830235927.6443-1-rick.p.edgecombe@intel.com> <20210830235927.6443-18-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210830235927.6443-18-rick.p.edgecombe@intel.com> On Mon, Aug 30, 2021 at 04:59:25PM -0700, Rick Edgecombe wrote: > Protecting direct map page tables is a bit more difficult because a page > table may be needed for a page split as part of setting the PKS > permission the new page table. So in the case of an empty cache of page > tables the page table allocator could get into a situation where it cannot > create any more page tables. > > Several solutions were looked at: > > 1. Break the direct map with pages allocated from the large page being > converted to PKS. This would result in a window where the table could be > written to right before it was linked into the page tables. It also > depends on high order pages being available, and so would regress from > the un-protected behavior in that respect. > 2. Hold some page tables in reserve to be able to break the large page > for a new 2MB page, but if there are no 2MB page's available we may need > to add a single page to the cache, in which case we would use up the > reserve of page tables needed to break a new page, but not get enough > page tables back to replenish the resereve. > 3. Always map the direct map at 4k when protecting page tables so that > pages don't need to be broken to map them with a PKS key. This would have > undesirable performance. > > 4. Lastly, the strategy employed in this patch, have a separate cache of > page tables just used for the direct map. Early in boot, squirrel away > enough page tables to map the direct map at 4k. This comes with the same > memory overhead of mapping the direct map at 4k, but gets the other > benefits of mapping the direct map as large pages. > > There is also the problem of protecting page tables that are allocated > during boot. Instead of recording the tables to protect later, create a > page table traversing infrastructure to walk every page table in init_mm > and apply protection. This also covers non-direct map odds-and-ends page > tables that are allocated during boot. The existing page table traversing > in pagewalk.c cannot be used for this purpose because there are not actual > vmas for all of the kernel address space. > > The algorithm for protecting the direct map page table cache, while also > allocating from it for direct map splits is described in the comments of > init_pks_dmap_tables(). > > Signed-off-by: Rick Edgecombe > --- > arch/x86/include/asm/set_memory.h | 2 + > arch/x86/mm/init.c | 89 ++++++++++ > arch/x86/mm/pat/set_memory.c | 263 +++++++++++++++++++++++++++++- > 3 files changed, 350 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h > index 1ba2fb45ed05..9f8d0d0ae063 100644 > --- a/arch/x86/include/asm/set_memory.h > +++ b/arch/x86/include/asm/set_memory.h > @@ -90,6 +90,8 @@ bool kernel_page_present(struct page *page); > > extern int kernel_set_to_readonly; > > +void add_dmap_table(unsigned long addr); > + > #ifdef CONFIG_X86_64 > /* > * Prevent speculative access to the page by either unmapping > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c > index c8933c6d5efd..a91696e3da96 100644 > --- a/arch/x86/mm/init.c > +++ b/arch/x86/mm/init.c > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > #include > > #include > @@ -26,6 +27,7 @@ > #include > #include > #include > +#include > > /* > * We need to define the tracepoints somewhere, and tlb.c > @@ -119,6 +121,17 @@ __ref void *alloc_low_pages(unsigned int num) > if (after_bootmem) { > unsigned int order; > > + if (cpu_feature_enabled(X86_FEATURE_PKS_TABLES)) { > + struct page *page; > + > + /* 64 bit only allocates order 0 pages */ > + WARN_ON(num != 1); > + > + page = alloc_table(GFP_ATOMIC | __GFP_ZERO); > + if (!page) > + return NULL; > + return (void *)page_address(page); > + } > order = get_order((unsigned long)num << PAGE_SHIFT); > return (void *)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order); > } > @@ -504,6 +517,79 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn) > return false; > } > > +#ifdef CONFIG_PKS_PG_TABLES > +/* Page tables needed in bytes */ > +static u64 calc_tables_needed(unsigned int size) > +{ > + unsigned int puds = size >> PUD_SHIFT; > + unsigned int pmds = size >> PMD_SHIFT; > + > + /* > + * Catch if direct map ever might need more page tables to split > + * down to 4k. > + */ > + BUILD_BUG_ON(p4d_huge(foo)); > + BUILD_BUG_ON(pgd_huge(foo)); > + > + return (puds + pmds) << PAGE_SHIFT; > +} > + > +/* > + * If pre boot, reserve large pages from memory that will be mapped. It's ok that this is not > + * mapped as PKS, other init code in CPA will handle the conversion. > + */ > +static unsigned int __init reserve_pre_boot(u64 start, u64 end) > +{ > + u64 cur = memblock_find_in_range(start, end, HPAGE_SIZE, HPAGE_SIZE); > + int i; Please use memblock_phys_alloc_range() here. Besides, it seems this reserved pages are not accessed until late_initcall time, so there is no need to limit the allocation to already mapped areas, memblock_alloc_raw() would suffice. > + > + if (!cur) > + return 0; > + memblock_reserve(cur, HPAGE_SIZE); > + for (i = 0; i < HPAGE_SIZE; i += PAGE_SIZE) > + add_dmap_table((unsigned long)__va(cur + i)); > + return HPAGE_SIZE; > +} > + > +/* If post boot, memblock is not available. Just reserve from other memory regions */ > +static unsigned int __init reserve_post_boot(void) > +{ > + struct page *page = alloc_table(GFP_KERNEL); > + > + if (!page) > + return 0; > + > + add_dmap_table((unsigned long)page_address(page)); add_dmap_table() calls use casting everywhere, maybe make it add_dmap_table(void *)? > + > + return PAGE_SIZE; > +} > + > +static void __init reserve_page_tables(u64 start, u64 end) > +{ > + u64 reserve_size = calc_tables_needed(end - start); > + u64 reserved = 0; > + u64 cur_reserved; > + > + while (reserved < reserve_size) { > + if (after_bootmem) > + cur_reserved = reserve_post_boot(); > + else > + cur_reserved = reserve_pre_boot(start, end); > + > + if (!cur_reserved) { > + WARN(1, "Could not reserve direct map page tables %llu/%llu\n", > + reserved, > + reserve_size); > + return; > + } > + > + reserved += cur_reserved; > + } > +} > +#else > +static inline void reserve_page_tables(u64 start, u64 end) { } > +#endif > + > /* > * Setup the direct mapping of the physical memory at PAGE_OFFSET. > * This runs before bootmem is initialized and gets pages directly from > @@ -529,6 +615,9 @@ unsigned long __ref init_memory_mapping(unsigned long start, > > add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT); > > + if (cpu_feature_enabled(X86_FEATURE_PKS_TABLES)) > + reserve_page_tables(start, end); > + > return ret >> PAGE_SHIFT; > } -- Sincerely yours, Mike.