netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
To: "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"song@kernel.org" <song@kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Cc: "songliubraving@fb.com" <songliubraving@fb.com>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"ast@kernel.org" <ast@kernel.org>,
	"kernel-team@fb.com" <kernel-team@fb.com>,
	"andrii@kernel.org" <andrii@kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"iii@linux.ibm.com" <iii@linux.ibm.com>
Subject: Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP
Date: Sat, 26 Mar 2022 00:06:45 +0000	[thread overview]
Message-ID: <5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com> (raw)
In-Reply-To: <20220204185742.271030-2-song@kernel.org>

On Fri, 2022-02-04 at 10:57 -0800, Song Liu wrote:
> From: Song Liu <songliubraving@fb.com>
> 
> This enables module_alloc() to allocate huge page for 2MB+ requests.
> To check the difference of this change, we need enable config
> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  arch/x86/Kconfig | 1 +
>  1 file changed, 1 insertion(+)

Hi,

I just saw this upstream today. Glad to see this functionality, but I
think turning on huge vmalloc pages for x86 needs a bit more. I’ll
describe a couple possible failure modes I haven’t actually tested.

One problem is that the direct map permission reset part in vmalloc
assumes any special permissioned pages are mapped 4k on the direct map.
Otherwise the operation could fail to reset a page RW if a PTE page
allocation fails when it tries to split the page to toggle a 4k sized
region NP/P. If you are not familiar, x86 CPA generally leaves the
direct map page sizes mirroring the primary alias (vmalloc). So once
vmalloc has huge pages, the special permissioned direct map aliases
will have them too. This limitation of HAVE_ARCH_HUGE_VMALLOC is
actually hinted about in the Kconfig comments, but I guess it wasn’t
specific that x86 has these properties.

I think to make the vmalloc resetting part safe:
1. set_direct_map_invalid/default() needs to support multiple pages
like this[0].
2. vm_remove_mappings() needs to call them with the correct page size
in the hpage case so they don't cause a split[1].
3. Then hibernate needs to be blocked during this operation so it
doesn’t encounter the now sometimes huge NP pages, which it can’t
handle. Not sure what the right way to do this is, but potentially like
in the diff below[1].

Another problem is that CPA will sometimes now split pages of vmalloc
mappings in cases where it sets a region of an allocation to a
different permission than the rest (for example regular modules calling
set_memory_x() on the text section). Before this change, these couldn’t
fail since the module space mapping would never require a split.
Modules doesn’t check for failure there, so I’m thinking now it would
proceed to try to execute NX memory if the split failed. It could only
happen on allocation of especially large modules. Maybe it should just
be avoided for now by having regular module allocations pass
VM_NO_HUGE_VMAP on x86. And BPF could call __vmalloc_node_range()
directly to get 2MB vmallocs.

[0] 
https://lore.kernel.org/lkml/20210208084920.2884-5-rppt@kernel.org/#t

[1] Untested, but something like this possibly:
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 99e0f3e8d1a5..97c4ca3a29b1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -42,6 +42,7 @@
 #include <linux/sched/mm.h>
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
+#include <linux/suspend.h>
 
 #include "internal.h"
 #include "pgalloc-track.h"
@@ -2241,7 +2242,7 @@ EXPORT_SYMBOL(vm_map_ram);
 
 static struct vm_struct *vmlist __initdata;
 
-static inline unsigned int vm_area_page_order(struct vm_struct *vm)
+static inline unsigned int vm_area_page_order(const struct vm_struct
*vm)
 {
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
        return vm->page_order;
@@ -2560,12 +2561,12 @@ struct vm_struct *remove_vm_area(const void
*addr)
 static inline void set_area_direct_map(const struct vm_struct *area,
                                       int (*set_direct_map)(struct
page *page))
 {
+       unsigned int page_order = vm_area_page_order(area);
        int i;
 
-       /* HUGE_VMALLOC passes small pages to set_direct_map */
-       for (i = 0; i < area->nr_pages; i++)
+       for (i = 0; i < area->nr_pages; i += 1U << page_order)
                if (page_address(area->pages[i]))
-                       set_direct_map(area->pages[i]);
+                       set_direct_map(area->pages[i], 1U <<
page_order);
 }
 
 /* Handle removing and resetting vm mappings related to the vm_struct.
*/
@@ -2592,6 +2593,10 @@ static void vm_remove_mappings(struct vm_struct
*area, int deallocate_pages)
                return;
        }
 
+       /* Hibernate can't handle large NP pages */
+       if (page_order)
+               lock_system_sleep();
+
        /*
         * If execution gets here, flush the vm mapping and reset the
direct
         * map. Find the start and end range of the direct mappings to
make sure
@@ -2617,6 +2622,9 @@ static void vm_remove_mappings(struct vm_struct
*area, int deallocate_pages)
        set_area_direct_map(area, set_direct_map_invalid_noflush);
        _vm_unmap_aliases(start, end, flush_dmap);
        set_area_direct_map(area, set_direct_map_default_noflush);
+
+       if (page_order)
+               unlock_system_sleep();
 }
 
 static void __vunmap(const void *addr, int deallocate_pages)

  parent reply	other threads:[~2022-03-26  0:07 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20220204185742.271030-1-song@kernel.org>
2022-02-08  2:30 ` [PATCH v9 bpf-next 0/9] bpf_prog_pack allocator patchwork-bot+netdevbpf
     [not found] ` <20220204185742.271030-2-song@kernel.org>
2022-03-26  0:06   ` Edgecombe, Rick P [this message]
2022-03-28 23:27     ` [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP Song Liu
2022-03-29  0:18       ` Edgecombe, Rick P
2022-03-29  8:23         ` Song Liu
2022-03-29 18:39           ` Edgecombe, Rick P
2022-03-29 19:13             ` Song Liu
2022-03-29 21:36               ` Edgecombe, Rick P
2022-03-29 22:12                 ` Song Liu
2022-03-26 18:46   ` BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP) Paul Menzel
2022-03-27 10:36     ` Paul Menzel
2022-03-28  6:37       ` Song Liu
2022-03-28  6:51         ` Paul Menzel
2022-03-28 19:24           ` Song Liu
2022-03-28 20:14             ` Paul Menzel
2022-03-28 21:57               ` Song Liu
2022-03-28 19:21         ` Edgecombe, Rick P
     [not found] ` <20220204185742.271030-10-song@kernel.org>
2022-02-08  2:24   ` [PATCH v9 bpf-next 9/9] bpf, x86_64: use bpf_jit_binary_pack_alloc Alexei Starovoitov
2022-07-03  3:02   ` Andres Freund
2022-07-03  3:03     ` Alexei Starovoitov
2022-07-03  3:14       ` Andres Freund

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com \
    --to=rick.p.edgecombe@intel.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=iii@linux.ibm.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=song@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).