All of lore.kernel.org
 help / color / mirror / Atom feed
From: Song Liu <songliubraving@meta.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	Kernel Team <Kernel-team@fb.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"song@kernel.org" <song@kernel.org>, "hch@lst.de" <hch@lst.de>,
	"x86@kernel.org" <x86@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"Hansen, Dave" <dave.hansen@intel.com>,
	"urezki@gmail.com" <urezki@gmail.com>
Subject: Re: [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text
Date: Tue, 11 Oct 2022 16:25:51 +0000	[thread overview]
Message-ID: <2B66E2E7-7D32-418C-9DFD-1E17180300B4@fb.com> (raw)
In-Reply-To: <3842f1e7cfdde4f848e164872f62c0c1da654fec.camel@intel.com>



> On Oct 10, 2022, at 1:09 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Mon, 2022-10-10 at 19:08 +0000, Song Liu wrote:
>>> On Oct 10, 2022, at 11:32 AM, Edgecombe, Rick P <
>>> rick.p.edgecombe@intel.com> wrote:
>>> 
>>> On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote:
>>>> On x86 kernel, we allocate 2MB pages for kernel text up to
>>>> round_down(_etext, 2MB). Therefore, some of the kernel text is
>>>> still
>>>> on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to
>>>> round_up(_etext, 2MB), and use the rest of the page for modules
>>>> and
>>>> BPF programs.
>>>> 
>>>> Here is an example:
>>>> 
>>>> [root@eth50-1 ~]# grep _etext /proc/kallsyms
>>>> ffffffff82202a08 T _etext
>>>> 
>>>> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
>>>> ffffffff8220f920 t
>>>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
>>>> ffffffff8220fa28 t
>>>> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
>>>> ffffffff8220fad4 t
>>>> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
>>>> 
>>>> [root@eth50-1 ~]#  grep 0xffffffff82200000
>>>> /sys/kernel/debug/page_tables/kernel
>>>> 0xffffffff82200000-
>>>> 0xffffffff82400000     2M     ro   PSE         x  pmd
>>>> 
>>>> [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms
>>>> ffffffff822ba910 t xfs_flush_inodes_worker      [xfs]
>>>> ffffffff822bc580 t xfs_flush_inodes     [xfs]
>>>> 
>>>> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel
>>>> text,
>>>> xfs
>>>> module, and bpf programs.
>>> 
>>> Can this memory range be freed as part of a vfree_exec() call then?
>>> Does vmalloc actually try to unmap it? If so, it could get
>>> complicated
>>> with PTI.
>>> 
>>> It probably should be a special case that never gets fully freed.
>> 
>> Right, this is never freed. 
> 
> Can we get a comment somewhere highlighting how this is avoided?

vfree_exec() only frees the range when we get PMD aligned range >=
PMD_SIZE. Since this range is smaller than PMD_SIZe, it is never
freed. I will add a comment for this in the never version. 
 
> 
> Maybe this is just me missing some vmalloc understanding, but this
> pointer to an all zero vm_struct seems weird too. Are there other vmap
> allocations like this? Which vmap APIs work with this and which don't?

There are two vmap trees at the moment: free_area_ tree and 
vmap_area_ tree. free_area_ tree uses vmap->subtree_max_size, while 
vmap_area_ tree contains vmap backed by vm_struct, and thus uses 
vmap->vm. 

This set add a new tree, free_text_area_. This tree is different to 
the other two, as it uses subtree_max_size, and it is also backed 
by vm_struct. To handle this requirement without growing vmap_struct,
we introduced all_text_vm to store the vm_struct for free_text_area_
tree. 

free_text_area_ tree is different to vmap_area_ tree. Each vmap in
vmap_area_ tree has its own vm_struct (1 to 1 mapping), while 
multiple vmap in free_text_area_ tree map to a single vm_struct.

Also, free_text_area_ handles granularity < PAGE_SIZE; while the
other two trees only work with PAGE_SIZE aligned memory. 

Does this answer your questions? 

> 
>> 
>>> 
>>>> 
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> ---
>>>> arch/x86/mm/init_64.c |  3 ++-
>>>> mm/vmalloc.c          | 24 ++++++++++++++++++++++++
>>>> 2 files changed, 26 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>>> index 0fe690ebc269..d94f196c541a 100644
>>>> --- a/arch/x86/mm/init_64.c
>>>> +++ b/arch/x86/mm/init_64.c
>>>> @@ -1367,12 +1367,13 @@ int __init
>>>> deferred_page_init_max_threads(const struct cpumask
>>>> *node_cpumask)
>>>> 
>>>> int kernel_set_to_readonly;
>>>> 
>>>> +#define PMD_ALIGN(x)        (((unsigned long)(x) + (PMD_SIZE -
>>>> 1)) &
>>>> PMD_MASK)
>>>> void mark_rodata_ro(void)
>>>> {
>>>>      unsigned long start = PFN_ALIGN(_text);
>>>>      unsigned long rodata_start = PFN_ALIGN(__start_rodata);
>>>>      unsigned long end = (unsigned
>>>> long)__end_rodata_hpage_align;
>>>> -    unsigned long text_end = PFN_ALIGN(_etext);
>>>> +    unsigned long text_end = PMD_ALIGN(_etext);
>>> 
>>> This should probably have more logic and adjustments. If etext is
>>> PMD
>>> aligned, some of the stuff outside the diff won't do anything.
>> 
>> Hmm.. I don't quite follow this comment. If the etext is PMD
>> aligned, 
>> we can still use vmalloc_exec to allocate memory. So it shouldn't 
>> matter, no?
> 
> Maybe this doesn't matter since PMD alignment must happen naturally
> sometimes. I was just noticing the attempts to operate on this region
> between etext and start_rodata (free_init_pages(), etc). If this was
> never not PMD aligned they could be dropped. But if you are going to
> adjust the behavior for !CONFIG_MODULES, etc, then it is still needed.

I guess we can add special handling for !CONFIG_MODULES && !CONFIG_BPF
&& !CONFIG_FTRACE cases, where we will not allocate this memory. 

Thanks,
Song


  reply	other threads:[~2022-10-11 16:26 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-07 23:43 [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
2022-10-07 23:43 ` [RFC v2 1/4] vmalloc: introduce vmalloc_exec and vfree_exec Song Liu
2022-10-10 18:13   ` Edgecombe, Rick P
2022-10-10 19:04     ` Song Liu
2022-10-10 19:59       ` Edgecombe, Rick P
2022-10-07 23:43 ` [RFC v2 2/4] bpf: use vmalloc_exec Song Liu
2022-10-07 23:43 ` [RFC v2 3/4] modules, x86: use vmalloc_exec for module core Song Liu
2022-10-14  3:48   ` Aaron Lu
2022-10-14  6:07     ` Song Liu
2022-10-14 15:42   ` Edgecombe, Rick P
2022-10-14 18:26     ` Song Liu
2022-10-14 18:41       ` Edgecombe, Rick P
2022-10-07 23:43 ` [RFC v2 4/4] vmalloc_exec: share a huge page with kernel text Song Liu
2022-10-10 18:32   ` Edgecombe, Rick P
2022-10-10 19:08     ` Song Liu
2022-10-10 20:09       ` Edgecombe, Rick P
2022-10-11 16:25         ` Song Liu [this message]
2022-10-11 20:40           ` Edgecombe, Rick P
2022-10-12  5:37             ` Song Liu
2022-10-12 18:38               ` Edgecombe, Rick P
2022-10-12 19:01                 ` Song Liu
2022-10-08  0:17 ` [RFC v2 0/4] vmalloc_exec for modules and BPF programs Song Liu
2022-10-12 19:03 ` Song Liu
2022-10-17  7:26 ` Christoph Hellwig
2022-10-17 16:23   ` Song Liu
2022-10-18 14:50     ` Christoph Hellwig
2022-10-18 15:05       ` Song Liu
2022-10-18 15:40         ` Christoph Hellwig
2022-10-18 15:40           ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2B66E2E7-7D32-418C-9DFD-1E17180300B4@fb.com \
    --to=songliubraving@meta.com \
    --cc=Kernel-team@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterz@infradead.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=song@kernel.org \
    --cc=urezki@gmail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.