From: Song Liu <song@kernel.org>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: bpf@vger.kernel.org, linux-mm@kvack.org, peterz@infradead.org,
akpm@linux-foundation.org, x86@kernel.org, hch@lst.de,
rick.p.edgecombe@intel.com, aaron.lu@intel.com, rppt@kernel.org,
mcgrof@kernel.org
Subject: Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
Date: Thu, 1 Dec 2022 11:31:43 -0800 [thread overview]
Message-ID: <CAPhsuW5g45D+CFHBYR53nR17zG3dJ=3UJem-GCJwT0v6YCsxwg@mail.gmail.com> (raw)
In-Reply-To: <87v8mvsd8d.ffs@tglx>
Hi Thomas,
Thanks for these insights! They are really helpful!
On Thu, Dec 1, 2022 at 1:08 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Song!
>
> On Wed, Nov 30 2022 at 08:18, Song Liu wrote:
> > On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> You are not making anything easier. You are violating the basic
> >> engineering principle of "Fix the root cause, not the symptom".
> >>
> >
> > I am not sure what is the root cause and the symptom here.
>
[...]
>
> This made me look at your allocator again:
>
> > +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
> > +#define EXEC_MEM_START MODULES_VADDR
> > +#define EXEC_MEM_END MODULES_END
> > +#else
> > +#define EXEC_MEM_START VMALLOC_START
> > +#define EXEC_MEM_END VMALLOC_END
> > +#endif
>
> The #else part is completely broken on x86/64 and any other
> architecture, which has PC relative restricted displacement.
Yeah, the #else part is just to make it build. It is not really usable.
>
> Even if modules are disabled in Kconfig the only safe place to allocate
> executable kernel text from (on these architectures) is the modules
> address space. The ISA restrictions do not go magically away when
> modules are disabled.
>
> In the early version of the SKX retbleed mitigation work I had
>
> https://lore.kernel.org/all/20220716230953.442937066@linutronix.de
>
> exactly to handle this correctly for the !MODULE case. It went nowhere
> as we did not need the trampolines in the final version.
I remember there was some other work to use module_alloc for ftrace, etc.
without CONFIG_MODULES. One of these versions would work here.
>
> This is why Peter suggested to 'split' the module address range into a
> top down and bottom up part:
>
> https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
>
> That obviously separates text and data, but keeps everything within the
> defined working range.
>
> It immediately solves the text problem for _all_ module_alloc() users
> and still leaves the data split into 4k pages due to RO/RW sections.
>
> But after staring at it for a while I think this top down and bottom up
> dance is too much effort for not much gain. The module address space is
> sized generously, so the straight forward solution is to split that
> space into two blocks and use them to allocate text and data separately.
>
> The rest of Peter's suggestions how to migrate there still apply.
>
> The init sections of a module are obviously separate as they are freed
> after the module is initialized, but they are not really special either.
> Today they leave holes in the address range. With the new scheme these
> holes will be in the memory backed large mapping, but I don't see a real
> issue with that, especially as those holes at least in text can be
> reused for small allocations (kprobes, trace, bpf).
>
> As a logical next step we make that three blocks and allocate text,
> data and rodata separately, which will preserve the large mappings for
> text and data. rodata still needs to be split because we need a space to
> accomodate ro_after_init data.
>
> Alternatively, instead of splitting the module address space, the
> allocation mechanism can keep track of the types (text, data, rodata)
> and manage large mapping blocks per type. There are pros and cons for
> both approaches, so that needs some thought.
AFAICT, the new allocator (let's call it module_alloc_new here)
requires quite some different logic than the existing vmalloc logic (or
module_alloc logic):
1. vmalloc is at least PAGE_SIZE granularity; while ftrace, bpf etc would
benefit from a much smaller granularity.
2. vmalloc maintains 1-to-1 mapping between virtual address range (vmap
in vmap_area_root) and physical pages (vm_struct); while
module_alloc_new allocates physical pages in 2MB chunks, and
maintains multiple vmap within a single 2MB chunk.
To solve this, I introduced a new tree free_text_area_root, address spaces
in this tree is backed with ROX physical pages, but not used by any user.
I think some logic like this is always needed.
With this logic in place, I think we don't really need to split the module
address space. Instead, we can have 3 trees:
free_module_text_area_root;
free_module_data_area_root;
free_module_ro_data_area_root;
Similar to free_text_area_root, we add virtual address and physical pages
to these trees in 2MB chunks, and hands virtual address rnage out to users in
smaller granularity.
What do you think about this idea?
>
> But at the end we want an allocation mechanism which:
>
> - preserves large mappings
> - handles a distinct address range
> - is mapping type aware
>
> That solves _all_ the issues of modules, kprobes, tracing, bpf in one
> go. See?
I think the user still needs to use module_alloc_new() differently. At the
moment, the user does something like.
my_text = module_alloc(size);
set_vm_flush_reset_perms(my_text);
update_my_text(my_text);
set_memory_ro(my_text);
set_memory_x(my_text);
/* use my_text */
With module_alloc_new(), my_text buffer is RX right out of the
allocator, so some text_poke mechanism is needed. In some cases,
the user also needs some logic to handle relative call/jump. It is
something like:
my_text = module_alloc_new(size, MODULE_MEM_TEXT);
my_tmp_buf = vmalloc(size);
update_my_text(my_tmp_buf);
adjust_rela_calls(my_tmp_buf, my_text);
text_poke_copy(my_text, my_tmp_buf, size);
vfree(my_tmp_buf);
/* use my_text */
There are also archs that do not support text_poke, so we need some
logic, especially for modules, to handle them properly. For example,
Rick suggested something like:
For non-text_poke() architectures, the way you can make it work is have
the API look like:
execmem_alloc() <- Does the allocation, but necessarily usable yet
execmem_write() <- Loads the mapping, doesn't work after finish()
execmem_finish() <- Makes the mapping live (loaded, executable, ready)
So for text_poke():
execmem_alloc() <- reserves the mapping
execmem_write() <- text_pokes() to the mapping
execmem_finish() <- does nothing
And non-text_poke():
execmem_alloc() <- Allocates a regular RW vmalloc allocation
execmem_write() <- Writes normally to it
execmem_finish() <- does set_memory_ro()/set_memory_x() on it
Does this sound like the best path forward to you?
Also, do you have suggestions on the name of the API? Maybe
something like:
enum module_mem_type {
MODULE_MEM_TEXT,
MODULE_MEM_DATA,
MODULE_MEM_RODATA,
};
module_alloc_type(size_t len, enum module_mem_type type);
module_free_type(ptr); /* I guess we may or may not type here */
Thanks,
Song
next prev parent reply other threads:[~2022-12-01 19:32 UTC|newest]
Thread overview: 91+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 1/5] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 2/5] x86/alternative: support execmem_alloc() and execmem_free() Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 3/5] bpf: use execmem_alloc for bpf program and bpf dispatcher Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 5/5] x86: use register_text_tail_vm Song Liu
2022-11-08 19:04 ` Edgecombe, Rick P
2022-11-08 22:15 ` Song Liu
2022-11-15 17:28 ` Edgecombe, Rick P
2022-11-07 22:55 ` [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Luis Chamberlain
2022-11-07 23:13 ` Song Liu
2022-11-07 23:39 ` Luis Chamberlain
2022-11-08 0:13 ` Edgecombe, Rick P
2022-11-08 2:45 ` Luis Chamberlain
2022-11-08 18:20 ` Song Liu
2022-11-08 18:12 ` Song Liu
2022-11-08 11:27 ` Mike Rapoport
2022-11-08 12:38 ` Aaron Lu
2022-11-09 6:55 ` Christoph Hellwig
2022-11-09 11:05 ` Peter Zijlstra
2022-11-08 16:51 ` Edgecombe, Rick P
2022-11-08 18:50 ` Song Liu
2022-11-09 11:17 ` Mike Rapoport
2022-11-09 17:04 ` Edgecombe, Rick P
2022-11-09 17:53 ` Song Liu
2022-11-13 10:34 ` Mike Rapoport
2022-11-14 20:30 ` Song Liu
2022-11-15 21:18 ` Luis Chamberlain
2022-11-15 21:39 ` Edgecombe, Rick P
2022-11-16 22:34 ` Luis Chamberlain
2022-11-17 8:50 ` Mike Rapoport
2022-11-17 18:36 ` Song Liu
2022-11-20 10:41 ` Mike Rapoport
2022-11-21 14:52 ` Song Liu
2022-11-30 9:39 ` Mike Rapoport
2022-11-09 17:43 ` Song Liu
2022-11-09 21:23 ` Christophe Leroy
2022-11-10 1:50 ` Song Liu
2022-11-13 10:42 ` Mike Rapoport
2022-11-14 20:45 ` Song Liu
2022-11-15 20:51 ` Luis Chamberlain
2022-11-20 10:44 ` Mike Rapoport
2022-11-08 18:41 ` Song Liu
2022-11-08 19:43 ` Christophe Leroy
2022-11-08 21:40 ` Song Liu
2022-11-13 9:58 ` Mike Rapoport
2022-11-14 20:13 ` Song Liu
2022-11-08 11:44 ` Christophe Leroy
2022-11-08 18:47 ` Song Liu
2022-11-08 19:32 ` Christophe Leroy
2022-11-08 11:48 ` Mike Rapoport
2022-11-15 1:30 ` Song Liu
2022-11-15 17:34 ` Edgecombe, Rick P
2022-11-15 21:54 ` Song Liu
2022-11-15 22:14 ` Edgecombe, Rick P
2022-11-15 22:32 ` Song Liu
2022-11-16 1:20 ` Song Liu
2022-11-16 21:22 ` Edgecombe, Rick P
2022-11-16 22:03 ` Song Liu
2022-11-15 21:09 ` Luis Chamberlain
2022-11-15 21:32 ` Luis Chamberlain
2022-11-15 22:48 ` Song Liu
2022-11-16 22:33 ` Luis Chamberlain
2022-11-16 22:47 ` Edgecombe, Rick P
2022-11-16 23:53 ` Luis Chamberlain
2022-11-17 1:17 ` Song Liu
2022-11-17 9:37 ` Mike Rapoport
2022-11-29 10:23 ` Thomas Gleixner
2022-11-29 17:26 ` Song Liu
2022-11-29 23:56 ` Thomas Gleixner
2022-11-30 16:18 ` Song Liu
2022-12-01 9:08 ` Thomas Gleixner
2022-12-01 19:31 ` Song Liu [this message]
2022-12-02 1:38 ` Thomas Gleixner
2022-12-02 8:38 ` Song Liu
2022-12-02 9:22 ` Thomas Gleixner
2022-12-06 20:25 ` Song Liu
2022-12-07 15:36 ` Thomas Gleixner
2022-12-07 16:53 ` Christophe Leroy
2022-12-07 19:29 ` Song Liu
2022-12-07 21:04 ` Thomas Gleixner
2022-12-07 21:48 ` Christophe Leroy
2022-12-07 19:26 ` Song Liu
2022-12-07 20:57 ` Thomas Gleixner
2022-12-07 23:17 ` Song Liu
2022-12-02 10:46 ` Christophe Leroy
2022-12-02 17:43 ` Thomas Gleixner
2022-12-01 20:23 ` Mike Rapoport
2022-12-01 22:34 ` Thomas Gleixner
2022-12-03 14:46 ` Mike Rapoport
2022-12-03 20:58 ` Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAPhsuW5g45D+CFHBYR53nR17zG3dJ=3UJem-GCJwT0v6YCsxwg@mail.gmail.com' \
--to=song@kernel.org \
--cc=aaron.lu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=bpf@vger.kernel.org \
--cc=hch@lst.de \
--cc=linux-mm@kvack.org \
--cc=mcgrof@kernel.org \
--cc=peterz@infradead.org \
--cc=rick.p.edgecombe@intel.com \
--cc=rppt@kernel.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).