linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Song Liu <song@kernel.org>
To: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Mike Rapoport <rppt@kernel.org>,
	"Edgecombe, Rick P" <rick.p.edgecombe@intel.com>,
	 "peterz@infradead.org" <peterz@infradead.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	 "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"hch@lst.de" <hch@lst.de>, "x86@kernel.org" <x86@kernel.org>,
	 "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	 "Lu, Aaron" <aaron.lu@intel.com>,
	 "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs
Date: Wed, 9 Nov 2022 17:50:39 -0800	[thread overview]
Message-ID: <CAPhsuW5wtWQHMhjvi4hmOsVDZF-kosr7Eb8Gj2Jo4R5LFqE-qA@mail.gmail.com> (raw)
In-Reply-To: <d60266dc-6a10-b234-954c-a899a7ad054f@csgroup.eu>

On Wed, Nov 9, 2022 at 1:24 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> + linuxppc-dev list as we start mentioning powerpc.
>
> Le 09/11/2022 à 18:43, Song Liu a écrit :
> > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@kernel.org> wrote:
> >>
> > [...]
> >
> >>>>
> >>>> The proposed execmem_alloc() looks to me very much tailored for x86
> >>>> to be
> >>>> used as a replacement for module_alloc(). Some architectures have
> >>>> module_alloc() that is quite different from the default or x86
> >>>> version, so
> >>>> I'd expect at least some explanation how modules etc can use execmem_
> >>>> APIs
> >>>> without breaking !x86 architectures.
> >>>
> >>> I think this is fair, but I think we should ask ask ourselves - how
> >>> much should we do in one step?
> >>
> >> I think that at least we need an evidence that execmem_alloc() etc can be
> >> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
> >> for him at all, so having a core MM API for code allocation that only works
> >> with BPF on x86 seems not right to me.
> >
> > While using execmem_alloc() et. al. in module support is difficult, folks are
> > making progress with it. For example, the prototype would be more difficult
> > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> > (introduced by Christophe).
>
> By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> was completely different: This was because on powerpc book3s/32, no-exec
> flaggin is per segment of size 256 Mbytes, so in order to provide
> STRICT_MODULES_RWX it was necessary to put data outside of the segment
> that holds module text in order to be able to flag RW data as no-exec.

Yeah, I only noticed the actual motivation of this work earlier today. :)

>
> But I'm happy if it can also serve other purposes.
>
> >
> > We also have other users that we can onboard soon: BPF trampoline on
> > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> > s390.
> >
> >>
> >>> For non-text_poke() architectures, the way you can make it work is have
> >>> the API look like:
> >>> execmem_alloc()  <- Does the allocation, but necessarily usable yet
> >>> execmem_write()  <- Loads the mapping, doesn't work after finish()
> >>> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
> >>>
> >>> So for text_poke():
> >>> execmem_alloc()  <- reserves the mapping
> >>> execmem_write()  <- text_pokes() to the mapping
> >>> execmem_finish() <- does nothing
> >>>
> >>> And non-text_poke():
> >>> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> >>> execmem_write()  <- Writes normally to it
> >>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it
> >>>
> >>> Non-text_poke() only gets the benefits of centralized logic, but the
> >>> interface works for both. This is pretty much what the perm_alloc() RFC
> >>> did to make it work with other arch's and modules. But to fit with the
> >>> existing modules code (which is actually spread all over) and also
> >>> handle RO sections, it also needed some additional bells and whistles.
> >>
> >> I'm less concerned about non-text_poke() part, but rather about
> >> restrictions where code and data can live on different architectures and
> >> whether these restrictions won't lead to inability to use the centralized
> >> logic on, say, arm64 and powerpc.
>
> Until recently, powerpc CPU didn't implement PC-relative data access.
> Only very recent powerpc CPUs (power10 only ?) have capability to do
> PC-relative accesses, but the kernel doesn't use it yet. So there's no
> constraint about distance between text and data. What matters is the
> distance between core kernel text and module text to avoid trampolines.

Ah, this is great. I guess this means powerpc can benefit from this work
with much less effort than x86_64.

>
> >>
> >> For instance, if we use execmem_alloc() for modules, it means that data
> >> sections should be allocated separately with plain vmalloc(). Will this
> >> work universally? Or this will require special care with additional
> >> complexity in the modules code?
> >>
> >>> So the question I'm trying to ask is, how much should we target for the
> >>> next step? I first thought that this functionality was so intertwined,
> >>> it would be too hard to do iteratively. So if we want to try
> >>> iteratively, I'm ok if it doesn't solve everything.
> >>
> >> With execmem_alloc() as the first step I'm failing to see the large
> >> picture. If we want to use it for modules, how will we allocate RO data?
> >> with similar rodata_alloc() that uses yet another tree in vmalloc?
> >> How the caching of large pages in vmalloc can be made useful for use cases
> >> like secretmem and PKS?
> >
> > If RO data causes problems with direct map fragmentation, we can use
> > similar logic. I think we will need another tree in vmalloc for this case.
> > Since the logic will be mostly identical, I personally don't think adding
> > another tree is a big overhead.
>
> On powerpc, kernel core RAM is not mapped by pages but is mapped by
> blocks. There are only two blocks: One ROX block which contains both
> text and rodata, and one RW block that contains everything else. Maybe
> the same can be done for modules. What matters is to be sure you never
> have WX memory. Having ROX rodata is not an issue.

Got it. Thanks!

Song


  reply	other threads:[~2022-11-10  1:50 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-07 22:39 [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 1/5] vmalloc: introduce execmem_alloc, execmem_free, and execmem_fill Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 2/5] x86/alternative: support execmem_alloc() and execmem_free() Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 3/5] bpf: use execmem_alloc for bpf program and bpf dispatcher Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 4/5] vmalloc: introduce register_text_tail_vm() Song Liu
2022-11-07 22:39 ` [PATCH bpf-next v2 5/5] x86: use register_text_tail_vm Song Liu
2022-11-08 19:04   ` Edgecombe, Rick P
2022-11-08 22:15     ` Song Liu
2022-11-15 17:28       ` Edgecombe, Rick P
2022-11-07 22:55 ` [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs Luis Chamberlain
2022-11-07 23:13   ` Song Liu
2022-11-07 23:39     ` Luis Chamberlain
2022-11-08  0:13       ` Edgecombe, Rick P
2022-11-08  2:45         ` Luis Chamberlain
2022-11-08 18:20         ` Song Liu
2022-11-08 18:12       ` Song Liu
2022-11-08 11:27 ` Mike Rapoport
2022-11-08 12:38   ` Aaron Lu
2022-11-09  6:55     ` Christoph Hellwig
2022-11-09 11:05       ` Peter Zijlstra
2022-11-08 16:51   ` Edgecombe, Rick P
2022-11-08 18:50     ` Song Liu
2022-11-09 11:17     ` Mike Rapoport
2022-11-09 17:04       ` Edgecombe, Rick P
2022-11-09 17:53         ` Song Liu
2022-11-13 10:34         ` Mike Rapoport
2022-11-14 20:30           ` Song Liu
2022-11-15 21:18             ` Luis Chamberlain
2022-11-15 21:39               ` Edgecombe, Rick P
2022-11-16 22:34                 ` Luis Chamberlain
2022-11-17  8:50             ` Mike Rapoport
2022-11-17 18:36               ` Song Liu
2022-11-20 10:41                 ` Mike Rapoport
2022-11-21 14:52                   ` Song Liu
2022-11-30  9:39                     ` Mike Rapoport
2022-11-09 17:43       ` Song Liu
2022-11-09 21:23         ` Christophe Leroy
2022-11-10  1:50           ` Song Liu [this message]
2022-11-13 10:42         ` Mike Rapoport
2022-11-14 20:45           ` Song Liu
2022-11-15 20:51             ` Luis Chamberlain
2022-11-20 10:44             ` Mike Rapoport
2022-11-08 18:41   ` Song Liu
2022-11-08 19:43     ` Christophe Leroy
2022-11-08 21:40       ` Song Liu
2022-11-13  9:58     ` Mike Rapoport
2022-11-14 20:13       ` Song Liu
2022-11-08 11:44 ` Christophe Leroy
2022-11-08 18:47   ` Song Liu
2022-11-08 19:32     ` Christophe Leroy
2022-11-08 11:48 ` Mike Rapoport
2022-11-15  1:30 ` Song Liu
2022-11-15 17:34   ` Edgecombe, Rick P
2022-11-15 21:54     ` Song Liu
2022-11-15 22:14       ` Edgecombe, Rick P
2022-11-15 22:32         ` Song Liu
2022-11-16  1:20         ` Song Liu
2022-11-16 21:22           ` Edgecombe, Rick P
2022-11-16 22:03             ` Song Liu
2022-11-15 21:09   ` Luis Chamberlain
2022-11-15 21:32     ` Luis Chamberlain
2022-11-15 22:48     ` Song Liu
2022-11-16 22:33       ` Luis Chamberlain
2022-11-16 22:47         ` Edgecombe, Rick P
2022-11-16 23:53           ` Luis Chamberlain
2022-11-17  1:17             ` Song Liu
2022-11-17  9:37         ` Mike Rapoport
2022-11-29 10:23   ` Thomas Gleixner
2022-11-29 17:26     ` Song Liu
2022-11-29 23:56       ` Thomas Gleixner
2022-11-30 16:18         ` Song Liu
2022-12-01  9:08           ` Thomas Gleixner
2022-12-01 19:31             ` Song Liu
2022-12-02  1:38               ` Thomas Gleixner
2022-12-02  8:38                 ` Song Liu
2022-12-02  9:22                   ` Thomas Gleixner
2022-12-06 20:25                     ` Song Liu
2022-12-07 15:36                       ` Thomas Gleixner
2022-12-07 16:53                         ` Christophe Leroy
2022-12-07 19:29                           ` Song Liu
2022-12-07 21:04                           ` Thomas Gleixner
2022-12-07 21:48                             ` Christophe Leroy
2022-12-07 19:26                         ` Song Liu
2022-12-07 20:57                           ` Thomas Gleixner
2022-12-07 23:17                             ` Song Liu
2022-12-02 10:46                 ` Christophe Leroy
2022-12-02 17:43                   ` Thomas Gleixner
2022-12-01 20:23             ` Mike Rapoport
2022-12-01 22:34               ` Thomas Gleixner
2022-12-03 14:46                 ` Mike Rapoport
2022-12-03 20:58                   ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPhsuW5wtWQHMhjvi4hmOsVDZF-kosr7Eb8Gj2Jo4R5LFqE-qA@mail.gmail.com \
    --to=song@kernel.org \
    --cc=aaron.lu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bpf@vger.kernel.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=hch@lst.de \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mcgrof@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rppt@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).