All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Andy Lutomirski" <luto@kernel.org>
To: "Mike Rapoport" <rppt@kernel.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Kees Cook" <keescook@chromium.org>
Cc: "Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Christophe Leroy" <christophe.leroy@csgroup.eu>,
	"David S. Miller" <davem@davemloft.net>,
	"Dinh Nguyen" <dinguyen@kernel.org>,
	"Heiko Carstens" <hca@linux.ibm.com>,
	"Helge Deller" <deller@gmx.de>,
	"Huacai Chen" <chenhuacai@kernel.org>,
	"Kent Overstreet" <kent.overstreet@linux.dev>,
	"Luis Chamberlain" <mcgrof@kernel.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Naveen N. Rao" <naveen.n.rao@linux.ibm.com>,
	"Palmer Dabbelt" <palmer@dabbelt.com>,
	"Puranjay Mohan" <puranjay12@gmail.com>,
	"Rick P Edgecombe" <rick.p.edgecombe@intel.com>,
	"Russell King (Oracle)" <linux@armlinux.org.uk>,
	"Song Liu" <song@kernel.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Thomas Bogendoerfer" <tsbogend@alpha.franken.de>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Will Deacon" <will@kernel.org>,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-mips@vger.kernel.org, linux-mm@kvack.org,
	linux-modules@vger.kernel.org, linux-parisc@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev,
	netdev@vger.kernel.org, sparclinux@vger.kernel.org,
	"the arch/x86 maintainers" <x86@kernel.org>
Subject: Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
Date: Mon, 19 Jun 2023 10:09:02 -0700	[thread overview]
Message-ID: <a17c65c6-863f-4026-9c6f-a04b659e9ab4@app.fastmail.com> (raw)
In-Reply-To: <20230618080027.GA52412@kernel.org>



On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>> >
>> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >
>> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> > and puts the burden of code allocation to the modules code.
>> >
>> > Several architectures override module_alloc() because of various
>> > constraints where the executable memory can be located and this causes
>> > additional obstacles for improvements of code allocation.
>> >
>> > Start splitting code allocation from modules by introducing
>> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >
>> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> > module_alloc() and execmem_free() and jit_free() are replacements of
>> > module_memfree() to allow updating all call sites to use the new APIs.
>> >
>> > The intention semantics for new allocation APIs:
>> >
>> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >   close to the kernel image, like loadable kernel modules and generated
>> >   code that is restricted by relative addressing.
>> >
>> > * jit_text_alloc() should be used to allocate memory for generated code
>> >   when there are no restrictions for the code placement. For
>> >   architectures that require that any code is within certain distance
>> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >   execmem_text_alloc().
>> >
>> 
>> Is there anything in this series to help users do the appropriate
>> synchronization when the actually populate the allocated memory with
>> code?  See here, for example:
>
> This series only factors out the executable allocations from modules and
> puts them in a central place.
> Anything else would go on top after this lands.

Hmm.

On the one hand, there's nothing wrong with factoring out common code. On the other hand, this is probably the right time to at least start thinking about synchronization, at least to the extent that it might make us want to change this API.  (I'm not at all saying that this series should require changes -- I'm just saying that this is a good time to think about how this should work.)

The current APIs, *and* the proposed jit_text_alloc() API, don't actually look like the one think in the Linux ecosystem that actually intelligently and efficiently maps new text into an address space: mmap().

On x86, you can mmap() an existing file full of executable code PROT_EXEC and jump to it with minimal synchronization (just the standard implicit ordering in the kernel that populates the pages before setting up the PTEs and whatever user synchronization is needed to avoid jumping into the mapping before mmap() finishes).  It works across CPUs, and the only possible way userspace can screw it up (for a read-only mapping of read-only text, anyway) is to jump to the mapping too early, in which case userspace gets a page fault.  Incoherence is impossible, and no one needs to "serialize" (in the SDM sense).

I think the same sequence (from userspace's perspective) works on other architectures, too, although I think more cache management is needed on the kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V has very developer-unfriendly icache management, but I don't remember the details.)

Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is rather fraught, and I bet many things do it wrong when userspace is multithreaded.  But not in production because it's mostly not used in production.)

But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.

For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:

jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
jit_text_write() -> write to that handle
jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)

could be more efficient and/or safer.

(Modules could use this too.  Getting alternatives right might take some fiddling, because off the top of my head, this doesn't match how it works now.)

To make alternatives easier, this could work, maybe (haven't fully thought it through):

jit_text_alloc()
jit_text_map_rw_inplace() -> map at the target address, but RW, !X

write the text and apply alternatives

jit_text_finalize() -> change from RW to RX *and synchronize*

jit_text_finalize() would either need to wait for RCU (possibly extra heavy weight RCU to get "serialization") or send an IPI.

This is slower than the alloc, write, map solution, but allows alternatives to be applied at the final address.


Even fancier variants where the writing is some using something like use_temporary_mm() might even make sense.


To what extent does performance matter for the various users?  module loading is slow, and I don't think we care that much.  eBPF loaded is not super fast, and we care to a limited extent.  I *think* the bcachefs use case needs to be very fast, but I'm not sure it can be fast and supportable.

Anyway, food for thought.


WARNING: multiple messages have this Message-ID (diff)
From: "Andy Lutomirski" <luto@kernel.org>
To: "Mike Rapoport" <rppt@kernel.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Kees Cook" <keescook@chromium.org>
Cc: "Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Christophe Leroy" <christophe.leroy@csgroup.eu>,
	"David S. Miller" <davem@davemloft.net>,
	"Dinh Nguyen" <dinguyen@kernel.org>,
	"Heiko Carstens" <hca@linux.ibm.com>,
	"Helge Deller" <deller@gmx.de>,
	"Huacai Chen" <chenhuacai@kernel.org>,
	"Kent Overstreet" <kent.overstreet@linux.dev>,
	"Luis Chamberlain" <mcgrof@kernel.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Naveen N. Rao" <naveen.n.rao@linux.ibm.com>,
	"Palmer Dabbelt" <palmer@dabbelt.com>,
	"Puranjay Mohan" <puranjay12@gmail.com>,
	"Rick P Edgecombe" <rick.p.edgecombe@intel.com>,
	"Russell King (Oracle)" <linux@armlinux.org.uk>,
	"Song Liu" <song@kernel.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Thomas Bogendoerfer" <tsbogend@alpha.franken.de>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Will Deacon" <will@kernel.org>,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-mips@vger.kernel.org, linux-mm@kvack.org,
	linux-modules@vger.kernel.org, linux-parisc@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev,
	netdev@vger.kernel.org, sparclinux@vger.kernel.org,
	"the arch/x86 maintainers" <x86@kernel.org>
Subject: Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
Date: Mon, 19 Jun 2023 10:09:02 -0700	[thread overview]
Message-ID: <a17c65c6-863f-4026-9c6f-a04b659e9ab4@app.fastmail.com> (raw)
In-Reply-To: <20230618080027.GA52412@kernel.org>



On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>> >
>> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >
>> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> > and puts the burden of code allocation to the modules code.
>> >
>> > Several architectures override module_alloc() because of various
>> > constraints where the executable memory can be located and this causes
>> > additional obstacles for improvements of code allocation.
>> >
>> > Start splitting code allocation from modules by introducing
>> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >
>> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> > module_alloc() and execmem_free() and jit_free() are replacements of
>> > module_memfree() to allow updating all call sites to use the new APIs.
>> >
>> > The intention semantics for new allocation APIs:
>> >
>> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >   close to the kernel image, like loadable kernel modules and generated
>> >   code that is restricted by relative addressing.
>> >
>> > * jit_text_alloc() should be used to allocate memory for generated code
>> >   when there are no restrictions for the code placement. For
>> >   architectures that require that any code is within certain distance
>> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >   execmem_text_alloc().
>> >
>> 
>> Is there anything in this series to help users do the appropriate
>> synchronization when the actually populate the allocated memory with
>> code?  See here, for example:
>
> This series only factors out the executable allocations from modules and
> puts them in a central place.
> Anything else would go on top after this lands.

Hmm.

On the one hand, there's nothing wrong with factoring out common code. On the other hand, this is probably the right time to at least start thinking about synchronization, at least to the extent that it might make us want to change this API.  (I'm not at all saying that this series should require changes -- I'm just saying that this is a good time to think about how this should work.)

The current APIs, *and* the proposed jit_text_alloc() API, don't actually look like the one think in the Linux ecosystem that actually intelligently and efficiently maps new text into an address space: mmap().

On x86, you can mmap() an existing file full of executable code PROT_EXEC and jump to it with minimal synchronization (just the standard implicit ordering in the kernel that populates the pages before setting up the PTEs and whatever user synchronization is needed to avoid jumping into the mapping before mmap() finishes).  It works across CPUs, and the only possible way userspace can screw it up (for a read-only mapping of read-only text, anyway) is to jump to the mapping too early, in which case userspace gets a page fault.  Incoherence is impossible, and no one needs to "serialize" (in the SDM sense).

I think the same sequence (from userspace's perspective) works on other architectures, too, although I think more cache management is needed on the kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V has very developer-unfriendly icache management, but I don't remember the details.)

Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is rather fraught, and I bet many things do it wrong when userspace is multithreaded.  But not in production because it's mostly not used in production.)

But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.

For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:

jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
jit_text_write() -> write to that handle
jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)

could be more efficient and/or safer.

(Modules could use this too.  Getting alternatives right might take some fiddling, because off the top of my head, this doesn't match how it works now.)

To make alternatives easier, this could work, maybe (haven't fully thought it through):

jit_text_alloc()
jit_text_map_rw_inplace() -> map at the target address, but RW, !X

write the text and apply alternatives

jit_text_finalize() -> change from RW to RX *and synchronize*

jit_text_finalize() would either need to wait for RCU (possibly extra heavy weight RCU to get "serialization") or send an IPI.

This is slower than the alloc, write, map solution, but allows alternatives to be applied at the final address.


Even fancier variants where the writing is some using something like use_temporary_mm() might even make sense.


To what extent does performance matter for the various users?  module loading is slow, and I don't think we care that much.  eBPF loaded is not super fast, and we care to a limited extent.  I *think* the bcachefs use case needs to be very fast, but I'm not sure it can be fast and supportable.

Anyway, food for thought.


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

WARNING: multiple messages have this Message-ID (diff)
From: "Andy Lutomirski" <luto@kernel.org>
To: "Mike Rapoport" <rppt@kernel.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Kees Cook" <keescook@chromium.org>
Cc: "Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Christophe Leroy" <christophe.leroy@csgroup.eu>,
	"David S. Miller" <davem@davemloft.net>,
	"Dinh Nguyen" <dinguyen@kernel.org>,
	"Heiko Carstens" <hca@linux.ibm.com>,
	"Helge Deller" <deller@gmx.de>,
	"Huacai Chen" <chenhuacai@kernel.org>,
	"Kent Overstreet" <kent.overstreet@linux.dev>,
	"Luis Chamberlain" <mcgrof@kernel.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Naveen N. Rao" <naveen.n.rao@linux.ibm.com>,
	"Palmer Dabbelt" <palmer@dabbelt.com>,
	"Puranjay Mohan" <puranjay12@gmail.com>,
	"Rick P Edgecombe" <rick.p.edgecombe@intel.com>,
	"Russell King (Oracle)" <linux@armlinux.org.uk>,
	"Song Liu" <song@kernel.org>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Thomas Bogendoerfer" <tsbogend@alpha.franken.de>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Will Deacon" <will@kernel.org>,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-mips@vger.kernel.org, linux-mm@kvack.org,
	linux-modules@vger.kernel.org, linux-parisc@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, loongarch@lists.linux.dev,
	netdev@vger.kernel.org, sparclinux@vger.kernel.org,
	"the arch/x86 maintainers" <x86@kernel.org>
Subject: Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
Date: Mon, 19 Jun 2023 10:09:02 -0700	[thread overview]
Message-ID: <a17c65c6-863f-4026-9c6f-a04b659e9ab4@app.fastmail.com> (raw)
In-Reply-To: <20230618080027.GA52412@kernel.org>



On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>> >
>> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >
>> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> > and puts the burden of code allocation to the modules code.
>> >
>> > Several architectures override module_alloc() because of various
>> > constraints where the executable memory can be located and this causes
>> > additional obstacles for improvements of code allocation.
>> >
>> > Start splitting code allocation from modules by introducing
>> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >
>> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> > module_alloc() and execmem_free() and jit_free() are replacements of
>> > module_memfree() to allow updating all call sites to use the new APIs.
>> >
>> > The intention semantics for new allocation APIs:
>> >
>> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >   close to the kernel image, like loadable kernel modules and generated
>> >   code that is restricted by relative addressing.
>> >
>> > * jit_text_alloc() should be used to allocate memory for generated code
>> >   when there are no restrictions for the code placement. For
>> >   architectures that require that any code is within certain distance
>> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >   execmem_text_alloc().
>> >
>> 
>> Is there anything in this series to help users do the appropriate
>> synchronization when the actually populate the allocated memory with
>> code?  See here, for example:
>
> This series only factors out the executable allocations from modules and
> puts them in a central place.
> Anything else would go on top after this lands.

Hmm.

On the one hand, there's nothing wrong with factoring out common code. On the other hand, this is probably the right time to at least start thinking about synchronization, at least to the extent that it might make us want to change this API.  (I'm not at all saying that this series should require changes -- I'm just saying that this is a good time to think about how this should work.)

The current APIs, *and* the proposed jit_text_alloc() API, don't actually look like the one think in the Linux ecosystem that actually intelligently and efficiently maps new text into an address space: mmap().

On x86, you can mmap() an existing file full of executable code PROT_EXEC and jump to it with minimal synchronization (just the standard implicit ordering in the kernel that populates the pages before setting up the PTEs and whatever user synchronization is needed to avoid jumping into the mapping before mmap() finishes).  It works across CPUs, and the only possible way userspace can screw it up (for a read-only mapping of read-only text, anyway) is to jump to the mapping too early, in which case userspace gets a page fault.  Incoherence is impossible, and no one needs to "serialize" (in the SDM sense).

I think the same sequence (from userspace's perspective) works on other architectures, too, although I think more cache management is needed on the kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V has very developer-unfriendly icache management, but I don't remember the details.)

Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is rather fraught, and I bet many things do it wrong when userspace is multithreaded.  But not in production because it's mostly not used in production.)

But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.

For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:

jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
jit_text_write() -> write to that handle
jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)

could be more efficient and/or safer.

(Modules could use this too.  Getting alternatives right might take some fiddling, because off the top of my head, this doesn't match how it works now.)

To make alternatives easier, this could work, maybe (haven't fully thought it through):

jit_text_alloc()
jit_text_map_rw_inplace() -> map at the target address, but RW, !X

write the text and apply alternatives

jit_text_finalize() -> change from RW to RX *and synchronize*

jit_text_finalize() would either need to wait for RCU (possibly extra heavy weight RCU to get "serialization") or send an IPI.

This is slower than the alloc, write, map solution, but allows alternatives to be applied at the final address.


Even fancier variants where the writing is some using something like use_temporary_mm() might even make sense.


To what extent does performance matter for the various users?  module loading is slow, and I don't think we care that much.  eBPF loaded is not super fast, and we care to a limited extent.  I *think* the bcachefs use case needs to be very fast, but I'm not sure it can be fast and supportable.

Anyway, food for thought.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: "Andy Lutomirski" <luto@kernel.org>
To: "Mike Rapoport" <rppt@kernel.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Kees Cook" <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>,
	the arch/x86 maintainers <x86@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	linux-mips@vger.kernel.org, Song Liu <song@kernel.org>,
	sparclinux@vger.kernel.org, linux-riscv@lists.infradead.org,
	Nadav Amit <nadav.amit@gmail.com>,
	linux-s390@vger.kernel.org, Helge Deller <deller@gmx.de>,
	Huacai Chen <chenhuacai@kernel.org>,
	"Russell King \(Oracle\)" <linux@armlinux.org.uk>,
	"Naveen N. Rao" <naveen.n.rao@linux.ibm.com>,
	linux-trace-kernel@vger.kernel.org, Will Deacon <will@kernel.org>,
	Heiko Carstens <hca@linux.ibm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	loongarch@lists.linux.dev, Thomas Gleixner <tglx@linutronix.de>,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	linux-parisc@vger.kernel.org,
	Puranjay Mohan <puranjay12@gmail.com>,
	linux-mm@kvack.org, netdev@vger.kernel.org,
	Kent Overstreet <kent.overstreet@linux.dev>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Dinh Nguyen <dinguyen@kernel.org>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Palmer Dabbelt <palmer@dabbelt.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Rick P Edgecombe <rick.p.edgecombe@intel.com>,
	linuxppc-dev@lists.ozlabs.org,
	"David S. Miller" <davem@davemloft.net>,
	linux-modules@vger.kernel.org
Subject: Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()
Date: Mon, 19 Jun 2023 10:09:02 -0700	[thread overview]
Message-ID: <a17c65c6-863f-4026-9c6f-a04b659e9ab4@app.fastmail.com> (raw)
In-Reply-To: <20230618080027.GA52412@kernel.org>



On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>> >
>> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >
>> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> > and puts the burden of code allocation to the modules code.
>> >
>> > Several architectures override module_alloc() because of various
>> > constraints where the executable memory can be located and this causes
>> > additional obstacles for improvements of code allocation.
>> >
>> > Start splitting code allocation from modules by introducing
>> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >
>> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> > module_alloc() and execmem_free() and jit_free() are replacements of
>> > module_memfree() to allow updating all call sites to use the new APIs.
>> >
>> > The intention semantics for new allocation APIs:
>> >
>> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >   close to the kernel image, like loadable kernel modules and generated
>> >   code that is restricted by relative addressing.
>> >
>> > * jit_text_alloc() should be used to allocate memory for generated code
>> >   when there are no restrictions for the code placement. For
>> >   architectures that require that any code is within certain distance
>> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >   execmem_text_alloc().
>> >
>> 
>> Is there anything in this series to help users do the appropriate
>> synchronization when the actually populate the allocated memory with
>> code?  See here, for example:
>
> This series only factors out the executable allocations from modules and
> puts them in a central place.
> Anything else would go on top after this lands.

Hmm.

On the one hand, there's nothing wrong with factoring out common code. On the other hand, this is probably the right time to at least start thinking about synchronization, at least to the extent that it might make us want to change this API.  (I'm not at all saying that this series should require changes -- I'm just saying that this is a good time to think about how this should work.)

The current APIs, *and* the proposed jit_text_alloc() API, don't actually look like the one think in the Linux ecosystem that actually intelligently and efficiently maps new text into an address space: mmap().

On x86, you can mmap() an existing file full of executable code PROT_EXEC and jump to it with minimal synchronization (just the standard implicit ordering in the kernel that populates the pages before setting up the PTEs and whatever user synchronization is needed to avoid jumping into the mapping before mmap() finishes).  It works across CPUs, and the only possible way userspace can screw it up (for a read-only mapping of read-only text, anyway) is to jump to the mapping too early, in which case userspace gets a page fault.  Incoherence is impossible, and no one needs to "serialize" (in the SDM sense).

I think the same sequence (from userspace's perspective) works on other architectures, too, although I think more cache management is needed on the kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V has very developer-unfriendly icache management, but I don't remember the details.)

Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is rather fraught, and I bet many things do it wrong when userspace is multithreaded.  But not in production because it's mostly not used in production.)

But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.

For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:

jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
jit_text_write() -> write to that handle
jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)

could be more efficient and/or safer.

(Modules could use this too.  Getting alternatives right might take some fiddling, because off the top of my head, this doesn't match how it works now.)

To make alternatives easier, this could work, maybe (haven't fully thought it through):

jit_text_alloc()
jit_text_map_rw_inplace() -> map at the target address, but RW, !X

write the text and apply alternatives

jit_text_finalize() -> change from RW to RX *and synchronize*

jit_text_finalize() would either need to wait for RCU (possibly extra heavy weight RCU to get "serialization") or send an IPI.

This is slower than the alloc, write, map solution, but allows alternatives to be applied at the final address.


Even fancier variants where the writing is some using something like use_temporary_mm() might even make sense.


To what extent does performance matter for the various users?  module loading is slow, and I don't think we care that much.  eBPF loaded is not super fast, and we care to a limited extent.  I *think* the bcachefs use case needs to be very fast, but I'm not sure it can be fast and supportable.

Anyway, food for thought.


  reply	other threads:[~2023-06-19 17:09 UTC|newest]

Thread overview: 260+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-16  8:50 [PATCH v2 00/12] mm: jit/text allocator Mike Rapoport
2023-06-16  8:50 ` Mike Rapoport
2023-06-16  8:50 ` Mike Rapoport
2023-06-16  8:50 ` Mike Rapoport
2023-06-16  8:50 ` [PATCH v2 01/12] nios2: define virtual address space for modules Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 16:00   ` Edgecombe, Rick P
2023-06-16 16:00     ` Edgecombe, Rick P
2023-06-16 16:00     ` Edgecombe, Rick P
2023-06-16 16:00     ` Edgecombe, Rick P
2023-06-17  5:52     ` Mike Rapoport
2023-06-17  5:52       ` Mike Rapoport
2023-06-17  5:52       ` Mike Rapoport
2023-06-17  5:52       ` Mike Rapoport
2023-06-16 18:14   ` Song Liu
2023-06-16 18:14     ` Song Liu
2023-06-16 18:14     ` Song Liu
2023-06-16 18:14     ` Song Liu
2023-06-16  8:50 ` [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc() Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 16:48   ` Kent Overstreet
2023-06-16 16:48     ` Kent Overstreet
2023-06-16 16:48     ` Kent Overstreet
2023-06-16 16:48     ` Kent Overstreet
2023-06-16 18:18     ` Song Liu
2023-06-16 18:18       ` Song Liu
2023-06-16 18:18       ` Song Liu
2023-06-16 18:18       ` Song Liu
2023-06-17  5:57     ` Mike Rapoport
2023-06-17  5:57       ` Mike Rapoport
2023-06-17  5:57       ` Mike Rapoport
2023-06-17  5:57       ` Mike Rapoport
2023-06-17 20:38   ` Andy Lutomirski
2023-06-17 20:38     ` Andy Lutomirski
2023-06-17 20:38     ` Andy Lutomirski
2023-06-17 20:38     ` Andy Lutomirski
2023-06-18  8:00     ` Mike Rapoport
2023-06-18  8:00       ` Mike Rapoport
2023-06-18  8:00       ` Mike Rapoport
2023-06-18  8:00       ` Mike Rapoport
2023-06-19 17:09       ` Andy Lutomirski [this message]
2023-06-19 17:09         ` Andy Lutomirski
2023-06-19 17:09         ` Andy Lutomirski
2023-06-19 17:09         ` Andy Lutomirski
2023-06-19 20:18         ` Nadav Amit
2023-06-19 20:18           ` Nadav Amit
2023-06-19 20:18           ` Nadav Amit
2023-06-19 20:18           ` Nadav Amit
2023-06-20 17:24           ` Andy Lutomirski
2023-06-20 17:24             ` Andy Lutomirski
2023-06-20 17:24             ` Andy Lutomirski
2023-06-20 17:24             ` Andy Lutomirski
2023-06-25 16:14         ` Mike Rapoport
2023-06-25 16:14           ` Mike Rapoport
2023-06-25 16:14           ` Mike Rapoport
2023-06-25 16:14           ` Mike Rapoport
2023-06-25 16:59           ` Andy Lutomirski
2023-06-25 16:59             ` Andy Lutomirski
2023-06-25 16:59             ` Andy Lutomirski
2023-06-25 16:59             ` Andy Lutomirski
2023-06-25 17:42             ` Mike Rapoport
2023-06-25 17:42               ` Mike Rapoport
2023-06-25 17:42               ` Mike Rapoport
2023-06-25 17:42               ` Mike Rapoport
2023-06-25 18:07               ` Kent Overstreet
2023-06-25 18:07                 ` Kent Overstreet
2023-06-25 18:07                 ` Kent Overstreet
2023-06-25 18:07                 ` Kent Overstreet
2023-06-26  6:13                 ` Song Liu
2023-06-26  6:13                   ` Song Liu
2023-06-26  6:13                   ` Song Liu
2023-06-26  6:13                   ` Song Liu
2023-06-26  9:54                   ` Puranjay Mohan
2023-06-26  9:54                     ` Puranjay Mohan
2023-06-26  9:54                     ` Puranjay Mohan
2023-06-26  9:54                     ` Puranjay Mohan
2023-06-26 12:23                     ` Mark Rutland
2023-06-26 12:23                       ` Mark Rutland
2023-06-26 12:23                       ` Mark Rutland
2023-06-26 12:23                       ` Mark Rutland
2023-06-26 12:31           ` Mark Rutland
2023-06-26 12:31             ` Mark Rutland
2023-06-26 12:31             ` Mark Rutland
2023-06-26 12:31             ` Mark Rutland
2023-06-26 17:48             ` Song Liu
2023-06-26 17:48               ` Song Liu
2023-06-26 17:48               ` Song Liu
2023-06-26 17:48               ` Song Liu
2023-07-17 17:23               ` Andy Lutomirski
2023-07-17 17:23                 ` Andy Lutomirski
2023-07-17 17:23                 ` Andy Lutomirski
2023-07-17 17:23                 ` Andy Lutomirski
2023-06-26 13:01         ` Mark Rutland
2023-06-26 13:01           ` Mark Rutland
2023-06-26 13:01           ` Mark Rutland
2023-06-26 13:01           ` Mark Rutland
2023-06-19 11:34     ` Kent Overstreet
2023-06-19 11:34       ` Kent Overstreet
2023-06-19 11:34       ` Kent Overstreet
2023-06-19 11:34       ` Kent Overstreet
2023-06-16  8:50 ` [PATCH v2 03/12] mm/execmem, arch: convert simple overrides of module_alloc to execmem Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 18:36   ` Song Liu
2023-06-16 18:36     ` Song Liu
2023-06-16 18:36     ` Song Liu
2023-06-16 18:36     ` Song Liu
2023-06-16  8:50 ` [PATCH v2 04/12] mm/execmem, arch: convert remaining " Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 16:16   ` Edgecombe, Rick P
2023-06-16 16:16     ` Edgecombe, Rick P
2023-06-16 16:16     ` Edgecombe, Rick P
2023-06-16 16:16     ` Edgecombe, Rick P
2023-06-17  6:10     ` Mike Rapoport
2023-06-17  6:10       ` Mike Rapoport
2023-06-17  6:10       ` Mike Rapoport
2023-06-17  6:10       ` Mike Rapoport
2023-06-16 18:53   ` Song Liu
2023-06-16 18:53     ` Song Liu
2023-06-16 18:53     ` Song Liu
2023-06-16 18:53     ` Song Liu
2023-06-17  6:14     ` Mike Rapoport
2023-06-17  6:14       ` Mike Rapoport
2023-06-17  6:14       ` Mike Rapoport
2023-06-17  6:14       ` Mike Rapoport
2023-06-16  8:50 ` [PATCH v2 05/12] modules, execmem: drop module_alloc Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 18:56   ` Song Liu
2023-06-16 18:56     ` Song Liu
2023-06-16 18:56     ` Song Liu
2023-06-16 18:56     ` Song Liu
2023-06-16  8:50 ` [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc() Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 16:55   ` Edgecombe, Rick P
2023-06-16 16:55     ` Edgecombe, Rick P
2023-06-16 16:55     ` Edgecombe, Rick P
2023-06-16 16:55     ` Edgecombe, Rick P
2023-06-17  6:44     ` Mike Rapoport
2023-06-17  6:44       ` Mike Rapoport
2023-06-17  6:44       ` Mike Rapoport
2023-06-17  6:44       ` Mike Rapoport
2023-06-16 20:01   ` Song Liu
2023-06-16 20:01     ` Song Liu
2023-06-16 20:01     ` Song Liu
2023-06-16 20:01     ` Song Liu
2023-06-17  6:51     ` Mike Rapoport
2023-06-17  6:51       ` Mike Rapoport
2023-06-17  6:51       ` Mike Rapoport
2023-06-17  6:51       ` Mike Rapoport
2023-06-18 22:32   ` Thomas Gleixner
2023-06-18 22:32     ` Thomas Gleixner
2023-06-18 22:32     ` Thomas Gleixner
2023-06-18 22:32     ` Thomas Gleixner
2023-06-18 23:14     ` Kent Overstreet
2023-06-18 23:14       ` Kent Overstreet
2023-06-18 23:14       ` Kent Overstreet
2023-06-18 23:14       ` Kent Overstreet
2023-06-19  0:43       ` Thomas Gleixner
2023-06-19  0:43         ` Thomas Gleixner
2023-06-19  0:43         ` Thomas Gleixner
2023-06-19  0:43         ` Thomas Gleixner
2023-06-19  2:12         ` Kent Overstreet
2023-06-19  2:12           ` Kent Overstreet
2023-06-19  2:12           ` Kent Overstreet
2023-06-19  2:12           ` Kent Overstreet
2023-06-20 14:51         ` Steven Rostedt
2023-06-20 14:51           ` Steven Rostedt
2023-06-20 14:51           ` Steven Rostedt
2023-06-20 14:51           ` Steven Rostedt
2023-06-20 15:32           ` Alexei Starovoitov
2023-06-20 15:32             ` Alexei Starovoitov
2023-06-20 15:32             ` Alexei Starovoitov
2023-06-20 15:32             ` Alexei Starovoitov
2023-06-19 15:23     ` Mike Rapoport
2023-06-19 15:23       ` Mike Rapoport
2023-06-19 15:23       ` Mike Rapoport
2023-06-19 15:23       ` Mike Rapoport
2023-06-16  8:50 ` [PATCH v2 07/12] arm64, execmem: extend execmem_params for generated code definitions Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 20:05   ` Song Liu
2023-06-16 20:05     ` Song Liu
2023-06-16 20:05     ` Song Liu
2023-06-16 20:05     ` Song Liu
2023-06-17  6:57     ` Mike Rapoport
2023-06-17  6:57       ` Mike Rapoport
2023-06-17  6:57       ` Mike Rapoport
2023-06-17  6:57       ` Mike Rapoport
2023-06-17 15:36       ` Kent Overstreet
2023-06-17 15:36         ` Kent Overstreet
2023-06-17 15:36         ` Kent Overstreet
2023-06-17 15:36         ` Kent Overstreet
2023-06-17 16:38         ` Song Liu
2023-06-17 16:38           ` Song Liu
2023-06-17 16:38           ` Song Liu
2023-06-17 16:38           ` Song Liu
2023-06-17 20:37           ` Kent Overstreet
2023-06-17 20:37             ` Kent Overstreet
2023-06-17 20:37             ` Kent Overstreet
2023-06-17 20:37             ` Kent Overstreet
2023-06-16  8:50 ` [PATCH v2 08/12] riscv: extend execmem_params for kprobes allocations Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 20:09   ` Song Liu
2023-06-16 20:09     ` Song Liu
2023-06-16 20:09     ` Song Liu
2023-06-16 20:09     ` Song Liu
2023-06-16  8:50 ` [PATCH v2 09/12] powerpc: " Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 20:09   ` Song Liu
2023-06-16 20:09     ` Song Liu
2023-06-16 20:09     ` Song Liu
2023-06-16 20:09     ` Song Liu
2023-06-16  8:50 ` [PATCH v2 10/12] arch: make execmem setup available regardless of CONFIG_MODULES Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 20:17   ` Song Liu
2023-06-16 20:17     ` Song Liu
2023-06-16 20:17     ` Song Liu
2023-06-16 20:17     ` Song Liu
2023-06-16  8:50 ` [PATCH v2 11/12] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 20:18   ` Song Liu
2023-06-16 20:18     ` Song Liu
2023-06-16 20:18     ` Song Liu
2023-06-16 20:18     ` Song Liu
2023-06-16  8:50 ` [PATCH v2 12/12] kprobes: remove dependcy on CONFIG_MODULES Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16  8:50   ` Mike Rapoport
2023-06-16 11:44   ` Björn Töpel
2023-06-16 11:44     ` Björn Töpel
2023-06-16 11:44     ` Björn Töpel
2023-06-16 11:44     ` Björn Töpel
2023-06-17  6:52     ` Mike Rapoport
2023-06-17  6:52       ` Mike Rapoport
2023-06-17  6:52       ` Mike Rapoport
2023-06-17  6:52       ` Mike Rapoport
2023-06-16 17:02 ` [PATCH v2 00/12] mm: jit/text allocator Edgecombe, Rick P
2023-06-16 17:02   ` Edgecombe, Rick P
2023-06-16 17:02   ` Edgecombe, Rick P
2023-06-16 17:02   ` Edgecombe, Rick P

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a17c65c6-863f-4026-9c6f-a04b659e9ab4@app.fastmail.com \
    --to=luto@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bpf@vger.kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chenhuacai@kernel.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=davem@davemloft.net \
    --cc=deller@gmx.de \
    --cc=dinguyen@kernel.org \
    --cc=hca@linux.ibm.com \
    --cc=keescook@chromium.org \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-modules@vger.kernel.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=loongarch@lists.linux.dev \
    --cc=mark.rutland@arm.com \
    --cc=mcgrof@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=nadav.amit@gmail.com \
    --cc=naveen.n.rao@linux.ibm.com \
    --cc=netdev@vger.kernel.org \
    --cc=palmer@dabbelt.com \
    --cc=puranjay12@gmail.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=song@kernel.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tsbogend@alpha.franken.de \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.