linux-hardening.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
       [not found] ` <20230509165657.1735798-8-kent.overstreet@linux.dev>
@ 2023-05-10 15:05   ` Johannes Thumshirn
  2023-05-11 22:28     ` Kees Cook
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Thumshirn @ 2023-05-10 15:05 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs, Kees Cook
  Cc: Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On 09.05.23 18:56, Kent Overstreet wrote:
> +/**
> + * vmalloc_exec - allocate virtually contiguous, executable memory
> + * @size:	  allocation size
> + *
> + * Kernel-internal function to allocate enough pages to cover @size
> + * the page level allocator and map them into contiguous and
> + * executable kernel virtual space.
> + *
> + * For tight control over page level allocator and protection flags
> + * use __vmalloc() instead.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> +			NUMA_NO_NODE, __builtin_return_address(0));
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);

Uh W+X memory reagions.
The 90s called, they want their shellcode back.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-10 15:05   ` [PATCH 07/32] mm: Bring back vmalloc_exec Johannes Thumshirn
@ 2023-05-11 22:28     ` Kees Cook
  2023-05-12 18:41       ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Kees Cook @ 2023-05-11 22:28 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On Wed, May 10, 2023 at 03:05:48PM +0000, Johannes Thumshirn wrote:
> On 09.05.23 18:56, Kent Overstreet wrote:
> > +/**
> > + * vmalloc_exec - allocate virtually contiguous, executable memory
> > + * @size:	  allocation size
> > + *
> > + * Kernel-internal function to allocate enough pages to cover @size
> > + * the page level allocator and map them into contiguous and
> > + * executable kernel virtual space.
> > + *
> > + * For tight control over page level allocator and protection flags
> > + * use __vmalloc() instead.
> > + *
> > + * Return: pointer to the allocated memory or %NULL on error
> > + */
> > +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> > +{
> > +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> > +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> > +			NUMA_NO_NODE, __builtin_return_address(0));
> > +}
> > +EXPORT_SYMBOL_GPL(vmalloc_exec);
> 
> Uh W+X memory reagions.
> The 90s called, they want their shellcode back.

Just to clarify: the kernel must never create W+X memory regions. So,
no, do not reintroduce vmalloc_exec().

Dynamic code areas need to be constructed in a non-executable memory,
then switched to read-only and verified to still be what was expected,
and only then made executable.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-11 22:28     ` Kees Cook
@ 2023-05-12 18:41       ` Kent Overstreet
  2023-05-16 21:02         ` Kees Cook
  0 siblings, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-05-12 18:41 UTC (permalink / raw)
  To: Kees Cook
  Cc: Johannes Thumshirn, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On Thu, May 11, 2023 at 03:28:40PM -0700, Kees Cook wrote:
> On Wed, May 10, 2023 at 03:05:48PM +0000, Johannes Thumshirn wrote:
> > On 09.05.23 18:56, Kent Overstreet wrote:
> > > +/**
> > > + * vmalloc_exec - allocate virtually contiguous, executable memory
> > > + * @size:	  allocation size
> > > + *
> > > + * Kernel-internal function to allocate enough pages to cover @size
> > > + * the page level allocator and map them into contiguous and
> > > + * executable kernel virtual space.
> > > + *
> > > + * For tight control over page level allocator and protection flags
> > > + * use __vmalloc() instead.
> > > + *
> > > + * Return: pointer to the allocated memory or %NULL on error
> > > + */
> > > +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> > > +{
> > > +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> > > +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> > > +			NUMA_NO_NODE, __builtin_return_address(0));
> > > +}
> > > +EXPORT_SYMBOL_GPL(vmalloc_exec);
> > 
> > Uh W+X memory reagions.
> > The 90s called, they want their shellcode back.
> 
> Just to clarify: the kernel must never create W+X memory regions. So,
> no, do not reintroduce vmalloc_exec().
> 
> Dynamic code areas need to be constructed in a non-executable memory,
> then switched to read-only and verified to still be what was expected,
> and only then made executable.

So if we're opening this up to the topic if what an acceptible API would
look like - how hard is this requirement?

The reason is that the functions we're constructing are only ~50 bytes,
so we don't want to be burning a full page per function (particularly
for the 64kb page architectures...)

If we were to build an allocator for sub-page dynamically constructed
code, we'd have to flip the whole page to W+X while copying in the new
function. But, we could construct it in non executable memory and then
hand it off to this new allocator to do the copy, which would also do
the page permission flipping.

It seem like this is something BPF might want eventually too, depending
on average BPF program size...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-12 18:41       ` Kent Overstreet
@ 2023-05-16 21:02         ` Kees Cook
  2023-05-16 21:20           ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Kees Cook @ 2023-05-16 21:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Johannes Thumshirn, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On Fri, May 12, 2023 at 02:41:50PM -0400, Kent Overstreet wrote:
> On Thu, May 11, 2023 at 03:28:40PM -0700, Kees Cook wrote:
> > On Wed, May 10, 2023 at 03:05:48PM +0000, Johannes Thumshirn wrote:
> > > On 09.05.23 18:56, Kent Overstreet wrote:
> > > > +/**
> > > > + * vmalloc_exec - allocate virtually contiguous, executable memory
> > > > + * @size:	  allocation size
> > > > + *
> > > > + * Kernel-internal function to allocate enough pages to cover @size
> > > > + * the page level allocator and map them into contiguous and
> > > > + * executable kernel virtual space.
> > > > + *
> > > > + * For tight control over page level allocator and protection flags
> > > > + * use __vmalloc() instead.
> > > > + *
> > > > + * Return: pointer to the allocated memory or %NULL on error
> > > > + */
> > > > +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> > > > +{
> > > > +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> > > > +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> > > > +			NUMA_NO_NODE, __builtin_return_address(0));
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vmalloc_exec);
> > > 
> > > Uh W+X memory reagions.
> > > The 90s called, they want their shellcode back.
> > 
> > Just to clarify: the kernel must never create W+X memory regions. So,
> > no, do not reintroduce vmalloc_exec().
> > 
> > Dynamic code areas need to be constructed in a non-executable memory,
> > then switched to read-only and verified to still be what was expected,
> > and only then made executable.
> 
> So if we're opening this up to the topic if what an acceptible API would
> look like - how hard is this requirement?
> 
> The reason is that the functions we're constructing are only ~50 bytes,
> so we don't want to be burning a full page per function (particularly
> for the 64kb page architectures...)

For something that small, why not use the text_poke API?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-16 21:02         ` Kees Cook
@ 2023-05-16 21:20           ` Kent Overstreet
  2023-05-16 21:47             ` Matthew Wilcox
  2023-06-17  4:13             ` Andy Lutomirski
  0 siblings, 2 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-05-16 21:20 UTC (permalink / raw)
  To: Kees Cook
  Cc: Johannes Thumshirn, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> For something that small, why not use the text_poke API?

This looks like it's meant for patching existing kernel text, which
isn't what I want - I'm generating new functions on the fly, one per
btree node.

I'm working up a new allocator - a (very simple) slab allocator where
you pass a buffer, and it gives you a copy of that buffer mapped
executable, but not writeable.

It looks like we'll be able to convert bpf, kprobes, and ftrace
trampolines to it; it'll consolidate a fair amount of code (particularly
in bpf), and they won't have to burn a full page per allocation anymore.

bpf has a neat trick where it maps the same page in two different
locations, one is the executable location and the other is the writeable
location - I'm stealing that.

external api will be:

void *jit_alloc(void *buf, size_t len, gfp_t gfp);
void jit_free(void *buf);
void jit_update(void *buf, void *new_code, size_t len); /* update an existing allocation */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-16 21:20           ` Kent Overstreet
@ 2023-05-16 21:47             ` Matthew Wilcox
  2023-05-16 21:57               ` Kent Overstreet
  2023-05-17  5:28               ` Kent Overstreet
  2023-06-17  4:13             ` Andy Lutomirski
  1 sibling, 2 replies; 24+ messages in thread
From: Matthew Wilcox @ 2023-05-16 21:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kees Cook, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote:
> On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > For something that small, why not use the text_poke API?
> 
> This looks like it's meant for patching existing kernel text, which
> isn't what I want - I'm generating new functions on the fly, one per
> btree node.
> 
> I'm working up a new allocator - a (very simple) slab allocator where
> you pass a buffer, and it gives you a copy of that buffer mapped
> executable, but not writeable.
> 
> It looks like we'll be able to convert bpf, kprobes, and ftrace
> trampolines to it; it'll consolidate a fair amount of code (particularly
> in bpf), and they won't have to burn a full page per allocation anymore.
> 
> bpf has a neat trick where it maps the same page in two different
> locations, one is the executable location and the other is the writeable
> location - I'm stealing that.

How does that avoid the problem of being able to construct an arbitrary
gadget that somebody else will then execute?  IOW, what bpf has done
seems like it's working around & undoing the security improvements.

I suppose it's an improvement that only the executable address is
passed back to the caller, and not the writable address.

> external api will be:
> 
> void *jit_alloc(void *buf, size_t len, gfp_t gfp);
> void jit_free(void *buf);
> void jit_update(void *buf, void *new_code, size_t len); /* update an existing allocation */
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-16 21:47             ` Matthew Wilcox
@ 2023-05-16 21:57               ` Kent Overstreet
  2023-05-17  5:28               ` Kent Overstreet
  1 sibling, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-05-16 21:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kees Cook, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Tue, May 16, 2023 at 10:47:13PM +0100, Matthew Wilcox wrote:
> On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote:
> > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > For something that small, why not use the text_poke API?
> > 
> > This looks like it's meant for patching existing kernel text, which
> > isn't what I want - I'm generating new functions on the fly, one per
> > btree node.
> > 
> > I'm working up a new allocator - a (very simple) slab allocator where
> > you pass a buffer, and it gives you a copy of that buffer mapped
> > executable, but not writeable.
> > 
> > It looks like we'll be able to convert bpf, kprobes, and ftrace
> > trampolines to it; it'll consolidate a fair amount of code (particularly
> > in bpf), and they won't have to burn a full page per allocation anymore.
> > 
> > bpf has a neat trick where it maps the same page in two different
> > locations, one is the executable location and the other is the writeable
> > location - I'm stealing that.
> 
> How does that avoid the problem of being able to construct an arbitrary
> gadget that somebody else will then execute?  IOW, what bpf has done
> seems like it's working around & undoing the security improvements.
> 
> I suppose it's an improvement that only the executable address is
> passed back to the caller, and not the writable address.

That's my thinking; grepping around finds several uses of module_alloc()
that are all doing different variations on the page permissions dance.
Let's just do it once and do it right...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-16 21:47             ` Matthew Wilcox
  2023-05-16 21:57               ` Kent Overstreet
@ 2023-05-17  5:28               ` Kent Overstreet
  2023-05-17 14:04                 ` Mike Rapoport
  1 sibling, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-05-17  5:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kees Cook, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Tue, May 16, 2023 at 10:47:13PM +0100, Matthew Wilcox wrote:
> On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote:
> > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > For something that small, why not use the text_poke API?
> > 
> > This looks like it's meant for patching existing kernel text, which
> > isn't what I want - I'm generating new functions on the fly, one per
> > btree node.
> > 
> > I'm working up a new allocator - a (very simple) slab allocator where
> > you pass a buffer, and it gives you a copy of that buffer mapped
> > executable, but not writeable.
> > 
> > It looks like we'll be able to convert bpf, kprobes, and ftrace
> > trampolines to it; it'll consolidate a fair amount of code (particularly
> > in bpf), and they won't have to burn a full page per allocation anymore.
> > 
> > bpf has a neat trick where it maps the same page in two different
> > locations, one is the executable location and the other is the writeable
> > location - I'm stealing that.
> 
> How does that avoid the problem of being able to construct an arbitrary
> gadget that somebody else will then execute?  IOW, what bpf has done
> seems like it's working around & undoing the security improvements.
> 
> I suppose it's an improvement that only the executable address is
> passed back to the caller, and not the writable address.

Ok, here's what I came up with. Have not tested all corner cases, still
need to write docs - but I think this gives us a nicer interface than
what bpf/kprobes/etc. have been doing, and it does the sub-page sized
allocations I need.

With an additional tweak to module_alloc() (not done in this patch yet)
we avoid ever mapping in pages both writeable and executable:

-->--

From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@linux.dev>
Date: Wed, 17 May 2023 01:22:06 -0400
Subject: [PATCH] mm: jit/text allocator

This provides a new, very simple slab allocator for jit/text, i.e. bpf,
ftrace trampolines, or bcachefs unpack functions.

With this API we can avoid ever mapping pages both writeable and
executable (not implemented in this patch: need to tweak
module_alloc()), and it also supports sub-page sized allocations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
new file mode 100644
index 0000000000..f1549d60e8
--- /dev/null
+++ b/include/linux/jitalloc.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_JITALLOC_H
+#define _LINUX_JITALLOC_H
+
+void jit_update(void *buf, void *new_buf, size_t len);
+void jit_free(void *buf);
+void *jit_alloc(void *buf, size_t len);
+
+#endif /* _LINUX_JITALLOC_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f..ff26a4f0c9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,9 @@ config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config JITALLOC
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index c03e1e5859..25e82db9e8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_JITALLOC) += jitalloc.o
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
new file mode 100644
index 0000000000..7c4d621802
--- /dev/null
+++ b/mm/jitalloc.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/gfp.h>
+#include <linux/highmem.h>
+#include <linux/jitalloc.h>
+#include <linux/mm.h>
+#include <linux/moduleloader.h>
+#include <linux/mutex.h>
+#include <linux/set_memory.h>
+#include <linux/vmalloc.h>
+
+#include <asm/text-patching.h>
+
+static DEFINE_MUTEX(jit_alloc_lock);
+
+struct jit_cache {
+	unsigned		obj_size_bits;
+	unsigned		objs_per_slab;
+	struct list_head	partial;
+};
+
+#define JITALLOC_MIN_SIZE	16
+#define NR_JIT_CACHES		ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE)
+
+static struct jit_cache jit_caches[NR_JIT_CACHES];
+
+struct jit_slab {
+	unsigned long		__page_flags;
+
+	struct jit_cache	*cache;
+	void			*executably_mapped;;
+	unsigned long		*objs_allocated; /* bitmap of free objects */
+	struct list_head	list;
+};
+
+#define folio_jit_slab(folio)		(_Generic((folio),			\
+	const struct folio *:		(const struct jit_slab *)(folio),	\
+	struct folio *:			(struct jit_slab *)(folio)))
+
+#define jit_slab_folio(s)		(_Generic((s),				\
+	const struct jit_slab *:	(const struct folio *)s,		\
+	struct jit_slab *:		(struct folio *)s))
+
+static struct jit_slab *jit_slab_alloc(struct jit_cache *cache)
+{
+	void *executably_mapped = module_alloc(PAGE_SIZE);
+	struct page *page;
+	struct folio *folio;
+	struct jit_slab *slab;
+	unsigned long *objs_allocated;
+
+	if (!executably_mapped)
+		return NULL;
+
+	objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL);
+	if (!objs_allocated ) {
+		vfree(executably_mapped);
+		return NULL;
+	}
+
+	set_vm_flush_reset_perms(executably_mapped);
+	set_memory_rox((unsigned long) executably_mapped, 1);
+
+	page = vmalloc_to_page(executably_mapped);
+	folio = page_folio(page);
+
+	__folio_set_slab(folio);
+	slab			= folio_jit_slab(folio);
+	slab->cache		= cache;
+	slab->executably_mapped	= executably_mapped;
+	slab->objs_allocated = objs_allocated;
+	INIT_LIST_HEAD(&slab->list);
+
+	return slab;
+}
+
+static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
+{
+	struct jit_slab *s =
+		list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
+		jit_slab_alloc(cache);
+	unsigned obj_idx, nr_allocated;
+
+	if (!s)
+		return NULL;
+
+	obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
+
+	BUG_ON(obj_idx >= cache->objs_per_slab);
+	__set_bit(obj_idx, s->objs_allocated);
+
+	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
+
+	if (nr_allocated == s->cache->objs_per_slab) {
+		list_del_init(&s->list);
+	} else if (nr_allocated == 1) {
+		list_del(&s->list);
+		list_add(&s->list, &s->cache->partial);
+	}
+
+	return s->executably_mapped + (obj_idx << cache->obj_size_bits);
+}
+
+void jit_update(void *buf, void *new_buf, size_t len)
+{
+	text_poke_copy(buf, new_buf, len);
+}
+EXPORT_SYMBOL_GPL(jit_update);
+
+void jit_free(void *buf)
+{
+	struct page *page;
+	struct folio *folio;
+	struct jit_slab *s;
+	unsigned obj_idx, nr_allocated;
+	size_t offset;
+
+	if (!buf)
+		return;
+
+	page	= vmalloc_to_page(buf);
+	folio	= page_folio(page);
+	offset	= offset_in_folio(folio, buf);
+
+	if (!folio_test_slab(folio)) {
+		vfree(buf);
+		return;
+	}
+
+	s = folio_jit_slab(folio);
+
+	mutex_lock(&jit_alloc_lock);
+	obj_idx = offset >> s->cache->obj_size_bits;
+
+	__clear_bit(obj_idx, s->objs_allocated);
+
+	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
+
+	if (nr_allocated == 0) {
+		list_del(&s->list);
+		kfree(s->objs_allocated);
+		folio_put(folio);
+	} else if (nr_allocated + 1 == s->cache->objs_per_slab) {
+		list_del(&s->list);
+		list_add(&s->list, &s->cache->partial);
+	}
+
+	mutex_unlock(&jit_alloc_lock);
+}
+EXPORT_SYMBOL_GPL(jit_free);
+
+void *jit_alloc(void *buf, size_t len)
+{
+	unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16);
+	void *p;
+
+	if (jit_cache_idx < NR_JIT_CACHES) {
+		mutex_lock(&jit_alloc_lock);
+		p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]);
+		mutex_unlock(&jit_alloc_lock);
+	} else {
+		p = module_alloc(len);
+		if (p) {
+			set_vm_flush_reset_perms(p);
+			set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE));
+		}
+	}
+
+	if (p && buf)
+		jit_update(p, buf, len);
+
+	return p;
+}
+EXPORT_SYMBOL_GPL(jit_alloc);
+
+static int __init jit_alloc_init(void)
+{
+	for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) {
+		jit_caches[i].obj_size_bits	= ilog2(JITALLOC_MIN_SIZE) + i;
+		jit_caches[i].objs_per_slab	= PAGE_SIZE >> jit_caches[i].obj_size_bits;
+
+		INIT_LIST_HEAD(&jit_caches[i].partial);
+	}
+
+	return 0;
+}
+core_initcall(jit_alloc_init);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-17  5:28               ` Kent Overstreet
@ 2023-05-17 14:04                 ` Mike Rapoport
  2023-05-17 14:18                   ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Rapoport @ 2023-05-17 14:04 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Kees Cook, Johannes Thumshirn, linux-kernel,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening

On Wed, May 17, 2023 at 01:28:43AM -0400, Kent Overstreet wrote:
> On Tue, May 16, 2023 at 10:47:13PM +0100, Matthew Wilcox wrote:
> > On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote:
> > > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > > For something that small, why not use the text_poke API?
> > > 
> > > This looks like it's meant for patching existing kernel text, which
> > > isn't what I want - I'm generating new functions on the fly, one per
> > > btree node.
> > > 
> > > I'm working up a new allocator - a (very simple) slab allocator where
> > > you pass a buffer, and it gives you a copy of that buffer mapped
> > > executable, but not writeable.
> > > 
> > > It looks like we'll be able to convert bpf, kprobes, and ftrace
> > > trampolines to it; it'll consolidate a fair amount of code (particularly
> > > in bpf), and they won't have to burn a full page per allocation anymore.
> > > 
> > > bpf has a neat trick where it maps the same page in two different
> > > locations, one is the executable location and the other is the writeable
> > > location - I'm stealing that.
> > 
> > How does that avoid the problem of being able to construct an arbitrary
> > gadget that somebody else will then execute?  IOW, what bpf has done
> > seems like it's working around & undoing the security improvements.
> > 
> > I suppose it's an improvement that only the executable address is
> > passed back to the caller, and not the writable address.
> 
> Ok, here's what I came up with. Have not tested all corner cases, still
> need to write docs - but I think this gives us a nicer interface than
> what bpf/kprobes/etc. have been doing, and it does the sub-page sized
> allocations I need.
> 
> With an additional tweak to module_alloc() (not done in this patch yet)
> we avoid ever mapping in pages both writeable and executable:
> 
> -->--
> 
> From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
> From: Kent Overstreet <kent.overstreet@linux.dev>
> Date: Wed, 17 May 2023 01:22:06 -0400
> Subject: [PATCH] mm: jit/text allocator
> 
> This provides a new, very simple slab allocator for jit/text, i.e. bpf,
> ftrace trampolines, or bcachefs unpack functions.
> 
> With this API we can avoid ever mapping pages both writeable and
> executable (not implemented in this patch: need to tweak
> module_alloc()), and it also supports sub-page sized allocations.

This looks like yet another workaround for that module_alloc() was not
designed to handle permission changes. Rather than create more and more
wrappers for module_alloc() we need to have core API for code allocation,
apparently on top of vmalloc, and then use that API for modules, bpf,
tracing and whatnot.

There was quite lengthy discussion about how to handle code allocations
here:

https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@kernel.org/
 
and Song is already working on improvements for module_alloc(), e.g. see
commit ac3b43283923 ("module: replace module_layout with module_memory")

Another thing, the code below will not even compile on !x86.

> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> 
> diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
> new file mode 100644
> index 0000000000..f1549d60e8
> --- /dev/null
> +++ b/include/linux/jitalloc.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_JITALLOC_H
> +#define _LINUX_JITALLOC_H
> +
> +void jit_update(void *buf, void *new_buf, size_t len);
> +void jit_free(void *buf);
> +void *jit_alloc(void *buf, size_t len);
> +
> +#endif /* _LINUX_JITALLOC_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 4751031f3f..ff26a4f0c9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1202,6 +1202,9 @@ config LRU_GEN_STATS
>  	  This option has a per-memcg and per-node memory overhead.
>  # }
>  
> +config JITALLOC
> +	bool
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index c03e1e5859..25e82db9e8 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
>  obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
>  obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
>  obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
> +obj-$(CONFIG_JITALLOC) += jitalloc.o
> diff --git a/mm/jitalloc.c b/mm/jitalloc.c
> new file mode 100644
> index 0000000000..7c4d621802
> --- /dev/null
> +++ b/mm/jitalloc.c
> @@ -0,0 +1,187 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/gfp.h>
> +#include <linux/highmem.h>
> +#include <linux/jitalloc.h>
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/mutex.h>
> +#include <linux/set_memory.h>
> +#include <linux/vmalloc.h>
> +
> +#include <asm/text-patching.h>
> +
> +static DEFINE_MUTEX(jit_alloc_lock);
> +
> +struct jit_cache {
> +	unsigned		obj_size_bits;
> +	unsigned		objs_per_slab;
> +	struct list_head	partial;
> +};
> +
> +#define JITALLOC_MIN_SIZE	16
> +#define NR_JIT_CACHES		ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE)
> +
> +static struct jit_cache jit_caches[NR_JIT_CACHES];
> +
> +struct jit_slab {
> +	unsigned long		__page_flags;
> +
> +	struct jit_cache	*cache;
> +	void			*executably_mapped;;
> +	unsigned long		*objs_allocated; /* bitmap of free objects */
> +	struct list_head	list;
> +};
> +
> +#define folio_jit_slab(folio)		(_Generic((folio),			\
> +	const struct folio *:		(const struct jit_slab *)(folio),	\
> +	struct folio *:			(struct jit_slab *)(folio)))
> +
> +#define jit_slab_folio(s)		(_Generic((s),				\
> +	const struct jit_slab *:	(const struct folio *)s,		\
> +	struct jit_slab *:		(struct folio *)s))
> +
> +static struct jit_slab *jit_slab_alloc(struct jit_cache *cache)
> +{
> +	void *executably_mapped = module_alloc(PAGE_SIZE);
> +	struct page *page;
> +	struct folio *folio;
> +	struct jit_slab *slab;
> +	unsigned long *objs_allocated;
> +
> +	if (!executably_mapped)
> +		return NULL;
> +
> +	objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL);
> +	if (!objs_allocated ) {
> +		vfree(executably_mapped);
> +		return NULL;
> +	}
> +
> +	set_vm_flush_reset_perms(executably_mapped);
> +	set_memory_rox((unsigned long) executably_mapped, 1);
> +
> +	page = vmalloc_to_page(executably_mapped);
> +	folio = page_folio(page);
> +
> +	__folio_set_slab(folio);
> +	slab			= folio_jit_slab(folio);
> +	slab->cache		= cache;
> +	slab->executably_mapped	= executably_mapped;
> +	slab->objs_allocated = objs_allocated;
> +	INIT_LIST_HEAD(&slab->list);
> +
> +	return slab;
> +}
> +
> +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
> +{
> +	struct jit_slab *s =
> +		list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
> +		jit_slab_alloc(cache);
> +	unsigned obj_idx, nr_allocated;
> +
> +	if (!s)
> +		return NULL;
> +
> +	obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
> +
> +	BUG_ON(obj_idx >= cache->objs_per_slab);
> +	__set_bit(obj_idx, s->objs_allocated);
> +
> +	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> +	if (nr_allocated == s->cache->objs_per_slab) {
> +		list_del_init(&s->list);
> +	} else if (nr_allocated == 1) {
> +		list_del(&s->list);
> +		list_add(&s->list, &s->cache->partial);
> +	}
> +
> +	return s->executably_mapped + (obj_idx << cache->obj_size_bits);
> +}
> +
> +void jit_update(void *buf, void *new_buf, size_t len)
> +{
> +	text_poke_copy(buf, new_buf, len);
> +}
> +EXPORT_SYMBOL_GPL(jit_update);
> +
> +void jit_free(void *buf)
> +{
> +	struct page *page;
> +	struct folio *folio;
> +	struct jit_slab *s;
> +	unsigned obj_idx, nr_allocated;
> +	size_t offset;
> +
> +	if (!buf)
> +		return;
> +
> +	page	= vmalloc_to_page(buf);
> +	folio	= page_folio(page);
> +	offset	= offset_in_folio(folio, buf);
> +
> +	if (!folio_test_slab(folio)) {
> +		vfree(buf);
> +		return;
> +	}
> +
> +	s = folio_jit_slab(folio);
> +
> +	mutex_lock(&jit_alloc_lock);
> +	obj_idx = offset >> s->cache->obj_size_bits;
> +
> +	__clear_bit(obj_idx, s->objs_allocated);
> +
> +	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> +	if (nr_allocated == 0) {
> +		list_del(&s->list);
> +		kfree(s->objs_allocated);
> +		folio_put(folio);
> +	} else if (nr_allocated + 1 == s->cache->objs_per_slab) {
> +		list_del(&s->list);
> +		list_add(&s->list, &s->cache->partial);
> +	}
> +
> +	mutex_unlock(&jit_alloc_lock);
> +}
> +EXPORT_SYMBOL_GPL(jit_free);
> +
> +void *jit_alloc(void *buf, size_t len)
> +{
> +	unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16);
> +	void *p;
> +
> +	if (jit_cache_idx < NR_JIT_CACHES) {
> +		mutex_lock(&jit_alloc_lock);
> +		p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]);
> +		mutex_unlock(&jit_alloc_lock);
> +	} else {
> +		p = module_alloc(len);
> +		if (p) {
> +			set_vm_flush_reset_perms(p);
> +			set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE));
> +		}
> +	}
> +
> +	if (p && buf)
> +		jit_update(p, buf, len);
> +
> +	return p;
> +}
> +EXPORT_SYMBOL_GPL(jit_alloc);
> +
> +static int __init jit_alloc_init(void)
> +{
> +	for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) {
> +		jit_caches[i].obj_size_bits	= ilog2(JITALLOC_MIN_SIZE) + i;
> +		jit_caches[i].objs_per_slab	= PAGE_SIZE >> jit_caches[i].obj_size_bits;
> +
> +		INIT_LIST_HEAD(&jit_caches[i].partial);
> +	}
> +
> +	return 0;
> +}
> +core_initcall(jit_alloc_init);
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-17 14:04                 ` Mike Rapoport
@ 2023-05-17 14:18                   ` Kent Overstreet
  2023-05-17 15:44                     ` Mike Rapoport
  0 siblings, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-05-17 14:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Matthew Wilcox, Kees Cook, Johannes Thumshirn, linux-kernel,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening, song

On Wed, May 17, 2023 at 05:04:27PM +0300, Mike Rapoport wrote:
> On Wed, May 17, 2023 at 01:28:43AM -0400, Kent Overstreet wrote:
> > On Tue, May 16, 2023 at 10:47:13PM +0100, Matthew Wilcox wrote:
> > > On Tue, May 16, 2023 at 05:20:33PM -0400, Kent Overstreet wrote:
> > > > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > > > For something that small, why not use the text_poke API?
> > > > 
> > > > This looks like it's meant for patching existing kernel text, which
> > > > isn't what I want - I'm generating new functions on the fly, one per
> > > > btree node.
> > > > 
> > > > I'm working up a new allocator - a (very simple) slab allocator where
> > > > you pass a buffer, and it gives you a copy of that buffer mapped
> > > > executable, but not writeable.
> > > > 
> > > > It looks like we'll be able to convert bpf, kprobes, and ftrace
> > > > trampolines to it; it'll consolidate a fair amount of code (particularly
> > > > in bpf), and they won't have to burn a full page per allocation anymore.
> > > > 
> > > > bpf has a neat trick where it maps the same page in two different
> > > > locations, one is the executable location and the other is the writeable
> > > > location - I'm stealing that.
> > > 
> > > How does that avoid the problem of being able to construct an arbitrary
> > > gadget that somebody else will then execute?  IOW, what bpf has done
> > > seems like it's working around & undoing the security improvements.
> > > 
> > > I suppose it's an improvement that only the executable address is
> > > passed back to the caller, and not the writable address.
> > 
> > Ok, here's what I came up with. Have not tested all corner cases, still
> > need to write docs - but I think this gives us a nicer interface than
> > what bpf/kprobes/etc. have been doing, and it does the sub-page sized
> > allocations I need.
> > 
> > With an additional tweak to module_alloc() (not done in this patch yet)
> > we avoid ever mapping in pages both writeable and executable:
> > 
> > -->--
> > 
> > From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
> > From: Kent Overstreet <kent.overstreet@linux.dev>
> > Date: Wed, 17 May 2023 01:22:06 -0400
> > Subject: [PATCH] mm: jit/text allocator
> > 
> > This provides a new, very simple slab allocator for jit/text, i.e. bpf,
> > ftrace trampolines, or bcachefs unpack functions.
> > 
> > With this API we can avoid ever mapping pages both writeable and
> > executable (not implemented in this patch: need to tweak
> > module_alloc()), and it also supports sub-page sized allocations.
> 
> This looks like yet another workaround for that module_alloc() was not
> designed to handle permission changes. Rather than create more and more
> wrappers for module_alloc() we need to have core API for code allocation,
> apparently on top of vmalloc, and then use that API for modules, bpf,
> tracing and whatnot.
> 
> There was quite lengthy discussion about how to handle code allocations
> here:
> 
> https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@kernel.org/

Thanks for the link!

Added Song to the CC.

Song, I'm looking at your code now - switching to hugepages is great,
but I wonder if we might be able to combine our two approaches - with
the slab allocator I did, do we have to bother with VMAs at all? And
then it gets us sub-page sized allocations.

> and Song is already working on improvements for module_alloc(), e.g. see
> commit ac3b43283923 ("module: replace module_layout with module_memory")
> 
> Another thing, the code below will not even compile on !x86.

Due to text_poke(), which I see is abstracted better in that patchset.

I'm very curious why text_poke() does tlb flushing at all; it seems like
flush_icache_range() is actually what's needed?

text_poke() also only touching up to two pages, without that being
documented, is also a footgun...

And I'm really curious why text_poke() is needed at all. Seems like we
could just use kmap_local() to create a temporary writeable mapping,
except in my testing that got me a RO mapping. Odd.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-17 14:18                   ` Kent Overstreet
@ 2023-05-17 15:44                     ` Mike Rapoport
  2023-05-17 15:59                       ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Rapoport @ 2023-05-17 15:44 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Kees Cook, Johannes Thumshirn, linux-kernel,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening, song

On Wed, May 17, 2023 at 10:18:11AM -0400, Kent Overstreet wrote:
> On Wed, May 17, 2023 at 05:04:27PM +0300, Mike Rapoport wrote:
> 
> And I'm really curious why text_poke() is needed at all. Seems like we
> could just use kmap_local() to create a temporary writeable mapping,

On 64 bit kmap_local_page() is aliased to page_address() and does not map
anything. text_poke() is needed to actually create a temporary writable
mapping without touching page tables in vmalloc and/or direct map.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-17 15:44                     ` Mike Rapoport
@ 2023-05-17 15:59                       ` Kent Overstreet
  0 siblings, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-05-17 15:59 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Matthew Wilcox, Kees Cook, Johannes Thumshirn, linux-kernel,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening, song

On Wed, May 17, 2023 at 06:44:12PM +0300, Mike Rapoport wrote:
> On Wed, May 17, 2023 at 10:18:11AM -0400, Kent Overstreet wrote:
> > On Wed, May 17, 2023 at 05:04:27PM +0300, Mike Rapoport wrote:
> > 
> > And I'm really curious why text_poke() is needed at all. Seems like we
> > could just use kmap_local() to create a temporary writeable mapping,
> 
> On 64 bit kmap_local_page() is aliased to page_address() and does not map
> anything. text_poke() is needed to actually create a temporary writable
> mapping without touching page tables in vmalloc and/or direct map.

Duh - thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-16 21:20           ` Kent Overstreet
  2023-05-16 21:47             ` Matthew Wilcox
@ 2023-06-17  4:13             ` Andy Lutomirski
  2023-06-17 15:34               ` Kent Overstreet
  1 sibling, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2023-06-17  4:13 UTC (permalink / raw)
  To: Kent Overstreet, Kees Cook
  Cc: Johannes Thumshirn, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On 5/16/23 14:20, Kent Overstreet wrote:
> On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
>> For something that small, why not use the text_poke API?
> 
> This looks like it's meant for patching existing kernel text, which
> isn't what I want - I'm generating new functions on the fly, one per
> btree node.

Dynamically generating code is a giant can of worms.

Kees touched on a basic security thing: a linear address mapped W+X is a big
no-no.  And that's just scratching the surface -- ideally we would have a
strong protocol for generating code: the code is generated in some
extra-secure context, then it's made immutable and double-checked, then
it becomes live.  (And we would offer this to userspace, some day.)
Just having a different address for the W and X aliases is pretty weak.

(When x86 modifies itself at boot or for static keys, it changes out the
page tables temporarily.)

And even beyond security, we have correctness.  x86 is a fairly 
forgiving architecture.  If you go back in time about 20 years, modify
some code *at the same linear address at which you intend to execute 
it*, and jump to it, it works.  It may even work if you do it through
an alias (the manual is vague).  But it's not 20 years ago, and you have
multiple cores.  This does *not* work with multiple CPUs -- you need to 
serialize on the CPU executing the modified code.  On all the but the 
very newest CPUs, you need to kludge up the serialization, and that's
sloooooooooooooow.  Very new CPUs have the SERIALIZE instruction, which
is merely sloooooow.

(The manual is terrible.  It's clear that a way to do this without 
serializing must exist, because that's what happens when code is paged 
in from a user program.)

And remember that x86 is the forgiving architecture.  Other 
architectures have their own rules that may involve all kinds of 
terrifying cache management.  IIRC ARM (32-bit) is really quite nasty in 
this regard.  I've seen some references suggesting that RISC-V has a 
broken design of its cache management and this is a real mess.

x86 low level stuff on Linux gets away with it because the 
implementation is conservative and very slow, but it's very rarely invoked.

eBPF gets away with it in ways that probably no one really likes, but 
also no one expects eBPF to load programs particularly quickly.

You are proposing doing this when a btree node is loaded.  You could 
spend 20 *thousand* cycles, on *each CPU*, the first time you access 
that node, not to mention the extra branch to decide whether you need to 
spend those 20k cycles.  Or you could use IPIs.

Or you could just not do this.  I think you should just remove all this 
dynamic codegen stuff, at least for now.

> 
> I'm working up a new allocator - a (very simple) slab allocator where
> you pass a buffer, and it gives you a copy of that buffer mapped
> executable, but not writeable.
> 
> It looks like we'll be able to convert bpf, kprobes, and ftrace
> trampolines to it; it'll consolidate a fair amount of code (particularly
> in bpf), and they won't have to burn a full page per allocation anymore.
> 
> bpf has a neat trick where it maps the same page in two different
> locations, one is the executable location and the other is the writeable
> location - I'm stealing that.
> 
> external api will be:
> 
> void *jit_alloc(void *buf, size_t len, gfp_t gfp);
> void jit_free(void *buf);
> void jit_update(void *buf, void *new_code, size_t len); /* update an existing allocation */

Based on the above, I regret to inform you that jit_update() will either 
need to sync all cores via IPI or all cores will need to check whether a 
sync is needed and do it themselves.

That IPI could be, I dunno, 500k cycles?  1M cycles?  Depends on what 
cores are asleep at the time.  (I have some old Sandy Bridge machines 
where, if you tick all the boxes wrong, you might spend tens of 
milliseconds doing this due to power savings gone wrong.)  Or are you 
planning to implement a fancy mostly-lockless thing to track which cores 
actually need the IPI so you can avoid waking up sleeping cores?

Sorry to be a party pooper.

--Andy

P.S. I have given some thought to how to make a JIT API that was 
actually (somewhat) performant.  It's nontrivial, and it would involve 
having at least phone calls and possibly actual meetings with people who 
understand the microarchitecture of various CPUs to get all the details 
hammered out and documented properly.

I don't think it would be efficient for teeny little functions like 
bcachefs wants, but maybe?  That would be even more complex and messy.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-17  4:13             ` Andy Lutomirski
@ 2023-06-17 15:34               ` Kent Overstreet
  2023-06-17 19:19                 ` Andy Lutomirski
  2023-06-19 19:45                 ` Kees Cook
  0 siblings, 2 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-06-17 15:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Fri, Jun 16, 2023 at 09:13:22PM -0700, Andy Lutomirski wrote:
> On 5/16/23 14:20, Kent Overstreet wrote:
> > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > For something that small, why not use the text_poke API?
> > 
> > This looks like it's meant for patching existing kernel text, which
> > isn't what I want - I'm generating new functions on the fly, one per
> > btree node.
> 
> Dynamically generating code is a giant can of worms.
> 
> Kees touched on a basic security thing: a linear address mapped W+X is a big
> no-no.  And that's just scratching the surface -- ideally we would have a
> strong protocol for generating code: the code is generated in some
> extra-secure context, then it's made immutable and double-checked, then
> it becomes live.

"Double checking" arbitrary code is is fantasy. You can't "prove the
security" of arbitrary code post compilation.

Rice's theorem states that any nontrivial property of a program is
either a direct consequence of the syntax, or is undecidable. It's why
programs in statically typed languages are easier to reason about, and
it's also why the borrow checker in Rust is a syntactic construct.

You just have to be able to trust the code that generates the code. Just
like you have to be able to trust any other code that lives in kernel
space.

This is far safer and easier to reason about than what BPF is doing
because we're not compiling arbitrary code, the actual codegen part is
200 loc and the input is just a single table.

> 
> (When x86 modifies itself at boot or for static keys, it changes out the
> page tables temporarily.)
> 
> And even beyond security, we have correctness.  x86 is a fairly forgiving
> architecture.  If you go back in time about 20 years, modify
> some code *at the same linear address at which you intend to execute it*,
> and jump to it, it works.  It may even work if you do it through
> an alias (the manual is vague).  But it's not 20 years ago, and you have
> multiple cores.  This does *not* work with multiple CPUs -- you need to
> serialize on the CPU executing the modified code.  On all the but the very
> newest CPUs, you need to kludge up the serialization, and that's
> sloooooooooooooow.  Very new CPUs have the SERIALIZE instruction, which
> is merely sloooooow.

If what you were saying was true, it would be an issue any time we
mapped in new executable code for userspace - minor page faults would be
stupidly slow.

This code has been running on thousands of machines for years, and the
only issues that have come up have been due to the recent introduction
of indirect branch tracking. x86 doesn't have such broken caches, and
architectures that do have utterly broken caches (because that's what
you're describing: you're describing caches that _are not coherent
across cores_) are not high on my list of things I care about.

Also, SERIALIZE is a spectre thing. Not relevant here.

> Based on the above, I regret to inform you that jit_update() will either
> need to sync all cores via IPI or all cores will need to check whether a
> sync is needed and do it themselves.

text_poke() doesn't even send IPIs.

I think you've been misled about some things :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-17 15:34               ` Kent Overstreet
@ 2023-06-17 19:19                 ` Andy Lutomirski
  2023-06-17 20:08                   ` Kent Overstreet
  2023-06-19 19:45                 ` Kees Cook
  1 sibling, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2023-06-17 19:19 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kees Cook, Johannes Thumshirn, Linux Kernel Mailing List,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening

On Sat, Jun 17, 2023, at 8:34 AM, Kent Overstreet wrote:
> On Fri, Jun 16, 2023 at 09:13:22PM -0700, Andy Lutomirski wrote:
>> On 5/16/23 14:20, Kent Overstreet wrote:
>> > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
>> > > For something that small, why not use the text_poke API?
>> > 
>> > This looks like it's meant for patching existing kernel text, which
>> > isn't what I want - I'm generating new functions on the fly, one per
>> > btree node.
>> 
>> Dynamically generating code is a giant can of worms.
>> 
>> Kees touched on a basic security thing: a linear address mapped W+X is a big
>> no-no.  And that's just scratching the surface -- ideally we would have a
>> strong protocol for generating code: the code is generated in some
>> extra-secure context, then it's made immutable and double-checked, then
>> it becomes live.
>
> "Double checking" arbitrary code is is fantasy. You can't "prove the
> security" of arbitrary code post compilation.
>
> Rice's theorem states that any nontrivial property of a program is
> either a direct consequence of the syntax, or is undecidable. It's why
> programs in statically typed languages are easier to reason about, and
> it's also why the borrow checker in Rust is a syntactic construct.

If you want security in some theoretical sense, sure, you're probably right.  But that doesn't stop people from double-checking executable code to quite good effect.  For example:

https://www.bitdefender.com/blog/businessinsights/bitdefender-releases-landmark-open-source-software-project-hypervisor-based-memory-introspection/

(I have no personal experience with this, but I know people who do.  It's obviously not perfect, but I think it provides meaningful benefits.)

I'm not saying Linux should do this internally, but it might not be a terrible idea some day.

>
> You just have to be able to trust the code that generates the code. Just
> like you have to be able to trust any other code that lives in kernel
> space.
>
> This is far safer and easier to reason about than what BPF is doing
> because we're not compiling arbitrary code, the actual codegen part is
> 200 loc and the input is just a single table.

Great, then propose a model where the codegen operates in an extra-safe protected context.  Or pre-generate the most common variants, have them pull their constants from memory instead of immediates, and use that.

>
>> 
>> (When x86 modifies itself at boot or for static keys, it changes out the
>> page tables temporarily.)
>> 
>> And even beyond security, we have correctness.  x86 is a fairly forgiving
>> architecture.  If you go back in time about 20 years, modify
>> some code *at the same linear address at which you intend to execute it*,
>> and jump to it, it works.  It may even work if you do it through
>> an alias (the manual is vague).  But it's not 20 years ago, and you have
>> multiple cores.  This does *not* work with multiple CPUs -- you need to
>> serialize on the CPU executing the modified code.  On all the but the very
>> newest CPUs, you need to kludge up the serialization, and that's
>> sloooooooooooooow.  Very new CPUs have the SERIALIZE instruction, which
>> is merely sloooooow.
>
> If what you were saying was true, it would be an issue any time we
> mapped in new executable code for userspace - minor page faults would be
> stupidly slow.

I literally mentioned this in the email.

I don't know _precisely_ what's going on, but I assume it's that it's impossible (assuming the kernel gets TLB invalidation right) for a CPU to have anything buffered for a linear address that is unmapped, so when it gets mapped, the CPU can't have anything stale in its buffers.  (By buffers, I mean any sort of instruction or decoded instruction cache.)

Having *this* conversation is what I was talking about in regard to possible fancy future optimization.

>
> This code has been running on thousands of machines for years, and the
> only issues that have come up have been due to the recent introduction
> of indirect branch tracking. x86 doesn't have such broken caches, and
> architectures that do have utterly broken caches (because that's what
> you're describing: you're describing caches that _are not coherent
> across cores_) are not high on my list of things I care about.

I care.  And a bunch of people who haven't gotten their filesystem corrupted because of a missed serialization.

>
> Also, SERIALIZE is a spectre thing. Not relevant here.

Nope, try again.  SERIALIZE "serializes" in the rather vague sense in the Intel SDM.  I don't think it's terribly useful for Spectre.

(Yes, I know what I'm talking about.)

>
>> Based on the above, I regret to inform you that jit_update() will either
>> need to sync all cores via IPI or all cores will need to check whether a
>> sync is needed and do it themselves.
>
> text_poke() doesn't even send IPIs.

text_poke() and the associated machinery is unbelievably complicated.  

Also, arch/x86/kernel/alternative.c contains:

void text_poke_sync(void)
{
	on_each_cpu(do_sync_core, NULL, 1);
}

The magic in text_poke() was developed over the course of years, and Intel architects were involved.

(And I think some text_poke() stuff uses RCU, which is another way to sync without IPI.  I doubt the performance characteristics are appropriate for bcachefs, but I could be wrong.)

>
> I think you've been misled about some things :)

I wish.


I like bcachefs.  I really don't want to have to put on my maintainer hat here, and I do indeed generally stay in the background.  (And I haven't had nearly as much time for this kind of the work in the last couple years as I'd like, sigh.) But I personally have a fairly strict opinion that, if someone (including myself!) wants to merge something that plays clever games that may cause x86 architecture code (especially mm code) to do things it shouldn't in corner cases, even if no one has directly observed that corner case or even knows how to get it to misbehave, then they had better have a very convincing argument that it's safe.  No one likes debugging bugs when something that should be coherent becomes incoherent.

So, if you really really want self-modifying code in bcachefs, its correctness needs to be very, very well argued, and it needs to be maintainable.  Otherwise I will NAK it.  Sorry.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-17 19:19                 ` Andy Lutomirski
@ 2023-06-17 20:08                   ` Kent Overstreet
  2023-06-17 20:35                     ` Andy Lutomirski
  0 siblings, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-06-17 20:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Johannes Thumshirn, Linux Kernel Mailing List,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening

On Sat, Jun 17, 2023 at 12:19:41PM -0700, Andy Lutomirski wrote:
> On Sat, Jun 17, 2023, at 8:34 AM, Kent Overstreet wrote:
> > On Fri, Jun 16, 2023 at 09:13:22PM -0700, Andy Lutomirski wrote:
> >> On 5/16/23 14:20, Kent Overstreet wrote:
> >> > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> >> > > For something that small, why not use the text_poke API?
> >> > 
> >> > This looks like it's meant for patching existing kernel text, which
> >> > isn't what I want - I'm generating new functions on the fly, one per
> >> > btree node.
> >> 
> >> Dynamically generating code is a giant can of worms.
> >> 
> >> Kees touched on a basic security thing: a linear address mapped W+X is a big
> >> no-no.  And that's just scratching the surface -- ideally we would have a
> >> strong protocol for generating code: the code is generated in some
> >> extra-secure context, then it's made immutable and double-checked, then
> >> it becomes live.
> >
> > "Double checking" arbitrary code is is fantasy. You can't "prove the
> > security" of arbitrary code post compilation.
> >
> > Rice's theorem states that any nontrivial property of a program is
> > either a direct consequence of the syntax, or is undecidable. It's why
> > programs in statically typed languages are easier to reason about, and
> > it's also why the borrow checker in Rust is a syntactic construct.
> 
> If you want security in some theoretical sense, sure, you're probably right.  But that doesn't stop people from double-checking executable code to quite good effect.  For example:
> 
> https://www.bitdefender.com/blog/businessinsights/bitdefender-releases-landmark-open-source-software-project-hypervisor-based-memory-introspection/
> 
> (I have no personal experience with this, but I know people who do.  It's obviously not perfect, but I think it provides meaningful benefits.)
> 
> I'm not saying Linux should do this internally, but it might not be a terrible idea some day.

So you want to pull a virus scanner into the kernel.

> > You just have to be able to trust the code that generates the code. Just
> > like you have to be able to trust any other code that lives in kernel
> > space.
> >
> > This is far safer and easier to reason about than what BPF is doing
> > because we're not compiling arbitrary code, the actual codegen part is
> > 200 loc and the input is just a single table.
> 
> Great, then propose a model where the codegen operates in an
> extra-safe protected context.  Or pre-generate the most common
> variants, have them pull their constants from memory instead of
> immediates, and use that.

I'll do no such nonsense.

> > If what you were saying was true, it would be an issue any time we
> > mapped in new executable code for userspace - minor page faults would be
> > stupidly slow.
> 
> I literally mentioned this in the email.

No, you didn't. Feel free to link or cite if you think otherwise.

> 
> I don't know _precisely_ what's going on, but I assume it's that it's impossible (assuming the kernel gets TLB invalidation right) for a CPU to have anything buffered for a linear address that is unmapped, so when it gets mapped, the CPU can't have anything stale in its buffers.  (By buffers, I mean any sort of instruction or decoded instruction cache.)
> 
> Having *this* conversation is what I was talking about in regard to possible fancy future optimization.
> 
> >
> > This code has been running on thousands of machines for years, and the
> > only issues that have come up have been due to the recent introduction
> > of indirect branch tracking. x86 doesn't have such broken caches, and
> > architectures that do have utterly broken caches (because that's what
> > you're describing: you're describing caches that _are not coherent
> > across cores_) are not high on my list of things I care about.
> 
> I care.  And a bunch of people who haven't gotten their filesystem corrupted because of a missed serialization.
> 
> >
> > Also, SERIALIZE is a spectre thing. Not relevant here.
> 
> Nope, try again.  SERIALIZE "serializes" in the rather vague sense in the Intel SDM.  I don't think it's terribly useful for Spectre.
> 
> (Yes, I know what I'm talking about.)
> 
> >
> >> Based on the above, I regret to inform you that jit_update() will either
> >> need to sync all cores via IPI or all cores will need to check whether a
> >> sync is needed and do it themselves.
> >
> > text_poke() doesn't even send IPIs.
> 
> text_poke() and the associated machinery is unbelievably complicated.  

It's not that bad.

The only reference to IPIs in text_poke() is the comment that indicates
that flush_tlb_mm_range() may sometimes do IPIs, but explicitly
indicates that it does _not_ do IPIs the way text_poke() is using it.

> Also, arch/x86/kernel/alternative.c contains:
> 
> void text_poke_sync(void)
> {
> 	on_each_cpu(do_sync_core, NULL, 1);
> }

...which is for modifying code that is currently being executed, not the
text_poke() or text_poke_copy() paths.

> 
> The magic in text_poke() was developed over the course of years, and
> Intel architects were involved.
> 
> (And I think some text_poke() stuff uses RCU, which is another way to
> sync without IPI.  I doubt the performance characteristics are
> appropriate for bcachefs, but I could be wrong.)

No, it doesn't use RCU.

> > I think you've been misled about some things :)
> 
> I wish.

Given your comments on text_poke(), I think you are. You're confusing
synchronization requirements for _self modifying_ code with the
synchronization requirements for writing new code to memory, and then
executing it.

And given that bcachefs is not doing anything new here - we're doing a
more limited form of what BPF is already doing - I don't think this is
even the appropriate place for this discussion. There is a new
executable memory allocator being developed and posted, which is
expected to wrap text_poke() in an arch-independent way so that
allocations can share pages, and so that we can remove the need to have
pages mapped both writeable and executable.

If you've got knowledge you wish to share on how to get cache coherency
right, I think that might be a more appropriate thread.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-17 20:08                   ` Kent Overstreet
@ 2023-06-17 20:35                     ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2023-06-17 20:35 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Kees Cook, Johannes Thumshirn, Linux Kernel Mailing List,
	linux-fsdevel, linux-bcachefs, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, hch, linux-mm, linux-hardening



On Sat, Jun 17, 2023, at 1:08 PM, Kent Overstreet wrote:
> On Sat, Jun 17, 2023 at 12:19:41PM -0700, Andy Lutomirski wrote:
>> On Sat, Jun 17, 2023, at 8:34 AM, Kent Overstreet wrote:

>> Great, then propose a model where the codegen operates in an
>> extra-safe protected context.  Or pre-generate the most common
>> variants, have them pull their constants from memory instead of
>> immediates, and use that.
>
> I'll do no such nonsense.

You can avoid generating code beyond what gcc generates at all, or you can pre-generate code but not on an ongoing basis at runtime, or you can generate code at runtime correctly.  I don't think there are many other options.

>
>> > If what you were saying was true, it would be an issue any time we
>> > mapped in new executable code for userspace - minor page faults would be
>> > stupidly slow.
>> 
>> I literally mentioned this in the email.
>
> No, you didn't. Feel free to link or cite if you think otherwise.

"It's clear that a way to do this without 
serializing must exist, because that's what happens when code is paged 
in from a user program."

>> > text_poke() doesn't even send IPIs.
>> 
>> text_poke() and the associated machinery is unbelievably complicated.  
>
> It's not that bad.

This is a useless discussion.

>
> The only reference to IPIs in text_poke() is the comment that indicates
> that flush_tlb_mm_range() may sometimes do IPIs, but explicitly
> indicates that it does _not_ do IPIs the way text_poke() is using it.
>
>> Also, arch/x86/kernel/alternative.c contains:
>> 
>> void text_poke_sync(void)
>> {
>> 	on_each_cpu(do_sync_core, NULL, 1);
>> }
>
> ...which is for modifying code that is currently being executed, not the
> text_poke() or text_poke_copy() paths.
>
>> 
>> The magic in text_poke() was developed over the course of years, and
>> Intel architects were involved.
>> 
>> (And I think some text_poke() stuff uses RCU, which is another way to
>> sync without IPI.  I doubt the performance characteristics are
>> appropriate for bcachefs, but I could be wrong.)
>
> No, it doesn't use RCU.

It literally says in alternative.c:

 * Not safe against concurrent execution; useful for JITs to dump
 * new code blocks into unused regions of RX memory. Can be used in
 * conjunction with synchronize_rcu_tasks() to wait for existing
 * execution to quiesce after having made sure no existing functions
 * pointers are live.

I don't know whether any callers actually do this.  I didn't look.

>
>> > I think you've been misled about some things :)
>> 
>> I wish.
>
> Given your comments on text_poke(), I think you are. You're confusing
> synchronization requirements for _self modifying_ code with the
> synchronization requirements for writing new code to memory, and then
> executing it.

No, you are misunderstanding the difference.

Version A:

User mmap()s an executable file (DSO, whatever).  At first, there is either no PTE or a not-present PTE.  At some point, in response to a page fault or just the kernel prefetching, the kernel fills in the backing page and then creates the PTE.  From the CPU's perspective, the virtual address atomically transitions from having nothing there to having the final code there.  It works (despite the manual having nothing to say about this case).  It's also completely unavoidable.

Version B:

Kernel vmallocs some space *and populates the pagetables*.  There is backing storage, that is executable (or it's a non-NX system, although those are quite rare these days).

Because the CPU hates you, it speculatively executes that code.  (Maybe you're under attack.  Maybe you're just unlucky.  Doesn't matter.)  It populates the instruction cache, remembers the decoded instructions, etc.  It does all the things that make the manual say scary things about serialization.  It notices that the speculative execution was wrong and backs it out, but nothing is invalidated.

Now you write code into there.  Either you do this from a different CPU or you do it at a different linear address, so the magic hardware that invalidates for you does not trigger.

Now you jump into that code, and you tell yourself that it was new code because it was all zeros before and you never intentionally executed it.  But the CPU could not care less what you think, and you lose.

>
> And given that bcachefs is not doing anything new here - we're doing a
> more limited form of what BPF is already doing - I don't think this is
> even the appropriate place for this discussion. There is a new
> executable memory allocator being developed and posted, which is
> expected to wrap text_poke() in an arch-independent way so that
> allocations can share pages, and so that we can remove the need to have
> pages mapped both writeable and executable.

I don't really care what BPF is doing, and BPF may well have the same problem.

But if I understood what bcachefs is doing, it's creating code vastly more frequently than BPF, in response to entirely unprivileged operations from usermode.  It's a whole different amount of exposure.

>
> If you've got knowledge you wish to share on how to get cache coherency
> right, I think that might be a more appropriate thread.

I'll look.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-17 15:34               ` Kent Overstreet
  2023-06-17 19:19                 ` Andy Lutomirski
@ 2023-06-19 19:45                 ` Kees Cook
  2023-06-20  0:39                   ` Kent Overstreet
  1 sibling, 1 reply; 24+ messages in thread
From: Kees Cook @ 2023-06-19 19:45 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Andy Lutomirski, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Sat, Jun 17, 2023 at 11:34:31AM -0400, Kent Overstreet wrote:
> On Fri, Jun 16, 2023 at 09:13:22PM -0700, Andy Lutomirski wrote:
> > On 5/16/23 14:20, Kent Overstreet wrote:
> > > On Tue, May 16, 2023 at 02:02:11PM -0700, Kees Cook wrote:
> > > > For something that small, why not use the text_poke API?
> > > 
> > > This looks like it's meant for patching existing kernel text, which
> > > isn't what I want - I'm generating new functions on the fly, one per
> > > btree node.
> > 
> > Dynamically generating code is a giant can of worms.
> > 
> > Kees touched on a basic security thing: a linear address mapped W+X is a big
> > no-no.  And that's just scratching the surface -- ideally we would have a
> > strong protocol for generating code: the code is generated in some
> > extra-secure context, then it's made immutable and double-checked, then
> > it becomes live.
> 
> "Double checking" arbitrary code is is fantasy. You can't "prove the
> security" of arbitrary code post compilation.

I think there's a misunderstanding here about the threat model I'm
interested in protecting against for JITs. While making sure the VM of a
JIT is safe in itself, that's separate from what I'm concerned about.

The threat model is about flaws _elsewhere_ in the kernel that can
leverage the JIT machinery to convert a "write anything anywhere anytime"
exploit primitive into an "execute anything" primitive. Arguments can
be made to say "a write anything flaw means the total collapse of the
security model so there's no point defending against it", but both that
type of flaw and the slippery slope argument don't stand up well to
real-world situations.

The kinds of flaws we've seen are frequently limited in scope (write
1 byte, write only NULs, write only in a specific range, etc), but
when chained together, the weakest link is what ultimately compromises
the kernel. As such, "W^X" is a basic building block of the kernel's
self-defense methods, because it is such a potent target for a
write->execute attack upgrades.

Since a JIT constructs something that will become executable, it needs
to defend itself against stray writes from other threads. Since Linux
doesn't (really) use per-CPU page tables, the workspace for a JIT can be
targeted by something that isn't the JIT. To deal with this, JITs need
to use 3 phases: a writing pass (into W memory), then switch it to RO
and perform a verification pass (construct it again, but compare results
to the RO version), and finally switch it executable. Or, it can use
writes to memory that only the local CPU can perform (i.e. text_poke(),
which uses a different set of page tables with different permissions).

Without basic W^X, it becomes extremely difficult to build further
defenses (e.g. protecting page tables themselves, etc) since WX will
remain the easiest target.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-06-19 19:45                 ` Kees Cook
@ 2023-06-20  0:39                   ` Kent Overstreet
  0 siblings, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-06-20  0:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Johannes Thumshirn, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	hch, linux-mm, linux-hardening

On Mon, Jun 19, 2023 at 12:45:43PM -0700, Kees Cook wrote:
> I think there's a misunderstanding here about the threat model I'm
> interested in protecting against for JITs. While making sure the VM of a
> JIT is safe in itself, that's separate from what I'm concerned about.
> 
> The threat model is about flaws _elsewhere_ in the kernel that can
> leverage the JIT machinery to convert a "write anything anywhere anytime"
> exploit primitive into an "execute anything" primitive. Arguments can
> be made to say "a write anything flaw means the total collapse of the
> security model so there's no point defending against it", but both that
> type of flaw and the slippery slope argument don't stand up well to
> real-world situations.

Hey Kees, thanks for the explanation - I don't think this is a concern
for what bcachefs is doing, since we're not doing a full jit. The unpack
functions we generate only write to the 40 bytes pointed to by rsi; not
terribly useful as an execute anything primitive :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
       [not found] ` <20230509165657.1735798-30-kent.overstreet@linux.dev>
@ 2023-07-12 19:58   ` Kees Cook
  2023-07-12 20:19     ` Kent Overstreet
  2023-07-12 20:23     ` Kent Overstreet
  0 siblings, 2 replies; 24+ messages in thread
From: Kees Cook @ 2023-07-12 19:58 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	linux-hardening

On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <kent.overstreet@gmail.com>
> 
> printbuf now needs to know the number of characters that would have been
> written if the buffer was too small, like snprintf(); this changes
> string_get_size() to return the the return value of snprintf().

Unfortunately, snprintf doesn't return characters written, it return
what it TRIED to write, and can cause a lot of problems[1]. This patch
would be fine with me if the snprintf was also replaced by scnprintf,
which will return the actual string length copied (or 0) *not* including
the trailing %NUL.

> [...]
> @@ -126,8 +126,8 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
>  	else
>  		unit = units_str[units][i];
>  
> -	snprintf(buf, len, "%u%s %s", (u32)size,
> -		 tmp, unit);
> +	return snprintf(buf, len, "%u%s %s", (u32)size,
> +			tmp, unit);

-Kees

[1] https://github.com/KSPP/linux/issues/105

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-07-12 19:58   ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kees Cook
@ 2023-07-12 20:19     ` Kent Overstreet
  2023-07-12 22:38       ` Kees Cook
  2023-07-12 20:23     ` Kent Overstreet
  1 sibling, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-07-12 20:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	linux-hardening

On Wed, Jul 12, 2023 at 12:58:54PM -0700, Kees Cook wrote:
> On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> > printbuf now needs to know the number of characters that would have been
> > written if the buffer was too small, like snprintf(); this changes
> > string_get_size() to return the the return value of snprintf().
> 
> Unfortunately, snprintf doesn't return characters written, it return
> what it TRIED to write, and can cause a lot of problems[1]. This patch
> would be fine with me if the snprintf was also replaced by scnprintf,
> which will return the actual string length copied (or 0) *not* including
> the trailing %NUL.

...All of which would be solved if we were converting code away from raw
char * buffers to a proper string building type.

Which I tried to address when I tried to push printbufs upstream, but
that turned into a giant exercise in frustration in dealing with
maintainers.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-07-12 19:58   ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kees Cook
  2023-07-12 20:19     ` Kent Overstreet
@ 2023-07-12 20:23     ` Kent Overstreet
  1 sibling, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-07-12 20:23 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	linux-hardening

On Wed, Jul 12, 2023 at 12:58:54PM -0700, Kees Cook wrote:
> On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <kent.overstreet@gmail.com>
> > 
> > printbuf now needs to know the number of characters that would have been
> > written if the buffer was too small, like snprintf(); this changes
> > string_get_size() to return the the return value of snprintf().
> 
> Unfortunately, snprintf doesn't return characters written, it return
> what it TRIED to write, and can cause a lot of problems[1]. This patch
> would be fine with me if the snprintf was also replaced by scnprintf,
> which will return the actual string length copied (or 0) *not* including
> the trailing %NUL.

Anyways, I can't use scnprintf here, printbufs/seq_buf both need the
number of characters that would have been written, but I'll update the
comment.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-07-12 20:19     ` Kent Overstreet
@ 2023-07-12 22:38       ` Kees Cook
  2023-07-12 23:53         ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Kees Cook @ 2023-07-12 22:38 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	linux-hardening

On Wed, Jul 12, 2023 at 04:19:31PM -0400, Kent Overstreet wrote:
> On Wed, Jul 12, 2023 at 12:58:54PM -0700, Kees Cook wrote:
> > On Tue, May 09, 2023 at 12:56:54PM -0400, Kent Overstreet wrote:
> > > From: Kent Overstreet <kent.overstreet@gmail.com>
> > > 
> > > printbuf now needs to know the number of characters that would have been
> > > written if the buffer was too small, like snprintf(); this changes
> > > string_get_size() to return the the return value of snprintf().
> > 
> > Unfortunately, snprintf doesn't return characters written, it return
> > what it TRIED to write, and can cause a lot of problems[1]. This patch
> > would be fine with me if the snprintf was also replaced by scnprintf,
> > which will return the actual string length copied (or 0) *not* including
> > the trailing %NUL.
> 
> ...All of which would be solved if we were converting code away from raw
> char * buffers to a proper string building type.
> 
> Which I tried to address when I tried to push printbufs upstream, but
> that turned into a giant exercise in frustration in dealing with
> maintainers.

Heh, yeah, I've been trying to aim people at using seq_buf instead of
a long series of snprintf/strlcat/etc calls. Where can I look at how
you wired this up to seq_buf/printbuf? I had trouble finding it when I
looked before. I'd really like to find a way to do it without leaving
around foot-guns for future callers of string_get_size(). :)

I found the printbuf series:
https://lore.kernel.org/lkml/20220808024128.3219082-1-willy@infradead.org/
It seems there are some nice improvements in there. It'd be really nice
if seq_buf could just grow those changes. Adding a static version of
seq_buf_init to be used like you have PRINTBUF_EXTERN would be nice
(or even a statically sized initializer). And much of the conversions
is just changing types and functions. If we can leave all that alone,
things become MUCH easier to review, etc, etc. I'd *love* to see an
incremental improvement for seq_buf, especially the heap-allocation
part.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-07-12 22:38       ` Kees Cook
@ 2023-07-12 23:53         ` Kent Overstreet
  0 siblings, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-07-12 23:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	linux-hardening

On Wed, Jul 12, 2023 at 03:38:44PM -0700, Kees Cook wrote:
> Heh, yeah, I've been trying to aim people at using seq_buf instead of
> a long series of snprintf/strlcat/etc calls. Where can I look at how
> you wired this up to seq_buf/printbuf? I had trouble finding it when I
> looked before. I'd really like to find a way to do it without leaving
> around foot-guns for future callers of string_get_size(). :)
> 
> I found the printbuf series:
> https://lore.kernel.org/lkml/20220808024128.3219082-1-willy@infradead.org/
> It seems there are some nice improvements in there. It'd be really nice
> if seq_buf could just grow those changes. Adding a static version of
> seq_buf_init to be used like you have PRINTBUF_EXTERN would be nice
> (or even a statically sized initializer). And much of the conversions
> is just changing types and functions. If we can leave all that alone,
> things become MUCH easier to review, etc, etc. I'd *love* to see an
> incremental improvement for seq_buf, especially the heap-allocation
> part.

Well, I raised that with Steve way back when I was starting on the
conversions of existing code, and I couldn't get any communication out
him regarding making those changes to seq_buf.

So, I'd _love_ to resurrect that patch series and get it in after the
bcachefs merger, but don't expect me to go back and redo everything :)
the amount of code in existing seq_buf users is fairly small compared to
bcachef's printbuf usage, and what that patch series does in the rest of
the kernel anyways.

I'd rather save that energy for ditching the seq_file interface and
making that just use a printbuf - clean up that bit of API
fragmentation.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-07-12 23:53 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20230509165657.1735798-1-kent.overstreet@linux.dev>
     [not found] ` <20230509165657.1735798-8-kent.overstreet@linux.dev>
2023-05-10 15:05   ` [PATCH 07/32] mm: Bring back vmalloc_exec Johannes Thumshirn
2023-05-11 22:28     ` Kees Cook
2023-05-12 18:41       ` Kent Overstreet
2023-05-16 21:02         ` Kees Cook
2023-05-16 21:20           ` Kent Overstreet
2023-05-16 21:47             ` Matthew Wilcox
2023-05-16 21:57               ` Kent Overstreet
2023-05-17  5:28               ` Kent Overstreet
2023-05-17 14:04                 ` Mike Rapoport
2023-05-17 14:18                   ` Kent Overstreet
2023-05-17 15:44                     ` Mike Rapoport
2023-05-17 15:59                       ` Kent Overstreet
2023-06-17  4:13             ` Andy Lutomirski
2023-06-17 15:34               ` Kent Overstreet
2023-06-17 19:19                 ` Andy Lutomirski
2023-06-17 20:08                   ` Kent Overstreet
2023-06-17 20:35                     ` Andy Lutomirski
2023-06-19 19:45                 ` Kees Cook
2023-06-20  0:39                   ` Kent Overstreet
     [not found] ` <20230509165657.1735798-30-kent.overstreet@linux.dev>
2023-07-12 19:58   ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kees Cook
2023-07-12 20:19     ` Kent Overstreet
2023-07-12 22:38       ` Kees Cook
2023-07-12 23:53         ` Kent Overstreet
2023-07-12 20:23     ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).