bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/10] New permission vmalloc interface
@ 2020-11-20 20:24 Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation Rick Edgecombe
                   ` (9 more replies)
  0 siblings, 10 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

This is a proposal to address some inefficiencies in how memory permissions
are handled on vmalloc mappings. The way the interfaces are defined across
vmalloc and cpa make it hard to fully address problems underneath the
existing interfaces. So this creates a new interface in vmalloc that
encapsulates what vmalloc memory permission usages need, but with more
details handled on the back end. This allows for optimizations and shared caches
of resources. The genesis for this was this conversation[0] and many of the
ideas were suggested by Andy Lutomirski. In its current state it takes module
load's down to usually one kernel range shootdown, and BPF JIT load's down to
usually zero on x86. It also minimizes the direct map 4k breakage when possible.

For the future, x86 also has new kernel memory permission types that would
benefit from efficiently handling the direct map permissions/unmapping, for
example [1]. However, this patchset is just targeting improving performance
inefficiencies with existing usages in modules and normal eBPF JITs.

The code is early and very lightly tested. I was hoping to get some feedback on
the approach.

The Problem
===========

For a little background, the way executable code is loaded (modules, BPF
JITs, etc) on x86 goes something like this:

ptr = vmalloc_node_range(, PAGE_KERNEL)
   alloc_page() - Get random pages from page allocator
   map_kernel_range() - Map vmalloc range allocation as RW to those pages

set_memory_ro(ptr)
   set vmalloc alias to RO, break direct map aliases to 4k/RO, all-cpu shootdown
   vm_unmap_alias() flush any lazy RW aliases to now RO pages, all-cpu shootdown

set_memory_x(ptr)
   set vmalloc alias to not-NX, all-cpu shootdown
   vm_unmap_alias(), possible all-cpu shootdown

So during this load operation, 4 shootdowns may take place and the direct map
will be broken to 4k pages across whichever random pages got used in the
executable region. When a split is required it can be even more. Besides the
direct map, the other reason for this is having to change the permission of the
vmalloc mapping several times in order to load it while it's writable, and then
transition it to its final permission.

Ideally we would unmap pages from the direct map in bulk to share a shootdown.
For changing the vmalloc mappings permission, we should instead map it at its
final permission from the start and use a temporary per-cpu mapping such as
text_poke() to load the data such that it only requires a local TLB flush.

For large page breakage on the direct map, if multiple JITs happen to get
pages from the same 2MB physical region this can limit the damage to a
smaller region. However, currently this depends on lucky physical distance of
the pages picked inside vmalloc. Today it seems more likely to happen if
allocations are made close together in time. Ideally we would make an effort
to group pages used for permissioned vmallocs together physically so the
direct map breakage would be minimized.

But trying to improve this doesn't fit into the existing interfaces very well.

 - vmalloc_node_range() doesn't know what it's final permission will be.

 - There isn't any cross-arch way to describe to vmalloc what the permissions
will be, since permissions are encoded into the name of the set_memory_foo()
functions.

 - text_poke() only exists on x86, and other HW ways of temporarily writing to
RO mappings don't necessarily have standardized semantics.

Proposed solution
=================

For text or RO allocations, to oversimplify, what usages want to do is just
say:

1. Give me a kva for this particular permission and size
2. Load this data into it
3. Make it "live" (no writable mapping, no direct map mapping, whatever
permissions are set on it and ready to go)

So this implements a new interface to do just that. I had in mind this
interface should try to support the following optimizations on x86 even if
they weren't implemented right away.

1. Draw from 2MB physical pages that can be unmapped from the direct map in
contiguous chunks

In memory pressure situations a shrinker callback can free unused pages
from the cache. These can get re-mapped on the fly without any flush since the
direct map would be transitioning NP->RW. Since we can re-map the direct map
cheaply, it's better to unmap more than we need. This part is close to
secretmem[2] in implementation, and should possibly share infrastructure or
caches of unmapped pages.

2. Load text/special data via per-cpu mappings

The mapping can be mapped in its "final" permission, and loaded via text_poke().
This will reduce shootdowns during loads to zero is most cases. Just local
flushes. The new interface provides a writable buffer for usages to stage their
changes, and trigger the copying into the RO mapping via text_poke()

3. Caching of virtual mappings as well as pages

Normally executable mappings need to be zapped and flushed before the pages
return to the page allocator to prevent random other memory that uses the
page later from having an executable alias. But we could cache these live
mappings and delay the flush until the page is needed for an allocation of a
larger size or different permission. The "free" operations could just zero it
with a per-cpu mapping to prevent unwanted text from remaining mapped.

4. 2MB module space mappings

It would be nice if the virtual mappings of the same permission types could
be placed next to each other so that they could share 2MB mappings. This way we
could have modules or PKS memory have 2MB pages. Of course allocating from a
2MB block could cause internal fragmentation and wasted memory, however it
might be possible to break the virtual mapping later and allow the wasted
memory to be unmapped and freed in the formerly 2MB page. Often a bunch of
modules are loaded at boot. If we placed the long lived "core sections" of
these modules sequentially into the 2MB blocks, there is probably a good chance
we could get some decent utilization out of one.


This RFC just has 1 and 2 actually implemented on x86.

Module loader changes
=====================

Of the text allocation usages, kernel modules are the most complex because a
single vmalloc allocation has many memory permissions across it (RO+X for the
text, RO, RO after init, and RW). In addition to this preventing having module
text mapped in 2MB pages since the text is all scattered around in different
allocations, it would require more complexity for the new interface.

However, at least for x86, it doesn't seem like there is any requirement for
a module allocation to be virtually contiguous. Instead we could have the
module loader treat each of its 4 permission regions as separate allocations.
Then the new interface could be simpler and it could have the option of putting
similar permission allocations next to each other to achieve 2MB pages or more
opportunities to reuse existing mappings.

The challenge in changing this in the module loader is that most of it is
cross-arch code and there could be relocation rules required by various
arch's that depend on the existing virtual address distances. To try to
transition to this interface without disturbing anything, the default
module.c behavior is to layout the modules as they were before in both
location and permissions, but wrapped separately as multiple instances of the
new type of allocation. This way it could have no functional change for other
architectures at first, but allow any to implement similar optimizations in the
arch module.c breakouts. So this RFC also looks at handling things as separate
allocations, and actually allocates them separately for x86.

Of course, there are several areas outside of modules that are involved in
modifying the module text and data such as alternatives, orc unwinding, etc.
These components are changed to be aware they may need to opearate on the
writable staging area buffer.

[0] https://lore.kernel.org/lkml/CALCETrV_tGk=B3Hw0h9viW45wMqB_W+rwWzx6LnC3-vSATOUOA@mail.gmail.com/
[1] https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.weiny@intel.com/
[2] https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org/

This RFC has been acked by Dave Hansen.

Rick Edgecombe (10):
  vmalloc: Add basic perm alloc implementation
  bpf: Use perm_alloc() for BPF JIT filters
  module: Use perm_alloc() for modules
  module: Support separate writable allocation
  x86/modules: Use real perm_allocations
  x86/alternatives: Handle perm_allocs for modules
  x86/unwind: Unwind orc at module writable address
  jump_label: Handle module writable address
  ftrace: Use module writable address
  vmalloc: Add perm_alloc x86 implementation

 arch/Kconfig                      |   3 +
 arch/arm/net/bpf_jit_32.c         |   3 +-
 arch/arm64/net/bpf_jit_comp.c     |   5 +-
 arch/mips/net/bpf_jit.c           |   2 +-
 arch/mips/net/ebpf_jit.c          |   3 +-
 arch/powerpc/net/bpf_jit_comp.c   |   2 +-
 arch/powerpc/net/bpf_jit_comp64.c |  10 +-
 arch/s390/net/bpf_jit_comp.c      |   5 +-
 arch/sparc/net/bpf_jit_comp_32.c  |   2 +-
 arch/sparc/net/bpf_jit_comp_64.c  |   5 +-
 arch/x86/Kconfig                  |   1 +
 arch/x86/include/asm/set_memory.h |   2 +
 arch/x86/kernel/alternative.c     |  25 +-
 arch/x86/kernel/jump_label.c      |  18 +-
 arch/x86/kernel/module.c          |  84 ++++-
 arch/x86/kernel/unwind_orc.c      |   8 +-
 arch/x86/mm/Makefile              |   1 +
 arch/x86/mm/pat/set_memory.c      |  13 +
 arch/x86/mm/vmalloc.c             | 438 +++++++++++++++++++++++
 arch/x86/net/bpf_jit_comp.c       |  15 +-
 arch/x86/net/bpf_jit_comp32.c     |   3 +-
 include/linux/filter.h            |  30 +-
 include/linux/module.h            |  66 +++-
 include/linux/vmalloc.h           |  82 +++++
 kernel/bpf/core.c                 |  48 ++-
 kernel/jump_label.c               |   2 +-
 kernel/module.c                   | 561 ++++++++++++++++++------------
 kernel/trace/ftrace.c             |   2 +-
 mm/nommu.c                        |  66 ++++
 mm/vmalloc.c                      | 135 +++++++
 30 files changed, 1308 insertions(+), 332 deletions(-)
 create mode 100644 arch/x86/mm/vmalloc.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
@ 2020-11-20 20:24 ` Rick Edgecombe
  2020-11-22  4:10   ` Andy Lutomirski
                     ` (2 more replies)
  2020-11-20 20:24 ` [PATCH RFC 02/10] bpf: Use perm_alloc() for BPF JIT filters Rick Edgecombe
                   ` (8 subsequent siblings)
  9 siblings, 3 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

In order to allow for future arch specific optimizations for vmalloc
permissions, first add an implementation of a new interface that will
work cross arch by using the existing set_memory_() functions.

When allocating some memory that will be RO, for example it should be used
like:

/* Reserve va */
struct perm_allocation *alloc = perm_alloc(vstart, vend, page_cnt, PERM_R);
unsigned long ro = (unsigned long)perm_alloc_address(alloc);

/* Write to writable address */
strcpy((char *)perm_writable_addr(alloc, ro), "Some data to be RO");
/* Signal that writing is done and mapping should be live */
perm_writable_finish(alloc);
/* Print from RO address */
printk("Read only data is: %s\n", (char *)ro);

Create some new flags to handle the memory permissions currently defined
cross-architectually in the set_memory_() function names themselves. The
PAGE_ defines are not uniform across the architectures, so couldn't be used
without unifying them. However in the future there may also be some other
flags, for example requesting to try to allocate into part of a 2MB page
for longer lived allocations.

Have the default implementation use the primary address for loading the
data as is done today for special kernel permission usages. However, make
the interface compatible with having the writable data loaded at a
separate address or via some PKS backed solution. Allocate using
module_alloc() in the default implementation in order to allocate from
each arch's chosen place for executable code.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/Kconfig            |   3 +
 include/linux/vmalloc.h |  82 ++++++++++++++++++++++++
 mm/nommu.c              |  66 ++++++++++++++++++++
 mm/vmalloc.c            | 135 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 286 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc0e32d..0fa42f76548d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -259,6 +259,9 @@ config ARCH_HAS_SET_MEMORY
 config ARCH_HAS_SET_DIRECT_MAP
 	bool
 
+config ARCH_HAS_PERM_ALLOC_IMPLEMENTATION
+	bool
+
 #
 # Select if the architecture provides the arch_dma_set_uncached symbol to
 # either provide an uncached segement alias for a DMA allocation, or
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 938eaf9517e2..4a6b30014fff 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -248,4 +248,86 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
 int register_vmap_purge_notifier(struct notifier_block *nb);
 int unregister_vmap_purge_notifier(struct notifier_block *nb);
 
+#define PERM_R	1
+#define PERM_W	2
+#define PERM_X	4
+#define PERM_RWX	(PERM_R | PERM_W | PERM_X)
+#define PERM_RW		(PERM_R | PERM_W)
+#define PERM_RX		(PERM_R | PERM_X)
+
+typedef u8 virtual_perm;
+
+struct perm_allocation {
+	struct page **pages;
+	virtual_perm cur_perm;
+	virtual_perm orig_perm;
+	struct vm_struct *area;
+	unsigned long offset;
+	unsigned long size;
+	void *writable;
+};
+
+/*
+ * Allocate a special permission kva region. The region may not be mapped
+ * until a call to perm_writable_finish(). A writable region will be mapped
+ * immediately at the address returned by perm_writable_addr(). The allocation
+ * will be made between the start and end virtual addresses.
+ */
+struct perm_allocation *perm_alloc(unsigned long vstart, unsigned long vend, unsigned long page_cnt,
+				   virtual_perm perms);
+
+/* The writable address for data to be loaded into the allocation */
+unsigned long perm_writable_addr(struct perm_allocation *alloc, unsigned long addr);
+
+/* The writable address for data to be loaded into the allocation */
+bool perm_writable_finish(struct perm_allocation *alloc);
+
+/* Change the permission of an allocation that is already live */
+bool perm_change(struct perm_allocation *alloc, virtual_perm perms);
+
+/* Free an allocation */
+void perm_free(struct perm_allocation *alloc);
+
+/* Helper for memsetting an allocation. Should be called before perm_writable_finish() */
+void perm_memset(struct perm_allocation *alloc, char val);
+
+/* The final address of the allocation */
+static inline unsigned long perm_alloc_address(const struct perm_allocation *alloc)
+{
+	return (unsigned long)alloc->area->addr + alloc->offset;
+}
+
+/* The size of the allocation */
+static inline unsigned long perm_alloc_size(const struct perm_allocation *alloc)
+{
+	return alloc->size;
+}
+
+static inline unsigned long within_perm_alloc(const struct perm_allocation *alloc,
+					      unsigned long addr)
+{
+	unsigned long base, size;
+
+	if (!alloc)
+		return false;
+
+	base = perm_alloc_address(alloc);
+	size = perm_alloc_size(alloc);
+
+	return base <= addr && addr < base + size;
+}
+
+static inline unsigned long perm_writable_base(struct perm_allocation *alloc)
+{
+	return perm_writable_addr(alloc, perm_alloc_address(alloc));
+}
+
+static inline bool perm_is_writable(struct perm_allocation *alloc)
+{
+	if (!alloc)
+		return false;
+
+	return (alloc->cur_perm & PERM_W) || alloc->writable;
+}
+
 #endif /* _LINUX_VMALLOC_H */
diff --git a/mm/nommu.c b/mm/nommu.c
index 0faf39b32cdb..6458bd23de3e 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1810,6 +1810,72 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
 	return 0;
 }
 
+struct perm_allocation *perm_alloc(unsigned long vstart, unsigned long vend, unsigned long page_cnt,
+				   virtual_perm perms)
+{
+	struct perm_allocation *alloc;
+	struct vm_struct *area;
+	unsigned long size = page_cnt << PAGE_SHIFT;
+	void *ptr;
+
+	if (!size)
+		return NULL;
+
+	alloc = kmalloc(sizeof(*alloc), GFP_KERNEL | __GFP_ZERO);
+
+	if (!alloc)
+		return NULL;
+
+	area = kmalloc(sizeof(*area), GFP_KERNEL | __GFP_ZERO);
+
+	if (!area)
+		goto free_alloc;
+
+	alloc->area = area;
+
+	ptr = vmalloc(size);
+
+	if (!ptr)
+		goto free_area;
+
+	alloc->size = size;
+	alloc->cur_perm = PERM_RWX;
+
+	return alloc;
+
+free_area:
+	kfree(area);
+free_alloc:
+	kfree(alloc);
+	return NULL;
+}
+
+unsigned long perm_writable_addr(struct perm_allocation *alloc, unsigned long addr)
+{
+	return addr;
+}
+
+bool perm_writable_finish(struct perm_allocation *alloc)
+{
+	return true;
+}
+
+bool perm_change(struct perm_allocation *alloc, virtual_perm perms)
+{
+	return true;
+}
+
+void perm_free(struct perm_allocation *alloc)
+{
+	if (!alloc)
+		return;
+
+	kfree(alloc->area);
+	kfree(alloc);
+}
+
+void perm_memset(struct perm_allocation *alloc, char val) {}
+
 /*
  * Initialise sysctl_user_reserve_kbytes.
  *
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6ae491a8b210..3e8e54a75dfc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -34,6 +34,7 @@
 #include <linux/bitops.h>
 #include <linux/rbtree_augmented.h>
 #include <linux/overflow.h>
+#include <linux/moduleloader.h>
 
 #include <linux/uaccess.h>
 #include <asm/tlbflush.h>
@@ -3088,6 +3089,140 @@ void free_vm_area(struct vm_struct *area)
 }
 EXPORT_SYMBOL_GPL(free_vm_area);
 
+#ifndef CONFIG_ARCH_HAS_PERM_ALLOC_IMPLEMENTATION
+
+#ifndef CONFIG_MODULES
+/* If modules is not configured, provide stubs so perm_alloc() could use fallback logic. */
+void *module_alloc(unsigned long size)
+{
+	return NULL;
+}
+
+void module_memfree(void *module_region) { }
+#endif /* !CONFIG_MODULES */
+
+struct perm_allocation *perm_alloc(unsigned long vstart, unsigned long vend, unsigned long page_cnt,
+				   virtual_perm perms)
+{
+	struct perm_allocation *alloc;
+	unsigned long size = page_cnt << PAGE_SHIFT;
+	void *ptr;
+
+	if (!size)
+		return NULL;
+
+	alloc = kmalloc(sizeof(*alloc), GFP_KERNEL | __GFP_ZERO);
+
+	if (!alloc)
+		return NULL;
+
+	ptr = module_alloc(size);
+
+	if (!ptr) {
+		kfree(alloc);
+		return NULL;
+	}
+
+	/*
+	 * In order to work with all arch's we call the arch's module_alloc() which is the only
+	 * cross-arch place where information about where an executable allocation should go is
+	 * located. If the caller passed in a different range they want for the allocation...we
+	 * could try a vmalloc_node_range() at this point, but just return NULL for now.
+	 */
+	if ((unsigned long)ptr < vstart || (unsigned long)ptr >= vend) {
+		module_memfree(ptr);
+		kfree(alloc);
+		return NULL;
+	}
+
+	alloc->area = find_vm_area(ptr);
+	alloc->size = size;
+
+	if (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_X86))
+		alloc->cur_perm = PERM_RW;
+	else
+		alloc->cur_perm = PERM_RWX;
+
+	alloc->orig_perm = perms;
+
+	return alloc;
+}
+
+unsigned long perm_writable_addr(struct perm_allocation *alloc, unsigned long addr)
+{
+	return addr;
+}
+
+bool perm_writable_finish(struct perm_allocation *alloc)
+{
+	if (!alloc)
+		return false;
+
+	return perm_change(alloc, alloc->orig_perm);
+}
+
+bool perm_change(struct perm_allocation *alloc, virtual_perm perm)
+{
+	unsigned long start, npages;
+	virtual_perm unset, set;
+
+	if (!alloc)
+		return false;
+
+	npages = alloc->size >> PAGE_SHIFT;
+
+	start = perm_alloc_address(alloc);
+
+	set = ~alloc->cur_perm & perm;
+	unset = alloc->cur_perm & ~perm;
+
+	if (set & PERM_W)
+		set_memory_rw(start, npages);
+
+	if (unset & PERM_W)
+		set_memory_ro(start, npages);
+
+	if (set & PERM_X)
+		set_memory_x(start, npages);
+
+	if (unset & PERM_X)
+		set_memory_nx(start, npages);
+
+	alloc->cur_perm = perm;
+
+	return false;
+}
+
+static inline bool perms_need_reset(struct perm_allocation *alloc)
+{
+	return (alloc->cur_perm & PERM_X) || (~alloc->cur_perm & PERM_W);
+}
+
+void perm_free(struct perm_allocation *alloc)
+{
+	unsigned long addr;
+
+	if (!alloc)
+		return;
+
+	addr = perm_alloc_address(alloc);
+
+	if (perms_need_reset(alloc))
+		set_vm_flush_reset_perms((void *)addr);
+
+	module_memfree((void *)addr);
+
+	kfree(alloc);
+}
+
+void perm_memset(struct perm_allocation *alloc, char val)
+{
+	if (!alloc)
+		return;
+	memset((void *)perm_writable_base(alloc), val, perm_alloc_size(alloc));
+}
+#endif /* CONFIG_ARCH_HAS_PERM_ALLOC_IMPLEMENTATION */
+
 #ifdef CONFIG_SMP
 static struct vmap_area *node_to_va(struct rb_node *n)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 02/10] bpf: Use perm_alloc() for BPF JIT filters
  2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation Rick Edgecombe
@ 2020-11-20 20:24 ` Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 03/10] module: Use perm_alloc() for modules Rick Edgecombe
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

eBPF has other executable allocations besides filters, but just convert
over the filters for now.

Since struct perm_allocation has size information, no longer track this
separately.

For x86, write the JIT to the address provided by perm_writable_addr()
so that in later patches this can be directed to a separate writable
staging area.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/arm/net/bpf_jit_32.c         |  3 +-
 arch/arm64/net/bpf_jit_comp.c     |  5 ++--
 arch/mips/net/bpf_jit.c           |  2 +-
 arch/mips/net/ebpf_jit.c          |  3 +-
 arch/powerpc/net/bpf_jit_comp.c   |  2 +-
 arch/powerpc/net/bpf_jit_comp64.c | 10 +++----
 arch/s390/net/bpf_jit_comp.c      |  5 ++--
 arch/sparc/net/bpf_jit_comp_32.c  |  2 +-
 arch/sparc/net/bpf_jit_comp_64.c  |  5 ++--
 arch/x86/net/bpf_jit_comp.c       | 15 ++++++++--
 arch/x86/net/bpf_jit_comp32.c     |  3 +-
 include/linux/filter.h            | 30 ++++++-------------
 kernel/bpf/core.c                 | 48 +++++++++++++++----------------
 13 files changed, 65 insertions(+), 68 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 0207b6ea6e8a..83b7d7a7a833 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1874,7 +1874,7 @@ bool bpf_jit_needs_zext(void)
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
 	struct bpf_prog *tmp, *orig_prog = prog;
-	struct bpf_binary_header *header;
+	struct perm_allocation *header;
 	bool tmp_blinded = false;
 	struct jit_ctx ctx;
 	unsigned int tmp_idx;
@@ -1971,6 +1971,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		prog = orig_prog;
 		goto out_imms;
 	}
+	prog->alloc = header;
 
 	/* 2.) Actual pass to generate final JIT code */
 	ctx.target = (u32 *) image_ptr;
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index ef9f1d5e989d..371a1b8a7fa3 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -970,7 +970,7 @@ static inline void bpf_flush_icache(void *start, void *end)
 }
 
 struct arm64_jit_data {
-	struct bpf_binary_header *header;
+	struct perm_allocation *header;
 	u8 *image;
 	struct jit_ctx ctx;
 };
@@ -979,7 +979,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
 	int image_size, prog_size, extable_size;
 	struct bpf_prog *tmp, *orig_prog = prog;
-	struct bpf_binary_header *header;
+	struct perm_allocation *header;
 	struct arm64_jit_data *jit_data;
 	bool was_classic = bpf_prog_was_classic(prog);
 	bool tmp_blinded = false;
@@ -1055,6 +1055,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		prog = orig_prog;
 		goto out_off;
 	}
+	prog->alloc = header;
 
 	/* 2. Now, the actual pass. */
 
diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
index 0af88622c619..c8ea5c00ce55 100644
--- a/arch/mips/net/bpf_jit.c
+++ b/arch/mips/net/bpf_jit.c
@@ -1264,7 +1264,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
 void bpf_jit_free(struct bpf_prog *fp)
 {
 	if (fp->jited)
-		module_memfree(fp->bpf_func);
+		bpf_jit_binary_free(fp->alloc);
 
 	bpf_prog_unlock_free(fp);
 }
diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c
index 561154cbcc40..e251d0cd33d8 100644
--- a/arch/mips/net/ebpf_jit.c
+++ b/arch/mips/net/ebpf_jit.c
@@ -1799,7 +1799,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	struct bpf_prog *orig_prog = prog;
 	bool tmp_blinded = false;
 	struct bpf_prog *tmp;
-	struct bpf_binary_header *header = NULL;
+	struct perm_allocation *header = NULL;
 	struct jit_ctx ctx;
 	unsigned int image_size;
 	u8 *image_ptr;
@@ -1889,6 +1889,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 				      sizeof(u32), jit_fill_hole);
 	if (header == NULL)
 		goto out_err;
+	prog->alloc = header;
 
 	ctx.target = (u32 *)image_ptr;
 
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index e809cb5a1631..7891a8920c68 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -677,7 +677,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
 void bpf_jit_free(struct bpf_prog *fp)
 {
 	if (fp->jited)
-		module_memfree(fp->bpf_func);
+		bpf_jit_binary_free(fp->alloc);
 
 	bpf_prog_unlock_free(fp);
 }
diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
index 022103c6a201..d78e582580a5 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -1062,7 +1062,7 @@ static int bpf_jit_fixup_subprog_calls(struct bpf_prog *fp, u32 *image,
 }
 
 struct powerpc64_jit_data {
-	struct bpf_binary_header *header;
+	struct perm_allocation *header;
 	u32 *addrs;
 	u8 *image;
 	u32 proglen;
@@ -1085,7 +1085,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 	struct codegen_context cgctx;
 	int pass;
 	int flen;
-	struct bpf_binary_header *bpf_hdr;
+	struct perm_allocation *bpf_hdr;
 	struct bpf_prog *org_fp = fp;
 	struct bpf_prog *tmp_fp;
 	bool bpf_blinded = false;
@@ -1173,6 +1173,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 		fp = org_fp;
 		goto out_addrs;
 	}
+	fp->alloc = header;
 
 skip_init_ctx:
 	code_base = (u32 *)(image + FUNCTION_DESCR_SIZE);
@@ -1249,11 +1250,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 /* Overriding bpf_jit_free() as we don't set images read-only. */
 void bpf_jit_free(struct bpf_prog *fp)
 {
-	unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
-	struct bpf_binary_header *bpf_hdr = (void *)addr;
-
 	if (fp->jited)
-		bpf_jit_binary_free(bpf_hdr);
+		bpf_jit_binary_free(fp->alloc);
 
 	bpf_prog_unlock_free(fp);
 }
diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index 0a4182792876..e0440307f539 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -1686,7 +1686,7 @@ bool bpf_jit_needs_zext(void)
 }
 
 struct s390_jit_data {
-	struct bpf_binary_header *header;
+	struct perm_allocation *header;
 	struct bpf_jit ctx;
 	int pass;
 };
@@ -1721,7 +1721,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 {
 	u32 stack_depth = round_up(fp->aux->stack_depth, 8);
 	struct bpf_prog *tmp, *orig_fp = fp;
-	struct bpf_binary_header *header;
+	struct perm_allocation *header;
 	struct s390_jit_data *jit_data;
 	bool tmp_blinded = false;
 	bool extra_pass = false;
@@ -1785,6 +1785,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 		fp = orig_fp;
 		goto free_addrs;
 	}
+	fp->alloc = header;
 skip_init_ctx:
 	if (bpf_jit_prog(&jit, fp, extra_pass, stack_depth)) {
 		bpf_jit_binary_free(header);
diff --git a/arch/sparc/net/bpf_jit_comp_32.c b/arch/sparc/net/bpf_jit_comp_32.c
index b1dbf2fa8c0a..2a29d2523fd6 100644
--- a/arch/sparc/net/bpf_jit_comp_32.c
+++ b/arch/sparc/net/bpf_jit_comp_32.c
@@ -758,7 +758,7 @@ cond_branch:			f_offset = addrs[i + filter[i].jf];
 void bpf_jit_free(struct bpf_prog *fp)
 {
 	if (fp->jited)
-		module_memfree(fp->bpf_func);
+		bpf_jit_binary_free(fp->alloc);
 
 	bpf_prog_unlock_free(fp);
 }
diff --git a/arch/sparc/net/bpf_jit_comp_64.c b/arch/sparc/net/bpf_jit_comp_64.c
index 3364e2a00989..3ad639166076 100644
--- a/arch/sparc/net/bpf_jit_comp_64.c
+++ b/arch/sparc/net/bpf_jit_comp_64.c
@@ -1458,7 +1458,7 @@ bool bpf_jit_needs_zext(void)
 }
 
 struct sparc64_jit_data {
-	struct bpf_binary_header *header;
+	struct perm_allocation *header;
 	u8 *image;
 	struct jit_ctx ctx;
 };
@@ -1467,7 +1467,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
 	struct bpf_prog *tmp, *orig_prog = prog;
 	struct sparc64_jit_data *jit_data;
-	struct bpf_binary_header *header;
+	struct perm_allocation *header;
 	u32 prev_image_size, image_size;
 	bool tmp_blinded = false;
 	bool extra_pass = false;
@@ -1559,6 +1559,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		prog = orig_prog;
 		goto out_off;
 	}
+	prog->alloc = header;
 
 	ctx.image = (u32 *)image_ptr;
 skip_init_ctx:
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 796506dcfc42..e11ee6b71d40 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1476,11 +1476,14 @@ xadd:			if (is_imm8(insn->off))
 		}
 
 		if (image) {
+			unsigned long writable = perm_writable_addr(bpf_prog->alloc,
+								       (unsigned long)image);
+
 			if (unlikely(proglen + ilen > oldproglen)) {
 				pr_err("bpf_jit: fatal error\n");
 				return -EFAULT;
 			}
-			memcpy(image + proglen, temp, ilen);
+			memcpy((void *)writable + proglen, temp, ilen);
 		}
 		proglen += ilen;
 		addrs[i] = proglen;
@@ -1965,16 +1968,21 @@ int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs)
 }
 
 struct x64_jit_data {
-	struct bpf_binary_header *header;
+	struct perm_allocation *header;
 	int *addrs;
 	u8 *image;
 	int proglen;
 	struct jit_context ctx;
 };
 
+struct perm_allocation *bpf_jit_alloc_exec(unsigned long size)
+{
+	return perm_alloc(MODULES_VADDR, MODULES_END, size >> PAGE_SHIFT, PERM_RX);
+}
+
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
-	struct bpf_binary_header *header = NULL;
+	struct perm_allocation *header = NULL;
 	struct bpf_prog *tmp, *orig_prog = prog;
 	struct x64_jit_data *jit_data;
 	int proglen, oldproglen = 0;
@@ -2078,6 +2086,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 				prog = orig_prog;
 				goto out_addrs;
 			}
+			prog->alloc = header;
 			prog->aux->extable = (void *) image + roundup(proglen, align);
 		}
 		oldproglen = proglen;
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index 96fde03aa987..680ca859b829 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -2298,7 +2298,7 @@ bool bpf_jit_needs_zext(void)
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
-	struct bpf_binary_header *header = NULL;
+	struct perm_allocation *header = NULL;
 	struct bpf_prog *tmp, *orig_prog = prog;
 	int proglen, oldproglen = 0;
 	struct jit_context ctx = {};
@@ -2370,6 +2370,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 				prog = orig_prog;
 				goto out_addrs;
 			}
+			prog->alloc = header;
 		}
 		oldproglen = proglen;
 		cond_resched();
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1b62397bd124..1ffd93e01550 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -517,11 +517,6 @@ struct sock_fprog_kern {
 /* Some arches need doubleword alignment for their instructions and/or data */
 #define BPF_IMAGE_ALIGNMENT 8
 
-struct bpf_binary_header {
-	u32 pages;
-	u8 image[] __aligned(BPF_IMAGE_ALIGNMENT);
-};
-
 struct bpf_prog {
 	u16			pages;		/* Number of allocated pages */
 	u16			jited:1,	/* Is our filter JIT'ed? */
@@ -544,6 +539,8 @@ struct bpf_prog {
 	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
 	unsigned int		(*bpf_func)(const void *ctx,
 					    const struct bpf_insn *insn);
+	struct perm_allocation *alloc;
+
 	/* Instructions for interpreter */
 	struct sock_filter	insns[0];
 	struct bpf_insn		insnsi[];
@@ -818,20 +815,9 @@ static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
 #endif
 }
 
-static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
+static inline void bpf_jit_binary_lock_ro(struct perm_allocation *alloc)
 {
-	set_vm_flush_reset_perms(hdr);
-	set_memory_ro((unsigned long)hdr, hdr->pages);
-	set_memory_x((unsigned long)hdr, hdr->pages);
-}
-
-static inline struct bpf_binary_header *
-bpf_jit_binary_hdr(const struct bpf_prog *fp)
-{
-	unsigned long real_start = (unsigned long)fp->bpf_func;
-	unsigned long addr = real_start & PAGE_MASK;
-
-	return (void *)addr;
+	perm_writable_finish(alloc);
 }
 
 int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap);
@@ -986,14 +972,14 @@ extern long bpf_jit_limit;
 
 typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size);
 
-struct bpf_binary_header *
+struct perm_allocation *
 bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     unsigned int alignment,
 		     bpf_jit_fill_hole_t bpf_fill_ill_insns);
-void bpf_jit_binary_free(struct bpf_binary_header *hdr);
+void bpf_jit_binary_free(struct perm_allocation *hdr);
 u64 bpf_jit_alloc_exec_limit(void);
-void *bpf_jit_alloc_exec(unsigned long size);
-void bpf_jit_free_exec(void *addr);
+struct perm_allocation *bpf_jit_alloc_exec(unsigned long size);
+void bpf_jit_free_exec(struct perm_allocation *alloc);
 void bpf_jit_free(struct bpf_prog *fp);
 
 int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 55454d2278b1..c0088014c65c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -530,13 +530,13 @@ long bpf_jit_limit   __read_mostly;
 static void
 bpf_prog_ksym_set_addr(struct bpf_prog *prog)
 {
-	const struct bpf_binary_header *hdr = bpf_jit_binary_hdr(prog);
-	unsigned long addr = (unsigned long)hdr;
+	const struct perm_allocation *alloc = prog->alloc;
+	unsigned long addr = perm_alloc_address(alloc);
 
 	WARN_ON_ONCE(!bpf_prog_ebpf_jited(prog));
 
-	prog->aux->ksym.start = (unsigned long) prog->bpf_func;
-	prog->aux->ksym.end   = addr + hdr->pages * PAGE_SIZE;
+	prog->aux->ksym.start = (unsigned long)addr;
+	prog->aux->ksym.end   = addr +  alloc->size;
 }
 
 static void
@@ -843,22 +843,23 @@ static void bpf_jit_uncharge_modmem(u32 pages)
 	atomic_long_sub(pages, &bpf_jit_current);
 }
 
-void *__weak bpf_jit_alloc_exec(unsigned long size)
+struct perm_allocation * __weak bpf_jit_alloc_exec(unsigned long size)
 {
-	return module_alloc(size);
+	/* Note: Range ignored for default perm_alloc implementation */
+	return perm_alloc(0, 0, size >> PAGE_SHIFT, PERM_RX);
 }
 
-void __weak bpf_jit_free_exec(void *addr)
+void __weak bpf_jit_free_exec(struct perm_allocation *alloc)
 {
-	module_memfree(addr);
+	perm_free(alloc);
 }
 
-struct bpf_binary_header *
+struct perm_allocation *
 bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     unsigned int alignment,
 		     bpf_jit_fill_hole_t bpf_fill_ill_insns)
 {
-	struct bpf_binary_header *hdr;
+	struct perm_allocation *alloc;
 	u32 size, hole, start, pages;
 
 	WARN_ON_ONCE(!is_power_of_2(alignment) ||
@@ -868,36 +869,35 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 	 * fill a page, allow at least 128 extra bytes to insert a
 	 * random section of illegal instructions.
 	 */
-	size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE);
+	size = round_up(proglen + 128, PAGE_SIZE);
 	pages = size / PAGE_SIZE;
 
 	if (bpf_jit_charge_modmem(pages))
 		return NULL;
-	hdr = bpf_jit_alloc_exec(size);
-	if (!hdr) {
+	alloc = bpf_jit_alloc_exec(size);
+	if (!alloc) {
 		bpf_jit_uncharge_modmem(pages);
 		return NULL;
 	}
 
 	/* Fill space with illegal/arch-dep instructions. */
-	bpf_fill_ill_insns(hdr, size);
+	bpf_fill_ill_insns((void *)perm_writable_base(alloc), size);
 
-	hdr->pages = pages;
-	hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)),
-		     PAGE_SIZE - sizeof(*hdr));
+	hole = min_t(unsigned int, size - proglen,
+		     PAGE_SIZE);
 	start = (get_random_int() % hole) & ~(alignment - 1);
 
 	/* Leave a random number of instructions before BPF code. */
-	*image_ptr = &hdr->image[start];
+	*image_ptr = (void *)perm_alloc_address(alloc) + start;
 
-	return hdr;
+	return alloc;
 }
 
-void bpf_jit_binary_free(struct bpf_binary_header *hdr)
+void bpf_jit_binary_free(struct perm_allocation *alloc)
 {
-	u32 pages = hdr->pages;
+	u32 pages = alloc->size >> PAGE_SHIFT;
 
-	bpf_jit_free_exec(hdr);
+	bpf_jit_free_exec(alloc);
 	bpf_jit_uncharge_modmem(pages);
 }
 
@@ -908,9 +908,7 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr)
 void __weak bpf_jit_free(struct bpf_prog *fp)
 {
 	if (fp->jited) {
-		struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
-
-		bpf_jit_binary_free(hdr);
+		bpf_jit_binary_free(fp->alloc);
 
 		WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 03/10] module: Use perm_alloc() for modules
  2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 02/10] bpf: Use perm_alloc() for BPF JIT filters Rick Edgecombe
@ 2020-11-20 20:24 ` Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 04/10] module: Support separate writable allocation Rick Edgecombe
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

Today modules uses single allocation for each of the core and init
allocation lifetimes. Subsections of these allocations are broken into
different permission types. In order to allow arch breakouts to allocate
permissions more efficiently, change the module loader to operate on
these as separate allocations. Track these broken out subsection
allocations in struct module_layout.

Drop the unneeded frob_() functions previously used for each permission
transition. Since transitioning ro_after_init still requires a flush,
instead provide module_map() to handle the initial mapping and
ro_after_init transition.

Similar to how a flag was stuffed into the upper bits of sh_entsize in
order to communicate which (core or init) allocation a section should
be placed, create 4 new flags to communicate which permission allocation
a subsection should be in.

To prevent disturbing any architectures that may have relocation
limitiations that require subsections to be close together, layout these
perm_allocation's out of a single module_alloc() allocation by default.
This way architectures can work as they did before, but they can have the
opportunity to do things more efficiently if supported.

So this should not have any functional change yet. It is just a change
to how the different regions of the module allocations are tracked in
module.c such that future patches can actually make the regions separate
allocations.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/module.h |  44 ++--
 kernel/module.c        | 547 +++++++++++++++++++++++++----------------
 2 files changed, 362 insertions(+), 229 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 6264617bab4d..9964f909d879 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -26,6 +26,7 @@
 #include <linux/tracepoint-defs.h>
 #include <linux/srcu.h>
 #include <linux/static_call_types.h>
+#include <linux/vmalloc.h>
 
 #include <linux/percpu.h>
 #include <asm/module.h>
@@ -318,23 +319,32 @@ struct mod_tree_node {
 	struct latch_tree_node node;
 };
 
-struct module_layout {
-	/* The actual code + data. */
-	void *base;
-	/* Total size. */
+struct mod_alloc {
+	struct perm_allocation *alloc;
 	unsigned int size;
-	/* The size of the executable code.  */
-	unsigned int text_size;
-	/* Size of RO section of the module (text+rodata) */
-	unsigned int ro_size;
-	/* Size of RO after init section */
-	unsigned int ro_after_init_size;
 
 #ifdef CONFIG_MODULES_TREE_LOOKUP
 	struct mod_tree_node mtn;
 #endif
 };
 
+struct module_layout {
+	/*
+	 * The start of the allocations for arch's that have
+	 * them as contiguous regions.
+	 */
+	void *base;
+
+	/* The actual code + data. */
+	struct mod_alloc text;
+	struct mod_alloc ro;
+	struct mod_alloc ro_after_init;
+	struct mod_alloc rw;
+
+	/* Total size. */
+	unsigned int size;
+};
+
 #ifdef CONFIG_MODULES_TREE_LOOKUP
 /* Only touch one cacheline for common rbtree-for-core-layout case. */
 #define __module_layout_align ____cacheline_aligned
@@ -562,19 +572,25 @@ bool is_module_address(unsigned long addr);
 bool __is_module_percpu_address(unsigned long addr, unsigned long *can_addr);
 bool is_module_percpu_address(unsigned long addr);
 bool is_module_text_address(unsigned long addr);
+struct perm_allocation *module_get_allocation(struct module *mod, unsigned long addr);
+bool module_perm_alloc(struct module_layout *layout);
+void module_perm_free(struct module_layout *layout);
 
 static inline bool within_module_core(unsigned long addr,
 				      const struct module *mod)
 {
-	return (unsigned long)mod->core_layout.base <= addr &&
-	       addr < (unsigned long)mod->core_layout.base + mod->core_layout.size;
+	return within_perm_alloc(mod->core_layout.text.alloc, addr) ||
+		within_perm_alloc(mod->core_layout.ro.alloc, addr) ||
+		within_perm_alloc(mod->core_layout.ro_after_init.alloc, addr) ||
+		within_perm_alloc(mod->core_layout.rw.alloc, addr);
 }
 
 static inline bool within_module_init(unsigned long addr,
 				      const struct module *mod)
 {
-	return (unsigned long)mod->init_layout.base <= addr &&
-	       addr < (unsigned long)mod->init_layout.base + mod->init_layout.size;
+	return within_perm_alloc(mod->init_layout.text.alloc, addr) ||
+		within_perm_alloc(mod->init_layout.ro.alloc, addr) ||
+		within_perm_alloc(mod->init_layout.rw.alloc, addr);
 }
 
 static inline bool within_module(unsigned long addr, const struct module *mod)
diff --git a/kernel/module.c b/kernel/module.c
index a4fa44a652a7..0b31c44798e2 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -80,6 +80,12 @@
 
 /* If this is set, the section belongs in the init part of the module */
 #define INIT_OFFSET_MASK (1UL << (BITS_PER_LONG-1))
+#define TEXT_OFFSET_MASK (1UL << (BITS_PER_LONG-2))
+#define RO_OFFSET_MASK (1UL << (BITS_PER_LONG-3))
+#define RO_AFTER_INIT_OFFSET_MASK (1UL << (BITS_PER_LONG-4))
+#define RW_OFFSET_MASK (1UL << (BITS_PER_LONG-5))
+#define ALL_OFFSET_MASK (INIT_OFFSET_MASK | TEXT_OFFSET_MASK | RO_OFFSET_MASK | \
+			 RO_AFTER_INIT_OFFSET_MASK | RW_OFFSET_MASK)
 
 /*
  * Mutex protects:
@@ -109,16 +115,16 @@ static LLIST_HEAD(init_free_list);
 
 static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
 {
-	struct module_layout *layout = container_of(n, struct module_layout, mtn.node);
+	struct mod_alloc *modalloc = container_of(n, struct mod_alloc, mtn.node);
 
-	return (unsigned long)layout->base;
+	return perm_alloc_address(modalloc->alloc);
 }
 
 static __always_inline unsigned long __mod_tree_size(struct latch_tree_node *n)
 {
-	struct module_layout *layout = container_of(n, struct module_layout, mtn.node);
+	struct mod_alloc *modalloc = container_of(n, struct mod_alloc, mtn.node);
 
-	return (unsigned long)layout->size;
+	return (unsigned long)modalloc->size;
 }
 
 static __always_inline bool
@@ -170,29 +176,51 @@ static void __mod_tree_remove(struct mod_tree_node *node)
 	latch_tree_erase(&node->node, &mod_tree.root, &mod_tree_ops);
 }
 
+static void __mod_tree_insert_layout(struct module_layout *layout, struct module *mod)
+{
+	layout->text.mtn.mod = mod;
+	layout->ro.mtn.mod = mod;
+	layout->ro_after_init.mtn.mod = mod;
+	layout->rw.mtn.mod = mod;
+
+	if (layout->text.alloc)
+		__mod_tree_insert(&layout->text.mtn);
+	if (layout->ro.alloc)
+		__mod_tree_insert(&layout->ro.mtn);
+	if (layout->ro_after_init.alloc)
+		__mod_tree_insert(&layout->ro_after_init.mtn);
+	if (layout->rw.alloc)
+		__mod_tree_insert(&layout->rw.mtn);
+}
+
+static void __mod_tree_remove_layout(struct module_layout *layout)
+{
+	__mod_tree_remove(&layout->text.mtn);
+	__mod_tree_remove(&layout->ro.mtn);
+	__mod_tree_remove(&layout->ro_after_init.mtn);
+	__mod_tree_remove(&layout->rw.mtn);
+}
+
 /*
  * These modifications: insert, remove_init and remove; are serialized by the
  * module_mutex.
  */
 static void mod_tree_insert(struct module *mod)
 {
-	mod->core_layout.mtn.mod = mod;
-	mod->init_layout.mtn.mod = mod;
-
-	__mod_tree_insert(&mod->core_layout.mtn);
+	__mod_tree_insert_layout(&mod->core_layout, mod);
 	if (mod->init_layout.size)
-		__mod_tree_insert(&mod->init_layout.mtn);
+		__mod_tree_insert_layout(&mod->init_layout, mod);
 }
 
 static void mod_tree_remove_init(struct module *mod)
 {
 	if (mod->init_layout.size)
-		__mod_tree_remove(&mod->init_layout.mtn);
+		__mod_tree_remove_layout(&mod->init_layout);
 }
 
 static void mod_tree_remove(struct module *mod)
 {
-	__mod_tree_remove(&mod->core_layout.mtn);
+	__mod_tree_remove_layout(&mod->core_layout);
 	mod_tree_remove_init(mod);
 }
 
@@ -234,10 +262,16 @@ static struct module *mod_find(unsigned long addr)
  * Bounds of module text, for speeding up __module_address.
  * Protected by module_mutex.
  */
-static void __mod_update_bounds(void *base, unsigned int size)
+static void __mod_update_bounds(struct mod_alloc *modalloc)
 {
-	unsigned long min = (unsigned long)base;
-	unsigned long max = min + size;
+	unsigned long min;
+	unsigned long max;
+
+	if (!modalloc->alloc)
+		return;
+
+	min = (unsigned long)perm_alloc_address(modalloc->alloc);
+	max = min + perm_alloc_size(modalloc->alloc);
 
 	if (min < module_addr_min)
 		module_addr_min = min;
@@ -247,9 +281,14 @@ static void __mod_update_bounds(void *base, unsigned int size)
 
 static void mod_update_bounds(struct module *mod)
 {
-	__mod_update_bounds(mod->core_layout.base, mod->core_layout.size);
-	if (mod->init_layout.size)
-		__mod_update_bounds(mod->init_layout.base, mod->init_layout.size);
+	__mod_update_bounds(&mod->core_layout.text);
+	__mod_update_bounds(&mod->core_layout.ro);
+	__mod_update_bounds(&mod->core_layout.ro_after_init);
+	__mod_update_bounds(&mod->core_layout.rw);
+	__mod_update_bounds(&mod->init_layout.text);
+	__mod_update_bounds(&mod->init_layout.ro);
+	__mod_update_bounds(&mod->init_layout.ro_after_init);
+	__mod_update_bounds(&mod->init_layout.rw);
 }
 
 #ifdef CONFIG_KGDB_KDB
@@ -1995,100 +2034,48 @@ static void mod_sysfs_teardown(struct module *mod)
 	mod_sysfs_fini(mod);
 }
 
-/*
- * LKM RO/NX protection: protect module's text/ro-data
- * from modification and any data from execution.
- *
- * General layout of module is:
- *          [text] [read-only-data] [ro-after-init] [writable data]
- * text_size -----^                ^               ^               ^
- * ro_size ------------------------|               |               |
- * ro_after_init_size -----------------------------|               |
- * size -----------------------------------------------------------|
- *
- * These values are always page-aligned (as is base)
- */
-
-/*
- * Since some arches are moving towards PAGE_KERNEL module allocations instead
- * of PAGE_KERNEL_EXEC, keep frob_text() and module_enable_x() outside of the
- * CONFIG_STRICT_MODULE_RWX block below because they are needed regardless of
- * whether we are strict.
- */
-#ifdef CONFIG_ARCH_HAS_STRICT_MODULE_RWX
-static void frob_text(const struct module_layout *layout,
-		      int (*set_memory)(unsigned long start, int num_pages))
+#ifdef CONFIG_STRICT_MODULE_RWX
+static void map_core_regions(const struct module *mod,
+			     virtual_perm text_perm,
+			     virtual_perm ro_perm,
+			     virtual_perm ro_after_init_perm,
+			     virtual_perm rw_perm)
 {
-	BUG_ON((unsigned long)layout->base & (PAGE_SIZE-1));
-	BUG_ON((unsigned long)layout->text_size & (PAGE_SIZE-1));
-	set_memory((unsigned long)layout->base,
-		   layout->text_size >> PAGE_SHIFT);
-}
+	const struct module_layout *layout = &mod->core_layout;
 
-static void module_enable_x(const struct module *mod)
-{
-	frob_text(&mod->core_layout, set_memory_x);
-	frob_text(&mod->init_layout, set_memory_x);
+	perm_change(layout->text.alloc, text_perm);
+	perm_change(layout->ro.alloc, ro_perm);
+	perm_change(layout->ro_after_init.alloc, ro_after_init_perm);
+	perm_change(layout->rw.alloc, rw_perm);
 }
-#else /* !CONFIG_ARCH_HAS_STRICT_MODULE_RWX */
-static void module_enable_x(const struct module *mod) { }
-#endif /* CONFIG_ARCH_HAS_STRICT_MODULE_RWX */
 
-#ifdef CONFIG_STRICT_MODULE_RWX
-static void frob_rodata(const struct module_layout *layout,
-			int (*set_memory)(unsigned long start, int num_pages))
-{
-	BUG_ON((unsigned long)layout->base & (PAGE_SIZE-1));
-	BUG_ON((unsigned long)layout->text_size & (PAGE_SIZE-1));
-	BUG_ON((unsigned long)layout->ro_size & (PAGE_SIZE-1));
-	set_memory((unsigned long)layout->base + layout->text_size,
-		   (layout->ro_size - layout->text_size) >> PAGE_SHIFT);
-}
 
-static void frob_ro_after_init(const struct module_layout *layout,
-				int (*set_memory)(unsigned long start, int num_pages))
+static void map_init_regions(const struct module *mod,
+			     virtual_perm text_perm, virtual_perm ro_perm)
 {
-	BUG_ON((unsigned long)layout->base & (PAGE_SIZE-1));
-	BUG_ON((unsigned long)layout->ro_size & (PAGE_SIZE-1));
-	BUG_ON((unsigned long)layout->ro_after_init_size & (PAGE_SIZE-1));
-	set_memory((unsigned long)layout->base + layout->ro_size,
-		   (layout->ro_after_init_size - layout->ro_size) >> PAGE_SHIFT);
-}
+	const struct module_layout *layout = &mod->init_layout;
 
-static void frob_writable_data(const struct module_layout *layout,
-			       int (*set_memory)(unsigned long start, int num_pages))
-{
-	BUG_ON((unsigned long)layout->base & (PAGE_SIZE-1));
-	BUG_ON((unsigned long)layout->ro_after_init_size & (PAGE_SIZE-1));
-	BUG_ON((unsigned long)layout->size & (PAGE_SIZE-1));
-	set_memory((unsigned long)layout->base + layout->ro_after_init_size,
-		   (layout->size - layout->ro_after_init_size) >> PAGE_SHIFT);
+	perm_change(layout->text.alloc, text_perm);
+	perm_change(layout->ro.alloc, ro_perm);
 }
 
-static void module_enable_ro(const struct module *mod, bool after_init)
+static void module_map(const struct module *mod, bool after_init)
 {
-	if (!rodata_enabled)
-		return;
+	virtual_perm after_init_perm = PERM_R;
 
-	set_vm_flush_reset_perms(mod->core_layout.base);
-	set_vm_flush_reset_perms(mod->init_layout.base);
-	frob_text(&mod->core_layout, set_memory_ro);
+	perm_writable_finish(mod->core_layout.text.alloc);
+	perm_writable_finish(mod->core_layout.ro.alloc);
+	perm_writable_finish(mod->init_layout.text.alloc);
+	perm_writable_finish(mod->init_layout.ro.alloc);
 
-	frob_rodata(&mod->core_layout, set_memory_ro);
-	frob_text(&mod->init_layout, set_memory_ro);
-	frob_rodata(&mod->init_layout, set_memory_ro);
+	if (!rodata_enabled)
+		return;
 
-	if (after_init)
-		frob_ro_after_init(&mod->core_layout, set_memory_ro);
-}
+	if (!after_init)
+		after_init_perm |= PERM_W;
 
-static void module_enable_nx(const struct module *mod)
-{
-	frob_rodata(&mod->core_layout, set_memory_nx);
-	frob_ro_after_init(&mod->core_layout, set_memory_nx);
-	frob_writable_data(&mod->core_layout, set_memory_nx);
-	frob_rodata(&mod->init_layout, set_memory_nx);
-	frob_writable_data(&mod->init_layout, set_memory_nx);
+	map_core_regions(mod, PERM_RX, PERM_R, after_init_perm, PERM_RW);
+	map_init_regions(mod, PERM_RX, PERM_R);
 }
 
 static int module_enforce_rwx_sections(Elf_Ehdr *hdr, Elf_Shdr *sechdrs,
@@ -2109,8 +2096,7 @@ static int module_enforce_rwx_sections(Elf_Ehdr *hdr, Elf_Shdr *sechdrs,
 }
 
 #else /* !CONFIG_STRICT_MODULE_RWX */
-static void module_enable_nx(const struct module *mod) { }
-static void module_enable_ro(const struct module *mod, bool after_init) {}
+static void module_map(const struct module *mod, bool after_init) {}
 static int module_enforce_rwx_sections(Elf_Ehdr *hdr, Elf_Shdr *sechdrs,
 				       char *secstrings, struct module *mod)
 {
@@ -2211,6 +2197,8 @@ void __weak module_arch_freeing_init(struct module *mod)
 {
 }
 
+void module_perm_free(struct module_layout *layout);
+
 /* Free a module, remove from lists, etc. */
 static void free_module(struct module *mod)
 {
@@ -2252,15 +2240,16 @@ static void free_module(struct module *mod)
 
 	/* This may be empty, but that's OK */
 	module_arch_freeing_init(mod);
-	module_memfree(mod->init_layout.base);
+	module_perm_free(&mod->init_layout);
 	kfree(mod->args);
 	percpu_modfree(mod);
 
 	/* Free lock-classes; relies on the preceding sync_rcu(). */
-	lockdep_free_key_range(mod->core_layout.base, mod->core_layout.size);
+	lockdep_free_key_range((void *)perm_alloc_address(mod->core_layout.rw.alloc),
+			       mod->core_layout.rw.size);
 
 	/* Finally, free the core (containing the module structure) */
-	module_memfree(mod->core_layout.base);
+	module_perm_free(&mod->core_layout);
 }
 
 void *__symbol_get(const char *symbol)
@@ -2436,10 +2425,56 @@ static long get_offset(struct module *mod, unsigned int *size,
 
 	*size += arch_mod_section_prepend(mod, section);
 	ret = ALIGN(*size, sechdr->sh_addralign ?: 1);
-	*size = ret + sechdr->sh_size;
+	*size = debug_align(ret + sechdr->sh_size);
+
 	return ret;
 }
 
+static inline unsigned int *get_size_from_layout(struct module_layout *layout,
+						 unsigned long offset_mask)
+{
+	if (offset_mask & TEXT_OFFSET_MASK)
+		return &layout->text.size;
+	if (offset_mask & RO_OFFSET_MASK)
+		return &layout->ro.size;
+	if (offset_mask & RO_AFTER_INIT_OFFSET_MASK)
+		return &layout->ro_after_init.size;
+	return &layout->rw.size; /*RW_OFFSET_MASK*/
+}
+
+static inline struct perm_allocation *get_alloc_from_layout(struct module_layout *layout,
+							    unsigned long offset_mask)
+{
+	if (offset_mask & TEXT_OFFSET_MASK)
+		return layout->text.alloc;
+	if (offset_mask & RO_OFFSET_MASK)
+		return layout->ro.alloc;
+	if (offset_mask & RO_AFTER_INIT_OFFSET_MASK)
+		return layout->ro_after_init.alloc;
+	return layout->rw.alloc; /*RW_OFFSET_MASK*/
+}
+
+static void set_total_size(struct module_layout *layout)
+{
+	layout->size = debug_align(layout->text.size + layout->ro.size +
+		       layout->ro_after_init.size + layout->rw.size);
+}
+
+struct perm_allocation *module_get_allocation(struct module *mod, unsigned long addr)
+{
+	struct perm_allocation *allocs[] = { mod->core_layout.text.alloc, mod->core_layout.ro.alloc,
+					mod->core_layout.ro_after_init.alloc,
+					mod->core_layout.rw.alloc, mod->init_layout.text.alloc,
+					mod->init_layout.ro.alloc, mod->init_layout.rw.alloc };
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(allocs); i++)
+		if (within_perm_alloc(allocs[i], addr))
+			return allocs[i];
+
+	return NULL;
+}
+
 /* Lay out the SHF_ALLOC sections in a way not dissimilar to how ld
    might -- code, read-only data, read-write data, small data.  Tally
    sizes, and place the offsets into sh_entsize fields: high bit means it
@@ -2456,6 +2491,7 @@ static void layout_sections(struct module *mod, struct load_info *info)
 		{ SHF_WRITE | SHF_ALLOC, ARCH_SHF_SMALL },
 		{ ARCH_SHF_SMALL | SHF_ALLOC, 0 }
 	};
+	unsigned int *section_size;
 	unsigned int m, i;
 
 	for (i = 0; i < info->hdr->e_shnum; i++)
@@ -2463,6 +2499,8 @@ static void layout_sections(struct module *mod, struct load_info *info)
 
 	pr_debug("Core section allocation order:\n");
 	for (m = 0; m < ARRAY_SIZE(masks); ++m) {
+		unsigned long offset_mask = (1UL << (BITS_PER_LONG - (m + 2)));
+
 		for (i = 0; i < info->hdr->e_shnum; ++i) {
 			Elf_Shdr *s = &info->sechdrs[i];
 			const char *sname = info->secstrings + s->sh_name;
@@ -2472,30 +2510,19 @@ static void layout_sections(struct module *mod, struct load_info *info)
 			    || s->sh_entsize != ~0UL
 			    || module_init_section(sname))
 				continue;
-			s->sh_entsize = get_offset(mod, &mod->core_layout.size, s, i);
+			section_size = get_size_from_layout(&mod->core_layout, offset_mask);
+			s->sh_entsize = get_offset(mod, section_size, s, i) | offset_mask;
+
 			pr_debug("\t%s\n", sname);
 		}
-		switch (m) {
-		case 0: /* executable */
-			mod->core_layout.size = debug_align(mod->core_layout.size);
-			mod->core_layout.text_size = mod->core_layout.size;
-			break;
-		case 1: /* RO: text and ro-data */
-			mod->core_layout.size = debug_align(mod->core_layout.size);
-			mod->core_layout.ro_size = mod->core_layout.size;
-			break;
-		case 2: /* RO after init */
-			mod->core_layout.size = debug_align(mod->core_layout.size);
-			mod->core_layout.ro_after_init_size = mod->core_layout.size;
-			break;
-		case 4: /* whole core */
-			mod->core_layout.size = debug_align(mod->core_layout.size);
-			break;
-		}
 	}
 
+	set_total_size(&mod->core_layout);
+
 	pr_debug("Init section allocation order:\n");
 	for (m = 0; m < ARRAY_SIZE(masks); ++m) {
+		unsigned long offset_mask = (1UL << (BITS_PER_LONG - (m + 2)));
+
 		for (i = 0; i < info->hdr->e_shnum; ++i) {
 			Elf_Shdr *s = &info->sechdrs[i];
 			const char *sname = info->secstrings + s->sh_name;
@@ -2505,31 +2532,15 @@ static void layout_sections(struct module *mod, struct load_info *info)
 			    || s->sh_entsize != ~0UL
 			    || !module_init_section(sname))
 				continue;
-			s->sh_entsize = (get_offset(mod, &mod->init_layout.size, s, i)
-					 | INIT_OFFSET_MASK);
+
+			section_size = get_size_from_layout(&mod->init_layout, offset_mask);
+			s->sh_entsize = get_offset(mod, section_size, s, i);
+			s->sh_entsize |= INIT_OFFSET_MASK | offset_mask;
 			pr_debug("\t%s\n", sname);
 		}
-		switch (m) {
-		case 0: /* executable */
-			mod->init_layout.size = debug_align(mod->init_layout.size);
-			mod->init_layout.text_size = mod->init_layout.size;
-			break;
-		case 1: /* RO: text and ro-data */
-			mod->init_layout.size = debug_align(mod->init_layout.size);
-			mod->init_layout.ro_size = mod->init_layout.size;
-			break;
-		case 2:
-			/*
-			 * RO after init doesn't apply to init_layout (only
-			 * core_layout), so it just takes the value of ro_size.
-			 */
-			mod->init_layout.ro_after_init_size = mod->init_layout.ro_size;
-			break;
-		case 4: /* whole init */
-			mod->init_layout.size = debug_align(mod->init_layout.size);
-			break;
-		}
 	}
+
+	set_total_size(&mod->init_layout);
 }
 
 static void set_license(struct module *mod, const char *license)
@@ -2724,7 +2735,7 @@ static void layout_symtab(struct module *mod, struct load_info *info)
 
 	/* Put symbol section at end of init part of module. */
 	symsect->sh_flags |= SHF_ALLOC;
-	symsect->sh_entsize = get_offset(mod, &mod->init_layout.size, symsect,
+	symsect->sh_entsize = get_offset(mod, &mod->init_layout.rw.size, symsect,
 					 info->index.sym) | INIT_OFFSET_MASK;
 	pr_debug("\t%s\n", info->secstrings + symsect->sh_name);
 
@@ -2742,27 +2753,28 @@ static void layout_symtab(struct module *mod, struct load_info *info)
 	}
 
 	/* Append room for core symbols at end of core part. */
-	info->symoffs = ALIGN(mod->core_layout.size, symsect->sh_addralign ?: 1);
-	info->stroffs = mod->core_layout.size = info->symoffs + ndst * sizeof(Elf_Sym);
-	mod->core_layout.size += strtab_size;
-	info->core_typeoffs = mod->core_layout.size;
+	info->symoffs = ALIGN(mod->core_layout.rw.size, symsect->sh_addralign ?: 1);
+	info->stroffs = mod->core_layout.rw.size = info->symoffs + ndst * sizeof(Elf_Sym);
+	mod->core_layout.rw.size += strtab_size;
+	info->core_typeoffs = mod->core_layout.rw.size;
 	mod->core_layout.size += ndst * sizeof(char);
-	mod->core_layout.size = debug_align(mod->core_layout.size);
 
 	/* Put string table section at end of init part of module. */
 	strsect->sh_flags |= SHF_ALLOC;
-	strsect->sh_entsize = get_offset(mod, &mod->init_layout.size, strsect,
+	strsect->sh_entsize = get_offset(mod, &mod->init_layout.rw.size, strsect,
 					 info->index.str) | INIT_OFFSET_MASK;
 	pr_debug("\t%s\n", info->secstrings + strsect->sh_name);
 
 	/* We'll tack temporary mod_kallsyms on the end. */
-	mod->init_layout.size = ALIGN(mod->init_layout.size,
-				      __alignof__(struct mod_kallsyms));
-	info->mod_kallsyms_init_off = mod->init_layout.size;
-	mod->init_layout.size += sizeof(struct mod_kallsyms);
-	info->init_typeoffs = mod->init_layout.size;
-	mod->init_layout.size += nsrc * sizeof(char);
-	mod->init_layout.size = debug_align(mod->init_layout.size);
+	mod->init_layout.rw.size = ALIGN(mod->init_layout.rw.size,
+					 __alignof__(struct mod_kallsyms));
+	info->mod_kallsyms_init_off = mod->init_layout.rw.size;
+	mod->init_layout.rw.size += sizeof(struct mod_kallsyms);
+	info->init_typeoffs = mod->init_layout.rw.size;
+	mod->init_layout.rw.size += nsrc * sizeof(char);
+
+	set_total_size(&mod->core_layout);
+	set_total_size(&mod->init_layout);
 }
 
 /*
@@ -2777,23 +2789,25 @@ static void add_kallsyms(struct module *mod, const struct load_info *info)
 	Elf_Sym *dst;
 	char *s;
 	Elf_Shdr *symsec = &info->sechdrs[info->index.sym];
+	void *core_rw = (void *)perm_alloc_address(mod->core_layout.rw.alloc);
+	void *init_rw = (void *)perm_alloc_address(mod->init_layout.rw.alloc);
 
 	/* Set up to point into init section. */
-	mod->kallsyms = mod->init_layout.base + info->mod_kallsyms_init_off;
+	mod->kallsyms = init_rw + info->mod_kallsyms_init_off;
 
 	mod->kallsyms->symtab = (void *)symsec->sh_addr;
 	mod->kallsyms->num_symtab = symsec->sh_size / sizeof(Elf_Sym);
 	/* Make sure we get permanent strtab: don't use info->strtab. */
 	mod->kallsyms->strtab = (void *)info->sechdrs[info->index.str].sh_addr;
-	mod->kallsyms->typetab = mod->init_layout.base + info->init_typeoffs;
+	mod->kallsyms->typetab = init_rw + info->init_typeoffs;
 
 	/*
 	 * Now populate the cut down core kallsyms for after init
 	 * and set types up while we still have access to sections.
 	 */
-	mod->core_kallsyms.symtab = dst = mod->core_layout.base + info->symoffs;
-	mod->core_kallsyms.strtab = s = mod->core_layout.base + info->stroffs;
-	mod->core_kallsyms.typetab = mod->core_layout.base + info->core_typeoffs;
+	mod->core_kallsyms.symtab = dst = core_rw + info->symoffs;
+	mod->core_kallsyms.strtab = s = core_rw + info->stroffs;
+	mod->core_kallsyms.typetab = core_rw + info->core_typeoffs;
 	src = mod->kallsyms->symtab;
 	for (ndst = i = 0; i < mod->kallsyms->num_symtab; i++) {
 		mod->kallsyms->typetab[i] = elf_type(src + i, info);
@@ -2845,6 +2859,110 @@ bool __weak module_init_section(const char *name)
 	return strstarts(name, ".init");
 }
 
+static void set_perm_alloc(struct perm_allocation *alloc, struct vm_struct *area,
+			   unsigned long offset, unsigned long size)
+{
+	if (!alloc)
+		return;
+
+	alloc->area = area;
+	alloc->offset = offset;
+	alloc->size = size;
+}
+
+/*
+ * Layout perm_allocs as they would have been by a straight module_alloc call in case any arch's
+ * care about that.
+ */
+bool __weak module_perm_alloc(struct module_layout *layout)
+{
+	void *addr = module_alloc(layout->size);
+	gfp_t flags = GFP_KERNEL | __GFP_ZERO;
+	struct vm_struct *area;
+
+	layout->text.alloc = NULL;
+	layout->ro.alloc = NULL;
+	layout->ro_after_init.alloc = NULL;
+	layout->rw.alloc = NULL;
+
+	if (!addr)
+		return false;
+
+	layout->base = addr;
+
+	if (IS_ENABLED(CONFIG_MMU)) {
+		area = find_vm_area(addr);
+	} else {
+		area = kmalloc(sizeof(*area), flags);
+		if (!area) {
+			module_memfree(addr);
+			return NULL;
+		}
+		area->addr = addr;
+	}
+
+	/*
+	 * This is a bit of an awkward hack. modules.c now expects all of the different permission
+	 * ranges to be in separate allocations, but we want to make this perm_allocation transition
+	 * without disturbing any arch, which may have unknown relocation rules. So to make sure
+	 * there is no functional change, allocate using the arch's module_alloc() and lay fake
+	 * perm_allocations over the top of it. This way all the sections would be laid out exactly
+	 * as before struct special_alloc was introduced until an arch adds their own
+	 * module_perm_alloc().
+	 */
+	layout->text.alloc = kmalloc(sizeof(*layout->text.alloc), flags);
+	layout->ro.alloc = kmalloc(sizeof(*layout->ro.alloc), flags);
+	layout->ro_after_init.alloc = kmalloc(sizeof(*layout->ro_after_init.alloc), flags);
+	layout->rw.alloc = kmalloc(sizeof(*layout->rw.alloc), flags);
+	if (!(layout->text.alloc && layout->ro.alloc && layout->ro_after_init.alloc &&
+	      layout->rw.alloc)) {
+		module_perm_free(layout);
+		layout->text.alloc = NULL;
+		layout->ro.alloc = NULL;
+		layout->ro_after_init.alloc = NULL;
+		layout->rw.alloc = NULL;
+		return false;
+	}
+
+	set_perm_alloc(layout->text.alloc, area, 0, layout->text.size);
+	set_perm_alloc(layout->ro.alloc, area, layout->text.size, layout->ro.size);
+	set_perm_alloc(layout->ro_after_init.alloc, area, layout->ro.alloc->offset +
+		       layout->ro.size, layout->ro_after_init.size);
+	set_perm_alloc(layout->rw.alloc, area,
+		       layout->ro_after_init.alloc->offset + layout->ro_after_init.size,
+		       layout->rw.size);
+
+	set_vm_flush_reset_perms(addr);
+
+	return true;
+}
+
+static void unset_layout_allocs(struct module_layout *layout)
+{
+	layout->text.alloc = NULL;
+	layout->ro.alloc = NULL;
+	layout->ro_after_init.alloc = NULL;
+	layout->rw.alloc = NULL;
+}
+
+void __weak module_perm_free(struct module_layout *layout)
+{
+	void *addr;
+
+	if (!layout->text.alloc)
+		return;
+	addr = (void *)perm_alloc_address(layout->text.alloc);
+
+	/* Clean up the fake perm_allocation's in a special way */
+	kfree(layout->text.alloc);
+	kfree(layout->ro.alloc);
+	kfree(layout->ro_after_init.alloc);
+	kfree(layout->rw.alloc);
+	unset_layout_allocs(layout);
+
+	module_memfree(addr);
+}
+
 bool __weak module_exit_section(const char *name)
 {
 	return strstarts(name, ".exit");
@@ -3306,54 +3424,52 @@ static int find_module_sections(struct module *mod, struct load_info *info)
 static int move_module(struct module *mod, struct load_info *info)
 {
 	int i;
-	void *ptr;
 
-	/* Do the allocs. */
-	ptr = module_alloc(mod->core_layout.size);
 	/*
 	 * The pointer to this block is stored in the module structure
 	 * which is inside the block. Just mark it as not being a
 	 * leak.
 	 */
-	kmemleak_not_leak(ptr);
-	if (!ptr)
+	if (!module_perm_alloc(&mod->core_layout))
 		return -ENOMEM;
 
-	memset(ptr, 0, mod->core_layout.size);
-	mod->core_layout.base = ptr;
+	perm_memset(mod->core_layout.text.alloc, 0xcc);
+	perm_memset(mod->core_layout.ro.alloc, 0);
+	perm_memset(mod->core_layout.ro_after_init.alloc, 0);
+	perm_memset(mod->core_layout.rw.alloc, 0);
 
 	if (mod->init_layout.size) {
-		ptr = module_alloc(mod->init_layout.size);
 		/*
 		 * The pointer to this block is stored in the module structure
 		 * which is inside the block. This block doesn't need to be
 		 * scanned as it contains data and code that will be freed
 		 * after the module is initialized.
 		 */
-		kmemleak_ignore(ptr);
-		if (!ptr) {
-			module_memfree(mod->core_layout.base);
+		if (!module_perm_alloc(&mod->init_layout))
 			return -ENOMEM;
-		}
-		memset(ptr, 0, mod->init_layout.size);
-		mod->init_layout.base = ptr;
-	} else
-		mod->init_layout.base = NULL;
+		perm_memset(mod->init_layout.text.alloc, 0xcc);
+		perm_memset(mod->init_layout.ro.alloc, 0);
+		perm_memset(mod->init_layout.rw.alloc, 0);
+	} else {
+		unset_layout_allocs(&mod->init_layout);
+	}
 
 	/* Transfer each section which specifies SHF_ALLOC */
 	pr_debug("final section addresses:\n");
 	for (i = 0; i < info->hdr->e_shnum; i++) {
 		void *dest;
+		struct perm_allocation *alloc;
+
 		Elf_Shdr *shdr = &info->sechdrs[i];
 
 		if (!(shdr->sh_flags & SHF_ALLOC))
 			continue;
 
 		if (shdr->sh_entsize & INIT_OFFSET_MASK)
-			dest = mod->init_layout.base
-				+ (shdr->sh_entsize & ~INIT_OFFSET_MASK);
+			alloc = get_alloc_from_layout(&mod->init_layout, shdr->sh_entsize);
 		else
-			dest = mod->core_layout.base + shdr->sh_entsize;
+			alloc = get_alloc_from_layout(&mod->core_layout, shdr->sh_entsize);
+		dest = (void *)perm_alloc_address(alloc) + (shdr->sh_entsize & ~ALL_OFFSET_MASK);
 
 		if (shdr->sh_type != SHT_NOBITS)
 			memcpy(dest, (void *)shdr->sh_addr, shdr->sh_size);
@@ -3414,12 +3530,13 @@ static void flush_module_icache(const struct module *mod)
 	 * Do it before processing of module parameters, so the module
 	 * can provide parameter accessor functions of its own.
 	 */
-	if (mod->init_layout.base)
-		flush_icache_range((unsigned long)mod->init_layout.base,
-				   (unsigned long)mod->init_layout.base
-				   + mod->init_layout.size);
-	flush_icache_range((unsigned long)mod->core_layout.base,
-			   (unsigned long)mod->core_layout.base + mod->core_layout.size);
+	if (mod->init_layout.text.alloc)
+		flush_icache_range((unsigned long)perm_alloc_address(mod->init_layout.text.alloc),
+				   (unsigned long)perm_alloc_address(mod->init_layout.text.alloc)
+				   + mod->init_layout.text.size);
+	flush_icache_range((unsigned long)perm_alloc_address(mod->core_layout.text.alloc),
+			   (unsigned long)perm_alloc_address(mod->init_layout.text.alloc) +
+			   mod->core_layout.text.size);
 }
 
 int __weak module_frob_arch_sections(Elf_Ehdr *hdr,
@@ -3515,8 +3632,8 @@ static void module_deallocate(struct module *mod, struct load_info *info)
 {
 	percpu_modfree(mod);
 	module_arch_freeing_init(mod);
-	module_memfree(mod->init_layout.base);
-	module_memfree(mod->core_layout.base);
+	module_perm_free(&mod->init_layout);
+	module_perm_free(&mod->core_layout);
 }
 
 int __weak module_finalize(const Elf_Ehdr *hdr,
@@ -3576,7 +3693,7 @@ static void do_mod_ctors(struct module *mod)
 /* For freeing module_init on success, in case kallsyms traversing */
 struct mod_initfree {
 	struct llist_node node;
-	void *module_init;
+	struct module_layout layout;
 };
 
 static void do_free_init(struct work_struct *w)
@@ -3590,7 +3707,7 @@ static void do_free_init(struct work_struct *w)
 
 	llist_for_each_safe(pos, n, list) {
 		initfree = container_of(pos, struct mod_initfree, node);
-		module_memfree(initfree->module_init);
+		module_perm_free(&initfree->layout);
 		kfree(initfree);
 	}
 }
@@ -3606,12 +3723,11 @@ static noinline int do_init_module(struct module *mod)
 	int ret = 0;
 	struct mod_initfree *freeinit;
 
-	freeinit = kmalloc(sizeof(*freeinit), GFP_KERNEL);
+	freeinit = kmalloc(sizeof(*freeinit) * 4, GFP_KERNEL);
 	if (!freeinit) {
 		ret = -ENOMEM;
 		goto fail;
 	}
-	freeinit->module_init = mod->init_layout.base;
 
 	/*
 	 * We want to find out whether @mod uses async during init.  Clear
@@ -3659,8 +3775,9 @@ static noinline int do_init_module(struct module *mod)
 	if (!mod->async_probe_requested && (current->flags & PF_USED_ASYNC))
 		async_synchronize_full();
 
-	ftrace_free_mem(mod, mod->init_layout.base, mod->init_layout.base +
-			mod->init_layout.size);
+	ftrace_free_mem(mod, (void *)perm_alloc_address(mod->init_layout.text.alloc),
+			(void *)perm_alloc_address(mod->init_layout.text.alloc) +
+			perm_alloc_size(mod->init_layout.text.alloc));
 	mutex_lock(&module_mutex);
 	/* Drop initial reference. */
 	module_put(mod);
@@ -3669,14 +3786,10 @@ static noinline int do_init_module(struct module *mod)
 	/* Switch to core kallsyms now init is done: kallsyms may be walking! */
 	rcu_assign_pointer(mod->kallsyms, &mod->core_kallsyms);
 #endif
-	module_enable_ro(mod, true);
+	module_map(mod, true);
 	mod_tree_remove_init(mod);
 	module_arch_freeing_init(mod);
-	mod->init_layout.base = NULL;
-	mod->init_layout.size = 0;
-	mod->init_layout.ro_size = 0;
-	mod->init_layout.ro_after_init_size = 0;
-	mod->init_layout.text_size = 0;
+
 	/*
 	 * We want to free module_init, but be aware that kallsyms may be
 	 * walking this with preempt disabled.  In all the failure paths, we
@@ -3690,6 +3803,9 @@ static noinline int do_init_module(struct module *mod)
 	 * be cleaned up needs to sync with the queued work - ie
 	 * rcu_barrier()
 	 */
+	freeinit->layout = mod->init_layout;
+	unset_layout_allocs(&mod->init_layout);
+	mod->init_layout.size = 0;
 	if (llist_add(&freeinit->node, &init_free_list))
 		schedule_work(&init_free_wq);
 
@@ -3775,9 +3891,7 @@ static int complete_formation(struct module *mod, struct load_info *info)
 	/* This relies on module_mutex for list integrity. */
 	module_bug_finalize(info->hdr, info->sechdrs, mod);
 
-	module_enable_ro(mod, false);
-	module_enable_nx(mod);
-	module_enable_x(mod);
+	module_map(mod, false);
 
 	/* Mark state as coming so strong_try_module_get() ignores us,
 	 * but kallsyms etc. can see us. */
@@ -4018,7 +4132,8 @@ static int load_module(struct load_info *info, const char __user *uargs,
 	mutex_unlock(&module_mutex);
  free_module:
 	/* Free lock-classes; relies on the preceding sync_rcu() */
-	lockdep_free_key_range(mod->core_layout.base, mod->core_layout.size);
+	lockdep_free_key_range((void *)perm_alloc_address(mod->core_layout.rw.alloc),
+			       mod->core_layout.text.size);
 
 	module_deallocate(mod, info);
  free_copy:
@@ -4110,9 +4225,11 @@ static const char *find_kallsyms_symbol(struct module *mod,
 
 	/* At worse, next value is at end of module */
 	if (within_module_init(addr, mod))
-		nextval = (unsigned long)mod->init_layout.base+mod->init_layout.text_size;
+		nextval = perm_alloc_address(mod->init_layout.text.alloc) +
+					     mod->init_layout.text.size;
 	else
-		nextval = (unsigned long)mod->core_layout.base+mod->core_layout.text_size;
+		nextval = perm_alloc_address(mod->core_layout.text.alloc) +
+					     mod->core_layout.text.size;
 
 	bestval = kallsyms_symbol_value(&kallsyms->symtab[best]);
 
@@ -4404,7 +4521,7 @@ static int m_show(struct seq_file *m, void *p)
 		   mod->state == MODULE_STATE_COMING ? "Loading" :
 		   "Live");
 	/* Used by oprofile and other similar tools. */
-	value = m->private ? NULL : mod->core_layout.base;
+	value = m->private ? NULL : (void *)perm_alloc_address(mod->core_layout.text.alloc);
 	seq_printf(m, " 0x%px", value);
 
 	/* Taints info */
@@ -4524,11 +4641,8 @@ struct module *__module_address(unsigned long addr)
 	module_assert_mutex_or_preempt();
 
 	mod = mod_find(addr);
-	if (mod) {
-		BUG_ON(!within_module(addr, mod));
-		if (mod->state == MODULE_STATE_UNFORMED)
-			mod = NULL;
-	}
+	BUG_ON(mod && !within_module(addr, mod));
+
 	return mod;
 }
 
@@ -4562,9 +4676,12 @@ struct module *__module_text_address(unsigned long addr)
 {
 	struct module *mod = __module_address(addr);
 	if (mod) {
+		void *init_text = (void *)perm_alloc_address(mod->init_layout.text.alloc);
+		void *core_text = (void *)perm_alloc_address(mod->init_layout.text.alloc);
+
 		/* Make sure it's within the text section. */
-		if (!within(addr, mod->init_layout.base, mod->init_layout.text_size)
-		    && !within(addr, mod->core_layout.base, mod->core_layout.text_size))
+		if (!within(addr, init_text, mod->init_layout.text.size) &&
+		    !within(addr, core_text, mod->core_layout.text.size))
 			mod = NULL;
 	}
 	return mod;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 04/10] module: Support separate writable allocation
  2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
                   ` (2 preceding siblings ...)
  2020-11-20 20:24 ` [PATCH RFC 03/10] module: Use perm_alloc() for modules Rick Edgecombe
@ 2020-11-20 20:24 ` Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 05/10] x86/modules: Use real perm_allocations Rick Edgecombe
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

The perm_alloc interface supports architectures to direct writes
intended for an allocation to a separate writable staging area before a
mapping is made live. In order to support this, change modules to write
to the address provided by perm_writable_addr(). Currently this is the
same address of the final allocation, so this patch should not create
any functional change yet.

To facilitate re-direction to separate writable staging areas, create a
helper module_adjust_writable_addr(). This function will return an
allocation's writable address if the parameter address is from a module
that is in the process of being loaded. If the address does not meet that
criteria, simply return the address passed in. This helper, while a
little heavy weight, will allow callers in upcoming patches to simply
retrieve the writable address without context of what module is being
loaded.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 include/linux/module.h | 22 ++++++++++++++++++++++
 kernel/module.c        | 14 +++++++++-----
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 9964f909d879..32dd22b2a38a 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -576,6 +576,23 @@ struct perm_allocation *module_get_allocation(struct module *mod, unsigned long
 bool module_perm_alloc(struct module_layout *layout);
 void module_perm_free(struct module_layout *layout);
 
+static inline void *module_adjust_writable_addr(void *addr)
+{
+	unsigned long laddr = (unsigned long)addr;
+	struct module *mod;
+
+	mutex_lock(&module_mutex);
+	mod = __module_address(laddr);
+	if (!mod) {
+		mutex_unlock(&module_mutex);
+		return addr;
+	}
+	mutex_unlock(&module_mutex);
+	/* The module shouldn't be going away if someone is trying to write to it */
+
+	return (void *)perm_writable_addr(module_get_allocation(mod, laddr), laddr);
+}
+
 static inline bool within_module_core(unsigned long addr,
 				      const struct module *mod)
 {
@@ -853,6 +870,11 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr)
 	return ptr;
 }
 
+static inline void *module_adjust_writable_addr(void *addr)
+{
+	return addr;
+}
+
 #endif /* CONFIG_MODULES */
 
 #ifdef CONFIG_SYSFS
diff --git a/kernel/module.c b/kernel/module.c
index 0b31c44798e2..d0afedd36cea 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3457,7 +3457,7 @@ static int move_module(struct module *mod, struct load_info *info)
 	/* Transfer each section which specifies SHF_ALLOC */
 	pr_debug("final section addresses:\n");
 	for (i = 0; i < info->hdr->e_shnum; i++) {
-		void *dest;
+		void *dest, *wdest;
 		struct perm_allocation *alloc;
 
 		Elf_Shdr *shdr = &info->sechdrs[i];
@@ -3470,9 +3470,10 @@ static int move_module(struct module *mod, struct load_info *info)
 		else
 			alloc = get_alloc_from_layout(&mod->core_layout, shdr->sh_entsize);
 		dest = (void *)perm_alloc_address(alloc) + (shdr->sh_entsize & ~ALL_OFFSET_MASK);
+		wdest = (void *)perm_writable_addr(alloc, (unsigned long)dest);
 
 		if (shdr->sh_type != SHT_NOBITS)
-			memcpy(dest, (void *)shdr->sh_addr, shdr->sh_size);
+			memcpy(wdest, (void *)shdr->sh_addr, shdr->sh_size);
 		/* Update sh_addr to point to copy in image. */
 		shdr->sh_addr = (unsigned long)dest;
 		pr_debug("\t0x%lx %s\n",
@@ -3645,12 +3646,15 @@ int __weak module_finalize(const Elf_Ehdr *hdr,
 
 static int post_relocation(struct module *mod, const struct load_info *info)
 {
+	struct exception_table_entry *extable_writ = module_adjust_writable_addr(mod->extable);
+	void *pcpu = (void *)info->sechdrs[info->index.pcpu].sh_addr;
+	void *percpu_writ = module_adjust_writable_addr(pcpu);
+
 	/* Sort exception table now relocations are done. */
-	sort_extable(mod->extable, mod->extable + mod->num_exentries);
+	sort_extable(extable_writ, mod->extable + mod->num_exentries);
 
 	/* Copy relocated percpu area over. */
-	percpu_modcopy(mod, (void *)info->sechdrs[info->index.pcpu].sh_addr,
-		       info->sechdrs[info->index.pcpu].sh_size);
+	percpu_modcopy(mod, percpu_writ, info->sechdrs[info->index.pcpu].sh_size);
 
 	/* Setup kallsyms-specific fields. */
 	add_kallsyms(mod, info);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 05/10] x86/modules: Use real perm_allocations
  2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
                   ` (3 preceding siblings ...)
  2020-11-20 20:24 ` [PATCH RFC 04/10] module: Support separate writable allocation Rick Edgecombe
@ 2020-11-20 20:24 ` Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 06/10] x86/alternatives: Handle perm_allocs for modules Rick Edgecombe
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

x86 relocations require all of the text sections to be within 2GB of
eachother. So as long as the allocations are somewhere in the module area,
relocations can be applied. So relax the restriction of having the module
regions for a module core or init area be virtually contiguous.

Also, apply relocations at the writable address of the perm_allocation to
support a future implementation that has the writable address in a
different allocation.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/module.c | 84 +++++++++++++++++++++++++++++++++-------
 1 file changed, 71 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 34b153cbd4ac..7b369f5ffdb7 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -85,6 +85,58 @@ void *module_alloc(unsigned long size)
 	return p;
 }
 
+bool module_perm_alloc(struct module_layout *layout)
+{
+	unsigned long vstart = MODULES_VADDR + get_module_load_offset();
+	unsigned long vend = MODULES_END;
+	unsigned int text_page_cnt = PAGE_ALIGN(layout->text.size) >> PAGE_SHIFT;
+	unsigned int ro_page_cnt = PAGE_ALIGN(layout->ro.size) >> PAGE_SHIFT;
+	unsigned int ro_after_init_page_cnt = PAGE_ALIGN(layout->ro_after_init.size) >> PAGE_SHIFT;
+	unsigned int rw_page_cnt = PAGE_ALIGN(layout->rw.size) >> PAGE_SHIFT;
+
+	layout->text.alloc = NULL;
+	layout->ro.alloc = NULL;
+	layout->ro_after_init.alloc = NULL;
+	layout->rw.alloc = NULL;
+
+	layout->text.alloc = perm_alloc(vstart, vend, text_page_cnt, PERM_RX);
+	if (!layout->text.alloc && layout->text.size)
+		goto out;
+
+	layout->ro.alloc = perm_alloc(vstart, vend, ro_page_cnt, PERM_R);
+	if (!layout->ro.alloc && layout->ro.size)
+		goto text_free_out;
+
+	layout->ro_after_init.alloc = perm_alloc(vstart, vend, ro_after_init_page_cnt, PERM_RW);
+	if (!layout->ro_after_init.alloc && layout->ro_after_init.size)
+		goto ro_free_out;
+
+	layout->rw.alloc = perm_alloc(vstart, vend, rw_page_cnt, PERM_RW);
+	if (!layout->rw.alloc && layout->rw.size)
+		goto ro_after_init_out;
+
+	return true;
+ro_after_init_out:
+	perm_free(layout->ro_after_init.alloc);
+	layout->ro_after_init.alloc = NULL;
+ro_free_out:
+	perm_free(layout->ro.alloc);
+	layout->ro.alloc = NULL;
+text_free_out:
+	perm_free(layout->text.alloc);
+	layout->text.alloc = NULL;
+out:
+	return false;
+}
+
+void module_perm_free(struct module_layout *layout)
+{
+	perm_free(layout->text.alloc);
+	perm_free(layout->ro.alloc);
+	perm_free(layout->ro_after_init.alloc);
+	perm_free(layout->rw.alloc);
+}
+
 #ifdef CONFIG_X86_32
 int apply_relocate(Elf32_Shdr *sechdrs,
 		   const char *strtab,
@@ -134,9 +186,11 @@ static int __apply_relocate_add(Elf64_Shdr *sechdrs,
 		   void *(*write)(void *dest, const void *src, size_t len))
 {
 	unsigned int i;
+	struct perm_allocation *alloc;
 	Elf64_Rela *rel = (void *)sechdrs[relsec].sh_addr;
 	Elf64_Sym *sym;
 	void *loc;
+	void *writable_loc;
 	u64 val;
 
 	DEBUGP("Applying relocate section %u to %u\n",
@@ -146,6 +200,9 @@ static int __apply_relocate_add(Elf64_Shdr *sechdrs,
 		loc = (void *)sechdrs[sechdrs[relsec].sh_info].sh_addr
 			+ rel[i].r_offset;
 
+		alloc = module_get_allocation(me, (unsigned long)loc);
+		writable_loc = (void *)perm_writable_addr(alloc, (unsigned long)loc);
+
 		/* This is the symbol it is referring to.  Note that all
 		   undefined symbols have been resolved.  */
 		sym = (Elf64_Sym *)sechdrs[symindex].sh_addr
@@ -161,40 +218,40 @@ static int __apply_relocate_add(Elf64_Shdr *sechdrs,
 		case R_X86_64_NONE:
 			break;
 		case R_X86_64_64:
-			if (*(u64 *)loc != 0)
+			if (*(u64 *)writable_loc != 0)
 				goto invalid_relocation;
-			write(loc, &val, 8);
+			write(writable_loc, &val, 8);
 			break;
 		case R_X86_64_32:
-			if (*(u32 *)loc != 0)
+			if (*(u32 *)writable_loc != 0)
 				goto invalid_relocation;
-			write(loc, &val, 4);
-			if (val != *(u32 *)loc)
+			write(writable_loc, &val, 4);
+			if (val != *(u32 *)writable_loc)
 				goto overflow;
 			break;
 		case R_X86_64_32S:
-			if (*(s32 *)loc != 0)
+			if (*(s32 *)writable_loc != 0)
 				goto invalid_relocation;
-			write(loc, &val, 4);
-			if ((s64)val != *(s32 *)loc)
+			write(writable_loc, &val, 4);
+			if ((s64)val != *(s32 *)writable_loc)
 				goto overflow;
 			break;
 		case R_X86_64_PC32:
 		case R_X86_64_PLT32:
-			if (*(u32 *)loc != 0)
+			if (*(u32 *)writable_loc != 0)
 				goto invalid_relocation;
 			val -= (u64)loc;
-			write(loc, &val, 4);
+			write(writable_loc, &val, 4);
 #if 0
-			if ((s64)val != *(s32 *)loc)
+			if ((s64)val != *(s32 *)writable_loc)
 				goto overflow;
 #endif
 			break;
 		case R_X86_64_PC64:
-			if (*(u64 *)loc != 0)
+			if (*(u64 *)writable_loc != 0)
 				goto invalid_relocation;
 			val -= (u64)loc;
-			write(loc, &val, 8);
+			write(writable_loc, &val, 8);
 			break;
 		default:
 			pr_err("%s: Unknown rela relocation: %llu\n",
@@ -273,6 +330,7 @@ int module_finalize(const Elf_Ehdr *hdr,
 		void *aseg = (void *)alt->sh_addr;
 		apply_alternatives(aseg, aseg + alt->sh_size);
 	}
+
 	if (locks && text) {
 		void *lseg = (void *)locks->sh_addr;
 		void *tseg = (void *)text->sh_addr;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 06/10] x86/alternatives: Handle perm_allocs for modules
  2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
                   ` (4 preceding siblings ...)
  2020-11-20 20:24 ` [PATCH RFC 05/10] x86/modules: Use real perm_allocations Rick Edgecombe
@ 2020-11-20 20:24 ` Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 07/10] x86/unwind: Unwind orc at module writable address Rick Edgecombe
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

Modules being loaded using perm_allocs may have a separate writable
address. Handle this case in alternatives for operations called during
module loading.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/alternative.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 2400ad62f330..e6d8a696540b 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -373,7 +373,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 						  struct alt_instr *end)
 {
 	struct alt_instr *a;
-	u8 *instr, *replacement;
+	u8 *instr, *writ_instr, *replacement, *writ_replacement;
 	u8 insn_buff[MAX_PATCH_LEN];
 
 	DPRINTK("alt table %px, -> %px", start, end);
@@ -391,11 +391,13 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 
 		instr = (u8 *)&a->instr_offset + a->instr_offset;
 		replacement = (u8 *)&a->repl_offset + a->repl_offset;
+		writ_instr = (u8 *)module_adjust_writable_addr(instr);
+		writ_replacement = (u8 *)module_adjust_writable_addr(replacement);
 		BUG_ON(a->instrlen > sizeof(insn_buff));
 		BUG_ON(a->cpuid >= (NCAPINTS + NBUGINTS) * 32);
 		if (!boot_cpu_has(a->cpuid)) {
 			if (a->padlen > 1)
-				optimize_nops(a, instr);
+				optimize_nops(a, writ_instr);
 
 			continue;
 		}
@@ -403,13 +405,13 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 		DPRINTK("feat: %d*32+%d, old: (%pS (%px) len: %d), repl: (%px, len: %d), pad: %d",
 			a->cpuid >> 5,
 			a->cpuid & 0x1f,
-			instr, instr, a->instrlen,
+			writ_instr, instr, a->instrlen,
 			replacement, a->replacementlen, a->padlen);
 
 		DUMP_BYTES(instr, a->instrlen, "%px: old_insn: ", instr);
 		DUMP_BYTES(replacement, a->replacementlen, "%px: rpl_insn: ", replacement);
 
-		memcpy(insn_buff, replacement, a->replacementlen);
+		memcpy(insn_buff, writ_replacement, a->replacementlen);
 		insn_buff_sz = a->replacementlen;
 
 		/*
@@ -435,7 +437,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 		}
 		DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
 
-		text_poke_early(instr, insn_buff, insn_buff_sz);
+		text_poke_early(writ_instr, insn_buff, insn_buff_sz);
 	}
 }
 
@@ -496,6 +498,10 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
 						  void *text,  void *text_end)
 {
 	struct smp_alt_module *smp;
+	void *w_locks = module_adjust_writable_addr(locks);
+	void *w_locks_end = module_adjust_writable_addr(locks_end);
+	void *w_text = module_adjust_writable_addr(text);
+	void *w_text_end = module_adjust_writable_addr(text_end);
 
 	mutex_lock(&text_mutex);
 	if (!uniproc_patched)
@@ -522,7 +528,7 @@ void __init_or_module alternatives_smp_module_add(struct module *mod,
 
 	list_add_tail(&smp->next, &smp_alt_modules);
 smp_unlock:
-	alternatives_smp_unlock(locks, locks_end, text, text_end);
+	alternatives_smp_unlock(w_locks, w_locks_end, w_text, w_text_end);
 unlock:
 	mutex_unlock(&text_mutex);
 }
@@ -601,17 +607,18 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
 
 	for (p = start; p < end; p++) {
 		unsigned int used;
+		void *writable = module_adjust_writable_addr(p->instr);
 
 		BUG_ON(p->len > MAX_PATCH_LEN);
 		/* prep the buffer with the original instructions */
-		memcpy(insn_buff, p->instr, p->len);
-		used = pv_ops.init.patch(p->type, insn_buff, (unsigned long)p->instr, p->len);
+		memcpy(insn_buff, writable, p->len);
+		used = pv_ops.init.patch(p->type, insn_buff, (unsigned long)writable, p->len);
 
 		BUG_ON(used > p->len);
 
 		/* Pad the rest with nops */
 		add_nops(insn_buff + used, p->len - used);
-		text_poke_early(p->instr, insn_buff, p->len);
+		text_poke_early(writable, insn_buff, p->len);
 	}
 }
 extern struct paravirt_patch_site __start_parainstructions[],
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 07/10] x86/unwind: Unwind orc at module writable address
  2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
                   ` (5 preceding siblings ...)
  2020-11-20 20:24 ` [PATCH RFC 06/10] x86/alternatives: Handle perm_allocs for modules Rick Edgecombe
@ 2020-11-20 20:24 ` Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 08/10] jump_label: Handle " Rick Edgecombe
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

Since modules can have a separate writable address during loading,
do the orc unwind at the writable address.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/unwind_orc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 73f800100066..41f9022a10cc 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -238,8 +238,8 @@ static int orc_sort_cmp(const void *_a, const void *_b)
 void unwind_module_init(struct module *mod, void *_orc_ip, size_t orc_ip_size,
 			void *_orc, size_t orc_size)
 {
-	int *orc_ip = _orc_ip;
-	struct orc_entry *orc = _orc;
+	int *orc_ip = module_adjust_writable_addr(_orc_ip);
+	struct orc_entry *orc = module_adjust_writable_addr(_orc);
 	unsigned int num_entries = orc_ip_size / sizeof(int);
 
 	WARN_ON_ONCE(orc_ip_size % sizeof(int) != 0 ||
@@ -257,8 +257,8 @@ void unwind_module_init(struct module *mod, void *_orc_ip, size_t orc_ip_size,
 	sort(orc_ip, num_entries, sizeof(int), orc_sort_cmp, orc_sort_swap);
 	mutex_unlock(&sort_mutex);
 
-	mod->arch.orc_unwind_ip = orc_ip;
-	mod->arch.orc_unwind = orc;
+	mod->arch.orc_unwind_ip = _orc_ip;
+	mod->arch.orc_unwind = _orc;
 	mod->arch.num_orcs = num_entries;
 }
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 08/10] jump_label: Handle module writable address
  2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
                   ` (6 preceding siblings ...)
  2020-11-20 20:24 ` [PATCH RFC 07/10] x86/unwind: Unwind orc at module writable address Rick Edgecombe
@ 2020-11-20 20:24 ` Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 09/10] ftrace: Use " Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 10/10] vmalloc: Add perm_alloc x86 implementation Rick Edgecombe
  9 siblings, 0 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

Since modules can have a separate writable address during loading,
do the nop application at the writable address.

As long as info is on hand about if the operations is happening during
a module load, don't do a full text_poke() when writing data to a
writable address.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kernel/jump_label.c | 18 ++++++++++++++++--
 kernel/jump_label.c          |  2 +-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index 5ba8477c2cb7..7a50148a63dd 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -63,6 +63,18 @@ static inline void __jump_label_transform(struct jump_entry *entry,
 					  int init)
 {
 	const void *opcode = __jump_label_set_jump_code(entry, type, init);
+	unsigned long addr = jump_entry_code(entry);
+	struct module *mod = __module_address(addr);
+	bool mod_writable = false;
+
+	if (mod) {
+		struct perm_allocation *alloc = module_get_allocation(mod, addr);
+
+		if (perm_is_writable(alloc)) {
+			addr = perm_writable_addr(alloc, addr);
+			mod_writable = true;
+		}
+	}
 
 	/*
 	 * As long as only a single processor is running and the code is still
@@ -74,9 +86,11 @@ static inline void __jump_label_transform(struct jump_entry *entry,
 	 * At the time the change is being done, just ignore whether we
 	 * are doing nop -> jump or jump -> nop transition, and assume
 	 * always nop being the 'currently valid' instruction
+	 *
+	 * If this is a module being loaded, text_poke_early can also be used.
 	 */
-	if (init || system_state == SYSTEM_BOOTING) {
-		text_poke_early((void *)jump_entry_code(entry), opcode,
+	if (init || system_state == SYSTEM_BOOTING || mod_writable) {
+		text_poke_early((void *)addr, opcode,
 				JUMP_LABEL_NOP_SIZE);
 		return;
 	}
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 015ef903ce8c..3919e78fce12 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -595,7 +595,7 @@ static void __jump_label_mod_update(struct static_key *key)
  */
 void jump_label_apply_nops(struct module *mod)
 {
-	struct jump_entry *iter_start = mod->jump_entries;
+	struct jump_entry *iter_start = module_adjust_writable_addr(mod->jump_entries);
 	struct jump_entry *iter_stop = iter_start + mod->num_jump_entries;
 	struct jump_entry *iter;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 09/10] ftrace: Use module writable address
  2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
                   ` (7 preceding siblings ...)
  2020-11-20 20:24 ` [PATCH RFC 08/10] jump_label: Handle " Rick Edgecombe
@ 2020-11-20 20:24 ` Rick Edgecombe
  2020-11-20 20:24 ` [PATCH RFC 10/10] vmalloc: Add perm_alloc x86 implementation Rick Edgecombe
  9 siblings, 0 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

Use the module writable address to accommodate arch's that have a
separate writable address for perm_alloc.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 kernel/trace/ftrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 8185f7240095..ea6377108bab 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -6134,7 +6134,7 @@ static int ftrace_process_locs(struct module *mod,
 	if (!count)
 		return 0;
 
-	sort(start, count, sizeof(*start),
+	sort(module_adjust_writable_addr(start), count, sizeof(*start),
 	     ftrace_cmp_ips, NULL);
 
 	start_pg = ftrace_allocate_pages(count);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 10/10] vmalloc: Add perm_alloc x86 implementation
  2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
                   ` (8 preceding siblings ...)
  2020-11-20 20:24 ` [PATCH RFC 09/10] ftrace: Use " Rick Edgecombe
@ 2020-11-20 20:24 ` Rick Edgecombe
  9 siblings, 0 replies; 21+ messages in thread
From: Rick Edgecombe @ 2020-11-20 20:24 UTC (permalink / raw)
  To: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams
  Cc: elena.reshetova, ira.weiny, Rick Edgecombe

Create a custom perm_alloc() implementation for x86. Allocate a new buffer
to be returned by perm_writable_addr(), and text_poke() it into the final
allocation in perm_writable_finish(). Map the allocation at it's final
permissions so that it does not need to be shot down as it transitions
from writable to read-only.

Since x86 direct map permissions are set via set_memory_() calls whose
alias mappings can be scattered across the direct map, instead draw from
blocks of 2MB unmapped pages. Previously the direct map was marked RO
for a RO vmalloc alias, however unmapping the direct map instead will
allow us to remap them later without a shootdown.

Register a shrinker callback to allow for this pool of pages to be
freed. This should reduce breaking large pages on the direct map
over the existing set_memory_() based solutions. Since the pages are left
as 4k NP pages, we can remap them in the shrinker callback without any
split or flush.

An algorithm for how many unmapped pages to keep in the cache is still a
todo. Currently it will grow until the shrinker triggers.

By grabbing whole 2MB pages this may reduce the availability of high page
order blocks in some cases. However, it could just as easily improve the
availability of high order pages for cases where the vmalloc allocations
could be long lived, such as modules, by grouping these long lived
allocations together. By caching pages, this would however reduce pages
available to other parts of the kernel for use in other caches.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/Kconfig                  |   1 +
 arch/x86/include/asm/set_memory.h |   2 +
 arch/x86/mm/Makefile              |   1 +
 arch/x86/mm/pat/set_memory.c      |  13 +
 arch/x86/mm/vmalloc.c             | 438 ++++++++++++++++++++++++++++++
 5 files changed, 455 insertions(+)
 create mode 100644 arch/x86/mm/vmalloc.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..aa3c19fa23f0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -78,6 +78,7 @@ config X86
 	select ARCH_HAS_COPY_MC			if X86_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_SET_DIRECT_MAP
+	select ARCH_HAS_PERM_ALLOC_IMPLEMENTATION	if X86_64
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 5948218f35c5..0d3efed009bb 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -49,6 +49,8 @@ int set_memory_decrypted(unsigned long addr, int numpages);
 int set_memory_np_noalias(unsigned long addr, int numpages);
 int set_memory_nonglobal(unsigned long addr, int numpages);
 int set_memory_global(unsigned long addr, int numpages);
+int set_memory_noalias_noflush(unsigned long addr, int numpages,
+			       pgprot_t mask_set, pgprot_t mask_clr);
 
 int set_pages_array_uc(struct page **pages, int addrinarray);
 int set_pages_array_wc(struct page **pages, int addrinarray);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5864219221ca..fb7a566bfedc 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -35,6 +35,7 @@ obj-$(CONFIG_PTDUMP_CORE)	+= dump_pagetables.o
 obj-$(CONFIG_PTDUMP_DEBUGFS)	+= debug_pagetables.o
 
 obj-$(CONFIG_HIGHMEM)		+= highmem_32.o
+obj-$(CONFIG_ARCH_HAS_PERM_ALLOC_IMPLEMENTATION)	+= vmalloc.o
 
 KASAN_SANITIZE_kasan_init_$(BITS).o := n
 obj-$(CONFIG_KASAN)		+= kasan_init_$(BITS).o
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 40baa90e74f4..59f42de1e6ab 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1954,6 +1954,19 @@ int set_memory_np_noalias(unsigned long addr, int numpages)
 					cpa_flags, NULL);
 }
 
+int set_memory_noalias_noflush(unsigned long addr, int numpages,
+			       pgprot_t mask_set, pgprot_t mask_clr)
+{
+	struct cpa_data cpa = { .vaddr = &addr,
+				.pgd = NULL,
+				.numpages = numpages,
+				.mask_set = mask_set,
+				.mask_clr = mask_clr,
+				.flags = CPA_NO_CHECK_ALIAS};
+
+	return __change_page_attr_set_clr(&cpa, 0);
+}
+
 int set_memory_4k(unsigned long addr, int numpages)
 {
 	return change_page_attr_set_clr(&addr, numpages, __pgprot(0),
diff --git a/arch/x86/mm/vmalloc.c b/arch/x86/mm/vmalloc.c
new file mode 100644
index 000000000000..2bc6381cbd2d
--- /dev/null
+++ b/arch/x86/mm/vmalloc.c
@@ -0,0 +1,438 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
+#include <linux/shrinker.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <asm/set_memory.h>
+#include <asm/tlbflush.h>
+#include <asm/text-patching.h>
+#include <linux/memory.h>
+
+#define DIRECTMAP_PROTECTED	0
+#define DIRECTMAP_UNPROTECTED	1
+
+static LIST_HEAD(pages);
+static int page_cnt;
+static DEFINE_SPINLOCK(pages_lock);
+
+#define DEFINE_RANGE_CALC(x) struct range_calc x = {.min = ULONG_MAX, .max = 0}
+
+struct range_calc {
+	unsigned long min;
+	unsigned long max;
+};
+
+static inline void grow_range(struct range_calc *grower, unsigned long addr, unsigned long size)
+{
+	if (addr < grower->min)
+		grower->min = addr;
+	if (addr + size > grower->max)
+		grower->max = addr;
+}
+
+static inline void grow_range_page(struct range_calc *grower, struct page *page)
+{
+	unsigned long addr = (unsigned long)page_address(page);
+
+	if (!addr)
+		return;
+
+	grow_range(grower, addr, PAGE_SIZE);
+}
+
+static inline bool has_range(struct range_calc *grower)
+{
+	return grower->min != ULONG_MAX && grower->max != 0;
+}
+
+static unsigned long perm_shrink_count(struct shrinker *shrinker, struct shrink_control *sc)
+{
+	unsigned long ret;
+
+	spin_lock(&pages_lock);
+	ret = page_cnt;
+	spin_unlock(&pages_lock);
+
+	return ret;
+}
+
+static struct page *__get_perm_page(void)
+{
+	struct page *page;
+
+	spin_lock(&pages_lock);
+	page = list_first_entry(&pages, struct page, lru);
+	list_del(&page->lru);
+	spin_unlock(&pages_lock);
+
+	return page;
+}
+
+static unsigned long perm_shrink_scan(struct shrinker *shrinker, struct shrink_control *sc)
+{
+	DEFINE_RANGE_CALC(range);
+	static LIST_HEAD(to_free_list);
+	struct page *page, *tmp;
+	unsigned int i, cnt = 0;
+
+	for (i = 0; i < sc->nr_to_scan; i++) {
+		page = __get_perm_page();
+		if (!page)
+			continue;
+
+		grow_range_page(&range, page);
+		set_direct_map_default_noflush(page);
+		list_add(&page->lru, &to_free_list);
+		cnt++;
+	}
+
+	if (has_range(&range))
+		flush_tlb_kernel_range(range.min, range.max);
+
+	list_for_each_entry_safe(page, tmp, &to_free_list, lru)
+		__free_pages(page, 0);
+
+	return cnt;
+}
+
+static struct shrinker perm_shrinker = {
+	.count_objects = perm_shrink_count,
+	.scan_objects = perm_shrink_scan,
+	.seeks = DEFAULT_SEEKS
+};
+
+static bool replenish_pages_one(void)
+{
+	struct page *page = alloc_pages(GFP_KERNEL | __GFP_ZERO, 0);
+
+	if (!page)
+		return false;
+
+	spin_lock(&pages_lock);
+	list_add(&page->lru, &pages);
+	page->private = DIRECTMAP_UNPROTECTED;
+	page_cnt++;
+	spin_unlock(&pages_lock);
+
+	return true;
+}
+
+static bool replenish_pages(void)
+{
+	struct page *page = alloc_pages(GFP_KERNEL | __GFP_ZERO, 9); /* 2MB */
+	DEFINE_RANGE_CALC(range);
+	int convert_ret = 0;
+	int i;
+
+	if (!page)
+		return replenish_pages_one();
+
+	for (i = 0; i < 512; i++)
+		convert_ret |= set_direct_map_invalid_noflush(&page[i]);
+
+	if (convert_ret)
+		goto convert_fail;
+
+	spin_lock(&pages_lock);
+	for (i = 0; i < 512; i++) {
+		list_add(&page[i].lru, &pages);
+		page[i].private = DIRECTMAP_PROTECTED;
+		grow_range_page(&range, page);
+	}
+	page_cnt += 512;
+	spin_unlock(&pages_lock);
+
+	flush_tlb_kernel_range(range.min, range.max);
+
+	vm_unmap_aliases();
+
+	return true;
+
+convert_fail:
+	for (i = 0; i < 512; i++) {
+		set_direct_map_default_noflush(&page[i]);
+		__free_pages(&page[i], 0);
+	}
+
+	return false;
+}
+
+static struct page *get_perm_page(void)
+{
+	struct page *page;
+
+	spin_lock(&pages_lock);
+	if (!page_cnt) {
+		spin_unlock(&pages_lock);
+		if (!replenish_pages())
+			return NULL;
+		spin_lock(&pages_lock);
+	}
+
+	page = list_first_entry(&pages, struct page, lru);
+	page_cnt--;
+	list_del(&page->lru);
+	spin_unlock(&pages_lock);
+
+	return page;
+}
+
+static void __perm_free_page(struct page *page)
+{
+	spin_lock(&pages_lock);
+	list_add(&page->lru, &pages);
+	page_cnt++;
+	spin_unlock(&pages_lock);
+}
+
+static void __perm_free_pages(struct page **page, int count)
+{
+	int i;
+
+	spin_lock(&pages_lock);
+	for (i = 0; i < count; i++)
+		list_add(&page[i]->lru, &pages);
+	page_cnt += count;
+	spin_unlock(&pages_lock);
+}
+
+static inline pgprot_t perms_to_prot(virtual_perm perms)
+{
+	switch (perms) {
+	case PERM_RX:
+		return PAGE_KERNEL_ROX;
+	case PERM_RWX:
+		return PAGE_KERNEL_EXEC;
+	case PERM_RW:
+		return PAGE_KERNEL;
+	case PERM_R:
+		return PAGE_KERNEL_RO;
+	default:
+		return __pgprot(0);
+	}
+}
+
+static bool map_alloc(struct perm_allocation *alloc)
+{
+	return !map_kernel_range(perm_alloc_address(alloc), get_vm_area_size(alloc->area),
+				 perms_to_prot(alloc->cur_perm), alloc->pages);
+}
+
+struct perm_allocation *perm_alloc(unsigned long vstart, unsigned long vend, unsigned long page_cnt,
+				   virtual_perm perms)
+{
+	struct perm_allocation *alloc;
+	DEFINE_RANGE_CALC(range);
+	int i, j;
+
+	if (!page_cnt)
+		return NULL;
+
+	alloc = kmalloc(sizeof(*alloc), GFP_KERNEL | __GFP_ZERO);
+
+	if (!alloc)
+		return NULL;
+
+	alloc->area = __get_vm_area_caller(page_cnt << PAGE_SHIFT, VM_MAP, vstart, vend,
+					   __builtin_return_address(0));
+
+	if (!alloc->area)
+		goto free_alloc;
+
+	alloc->pages = kvmalloc(page_cnt * sizeof(struct page *), GFP_KERNEL);
+	if (!alloc->pages)
+		goto free_area;
+
+	alloc->size = (unsigned long)get_vm_area_size(alloc->area);
+	alloc->offset = 0;
+	alloc->writable = NULL;
+	alloc->cur_perm = perms;
+
+	/* TODO if this will be RW, we don't need unmapped pages, better to conserve those */
+	for (i = 0; i < page_cnt; i++) {
+		alloc->pages[i] = get_perm_page();
+		if (alloc->pages[i]->private == DIRECTMAP_PROTECTED)
+			continue;
+
+		grow_range_page(&range, alloc->pages[i]);
+		if (set_direct_map_invalid_noflush(alloc->pages[i]))
+			goto convert_fail;
+		alloc->pages[i]->private = DIRECTMAP_PROTECTED;
+	}
+
+	/*
+	 * Flush any pages that were removed in the loop above. In the event of no pages in the
+	 * cache, these may be scattered about single pages, so flush here to only have a single
+	 * flush instead of one for each replenish_pages_one() call.
+	 */
+	if (has_range(&range)) {
+		flush_tlb_kernel_range(range.min, range.max);
+		vm_unmap_aliases();
+	}
+
+	if (i != page_cnt)
+		goto free_pages;
+
+	/* TODO: Need to zero these pages */
+	if (!map_alloc(alloc))
+		goto free_pages;
+
+	return alloc;
+
+free_pages:
+	__perm_free_pages(alloc->pages, i);
+	kvfree(alloc->pages);
+free_area:
+	remove_vm_area(alloc->area->addr);
+free_alloc:
+	kfree(alloc);
+
+	return NULL;
+
+convert_fail:
+	for (j = 0; j < i - 1; j++)
+		__perm_free_page(alloc->pages[j]);
+
+	return NULL;
+}
+
+unsigned long perm_writable_addr(struct perm_allocation *alloc, unsigned long addr)
+{
+	/* If this is already mapped and writable, just write to the actual kva */
+	if (alloc->cur_perm & PERM_W)
+		return addr;
+
+	/* TODO lock or callers need to synchronize? */
+	/* TODO could fail, set primary alias writable in this case or alloc up front */
+	if (!alloc->writable)
+		alloc->writable = vmalloc(alloc->size);
+
+	return (unsigned long)alloc->writable + (addr - perm_alloc_address(alloc));
+}
+
+bool perm_writable_finish(struct perm_allocation *alloc)
+{
+	unsigned long addr;
+	void *writable;
+	int i;
+
+	if (!alloc)
+		return false;
+	if (!alloc->writable)
+		return true;
+
+	addr = perm_alloc_address(alloc);
+
+	writable = (void *)perm_writable_base(alloc);
+
+	mutex_lock(&text_mutex);
+	for (i = 0; i < perm_alloc_size(alloc); i += PAGE_SIZE)
+		text_poke((void *)(addr + i), writable + i, PAGE_SIZE);
+	mutex_unlock(&text_mutex);
+
+	vfree(alloc->writable);
+
+	alloc->writable = 0;
+
+	return true;
+}
+
+static inline pgprot_t get_set(virtual_perm perms)
+{
+	pteval_t ret = 0;
+
+	if (perms)
+		ret = _PAGE_PRESENT;
+
+	if (perms & PERM_W)
+		ret |= _PAGE_RW;
+
+	if (~perms & PERM_X)
+		ret |= _PAGE_NX;
+
+	return __pgprot(ret);
+}
+
+static inline pgprot_t get_unset(virtual_perm perms)
+{
+	pteval_t ret = 0;
+
+	if (!perms)
+		ret = _PAGE_PRESENT;
+
+	if (~perms & PERM_W)
+		ret |= _PAGE_RW;
+
+	if (perms & PERM_X)
+		ret |= _PAGE_NX;
+
+	return __pgprot(ret);
+}
+
+bool perm_change(struct perm_allocation *alloc, virtual_perm perms)
+{
+	pgprot_t set = get_set(perms);
+	pgprot_t clr = get_unset(perms);
+	unsigned long start;
+
+	if (!alloc)
+		return false;
+
+	if (!(alloc->cur_perm ^ perms))
+		return true;
+
+	start = perm_alloc_address(alloc);
+
+	set_memory_noalias_noflush(start, perm_alloc_size(alloc) >> PAGE_SHIFT, set, clr);
+
+	flush_tlb_kernel_range(start, start + perm_alloc_size(alloc));
+
+	return true;
+}
+
+static inline bool perms_need_flush(struct perm_allocation *alloc)
+{
+	return alloc->cur_perm & PERM_X;
+}
+
+void perm_free(struct perm_allocation *alloc)
+{
+	unsigned long size, addr;
+	int page_cnt;
+
+	if (!alloc)
+		return;
+
+	if (alloc->writable)
+		vfree(alloc->writable);
+
+	size = get_vm_area_size(alloc->area);
+	addr = perm_alloc_address(alloc);
+	page_cnt = size >> PAGE_SHIFT;
+
+	vunmap((void *)addr);
+
+	if (perms_need_flush(alloc))
+		flush_tlb_kernel_range(addr, addr + size);
+
+	__perm_free_pages(alloc->pages, page_cnt);
+
+	kvfree(alloc->pages);
+	kfree(alloc);
+}
+
+void perm_memset(struct perm_allocation *alloc, char val)
+{
+	if (!alloc)
+		return;
+
+	memset((void *)perm_writable_base(alloc), val, perm_alloc_size(alloc));
+}
+
+static int __init perm_alloc_init(void)
+{
+	return register_shrinker(&perm_shrinker);
+}
+device_initcall(perm_alloc_init);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-11-20 20:24 ` [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation Rick Edgecombe
@ 2020-11-22  4:10   ` Andy Lutomirski
  2020-11-23  0:01     ` Edgecombe, Rick P
  2020-11-23  9:00   ` Christoph Hellwig
  2020-12-04 23:24   ` Sean Christopherson
  2 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2020-11-22  4:10 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: Andrew Morton, Jessica Yu, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrew Lutomirski, Dave Hansen, Peter Zijlstra,
	X86 ML, Mike Rapoport, Linux-MM, LKML, Dan Williams,
	Elena Reshetova, Weiny Ira

On Fri, Nov 20, 2020 at 12:30 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> In order to allow for future arch specific optimizations for vmalloc
> permissions, first add an implementation of a new interface that will
> work cross arch by using the existing set_memory_() functions.
>
> When allocating some memory that will be RO, for example it should be used
> like:
>
> /* Reserve va */
> struct perm_allocation *alloc = perm_alloc(vstart, vend, page_cnt, PERM_R);

I'm sure I could reverse-engineer this from the code, but:

Where do vstart and vend come from?  Does perm_alloc() allocate memory
or just virtual addresses?  Is the caller expected to call vmalloc()?
How does one free this thing?

> unsigned long ro = (unsigned long)perm_alloc_address(alloc);
>
> /* Write to writable address */
> strcpy((char *)perm_writable_addr(alloc, ro), "Some data to be RO");
> /* Signal that writing is done and mapping should be live */
> perm_writable_finish(alloc);
> /* Print from RO address */
> printk("Read only data is: %s\n", (char *)ro);
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-11-22  4:10   ` Andy Lutomirski
@ 2020-11-23  0:01     ` Edgecombe, Rick P
  2020-11-24 10:16       ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Edgecombe, Rick P @ 2020-11-23  0:01 UTC (permalink / raw)
  To: luto
  Cc: linux-kernel, daniel, peterz, jeyu, bpf, rppt, ast, linux-mm,
	dave.hansen, Weiny, Ira, x86, akpm, Reshetova, Elena, Williams,
	Dan J

On Sat, 2020-11-21 at 20:10 -0800, Andy Lutomirski wrote:
> On Fri, Nov 20, 2020 at 12:30 PM Rick Edgecombe
> <rick.p.edgecombe@intel.com> wrote:
> > In order to allow for future arch specific optimizations for
> > vmalloc
> > permissions, first add an implementation of a new interface that
> > will
> > work cross arch by using the existing set_memory_() functions.
> > 
> > When allocating some memory that will be RO, for example it should
> > be used
> > like:
> > 
> > /* Reserve va */
> > struct perm_allocation *alloc = perm_alloc(vstart, vend, page_cnt,
> > PERM_R);
> 
> I'm sure I could reverse-engineer this from the code, but:
> 
> Where do vstart and vend come from?

They are the start and end virtual address range to try to allocate in,
like __vmalloc_node_range() has. So like, MODULES_VADDR and
MODULES_END. Sorry for the terse example. The header in this patch has
some comments about each of the new functions to supplement it a bit.

> Does perm_alloc() allocate memory or just virtual addresses? Is the
> caller expected to call vmalloc()?

The caller does not need to call vmalloc(). perm_alloc() behaves
similar to __vmalloc_node_range(), where it allocates both memory and
virtual addresses. I left a little wiggle room in the descriptions in
the header, that the virtual address range doesn't actually need to be
mapped until after perm_writable_finish(). But both of the
implementations in this series will map it right away like a vmalloc().

So the interface could actually pretty easily be changed to look like
another flavor of vmalloc() that just returns a pointer to allocation.
The reason why it returns this new struct instead is that, unlike most
vmalloc()'s, the callers will be looking up metadata about the
allocation a bunch of times (the writable address). Having this
metadata stored in some struct inside vmalloc would mean
perm_writable_addr() would have to do something like find_vmap_area()
every time in order to find the writable allocation address from the
allocations address passed in. So returning a struct makes it so the
writable translation can skip a global lock and lookup. 

Another option could be putting the new metadata in vm_struct and just
return that, like get_vm_area(). Then we don't need to invent a new
struct. But then normal vmalloc()'s would have a bit of wasted memory
since they don't need this metadata.

A nice thing about that though, is there would be a central place to
translate to the writable addresses in cases where only a virtual
address is available. In later patches here, a similar lookup happens
anyway for modules using __module_address() to get the writable
address. This is due to some existing code where plumbing the new
struct all the way through would have resulted in too many changes.

I'm not sure which is best.

> How does one free this thing?

void perm_free(struct perm_allocation *alloc);


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-11-20 20:24 ` [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation Rick Edgecombe
  2020-11-22  4:10   ` Andy Lutomirski
@ 2020-11-23  9:00   ` Christoph Hellwig
  2020-11-23 20:44     ` Edgecombe, Rick P
  2020-12-04 23:24   ` Sean Christopherson
  2 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2020-11-23  9:00 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams, elena.reshetova,
	ira.weiny

First thanks for doing this, having a vmalloc variant that starts out
with proper permissions has been on my todo list for a while.


> +#define PERM_R	1
> +#define PERM_W	2
> +#define PERM_X	4
> +#define PERM_RWX	(PERM_R | PERM_W | PERM_X)
> +#define PERM_RW		(PERM_R | PERM_W)
> +#define PERM_RX		(PERM_R | PERM_X)

Why can't this use the normal pgprot flags?

> +typedef u8 virtual_perm;

This would need __bitwise annotations to allow sparse to typecheck the
flags.

> +/*
> + * Allocate a special permission kva region. The region may not be mapped
> + * until a call to perm_writable_finish(). A writable region will be mapped
> + * immediately at the address returned by perm_writable_addr(). The allocation
> + * will be made between the start and end virtual addresses.
> + */
> +struct perm_allocation *perm_alloc(unsigned long vstart, unsigned long vend, unsigned long page_cnt,
> +				   virtual_perm perms);

Please avoid totally pointless overly long line (all over the series)

Also I find the unsigned long for kernel virtual address interface
strange, but I'll take a look at the callers later.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-11-23  9:00   ` Christoph Hellwig
@ 2020-11-23 20:44     ` Edgecombe, Rick P
  2020-11-24 10:19       ` hch
  0 siblings, 1 reply; 21+ messages in thread
From: Edgecombe, Rick P @ 2020-11-23 20:44 UTC (permalink / raw)
  To: hch
  Cc: linux-kernel, daniel, peterz, jeyu, bpf, rppt, ast, linux-mm,
	dave.hansen, Weiny, Ira, x86, akpm, Reshetova, Elena, Williams,
	Dan J, luto

On Mon, 2020-11-23 at 09:00 +0000, Christoph Hellwig wrote:
> First thanks for doing this, having a vmalloc variant that starts out
> with proper permissions has been on my todo list for a while.
> 
> > +#define PERM_R	1
> > +#define PERM_W	2
> > +#define PERM_X	4
> > +#define PERM_RWX	(PERM_R | PERM_W | PERM_X)
> > +#define PERM_RW		(PERM_R | PERM_W)
> > +#define PERM_RX		(PERM_R | PERM_X)
> 
> Why can't this use the normal pgprot flags?
> 

Well, there were two reasons:
1. Non-standard naming for the PAGE_FOO flags. For example,
PAGE_KERNEL_ROX vs PAGE_KERNEL_READ_EXEC. This could be unified. I
think it's just riscv that breaks the conventions. Others are just
missing some.

2. The need to translate between the flags and set_memory_foo() calls.
For example if a permission is RW and the caller is asking to change it
to RWX. Some architectures have an X permission and others an NX
permission, and it's the same with read only vs writable. So these
flags are trying to be more analogous of the cross-arch set_memory_()
function names rather than pgprot flags.

I guess you could do something like (pgprot_val(PAGE_KERNEL_EXEC) &
~pgprot_val(PAGE_KERNEL)) and assume if there are any bits set it is a
positive permission and from that deduce whether to call set_memory_nx() or set_memory_x().

But I thought that using those pgprot flags was still sort overloading
the meaning of pgprot. My understanding was that it is supposed to hold
the actual bits set in the PTE. For example large pages or TLB hints
(like PAGE_KERNEL_EXEC_CONT) could set or unset extra bits, so asking
for PAGE_KERNEL_EXEC wouldn't necessarily mean "set these bits in all
of the PTEs", it could mean something more like "infer what I want from
these bits and do that".

x86's cpa will also avoid changing NX if it is not supported, so if the
caller asked for PAGE_KERNEL->PAGE_KERNEL_EXEC in perm_change() it
should not necessarily bother setting all of the PAGE_KERNEL_EXEC bits
in the actual PTEs. Asking for PERM_RW->PERM_RWX on the other hand,
would let the implementation do whatever it needs to set the memory
executable, like set_memory_x() does. It should work either way but
seems like the expectations would be a little clearer with the PERM_
flags.

On the other hand, creating a whole new set of flags is not ideal
either. But that was just my reasoning. Does it seem worth it?

> > +typedef u8 virtual_perm;
> 
> This would need __bitwise annotations to allow sparse to typecheck
> the
> flags.
> 

Ok, thanks.

> > +/*
> > + * Allocate a special permission kva region. The region may not be
> > mapped
> > + * until a call to perm_writable_finish(). A writable region will
> > be mapped
> > + * immediately at the address returned by perm_writable_addr().
> > The allocation
> > + * will be made between the start and end virtual addresses.
> > + */
> > +struct perm_allocation *perm_alloc(unsigned long vstart, unsigned
> > long vend, unsigned long page_cnt,
> > +				   virtual_perm perms);
> 
> Please avoid totally pointless overly long line (all over the series)

Could easily wrap this one, but just to clarify, do you mean lines over
80 chars? There were already some over 80 in vmalloc before the move to
100 chars, so figured it was ok to stretch out now.

> Also I find the unsigned long for kernel virtual address interface
> strange, but I'll take a look at the callers later.

Yea, some of the callers need to cast either way. I think I changed it
to unsigned long, because casting (void *) was smaller in the code than
(unsigned long) and it shorted some line lengths.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-11-23  0:01     ` Edgecombe, Rick P
@ 2020-11-24 10:16       ` Christoph Hellwig
  2020-11-24 20:00         ` Edgecombe, Rick P
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2020-11-24 10:16 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: luto, linux-kernel, daniel, peterz, jeyu, bpf, rppt, ast,
	linux-mm, dave.hansen, Weiny, Ira, x86, akpm, Reshetova, Elena,
	Williams, Dan J

On Mon, Nov 23, 2020 at 12:01:35AM +0000, Edgecombe, Rick P wrote:
> Another option could be putting the new metadata in vm_struct and just
> return that, like get_vm_area(). Then we don't need to invent a new
> struct. But then normal vmalloc()'s would have a bit of wasted memory
> since they don't need this metadata.

That would seem most natural to me.  We'll need to figure out how we
can do that without bloating vm_struct too much.  One option would
be a bigger structure that embedds vm_struct and can be retreived using
container_of().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-11-23 20:44     ` Edgecombe, Rick P
@ 2020-11-24 10:19       ` hch
  2020-11-24 19:59         ` Edgecombe, Rick P
  0 siblings, 1 reply; 21+ messages in thread
From: hch @ 2020-11-24 10:19 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hch, linux-kernel, daniel, peterz, jeyu, bpf, rppt, ast,
	linux-mm, dave.hansen, Weiny, Ira, x86, akpm, Reshetova, Elena,
	Williams, Dan J, luto

On Mon, Nov 23, 2020 at 08:44:12PM +0000, Edgecombe, Rick P wrote:
> Well, there were two reasons:
> 1. Non-standard naming for the PAGE_FOO flags. For example,
> PAGE_KERNEL_ROX vs PAGE_KERNEL_READ_EXEC. This could be unified. I
> think it's just riscv that breaks the conventions. Others are just
> missing some.

We need to standardize those anyway.  I've done that for a few
PAGE_* constants already but as you see there is more work to do.

> But I thought that using those pgprot flags was still sort overloading
> the meaning of pgprot. My understanding was that it is supposed to hold
> the actual bits set in the PTE. For example large pages or TLB hints
> (like PAGE_KERNEL_EXEC_CONT) could set or unset extra bits, so asking
> for PAGE_KERNEL_EXEC wouldn't necessarily mean "set these bits in all
> of the PTEs", it could mean something more like "infer what I want from
> these bits and do that".
> 
> x86's cpa will also avoid changing NX if it is not supported, so if the
> caller asked for PAGE_KERNEL->PAGE_KERNEL_EXEC in perm_change() it
> should not necessarily bother setting all of the PAGE_KERNEL_EXEC bits
> in the actual PTEs. Asking for PERM_RW->PERM_RWX on the other hand,
> would let the implementation do whatever it needs to set the memory
> executable, like set_memory_x() does. It should work either way but
> seems like the expectations would be a little clearer with the PERM_
> flags.

Ok, maybe that is an argument, and we should use the new flags more
broadly.

> Could easily wrap this one, but just to clarify, do you mean lines over
> 80 chars? There were already some over 80 in vmalloc before the move to
> 100 chars, so figured it was ok to stretch out now.

CodingStyle still says 80 characters unless you have an exception where
a longer line improves the readability.  The quoted code absolutely
does not fit the definition of an exception or improves readability.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-11-24 10:19       ` hch
@ 2020-11-24 19:59         ` Edgecombe, Rick P
  0 siblings, 0 replies; 21+ messages in thread
From: Edgecombe, Rick P @ 2020-11-24 19:59 UTC (permalink / raw)
  To: hch
  Cc: linux-kernel, daniel, peterz, jeyu, bpf, rppt, linux-mm,
	dave.hansen, ast, Weiny, Ira, x86, akpm, Williams, Dan J,
	Reshetova, Elena, luto

On Tue, 2020-11-24 at 10:19 +0000, hch@infradead.org wrote:
> But I thought that using those pgprot flags was still sort
> overloading
> > the meaning of pgprot. My understanding was that it is supposed to
> > hold
> > the actual bits set in the PTE. For example large pages or TLB
> > hints
> > (like PAGE_KERNEL_EXEC_CONT) could set or unset extra bits, so
> > asking
> > for PAGE_KERNEL_EXEC wouldn't necessarily mean "set these bits in
> > all
> > of the PTEs", it could mean something more like "infer what I want
> > from
> > these bits and do that".
> > 
> > x86's cpa will also avoid changing NX if it is not supported, so if
> > the
> > caller asked for PAGE_KERNEL->PAGE_KERNEL_EXEC in perm_change() it
> > should not necessarily bother setting all of the PAGE_KERNEL_EXEC
> > bits
> > in the actual PTEs. Asking for PERM_RW->PERM_RWX on the other hand,
> > would let the implementation do whatever it needs to set the memory
> > executable, like set_memory_x() does. It should work either way but
> > seems like the expectations would be a little clearer with the
> > PERM_
> > flags.
> 
> Ok, maybe that is an argument, and we should use the new flags more
> broadly.

They might make sense to live in set_memory.h then. Separate from this
patchset, a call like set_memory(addr, numpages, PERM_R) could be more
efficient than two calls to set_memory_ro() and set_memory_nx(). Not
that it happens very much outside of vmalloc usages. But just to try to
think where else it could be used.

> > Could easily wrap this one, but just to clarify, do you mean lines
> > over
> > 80 chars? There were already some over 80 in vmalloc before the
> > move to
> > 100 chars, so figured it was ok to stretch out now.
> 
> CodingStyle still says 80 characters unless you have an exception
> where
> a longer line improves the readability.  The quoted code absolutely
> does not fit the definition of an exception or improves readability.

Fair enough.

And to the other comment in your first mail, glad to do this and
finally send it out. This series has been sitting in a local branch for
most of the year while stuff kept interrupting it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-11-24 10:16       ` Christoph Hellwig
@ 2020-11-24 20:00         ` Edgecombe, Rick P
  0 siblings, 0 replies; 21+ messages in thread
From: Edgecombe, Rick P @ 2020-11-24 20:00 UTC (permalink / raw)
  To: hch
  Cc: linux-kernel, daniel, peterz, jeyu, bpf, rppt, ast, linux-mm,
	dave.hansen, Weiny, Ira, x86, akpm, Reshetova, Elena, Williams,
	Dan J, luto

On Tue, 2020-11-24 at 10:16 +0000, Christoph Hellwig wrote:
> On Mon, Nov 23, 2020 at 12:01:35AM +0000, Edgecombe, Rick P wrote:
> > Another option could be putting the new metadata in vm_struct and
> > just
> > return that, like get_vm_area(). Then we don't need to invent a new
> > struct. But then normal vmalloc()'s would have a bit of wasted
> > memory
> > since they don't need this metadata.
> 
> That would seem most natural to me.  We'll need to figure out how we
> can do that without bloating vm_struct too much.  One option would
> be a bigger structure that embedds vm_struct and can be retreived
> using
> container_of().

Hmm, neat. I can change this in the next version.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-11-20 20:24 ` [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation Rick Edgecombe
  2020-11-22  4:10   ` Andy Lutomirski
  2020-11-23  9:00   ` Christoph Hellwig
@ 2020-12-04 23:24   ` Sean Christopherson
  2020-12-07 23:55     ` Edgecombe, Rick P
  2 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2020-12-04 23:24 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: akpm, jeyu, bpf, ast, daniel, luto, dave.hansen, peterz, x86,
	rppt, linux-mm, linux-kernel, dan.j.williams, elena.reshetova,
	ira.weiny

On Fri, Nov 20, 2020, Rick Edgecombe wrote:
> +struct perm_allocation {
> +	struct page **pages;
> +	virtual_perm cur_perm;
> +	virtual_perm orig_perm;
> +	struct vm_struct *area;
> +	unsigned long offset;
> +	unsigned long size;
> +	void *writable;
> +};
> +
> +/*
> + * Allocate a special permission kva region. The region may not be mapped
> + * until a call to perm_writable_finish(). A writable region will be mapped
> + * immediately at the address returned by perm_writable_addr(). The allocation
> + * will be made between the start and end virtual addresses.
> + */
> +struct perm_allocation *perm_alloc(unsigned long vstart, unsigned long vend, unsigned long page_cnt,
> +				   virtual_perm perms);

IMO, 'perm' as the root namespace is too generic, and perm_ is already very
prevelant throughout the kernel.  E.g. it's not obvious when looking at the
callers that perm_alloc() is the first step in setting up an alternate kernel
VA->PA mapping.

I don't have a suggestion for a more intuitive name, but in the absence of a
perfect name, I'd vote for an acronym that is easy to grep.  Something like
pvmap?  That isn't currently used in the kernel, though I can't help but read it
as "paravirt map"...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation
  2020-12-04 23:24   ` Sean Christopherson
@ 2020-12-07 23:55     ` Edgecombe, Rick P
  0 siblings, 0 replies; 21+ messages in thread
From: Edgecombe, Rick P @ 2020-12-07 23:55 UTC (permalink / raw)
  To: seanjc
  Cc: linux-kernel, daniel, peterz, jeyu, bpf, rppt, ast, linux-mm,
	dave.hansen, Weiny, Ira, x86, akpm, Reshetova, Elena, Williams,
	Dan J, luto

On Fri, 2020-12-04 at 15:24 -0800, Sean Christopherson wrote:
> On Fri, Nov 20, 2020, Rick Edgecombe wrote:
> > +struct perm_allocation {
> > +	struct page **pages;
> > +	virtual_perm cur_perm;
> > +	virtual_perm orig_perm;
> > +	struct vm_struct *area;
> > +	unsigned long offset;
> > +	unsigned long size;
> > +	void *writable;
> > +};
> > +
> > +/*
> > + * Allocate a special permission kva region. The region may not be
> > mapped
> > + * until a call to perm_writable_finish(). A writable region will
> > be mapped
> > + * immediately at the address returned by perm_writable_addr().
> > The allocation
> > + * will be made between the start and end virtual addresses.
> > + */
> > +struct perm_allocation *perm_alloc(unsigned long vstart, unsigned
> > long vend, unsigned long page_cnt,
> > +				   virtual_perm perms);
> 
> IMO, 'perm' as the root namespace is too generic, and perm_ is
> already very
> prevelant throughout the kernel.  E.g. it's not obvious when looking
> at the
> callers that perm_alloc() is the first step in setting up an
> alternate kernel
> VA->PA mapping.
> 
> I don't have a suggestion for a more intuitive name, but in the
> absence of a
> perfect name, I'd vote for an acronym that is easy to
> grep.  Something like
> pvmap?  That isn't currently used in the kernel, though I can't help
> but read it
> as "paravirt map"...

Good point, thanks.

After Christoph's comments to return a vm_struct pointer, I was going
to try to pick some more vmalloc-like names. Like vmalloc_perm(),
vmalloc_writable_finish(), etc. Still have to play around with it some
more.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-12-07 23:55 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-20 20:24 [PATCH RFC 00/10] New permission vmalloc interface Rick Edgecombe
2020-11-20 20:24 ` [PATCH RFC 01/10] vmalloc: Add basic perm alloc implementation Rick Edgecombe
2020-11-22  4:10   ` Andy Lutomirski
2020-11-23  0:01     ` Edgecombe, Rick P
2020-11-24 10:16       ` Christoph Hellwig
2020-11-24 20:00         ` Edgecombe, Rick P
2020-11-23  9:00   ` Christoph Hellwig
2020-11-23 20:44     ` Edgecombe, Rick P
2020-11-24 10:19       ` hch
2020-11-24 19:59         ` Edgecombe, Rick P
2020-12-04 23:24   ` Sean Christopherson
2020-12-07 23:55     ` Edgecombe, Rick P
2020-11-20 20:24 ` [PATCH RFC 02/10] bpf: Use perm_alloc() for BPF JIT filters Rick Edgecombe
2020-11-20 20:24 ` [PATCH RFC 03/10] module: Use perm_alloc() for modules Rick Edgecombe
2020-11-20 20:24 ` [PATCH RFC 04/10] module: Support separate writable allocation Rick Edgecombe
2020-11-20 20:24 ` [PATCH RFC 05/10] x86/modules: Use real perm_allocations Rick Edgecombe
2020-11-20 20:24 ` [PATCH RFC 06/10] x86/alternatives: Handle perm_allocs for modules Rick Edgecombe
2020-11-20 20:24 ` [PATCH RFC 07/10] x86/unwind: Unwind orc at module writable address Rick Edgecombe
2020-11-20 20:24 ` [PATCH RFC 08/10] jump_label: Handle " Rick Edgecombe
2020-11-20 20:24 ` [PATCH RFC 09/10] ftrace: Use " Rick Edgecombe
2020-11-20 20:24 ` [PATCH RFC 10/10] vmalloc: Add perm_alloc x86 implementation Rick Edgecombe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).