[PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration

* [PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration
@ 2024-02-20 20:32 Maxwell Bland
  2024-02-20 20:32 ` [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides Maxwell Bland
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Maxwell Bland @ 2024-02-20 20:32 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: gregkh, agordeev, akpm, andreyknvl, andrii, aneesh.kumar, aou,
	ardb, arnd, ast, borntraeger, bpf, brauner, catalin.marinas,
	christophe.leroy, cl, daniel, dave.hansen, david, dennis,
	dvyukov, glider, gor, guoren, haoluo, hca, hch, john.fastabend,
	jolsa, kasan-dev, kpsingh, linux-arch, linux, linux-efi,
	linux-kernel, linux-mm, linuxppc-dev, linux-riscv, linux-s390,
	lstoakes, mark.rutland, martin.lau, meted, michael.christie,
	mjguzik, mpe, mst, muchun.song, naveen.n.rao, npiggin, palmer,
	paul.walmsley, quic_nprakash, quic_pkondeti, rick.p.edgecombe,
	ryabinin.a.a, ryan.roberts, samitolvanen, sdf, song, surenb,
	svens, tj, urezki, vincenzo.frascino, will, wuqiang.matt,
	yonghong.song, zlim.lnx, mbland, awheeler

Reworks ARM's virtual memory allocation infrastructure to support
dynamic enforcement of page middle directory PXNTable restrictions
rather than only during the initial memory mapping. Runtime enforcement
of this bit prevents write-then-execute attacks, where malicious code is
staged in vmalloc'd data regions, and later the page table is changed to
make this code executable.

Previously the entire region from VMALLOC_START to VMALLOC_END was
vulnerable, but now the vulnerable region is restricted to the 2GB
reserved by module_alloc, a region which is generally read-only and more
difficult to inject staging code into, e.g., data must pass the BPF
verifier. These changes also set the stage for other systems, such as
KVM-level (EL2) changes to mark page tables immutable and code page
verification changes, forging a path toward complete mitigation of
kernel exploits on ARM.

Implementing this required minimal changes to the generic vmalloc
interface in the kernel to allow architecture overrides of some vmalloc
wrapper functions, refactoring vmalloc calls to use a standard interface
in the generic kernel, and passing the address parameter already passed
into PTE allocation to the pte_allocate child function call.

The new arm64 vmalloc wrapper functions ensure vmalloc data is not
allocated into the region reserved for module_alloc. arm64 BPF and
kprobe code also see a two-line-change ensuring their allocations abide
by the segmentation of code from data. Finally, arm64's pmd_populate
function is modified to set the PXNTable bit appropriately.

Signed-off-by: Maxwell Bland <mbland@motorola.com>

---

After Mark Rutland's feedback last week on my more minimal patch, see

<CAP5Mv+ydhk=Ob4b40ZahGMgT-5+-VEHxtmA=-LkJiEOOU+K6hw@mail.gmail.com>

I adopted a more sweeping and more correct overhaul of ARM's virtual
memory allocation infrastructure to support these changes. This patch
guarantees our ability to write future systems with a strong and
accessible distinction between code and data at the page allocation
layer, bolstering the guarantees of complementary contributions, i.e.
W^X and kCFI.

The current patch minimally reduces available vmalloc space, removing
the 2GB that should be reserved for code allocations regardless, and I
feel really benefits the kernel by making several memory allocation
interfaces more uniform, and providing hooks for non-ARM architectures
to follow suit.

I have done some minimal runtime testing using Torvald's test-tlb script
on a QEMU VM, but maybe more extensive benchmarking is needed?

Size: Before Patch -> After Patch
4k: 4.09ns  4.15ns  4.41ns  4.43ns -> 3.68ns  3.73ns  3.67ns  3.73ns 
8k: 4.22ns  4.19ns  4.30ns  4.15ns -> 3.99ns  3.89ns  4.12ns  4.04ns 
16k: 3.97ns  4.31ns  4.30ns  4.28ns -> 4.03ns  3.98ns  4.06ns  4.06ns 
32k: 3.82ns  4.51ns  4.25ns  4.31ns -> 3.99ns  4.09ns  4.07ns  5.17ns 
64k: 4.50ns  5.59ns  6.13ns  6.14ns -> 4.23ns  4.26ns  5.91ns  5.93ns 
128k: 5.06ns  4.47ns  6.75ns  6.69ns -> 4.47ns  4.71ns  6.54ns  6.44ns 
256k: 4.83ns  4.43ns  6.62ns  6.21ns -> 4.39ns  4.62ns  6.71ns  6.65ns 
512k: 4.45ns  4.75ns  6.19ns  6.65ns -> 4.86ns  5.26ns  7.77ns  6.68ns 
1M: 4.72ns  4.73ns  6.74ns  6.47ns -> 4.29ns  4.45ns  6.87ns  6.59ns 
2M: 4.66ns  4.86ns  14.49ns  15.00ns -> 4.53ns  4.57ns  15.91ns  15.90ns 
4M: 4.85ns  4.95ns  15.90ns  15.98ns -> 4.48ns  4.74ns  17.27ns  17.36ns 
6M: 4.94ns  5.03ns  17.19ns  17.31ns -> 4.70ns  4.93ns  18.02ns  18.23ns 
8M: 5.05ns  5.18ns  17.49ns  17.64ns -> 4.96ns  5.07ns  18.84ns  18.72ns 
16M: 5.55ns  5.79ns  20.99ns  23.70ns -> 5.46ns  5.72ns  22.76ns  26.51ns
32M: 8.54ns  9.06ns  124.61ns 125.07ns -> 8.43ns  8.59ns  116.83ns 138.83ns
64M: 8.42ns  8.63ns  196.17ns 204.52ns -> 8.26ns  8.43ns  193.49ns 203.85ns
128M: 8.31ns  8.58ns  230.46ns 242.63ns -> 8.22ns  8.39ns  227.99ns 240.29ns
256M: 8.80ns  8.80ns  248.24ns 261.68ns -> 8.35ns  8.55ns  250.18ns 262.20ns

Note I also chose to enforce PXNTable at the PMD layer only (for now),
since the 194 descriptors which are affected by this change on my
testing setup are not sufficient to warrant enforcement at a coarser
granularity.

The architecture-independent changes (I term "generic") can be
classified only as refactoring, but I feel are also major improvements
in that they standardize most uses of the vmalloc interface across the
kernel.

Note this patch reduces the arm64 allocated region for BPF and kprobes,
but only to match with the existing allocation choices made by the
generic kernel. I will admit I do not understand why BPF JIT allocation
code was duplicated into arm64, but I also feel that this was either an
artifact or that these overrides for generic allocation should require a
specific KConfig as they trade off between security and space. That
said, I have chosen not to wrap this patch in a KConfig interface, as I
feel the changes provide significant benefit to the arm64 kernel's
baseline security, though a KConfig could certainly be added if the
maintainers see the need.

Maxwell Bland (4):
  mm/vmalloc: allow arch-specific vmalloc_node overrides
  mm: pgalloc: support address-conditional pmd allocation
  arm64: separate code and data virtual memory allocation
  arm64: dynamic enforcement of pmd-level PXNTable

 arch/arm/kernel/irq.c               |  2 +-
 arch/arm64/include/asm/pgalloc.h    | 11 +++++-
 arch/arm64/include/asm/vmalloc.h    |  8 ++++
 arch/arm64/include/asm/vmap_stack.h |  2 +-
 arch/arm64/kernel/efi.c             |  2 +-
 arch/arm64/kernel/module.c          |  7 ++++
 arch/arm64/kernel/probes/kprobes.c  |  2 +-
 arch/arm64/mm/Makefile              |  3 +-
 arch/arm64/mm/trans_pgd.c           |  2 +-
 arch/arm64/mm/vmalloc.c             | 57 +++++++++++++++++++++++++++++
 arch/arm64/net/bpf_jit_comp.c       |  5 ++-
 arch/powerpc/kernel/irq.c           |  2 +-
 arch/riscv/include/asm/irq_stack.h  |  2 +-
 arch/s390/hypfs/hypfs_diag.c        |  2 +-
 arch/s390/kernel/setup.c            |  6 +--
 arch/s390/kernel/sthyi.c            |  2 +-
 include/asm-generic/pgalloc.h       | 18 +++++++++
 include/linux/mm.h                  |  4 +-
 include/linux/vmalloc.h             | 15 +++++++-
 kernel/bpf/syscall.c                |  4 +-
 kernel/fork.c                       |  4 +-
 kernel/scs.c                        |  3 +-
 lib/objpool.c                       |  2 +-
 lib/test_vmalloc.c                  |  6 +--
 mm/hugetlb_vmemmap.c                |  4 +-
 mm/kasan/init.c                     | 22 ++++++-----
 mm/memory.c                         |  4 +-
 mm/percpu.c                         |  2 +-
 mm/pgalloc-track.h                  |  3 +-
 mm/sparse-vmemmap.c                 |  2 +-
 mm/util.c                           |  3 +-
 mm/vmalloc.c                        | 39 +++++++-------------
 32 files changed, 176 insertions(+), 74 deletions(-)
 create mode 100644 arch/arm64/mm/vmalloc.c

base-commit: b401b621758e46812da61fa58a67c3fd8d91de0d
-- 
2.39.2

^ permalink raw reply	[flat|nested] 17+ messages in thread