linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: introduce reference pages
@ 2020-07-31 20:32 Peter Collingbourne
  2020-08-03  3:28 ` John Hubbard
  2020-08-03  9:32 ` Kirill A. Shutemov
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Collingbourne @ 2020-07-31 20:32 UTC (permalink / raw)
  To: Andrew Morton, Catalin Marinas
  Cc: linux-mm, Peter Collingbourne, linux-arm-kernel, Evgenii Stepanov

Introduce a new mmap flag, MAP_REFPAGE, that creates a mapping similar
to an anonymous mapping, but instead of clean pages being backed by the
zero page, they are instead backed by a so-called reference page, whose
address is specified using the offset argument to mmap. Loads from
the mapping will load directly from the reference page, and initial
stores to the mapping will copy-on-write from the reference page.

Reference pages are useful in circumstances where anonymous mappings
combined with manual stores to memory would impose undesirable costs,
either in terms of performance or RSS. Use cases are focused on heap
allocators and include:

- Pattern initialization for the heap. This is where malloc(3) gives
  you memory whose contents are filled with a non-zero pattern
  byte, in order to help detect and mitigate bugs involving use
  of uninitialized memory. Typically this is implemented by having
  the allocator memset the allocation with the pattern byte before
  returning it to the user, but for large allocations this can result
  in a significant increase in RSS, especially for allocations that
  are used sparsely. Even for dense allocations there is a needless
  impact to startup performance when it may be better to amortize it
  throughout the program. By creating allocations using a reference
  page filled with the pattern byte, we can avoid these costs.

- Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
  feature which allows for memory to be tagged in order to detect
  certain kinds of memory errors with low overhead. In order to set
  up an allocation to allow memory errors to be detected, the entire
  allocation needs to have the same tag. The issue here is similar to
  pattern initialization in the sense that large tagged allocations
  will be expensive if the tagging is done up front. The idea is that
  the allocator would create reference pages with each of the possible
  memory tags, and use those reference pages for the large allocations.

In order to measure the performance and RSS impact of reference pages,
a version of this patch backported to kernel version 4.14 was tested on
a Pixel 4 together with a modified [2] version of the Scudo allocator
that uses reference pages to implement pattern initialization. A
PDFium test program was used to collect the measurements like so:

$ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
$ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf

and the median of 100 runs measurement was taken with three variants
of the allocator:

- "anon" is the baseline (no pattern init)
- "memset" is with pattern init of allocator pages implemented by
  initializing anonymous pages with memset
- "refpage" is with pattern init of allocator pages implemented
  by creating reference pages

All three variants are measured using the patch that I linked. "anon"
is without the patch, "refpage" is with the patch and "memset"
is with the patch with "#if 0" in place of "#if 1" in linux.cpp.
The measurements are as follows:

          Real time (s)    Max RSS (KiB)
anon        2.237081         107088
memset      2.252241         112180
refpage     2.251220         103504

We can see that real time for refpage is about the same or maybe
slightly faster than memset. At this point it is unclear where the
discrepancy in performance between anon and refpage comes from. The
Pixel 4 kernel has transparent hugepages disabled so that can't be it.

I wouldn't trust the RSS number for reference pages (with a test
program that uses an anonymous page as a reference page, I saw the
following output on dmesg:

[75768.572560] BUG: Bad rss-counter state mm:00000000f1cdec59 idx:1 val:-2
[75768.572577] BUG: Bad rss-counter state mm:00000000f1cdec59 idx:3 val:2

indicating that I might not have implemented RSS accounting for
reference pages correctly), but we see straight away an RSS impact
of 5% for memset versus anon. Assuming that accounting for anonymous
pages has been implemented correctly, we can expect the true RSS
number for refpages to be similar to that which I measured for anon.

As an alternative to extending mmap(2), I considered using
userfaultfd to implement reference pages. However, after having taken
a detailed look at the interface, it does not seem suitable to be
used in the context of a general purpose allocator. For example,
UFFD_FEATURE_FORK support would be required in order to correctly
support fork(2) in a process that uses the allocator (although POSIX
does not guarantee support for allocating after fork, many allocators
including Scudo support it, and nothing stops the forked process from
page faulting pre-existing allocations after forking anyway), but
UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
making it unsuitable for use in an allocator. Furthermore, even if
the interface issues are resolved, I suspect (but have not measured)
that the cost of the multiple context switches between kernel and
userspace would be too high to be used in an allocator anyway.

There are unresolved issues with this patch:

- We need to decide on the semantics associated with remapping or
  unmapping the reference page. As currently implemented, the page is
  looked up by address on each page fault, and a segfault ensues if the
  address is not mapped. It may be better to have the mmap(2) call take
  a reference to the page (failing if not mapped) and the underlying
  vma so that future remappings or unmappings have no effect.

- I have not yet looked at interaction with transparent hugepages.

- We probably need to restrict which kinds of pages are supported as
  reference pages (probably only anonymous and file-backed pages). This
  is somewhat tied to the remapping semantics as we would need
  to decide what happens if a supported page is replaced with an
  unsupported page.

- Finally, the accounting issues as previously mentioned.

However, I am sending this first version of the patch in order to get
early feedback on the idea and whether it is suitable to be added to
the kernel.

[1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety
[2] https://github.com/pcc/llvm-project/commit/a05f88aaebc7daf262d6885444d9845052026f4b

Signed-off-by: Peter Collingbourne <pcc@google.com>
---
 arch/mips/kernel/vdso.c                |  2 +-
 include/linux/mm.h                     |  2 +-
 include/uapi/asm-generic/mman-common.h |  1 +
 mm/mmap.c                              | 46 +++++++++++++++++++++++---
 4 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 242dc5e83847..403c00cc1ac3 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -101,7 +101,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		/* Map delay slot emulation page */
 		base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 				VM_READ | VM_EXEC |
-				VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
+				VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC, 0,
 				0, NULL);
 		if (IS_ERR_VALUE(base)) {
 			ret = base;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 256e1bc83460..3b3efa2e3283 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2576,7 +2576,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	unsigned long refpage, struct list_head *uf);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..f57552dcf99a 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -29,6 +29,7 @@
 #define MAP_HUGETLB		0x040000	/* create a huge page mapping */
 #define MAP_SYNC		0x080000 /* perform synchronous page faults for the mapping */
 #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_REFPAGE		0x200000	/* use the offset argument as a pointer to a reference page */
 
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
diff --git a/mm/mmap.c b/mm/mmap.c
index d43cc3b0187c..d74d0963d460 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -47,6 +47,7 @@
 #include <linux/pkeys.h>
 #include <linux/oom.h>
 #include <linux/sched/mm.h>
+#include <linux/compat.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1371,6 +1372,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	vm_flags_t vm_flags;
 	int pkey = 0;
+	unsigned long refpage = 0;
 
 	*populate = 0;
 
@@ -1441,6 +1443,16 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (mlock_future_check(mm, vm_flags, len))
 		return -EAGAIN;
 
+	if (flags & MAP_REFPAGE) {
+		refpage = pgoff << PAGE_SHIFT;
+		if (in_compat_syscall()) {
+			/* The offset argument may have been sign extended at some
+			 * point, so we need to mask out the high bits.
+			 */
+			refpage &= 0xffffffff;
+		}
+	}
+
 	if (file) {
 		struct inode *inode = file_inode(file);
 		unsigned long flags_mask;
@@ -1541,8 +1553,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		if (file && is_file_hugepages(file))
 			vm_flags |= VM_NORESERVE;
 	}
-
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, refpage, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1557,7 +1568,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 	struct file *file = NULL;
 	unsigned long retval;
 
-	if (!(flags & MAP_ANONYMOUS)) {
+	if (!(flags & (MAP_ANONYMOUS | MAP_REFPAGE))) {
 		audit_mmap_fd(fd, flags);
 		file = fget(fd);
 		if (!file)
@@ -1684,9 +1695,33 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
+static vm_fault_t refpage_fault(struct vm_fault *vmf)
+{
+	struct page *page;
+
+	if (get_user_pages((unsigned long)vmf->vma->vm_private_data, 1, 0,
+			   &page, 0) != 1)
+		return VM_FAULT_SIGSEGV;
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static void refpage_close(struct vm_area_struct *vma)
+{
+	/* This function exists only to prevent is_mergeable_vma from allowing a
+	 * reference page mapping to be merged with an anonymous mapping.
+	 */
+}
+
+const struct vm_operations_struct refpage_vmops = {
+	.fault = refpage_fault,
+	.close = refpage_close,
+};
+
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		unsigned long refpage, struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1788,6 +1823,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		error = shmem_zero_setup(vma);
 		if (error)
 			goto free_vma;
+	} else if (refpage) {
+		vma->vm_ops = &refpage_vmops;
+		vma->vm_private_data = (void *)refpage;
 	} else {
 		vma_set_anonymous(vma);
 	}
-- 
2.28.0.163.g6104cc2f0b6-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-08-13 22:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-31 20:32 [PATCH] mm: introduce reference pages Peter Collingbourne
2020-08-03  3:28 ` John Hubbard
2020-08-03  3:51   ` Matthew Wilcox
2020-08-13 22:03     ` Peter Collingbourne
2020-08-13 22:03   ` Peter Collingbourne
2020-08-03  9:32 ` Kirill A. Shutemov
2020-08-03 12:01   ` Catalin Marinas
2020-08-04  0:50     ` Peter Collingbourne
2020-08-04 15:27       ` Catalin Marinas
2020-08-04 15:48         ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).