linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jason A. Donenfeld" <Jason@zx2c4.com>
To: linux-kernel@vger.kernel.org, patches@lists.linux.dev,
	tglx@linutronix.de
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>,
	linux-crypto@vger.kernel.org, linux-api@vger.kernel.org,
	x86@kernel.org, Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>,
	Carlos O'Donell <carlos@redhat.com>,
	Florian Weimer <fweimer@redhat.com>,
	Arnd Bergmann <arnd@arndb.de>, Jann Horn <jannh@google.com>,
	Christian Brauner <brauner@kernel.org>,
	linux-mm@kvack.org
Subject: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings
Date: Sun,  1 Jan 2023 17:29:05 +0100	[thread overview]
Message-ID: <20230101162910.710293-3-Jason@zx2c4.com> (raw)
In-Reply-To: <20230101162910.710293-1-Jason@zx2c4.com>

The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:

- It shouldn't be written to core dumps.
  * Easy: VM_DONTDUMP.
- It should be zeroed on fork.
  * Easy: VM_WIPEONFORK.

- It shouldn't be written to swap.
  * Uh-oh: mlock is rlimited.
  * Uh-oh: mlock isn't inherited by forks.

- It shouldn't reserve actual memory, but it also shouldn't crash when
  page faulting in memory if none is available
  * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
  * Uh-oh: VM_NORESERVE means segfaults.

It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:

1) Due to being wiped during fork(), the vDSO code is already robust to
   having the contents of the pages it reads zeroed out midway through
   the function's execution.

2) In the absolute worst case of whatever contingency we're coding for,
   we have the option to fallback to the getrandom() syscall, and
   everything is fine.

3) The buffers the function uses are only ever useful for a maximum of
   60 seconds -- a sort of cache, rather than a long term allocation.

These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:

a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
   zero when read back again).
c) If there's not enough memory to service a page fault, it's not fatal,
   and no signal is sent. Instead, writes are simply lost.
d) It is inherited by fork.
e) It doesn't count against the mlock budget, since nothing is locked.

This is fairly simple to implement, with the one snag that we have to
use 64-bit VM_* flags, but this shouldn't be a problem, since the only
consumers will probably be 64-bit anyway.

This way, allocations used by vDSO getrandom() can use:

    VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE

And there will be no problem with OOMing, crashing on overcommitment,
using memory when not in use, not wiping on fork(), coredumps, or
writing out to swap.

At the moment, rather than skipping writes on OOM, the fault handler
just returns to userspace, and the instruction is retried. This isn't
terrible, but it's not quite what is intended. The actual instruction
skipping has to be implemented arch-by-arch, but so does this whole
vDSO series, so that's fine. The following commit addresses it for x86.

Cc: linux-mm@kvack.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 fs/proc/task_mmu.c             | 3 +++
 include/linux/mm.h             | 8 ++++++++
 include/trace/events/mmflags.h | 7 +++++++
 mm/Kconfig                     | 3 +++
 mm/memory.c                    | 4 ++++
 mm/mempolicy.c                 | 3 +++
 mm/mprotect.c                  | 2 +-
 mm/rmap.c                      | 5 +++--
 8 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..47c7c046f2be 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 		[ilog2(VM_UFFD_MINOR)]	= "ui",
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_NEED_VM_DROPPABLE
+		[ilog2(VM_DROPPABLE)]	= "dp",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..fba3f1e8616b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,11 +315,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -335,6 +337,12 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_NEED_VM_DROPPABLE
+# define VM_DROPPABLE VM_HIGH_ARCH_5
+#else
+# define VM_DROPPABLE 0
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 412b5a46374c..82b2fb811d06 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -163,6 +163,12 @@ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
 # define IF_HAVE_UFFD_MINOR(flag, name)
 #endif
 
+#ifdef CONFIG_NEED_VM_DROPPABLE
+# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
+#else
+# define IF_HAVE_VM_DROPPABLE(flag, name)
+#endif
+
 #define __def_vmaflag_names						\
 	{VM_READ,			"read"		},		\
 	{VM_WRITE,			"write"		},		\
@@ -195,6 +201,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
+IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,	"droppable"	)		\
 	{VM_MERGEABLE,			"mergeable"	}		\
 
 #define show_vma_flags(flags)						\
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..91fd0be96ca4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1030,6 +1030,9 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config NEED_VM_DROPPABLE
+	select ARCH_USES_HIGH_VMA_FLAGS
+	bool
 
 config ARCH_USES_PG_ARCH_X
 	bool
diff --git a/mm/memory.c b/mm/memory.c
index aad226daf41b..1ade407ccbf9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5220,6 +5220,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 
 	lru_gen_exit_fault();
 
+	/* If the mapping is droppable, then errors due to OOM aren't fatal. */
+	if (vma->vm_flags & VM_DROPPABLE)
+		ret &= ~VM_FAULT_OOM;
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02c8a712282f..ebf2e3694a0a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2173,6 +2173,9 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 	int preferred_nid;
 	nodemask_t *nmask;
 
+	if (vma->vm_flags & VM_DROPPABLE)
+		gfp |= __GFP_NOWARN | __GFP_NORETRY;
+
 	pol = get_vma_policy(vma, addr);
 
 	if (pol->mode == MPOL_INTERLEAVE) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 908df12caa26..a679cc5d1c75 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -593,7 +593,7 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				may_expand_vm(mm, oldflags, nrpages))
 			return -ENOMEM;
 		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
-						VM_SHARED|VM_NORESERVE))) {
+				  VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) {
 			charged = nrpages;
 			if (security_vm_enough_memory_mm(mm, charged))
 				return -ENOMEM;
diff --git a/mm/rmap.c b/mm/rmap.c
index b616870a09be..5ed46e59dfcd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1294,7 +1294,8 @@ void page_add_new_anon_rmap(struct page *page,
 	int nr;
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-	__SetPageSwapBacked(page);
+	if (!(vma->vm_flags & VM_DROPPABLE))
+		__SetPageSwapBacked(page);
 
 	if (likely(!PageCompound(page))) {
 		/* increment count (starts at -1) */
@@ -1683,7 +1684,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * plus the rmap(s) (dropped by discard:).
 				 */
 				if (ref_count == 1 + map_count &&
-				    !folio_test_dirty(folio)) {
+				    (!folio_test_dirty(folio) || (vma->vm_flags & VM_DROPPABLE))) {
 					/* Invalidate as we cleared the pte */
 					mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);
-- 
2.39.0


  parent reply	other threads:[~2023-01-01 16:29 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-01 16:29 [PATCH v14 0/7] implement getrandom() in vDSO Jason A. Donenfeld
2023-01-01 16:29 ` [PATCH v14 1/7] x86: lib: Separate instruction decoder MMIO type from MMIO trace Jason A. Donenfeld
2023-01-03 10:32   ` Ingo Molnar
2023-01-03 14:51     ` Jason A. Donenfeld
2023-01-03 17:00       ` Ingo Molnar
2023-01-03 17:29         ` Borislav Petkov
2023-01-03 17:30           ` Jason A. Donenfeld
2023-01-03 17:47             ` Ingo Molnar
2023-01-03 17:48               ` Jason A. Donenfeld
2023-01-04 20:25               ` Ingo Molnar
2023-01-04 20:29                 ` Jason A. Donenfeld
2023-01-03 11:00   ` [tip: x86/asm] x86/insn: Avoid namespace clash by separating instruction decoder MMIO type from MMIO trace type tip-bot2 for Jason A. Donenfeld
2023-01-03 17:53   ` [tip: x86/urgent] " tip-bot2 for Jason A. Donenfeld
2023-01-01 16:29 ` Jason A. Donenfeld [this message]
2023-01-03 10:50   ` [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings Ingo Molnar
2023-01-03 15:01     ` Jason A. Donenfeld
2023-01-03 18:15       ` Ingo Molnar
2023-01-03 18:51         ` Jason A. Donenfeld
2023-01-03 18:36     ` Andy Lutomirski
2023-01-03 19:05       ` Jason A. Donenfeld
2023-01-03 20:52         ` Andy Lutomirski
2023-01-03 19:19       ` Linus Torvalds
2023-01-03 19:35         ` Jason A. Donenfeld
2023-01-03 19:54           ` Linus Torvalds
2023-01-03 20:03             ` Jason A. Donenfeld
2023-01-03 20:15               ` Linus Torvalds
2023-01-03 20:25                 ` Linus Torvalds
2023-01-03 20:44                 ` Jason A. Donenfeld
2023-01-05 21:57                   ` Yann Droneaud
2023-01-05 22:57                     ` Jason A. Donenfeld
2023-01-06  1:02                       ` Linus Torvalds
2023-01-06  2:08                         ` Linus Torvalds
2023-01-06  2:42                           ` Jason A. Donenfeld
2023-01-06 20:53                           ` Andy Lutomirski
2023-01-06 21:10                             ` Linus Torvalds
2023-01-10 11:01                               ` Dr. Greg
2023-01-06 21:36                             ` Jason A. Donenfeld
2023-01-06 21:42                           ` Matthew Wilcox
2023-01-06 22:06                             ` Linus Torvalds
2023-01-06  2:14                         ` Jason A. Donenfeld
2023-01-09 10:34             ` Florian Weimer
2023-01-09 14:28               ` Linus Torvalds
2023-01-11  7:27                 ` Eric Biggers
2023-01-11 12:07                   ` Linus Torvalds
2023-01-01 16:29 ` [PATCH v14 3/7] x86: mm: Skip faulting instruction for VM_DROPPABLE faults Jason A. Donenfeld
2023-01-01 16:29 ` [PATCH v14 4/7] random: add vgetrandom_alloc() syscall Jason A. Donenfeld
2023-01-01 16:29 ` [PATCH v14 5/7] arch: allocate vgetrandom_alloc() syscall number Jason A. Donenfeld
2023-01-01 16:29 ` [PATCH v14 6/7] random: introduce generic vDSO getrandom() implementation Jason A. Donenfeld
2023-01-01 16:29 ` [PATCH v14 7/7] x86: vdso: Wire up getrandom() vDSO implementation Jason A. Donenfeld
2023-01-12 17:27   ` Christophe Leroy
2023-01-12 17:49     ` Jason A. Donenfeld
2023-01-11 22:23 ` [PATCH v14 0/7] implement getrandom() in vDSO Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230101162910.710293-3-Jason@zx2c4.com \
    --to=jason@zx2c4.com \
    --cc=adhemerval.zanella@linaro.org \
    --cc=arnd@arndb.de \
    --cc=brauner@kernel.org \
    --cc=carlos@redhat.com \
    --cc=fweimer@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jannh@google.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=patches@lists.linux.dev \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).