[RFC]: mm,power: introduce MADV_WIPEONSUSPEND

* [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
@ 2020-07-03 10:34 Catangiu, Adrian Costin
  2020-07-03 11:04 ` Jann Horn
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Catangiu, Adrian Costin @ 2020-07-03 10:34 UTC (permalink / raw)
  To: linux-mm, linux-pm, virtualization, linux-api
  Cc: akpm, rjw, len.brown, pavel, mhocko, fweimer, keescook, luto,
	wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

Cryptographic libraries carry pseudo random number generators to
quickly provide randomness when needed. If such a random pool gets
cloned, secrets may get revealed, as the same random number may get
used multiple times. For fork, this was fixed using the WIPEONFORK
madvise flag [1].

Unfortunately, the same problem surfaces when a virtual machine gets
cloned. The existing flag does not help there. This patch introduces a
new flag to automatically clear memory contents on VM suspend/resume,
which will allow random number generators to reseed when virtual
machines get cloned.

Examples of this are:
 - PKCS#11 API reinitialization check (mandated by specification)
 - glibc's upcoming PRNG (reseed after wake)
 - OpenSSL PRNG (reseed after wake)

Benefits exist in two spaces:
 - The security benefits of a cloned virtual machine having a
   re-initialized PRNG in every process are straightforward.
   Without reinitialization, two or more cloned VMs could produce
   identical random numbers, which are often used to generate secure
   keys.
 - Provides a simple mechanism to avoid RAM exfiltration during
   traditional sleep/hibernate on a laptop or desktop when memory,
   and thus secrets, are vulnerable to offline tampering or inspection.

This RFC is foremost aimed at defining a userspace interface to enable
applications and libraries that store or cache sensitive information,
to know that they need to regenerate it after process memory has been
exposed to potential copying.  The proposed userspace interface is
a new MADV_WIPEONSUSPEND 'madvise()' flag used to mark pages which
contain such data. This newly added flag would only be available on
64bit archs, since we've run out of 32bit VMA flags.

The mechanism through which the kernel marks the application sensitive
data as potentially copied, is a secondary objective of this RFC. In
the current PoC proposal, the RFC kernel code combines
MADV_WIPEONSUSPEND semantics with ACPI suspend/wake transitions to zero
out all process pages that fall in VMAs marked as MADV_WIPEONSUSPEND
and thus allow applications and libraries be notified and regenerate
their sensitive data.  Marking VMAs as MADV_WIPEONSUSPEND results in
the VMAs being empty in the process after any suspend/wake cycle.
Similar to MADV_WIPEONFORK, if the process accesses memory that was
wiped on suspend, it will get zeroes.  The address ranges are still
valid, they are just empty.

This patch adds logic to the kernel power code to zero out contents of
all MADV_WIPEONSUSPEND VMAs present in the system during its transition
to any suspend state equal or greater/deeper than Suspend-to-memory,
known as S3.

MADV_WIPEONSUSPEND only works on private, anonymous mappings.
The patch also adds MADV_KEEPONSUSPEND, to undo the effects of a
prior MADV_WIPEONSUSPEND for a VMA.

Hypervisors can issue ACPI S0->S3 and S3->S0 events to leverage this
functionality in a virtualized environment.

Alternative kernel implementation ideas:
 - Move the code that clears MADV_WIPEONFORK pages to a virtual
   device driver that registers itself to ACPI events.
 - Add prerequisite that MADV_WIPEONFORK pages must be pinned (so
   no faulting happens) and clear them in a custom/roll-your-own
   device driver on a NMI handler. This could work in a virtualized
   environment where the hypervisor pauses all other vCPUs before
   injecting the NMI.

[1] https://lore.kernel.org/lkml/20170811212829.29186-1-riel@redhat.com/

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
---
 include/linux/mm.h                     |  2 +
 include/uapi/asm-generic/mman-common.h |  3 +
 kernel/power/suspend.c                 | 82 ++++++++++++++++++++++++++
 mm/madvise.c                           | 17 ++++++
 4 files changed, 104 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..939eb80fabbb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -298,11 +298,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_WIPEONSUSPEND	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 63b1f506ea67..e527d9090842 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -67,6 +67,9 @@
 #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
 
+#define MADV_WIPEONSUSPEND 20		/* Zero memory on system suspend */
+#define MADV_KEEPONSUSPEND 21		/* Undo MADV_WIPEONSUSPEND */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index c874a7026e24..4282b7f0dd03 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -323,6 +323,78 @@ static bool platform_suspend_again(suspend_state_t state)
 		suspend_ops->suspend_again() : false;
 }
 
+#ifdef VM_WIPEONSUSPEND
+static void memory_cleanup_on_suspend(suspend_state_t state)
+{
+	struct task_struct *p;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct page *pages[32];
+	unsigned long max_pages_per_loop = ARRAY_SIZE(pages);
+
+	/* Only care about states >= S3 */
+	if (state < PM_SUSPEND_MEM)
+		return;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		int gup_flags = FOLL_WRITE;
+
+		mm = p->mm;
+		if (!mm)
+			continue;
+
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			unsigned long addr, nr_pages;
+
+			if (!(vma->vm_flags & VM_WIPEONSUSPEND))
+				continue;
+
+			addr = vma->vm_start;
+			nr_pages = (vma->vm_end - addr - 1) / PAGE_SIZE + 1;
+			while (nr_pages) {
+				int count = min(nr_pages, max_pages_per_loop);
+				void *kaddr;
+
+				count = get_user_pages_remote(p, mm, addr,
+							count, gup_flags,
+							pages, NULL, NULL);
+				if (count <= 0) {
+					/*
+					 * FIXME: In this PoC just break if we
+					 * get an error.
+					 * In the final implementation we need
+					 * to handle this better and not leave
+					 * pages uncleared.
+					 */
+					break;
+				}
+				/* Go through pages buffer and clear them. */
+				while (count) {
+					struct page *page = pages[--count];
+
+					kaddr = kmap(page);
+					clear_page(kaddr);
+					kunmap(page);
+
+					put_page(page);
+					nr_pages--;
+					addr += PAGE_SIZE;
+				}
+			}
+		}
+		up_read(&mm->mmap_sem);
+	}
+	rcu_read_unlock();
+}
+#else
+static void memory_cleanup_on_suspend(suspend_state_t state)
+{
+	/* noop */
+}
+#endif /* VM_WIPEONSUSPEND */
+
 #ifdef CONFIG_PM_DEBUG
 static unsigned int pm_test_delay = 5;
 module_param(pm_test_delay, uint, 0644);
@@ -415,6 +487,16 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
 	if (error)
 		goto Devices_early_resume;
 
+	/*
+	 * FIXME: For this PoC we're calling this early to be able to
+	 * fault in pages. For a correct implementation we have to find a
+	 * way to do it later, eventually _after_ disabling devices and
+	 * secondary CPUs.
+	 * One idea is to add requirement of having these pages pinned
+	 * so that we don't worry about faulting.
+	 */
+	memory_cleanup_on_suspend(state);
+
 	if (state == PM_SUSPEND_TO_IDLE && pm_test_level != TEST_PLATFORM) {
 		s2idle_loop();
 		goto Platform_early_resume;
diff --git a/mm/madvise.c b/mm/madvise.c
index 968df3aa069f..250b65277f11 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -92,6 +92,19 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	case MADV_KEEPONFORK:
 		new_flags &= ~VM_WIPEONFORK;
 		break;
+#ifdef VM_WIPEONSUSPEND
+	case MADV_WIPEONSUSPEND:
+		/* MADV_WIPEONSUSPEND is only supported on anonymous memory. */
+		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_WIPEONSUSPEND;
+		break;
+	case MADV_KEEPONSUSPEND:
+		new_flags &= ~VM_WIPEONSUSPEND;
+		break;
+#endif
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
@@ -731,6 +744,10 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
+#endif
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+	case MADV_WIPEONSUSPEND:
+	case MADV_KEEPONSUSPEND:
 #endif
 		return true;
 
-- 
2.17.1





Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

^ permalink raw reply related	[flat|nested] 28+ messages in thread