linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
@ 2020-07-03 10:34 Catangiu, Adrian Costin
  2020-07-03 11:04 ` Jann Horn
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Catangiu, Adrian Costin @ 2020-07-03 10:34 UTC (permalink / raw)
  To: linux-mm, linux-pm, virtualization, linux-api
  Cc: akpm, rjw, len.brown, pavel, mhocko, fweimer, keescook, luto,
	wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

Cryptographic libraries carry pseudo random number generators to
quickly provide randomness when needed. If such a random pool gets
cloned, secrets may get revealed, as the same random number may get
used multiple times. For fork, this was fixed using the WIPEONFORK
madvise flag [1].

Unfortunately, the same problem surfaces when a virtual machine gets
cloned. The existing flag does not help there. This patch introduces a
new flag to automatically clear memory contents on VM suspend/resume,
which will allow random number generators to reseed when virtual
machines get cloned.

Examples of this are:
 - PKCS#11 API reinitialization check (mandated by specification)
 - glibc's upcoming PRNG (reseed after wake)
 - OpenSSL PRNG (reseed after wake)

Benefits exist in two spaces:
 - The security benefits of a cloned virtual machine having a
   re-initialized PRNG in every process are straightforward.
   Without reinitialization, two or more cloned VMs could produce
   identical random numbers, which are often used to generate secure
   keys.
 - Provides a simple mechanism to avoid RAM exfiltration during
   traditional sleep/hibernate on a laptop or desktop when memory,
   and thus secrets, are vulnerable to offline tampering or inspection.

This RFC is foremost aimed at defining a userspace interface to enable
applications and libraries that store or cache sensitive information,
to know that they need to regenerate it after process memory has been
exposed to potential copying.  The proposed userspace interface is
a new MADV_WIPEONSUSPEND 'madvise()' flag used to mark pages which
contain such data. This newly added flag would only be available on
64bit archs, since we've run out of 32bit VMA flags.

The mechanism through which the kernel marks the application sensitive
data as potentially copied, is a secondary objective of this RFC. In
the current PoC proposal, the RFC kernel code combines
MADV_WIPEONSUSPEND semantics with ACPI suspend/wake transitions to zero
out all process pages that fall in VMAs marked as MADV_WIPEONSUSPEND
and thus allow applications and libraries be notified and regenerate
their sensitive data.  Marking VMAs as MADV_WIPEONSUSPEND results in
the VMAs being empty in the process after any suspend/wake cycle.
Similar to MADV_WIPEONFORK, if the process accesses memory that was
wiped on suspend, it will get zeroes.  The address ranges are still
valid, they are just empty.

This patch adds logic to the kernel power code to zero out contents of
all MADV_WIPEONSUSPEND VMAs present in the system during its transition
to any suspend state equal or greater/deeper than Suspend-to-memory,
known as S3.

MADV_WIPEONSUSPEND only works on private, anonymous mappings.
The patch also adds MADV_KEEPONSUSPEND, to undo the effects of a
prior MADV_WIPEONSUSPEND for a VMA.

Hypervisors can issue ACPI S0->S3 and S3->S0 events to leverage this
functionality in a virtualized environment.

Alternative kernel implementation ideas:
 - Move the code that clears MADV_WIPEONFORK pages to a virtual
   device driver that registers itself to ACPI events.
 - Add prerequisite that MADV_WIPEONFORK pages must be pinned (so
   no faulting happens) and clear them in a custom/roll-your-own
   device driver on a NMI handler. This could work in a virtualized
   environment where the hypervisor pauses all other vCPUs before
   injecting the NMI.

[1] https://lore.kernel.org/lkml/20170811212829.29186-1-riel@redhat.com/

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
---
 include/linux/mm.h                     |  2 +
 include/uapi/asm-generic/mman-common.h |  3 +
 kernel/power/suspend.c                 | 82 ++++++++++++++++++++++++++
 mm/madvise.c                           | 17 ++++++
 4 files changed, 104 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..939eb80fabbb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -298,11 +298,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_WIPEONSUSPEND	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 63b1f506ea67..e527d9090842 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -67,6 +67,9 @@
 #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
 
+#define MADV_WIPEONSUSPEND 20		/* Zero memory on system suspend */
+#define MADV_KEEPONSUSPEND 21		/* Undo MADV_WIPEONSUSPEND */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index c874a7026e24..4282b7f0dd03 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -323,6 +323,78 @@ static bool platform_suspend_again(suspend_state_t state)
 		suspend_ops->suspend_again() : false;
 }
 
+#ifdef VM_WIPEONSUSPEND
+static void memory_cleanup_on_suspend(suspend_state_t state)
+{
+	struct task_struct *p;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct page *pages[32];
+	unsigned long max_pages_per_loop = ARRAY_SIZE(pages);
+
+	/* Only care about states >= S3 */
+	if (state < PM_SUSPEND_MEM)
+		return;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		int gup_flags = FOLL_WRITE;
+
+		mm = p->mm;
+		if (!mm)
+			continue;
+
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			unsigned long addr, nr_pages;
+
+			if (!(vma->vm_flags & VM_WIPEONSUSPEND))
+				continue;
+
+			addr = vma->vm_start;
+			nr_pages = (vma->vm_end - addr - 1) / PAGE_SIZE + 1;
+			while (nr_pages) {
+				int count = min(nr_pages, max_pages_per_loop);
+				void *kaddr;
+
+				count = get_user_pages_remote(p, mm, addr,
+							count, gup_flags,
+							pages, NULL, NULL);
+				if (count <= 0) {
+					/*
+					 * FIXME: In this PoC just break if we
+					 * get an error.
+					 * In the final implementation we need
+					 * to handle this better and not leave
+					 * pages uncleared.
+					 */
+					break;
+				}
+				/* Go through pages buffer and clear them. */
+				while (count) {
+					struct page *page = pages[--count];
+
+					kaddr = kmap(page);
+					clear_page(kaddr);
+					kunmap(page);
+
+					put_page(page);
+					nr_pages--;
+					addr += PAGE_SIZE;
+				}
+			}
+		}
+		up_read(&mm->mmap_sem);
+	}
+	rcu_read_unlock();
+}
+#else
+static void memory_cleanup_on_suspend(suspend_state_t state)
+{
+	/* noop */
+}
+#endif /* VM_WIPEONSUSPEND */
+
 #ifdef CONFIG_PM_DEBUG
 static unsigned int pm_test_delay = 5;
 module_param(pm_test_delay, uint, 0644);
@@ -415,6 +487,16 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
 	if (error)
 		goto Devices_early_resume;
 
+	/*
+	 * FIXME: For this PoC we're calling this early to be able to
+	 * fault in pages. For a correct implementation we have to find a
+	 * way to do it later, eventually _after_ disabling devices and
+	 * secondary CPUs.
+	 * One idea is to add requirement of having these pages pinned
+	 * so that we don't worry about faulting.
+	 */
+	memory_cleanup_on_suspend(state);
+
 	if (state == PM_SUSPEND_TO_IDLE && pm_test_level != TEST_PLATFORM) {
 		s2idle_loop();
 		goto Platform_early_resume;
diff --git a/mm/madvise.c b/mm/madvise.c
index 968df3aa069f..250b65277f11 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -92,6 +92,19 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	case MADV_KEEPONFORK:
 		new_flags &= ~VM_WIPEONFORK;
 		break;
+#ifdef VM_WIPEONSUSPEND
+	case MADV_WIPEONSUSPEND:
+		/* MADV_WIPEONSUSPEND is only supported on anonymous memory. */
+		if (vma->vm_file || vma->vm_flags & VM_SHARED) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_WIPEONSUSPEND;
+		break;
+	case MADV_KEEPONSUSPEND:
+		new_flags &= ~VM_WIPEONSUSPEND;
+		break;
+#endif
 	case MADV_DONTDUMP:
 		new_flags |= VM_DONTDUMP;
 		break;
@@ -731,6 +744,10 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
+#endif
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+	case MADV_WIPEONSUSPEND:
+	case MADV_KEEPONSUSPEND:
 #endif
 		return true;
 
-- 
2.17.1





Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 10:34 [RFC]: mm,power: introduce MADV_WIPEONSUSPEND Catangiu, Adrian Costin
@ 2020-07-03 11:04 ` Jann Horn
  2020-07-04  1:33   ` Colm MacCárthaigh
  2020-07-06 12:09   ` Alexander Graf
  2020-07-03 11:30 ` Michal Hocko
  2020-07-03 22:44 ` Pavel Machek
  2 siblings, 2 replies; 28+ messages in thread
From: Jann Horn @ 2020-07-03 11:04 UTC (permalink / raw)
  To: Catangiu, Adrian Costin
  Cc: linux-mm, linux-pm, virtualization, linux-api, akpm, rjw,
	len.brown, pavel, mhocko, fweimer, keescook, luto, wad, mingo,
	bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

On Fri, Jul 3, 2020 at 12:34 PM Catangiu, Adrian Costin
<acatan@amazon.com> wrote:
> Cryptographic libraries carry pseudo random number generators to
> quickly provide randomness when needed. If such a random pool gets
> cloned, secrets may get revealed, as the same random number may get
> used multiple times. For fork, this was fixed using the WIPEONFORK
> madvise flag [1].
>
> Unfortunately, the same problem surfaces when a virtual machine gets
> cloned. The existing flag does not help there. This patch introduces a
> new flag to automatically clear memory contents on VM suspend/resume,
> which will allow random number generators to reseed when virtual
> machines get cloned.
>
> Examples of this are:
>  - PKCS#11 API reinitialization check (mandated by specification)
>  - glibc's upcoming PRNG (reseed after wake)
>  - OpenSSL PRNG (reseed after wake)
>
> Benefits exist in two spaces:
>  - The security benefits of a cloned virtual machine having a
>    re-initialized PRNG in every process are straightforward.
>    Without reinitialization, two or more cloned VMs could produce
>    identical random numbers, which are often used to generate secure
>    keys.
>  - Provides a simple mechanism to avoid RAM exfiltration during
>    traditional sleep/hibernate on a laptop or desktop when memory,
>    and thus secrets, are vulnerable to offline tampering or inspection.

For the first usecase, I wonder which way around this would work
better - do the wiping when a VM is saved, or do it when the VM is
restored? I guess that at least in some scenarios, doing it on restore
would be nicer because that way the hypervisor can always instantly
save a VM without having to wait for the guest to say "alright, I'm
ready" - especially if someone e.g. wants to take a snapshot of a
running VM while keeping it running? Or do hypervisors inject such
ACPI transitions every time they snapshot/save/restore a VM anyway?

> This RFC is foremost aimed at defining a userspace interface to enable
> applications and libraries that store or cache sensitive information,
> to know that they need to regenerate it after process memory has been
> exposed to potential copying.  The proposed userspace interface is
> a new MADV_WIPEONSUSPEND 'madvise()' flag used to mark pages which
> contain such data. This newly added flag would only be available on
> 64bit archs, since we've run out of 32bit VMA flags.
>
> The mechanism through which the kernel marks the application sensitive
> data as potentially copied, is a secondary objective of this RFC. In
> the current PoC proposal, the RFC kernel code combines
> MADV_WIPEONSUSPEND semantics with ACPI suspend/wake transitions to zero
> out all process pages that fall in VMAs marked as MADV_WIPEONSUSPEND
> and thus allow applications and libraries be notified and regenerate
> their sensitive data.  Marking VMAs as MADV_WIPEONSUSPEND results in
> the VMAs being empty in the process after any suspend/wake cycle.
> Similar to MADV_WIPEONFORK, if the process accesses memory that was
> wiped on suspend, it will get zeroes.  The address ranges are still
> valid, they are just empty.
>
> This patch adds logic to the kernel power code to zero out contents of
> all MADV_WIPEONSUSPEND VMAs present in the system during its transition
> to any suspend state equal or greater/deeper than Suspend-to-memory,
> known as S3.
>
> MADV_WIPEONSUSPEND only works on private, anonymous mappings.
> The patch also adds MADV_KEEPONSUSPEND, to undo the effects of a
> prior MADV_WIPEONSUSPEND for a VMA.
>
> Hypervisors can issue ACPI S0->S3 and S3->S0 events to leverage this
> functionality in a virtualized environment.
>
> Alternative kernel implementation ideas:
>  - Move the code that clears MADV_WIPEONFORK pages to a virtual
>    device driver that registers itself to ACPI events.
>  - Add prerequisite that MADV_WIPEONFORK pages must be pinned (so
>    no faulting happens) and clear them in a custom/roll-your-own
>    device driver on a NMI handler. This could work in a virtualized
>    environment where the hypervisor pauses all other vCPUs before
>    injecting the NMI.
>
> [1] https://lore.kernel.org/lkml/20170811212829.29186-1-riel@redhat.com/
[...]
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index c874a7026e24..4282b7f0dd03 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -323,6 +323,78 @@ static bool platform_suspend_again(suspend_state_t state)
>                 suspend_ops->suspend_again() : false;
>  }
>
> +#ifdef VM_WIPEONSUSPEND
> +static void memory_cleanup_on_suspend(suspend_state_t state)
> +{
> +       struct task_struct *p;
> +       struct mm_struct *mm;
> +       struct vm_area_struct *vma;
> +       struct page *pages[32];
> +       unsigned long max_pages_per_loop = ARRAY_SIZE(pages);
> +
> +       /* Only care about states >= S3 */
> +       if (state < PM_SUSPEND_MEM)
> +               return;
> +
> +       rcu_read_lock();
> +       for_each_process(p) {
> +               int gup_flags = FOLL_WRITE;
> +
> +               mm = p->mm;
> +               if (!mm)
> +                       continue;
> +
> +               down_read(&mm->mmap_sem);

Blocking actions, such as locking semaphores, are forbidden in RCU
read-side critical sections. Also, from a more high-level perspective,
do we need to be careful here to avoid deadlocks with frozen tasks or
stuff like that?

> +               for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +                       unsigned long addr, nr_pages;
> +
> +                       if (!(vma->vm_flags & VM_WIPEONSUSPEND))
> +                               continue;
> +
> +                       addr = vma->vm_start;
> +                       nr_pages = (vma->vm_end - addr - 1) / PAGE_SIZE + 1;
> +                       while (nr_pages) {
> +                               int count = min(nr_pages, max_pages_per_loop);
> +                               void *kaddr;
> +
> +                               count = get_user_pages_remote(p, mm, addr,
> +                                                       count, gup_flags,
> +                                                       pages, NULL, NULL);

get_user_pages_remote() can wait for disk I/O (for swapping stuff back
in), which we'd probably like to avoid here. And I think it can also
wait for userfaultfd handling from userspace? zap_page_range() (which
is what e.g. MADV_DONTNEED uses) might be a better fit, since it can
yank entries out of the page table (forcing the next write fault to
allocate a new zeroed page) without faulting them into RAM.

> +                               if (count <= 0) {
> +                                       /*
> +                                        * FIXME: In this PoC just break if we
> +                                        * get an error.
> +                                        * In the final implementation we need
> +                                        * to handle this better and not leave
> +                                        * pages uncleared.
> +                                        */
> +                                       break;
> +                               }
> +                               /* Go through pages buffer and clear them. */
> +                               while (count) {
> +                                       struct page *page = pages[--count];
> +
> +                                       kaddr = kmap(page);
> +                                       clear_page(kaddr);
> +                                       kunmap(page);

(This part should go away, but if it stayed, you'd probably want to
use clear_user_highpage() or so instead of open-coding this.)

> +                                       put_page(page);
> +                                       nr_pages--;
> +                                       addr += PAGE_SIZE;
> +                               }
> +                       }
> +               }
> +               up_read(&mm->mmap_sem);
> +       }
> +       rcu_read_unlock();
> +}


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 10:34 [RFC]: mm,power: introduce MADV_WIPEONSUSPEND Catangiu, Adrian Costin
  2020-07-03 11:04 ` Jann Horn
@ 2020-07-03 11:30 ` Michal Hocko
  2020-07-03 12:17   ` Rafael J. Wysocki
                     ` (2 more replies)
  2020-07-03 22:44 ` Pavel Machek
  2 siblings, 3 replies; 28+ messages in thread
From: Michal Hocko @ 2020-07-03 11:30 UTC (permalink / raw)
  To: Catangiu, Adrian Costin
  Cc: linux-mm, linux-pm, virtualization, linux-api, akpm, rjw,
	len.brown, pavel, fweimer, keescook, luto, wad, mingo, bonzini,
	Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

On Fri 03-07-20 10:34:09, Catangiu, Adrian Costin wrote:
> This patch adds logic to the kernel power code to zero out contents of
> all MADV_WIPEONSUSPEND VMAs present in the system during its transition
> to any suspend state equal or greater/deeper than Suspend-to-memory,
> known as S3.

How does the application learn that its memory got wiped? S2disk is an
async operation and it can happen at any time during the task execution.
So how does the application work to prevent from corrupted state - e.g.
when suspended between two memory loads?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 11:30 ` Michal Hocko
@ 2020-07-03 12:17   ` Rafael J. Wysocki
  2020-07-03 22:39     ` Pavel Machek
  2020-07-03 13:29   ` Jann Horn
  2020-07-04  1:45   ` Colm MacCárthaigh
  2 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2020-07-03 12:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Catangiu, Adrian Costin, linux-mm, linux-pm, virtualization,
	linux-api, akpm, rjw, len.brown, pavel, fweimer, keescook, luto,
	wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

On Fri, Jul 3, 2020 at 1:30 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 03-07-20 10:34:09, Catangiu, Adrian Costin wrote:
> > This patch adds logic to the kernel power code to zero out contents of
> > all MADV_WIPEONSUSPEND VMAs present in the system during its transition
> > to any suspend state equal or greater/deeper than Suspend-to-memory,
> > known as S3.
>
> How does the application learn that its memory got wiped? S2disk is an
> async operation and it can happen at any time during the task execution.
> So how does the application work to prevent from corrupted state - e.g.
> when suspended between two memory loads?

This doesn't affect hibernation AFAICS, but system suspend
(suspend-to-RAM or suspend-to-idle, or standby) is async too.

I guess this calls for an interface to notify user space (that opted
in to receive such notifications) on system-wide suspend start and
finish.

Thanks!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 11:30 ` Michal Hocko
  2020-07-03 12:17   ` Rafael J. Wysocki
@ 2020-07-03 13:29   ` Jann Horn
  2020-07-03 22:34     ` Pavel Machek
  2020-07-07  7:38     ` Michal Hocko
  2020-07-04  1:45   ` Colm MacCárthaigh
  2 siblings, 2 replies; 28+ messages in thread
From: Jann Horn @ 2020-07-03 13:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Catangiu, Adrian Costin, linux-mm, linux-pm, virtualization,
	linux-api, akpm, rjw, len.brown, pavel, fweimer, keescook, luto,
	wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

On Fri, Jul 3, 2020 at 1:30 PM Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 03-07-20 10:34:09, Catangiu, Adrian Costin wrote:
> > This patch adds logic to the kernel power code to zero out contents of
> > all MADV_WIPEONSUSPEND VMAs present in the system during its transition
> > to any suspend state equal or greater/deeper than Suspend-to-memory,
> > known as S3.
>
> How does the application learn that its memory got wiped? S2disk is an
> async operation and it can happen at any time during the task execution.
> So how does the application work to prevent from corrupted state - e.g.
> when suspended between two memory loads?

You can do it seqlock-style, kind of - you reserve the first byte of
the page or so as a "is this page initialized" marker, and after every
read from the page, you do a compiler barrier and check whether that
byte has been cleared.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 13:29   ` Jann Horn
@ 2020-07-03 22:34     ` Pavel Machek
  2020-07-03 22:53       ` Jann Horn
  2020-07-07  7:38     ` Michal Hocko
  1 sibling, 1 reply; 28+ messages in thread
From: Pavel Machek @ 2020-07-03 22:34 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michal Hocko, Catangiu, Adrian Costin, linux-mm, linux-pm,
	virtualization, linux-api, akpm, rjw, len.brown, fweimer,
	keescook, luto, wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

[-- Attachment #1: Type: text/plain, Size: 1281 bytes --]

On Fri 2020-07-03 15:29:22, Jann Horn wrote:
> On Fri, Jul 3, 2020 at 1:30 PM Michal Hocko <mhocko@kernel.org> wrote:
> > On Fri 03-07-20 10:34:09, Catangiu, Adrian Costin wrote:
> > > This patch adds logic to the kernel power code to zero out contents of
> > > all MADV_WIPEONSUSPEND VMAs present in the system during its transition
> > > to any suspend state equal or greater/deeper than Suspend-to-memory,
> > > known as S3.
> >
> > How does the application learn that its memory got wiped? S2disk is an
> > async operation and it can happen at any time during the task execution.
> > So how does the application work to prevent from corrupted state - e.g.
> > when suspended between two memory loads?
> 
> You can do it seqlock-style, kind of - you reserve the first byte of
> the page or so as a "is this page initialized" marker, and after every
> read from the page, you do a compiler barrier and check whether that
> byte has been

That would also need smp cpu barriers, and guarantee that first byte
is always ... cleared first, and matching barriers in kernel space,
too, no?

Best regards,
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 12:17   ` Rafael J. Wysocki
@ 2020-07-03 22:39     ` Pavel Machek
  0 siblings, 0 replies; 28+ messages in thread
From: Pavel Machek @ 2020-07-03 22:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Michal Hocko, Catangiu, Adrian Costin, linux-mm, linux-pm,
	virtualization, linux-api, akpm, rjw, len.brown, fweimer,
	keescook, luto, wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

On Fri 2020-07-03 14:17:50, Rafael J. Wysocki wrote:
> On Fri, Jul 3, 2020 at 1:30 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 03-07-20 10:34:09, Catangiu, Adrian Costin wrote:
> > > This patch adds logic to the kernel power code to zero out contents of
> > > all MADV_WIPEONSUSPEND VMAs present in the system during its transition
> > > to any suspend state equal or greater/deeper than Suspend-to-memory,
> > > known as S3.
> >
> > How does the application learn that its memory got wiped? S2disk is an
> > async operation and it can happen at any time during the task execution.
> > So how does the application work to prevent from corrupted state - e.g.
> > when suspended between two memory loads?
> 
> This doesn't affect hibernation AFAICS, but system suspend
> (suspend-to-RAM or suspend-to-idle, or standby) is async too.
> 
> I guess this calls for an interface to notify user space (that opted
> in to receive such notifications) on system-wide suspend start and
> finish.

We could simply provide a file that would produce single byte 'e' when
entering the suspend and different byte 'x' when exiting...

Not sure how useful that would be for the crypto stuff...

Best regards,
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 10:34 [RFC]: mm,power: introduce MADV_WIPEONSUSPEND Catangiu, Adrian Costin
  2020-07-03 11:04 ` Jann Horn
  2020-07-03 11:30 ` Michal Hocko
@ 2020-07-03 22:44 ` Pavel Machek
  2020-07-03 22:56   ` Jann Horn
  2 siblings, 1 reply; 28+ messages in thread
From: Pavel Machek @ 2020-07-03 22:44 UTC (permalink / raw)
  To: Catangiu, Adrian Costin
  Cc: linux-mm, linux-pm, virtualization, linux-api, akpm, rjw,
	len.brown, mhocko, fweimer, keescook, luto, wad, mingo, bonzini,
	Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

[-- Attachment #1: Type: text/plain, Size: 1523 bytes --]

Hi!

> Cryptographic libraries carry pseudo random number generators to
> quickly provide randomness when needed. If such a random pool gets
> cloned, secrets may get revealed, as the same random number may get
> used multiple times. For fork, this was fixed using the WIPEONFORK
> madvise flag [1].

> Unfortunately, the same problem surfaces when a virtual machine gets
> cloned. The existing flag does not help there. This patch introduces a
> new flag to automatically clear memory contents on VM suspend/resume,
> which will allow random number generators to reseed when virtual
> machines get cloned.

Umm. If this is real problem, should kernel provide such rng in the
vsdo page using vsyscalls? Kernel can have special interface to its
vsyscalls, but we may not want to offer this functionality to rest of
userland...

>  - Provides a simple mechanism to avoid RAM exfiltration during
>    traditional sleep/hibernate on a laptop or desktop when memory,
>    and thus secrets, are vulnerable to offline tampering or
>    inspection.

This second use has nothing to do with RNGs, right?

And I don't think we should do this in kernel.

It is userspace that initiates the suspend transition. Userspace
should lock the screen _before_ starting it, for example. Userspace
should also get rid of any secrets, first...

Best regards,
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 22:34     ` Pavel Machek
@ 2020-07-03 22:53       ` Jann Horn
  0 siblings, 0 replies; 28+ messages in thread
From: Jann Horn @ 2020-07-03 22:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Michal Hocko, Catangiu, Adrian Costin, linux-mm, linux-pm,
	virtualization, linux-api, akpm, rjw, len.brown, fweimer,
	keescook, luto, wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

On Sat, Jul 4, 2020 at 12:34 AM Pavel Machek <pavel@ucw.cz> wrote:
> On Fri 2020-07-03 15:29:22, Jann Horn wrote:
> > On Fri, Jul 3, 2020 at 1:30 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > On Fri 03-07-20 10:34:09, Catangiu, Adrian Costin wrote:
> > > > This patch adds logic to the kernel power code to zero out contents of
> > > > all MADV_WIPEONSUSPEND VMAs present in the system during its transition
> > > > to any suspend state equal or greater/deeper than Suspend-to-memory,
> > > > known as S3.
> > >
> > > How does the application learn that its memory got wiped? S2disk is an
> > > async operation and it can happen at any time during the task execution.
> > > So how does the application work to prevent from corrupted state - e.g.
> > > when suspended between two memory loads?
> >
> > You can do it seqlock-style, kind of - you reserve the first byte of
> > the page or so as a "is this page initialized" marker, and after every
> > read from the page, you do a compiler barrier and check whether that
> > byte has been
>
> That would also need smp cpu barriers, and guarantee that first byte
> is always ... cleared first, and matching barriers in kernel space,
> too, no?

Not if it happens in the guts of the suspend stuff, when userspace is
frozen, I think?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 22:44 ` Pavel Machek
@ 2020-07-03 22:56   ` Jann Horn
  2020-07-04 11:48     ` Pavel Machek
  0 siblings, 1 reply; 28+ messages in thread
From: Jann Horn @ 2020-07-03 22:56 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Catangiu, Adrian Costin, linux-mm, linux-pm, virtualization,
	linux-api, akpm, rjw, len.brown, mhocko, fweimer, keescook, luto,
	wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

On Sat, Jul 4, 2020 at 12:44 AM Pavel Machek <pavel@ucw.cz> wrote:
> > Cryptographic libraries carry pseudo random number generators to
> > quickly provide randomness when needed. If such a random pool gets
> > cloned, secrets may get revealed, as the same random number may get
> > used multiple times. For fork, this was fixed using the WIPEONFORK
> > madvise flag [1].
>
> > Unfortunately, the same problem surfaces when a virtual machine gets
> > cloned. The existing flag does not help there. This patch introduces a
> > new flag to automatically clear memory contents on VM suspend/resume,
> > which will allow random number generators to reseed when virtual
> > machines get cloned.
>
> Umm. If this is real problem, should kernel provide such rng in the
> vsdo page using vsyscalls? Kernel can have special interface to its
> vsyscalls, but we may not want to offer this functionality to rest of
> userland...

And then the kernel would just need to maintain a sequence
number in the vDSO data page that gets bumped on suspend, right?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 11:04 ` Jann Horn
@ 2020-07-04  1:33   ` Colm MacCárthaigh
  2020-07-06 12:09   ` Alexander Graf
  1 sibling, 0 replies; 28+ messages in thread
From: Colm MacCárthaigh @ 2020-07-04  1:33 UTC (permalink / raw)
  To: Jann Horn
  Cc: Catangiu, Adrian Costin, linux-mm, linux-pm, virtualization,
	linux-api, akpm, rjw, len.brown, pavel, mhocko, fweimer,
	keescook, luto, wad, mingo, bonzini, Graf (AWS),
	Alexander, Singh, Balbir, Sandu, Andrei, Brooker, Marc, Weiss,
	Radu, Manwaring, Derek



On 3 Jul 2020, at 4:04, Jann Horn wrote:

>>  - Provides a simple mechanism to avoid RAM exfiltration during
>>    traditional sleep/hibernate on a laptop or desktop when memory,
>>    and thus secrets, are vulnerable to offline tampering or 
>> inspection.
>
> For the first usecase, I wonder which way around this would work
> better - do the wiping when a VM is saved, or do it when the VM is
> restored? I guess that at least in some scenarios, doing it on restore
> would be nicer because that way the hypervisor can always instantly
> save a VM without having to wait for the guest to say "alright, I'm
> ready" - especially if someone e.g. wants to take a snapshot of a
> running VM while keeping it running? Or do hypervisors inject such
> ACPI transitions every time they snapshot/save/restore a VM anyway?


Just to answer this - I’d expect wipe-after-save rather than 
wipe-on-restore to be common for some. That provides the most defense 
against secrets ending up on disk or some other durable medium when the 
VM images are being saved.

-
Colm


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 11:30 ` Michal Hocko
  2020-07-03 12:17   ` Rafael J. Wysocki
  2020-07-03 13:29   ` Jann Horn
@ 2020-07-04  1:45   ` Colm MacCárthaigh
  2020-07-07  7:40     ` Michal Hocko
  2 siblings, 1 reply; 28+ messages in thread
From: Colm MacCárthaigh @ 2020-07-04  1:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Catangiu, Adrian Costin, linux-mm, linux-pm, virtualization,
	linux-api, akpm, rjw, len.brown, pavel, fweimer, keescook, luto,
	wad, mingo, bonzini, Graf (AWS),
	Alexander, Singh, Balbir, Sandu, Andrei, Brooker, Marc, Weiss,
	Radu, Manwaring, Derek



On 3 Jul 2020, at 4:30, Michal Hocko wrote:

> On Fri 03-07-20 10:34:09, Catangiu, Adrian Costin wrote:
>> This patch adds logic to the kernel power code to zero out contents 
>> of
>> all MADV_WIPEONSUSPEND VMAs present in the system during its 
>> transition
>> to any suspend state equal or greater/deeper than Suspend-to-memory,
>> known as S3.
>
> How does the application learn that its memory got wiped? S2disk is an
> async operation and it can happen at any time during the task 
> execution.
> So how does the application work to prevent from corrupted state - 
> e.g.
> when suspended between two memory loads?

The usual trick when using MADV_WIPEONFORK, or BSD’s MAP_INHERIT_ZERO, 
is to store a guard variable in the page and to check the variable any 
time that random data is generated.

Here’s an example of Google’s OpenSSL fork BoringSSL:

https://boringssl.googlesource.com/boringssl/+/ad5582985cc6b89d0e7caf0d9cc7e301de61cf66/crypto/fipsmodule/rand/fork_detect.c

Checking a guard variable for non-zero status will always happen 
atomically and monotonically (it won’t suddenly flip back) … which 
is all that’s needed in this case. If userspace applications need to 
build a larger critical section around they can use regular concurrency 
controls, but it really doesn’t come up in this context. With 
WIPEONSUSPEND support in a kernel, I expect to add another madvise() 
call on the existing page. The manyworldsdetector micro-library is an 
example:


https://github.com/colmmacc/manyworldsdetector/blob/master/src/mwd.c

It’d be a new block in the style of lines 43-48.

-
Colm


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 22:56   ` Jann Horn
@ 2020-07-04 11:48     ` Pavel Machek
  2020-07-06 12:26       ` Alexander Graf
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Machek @ 2020-07-04 11:48 UTC (permalink / raw)
  To: Jann Horn
  Cc: Catangiu, Adrian Costin, linux-mm, linux-pm, virtualization,
	linux-api, akpm, rjw, len.brown, mhocko, fweimer, keescook, luto,
	wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

[-- Attachment #1: Type: text/plain, Size: 1297 bytes --]

Hi!

> > > Cryptographic libraries carry pseudo random number generators to
> > > quickly provide randomness when needed. If such a random pool gets
> > > cloned, secrets may get revealed, as the same random number may get
> > > used multiple times. For fork, this was fixed using the WIPEONFORK
> > > madvise flag [1].
> >
> > > Unfortunately, the same problem surfaces when a virtual machine gets
> > > cloned. The existing flag does not help there. This patch introduces a
> > > new flag to automatically clear memory contents on VM suspend/resume,
> > > which will allow random number generators to reseed when virtual
> > > machines get cloned.
> >
> > Umm. If this is real problem, should kernel provide such rng in the
> > vsdo page using vsyscalls? Kernel can have special interface to its
> > vsyscalls, but we may not want to offer this functionality to rest of
> > userland...
> 
> And then the kernel would just need to maintain a sequence
> number in the vDSO data page that gets bumped on suspen

Yes, something like that would work. Plus, we'd be free to change the
mechanism in future.

Best regards,
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 11:04 ` Jann Horn
  2020-07-04  1:33   ` Colm MacCárthaigh
@ 2020-07-06 12:09   ` Alexander Graf
  1 sibling, 0 replies; 28+ messages in thread
From: Alexander Graf @ 2020-07-06 12:09 UTC (permalink / raw)
  To: Jann Horn, Catangiu, Adrian Costin
  Cc: linux-mm, linux-pm, virtualization, linux-api, akpm, rjw,
	len.brown, pavel, mhocko, fweimer, keescook, luto, wad, mingo,
	bonzini, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek



On 03.07.20 13:04, Jann Horn wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Fri, Jul 3, 2020 at 12:34 PM Catangiu, Adrian Costin
> <acatan@amazon.com> wrote:
>> Cryptographic libraries carry pseudo random number generators to
>> quickly provide randomness when needed. If such a random pool gets
>> cloned, secrets may get revealed, as the same random number may get
>> used multiple times. For fork, this was fixed using the WIPEONFORK
>> madvise flag [1].
>>
>> Unfortunately, the same problem surfaces when a virtual machine gets
>> cloned. The existing flag does not help there. This patch introduces a
>> new flag to automatically clear memory contents on VM suspend/resume,
>> which will allow random number generators to reseed when virtual
>> machines get cloned.
>>
>> Examples of this are:
>>   - PKCS#11 API reinitialization check (mandated by specification)
>>   - glibc's upcoming PRNG (reseed after wake)
>>   - OpenSSL PRNG (reseed after wake)
>>
>> Benefits exist in two spaces:
>>   - The security benefits of a cloned virtual machine having a
>>     re-initialized PRNG in every process are straightforward.
>>     Without reinitialization, two or more cloned VMs could produce
>>     identical random numbers, which are often used to generate secure
>>     keys.
>>   - Provides a simple mechanism to avoid RAM exfiltration during
>>     traditional sleep/hibernate on a laptop or desktop when memory,
>>     and thus secrets, are vulnerable to offline tampering or inspection.
> 
> For the first usecase, I wonder which way around this would work
> better - do the wiping when a VM is saved, or do it when the VM is
> restored? I guess that at least in some scenarios, doing it on restore
> would be nicer because that way the hypervisor can always instantly
> save a VM without having to wait for the guest to say "alright, I'm
> ready" - especially if someone e.g. wants to take a snapshot of a
> running VM while keeping it running? Or do hypervisors inject such
> ACPI transitions every time they snapshot/save/restore a VM anyway?

Today a hypervisor snapshot/restore operation is almost invisible from 
the VM's point of view. I'm only aware of 2 places where a normal VM 
would be made aware of such an operation:

   1) Clock adjustment. Kvmclock jumps ahead and tells you that a lot of 
time passed

   2) VmGenID. There is a special PV device invented by MS to indicate 
to a VM after resume that it's been cloned.


I can only stress again though that the main point of this RFC is to get 
concensuous on the user space API. Whether we then clear on VM triggered 
suspend, we clear on hypervisor indicated resume or we clear based on 
propagating guarded pages to the hypervisor is a separate discussion 
(Also worth having! But orthogonal).

> 
>> This RFC is foremost aimed at defining a userspace interface to enable
>> applications and libraries that store or cache sensitive information,
>> to know that they need to regenerate it after process memory has been
>> exposed to potential copying.  The proposed userspace interface is
>> a new MADV_WIPEONSUSPEND 'madvise()' flag used to mark pages which
>> contain such data. This newly added flag would only be available on
>> 64bit archs, since we've run out of 32bit VMA flags.
>>
>> The mechanism through which the kernel marks the application sensitive
>> data as potentially copied, is a secondary objective of this RFC. In
>> the current PoC proposal, the RFC kernel code combines
>> MADV_WIPEONSUSPEND semantics with ACPI suspend/wake transitions to zero
>> out all process pages that fall in VMAs marked as MADV_WIPEONSUSPEND
>> and thus allow applications and libraries be notified and regenerate
>> their sensitive data.  Marking VMAs as MADV_WIPEONSUSPEND results in
>> the VMAs being empty in the process after any suspend/wake cycle.
>> Similar to MADV_WIPEONFORK, if the process accesses memory that was
>> wiped on suspend, it will get zeroes.  The address ranges are still
>> valid, they are just empty.
>>
>> This patch adds logic to the kernel power code to zero out contents of
>> all MADV_WIPEONSUSPEND VMAs present in the system during its transition
>> to any suspend state equal or greater/deeper than Suspend-to-memory,
>> known as S3.
>>
>> MADV_WIPEONSUSPEND only works on private, anonymous mappings.
>> The patch also adds MADV_KEEPONSUSPEND, to undo the effects of a
>> prior MADV_WIPEONSUSPEND for a VMA.
>>
>> Hypervisors can issue ACPI S0->S3 and S3->S0 events to leverage this
>> functionality in a virtualized environment.
>>
>> Alternative kernel implementation ideas:
>>   - Move the code that clears MADV_WIPEONFORK pages to a virtual
>>     device driver that registers itself to ACPI events.
>>   - Add prerequisite that MADV_WIPEONFORK pages must be pinned (so
>>     no faulting happens) and clear them in a custom/roll-your-own
>>     device driver on a NMI handler. This could work in a virtualized
>>     environment where the hypervisor pauses all other vCPUs before
>>     injecting the NMI.
>>
>> [1] https://lore.kernel.org/lkml/20170811212829.29186-1-riel@redhat.com/
> [...]
>> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
>> index c874a7026e24..4282b7f0dd03 100644
>> --- a/kernel/power/suspend.c
>> +++ b/kernel/power/suspend.c
>> @@ -323,6 +323,78 @@ static bool platform_suspend_again(suspend_state_t state)
>>                  suspend_ops->suspend_again() : false;
>>   }
>>
>> +#ifdef VM_WIPEONSUSPEND
>> +static void memory_cleanup_on_suspend(suspend_state_t state)
>> +{
>> +       struct task_struct *p;
>> +       struct mm_struct *mm;
>> +       struct vm_area_struct *vma;
>> +       struct page *pages[32];
>> +       unsigned long max_pages_per_loop = ARRAY_SIZE(pages);
>> +
>> +       /* Only care about states >= S3 */
>> +       if (state < PM_SUSPEND_MEM)
>> +               return;
>> +
>> +       rcu_read_lock();
>> +       for_each_process(p) {
>> +               int gup_flags = FOLL_WRITE;
>> +
>> +               mm = p->mm;
>> +               if (!mm)
>> +                       continue;
>> +
>> +               down_read(&mm->mmap_sem);
> 
> Blocking actions, such as locking semaphores, are forbidden in RCU
> read-side critical sections. Also, from a more high-level perspective,
> do we need to be careful here to avoid deadlocks with frozen tasks or
> stuff like that?

Can you think of a better alternative?

> 
>> +               for (vma = mm->mmap; vma; vma = vma->vm_next) {
>> +                       unsigned long addr, nr_pages;
>> +
>> +                       if (!(vma->vm_flags & VM_WIPEONSUSPEND))
>> +                               continue;
>> +
>> +                       addr = vma->vm_start;
>> +                       nr_pages = (vma->vm_end - addr - 1) / PAGE_SIZE + 1;
>> +                       while (nr_pages) {
>> +                               int count = min(nr_pages, max_pages_per_loop);
>> +                               void *kaddr;
>> +
>> +                               count = get_user_pages_remote(p, mm, addr,
>> +                                                       count, gup_flags,
>> +                                                       pages, NULL, NULL);
> 
> get_user_pages_remote() can wait for disk I/O (for swapping stuff back
> in), which we'd probably like to avoid here. And I think it can also
> wait for userfaultfd handling from userspace? zap_page_range() (which
> is what e.g. MADV_DONTNEED uses) might be a better fit, since it can
> yank entries out of the page table (forcing the next write fault to
> allocate a new zeroed page) without faulting them into RAM.

That sounds like a much better fit indeed, thanks!


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-04 11:48     ` Pavel Machek
@ 2020-07-06 12:26       ` Alexander Graf
  2020-07-06 12:52         ` Jann Horn
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Graf @ 2020-07-06 12:26 UTC (permalink / raw)
  To: Pavel Machek, Jann Horn
  Cc: Catangiu, Adrian Costin, linux-mm, linux-pm, virtualization,
	linux-api, akpm, rjw, len.brown, mhocko, fweimer, keescook, luto,
	wad, mingo, bonzini, MacCarthaigh, Colm, Singh, Balbir, Sandu,
	Andrei, Brooker, Marc, Weiss, Radu, Manwaring, Derek



On 04.07.20 13:48, Pavel Machek wrote:
> Hi!
> 
>>>> Cryptographic libraries carry pseudo random number generators to
>>>> quickly provide randomness when needed. If such a random pool gets
>>>> cloned, secrets may get revealed, as the same random number may get
>>>> used multiple times. For fork, this was fixed using the WIPEONFORK
>>>> madvise flag [1].
>>>
>>>> Unfortunately, the same problem surfaces when a virtual machine gets
>>>> cloned. The existing flag does not help there. This patch introduces a
>>>> new flag to automatically clear memory contents on VM suspend/resume,
>>>> which will allow random number generators to reseed when virtual
>>>> machines get cloned.
>>>
>>> Umm. If this is real problem, should kernel provide such rng in the
>>> vsdo page using vsyscalls? Kernel can have special interface to its
>>> vsyscalls, but we may not want to offer this functionality to rest of
>>> userland...
>>
>> And then the kernel would just need to maintain a sequence
>> number in the vDSO data page that gets bumped on suspen
> 
> Yes, something like that would work. Plus, we'd be free to change the
> mechanism in future.

So if we keep treading along that train of thought, a simple vsyscall 
that returns an epoch (incremented by every [VM] resume) would be good 
enough, as user space could in its own logic determine whether it's 
still living inside the same epoch.

The beauty of the clearing is that the checks on it are almost free and 
that we can avoid to store secrets on disk in the first place.

The latter I think is impossible to model with the epoch, but that might 
be ok.

Performance wise, I don't think we can make the vsyscall as cheap as a 
memory compare. Keep in mind that we need to check for the epoch in a 
pretty hot path. How bad would it really be? I'm not sure. It might be 
good enough.

My main concern however is around fragmentation of mechanisms. We 
already have the WIPEONFORK semantic in place in user space 
applications. Do we really want to introduce yet another check for 
what's almost the same semantic? With WIPEONSUSPEND, the hot path check 
between fork and suspend are identical. With an epoch, we have to check 
for zeros and the epoch in addition.

Unless we create a vsyscall that returns both the PID as well as the 
epoch and thus handles fork *and* suspend. I need to think about this a 
bit more :).


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-06 12:26       ` Alexander Graf
@ 2020-07-06 12:52         ` Jann Horn
  2020-07-06 13:14           ` Alexander Graf
  2020-07-07  7:44           ` Michal Hocko
  0 siblings, 2 replies; 28+ messages in thread
From: Jann Horn @ 2020-07-06 12:52 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Pavel Machek, Catangiu, Adrian Costin, linux-mm, linux-pm,
	virtualization, linux-api, akpm, rjw, len.brown, mhocko, fweimer,
	keescook, luto, wad, mingo, bonzini, MacCarthaigh, Colm, Singh,
	Balbir, Sandu, Andrei, Brooker, Marc, Weiss, Radu, Manwaring,
	Derek

On Mon, Jul 6, 2020 at 2:27 PM Alexander Graf <graf@amazon.com> wrote:
> Unless we create a vsyscall that returns both the PID as well as the
> epoch and thus handles fork *and* suspend. I need to think about this a
> bit more :).

You can't reliably detect forking by checking the PID if it is
possible for multiple forks to be chained before the reuse check runs:

 - pid 1000 remembers its PID
 - pid 1000 forks, creating child pid 1001
 - pid 1000 exits and is waited on by init
 - the pid allocator wraps around
 - pid 1001 forks, creating child pid 1000
 - child with pid 1000 tries to check for forking, determines that its
PID is 1000, and concludes that it is still the original process


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-06 12:52         ` Jann Horn
@ 2020-07-06 13:14           ` Alexander Graf
  2020-07-07  7:44           ` Michal Hocko
  1 sibling, 0 replies; 28+ messages in thread
From: Alexander Graf @ 2020-07-06 13:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Pavel Machek, Catangiu, Adrian Costin, linux-mm, linux-pm,
	virtualization, linux-api, akpm, rjw, len.brown, mhocko, fweimer,
	keescook, luto, wad, mingo, bonzini, MacCarthaigh, Colm, Singh,
	Balbir, Sandu, Andrei, Brooker, Marc, Weiss, Radu, Manwaring,
	Derek



On 06.07.20 14:52, Jann Horn wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Mon, Jul 6, 2020 at 2:27 PM Alexander Graf <graf@amazon.com> wrote:
>> Unless we create a vsyscall that returns both the PID as well as the
>> epoch and thus handles fork *and* suspend. I need to think about this a
>> bit more :).
> 
> You can't reliably detect forking by checking the PID if it is
> possible for multiple forks to be chained before the reuse check runs:
> 
>   - pid 1000 remembers its PID
>   - pid 1000 forks, creating child pid 1001
>   - pid 1000 exits and is waited on by init
>   - the pid allocator wraps around
>   - pid 1001 forks, creating child pid 1000
>   - child with pid 1000 tries to check for forking, determines that its
> PID is 1000, and concludes that it is still the original process
> 

Fair point. However, you could bump an epoch value on fork, no? I don't 
think we map anything in the vdso per-process today though ...


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-03 13:29   ` Jann Horn
  2020-07-03 22:34     ` Pavel Machek
@ 2020-07-07  7:38     ` Michal Hocko
  2020-07-07  8:07       ` Pavel Machek
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2020-07-07  7:38 UTC (permalink / raw)
  To: Jann Horn
  Cc: Catangiu, Adrian Costin, linux-mm, linux-pm, virtualization,
	linux-api, akpm, rjw, len.brown, pavel, fweimer, keescook, luto,
	wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

On Fri 03-07-20 15:29:22, Jann Horn wrote:
> On Fri, Jul 3, 2020 at 1:30 PM Michal Hocko <mhocko@kernel.org> wrote:
> > On Fri 03-07-20 10:34:09, Catangiu, Adrian Costin wrote:
> > > This patch adds logic to the kernel power code to zero out contents of
> > > all MADV_WIPEONSUSPEND VMAs present in the system during its transition
> > > to any suspend state equal or greater/deeper than Suspend-to-memory,
> > > known as S3.
> >
> > How does the application learn that its memory got wiped? S2disk is an
> > async operation and it can happen at any time during the task execution.
> > So how does the application work to prevent from corrupted state - e.g.
> > when suspended between two memory loads?
> 
> You can do it seqlock-style, kind of - you reserve the first byte of
> the page or so as a "is this page initialized" marker, and after every
> read from the page, you do a compiler barrier and check whether that
> byte has been cleared.

This is certainly possible yet wery awkwar interface to use IMHO.
MADV_EXTERNALY_VOLATILE would express the actual semantic much better.
I might not still understand the expected usecase but if the target
application has to be changed anyway then why not simply use a
transparent and proper signaling mechanism like poll on a fd. That would
be certainly a more natural and less error prone programming interface.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-04  1:45   ` Colm MacCárthaigh
@ 2020-07-07  7:40     ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2020-07-07  7:40 UTC (permalink / raw)
  To: Colm MacCárthaigh
  Cc: Catangiu, Adrian Costin, linux-mm, linux-pm, virtualization,
	linux-api, akpm, rjw, len.brown, pavel, fweimer, keescook, luto,
	wad, mingo, bonzini, Graf (AWS),
	Alexander, Singh, Balbir, Sandu, Andrei, Brooker, Marc, Weiss,
	Radu, Manwaring, Derek

On Fri 03-07-20 18:45:06, Colm MacCárthaigh wrote:
> 
> 
> On 3 Jul 2020, at 4:30, Michal Hocko wrote:
> 
> > On Fri 03-07-20 10:34:09, Catangiu, Adrian Costin wrote:
> > > This patch adds logic to the kernel power code to zero out contents
> > > of
> > > all MADV_WIPEONSUSPEND VMAs present in the system during its
> > > transition
> > > to any suspend state equal or greater/deeper than Suspend-to-memory,
> > > known as S3.
> > 
> > How does the application learn that its memory got wiped? S2disk is an
> > async operation and it can happen at any time during the task execution.
> > So how does the application work to prevent from corrupted state - e.g.
> > when suspended between two memory loads?
> 
> The usual trick when using MADV_WIPEONFORK, or BSD’s MAP_INHERIT_ZERO, is to
> store a guard variable in the page and to check the variable any time that
> random data is generated.

Well, MADV_WIPEONFORK is a completely different beast because the
forking is under a full control of the parent process and the
information about the fork can be forwarded to child process. It is
not like the child would reborn into a new world in the middle of the
execution.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-06 12:52         ` Jann Horn
  2020-07-06 13:14           ` Alexander Graf
@ 2020-07-07  7:44           ` Michal Hocko
  2020-07-07  8:01             ` Alexander Graf
  1 sibling, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2020-07-07  7:44 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alexander Graf, Pavel Machek, Catangiu, Adrian Costin, linux-mm,
	linux-pm, virtualization, linux-api, akpm, rjw, len.brown,
	fweimer, keescook, luto, wad, mingo, bonzini, MacCarthaigh, Colm,
	Singh, Balbir, Sandu, Andrei, Brooker, Marc, Weiss, Radu,
	Manwaring, Derek

On Mon 06-07-20 14:52:07, Jann Horn wrote:
> On Mon, Jul 6, 2020 at 2:27 PM Alexander Graf <graf@amazon.com> wrote:
> > Unless we create a vsyscall that returns both the PID as well as the
> > epoch and thus handles fork *and* suspend. I need to think about this a
> > bit more :).
> 
> You can't reliably detect forking by checking the PID if it is
> possible for multiple forks to be chained before the reuse check runs:
> 
>  - pid 1000 remembers its PID
>  - pid 1000 forks, creating child pid 1001
>  - pid 1000 exits and is waited on by init
>  - the pid allocator wraps around
>  - pid 1001 forks, creating child pid 1000
>  - child with pid 1000 tries to check for forking, determines that its
> PID is 1000, and concludes that it is still the original process

I must be really missing something here because I really fail to see why
there has to be something new even invented. Sure, checking for pid is
certainly a suboptimal solution because pids are terrible tokens to work
with. We do have a concept of file descriptors which a much better and
supports signaling. There is a clear source of the signal IIUC
(migration) and there are consumers to act upon that (e.g. crypto
backends). So what does really prevent to use a standard signal delivery
over fd for this usecase?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-07  7:44           ` Michal Hocko
@ 2020-07-07  8:01             ` Alexander Graf
  2020-07-07  9:14               ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Graf @ 2020-07-07  8:01 UTC (permalink / raw)
  To: Michal Hocko, Jann Horn
  Cc: Pavel Machek, Catangiu, Adrian Costin, linux-mm, linux-pm,
	virtualization, linux-api, akpm, rjw, len.brown, fweimer,
	keescook, luto, wad, mingo, bonzini, MacCarthaigh, Colm, Singh,
	Balbir, Sandu, Andrei, Brooker, Marc, Weiss, Radu, Manwaring,
	Derek



On 07.07.20 09:44, Michal Hocko wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Mon 06-07-20 14:52:07, Jann Horn wrote:
>> On Mon, Jul 6, 2020 at 2:27 PM Alexander Graf <graf@amazon.com> wrote:
>>> Unless we create a vsyscall that returns both the PID as well as the
>>> epoch and thus handles fork *and* suspend. I need to think about this a
>>> bit more :).
>>
>> You can't reliably detect forking by checking the PID if it is
>> possible for multiple forks to be chained before the reuse check runs:
>>
>>   - pid 1000 remembers its PID
>>   - pid 1000 forks, creating child pid 1001
>>   - pid 1000 exits and is waited on by init
>>   - the pid allocator wraps around
>>   - pid 1001 forks, creating child pid 1000
>>   - child with pid 1000 tries to check for forking, determines that its
>> PID is 1000, and concludes that it is still the original process
> 
> I must be really missing something here because I really fail to see why
> there has to be something new even invented. Sure, checking for pid is
> certainly a suboptimal solution because pids are terrible tokens to work
> with. We do have a concept of file descriptors which a much better and
> supports signaling. There is a clear source of the signal IIUC
> (migration) and there are consumers to act upon that (e.g. crypto
> backends). So what does really prevent to use a standard signal delivery
> over fd for this usecase?

I wasn't part of the discussions on why things like WIPEONFORK were 
invented instead of just using signalling mechanisms, but the main 
reason I can think of are libraries.

As a library, you are under no control of the main loop usually, which 
means you just don't have a way to poll for an fd. As a library author, 
I would usually try to avoid very hard to create such a dependency, 
because it makes it really hard to glue pieces together.

The same applies to signals btw, which would also be a possible way to 
propagate such events.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-07  7:38     ` Michal Hocko
@ 2020-07-07  8:07       ` Pavel Machek
  2020-07-07  8:58         ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Machek @ 2020-07-07  8:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Catangiu, Adrian Costin, linux-mm, linux-pm,
	virtualization, linux-api, akpm, rjw, len.brown, fweimer,
	keescook, luto, wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

[-- Attachment #1: Type: text/plain, Size: 1694 bytes --]

Hi!

> > > > This patch adds logic to the kernel power code to zero out contents of
> > > > all MADV_WIPEONSUSPEND VMAs present in the system during its transition
> > > > to any suspend state equal or greater/deeper than Suspend-to-memory,
> > > > known as S3.
> > >
> > > How does the application learn that its memory got wiped? S2disk is an
> > > async operation and it can happen at any time during the task execution.
> > > So how does the application work to prevent from corrupted state - e.g.
> > > when suspended between two memory loads?
> > 
> > You can do it seqlock-style, kind of - you reserve the first byte of
> > the page or so as a "is this page initialized" marker, and after every
> > read from the page, you do a compiler barrier and check whether that
> > byte has been cleared.
> 
> This is certainly possible yet wery awkwar interface to use IMHO.
> MADV_EXTERNALY_VOLATILE would express the actual semantic much better.
> I might not still understand the expected usecase but if the target
> application has to be changed anyway then why not simply use a
> transparent and proper signaling mechanism like poll on a fd. That

The goal is to have cryprographically-safe get_random_number() with 0
syscalls.

You'd need to do:

   if (!poll(did_i_migrate)) {
         use_prng_seed();
	 if (poll(did_i_migrate)) {
	       /* oops_they_migrated_me_in_middle_of_computation,
 	          lets_redo_it() */
 		  goto retry:
	 }
   }

Which means two syscalls..

Best regards,


									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-07  8:07       ` Pavel Machek
@ 2020-07-07  8:58         ` Michal Hocko
  2020-07-07 16:37           ` Pavel Machek
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2020-07-07  8:58 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jann Horn, Catangiu, Adrian Costin, linux-mm, linux-pm,
	virtualization, linux-api, akpm, rjw, len.brown, fweimer,
	keescook, luto, wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

On Tue 07-07-20 10:07:26, Pavel Machek wrote:
> Hi!
> 
> > > > > This patch adds logic to the kernel power code to zero out contents of
> > > > > all MADV_WIPEONSUSPEND VMAs present in the system during its transition
> > > > > to any suspend state equal or greater/deeper than Suspend-to-memory,
> > > > > known as S3.
> > > >
> > > > How does the application learn that its memory got wiped? S2disk is an
> > > > async operation and it can happen at any time during the task execution.
> > > > So how does the application work to prevent from corrupted state - e.g.
> > > > when suspended between two memory loads?
> > > 
> > > You can do it seqlock-style, kind of - you reserve the first byte of
> > > the page or so as a "is this page initialized" marker, and after every
> > > read from the page, you do a compiler barrier and check whether that
> > > byte has been cleared.
> > 
> > This is certainly possible yet wery awkwar interface to use IMHO.
> > MADV_EXTERNALY_VOLATILE would express the actual semantic much better.
> > I might not still understand the expected usecase but if the target
> > application has to be changed anyway then why not simply use a
> > transparent and proper signaling mechanism like poll on a fd. That
> 
> The goal is to have cryprographically-safe get_random_number() with 0
> syscalls.
> 
> You'd need to do:
> 
>    if (!poll(did_i_migrate)) {
>          use_prng_seed();
> 	 if (poll(did_i_migrate)) {
> 	       /* oops_they_migrated_me_in_middle_of_computation,
>  	          lets_redo_it() */
>  		  goto retry:
> 	 }
>    }
> 
> Which means two syscalls..

Is this a real problem though? Do we have any actual numbers? E.g. how
often does the migration happen so that 2 syscalls would be visible in
actual workloads?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-07  8:01             ` Alexander Graf
@ 2020-07-07  9:14               ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2020-07-07  9:14 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Jann Horn, Pavel Machek, Catangiu, Adrian Costin, linux-mm,
	linux-pm, virtualization, linux-api, akpm, rjw, len.brown,
	fweimer, keescook, luto, wad, mingo, bonzini, MacCarthaigh, Colm,
	Singh, Balbir, Sandu, Andrei, Brooker, Marc, Weiss, Radu,
	Manwaring, Derek

On Tue 07-07-20 10:01:23, Alexander Graf wrote:
> On 07.07.20 09:44, Michal Hocko wrote:
> > On Mon 06-07-20 14:52:07, Jann Horn wrote:
> > > On Mon, Jul 6, 2020 at 2:27 PM Alexander Graf <graf@amazon.com> wrote:
> > > > Unless we create a vsyscall that returns both the PID as well as the
> > > > epoch and thus handles fork *and* suspend. I need to think about this a
> > > > bit more :).
> > > 
> > > You can't reliably detect forking by checking the PID if it is
> > > possible for multiple forks to be chained before the reuse check runs:
> > > 
> > >   - pid 1000 remembers its PID
> > >   - pid 1000 forks, creating child pid 1001
> > >   - pid 1000 exits and is waited on by init
> > >   - the pid allocator wraps around
> > >   - pid 1001 forks, creating child pid 1000
> > >   - child with pid 1000 tries to check for forking, determines that its
> > > PID is 1000, and concludes that it is still the original process
> > 
> > I must be really missing something here because I really fail to see why
> > there has to be something new even invented. Sure, checking for pid is
> > certainly a suboptimal solution because pids are terrible tokens to work
> > with. We do have a concept of file descriptors which a much better and
> > supports signaling. There is a clear source of the signal IIUC
> > (migration) and there are consumers to act upon that (e.g. crypto
> > backends). So what does really prevent to use a standard signal delivery
> > over fd for this usecase?
> 
> I wasn't part of the discussions on why things like WIPEONFORK were invented
> instead of just using signalling mechanisms, but the main reason I can think
> of are libraries.

Well, I would argue that WIPEONFORK is conceptually different. It is
one time initialization mechanism with a very clear life time semantic.
So any programming model is really as easy as, the initial state is
always 0 for a new task without any surprises later on because you own
the memory (essentially an extension to initialized .data section on
exec to any new task).

Compare that to a completely async nature of this interface. Any read
would essentially have to be properly synchronized with the external
event otherwise the state could have been corrupted. Such a consistency
model is really cumbersome to work with.

> As a library, you are under no control of the main loop usually, which means
> you just don't have a way to poll for an fd. As a library author, I would
> usually try to avoid very hard to create such a dependency, because it makes
> it really hard to glue pieces together.
> 
> The same applies to signals btw, which would also be a possible way to
> propagate such events.

Just to clarify I didn't really mean posix signals here. Those would be
quite clumsy indeed. But I can imagine that a library registers to a
system wide means to get a notification. There are many examples for
that, including a lot of usage inside libraries. All different *bus
interfaces.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-07  8:58         ` Michal Hocko
@ 2020-07-07 16:37           ` Pavel Machek
  2020-07-07 19:00             ` Colm MacCarthaigh
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Machek @ 2020-07-07 16:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Catangiu, Adrian Costin, linux-mm, linux-pm,
	virtualization, linux-api, akpm, rjw, len.brown, fweimer,
	keescook, luto, wad, mingo, bonzini, Graf (AWS),
	Alexander, MacCarthaigh, Colm, Singh, Balbir, Sandu, Andrei,
	Brooker, Marc, Weiss, Radu, Manwaring, Derek

[-- Attachment #1: Type: text/plain, Size: 1515 bytes --]

Hi!

> > > > You can do it seqlock-style, kind of - you reserve the first byte of
> > > > the page or so as a "is this page initialized" marker, and after every
> > > > read from the page, you do a compiler barrier and check whether that
> > > > byte has been cleared.
> > > 
> > > This is certainly possible yet wery awkwar interface to use IMHO.
> > > MADV_EXTERNALY_VOLATILE would express the actual semantic much better.
> > > I might not still understand the expected usecase but if the target
> > > application has to be changed anyway then why not simply use a
> > > transparent and proper signaling mechanism like poll on a fd. That
> > 
> > The goal is to have cryprographically-safe get_random_number() with 0
> > syscalls.
> > 
> > You'd need to do:
> > 
> >    if (!poll(did_i_migrate)) {
> >          use_prng_seed();
> > 	 if (poll(did_i_migrate)) {
> > 	       /* oops_they_migrated_me_in_middle_of_computation,
> >  	          lets_redo_it() */
> >  		  goto retry:
> > 	 }
> >    }
> > 
> > Which means two syscalls..
> 
> Is this a real problem though? Do we have any actual numbers? E.g. how
> often does the migration happen so that 2 syscalls would be visible in
> actual workloads?

Please go through the thread and try to understand it.

You'd need syscalls per get_randomness(), not per migration.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-07 16:37           ` Pavel Machek
@ 2020-07-07 19:00             ` Colm MacCarthaigh
  2020-07-12  7:22               ` Pavel Machek
  0 siblings, 1 reply; 28+ messages in thread
From: Colm MacCarthaigh @ 2020-07-07 19:00 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Michal Hocko, Jann Horn, Catangiu, Adrian Costin, linux-mm,
	linux-pm, virtualization, linux-api, akpm, rjw, len.brown,
	fweimer, keescook, luto, wad, mingo, bonzini, Graf (AWS),
	Alexander, Singh, Balbir, Sandu, Andrei, Brooker, Marc, Weiss,
	Radu, Manwaring, Derek

[-- Attachment #1: Type: text/plain, Size: 1129 bytes --]



On 7 Jul 2020, at 9:37, Pavel Machek wrote:
> Please go through the thread and try to understand it.
>
> You'd need syscalls per get_randomness(), not per migration.

I think one check per get_randomness() is sufficient, though putting it 
at the end of the critical section rather than the beginning helps.

     get_randomness( int len, void *out )
     {
        retry:
        /* Generate random data */
        *out = rng(len);

        if (vm_was_cloned)
            goto retry;
     }

At that point if there is a VM snapshot event .. it happens in the 
callers context and it’s the callers job to mitigate it, and the 
caller can use the same trick if neccessary.

Note though; the security issues arise when a snapshot is being restored 
more than once. For those cases it seems very reasonable for the 
snapshot takers to make the image quiescent prior to snapshotting, to 
further reduce the risk of things like the snapshot occurring in the 
middle of a different critical section. The mechanism here is about 
communicating the snapshot to libraries which are self-contained.




[-- Attachment #2: Type: text/html, Size: 1852 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-07 19:00             ` Colm MacCarthaigh
@ 2020-07-12  7:22               ` Pavel Machek
  2020-07-13  8:02                 ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Machek @ 2020-07-12  7:22 UTC (permalink / raw)
  To: Colm MacCarthaigh
  Cc: Michal Hocko, Jann Horn, Catangiu, Adrian Costin, linux-mm,
	linux-pm, virtualization, linux-api, akpm, rjw, len.brown,
	fweimer, keescook, luto, wad, mingo, bonzini, Graf (AWS),
	Alexander, Singh, Balbir, Sandu, Andrei, Brooker, Marc, Weiss,
	Radu, Manwaring, Derek

[-- Attachment #1: Type: text/plain, Size: 627 bytes --]

On Tue 2020-07-07 12:00:41, Colm MacCarthaigh wrote:
> 
> 
> On 7 Jul 2020, at 9:37, Pavel Machek wrote:
> > Please go through the thread and try to understand it.
> > 
> > You'd need syscalls per get_randomness(), not per migration.
> 
> I think one check per get_randomness() is sufficient, though putting it at
> the end of the critical section rather than the beginning helps.

Yeah, well, one syscall is still enough to make it useless.

Best regards,
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]: mm,power: introduce MADV_WIPEONSUSPEND
  2020-07-12  7:22               ` Pavel Machek
@ 2020-07-13  8:02                 ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2020-07-13  8:02 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Colm MacCarthaigh, Jann Horn, Catangiu, Adrian Costin, linux-mm,
	linux-pm, virtualization, linux-api, akpm, rjw, len.brown,
	fweimer, keescook, luto, wad, mingo, bonzini, Graf (AWS),
	Alexander, Singh, Balbir, Sandu, Andrei, Brooker, Marc, Weiss,
	Radu, Manwaring, Derek

On Sun 12-07-20 09:22:28, Pavel Machek wrote:
> On Tue 2020-07-07 12:00:41, Colm MacCarthaigh wrote:
> > 
> > 
> > On 7 Jul 2020, at 9:37, Pavel Machek wrote:
> > > Please go through the thread and try to understand it.
> > > 
> > > You'd need syscalls per get_randomness(), not per migration.
> > 
> > I think one check per get_randomness() is sufficient, though putting it at
> > the end of the critical section rather than the beginning helps.
> 
> Yeah, well, one syscall is still enough to make it useless.

I am sorry but I really do not follow. Why would you want to call a
syscall on each get_randomness invocation? Why is it not enough to
simply have a flag that tells that an external event has happened
and reinitialize if the flag is set? Yes this wouldn't be really sync
operation but does that matter? Is using a few random numbers from the
old pool just because the notifier hasn't processed and flag the
situation a major security concern?

Btw. let me just clarify that I am not by any means pushing a solution
like that. All I am saying is that MADV_WIPEONSUSPEND is inherently
subtle interface that we likely want to avoid.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-07-13  8:02 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-03 10:34 [RFC]: mm,power: introduce MADV_WIPEONSUSPEND Catangiu, Adrian Costin
2020-07-03 11:04 ` Jann Horn
2020-07-04  1:33   ` Colm MacCárthaigh
2020-07-06 12:09   ` Alexander Graf
2020-07-03 11:30 ` Michal Hocko
2020-07-03 12:17   ` Rafael J. Wysocki
2020-07-03 22:39     ` Pavel Machek
2020-07-03 13:29   ` Jann Horn
2020-07-03 22:34     ` Pavel Machek
2020-07-03 22:53       ` Jann Horn
2020-07-07  7:38     ` Michal Hocko
2020-07-07  8:07       ` Pavel Machek
2020-07-07  8:58         ` Michal Hocko
2020-07-07 16:37           ` Pavel Machek
2020-07-07 19:00             ` Colm MacCarthaigh
2020-07-12  7:22               ` Pavel Machek
2020-07-13  8:02                 ` Michal Hocko
2020-07-04  1:45   ` Colm MacCárthaigh
2020-07-07  7:40     ` Michal Hocko
2020-07-03 22:44 ` Pavel Machek
2020-07-03 22:56   ` Jann Horn
2020-07-04 11:48     ` Pavel Machek
2020-07-06 12:26       ` Alexander Graf
2020-07-06 12:52         ` Jann Horn
2020-07-06 13:14           ` Alexander Graf
2020-07-07  7:44           ` Michal Hocko
2020-07-07  8:01             ` Alexander Graf
2020-07-07  9:14               ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).