All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-03-21 21:17 ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

Just wanted to send out an updated patch set that includes changes from
some of the reviews. Hopefully folks will have some time to look them
over prior to the LSF-MM discussion on volatile ranges on Tuesday (see
below for LSF-MM discussion points to think about).

New changes are:
----------------
o Added flags argument to the syscall, which is unused, but per
  https://lwn.net/Articles/585415/ seems like a good idea.
o Minor vma traversing cleanups suggested by Jan
o Return an error when trying to mark unmapped regions
o First pass implementation of marking pages referenced when
  they are marked volatile, so the pages in a range are set to
  the same "age" and will be approximately purged together.
  This behavior is still open for discussion.
o Very naive implementation of anonymous page aging on swapless
  systems. This has clear performance issues, as we burn time
  overly scanning anonymous pages, but provides something
  concrete upon which to discuss what the best way would be to
  solve this.
o Other minor code cleanups

The first three patches are still the core functionality, which
I'd really like further review on. The last two patches in this
series are more discussion starters, and are less serious.


Potential discussion items for LSF-MM to think about:
----------------------------------------------------
o How to increase reviewer interest?
    - Lots of interest from application world
o Page aging semantics when marking volatile.
    - Should marking volatile be the same as accessing pages?
    - Should volatile ranges be put on end of inactive lru?
    - Should we just punt this and have applications combine madvise()
      use with vrange() to specify range age?
o Volatile page & purged page accounting
    - Volatility is stored in per-process vma, not page
    - vmstats are page based, how do we deal w/ COWed pages?
o Aging anonymous memory on swapless systems
    - Any thoughts on improving over naive method?
    - Better volatile page accounting might help?
    - Do we need a separate volatile LRU?
o Shared volatility on tmpfs/shm/memfd (required for ashmem)
    - Johannes idea for clearing dirty bits?
    - vma-like structure on the address space?

thanks
-john


Volatile ranges provides a method for userland to inform the kernel that
a range of memory is safe to discard (ie: can be regenerated) but
userspace may want to try access it in the future.  It can be thought of
as similar to MADV_DONTNEED, but that the actual freeing of the memory
is delayed and only done under memory pressure, and the user can try to
cancel the action and be able to quickly access any unpurged pages. The
idea originated from Android's ashmem, but I've since learned that other
OSes provide similar functionality.

This functionality allows for a number of interesting uses. One such
example is: Userland caches that have kernel triggered eviction under
memory pressure. This allows for the kernel to "rightsize" userspace
caches for current system-wide workload. Things like image bitmap
caches, or rendered HTML in a hidden browser tab, where the data is
not visible and can be regenerated if needed, are good examples.

Both Chrome and Firefox already make use of volatile ranges via the
ashmem interface:
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34

https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc


There are two basic ways volatile ranges can be used:

Explicit marking method:
1) Userland marks a range of memory that can be regenerated if necessary
as volatile
2) Before accessing the memory again, userland marks the memory as
nonvolatile, and the kernel will provide notification if any pages in the
range has been purged.

Optimistic method:
1) Userland marks a large range of data as volatile
2) Userland continues to access the data as it needs.
3) If userland accesses a page that has been purged, the kernel will
send a SIGBUS
4) Userspace can trap the SIGBUS, mark the affected pages as
non-volatile, and refill the data as needed before continuing on


You can read more about the history of volatile ranges here (~reverse
chronological order):
https://lwn.net/Articles/590991/
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


Continuing from the last release, this revision is reduced in scope
when compared to earlier attempts. I've only focused on handled
volatility on anonymous memory, and we're storing the volatility in
the VMA.  This may have performance implications compared with the earlier
approach, but it does simplify the approach. I'm open to expanding
functionality via flags arugments, but for now I'm wanting to keep focus
on what the right default behavior should be and keep the use cases
restricted to help get reviewer interest.

Further, the page discarding happens via normal vmscanning, which due to
anonymous pages not being aged on swapless systems, means we'll only purge
pages when swap is enabled. In this version I've included a naive
implementation of enabling anonymous scanning on swapless systems, which
clearly has performance issues, but hopefully will trigger some discussion
on how to best do this.

Additionally, since we don't handle volatility on tmpfs files with this
version of the patch, it is not able to be used to implement semantics
similar to Android's ashmem. But since shared volatiltiy on files is
more complex, my hope is to start small and hopefully grow from there.

Again, much of the logic in this patchset is based on Minchan's earlier
efforts, so I do want to make sure the credit goes to him for his major
contribution!


Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>


John Stultz (5):
  vrange: Add vrange syscall and handle splitting/merging and marking
    vmas
  vrange: Add purged page detection on setting memory non-volatile
  vrange: Add page purging logic & SIGBUS trap
  vrange: Set affected pages referenced when marking volatile
  vmscan: Age anonymous memory even when swap is off.

 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/mm.h               |   1 +
 include/linux/swap.h             |  15 +-
 include/linux/swapops.h          |  10 +
 include/linux/vrange.h           |  13 ++
 mm/Makefile                      |   2 +-
 mm/internal.h                    |   2 -
 mm/memory.c                      |  21 ++
 mm/rmap.c                        |   5 +
 mm/vmscan.c                      |  38 ++--
 mm/vrange.c                      | 433 +++++++++++++++++++++++++++++++++++++++
 11 files changed, 514 insertions(+), 27 deletions(-)
 create mode 100644 include/linux/vrange.h
 create mode 100644 mm/vrange.c

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-03-21 21:17 ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

Just wanted to send out an updated patch set that includes changes from
some of the reviews. Hopefully folks will have some time to look them
over prior to the LSF-MM discussion on volatile ranges on Tuesday (see
below for LSF-MM discussion points to think about).

New changes are:
----------------
o Added flags argument to the syscall, which is unused, but per
  https://lwn.net/Articles/585415/ seems like a good idea.
o Minor vma traversing cleanups suggested by Jan
o Return an error when trying to mark unmapped regions
o First pass implementation of marking pages referenced when
  they are marked volatile, so the pages in a range are set to
  the same "age" and will be approximately purged together.
  This behavior is still open for discussion.
o Very naive implementation of anonymous page aging on swapless
  systems. This has clear performance issues, as we burn time
  overly scanning anonymous pages, but provides something
  concrete upon which to discuss what the best way would be to
  solve this.
o Other minor code cleanups

The first three patches are still the core functionality, which
I'd really like further review on. The last two patches in this
series are more discussion starters, and are less serious.


Potential discussion items for LSF-MM to think about:
----------------------------------------------------
o How to increase reviewer interest?
    - Lots of interest from application world
o Page aging semantics when marking volatile.
    - Should marking volatile be the same as accessing pages?
    - Should volatile ranges be put on end of inactive lru?
    - Should we just punt this and have applications combine madvise()
      use with vrange() to specify range age?
o Volatile page & purged page accounting
    - Volatility is stored in per-process vma, not page
    - vmstats are page based, how do we deal w/ COWed pages?
o Aging anonymous memory on swapless systems
    - Any thoughts on improving over naive method?
    - Better volatile page accounting might help?
    - Do we need a separate volatile LRU?
o Shared volatility on tmpfs/shm/memfd (required for ashmem)
    - Johannes idea for clearing dirty bits?
    - vma-like structure on the address space?

thanks
-john


Volatile ranges provides a method for userland to inform the kernel that
a range of memory is safe to discard (ie: can be regenerated) but
userspace may want to try access it in the future.  It can be thought of
as similar to MADV_DONTNEED, but that the actual freeing of the memory
is delayed and only done under memory pressure, and the user can try to
cancel the action and be able to quickly access any unpurged pages. The
idea originated from Android's ashmem, but I've since learned that other
OSes provide similar functionality.

This functionality allows for a number of interesting uses. One such
example is: Userland caches that have kernel triggered eviction under
memory pressure. This allows for the kernel to "rightsize" userspace
caches for current system-wide workload. Things like image bitmap
caches, or rendered HTML in a hidden browser tab, where the data is
not visible and can be regenerated if needed, are good examples.

Both Chrome and Firefox already make use of volatile ranges via the
ashmem interface:
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34

https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc


There are two basic ways volatile ranges can be used:

Explicit marking method:
1) Userland marks a range of memory that can be regenerated if necessary
as volatile
2) Before accessing the memory again, userland marks the memory as
nonvolatile, and the kernel will provide notification if any pages in the
range has been purged.

Optimistic method:
1) Userland marks a large range of data as volatile
2) Userland continues to access the data as it needs.
3) If userland accesses a page that has been purged, the kernel will
send a SIGBUS
4) Userspace can trap the SIGBUS, mark the affected pages as
non-volatile, and refill the data as needed before continuing on


You can read more about the history of volatile ranges here (~reverse
chronological order):
https://lwn.net/Articles/590991/
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


Continuing from the last release, this revision is reduced in scope
when compared to earlier attempts. I've only focused on handled
volatility on anonymous memory, and we're storing the volatility in
the VMA.  This may have performance implications compared with the earlier
approach, but it does simplify the approach. I'm open to expanding
functionality via flags arugments, but for now I'm wanting to keep focus
on what the right default behavior should be and keep the use cases
restricted to help get reviewer interest.

Further, the page discarding happens via normal vmscanning, which due to
anonymous pages not being aged on swapless systems, means we'll only purge
pages when swap is enabled. In this version I've included a naive
implementation of enabling anonymous scanning on swapless systems, which
clearly has performance issues, but hopefully will trigger some discussion
on how to best do this.

Additionally, since we don't handle volatility on tmpfs files with this
version of the patch, it is not able to be used to implement semantics
similar to Android's ashmem. But since shared volatiltiy on files is
more complex, my hope is to start small and hopefully grow from there.

Again, much of the logic in this patchset is based on Minchan's earlier
efforts, so I do want to make sure the credit goes to him for his major
contribution!


Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>


John Stultz (5):
  vrange: Add vrange syscall and handle splitting/merging and marking
    vmas
  vrange: Add purged page detection on setting memory non-volatile
  vrange: Add page purging logic & SIGBUS trap
  vrange: Set affected pages referenced when marking volatile
  vmscan: Age anonymous memory even when swap is off.

 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/mm.h               |   1 +
 include/linux/swap.h             |  15 +-
 include/linux/swapops.h          |  10 +
 include/linux/vrange.h           |  13 ++
 mm/Makefile                      |   2 +-
 mm/internal.h                    |   2 -
 mm/memory.c                      |  21 ++
 mm/rmap.c                        |   5 +
 mm/vmscan.c                      |  38 ++--
 mm/vrange.c                      | 433 +++++++++++++++++++++++++++++++++++++++
 11 files changed, 514 insertions(+), 27 deletions(-)
 create mode 100644 include/linux/vrange.h
 create mode 100644 mm/vrange.c

-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas
  2014-03-21 21:17 ` John Stultz
@ 2014-03-21 21:17   ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

This patch introduces the vrange() syscall, which allows for specifying
ranges of memory as volatile, and able to be discarded by the system.

This initial patch simply adds the syscall, and the vma handling,
splitting and merging the vmas as needed, and marking them with
VM_VOLATILE.

No purging or discarding of volatile ranges is done at this point.

Example man page:

NAME
	vrange - Mark or unmark range of memory as volatile

SYNOPSIS
	ssize_t vrange(unsigned_long start, size_t length,
			 unsigned_long mode, unsigned_long flags,
			 int *purged);

DESCRIPTION
	Applications can use vrange(2) to advise kernel that pages of
	anonymous mapping in the given VM area can be reclaimed without
	swapping (or can no longer be reclaimed without swapping).
	The idea is that application can help kernel with page reclaim
	under memory pressure by specifying data it can easily regenerate
	and thus kernel can discard the data if needed.

	mode:
	VRANGE_VOLATILE
		Informs the kernel that the VM can discard in pages in
		the specified range when under memory pressure.
	VRANGE_NONVOLATILE
		Informs the kernel that the VM can no longer discard pages
		in this range.

	flags: Currently no flags are supported.

	purged: Pointer to an integer which will return 1 if
	mode == VRANGE_NONVOLATILE and any page in the affected range
	was purged. If purged returns zero during a mode ==
	VRANGE_NONVOLATILE call, it means all of the pages in the range
	are intact.

	If a process accesses volatile memory which has been purged, and
	was not set as non volatile via a VRANGE_NONVOLATILE call, it
	will recieve a SIGBUS.

RETURN VALUE
	On success vrange returns the number of bytes marked or unmarked.
	Similar to write(), it may return fewer bytes then specified
	if it ran into a problem.

	When using VRANGE_NON_VOLATILE, if the return value is smaller
	then the specified length, then the value specified by the purged
	pointer will be set to 1 if any of the pages specified in the
	return value as successfully marked non-volatile had been purged.

	If an error is returned, no changes were made.

ERRORS
	EINVAL This error can occur for the following reasons:
		* The value length is negative or not page size units.
		* addr is not page-aligned
		* mode not a valid value.
		* flags is not a valid value.

	ENOMEM Not enough memory

	ENOMEM Addresses in the specified range are not currently mapped,
	       or are outside the address space of the process.

	EFAULT Purged pointer is invalid

This a simplified implementation which reuses some of the logic
from Minchan's earlier efforts. So credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/mm.h               |   1 +
 include/linux/vrange.h           |   8 ++
 mm/Makefile                      |   2 +-
 mm/vrange.c                      | 173 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 184 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/vrange.h
 create mode 100644 mm/vrange.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..7ae3940 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	vrange			sys_vrange
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1b7414..a1f11da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
 					/* Used by sys_madvise() */
+#define VM_VOLATILE	0x00001000	/* VMA is volatile */
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
 #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
 
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
new file mode 100644
index 0000000..6e5331e
--- /dev/null
+++ b/include/linux/vrange.h
@@ -0,0 +1,8 @@
+#ifndef _LINUX_VRANGE_H
+#define _LINUX_VRANGE_H
+
+#define VRANGE_NONVOLATILE 0
+#define VRANGE_VOLATILE 1
+#define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
+
+#endif /* _LINUX_VRANGE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..20229e2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o balloon_compaction.o \
+			   compaction.o balloon_compaction.o vrange.o \
 			   interval_tree.o list_lru.o $(mmu-y)
 
 obj-y += init-mm.o
diff --git a/mm/vrange.c b/mm/vrange.c
new file mode 100644
index 0000000..2f8e2ce
--- /dev/null
+++ b/mm/vrange.c
@@ -0,0 +1,173 @@
+#include <linux/syscalls.h>
+#include <linux/vrange.h>
+#include <linux/mm_inline.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+
+/**
+ * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
+ *
+ * Core logic of sys_volatile. Iterates over the VMAs in the specified
+ * range, and marks or clears them as VM_VOLATILE, splitting or merging them
+ * as needed.
+ *
+ * Returns the number of bytes successfully modified.
+ *
+ * Returns error only if no bytes were modified.
+ */
+static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
+				unsigned long end, unsigned long mode,
+				unsigned long flags, int *purged)
+{
+	struct vm_area_struct *vma, *prev;
+	unsigned long orig_start = start;
+	ssize_t count = 0, ret = 0;
+
+	down_read(&mm->mmap_sem);
+
+	vma = find_vma_prev(mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		unsigned long new_flags;
+		pgoff_t pgoff;
+		unsigned long tmp;
+
+		if (!vma)
+			goto out;
+
+		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+					VM_HUGETLB))
+			goto out;
+
+		/* We don't support volatility on files for now */
+		if (vma->vm_file) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* return ENOMEM if we're trying to mark unmapped pages */
+		if (start < vma->vm_start) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		new_flags = vma->vm_flags;
+
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		switch (mode) {
+		case VRANGE_VOLATILE:
+			new_flags |= VM_VOLATILE;
+			break;
+		case VRANGE_NONVOLATILE:
+			new_flags &= ~VM_VOLATILE;
+		}
+
+		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+		prev = vma_merge(mm, prev, start, tmp, new_flags,
+					vma->anon_vma, vma->vm_file, pgoff,
+					vma_policy(vma));
+		if (prev)
+			goto success;
+
+		if (start != vma->vm_start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret)
+				goto out;
+		}
+
+		if (tmp != vma->vm_end) {
+			ret = split_vma(mm, vma, tmp, 0);
+			if (ret)
+				goto out;
+		}
+
+		prev = vma;
+success:
+		vma->vm_flags = new_flags;
+
+		/* update count to distance covered so far*/
+		count = tmp - orig_start;
+
+		start = tmp;
+		if (start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			goto out;
+		vma = prev->vm_next;
+	}
+out:
+	up_read(&mm->mmap_sem);
+
+	/* report bytes successfully marked, even if we're exiting on error */
+	if (count)
+		return count;
+
+	return ret;
+}
+
+
+/**
+ * sys_vrange - Marks specified range as volatile or non-volatile.
+ *
+ * Validates the syscall inputs and calls do_vrange(), then copies the
+ * purged flag back out to userspace.
+ *
+ * Returns the number of bytes successfully modified.
+ * Returns error only if no bytes were modified.
+ */
+SYSCALL_DEFINE5(vrange, unsigned long, start, size_t, len, unsigned long, mode,
+			unsigned long, flags, int __user *, purged)
+{
+	unsigned long end;
+	struct mm_struct *mm = current->mm;
+	ssize_t ret = -EINVAL;
+	int p = 0;
+
+	if (flags & ~VRANGE_VALID_FLAGS)
+		goto out;
+
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	if (start >= TASK_SIZE)
+		goto out;
+
+	if (purged) {
+		/* Test pointer is valid before making any changes */
+		if (put_user(p, purged))
+			return -EFAULT;
+	}
+
+	ret = do_vrange(mm, start, end, mode, flags, &p);
+
+	if (purged) {
+		if (put_user(p, purged)) {
+			/*
+			 * This would be bad, since we've modified volatilty
+			 * and the change in purged state would be lost.
+			 */
+			WARN_ONCE(1, "vrange: purge state possibly lost\n");
+		}
+	}
+
+out:
+	return ret;
+}
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas
@ 2014-03-21 21:17   ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

This patch introduces the vrange() syscall, which allows for specifying
ranges of memory as volatile, and able to be discarded by the system.

This initial patch simply adds the syscall, and the vma handling,
splitting and merging the vmas as needed, and marking them with
VM_VOLATILE.

No purging or discarding of volatile ranges is done at this point.

Example man page:

NAME
	vrange - Mark or unmark range of memory as volatile

SYNOPSIS
	ssize_t vrange(unsigned_long start, size_t length,
			 unsigned_long mode, unsigned_long flags,
			 int *purged);

DESCRIPTION
	Applications can use vrange(2) to advise kernel that pages of
	anonymous mapping in the given VM area can be reclaimed without
	swapping (or can no longer be reclaimed without swapping).
	The idea is that application can help kernel with page reclaim
	under memory pressure by specifying data it can easily regenerate
	and thus kernel can discard the data if needed.

	mode:
	VRANGE_VOLATILE
		Informs the kernel that the VM can discard in pages in
		the specified range when under memory pressure.
	VRANGE_NONVOLATILE
		Informs the kernel that the VM can no longer discard pages
		in this range.

	flags: Currently no flags are supported.

	purged: Pointer to an integer which will return 1 if
	mode == VRANGE_NONVOLATILE and any page in the affected range
	was purged. If purged returns zero during a mode ==
	VRANGE_NONVOLATILE call, it means all of the pages in the range
	are intact.

	If a process accesses volatile memory which has been purged, and
	was not set as non volatile via a VRANGE_NONVOLATILE call, it
	will recieve a SIGBUS.

RETURN VALUE
	On success vrange returns the number of bytes marked or unmarked.
	Similar to write(), it may return fewer bytes then specified
	if it ran into a problem.

	When using VRANGE_NON_VOLATILE, if the return value is smaller
	then the specified length, then the value specified by the purged
	pointer will be set to 1 if any of the pages specified in the
	return value as successfully marked non-volatile had been purged.

	If an error is returned, no changes were made.

ERRORS
	EINVAL This error can occur for the following reasons:
		* The value length is negative or not page size units.
		* addr is not page-aligned
		* mode not a valid value.
		* flags is not a valid value.

	ENOMEM Not enough memory

	ENOMEM Addresses in the specified range are not currently mapped,
	       or are outside the address space of the process.

	EFAULT Purged pointer is invalid

This a simplified implementation which reuses some of the logic
from Minchan's earlier efforts. So credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/mm.h               |   1 +
 include/linux/vrange.h           |   8 ++
 mm/Makefile                      |   2 +-
 mm/vrange.c                      | 173 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 184 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/vrange.h
 create mode 100644 mm/vrange.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..7ae3940 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	vrange			sys_vrange
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1b7414..a1f11da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
 					/* Used by sys_madvise() */
+#define VM_VOLATILE	0x00001000	/* VMA is volatile */
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
 #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
 
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
new file mode 100644
index 0000000..6e5331e
--- /dev/null
+++ b/include/linux/vrange.h
@@ -0,0 +1,8 @@
+#ifndef _LINUX_VRANGE_H
+#define _LINUX_VRANGE_H
+
+#define VRANGE_NONVOLATILE 0
+#define VRANGE_VOLATILE 1
+#define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
+
+#endif /* _LINUX_VRANGE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..20229e2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o balloon_compaction.o \
+			   compaction.o balloon_compaction.o vrange.o \
 			   interval_tree.o list_lru.o $(mmu-y)
 
 obj-y += init-mm.o
diff --git a/mm/vrange.c b/mm/vrange.c
new file mode 100644
index 0000000..2f8e2ce
--- /dev/null
+++ b/mm/vrange.c
@@ -0,0 +1,173 @@
+#include <linux/syscalls.h>
+#include <linux/vrange.h>
+#include <linux/mm_inline.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+
+/**
+ * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
+ *
+ * Core logic of sys_volatile. Iterates over the VMAs in the specified
+ * range, and marks or clears them as VM_VOLATILE, splitting or merging them
+ * as needed.
+ *
+ * Returns the number of bytes successfully modified.
+ *
+ * Returns error only if no bytes were modified.
+ */
+static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
+				unsigned long end, unsigned long mode,
+				unsigned long flags, int *purged)
+{
+	struct vm_area_struct *vma, *prev;
+	unsigned long orig_start = start;
+	ssize_t count = 0, ret = 0;
+
+	down_read(&mm->mmap_sem);
+
+	vma = find_vma_prev(mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		unsigned long new_flags;
+		pgoff_t pgoff;
+		unsigned long tmp;
+
+		if (!vma)
+			goto out;
+
+		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+					VM_HUGETLB))
+			goto out;
+
+		/* We don't support volatility on files for now */
+		if (vma->vm_file) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* return ENOMEM if we're trying to mark unmapped pages */
+		if (start < vma->vm_start) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		new_flags = vma->vm_flags;
+
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		switch (mode) {
+		case VRANGE_VOLATILE:
+			new_flags |= VM_VOLATILE;
+			break;
+		case VRANGE_NONVOLATILE:
+			new_flags &= ~VM_VOLATILE;
+		}
+
+		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+		prev = vma_merge(mm, prev, start, tmp, new_flags,
+					vma->anon_vma, vma->vm_file, pgoff,
+					vma_policy(vma));
+		if (prev)
+			goto success;
+
+		if (start != vma->vm_start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret)
+				goto out;
+		}
+
+		if (tmp != vma->vm_end) {
+			ret = split_vma(mm, vma, tmp, 0);
+			if (ret)
+				goto out;
+		}
+
+		prev = vma;
+success:
+		vma->vm_flags = new_flags;
+
+		/* update count to distance covered so far*/
+		count = tmp - orig_start;
+
+		start = tmp;
+		if (start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			goto out;
+		vma = prev->vm_next;
+	}
+out:
+	up_read(&mm->mmap_sem);
+
+	/* report bytes successfully marked, even if we're exiting on error */
+	if (count)
+		return count;
+
+	return ret;
+}
+
+
+/**
+ * sys_vrange - Marks specified range as volatile or non-volatile.
+ *
+ * Validates the syscall inputs and calls do_vrange(), then copies the
+ * purged flag back out to userspace.
+ *
+ * Returns the number of bytes successfully modified.
+ * Returns error only if no bytes were modified.
+ */
+SYSCALL_DEFINE5(vrange, unsigned long, start, size_t, len, unsigned long, mode,
+			unsigned long, flags, int __user *, purged)
+{
+	unsigned long end;
+	struct mm_struct *mm = current->mm;
+	ssize_t ret = -EINVAL;
+	int p = 0;
+
+	if (flags & ~VRANGE_VALID_FLAGS)
+		goto out;
+
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	if (start >= TASK_SIZE)
+		goto out;
+
+	if (purged) {
+		/* Test pointer is valid before making any changes */
+		if (put_user(p, purged))
+			return -EFAULT;
+	}
+
+	ret = do_vrange(mm, start, end, mode, flags, &p);
+
+	if (purged) {
+		if (put_user(p, purged)) {
+			/*
+			 * This would be bad, since we've modified volatilty
+			 * and the change in purged state would be lost.
+			 */
+			WARN_ONCE(1, "vrange: purge state possibly lost\n");
+		}
+	}
+
+out:
+	return ret;
+}
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-03-21 21:17 ` John Stultz
@ 2014-03-21 21:17   ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

Users of volatile ranges will need to know if memory was discarded.
This patch adds the purged state tracking required to inform userland
when it marks memory as non-volatile that some memory in that range
was purged and needs to be regenerated.

This simplified implementation which uses some of the logic from
Minchan's earlier efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/swap.h    | 15 ++++++++--
 include/linux/swapops.h | 10 +++++++
 include/linux/vrange.h  |  3 ++
 mm/vrange.c             | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 101 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..18c12f9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
 #define SWP_HWPOISON_NUM 0
 #endif
 
-#define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+
+/*
+ * Purged volatile range pages
+ */
+#define SWP_VRANGE_PURGED_NUM 1
+#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
+
+
+#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)	\
+				- SWP_MIGRATION_NUM	\
+				- SWP_HWPOISON_NUM	\
+				- SWP_VRANGE_PURGED_NUM	\
+			)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c0f7526..84f43d9 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -161,6 +161,16 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+static inline swp_entry_t make_vpurged_entry(void)
+{
+	return swp_entry(SWP_VRANGE_PURGED, 0);
+}
+
+static inline int is_vpurged_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_VRANGE_PURGED;
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Support for hardware poisoned pages
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 6e5331e..986fa85 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -1,6 +1,9 @@
 #ifndef _LINUX_VRANGE_H
 #define _LINUX_VRANGE_H
 
+#include <linux/swap.h>
+#include <linux/swapops.h>
+
 #define VRANGE_NONVOLATILE 0
 #define VRANGE_VOLATILE 1
 #define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
diff --git a/mm/vrange.c b/mm/vrange.c
index 2f8e2ce..1ff3cbd 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -8,6 +8,76 @@
 #include <linux/mm_inline.h>
 #include "internal.h"
 
+struct vrange_walker {
+	struct vm_area_struct *vma;
+	int page_was_purged;
+};
+
+
+/**
+ * vrange_check_purged_pte - Checks ptes for purged pages
+ *
+ * Iterates over the ptes in the pmd checking if they have
+ * purged swap entries.
+ *
+ * Sets the vrange_walker.pages_purged to 1 if any were purged.
+ */
+static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
+					unsigned long end, struct mm_walk *walk)
+{
+	struct vrange_walker *vw = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (pmd_trans_huge(*pmd))
+		return 0;
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte)) {
+			swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
+
+			if (unlikely(is_vpurged_entry(vrange_entry))) {
+				vw->page_was_purged = 1;
+				break;
+			}
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+
+/**
+ * vrange_check_purged - Sets up a mm_walk to check for purged pages
+ *
+ * Sets up and calls wa_page_range() to check for purge pages.
+ *
+ * Returns 1 if pages in the range were purged, 0 otherwise.
+ */
+static int vrange_check_purged(struct mm_struct *mm,
+					 struct vm_area_struct *vma,
+					 unsigned long start,
+					 unsigned long end)
+{
+	struct vrange_walker vw;
+	struct mm_walk vrange_walk = {
+		.pmd_entry = vrange_check_purged_pte,
+		.mm = vma->vm_mm,
+		.private = &vw,
+	};
+	vw.page_was_purged = 0;
+	vw.vma = vma;
+
+	walk_page_range(start, end, &vrange_walk);
+
+	return vw.page_was_purged;
+
+}
 
 /**
  * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
@@ -106,6 +176,11 @@ success:
 		vma = prev->vm_next;
 	}
 out:
+	if (count && (mode == VRANGE_NONVOLATILE))
+		*purged = vrange_check_purged(mm, vma,
+						orig_start,
+						orig_start+count);
+
 	up_read(&mm->mmap_sem);
 
 	/* report bytes successfully marked, even if we're exiting on error */
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-03-21 21:17   ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

Users of volatile ranges will need to know if memory was discarded.
This patch adds the purged state tracking required to inform userland
when it marks memory as non-volatile that some memory in that range
was purged and needs to be regenerated.

This simplified implementation which uses some of the logic from
Minchan's earlier efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/swap.h    | 15 ++++++++--
 include/linux/swapops.h | 10 +++++++
 include/linux/vrange.h  |  3 ++
 mm/vrange.c             | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 101 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..18c12f9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
 #define SWP_HWPOISON_NUM 0
 #endif
 
-#define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+
+/*
+ * Purged volatile range pages
+ */
+#define SWP_VRANGE_PURGED_NUM 1
+#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
+
+
+#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)	\
+				- SWP_MIGRATION_NUM	\
+				- SWP_HWPOISON_NUM	\
+				- SWP_VRANGE_PURGED_NUM	\
+			)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c0f7526..84f43d9 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -161,6 +161,16 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+static inline swp_entry_t make_vpurged_entry(void)
+{
+	return swp_entry(SWP_VRANGE_PURGED, 0);
+}
+
+static inline int is_vpurged_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_VRANGE_PURGED;
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Support for hardware poisoned pages
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 6e5331e..986fa85 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -1,6 +1,9 @@
 #ifndef _LINUX_VRANGE_H
 #define _LINUX_VRANGE_H
 
+#include <linux/swap.h>
+#include <linux/swapops.h>
+
 #define VRANGE_NONVOLATILE 0
 #define VRANGE_VOLATILE 1
 #define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
diff --git a/mm/vrange.c b/mm/vrange.c
index 2f8e2ce..1ff3cbd 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -8,6 +8,76 @@
 #include <linux/mm_inline.h>
 #include "internal.h"
 
+struct vrange_walker {
+	struct vm_area_struct *vma;
+	int page_was_purged;
+};
+
+
+/**
+ * vrange_check_purged_pte - Checks ptes for purged pages
+ *
+ * Iterates over the ptes in the pmd checking if they have
+ * purged swap entries.
+ *
+ * Sets the vrange_walker.pages_purged to 1 if any were purged.
+ */
+static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
+					unsigned long end, struct mm_walk *walk)
+{
+	struct vrange_walker *vw = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (pmd_trans_huge(*pmd))
+		return 0;
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte)) {
+			swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
+
+			if (unlikely(is_vpurged_entry(vrange_entry))) {
+				vw->page_was_purged = 1;
+				break;
+			}
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+
+/**
+ * vrange_check_purged - Sets up a mm_walk to check for purged pages
+ *
+ * Sets up and calls wa_page_range() to check for purge pages.
+ *
+ * Returns 1 if pages in the range were purged, 0 otherwise.
+ */
+static int vrange_check_purged(struct mm_struct *mm,
+					 struct vm_area_struct *vma,
+					 unsigned long start,
+					 unsigned long end)
+{
+	struct vrange_walker vw;
+	struct mm_walk vrange_walk = {
+		.pmd_entry = vrange_check_purged_pte,
+		.mm = vma->vm_mm,
+		.private = &vw,
+	};
+	vw.page_was_purged = 0;
+	vw.vma = vma;
+
+	walk_page_range(start, end, &vrange_walk);
+
+	return vw.page_was_purged;
+
+}
 
 /**
  * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
@@ -106,6 +176,11 @@ success:
 		vma = prev->vm_next;
 	}
 out:
+	if (count && (mode == VRANGE_NONVOLATILE))
+		*purged = vrange_check_purged(mm, vma,
+						orig_start,
+						orig_start+count);
+
 	up_read(&mm->mmap_sem);
 
 	/* report bytes successfully marked, even if we're exiting on error */
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 3/5] vrange: Add page purging logic & SIGBUS trap
  2014-03-21 21:17 ` John Stultz
@ 2014-03-21 21:17   ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

This patch adds the hooks in the vmscan logic to discard volatile pages
and mark their pte as purged. With this, volatile pages will be purged
under pressure, and their ptes swap entry's marked. If the purged pages
are accessed before being marked non-volatile, we catch this and send a
SIGBUS.

This is a simplified implementation that uses logic from Minchan's earlier
efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/vrange.h |   2 +
 mm/internal.h          |   2 -
 mm/memory.c            |  21 +++++++++
 mm/rmap.c              |   5 +++
 mm/vmscan.c            |  12 ++++++
 mm/vrange.c            | 114 +++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 154 insertions(+), 2 deletions(-)

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 986fa85..d93ad21 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -8,4 +8,6 @@
 #define VRANGE_VOLATILE 1
 #define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
 
+extern int discard_vpage(struct page *page);
+
 #endif /* _LINUX_VRANGE_H */
diff --git a/mm/internal.h b/mm/internal.h
index 29e1e76..ea66bf9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -225,10 +225,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern unsigned long vma_address(struct page *page,
 				 struct vm_area_struct *vma);
-#endif
 #else /* !CONFIG_MMU */
 static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 22dfa61..db5f4da 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -60,6 +60,7 @@
 #include <linux/migrate.h>
 #include <linux/string.h>
 #include <linux/dma-debug.h>
+#include <linux/vrange.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3643,6 +3644,8 @@ static int handle_pte_fault(struct mm_struct *mm,
 
 	entry = *pte;
 	if (!pte_present(entry)) {
+		swp_entry_t vrange_entry;
+retry:
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
 				if (likely(vma->vm_ops->fault))
@@ -3652,6 +3655,24 @@ static int handle_pte_fault(struct mm_struct *mm,
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, flags);
 		}
+
+		vrange_entry = pte_to_swp_entry(entry);
+		if (unlikely(is_vpurged_entry(vrange_entry))) {
+			if (vma->vm_flags & VM_VOLATILE)
+				return VM_FAULT_SIGBUS;
+
+			/* zap pte */
+			ptl = pte_lockptr(mm, pmd);
+			spin_lock(ptl);
+			if (unlikely(!pte_same(*pte, entry)))
+				goto unlock;
+			flush_cache_page(vma, address, pte_pfn(*pte));
+			ptep_clear_flush(vma, address, pte);
+			pte_unmap_unlock(pte, ptl);
+			goto retry;
+		}
+
+
 		if (pte_file(entry))
 			return do_nonlinear_fault(mm, vma, address,
 					pte, pmd, flags, entry);
diff --git a/mm/rmap.c b/mm/rmap.c
index d9d4231..2b6f079 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 				referenced++;
 		}
 		pte_unmap_unlock(pte, ptl);
+		if (vma->vm_flags & VM_VOLATILE) {
+			pra->mapcount = 0;
+			pra->vm_flags |= VM_VOLATILE;
+			return SWAP_FAIL;
+		}
 	}
 
 	if (referenced) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b4..34f159a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -43,6 +43,7 @@
 #include <linux/sysctl.h>
 #include <linux/oom.h>
 #include <linux/prefetch.h>
+#include <linux/vrange.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -683,6 +684,7 @@ enum page_references {
 	PAGEREF_RECLAIM,
 	PAGEREF_RECLAIM_CLEAN,
 	PAGEREF_KEEP,
+	PAGEREF_DISCARD,
 	PAGEREF_ACTIVATE,
 };
 
@@ -703,6 +705,13 @@ static enum page_references page_check_references(struct page *page,
 	if (vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
+	/*
+	 * If volatile page is reached on LRU's tail, we discard the
+	 * page without considering recycle the page.
+	 */
+	if (vm_flags & VM_VOLATILE)
+		return PAGEREF_DISCARD;
+
 	if (referenced_ptes) {
 		if (PageSwapBacked(page))
 			return PAGEREF_ACTIVATE;
@@ -930,6 +939,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
+		case PAGEREF_DISCARD:
+			if (may_enter_fs && !discard_vpage(page))
+				goto free_it;
 		case PAGEREF_KEEP:
 			goto keep_locked;
 		case PAGEREF_RECLAIM:
diff --git a/mm/vrange.c b/mm/vrange.c
index 1ff3cbd..28ceb6f 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -246,3 +246,117 @@ SYSCALL_DEFINE5(vrange, unsigned long, start, size_t, len, unsigned long, mode,
 out:
 	return ret;
 }
+
+
+/**
+ * try_to_discard_one - Purge a volatile page from a vma
+ *
+ * Finds the pte for a page in a vma, marks the pte as purged
+ * and release the page.
+ */
+static void try_to_discard_one(struct page *page, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *pte;
+	pte_t pteval;
+	spinlock_t *ptl;
+	unsigned long addr;
+
+	VM_BUG_ON(!PageLocked(page));
+
+	addr = vma_address(page, vma);
+	pte = page_check_address(page, mm, addr, &ptl, 0);
+	if (!pte)
+		return;
+
+	BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
+
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	pteval = ptep_clear_flush(vma, addr, pte);
+
+	update_hiwater_rss(mm);
+	if (PageAnon(page))
+		dec_mm_counter(mm, MM_ANONPAGES);
+	else
+		dec_mm_counter(mm, MM_FILEPAGES);
+
+	page_remove_rmap(page);
+	page_cache_release(page);
+
+	set_pte_at(mm, addr, pte,
+				swp_entry_to_pte(make_vpurged_entry()));
+
+	pte_unmap_unlock(pte, ptl);
+	mmu_notifier_invalidate_page(mm, addr);
+
+}
+
+/**
+ * try_to_discard_vpage - check vma chain and discard from vmas marked volatile
+ *
+ * Goes over all the vmas that hold a page, and where the vmas are volatile,
+ * purge the page from the vma.
+ *
+ * Returns 0 on success, -1 on error.
+ */
+static int try_to_discard_vpage(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff;
+
+	anon_vma = page_lock_anon_vma_read(page);
+	if (!anon_vma)
+		return -1;
+
+	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	/*
+	 * During interating the loop, some processes could see a page as
+	 * purged while others could see a page as not-purged because we have
+	 * no global lock between parent and child for protecting vrange system
+	 * call during this loop. But it's not a problem because the page is
+	 * not *SHARED* page but *COW* page so parent and child can see other
+	 * data anytime. The worst case by this race is a page was purged
+	 * but couldn't be discarded so it makes unnecessary page fault but
+	 * it wouldn't be severe.
+	 */
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		struct vm_area_struct *vma = avc->vma;
+
+		if (!(vma->vm_flags & VM_VOLATILE))
+			continue;
+		try_to_discard_one(page, vma);
+	}
+	page_unlock_anon_vma_read(anon_vma);
+	return 0;
+}
+
+
+/**
+ * discard_vpage - If possible, discard the specified volatile page
+ *
+ * Attempts to discard a volatile page, and if needed frees the swap page
+ *
+ * Returns 0 on success, -1 on error.
+ */
+int discard_vpage(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageLRU(page));
+
+	/* XXX - for now we only support anonymous volatile pages */
+	if (!PageAnon(page))
+		return -1;
+
+	if (!try_to_discard_vpage(page)) {
+		if (PageSwapCache(page))
+			try_to_free_swap(page);
+
+		if (page_freeze_refs(page, 1)) {
+			unlock_page(page);
+			return 0;
+		}
+	}
+
+	return -1;
+}
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 3/5] vrange: Add page purging logic & SIGBUS trap
@ 2014-03-21 21:17   ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

This patch adds the hooks in the vmscan logic to discard volatile pages
and mark their pte as purged. With this, volatile pages will be purged
under pressure, and their ptes swap entry's marked. If the purged pages
are accessed before being marked non-volatile, we catch this and send a
SIGBUS.

This is a simplified implementation that uses logic from Minchan's earlier
efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/vrange.h |   2 +
 mm/internal.h          |   2 -
 mm/memory.c            |  21 +++++++++
 mm/rmap.c              |   5 +++
 mm/vmscan.c            |  12 ++++++
 mm/vrange.c            | 114 +++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 154 insertions(+), 2 deletions(-)

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 986fa85..d93ad21 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -8,4 +8,6 @@
 #define VRANGE_VOLATILE 1
 #define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
 
+extern int discard_vpage(struct page *page);
+
 #endif /* _LINUX_VRANGE_H */
diff --git a/mm/internal.h b/mm/internal.h
index 29e1e76..ea66bf9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -225,10 +225,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern unsigned long vma_address(struct page *page,
 				 struct vm_area_struct *vma);
-#endif
 #else /* !CONFIG_MMU */
 static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 22dfa61..db5f4da 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -60,6 +60,7 @@
 #include <linux/migrate.h>
 #include <linux/string.h>
 #include <linux/dma-debug.h>
+#include <linux/vrange.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3643,6 +3644,8 @@ static int handle_pte_fault(struct mm_struct *mm,
 
 	entry = *pte;
 	if (!pte_present(entry)) {
+		swp_entry_t vrange_entry;
+retry:
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
 				if (likely(vma->vm_ops->fault))
@@ -3652,6 +3655,24 @@ static int handle_pte_fault(struct mm_struct *mm,
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, flags);
 		}
+
+		vrange_entry = pte_to_swp_entry(entry);
+		if (unlikely(is_vpurged_entry(vrange_entry))) {
+			if (vma->vm_flags & VM_VOLATILE)
+				return VM_FAULT_SIGBUS;
+
+			/* zap pte */
+			ptl = pte_lockptr(mm, pmd);
+			spin_lock(ptl);
+			if (unlikely(!pte_same(*pte, entry)))
+				goto unlock;
+			flush_cache_page(vma, address, pte_pfn(*pte));
+			ptep_clear_flush(vma, address, pte);
+			pte_unmap_unlock(pte, ptl);
+			goto retry;
+		}
+
+
 		if (pte_file(entry))
 			return do_nonlinear_fault(mm, vma, address,
 					pte, pmd, flags, entry);
diff --git a/mm/rmap.c b/mm/rmap.c
index d9d4231..2b6f079 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 				referenced++;
 		}
 		pte_unmap_unlock(pte, ptl);
+		if (vma->vm_flags & VM_VOLATILE) {
+			pra->mapcount = 0;
+			pra->vm_flags |= VM_VOLATILE;
+			return SWAP_FAIL;
+		}
 	}
 
 	if (referenced) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b4..34f159a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -43,6 +43,7 @@
 #include <linux/sysctl.h>
 #include <linux/oom.h>
 #include <linux/prefetch.h>
+#include <linux/vrange.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -683,6 +684,7 @@ enum page_references {
 	PAGEREF_RECLAIM,
 	PAGEREF_RECLAIM_CLEAN,
 	PAGEREF_KEEP,
+	PAGEREF_DISCARD,
 	PAGEREF_ACTIVATE,
 };
 
@@ -703,6 +705,13 @@ static enum page_references page_check_references(struct page *page,
 	if (vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
+	/*
+	 * If volatile page is reached on LRU's tail, we discard the
+	 * page without considering recycle the page.
+	 */
+	if (vm_flags & VM_VOLATILE)
+		return PAGEREF_DISCARD;
+
 	if (referenced_ptes) {
 		if (PageSwapBacked(page))
 			return PAGEREF_ACTIVATE;
@@ -930,6 +939,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
+		case PAGEREF_DISCARD:
+			if (may_enter_fs && !discard_vpage(page))
+				goto free_it;
 		case PAGEREF_KEEP:
 			goto keep_locked;
 		case PAGEREF_RECLAIM:
diff --git a/mm/vrange.c b/mm/vrange.c
index 1ff3cbd..28ceb6f 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -246,3 +246,117 @@ SYSCALL_DEFINE5(vrange, unsigned long, start, size_t, len, unsigned long, mode,
 out:
 	return ret;
 }
+
+
+/**
+ * try_to_discard_one - Purge a volatile page from a vma
+ *
+ * Finds the pte for a page in a vma, marks the pte as purged
+ * and release the page.
+ */
+static void try_to_discard_one(struct page *page, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *pte;
+	pte_t pteval;
+	spinlock_t *ptl;
+	unsigned long addr;
+
+	VM_BUG_ON(!PageLocked(page));
+
+	addr = vma_address(page, vma);
+	pte = page_check_address(page, mm, addr, &ptl, 0);
+	if (!pte)
+		return;
+
+	BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
+
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	pteval = ptep_clear_flush(vma, addr, pte);
+
+	update_hiwater_rss(mm);
+	if (PageAnon(page))
+		dec_mm_counter(mm, MM_ANONPAGES);
+	else
+		dec_mm_counter(mm, MM_FILEPAGES);
+
+	page_remove_rmap(page);
+	page_cache_release(page);
+
+	set_pte_at(mm, addr, pte,
+				swp_entry_to_pte(make_vpurged_entry()));
+
+	pte_unmap_unlock(pte, ptl);
+	mmu_notifier_invalidate_page(mm, addr);
+
+}
+
+/**
+ * try_to_discard_vpage - check vma chain and discard from vmas marked volatile
+ *
+ * Goes over all the vmas that hold a page, and where the vmas are volatile,
+ * purge the page from the vma.
+ *
+ * Returns 0 on success, -1 on error.
+ */
+static int try_to_discard_vpage(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff;
+
+	anon_vma = page_lock_anon_vma_read(page);
+	if (!anon_vma)
+		return -1;
+
+	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	/*
+	 * During interating the loop, some processes could see a page as
+	 * purged while others could see a page as not-purged because we have
+	 * no global lock between parent and child for protecting vrange system
+	 * call during this loop. But it's not a problem because the page is
+	 * not *SHARED* page but *COW* page so parent and child can see other
+	 * data anytime. The worst case by this race is a page was purged
+	 * but couldn't be discarded so it makes unnecessary page fault but
+	 * it wouldn't be severe.
+	 */
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		struct vm_area_struct *vma = avc->vma;
+
+		if (!(vma->vm_flags & VM_VOLATILE))
+			continue;
+		try_to_discard_one(page, vma);
+	}
+	page_unlock_anon_vma_read(anon_vma);
+	return 0;
+}
+
+
+/**
+ * discard_vpage - If possible, discard the specified volatile page
+ *
+ * Attempts to discard a volatile page, and if needed frees the swap page
+ *
+ * Returns 0 on success, -1 on error.
+ */
+int discard_vpage(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageLRU(page));
+
+	/* XXX - for now we only support anonymous volatile pages */
+	if (!PageAnon(page))
+		return -1;
+
+	if (!try_to_discard_vpage(page)) {
+		if (PageSwapCache(page))
+			try_to_free_swap(page);
+
+		if (page_freeze_refs(page, 1)) {
+			unlock_page(page);
+			return 0;
+		}
+	}
+
+	return -1;
+}
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 4/5] vrange: Set affected pages referenced when marking volatile
  2014-03-21 21:17 ` John Stultz
@ 2014-03-21 21:17   ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

One issue that some potential users were concerned about, was that
they wanted to ensure that all the pages from one volatile range
were purged before we purge pages from a different volatile range.
This would prevent the case where they have 4 large objects, and
the system purges one page from each object, casuing all of the
objects to have to be re-created.

The counter-point to this case, is when an application is using the
SIGBUS semantics to continue to access pages after they have been
marked volatile. In that case, the desire was that the most recently
touched pages be purged last, and only the "cold" pages be purged
from the specified range.

Instead of adding option flags for the various usage model (at least
initially), one way of getting a solutoin for both uses would be to
have the act of marking pages as volatile in effect mark the pages
as accessed. Since all of the pages in the range would be marked
together, they would be of the same "age" and would (approximately)
be purged together. Further, if any pages in the range were accessed
after being marked volatile, they would be moved to the end of the
lru and be purged later.

This patch provides this solution by walking the pages in the range
and setting them accessed when set volatile.

This does have a performance impact, as we have to touch each page
when setting them volatile. Additionally, while setting all the
pages to the same age solves the basic problem, there is still an
open question of: What age all the pages should be set to?

One could consider them all recently accessed, which would put them
at the end of the active lru. Or one could possibly move them all to
the end of the inactive lru, making them more likely to be purged
sooner.

Another possibility would be to not affect the pages at all when
marking them as volatile, and allow applications to use madvise
prior to marking any pages as volatile to age them together, if
that behavior was needed. In that case this patch would be
unnecessary.

Thoughts on the best approach would be greatly appreciated.


Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 mm/vrange.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/mm/vrange.c b/mm/vrange.c
index 28ceb6f..9be8f45 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -79,6 +79,73 @@ static int vrange_check_purged(struct mm_struct *mm,
 
 }
 
+
+/**
+ * vrange_mark_accessed_pte - Marks pte pages in range accessed
+ *
+ * Iterates over the ptes in the pmd and marks the coresponding page
+ * as accessed. This ensures all the pages in the range are of the
+ * same "age", so that when pages are purged, we will most likely purge
+ * them together.
+ */
+static int vrange_mark_accessed_pte(pmd_t *pmd, unsigned long addr,
+					unsigned long end, struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (pmd_trans_huge(*pmd))
+		return 0;
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (pte_present(*pte)) {
+			struct page *page;
+
+			page = vm_normal_page(vma, addr, *pte);
+			if (IS_ERR_OR_NULL(page))
+				break;
+			get_page(page);
+			/*
+			 * XXX - So here we may want to do something
+			 * else other then marking the page accessed.
+			 * Setting them to all be the same "age" ensures
+			 * they are pruged together, but its not clear
+			 * what that "age" should be.
+			 */
+			mark_page_accessed(page);
+			put_page(page);
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+
+/**
+ * vrange_mark_range_accessed - Sets up a mm_walk to mark pages accessed
+ *
+ * Sets up and calls wa_page_range() to mark affected pages as accessed.
+ */
+static void vrange_mark_range_accessed(struct vm_area_struct *vma,
+						unsigned long start,
+						unsigned long end)
+{
+	struct mm_walk vrange_walk = {
+		.pmd_entry = vrange_mark_accessed_pte,
+		.mm = vma->vm_mm,
+		.private = vma,
+	};
+
+	walk_page_range(start, end, &vrange_walk);
+}
+
+
 /**
  * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
  *
@@ -165,6 +232,10 @@ static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
 success:
 		vma->vm_flags = new_flags;
 
+		/* Mark the vma range as accessed */
+		if (mode == VRANGE_VOLATILE)
+			vrange_mark_range_accessed(vma, start, tmp);
+
 		/* update count to distance covered so far*/
 		count = tmp - orig_start;
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 4/5] vrange: Set affected pages referenced when marking volatile
@ 2014-03-21 21:17   ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

One issue that some potential users were concerned about, was that
they wanted to ensure that all the pages from one volatile range
were purged before we purge pages from a different volatile range.
This would prevent the case where they have 4 large objects, and
the system purges one page from each object, casuing all of the
objects to have to be re-created.

The counter-point to this case, is when an application is using the
SIGBUS semantics to continue to access pages after they have been
marked volatile. In that case, the desire was that the most recently
touched pages be purged last, and only the "cold" pages be purged
from the specified range.

Instead of adding option flags for the various usage model (at least
initially), one way of getting a solutoin for both uses would be to
have the act of marking pages as volatile in effect mark the pages
as accessed. Since all of the pages in the range would be marked
together, they would be of the same "age" and would (approximately)
be purged together. Further, if any pages in the range were accessed
after being marked volatile, they would be moved to the end of the
lru and be purged later.

This patch provides this solution by walking the pages in the range
and setting them accessed when set volatile.

This does have a performance impact, as we have to touch each page
when setting them volatile. Additionally, while setting all the
pages to the same age solves the basic problem, there is still an
open question of: What age all the pages should be set to?

One could consider them all recently accessed, which would put them
at the end of the active lru. Or one could possibly move them all to
the end of the inactive lru, making them more likely to be purged
sooner.

Another possibility would be to not affect the pages at all when
marking them as volatile, and allow applications to use madvise
prior to marking any pages as volatile to age them together, if
that behavior was needed. In that case this patch would be
unnecessary.

Thoughts on the best approach would be greatly appreciated.


Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 mm/vrange.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/mm/vrange.c b/mm/vrange.c
index 28ceb6f..9be8f45 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -79,6 +79,73 @@ static int vrange_check_purged(struct mm_struct *mm,
 
 }
 
+
+/**
+ * vrange_mark_accessed_pte - Marks pte pages in range accessed
+ *
+ * Iterates over the ptes in the pmd and marks the coresponding page
+ * as accessed. This ensures all the pages in the range are of the
+ * same "age", so that when pages are purged, we will most likely purge
+ * them together.
+ */
+static int vrange_mark_accessed_pte(pmd_t *pmd, unsigned long addr,
+					unsigned long end, struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (pmd_trans_huge(*pmd))
+		return 0;
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (pte_present(*pte)) {
+			struct page *page;
+
+			page = vm_normal_page(vma, addr, *pte);
+			if (IS_ERR_OR_NULL(page))
+				break;
+			get_page(page);
+			/*
+			 * XXX - So here we may want to do something
+			 * else other then marking the page accessed.
+			 * Setting them to all be the same "age" ensures
+			 * they are pruged together, but its not clear
+			 * what that "age" should be.
+			 */
+			mark_page_accessed(page);
+			put_page(page);
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+
+/**
+ * vrange_mark_range_accessed - Sets up a mm_walk to mark pages accessed
+ *
+ * Sets up and calls wa_page_range() to mark affected pages as accessed.
+ */
+static void vrange_mark_range_accessed(struct vm_area_struct *vma,
+						unsigned long start,
+						unsigned long end)
+{
+	struct mm_walk vrange_walk = {
+		.pmd_entry = vrange_mark_accessed_pte,
+		.mm = vma->vm_mm,
+		.private = vma,
+	};
+
+	walk_page_range(start, end, &vrange_walk);
+}
+
+
 /**
  * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
  *
@@ -165,6 +232,10 @@ static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
 success:
 		vma->vm_flags = new_flags;
 
+		/* Mark the vma range as accessed */
+		if (mode == VRANGE_VOLATILE)
+			vrange_mark_range_accessed(vma, start, tmp);
+
 		/* update count to distance covered so far*/
 		count = tmp - orig_start;
 
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 5/5] vmscan: Age anonymous memory even when swap is off.
  2014-03-21 21:17 ` John Stultz
@ 2014-03-21 21:17   ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

Currently we don't shrink/scan the anonymous lrus when swap is off.
This is problematic for volatile range purging on swapless systems/

This patch naievely changes the vmscan code to continue scanning
and shrinking the lrus even when there is no swap.

It obviously has performance issues.

Thoughts on how best to implement this would be appreciated.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 mm/vmscan.c | 26 ++++----------------------
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34f159a..07b0a8c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -155,9 +155,8 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
 	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
 	     zone_page_state(zone, NR_INACTIVE_FILE);
 
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON);
+	nr += zone_page_state(zone, NR_ACTIVE_ANON) +
+	      zone_page_state(zone, NR_INACTIVE_ANON);
 
 	return nr;
 }
@@ -1764,13 +1763,6 @@ static int inactive_anon_is_low_global(struct zone *zone)
  */
 static int inactive_anon_is_low(struct lruvec *lruvec)
 {
-	/*
-	 * If we don't have swap space, anonymous page deactivation
-	 * is pointless.
-	 */
-	if (!total_swap_pages)
-		return 0;
-
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_inactive_anon_is_low(lruvec);
 
@@ -1880,12 +1872,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	if (!global_reclaim(sc))
 		force_scan = true;
 
-	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
-		scan_balance = SCAN_FILE;
-		goto out;
-	}
-
 	/*
 	 * Global reclaim will swap to prevent OOM even with no
 	 * swappiness, but memcg users want to use this knob to
@@ -2048,7 +2034,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			if (nr[lru]) {
 				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
 				nr[lru] -= nr_to_scan;
-
 				nr_reclaimed += shrink_list(lru, nr_to_scan,
 							    lruvec, sc);
 			}
@@ -2181,8 +2166,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 */
 	pages_for_compaction = (2UL << sc->order);
 	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+	inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
@@ -2726,9 +2711,6 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
 
-	if (!total_swap_pages)
-		return;
-
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
 		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 5/5] vmscan: Age anonymous memory even when swap is off.
@ 2014-03-21 21:17   ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-21 21:17 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

Currently we don't shrink/scan the anonymous lrus when swap is off.
This is problematic for volatile range purging on swapless systems/

This patch naievely changes the vmscan code to continue scanning
and shrinking the lrus even when there is no swap.

It obviously has performance issues.

Thoughts on how best to implement this would be appreciated.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 mm/vmscan.c | 26 ++++----------------------
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34f159a..07b0a8c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -155,9 +155,8 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
 	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
 	     zone_page_state(zone, NR_INACTIVE_FILE);
 
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON);
+	nr += zone_page_state(zone, NR_ACTIVE_ANON) +
+	      zone_page_state(zone, NR_INACTIVE_ANON);
 
 	return nr;
 }
@@ -1764,13 +1763,6 @@ static int inactive_anon_is_low_global(struct zone *zone)
  */
 static int inactive_anon_is_low(struct lruvec *lruvec)
 {
-	/*
-	 * If we don't have swap space, anonymous page deactivation
-	 * is pointless.
-	 */
-	if (!total_swap_pages)
-		return 0;
-
 	if (!mem_cgroup_disabled())
 		return mem_cgroup_inactive_anon_is_low(lruvec);
 
@@ -1880,12 +1872,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	if (!global_reclaim(sc))
 		force_scan = true;
 
-	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
-		scan_balance = SCAN_FILE;
-		goto out;
-	}
-
 	/*
 	 * Global reclaim will swap to prevent OOM even with no
 	 * swappiness, but memcg users want to use this knob to
@@ -2048,7 +2034,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			if (nr[lru]) {
 				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
 				nr[lru] -= nr_to_scan;
-
 				nr_reclaimed += shrink_list(lru, nr_to_scan,
 							    lruvec, sc);
 			}
@@ -2181,8 +2166,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 */
 	pages_for_compaction = (2UL << sc->order);
 	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+	inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
@@ -2726,9 +2711,6 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
 
-	if (!total_swap_pages)
-		return;
-
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
 		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas
  2014-03-21 21:17   ` John Stultz
@ 2014-03-23 12:20     ` Jan Kara
  -1 siblings, 0 replies; 112+ messages in thread
From: Jan Kara @ 2014-03-23 12:20 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On Fri 21-03-14 14:17:31, John Stultz wrote:
> This patch introduces the vrange() syscall, which allows for specifying
> ranges of memory as volatile, and able to be discarded by the system.
> 
> This initial patch simply adds the syscall, and the vma handling,
> splitting and merging the vmas as needed, and marking them with
> VM_VOLATILE.
> 
> No purging or discarding of volatile ranges is done at this point.
> 
> Example man page:
> 
> NAME
> 	vrange - Mark or unmark range of memory as volatile
> 
> SYNOPSIS
> 	ssize_t vrange(unsigned_long start, size_t length,
> 			 unsigned_long mode, unsigned_long flags,
> 			 int *purged);
> 
> DESCRIPTION
> 	Applications can use vrange(2) to advise kernel that pages of
> 	anonymous mapping in the given VM area can be reclaimed without
> 	swapping (or can no longer be reclaimed without swapping).
> 	The idea is that application can help kernel with page reclaim
> 	under memory pressure by specifying data it can easily regenerate
> 	and thus kernel can discard the data if needed.
> 
> 	mode:
> 	VRANGE_VOLATILE
> 		Informs the kernel that the VM can discard in pages in
> 		the specified range when under memory pressure.
> 	VRANGE_NONVOLATILE
> 		Informs the kernel that the VM can no longer discard pages
> 		in this range.
> 
> 	flags: Currently no flags are supported.
> 
> 	purged: Pointer to an integer which will return 1 if
> 	mode == VRANGE_NONVOLATILE and any page in the affected range
> 	was purged. If purged returns zero during a mode ==
> 	VRANGE_NONVOLATILE call, it means all of the pages in the range
> 	are intact.
> 
> 	If a process accesses volatile memory which has been purged, and
> 	was not set as non volatile via a VRANGE_NONVOLATILE call, it
> 	will recieve a SIGBUS.
> 
> RETURN VALUE
> 	On success vrange returns the number of bytes marked or unmarked.
> 	Similar to write(), it may return fewer bytes then specified
> 	if it ran into a problem.
> 
> 	When using VRANGE_NON_VOLATILE, if the return value is smaller
> 	then the specified length, then the value specified by the purged
        ^^^ than
Also I'm not sure why *purged is set only if the return value is smaller
than the specified legth. Won't the interface be more logical if we set
*purged to appropriate value in all cases?

...
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index a12bddc..7ae3940 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -322,6 +322,7 @@
>  313	common	finit_module		sys_finit_module
>  314	common	sched_setattr		sys_sched_setattr
>  315	common	sched_getattr		sys_sched_getattr
> +316	common	vrange			sys_vrange
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c1b7414..a1f11da 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
>  
>  					/* Used by sys_madvise() */
> +#define VM_VOLATILE	0x00001000	/* VMA is volatile */
>  #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
>  #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
>  
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> new file mode 100644
> index 0000000..6e5331e
> --- /dev/null
> +++ b/include/linux/vrange.h
> @@ -0,0 +1,8 @@
> +#ifndef _LINUX_VRANGE_H
> +#define _LINUX_VRANGE_H
> +
> +#define VRANGE_NONVOLATILE 0
> +#define VRANGE_VOLATILE 1
> +#define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
> +
> +#endif /* _LINUX_VRANGE_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 310c90a..20229e2 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   readahead.o swap.o truncate.o vmscan.o shmem.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
> -			   compaction.o balloon_compaction.o \
> +			   compaction.o balloon_compaction.o vrange.o \
>  			   interval_tree.o list_lru.o $(mmu-y)
>  
>  obj-y += init-mm.o
> diff --git a/mm/vrange.c b/mm/vrange.c
> new file mode 100644
> index 0000000..2f8e2ce
> --- /dev/null
> +++ b/mm/vrange.c
> @@ -0,0 +1,173 @@
> +#include <linux/syscalls.h>
> +#include <linux/vrange.h>
> +#include <linux/mm_inline.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +
> +/**
> + * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
  If you use docbook style comments (two stars on the first line), you should
also describe arguments like we do for example in mm/memory.c.

> + *
> + * Core logic of sys_volatile. Iterates over the VMAs in the specified
> + * range, and marks or clears them as VM_VOLATILE, splitting or merging them
> + * as needed.
> + *
> + * Returns the number of bytes successfully modified.
> + *
> + * Returns error only if no bytes were modified.
> + */
> +static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
> +				unsigned long end, unsigned long mode,
> +				unsigned long flags, int *purged)
> +{
> +	struct vm_area_struct *vma, *prev;
> +	unsigned long orig_start = start;
> +	ssize_t count = 0, ret = 0;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	vma = find_vma_prev(mm, start, &prev);
> +	if (vma && start > vma->vm_start)
> +		prev = vma;
> +
> +	for (;;) {
> +		unsigned long new_flags;
> +		pgoff_t pgoff;
> +		unsigned long tmp;
> +
> +		if (!vma)
> +			goto out;
> +
> +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
> +					VM_HUGETLB))
> +			goto out;
> +
> +		/* We don't support volatility on files for now */
> +		if (vma->vm_file) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* return ENOMEM if we're trying to mark unmapped pages */
> +		if (start < vma->vm_start) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		new_flags = vma->vm_flags;
> +
> +		tmp = vma->vm_end;
> +		if (end < tmp)
> +			tmp = end;
> +
> +		switch (mode) {
> +		case VRANGE_VOLATILE:
> +			new_flags |= VM_VOLATILE;
> +			break;
> +		case VRANGE_NONVOLATILE:
> +			new_flags &= ~VM_VOLATILE;
> +		}
> +
> +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> +		prev = vma_merge(mm, prev, start, tmp, new_flags,
> +					vma->anon_vma, vma->vm_file, pgoff,
> +					vma_policy(vma));
> +		if (prev)
> +			goto success;
> +
> +		if (start != vma->vm_start) {
> +			ret = split_vma(mm, vma, start, 1);
> +			if (ret)
> +				goto out;
> +		}
> +
> +		if (tmp != vma->vm_end) {
> +			ret = split_vma(mm, vma, tmp, 0);
> +			if (ret)
> +				goto out;
> +		}
> +
> +		prev = vma;
> +success:
> +		vma->vm_flags = new_flags;
> +
> +		/* update count to distance covered so far*/
> +		count = tmp - orig_start;
> +
> +		start = tmp;
> +		if (start < prev->vm_end)
> +			start = prev->vm_end;
> +		if (start >= end)
> +			goto out;
> +		vma = prev->vm_next;
> +	}
> +out:
> +	up_read(&mm->mmap_sem);
> +
> +	/* report bytes successfully marked, even if we're exiting on error */
> +	if (count)
> +		return count;
> +
> +	return ret;
> +}
> +
> +
> +/**
> + * sys_vrange - Marks specified range as volatile or non-volatile.
> + *
> + * Validates the syscall inputs and calls do_vrange(), then copies the
> + * purged flag back out to userspace.
> + *
> + * Returns the number of bytes successfully modified.
> + * Returns error only if no bytes were modified.
> + */
> +SYSCALL_DEFINE5(vrange, unsigned long, start, size_t, len, unsigned long, mode,
> +			unsigned long, flags, int __user *, purged)
> +{
> +	unsigned long end;
> +	struct mm_struct *mm = current->mm;
> +	ssize_t ret = -EINVAL;
> +	int p = 0;
> +
> +	if (flags & ~VRANGE_VALID_FLAGS)
> +		goto out;
> +
> +	if (start & ~PAGE_MASK)
> +		goto out;
> +
> +	len &= PAGE_MASK;
> +	if (!len)
> +		goto out;
> +
> +	end = start + len;
> +	if (end < start)
> +		goto out;
> +
> +	if (start >= TASK_SIZE)
> +		goto out;
> +
> +	if (purged) {
> +		/* Test pointer is valid before making any changes */
> +		if (put_user(p, purged))
> +			return -EFAULT;
> +	}
> +
> +	ret = do_vrange(mm, start, end, mode, flags, &p);
> +
> +	if (purged) {
> +		if (put_user(p, purged)) {
> +			/*
> +			 * This would be bad, since we've modified volatilty
> +			 * and the change in purged state would be lost.
> +			 */
> +			WARN_ONCE(1, "vrange: purge state possibly lost\n");
  I think this can happen when the application has several threads and
vrange() in one thread races with munmap() in another thread. So
WARN_ONCE() doesn't look appropriate (kernel shouldn't spew warnings about
application programming bugs)... I'd just return -EFAULT. I know
information will be lost but userspace is doing something utterly stupid.

> +		}
> +	}
> +
> +out:
> +	return ret;
> +}

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas
@ 2014-03-23 12:20     ` Jan Kara
  0 siblings, 0 replies; 112+ messages in thread
From: Jan Kara @ 2014-03-23 12:20 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On Fri 21-03-14 14:17:31, John Stultz wrote:
> This patch introduces the vrange() syscall, which allows for specifying
> ranges of memory as volatile, and able to be discarded by the system.
> 
> This initial patch simply adds the syscall, and the vma handling,
> splitting and merging the vmas as needed, and marking them with
> VM_VOLATILE.
> 
> No purging or discarding of volatile ranges is done at this point.
> 
> Example man page:
> 
> NAME
> 	vrange - Mark or unmark range of memory as volatile
> 
> SYNOPSIS
> 	ssize_t vrange(unsigned_long start, size_t length,
> 			 unsigned_long mode, unsigned_long flags,
> 			 int *purged);
> 
> DESCRIPTION
> 	Applications can use vrange(2) to advise kernel that pages of
> 	anonymous mapping in the given VM area can be reclaimed without
> 	swapping (or can no longer be reclaimed without swapping).
> 	The idea is that application can help kernel with page reclaim
> 	under memory pressure by specifying data it can easily regenerate
> 	and thus kernel can discard the data if needed.
> 
> 	mode:
> 	VRANGE_VOLATILE
> 		Informs the kernel that the VM can discard in pages in
> 		the specified range when under memory pressure.
> 	VRANGE_NONVOLATILE
> 		Informs the kernel that the VM can no longer discard pages
> 		in this range.
> 
> 	flags: Currently no flags are supported.
> 
> 	purged: Pointer to an integer which will return 1 if
> 	mode == VRANGE_NONVOLATILE and any page in the affected range
> 	was purged. If purged returns zero during a mode ==
> 	VRANGE_NONVOLATILE call, it means all of the pages in the range
> 	are intact.
> 
> 	If a process accesses volatile memory which has been purged, and
> 	was not set as non volatile via a VRANGE_NONVOLATILE call, it
> 	will recieve a SIGBUS.
> 
> RETURN VALUE
> 	On success vrange returns the number of bytes marked or unmarked.
> 	Similar to write(), it may return fewer bytes then specified
> 	if it ran into a problem.
> 
> 	When using VRANGE_NON_VOLATILE, if the return value is smaller
> 	then the specified length, then the value specified by the purged
        ^^^ than
Also I'm not sure why *purged is set only if the return value is smaller
than the specified legth. Won't the interface be more logical if we set
*purged to appropriate value in all cases?

...
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index a12bddc..7ae3940 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -322,6 +322,7 @@
>  313	common	finit_module		sys_finit_module
>  314	common	sched_setattr		sys_sched_setattr
>  315	common	sched_getattr		sys_sched_getattr
> +316	common	vrange			sys_vrange
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c1b7414..a1f11da 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
>  
>  					/* Used by sys_madvise() */
> +#define VM_VOLATILE	0x00001000	/* VMA is volatile */
>  #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
>  #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
>  
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> new file mode 100644
> index 0000000..6e5331e
> --- /dev/null
> +++ b/include/linux/vrange.h
> @@ -0,0 +1,8 @@
> +#ifndef _LINUX_VRANGE_H
> +#define _LINUX_VRANGE_H
> +
> +#define VRANGE_NONVOLATILE 0
> +#define VRANGE_VOLATILE 1
> +#define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
> +
> +#endif /* _LINUX_VRANGE_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 310c90a..20229e2 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   readahead.o swap.o truncate.o vmscan.o shmem.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
> -			   compaction.o balloon_compaction.o \
> +			   compaction.o balloon_compaction.o vrange.o \
>  			   interval_tree.o list_lru.o $(mmu-y)
>  
>  obj-y += init-mm.o
> diff --git a/mm/vrange.c b/mm/vrange.c
> new file mode 100644
> index 0000000..2f8e2ce
> --- /dev/null
> +++ b/mm/vrange.c
> @@ -0,0 +1,173 @@
> +#include <linux/syscalls.h>
> +#include <linux/vrange.h>
> +#include <linux/mm_inline.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +
> +/**
> + * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
  If you use docbook style comments (two stars on the first line), you should
also describe arguments like we do for example in mm/memory.c.

> + *
> + * Core logic of sys_volatile. Iterates over the VMAs in the specified
> + * range, and marks or clears them as VM_VOLATILE, splitting or merging them
> + * as needed.
> + *
> + * Returns the number of bytes successfully modified.
> + *
> + * Returns error only if no bytes were modified.
> + */
> +static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
> +				unsigned long end, unsigned long mode,
> +				unsigned long flags, int *purged)
> +{
> +	struct vm_area_struct *vma, *prev;
> +	unsigned long orig_start = start;
> +	ssize_t count = 0, ret = 0;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	vma = find_vma_prev(mm, start, &prev);
> +	if (vma && start > vma->vm_start)
> +		prev = vma;
> +
> +	for (;;) {
> +		unsigned long new_flags;
> +		pgoff_t pgoff;
> +		unsigned long tmp;
> +
> +		if (!vma)
> +			goto out;
> +
> +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
> +					VM_HUGETLB))
> +			goto out;
> +
> +		/* We don't support volatility on files for now */
> +		if (vma->vm_file) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* return ENOMEM if we're trying to mark unmapped pages */
> +		if (start < vma->vm_start) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		new_flags = vma->vm_flags;
> +
> +		tmp = vma->vm_end;
> +		if (end < tmp)
> +			tmp = end;
> +
> +		switch (mode) {
> +		case VRANGE_VOLATILE:
> +			new_flags |= VM_VOLATILE;
> +			break;
> +		case VRANGE_NONVOLATILE:
> +			new_flags &= ~VM_VOLATILE;
> +		}
> +
> +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> +		prev = vma_merge(mm, prev, start, tmp, new_flags,
> +					vma->anon_vma, vma->vm_file, pgoff,
> +					vma_policy(vma));
> +		if (prev)
> +			goto success;
> +
> +		if (start != vma->vm_start) {
> +			ret = split_vma(mm, vma, start, 1);
> +			if (ret)
> +				goto out;
> +		}
> +
> +		if (tmp != vma->vm_end) {
> +			ret = split_vma(mm, vma, tmp, 0);
> +			if (ret)
> +				goto out;
> +		}
> +
> +		prev = vma;
> +success:
> +		vma->vm_flags = new_flags;
> +
> +		/* update count to distance covered so far*/
> +		count = tmp - orig_start;
> +
> +		start = tmp;
> +		if (start < prev->vm_end)
> +			start = prev->vm_end;
> +		if (start >= end)
> +			goto out;
> +		vma = prev->vm_next;
> +	}
> +out:
> +	up_read(&mm->mmap_sem);
> +
> +	/* report bytes successfully marked, even if we're exiting on error */
> +	if (count)
> +		return count;
> +
> +	return ret;
> +}
> +
> +
> +/**
> + * sys_vrange - Marks specified range as volatile or non-volatile.
> + *
> + * Validates the syscall inputs and calls do_vrange(), then copies the
> + * purged flag back out to userspace.
> + *
> + * Returns the number of bytes successfully modified.
> + * Returns error only if no bytes were modified.
> + */
> +SYSCALL_DEFINE5(vrange, unsigned long, start, size_t, len, unsigned long, mode,
> +			unsigned long, flags, int __user *, purged)
> +{
> +	unsigned long end;
> +	struct mm_struct *mm = current->mm;
> +	ssize_t ret = -EINVAL;
> +	int p = 0;
> +
> +	if (flags & ~VRANGE_VALID_FLAGS)
> +		goto out;
> +
> +	if (start & ~PAGE_MASK)
> +		goto out;
> +
> +	len &= PAGE_MASK;
> +	if (!len)
> +		goto out;
> +
> +	end = start + len;
> +	if (end < start)
> +		goto out;
> +
> +	if (start >= TASK_SIZE)
> +		goto out;
> +
> +	if (purged) {
> +		/* Test pointer is valid before making any changes */
> +		if (put_user(p, purged))
> +			return -EFAULT;
> +	}
> +
> +	ret = do_vrange(mm, start, end, mode, flags, &p);
> +
> +	if (purged) {
> +		if (put_user(p, purged)) {
> +			/*
> +			 * This would be bad, since we've modified volatilty
> +			 * and the change in purged state would be lost.
> +			 */
> +			WARN_ONCE(1, "vrange: purge state possibly lost\n");
  I think this can happen when the application has several threads and
vrange() in one thread races with munmap() in another thread. So
WARN_ONCE() doesn't look appropriate (kernel shouldn't spew warnings about
application programming bugs)... I'd just return -EFAULT. I know
information will be lost but userspace is doing something utterly stupid.

> +		}
> +	}
> +
> +out:
> +	return ret;
> +}

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-03-21 21:17   ` John Stultz
@ 2014-03-23 12:29     ` Jan Kara
  -1 siblings, 0 replies; 112+ messages in thread
From: Jan Kara @ 2014-03-23 12:29 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On Fri 21-03-14 14:17:32, John Stultz wrote:
> Users of volatile ranges will need to know if memory was discarded.
> This patch adds the purged state tracking required to inform userland
> when it marks memory as non-volatile that some memory in that range
> was purged and needs to be regenerated.
> 
> This simplified implementation which uses some of the logic from
> Minchan's earlier efforts, so credit to Minchan for his work.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
  Just one minor nit below. Otherwise the patch looks good to me. So you
can add
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/swap.h    | 15 ++++++++--
>  include/linux/swapops.h | 10 +++++++
>  include/linux/vrange.h  |  3 ++
>  mm/vrange.c             | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 101 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6..18c12f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
>  #define SWP_HWPOISON_NUM 0
>  #endif
>  
> -#define MAX_SWAPFILES \
> -	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +
> +/*
> + * Purged volatile range pages
> + */
> +#define SWP_VRANGE_PURGED_NUM 1
> +#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
> +
> +
> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)	\
> +				- SWP_MIGRATION_NUM	\
> +				- SWP_HWPOISON_NUM	\
> +				- SWP_VRANGE_PURGED_NUM	\
> +			)
>  
>  /*
>   * Magic header for a swap area. The first part of the union is
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index c0f7526..84f43d9 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -161,6 +161,16 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>  
>  #endif
>  
> +static inline swp_entry_t make_vpurged_entry(void)
> +{
> +	return swp_entry(SWP_VRANGE_PURGED, 0);
> +}
> +
> +static inline int is_vpurged_entry(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_VRANGE_PURGED;
> +}
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
>   * Support for hardware poisoned pages
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> index 6e5331e..986fa85 100644
> --- a/include/linux/vrange.h
> +++ b/include/linux/vrange.h
> @@ -1,6 +1,9 @@
>  #ifndef _LINUX_VRANGE_H
>  #define _LINUX_VRANGE_H
>  
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +
>  #define VRANGE_NONVOLATILE 0
>  #define VRANGE_VOLATILE 1
>  #define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
> diff --git a/mm/vrange.c b/mm/vrange.c
> index 2f8e2ce..1ff3cbd 100644
> --- a/mm/vrange.c
> +++ b/mm/vrange.c
> @@ -8,6 +8,76 @@
>  #include <linux/mm_inline.h>
>  #include "internal.h"
>  
> +struct vrange_walker {
> +	struct vm_area_struct *vma;
> +	int page_was_purged;
> +};
> +
> +
> +/**
> + * vrange_check_purged_pte - Checks ptes for purged pages
> + *
> + * Iterates over the ptes in the pmd checking if they have
> + * purged swap entries.
> + *
> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
                              ^^^ page_was_purged

> + */
> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
> +					unsigned long end, struct mm_walk *walk)
> +{
> +	struct vrange_walker *vw = walk->private;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +
> +	if (pmd_trans_huge(*pmd))
> +		return 0;
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		if (!pte_present(*pte)) {
> +			swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
> +
> +			if (unlikely(is_vpurged_entry(vrange_entry))) {
> +				vw->page_was_purged = 1;
> +				break;
> +			}
> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +
> +/**
> + * vrange_check_purged - Sets up a mm_walk to check for purged pages
> + *
> + * Sets up and calls wa_page_range() to check for purge pages.
> + *
> + * Returns 1 if pages in the range were purged, 0 otherwise.
> + */
> +static int vrange_check_purged(struct mm_struct *mm,
> +					 struct vm_area_struct *vma,
> +					 unsigned long start,
> +					 unsigned long end)
> +{
> +	struct vrange_walker vw;
> +	struct mm_walk vrange_walk = {
> +		.pmd_entry = vrange_check_purged_pte,
> +		.mm = vma->vm_mm,
> +		.private = &vw,
> +	};
> +	vw.page_was_purged = 0;
> +	vw.vma = vma;
> +
> +	walk_page_range(start, end, &vrange_walk);
> +
> +	return vw.page_was_purged;
> +
> +}
>  
>  /**
>   * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
> @@ -106,6 +176,11 @@ success:
>  		vma = prev->vm_next;
>  	}
>  out:
> +	if (count && (mode == VRANGE_NONVOLATILE))
> +		*purged = vrange_check_purged(mm, vma,
> +						orig_start,
> +						orig_start+count);
> +
>  	up_read(&mm->mmap_sem);
>  
>  	/* report bytes successfully marked, even if we're exiting on error */
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-03-23 12:29     ` Jan Kara
  0 siblings, 0 replies; 112+ messages in thread
From: Jan Kara @ 2014-03-23 12:29 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On Fri 21-03-14 14:17:32, John Stultz wrote:
> Users of volatile ranges will need to know if memory was discarded.
> This patch adds the purged state tracking required to inform userland
> when it marks memory as non-volatile that some memory in that range
> was purged and needs to be regenerated.
> 
> This simplified implementation which uses some of the logic from
> Minchan's earlier efforts, so credit to Minchan for his work.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
  Just one minor nit below. Otherwise the patch looks good to me. So you
can add
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/swap.h    | 15 ++++++++--
>  include/linux/swapops.h | 10 +++++++
>  include/linux/vrange.h  |  3 ++
>  mm/vrange.c             | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 101 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6..18c12f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
>  #define SWP_HWPOISON_NUM 0
>  #endif
>  
> -#define MAX_SWAPFILES \
> -	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +
> +/*
> + * Purged volatile range pages
> + */
> +#define SWP_VRANGE_PURGED_NUM 1
> +#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
> +
> +
> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)	\
> +				- SWP_MIGRATION_NUM	\
> +				- SWP_HWPOISON_NUM	\
> +				- SWP_VRANGE_PURGED_NUM	\
> +			)
>  
>  /*
>   * Magic header for a swap area. The first part of the union is
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index c0f7526..84f43d9 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -161,6 +161,16 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>  
>  #endif
>  
> +static inline swp_entry_t make_vpurged_entry(void)
> +{
> +	return swp_entry(SWP_VRANGE_PURGED, 0);
> +}
> +
> +static inline int is_vpurged_entry(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_VRANGE_PURGED;
> +}
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
>   * Support for hardware poisoned pages
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> index 6e5331e..986fa85 100644
> --- a/include/linux/vrange.h
> +++ b/include/linux/vrange.h
> @@ -1,6 +1,9 @@
>  #ifndef _LINUX_VRANGE_H
>  #define _LINUX_VRANGE_H
>  
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +
>  #define VRANGE_NONVOLATILE 0
>  #define VRANGE_VOLATILE 1
>  #define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
> diff --git a/mm/vrange.c b/mm/vrange.c
> index 2f8e2ce..1ff3cbd 100644
> --- a/mm/vrange.c
> +++ b/mm/vrange.c
> @@ -8,6 +8,76 @@
>  #include <linux/mm_inline.h>
>  #include "internal.h"
>  
> +struct vrange_walker {
> +	struct vm_area_struct *vma;
> +	int page_was_purged;
> +};
> +
> +
> +/**
> + * vrange_check_purged_pte - Checks ptes for purged pages
> + *
> + * Iterates over the ptes in the pmd checking if they have
> + * purged swap entries.
> + *
> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
                              ^^^ page_was_purged

> + */
> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
> +					unsigned long end, struct mm_walk *walk)
> +{
> +	struct vrange_walker *vw = walk->private;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +
> +	if (pmd_trans_huge(*pmd))
> +		return 0;
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		if (!pte_present(*pte)) {
> +			swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
> +
> +			if (unlikely(is_vpurged_entry(vrange_entry))) {
> +				vw->page_was_purged = 1;
> +				break;
> +			}
> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +
> +/**
> + * vrange_check_purged - Sets up a mm_walk to check for purged pages
> + *
> + * Sets up and calls wa_page_range() to check for purge pages.
> + *
> + * Returns 1 if pages in the range were purged, 0 otherwise.
> + */
> +static int vrange_check_purged(struct mm_struct *mm,
> +					 struct vm_area_struct *vma,
> +					 unsigned long start,
> +					 unsigned long end)
> +{
> +	struct vrange_walker vw;
> +	struct mm_walk vrange_walk = {
> +		.pmd_entry = vrange_check_purged_pte,
> +		.mm = vma->vm_mm,
> +		.private = &vw,
> +	};
> +	vw.page_was_purged = 0;
> +	vw.vma = vma;
> +
> +	walk_page_range(start, end, &vrange_walk);
> +
> +	return vw.page_was_purged;
> +
> +}
>  
>  /**
>   * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
> @@ -106,6 +176,11 @@ success:
>  		vma = prev->vm_next;
>  	}
>  out:
> +	if (count && (mode == VRANGE_NONVOLATILE))
> +		*purged = vrange_check_purged(mm, vma,
> +						orig_start,
> +						orig_start+count);
> +
>  	up_read(&mm->mmap_sem);
>  
>  	/* report bytes successfully marked, even if we're exiting on error */
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas
  2014-03-21 21:17   ` John Stultz
@ 2014-03-23 16:50     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-23 16:50 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

Hi

On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
> This patch introduces the vrange() syscall, which allows for specifying
> ranges of memory as volatile, and able to be discarded by the system.
>
> This initial patch simply adds the syscall, and the vma handling,
> splitting and merging the vmas as needed, and marking them with
> VM_VOLATILE.
>
> No purging or discarding of volatile ranges is done at this point.
>
> Example man page:
>
> NAME
>         vrange - Mark or unmark range of memory as volatile
>
> SYNOPSIS
>         ssize_t vrange(unsigned_long start, size_t length,
>                          unsigned_long mode, unsigned_long flags,
>                          int *purged);
>
> DESCRIPTION
>         Applications can use vrange(2) to advise kernel that pages of
>         anonymous mapping in the given VM area can be reclaimed without
>         swapping (or can no longer be reclaimed without swapping).
>         The idea is that application can help kernel with page reclaim
>         under memory pressure by specifying data it can easily regenerate
>         and thus kernel can discard the data if needed.
>
>         mode:
>         VRANGE_VOLATILE
>                 Informs the kernel that the VM can discard in pages in
>                 the specified range when under memory pressure.
>         VRANGE_NONVOLATILE
>                 Informs the kernel that the VM can no longer discard pages
>                 in this range.
>
>         flags: Currently no flags are supported.
>
>         purged: Pointer to an integer which will return 1 if
>         mode == VRANGE_NONVOLATILE and any page in the affected range
>         was purged. If purged returns zero during a mode ==
>         VRANGE_NONVOLATILE call, it means all of the pages in the range
>         are intact.
>
>         If a process accesses volatile memory which has been purged, and
>         was not set as non volatile via a VRANGE_NONVOLATILE call, it
>         will recieve a SIGBUS.
>
> RETURN VALUE
>         On success vrange returns the number of bytes marked or unmarked.
>         Similar to write(), it may return fewer bytes then specified
>         if it ran into a problem.

This explanation doesn't match your implementation. You return the
last VMA - orig_start.
That said, when some hole is there at middle of the range marked (or
unmarked) bytes
aren't match the return value.


>
>         When using VRANGE_NON_VOLATILE, if the return value is smaller
>         then the specified length, then the value specified by the purged
>         pointer will be set to 1 if any of the pages specified in the
>         return value as successfully marked non-volatile had been purged.
>
>         If an error is returned, no changes were made.

At least, this explanation doesn't match the implementation. When you find file
mappings, you don't rollback prior changes.

>
> ERRORS
>         EINVAL This error can occur for the following reasons:
>                 * The value length is negative or not page size units.
>                 * addr is not page-aligned
>                 * mode not a valid value.
>                 * flags is not a valid value.
>
>         ENOMEM Not enough memory
>
>         ENOMEM Addresses in the specified range are not currently mapped,
>                or are outside the address space of the process.
>
>         EFAULT Purged pointer is invalid
>
> This a simplified implementation which reuses some of the logic
> from Minchan's earlier efforts. So credit to Minchan for his work.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/mm.h               |   1 +
>  include/linux/vrange.h           |   8 ++
>  mm/Makefile                      |   2 +-
>  mm/vrange.c                      | 173 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 184 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/vrange.h
>  create mode 100644 mm/vrange.c
>
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index a12bddc..7ae3940 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -322,6 +322,7 @@
>  313    common  finit_module            sys_finit_module
>  314    common  sched_setattr           sys_sched_setattr
>  315    common  sched_getattr           sys_sched_getattr
> +316    common  vrange                  sys_vrange
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c1b7414..a1f11da 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_IO           0x00004000     /* Memory mapped I/O or similar */
>
>                                         /* Used by sys_madvise() */
> +#define VM_VOLATILE    0x00001000      /* VMA is volatile */
>  #define VM_SEQ_READ    0x00008000      /* App will access data sequentially */
>  #define VM_RAND_READ   0x00010000      /* App will not benefit from clustered reads */
>
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> new file mode 100644
> index 0000000..6e5331e
> --- /dev/null
> +++ b/include/linux/vrange.h
> @@ -0,0 +1,8 @@
> +#ifndef _LINUX_VRANGE_H
> +#define _LINUX_VRANGE_H
> +
> +#define VRANGE_NONVOLATILE 0
> +#define VRANGE_VOLATILE 1

Maybe, moving uapi is better?

> +#define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
> +
> +#endif /* _LINUX_VRANGE_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 310c90a..20229e2 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -16,7 +16,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>                            readahead.o swap.o truncate.o vmscan.o shmem.o \
>                            util.o mmzone.o vmstat.o backing-dev.o \
>                            mm_init.o mmu_context.o percpu.o slab_common.o \
> -                          compaction.o balloon_compaction.o \
> +                          compaction.o balloon_compaction.o vrange.o \
>                            interval_tree.o list_lru.o $(mmu-y)
>
>  obj-y += init-mm.o
> diff --git a/mm/vrange.c b/mm/vrange.c
> new file mode 100644
> index 0000000..2f8e2ce
> --- /dev/null
> +++ b/mm/vrange.c
> @@ -0,0 +1,173 @@
> +#include <linux/syscalls.h>
> +#include <linux/vrange.h>
> +#include <linux/mm_inline.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +
> +/**
> + * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
> + *
> + * Core logic of sys_volatile. Iterates over the VMAs in the specified
> + * range, and marks or clears them as VM_VOLATILE, splitting or merging them
> + * as needed.
> + *
> + * Returns the number of bytes successfully modified.
> + *
> + * Returns error only if no bytes were modified.
> + */
> +static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
> +                               unsigned long end, unsigned long mode,
> +                               unsigned long flags, int *purged)
> +{
> +       struct vm_area_struct *vma, *prev;
> +       unsigned long orig_start = start;
> +       ssize_t count = 0, ret = 0;
> +
> +       down_read(&mm->mmap_sem);

This should be down_write. VMA split and merge require write lock.


> +
> +       vma = find_vma_prev(mm, start, &prev);
> +       if (vma && start > vma->vm_start)
> +               prev = vma;
> +
> +       for (;;) {
> +               unsigned long new_flags;
> +               pgoff_t pgoff;
> +               unsigned long tmp;
> +
> +               if (!vma)
> +                       goto out;
> +
> +               if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
> +                                       VM_HUGETLB))
> +                       goto out;
> +
> +               /* We don't support volatility on files for now */
> +               if (vma->vm_file) {
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
> +
> +               /* return ENOMEM if we're trying to mark unmapped pages */
> +               if (start < vma->vm_start) {
> +                       ret = -ENOMEM;
> +                       goto out;
> +               }
> +
> +               new_flags = vma->vm_flags;
> +
> +               tmp = vma->vm_end;
> +               if (end < tmp)
> +                       tmp = end;
> +
> +               switch (mode) {
> +               case VRANGE_VOLATILE:
> +                       new_flags |= VM_VOLATILE;
> +                       break;
> +               case VRANGE_NONVOLATILE:
> +                       new_flags &= ~VM_VOLATILE;
> +               }
> +
> +               pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> +               prev = vma_merge(mm, prev, start, tmp, new_flags,
> +                                       vma->anon_vma, vma->vm_file, pgoff,
> +                                       vma_policy(vma));
> +               if (prev)
> +                       goto success;
> +
> +               if (start != vma->vm_start) {
> +                       ret = split_vma(mm, vma, start, 1);
> +                       if (ret)
> +                               goto out;
> +               }
> +
> +               if (tmp != vma->vm_end) {
> +                       ret = split_vma(mm, vma, tmp, 0);
> +                       if (ret)
> +                               goto out;
> +               }
> +
> +               prev = vma;
> +success:
> +               vma->vm_flags = new_flags;
> +
> +               /* update count to distance covered so far*/
> +               count = tmp - orig_start;
> +
> +               start = tmp;
> +               if (start < prev->vm_end)
> +                       start = prev->vm_end;
> +               if (start >= end)
> +                       goto out;
> +               vma = prev->vm_next;
> +       }
> +out:
> +       up_read(&mm->mmap_sem);
> +
> +       /* report bytes successfully marked, even if we're exiting on error */
> +       if (count)
> +               return count;
> +
> +       return ret;
> +}
> +
> +
> +/**
> + * sys_vrange - Marks specified range as volatile or non-volatile.
> + *
> + * Validates the syscall inputs and calls do_vrange(), then copies the
> + * purged flag back out to userspace.
> + *
> + * Returns the number of bytes successfully modified.
> + * Returns error only if no bytes were modified.
> + */
> +SYSCALL_DEFINE5(vrange, unsigned long, start, size_t, len, unsigned long, mode,
> +                       unsigned long, flags, int __user *, purged)
> +{
> +       unsigned long end;
> +       struct mm_struct *mm = current->mm;
> +       ssize_t ret = -EINVAL;
> +       int p = 0;
> +
> +       if (flags & ~VRANGE_VALID_FLAGS)
> +               goto out;
> +
> +       if (start & ~PAGE_MASK)
> +               goto out;
> +
> +       len &= PAGE_MASK;
> +       if (!len)
> +               goto out;

This code doesn't match the explanation of "not page size units."

> +
> +       end = start + len;
> +       if (end < start)
> +               goto out;
> +
> +       if (start >= TASK_SIZE)
> +               goto out;
> +
> +       if (purged) {
> +               /* Test pointer is valid before making any changes */
> +               if (put_user(p, purged))
> +                       return -EFAULT;
> +       }
> +
> +       ret = do_vrange(mm, start, end, mode, flags, &p);
> +
> +       if (purged) {
> +               if (put_user(p, purged)) {
> +                       /*
> +                        * This would be bad, since we've modified volatilty
> +                        * and the change in purged state would be lost.
> +                        */
> +                       WARN_ONCE(1, "vrange: purge state possibly lost\n");

Don't do that.
If userland app unmap the page between do_vrange and here, it's just
their fault, not kernel.
Therefore kernel warning make no sense. Please just move 1st put_user to here.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas
@ 2014-03-23 16:50     ` KOSAKI Motohiro
  0 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-23 16:50 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

Hi

On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
> This patch introduces the vrange() syscall, which allows for specifying
> ranges of memory as volatile, and able to be discarded by the system.
>
> This initial patch simply adds the syscall, and the vma handling,
> splitting and merging the vmas as needed, and marking them with
> VM_VOLATILE.
>
> No purging or discarding of volatile ranges is done at this point.
>
> Example man page:
>
> NAME
>         vrange - Mark or unmark range of memory as volatile
>
> SYNOPSIS
>         ssize_t vrange(unsigned_long start, size_t length,
>                          unsigned_long mode, unsigned_long flags,
>                          int *purged);
>
> DESCRIPTION
>         Applications can use vrange(2) to advise kernel that pages of
>         anonymous mapping in the given VM area can be reclaimed without
>         swapping (or can no longer be reclaimed without swapping).
>         The idea is that application can help kernel with page reclaim
>         under memory pressure by specifying data it can easily regenerate
>         and thus kernel can discard the data if needed.
>
>         mode:
>         VRANGE_VOLATILE
>                 Informs the kernel that the VM can discard in pages in
>                 the specified range when under memory pressure.
>         VRANGE_NONVOLATILE
>                 Informs the kernel that the VM can no longer discard pages
>                 in this range.
>
>         flags: Currently no flags are supported.
>
>         purged: Pointer to an integer which will return 1 if
>         mode == VRANGE_NONVOLATILE and any page in the affected range
>         was purged. If purged returns zero during a mode ==
>         VRANGE_NONVOLATILE call, it means all of the pages in the range
>         are intact.
>
>         If a process accesses volatile memory which has been purged, and
>         was not set as non volatile via a VRANGE_NONVOLATILE call, it
>         will recieve a SIGBUS.
>
> RETURN VALUE
>         On success vrange returns the number of bytes marked or unmarked.
>         Similar to write(), it may return fewer bytes then specified
>         if it ran into a problem.

This explanation doesn't match your implementation. You return the
last VMA - orig_start.
That said, when some hole is there at middle of the range marked (or
unmarked) bytes
aren't match the return value.


>
>         When using VRANGE_NON_VOLATILE, if the return value is smaller
>         then the specified length, then the value specified by the purged
>         pointer will be set to 1 if any of the pages specified in the
>         return value as successfully marked non-volatile had been purged.
>
>         If an error is returned, no changes were made.

At least, this explanation doesn't match the implementation. When you find file
mappings, you don't rollback prior changes.

>
> ERRORS
>         EINVAL This error can occur for the following reasons:
>                 * The value length is negative or not page size units.
>                 * addr is not page-aligned
>                 * mode not a valid value.
>                 * flags is not a valid value.
>
>         ENOMEM Not enough memory
>
>         ENOMEM Addresses in the specified range are not currently mapped,
>                or are outside the address space of the process.
>
>         EFAULT Purged pointer is invalid
>
> This a simplified implementation which reuses some of the logic
> from Minchan's earlier efforts. So credit to Minchan for his work.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/mm.h               |   1 +
>  include/linux/vrange.h           |   8 ++
>  mm/Makefile                      |   2 +-
>  mm/vrange.c                      | 173 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 184 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/vrange.h
>  create mode 100644 mm/vrange.c
>
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index a12bddc..7ae3940 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -322,6 +322,7 @@
>  313    common  finit_module            sys_finit_module
>  314    common  sched_setattr           sys_sched_setattr
>  315    common  sched_getattr           sys_sched_getattr
> +316    common  vrange                  sys_vrange
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c1b7414..a1f11da 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_IO           0x00004000     /* Memory mapped I/O or similar */
>
>                                         /* Used by sys_madvise() */
> +#define VM_VOLATILE    0x00001000      /* VMA is volatile */
>  #define VM_SEQ_READ    0x00008000      /* App will access data sequentially */
>  #define VM_RAND_READ   0x00010000      /* App will not benefit from clustered reads */
>
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> new file mode 100644
> index 0000000..6e5331e
> --- /dev/null
> +++ b/include/linux/vrange.h
> @@ -0,0 +1,8 @@
> +#ifndef _LINUX_VRANGE_H
> +#define _LINUX_VRANGE_H
> +
> +#define VRANGE_NONVOLATILE 0
> +#define VRANGE_VOLATILE 1

Maybe, moving uapi is better?

> +#define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
> +
> +#endif /* _LINUX_VRANGE_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 310c90a..20229e2 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -16,7 +16,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>                            readahead.o swap.o truncate.o vmscan.o shmem.o \
>                            util.o mmzone.o vmstat.o backing-dev.o \
>                            mm_init.o mmu_context.o percpu.o slab_common.o \
> -                          compaction.o balloon_compaction.o \
> +                          compaction.o balloon_compaction.o vrange.o \
>                            interval_tree.o list_lru.o $(mmu-y)
>
>  obj-y += init-mm.o
> diff --git a/mm/vrange.c b/mm/vrange.c
> new file mode 100644
> index 0000000..2f8e2ce
> --- /dev/null
> +++ b/mm/vrange.c
> @@ -0,0 +1,173 @@
> +#include <linux/syscalls.h>
> +#include <linux/vrange.h>
> +#include <linux/mm_inline.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +
> +/**
> + * do_vrange - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
> + *
> + * Core logic of sys_volatile. Iterates over the VMAs in the specified
> + * range, and marks or clears them as VM_VOLATILE, splitting or merging them
> + * as needed.
> + *
> + * Returns the number of bytes successfully modified.
> + *
> + * Returns error only if no bytes were modified.
> + */
> +static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
> +                               unsigned long end, unsigned long mode,
> +                               unsigned long flags, int *purged)
> +{
> +       struct vm_area_struct *vma, *prev;
> +       unsigned long orig_start = start;
> +       ssize_t count = 0, ret = 0;
> +
> +       down_read(&mm->mmap_sem);

This should be down_write. VMA split and merge require write lock.


> +
> +       vma = find_vma_prev(mm, start, &prev);
> +       if (vma && start > vma->vm_start)
> +               prev = vma;
> +
> +       for (;;) {
> +               unsigned long new_flags;
> +               pgoff_t pgoff;
> +               unsigned long tmp;
> +
> +               if (!vma)
> +                       goto out;
> +
> +               if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
> +                                       VM_HUGETLB))
> +                       goto out;
> +
> +               /* We don't support volatility on files for now */
> +               if (vma->vm_file) {
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
> +
> +               /* return ENOMEM if we're trying to mark unmapped pages */
> +               if (start < vma->vm_start) {
> +                       ret = -ENOMEM;
> +                       goto out;
> +               }
> +
> +               new_flags = vma->vm_flags;
> +
> +               tmp = vma->vm_end;
> +               if (end < tmp)
> +                       tmp = end;
> +
> +               switch (mode) {
> +               case VRANGE_VOLATILE:
> +                       new_flags |= VM_VOLATILE;
> +                       break;
> +               case VRANGE_NONVOLATILE:
> +                       new_flags &= ~VM_VOLATILE;
> +               }
> +
> +               pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> +               prev = vma_merge(mm, prev, start, tmp, new_flags,
> +                                       vma->anon_vma, vma->vm_file, pgoff,
> +                                       vma_policy(vma));
> +               if (prev)
> +                       goto success;
> +
> +               if (start != vma->vm_start) {
> +                       ret = split_vma(mm, vma, start, 1);
> +                       if (ret)
> +                               goto out;
> +               }
> +
> +               if (tmp != vma->vm_end) {
> +                       ret = split_vma(mm, vma, tmp, 0);
> +                       if (ret)
> +                               goto out;
> +               }
> +
> +               prev = vma;
> +success:
> +               vma->vm_flags = new_flags;
> +
> +               /* update count to distance covered so far*/
> +               count = tmp - orig_start;
> +
> +               start = tmp;
> +               if (start < prev->vm_end)
> +                       start = prev->vm_end;
> +               if (start >= end)
> +                       goto out;
> +               vma = prev->vm_next;
> +       }
> +out:
> +       up_read(&mm->mmap_sem);
> +
> +       /* report bytes successfully marked, even if we're exiting on error */
> +       if (count)
> +               return count;
> +
> +       return ret;
> +}
> +
> +
> +/**
> + * sys_vrange - Marks specified range as volatile or non-volatile.
> + *
> + * Validates the syscall inputs and calls do_vrange(), then copies the
> + * purged flag back out to userspace.
> + *
> + * Returns the number of bytes successfully modified.
> + * Returns error only if no bytes were modified.
> + */
> +SYSCALL_DEFINE5(vrange, unsigned long, start, size_t, len, unsigned long, mode,
> +                       unsigned long, flags, int __user *, purged)
> +{
> +       unsigned long end;
> +       struct mm_struct *mm = current->mm;
> +       ssize_t ret = -EINVAL;
> +       int p = 0;
> +
> +       if (flags & ~VRANGE_VALID_FLAGS)
> +               goto out;
> +
> +       if (start & ~PAGE_MASK)
> +               goto out;
> +
> +       len &= PAGE_MASK;
> +       if (!len)
> +               goto out;

This code doesn't match the explanation of "not page size units."

> +
> +       end = start + len;
> +       if (end < start)
> +               goto out;
> +
> +       if (start >= TASK_SIZE)
> +               goto out;
> +
> +       if (purged) {
> +               /* Test pointer is valid before making any changes */
> +               if (put_user(p, purged))
> +                       return -EFAULT;
> +       }
> +
> +       ret = do_vrange(mm, start, end, mode, flags, &p);
> +
> +       if (purged) {
> +               if (put_user(p, purged)) {
> +                       /*
> +                        * This would be bad, since we've modified volatilty
> +                        * and the change in purged state would be lost.
> +                        */
> +                       WARN_ONCE(1, "vrange: purge state possibly lost\n");

Don't do that.
If userland app unmap the page between do_vrange and here, it's just
their fault, not kernel.
Therefore kernel warning make no sense. Please just move 1st put_user to here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-03-21 21:17   ` John Stultz
@ 2014-03-23 17:42     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-23 17:42 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
> Users of volatile ranges will need to know if memory was discarded.
> This patch adds the purged state tracking required to inform userland
> when it marks memory as non-volatile that some memory in that range
> was purged and needs to be regenerated.
>
> This simplified implementation which uses some of the logic from
> Minchan's earlier efforts, so credit to Minchan for his work.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/swap.h    | 15 ++++++++--
>  include/linux/swapops.h | 10 +++++++
>  include/linux/vrange.h  |  3 ++
>  mm/vrange.c             | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 101 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6..18c12f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
>  #define SWP_HWPOISON_NUM 0
>  #endif
>
> -#define MAX_SWAPFILES \
> -       ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +
> +/*
> + * Purged volatile range pages
> + */
> +#define SWP_VRANGE_PURGED_NUM 1
> +#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
> +
> +
> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)      \
> +                               - SWP_MIGRATION_NUM     \
> +                               - SWP_HWPOISON_NUM      \
> +                               - SWP_VRANGE_PURGED_NUM \
> +                       )

This change hwpoison and migration tag number. maybe ok, maybe not.
I'd suggest to use younger number than hwpoison.
(That's why hwpoison uses younger number than migration)

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-03-23 17:42     ` KOSAKI Motohiro
  0 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-23 17:42 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
> Users of volatile ranges will need to know if memory was discarded.
> This patch adds the purged state tracking required to inform userland
> when it marks memory as non-volatile that some memory in that range
> was purged and needs to be regenerated.
>
> This simplified implementation which uses some of the logic from
> Minchan's earlier efforts, so credit to Minchan for his work.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/swap.h    | 15 ++++++++--
>  include/linux/swapops.h | 10 +++++++
>  include/linux/vrange.h  |  3 ++
>  mm/vrange.c             | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 101 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6..18c12f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
>  #define SWP_HWPOISON_NUM 0
>  #endif
>
> -#define MAX_SWAPFILES \
> -       ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +
> +/*
> + * Purged volatile range pages
> + */
> +#define SWP_VRANGE_PURGED_NUM 1
> +#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
> +
> +
> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)      \
> +                               - SWP_MIGRATION_NUM     \
> +                               - SWP_HWPOISON_NUM      \
> +                               - SWP_VRANGE_PURGED_NUM \
> +                       )

This change hwpoison and migration tag number. maybe ok, maybe not.
I'd suggest to use younger number than hwpoison.
(That's why hwpoison uses younger number than migration)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-03-21 21:17   ` John Stultz
@ 2014-03-23 17:50     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-23 17:50 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

> +/**
> + * vrange_check_purged_pte - Checks ptes for purged pages
> + *
> + * Iterates over the ptes in the pmd checking if they have
> + * purged swap entries.
> + *
> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
> + */
> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
> +                                       unsigned long end, struct mm_walk *walk)
> +{
> +       struct vrange_walker *vw = walk->private;
> +       pte_t *pte;
> +       spinlock_t *ptl;
> +
> +       if (pmd_trans_huge(*pmd))
> +               return 0;
> +       if (pmd_trans_unstable(pmd))
> +               return 0;
> +
> +       pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +       for (; addr != end; pte++, addr += PAGE_SIZE) {
> +               if (!pte_present(*pte)) {
> +                       swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
> +
> +                       if (unlikely(is_vpurged_entry(vrange_entry))) {
> +                               vw->page_was_purged = 1;
> +                               break;

This function only detect there is vpurge entry or not. But
VRANGE_NONVOLATILE should remove all vpurge entries.
Otherwise, non-volatiled range still makes SIGBUS.

> +                       }
> +               }
> +       }
> +       pte_unmap_unlock(pte - 1, ptl);
> +       cond_resched();
> +
> +       return 0;
> +}

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-03-23 17:50     ` KOSAKI Motohiro
  0 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-23 17:50 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

> +/**
> + * vrange_check_purged_pte - Checks ptes for purged pages
> + *
> + * Iterates over the ptes in the pmd checking if they have
> + * purged swap entries.
> + *
> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
> + */
> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
> +                                       unsigned long end, struct mm_walk *walk)
> +{
> +       struct vrange_walker *vw = walk->private;
> +       pte_t *pte;
> +       spinlock_t *ptl;
> +
> +       if (pmd_trans_huge(*pmd))
> +               return 0;
> +       if (pmd_trans_unstable(pmd))
> +               return 0;
> +
> +       pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +       for (; addr != end; pte++, addr += PAGE_SIZE) {
> +               if (!pte_present(*pte)) {
> +                       swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
> +
> +                       if (unlikely(is_vpurged_entry(vrange_entry))) {
> +                               vw->page_was_purged = 1;
> +                               break;

This function only detect there is vpurge entry or not. But
VRANGE_NONVOLATILE should remove all vpurge entries.
Otherwise, non-volatiled range still makes SIGBUS.

> +                       }
> +               }
> +       }
> +       pte_unmap_unlock(pte - 1, ptl);
> +       cond_resched();
> +
> +       return 0;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-03-23 12:29     ` Jan Kara
@ 2014-03-23 20:21       ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-23 20:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, Michel Lespinasse, Minchan Kim,
	linux-mm

On Sun, Mar 23, 2014 at 5:29 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 21-03-14 14:17:32, John Stultz wrote:
>> + *
>> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
>                               ^^^ page_was_purged

Doh. Thanks for catching this! Fixed in my tree.

Thanks so much for the review!
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-03-23 20:21       ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-23 20:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, Michel Lespinasse, Minchan Kim,
	linux-mm

On Sun, Mar 23, 2014 at 5:29 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 21-03-14 14:17:32, John Stultz wrote:
>> + *
>> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
>                               ^^^ page_was_purged

Doh. Thanks for catching this! Fixed in my tree.

Thanks so much for the review!
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-03-23 17:50     ` KOSAKI Motohiro
@ 2014-03-23 20:26       ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-23 20:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On Sun, Mar 23, 2014 at 10:50 AM, KOSAKI Motohiro
<kosaki.motohiro@gmail.com> wrote:
>> +/**
>> + * vrange_check_purged_pte - Checks ptes for purged pages
>> + *
>> + * Iterates over the ptes in the pmd checking if they have
>> + * purged swap entries.
>> + *
>> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
>> + */
>> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
>> +                                       unsigned long end, struct mm_walk *walk)
>> +{
>> +       struct vrange_walker *vw = walk->private;
>> +       pte_t *pte;
>> +       spinlock_t *ptl;
>> +
>> +       if (pmd_trans_huge(*pmd))
>> +               return 0;
>> +       if (pmd_trans_unstable(pmd))
>> +               return 0;
>> +
>> +       pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>> +       for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +               if (!pte_present(*pte)) {
>> +                       swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
>> +
>> +                       if (unlikely(is_vpurged_entry(vrange_entry))) {
>> +                               vw->page_was_purged = 1;
>> +                               break;
>
> This function only detect there is vpurge entry or not. But
> VRANGE_NONVOLATILE should remove all vpurge entries.
> Otherwise, non-volatiled range still makes SIGBUS.

So in the following patch (3/5), we only SIGBUS if the swap entry
is_vpurged_entry()  && the vma is still marked volatile, so this
shouldn't be an issue.

thanks
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-03-23 20:26       ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-23 20:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On Sun, Mar 23, 2014 at 10:50 AM, KOSAKI Motohiro
<kosaki.motohiro@gmail.com> wrote:
>> +/**
>> + * vrange_check_purged_pte - Checks ptes for purged pages
>> + *
>> + * Iterates over the ptes in the pmd checking if they have
>> + * purged swap entries.
>> + *
>> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
>> + */
>> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
>> +                                       unsigned long end, struct mm_walk *walk)
>> +{
>> +       struct vrange_walker *vw = walk->private;
>> +       pte_t *pte;
>> +       spinlock_t *ptl;
>> +
>> +       if (pmd_trans_huge(*pmd))
>> +               return 0;
>> +       if (pmd_trans_unstable(pmd))
>> +               return 0;
>> +
>> +       pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>> +       for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +               if (!pte_present(*pte)) {
>> +                       swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
>> +
>> +                       if (unlikely(is_vpurged_entry(vrange_entry))) {
>> +                               vw->page_was_purged = 1;
>> +                               break;
>
> This function only detect there is vpurge entry or not. But
> VRANGE_NONVOLATILE should remove all vpurge entries.
> Otherwise, non-volatiled range still makes SIGBUS.

So in the following patch (3/5), we only SIGBUS if the swap entry
is_vpurged_entry()  && the vma is still marked volatile, so this
shouldn't be an issue.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas
  2014-03-23 12:20     ` Jan Kara
@ 2014-03-23 20:34       ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-23 20:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, Michel Lespinasse, Minchan Kim,
	linux-mm

On Sun, Mar 23, 2014 at 5:20 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 21-03-14 14:17:31, John Stultz wrote:
>> RETURN VALUE
>>       On success vrange returns the number of bytes marked or unmarked.
>>       Similar to write(), it may return fewer bytes then specified
>>       if it ran into a problem.
>>
>>       When using VRANGE_NON_VOLATILE, if the return value is smaller
>>       then the specified length, then the value specified by the purged
>         ^^^ than

Ah, thanks!

> Also I'm not sure why *purged is set only if the return value is smaller
> than the specified legth. Won't the interface be more logical if we set
> *purged to appropriate value in all cases?

So yea, we do set purged to the appropriate value in all cases. The
confusion here is I'm trying to clarify that in the case that the
return value is smaller then the requested length, the value of the
purge variable will be set to the purge state of only the pages
successfully marked non-volatile. In other words, the purge value will
provide no information about the requested pages beyond the returned
byte count. I'm clearly making a bit of a mess with the wording there
(and here probably as well ;). Any suggestions for a more clear
phrasing would be appreciated.


>> +     ret = do_vrange(mm, start, end, mode, flags, &p);
>> +
>> +     if (purged) {
>> +             if (put_user(p, purged)) {
>> +                     /*
>> +                      * This would be bad, since we've modified volatilty
>> +                      * and the change in purged state would be lost.
>> +                      */
>> +                     WARN_ONCE(1, "vrange: purge state possibly lost\n");
>   I think this can happen when the application has several threads and
> vrange() in one thread races with munmap() in another thread. So
> WARN_ONCE() doesn't look appropriate (kernel shouldn't spew warnings about
> application programming bugs)... I'd just return -EFAULT. I know
> information will be lost but userspace is doing something utterly stupid.

Ok.. I guess that sounds reasonable.

Thanks for the review! Very much appreciate it!
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas
@ 2014-03-23 20:34       ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-23 20:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, Michel Lespinasse, Minchan Kim,
	linux-mm

On Sun, Mar 23, 2014 at 5:20 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 21-03-14 14:17:31, John Stultz wrote:
>> RETURN VALUE
>>       On success vrange returns the number of bytes marked or unmarked.
>>       Similar to write(), it may return fewer bytes then specified
>>       if it ran into a problem.
>>
>>       When using VRANGE_NON_VOLATILE, if the return value is smaller
>>       then the specified length, then the value specified by the purged
>         ^^^ than

Ah, thanks!

> Also I'm not sure why *purged is set only if the return value is smaller
> than the specified legth. Won't the interface be more logical if we set
> *purged to appropriate value in all cases?

So yea, we do set purged to the appropriate value in all cases. The
confusion here is I'm trying to clarify that in the case that the
return value is smaller then the requested length, the value of the
purge variable will be set to the purge state of only the pages
successfully marked non-volatile. In other words, the purge value will
provide no information about the requested pages beyond the returned
byte count. I'm clearly making a bit of a mess with the wording there
(and here probably as well ;). Any suggestions for a more clear
phrasing would be appreciated.


>> +     ret = do_vrange(mm, start, end, mode, flags, &p);
>> +
>> +     if (purged) {
>> +             if (put_user(p, purged)) {
>> +                     /*
>> +                      * This would be bad, since we've modified volatilty
>> +                      * and the change in purged state would be lost.
>> +                      */
>> +                     WARN_ONCE(1, "vrange: purge state possibly lost\n");
>   I think this can happen when the application has several threads and
> vrange() in one thread races with munmap() in another thread. So
> WARN_ONCE() doesn't look appropriate (kernel shouldn't spew warnings about
> application programming bugs)... I'd just return -EFAULT. I know
> information will be lost but userspace is doing something utterly stupid.

Ok.. I guess that sounds reasonable.

Thanks for the review! Very much appreciate it!
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-03-23 20:26       ` John Stultz
@ 2014-03-23 21:50         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-23 21:50 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On Sun, Mar 23, 2014 at 1:26 PM, John Stultz <john.stultz@linaro.org> wrote:
> On Sun, Mar 23, 2014 at 10:50 AM, KOSAKI Motohiro
> <kosaki.motohiro@gmail.com> wrote:
>>> +/**
>>> + * vrange_check_purged_pte - Checks ptes for purged pages
>>> + *
>>> + * Iterates over the ptes in the pmd checking if they have
>>> + * purged swap entries.
>>> + *
>>> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
>>> + */
>>> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
>>> +                                       unsigned long end, struct mm_walk *walk)
>>> +{
>>> +       struct vrange_walker *vw = walk->private;
>>> +       pte_t *pte;
>>> +       spinlock_t *ptl;
>>> +
>>> +       if (pmd_trans_huge(*pmd))
>>> +               return 0;
>>> +       if (pmd_trans_unstable(pmd))
>>> +               return 0;
>>> +
>>> +       pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>>> +       for (; addr != end; pte++, addr += PAGE_SIZE) {
>>> +               if (!pte_present(*pte)) {
>>> +                       swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
>>> +
>>> +                       if (unlikely(is_vpurged_entry(vrange_entry))) {
>>> +                               vw->page_was_purged = 1;
>>> +                               break;
>>
>> This function only detect there is vpurge entry or not. But
>> VRANGE_NONVOLATILE should remove all vpurge entries.
>> Otherwise, non-volatiled range still makes SIGBUS.
>
> So in the following patch (3/5), we only SIGBUS if the swap entry
> is_vpurged_entry()  && the vma is still marked volatile, so this
> shouldn't be an issue.

When VOLATILE -> NON-VOLATILE -> VOLATILE transition happen,
the page immediately marked "was purged"?

I don't understand why vma check help.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-03-23 21:50         ` KOSAKI Motohiro
  0 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-23 21:50 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On Sun, Mar 23, 2014 at 1:26 PM, John Stultz <john.stultz@linaro.org> wrote:
> On Sun, Mar 23, 2014 at 10:50 AM, KOSAKI Motohiro
> <kosaki.motohiro@gmail.com> wrote:
>>> +/**
>>> + * vrange_check_purged_pte - Checks ptes for purged pages
>>> + *
>>> + * Iterates over the ptes in the pmd checking if they have
>>> + * purged swap entries.
>>> + *
>>> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
>>> + */
>>> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
>>> +                                       unsigned long end, struct mm_walk *walk)
>>> +{
>>> +       struct vrange_walker *vw = walk->private;
>>> +       pte_t *pte;
>>> +       spinlock_t *ptl;
>>> +
>>> +       if (pmd_trans_huge(*pmd))
>>> +               return 0;
>>> +       if (pmd_trans_unstable(pmd))
>>> +               return 0;
>>> +
>>> +       pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>>> +       for (; addr != end; pte++, addr += PAGE_SIZE) {
>>> +               if (!pte_present(*pte)) {
>>> +                       swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
>>> +
>>> +                       if (unlikely(is_vpurged_entry(vrange_entry))) {
>>> +                               vw->page_was_purged = 1;
>>> +                               break;
>>
>> This function only detect there is vpurge entry or not. But
>> VRANGE_NONVOLATILE should remove all vpurge entries.
>> Otherwise, non-volatiled range still makes SIGBUS.
>
> So in the following patch (3/5), we only SIGBUS if the swap entry
> is_vpurged_entry()  && the vma is still marked volatile, so this
> shouldn't be an issue.

When VOLATILE -> NON-VOLATILE -> VOLATILE transition happen,
the page immediately marked "was purged"?

I don't understand why vma check help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] vrange: Add page purging logic & SIGBUS trap
  2014-03-21 21:17   ` John Stultz
@ 2014-03-23 23:44     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-23 23:44 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
> This patch adds the hooks in the vmscan logic to discard volatile pages
> and mark their pte as purged. With this, volatile pages will be purged
> under pressure, and their ptes swap entry's marked. If the purged pages
> are accessed before being marked non-volatile, we catch this and send a
> SIGBUS.
>
> This is a simplified implementation that uses logic from Minchan's earlier
> efforts, so credit to Minchan for his work.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/vrange.h |   2 +
>  mm/internal.h          |   2 -
>  mm/memory.c            |  21 +++++++++
>  mm/rmap.c              |   5 +++
>  mm/vmscan.c            |  12 ++++++
>  mm/vrange.c            | 114 +++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 154 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> index 986fa85..d93ad21 100644
> --- a/include/linux/vrange.h
> +++ b/include/linux/vrange.h
> @@ -8,4 +8,6 @@
>  #define VRANGE_VOLATILE 1
>  #define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
>
> +extern int discard_vpage(struct page *page);
> +
>  #endif /* _LINUX_VRANGE_H */
> diff --git a/mm/internal.h b/mm/internal.h
> index 29e1e76..ea66bf9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -225,10 +225,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>
>  extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern unsigned long vma_address(struct page *page,
>                                  struct vm_area_struct *vma);
> -#endif
>  #else /* !CONFIG_MMU */
>  static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
>  {
> diff --git a/mm/memory.c b/mm/memory.c
> index 22dfa61..db5f4da 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -60,6 +60,7 @@
>  #include <linux/migrate.h>
>  #include <linux/string.h>
>  #include <linux/dma-debug.h>
> +#include <linux/vrange.h>
>
>  #include <asm/io.h>
>  #include <asm/pgalloc.h>
> @@ -3643,6 +3644,8 @@ static int handle_pte_fault(struct mm_struct *mm,
>
>         entry = *pte;
>         if (!pte_present(entry)) {
> +               swp_entry_t vrange_entry;
> +retry:
>                 if (pte_none(entry)) {
>                         if (vma->vm_ops) {
>                                 if (likely(vma->vm_ops->fault))
> @@ -3652,6 +3655,24 @@ static int handle_pte_fault(struct mm_struct *mm,
>                         return do_anonymous_page(mm, vma, address,
>                                                  pte, pmd, flags);
>                 }
> +
> +               vrange_entry = pte_to_swp_entry(entry);
> +               if (unlikely(is_vpurged_entry(vrange_entry))) {
> +                       if (vma->vm_flags & VM_VOLATILE)
> +                               return VM_FAULT_SIGBUS;
> +
> +                       /* zap pte */
> +                       ptl = pte_lockptr(mm, pmd);
> +                       spin_lock(ptl);
> +                       if (unlikely(!pte_same(*pte, entry)))
> +                               goto unlock;
> +                       flush_cache_page(vma, address, pte_pfn(*pte));
> +                       ptep_clear_flush(vma, address, pte);
> +                       pte_unmap_unlock(pte, ptl);
> +                       goto retry;

This looks strange why we need zap pte here?

> +               }
> +
> +
>                 if (pte_file(entry))
>                         return do_nonlinear_fault(mm, vma, address,
>                                         pte, pmd, flags, entry);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d9d4231..2b6f079 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>                                 referenced++;
>                 }
>                 pte_unmap_unlock(pte, ptl);
> +               if (vma->vm_flags & VM_VOLATILE) {
> +                       pra->mapcount = 0;
> +                       pra->vm_flags |= VM_VOLATILE;
> +                       return SWAP_FAIL;
> +               }
>         }
>
>         if (referenced) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a9c74b4..34f159a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -43,6 +43,7 @@
>  #include <linux/sysctl.h>
>  #include <linux/oom.h>
>  #include <linux/prefetch.h>
> +#include <linux/vrange.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -683,6 +684,7 @@ enum page_references {
>         PAGEREF_RECLAIM,
>         PAGEREF_RECLAIM_CLEAN,
>         PAGEREF_KEEP,
> +       PAGEREF_DISCARD,

"discard" is alread used in various place for another meanings.
another name is better.

>         PAGEREF_ACTIVATE,
>  };
>
> @@ -703,6 +705,13 @@ static enum page_references page_check_references(struct page *page,
>         if (vm_flags & VM_LOCKED)
>                 return PAGEREF_RECLAIM;
>
> +       /*
> +        * If volatile page is reached on LRU's tail, we discard the
> +        * page without considering recycle the page.
> +        */
> +       if (vm_flags & VM_VOLATILE)
> +               return PAGEREF_DISCARD;
> +
>         if (referenced_ptes) {
>                 if (PageSwapBacked(page))
>                         return PAGEREF_ACTIVATE;
> @@ -930,6 +939,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                 switch (references) {
>                 case PAGEREF_ACTIVATE:
>                         goto activate_locked;
> +               case PAGEREF_DISCARD:
> +                       if (may_enter_fs && !discard_vpage(page))

Wny may-enter-fs is needed? discard_vpage never enter FS.


> +                               goto free_it;
>                 case PAGEREF_KEEP:
>                         goto keep_locked;
>                 case PAGEREF_RECLAIM:
> diff --git a/mm/vrange.c b/mm/vrange.c
> index 1ff3cbd..28ceb6f 100644
> --- a/mm/vrange.c
> +++ b/mm/vrange.c
> @@ -246,3 +246,117 @@ SYSCALL_DEFINE5(vrange, unsigned long, start, size_t, len, unsigned long, mode,
>  out:
>         return ret;
>  }
> +
> +
> +/**
> + * try_to_discard_one - Purge a volatile page from a vma
> + *
> + * Finds the pte for a page in a vma, marks the pte as purged
> + * and release the page.
> + */
> +static void try_to_discard_one(struct page *page, struct vm_area_struct *vma)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       pte_t *pte;
> +       pte_t pteval;
> +       spinlock_t *ptl;
> +       unsigned long addr;
> +
> +       VM_BUG_ON(!PageLocked(page));
> +
> +       addr = vma_address(page, vma);
> +       pte = page_check_address(page, mm, addr, &ptl, 0);
> +       if (!pte)
> +               return;
> +
> +       BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
> +
> +       flush_cache_page(vma, addr, page_to_pfn(page));
> +       pteval = ptep_clear_flush(vma, addr, pte);
> +
> +       update_hiwater_rss(mm);
> +       if (PageAnon(page))
> +               dec_mm_counter(mm, MM_ANONPAGES);
> +       else
> +               dec_mm_counter(mm, MM_FILEPAGES);
> +
> +       page_remove_rmap(page);
> +       page_cache_release(page);
> +
> +       set_pte_at(mm, addr, pte,
> +                               swp_entry_to_pte(make_vpurged_entry()));
> +
> +       pte_unmap_unlock(pte, ptl);
> +       mmu_notifier_invalidate_page(mm, addr);
> +
> +}
> +
> +/**
> + * try_to_discard_vpage - check vma chain and discard from vmas marked volatile
> + *
> + * Goes over all the vmas that hold a page, and where the vmas are volatile,
> + * purge the page from the vma.
> + *
> + * Returns 0 on success, -1 on error.
> + */
> +static int try_to_discard_vpage(struct page *page)
> +{
> +       struct anon_vma *anon_vma;
> +       struct anon_vma_chain *avc;
> +       pgoff_t pgoff;
> +
> +       anon_vma = page_lock_anon_vma_read(page);
> +       if (!anon_vma)
> +               return -1;
> +
> +       pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +       /*
> +        * During interating the loop, some processes could see a page as
> +        * purged while others could see a page as not-purged because we have
> +        * no global lock between parent and child for protecting vrange system
> +        * call during this loop. But it's not a problem because the page is
> +        * not *SHARED* page but *COW* page so parent and child can see other
> +        * data anytime. The worst case by this race is a page was purged
> +        * but couldn't be discarded so it makes unnecessary page fault but
> +        * it wouldn't be severe.
> +        */
> +       anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
> +               struct vm_area_struct *vma = avc->vma;
> +
> +               if (!(vma->vm_flags & VM_VOLATILE))
> +                       continue;

When you find !VM_VOLATILE vma, we have no reason to continue pte zapping.
Isn't it?


> +               try_to_discard_one(page, vma);
> +       }
> +       page_unlock_anon_vma_read(anon_vma);
> +       return 0;
> +}
> +
> +
> +/**
> + * discard_vpage - If possible, discard the specified volatile page
> + *
> + * Attempts to discard a volatile page, and if needed frees the swap page
> + *
> + * Returns 0 on success, -1 on error.
> + */
> +int discard_vpage(struct page *page)
> +{
> +       VM_BUG_ON(!PageLocked(page));
> +       VM_BUG_ON(PageLRU(page));
> +
> +       /* XXX - for now we only support anonymous volatile pages */
> +       if (!PageAnon(page))
> +               return -1;
> +
> +       if (!try_to_discard_vpage(page)) {
> +               if (PageSwapCache(page))
> +                       try_to_free_swap(page);

This looks strange. try_to_free_swap can't handle vpurge pseudo entry.


> +
> +               if (page_freeze_refs(page, 1)) {

Where is page_unfreeze_refs() for the pair of this?

> +                       unlock_page(page);
> +                       return 0;
> +               }
> +       }
> +
> +       return -1;
> +}
> --
> 1.8.3.2
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] vrange: Add page purging logic & SIGBUS trap
@ 2014-03-23 23:44     ` KOSAKI Motohiro
  0 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-23 23:44 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
> This patch adds the hooks in the vmscan logic to discard volatile pages
> and mark their pte as purged. With this, volatile pages will be purged
> under pressure, and their ptes swap entry's marked. If the purged pages
> are accessed before being marked non-volatile, we catch this and send a
> SIGBUS.
>
> This is a simplified implementation that uses logic from Minchan's earlier
> efforts, so credit to Minchan for his work.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/vrange.h |   2 +
>  mm/internal.h          |   2 -
>  mm/memory.c            |  21 +++++++++
>  mm/rmap.c              |   5 +++
>  mm/vmscan.c            |  12 ++++++
>  mm/vrange.c            | 114 +++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 154 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> index 986fa85..d93ad21 100644
> --- a/include/linux/vrange.h
> +++ b/include/linux/vrange.h
> @@ -8,4 +8,6 @@
>  #define VRANGE_VOLATILE 1
>  #define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
>
> +extern int discard_vpage(struct page *page);
> +
>  #endif /* _LINUX_VRANGE_H */
> diff --git a/mm/internal.h b/mm/internal.h
> index 29e1e76..ea66bf9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -225,10 +225,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>
>  extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern unsigned long vma_address(struct page *page,
>                                  struct vm_area_struct *vma);
> -#endif
>  #else /* !CONFIG_MMU */
>  static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
>  {
> diff --git a/mm/memory.c b/mm/memory.c
> index 22dfa61..db5f4da 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -60,6 +60,7 @@
>  #include <linux/migrate.h>
>  #include <linux/string.h>
>  #include <linux/dma-debug.h>
> +#include <linux/vrange.h>
>
>  #include <asm/io.h>
>  #include <asm/pgalloc.h>
> @@ -3643,6 +3644,8 @@ static int handle_pte_fault(struct mm_struct *mm,
>
>         entry = *pte;
>         if (!pte_present(entry)) {
> +               swp_entry_t vrange_entry;
> +retry:
>                 if (pte_none(entry)) {
>                         if (vma->vm_ops) {
>                                 if (likely(vma->vm_ops->fault))
> @@ -3652,6 +3655,24 @@ static int handle_pte_fault(struct mm_struct *mm,
>                         return do_anonymous_page(mm, vma, address,
>                                                  pte, pmd, flags);
>                 }
> +
> +               vrange_entry = pte_to_swp_entry(entry);
> +               if (unlikely(is_vpurged_entry(vrange_entry))) {
> +                       if (vma->vm_flags & VM_VOLATILE)
> +                               return VM_FAULT_SIGBUS;
> +
> +                       /* zap pte */
> +                       ptl = pte_lockptr(mm, pmd);
> +                       spin_lock(ptl);
> +                       if (unlikely(!pte_same(*pte, entry)))
> +                               goto unlock;
> +                       flush_cache_page(vma, address, pte_pfn(*pte));
> +                       ptep_clear_flush(vma, address, pte);
> +                       pte_unmap_unlock(pte, ptl);
> +                       goto retry;

This looks strange why we need zap pte here?

> +               }
> +
> +
>                 if (pte_file(entry))
>                         return do_nonlinear_fault(mm, vma, address,
>                                         pte, pmd, flags, entry);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d9d4231..2b6f079 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>                                 referenced++;
>                 }
>                 pte_unmap_unlock(pte, ptl);
> +               if (vma->vm_flags & VM_VOLATILE) {
> +                       pra->mapcount = 0;
> +                       pra->vm_flags |= VM_VOLATILE;
> +                       return SWAP_FAIL;
> +               }
>         }
>
>         if (referenced) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a9c74b4..34f159a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -43,6 +43,7 @@
>  #include <linux/sysctl.h>
>  #include <linux/oom.h>
>  #include <linux/prefetch.h>
> +#include <linux/vrange.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -683,6 +684,7 @@ enum page_references {
>         PAGEREF_RECLAIM,
>         PAGEREF_RECLAIM_CLEAN,
>         PAGEREF_KEEP,
> +       PAGEREF_DISCARD,

"discard" is alread used in various place for another meanings.
another name is better.

>         PAGEREF_ACTIVATE,
>  };
>
> @@ -703,6 +705,13 @@ static enum page_references page_check_references(struct page *page,
>         if (vm_flags & VM_LOCKED)
>                 return PAGEREF_RECLAIM;
>
> +       /*
> +        * If volatile page is reached on LRU's tail, we discard the
> +        * page without considering recycle the page.
> +        */
> +       if (vm_flags & VM_VOLATILE)
> +               return PAGEREF_DISCARD;
> +
>         if (referenced_ptes) {
>                 if (PageSwapBacked(page))
>                         return PAGEREF_ACTIVATE;
> @@ -930,6 +939,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>                 switch (references) {
>                 case PAGEREF_ACTIVATE:
>                         goto activate_locked;
> +               case PAGEREF_DISCARD:
> +                       if (may_enter_fs && !discard_vpage(page))

Wny may-enter-fs is needed? discard_vpage never enter FS.


> +                               goto free_it;
>                 case PAGEREF_KEEP:
>                         goto keep_locked;
>                 case PAGEREF_RECLAIM:
> diff --git a/mm/vrange.c b/mm/vrange.c
> index 1ff3cbd..28ceb6f 100644
> --- a/mm/vrange.c
> +++ b/mm/vrange.c
> @@ -246,3 +246,117 @@ SYSCALL_DEFINE5(vrange, unsigned long, start, size_t, len, unsigned long, mode,
>  out:
>         return ret;
>  }
> +
> +
> +/**
> + * try_to_discard_one - Purge a volatile page from a vma
> + *
> + * Finds the pte for a page in a vma, marks the pte as purged
> + * and release the page.
> + */
> +static void try_to_discard_one(struct page *page, struct vm_area_struct *vma)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       pte_t *pte;
> +       pte_t pteval;
> +       spinlock_t *ptl;
> +       unsigned long addr;
> +
> +       VM_BUG_ON(!PageLocked(page));
> +
> +       addr = vma_address(page, vma);
> +       pte = page_check_address(page, mm, addr, &ptl, 0);
> +       if (!pte)
> +               return;
> +
> +       BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
> +
> +       flush_cache_page(vma, addr, page_to_pfn(page));
> +       pteval = ptep_clear_flush(vma, addr, pte);
> +
> +       update_hiwater_rss(mm);
> +       if (PageAnon(page))
> +               dec_mm_counter(mm, MM_ANONPAGES);
> +       else
> +               dec_mm_counter(mm, MM_FILEPAGES);
> +
> +       page_remove_rmap(page);
> +       page_cache_release(page);
> +
> +       set_pte_at(mm, addr, pte,
> +                               swp_entry_to_pte(make_vpurged_entry()));
> +
> +       pte_unmap_unlock(pte, ptl);
> +       mmu_notifier_invalidate_page(mm, addr);
> +
> +}
> +
> +/**
> + * try_to_discard_vpage - check vma chain and discard from vmas marked volatile
> + *
> + * Goes over all the vmas that hold a page, and where the vmas are volatile,
> + * purge the page from the vma.
> + *
> + * Returns 0 on success, -1 on error.
> + */
> +static int try_to_discard_vpage(struct page *page)
> +{
> +       struct anon_vma *anon_vma;
> +       struct anon_vma_chain *avc;
> +       pgoff_t pgoff;
> +
> +       anon_vma = page_lock_anon_vma_read(page);
> +       if (!anon_vma)
> +               return -1;
> +
> +       pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +       /*
> +        * During interating the loop, some processes could see a page as
> +        * purged while others could see a page as not-purged because we have
> +        * no global lock between parent and child for protecting vrange system
> +        * call during this loop. But it's not a problem because the page is
> +        * not *SHARED* page but *COW* page so parent and child can see other
> +        * data anytime. The worst case by this race is a page was purged
> +        * but couldn't be discarded so it makes unnecessary page fault but
> +        * it wouldn't be severe.
> +        */
> +       anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
> +               struct vm_area_struct *vma = avc->vma;
> +
> +               if (!(vma->vm_flags & VM_VOLATILE))
> +                       continue;

When you find !VM_VOLATILE vma, we have no reason to continue pte zapping.
Isn't it?


> +               try_to_discard_one(page, vma);
> +       }
> +       page_unlock_anon_vma_read(anon_vma);
> +       return 0;
> +}
> +
> +
> +/**
> + * discard_vpage - If possible, discard the specified volatile page
> + *
> + * Attempts to discard a volatile page, and if needed frees the swap page
> + *
> + * Returns 0 on success, -1 on error.
> + */
> +int discard_vpage(struct page *page)
> +{
> +       VM_BUG_ON(!PageLocked(page));
> +       VM_BUG_ON(PageLRU(page));
> +
> +       /* XXX - for now we only support anonymous volatile pages */
> +       if (!PageAnon(page))
> +               return -1;
> +
> +       if (!try_to_discard_vpage(page)) {
> +               if (PageSwapCache(page))
> +                       try_to_free_swap(page);

This looks strange. try_to_free_swap can't handle vpurge pseudo entry.


> +
> +               if (page_freeze_refs(page, 1)) {

Where is page_unfreeze_refs() for the pair of this?

> +                       unlock_page(page);
> +                       return 0;
> +               }
> +       }
> +
> +       return -1;
> +}
> --
> 1.8.3.2
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] vrange: Set affected pages referenced when marking volatile
  2014-03-21 21:17   ` John Stultz
@ 2014-03-24  0:01     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-24  0:01 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
> One issue that some potential users were concerned about, was that
> they wanted to ensure that all the pages from one volatile range
> were purged before we purge pages from a different volatile range.
> This would prevent the case where they have 4 large objects, and
> the system purges one page from each object, casuing all of the
> objects to have to be re-created.
>
> The counter-point to this case, is when an application is using the
> SIGBUS semantics to continue to access pages after they have been
> marked volatile. In that case, the desire was that the most recently
> touched pages be purged last, and only the "cold" pages be purged
> from the specified range.
>
> Instead of adding option flags for the various usage model (at least
> initially), one way of getting a solutoin for both uses would be to
> have the act of marking pages as volatile in effect mark the pages
> as accessed. Since all of the pages in the range would be marked
> together, they would be of the same "age" and would (approximately)
> be purged together. Further, if any pages in the range were accessed
> after being marked volatile, they would be moved to the end of the
> lru and be purged later.

If you run after two hares, you will catch neither. I suspect this patch
doesn't make happy any user.
I suggest to aim former case (object level caching) and aim latter by
another patch-kit.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] vrange: Set affected pages referenced when marking volatile
@ 2014-03-24  0:01     ` KOSAKI Motohiro
  0 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-03-24  0:01 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
> One issue that some potential users were concerned about, was that
> they wanted to ensure that all the pages from one volatile range
> were purged before we purge pages from a different volatile range.
> This would prevent the case where they have 4 large objects, and
> the system purges one page from each object, casuing all of the
> objects to have to be re-created.
>
> The counter-point to this case, is when an application is using the
> SIGBUS semantics to continue to access pages after they have been
> marked volatile. In that case, the desire was that the most recently
> touched pages be purged last, and only the "cold" pages be purged
> from the specified range.
>
> Instead of adding option flags for the various usage model (at least
> initially), one way of getting a solutoin for both uses would be to
> have the act of marking pages as volatile in effect mark the pages
> as accessed. Since all of the pages in the range would be marked
> together, they would be of the same "age" and would (approximately)
> be purged together. Further, if any pages in the range were accessed
> after being marked volatile, they would be moved to the end of the
> lru and be purged later.

If you run after two hares, you will catch neither. I suspect this patch
doesn't make happy any user.
I suggest to aim former case (object level caching) and aim latter by
another patch-kit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] vmscan: Age anonymous memory even when swap is off.
  2014-03-21 21:17   ` John Stultz
@ 2014-03-24 17:33     ` Rik van Riel
  -1 siblings, 0 replies; 112+ messages in thread
From: Rik van Riel @ 2014-03-24 17:33 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Andrew Morton, Android Kernel Team, Johannes Weiner, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Dmitry Adamushko,
	Neil Brown, Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On 03/21/2014 05:17 PM, John Stultz wrote:
> Currently we don't shrink/scan the anonymous lrus when swap is off.
> This is problematic for volatile range purging on swapless systems/
>
> This patch naievely changes the vmscan code to continue scanning
> and shrinking the lrus even when there is no swap.
>
> It obviously has performance issues.
>
> Thoughts on how best to implement this would be appreciated.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>   mm/vmscan.c | 26 ++++----------------------
>   1 file changed, 4 insertions(+), 22 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 34f159a..07b0a8c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -155,9 +155,8 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
>   	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
>   	     zone_page_state(zone, NR_INACTIVE_FILE);
>
> -	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON);
> +	nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> +	      zone_page_state(zone, NR_INACTIVE_ANON);
>
>   	return nr;

Not all of the anonymous pages will be reclaimable.

Is there some counter that keeps track of how many
volatile range pages there are in each zone?


> @@ -1764,13 +1763,6 @@ static int inactive_anon_is_low_global(struct zone *zone)
>    */
>   static int inactive_anon_is_low(struct lruvec *lruvec)
>   {
> -	/*
> -	 * If we don't have swap space, anonymous page deactivation
> -	 * is pointless.
> -	 */
> -	if (!total_swap_pages)
> -		return 0;
> -
>   	if (!mem_cgroup_disabled())
>   		return mem_cgroup_inactive_anon_is_low(lruvec);

This part is correct, and needed.

> @@ -1880,12 +1872,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>   	if (!global_reclaim(sc))
>   		force_scan = true;
>
> -	/* If we have no swap space, do not bother scanning anon pages. */
> -	if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
> -		scan_balance = SCAN_FILE;
> -		goto out;
> -	}
> -
>   	/*

This part is too.

> @@ -2181,8 +2166,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
>   	 */
>   	pages_for_compaction = (2UL << sc->order);
>   	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
> -	if (get_nr_swap_pages() > 0)
> -		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +	inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +
>   	if (sc->nr_reclaimed < pages_for_compaction &&
>   			inactive_lru_pages > pages_for_compaction)

Not sure this is a good idea, since the pages may not actually
be reclaimable, and the inactive list will continue to be
refilled indefinitely...

If there was a counter of the number of volatile range pages
in a zone, this would be easier.

Of course, the overhead of keeping such a counter might be
too high for what volatile ranges are designed for...

>   		return true;
> @@ -2726,9 +2711,6 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
>   {
>   	struct mem_cgroup *memcg;
>
> -	if (!total_swap_pages)
> -		return;
> -

This bit is correct and needed.


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] vmscan: Age anonymous memory even when swap is off.
@ 2014-03-24 17:33     ` Rik van Riel
  0 siblings, 0 replies; 112+ messages in thread
From: Rik van Riel @ 2014-03-24 17:33 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Andrew Morton, Android Kernel Team, Johannes Weiner, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Dmitry Adamushko,
	Neil Brown, Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On 03/21/2014 05:17 PM, John Stultz wrote:
> Currently we don't shrink/scan the anonymous lrus when swap is off.
> This is problematic for volatile range purging on swapless systems/
>
> This patch naievely changes the vmscan code to continue scanning
> and shrinking the lrus even when there is no swap.
>
> It obviously has performance issues.
>
> Thoughts on how best to implement this would be appreciated.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>   mm/vmscan.c | 26 ++++----------------------
>   1 file changed, 4 insertions(+), 22 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 34f159a..07b0a8c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -155,9 +155,8 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
>   	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
>   	     zone_page_state(zone, NR_INACTIVE_FILE);
>
> -	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON);
> +	nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> +	      zone_page_state(zone, NR_INACTIVE_ANON);
>
>   	return nr;

Not all of the anonymous pages will be reclaimable.

Is there some counter that keeps track of how many
volatile range pages there are in each zone?


> @@ -1764,13 +1763,6 @@ static int inactive_anon_is_low_global(struct zone *zone)
>    */
>   static int inactive_anon_is_low(struct lruvec *lruvec)
>   {
> -	/*
> -	 * If we don't have swap space, anonymous page deactivation
> -	 * is pointless.
> -	 */
> -	if (!total_swap_pages)
> -		return 0;
> -
>   	if (!mem_cgroup_disabled())
>   		return mem_cgroup_inactive_anon_is_low(lruvec);

This part is correct, and needed.

> @@ -1880,12 +1872,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>   	if (!global_reclaim(sc))
>   		force_scan = true;
>
> -	/* If we have no swap space, do not bother scanning anon pages. */
> -	if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
> -		scan_balance = SCAN_FILE;
> -		goto out;
> -	}
> -
>   	/*

This part is too.

> @@ -2181,8 +2166,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
>   	 */
>   	pages_for_compaction = (2UL << sc->order);
>   	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
> -	if (get_nr_swap_pages() > 0)
> -		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +	inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +
>   	if (sc->nr_reclaimed < pages_for_compaction &&
>   			inactive_lru_pages > pages_for_compaction)

Not sure this is a good idea, since the pages may not actually
be reclaimable, and the inactive list will continue to be
refilled indefinitely...

If there was a counter of the number of volatile range pages
in a zone, this would be easier.

Of course, the overhead of keeping such a counter might be
too high for what volatile ranges are designed for...

>   		return true;
> @@ -2726,9 +2711,6 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
>   {
>   	struct mem_cgroup *memcg;
>
> -	if (!total_swap_pages)
> -		return;
> -

This bit is correct and needed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] vmscan: Age anonymous memory even when swap is off.
  2014-03-24 17:33     ` Rik van Riel
@ 2014-03-24 18:04       ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-24 18:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On Mon, Mar 24, 2014 at 10:33 AM, Rik van Riel <riel@redhat.com> wrote:
> On 03/21/2014 05:17 PM, John Stultz wrote:
>>
>> Currently we don't shrink/scan the anonymous lrus when swap is off.
>> This is problematic for volatile range purging on swapless systems/
>>
>> This patch naievely changes the vmscan code to continue scanning
>> and shrinking the lrus even when there is no swap.
>>
>> It obviously has performance issues.
>>
>> Thoughts on how best to implement this would be appreciated.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Android Kernel Team <kernel-team@android.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Robert Love <rlove@google.com>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Dave Hansen <dave@sr71.net>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
>> Cc: Neil Brown <neilb@suse.de>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Mike Hommey <mh@glandium.org>
>> Cc: Taras Glek <tglek@mozilla.com>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
>> Cc: Michel Lespinasse <walken@google.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
>> Signed-off-by: John Stultz <john.stultz@linaro.org>
>> ---
>>   mm/vmscan.c | 26 ++++----------------------
>>   1 file changed, 4 insertions(+), 22 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 34f159a..07b0a8c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -155,9 +155,8 @@ static unsigned long zone_reclaimable_pages(struct
>> zone *zone)
>>         nr = zone_page_state(zone, NR_ACTIVE_FILE) +
>>              zone_page_state(zone, NR_INACTIVE_FILE);
>>
>> -       if (get_nr_swap_pages() > 0)
>> -               nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>> -                     zone_page_state(zone, NR_INACTIVE_ANON);
>> +       nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>> +             zone_page_state(zone, NR_INACTIVE_ANON);
>>
>>         return nr;
>
>
> Not all of the anonymous pages will be reclaimable.
>
> Is there some counter that keeps track of how many
> volatile range pages there are in each zone?

So right, keeping statistics like NR_VOLATILE_PAGES (as well as
possibly NR_PURGED_VOLATILE_PAGES), would likely help here.

>> @@ -2181,8 +2166,8 @@ static inline bool should_continue_reclaim(struct
>> zone *zone,
>>          */
>>         pages_for_compaction = (2UL << sc->order);
>>         inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
>> -       if (get_nr_swap_pages() > 0)
>> -               inactive_lru_pages += zone_page_state(zone,
>> NR_INACTIVE_ANON);
>> +       inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
>> +
>>         if (sc->nr_reclaimed < pages_for_compaction &&
>>                         inactive_lru_pages > pages_for_compaction)
>
>
> Not sure this is a good idea, since the pages may not actually
> be reclaimable, and the inactive list will continue to be
> refilled indefinitely...
>
> If there was a counter of the number of volatile range pages
> in a zone, this would be easier.
>
> Of course, the overhead of keeping such a counter might be
> too high for what volatile ranges are designed for...

I started looking at something like this, but it runs into some
complexity when we're keeping volatility as a flag in the vma rather
then as a page state.

Also, even with a rough attempt at tracking of the number of volatile
pages, it seemed naively plugging that in for NR_INACTIVE_ANON here
was problematic, since we would scan for a shorter time, but but
wouldn't necessarily find the volatile pages in that time, causing us
not to always purge the volatile pages.

Part of me starts to wonder if a new LRU for volatile pages would be
needed to really be efficient here, but then I worry the moving of the
pages back and forth might be too expensive.

Thanks so much for the review and comments!
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] vmscan: Age anonymous memory even when swap is off.
@ 2014-03-24 18:04       ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-03-24 18:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On Mon, Mar 24, 2014 at 10:33 AM, Rik van Riel <riel@redhat.com> wrote:
> On 03/21/2014 05:17 PM, John Stultz wrote:
>>
>> Currently we don't shrink/scan the anonymous lrus when swap is off.
>> This is problematic for volatile range purging on swapless systems/
>>
>> This patch naievely changes the vmscan code to continue scanning
>> and shrinking the lrus even when there is no swap.
>>
>> It obviously has performance issues.
>>
>> Thoughts on how best to implement this would be appreciated.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Android Kernel Team <kernel-team@android.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Robert Love <rlove@google.com>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Dave Hansen <dave@sr71.net>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
>> Cc: Neil Brown <neilb@suse.de>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Mike Hommey <mh@glandium.org>
>> Cc: Taras Glek <tglek@mozilla.com>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
>> Cc: Michel Lespinasse <walken@google.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
>> Signed-off-by: John Stultz <john.stultz@linaro.org>
>> ---
>>   mm/vmscan.c | 26 ++++----------------------
>>   1 file changed, 4 insertions(+), 22 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 34f159a..07b0a8c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -155,9 +155,8 @@ static unsigned long zone_reclaimable_pages(struct
>> zone *zone)
>>         nr = zone_page_state(zone, NR_ACTIVE_FILE) +
>>              zone_page_state(zone, NR_INACTIVE_FILE);
>>
>> -       if (get_nr_swap_pages() > 0)
>> -               nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>> -                     zone_page_state(zone, NR_INACTIVE_ANON);
>> +       nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>> +             zone_page_state(zone, NR_INACTIVE_ANON);
>>
>>         return nr;
>
>
> Not all of the anonymous pages will be reclaimable.
>
> Is there some counter that keeps track of how many
> volatile range pages there are in each zone?

So right, keeping statistics like NR_VOLATILE_PAGES (as well as
possibly NR_PURGED_VOLATILE_PAGES), would likely help here.

>> @@ -2181,8 +2166,8 @@ static inline bool should_continue_reclaim(struct
>> zone *zone,
>>          */
>>         pages_for_compaction = (2UL << sc->order);
>>         inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
>> -       if (get_nr_swap_pages() > 0)
>> -               inactive_lru_pages += zone_page_state(zone,
>> NR_INACTIVE_ANON);
>> +       inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
>> +
>>         if (sc->nr_reclaimed < pages_for_compaction &&
>>                         inactive_lru_pages > pages_for_compaction)
>
>
> Not sure this is a good idea, since the pages may not actually
> be reclaimable, and the inactive list will continue to be
> refilled indefinitely...
>
> If there was a counter of the number of volatile range pages
> in a zone, this would be easier.
>
> Of course, the overhead of keeping such a counter might be
> too high for what volatile ranges are designed for...

I started looking at something like this, but it runs into some
complexity when we're keeping volatility as a flag in the vma rather
then as a page state.

Also, even with a rough attempt at tracking of the number of volatile
pages, it seemed naively plugging that in for NR_INACTIVE_ANON here
was problematic, since we would scan for a shorter time, but but
wouldn't necessarily find the volatile pages in that time, causing us
not to always purge the volatile pages.

Part of me starts to wonder if a new LRU for volatile pages would be
needed to really be efficient here, but then I worry the moving of the
pages back and forth might be too expensive.

Thanks so much for the review and comments!
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-03-21 21:17 ` John Stultz
@ 2014-04-01 21:21   ` Johannes Weiner
  -1 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-01 21:21 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, H. Peter Anvin, linux-mm

[ I tried to bring this up during LSFMM but it got drowned out.
  Trying again :) ]

On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
> Optimistic method:
> 1) Userland marks a large range of data as volatile
> 2) Userland continues to access the data as it needs.
> 3) If userland accesses a page that has been purged, the kernel will
> send a SIGBUS
> 4) Userspace can trap the SIGBUS, mark the affected pages as
> non-volatile, and refill the data as needed before continuing on

As far as I understand, if a pointer to volatile memory makes it into
a syscall and the fault is trapped in kernel space, there won't be a
SIGBUS, the syscall will just return -EFAULT.

Handling this would mean annotating every syscall invocation to check
for -EFAULT, refill the data, and then restart the syscall.  This is
complicated even before taking external libraries into account, which
may not propagate syscall returns properly or may not be reentrant at
the necessary granularity.

Another option is to never pass volatile memory pointers into the
kernel, but that too means that knowledge of volatility has to travel
alongside the pointers, which will either result in more complexity
throughout the application or severely limited scope of volatile
memory usage.

Either way, optimistic volatile pointers are nowhere near as
transparent to the application as the above description suggests,
which makes this usecase not very interesting, IMO.  If we can support
it at little cost, why not, but I don't think we should complicate the
common usecases to support this one.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-01 21:21   ` Johannes Weiner
  0 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-01 21:21 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, H. Peter Anvin, linux-mm

[ I tried to bring this up during LSFMM but it got drowned out.
  Trying again :) ]

On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
> Optimistic method:
> 1) Userland marks a large range of data as volatile
> 2) Userland continues to access the data as it needs.
> 3) If userland accesses a page that has been purged, the kernel will
> send a SIGBUS
> 4) Userspace can trap the SIGBUS, mark the affected pages as
> non-volatile, and refill the data as needed before continuing on

As far as I understand, if a pointer to volatile memory makes it into
a syscall and the fault is trapped in kernel space, there won't be a
SIGBUS, the syscall will just return -EFAULT.

Handling this would mean annotating every syscall invocation to check
for -EFAULT, refill the data, and then restart the syscall.  This is
complicated even before taking external libraries into account, which
may not propagate syscall returns properly or may not be reentrant at
the necessary granularity.

Another option is to never pass volatile memory pointers into the
kernel, but that too means that knowledge of volatility has to travel
alongside the pointers, which will either result in more complexity
throughout the application or severely limited scope of volatile
memory usage.

Either way, optimistic volatile pointers are nowhere near as
transparent to the application as the above description suggests,
which makes this usecase not very interesting, IMO.  If we can support
it at little cost, why not, but I don't think we should complicate the
common usecases to support this one.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-01 21:21   ` Johannes Weiner
@ 2014-04-01 21:34     ` H. Peter Anvin
  -1 siblings, 0 replies; 112+ messages in thread
From: H. Peter Anvin @ 2014-04-01 21:34 UTC (permalink / raw)
  To: Johannes Weiner, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> [ I tried to bring this up during LSFMM but it got drowned out.
>   Trying again :) ]
> 
> On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
>> Optimistic method:
>> 1) Userland marks a large range of data as volatile
>> 2) Userland continues to access the data as it needs.
>> 3) If userland accesses a page that has been purged, the kernel will
>> send a SIGBUS
>> 4) Userspace can trap the SIGBUS, mark the affected pages as
>> non-volatile, and refill the data as needed before continuing on
> 
> As far as I understand, if a pointer to volatile memory makes it into
> a syscall and the fault is trapped in kernel space, there won't be a
> SIGBUS, the syscall will just return -EFAULT.
> 
> Handling this would mean annotating every syscall invocation to check
> for -EFAULT, refill the data, and then restart the syscall.  This is
> complicated even before taking external libraries into account, which
> may not propagate syscall returns properly or may not be reentrant at
> the necessary granularity.
> 
> Another option is to never pass volatile memory pointers into the
> kernel, but that too means that knowledge of volatility has to travel
> alongside the pointers, which will either result in more complexity
> throughout the application or severely limited scope of volatile
> memory usage.
> 
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO.  If we can support
> it at little cost, why not, but I don't think we should complicate the
> common usecases to support this one.
> 

The whole EFAULT thing is a fundamental problem with the kernel
interface.  This is not in any way the only place where this suffers.

The fact that we cannot reliably get SIGSEGV or SIGBUS because something
may have been passed as a system call is an enormous problem.  The
question is if it is in any way fixable.

	-hpa


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-01 21:34     ` H. Peter Anvin
  0 siblings, 0 replies; 112+ messages in thread
From: H. Peter Anvin @ 2014-04-01 21:34 UTC (permalink / raw)
  To: Johannes Weiner, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> [ I tried to bring this up during LSFMM but it got drowned out.
>   Trying again :) ]
> 
> On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
>> Optimistic method:
>> 1) Userland marks a large range of data as volatile
>> 2) Userland continues to access the data as it needs.
>> 3) If userland accesses a page that has been purged, the kernel will
>> send a SIGBUS
>> 4) Userspace can trap the SIGBUS, mark the affected pages as
>> non-volatile, and refill the data as needed before continuing on
> 
> As far as I understand, if a pointer to volatile memory makes it into
> a syscall and the fault is trapped in kernel space, there won't be a
> SIGBUS, the syscall will just return -EFAULT.
> 
> Handling this would mean annotating every syscall invocation to check
> for -EFAULT, refill the data, and then restart the syscall.  This is
> complicated even before taking external libraries into account, which
> may not propagate syscall returns properly or may not be reentrant at
> the necessary granularity.
> 
> Another option is to never pass volatile memory pointers into the
> kernel, but that too means that knowledge of volatility has to travel
> alongside the pointers, which will either result in more complexity
> throughout the application or severely limited scope of volatile
> memory usage.
> 
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO.  If we can support
> it at little cost, why not, but I don't think we should complicate the
> common usecases to support this one.
> 

The whole EFAULT thing is a fundamental problem with the kernel
interface.  This is not in any way the only place where this suffers.

The fact that we cannot reliably get SIGSEGV or SIGBUS because something
may have been passed as a system call is an enormous problem.  The
question is if it is in any way fixable.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-01 21:21   ` Johannes Weiner
@ 2014-04-01 21:35     ` H. Peter Anvin
  -1 siblings, 0 replies; 112+ messages in thread
From: H. Peter Anvin @ 2014-04-01 21:35 UTC (permalink / raw)
  To: Johannes Weiner, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> 
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO.
> 

... however, I think you're still derating the value way too much.  The
case of user space doing elastic memory management is more and more
common, and for a lot of those applications it is perfectly reasonable
to either not do system calls or to have to devolatilize first.

	-hpa


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-01 21:35     ` H. Peter Anvin
  0 siblings, 0 replies; 112+ messages in thread
From: H. Peter Anvin @ 2014-04-01 21:35 UTC (permalink / raw)
  To: Johannes Weiner, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> 
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO.
> 

... however, I think you're still derating the value way too much.  The
case of user space doing elastic memory management is more and more
common, and for a lot of those applications it is perfectly reasonable
to either not do system calls or to have to devolatilize first.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-01 21:35     ` H. Peter Anvin
@ 2014-04-01 23:01       ` Dave Hansen
  -1 siblings, 0 replies; 112+ messages in thread
From: Dave Hansen @ 2014-04-01 23:01 UTC (permalink / raw)
  To: H. Peter Anvin, Johannes Weiner, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>> Either way, optimistic volatile pointers are nowhere near as
>> transparent to the application as the above description suggests,
>> which makes this usecase not very interesting, IMO.
> 
> ... however, I think you're still derating the value way too much.  The
> case of user space doing elastic memory management is more and more
> common, and for a lot of those applications it is perfectly reasonable
> to either not do system calls or to have to devolatilize first.

The SIGBUS is only in cases where the memory is set as volatile and
_then_ accessed, right?

John, this was something that the Mozilla guys asked for, right?  Any
idea why this isn't ever a problem for them?

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-01 23:01       ` Dave Hansen
  0 siblings, 0 replies; 112+ messages in thread
From: Dave Hansen @ 2014-04-01 23:01 UTC (permalink / raw)
  To: H. Peter Anvin, Johannes Weiner, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>> Either way, optimistic volatile pointers are nowhere near as
>> transparent to the application as the above description suggests,
>> which makes this usecase not very interesting, IMO.
> 
> ... however, I think you're still derating the value way too much.  The
> case of user space doing elastic memory management is more and more
> common, and for a lot of those applications it is perfectly reasonable
> to either not do system calls or to have to devolatilize first.

The SIGBUS is only in cases where the memory is set as volatile and
_then_ accessed, right?

John, this was something that the Mozilla guys asked for, right?  Any
idea why this isn't ever a problem for them?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-01 21:21   ` Johannes Weiner
@ 2014-04-02  4:03     ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02  4:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, H. Peter Anvin, linux-mm

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> [ I tried to bring this up during LSFMM but it got drowned out.
>   Trying again :) ]
>
> On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
>> Optimistic method:
>> 1) Userland marks a large range of data as volatile
>> 2) Userland continues to access the data as it needs.
>> 3) If userland accesses a page that has been purged, the kernel will
>> send a SIGBUS
>> 4) Userspace can trap the SIGBUS, mark the affected pages as
>> non-volatile, and refill the data as needed before continuing on
> As far as I understand, if a pointer to volatile memory makes it into
> a syscall and the fault is trapped in kernel space, there won't be a
> SIGBUS, the syscall will just return -EFAULT.
>
> Handling this would mean annotating every syscall invocation to check
> for -EFAULT, refill the data, and then restart the syscall.  This is
> complicated even before taking external libraries into account, which
> may not propagate syscall returns properly or may not be reentrant at
> the necessary granularity.
>
> Another option is to never pass volatile memory pointers into the
> kernel, but that too means that knowledge of volatility has to travel
> alongside the pointers, which will either result in more complexity
> throughout the application or severely limited scope of volatile
> memory usage.
>
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO.  If we can support
> it at little cost, why not, but I don't think we should complicate the
> common usecases to support this one.

So yea, thanks again for all the feedback at LSF-MM! I'm trying to get
things integrated for a v13 here shortly (although with visitors in town
this week it may not happen until next week).


So, maybe its best to ignore the fact that folks want to do semi-crazy
user-space faulting via SIGBUS. At least to start with. Lets look at the
semantic for the "normal" mark volatile, never touch the pages until you
mark non-volatile - basically where accessing volatile pages is similar
to a use-after-free bug.

So, for the most part, I'd say the proposed SIGBUS semantics don't
complicate things for this basic use-case, at least when compared with
things like zero-fill.  If an applications accidentally accessed a
purged volatile page, I think SIGBUS is the right thing to do. They most
likely immediately crash, but its better then them moving along with
silent corruption because they're mucking with zero-filled pages.

So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
you have a third option you're thinking of, I'd of course be interested
in hearing it.

Now... once you've chosen SIGBUS semantics, there will be folks who will
try to exploit the fact that we get SIGBUS on purged page access (at
least on the user-space side) and will try to access pages that are
volatile until they are purged and try to then handle the SIGBUS to fix
things up. Those folks exploiting that will have to be particularly
careful not to pass volatile data to the kernel, and if they do they'll
have to be smart enough to handle the EFAULT, etc. That's really all
their problem, because they're being clever. :)

I've maybe made a mistake in talking at length about those use cases,
because I wanted to make sure folks didn't have suggestions on how to
better address those cases (so far I've not heard any), and it sort of
helps wrap folks heads around at least some of the potential variations
on the desired purging semantics (lru based cold page purging, or entire
object based purging).

Now, one other potential variant, which Keith brought up at LSF-MM, and
others have mentioned before, is to have *any* volatile page access
(purged or not) return a SIGBUS. This seems "safe" in that it protects
developers from themselves, and makes application behavior more
deterministic (rather then depending on memory pressure). However it
also has the overhead of setting up the pte swp entries for each page in
order to trip the SIGBUS.  Since folks have explicitly asked for it,
allowing non-purged volatile page access seems more flexible. And its
cheaper. So that's what I've been leaning towards.

thanks again!
-john



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02  4:03     ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02  4:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, H. Peter Anvin, linux-mm

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> [ I tried to bring this up during LSFMM but it got drowned out.
>   Trying again :) ]
>
> On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
>> Optimistic method:
>> 1) Userland marks a large range of data as volatile
>> 2) Userland continues to access the data as it needs.
>> 3) If userland accesses a page that has been purged, the kernel will
>> send a SIGBUS
>> 4) Userspace can trap the SIGBUS, mark the affected pages as
>> non-volatile, and refill the data as needed before continuing on
> As far as I understand, if a pointer to volatile memory makes it into
> a syscall and the fault is trapped in kernel space, there won't be a
> SIGBUS, the syscall will just return -EFAULT.
>
> Handling this would mean annotating every syscall invocation to check
> for -EFAULT, refill the data, and then restart the syscall.  This is
> complicated even before taking external libraries into account, which
> may not propagate syscall returns properly or may not be reentrant at
> the necessary granularity.
>
> Another option is to never pass volatile memory pointers into the
> kernel, but that too means that knowledge of volatility has to travel
> alongside the pointers, which will either result in more complexity
> throughout the application or severely limited scope of volatile
> memory usage.
>
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO.  If we can support
> it at little cost, why not, but I don't think we should complicate the
> common usecases to support this one.

So yea, thanks again for all the feedback at LSF-MM! I'm trying to get
things integrated for a v13 here shortly (although with visitors in town
this week it may not happen until next week).


So, maybe its best to ignore the fact that folks want to do semi-crazy
user-space faulting via SIGBUS. At least to start with. Lets look at the
semantic for the "normal" mark volatile, never touch the pages until you
mark non-volatile - basically where accessing volatile pages is similar
to a use-after-free bug.

So, for the most part, I'd say the proposed SIGBUS semantics don't
complicate things for this basic use-case, at least when compared with
things like zero-fill.  If an applications accidentally accessed a
purged volatile page, I think SIGBUS is the right thing to do. They most
likely immediately crash, but its better then them moving along with
silent corruption because they're mucking with zero-filled pages.

So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
you have a third option you're thinking of, I'd of course be interested
in hearing it.

Now... once you've chosen SIGBUS semantics, there will be folks who will
try to exploit the fact that we get SIGBUS on purged page access (at
least on the user-space side) and will try to access pages that are
volatile until they are purged and try to then handle the SIGBUS to fix
things up. Those folks exploiting that will have to be particularly
careful not to pass volatile data to the kernel, and if they do they'll
have to be smart enough to handle the EFAULT, etc. That's really all
their problem, because they're being clever. :)

I've maybe made a mistake in talking at length about those use cases,
because I wanted to make sure folks didn't have suggestions on how to
better address those cases (so far I've not heard any), and it sort of
helps wrap folks heads around at least some of the potential variations
on the desired purging semantics (lru based cold page purging, or entire
object based purging).

Now, one other potential variant, which Keith brought up at LSF-MM, and
others have mentioned before, is to have *any* volatile page access
(purged or not) return a SIGBUS. This seems "safe" in that it protects
developers from themselves, and makes application behavior more
deterministic (rather then depending on memory pressure). However it
also has the overhead of setting up the pte swp entries for each page in
order to trip the SIGBUS.  Since folks have explicitly asked for it,
allowing non-purged volatile page access seems more flexible. And its
cheaper. So that's what I've been leaning towards.

thanks again!
-john


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02  4:03     ` John Stultz
@ 2014-04-02  4:07       ` H. Peter Anvin
  -1 siblings, 0 replies; 112+ messages in thread
From: H. Peter Anvin @ 2014-04-02  4:07 UTC (permalink / raw)
  To: John Stultz, Johannes Weiner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/01/2014 09:03 PM, John Stultz wrote:
> 
> So, maybe its best to ignore the fact that folks want to do semi-crazy
> user-space faulting via SIGBUS. At least to start with. Lets look at the
> semantic for the "normal" mark volatile, never touch the pages until you
> mark non-volatile - basically where accessing volatile pages is similar
> to a use-after-free bug.
> 
> So, for the most part, I'd say the proposed SIGBUS semantics don't
> complicate things for this basic use-case, at least when compared with
> things like zero-fill.  If an applications accidentally accessed a
> purged volatile page, I think SIGBUS is the right thing to do. They most
> likely immediately crash, but its better then them moving along with
> silent corruption because they're mucking with zero-filled pages.
> 
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.
> 

People already do SIGBUS for mmap, so there is nothing new here.

> Now... once you've chosen SIGBUS semantics, there will be folks who will
> try to exploit the fact that we get SIGBUS on purged page access (at
> least on the user-space side) and will try to access pages that are
> volatile until they are purged and try to then handle the SIGBUS to fix
> things up. Those folks exploiting that will have to be particularly
> careful not to pass volatile data to the kernel, and if they do they'll
> have to be smart enough to handle the EFAULT, etc. That's really all
> their problem, because they're being clever. :)

Yep.

	-hpa


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02  4:07       ` H. Peter Anvin
  0 siblings, 0 replies; 112+ messages in thread
From: H. Peter Anvin @ 2014-04-02  4:07 UTC (permalink / raw)
  To: John Stultz, Johannes Weiner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/01/2014 09:03 PM, John Stultz wrote:
> 
> So, maybe its best to ignore the fact that folks want to do semi-crazy
> user-space faulting via SIGBUS. At least to start with. Lets look at the
> semantic for the "normal" mark volatile, never touch the pages until you
> mark non-volatile - basically where accessing volatile pages is similar
> to a use-after-free bug.
> 
> So, for the most part, I'd say the proposed SIGBUS semantics don't
> complicate things for this basic use-case, at least when compared with
> things like zero-fill.  If an applications accidentally accessed a
> purged volatile page, I think SIGBUS is the right thing to do. They most
> likely immediately crash, but its better then them moving along with
> silent corruption because they're mucking with zero-filled pages.
> 
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.
> 

People already do SIGBUS for mmap, so there is nothing new here.

> Now... once you've chosen SIGBUS semantics, there will be folks who will
> try to exploit the fact that we get SIGBUS on purged page access (at
> least on the user-space side) and will try to access pages that are
> volatile until they are purged and try to then handle the SIGBUS to fix
> things up. Those folks exploiting that will have to be particularly
> careful not to pass volatile data to the kernel, and if they do they'll
> have to be smart enough to handle the EFAULT, etc. That's really all
> their problem, because they're being clever. :)

Yep.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-01 23:01       ` Dave Hansen
@ 2014-04-02  4:12         ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02  4:12 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Johannes Weiner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On 04/01/2014 04:01 PM, Dave Hansen wrote:
> On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
>> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>>> Either way, optimistic volatile pointers are nowhere near as
>>> transparent to the application as the above description suggests,
>>> which makes this usecase not very interesting, IMO.
>> ... however, I think you're still derating the value way too much.  The
>> case of user space doing elastic memory management is more and more
>> common, and for a lot of those applications it is perfectly reasonable
>> to either not do system calls or to have to devolatilize first.
> The SIGBUS is only in cases where the memory is set as volatile and
> _then_ accessed, right?
Not just set volatile and then accessed, but when a volatile page has
been purged and then accessed without being made non-volatile.


> John, this was something that the Mozilla guys asked for, right?  Any
> idea why this isn't ever a problem for them?
So one of their use cases for it is for library text. Basically they
want to decompress a compressed library file into memory. Then they plan
to mark the uncompressed pages volatile, and then be able to call into
it. Ideally for them, the kernel would only purge cold pages, leaving
the hot pages in memory. When they traverse a purged page, they handle
the SIGBUS and patch the page up.

Now.. this is not what I'd consider a normal use case, but was hoping to
illustrate some of the more interesting uses and demonstrate the
interfaces flexibility.

Also it provided a clear example of benefits to doing LRU based
cold-page purging rather then full object purging. Though I think the
same could be demonstrated in a simpler case of a large cache of objects
that the applications wants to mark volatile in one pass, unmarking
sub-objects as it needs.

thanks
-john


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02  4:12         ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02  4:12 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Johannes Weiner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On 04/01/2014 04:01 PM, Dave Hansen wrote:
> On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
>> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>>> Either way, optimistic volatile pointers are nowhere near as
>>> transparent to the application as the above description suggests,
>>> which makes this usecase not very interesting, IMO.
>> ... however, I think you're still derating the value way too much.  The
>> case of user space doing elastic memory management is more and more
>> common, and for a lot of those applications it is perfectly reasonable
>> to either not do system calls or to have to devolatilize first.
> The SIGBUS is only in cases where the memory is set as volatile and
> _then_ accessed, right?
Not just set volatile and then accessed, but when a volatile page has
been purged and then accessed without being made non-volatile.


> John, this was something that the Mozilla guys asked for, right?  Any
> idea why this isn't ever a problem for them?
So one of their use cases for it is for library text. Basically they
want to decompress a compressed library file into memory. Then they plan
to mark the uncompressed pages volatile, and then be able to call into
it. Ideally for them, the kernel would only purge cold pages, leaving
the hot pages in memory. When they traverse a purged page, they handle
the SIGBUS and patch the page up.

Now.. this is not what I'd consider a normal use case, but was hoping to
illustrate some of the more interesting uses and demonstrate the
interfaces flexibility.

Also it provided a clear example of benefits to doing LRU based
cold-page purging rather then full object purging. Though I think the
same could be demonstrated in a simpler case of a large cache of objects
that the applications wants to mark volatile in one pass, unmarking
sub-objects as it needs.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02  4:03     ` John Stultz
@ 2014-04-02 16:30       ` Johannes Weiner
  -1 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 16:30 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, H. Peter Anvin, linux-mm

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> > [ I tried to bring this up during LSFMM but it got drowned out.
> >   Trying again :) ]
> >
> > On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
> >> Optimistic method:
> >> 1) Userland marks a large range of data as volatile
> >> 2) Userland continues to access the data as it needs.
> >> 3) If userland accesses a page that has been purged, the kernel will
> >> send a SIGBUS
> >> 4) Userspace can trap the SIGBUS, mark the affected pages as
> >> non-volatile, and refill the data as needed before continuing on
> > As far as I understand, if a pointer to volatile memory makes it into
> > a syscall and the fault is trapped in kernel space, there won't be a
> > SIGBUS, the syscall will just return -EFAULT.
> >
> > Handling this would mean annotating every syscall invocation to check
> > for -EFAULT, refill the data, and then restart the syscall.  This is
> > complicated even before taking external libraries into account, which
> > may not propagate syscall returns properly or may not be reentrant at
> > the necessary granularity.
> >
> > Another option is to never pass volatile memory pointers into the
> > kernel, but that too means that knowledge of volatility has to travel
> > alongside the pointers, which will either result in more complexity
> > throughout the application or severely limited scope of volatile
> > memory usage.
> >
> > Either way, optimistic volatile pointers are nowhere near as
> > transparent to the application as the above description suggests,
> > which makes this usecase not very interesting, IMO.  If we can support
> > it at little cost, why not, but I don't think we should complicate the
> > common usecases to support this one.
> 
> So yea, thanks again for all the feedback at LSF-MM! I'm trying to get
> things integrated for a v13 here shortly (although with visitors in town
> this week it may not happen until next week).
> 
> 
> So, maybe its best to ignore the fact that folks want to do semi-crazy
> user-space faulting via SIGBUS. At least to start with. Lets look at the
> semantic for the "normal" mark volatile, never touch the pages until you
> mark non-volatile - basically where accessing volatile pages is similar
> to a use-after-free bug.
> 
> So, for the most part, I'd say the proposed SIGBUS semantics don't
> complicate things for this basic use-case, at least when compared with
> things like zero-fill.  If an applications accidentally accessed a
> purged volatile page, I think SIGBUS is the right thing to do. They most
> likely immediately crash, but its better then them moving along with
> silent corruption because they're mucking with zero-filled pages.
> 
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.

The reason I'm bringing this up again is because I see very little
solid usecases for a separate vrange() syscall once we have something
like MADV_FREE and MADV_REVIVE, which respectively clear the dirty
bits of a range of anon/tmpfs pages, and set them again and report if
any pages in the given range were purged on revival.

So between zero-fill and SIGBUS, I'd prefer the one which results in
the simpler user interface / fewer system calls.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 16:30       ` Johannes Weiner
  0 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 16:30 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, H. Peter Anvin, linux-mm

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> > [ I tried to bring this up during LSFMM but it got drowned out.
> >   Trying again :) ]
> >
> > On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
> >> Optimistic method:
> >> 1) Userland marks a large range of data as volatile
> >> 2) Userland continues to access the data as it needs.
> >> 3) If userland accesses a page that has been purged, the kernel will
> >> send a SIGBUS
> >> 4) Userspace can trap the SIGBUS, mark the affected pages as
> >> non-volatile, and refill the data as needed before continuing on
> > As far as I understand, if a pointer to volatile memory makes it into
> > a syscall and the fault is trapped in kernel space, there won't be a
> > SIGBUS, the syscall will just return -EFAULT.
> >
> > Handling this would mean annotating every syscall invocation to check
> > for -EFAULT, refill the data, and then restart the syscall.  This is
> > complicated even before taking external libraries into account, which
> > may not propagate syscall returns properly or may not be reentrant at
> > the necessary granularity.
> >
> > Another option is to never pass volatile memory pointers into the
> > kernel, but that too means that knowledge of volatility has to travel
> > alongside the pointers, which will either result in more complexity
> > throughout the application or severely limited scope of volatile
> > memory usage.
> >
> > Either way, optimistic volatile pointers are nowhere near as
> > transparent to the application as the above description suggests,
> > which makes this usecase not very interesting, IMO.  If we can support
> > it at little cost, why not, but I don't think we should complicate the
> > common usecases to support this one.
> 
> So yea, thanks again for all the feedback at LSF-MM! I'm trying to get
> things integrated for a v13 here shortly (although with visitors in town
> this week it may not happen until next week).
> 
> 
> So, maybe its best to ignore the fact that folks want to do semi-crazy
> user-space faulting via SIGBUS. At least to start with. Lets look at the
> semantic for the "normal" mark volatile, never touch the pages until you
> mark non-volatile - basically where accessing volatile pages is similar
> to a use-after-free bug.
> 
> So, for the most part, I'd say the proposed SIGBUS semantics don't
> complicate things for this basic use-case, at least when compared with
> things like zero-fill.  If an applications accidentally accessed a
> purged volatile page, I think SIGBUS is the right thing to do. They most
> likely immediately crash, but its better then them moving along with
> silent corruption because they're mucking with zero-filled pages.
> 
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.

The reason I'm bringing this up again is because I see very little
solid usecases for a separate vrange() syscall once we have something
like MADV_FREE and MADV_REVIVE, which respectively clear the dirty
bits of a range of anon/tmpfs pages, and set them again and report if
any pages in the given range were purged on revival.

So between zero-fill and SIGBUS, I'd prefer the one which results in
the simpler user interface / fewer system calls.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 16:30       ` Johannes Weiner
@ 2014-04-02 16:32         ` H. Peter Anvin
  -1 siblings, 0 replies; 112+ messages in thread
From: H. Peter Anvin @ 2014-04-02 16:32 UTC (permalink / raw)
  To: Johannes Weiner, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/02/2014 09:30 AM, Johannes Weiner wrote:
> 
> So between zero-fill and SIGBUS, I'd prefer the one which results in
> the simpler user interface / fewer system calls.
> 

The use cases are different; I believe this should be a user space option.

	-hpa


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 16:32         ` H. Peter Anvin
  0 siblings, 0 replies; 112+ messages in thread
From: H. Peter Anvin @ 2014-04-02 16:32 UTC (permalink / raw)
  To: Johannes Weiner, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/02/2014 09:30 AM, Johannes Weiner wrote:
> 
> So between zero-fill and SIGBUS, I'd prefer the one which results in
> the simpler user interface / fewer system calls.
> 

The use cases are different; I believe this should be a user space option.

	-hpa

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02  4:12         ` John Stultz
@ 2014-04-02 16:36           ` Johannes Weiner
  -1 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 16:36 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >>> Either way, optimistic volatile pointers are nowhere near as
> >>> transparent to the application as the above description suggests,
> >>> which makes this usecase not very interesting, IMO.
> >> ... however, I think you're still derating the value way too much.  The
> >> case of user space doing elastic memory management is more and more
> >> common, and for a lot of those applications it is perfectly reasonable
> >> to either not do system calls or to have to devolatilize first.
> > The SIGBUS is only in cases where the memory is set as volatile and
> > _then_ accessed, right?
> Not just set volatile and then accessed, but when a volatile page has
> been purged and then accessed without being made non-volatile.
> 
> 
> > John, this was something that the Mozilla guys asked for, right?  Any
> > idea why this isn't ever a problem for them?
> So one of their use cases for it is for library text. Basically they
> want to decompress a compressed library file into memory. Then they plan
> to mark the uncompressed pages volatile, and then be able to call into
> it. Ideally for them, the kernel would only purge cold pages, leaving
> the hot pages in memory. When they traverse a purged page, they handle
> the SIGBUS and patch the page up.

How big are these libraries compared to overall system size?

> Now.. this is not what I'd consider a normal use case, but was hoping to
> illustrate some of the more interesting uses and demonstrate the
> interfaces flexibility.

I'm just dying to hear a "normal" use case then. :)

> Also it provided a clear example of benefits to doing LRU based
> cold-page purging rather then full object purging. Though I think the
> same could be demonstrated in a simpler case of a large cache of objects
> that the applications wants to mark volatile in one pass, unmarking
> sub-objects as it needs.

Agreed.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 16:36           ` Johannes Weiner
  0 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 16:36 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >>> Either way, optimistic volatile pointers are nowhere near as
> >>> transparent to the application as the above description suggests,
> >>> which makes this usecase not very interesting, IMO.
> >> ... however, I think you're still derating the value way too much.  The
> >> case of user space doing elastic memory management is more and more
> >> common, and for a lot of those applications it is perfectly reasonable
> >> to either not do system calls or to have to devolatilize first.
> > The SIGBUS is only in cases where the memory is set as volatile and
> > _then_ accessed, right?
> Not just set volatile and then accessed, but when a volatile page has
> been purged and then accessed without being made non-volatile.
> 
> 
> > John, this was something that the Mozilla guys asked for, right?  Any
> > idea why this isn't ever a problem for them?
> So one of their use cases for it is for library text. Basically they
> want to decompress a compressed library file into memory. Then they plan
> to mark the uncompressed pages volatile, and then be able to call into
> it. Ideally for them, the kernel would only purge cold pages, leaving
> the hot pages in memory. When they traverse a purged page, they handle
> the SIGBUS and patch the page up.

How big are these libraries compared to overall system size?

> Now.. this is not what I'd consider a normal use case, but was hoping to
> illustrate some of the more interesting uses and demonstrate the
> interfaces flexibility.

I'm just dying to hear a "normal" use case then. :)

> Also it provided a clear example of benefits to doing LRU based
> cold-page purging rather then full object purging. Though I think the
> same could be demonstrated in a simpler case of a large cache of objects
> that the applications wants to mark volatile in one pass, unmarking
> sub-objects as it needs.

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 16:32         ` H. Peter Anvin
  (?)
@ 2014-04-02 16:37         ` H. Peter Anvin
  2014-04-02 17:18             ` Johannes Weiner
  -1 siblings, 1 reply; 112+ messages in thread
From: H. Peter Anvin @ 2014-04-02 16:37 UTC (permalink / raw)
  To: Johannes Weiner, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/02/2014 09:32 AM, H. Peter Anvin wrote:
> On 04/02/2014 09:30 AM, Johannes Weiner wrote:
>>
>> So between zero-fill and SIGBUS, I'd prefer the one which results in
>> the simpler user interface / fewer system calls.
>>
> 
> The use cases are different; I believe this should be a user space option.
> 

Case in point, for example: imagine a JIT.  You *really* don't want to
zero-fill memory behind the back of your JIT, as all zero memory may not
be a trapping instruction (it isn't on x86, for example, and if you are
unlucky you may be modifying *part* of an instruction.)

Thus, SIGBUS is the only safe option.

	-hpa



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 16:37         ` H. Peter Anvin
@ 2014-04-02 17:18             ` Johannes Weiner
  0 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 17:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On Wed, Apr 02, 2014 at 09:37:49AM -0700, H. Peter Anvin wrote:
> On 04/02/2014 09:32 AM, H. Peter Anvin wrote:
> > On 04/02/2014 09:30 AM, Johannes Weiner wrote:
> >>
> >> So between zero-fill and SIGBUS, I'd prefer the one which results in
> >> the simpler user interface / fewer system calls.
> >>
> > 
> > The use cases are different; I believe this should be a user space option.
> > 
> 
> Case in point, for example: imagine a JIT.  You *really* don't want to
> zero-fill memory behind the back of your JIT, as all zero memory may not
> be a trapping instruction (it isn't on x86, for example, and if you are
> unlucky you may be modifying *part* of an instruction.)

Yes, and I think this would be comparable to the compressed-library
usecase that John mentioned.  What's special about these cases is that
the accesses are no longer under control of the application because
it's literally code that the CPU jumps into.  It is obvious to me that
such a usecase would require SIGBUS handling.  However, it seems that
in any usecase *besides* executable code caches, userspace would have
the ability to mark the pages non-volatile ahead of time, and thus not
require SIGBUS delivery.

Hence my follow-up question in the other mail about how large we
expect such code caches to become in practice in relationship to
overall system memory.  Are code caches interesting reclaim candidates
to begin with?  Are they big enough to make the machine thrash/swap
otherwise?


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 17:18             ` Johannes Weiner
  0 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 17:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On Wed, Apr 02, 2014 at 09:37:49AM -0700, H. Peter Anvin wrote:
> On 04/02/2014 09:32 AM, H. Peter Anvin wrote:
> > On 04/02/2014 09:30 AM, Johannes Weiner wrote:
> >>
> >> So between zero-fill and SIGBUS, I'd prefer the one which results in
> >> the simpler user interface / fewer system calls.
> >>
> > 
> > The use cases are different; I believe this should be a user space option.
> > 
> 
> Case in point, for example: imagine a JIT.  You *really* don't want to
> zero-fill memory behind the back of your JIT, as all zero memory may not
> be a trapping instruction (it isn't on x86, for example, and if you are
> unlucky you may be modifying *part* of an instruction.)

Yes, and I think this would be comparable to the compressed-library
usecase that John mentioned.  What's special about these cases is that
the accesses are no longer under control of the application because
it's literally code that the CPU jumps into.  It is obvious to me that
such a usecase would require SIGBUS handling.  However, it seems that
in any usecase *besides* executable code caches, userspace would have
the ability to mark the pages non-volatile ahead of time, and thus not
require SIGBUS delivery.

Hence my follow-up question in the other mail about how large we
expect such code caches to become in practice in relationship to
overall system memory.  Are code caches interesting reclaim candidates
to begin with?  Are they big enough to make the machine thrash/swap
otherwise?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 17:18             ` Johannes Weiner
@ 2014-04-02 17:40               ` Dave Hansen
  -1 siblings, 0 replies; 112+ messages in thread
From: Dave Hansen @ 2014-04-02 17:40 UTC (permalink / raw)
  To: Johannes Weiner, H. Peter Anvin
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/02/2014 10:18 AM, Johannes Weiner wrote:
> Hence my follow-up question in the other mail about how large we
> expect such code caches to become in practice in relationship to
> overall system memory.  Are code caches interesting reclaim candidates
> to begin with?  Are they big enough to make the machine thrash/swap
> otherwise?

A big chunk of the use cases here are for swapless systems anyway, so
this is the *only* way for them to reclaim anonymous memory.  Their
choices are either to be constantly throwing away and rebuilding these
objects, or to leave them in memory effectively pinned.

In practice I did see ashmem (the Android thing that we're trying to
replace) get used a lot by the Android web browser when I was playing
with it.  John said that it got used for storing decompressed copies of
images.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 17:40               ` Dave Hansen
  0 siblings, 0 replies; 112+ messages in thread
From: Dave Hansen @ 2014-04-02 17:40 UTC (permalink / raw)
  To: Johannes Weiner, H. Peter Anvin
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 04/02/2014 10:18 AM, Johannes Weiner wrote:
> Hence my follow-up question in the other mail about how large we
> expect such code caches to become in practice in relationship to
> overall system memory.  Are code caches interesting reclaim candidates
> to begin with?  Are they big enough to make the machine thrash/swap
> otherwise?

A big chunk of the use cases here are for swapless systems anyway, so
this is the *only* way for them to reclaim anonymous memory.  Their
choices are either to be constantly throwing away and rebuilding these
objects, or to leave them in memory effectively pinned.

In practice I did see ashmem (the Android thing that we're trying to
replace) get used a lot by the Android web browser when I was playing
with it.  John said that it got used for storing decompressed copies of
images.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 16:36           ` Johannes Weiner
@ 2014-04-02 17:40             ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 17:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
>> On 04/01/2014 04:01 PM, Dave Hansen wrote:
>> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
>> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>> > John, this was something that the Mozilla guys asked for, right?  Any
>> > idea why this isn't ever a problem for them?
>> So one of their use cases for it is for library text. Basically they
>> want to decompress a compressed library file into memory. Then they plan
>> to mark the uncompressed pages volatile, and then be able to call into
>> it. Ideally for them, the kernel would only purge cold pages, leaving
>> the hot pages in memory. When they traverse a purged page, they handle
>> the SIGBUS and patch the page up.
>
> How big are these libraries compared to overall system size?

Mike or Taras would have to refresh my memory on this detail. My
recollection is it mostly has to do with keeping the on-disk size of
the library small, so it can load off of slow media very quickly.

>> Now.. this is not what I'd consider a normal use case, but was hoping to
>> illustrate some of the more interesting uses and demonstrate the
>> interfaces flexibility.
>
> I'm just dying to hear a "normal" use case then. :)

So the more "normal" use cause would be marking objects volatile and
then non-volatile w/o accessing them in-between. In this case the
zero-fill vs SIGBUS semantics don't really matter, its really just a
trade off in how we handle applications deviating (intentionally or
not) from this use case.

So to maybe flesh out the context here for folks who are following
along (but weren't in the hallway at LSF :),  Johannes made a fairly
interesting proposal (Johannes: Please correct me here where I'm maybe
slightly off here) to use only the dirty bits of the ptes to mark a
page as volatile. Then the kernel could reclaim these clean pages as
it needed, and when we marked the range as non-volatile, the pages
would be re-dirtied and if any of the pages were missing, we could
return a flag with the purged state.  This had some different
semantics then what I've been working with for awhile (for example,
any writes to pages would implicitly clear volatility), so I wasn't
completely comfortable with it, but figured I'd think about it to see
if it could be done. Particularly since it would in some ways simplify
tmpfs/shm shared volatility that I'd eventually like to do.

After thinking it over in the hallway, I talked some of the details w/
Johnnes and there was one issue that while w/ anonymous memory, we can
still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
semantics, but since on shared volatile ranges, we don't have anything
to hang a volatile flag on w/o adding some new vma like structure to
the address_space structure (much as we did in the past w/ earlier
volatile range implementations). This would negate much of the point
of using the dirty bits to simplify the shared volatility
implementation.

Thus Johannes is reasonably questioning the need for SIGBUS semantics,
since if it wasn't needed, the simpler page-cleaning based volatility
could potentially be used.


Now, while for the case I'm personally most interested in (ashmem),
zero-fill would technically be ok, since that's what Android does.
Even so, I don't think its the best approach for the interface, since
applications may end up quite surprised by the results when they
accidentally don't follow the "don't touch volatile pages" rule.

That point beside, I think the other problem with the page-cleaning
volatility approach is that there are other awkward side effects. For
example: Say an application marks a range as volatile. One page in the
range is then purged. The application, due to a bug or otherwise,
reads the volatile range. This causes the page to be zero-filled in,
and the application silently uses the corrupted data (which isn't
great). More problematic though, is that by faulting the page in,
they've in effect lost the purge state for that page. When the
application then goes to mark the range as non-volatile, all pages are
present, so we'd return that no pages were purged.  From an
application perspective this is pretty ugly.

Johannes: Any thoughts on this potential issue with your proposal? Am
I missing something else?

thanks
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 17:40             ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 17:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
>> On 04/01/2014 04:01 PM, Dave Hansen wrote:
>> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
>> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>> > John, this was something that the Mozilla guys asked for, right?  Any
>> > idea why this isn't ever a problem for them?
>> So one of their use cases for it is for library text. Basically they
>> want to decompress a compressed library file into memory. Then they plan
>> to mark the uncompressed pages volatile, and then be able to call into
>> it. Ideally for them, the kernel would only purge cold pages, leaving
>> the hot pages in memory. When they traverse a purged page, they handle
>> the SIGBUS and patch the page up.
>
> How big are these libraries compared to overall system size?

Mike or Taras would have to refresh my memory on this detail. My
recollection is it mostly has to do with keeping the on-disk size of
the library small, so it can load off of slow media very quickly.

>> Now.. this is not what I'd consider a normal use case, but was hoping to
>> illustrate some of the more interesting uses and demonstrate the
>> interfaces flexibility.
>
> I'm just dying to hear a "normal" use case then. :)

So the more "normal" use cause would be marking objects volatile and
then non-volatile w/o accessing them in-between. In this case the
zero-fill vs SIGBUS semantics don't really matter, its really just a
trade off in how we handle applications deviating (intentionally or
not) from this use case.

So to maybe flesh out the context here for folks who are following
along (but weren't in the hallway at LSF :),  Johannes made a fairly
interesting proposal (Johannes: Please correct me here where I'm maybe
slightly off here) to use only the dirty bits of the ptes to mark a
page as volatile. Then the kernel could reclaim these clean pages as
it needed, and when we marked the range as non-volatile, the pages
would be re-dirtied and if any of the pages were missing, we could
return a flag with the purged state.  This had some different
semantics then what I've been working with for awhile (for example,
any writes to pages would implicitly clear volatility), so I wasn't
completely comfortable with it, but figured I'd think about it to see
if it could be done. Particularly since it would in some ways simplify
tmpfs/shm shared volatility that I'd eventually like to do.

After thinking it over in the hallway, I talked some of the details w/
Johnnes and there was one issue that while w/ anonymous memory, we can
still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
semantics, but since on shared volatile ranges, we don't have anything
to hang a volatile flag on w/o adding some new vma like structure to
the address_space structure (much as we did in the past w/ earlier
volatile range implementations). This would negate much of the point
of using the dirty bits to simplify the shared volatility
implementation.

Thus Johannes is reasonably questioning the need for SIGBUS semantics,
since if it wasn't needed, the simpler page-cleaning based volatility
could potentially be used.


Now, while for the case I'm personally most interested in (ashmem),
zero-fill would technically be ok, since that's what Android does.
Even so, I don't think its the best approach for the interface, since
applications may end up quite surprised by the results when they
accidentally don't follow the "don't touch volatile pages" rule.

That point beside, I think the other problem with the page-cleaning
volatility approach is that there are other awkward side effects. For
example: Say an application marks a range as volatile. One page in the
range is then purged. The application, due to a bug or otherwise,
reads the volatile range. This causes the page to be zero-filled in,
and the application silently uses the corrupted data (which isn't
great). More problematic though, is that by faulting the page in,
they've in effect lost the purge state for that page. When the
application then goes to mark the range as non-volatile, all pages are
present, so we'd return that no pages were purged.  From an
application perspective this is pretty ugly.

Johannes: Any thoughts on this potential issue with your proposal? Am
I missing something else?

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 17:40               ` Dave Hansen
@ 2014-04-02 17:48                 ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 17:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen <dave@sr71.net> wrote:
> On 04/02/2014 10:18 AM, Johannes Weiner wrote:
>> Hence my follow-up question in the other mail about how large we
>> expect such code caches to become in practice in relationship to
>> overall system memory.  Are code caches interesting reclaim candidates
>> to begin with?  Are they big enough to make the machine thrash/swap
>> otherwise?
>
> A big chunk of the use cases here are for swapless systems anyway, so
> this is the *only* way for them to reclaim anonymous memory.  Their
> choices are either to be constantly throwing away and rebuilding these
> objects, or to leave them in memory effectively pinned.
>
> In practice I did see ashmem (the Android thing that we're trying to
> replace) get used a lot by the Android web browser when I was playing
> with it.  John said that it got used for storing decompressed copies of
> images.

Although images are a simpler case where its easier to not touch
volatile pages. I think Johannes is mostly concerned about cases where
volatile pages are being accessed while they are volatile, which the
Mozilla folks are so far the only viable case (in my mind... folks may
have others) where they intentionally want to access pages while
they're volatile and thus require SIGBUS semantics.

I suspect handling the SIGBUS and patching up the purged page you
trapped on is likely much to complicated for most use cases. But I do
think SIGBUS is preferable to zero-fill on purged page access, just
because its likely to be easier to debug applications.

thanks
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 17:48                 ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 17:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen <dave@sr71.net> wrote:
> On 04/02/2014 10:18 AM, Johannes Weiner wrote:
>> Hence my follow-up question in the other mail about how large we
>> expect such code caches to become in practice in relationship to
>> overall system memory.  Are code caches interesting reclaim candidates
>> to begin with?  Are they big enough to make the machine thrash/swap
>> otherwise?
>
> A big chunk of the use cases here are for swapless systems anyway, so
> this is the *only* way for them to reclaim anonymous memory.  Their
> choices are either to be constantly throwing away and rebuilding these
> objects, or to leave them in memory effectively pinned.
>
> In practice I did see ashmem (the Android thing that we're trying to
> replace) get used a lot by the Android web browser when I was playing
> with it.  John said that it got used for storing decompressed copies of
> images.

Although images are a simpler case where its easier to not touch
volatile pages. I think Johannes is mostly concerned about cases where
volatile pages are being accessed while they are volatile, which the
Mozilla folks are so far the only viable case (in my mind... folks may
have others) where they intentionally want to access pages while
they're volatile and thus require SIGBUS semantics.

I suspect handling the SIGBUS and patching up the purged page you
trapped on is likely much to complicated for most use cases. But I do
think SIGBUS is preferable to zero-fill on purged page access, just
because its likely to be easier to debug applications.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 17:40             ` John Stultz
@ 2014-04-02 17:58               ` Johannes Weiner
  -1 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 17:58 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> >> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >> > John, this was something that the Mozilla guys asked for, right?  Any
> >> > idea why this isn't ever a problem for them?
> >> So one of their use cases for it is for library text. Basically they
> >> want to decompress a compressed library file into memory. Then they plan
> >> to mark the uncompressed pages volatile, and then be able to call into
> >> it. Ideally for them, the kernel would only purge cold pages, leaving
> >> the hot pages in memory. When they traverse a purged page, they handle
> >> the SIGBUS and patch the page up.
> >
> > How big are these libraries compared to overall system size?
> 
> Mike or Taras would have to refresh my memory on this detail. My
> recollection is it mostly has to do with keeping the on-disk size of
> the library small, so it can load off of slow media very quickly.
> 
> >> Now.. this is not what I'd consider a normal use case, but was hoping to
> >> illustrate some of the more interesting uses and demonstrate the
> >> interfaces flexibility.
> >
> > I'm just dying to hear a "normal" use case then. :)
> 
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
> 
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :),  Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could
> return a flag with the purged state.  This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
> 
> After thinking it over in the hallway, I talked some of the details w/
> Johnnes and there was one issue that while w/ anonymous memory, we can
> still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
> semantics, but since on shared volatile ranges, we don't have anything
> to hang a volatile flag on w/o adding some new vma like structure to
> the address_space structure (much as we did in the past w/ earlier
> volatile range implementations). This would negate much of the point
> of using the dirty bits to simplify the shared volatility
> implementation.
> 
> Thus Johannes is reasonably questioning the need for SIGBUS semantics,
> since if it wasn't needed, the simpler page-cleaning based volatility
> could potentially be used.

Thanks for summarizing this again!

> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
> 
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged.  From an
> application perspective this is pretty ugly.
> 
> Johannes: Any thoughts on this potential issue with your proposal? Am
> I missing something else?

No, this is accurate.  However, I don't really see how this is
different than any other use-after-free bug.  If you access malloc
memory after free(), you might receive a SIGSEGV, you might see random
data, you might corrupt somebody else's data.  This certainly isn't
nice, but it's not exactly new behavior, is it?

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 17:58               ` Johannes Weiner
  0 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 17:58 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> >> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >> > John, this was something that the Mozilla guys asked for, right?  Any
> >> > idea why this isn't ever a problem for them?
> >> So one of their use cases for it is for library text. Basically they
> >> want to decompress a compressed library file into memory. Then they plan
> >> to mark the uncompressed pages volatile, and then be able to call into
> >> it. Ideally for them, the kernel would only purge cold pages, leaving
> >> the hot pages in memory. When they traverse a purged page, they handle
> >> the SIGBUS and patch the page up.
> >
> > How big are these libraries compared to overall system size?
> 
> Mike or Taras would have to refresh my memory on this detail. My
> recollection is it mostly has to do with keeping the on-disk size of
> the library small, so it can load off of slow media very quickly.
> 
> >> Now.. this is not what I'd consider a normal use case, but was hoping to
> >> illustrate some of the more interesting uses and demonstrate the
> >> interfaces flexibility.
> >
> > I'm just dying to hear a "normal" use case then. :)
> 
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
> 
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :),  Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could
> return a flag with the purged state.  This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
> 
> After thinking it over in the hallway, I talked some of the details w/
> Johnnes and there was one issue that while w/ anonymous memory, we can
> still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
> semantics, but since on shared volatile ranges, we don't have anything
> to hang a volatile flag on w/o adding some new vma like structure to
> the address_space structure (much as we did in the past w/ earlier
> volatile range implementations). This would negate much of the point
> of using the dirty bits to simplify the shared volatility
> implementation.
> 
> Thus Johannes is reasonably questioning the need for SIGBUS semantics,
> since if it wasn't needed, the simpler page-cleaning based volatility
> could potentially be used.

Thanks for summarizing this again!

> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
> 
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged.  From an
> application perspective this is pretty ugly.
> 
> Johannes: Any thoughts on this potential issue with your proposal? Am
> I missing something else?

No, this is accurate.  However, I don't really see how this is
different than any other use-after-free bug.  If you access malloc
memory after free(), you might receive a SIGSEGV, you might see random
data, you might corrupt somebody else's data.  This certainly isn't
nice, but it's not exactly new behavior, is it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 17:48                 ` John Stultz
@ 2014-04-02 18:07                   ` Johannes Weiner
  -1 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 18:07 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen <dave@sr71.net> wrote:
> > On 04/02/2014 10:18 AM, Johannes Weiner wrote:
> >> Hence my follow-up question in the other mail about how large we
> >> expect such code caches to become in practice in relationship to
> >> overall system memory.  Are code caches interesting reclaim candidates
> >> to begin with?  Are they big enough to make the machine thrash/swap
> >> otherwise?
> >
> > A big chunk of the use cases here are for swapless systems anyway, so
> > this is the *only* way for them to reclaim anonymous memory.  Their
> > choices are either to be constantly throwing away and rebuilding these
> > objects, or to leave them in memory effectively pinned.
> >
> > In practice I did see ashmem (the Android thing that we're trying to
> > replace) get used a lot by the Android web browser when I was playing
> > with it.  John said that it got used for storing decompressed copies of
> > images.
> 
> Although images are a simpler case where its easier to not touch
> volatile pages. I think Johannes is mostly concerned about cases where
> volatile pages are being accessed while they are volatile, which the
> Mozilla folks are so far the only viable case (in my mind... folks may
> have others) where they intentionally want to access pages while
> they're volatile and thus require SIGBUS semantics.

Yes, absolutely, that is my only concern.  Compressed images as in
Android can easily be marked non-volatile before they are accessed
again.

Code caches are harder because control is handed off to the CPU, but
I'm not entirely sure yet whether these are in fact interesting
reclaim candidates.

> I suspect handling the SIGBUS and patching up the purged page you
> trapped on is likely much to complicated for most use cases. But I do
> think SIGBUS is preferable to zero-fill on purged page access, just
> because its likely to be easier to debug applications.

Fully agreed, but it seems a bit overkill to add a separate syscall, a
range-tree on top of shmem address_spaces, and an essentially new
programming model based on SIGBUS userspace fault handling (incl. all
the complexities and confusion this inevitably will bring when people
DO end up passing these pointers into kernel space) just to be a bit
nicer about use-after-free bugs in applications.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 18:07                   ` Johannes Weiner
  0 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 18:07 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen <dave@sr71.net> wrote:
> > On 04/02/2014 10:18 AM, Johannes Weiner wrote:
> >> Hence my follow-up question in the other mail about how large we
> >> expect such code caches to become in practice in relationship to
> >> overall system memory.  Are code caches interesting reclaim candidates
> >> to begin with?  Are they big enough to make the machine thrash/swap
> >> otherwise?
> >
> > A big chunk of the use cases here are for swapless systems anyway, so
> > this is the *only* way for them to reclaim anonymous memory.  Their
> > choices are either to be constantly throwing away and rebuilding these
> > objects, or to leave them in memory effectively pinned.
> >
> > In practice I did see ashmem (the Android thing that we're trying to
> > replace) get used a lot by the Android web browser when I was playing
> > with it.  John said that it got used for storing decompressed copies of
> > images.
> 
> Although images are a simpler case where its easier to not touch
> volatile pages. I think Johannes is mostly concerned about cases where
> volatile pages are being accessed while they are volatile, which the
> Mozilla folks are so far the only viable case (in my mind... folks may
> have others) where they intentionally want to access pages while
> they're volatile and thus require SIGBUS semantics.

Yes, absolutely, that is my only concern.  Compressed images as in
Android can easily be marked non-volatile before they are accessed
again.

Code caches are harder because control is handed off to the CPU, but
I'm not entirely sure yet whether these are in fact interesting
reclaim candidates.

> I suspect handling the SIGBUS and patching up the purged page you
> trapped on is likely much to complicated for most use cases. But I do
> think SIGBUS is preferable to zero-fill on purged page access, just
> because its likely to be easier to debug applications.

Fully agreed, but it seems a bit overkill to add a separate syscall, a
range-tree on top of shmem address_spaces, and an essentially new
programming model based on SIGBUS userspace fault handling (incl. all
the complexities and confusion this inevitably will bring when people
DO end up passing these pointers into kernel space) just to be a bit
nicer about use-after-free bugs in applications.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02  4:03     ` John Stultz
@ 2014-04-02 18:31       ` Andrea Arcangeli
  -1 siblings, 0 replies; 112+ messages in thread
From: Andrea Arcangeli @ 2014-04-02 18:31 UTC (permalink / raw)
  To: John Stultz
  Cc: Johannes Weiner, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, H. Peter Anvin,
	linux-mm

Hi everyone,

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.

I actually thought the way of being notified with a page fault (sigbus
or whatever) was the most efficient way of using volatile ranges.

Why having to call a syscall to know if you can still access the
volatile range, if there was no VM pressure before the access?
syscalls are expensive, accessing the memory direct is not. Only if it
page was actually missing and a page fault would fire, you'd take the
slowpath.

The usages I see for this are plenty, like for maintaining caches in
memory that may be big and would be nice to discard if there's VM
pressure, jpeg uncompressed images sounds like a candidate too. So the
browser size would shrink if there's VM pressure, instead of ending up
swapping out uncompressed image data that can be regenerated more
quickly with the CPU than with swapins.

> Now... once you've chosen SIGBUS semantics, there will be folks who will
> try to exploit the fact that we get SIGBUS on purged page access (at
> least on the user-space side) and will try to access pages that are
> volatile until they are purged and try to then handle the SIGBUS to fix
> things up. Those folks exploiting that will have to be particularly
> careful not to pass volatile data to the kernel, and if they do they'll
> have to be smart enough to handle the EFAULT, etc. That's really all
> their problem, because they're being clever. :)

I'm actually working on feature that would solve the problem for the
syscalls accessing missing volatile pages. So you'd never see a
-EFAULT because all syscalls won't return even if they encounters a
missing page in the volatile range dropped by the VM pressure.

It's called userfaultfd. You call sys_userfaultfd(flags) and it
connects the current mm to a pseudo filedescriptor. The filedescriptor
works similarly to eventfd but with a different protocol.

You need a thread that will never access the userfault area with the
CPU, that is responsible to poll on the userfaultfd and talk the
userfaultfd protocol to fill-in missing pages. The userfault thread
after a POLLIN event reads the virtual addresses of the fault that
must have happened on some other thread of the same mm, and then
writes back an "handled" virtual range into the fd, after the page (or
pages if multiple) have been regenerated and mapped in with
sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
swapping. Then depending on the "solved" range written back into the
fd, the kernel will wakeup the thread or threads that were waiting in
kernel mode on the "handled" virtual range, and retry the fault
without ever exiting kernel mode.

We need this in KVM for running the guest on memory that is on other
nodes or other processes (postcopy live migration is the most common
use case but there are others like memory externalization and
cross-node KSM in the cloud, to keep a single copy of memory across
multiple nodes and externalized to the VM and to the host node).

This thread made me wonder if we could mix the two features and you
would then depend on MADV_USERFAULT and userfaultfd to deliver to
userland the "faults" happening on the volatile pages that have been
purged as result of VM pressure.

I'm just saying this after Johannes mentioned the issue with syscalls
returning -EFAULT. Because that is the very issue that the userfaultfd
is going to solve for the KVM migration thread.

What I'm thinking now would be to mark the volatile range also
MADV_USERFAULT and then calling userfaultfd and instead of having the
cache regeneration "slow path" inside the SIGBUS handler, to run it in
the userfault thread that polls the userfaultfd. Then you could write
the volatile ranges to disk with a write() syscall (or use any other
syscall on the volatile ranges), without having to worry about -EFAULT
being returned because one page was discarded. And if MADV_USERFAULT
is not called in combination with vrange syscalls, then it'd still
work without the userfault, but with the vrange syscalls only.

In short the idea would be to let the userfault code solve the fault
delivery to userland for you, and make the vrange syscalls only focus
on the page purging problem, without having to worry about what
happens when something access a missing page.

But if you don't intend to solve the syscall -EFAULT problem, well
then probably the overlap is still as thin as I thought it was before
(like also mentioned in the below link).

Thanks,
Andrea

PS. my last email about this from a more KVM centric point of view:

http://www.spinics.net/lists/kvm/msg101449.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 18:31       ` Andrea Arcangeli
  0 siblings, 0 replies; 112+ messages in thread
From: Andrea Arcangeli @ 2014-04-02 18:31 UTC (permalink / raw)
  To: John Stultz
  Cc: Johannes Weiner, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, H. Peter Anvin,
	linux-mm

Hi everyone,

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.

I actually thought the way of being notified with a page fault (sigbus
or whatever) was the most efficient way of using volatile ranges.

Why having to call a syscall to know if you can still access the
volatile range, if there was no VM pressure before the access?
syscalls are expensive, accessing the memory direct is not. Only if it
page was actually missing and a page fault would fire, you'd take the
slowpath.

The usages I see for this are plenty, like for maintaining caches in
memory that may be big and would be nice to discard if there's VM
pressure, jpeg uncompressed images sounds like a candidate too. So the
browser size would shrink if there's VM pressure, instead of ending up
swapping out uncompressed image data that can be regenerated more
quickly with the CPU than with swapins.

> Now... once you've chosen SIGBUS semantics, there will be folks who will
> try to exploit the fact that we get SIGBUS on purged page access (at
> least on the user-space side) and will try to access pages that are
> volatile until they are purged and try to then handle the SIGBUS to fix
> things up. Those folks exploiting that will have to be particularly
> careful not to pass volatile data to the kernel, and if they do they'll
> have to be smart enough to handle the EFAULT, etc. That's really all
> their problem, because they're being clever. :)

I'm actually working on feature that would solve the problem for the
syscalls accessing missing volatile pages. So you'd never see a
-EFAULT because all syscalls won't return even if they encounters a
missing page in the volatile range dropped by the VM pressure.

It's called userfaultfd. You call sys_userfaultfd(flags) and it
connects the current mm to a pseudo filedescriptor. The filedescriptor
works similarly to eventfd but with a different protocol.

You need a thread that will never access the userfault area with the
CPU, that is responsible to poll on the userfaultfd and talk the
userfaultfd protocol to fill-in missing pages. The userfault thread
after a POLLIN event reads the virtual addresses of the fault that
must have happened on some other thread of the same mm, and then
writes back an "handled" virtual range into the fd, after the page (or
pages if multiple) have been regenerated and mapped in with
sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
swapping. Then depending on the "solved" range written back into the
fd, the kernel will wakeup the thread or threads that were waiting in
kernel mode on the "handled" virtual range, and retry the fault
without ever exiting kernel mode.

We need this in KVM for running the guest on memory that is on other
nodes or other processes (postcopy live migration is the most common
use case but there are others like memory externalization and
cross-node KSM in the cloud, to keep a single copy of memory across
multiple nodes and externalized to the VM and to the host node).

This thread made me wonder if we could mix the two features and you
would then depend on MADV_USERFAULT and userfaultfd to deliver to
userland the "faults" happening on the volatile pages that have been
purged as result of VM pressure.

I'm just saying this after Johannes mentioned the issue with syscalls
returning -EFAULT. Because that is the very issue that the userfaultfd
is going to solve for the KVM migration thread.

What I'm thinking now would be to mark the volatile range also
MADV_USERFAULT and then calling userfaultfd and instead of having the
cache regeneration "slow path" inside the SIGBUS handler, to run it in
the userfault thread that polls the userfaultfd. Then you could write
the volatile ranges to disk with a write() syscall (or use any other
syscall on the volatile ranges), without having to worry about -EFAULT
being returned because one page was discarded. And if MADV_USERFAULT
is not called in combination with vrange syscalls, then it'd still
work without the userfault, but with the vrange syscalls only.

In short the idea would be to let the userfault code solve the fault
delivery to userland for you, and make the vrange syscalls only focus
on the page purging problem, without having to worry about what
happens when something access a missing page.

But if you don't intend to solve the syscall -EFAULT problem, well
then probably the overlap is still as thin as I thought it was before
(like also mentioned in the below link).

Thanks,
Andrea

PS. my last email about this from a more KVM centric point of view:

http://www.spinics.net/lists/kvm/msg101449.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 17:58               ` Johannes Weiner
@ 2014-04-02 19:01                 ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 19:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>> That point beside, I think the other problem with the page-cleaning
>> volatility approach is that there are other awkward side effects. For
>> example: Say an application marks a range as volatile. One page in the
>> range is then purged. The application, due to a bug or otherwise,
>> reads the volatile range. This causes the page to be zero-filled in,
>> and the application silently uses the corrupted data (which isn't
>> great). More problematic though, is that by faulting the page in,
>> they've in effect lost the purge state for that page. When the
>> application then goes to mark the range as non-volatile, all pages are
>> present, so we'd return that no pages were purged.  From an
>> application perspective this is pretty ugly.
>>
>> Johannes: Any thoughts on this potential issue with your proposal? Am
>> I missing something else?
>
> No, this is accurate.  However, I don't really see how this is
> different than any other use-after-free bug.  If you access malloc
> memory after free(), you might receive a SIGSEGV, you might see random
> data, you might corrupt somebody else's data.  This certainly isn't
> nice, but it's not exactly new behavior, is it?

The part that troubles me is that I see the purged state as kernel
data being corrupted by userland in this case. The kernel will tell
userspace that no pages were purged, even though they were. Only
because userspace made an errant read of a page, and got garbage data
back.

thanks
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 19:01                 ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 19:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>> That point beside, I think the other problem with the page-cleaning
>> volatility approach is that there are other awkward side effects. For
>> example: Say an application marks a range as volatile. One page in the
>> range is then purged. The application, due to a bug or otherwise,
>> reads the volatile range. This causes the page to be zero-filled in,
>> and the application silently uses the corrupted data (which isn't
>> great). More problematic though, is that by faulting the page in,
>> they've in effect lost the purge state for that page. When the
>> application then goes to mark the range as non-volatile, all pages are
>> present, so we'd return that no pages were purged.  From an
>> application perspective this is pretty ugly.
>>
>> Johannes: Any thoughts on this potential issue with your proposal? Am
>> I missing something else?
>
> No, this is accurate.  However, I don't really see how this is
> different than any other use-after-free bug.  If you access malloc
> memory after free(), you might receive a SIGSEGV, you might see random
> data, you might corrupt somebody else's data.  This certainly isn't
> nice, but it's not exactly new behavior, is it?

The part that troubles me is that I see the purged state as kernel
data being corrupted by userland in this case. The kernel will tell
userspace that no pages were purged, even though they were. Only
because userspace made an errant read of a page, and got garbage data
back.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 18:31       ` Andrea Arcangeli
@ 2014-04-02 19:27         ` Johannes Weiner
  -1 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 19:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, H. Peter Anvin,
	linux-mm

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > you have a third option you're thinking of, I'd of course be interested
> > in hearing it.
> 
> I actually thought the way of being notified with a page fault (sigbus
> or whatever) was the most efficient way of using volatile ranges.
> 
> Why having to call a syscall to know if you can still access the
> volatile range, if there was no VM pressure before the access?
> syscalls are expensive, accessing the memory direct is not. Only if it
> page was actually missing and a page fault would fire, you'd take the
> slowpath.

Not everybody wants to actually come back for the data in the range,
allocators and message passing applications just want to be able to
reuse the memory mapping.

By tying the volatility to the dirty bit in the page tables, an
allocator could simply clear those bits once on free().  When malloc()
hands out this region again, the user is expected to write, which will
either overwrite the old page, or, if it was purged, fault in a fresh
zero page.  But there is no second syscall needed to clear volatility.

> > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > try to exploit the fact that we get SIGBUS on purged page access (at
> > least on the user-space side) and will try to access pages that are
> > volatile until they are purged and try to then handle the SIGBUS to fix
> > things up. Those folks exploiting that will have to be particularly
> > careful not to pass volatile data to the kernel, and if they do they'll
> > have to be smart enough to handle the EFAULT, etc. That's really all
> > their problem, because they're being clever. :)
> 
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
> 
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
> 
> You need a thread that will never access the userfault area with the
> CPU, that is responsible to poll on the userfaultfd and talk the
> userfaultfd protocol to fill-in missing pages. The userfault thread
> after a POLLIN event reads the virtual addresses of the fault that
> must have happened on some other thread of the same mm, and then
> writes back an "handled" virtual range into the fd, after the page (or
> pages if multiple) have been regenerated and mapped in with
> sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> swapping. Then depending on the "solved" range written back into the
> fd, the kernel will wakeup the thread or threads that were waiting in
> kernel mode on the "handled" virtual range, and retry the fault
> without ever exiting kernel mode.
> 
> We need this in KVM for running the guest on memory that is on other
> nodes or other processes (postcopy live migration is the most common
> use case but there are others like memory externalization and
> cross-node KSM in the cloud, to keep a single copy of memory across
> multiple nodes and externalized to the VM and to the host node).
> 
> This thread made me wonder if we could mix the two features and you
> would then depend on MADV_USERFAULT and userfaultfd to deliver to
> userland the "faults" happening on the volatile pages that have been
> purged as result of VM pressure.
> 
> I'm just saying this after Johannes mentioned the issue with syscalls
> returning -EFAULT. Because that is the very issue that the userfaultfd
> is going to solve for the KVM migration thread.
> 
> What I'm thinking now would be to mark the volatile range also
> MADV_USERFAULT and then calling userfaultfd and instead of having the
> cache regeneration "slow path" inside the SIGBUS handler, to run it in
> the userfault thread that polls the userfaultfd. Then you could write
> the volatile ranges to disk with a write() syscall (or use any other
> syscall on the volatile ranges), without having to worry about -EFAULT
> being returned because one page was discarded. And if MADV_USERFAULT
> is not called in combination with vrange syscalls, then it'd still
> work without the userfault, but with the vrange syscalls only.
> 
> In short the idea would be to let the userfault code solve the fault
> delivery to userland for you, and make the vrange syscalls only focus
> on the page purging problem, without having to worry about what
> happens when something access a missing page.

Yes, the two seem certainly combinable to me.

madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace
fault handling.  In the fault slowpath, you can then regenerate any
missing data and do MADV_FREE again if it should remain volatile.  And
again, any actual writes to the region would clear volatility because
now the cache copy changed and discarding it would mean losing state.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 19:27         ` Johannes Weiner
  0 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 19:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, H. Peter Anvin,
	linux-mm

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > you have a third option you're thinking of, I'd of course be interested
> > in hearing it.
> 
> I actually thought the way of being notified with a page fault (sigbus
> or whatever) was the most efficient way of using volatile ranges.
> 
> Why having to call a syscall to know if you can still access the
> volatile range, if there was no VM pressure before the access?
> syscalls are expensive, accessing the memory direct is not. Only if it
> page was actually missing and a page fault would fire, you'd take the
> slowpath.

Not everybody wants to actually come back for the data in the range,
allocators and message passing applications just want to be able to
reuse the memory mapping.

By tying the volatility to the dirty bit in the page tables, an
allocator could simply clear those bits once on free().  When malloc()
hands out this region again, the user is expected to write, which will
either overwrite the old page, or, if it was purged, fault in a fresh
zero page.  But there is no second syscall needed to clear volatility.

> > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > try to exploit the fact that we get SIGBUS on purged page access (at
> > least on the user-space side) and will try to access pages that are
> > volatile until they are purged and try to then handle the SIGBUS to fix
> > things up. Those folks exploiting that will have to be particularly
> > careful not to pass volatile data to the kernel, and if they do they'll
> > have to be smart enough to handle the EFAULT, etc. That's really all
> > their problem, because they're being clever. :)
> 
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
> 
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
> 
> You need a thread that will never access the userfault area with the
> CPU, that is responsible to poll on the userfaultfd and talk the
> userfaultfd protocol to fill-in missing pages. The userfault thread
> after a POLLIN event reads the virtual addresses of the fault that
> must have happened on some other thread of the same mm, and then
> writes back an "handled" virtual range into the fd, after the page (or
> pages if multiple) have been regenerated and mapped in with
> sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> swapping. Then depending on the "solved" range written back into the
> fd, the kernel will wakeup the thread or threads that were waiting in
> kernel mode on the "handled" virtual range, and retry the fault
> without ever exiting kernel mode.
> 
> We need this in KVM for running the guest on memory that is on other
> nodes or other processes (postcopy live migration is the most common
> use case but there are others like memory externalization and
> cross-node KSM in the cloud, to keep a single copy of memory across
> multiple nodes and externalized to the VM and to the host node).
> 
> This thread made me wonder if we could mix the two features and you
> would then depend on MADV_USERFAULT and userfaultfd to deliver to
> userland the "faults" happening on the volatile pages that have been
> purged as result of VM pressure.
> 
> I'm just saying this after Johannes mentioned the issue with syscalls
> returning -EFAULT. Because that is the very issue that the userfaultfd
> is going to solve for the KVM migration thread.
> 
> What I'm thinking now would be to mark the volatile range also
> MADV_USERFAULT and then calling userfaultfd and instead of having the
> cache regeneration "slow path" inside the SIGBUS handler, to run it in
> the userfault thread that polls the userfaultfd. Then you could write
> the volatile ranges to disk with a write() syscall (or use any other
> syscall on the volatile ranges), without having to worry about -EFAULT
> being returned because one page was discarded. And if MADV_USERFAULT
> is not called in combination with vrange syscalls, then it'd still
> work without the userfault, but with the vrange syscalls only.
> 
> In short the idea would be to let the userfault code solve the fault
> delivery to userland for you, and make the vrange syscalls only focus
> on the page purging problem, without having to worry about what
> happens when something access a missing page.

Yes, the two seem certainly combinable to me.

madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace
fault handling.  In the fault slowpath, you can then regenerate any
missing data and do MADV_FREE again if it should remain volatile.  And
again, any actual writes to the region would clear volatility because
now the cache copy changed and discarding it would mean losing state.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 18:07                   ` Johannes Weiner
@ 2014-04-02 19:37                     ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 19:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 2, 2014 at 11:07 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote:
>> I suspect handling the SIGBUS and patching up the purged page you
>> trapped on is likely much to complicated for most use cases. But I do
>> think SIGBUS is preferable to zero-fill on purged page access, just
>> because its likely to be easier to debug applications.
>
> Fully agreed, but it seems a bit overkill to add a separate syscall, a
> range-tree on top of shmem address_spaces, and an essentially new
> programming model based on SIGBUS userspace fault handling (incl. all
> the complexities and confusion this inevitably will bring when people
> DO end up passing these pointers into kernel space) just to be a bit
> nicer about use-after-free bugs in applications.

Its more about making an interface that has graspable semantics to
userspace, instead of having the semantics being a side-effect of the
implementation.

Tying volatility to the page-clean state and page-was-purged to
page-present seems problematic to me, because there are too many ways
to change the page-clean or page-present outside of the interface
being proposed.

I feel this causes a cascade of corner cases that have to be explained
to users of the interface.

Also I disagree we're adding a new programming model, as SIGBUSes can
already be caught, just that there's not usually much one can do,
where with volatile pages its more likely something could be done. And
again, its really just a side-effect of having semantics (SIGBUS on
purged page access) that are more helpful from a applications
perspective.

As for the separate syscall: Again, this is mainly needed to handle
allocation failures that happen mid-way through modifying the range.
There may still be a way to do the allocation first and only after it
succeeds do the modification. The vma merge/splitting logic doesn't
make this easy but if we can be sure that on a failed split of 1 vma
-> 3 vmas (which may fail half way) we can re-merge w/o allocation and
error out (without having to do any other allocations), this might be
avoidable. I'm still wanting to look at this. If so, it would be
easier to re-add this support under madvise, if folks really really
don't like the new syscall.   For the most part, having the separate
syscall allows us to discuss other details of the semantics, which to
me are more important then the syscall naming.

thanks
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 19:37                     ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 19:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 2, 2014 at 11:07 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote:
>> I suspect handling the SIGBUS and patching up the purged page you
>> trapped on is likely much to complicated for most use cases. But I do
>> think SIGBUS is preferable to zero-fill on purged page access, just
>> because its likely to be easier to debug applications.
>
> Fully agreed, but it seems a bit overkill to add a separate syscall, a
> range-tree on top of shmem address_spaces, and an essentially new
> programming model based on SIGBUS userspace fault handling (incl. all
> the complexities and confusion this inevitably will bring when people
> DO end up passing these pointers into kernel space) just to be a bit
> nicer about use-after-free bugs in applications.

Its more about making an interface that has graspable semantics to
userspace, instead of having the semantics being a side-effect of the
implementation.

Tying volatility to the page-clean state and page-was-purged to
page-present seems problematic to me, because there are too many ways
to change the page-clean or page-present outside of the interface
being proposed.

I feel this causes a cascade of corner cases that have to be explained
to users of the interface.

Also I disagree we're adding a new programming model, as SIGBUSes can
already be caught, just that there's not usually much one can do,
where with volatile pages its more likely something could be done. And
again, its really just a side-effect of having semantics (SIGBUS on
purged page access) that are more helpful from a applications
perspective.

As for the separate syscall: Again, this is mainly needed to handle
allocation failures that happen mid-way through modifying the range.
There may still be a way to do the allocation first and only after it
succeeds do the modification. The vma merge/splitting logic doesn't
make this easy but if we can be sure that on a failed split of 1 vma
-> 3 vmas (which may fail half way) we can re-merge w/o allocation and
error out (without having to do any other allocations), this might be
avoidable. I'm still wanting to look at this. If so, it would be
easier to re-add this support under madvise, if folks really really
don't like the new syscall.   For the most part, having the separate
syscall allows us to discuss other details of the semantics, which to
me are more important then the syscall naming.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 19:01                 ` John Stultz
@ 2014-04-02 19:47                   ` Johannes Weiner
  -1 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 19:47 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> >> That point beside, I think the other problem with the page-cleaning
> >> volatility approach is that there are other awkward side effects. For
> >> example: Say an application marks a range as volatile. One page in the
> >> range is then purged. The application, due to a bug or otherwise,
> >> reads the volatile range. This causes the page to be zero-filled in,
> >> and the application silently uses the corrupted data (which isn't
> >> great). More problematic though, is that by faulting the page in,
> >> they've in effect lost the purge state for that page. When the
> >> application then goes to mark the range as non-volatile, all pages are
> >> present, so we'd return that no pages were purged.  From an
> >> application perspective this is pretty ugly.
> >>
> >> Johannes: Any thoughts on this potential issue with your proposal? Am
> >> I missing something else?
> >
> > No, this is accurate.  However, I don't really see how this is
> > different than any other use-after-free bug.  If you access malloc
> > memory after free(), you might receive a SIGSEGV, you might see random
> > data, you might corrupt somebody else's data.  This certainly isn't
> > nice, but it's not exactly new behavior, is it?
> 
> The part that troubles me is that I see the purged state as kernel
> data being corrupted by userland in this case. The kernel will tell
> userspace that no pages were purged, even though they were. Only
> because userspace made an errant read of a page, and got garbage data
> back.

That sounds overly dramatic to me.  First of all, this data still
reflects accurately the actions of userspace in this situation.  And
secondly, the kernel does not rely on this data to be meaningful from
a userspace perspective to function correctly.

It's really nothing but a use-after-free bug that has consequences for
no-one but the faulty application.  The thing that IS new is that even
a read is enough to corrupt your data in this case.

MADV_REVIVE could return 0 if all pages in the specified range were
present, -Esomething if otherwise.  That would be semantically sound
even if userspace messes up.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 19:47                   ` Johannes Weiner
  0 siblings, 0 replies; 112+ messages in thread
From: Johannes Weiner @ 2014-04-02 19:47 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> >> That point beside, I think the other problem with the page-cleaning
> >> volatility approach is that there are other awkward side effects. For
> >> example: Say an application marks a range as volatile. One page in the
> >> range is then purged. The application, due to a bug or otherwise,
> >> reads the volatile range. This causes the page to be zero-filled in,
> >> and the application silently uses the corrupted data (which isn't
> >> great). More problematic though, is that by faulting the page in,
> >> they've in effect lost the purge state for that page. When the
> >> application then goes to mark the range as non-volatile, all pages are
> >> present, so we'd return that no pages were purged.  From an
> >> application perspective this is pretty ugly.
> >>
> >> Johannes: Any thoughts on this potential issue with your proposal? Am
> >> I missing something else?
> >
> > No, this is accurate.  However, I don't really see how this is
> > different than any other use-after-free bug.  If you access malloc
> > memory after free(), you might receive a SIGSEGV, you might see random
> > data, you might corrupt somebody else's data.  This certainly isn't
> > nice, but it's not exactly new behavior, is it?
> 
> The part that troubles me is that I see the purged state as kernel
> data being corrupted by userland in this case. The kernel will tell
> userspace that no pages were purged, even though they were. Only
> because userspace made an errant read of a page, and got garbage data
> back.

That sounds overly dramatic to me.  First of all, this data still
reflects accurately the actions of userspace in this situation.  And
secondly, the kernel does not rely on this data to be meaningful from
a userspace perspective to function correctly.

It's really nothing but a use-after-free bug that has consequences for
no-one but the faulty application.  The thing that IS new is that even
a read is enough to corrupt your data in this case.

MADV_REVIVE could return 0 if all pages in the specified range were
present, -Esomething if otherwise.  That would be semantically sound
even if userspace messes up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 18:31       ` Andrea Arcangeli
@ 2014-04-02 19:51         ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 19:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, H. Peter Anvin,
	linux-mm

On 04/02/2014 11:31 AM, Andrea Arcangeli wrote:
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
>> Now... once you've chosen SIGBUS semantics, there will be folks who will
>> try to exploit the fact that we get SIGBUS on purged page access (at
>> least on the user-space side) and will try to access pages that are
>> volatile until they are purged and try to then handle the SIGBUS to fix
>> things up. Those folks exploiting that will have to be particularly
>> careful not to pass volatile data to the kernel, and if they do they'll
>> have to be smart enough to handle the EFAULT, etc. That's really all
>> their problem, because they're being clever. :)
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
>
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
So yea! I actually think (its been awhile now) I mentioned your work to
Taras (or maybe he mentioned it to me?), but it did seem like the
userfaltfd would be a better solution for the style of fault handling
they were thinking about. (Especially as actually handling SIGBUS and
doing something sane in a large threaded application seems very difficult).

That said, explaining volatile ranges as a concept has been difficult
enough without mixing in other new concepts :), so I'm hesitant to tie
the functionality together in until its clear the userfaultfd approach
is likely to land. But maybe I need to take a closer look at it.

thanks
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 19:51         ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 19:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, H. Peter Anvin,
	linux-mm

On 04/02/2014 11:31 AM, Andrea Arcangeli wrote:
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
>> Now... once you've chosen SIGBUS semantics, there will be folks who will
>> try to exploit the fact that we get SIGBUS on purged page access (at
>> least on the user-space side) and will try to access pages that are
>> volatile until they are purged and try to then handle the SIGBUS to fix
>> things up. Those folks exploiting that will have to be particularly
>> careful not to pass volatile data to the kernel, and if they do they'll
>> have to be smart enough to handle the EFAULT, etc. That's really all
>> their problem, because they're being clever. :)
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
>
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
So yea! I actually think (its been awhile now) I mentioned your work to
Taras (or maybe he mentioned it to me?), but it did seem like the
userfaltfd would be a better solution for the style of fault handling
they were thinking about. (Especially as actually handling SIGBUS and
doing something sane in a large threaded application seems very difficult).

That said, explaining volatile ranges as a concept has been difficult
enough without mixing in other new concepts :), so I'm hesitant to tie
the functionality together in until its clear the userfaultfd approach
is likely to land. But maybe I need to take a closer look at it.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 19:47                   ` Johannes Weiner
@ 2014-04-02 20:13                     ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 20:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On 04/02/2014 12:47 PM, Johannes Weiner wrote:
> On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
>> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>>>> That point beside, I think the other problem with the page-cleaning
>>>> volatility approach is that there are other awkward side effects. For
>>>> example: Say an application marks a range as volatile. One page in the
>>>> range is then purged. The application, due to a bug or otherwise,
>>>> reads the volatile range. This causes the page to be zero-filled in,
>>>> and the application silently uses the corrupted data (which isn't
>>>> great). More problematic though, is that by faulting the page in,
>>>> they've in effect lost the purge state for that page. When the
>>>> application then goes to mark the range as non-volatile, all pages are
>>>> present, so we'd return that no pages were purged.  From an
>>>> application perspective this is pretty ugly.
>>>>
>>>> Johannes: Any thoughts on this potential issue with your proposal? Am
>>>> I missing something else?
>>> No, this is accurate.  However, I don't really see how this is
>>> different than any other use-after-free bug.  If you access malloc
>>> memory after free(), you might receive a SIGSEGV, you might see random
>>> data, you might corrupt somebody else's data.  This certainly isn't
>>> nice, but it's not exactly new behavior, is it?
>> The part that troubles me is that I see the purged state as kernel
>> data being corrupted by userland in this case. The kernel will tell
>> userspace that no pages were purged, even though they were. Only
>> because userspace made an errant read of a page, and got garbage data
>> back.
> That sounds overly dramatic to me.  First of all, this data still
> reflects accurately the actions of userspace in this situation.  And
> secondly, the kernel does not rely on this data to be meaningful from
> a userspace perspective to function correctly.
<insert dramatic-chipmunk video w/ text overlay "errant read corrupted
volatile page purge state!!!!1">

Maybe you're right, but I feel this is the sort of thing application
developers would be surprised and annoyed by.


> It's really nothing but a use-after-free bug that has consequences for
> no-one but the faulty application.  The thing that IS new is that even
> a read is enough to corrupt your data in this case.
>
> MADV_REVIVE could return 0 if all pages in the specified range were
> present, -Esomething if otherwise.  That would be semantically sound
> even if userspace messes up.

So its semantically more of just a combined mincore+dirty operation..
and nothing more?

What are other folks thinking about this? Although I don't particularly
like it, I probably could go along with Johannes' approach, forgoing
SIGBUS for zero-fill and adapting the semantics that are in my mind a
bit stranger. This would allow for ashmem-like style behavior w/ the
additional  write-clears-volatile-state and read-clears-purged-state
constraints (which I don't think would be problematic for Android, but
am not totally sure).

But I do worry that these semantics are easier for kernel-mm-developers
to grasp, but are much much harder for application developers to
understand.

Additionally unless we could really leave access-after-volatile as a
total undefined behavior, this would lock us into O(page) behavior and
would remove the possibility of O(log(ranges)) behavior Minchan and I
were able to get (admittedly with more complicated code - but something
I was hoping we'd be able to get back to after the base semantics and
interface behavior was understood and merged). I since applications will
have bugs and will access after volatile, we won't be able to get away
with that sort of behavioral flexibility.

thanks
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 20:13                     ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-02 20:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On 04/02/2014 12:47 PM, Johannes Weiner wrote:
> On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
>> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>>>> That point beside, I think the other problem with the page-cleaning
>>>> volatility approach is that there are other awkward side effects. For
>>>> example: Say an application marks a range as volatile. One page in the
>>>> range is then purged. The application, due to a bug or otherwise,
>>>> reads the volatile range. This causes the page to be zero-filled in,
>>>> and the application silently uses the corrupted data (which isn't
>>>> great). More problematic though, is that by faulting the page in,
>>>> they've in effect lost the purge state for that page. When the
>>>> application then goes to mark the range as non-volatile, all pages are
>>>> present, so we'd return that no pages were purged.  From an
>>>> application perspective this is pretty ugly.
>>>>
>>>> Johannes: Any thoughts on this potential issue with your proposal? Am
>>>> I missing something else?
>>> No, this is accurate.  However, I don't really see how this is
>>> different than any other use-after-free bug.  If you access malloc
>>> memory after free(), you might receive a SIGSEGV, you might see random
>>> data, you might corrupt somebody else's data.  This certainly isn't
>>> nice, but it's not exactly new behavior, is it?
>> The part that troubles me is that I see the purged state as kernel
>> data being corrupted by userland in this case. The kernel will tell
>> userspace that no pages were purged, even though they were. Only
>> because userspace made an errant read of a page, and got garbage data
>> back.
> That sounds overly dramatic to me.  First of all, this data still
> reflects accurately the actions of userspace in this situation.  And
> secondly, the kernel does not rely on this data to be meaningful from
> a userspace perspective to function correctly.
<insert dramatic-chipmunk video w/ text overlay "errant read corrupted
volatile page purge state!!!!1">

Maybe you're right, but I feel this is the sort of thing application
developers would be surprised and annoyed by.


> It's really nothing but a use-after-free bug that has consequences for
> no-one but the faulty application.  The thing that IS new is that even
> a read is enough to corrupt your data in this case.
>
> MADV_REVIVE could return 0 if all pages in the specified range were
> present, -Esomething if otherwise.  That would be semantically sound
> even if userspace messes up.

So its semantically more of just a combined mincore+dirty operation..
and nothing more?

What are other folks thinking about this? Although I don't particularly
like it, I probably could go along with Johannes' approach, forgoing
SIGBUS for zero-fill and adapting the semantics that are in my mind a
bit stranger. This would allow for ashmem-like style behavior w/ the
additional  write-clears-volatile-state and read-clears-purged-state
constraints (which I don't think would be problematic for Android, but
am not totally sure).

But I do worry that these semantics are easier for kernel-mm-developers
to grasp, but are much much harder for application developers to
understand.

Additionally unless we could really leave access-after-volatile as a
total undefined behavior, this would lock us into O(page) behavior and
would remove the possibility of O(log(ranges)) behavior Minchan and I
were able to get (admittedly with more complicated code - but something
I was hoping we'd be able to get back to after the base semantics and
interface behavior was understood and merged). I since applications will
have bugs and will access after volatile, we won't be able to get away
with that sort of behavioral flexibility.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 20:13                     ` John Stultz
@ 2014-04-02 22:44                       ` Jan Kara
  -1 siblings, 0 replies; 112+ messages in thread
From: Jan Kara @ 2014-04-02 22:44 UTC (permalink / raw)
  To: John Stultz
  Cc: Johannes Weiner, Dave Hansen, H. Peter Anvin, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On Wed 02-04-14 13:13:34, John Stultz wrote:
> On 04/02/2014 12:47 PM, Johannes Weiner wrote:
> > On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
> >> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> >>>> That point beside, I think the other problem with the page-cleaning
> >>>> volatility approach is that there are other awkward side effects. For
> >>>> example: Say an application marks a range as volatile. One page in the
> >>>> range is then purged. The application, due to a bug or otherwise,
> >>>> reads the volatile range. This causes the page to be zero-filled in,
> >>>> and the application silently uses the corrupted data (which isn't
> >>>> great). More problematic though, is that by faulting the page in,
> >>>> they've in effect lost the purge state for that page. When the
> >>>> application then goes to mark the range as non-volatile, all pages are
> >>>> present, so we'd return that no pages were purged.  From an
> >>>> application perspective this is pretty ugly.
> >>>>
> >>>> Johannes: Any thoughts on this potential issue with your proposal? Am
> >>>> I missing something else?
> >>> No, this is accurate.  However, I don't really see how this is
> >>> different than any other use-after-free bug.  If you access malloc
> >>> memory after free(), you might receive a SIGSEGV, you might see random
> >>> data, you might corrupt somebody else's data.  This certainly isn't
> >>> nice, but it's not exactly new behavior, is it?
> >> The part that troubles me is that I see the purged state as kernel
> >> data being corrupted by userland in this case. The kernel will tell
> >> userspace that no pages were purged, even though they were. Only
> >> because userspace made an errant read of a page, and got garbage data
> >> back.
> > That sounds overly dramatic to me.  First of all, this data still
> > reflects accurately the actions of userspace in this situation.  And
> > secondly, the kernel does not rely on this data to be meaningful from
> > a userspace perspective to function correctly.
> <insert dramatic-chipmunk video w/ text overlay "errant read corrupted
> volatile page purge state!!!!1">
> 
> Maybe you're right, but I feel this is the sort of thing application
> developers would be surprised and annoyed by.
> 
> 
> > It's really nothing but a use-after-free bug that has consequences for
> > no-one but the faulty application.  The thing that IS new is that even
> > a read is enough to corrupt your data in this case.
> >
> > MADV_REVIVE could return 0 if all pages in the specified range were
> > present, -Esomething if otherwise.  That would be semantically sound
> > even if userspace messes up.
> 
> So its semantically more of just a combined mincore+dirty operation..
> and nothing more?
> 
> What are other folks thinking about this? Although I don't particularly
> like it, I probably could go along with Johannes' approach, forgoing
> SIGBUS for zero-fill and adapting the semantics that are in my mind a
> bit stranger. This would allow for ashmem-like style behavior w/ the
> additional  write-clears-volatile-state and read-clears-purged-state
> constraints (which I don't think would be problematic for Android, but
> am not totally sure).
> 
> But I do worry that these semantics are easier for kernel-mm-developers
> to grasp, but are much much harder for application developers to
> understand.
  Yeah, I have to admit that although the simplicity of the implementation
looks compelling, the interface from a userspace POV looks weird.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-02 22:44                       ` Jan Kara
  0 siblings, 0 replies; 112+ messages in thread
From: Jan Kara @ 2014-04-02 22:44 UTC (permalink / raw)
  To: John Stultz
  Cc: Johannes Weiner, Dave Hansen, H. Peter Anvin, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On Wed 02-04-14 13:13:34, John Stultz wrote:
> On 04/02/2014 12:47 PM, Johannes Weiner wrote:
> > On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
> >> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> >>>> That point beside, I think the other problem with the page-cleaning
> >>>> volatility approach is that there are other awkward side effects. For
> >>>> example: Say an application marks a range as volatile. One page in the
> >>>> range is then purged. The application, due to a bug or otherwise,
> >>>> reads the volatile range. This causes the page to be zero-filled in,
> >>>> and the application silently uses the corrupted data (which isn't
> >>>> great). More problematic though, is that by faulting the page in,
> >>>> they've in effect lost the purge state for that page. When the
> >>>> application then goes to mark the range as non-volatile, all pages are
> >>>> present, so we'd return that no pages were purged.  From an
> >>>> application perspective this is pretty ugly.
> >>>>
> >>>> Johannes: Any thoughts on this potential issue with your proposal? Am
> >>>> I missing something else?
> >>> No, this is accurate.  However, I don't really see how this is
> >>> different than any other use-after-free bug.  If you access malloc
> >>> memory after free(), you might receive a SIGSEGV, you might see random
> >>> data, you might corrupt somebody else's data.  This certainly isn't
> >>> nice, but it's not exactly new behavior, is it?
> >> The part that troubles me is that I see the purged state as kernel
> >> data being corrupted by userland in this case. The kernel will tell
> >> userspace that no pages were purged, even though they were. Only
> >> because userspace made an errant read of a page, and got garbage data
> >> back.
> > That sounds overly dramatic to me.  First of all, this data still
> > reflects accurately the actions of userspace in this situation.  And
> > secondly, the kernel does not rely on this data to be meaningful from
> > a userspace perspective to function correctly.
> <insert dramatic-chipmunk video w/ text overlay "errant read corrupted
> volatile page purge state!!!!1">
> 
> Maybe you're right, but I feel this is the sort of thing application
> developers would be surprised and annoyed by.
> 
> 
> > It's really nothing but a use-after-free bug that has consequences for
> > no-one but the faulty application.  The thing that IS new is that even
> > a read is enough to corrupt your data in this case.
> >
> > MADV_REVIVE could return 0 if all pages in the specified range were
> > present, -Esomething if otherwise.  That would be semantically sound
> > even if userspace messes up.
> 
> So its semantically more of just a combined mincore+dirty operation..
> and nothing more?
> 
> What are other folks thinking about this? Although I don't particularly
> like it, I probably could go along with Johannes' approach, forgoing
> SIGBUS for zero-fill and adapting the semantics that are in my mind a
> bit stranger. This would allow for ashmem-like style behavior w/ the
> additional  write-clears-volatile-state and read-clears-purged-state
> constraints (which I don't think would be problematic for Android, but
> am not totally sure).
> 
> But I do worry that these semantics are easier for kernel-mm-developers
> to grasp, but are much much harder for application developers to
> understand.
  Yeah, I have to admit that although the simplicity of the implementation
looks compelling, the interface from a userspace POV looks weird.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 16:36           ` Johannes Weiner
@ 2014-04-07  5:24             ` Minchan Kim
  -1 siblings, 0 replies; 112+ messages in thread
From: Minchan Kim @ 2014-04-07  5:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: John Stultz, Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

On Wed, Apr 02, 2014 at 12:36:38PM -0400, Johannes Weiner wrote:
> On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> > On 04/01/2014 04:01 PM, Dave Hansen wrote:
> > > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> > >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> > >>> Either way, optimistic volatile pointers are nowhere near as
> > >>> transparent to the application as the above description suggests,
> > >>> which makes this usecase not very interesting, IMO.
> > >> ... however, I think you're still derating the value way too much.  The
> > >> case of user space doing elastic memory management is more and more
> > >> common, and for a lot of those applications it is perfectly reasonable
> > >> to either not do system calls or to have to devolatilize first.
> > > The SIGBUS is only in cases where the memory is set as volatile and
> > > _then_ accessed, right?
> > Not just set volatile and then accessed, but when a volatile page has
> > been purged and then accessed without being made non-volatile.
> > 
> > 
> > > John, this was something that the Mozilla guys asked for, right?  Any
> > > idea why this isn't ever a problem for them?
> > So one of their use cases for it is for library text. Basically they
> > want to decompress a compressed library file into memory. Then they plan
> > to mark the uncompressed pages volatile, and then be able to call into
> > it. Ideally for them, the kernel would only purge cold pages, leaving
> > the hot pages in memory. When they traverse a purged page, they handle
> > the SIGBUS and patch the page up.
> 
> How big are these libraries compared to overall system size?

One of the example about jit I had is 5M bytes for just simple node.js
service. Acutally I'm not sure it was JIT or something. Just what I saw
was it was rwxp vmas so I guess they are JIT.
Anyway, it's really simple script but consumed 5M bytes. It's really
big for Embedded WebOS because other more complicated service could be
executed in parallel on the system.

> 
> > Now.. this is not what I'd consider a normal use case, but was hoping to
> > illustrate some of the more interesting uses and demonstrate the
> > interfaces flexibility.
> 
> I'm just dying to hear a "normal" use case then. :)
> 
> > Also it provided a clear example of benefits to doing LRU based
> > cold-page purging rather then full object purging. Though I think the
> > same could be demonstrated in a simpler case of a large cache of objects
> > that the applications wants to mark volatile in one pass, unmarking
> > sub-objects as it needs.
> 
> Agreed.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-07  5:24             ` Minchan Kim
  0 siblings, 0 replies; 112+ messages in thread
From: Minchan Kim @ 2014-04-07  5:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: John Stultz, Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

On Wed, Apr 02, 2014 at 12:36:38PM -0400, Johannes Weiner wrote:
> On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> > On 04/01/2014 04:01 PM, Dave Hansen wrote:
> > > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> > >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> > >>> Either way, optimistic volatile pointers are nowhere near as
> > >>> transparent to the application as the above description suggests,
> > >>> which makes this usecase not very interesting, IMO.
> > >> ... however, I think you're still derating the value way too much.  The
> > >> case of user space doing elastic memory management is more and more
> > >> common, and for a lot of those applications it is perfectly reasonable
> > >> to either not do system calls or to have to devolatilize first.
> > > The SIGBUS is only in cases where the memory is set as volatile and
> > > _then_ accessed, right?
> > Not just set volatile and then accessed, but when a volatile page has
> > been purged and then accessed without being made non-volatile.
> > 
> > 
> > > John, this was something that the Mozilla guys asked for, right?  Any
> > > idea why this isn't ever a problem for them?
> > So one of their use cases for it is for library text. Basically they
> > want to decompress a compressed library file into memory. Then they plan
> > to mark the uncompressed pages volatile, and then be able to call into
> > it. Ideally for them, the kernel would only purge cold pages, leaving
> > the hot pages in memory. When they traverse a purged page, they handle
> > the SIGBUS and patch the page up.
> 
> How big are these libraries compared to overall system size?

One of the example about jit I had is 5M bytes for just simple node.js
service. Acutally I'm not sure it was JIT or something. Just what I saw
was it was rwxp vmas so I guess they are JIT.
Anyway, it's really simple script but consumed 5M bytes. It's really
big for Embedded WebOS because other more complicated service could be
executed in parallel on the system.

> 
> > Now.. this is not what I'd consider a normal use case, but was hoping to
> > illustrate some of the more interesting uses and demonstrate the
> > interfaces flexibility.
> 
> I'm just dying to hear a "normal" use case then. :)
> 
> > Also it provided a clear example of benefits to doing LRU based
> > cold-page purging rather then full object purging. Though I think the
> > same could be demonstrated in a simpler case of a large cache of objects
> > that the applications wants to mark volatile in one pass, unmarking
> > sub-objects as it needs.
> 
> Agreed.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 17:40             ` John Stultz
@ 2014-04-07  5:48               ` Minchan Kim
  -1 siblings, 0 replies; 112+ messages in thread
From: Minchan Kim @ 2014-04-07  5:48 UTC (permalink / raw)
  To: John Stultz
  Cc: Johannes Weiner, Dave Hansen, H. Peter Anvin, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, linux-mm

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> >> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >> > John, this was something that the Mozilla guys asked for, right?  Any
> >> > idea why this isn't ever a problem for them?
> >> So one of their use cases for it is for library text. Basically they
> >> want to decompress a compressed library file into memory. Then they plan
> >> to mark the uncompressed pages volatile, and then be able to call into
> >> it. Ideally for them, the kernel would only purge cold pages, leaving
> >> the hot pages in memory. When they traverse a purged page, they handle
> >> the SIGBUS and patch the page up.
> >
> > How big are these libraries compared to overall system size?
> 
> Mike or Taras would have to refresh my memory on this detail. My
> recollection is it mostly has to do with keeping the on-disk size of
> the library small, so it can load off of slow media very quickly.
> 
> >> Now.. this is not what I'd consider a normal use case, but was hoping to
> >> illustrate some of the more interesting uses and demonstrate the
> >> interfaces flexibility.
> >
> > I'm just dying to hear a "normal" use case then. :)
> 
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
> 
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :),  Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could

I'd like to know more clearly as Hannes and you are thinking.
You mean that when we unmark the range, we should redirty of all of
pages's pte? or SetPageDirty?
If we redirty pte, maybe softdirty people(ie, CRIU) might be angry
because it could make lots of diff.
If we just do SetPageDirty, it would invalidate writeout-avoid logic
of swapped page which were already on the swap. Yeb, but it could
be minor and SetPageDirty model would be proper for shared-vrange
implmenetation. But how could we know any pages were missing
when unmarking time? Where do we keep the information?
It's no problem for vrange-anon because we can keep the information
on pte but how about vrange-file(ie, vrange-shared)? Using a shadow
entry of radix tree? What are you thinking about?

Another major concern is still syscall's overhead.
Such page-based scheme has a trouble with syscall's speed so I'm
afraid users might not use the syscall any more. :(
Frankly speaking, we don't have concrete user so not sure how
the overhead is severe but we could imagine easily that in future
someuser might want to makr volatile huge GB memory.

But I couldn't insist on range-based option because it has downside, too.
If we don't work page-based model, reclaim path cleary have a big
overhead to scan virtual memory to find a victim pages. As worst case,
just a page in Huge GB vma. Even, a page might be other zone. :(
If we could optimize that path to prevent CPU buring in future,
it could make very complicated and not sure woking well.
We already have similar issue with compaction. ;-)

So, it's really dilemma.

> return a flag with the purged state.  This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
> 
> After thinking it over in the hallway, I talked some of the details w/
> Johnnes and there was one issue that while w/ anonymous memory, we can
> still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
> semantics, but since on shared volatile ranges, we don't have anything
> to hang a volatile flag on w/o adding some new vma like structure to
> the address_space structure (much as we did in the past w/ earlier
> volatile range implementations). This would negate much of the point
> of using the dirty bits to simplify the shared volatility
> implementation.
> 
> Thus Johannes is reasonably questioning the need for SIGBUS semantics,
> since if it wasn't needed, the simpler page-cleaning based volatility
> could potentially be used.

I think SIGBUS scenario isn't common but in case of JIT, it is necessary
and the amount of ram consumed would be never small for embedded world.

> 
> 
> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
> 
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged.  From an
> application perspective this is pretty ugly.
> 
> Johannes: Any thoughts on this potential issue with your proposal? Am
> I missing something else?
> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-07  5:48               ` Minchan Kim
  0 siblings, 0 replies; 112+ messages in thread
From: Minchan Kim @ 2014-04-07  5:48 UTC (permalink / raw)
  To: John Stultz
  Cc: Johannes Weiner, Dave Hansen, H. Peter Anvin, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, linux-mm

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> >> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >> > John, this was something that the Mozilla guys asked for, right?  Any
> >> > idea why this isn't ever a problem for them?
> >> So one of their use cases for it is for library text. Basically they
> >> want to decompress a compressed library file into memory. Then they plan
> >> to mark the uncompressed pages volatile, and then be able to call into
> >> it. Ideally for them, the kernel would only purge cold pages, leaving
> >> the hot pages in memory. When they traverse a purged page, they handle
> >> the SIGBUS and patch the page up.
> >
> > How big are these libraries compared to overall system size?
> 
> Mike or Taras would have to refresh my memory on this detail. My
> recollection is it mostly has to do with keeping the on-disk size of
> the library small, so it can load off of slow media very quickly.
> 
> >> Now.. this is not what I'd consider a normal use case, but was hoping to
> >> illustrate some of the more interesting uses and demonstrate the
> >> interfaces flexibility.
> >
> > I'm just dying to hear a "normal" use case then. :)
> 
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
> 
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :),  Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could

I'd like to know more clearly as Hannes and you are thinking.
You mean that when we unmark the range, we should redirty of all of
pages's pte? or SetPageDirty?
If we redirty pte, maybe softdirty people(ie, CRIU) might be angry
because it could make lots of diff.
If we just do SetPageDirty, it would invalidate writeout-avoid logic
of swapped page which were already on the swap. Yeb, but it could
be minor and SetPageDirty model would be proper for shared-vrange
implmenetation. But how could we know any pages were missing
when unmarking time? Where do we keep the information?
It's no problem for vrange-anon because we can keep the information
on pte but how about vrange-file(ie, vrange-shared)? Using a shadow
entry of radix tree? What are you thinking about?

Another major concern is still syscall's overhead.
Such page-based scheme has a trouble with syscall's speed so I'm
afraid users might not use the syscall any more. :(
Frankly speaking, we don't have concrete user so not sure how
the overhead is severe but we could imagine easily that in future
someuser might want to makr volatile huge GB memory.

But I couldn't insist on range-based option because it has downside, too.
If we don't work page-based model, reclaim path cleary have a big
overhead to scan virtual memory to find a victim pages. As worst case,
just a page in Huge GB vma. Even, a page might be other zone. :(
If we could optimize that path to prevent CPU buring in future,
it could make very complicated and not sure woking well.
We already have similar issue with compaction. ;-)

So, it's really dilemma.

> return a flag with the purged state.  This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
> 
> After thinking it over in the hallway, I talked some of the details w/
> Johnnes and there was one issue that while w/ anonymous memory, we can
> still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
> semantics, but since on shared volatile ranges, we don't have anything
> to hang a volatile flag on w/o adding some new vma like structure to
> the address_space structure (much as we did in the past w/ earlier
> volatile range implementations). This would negate much of the point
> of using the dirty bits to simplify the shared volatility
> implementation.
> 
> Thus Johannes is reasonably questioning the need for SIGBUS semantics,
> since if it wasn't needed, the simpler page-cleaning based volatility
> could potentially be used.

I think SIGBUS scenario isn't common but in case of JIT, it is necessary
and the amount of ram consumed would be never small for embedded world.

> 
> 
> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
> 
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged.  From an
> application perspective this is pretty ugly.
> 
> Johannes: Any thoughts on this potential issue with your proposal? Am
> I missing something else?
> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 18:31       ` Andrea Arcangeli
@ 2014-04-07  6:11         ` Minchan Kim
  -1 siblings, 0 replies; 112+ messages in thread
From: Minchan Kim @ 2014-04-07  6:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: John Stultz, Johannes Weiner, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, H. Peter Anvin, linux-mm

Hello Andrea,

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > you have a third option you're thinking of, I'd of course be interested
> > in hearing it.
> 
> I actually thought the way of being notified with a page fault (sigbus
> or whatever) was the most efficient way of using volatile ranges.
> 
> Why having to call a syscall to know if you can still access the
> volatile range, if there was no VM pressure before the access?
> syscalls are expensive, accessing the memory direct is not. Only if it
> page was actually missing and a page fault would fire, you'd take the
> slowpath.

True.

> 
> The usages I see for this are plenty, like for maintaining caches in
> memory that may be big and would be nice to discard if there's VM
> pressure, jpeg uncompressed images sounds like a candidate too. So the
> browser size would shrink if there's VM pressure, instead of ending up
> swapping out uncompressed image data that can be regenerated more
> quickly with the CPU than with swapins.

That's really typical case vrange is targetting.

> 
> > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > try to exploit the fact that we get SIGBUS on purged page access (at
> > least on the user-space side) and will try to access pages that are
> > volatile until they are purged and try to then handle the SIGBUS to fix
> > things up. Those folks exploiting that will have to be particularly
> > careful not to pass volatile data to the kernel, and if they do they'll
> > have to be smart enough to handle the EFAULT, etc. That's really all
> > their problem, because they're being clever. :)
> 
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
> 
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
> 
> You need a thread that will never access the userfault area with the
> CPU, that is responsible to poll on the userfaultfd and talk the
> userfaultfd protocol to fill-in missing pages. The userfault thread
> after a POLLIN event reads the virtual addresses of the fault that
> must have happened on some other thread of the same mm, and then
> writes back an "handled" virtual range into the fd, after the page (or
> pages if multiple) have been regenerated and mapped in with
> sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> swapping. Then depending on the "solved" range written back into the
> fd, the kernel will wakeup the thread or threads that were waiting in
> kernel mode on the "handled" virtual range, and retry the fault
> without ever exiting kernel mode.

Sounds flexible.

> 
> We need this in KVM for running the guest on memory that is on other
> nodes or other processes (postcopy live migration is the most common
> use case but there are others like memory externalization and
> cross-node KSM in the cloud, to keep a single copy of memory across
> multiple nodes and externalized to the VM and to the host node).
> 
> This thread made me wonder if we could mix the two features and you
> would then depend on MADV_USERFAULT and userfaultfd to deliver to
> userland the "faults" happening on the volatile pages that have been
> purged as result of VM pressure.
> 
> I'm just saying this after Johannes mentioned the issue with syscalls
> returning -EFAULT. Because that is the very issue that the userfaultfd
> is going to solve for the KVM migration thread.
> 
> What I'm thinking now would be to mark the volatile range also
> MADV_USERFAULT and then calling userfaultfd and instead of having the
> cache regeneration "slow path" inside the SIGBUS handler, to run it in
> the userfault thread that polls the userfaultfd. Then you could write
> the volatile ranges to disk with a write() syscall (or use any other
> syscall on the volatile ranges), without having to worry about -EFAULT
> being returned because one page was discarded. And if MADV_USERFAULT
> is not called in combination with vrange syscalls, then it'd still
> work without the userfault, but with the vrange syscalls only.
> 
> In short the idea would be to let the userfault code solve the fault
> delivery to userland for you, and make the vrange syscalls only focus
> on the page purging problem, without having to worry about what
> happens when something access a missing page.
> 
> But if you don't intend to solve the syscall -EFAULT problem, well
> then probably the overlap is still as thin as I thought it was before
> (like also mentioned in the below link).

Sounds doable. I will look into your patch.
Thanks for reminding!

> 
> Thanks,
> Andrea
> 
> PS. my last email about this from a more KVM centric point of view:
> 
> http://www.spinics.net/lists/kvm/msg101449.html
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-07  6:11         ` Minchan Kim
  0 siblings, 0 replies; 112+ messages in thread
From: Minchan Kim @ 2014-04-07  6:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: John Stultz, Johannes Weiner, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, H. Peter Anvin, linux-mm

Hello Andrea,

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > you have a third option you're thinking of, I'd of course be interested
> > in hearing it.
> 
> I actually thought the way of being notified with a page fault (sigbus
> or whatever) was the most efficient way of using volatile ranges.
> 
> Why having to call a syscall to know if you can still access the
> volatile range, if there was no VM pressure before the access?
> syscalls are expensive, accessing the memory direct is not. Only if it
> page was actually missing and a page fault would fire, you'd take the
> slowpath.

True.

> 
> The usages I see for this are plenty, like for maintaining caches in
> memory that may be big and would be nice to discard if there's VM
> pressure, jpeg uncompressed images sounds like a candidate too. So the
> browser size would shrink if there's VM pressure, instead of ending up
> swapping out uncompressed image data that can be regenerated more
> quickly with the CPU than with swapins.

That's really typical case vrange is targetting.

> 
> > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > try to exploit the fact that we get SIGBUS on purged page access (at
> > least on the user-space side) and will try to access pages that are
> > volatile until they are purged and try to then handle the SIGBUS to fix
> > things up. Those folks exploiting that will have to be particularly
> > careful not to pass volatile data to the kernel, and if they do they'll
> > have to be smart enough to handle the EFAULT, etc. That's really all
> > their problem, because they're being clever. :)
> 
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
> 
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
> 
> You need a thread that will never access the userfault area with the
> CPU, that is responsible to poll on the userfaultfd and talk the
> userfaultfd protocol to fill-in missing pages. The userfault thread
> after a POLLIN event reads the virtual addresses of the fault that
> must have happened on some other thread of the same mm, and then
> writes back an "handled" virtual range into the fd, after the page (or
> pages if multiple) have been regenerated and mapped in with
> sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> swapping. Then depending on the "solved" range written back into the
> fd, the kernel will wakeup the thread or threads that were waiting in
> kernel mode on the "handled" virtual range, and retry the fault
> without ever exiting kernel mode.

Sounds flexible.

> 
> We need this in KVM for running the guest on memory that is on other
> nodes or other processes (postcopy live migration is the most common
> use case but there are others like memory externalization and
> cross-node KSM in the cloud, to keep a single copy of memory across
> multiple nodes and externalized to the VM and to the host node).
> 
> This thread made me wonder if we could mix the two features and you
> would then depend on MADV_USERFAULT and userfaultfd to deliver to
> userland the "faults" happening on the volatile pages that have been
> purged as result of VM pressure.
> 
> I'm just saying this after Johannes mentioned the issue with syscalls
> returning -EFAULT. Because that is the very issue that the userfaultfd
> is going to solve for the KVM migration thread.
> 
> What I'm thinking now would be to mark the volatile range also
> MADV_USERFAULT and then calling userfaultfd and instead of having the
> cache regeneration "slow path" inside the SIGBUS handler, to run it in
> the userfault thread that polls the userfaultfd. Then you could write
> the volatile ranges to disk with a write() syscall (or use any other
> syscall on the volatile ranges), without having to worry about -EFAULT
> being returned because one page was discarded. And if MADV_USERFAULT
> is not called in combination with vrange syscalls, then it'd still
> work without the userfault, but with the vrange syscalls only.
> 
> In short the idea would be to let the userfault code solve the fault
> delivery to userland for you, and make the vrange syscalls only focus
> on the page purging problem, without having to worry about what
> happens when something access a missing page.
> 
> But if you don't intend to solve the syscall -EFAULT problem, well
> then probably the overlap is still as thin as I thought it was before
> (like also mentioned in the below link).

Sounds doable. I will look into your patch.
Thanks for reminding!

> 
> Thanks,
> Andrea
> 
> PS. my last email about this from a more KVM centric point of view:
> 
> http://www.spinics.net/lists/kvm/msg101449.html
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 19:27         ` Johannes Weiner
@ 2014-04-07  6:19           ` Minchan Kim
  -1 siblings, 0 replies; 112+ messages in thread
From: Minchan Kim @ 2014-04-07  6:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, John Stultz, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, H. Peter Anvin, linux-mm

On Wed, Apr 02, 2014 at 03:27:44PM -0400, Johannes Weiner wrote:
> On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> > Hi everyone,
> > 
> > On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > > you have a third option you're thinking of, I'd of course be interested
> > > in hearing it.
> > 
> > I actually thought the way of being notified with a page fault (sigbus
> > or whatever) was the most efficient way of using volatile ranges.
> > 
> > Why having to call a syscall to know if you can still access the
> > volatile range, if there was no VM pressure before the access?
> > syscalls are expensive, accessing the memory direct is not. Only if it
> > page was actually missing and a page fault would fire, you'd take the
> > slowpath.
> 
> Not everybody wants to actually come back for the data in the range,
> allocators and message passing applications just want to be able to
> reuse the memory mapping.
> 
> By tying the volatility to the dirty bit in the page tables, an
> allocator could simply clear those bits once on free().  When malloc()
> hands out this region again, the user is expected to write, which will
> either overwrite the old page, or, if it was purged, fault in a fresh
> zero page.  But there is no second syscall needed to clear volatility.
> 
> > > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > > try to exploit the fact that we get SIGBUS on purged page access (at
> > > least on the user-space side) and will try to access pages that are
> > > volatile until they are purged and try to then handle the SIGBUS to fix
> > > things up. Those folks exploiting that will have to be particularly
> > > careful not to pass volatile data to the kernel, and if they do they'll
> > > have to be smart enough to handle the EFAULT, etc. That's really all
> > > their problem, because they're being clever. :)
> > 
> > I'm actually working on feature that would solve the problem for the
> > syscalls accessing missing volatile pages. So you'd never see a
> > -EFAULT because all syscalls won't return even if they encounters a
> > missing page in the volatile range dropped by the VM pressure.
> > 
> > It's called userfaultfd. You call sys_userfaultfd(flags) and it
> > connects the current mm to a pseudo filedescriptor. The filedescriptor
> > works similarly to eventfd but with a different protocol.
> > 
> > You need a thread that will never access the userfault area with the
> > CPU, that is responsible to poll on the userfaultfd and talk the
> > userfaultfd protocol to fill-in missing pages. The userfault thread
> > after a POLLIN event reads the virtual addresses of the fault that
> > must have happened on some other thread of the same mm, and then
> > writes back an "handled" virtual range into the fd, after the page (or
> > pages if multiple) have been regenerated and mapped in with
> > sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> > swapping. Then depending on the "solved" range written back into the
> > fd, the kernel will wakeup the thread or threads that were waiting in
> > kernel mode on the "handled" virtual range, and retry the fault
> > without ever exiting kernel mode.
> > 
> > We need this in KVM for running the guest on memory that is on other
> > nodes or other processes (postcopy live migration is the most common
> > use case but there are others like memory externalization and
> > cross-node KSM in the cloud, to keep a single copy of memory across
> > multiple nodes and externalized to the VM and to the host node).
> > 
> > This thread made me wonder if we could mix the two features and you
> > would then depend on MADV_USERFAULT and userfaultfd to deliver to
> > userland the "faults" happening on the volatile pages that have been
> > purged as result of VM pressure.
> > 
> > I'm just saying this after Johannes mentioned the issue with syscalls
> > returning -EFAULT. Because that is the very issue that the userfaultfd
> > is going to solve for the KVM migration thread.
> > 
> > What I'm thinking now would be to mark the volatile range also
> > MADV_USERFAULT and then calling userfaultfd and instead of having the
> > cache regeneration "slow path" inside the SIGBUS handler, to run it in
> > the userfault thread that polls the userfaultfd. Then you could write
> > the volatile ranges to disk with a write() syscall (or use any other
> > syscall on the volatile ranges), without having to worry about -EFAULT
> > being returned because one page was discarded. And if MADV_USERFAULT
> > is not called in combination with vrange syscalls, then it'd still
> > work without the userfault, but with the vrange syscalls only.
> > 
> > In short the idea would be to let the userfault code solve the fault
> > delivery to userland for you, and make the vrange syscalls only focus
> > on the page purging problem, without having to worry about what
> > happens when something access a missing page.
> 
> Yes, the two seem certainly combinable to me.
> 
> madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace
> fault handling.  In the fault slowpath, you can then regenerate any
> missing data and do MADV_FREE again if it should remain volatile.  And
> again, any actual writes to the region would clear volatility because
> now the cache copy changed and discarding it would mean losing state.

Another scenario that above can't cover.
Someone might put volatility permanently until unmarking so they can
generate cache pages on that range freely without further syscall.

I mean above sugguestion can cover those pages were already mapped
when syscall was called but couldn't cover upcoming fault-in pages
so I think vrange syscall is still needed.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-07  6:19           ` Minchan Kim
  0 siblings, 0 replies; 112+ messages in thread
From: Minchan Kim @ 2014-04-07  6:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, John Stultz, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, H. Peter Anvin, linux-mm

On Wed, Apr 02, 2014 at 03:27:44PM -0400, Johannes Weiner wrote:
> On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> > Hi everyone,
> > 
> > On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > > you have a third option you're thinking of, I'd of course be interested
> > > in hearing it.
> > 
> > I actually thought the way of being notified with a page fault (sigbus
> > or whatever) was the most efficient way of using volatile ranges.
> > 
> > Why having to call a syscall to know if you can still access the
> > volatile range, if there was no VM pressure before the access?
> > syscalls are expensive, accessing the memory direct is not. Only if it
> > page was actually missing and a page fault would fire, you'd take the
> > slowpath.
> 
> Not everybody wants to actually come back for the data in the range,
> allocators and message passing applications just want to be able to
> reuse the memory mapping.
> 
> By tying the volatility to the dirty bit in the page tables, an
> allocator could simply clear those bits once on free().  When malloc()
> hands out this region again, the user is expected to write, which will
> either overwrite the old page, or, if it was purged, fault in a fresh
> zero page.  But there is no second syscall needed to clear volatility.
> 
> > > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > > try to exploit the fact that we get SIGBUS on purged page access (at
> > > least on the user-space side) and will try to access pages that are
> > > volatile until they are purged and try to then handle the SIGBUS to fix
> > > things up. Those folks exploiting that will have to be particularly
> > > careful not to pass volatile data to the kernel, and if they do they'll
> > > have to be smart enough to handle the EFAULT, etc. That's really all
> > > their problem, because they're being clever. :)
> > 
> > I'm actually working on feature that would solve the problem for the
> > syscalls accessing missing volatile pages. So you'd never see a
> > -EFAULT because all syscalls won't return even if they encounters a
> > missing page in the volatile range dropped by the VM pressure.
> > 
> > It's called userfaultfd. You call sys_userfaultfd(flags) and it
> > connects the current mm to a pseudo filedescriptor. The filedescriptor
> > works similarly to eventfd but with a different protocol.
> > 
> > You need a thread that will never access the userfault area with the
> > CPU, that is responsible to poll on the userfaultfd and talk the
> > userfaultfd protocol to fill-in missing pages. The userfault thread
> > after a POLLIN event reads the virtual addresses of the fault that
> > must have happened on some other thread of the same mm, and then
> > writes back an "handled" virtual range into the fd, after the page (or
> > pages if multiple) have been regenerated and mapped in with
> > sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> > swapping. Then depending on the "solved" range written back into the
> > fd, the kernel will wakeup the thread or threads that were waiting in
> > kernel mode on the "handled" virtual range, and retry the fault
> > without ever exiting kernel mode.
> > 
> > We need this in KVM for running the guest on memory that is on other
> > nodes or other processes (postcopy live migration is the most common
> > use case but there are others like memory externalization and
> > cross-node KSM in the cloud, to keep a single copy of memory across
> > multiple nodes and externalized to the VM and to the host node).
> > 
> > This thread made me wonder if we could mix the two features and you
> > would then depend on MADV_USERFAULT and userfaultfd to deliver to
> > userland the "faults" happening on the volatile pages that have been
> > purged as result of VM pressure.
> > 
> > I'm just saying this after Johannes mentioned the issue with syscalls
> > returning -EFAULT. Because that is the very issue that the userfaultfd
> > is going to solve for the KVM migration thread.
> > 
> > What I'm thinking now would be to mark the volatile range also
> > MADV_USERFAULT and then calling userfaultfd and instead of having the
> > cache regeneration "slow path" inside the SIGBUS handler, to run it in
> > the userfault thread that polls the userfaultfd. Then you could write
> > the volatile ranges to disk with a write() syscall (or use any other
> > syscall on the volatile ranges), without having to worry about -EFAULT
> > being returned because one page was discarded. And if MADV_USERFAULT
> > is not called in combination with vrange syscalls, then it'd still
> > work without the userfault, but with the vrange syscalls only.
> > 
> > In short the idea would be to let the userfault code solve the fault
> > delivery to userland for you, and make the vrange syscalls only focus
> > on the page purging problem, without having to worry about what
> > happens when something access a missing page.
> 
> Yes, the two seem certainly combinable to me.
> 
> madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace
> fault handling.  In the fault slowpath, you can then regenerate any
> missing data and do MADV_FREE again if it should remain volatile.  And
> again, any actual writes to the region would clear volatility because
> now the cache copy changed and discarding it would mean losing state.

Another scenario that above can't cover.
Someone might put volatility permanently until unmarking so they can
generate cache pages on that range freely without further syscall.

I mean above sugguestion can cover those pages were already mapped
when syscall was called but couldn't cover upcoming fault-in pages
so I think vrange syscall is still needed.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-03-23 17:42     ` KOSAKI Motohiro
@ 2014-04-07 18:37       ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-07 18:37 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On 03/23/2014 10:42 AM, KOSAKI Motohiro wrote:
> On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
>> Users of volatile ranges will need to know if memory was discarded.
>> This patch adds the purged state tracking required to inform userland
>> when it marks memory as non-volatile that some memory in that range
>> was purged and needs to be regenerated.
>>
>> This simplified implementation which uses some of the logic from
>> Minchan's earlier efforts, so credit to Minchan for his work.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Android Kernel Team <kernel-team@android.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Robert Love <rlove@google.com>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Dave Hansen <dave@sr71.net>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
>> Cc: Neil Brown <neilb@suse.de>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Mike Hommey <mh@glandium.org>
>> Cc: Taras Glek <tglek@mozilla.com>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
>> Cc: Michel Lespinasse <walken@google.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
>> Signed-off-by: John Stultz <john.stultz@linaro.org>
>> ---
>>  include/linux/swap.h    | 15 ++++++++--
>>  include/linux/swapops.h | 10 +++++++
>>  include/linux/vrange.h  |  3 ++
>>  mm/vrange.c             | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 101 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 46ba0c6..18c12f9 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
>>  #define SWP_HWPOISON_NUM 0
>>  #endif
>>
>> -#define MAX_SWAPFILES \
>> -       ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>> +
>> +/*
>> + * Purged volatile range pages
>> + */
>> +#define SWP_VRANGE_PURGED_NUM 1
>> +#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
>> +
>> +
>> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)      \
>> +                               - SWP_MIGRATION_NUM     \
>> +                               - SWP_HWPOISON_NUM      \
>> +                               - SWP_VRANGE_PURGED_NUM \
>> +                       )
> This change hwpoison and migration tag number. maybe ok, maybe not.

Though depending on config can't these tag numbers change anyway?


> I'd suggest to use younger number than hwpoison.
> (That's why hwpoison uses younger number than migration)

So I can, but the way these are defined makes the results seem pretty
terrible:

#define SWP_MIGRATION_WRITE    (MAX_SWAPFILES + SWP_HWPOISON_NUM \
                    + SWP_MVOLATILE_PURGED_NUM + 1)

Particularly when:
#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)        \
                - SWP_MIGRATION_NUM        \
                - SWP_HWPOISON_NUM        \
                - SWP_MVOLATILE_PURGED_NUM    \
            )

Its a lot of unnecessary mental gymnastics. Yuck.

Would a general cleanup like the following be ok to try to make this
more extensible?

thanks
-john

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3507115..21387df 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -49,29 +49,38 @@ static inline int current_is_kswapd(void)
  * actions on faults.
  */
 
+enum {
+	/*
+	 * NOTE: We use the high bits here (subtracting from
+	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
+	 * new entries here at the top of the enum, not at the bottom
+	 */
+#ifdef CONFIG_MEMORY_FAILURE
+	SWP_HWPOISON_NR,
+#endif
+#ifdef CONFIG_MIGRATION
+	SWP_MIGRATION_READ_NR,
+	SWP_MIGRATION_WRITE_NR,
+#endif
+	SWP_MAX_NR,
+};
+#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT) - SWP_MAX_NR)
+
 /*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
-#define SWP_MIGRATION_NUM 2
-#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
-#else
-#define SWP_MIGRATION_NUM 0
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_MIGRATION_READ_NR)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_MIGRATION_WRITE_NR)
 #endif
 
 /*
  * Handling of hardware poisoned pages with memory corruption.
  */
 #ifdef CONFIG_MEMORY_FAILURE
-#define SWP_HWPOISON_NUM 1
-#define SWP_HWPOISON		MAX_SWAPFILES
-#else
-#define SWP_HWPOISON_NUM 0
+#define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
 #endif
 
-#define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is


^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-04-07 18:37       ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-07 18:37 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On 03/23/2014 10:42 AM, KOSAKI Motohiro wrote:
> On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
>> Users of volatile ranges will need to know if memory was discarded.
>> This patch adds the purged state tracking required to inform userland
>> when it marks memory as non-volatile that some memory in that range
>> was purged and needs to be regenerated.
>>
>> This simplified implementation which uses some of the logic from
>> Minchan's earlier efforts, so credit to Minchan for his work.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Android Kernel Team <kernel-team@android.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Robert Love <rlove@google.com>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Dave Hansen <dave@sr71.net>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
>> Cc: Neil Brown <neilb@suse.de>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Mike Hommey <mh@glandium.org>
>> Cc: Taras Glek <tglek@mozilla.com>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
>> Cc: Michel Lespinasse <walken@google.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
>> Signed-off-by: John Stultz <john.stultz@linaro.org>
>> ---
>>  include/linux/swap.h    | 15 ++++++++--
>>  include/linux/swapops.h | 10 +++++++
>>  include/linux/vrange.h  |  3 ++
>>  mm/vrange.c             | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 101 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 46ba0c6..18c12f9 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
>>  #define SWP_HWPOISON_NUM 0
>>  #endif
>>
>> -#define MAX_SWAPFILES \
>> -       ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>> +
>> +/*
>> + * Purged volatile range pages
>> + */
>> +#define SWP_VRANGE_PURGED_NUM 1
>> +#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
>> +
>> +
>> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)      \
>> +                               - SWP_MIGRATION_NUM     \
>> +                               - SWP_HWPOISON_NUM      \
>> +                               - SWP_VRANGE_PURGED_NUM \
>> +                       )
> This change hwpoison and migration tag number. maybe ok, maybe not.

Though depending on config can't these tag numbers change anyway?


> I'd suggest to use younger number than hwpoison.
> (That's why hwpoison uses younger number than migration)

So I can, but the way these are defined makes the results seem pretty
terrible:

#define SWP_MIGRATION_WRITE    (MAX_SWAPFILES + SWP_HWPOISON_NUM \
                    + SWP_MVOLATILE_PURGED_NUM + 1)

Particularly when:
#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)        \
                - SWP_MIGRATION_NUM        \
                - SWP_HWPOISON_NUM        \
                - SWP_MVOLATILE_PURGED_NUM    \
            )

Its a lot of unnecessary mental gymnastics. Yuck.

Would a general cleanup like the following be ok to try to make this
more extensible?

thanks
-john

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3507115..21387df 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -49,29 +49,38 @@ static inline int current_is_kswapd(void)
  * actions on faults.
  */
 
+enum {
+	/*
+	 * NOTE: We use the high bits here (subtracting from
+	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
+	 * new entries here at the top of the enum, not at the bottom
+	 */
+#ifdef CONFIG_MEMORY_FAILURE
+	SWP_HWPOISON_NR,
+#endif
+#ifdef CONFIG_MIGRATION
+	SWP_MIGRATION_READ_NR,
+	SWP_MIGRATION_WRITE_NR,
+#endif
+	SWP_MAX_NR,
+};
+#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT) - SWP_MAX_NR)
+
 /*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
-#define SWP_MIGRATION_NUM 2
-#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
-#else
-#define SWP_MIGRATION_NUM 0
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_MIGRATION_READ_NR)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_MIGRATION_WRITE_NR)
 #endif
 
 /*
  * Handling of hardware poisoned pages with memory corruption.
  */
 #ifdef CONFIG_MEMORY_FAILURE
-#define SWP_HWPOISON_NUM 1
-#define SWP_HWPOISON		MAX_SWAPFILES
-#else
-#define SWP_HWPOISON_NUM 0
+#define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
 #endif
 
-#define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-04-07 18:37       ` John Stultz
@ 2014-04-07 22:14         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-04-07 22:14 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

>> This change hwpoison and migration tag number. maybe ok, maybe not.
>
> Though depending on config can't these tag numbers change anyway?

I don't think distro disable any of these.


>> I'd suggest to use younger number than hwpoison.
>> (That's why hwpoison uses younger number than migration)
>
> So I can, but the way these are defined makes the results seem pretty
> terrible:
>
> #define SWP_MIGRATION_WRITE    (MAX_SWAPFILES + SWP_HWPOISON_NUM \
>                     + SWP_MVOLATILE_PURGED_NUM + 1)
>
> Particularly when:
> #define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)        \
>                 - SWP_MIGRATION_NUM        \
>                 - SWP_HWPOISON_NUM        \
>                 - SWP_MVOLATILE_PURGED_NUM    \
>             )
>
> Its a lot of unnecessary mental gymnastics. Yuck.
>
> Would a general cleanup like the following be ok to try to make this
> more extensible?
>
> thanks
> -john
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3507115..21387df 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -49,29 +49,38 @@ static inline int current_is_kswapd(void)
>   * actions on faults.
>   */
>
> +enum {
> +       /*
> +        * NOTE: We use the high bits here (subtracting from
> +        * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
> +        * new entries here at the top of the enum, not at the bottom
> +        */
> +#ifdef CONFIG_MEMORY_FAILURE
> +       SWP_HWPOISON_NR,
> +#endif
> +#ifdef CONFIG_MIGRATION
> +       SWP_MIGRATION_READ_NR,
> +       SWP_MIGRATION_WRITE_NR,
> +#endif
> +       SWP_MAX_NR,
> +};
> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT) - SWP_MAX_NR)
> +

I don't see any benefit of this code. At least, SWP_MAX_NR is suck.
The name doesn't match the actual meanings.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-04-07 22:14         ` KOSAKI Motohiro
  0 siblings, 0 replies; 112+ messages in thread
From: KOSAKI Motohiro @ 2014-04-07 22:14 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

>> This change hwpoison and migration tag number. maybe ok, maybe not.
>
> Though depending on config can't these tag numbers change anyway?

I don't think distro disable any of these.


>> I'd suggest to use younger number than hwpoison.
>> (That's why hwpoison uses younger number than migration)
>
> So I can, but the way these are defined makes the results seem pretty
> terrible:
>
> #define SWP_MIGRATION_WRITE    (MAX_SWAPFILES + SWP_HWPOISON_NUM \
>                     + SWP_MVOLATILE_PURGED_NUM + 1)
>
> Particularly when:
> #define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)        \
>                 - SWP_MIGRATION_NUM        \
>                 - SWP_HWPOISON_NUM        \
>                 - SWP_MVOLATILE_PURGED_NUM    \
>             )
>
> Its a lot of unnecessary mental gymnastics. Yuck.
>
> Would a general cleanup like the following be ok to try to make this
> more extensible?
>
> thanks
> -john
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3507115..21387df 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -49,29 +49,38 @@ static inline int current_is_kswapd(void)
>   * actions on faults.
>   */
>
> +enum {
> +       /*
> +        * NOTE: We use the high bits here (subtracting from
> +        * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
> +        * new entries here at the top of the enum, not at the bottom
> +        */
> +#ifdef CONFIG_MEMORY_FAILURE
> +       SWP_HWPOISON_NR,
> +#endif
> +#ifdef CONFIG_MIGRATION
> +       SWP_MIGRATION_READ_NR,
> +       SWP_MIGRATION_WRITE_NR,
> +#endif
> +       SWP_MAX_NR,
> +};
> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT) - SWP_MAX_NR)
> +

I don't see any benefit of this code. At least, SWP_MAX_NR is suck.
The name doesn't match the actual meanings.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-04-07 22:14         ` KOSAKI Motohiro
@ 2014-04-08  3:09           ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-08  3:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On 04/07/2014 03:14 PM, KOSAKI Motohiro wrote:
>>> This change hwpoison and migration tag number. maybe ok, maybe not.
>> Though depending on config can't these tag numbers change anyway?
> I don't think distro disable any of these.

Well, it still shouldn't break if the config options are turned off.
This isn't some subtle userspace visible ABI, is it?
I'm fine with keeping the values the same, but it just seems worrying if
this logic is so fragile.


>>> I'd suggest to use younger number than hwpoison.
>>> (That's why hwpoison uses younger number than migration)
>> So I can, but the way these are defined makes the results seem pretty
>> terrible:
>>
>> #define SWP_MIGRATION_WRITE    (MAX_SWAPFILES + SWP_HWPOISON_NUM \
>>                     + SWP_MVOLATILE_PURGED_NUM + 1)
>>
>> Particularly when:
>> #define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)        \
>>                 - SWP_MIGRATION_NUM        \
>>                 - SWP_HWPOISON_NUM        \
>>                 - SWP_MVOLATILE_PURGED_NUM    \
>>             )
>>
>> Its a lot of unnecessary mental gymnastics. Yuck.
>>
>> Would a general cleanup like the following be ok to try to make this
>> more extensible?
>>
>> thanks
>> -john
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 3507115..21387df 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -49,29 +49,38 @@ static inline int current_is_kswapd(void)
>>   * actions on faults.
>>   */
>>
>> +enum {
>> +       /*
>> +        * NOTE: We use the high bits here (subtracting from
>> +        * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
>> +        * new entries here at the top of the enum, not at the bottom
>> +        */
>> +#ifdef CONFIG_MEMORY_FAILURE
>> +       SWP_HWPOISON_NR,
>> +#endif
>> +#ifdef CONFIG_MIGRATION
>> +       SWP_MIGRATION_READ_NR,
>> +       SWP_MIGRATION_WRITE_NR,
>> +#endif
>> +       SWP_MAX_NR,
>> +};
>> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT) - SWP_MAX_NR)
>> +
> I don't see any benefit of this code. At least, SWP_MAX_NR is suck.


So it makes adding new special swap types (like SWP_MVOLATILE_PURGED)
much cleaner. If we need to preserve the actual values for SWP_HWPOSIN
and SWP_MIGRATION_* as you suggested earlier, the cleanup above makes
doing so when adding a new type much easier.

For example adding the MVOLATILE_PURGED value (without effecting the
values of HWPOSIN or MIGRATION_*) is only:

@@ -55,6 +55,7 @@ enum {
         * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
         * new entries here at the top of the enum, not at the bottom
         */
+       SWP_MVOLATILE_PURGED_NR,
 #ifdef CONFIG_MEMORY_FAILURE
        SWP_HWPOISON_NR,
 #endif
@@ -81,6 +82,10 @@ enum {
 #define SWP_HWPOISON           (MAX_SWAPFILES + SWP_HWPOISON_NR)
 #endif
 
+/*
+ * Purged volatile range pages
+ */
+#define SWP_MVOLATILE_PURGED   (MAX_SWAPFILES + SWP_MVOLATILE_PURGED_NR)
 

That's *much* nicer when compared with modifying every value to subtract the extra entry, as it was done before.


> The name doesn't match the actual meanings.
Would SWP_MAX_SPECIAL_TYPE_NR be a better name? Do you have other
suggestions?

thanks
-john


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-04-08  3:09           ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-08  3:09 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On 04/07/2014 03:14 PM, KOSAKI Motohiro wrote:
>>> This change hwpoison and migration tag number. maybe ok, maybe not.
>> Though depending on config can't these tag numbers change anyway?
> I don't think distro disable any of these.

Well, it still shouldn't break if the config options are turned off.
This isn't some subtle userspace visible ABI, is it?
I'm fine with keeping the values the same, but it just seems worrying if
this logic is so fragile.


>>> I'd suggest to use younger number than hwpoison.
>>> (That's why hwpoison uses younger number than migration)
>> So I can, but the way these are defined makes the results seem pretty
>> terrible:
>>
>> #define SWP_MIGRATION_WRITE    (MAX_SWAPFILES + SWP_HWPOISON_NUM \
>>                     + SWP_MVOLATILE_PURGED_NUM + 1)
>>
>> Particularly when:
>> #define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)        \
>>                 - SWP_MIGRATION_NUM        \
>>                 - SWP_HWPOISON_NUM        \
>>                 - SWP_MVOLATILE_PURGED_NUM    \
>>             )
>>
>> Its a lot of unnecessary mental gymnastics. Yuck.
>>
>> Would a general cleanup like the following be ok to try to make this
>> more extensible?
>>
>> thanks
>> -john
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 3507115..21387df 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -49,29 +49,38 @@ static inline int current_is_kswapd(void)
>>   * actions on faults.
>>   */
>>
>> +enum {
>> +       /*
>> +        * NOTE: We use the high bits here (subtracting from
>> +        * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
>> +        * new entries here at the top of the enum, not at the bottom
>> +        */
>> +#ifdef CONFIG_MEMORY_FAILURE
>> +       SWP_HWPOISON_NR,
>> +#endif
>> +#ifdef CONFIG_MIGRATION
>> +       SWP_MIGRATION_READ_NR,
>> +       SWP_MIGRATION_WRITE_NR,
>> +#endif
>> +       SWP_MAX_NR,
>> +};
>> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT) - SWP_MAX_NR)
>> +
> I don't see any benefit of this code. At least, SWP_MAX_NR is suck.


So it makes adding new special swap types (like SWP_MVOLATILE_PURGED)
much cleaner. If we need to preserve the actual values for SWP_HWPOSIN
and SWP_MIGRATION_* as you suggested earlier, the cleanup above makes
doing so when adding a new type much easier.

For example adding the MVOLATILE_PURGED value (without effecting the
values of HWPOSIN or MIGRATION_*) is only:

@@ -55,6 +55,7 @@ enum {
         * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
         * new entries here at the top of the enum, not at the bottom
         */
+       SWP_MVOLATILE_PURGED_NR,
 #ifdef CONFIG_MEMORY_FAILURE
        SWP_HWPOISON_NR,
 #endif
@@ -81,6 +82,10 @@ enum {
 #define SWP_HWPOISON           (MAX_SWAPFILES + SWP_HWPOISON_NR)
 #endif
 
+/*
+ * Purged volatile range pages
+ */
+#define SWP_MVOLATILE_PURGED   (MAX_SWAPFILES + SWP_MVOLATILE_PURGED_NR)
 

That's *much* nicer when compared with modifying every value to subtract the extra entry, as it was done before.


> The name doesn't match the actual meanings.
Would SWP_MAX_SPECIAL_TYPE_NR be a better name? Do you have other
suggestions?

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-08  4:32             ` Kevin Easton
@ 2014-04-08  3:38                 ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-08  3:38 UTC (permalink / raw)
  To: Kevin Easton
  Cc: Johannes Weiner, Dave Hansen, H. Peter Anvin, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On 04/07/2014 09:32 PM, Kevin Easton wrote:
> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>> I'm just dying to hear a "normal" use case then. :)
>> So the more "normal" use cause would be marking objects volatile and
>> then non-volatile w/o accessing them in-between. In this case the
>> zero-fill vs SIGBUS semantics don't really matter, its really just a
>> trade off in how we handle applications deviating (intentionally or
>> not) from this use case.
>>
>> So to maybe flesh out the context here for folks who are following
>> along (but weren't in the hallway at LSF :),  Johannes made a fairly
>> interesting proposal (Johannes: Please correct me here where I'm maybe
>> slightly off here) to use only the dirty bits of the ptes to mark a
>> page as volatile. Then the kernel could reclaim these clean pages as
>> it needed, and when we marked the range as non-volatile, the pages
>> would be re-dirtied and if any of the pages were missing, we could
>> return a flag with the purged state.  This had some different
>> semantics then what I've been working with for awhile (for example,
>> any writes to pages would implicitly clear volatility), so I wasn't
>> completely comfortable with it, but figured I'd think about it to see
>> if it could be done. Particularly since it would in some ways simplify
>> tmpfs/shm shared volatility that I'd eventually like to do.
> ...
>> Now, while for the case I'm personally most interested in (ashmem),
>> zero-fill would technically be ok, since that's what Android does.
>> Even so, I don't think its the best approach for the interface, since
>> applications may end up quite surprised by the results when they
>> accidentally don't follow the "don't touch volatile pages" rule.
>>
>> That point beside, I think the other problem with the page-cleaning
>> volatility approach is that there are other awkward side effects. For
>> example: Say an application marks a range as volatile. One page in the
>> range is then purged. The application, due to a bug or otherwise,
>> reads the volatile range. This causes the page to be zero-filled in,
>> and the application silently uses the corrupted data (which isn't
>> great). More problematic though, is that by faulting the page in,
>> they've in effect lost the purge state for that page. When the
>> application then goes to mark the range as non-volatile, all pages are
>> present, so we'd return that no pages were purged.  From an
>> application perspective this is pretty ugly.
> The write-implicitly-clears-volatile semantics would actually be
> an advantage for some use cases.  If you have a volatile cache of
> many sub-page-size objects, the application can just include at
> the start of each page "int present, in_use;".  "present" is set
> to non-zero before marking volatile, and when the application wants
> unmark as volatile it writes to "in_use" and tests the value of 
> "present".  No need for a syscall at all, although it does take a
> minor fault.
>
> The syscall would be better for the case of large objects, though.
>
> Or is that fatally flawed?

Well, as you note, each object would then have to be page size or
smaller, which limits some of the potential use cases.

However, these semantics would match better to the MADV_FREE proposal
Minchan is pushing. So this method would work fine there.

thanks
-john



^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-08  3:38                 ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-08  3:38 UTC (permalink / raw)
  To: Kevin Easton
  Cc: Johannes Weiner, Dave Hansen, H. Peter Anvin, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On 04/07/2014 09:32 PM, Kevin Easton wrote:
> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>> I'm just dying to hear a "normal" use case then. :)
>> So the more "normal" use cause would be marking objects volatile and
>> then non-volatile w/o accessing them in-between. In this case the
>> zero-fill vs SIGBUS semantics don't really matter, its really just a
>> trade off in how we handle applications deviating (intentionally or
>> not) from this use case.
>>
>> So to maybe flesh out the context here for folks who are following
>> along (but weren't in the hallway at LSF :),  Johannes made a fairly
>> interesting proposal (Johannes: Please correct me here where I'm maybe
>> slightly off here) to use only the dirty bits of the ptes to mark a
>> page as volatile. Then the kernel could reclaim these clean pages as
>> it needed, and when we marked the range as non-volatile, the pages
>> would be re-dirtied and if any of the pages were missing, we could
>> return a flag with the purged state.  This had some different
>> semantics then what I've been working with for awhile (for example,
>> any writes to pages would implicitly clear volatility), so I wasn't
>> completely comfortable with it, but figured I'd think about it to see
>> if it could be done. Particularly since it would in some ways simplify
>> tmpfs/shm shared volatility that I'd eventually like to do.
> ...
>> Now, while for the case I'm personally most interested in (ashmem),
>> zero-fill would technically be ok, since that's what Android does.
>> Even so, I don't think its the best approach for the interface, since
>> applications may end up quite surprised by the results when they
>> accidentally don't follow the "don't touch volatile pages" rule.
>>
>> That point beside, I think the other problem with the page-cleaning
>> volatility approach is that there are other awkward side effects. For
>> example: Say an application marks a range as volatile. One page in the
>> range is then purged. The application, due to a bug or otherwise,
>> reads the volatile range. This causes the page to be zero-filled in,
>> and the application silently uses the corrupted data (which isn't
>> great). More problematic though, is that by faulting the page in,
>> they've in effect lost the purge state for that page. When the
>> application then goes to mark the range as non-volatile, all pages are
>> present, so we'd return that no pages were purged.  From an
>> application perspective this is pretty ugly.
> The write-implicitly-clears-volatile semantics would actually be
> an advantage for some use cases.  If you have a volatile cache of
> many sub-page-size objects, the application can just include at
> the start of each page "int present, in_use;".  "present" is set
> to non-zero before marking volatile, and when the application wants
> unmark as volatile it writes to "in_use" and tests the value of 
> "present".  No need for a syscall at all, although it does take a
> minor fault.
>
> The syscall would be better for the case of large objects, though.
>
> Or is that fatally flawed?

Well, as you note, each object would then have to be page size or
smaller, which limits some of the potential use cases.

However, these semantics would match better to the MADV_FREE proposal
Minchan is pushing. So this method would work fine there.

thanks
-john


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 17:40             ` John Stultz
                               ` (2 preceding siblings ...)
  (?)
@ 2014-04-08  4:32             ` Kevin Easton
  2014-04-08  3:38                 ` John Stultz
  -1 siblings, 1 reply; 112+ messages in thread
From: Kevin Easton @ 2014-04-08  4:32 UTC (permalink / raw)
  To: John Stultz
  Cc: Johannes Weiner, Dave Hansen, H. Peter Anvin, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > I'm just dying to hear a "normal" use case then. :)
> 
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
> 
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :),  Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could
> return a flag with the purged state.  This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
...
> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
> 
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged.  From an
> application perspective this is pretty ugly.

The write-implicitly-clears-volatile semantics would actually be
an advantage for some use cases.  If you have a volatile cache of
many sub-page-size objects, the application can just include at
the start of each page "int present, in_use;".  "present" is set
to non-zero before marking volatile, and when the application wants
unmark as volatile it writes to "in_use" and tests the value of 
"present".  No need for a syscall at all, although it does take a
minor fault.

The syscall would be better for the case of large objects, though.

Or is that fatally flawed?

    - Kevin

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas
  2014-03-23 16:50     ` KOSAKI Motohiro
@ 2014-04-08 18:52       ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-08 18:52 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

Hey Kosaki-san,
   Sorry to not have replied to this earlier, I really appreciate your
review! I'm now running through your feedback to make sure its all
integrated into my upcoming v13 patch series, and while most of your
comments have been addressed there are a few items outstanding, which I
suspect is from misunderstanding on my part or yours.

Anyway, thanks again for the comments. A few notes below.

On 03/23/2014 09:50 AM, KOSAKI Motohiro wrote:
> On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
>> RETURN VALUE
>>         On success vrange returns the number of bytes marked or unmarked.
>>         Similar to write(), it may return fewer bytes then specified
>>         if it ran into a problem.
> This explanation doesn't match your implementation. You return the
> last VMA - orig_start.
> That said, when some hole is there at middle of the range marked (or
> unmarked) bytes
> aren't match the return value.

As soon as we hit the hole, we will stop making further changes and will
return the number of successfully modified bytes up to that part. Thus
last VMA - orig_start should still match the modified values up to the hole.

I'm not sure how this is inconsistent with the implementation or
documentation, but there may still be bugs so I'd appreciate your
clarification if you think this is still an issue in the v13 release.


>
>>         When using VRANGE_NON_VOLATILE, if the return value is smaller
>>         then the specified length, then the value specified by the purged
>>         pointer will be set to 1 if any of the pages specified in the
>>         return value as successfully marked non-volatile had been purged.
>>
>>         If an error is returned, no changes were made.
> At least, this explanation doesn't match the implementation. When you find file
> mappings, you don't rollback prior changes.
No. If we find a file mapping, we simply return the amount of
successfully modified bytes prior to hitting that file mapping. This is
much in the same way as if we hit a hole in the address space. Again,
maybe you mis-read this or I am not understanding the issue you're
pointing out.



>
>> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
>> new file mode 100644
>> index 0000000..6e5331e
>> --- /dev/null
>> +++ b/include/linux/vrange.h
>> @@ -0,0 +1,8 @@
>> +#ifndef _LINUX_VRANGE_H
>> +#define _LINUX_VRANGE_H
>> +
>> +#define VRANGE_NONVOLATILE 0
>> +#define VRANGE_VOLATILE 1
> Maybe, moving uapi is better?

Agreed! Fixed in my tree.


>> +
>> +       down_read(&mm->mmap_sem);
> This should be down_write. VMA split and merge require write lock.

Very true. Minchan has already sent a fix that I've folded into my tree.



>> +
>> +       len &= PAGE_MASK;
>> +       if (!len)
>> +               goto out;
> This code doesn't match the explanation of "not page size units."

Again, good eye! Fixed in my tree.



>> +       if (purged) {
>> +               if (put_user(p, purged)) {
>> +                       /*
>> +                        * This would be bad, since we've modified volatilty
>> +                        * and the change in purged state would be lost.
>> +                        */
>> +                       WARN_ONCE(1, "vrange: purge state possibly lost\n");
> Don't do that.
> If userland app unmap the page between do_vrange and here, it's just
> their fault, not kernel.
> Therefore kernel warning make no sense. Please just move 1st put_user to here.
Yes, per Jan's suggestion I've changed this to return EFAULT.


Thanks again for your great review here!
-john


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas
@ 2014-04-08 18:52       ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-08 18:52 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

Hey Kosaki-san,
   Sorry to not have replied to this earlier, I really appreciate your
review! I'm now running through your feedback to make sure its all
integrated into my upcoming v13 patch series, and while most of your
comments have been addressed there are a few items outstanding, which I
suspect is from misunderstanding on my part or yours.

Anyway, thanks again for the comments. A few notes below.

On 03/23/2014 09:50 AM, KOSAKI Motohiro wrote:
> On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
>> RETURN VALUE
>>         On success vrange returns the number of bytes marked or unmarked.
>>         Similar to write(), it may return fewer bytes then specified
>>         if it ran into a problem.
> This explanation doesn't match your implementation. You return the
> last VMA - orig_start.
> That said, when some hole is there at middle of the range marked (or
> unmarked) bytes
> aren't match the return value.

As soon as we hit the hole, we will stop making further changes and will
return the number of successfully modified bytes up to that part. Thus
last VMA - orig_start should still match the modified values up to the hole.

I'm not sure how this is inconsistent with the implementation or
documentation, but there may still be bugs so I'd appreciate your
clarification if you think this is still an issue in the v13 release.


>
>>         When using VRANGE_NON_VOLATILE, if the return value is smaller
>>         then the specified length, then the value specified by the purged
>>         pointer will be set to 1 if any of the pages specified in the
>>         return value as successfully marked non-volatile had been purged.
>>
>>         If an error is returned, no changes were made.
> At least, this explanation doesn't match the implementation. When you find file
> mappings, you don't rollback prior changes.
No. If we find a file mapping, we simply return the amount of
successfully modified bytes prior to hitting that file mapping. This is
much in the same way as if we hit a hole in the address space. Again,
maybe you mis-read this or I am not understanding the issue you're
pointing out.



>
>> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
>> new file mode 100644
>> index 0000000..6e5331e
>> --- /dev/null
>> +++ b/include/linux/vrange.h
>> @@ -0,0 +1,8 @@
>> +#ifndef _LINUX_VRANGE_H
>> +#define _LINUX_VRANGE_H
>> +
>> +#define VRANGE_NONVOLATILE 0
>> +#define VRANGE_VOLATILE 1
> Maybe, moving uapi is better?

Agreed! Fixed in my tree.


>> +
>> +       down_read(&mm->mmap_sem);
> This should be down_write. VMA split and merge require write lock.

Very true. Minchan has already sent a fix that I've folded into my tree.



>> +
>> +       len &= PAGE_MASK;
>> +       if (!len)
>> +               goto out;
> This code doesn't match the explanation of "not page size units."

Again, good eye! Fixed in my tree.



>> +       if (purged) {
>> +               if (put_user(p, purged)) {
>> +                       /*
>> +                        * This would be bad, since we've modified volatilty
>> +                        * and the change in purged state would be lost.
>> +                        */
>> +                       WARN_ONCE(1, "vrange: purge state possibly lost\n");
> Don't do that.
> If userland app unmap the page between do_vrange and here, it's just
> their fault, not kernel.
> Therefore kernel warning make no sense. Please just move 1st put_user to here.
Yes, per Jan's suggestion I've changed this to return EFAULT.


Thanks again for your great review here!
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
  2014-03-23 21:50         ` KOSAKI Motohiro
@ 2014-04-09 18:29           ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-09 18:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On 03/23/2014 02:50 PM, KOSAKI Motohiro wrote:
> On Sun, Mar 23, 2014 at 1:26 PM, John Stultz <john.stultz@linaro.org> wrote:
>> On Sun, Mar 23, 2014 at 10:50 AM, KOSAKI Motohiro
>> <kosaki.motohiro@gmail.com> wrote:
>>>> +/**
>>>> + * vrange_check_purged_pte - Checks ptes for purged pages
>>>> + *
>>>> + * Iterates over the ptes in the pmd checking if they have
>>>> + * purged swap entries.
>>>> + *
>>>> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
>>>> + */
>>>> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
>>>> +                                       unsigned long end, struct mm_walk *walk)
>>>> +{
>>>> +       struct vrange_walker *vw = walk->private;
>>>> +       pte_t *pte;
>>>> +       spinlock_t *ptl;
>>>> +
>>>> +       if (pmd_trans_huge(*pmd))
>>>> +               return 0;
>>>> +       if (pmd_trans_unstable(pmd))
>>>> +               return 0;
>>>> +
>>>> +       pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>>>> +       for (; addr != end; pte++, addr += PAGE_SIZE) {
>>>> +               if (!pte_present(*pte)) {
>>>> +                       swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
>>>> +
>>>> +                       if (unlikely(is_vpurged_entry(vrange_entry))) {
>>>> +                               vw->page_was_purged = 1;
>>>> +                               break;
>>> This function only detect there is vpurge entry or not. But
>>> VRANGE_NONVOLATILE should remove all vpurge entries.
>>> Otherwise, non-volatiled range still makes SIGBUS.
>> So in the following patch (3/5), we only SIGBUS if the swap entry
>> is_vpurged_entry()  && the vma is still marked volatile, so this
>> shouldn't be an issue.
> When VOLATILE -> NON-VOLATILE -> VOLATILE transition happen,
> the page immediately marked "was purged"?
>
> I don't understand why vma check help.

Ok, you have a good point here. I've changed the code in my tree to
traverse all the ptes being unmarked and zap any psudo swp entries. This
is more expensive, but does provide simpler semantics for these corner
cases.

Thanks again for the great review!
-john




^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile
@ 2014-04-09 18:29           ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-09 18:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

On 03/23/2014 02:50 PM, KOSAKI Motohiro wrote:
> On Sun, Mar 23, 2014 at 1:26 PM, John Stultz <john.stultz@linaro.org> wrote:
>> On Sun, Mar 23, 2014 at 10:50 AM, KOSAKI Motohiro
>> <kosaki.motohiro@gmail.com> wrote:
>>>> +/**
>>>> + * vrange_check_purged_pte - Checks ptes for purged pages
>>>> + *
>>>> + * Iterates over the ptes in the pmd checking if they have
>>>> + * purged swap entries.
>>>> + *
>>>> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
>>>> + */
>>>> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
>>>> +                                       unsigned long end, struct mm_walk *walk)
>>>> +{
>>>> +       struct vrange_walker *vw = walk->private;
>>>> +       pte_t *pte;
>>>> +       spinlock_t *ptl;
>>>> +
>>>> +       if (pmd_trans_huge(*pmd))
>>>> +               return 0;
>>>> +       if (pmd_trans_unstable(pmd))
>>>> +               return 0;
>>>> +
>>>> +       pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>>>> +       for (; addr != end; pte++, addr += PAGE_SIZE) {
>>>> +               if (!pte_present(*pte)) {
>>>> +                       swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
>>>> +
>>>> +                       if (unlikely(is_vpurged_entry(vrange_entry))) {
>>>> +                               vw->page_was_purged = 1;
>>>> +                               break;
>>> This function only detect there is vpurge entry or not. But
>>> VRANGE_NONVOLATILE should remove all vpurge entries.
>>> Otherwise, non-volatiled range still makes SIGBUS.
>> So in the following patch (3/5), we only SIGBUS if the swap entry
>> is_vpurged_entry()  && the vma is still marked volatile, so this
>> shouldn't be an issue.
> When VOLATILE -> NON-VOLATILE -> VOLATILE transition happen,
> the page immediately marked "was purged"?
>
> I don't understand why vma check help.

Ok, you have a good point here. I've changed the code in my tree to
traverse all the ptes being unmarked and zap any psudo swp entries. This
is more expensive, but does provide simpler semantics for these corner
cases.

Thanks again for the great review!
-john



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] vrange: Add page purging logic & SIGBUS trap
  2014-03-23 23:44     ` KOSAKI Motohiro
@ 2014-04-10 18:49       ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-10 18:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

Hey Kosaki-san,
  Just a few follow ups on your comments here in preparation for v13.

On 03/23/2014 04:44 PM, KOSAKI Motohiro wrote:
> On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
> @@ -683,6 +684,7 @@ enum page_references {
>         PAGEREF_RECLAIM,
>         PAGEREF_RECLAIM_CLEAN,
>         PAGEREF_KEEP,
> +       PAGEREF_DISCARD,
> "discard" is alread used in various place for another meanings.
> another name is better.

Any suggestions here? Is PAGEREF_PURGE better?


>
>>         PAGEREF_ACTIVATE,
>>  };
>>
>> @@ -703,6 +705,13 @@ static enum page_references page_check_references(struct page *page,
>>         if (vm_flags & VM_LOCKED)
>>                 return PAGEREF_RECLAIM;
>>
>> +       /*
>> +        * If volatile page is reached on LRU's tail, we discard the
>> +        * page without considering recycle the page.
>> +        */
>> +       if (vm_flags & VM_VOLATILE)
>> +               return PAGEREF_DISCARD;
>> +
>>         if (referenced_ptes) {
>>                 if (PageSwapBacked(page))
>>                         return PAGEREF_ACTIVATE;
>> @@ -930,6 +939,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>                 switch (references) {
>>                 case PAGEREF_ACTIVATE:
>>                         goto activate_locked;
>> +               case PAGEREF_DISCARD:
>> +                       if (may_enter_fs && !discard_vpage(page))
> Wny may-enter-fs is needed? discard_vpage never enter FS.

I think this is a hold over from the file based/shared volatility.
Thanks for pointing it out, I've dropped the may_enter_fs check.


>> +       /*
>> +        * During interating the loop, some processes could see a page as
>> +        * purged while others could see a page as not-purged because we have
>> +        * no global lock between parent and child for protecting vrange system
>> +        * call during this loop. But it's not a problem because the page is
>> +        * not *SHARED* page but *COW* page so parent and child can see other
>> +        * data anytime. The worst case by this race is a page was purged
>> +        * but couldn't be discarded so it makes unnecessary page fault but
>> +        * it wouldn't be severe.
>> +        */
>> +       anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
>> +               struct vm_area_struct *vma = avc->vma;
>> +
>> +               if (!(vma->vm_flags & VM_VOLATILE))
>> +                       continue;
> When you find !VM_VOLATILE vma, we have no reason to continue pte zapping.
> Isn't it?

Sounds reasonable. I'll switch to breaking out here and returning an
error if Minchan doesn't object.


>
>> +               try_to_discard_one(page, vma);
>> +       }
>> +       page_unlock_anon_vma_read(anon_vma);
>> +       return 0;
>> +}
>> +
>> +
>> +/**
>> + * discard_vpage - If possible, discard the specified volatile page
>> + *
>> + * Attempts to discard a volatile page, and if needed frees the swap page
>> + *
>> + * Returns 0 on success, -1 on error.
>> + */
>> +int discard_vpage(struct page *page)
>> +{
>> +       VM_BUG_ON(!PageLocked(page));
>> +       VM_BUG_ON(PageLRU(page));
>> +
>> +       /* XXX - for now we only support anonymous volatile pages */
>> +       if (!PageAnon(page))
>> +               return -1;
>> +
>> +       if (!try_to_discard_vpage(page)) {
>> +               if (PageSwapCache(page))
>> +                       try_to_free_swap(page);
> This looks strange. try_to_free_swap can't handle vpurge pseudo entry.

So I may be missing some of the subtleties of the swap code, but the
vpurge pseudo swp entry is on the pte, where as here we're just trying
to make sure that before we drop the page we disconnect any swap-backing
the page may have (if it were swapped out previously before being marked
volatile). Let me know if I'm just not understanding the code or your point.


>> +
>> +               if (page_freeze_refs(page, 1)) {
> Where is page_unfreeze_refs() for the pair of this?

Since we're about to free the page I don't think we need a unfreeze_refs
pair? Or am I just misunderstanding the rules here?

thanks
-john

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] vrange: Add page purging logic & SIGBUS trap
@ 2014-04-10 18:49       ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-10 18:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, Michel Lespinasse, Minchan Kim, linux-mm

Hey Kosaki-san,
  Just a few follow ups on your comments here in preparation for v13.

On 03/23/2014 04:44 PM, KOSAKI Motohiro wrote:
> On Fri, Mar 21, 2014 at 2:17 PM, John Stultz <john.stultz@linaro.org> wrote:
> @@ -683,6 +684,7 @@ enum page_references {
>         PAGEREF_RECLAIM,
>         PAGEREF_RECLAIM_CLEAN,
>         PAGEREF_KEEP,
> +       PAGEREF_DISCARD,
> "discard" is alread used in various place for another meanings.
> another name is better.

Any suggestions here? Is PAGEREF_PURGE better?


>
>>         PAGEREF_ACTIVATE,
>>  };
>>
>> @@ -703,6 +705,13 @@ static enum page_references page_check_references(struct page *page,
>>         if (vm_flags & VM_LOCKED)
>>                 return PAGEREF_RECLAIM;
>>
>> +       /*
>> +        * If volatile page is reached on LRU's tail, we discard the
>> +        * page without considering recycle the page.
>> +        */
>> +       if (vm_flags & VM_VOLATILE)
>> +               return PAGEREF_DISCARD;
>> +
>>         if (referenced_ptes) {
>>                 if (PageSwapBacked(page))
>>                         return PAGEREF_ACTIVATE;
>> @@ -930,6 +939,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>                 switch (references) {
>>                 case PAGEREF_ACTIVATE:
>>                         goto activate_locked;
>> +               case PAGEREF_DISCARD:
>> +                       if (may_enter_fs && !discard_vpage(page))
> Wny may-enter-fs is needed? discard_vpage never enter FS.

I think this is a hold over from the file based/shared volatility.
Thanks for pointing it out, I've dropped the may_enter_fs check.


>> +       /*
>> +        * During interating the loop, some processes could see a page as
>> +        * purged while others could see a page as not-purged because we have
>> +        * no global lock between parent and child for protecting vrange system
>> +        * call during this loop. But it's not a problem because the page is
>> +        * not *SHARED* page but *COW* page so parent and child can see other
>> +        * data anytime. The worst case by this race is a page was purged
>> +        * but couldn't be discarded so it makes unnecessary page fault but
>> +        * it wouldn't be severe.
>> +        */
>> +       anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
>> +               struct vm_area_struct *vma = avc->vma;
>> +
>> +               if (!(vma->vm_flags & VM_VOLATILE))
>> +                       continue;
> When you find !VM_VOLATILE vma, we have no reason to continue pte zapping.
> Isn't it?

Sounds reasonable. I'll switch to breaking out here and returning an
error if Minchan doesn't object.


>
>> +               try_to_discard_one(page, vma);
>> +       }
>> +       page_unlock_anon_vma_read(anon_vma);
>> +       return 0;
>> +}
>> +
>> +
>> +/**
>> + * discard_vpage - If possible, discard the specified volatile page
>> + *
>> + * Attempts to discard a volatile page, and if needed frees the swap page
>> + *
>> + * Returns 0 on success, -1 on error.
>> + */
>> +int discard_vpage(struct page *page)
>> +{
>> +       VM_BUG_ON(!PageLocked(page));
>> +       VM_BUG_ON(PageLRU(page));
>> +
>> +       /* XXX - for now we only support anonymous volatile pages */
>> +       if (!PageAnon(page))
>> +               return -1;
>> +
>> +       if (!try_to_discard_vpage(page)) {
>> +               if (PageSwapCache(page))
>> +                       try_to_free_swap(page);
> This looks strange. try_to_free_swap can't handle vpurge pseudo entry.

So I may be missing some of the subtleties of the swap code, but the
vpurge pseudo swp entry is on the pte, where as here we're just trying
to make sure that before we drop the page we disconnect any swap-backing
the page may have (if it were swapped out previously before being marked
volatile). Let me know if I'm just not understanding the code or your point.


>> +
>> +               if (page_freeze_refs(page, 1)) {
> Where is page_unfreeze_refs() for the pair of this?

Since we're about to free the page I don't think we need a unfreeze_refs
pair? Or am I just misunderstanding the rules here?

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
  2014-04-02 20:13                     ` John Stultz
@ 2014-04-11 19:32                       ` John Stultz
  -1 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-11 19:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On 04/02/2014 01:13 PM, John Stultz wrote:
> On 04/02/2014 12:47 PM, Johannes Weiner wrote:
>
>> It's really nothing but a use-after-free bug that has consequences for
>> no-one but the faulty application.  The thing that IS new is that even
>> a read is enough to corrupt your data in this case.
>>
>> MADV_REVIVE could return 0 if all pages in the specified range were
>> present, -Esomething if otherwise.  That would be semantically sound
>> even if userspace messes up.
> So its semantically more of just a combined mincore+dirty operation..
> and nothing more?
>
> What are other folks thinking about this? Although I don't particularly
> like it, I probably could go along with Johannes' approach, forgoing
> SIGBUS for zero-fill and adapting the semantics that are in my mind a
> bit stranger. This would allow for ashmem-like style behavior w/ the
> additional  write-clears-volatile-state and read-clears-purged-state
> constraints (which I don't think would be problematic for Android, but
> am not totally sure).
>
> But I do worry that these semantics are easier for kernel-mm-developers
> to grasp, but are much much harder for application developers to
> understand.

So I don't feel like we've gotten enough feedback for consensus here.

Thus, to at least address other issues pointed out at LSF-MM, I'm going
to shortly send out a v13 of the patchset which keeps with the previous
approach instead of adopting Johannes' suggested approach here.

If folks do prefer Johannes' approach, please speak up as I'm willing to
give it a whirl, despite my concerns about the subtle semantics.

thanks
-john


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
@ 2014-04-11 19:32                       ` John Stultz
  0 siblings, 0 replies; 112+ messages in thread
From: John Stultz @ 2014-04-11 19:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, H. Peter Anvin, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dmitry Adamushko, Neil Brown, Andrea Arcangeli,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On 04/02/2014 01:13 PM, John Stultz wrote:
> On 04/02/2014 12:47 PM, Johannes Weiner wrote:
>
>> It's really nothing but a use-after-free bug that has consequences for
>> no-one but the faulty application.  The thing that IS new is that even
>> a read is enough to corrupt your data in this case.
>>
>> MADV_REVIVE could return 0 if all pages in the specified range were
>> present, -Esomething if otherwise.  That would be semantically sound
>> even if userspace messes up.
> So its semantically more of just a combined mincore+dirty operation..
> and nothing more?
>
> What are other folks thinking about this? Although I don't particularly
> like it, I probably could go along with Johannes' approach, forgoing
> SIGBUS for zero-fill and adapting the semantics that are in my mind a
> bit stranger. This would allow for ashmem-like style behavior w/ the
> additional  write-clears-volatile-state and read-clears-purged-state
> constraints (which I don't think would be problematic for Android, but
> am not totally sure).
>
> But I do worry that these semantics are easier for kernel-mm-developers
> to grasp, but are much much harder for application developers to
> understand.

So I don't feel like we've gotten enough feedback for consensus here.

Thus, to at least address other issues pointed out at LSF-MM, I'm going
to shortly send out a v13 of the patchset which keeps with the previous
approach instead of adopting Johannes' suggested approach here.

If folks do prefer Johannes' approach, please speak up as I'm willing to
give it a whirl, despite my concerns about the subtle semantics.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

end of thread, other threads:[~2014-04-11 19:32 UTC | newest]

Thread overview: 112+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-21 21:17 [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder John Stultz
2014-03-21 21:17 ` John Stultz
2014-03-21 21:17 ` [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas John Stultz
2014-03-21 21:17   ` John Stultz
2014-03-23 12:20   ` Jan Kara
2014-03-23 12:20     ` Jan Kara
2014-03-23 20:34     ` John Stultz
2014-03-23 20:34       ` John Stultz
2014-03-23 16:50   ` KOSAKI Motohiro
2014-03-23 16:50     ` KOSAKI Motohiro
2014-04-08 18:52     ` John Stultz
2014-04-08 18:52       ` John Stultz
2014-03-21 21:17 ` [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile John Stultz
2014-03-21 21:17   ` John Stultz
2014-03-23 12:29   ` Jan Kara
2014-03-23 12:29     ` Jan Kara
2014-03-23 20:21     ` John Stultz
2014-03-23 20:21       ` John Stultz
2014-03-23 17:42   ` KOSAKI Motohiro
2014-03-23 17:42     ` KOSAKI Motohiro
2014-04-07 18:37     ` John Stultz
2014-04-07 18:37       ` John Stultz
2014-04-07 22:14       ` KOSAKI Motohiro
2014-04-07 22:14         ` KOSAKI Motohiro
2014-04-08  3:09         ` John Stultz
2014-04-08  3:09           ` John Stultz
2014-03-23 17:50   ` KOSAKI Motohiro
2014-03-23 17:50     ` KOSAKI Motohiro
2014-03-23 20:26     ` John Stultz
2014-03-23 20:26       ` John Stultz
2014-03-23 21:50       ` KOSAKI Motohiro
2014-03-23 21:50         ` KOSAKI Motohiro
2014-04-09 18:29         ` John Stultz
2014-04-09 18:29           ` John Stultz
2014-03-21 21:17 ` [PATCH 3/5] vrange: Add page purging logic & SIGBUS trap John Stultz
2014-03-21 21:17   ` John Stultz
2014-03-23 23:44   ` KOSAKI Motohiro
2014-03-23 23:44     ` KOSAKI Motohiro
2014-04-10 18:49     ` John Stultz
2014-04-10 18:49       ` John Stultz
2014-03-21 21:17 ` [PATCH 4/5] vrange: Set affected pages referenced when marking volatile John Stultz
2014-03-21 21:17   ` John Stultz
2014-03-24  0:01   ` KOSAKI Motohiro
2014-03-24  0:01     ` KOSAKI Motohiro
2014-03-21 21:17 ` [PATCH 5/5] vmscan: Age anonymous memory even when swap is off John Stultz
2014-03-21 21:17   ` John Stultz
2014-03-24 17:33   ` Rik van Riel
2014-03-24 17:33     ` Rik van Riel
2014-03-24 18:04     ` John Stultz
2014-03-24 18:04       ` John Stultz
2014-04-01 21:21 ` [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder Johannes Weiner
2014-04-01 21:21   ` Johannes Weiner
2014-04-01 21:34   ` H. Peter Anvin
2014-04-01 21:34     ` H. Peter Anvin
2014-04-01 21:35   ` H. Peter Anvin
2014-04-01 21:35     ` H. Peter Anvin
2014-04-01 23:01     ` Dave Hansen
2014-04-01 23:01       ` Dave Hansen
2014-04-02  4:12       ` John Stultz
2014-04-02  4:12         ` John Stultz
2014-04-02 16:36         ` Johannes Weiner
2014-04-02 16:36           ` Johannes Weiner
2014-04-02 17:40           ` John Stultz
2014-04-02 17:40             ` John Stultz
2014-04-02 17:58             ` Johannes Weiner
2014-04-02 17:58               ` Johannes Weiner
2014-04-02 19:01               ` John Stultz
2014-04-02 19:01                 ` John Stultz
2014-04-02 19:47                 ` Johannes Weiner
2014-04-02 19:47                   ` Johannes Weiner
2014-04-02 20:13                   ` John Stultz
2014-04-02 20:13                     ` John Stultz
2014-04-02 22:44                     ` Jan Kara
2014-04-02 22:44                       ` Jan Kara
2014-04-11 19:32                     ` John Stultz
2014-04-11 19:32                       ` John Stultz
2014-04-07  5:48             ` Minchan Kim
2014-04-07  5:48               ` Minchan Kim
2014-04-08  4:32             ` Kevin Easton
2014-04-08  3:38               ` John Stultz
2014-04-08  3:38                 ` John Stultz
2014-04-07  5:24           ` Minchan Kim
2014-04-07  5:24             ` Minchan Kim
2014-04-02  4:03   ` John Stultz
2014-04-02  4:03     ` John Stultz
2014-04-02  4:07     ` H. Peter Anvin
2014-04-02  4:07       ` H. Peter Anvin
2014-04-02 16:30     ` Johannes Weiner
2014-04-02 16:30       ` Johannes Weiner
2014-04-02 16:32       ` H. Peter Anvin
2014-04-02 16:32         ` H. Peter Anvin
2014-04-02 16:37         ` H. Peter Anvin
2014-04-02 17:18           ` Johannes Weiner
2014-04-02 17:18             ` Johannes Weiner
2014-04-02 17:40             ` Dave Hansen
2014-04-02 17:40               ` Dave Hansen
2014-04-02 17:48               ` John Stultz
2014-04-02 17:48                 ` John Stultz
2014-04-02 18:07                 ` Johannes Weiner
2014-04-02 18:07                   ` Johannes Weiner
2014-04-02 19:37                   ` John Stultz
2014-04-02 19:37                     ` John Stultz
2014-04-02 18:31     ` Andrea Arcangeli
2014-04-02 18:31       ` Andrea Arcangeli
2014-04-02 19:27       ` Johannes Weiner
2014-04-02 19:27         ` Johannes Weiner
2014-04-07  6:19         ` Minchan Kim
2014-04-07  6:19           ` Minchan Kim
2014-04-02 19:51       ` John Stultz
2014-04-02 19:51         ` John Stultz
2014-04-07  6:11       ` Minchan Kim
2014-04-07  6:11         ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.