linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Volatile Ranges (v13)
@ 2014-04-11 20:15 John Stultz
  2014-04-11 20:15 ` [PATCH 1/4] swap: Cleanup how special swap file numbers are defined John Stultz
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: John Stultz @ 2014-04-11 20:15 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

Just wanted to send out an updated patch set that includes changes
(mostly cleanups and fixes) from some of the reviews and discussion
at LSF-MM.

New changes are:
----------------
o Renamed vrange syscall to mvolatile (Per Hugh's suggestion)
o Dropped any modifications made to page age when marking volatile.
  (from LSF-MM discussion)
o Changed "discard" usage to "purged" for clarity/consistency
  (Kosaki-san's suggestion)
o Moved appropriate header definitions to uapi/ (Kosaki-san)
o Made sure to take write on mmap_sem (Kosaki-san/Minchan)
o Made sure to clean ptes on marking non-volatile rather then
  cleaning them on fault (Kosaki-san's suggestion)
o Introduced a cleanup patch to make adding new psudo-swap types
  simpler
o Numerous other fixes from Minchan
o Numerous other fixes suggested by Kosaki-san
o For now I've also dropped the naive aging anonymous memory on
  swapless systems, as I need to sort out accounting details first.


Still on the TODO list
----------------------------------------------------
o Look into possibility of doing VMA splitting/merging more
  carefully by hand to avoid potential allocation failures.
  This would allow us to go back using madvise() syscall.
o Sort out how best to do page accounting when the volatility
  is tracked on a per-mm basis.
o Revisit anonymous page aging on swapless systems
o Draft up re-adding tmpfs/shm file volatility support


There is of course, the ongoing discussion w/ Johannes
about his suggestion of cleaning the pte entries dirty
bit to represent volatility. While this method is
attractive from a kernel mm-developer point of view,
the resulting semantics are quite subtle and I worry
would be confusing and error prone for application
developers. If consensus builds for that approach,
I'm willing to move to that method, but until then
I figured I'd continue with this approach, which
has simpler semantics (but will require more logic
when tmpfs/shm file support is readded).


Many thanks again to Minchan, Kosaki-san, Johannes, Jan, Rik,
Hugh, and others for the great feedback and discussion at
LSF-MM.

thanks
-john


Volatile ranges provides a method for userland to inform the kernel
that a range of memory is safe to discard (ie: can be regenerated)
but userspace may want to try access it in the future.  It can be
thought of as similar to MADV_DONTNEED, but that the actual freeing
of the memory is delayed and only done under memory pressure, and the
user can try to cancel the action and be able to quickly access any
unpurged pages. The idea originated from Android's ashmem, but I've
since learned that other OSes provide similar functionality.

This functionality allows for a number of interesting uses. One such
example is: Userland caches that have kernel triggered eviction under
memory pressure. This allows for the kernel to "rightsize" userspace
caches for current system-wide workload. Things like image bitmap
caches, or rendered HTML in a hidden browser tab, where the data is
not visible and can be regenerated if needed, are good examples.

Both Chrome and Firefox already make use of volatile range-like
functionality via the ashmem interface:
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34

https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc


The basic usage of volatile ranges is as so:
1) Userland marks a range of memory that can be regenerated if
necessary as volatile
2) Before accessing the memory again, userland marks the memory as
nonvolatile, and the kernel will provide notification if any pages in
the range has been purged.

If userland accesses memory while it is volatile, it will either
get the value stored at that memory if there has been no memory
pressure or the application will get a SIGBUS if the page has been
purged.

Reads or writes to the memory do not affect the volatility state of the
pages.

You can read more about the history of volatile ranges here (~reverse
chronological order):
https://lwn.net/Articles/592042/
https://lwn.net/Articles/590991/
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


Continuing from the last few releases, this revision is reduced in
scope when compared to earlier attempts. I've only focused on handled
volatility on anonymous memory, and we're storing the volatility in
the VMA.  This may have performance implications compared with the
earlier approach, but it does simplify the approach. I'm open to
expanding functionality via flags arguments, but for now I'm wanting
to keep focus on what the right default behavior should be and keep
the use cases restricted to help get reviewer interest.

Additionally, since we don't handle volatility on tmpfs files with this
version of the patch, it is not able to be used to implement semantics
similar to Android's ashmem. But since shared volatiltiy on files is
more complex, my hope is to start small and hopefully grow from there.

Again, much of the logic in this patchset is based on Minchan's earlier
efforts, so I do want to make sure the credit goes to him for his major
contribution!

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>

John Stultz (4):
  swap: Cleanup how special swap file numbers are defined
  mvolatile: Add mvolatile syscall and handle splitting/merging and
    marking vmas
  mvolatile: Add purged page detection on setting memory non-volatile
  mvolatile: Add page purging logic & SIGBUS trap

 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/mm.h               |   1 +
 include/linux/mvolatile.h        |  10 +
 include/linux/swap.h             |  36 ++--
 include/linux/swapops.h          |  10 +
 include/uapi/linux/mvolatile.h   |   7 +
 mm/Makefile                      |   2 +-
 mm/internal.h                    |   2 -
 mm/memory.c                      |   8 +
 mm/mvolatile.c                   | 401 +++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                        |   5 +
 mm/vmscan.c                      |  12 ++
 12 files changed, 481 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/mvolatile.h
 create mode 100644 include/uapi/linux/mvolatile.h
 create mode 100644 mm/mvolatile.c

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] swap: Cleanup how special swap file numbers are defined
  2014-04-11 20:15 [PATCH 0/4] Volatile Ranges (v13) John Stultz
@ 2014-04-11 20:15 ` John Stultz
  2014-04-11 20:15 ` [PATCH 2/4] mvolatile: Add mvolatile syscall and handle splitting/merging and marking vmas John Stultz
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: John Stultz @ 2014-04-11 20:15 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

The SWP_HWPOISON and SWP_MIGRATION numbers are defined in
a fairly awkward way. Since they are stolen from the top
few values of the 1<<MAX_SWAPFILES_SHIFT bits, the values
themselves are calculated by taking the MAX_SWAPFILES value
(which is defined by subtraciting out all the available special
types), and re-adding all the other various special types.

However, in order to preserve the actual values when adding
new entries, one would have to re-add the new entries value
to all the type definitions. This gets ugly fast.

This patch tries to clean up how these values are defined so
its simpler to understand how they are calculated and makes it
easier add new special values.

This is done via a enum list which tracks the various special types
making the MAX_SWAPFILES definition much simpler. Then we just
define the special type as (MAX_SWAPFILES + <enum val>).

As long as the enum values are added to the top of the enum
instead of the bottom, the values for the types will be preserved.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/swap.h | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..a90ea95 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -49,29 +49,38 @@ static inline int current_is_kswapd(void)
  * actions on faults.
  */
 
+enum {
+	/*
+	 * NOTE: We use the high bits here (subtracting from
+	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
+	 * new entries here at the top of the enum, not at the bottom
+	 */
+#ifdef CONFIG_MEMORY_FAILURE
+	SWP_HWPOISON_NR,
+#endif
+#ifdef CONFIG_MIGRATION
+	SWP_MIGRATION_READ_NR,
+	SWP_MIGRATION_WRITE_NR,
+#endif
+	SWP_MAX_SPECIAL_TYPE_NR,
+};
+#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT) - SWP_MAX_SPECIAL_TYPE_NR)
+
 /*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
-#define SWP_MIGRATION_NUM 2
-#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
-#else
-#define SWP_MIGRATION_NUM 0
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_MIGRATION_READ_NR)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_MIGRATION_WRITE_NR)
 #endif
 
 /*
  * Handling of hardware poisoned pages with memory corruption.
  */
 #ifdef CONFIG_MEMORY_FAILURE
-#define SWP_HWPOISON_NUM 1
-#define SWP_HWPOISON		MAX_SWAPFILES
-#else
-#define SWP_HWPOISON_NUM 0
+#define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
 #endif
 
-#define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] mvolatile: Add mvolatile syscall and handle splitting/merging and marking vmas
  2014-04-11 20:15 [PATCH 0/4] Volatile Ranges (v13) John Stultz
  2014-04-11 20:15 ` [PATCH 1/4] swap: Cleanup how special swap file numbers are defined John Stultz
@ 2014-04-11 20:15 ` John Stultz
  2014-04-11 20:15 ` [PATCH 3/4] mvolatile: Add purged page detection on setting memory non-volatile John Stultz
  2014-04-11 20:15 ` [PATCH 4/4] mvolatile: Add page purging logic & SIGBUS trap John Stultz
  3 siblings, 0 replies; 8+ messages in thread
From: John Stultz @ 2014-04-11 20:15 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

This patch introduces the mvolatile() syscall, which allows for specifying
ranges of memory as volatile, and able to be discarded by the system.

This initial patch simply adds the syscall, and the vma handling,
splitting and merging the vmas as needed, and marking them with
VM_VOLATILE.

No purging or discarding of volatile ranges is done at this point.

Example man page:

NAME
	mvolatile - Mark or unmark range of memory as volatile

SYNOPSIS
	ssize_t mvolatile(unsigned_long start, size_t length,
			 unsigned_long mode, unsigned_long flags,
			 int *purged);

DESCRIPTION
	Applications can use mvolatile(2) to advise kernel that pages of
	anonymous mapping in the given VM area can be reclaimed without
	swapping (or can no longer be reclaimed without swapping).
	The idea is that application can help kernel with page reclaim
	under memory pressure by specifying data it can easily regenerate
	and thus kernel can discard the data if needed.

	mode:
	MVOLATILE_VOLATILE
		Informs the kernel that the VM can discard in pages in
		the specified range when under memory pressure.
	MVOLATILE_NONVOLATILE
		Informs the kernel that the VM can no longer discard pages
		in this range.

	flags: Currently no flags are supported.

	purged: Pointer to an integer which will return 1 if
	mode == MVOLATILE_NONVOLATILE and any page in the affected range
	was purged. If purged returns zero during a mode ==
	MVOLATILE_NONVOLATILE call, it means all of the pages in the range
	are intact.

	If a process accesses volatile memory which has been purged, and
	was not set as non volatile via a MVOLATILE_NONVOLATILE call, it
	will recieve a SIGBUS.

RETURN VALUE
	On success mvolatile returns the number of bytes marked or unmarked.

	Similar to write(), it may return fewer bytes then specified
	if it ran into a problem.

	When using MVOLATILE_NONVOLATILE, if the return value is smaller
	than the specified length, then the value returned in the purged
	pointer only reflects the purged state of the successfully marked
	non-volatile pages.

	If an error is returned, no changes were made.

ERRORS
	EINVAL This error can occur for the following reasons:
		* The value length is negative or not page size units.
		* addr is not page-aligned
		* mode not a valid value.
		* flags is not a valid value.

	ENOMEM Not enough memory

	ENOMEM Addresses in the specified range are not currently mapped,
	       or are outside the address space of the process.

	EFAULT Purged pointer is invalid

This a simplified implementation which reuses some of the logic
from Minchan's earlier efforts. So credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/mm.h               |   1 +
 include/linux/mvolatile.h        |   8 ++
 include/uapi/linux/mvolatile.h   |   7 ++
 mm/Makefile                      |   2 +-
 mm/mvolatile.c                   | 195 +++++++++++++++++++++++++++++++++++++++
 6 files changed, 213 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/mvolatile.h
 create mode 100644 include/uapi/linux/mvolatile.h
 create mode 100644 mm/mvolatile.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..6fa2087 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	mvolatile		sys_mvolatile
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1b7414..a1f11da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
 					/* Used by sys_madvise() */
+#define VM_VOLATILE	0x00001000	/* VMA is volatile */
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
 #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
 
diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
new file mode 100644
index 0000000..973bb3b
--- /dev/null
+++ b/include/linux/mvolatile.h
@@ -0,0 +1,8 @@
+#ifndef _LINUX_MVOLATILE_H
+#define _LINUX_MVOLATILE_H
+
+#include <uapi/linux/mvolatile.h>
+
+#define MVOLATILE_VALID_FLAGS (0) /* Don't yet support any flags */
+
+#endif /* _LINUX_MVOLATILE_H */
diff --git a/include/uapi/linux/mvolatile.h b/include/uapi/linux/mvolatile.h
new file mode 100644
index 0000000..1e92f3f
--- /dev/null
+++ b/include/uapi/linux/mvolatile.h
@@ -0,0 +1,7 @@
+#ifndef _UAPI_LINUX_MVOLATILE_H
+#define _UAPI_LINUX_MVOLATILE_H
+
+#define MVOLATILE_NONVOLATILE 0
+#define MVOLATILE_VOLATILE 1
+
+#endif /* _UAPI_LINUX_MVOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..76a3444 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o balloon_compaction.o \
+			   compaction.o balloon_compaction.o mvolatile.o \
 			   interval_tree.o list_lru.o $(mmu-y)
 
 obj-y += init-mm.o
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
new file mode 100644
index 0000000..d4d2eed
--- /dev/null
+++ b/mm/mvolatile.c
@@ -0,0 +1,195 @@
+/*
+ * mm/mvolatile.c
+ *
+ * Copyright (C) 2014, LG Electronics, Minchan Kim <minchan@kernel.org>
+ * Copyright (C) 2014 Linaro Ltd., John Stultz <john.stultz@linaro.org>
+ */
+#include <linux/syscalls.h>
+#include <linux/mvolatile.h>
+#include <linux/mm_inline.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+
+/**
+ * do_mvolatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
+ * @mm: mm_struct we're working on
+ * @start: starting address of the volatile range
+ * @end: ending address of the volatile range
+ * @mode: the mode of the volatile range (volatile or non-volatile)
+ * @flags: any additonal flags arguments (ignored for now, as there are none)
+ * @purged: pointer to integer value that is set to 1 if any pages in a range
+ * being set non-volatile have been purged.
+ *
+ * Core logic of sys_volatile. Iterates over the VMAs in the specified
+ * range, and marks or clears them as VM_VOLATILE, splitting or merging them
+ * as needed.
+ *
+ * Returns the number of bytes successfully modified.
+ *
+ * Returns error only if no bytes were modified.
+ */
+static ssize_t do_mvolatile(struct mm_struct *mm, unsigned long start,
+				unsigned long end, unsigned long mode,
+				unsigned long flags, int *purged)
+{
+	struct vm_area_struct *vma, *prev;
+	unsigned long orig_start = start;
+	ssize_t count = 0, ret = 0;
+
+	down_write(&mm->mmap_sem);
+
+	vma = find_vma_prev(mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		unsigned long new_flags;
+		pgoff_t pgoff;
+		unsigned long tmp;
+
+		if (!vma)
+			goto out;
+
+		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+					VM_HUGETLB))
+			goto out;
+
+		/* We don't support volatility on files for now */
+		if (vma->vm_file) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* return ENOMEM if we're trying to mark unmapped pages */
+		if (start < vma->vm_start) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		new_flags = vma->vm_flags;
+
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		switch (mode) {
+		case MVOLATILE_VOLATILE:
+			new_flags |= VM_VOLATILE;
+			break;
+		case MVOLATILE_NONVOLATILE:
+			new_flags &= ~VM_VOLATILE;
+		}
+
+		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+		prev = vma_merge(mm, prev, start, tmp, new_flags,
+					vma->anon_vma, vma->vm_file, pgoff,
+					vma_policy(vma));
+		if (prev)
+			goto success;
+
+		if (start != vma->vm_start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret)
+				goto out;
+		}
+
+		if (tmp != vma->vm_end) {
+			ret = split_vma(mm, vma, tmp, 0);
+			if (ret)
+				goto out;
+		}
+
+		prev = vma;
+success:
+		vma->vm_flags = new_flags;
+
+		/* update count to distance covered so far*/
+		count = tmp - orig_start;
+
+		start = tmp;
+		if (start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			goto out;
+		vma = prev->vm_next;
+	}
+out:
+	up_write(&mm->mmap_sem);
+
+	/* report bytes successfully marked, even if we're exiting on error */
+	if (count)
+		return count;
+
+	return ret;
+}
+
+
+/**
+ * sys_mvolatile - Marks specified range as volatile or non-volatile.
+ * @start: starting address of the range
+ * @len: size of the range being requested
+ * @mode: the mode of the range (volatile or non-volatile)
+ * @flags: any additonal flags arguments (ignored for now, as there are none)
+ * @purged: pointer to integer value that is set to 1 if any pages in a range
+ * being set non-volatile have been purged.
+ *
+ * Validates the syscall inputs and calls do_mvolatile(), then copies the
+ * purged flag back out to userspace.
+ *
+ * Returns the number of bytes successfully modified.
+ * Returns error only if no bytes were modified.
+ */
+SYSCALL_DEFINE5(mvolatile, unsigned long, start, size_t, len,
+				unsigned long, mode, unsigned long, flags,
+				int __user *, purged)
+{
+	unsigned long end;
+	struct mm_struct *mm = current->mm;
+	ssize_t ret = -EINVAL;
+	int p = 0;
+
+	if (flags & ~MVOLATILE_VALID_FLAGS)
+		goto out;
+
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	if (len & ~PAGE_MASK)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	if (start >= TASK_SIZE)
+		goto out;
+
+	if (purged) {
+		/* Test pointer is valid before making any changes */
+		if (put_user(p, purged))
+			return -EFAULT;
+	}
+
+	ret = do_mvolatile(mm, start, end, mode, flags, &p);
+
+	if (purged) {
+		if (put_user(p, purged)) {
+			/*
+			 * This would be bad, since we've modified volatilty
+			 * and the change in purged state would be lost.
+			 * But the application is doing something dumb here,
+			 * so just return EFAULT and be ok with losing the
+			 * state.
+			 */
+			return -EFAULT;
+		}
+	}
+
+out:
+	return ret;
+}
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] mvolatile: Add purged page detection on setting memory non-volatile
  2014-04-11 20:15 [PATCH 0/4] Volatile Ranges (v13) John Stultz
  2014-04-11 20:15 ` [PATCH 1/4] swap: Cleanup how special swap file numbers are defined John Stultz
  2014-04-11 20:15 ` [PATCH 2/4] mvolatile: Add mvolatile syscall and handle splitting/merging and marking vmas John Stultz
@ 2014-04-11 20:15 ` John Stultz
  2014-04-14  2:37   ` Minchan Kim
  2014-04-11 20:15 ` [PATCH 4/4] mvolatile: Add page purging logic & SIGBUS trap John Stultz
  3 siblings, 1 reply; 8+ messages in thread
From: John Stultz @ 2014-04-11 20:15 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

Users of volatile ranges will need to know if memory was discarded.
This patch adds the purged state tracking required to inform userland
when it marks memory as non-volatile that some memory in that range
was purged and needs to be regenerated.

This simplified implementation which uses some of the logic from
Minchan's earlier efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/swap.h    |  5 +++
 include/linux/swapops.h | 10 ++++++
 mm/mvolatile.c          | 86 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a90ea95..c372ca7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -55,6 +55,7 @@ enum {
 	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
 	 * new entries here at the top of the enum, not at the bottom
 	 */
+	SWP_MVOLATILE_PURGED_NR,
 #ifdef CONFIG_MEMORY_FAILURE
 	SWP_HWPOISON_NR,
 #endif
@@ -81,6 +82,10 @@ enum {
 #define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
 #endif
 
+/*
+ * Purged volatile range pages
+ */
+#define SWP_MVOLATILE_PURGED	(MAX_SWAPFILES + SWP_MVOLATILE_PURGED_NR)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index c0f7526..fe9c026 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -161,6 +161,16 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
+static inline swp_entry_t make_purged_entry(void)
+{
+	return swp_entry(SWP_MVOLATILE_PURGED, 0);
+}
+
+static inline int is_purged_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_MVOLATILE_PURGED;
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Support for hardware poisoned pages
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
index d4d2eed..38c8315 100644
--- a/mm/mvolatile.c
+++ b/mm/mvolatile.c
@@ -12,8 +12,91 @@
 #include <linux/hugetlb.h>
 #include <linux/mmu_notifier.h>
 #include <linux/mm_inline.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include "internal.h"
 
+struct mvolatile_walker {
+	struct vm_area_struct *vma;
+	int page_was_purged;
+};
+
+
+/**
+ * mvolatile_check_purged_pte - Checks ptes for purged pages
+ * @pmd: pmd to walk
+ * @addr: starting address
+ * @end: end address
+ * @walk: mm_walk ptr (contains ptr to mvolatile_walker)
+ *
+ * Iterates over the ptes in the pmd checking if they have
+ * purged swap entries.
+ *
+ * Sets the mvolatile_walker.page_was_purged to 1 if any were purged.
+ */
+static int mvolatile_check_purged_pte(pmd_t *pmd, unsigned long addr,
+					unsigned long end, struct mm_walk *walk)
+{
+	struct mvolatile_walker *vw = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	if (pmd_trans_huge(*pmd))
+		return 0;
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte)) {
+			swp_entry_t mvolatile_entry = pte_to_swp_entry(*pte);
+
+			if (unlikely(is_purged_entry(mvolatile_entry))) {
+
+				vw->page_was_purged = 1;
+
+				/* clear the pte swp entry */
+				flush_cache_page(vw->vma, addr, pte_pfn(*pte));
+				ptep_clear_flush(vw->vma, addr, pte);
+			}
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	return ret;
+}
+
+
+/**
+ * mvolatile_check_purged - Sets up a mm_walk to check for purged pages
+ * @vma: ptr to vma we're starting with
+ * @start: start address to walk
+ * @end: end address of walk
+ *
+ * Sets up and calls wa_page_range() to check for purge pages.
+ *
+ * Returns 1 if pages in the range were purged, 0 otherwise.
+ */
+static int mvolatile_check_purged(struct vm_area_struct *vma,
+					 unsigned long start,
+					 unsigned long end)
+{
+	struct mvolatile_walker vw;
+	struct mm_walk mvolatile_walk = {
+		.pmd_entry = mvolatile_check_purged_pte,
+		.mm = vma->vm_mm,
+		.private = &vw,
+	};
+	vw.page_was_purged = 0;
+	vw.vma = vma;
+
+	walk_page_range(start, end, &mvolatile_walk);
+
+	return vw.page_was_purged;
+
+}
 
 /**
  * do_mvolatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
@@ -119,6 +202,9 @@ success:
 		vma = prev->vm_next;
 	}
 out:
+	if (count && (mode == MVOLATILE_NONVOLATILE))
+		*purged = mvolatile_check_purged(vma, orig_start,
+							orig_start+count);
 	up_write(&mm->mmap_sem);
 
 	/* report bytes successfully marked, even if we're exiting on error */
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] mvolatile: Add page purging logic & SIGBUS trap
  2014-04-11 20:15 [PATCH 0/4] Volatile Ranges (v13) John Stultz
                   ` (2 preceding siblings ...)
  2014-04-11 20:15 ` [PATCH 3/4] mvolatile: Add purged page detection on setting memory non-volatile John Stultz
@ 2014-04-11 20:15 ` John Stultz
  2014-04-14  2:51   ` Minchan Kim
  3 siblings, 1 reply; 8+ messages in thread
From: John Stultz @ 2014-04-11 20:15 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, Keith Packard, linux-mm

This patch adds the hooks in the vmscan logic to purge volatile pages
and mark their pte as purged. With this, volatile pages will be purged
under pressure, and their ptes swap entry's marked. If the purged pages
are accessed before being marked non-volatile, we catch this and send a
SIGBUS.

This is a simplified implementation that uses logic from Minchan's earlier
efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Keith Packard <keithp@keithp.com>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/mvolatile.h |   2 +
 mm/internal.h             |   2 -
 mm/memory.c               |   8 ++++
 mm/mvolatile.c            | 120 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                 |   5 ++
 mm/vmscan.c               |  12 +++++
 6 files changed, 147 insertions(+), 2 deletions(-)

diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
index 973bb3b..8cfe6e0 100644
--- a/include/linux/mvolatile.h
+++ b/include/linux/mvolatile.h
@@ -5,4 +5,6 @@
 
 #define MVOLATILE_VALID_FLAGS (0) /* Don't yet support any flags */
 
+extern int purge_volatile_page(struct page *page);
+
 #endif /* _LINUX_MVOLATILE_H */
diff --git a/mm/internal.h b/mm/internal.h
index 29e1e76..ea66bf9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -225,10 +225,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern unsigned long vma_address(struct page *page,
 				 struct vm_area_struct *vma);
-#endif
 #else /* !CONFIG_MMU */
 static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 22dfa61..9043e4c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -60,6 +60,7 @@
 #include <linux/migrate.h>
 #include <linux/string.h>
 #include <linux/dma-debug.h>
+#include <linux/mvolatile.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3643,6 +3644,8 @@ static int handle_pte_fault(struct mm_struct *mm,
 
 	entry = *pte;
 	if (!pte_present(entry)) {
+		swp_entry_t mvolatile_entry;
+
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
 				if (likely(vma->vm_ops->fault))
@@ -3652,6 +3655,11 @@ static int handle_pte_fault(struct mm_struct *mm,
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, flags);
 		}
+
+		mvolatile_entry = pte_to_swp_entry(entry);
+		if (unlikely(is_purged_entry(mvolatile_entry)))
+			return VM_FAULT_SIGBUS;
+
 		if (pte_file(entry))
 			return do_nonlinear_fault(mm, vma, address,
 					pte, pmd, flags, entry);
diff --git a/mm/mvolatile.c b/mm/mvolatile.c
index 38c8315..16dccee 100644
--- a/mm/mvolatile.c
+++ b/mm/mvolatile.c
@@ -279,3 +279,123 @@ SYSCALL_DEFINE5(mvolatile, unsigned long, start, size_t, len,
 out:
 	return ret;
 }
+
+
+/**
+ * try_to_purge_one - Purge a volatile page from a vma
+ * @page: page to purge
+ * @vma: vma to purge page from
+ *
+ * Finds the pte for a page in a vma, marks the pte as purged
+ * and release the page.
+ */
+static void try_to_purge_one(struct page *page, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *pte;
+	pte_t pteval;
+	spinlock_t *ptl;
+	unsigned long addr;
+
+	VM_BUG_ON(!PageLocked(page));
+
+	addr = vma_address(page, vma);
+	pte = page_check_address(page, mm, addr, &ptl, 0);
+	if (!pte)
+		return;
+
+	BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
+
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	pteval = ptep_clear_flush(vma, addr, pte);
+
+	update_hiwater_rss(mm);
+	if (PageAnon(page))
+		dec_mm_counter(mm, MM_ANONPAGES);
+	else
+		dec_mm_counter(mm, MM_FILEPAGES);
+
+	page_remove_rmap(page);
+	page_cache_release(page);
+
+	set_pte_at(mm, addr, pte, swp_entry_to_pte(make_purged_entry()));
+
+	pte_unmap_unlock(pte, ptl);
+	mmu_notifier_invalidate_page(mm, addr);
+
+}
+
+/**
+ * try_to_purge_vpage - check vma chain and purge from vmas marked volatile
+ * @page: page to purge
+ *
+ * Goes over all the vmas that hold a page, and where the vmas are volatile,
+ * purge the page from the vma.
+ *
+ * Returns 0 on success, -1 on error.
+ */
+static int try_to_purge_vpage(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff;
+	int ret = 0;
+
+	anon_vma = page_lock_anon_vma_read(page);
+	if (!anon_vma)
+		return -1;
+
+	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	/*
+	 * During interating the loop, some processes could see a page as
+	 * purged while others could see a page as not-purged because we have
+	 * no global lock between parent and child for protecting mvolatile
+	 * system call during this loop. But it's not a problem because the
+	 * page is  not *SHARED* page but *COW* page so parent and child can
+	 * see other data anytime. The worst case by this race is a page was
+	 * purged but couldn't be discarded so it makes unnecessary pagefault
+	 * but it wouldn't be severe.
+	 */
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		struct vm_area_struct *vma = avc->vma;
+
+		if (!(vma->vm_flags & VM_VOLATILE)) {
+			ret = -1;
+			break;
+		}
+		try_to_purge_one(page, vma);
+	}
+	page_unlock_anon_vma_read(anon_vma);
+	return ret;
+}
+
+
+/**
+ * purge_volatile_page - If possible, purge the specified volatile page
+ * @page: page to purge
+ *
+ * Attempts to purge a volatile page, and if needed frees the swap page
+ *
+ * Returns 0 on success, -1 on error.
+ */
+int purge_volatile_page(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageLRU(page));
+
+	/* XXX - for now we only support anonymous volatile pages */
+	if (!PageAnon(page))
+		return -1;
+
+	if (!try_to_purge_vpage(page)) {
+		if (PageSwapCache(page))
+			try_to_free_swap(page);
+
+		if (page_freeze_refs(page, 1)) {
+			unlock_page(page);
+			return 0;
+		}
+	}
+
+	return -1;
+}
diff --git a/mm/rmap.c b/mm/rmap.c
index 8fc049f..2c2aa7d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 				referenced++;
 		}
 		pte_unmap_unlock(pte, ptl);
+		if (vma->vm_flags & VM_VOLATILE) {
+			pra->mapcount = 0;
+			pra->vm_flags |= VM_VOLATILE;
+			return SWAP_FAIL;
+		}
 	}
 
 	if (referenced) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b4..0cbfbf6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -43,6 +43,7 @@
 #include <linux/sysctl.h>
 #include <linux/oom.h>
 #include <linux/prefetch.h>
+#include <linux/mvolatile.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -683,6 +684,7 @@ enum page_references {
 	PAGEREF_RECLAIM,
 	PAGEREF_RECLAIM_CLEAN,
 	PAGEREF_KEEP,
+	PAGEREF_PURGE,
 	PAGEREF_ACTIVATE,
 };
 
@@ -703,6 +705,13 @@ static enum page_references page_check_references(struct page *page,
 	if (vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
+	/*
+	 * If volatile page is reached on LRU's tail, we discard the
+	 * page without considering recycle the page.
+	 */
+	if (vm_flags & VM_VOLATILE)
+		return PAGEREF_PURGE;
+
 	if (referenced_ptes) {
 		if (PageSwapBacked(page))
 			return PAGEREF_ACTIVATE;
@@ -930,6 +939,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
+		case PAGEREF_PURGE:
+			if (!purge_volatile_page(page))
+				goto free_it;
 		case PAGEREF_KEEP:
 			goto keep_locked;
 		case PAGEREF_RECLAIM:
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/4] mvolatile: Add purged page detection on setting memory non-volatile
  2014-04-11 20:15 ` [PATCH 3/4] mvolatile: Add purged page detection on setting memory non-volatile John Stultz
@ 2014-04-14  2:37   ` Minchan Kim
  0 siblings, 0 replies; 8+ messages in thread
From: Minchan Kim @ 2014-04-14  2:37 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Fri, Apr 11, 2014 at 01:15:39PM -0700, John Stultz wrote:
> Users of volatile ranges will need to know if memory was discarded.
> This patch adds the purged state tracking required to inform userland
> when it marks memory as non-volatile that some memory in that range
> was purged and needs to be regenerated.
> 
> This simplified implementation which uses some of the logic from
> Minchan's earlier efforts, so credit to Minchan for his work.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Keith Packard <keithp@keithp.com>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Acked-by: Jan Kara <jack@suse.cz>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/swap.h    |  5 +++
>  include/linux/swapops.h | 10 ++++++
>  mm/mvolatile.c          | 86 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 101 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a90ea95..c372ca7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -55,6 +55,7 @@ enum {
>  	 * 1<<MAX_SWPFILES_SHIFT), so to preserve the values insert
>  	 * new entries here at the top of the enum, not at the bottom
>  	 */
> +	SWP_MVOLATILE_PURGED_NR,
>  #ifdef CONFIG_MEMORY_FAILURE
>  	SWP_HWPOISON_NR,
>  #endif
> @@ -81,6 +82,10 @@ enum {
>  #define SWP_HWPOISON		(MAX_SWAPFILES + SWP_HWPOISON_NR)
>  #endif
>  
> +/*
> + * Purged volatile range pages
> + */
> +#define SWP_MVOLATILE_PURGED	(MAX_SWAPFILES + SWP_MVOLATILE_PURGED_NR)
>  
>  /*
>   * Magic header for a swap area. The first part of the union is
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index c0f7526..fe9c026 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -161,6 +161,16 @@ static inline int is_write_migration_entry(swp_entry_t entry)
>  
>  #endif
>  
> +static inline swp_entry_t make_purged_entry(void)
> +{
> +	return swp_entry(SWP_MVOLATILE_PURGED, 0);
> +}
> +
> +static inline int is_purged_entry(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_MVOLATILE_PURGED;
> +}
> +
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
>   * Support for hardware poisoned pages
> diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> index d4d2eed..38c8315 100644
> --- a/mm/mvolatile.c
> +++ b/mm/mvolatile.c
> @@ -12,8 +12,91 @@
>  #include <linux/hugetlb.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/mm_inline.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  #include "internal.h"
>  
> +struct mvolatile_walker {
> +	struct vm_area_struct *vma;
> +	int page_was_purged;
> +};
> +
> +
> +/**
> + * mvolatile_check_purged_pte - Checks ptes for purged pages
> + * @pmd: pmd to walk
> + * @addr: starting address
> + * @end: end address
> + * @walk: mm_walk ptr (contains ptr to mvolatile_walker)
> + *
> + * Iterates over the ptes in the pmd checking if they have
> + * purged swap entries.
> + *
> + * Sets the mvolatile_walker.page_was_purged to 1 if any were purged.

Just nitpick:

This function zaps ptes as well as checking purging so it would be better to
mention it and "Why we should do it" in description.


> + */
> +static int mvolatile_check_purged_pte(pmd_t *pmd, unsigned long addr,
> +					unsigned long end, struct mm_walk *walk)
> +{
> +	struct mvolatile_walker *vw = walk->private;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +
> +	if (pmd_trans_huge(*pmd))
> +		return 0;
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		if (!pte_present(*pte)) {
> +			swp_entry_t mvolatile_entry = pte_to_swp_entry(*pte);
> +
> +			if (unlikely(is_purged_entry(mvolatile_entry))) {
> +
> +				vw->page_was_purged = 1;
> +
> +				/* clear the pte swp entry */
> +				flush_cache_page(vw->vma, addr, pte_pfn(*pte));
> +				ptep_clear_flush(vw->vma, addr, pte);
> +			}
> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +
> +	return ret;
> +}
> +
> +
> +/**
> + * mvolatile_check_purged - Sets up a mm_walk to check for purged pages
> + * @vma: ptr to vma we're starting with
> + * @start: start address to walk
> + * @end: end address of walk
> + *
> + * Sets up and calls wa_page_range() to check for purge pages.
> + *
> + * Returns 1 if pages in the range were purged, 0 otherwise.
> + */
> +static int mvolatile_check_purged(struct vm_area_struct *vma,
> +					 unsigned long start,
> +					 unsigned long end)
> +{
> +	struct mvolatile_walker vw;
> +	struct mm_walk mvolatile_walk = {
> +		.pmd_entry = mvolatile_check_purged_pte,
> +		.mm = vma->vm_mm,
> +		.private = &vw,
> +	};
> +	vw.page_was_purged = 0;
> +	vw.vma = vma;
> +
> +	walk_page_range(start, end, &mvolatile_walk);

mvolatile_check_purged_pte could zap *all* of ptes so we need to invalidate
range for mmu_notifier if purge happend.

Anyway, it could make vrange syscall slow and I'm really worry about that.
If we assume most of volatile object which span two page abvoe is purged
altogether, just bail out as soon as purged pte found would make syscall
really fast.

Any idea? It's not necessary to implement it right now but if we don't
have a good idea(ie, syscall itself is very slow), I'm not sure how it
could be effectively useful for userspace folks.


> +
> +	return vw.page_was_purged;
> +
> +}
>  
>  /**
>   * do_mvolatile - Marks or clears VMAs in the range (start-end) as VM_VOLATILE
> @@ -119,6 +202,9 @@ success:
>  		vma = prev->vm_next;
>  	}
>  out:
> +	if (count && (mode == MVOLATILE_NONVOLATILE))
> +		*purged = mvolatile_check_purged(vma, orig_start,
> +							orig_start+count);
>  	up_write(&mm->mmap_sem);
>  
>  	/* report bytes successfully marked, even if we're exiting on error */
> -- 
> 1.8.3.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4] mvolatile: Add page purging logic & SIGBUS trap
  2014-04-11 20:15 ` [PATCH 4/4] mvolatile: Add page purging logic & SIGBUS trap John Stultz
@ 2014-04-14  2:51   ` Minchan Kim
  2014-04-16 18:43     ` John Stultz
  0 siblings, 1 reply; 8+ messages in thread
From: Minchan Kim @ 2014-04-14  2:51 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Fri, Apr 11, 2014 at 01:15:40PM -0700, John Stultz wrote:
> This patch adds the hooks in the vmscan logic to purge volatile pages
> and mark their pte as purged. With this, volatile pages will be purged
> under pressure, and their ptes swap entry's marked. If the purged pages
> are accessed before being marked non-volatile, we catch this and send a
> SIGBUS.
> 
> This is a simplified implementation that uses logic from Minchan's earlier
> efforts, so credit to Minchan for his work.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Keith Packard <keithp@keithp.com>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/mvolatile.h |   2 +
>  mm/internal.h             |   2 -
>  mm/memory.c               |   8 ++++
>  mm/mvolatile.c            | 120 ++++++++++++++++++++++++++++++++++++++++++++++
>  mm/rmap.c                 |   5 ++
>  mm/vmscan.c               |  12 +++++
>  6 files changed, 147 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h
> index 973bb3b..8cfe6e0 100644
> --- a/include/linux/mvolatile.h
> +++ b/include/linux/mvolatile.h
> @@ -5,4 +5,6 @@
>  
>  #define MVOLATILE_VALID_FLAGS (0) /* Don't yet support any flags */
>  
> +extern int purge_volatile_page(struct page *page);
> +
>  #endif /* _LINUX_MVOLATILE_H */
> diff --git a/mm/internal.h b/mm/internal.h
> index 29e1e76..ea66bf9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -225,10 +225,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>  
>  extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern unsigned long vma_address(struct page *page,
>  				 struct vm_area_struct *vma);
> -#endif
>  #else /* !CONFIG_MMU */
>  static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
>  {
> diff --git a/mm/memory.c b/mm/memory.c
> index 22dfa61..9043e4c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -60,6 +60,7 @@
>  #include <linux/migrate.h>
>  #include <linux/string.h>
>  #include <linux/dma-debug.h>
> +#include <linux/mvolatile.h>
>  
>  #include <asm/io.h>
>  #include <asm/pgalloc.h>
> @@ -3643,6 +3644,8 @@ static int handle_pte_fault(struct mm_struct *mm,
>  
>  	entry = *pte;
>  	if (!pte_present(entry)) {
> +		swp_entry_t mvolatile_entry;
> +
>  		if (pte_none(entry)) {
>  			if (vma->vm_ops) {
>  				if (likely(vma->vm_ops->fault))
> @@ -3652,6 +3655,11 @@ static int handle_pte_fault(struct mm_struct *mm,
>  			return do_anonymous_page(mm, vma, address,
>  						 pte, pmd, flags);
>  		}
> +
> +		mvolatile_entry = pte_to_swp_entry(entry);
> +		if (unlikely(is_purged_entry(mvolatile_entry)))
> +			return VM_FAULT_SIGBUS;
> +

There is no pte lock so that is_purged_entry isn't safe so if race happens,
do_swap_page could have a problem so it would be better to handle it
do_swap_page with pte lock because we used swp_pte to indicate purged pte.

I tried to solve it while we were in Napa(you could remember I sent
crap patchset to you privately but failed to fix and I still didn't get
a time to fix it :( ) but I'd like to inform this problem.


>  		if (pte_file(entry))
>  			return do_nonlinear_fault(mm, vma, address,
>  					pte, pmd, flags, entry);
> diff --git a/mm/mvolatile.c b/mm/mvolatile.c
> index 38c8315..16dccee 100644
> --- a/mm/mvolatile.c
> +++ b/mm/mvolatile.c
> @@ -279,3 +279,123 @@ SYSCALL_DEFINE5(mvolatile, unsigned long, start, size_t, len,
>  out:
>  	return ret;
>  }
> +
> +
> +/**
> + * try_to_purge_one - Purge a volatile page from a vma
> + * @page: page to purge
> + * @vma: vma to purge page from
> + *
> + * Finds the pte for a page in a vma, marks the pte as purged
> + * and release the page.
> + */
> +static void try_to_purge_one(struct page *page, struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pte_t *pte;
> +	pte_t pteval;
> +	spinlock_t *ptl;
> +	unsigned long addr;
> +
> +	VM_BUG_ON(!PageLocked(page));
> +
> +	addr = vma_address(page, vma);
> +	pte = page_check_address(page, mm, addr, &ptl, 0);
> +	if (!pte)
> +		return;
> +
> +	BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
> +
> +	flush_cache_page(vma, addr, page_to_pfn(page));
> +	pteval = ptep_clear_flush(vma, addr, pte);
> +
> +	update_hiwater_rss(mm);
> +	if (PageAnon(page))
> +		dec_mm_counter(mm, MM_ANONPAGES);
> +	else
> +		dec_mm_counter(mm, MM_FILEPAGES);
> +
> +	page_remove_rmap(page);
> +	page_cache_release(page);
> +
> +	set_pte_at(mm, addr, pte, swp_entry_to_pte(make_purged_entry()));
> +
> +	pte_unmap_unlock(pte, ptl);
> +	mmu_notifier_invalidate_page(mm, addr);
> +
> +}
> +
> +/**
> + * try_to_purge_vpage - check vma chain and purge from vmas marked volatile
> + * @page: page to purge
> + *
> + * Goes over all the vmas that hold a page, and where the vmas are volatile,
> + * purge the page from the vma.
> + *
> + * Returns 0 on success, -1 on error.
> + */
> +static int try_to_purge_vpage(struct page *page)
> +{
> +	struct anon_vma *anon_vma;
> +	struct anon_vma_chain *avc;
> +	pgoff_t pgoff;
> +	int ret = 0;
> +
> +	anon_vma = page_lock_anon_vma_read(page);
> +	if (!anon_vma)
> +		return -1;
> +
> +	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	/*
> +	 * During interating the loop, some processes could see a page as
> +	 * purged while others could see a page as not-purged because we have
> +	 * no global lock between parent and child for protecting mvolatile
> +	 * system call during this loop. But it's not a problem because the
> +	 * page is  not *SHARED* page but *COW* page so parent and child can
> +	 * see other data anytime. The worst case by this race is a page was
> +	 * purged but couldn't be discarded so it makes unnecessary pagefault
> +	 * but it wouldn't be severe.
> +	 */
> +	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
> +		struct vm_area_struct *vma = avc->vma;
> +
> +		if (!(vma->vm_flags & VM_VOLATILE)) {
> +			ret = -1;
> +			break;
> +		}
> +		try_to_purge_one(page, vma);
> +	}
> +	page_unlock_anon_vma_read(anon_vma);
> +	return ret;
> +}
> +
> +
> +/**
> + * purge_volatile_page - If possible, purge the specified volatile page
> + * @page: page to purge
> + *
> + * Attempts to purge a volatile page, and if needed frees the swap page
> + *
> + * Returns 0 on success, -1 on error.
> + */
> +int purge_volatile_page(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));
> +	VM_BUG_ON(PageLRU(page));
> +
> +	/* XXX - for now we only support anonymous volatile pages */
> +	if (!PageAnon(page))
> +		return -1;
> +
> +	if (!try_to_purge_vpage(page)) {
> +		if (PageSwapCache(page))
> +			try_to_free_swap(page);
> +
> +		if (page_freeze_refs(page, 1)) {
> +			unlock_page(page);
> +			return 0;
> +		}
> +	}
> +
> +	return -1;
> +}
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8fc049f..2c2aa7d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  				referenced++;
>  		}
>  		pte_unmap_unlock(pte, ptl);
> +		if (vma->vm_flags & VM_VOLATILE) {
> +			pra->mapcount = 0;
> +			pra->vm_flags |= VM_VOLATILE;
> +			return SWAP_FAIL;
> +		}
>  	}
>  
>  	if (referenced) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a9c74b4..0cbfbf6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -43,6 +43,7 @@
>  #include <linux/sysctl.h>
>  #include <linux/oom.h>
>  #include <linux/prefetch.h>
> +#include <linux/mvolatile.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -683,6 +684,7 @@ enum page_references {
>  	PAGEREF_RECLAIM,
>  	PAGEREF_RECLAIM_CLEAN,
>  	PAGEREF_KEEP,
> +	PAGEREF_PURGE,
>  	PAGEREF_ACTIVATE,
>  };
>  
> @@ -703,6 +705,13 @@ static enum page_references page_check_references(struct page *page,
>  	if (vm_flags & VM_LOCKED)
>  		return PAGEREF_RECLAIM;
>  
> +	/*
> +	 * If volatile page is reached on LRU's tail, we discard the
> +	 * page without considering recycle the page.
> +	 */
> +	if (vm_flags & VM_VOLATILE)
> +		return PAGEREF_PURGE;
> +
>  	if (referenced_ptes) {
>  		if (PageSwapBacked(page))
>  			return PAGEREF_ACTIVATE;
> @@ -930,6 +939,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		switch (references) {
>  		case PAGEREF_ACTIVATE:
>  			goto activate_locked;
> +		case PAGEREF_PURGE:
> +			if (!purge_volatile_page(page))
> +				goto free_it;
>  		case PAGEREF_KEEP:
>  			goto keep_locked;
>  		case PAGEREF_RECLAIM:
> -- 
> 1.8.3.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4] mvolatile: Add page purging logic & SIGBUS trap
  2014-04-14  2:51   ` Minchan Kim
@ 2014-04-16 18:43     ` John Stultz
  0 siblings, 0 replies; 8+ messages in thread
From: John Stultz @ 2014-04-16 18:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Keith Packard, linux-mm

On Sun, Apr 13, 2014 at 7:51 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Fri, Apr 11, 2014 at 01:15:40PM -0700, John Stultz wrote:
>> @@ -3643,6 +3644,8 @@ static int handle_pte_fault(struct mm_struct *mm,
>>
>>       entry = *pte;
>>       if (!pte_present(entry)) {
>> +             swp_entry_t mvolatile_entry;
>> +
>>               if (pte_none(entry)) {
>>                       if (vma->vm_ops) {
>>                               if (likely(vma->vm_ops->fault))
>> @@ -3652,6 +3655,11 @@ static int handle_pte_fault(struct mm_struct *mm,
>>                       return do_anonymous_page(mm, vma, address,
>>                                                pte, pmd, flags);
>>               }
>> +
>> +             mvolatile_entry = pte_to_swp_entry(entry);
>> +             if (unlikely(is_purged_entry(mvolatile_entry)))
>> +                     return VM_FAULT_SIGBUS;
>> +
>
> There is no pte lock so that is_purged_entry isn't safe so if race happens,
> do_swap_page could have a problem so it would be better to handle it
> do_swap_page with pte lock because we used swp_pte to indicate purged pte.
>
> I tried to solve it while we were in Napa(you could remember I sent
> crap patchset to you privately but failed to fix and I still didn't get
> a time to fix it :( ) but I'd like to inform this problem.

Thanks for the review and the reminder! I'll move the check appropriately.

thanks
-john

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-04-16 18:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-11 20:15 [PATCH 0/4] Volatile Ranges (v13) John Stultz
2014-04-11 20:15 ` [PATCH 1/4] swap: Cleanup how special swap file numbers are defined John Stultz
2014-04-11 20:15 ` [PATCH 2/4] mvolatile: Add mvolatile syscall and handle splitting/merging and marking vmas John Stultz
2014-04-11 20:15 ` [PATCH 3/4] mvolatile: Add purged page detection on setting memory non-volatile John Stultz
2014-04-14  2:37   ` Minchan Kim
2014-04-11 20:15 ` [PATCH 4/4] mvolatile: Add page purging logic & SIGBUS trap John Stultz
2014-04-14  2:51   ` Minchan Kim
2014-04-16 18:43     ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).