linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Volatile Ranges (v11)
@ 2014-03-14 18:33 John Stultz
  2014-03-14 18:33 ` [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas John Stultz
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: John Stultz @ 2014-03-14 18:33 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

I recently got a chance to try to implement Johannes' suggested approach
so I wanted to send it out for comments. It looks like Minchan has also
done the same, but from a different direction, focusing on the MADV_FREE
use cases. I think both approaches are valid, so I wouldn't consider
these patches to be in conflict. Its just that earlier iterations of the
volatile range patches had tried to handle numerous different use cases,
and the resulting complexity was apparently making it difficult to review
and get interest in the patch set. So basically we're splitting the use
cases up and trying to find simpler solutions for each.

I'd greatly appreciate any feedback or thoughts!

thanks
-john


Volatile ranges provides a method for userland to inform the kernel that
a range of memory is safe to discard (ie: can be regenerated) but
userspace may want to try access it in the future.  It can be thought of
as similar to MADV_DONTNEED, but that the actual freeing of the memory
is delayed and only done under memory pressure, and the user can try to
cancel the action and be able to quickly access any unpurged pages. The
idea originated from Android's ashmem, but I've since learned that other
OSes provide similar functionality.

This functionality allows for a number of interesting uses:
* Userland caches that have kernel triggered eviction under memory
pressure. This allows for the kernel to "rightsize" userspace caches for
current system-wide workload. Things like image bitmap caches, or
rendered HTML in a hidden browser tab, where the data is not visible and
can be regenerated if needed, are good examples.

* Opportunistic freeing of memory that may be quickly reused. Minchan
has done a malloc implementation where free() marks the pages as
volatile, allowing the kernel to reclaim under pressure. This avoids the
unmapping and remapping of anonymous pages on free/malloc. So if
userland wants to malloc memory quickly after the free, it just needs to
mark the pages as non-volatile, and only purged pages will have to be
faulted back in.

There are two basic ways this can be used:

Explicit marking method:
1) Userland marks a range of memory that can be regenerated if necessary
as volatile
2) Before accessing the memory again, userland marks the memory as
nonvolatile, and the kernel will provide notification if any pages in the
range has been purged.

Optimistic method:
1) Userland marks a large range of data as volatile
2) Userland continues to access the data as it needs.
3) If userland accesses a page that has been purged, the kernel will
send a SIGBUS
4) Userspace can trap the SIGBUS, mark the affected pages as
non-volatile, and refill the data as needed before continuing on

You can read more about the history of volatile ranges here:
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


This version of the patchset, at Johannes Weiner's suggestion, is much
reduced in scope compared to earlier attempts. I've only handled
volatility on anonymous memory, and we're storing the volatility in
the VMA.  This may have performance implications compared with the earlier
approach, but it does simplify the approach.

Further, the page discarding happens via normal vmscanning, which due to
anonymous pages not being aged on swapless systems, means we'll only purge
pages when swap is enabled. I'll be looking at enabling anonymous aging
when swap is disabled to resolve this, but I wanted to get this out for
initial comment.

Additionally, since we don't handle volatility on tmpfs files with this
version of the patch, it is not able to be used to implement semantics
similar to Android's ashmem. But since shared volatiltiy on files is
more complex, my hope is to start small and hopefully grow from there.

Also, much of the logic in this patchset is based on Minchan's earlier
efforts. On this iteration, I've not been in close collaboration with him,
so I don't want to mis-attribute my rework of the code as his design,
but I do want to make sure the credit goes to him for his major contribution.


Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Dhaval Giani <dgiani@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>


John Stultz (3):
  vrange: Add vrange syscall and handle splitting/merging and marking
    vmas
  vrange: Add purged page detection on setting memory non-volatile
  vrange: Add page purging logic & SIGBUS trap

 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/mm.h               |   1 +
 include/linux/swap.h             |  15 +-
 include/linux/vrange.h           |  22 +++
 mm/Makefile                      |   2 +-
 mm/internal.h                    |   2 -
 mm/memory.c                      |  21 +++
 mm/rmap.c                        |   5 +
 mm/vmscan.c                      |  12 ++
 mm/vrange.c                      | 306 +++++++++++++++++++++++++++++++++++++++
 10 files changed, 382 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/vrange.h
 create mode 100644 mm/vrange.c

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas
  2014-03-14 18:33 [PATCH 0/3] Volatile Ranges (v11) John Stultz
@ 2014-03-14 18:33 ` John Stultz
  2014-03-17  9:21   ` Jan Kara
  2014-03-14 18:33 ` [PATCH 2/3] vrange: Add purged page detection on setting memory non-volatile John Stultz
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: John Stultz @ 2014-03-14 18:33 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

This patch introduces the vrange() syscall, which allows for specifying
ranges of memory as volatile, and able to be discarded by the system.

This initial patch simply adds the syscall, and the vma handling,
splitting and merging the vmas as needed, and marking them with
VM_VOLATILE.

No purging or discarding of volatile ranges is done at this point.

Example man page:

NAME
	vrange - Mark or unmark range of memory as volatile

SYNOPSIS
	int vrange(unsigned_long start, size_t length, int mode,
			 int *purged);

DESCRIPTION
	Applications can use vrange(2) to advise the kernel how it should
	handle paging I/O in this VM area.  The idea is to help the kernel
	discard pages of vrange instead of reclaiming when memory pressure
	happens. It means kernel doesn't discard any pages of vrange if
	there is no memory pressure.

	mode:
	VRANGE_VOLATILE
		hint to kernel so VM can discard in vrange pages when
		memory pressure happens.
	VRANGE_NONVOLATILE
		hint to kernel so VM doesn't discard vrange pages
		any more.

	If user try to access purged memory without VRANGE_NONVOLATILE call,
	he can encounter SIGBUS if the page was discarded by kernel.

	purged: Pointer to an integer which will return 1 if
	mode == VRANGE_NONVOLATILE and any page in the affected range
	was purged. If purged returns zero during a mode ==
	VRANGE_NONVOLATILE call, it means all of the pages in the range
	are intact.

RETURN VALUE
	On success vrange returns the number of bytes marked or unmarked.
	Similar to write(), it may return fewer bytes then specified
	if it ran into a problem.

	If an error is returned, no changes were made.

ERRORS
	EINVAL This error can occur for the following reasons:
		* The value length is negative or not page size units.
		* addr is not page-aligned
		* mode not a valid value.

	ENOMEM Not enough memory

	EFAULT purged pointer is invalid

This a simplified implementation which reuses some of the logic
from Minchan's earlier efforts. So credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Dhaval Giani <dgiani@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/mm.h               |   1 +
 include/linux/vrange.h           |   7 ++
 mm/Makefile                      |   2 +-
 mm/vrange.c                      | 150 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 160 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/vrange.h
 create mode 100644 mm/vrange.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..7ae3940 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	vrange			sys_vrange
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c1b7414..a1f11da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
 
 					/* Used by sys_madvise() */
+#define VM_VOLATILE	0x00001000	/* VMA is volatile */
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
 #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
 
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
new file mode 100644
index 0000000..652396b
--- /dev/null
+++ b/include/linux/vrange.h
@@ -0,0 +1,7 @@
+#ifndef _LINUX_VRANGE_H
+#define _LINUX_VRANGE_H
+
+#define VRANGE_NONVOLATILE 0
+#define VRANGE_VOLATILE 1
+
+#endif /* _LINUX_VRANGE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..20229e2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o balloon_compaction.o \
+			   compaction.o balloon_compaction.o vrange.o \
 			   interval_tree.o list_lru.o $(mmu-y)
 
 obj-y += init-mm.o
diff --git a/mm/vrange.c b/mm/vrange.c
new file mode 100644
index 0000000..acb4356
--- /dev/null
+++ b/mm/vrange.c
@@ -0,0 +1,150 @@
+#include <linux/syscalls.h>
+#include <linux/vrange.h>
+#include <linux/mm_inline.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
+				unsigned long end, int mode, int *purged)
+{
+	struct vm_area_struct *vma, *prev;
+	unsigned long orig_start = start;
+	ssize_t count = 0, ret = 0;
+	int lpurged = 0;
+
+	down_read(&mm->mmap_sem);
+
+	vma = find_vma_prev(mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		unsigned long new_flags;
+		pgoff_t pgoff;
+		unsigned long tmp;
+
+		if (!vma)
+			goto out;
+
+		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+					VM_HUGETLB))
+			goto out;
+
+		/* We don't support volatility on files for now */
+		if (vma->vm_file) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		new_flags = vma->vm_flags;
+
+		if (start < vma->vm_start) {
+			start = vma->vm_start;
+			if (start >= end)
+				goto out;
+		}
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		switch (mode) {
+		case VRANGE_VOLATILE:
+			new_flags |= VM_VOLATILE;
+			break;
+		case VRANGE_NONVOLATILE:
+			new_flags &= ~VM_VOLATILE;
+		}
+
+		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+		prev = vma_merge(mm, prev, start, tmp, new_flags,
+					vma->anon_vma, vma->vm_file, pgoff,
+					vma_policy(vma));
+		if (prev)
+			goto success;
+
+		if (start != vma->vm_start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret)
+				goto out;
+		}
+
+		if (tmp != vma->vm_end) {
+			ret = split_vma(mm, vma, tmp, 0);
+			if (ret)
+				goto out;
+		}
+
+		prev = vma;
+success:
+		vma->vm_flags = new_flags;
+		*purged = lpurged;
+
+		/* update count to distance covered so far*/
+		count = tmp - orig_start;
+
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			goto out;
+		if (prev)
+			vma = prev->vm_next;
+		else	/* madvise_remove dropped mmap_sem */
+			vma = find_vma(mm, start);
+	}
+out:
+	up_read(&mm->mmap_sem);
+
+	/* report bytes successfully marked, even if we're exiting on error */
+	if (count)
+		return count;
+
+	return ret;
+}
+
+SYSCALL_DEFINE4(vrange, unsigned long, start,
+		size_t, len, int, mode, int __user *, purged)
+{
+	unsigned long end;
+	struct mm_struct *mm = current->mm;
+	ssize_t ret = -EINVAL;
+	int p = 0;
+
+	if (start & ~PAGE_MASK)
+		goto out;
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	if (start >= TASK_SIZE)
+		goto out;
+
+	if (purged) {
+		/* Test pointer is valid before making any changes */
+		if (put_user(p, purged))
+			return -EFAULT;
+	}
+
+	ret = do_vrange(mm, start, end, mode, &p);
+
+	if (purged) {
+		if (put_user(p, purged)) {
+			/*
+			 * This would be bad, since we've modified volatilty
+			 * and the change in purged state would be lost.
+			 */
+			WARN_ONCE(1, "vrange: purge state possibly lost\n");
+		}
+	}
+
+out:
+	return ret;
+}
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/3] vrange: Add purged page detection on setting memory non-volatile
  2014-03-14 18:33 [PATCH 0/3] Volatile Ranges (v11) John Stultz
  2014-03-14 18:33 ` [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas John Stultz
@ 2014-03-14 18:33 ` John Stultz
  2014-03-17  9:39   ` Jan Kara
  2014-03-14 18:33 ` [PATCH 3/3] vrange: Add page purging logic & SIGBUS trap John Stultz
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: John Stultz @ 2014-03-14 18:33 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

Users of volatile ranges will need to know if memory was discarded.
This patch adds the purged state tracking required to inform userland
when it marks memory as non-volatile that some memory in that range
was purged and needs to be regenerated.

This simplified implementation which uses some of the logic from
Minchan's earlier efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Dhaval Giani <dgiani@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/swap.h   | 15 +++++++++++--
 include/linux/vrange.h | 13 ++++++++++++
 mm/vrange.c            | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 83 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..18c12f9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
 #define SWP_HWPOISON_NUM 0
 #endif
 
-#define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+
+/*
+ * Purged volatile range pages
+ */
+#define SWP_VRANGE_PURGED_NUM 1
+#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
+
+
+#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)	\
+				- SWP_MIGRATION_NUM	\
+				- SWP_HWPOISON_NUM	\
+				- SWP_VRANGE_PURGED_NUM	\
+			)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 652396b..c4a1616 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -1,7 +1,20 @@
 #ifndef _LINUX_VRANGE_H
 #define _LINUX_VRANGE_H
 
+#include <linux/swap.h>
+#include <linux/swapops.h>
+
 #define VRANGE_NONVOLATILE 0
 #define VRANGE_VOLATILE 1
 
+static inline swp_entry_t swp_entry_mk_vrange_purged(void)
+{
+	return swp_entry(SWP_VRANGE_PURGED, 0);
+}
+
+static inline int entry_is_vrange_purged(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_VRANGE_PURGED;
+}
+
 #endif /* _LINUX_VRANGE_H */
diff --git a/mm/vrange.c b/mm/vrange.c
index acb4356..844571b 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -8,6 +8,60 @@
 #include <linux/mm_inline.h>
 #include "internal.h"
 
+struct vrange_walker {
+	struct vm_area_struct *vma;
+	int pages_purged;
+};
+
+static int vrange_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+				struct mm_walk *walk)
+{
+	struct vrange_walker *vw = walk->private;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (pmd_trans_huge(*pmd))
+		return 0;
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte)) {
+			swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
+
+			if (unlikely(entry_is_vrange_purged(vrange_entry))) {
+				vw->pages_purged = 1;
+				break;
+			}
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+static unsigned long vrange_check_purged(struct mm_struct *mm,
+					 struct vm_area_struct *vma,
+					 unsigned long start,
+					 unsigned long end)
+{
+	struct vrange_walker vw;
+	struct mm_walk vrange_walk = {
+		.pmd_entry = vrange_pte_range,
+		.mm = vma->vm_mm,
+		.private = &vw,
+	};
+	vw.pages_purged = 0;
+	vw.vma = vma;
+
+	walk_page_range(start, end, &vrange_walk);
+
+	return vw.pages_purged;
+
+}
+
 static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
 				unsigned long end, int mode, int *purged)
 {
@@ -57,6 +111,9 @@ static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
 			break;
 		case VRANGE_NONVOLATILE:
 			new_flags &= ~VM_VOLATILE;
+			lpurged |= vrange_check_purged(mm, vma,
+							vma->vm_start,
+							vma->vm_end);
 		}
 
 		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] vrange: Add page purging logic & SIGBUS trap
  2014-03-14 18:33 [PATCH 0/3] Volatile Ranges (v11) John Stultz
  2014-03-14 18:33 ` [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas John Stultz
  2014-03-14 18:33 ` [PATCH 2/3] vrange: Add purged page detection on setting memory non-volatile John Stultz
@ 2014-03-14 18:33 ` John Stultz
  2014-03-18 12:24 ` [PATCH 0/3] Volatile Ranges (v11) Michal Hocko
  2014-03-18 15:11 ` Minchan Kim
  4 siblings, 0 replies; 21+ messages in thread
From: John Stultz @ 2014-03-14 18:33 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

Finally, this patch adds the hooks in the vmscan logic to discard volatile
pages and mark their pte as purged. With this, volatile pages will be
purged under pressure, and their ptes swap entry's marked. If the
purged pages are accessed before being marked non-volatile, we catch this
and send a SIGBUS.

This is a simplified implementation that uses logic from Minchan's earlier
efforts, so credit to Minchan for his work.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Dhaval Giani <dgiani@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/vrange.h |  2 ++
 mm/internal.h          |  2 --
 mm/memory.c            | 21 +++++++++++
 mm/rmap.c              |  5 +++
 mm/vmscan.c            | 12 +++++++
 mm/vrange.c            | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 137 insertions(+), 2 deletions(-)

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index c4a1616..b18551f 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -7,6 +7,8 @@
 #define VRANGE_NONVOLATILE 0
 #define VRANGE_VOLATILE 1
 
+extern int discard_vpage(struct page *page);
+
 static inline swp_entry_t swp_entry_mk_vrange_purged(void)
 {
 	return swp_entry(SWP_VRANGE_PURGED, 0);
diff --git a/mm/internal.h b/mm/internal.h
index 29e1e76..ea66bf9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -225,10 +225,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern unsigned long vma_address(struct page *page,
 				 struct vm_area_struct *vma);
-#endif
 #else /* !CONFIG_MMU */
 static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 22dfa61..7ea9712 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -60,6 +60,7 @@
 #include <linux/migrate.h>
 #include <linux/string.h>
 #include <linux/dma-debug.h>
+#include <linux/vrange.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -3643,6 +3644,8 @@ static int handle_pte_fault(struct mm_struct *mm,
 
 	entry = *pte;
 	if (!pte_present(entry)) {
+		swp_entry_t vrange_entry;
+retry:
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
 				if (likely(vma->vm_ops->fault))
@@ -3652,6 +3655,24 @@ static int handle_pte_fault(struct mm_struct *mm,
 			return do_anonymous_page(mm, vma, address,
 						 pte, pmd, flags);
 		}
+
+		vrange_entry = pte_to_swp_entry(entry);
+		if (unlikely(entry_is_vrange_purged(vrange_entry))) {
+			if (vma->vm_flags & VM_VOLATILE)
+				return VM_FAULT_SIGBUS;
+
+			/* zap pte */
+			ptl = pte_lockptr(mm, pmd);
+			spin_lock(ptl);
+			if (unlikely(!pte_same(*pte, entry)))
+				goto unlock;
+			flush_cache_page(vma, address, pte_pfn(*pte));
+			ptep_clear_flush(vma, address, pte);
+			pte_unmap_unlock(pte, ptl);
+			goto retry;
+		}
+
+
 		if (pte_file(entry))
 			return do_nonlinear_fault(mm, vma, address,
 					pte, pmd, flags, entry);
diff --git a/mm/rmap.c b/mm/rmap.c
index d9d4231..2b6f079 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 				referenced++;
 		}
 		pte_unmap_unlock(pte, ptl);
+		if (vma->vm_flags & VM_VOLATILE) {
+			pra->mapcount = 0;
+			pra->vm_flags |= VM_VOLATILE;
+			return SWAP_FAIL;
+		}
 	}
 
 	if (referenced) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b4..c5c0ee0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -43,6 +43,7 @@
 #include <linux/sysctl.h>
 #include <linux/oom.h>
 #include <linux/prefetch.h>
+#include <linux/vrange.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -683,6 +684,7 @@ enum page_references {
 	PAGEREF_RECLAIM,
 	PAGEREF_RECLAIM_CLEAN,
 	PAGEREF_KEEP,
+	PAGEREF_DISCARD,
 	PAGEREF_ACTIVATE,
 };
 
@@ -703,6 +705,13 @@ static enum page_references page_check_references(struct page *page,
 	if (vm_flags & VM_LOCKED)
 		return PAGEREF_RECLAIM;
 
+	/*
+	 * If volatile page is reached on LRU's tail, we discard the
+	 * page without considering recycle the page.
+	 */
+	if (vm_flags & VM_VOLATILE)
+		return PAGEREF_DISCARD;
+
 	if (referenced_ptes) {
 		if (PageSwapBacked(page))
 			return PAGEREF_ACTIVATE;
@@ -930,6 +939,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		switch (references) {
 		case PAGEREF_ACTIVATE:
 			goto activate_locked;
+		case PAGEREF_DISCARD:
+			if (may_enter_fs && discard_vpage(page) == 0)
+				goto free_it;
 		case PAGEREF_KEEP:
 			goto keep_locked;
 		case PAGEREF_RECLAIM:
diff --git a/mm/vrange.c b/mm/vrange.c
index 844571b..fc9906f 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -205,3 +205,100 @@ SYSCALL_DEFINE4(vrange, unsigned long, start,
 out:
 	return ret;
 }
+
+static void try_to_discard_one(struct page *page, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *pte;
+	pte_t pteval;
+	spinlock_t *ptl;
+	unsigned long addr;
+
+	VM_BUG_ON(!PageLocked(page));
+
+	addr = vma_address(page, vma);
+	pte = page_check_address(page, mm, addr, &ptl, 0);
+	if (!pte)
+		return;
+
+	BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
+
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	pteval = ptep_clear_flush(vma, addr, pte);
+
+	update_hiwater_rss(mm);
+	if (PageAnon(page))
+		dec_mm_counter(mm, MM_ANONPAGES);
+	else
+		dec_mm_counter(mm, MM_FILEPAGES);
+
+	page_remove_rmap(page);
+	page_cache_release(page);
+
+	set_pte_at(mm, addr, pte,
+				swp_entry_to_pte(swp_entry_mk_vrange_purged()));
+
+	pte_unmap_unlock(pte, ptl);
+	mmu_notifier_invalidate_page(mm, addr);
+
+}
+
+
+static int try_to_discard_anon_vpage(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff;
+
+	anon_vma = page_lock_anon_vma_read(page);
+	if (!anon_vma)
+		return -1;
+
+	pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	/*
+	 * During interating the loop, some processes could see a page as
+	 * purged while others could see a page as not-purged because we have
+	 * no global lock between parent and child for protecting vrange system
+	 * call during this loop. But it's not a problem because the page is
+	 * not *SHARED* page but *COW* page so parent and child can see other
+	 * data anytime. The worst case by this race is a page was purged
+	 * but couldn't be discarded so it makes unnecessary page fault but
+	 * it wouldn't be severe.
+	 */
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		struct vm_area_struct *vma = avc->vma;
+
+		if (!(vma->vm_flags & VM_VOLATILE))
+			continue;
+		try_to_discard_one(page, vma);
+	}
+	page_unlock_anon_vma_read(anon_vma);
+	return 0;
+}
+
+
+static int try_to_discard_vpage(struct page *page)
+{
+	if (PageAnon(page))
+		return try_to_discard_anon_vpage(page);
+	return -1;
+}
+
+
+int discard_vpage(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageLRU(page));
+
+	if (!try_to_discard_vpage(page)) {
+		if (PageSwapCache(page))
+			try_to_free_swap(page);
+
+		if (page_freeze_refs(page, 1)) {
+			unlock_page(page);
+			return 0;
+		}
+	}
+
+	return 1;
+}
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas
  2014-03-14 18:33 ` [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas John Stultz
@ 2014-03-17  9:21   ` Jan Kara
  2014-03-17  9:43     ` Jan Kara
  2014-03-17 22:19     ` John Stultz
  0 siblings, 2 replies; 21+ messages in thread
From: Jan Kara @ 2014-03-17  9:21 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Fri 14-03-14 11:33:31, John Stultz wrote:
> This patch introduces the vrange() syscall, which allows for specifying
> ranges of memory as volatile, and able to be discarded by the system.
> 
> This initial patch simply adds the syscall, and the vma handling,
> splitting and merging the vmas as needed, and marking them with
> VM_VOLATILE.
> 
> No purging or discarding of volatile ranges is done at this point.
> 
> Example man page:
> 
> NAME
> 	vrange - Mark or unmark range of memory as volatile
> 
> SYNOPSIS
> 	int vrange(unsigned_long start, size_t length, int mode,
> 			 int *purged);
> 
> DESCRIPTION
> 	Applications can use vrange(2) to advise the kernel how it should
> 	handle paging I/O in this VM area.  The idea is to help the kernel
> 	discard pages of vrange instead of reclaiming when memory pressure
> 	happens. It means kernel doesn't discard any pages of vrange if
> 	there is no memory pressure.
  I'd say that the advantage is kernel doesn't have to swap volatile pages,
it can just directly discard them on memory pressure. You should also
mention somewhere vrange() is currently supported only for anonymous pages.
So maybe we can have the description like:
Applications can use vrange(2) to advise kernel that pages of anonymous
mapping in the given VM area can be reclaimed without swapping (or can no
longer be reclaimed without swapping). The idea is that application can
help kernel with page reclaim under memory pressure by specifying data
it can easily regenerate and thus kernel can discard the data if needed.

> 	mode:
> 	VRANGE_VOLATILE
> 		hint to kernel so VM can discard in vrange pages when
> 		memory pressure happens.
> 	VRANGE_NONVOLATILE
> 		hint to kernel so VM doesn't discard vrange pages
> 		any more.
> 
> 	If user try to access purged memory without VRANGE_NONVOLATILE call,
                ^^^ tries

> 	he can encounter SIGBUS if the page was discarded by kernel.
> 
> 	purged: Pointer to an integer which will return 1 if
> 	mode == VRANGE_NONVOLATILE and any page in the affected range
> 	was purged. If purged returns zero during a mode ==
> 	VRANGE_NONVOLATILE call, it means all of the pages in the range
> 	are intact.
> 
> RETURN VALUE
> 	On success vrange returns the number of bytes marked or unmarked.
> 	Similar to write(), it may return fewer bytes then specified
> 	if it ran into a problem.
  I believe you may need to better explain what is 'purged' argument good
for. Because in my naive understanding *purged == 1 iff return value !=
length.  I recall your discussion with Johannes about error conditions and
the need to return error but also the state of the range, is that right?
But that should be really explained somewhere so that poor application
programmer is aware of those corner cases as well.

> 
> 	If an error is returned, no changes were made.
> 
> ERRORS
> 	EINVAL This error can occur for the following reasons:
> 		* The value length is negative or not page size units.
> 		* addr is not page-aligned
> 		* mode not a valid value.
> 
> 	ENOMEM Not enough memory
> 
> 	EFAULT purged pointer is invalid
> 
> This a simplified implementation which reuses some of the logic
> from Minchan's earlier efforts. So credit to Minchan for his work.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Dhaval Giani <dgiani@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
  Some minor comments in the patch below...

> ---
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/mm.h               |   1 +
>  include/linux/vrange.h           |   7 ++
>  mm/Makefile                      |   2 +-
>  mm/vrange.c                      | 150 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 160 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/vrange.h
>  create mode 100644 mm/vrange.c
> 
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index a12bddc..7ae3940 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -322,6 +322,7 @@
>  313	common	finit_module		sys_finit_module
>  314	common	sched_setattr		sys_sched_setattr
>  315	common	sched_getattr		sys_sched_getattr
> +316	common	vrange			sys_vrange
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c1b7414..a1f11da 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_IO           0x00004000	/* Memory mapped I/O or similar */
>  
>  					/* Used by sys_madvise() */
> +#define VM_VOLATILE	0x00001000	/* VMA is volatile */
>  #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
>  #define VM_RAND_READ	0x00010000	/* App will not benefit from clustered reads */
>  
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> new file mode 100644
> index 0000000..652396b
> --- /dev/null
> +++ b/include/linux/vrange.h
> @@ -0,0 +1,7 @@
> +#ifndef _LINUX_VRANGE_H
> +#define _LINUX_VRANGE_H
> +
> +#define VRANGE_NONVOLATILE 0
> +#define VRANGE_VOLATILE 1
> +
> +#endif /* _LINUX_VRANGE_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 310c90a..20229e2 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   readahead.o swap.o truncate.o vmscan.o shmem.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
> -			   compaction.o balloon_compaction.o \
> +			   compaction.o balloon_compaction.o vrange.o \
>  			   interval_tree.o list_lru.o $(mmu-y)
>  
>  obj-y += init-mm.o
> diff --git a/mm/vrange.c b/mm/vrange.c
> new file mode 100644
> index 0000000..acb4356
> --- /dev/null
> +++ b/mm/vrange.c
> @@ -0,0 +1,150 @@
> +#include <linux/syscalls.h>
> +#include <linux/vrange.h>
> +#include <linux/mm_inline.h>
> +#include <linux/pagemap.h>
> +#include <linux/rmap.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
> +				unsigned long end, int mode, int *purged)
> +{
> +	struct vm_area_struct *vma, *prev;
> +	unsigned long orig_start = start;
> +	ssize_t count = 0, ret = 0;
> +	int lpurged = 0;
> +
> +	down_read(&mm->mmap_sem);
> +
> +	vma = find_vma_prev(mm, start, &prev);
> +	if (vma && start > vma->vm_start)
> +		prev = vma;
> +
> +	for (;;) {
> +		unsigned long new_flags;
> +		pgoff_t pgoff;
> +		unsigned long tmp;
> +
> +		if (!vma)
> +			goto out;
> +
> +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
> +					VM_HUGETLB))
> +			goto out;
> +
> +		/* We don't support volatility on files for now */
> +		if (vma->vm_file) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		new_flags = vma->vm_flags;
> +
> +		if (start < vma->vm_start) {
> +			start = vma->vm_start;
> +			if (start >= end)
> +				goto out;
> +		}
> +		tmp = vma->vm_end;
> +		if (end < tmp)
> +			tmp = end;
> +
> +		switch (mode) {
> +		case VRANGE_VOLATILE:
> +			new_flags |= VM_VOLATILE;
> +			break;
> +		case VRANGE_NONVOLATILE:
> +			new_flags &= ~VM_VOLATILE;
> +		}
> +
> +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> +		prev = vma_merge(mm, prev, start, tmp, new_flags,
> +					vma->anon_vma, vma->vm_file, pgoff,
> +					vma_policy(vma));
> +		if (prev)
> +			goto success;
> +
> +		if (start != vma->vm_start) {
> +			ret = split_vma(mm, vma, start, 1);
> +			if (ret)
> +				goto out;
> +		}
> +
> +		if (tmp != vma->vm_end) {
> +			ret = split_vma(mm, vma, tmp, 0);
> +			if (ret)
> +				goto out;
> +		}
> +
> +		prev = vma;
> +success:
> +		vma->vm_flags = new_flags;
> +		*purged = lpurged;
> +
> +		/* update count to distance covered so far*/
> +		count = tmp - orig_start;
> +
> +		if (prev && start < prev->vm_end)
  In which case 'prev' can be NULL? And when start >= prev->vm_end? In all
the cases I can come up with this condition seems to be true...

> +			start = prev->vm_end;
> +		if (start >= end)
> +			goto out;
> +		if (prev)
  Ditto regarding 'prev'...

> +			vma = prev->vm_next;
> +		else	/* madvise_remove dropped mmap_sem */
> +			vma = find_vma(mm, start);
  The comment regarding madvise_remove() looks bogus...

> +	}
> +out:
> +	up_read(&mm->mmap_sem);
> +
> +	/* report bytes successfully marked, even if we're exiting on error */
> +	if (count)
> +		return count;
> +
> +	return ret;
> +}
> +
> +SYSCALL_DEFINE4(vrange, unsigned long, start,
> +		size_t, len, int, mode, int __user *, purged)
> +{
> +	unsigned long end;
> +	struct mm_struct *mm = current->mm;
> +	ssize_t ret = -EINVAL;
> +	int p = 0;
> +
> +	if (start & ~PAGE_MASK)
> +		goto out;
> +
> +	len &= PAGE_MASK;
> +	if (!len)
> +		goto out;
> +
> +	end = start + len;
> +	if (end < start)
> +		goto out;
> +
> +	if (start >= TASK_SIZE)
> +		goto out;
> +
> +	if (purged) {
> +		/* Test pointer is valid before making any changes */
> +		if (put_user(p, purged))
> +			return -EFAULT;
> +	}
> +
> +	ret = do_vrange(mm, start, end, mode, &p);
> +
> +	if (purged) {
> +		if (put_user(p, purged)) {
> +			/*
> +			 * This would be bad, since we've modified volatilty
> +			 * and the change in purged state would be lost.
> +			 */
> +			WARN_ONCE(1, "vrange: purge state possibly lost\n");
> +		}
> +	}
> +
> +out:
> +	return ret;
> +}
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] vrange: Add purged page detection on setting memory non-volatile
  2014-03-14 18:33 ` [PATCH 2/3] vrange: Add purged page detection on setting memory non-volatile John Stultz
@ 2014-03-17  9:39   ` Jan Kara
  2014-03-17 22:22     ` John Stultz
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2014-03-17  9:39 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Fri 14-03-14 11:33:32, John Stultz wrote:
> Users of volatile ranges will need to know if memory was discarded.
> This patch adds the purged state tracking required to inform userland
> when it marks memory as non-volatile that some memory in that range
> was purged and needs to be regenerated.
> 
> This simplified implementation which uses some of the logic from
> Minchan's earlier efforts, so credit to Minchan for his work.
  Some minor comments below...

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Dhaval Giani <dgiani@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/swap.h   | 15 +++++++++++--
>  include/linux/vrange.h | 13 ++++++++++++
>  mm/vrange.c            | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 83 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6..18c12f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
>  #define SWP_HWPOISON_NUM 0
>  #endif
>  
> -#define MAX_SWAPFILES \
> -	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +
> +/*
> + * Purged volatile range pages
> + */
> +#define SWP_VRANGE_PURGED_NUM 1
> +#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
> +
> +
> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)	\
> +				- SWP_MIGRATION_NUM	\
> +				- SWP_HWPOISON_NUM	\
> +				- SWP_VRANGE_PURGED_NUM	\
> +			)
>  
>  /*
>   * Magic header for a swap area. The first part of the union is
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> index 652396b..c4a1616 100644
> --- a/include/linux/vrange.h
> +++ b/include/linux/vrange.h
> @@ -1,7 +1,20 @@
>  #ifndef _LINUX_VRANGE_H
>  #define _LINUX_VRANGE_H
>  
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +
>  #define VRANGE_NONVOLATILE 0
>  #define VRANGE_VOLATILE 1
>  
> +static inline swp_entry_t swp_entry_mk_vrange_purged(void)
> +{
> +	return swp_entry(SWP_VRANGE_PURGED, 0);
> +}
> +
> +static inline int entry_is_vrange_purged(swp_entry_t entry)
> +{
> +	return swp_type(entry) == SWP_VRANGE_PURGED;
> +}
> +
>  #endif /* _LINUX_VRANGE_H */
> diff --git a/mm/vrange.c b/mm/vrange.c
> index acb4356..844571b 100644
> --- a/mm/vrange.c
> +++ b/mm/vrange.c
> @@ -8,6 +8,60 @@
>  #include <linux/mm_inline.h>
>  #include "internal.h"
>  
> +struct vrange_walker {
> +	struct vm_area_struct *vma;
> +	int pages_purged;
  Maybe call this 'was_page_purged'? To better suggest the value is bool
and not a number of pages... Or make that 'bool' instead of 'int'?

> +};
> +
> +static int vrange_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> +				struct mm_walk *walk)
> +{
> +	struct vrange_walker *vw = walk->private;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +
> +	if (pmd_trans_huge(*pmd))
> +		return 0;
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		if (!pte_present(*pte)) {
> +			swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
> +
> +			if (unlikely(entry_is_vrange_purged(vrange_entry))) {
> +				vw->pages_purged = 1;
> +				break;
> +			}
> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static unsigned long vrange_check_purged(struct mm_struct *mm,
  What's the point of this function returning ulong when everything else
expects 'int'?

> +					 struct vm_area_struct *vma,
> +					 unsigned long start,
> +					 unsigned long end)
> +{
> +	struct vrange_walker vw;
> +	struct mm_walk vrange_walk = {
> +		.pmd_entry = vrange_pte_range,
> +		.mm = vma->vm_mm,
> +		.private = &vw,
> +	};
> +	vw.pages_purged = 0;
> +	vw.vma = vma;
> +
> +	walk_page_range(start, end, &vrange_walk);
> +
> +	return vw.pages_purged;
> +
> +}
> +
>  static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
>  				unsigned long end, int mode, int *purged)
>  {
> @@ -57,6 +111,9 @@ static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
>  			break;
>  		case VRANGE_NONVOLATILE:
>  			new_flags &= ~VM_VOLATILE;
> +			lpurged |= vrange_check_purged(mm, vma,
> +							vma->vm_start,
> +							vma->vm_end);
  Hum, why don't you actually just call vrange_check_purge() once for the
whole syscall range? walk_page_range() seems to handle multiple vmas just
fine...

>  		}
>  
>  		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> -- 
> 1.8.3.2
> 
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas
  2014-03-17  9:21   ` Jan Kara
@ 2014-03-17  9:43     ` Jan Kara
  2014-03-18  0:36       ` John Stultz
  2014-03-17 22:19     ` John Stultz
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Kara @ 2014-03-17  9:43 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Mon 17-03-14 10:21:18, Jan Kara wrote:
> On Fri 14-03-14 11:33:31, John Stultz wrote:
> > +	for (;;) {
> > +		unsigned long new_flags;
> > +		pgoff_t pgoff;
> > +		unsigned long tmp;
> > +
> > +		if (!vma)
> > +			goto out;
> > +
> > +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
> > +					VM_HUGETLB))
> > +			goto out;
> > +
> > +		/* We don't support volatility on files for now */
> > +		if (vma->vm_file) {
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> > +
> > +		new_flags = vma->vm_flags;
> > +
> > +		if (start < vma->vm_start) {
> > +			start = vma->vm_start;
> > +			if (start >= end)
> > +				goto out;
> > +		}
  One more question: This seems to silently skip any holes between VMAs. Is
that really intended? I'd expect that marking unmapped range as volatile /
non-volatile should return error... In any case what happens should be
defined in the description.

								Honza

> > +		tmp = vma->vm_end;
> > +		if (end < tmp)
> > +			tmp = end;
> > +
> > +		switch (mode) {
> > +		case VRANGE_VOLATILE:
> > +			new_flags |= VM_VOLATILE;
> > +			break;
> > +		case VRANGE_NONVOLATILE:
> > +			new_flags &= ~VM_VOLATILE;
> > +		}
> > +
> > +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> > +		prev = vma_merge(mm, prev, start, tmp, new_flags,
> > +					vma->anon_vma, vma->vm_file, pgoff,
> > +					vma_policy(vma));
> > +		if (prev)
> > +			goto success;
> > +
> > +		if (start != vma->vm_start) {
> > +			ret = split_vma(mm, vma, start, 1);
> > +			if (ret)
> > +				goto out;
> > +		}
> > +
> > +		if (tmp != vma->vm_end) {
> > +			ret = split_vma(mm, vma, tmp, 0);
> > +			if (ret)
> > +				goto out;
> > +		}
> > +
> > +		prev = vma;
> > +success:
> > +		vma->vm_flags = new_flags;
> > +		*purged = lpurged;
> > +
> > +		/* update count to distance covered so far*/
> > +		count = tmp - orig_start;
> > +
> > +		if (prev && start < prev->vm_end)
>   In which case 'prev' can be NULL? And when start >= prev->vm_end? In all
> the cases I can come up with this condition seems to be true...
> 
> > +			start = prev->vm_end;
> > +		if (start >= end)
> > +			goto out;
> > +		if (prev)
>   Ditto regarding 'prev'...
> 
> > +			vma = prev->vm_next;
> > +		else	/* madvise_remove dropped mmap_sem */
> > +			vma = find_vma(mm, start);
>   The comment regarding madvise_remove() looks bogus...
> 
> > +	}
> > +out:
> > +	up_read(&mm->mmap_sem);
> > +
> > +	/* report bytes successfully marked, even if we're exiting on error */
> > +	if (count)
> > +		return count;
> > +
> > +	return ret;
> > +}
> > +
> > +SYSCALL_DEFINE4(vrange, unsigned long, start,
> > +		size_t, len, int, mode, int __user *, purged)
> > +{
> > +	unsigned long end;
> > +	struct mm_struct *mm = current->mm;
> > +	ssize_t ret = -EINVAL;
> > +	int p = 0;
> > +
> > +	if (start & ~PAGE_MASK)
> > +		goto out;
> > +
> > +	len &= PAGE_MASK;
> > +	if (!len)
> > +		goto out;
> > +
> > +	end = start + len;
> > +	if (end < start)
> > +		goto out;
> > +
> > +	if (start >= TASK_SIZE)
> > +		goto out;
> > +
> > +	if (purged) {
> > +		/* Test pointer is valid before making any changes */
> > +		if (put_user(p, purged))
> > +			return -EFAULT;
> > +	}
> > +
> > +	ret = do_vrange(mm, start, end, mode, &p);
> > +
> > +	if (purged) {
> > +		if (put_user(p, purged)) {
> > +			/*
> > +			 * This would be bad, since we've modified volatilty
> > +			 * and the change in purged state would be lost.
> > +			 */
> > +			WARN_ONCE(1, "vrange: purge state possibly lost\n");
> > +		}
> > +	}
> > +
> > +out:
> > +	return ret;
> > +}
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas
  2014-03-17  9:21   ` Jan Kara
  2014-03-17  9:43     ` Jan Kara
@ 2014-03-17 22:19     ` John Stultz
  1 sibling, 0 replies; 21+ messages in thread
From: John Stultz @ 2014-03-17 22:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 03/17/2014 02:21 AM, Jan Kara wrote:
> On Fri 14-03-14 11:33:31, John Stultz wrote:
>> This patch introduces the vrange() syscall, which allows for specifying
>> ranges of memory as volatile, and able to be discarded by the system.
>>
>> This initial patch simply adds the syscall, and the vma handling,
>> splitting and merging the vmas as needed, and marking them with
>> VM_VOLATILE.
>>
>> No purging or discarding of volatile ranges is done at this point.
>>
>> Example man page:
>>
>> NAME
>> 	vrange - Mark or unmark range of memory as volatile
>>
>> SYNOPSIS
>> 	int vrange(unsigned_long start, size_t length, int mode,
>> 			 int *purged);
>>
>> DESCRIPTION
>> 	Applications can use vrange(2) to advise the kernel how it should
>> 	handle paging I/O in this VM area.  The idea is to help the kernel
>> 	discard pages of vrange instead of reclaiming when memory pressure
>> 	happens. It means kernel doesn't discard any pages of vrange if
>> 	there is no memory pressure.
>   I'd say that the advantage is kernel doesn't have to swap volatile pages,
> it can just directly discard them on memory pressure. You should also
> mention somewhere vrange() is currently supported only for anonymous pages.
> So maybe we can have the description like:
> Applications can use vrange(2) to advise kernel that pages of anonymous
> mapping in the given VM area can be reclaimed without swapping (or can no
> longer be reclaimed without swapping). The idea is that application can
> help kernel with page reclaim under memory pressure by specifying data
> it can easily regenerate and thus kernel can discard the data if needed.

Good point. This man page description originated from previous patches
where we did handle file paging, so I'll try to update it and make it
more clear we currently don't (although I very much want to re-add that
functionality eventually).


>
>> 	mode:
>> 	VRANGE_VOLATILE
>> 		hint to kernel so VM can discard in vrange pages when
>> 		memory pressure happens.
>> 	VRANGE_NONVOLATILE
>> 		hint to kernel so VM doesn't discard vrange pages
>> 		any more.
>>
>> 	If user try to access purged memory without VRANGE_NONVOLATILE call,
>                 ^^^ tries
>
>> 	he can encounter SIGBUS if the page was discarded by kernel.
>>
>> 	purged: Pointer to an integer which will return 1 if
>> 	mode == VRANGE_NONVOLATILE and any page in the affected range
>> 	was purged. If purged returns zero during a mode ==
>> 	VRANGE_NONVOLATILE call, it means all of the pages in the range
>> 	are intact.
>>
>> RETURN VALUE
>> 	On success vrange returns the number of bytes marked or unmarked.
>> 	Similar to write(), it may return fewer bytes then specified
>> 	if it ran into a problem.
>   I believe you may need to better explain what is 'purged' argument good
> for. Because in my naive understanding *purged == 1 iff return value !=
> length.  I recall your discussion with Johannes about error conditions and
> the need to return error but also the state of the range, is that right?
> But that should be really explained somewhere so that poor application
> programmer is aware of those corner cases as well.

Right. So the purged flag is separate/independent from the byte/error
return. Basically we want to describe how much memory has been changed
from volatile to non-volatile state (or vice versa), as well as
providing if any of those pages were purged while they were volatile.
One could mark 1meg of previously volatile memory non-volatile, and find
purged == 0 if there was no memory pressure, or if there was pressure,
find purged == 1.

The error case is that should we run out of memory (or hit some other
error condition that prevents us from successfully marking all of the
specified memory as non-volatile) half way through, we need to return to
the user the purged state for the pages that we did change.

Now, its true that if we ran out of memory, its likely that the memory
pressure caused pages to be purged before being marked, but one could
imagine a situation were we got half way through marking non-purged
pages and memory pressure suddenly went up, causing an allocation to
fail. Thus in that case, you would see the return value != length, and
purged == 0.

But thank you for the feedback, I'll try to rework that part to be more
clear. Any suggestions would also be welcome, as I worry my head is a
bit too steeped in this to see what would make it more clear to fresh eyes.


[snip]
>> diff --git a/mm/vrange.c b/mm/vrange.c
>> new file mode 100644
>> index 0000000..acb4356
>> --- /dev/null
>> +++ b/mm/vrange.c
>> @@ -0,0 +1,150 @@
>> +#include <linux/syscalls.h>
>> +#include <linux/vrange.h>
>> +#include <linux/mm_inline.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/rmap.h>
>> +#include <linux/hugetlb.h>
>> +#include <linux/mmu_notifier.h>
>> +#include <linux/mm_inline.h>
>> +#include "internal.h"
>> +
>> +static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
>> +				unsigned long end, int mode, int *purged)
>> +{
>> +	struct vm_area_struct *vma, *prev;
>> +	unsigned long orig_start = start;
>> +	ssize_t count = 0, ret = 0;
>> +	int lpurged = 0;
>> +
>> +	down_read(&mm->mmap_sem);
>> +
>> +	vma = find_vma_prev(mm, start, &prev);
>> +	if (vma && start > vma->vm_start)
>> +		prev = vma;
>> +
>> +	for (;;) {
>> +		unsigned long new_flags;
>> +		pgoff_t pgoff;
>> +		unsigned long tmp;
>> +
>> +		if (!vma)
>> +			goto out;
>> +
>> +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
>> +					VM_HUGETLB))
>> +			goto out;
>> +
>> +		/* We don't support volatility on files for now */
>> +		if (vma->vm_file) {
>> +			ret = -EINVAL;
>> +			goto out;
>> +		}
>> +
>> +		new_flags = vma->vm_flags;
>> +
>> +		if (start < vma->vm_start) {
>> +			start = vma->vm_start;
>> +			if (start >= end)
>> +				goto out;
>> +		}
>> +		tmp = vma->vm_end;
>> +		if (end < tmp)
>> +			tmp = end;
>> +
>> +		switch (mode) {
>> +		case VRANGE_VOLATILE:
>> +			new_flags |= VM_VOLATILE;
>> +			break;
>> +		case VRANGE_NONVOLATILE:
>> +			new_flags &= ~VM_VOLATILE;
>> +		}
>> +
>> +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
>> +		prev = vma_merge(mm, prev, start, tmp, new_flags,
>> +					vma->anon_vma, vma->vm_file, pgoff,
>> +					vma_policy(vma));
>> +		if (prev)
>> +			goto success;
>> +
>> +		if (start != vma->vm_start) {
>> +			ret = split_vma(mm, vma, start, 1);
>> +			if (ret)
>> +				goto out;
>> +		}
>> +
>> +		if (tmp != vma->vm_end) {
>> +			ret = split_vma(mm, vma, tmp, 0);
>> +			if (ret)
>> +				goto out;
>> +		}
>> +
>> +		prev = vma;
>> +success:
>> +		vma->vm_flags = new_flags;
>> +		*purged = lpurged;
>> +
>> +		/* update count to distance covered so far*/
>> +		count = tmp - orig_start;
>> +
>> +		if (prev && start < prev->vm_end)
>   In which case 'prev' can be NULL? And when start >= prev->vm_end? In all
> the cases I can come up with this condition seems to be true...
>
>> +			start = prev->vm_end;
>> +		if (start >= end)
>> +			goto out;
>> +		if (prev)
>   Ditto regarding 'prev'...

I haven't had the chance to look closely here today, but I'll double
check on these two.


>> +			vma = prev->vm_next;
>> +		else	/* madvise_remove dropped mmap_sem */
>> +			vma = find_vma(mm, start);
>   The comment regarding madvise_remove() looks bogus...

Yep. Thanks for pointing it out, that's from my starting with the
madvise logic and reworking it.


Thanks so much for the feedback here! I really appreciate it, and will
rework things appropriately.

thanks again!
-john

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] vrange: Add purged page detection on setting memory non-volatile
  2014-03-17  9:39   ` Jan Kara
@ 2014-03-17 22:22     ` John Stultz
  0 siblings, 0 replies; 21+ messages in thread
From: John Stultz @ 2014-03-17 22:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 03/17/2014 02:39 AM, Jan Kara wrote:
> On Fri 14-03-14 11:33:32, John Stultz wrote:
>> Users of volatile ranges will need to know if memory was discarded.
>> This patch adds the purged state tracking required to inform userland
>> when it marks memory as non-volatile that some memory in that range
>> was purged and needs to be regenerated.
>>
>> This simplified implementation which uses some of the logic from
>> Minchan's earlier efforts, so credit to Minchan for his work.
>   Some minor comments below...
>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Android Kernel Team <kernel-team@android.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Robert Love <rlove@google.com>
>> Cc: Mel Gorman <mel@csn.ul.ie>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Dave Hansen <dave@sr71.net>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
>> Cc: Neil Brown <neilb@suse.de>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Mike Hommey <mh@glandium.org>
>> Cc: Taras Glek <tglek@mozilla.com>
>> Cc: Dhaval Giani <dgiani@mozilla.com>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
>> Cc: Michel Lespinasse <walken@google.com>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
>> Signed-off-by: John Stultz <john.stultz@linaro.org>
>> ---
>>  include/linux/swap.h   | 15 +++++++++++--
>>  include/linux/vrange.h | 13 ++++++++++++
>>  mm/vrange.c            | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 83 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 46ba0c6..18c12f9 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
>>  #define SWP_HWPOISON_NUM 0
>>  #endif
>>  
>> -#define MAX_SWAPFILES \
>> -	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>> +
>> +/*
>> + * Purged volatile range pages
>> + */
>> +#define SWP_VRANGE_PURGED_NUM 1
>> +#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
>> +
>> +
>> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)	\
>> +				- SWP_MIGRATION_NUM	\
>> +				- SWP_HWPOISON_NUM	\
>> +				- SWP_VRANGE_PURGED_NUM	\
>> +			)
>>  
>>  /*
>>   * Magic header for a swap area. The first part of the union is
>> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
>> index 652396b..c4a1616 100644
>> --- a/include/linux/vrange.h
>> +++ b/include/linux/vrange.h
>> @@ -1,7 +1,20 @@
>>  #ifndef _LINUX_VRANGE_H
>>  #define _LINUX_VRANGE_H
>>  
>> +#include <linux/swap.h>
>> +#include <linux/swapops.h>
>> +
>>  #define VRANGE_NONVOLATILE 0
>>  #define VRANGE_VOLATILE 1
>>  
>> +static inline swp_entry_t swp_entry_mk_vrange_purged(void)
>> +{
>> +	return swp_entry(SWP_VRANGE_PURGED, 0);
>> +}
>> +
>> +static inline int entry_is_vrange_purged(swp_entry_t entry)
>> +{
>> +	return swp_type(entry) == SWP_VRANGE_PURGED;
>> +}
>> +
>>  #endif /* _LINUX_VRANGE_H */
>> diff --git a/mm/vrange.c b/mm/vrange.c
>> index acb4356..844571b 100644
>> --- a/mm/vrange.c
>> +++ b/mm/vrange.c
>> @@ -8,6 +8,60 @@
>>  #include <linux/mm_inline.h>
>>  #include "internal.h"
>>  
>> +struct vrange_walker {
>> +	struct vm_area_struct *vma;
>> +	int pages_purged;
>   Maybe call this 'was_page_purged'? To better suggest the value is bool
> and not a number of pages... Or make that 'bool' instead of 'int'?

Yea, page_was_purged sounds good to me. Thanks for pointing out the
ambiguity.  Similarly the

entry_is_vrange_purged/swp_entry_mk_vrange_purged functions are a little inconsistently named.



>
>> +};
>> +
>> +static int vrange_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>> +				struct mm_walk *walk)
>> +{
>> +	struct vrange_walker *vw = walk->private;
>> +	pte_t *pte;
>> +	spinlock_t *ptl;
>> +
>> +	if (pmd_trans_huge(*pmd))
>> +		return 0;
>> +	if (pmd_trans_unstable(pmd))
>> +		return 0;
>> +
>> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +		if (!pte_present(*pte)) {
>> +			swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
>> +
>> +			if (unlikely(entry_is_vrange_purged(vrange_entry))) {
>> +				vw->pages_purged = 1;
>> +				break;
>> +			}
>> +		}
>> +	}
>> +	pte_unmap_unlock(pte - 1, ptl);
>> +	cond_resched();
>> +
>> +	return 0;
>> +}
>> +
>> +static unsigned long vrange_check_purged(struct mm_struct *mm,
>   What's the point of this function returning ulong when everything else
> expects 'int'?

Thanks, I'll fix that.

>> +					 struct vm_area_struct *vma,
>> +					 unsigned long start,
>> +					 unsigned long end)
>> +{
>> +	struct vrange_walker vw;
>> +	struct mm_walk vrange_walk = {
>> +		.pmd_entry = vrange_pte_range,
>> +		.mm = vma->vm_mm,
>> +		.private = &vw,
>> +	};
>> +	vw.pages_purged = 0;
>> +	vw.vma = vma;
>> +
>> +	walk_page_range(start, end, &vrange_walk);
>> +
>> +	return vw.pages_purged;
>> +
>> +}
>> +
>>  static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
>>  				unsigned long end, int mode, int *purged)
>>  {
>> @@ -57,6 +111,9 @@ static ssize_t do_vrange(struct mm_struct *mm, unsigned long start,
>>  			break;
>>  		case VRANGE_NONVOLATILE:
>>  			new_flags &= ~VM_VOLATILE;
>> +			lpurged |= vrange_check_purged(mm, vma,
>> +							vma->vm_start,
>> +							vma->vm_end);
>   Hum, why don't you actually just call vrange_check_purge() once for the
> whole syscall range? walk_page_range() seems to handle multiple vmas just
> fine...
>

Another good suggestion!

Thanks!
-john


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas
  2014-03-17  9:43     ` Jan Kara
@ 2014-03-18  0:36       ` John Stultz
  0 siblings, 0 replies; 21+ messages in thread
From: John Stultz @ 2014-03-18  0:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, Michel Lespinasse, Minchan Kim,
	linux-mm

On 03/17/2014 02:43 AM, Jan Kara wrote:
> On Mon 17-03-14 10:21:18, Jan Kara wrote:
>> On Fri 14-03-14 11:33:31, John Stultz wrote:
>>> +	for (;;) {
>>> +		unsigned long new_flags;
>>> +		pgoff_t pgoff;
>>> +		unsigned long tmp;
>>> +
>>> +		if (!vma)
>>> +			goto out;
>>> +
>>> +		if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
>>> +					VM_HUGETLB))
>>> +			goto out;
>>> +
>>> +		/* We don't support volatility on files for now */
>>> +		if (vma->vm_file) {
>>> +			ret = -EINVAL;
>>> +			goto out;
>>> +		}
>>> +
>>> +		new_flags = vma->vm_flags;
>>> +
>>> +		if (start < vma->vm_start) {
>>> +			start = vma->vm_start;
>>> +			if (start >= end)
>>> +				goto out;
>>> +		}
>   One more question: This seems to silently skip any holes between VMAs. Is
> that really intended? I'd expect that marking unmapped range as volatile /
> non-volatile should return error... In any case what happens should be
> defined in the description.

So.. initially it was by design, but as I look at madvise and think
about it further, it does make more sense to throw errors if memory in
the range is not mapped.

I'll try to rework things to adapt to this.

thanks
-john

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
  2014-03-14 18:33 [PATCH 0/3] Volatile Ranges (v11) John Stultz
                   ` (2 preceding siblings ...)
  2014-03-14 18:33 ` [PATCH 3/3] vrange: Add page purging logic & SIGBUS trap John Stultz
@ 2014-03-18 12:24 ` Michal Hocko
  2014-03-18 17:53   ` John Stultz
  2014-03-20  0:38   ` Dave Hansen
  2014-03-18 15:11 ` Minchan Kim
  4 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2014-03-18 12:24 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Fri 14-03-14 11:33:30, John Stultz wrote:
[...]
> Volatile ranges provides a method for userland to inform the kernel that
> a range of memory is safe to discard (ie: can be regenerated) but
> userspace may want to try access it in the future.  It can be thought of
> as similar to MADV_DONTNEED, but that the actual freeing of the memory
> is delayed and only done under memory pressure, and the user can try to
> cancel the action and be able to quickly access any unpurged pages. The
> idea originated from Android's ashmem, but I've since learned that other
> OSes provide similar functionality.

Maybe I have missed something (I've only glanced through the patches)
but it seems that marking a range volatile doesn't alter neither
reference bits nor position in the LRU. I thought that a volatile page
would be moved to the end of inactive LRU with the reference bit
dropped. Or is this expectation wrong and volatility is not supposed to
touch page aging?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
  2014-03-14 18:33 [PATCH 0/3] Volatile Ranges (v11) John Stultz
                   ` (3 preceding siblings ...)
  2014-03-18 12:24 ` [PATCH 0/3] Volatile Ranges (v11) Michal Hocko
@ 2014-03-18 15:11 ` Minchan Kim
  2014-03-18 18:07   ` John Stultz
  4 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2014-03-18 15:11 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

Hello John,

Sorry for late. Timing between us is always not good.
I say my thought although I don't prepare whole thing in my brain
since you sent out the patchset(Anyway, we should share ideas before
the LSF/MM)

On Fri, Mar 14, 2014 at 11:33:30AM -0700, John Stultz wrote:
> I recently got a chance to try to implement Johannes' suggested approach
> so I wanted to send it out for comments. It looks like Minchan has also
> done the same, but from a different direction, focusing on the MADV_FREE
> use cases. I think both approaches are valid, so I wouldn't consider

True and I just wanted to think over vrange-anon after resolving
MADV_FREE first because MADV_FREE is very clear(ie, Other OS and general
allocators already have supported) and vrange-anon might share some of
implementaion from MADV_FREE. Once we give up the vrange syscall's speed(
ex, no mmap_sem writeside lock, no pte enumeration in syscall contex)
we could do better for other parts. I will describe them in below.

> these patches to be in conflict. Its just that earlier iterations of the
> volatile range patches had tried to handle numerous different use cases,
> and the resulting complexity was apparently making it difficult to review
> and get interest in the patch set. So basically we're splitting the use
> cases up and trying to find simpler solutions for each.
> 
> I'd greatly appreciate any feedback or thoughts!

1) SIGBUS

It's one of the arguable issue because some user want to get a
SIGBUS(ex, Firefox) while other want a just zero page(ex, Google
address sanitizer) without signal so it should be option.
	
	int vrange(start, len, VRANGE_VOLATILE|VRANGE_ZERO, &purged);
	int vrange(start, len, VRANGE_VOLATILE|VRANGE_SIGNAL, &purged);

2) Accouting

The one of problem I have thought is lack of accouting of vrange pages.
I mean we need some statistics for vrange pages and it should be number
of pages rather than vma size. Without that, user space couldn't see
current status and then they couldn't control the system's memory
consumption. It's alredy known problem for other OS which have support
similar thing(ex, MADV_FREE).

For accouting, we should account how many of existing pages are the range
when vrange syscall is called. It could increase syscall overhead
but user could have accurate statistics information. It's just trade-off.

3) Aging

I think vrange pages should be discarded eariler than other hot pages
so want to move pages to tail of inactive LRU when syscall is called.
We could do by using deactivate_page with some tweak while we accouts
pages in syscall context.

But if user want to treat vrange pages with other hot pages equally
he could ask so that we could skip deactivating.

	vrange(start, len, VRANGE_VOLATILE|VRANGE_ZERO|VRANGE_AGING, &purged)
	or
	vrange(start, len, VRANGE_VOLATILE|VRANGE_SIGNAL|VRANGE_AGING, &purged)

It could be convenient for Moz usecase if they want to age vrange
pages.

4) Permanency

Like MCL_FUTURE of mlockall, it would be better to make the range
have permanent property until called VRANGE_NOVOLATILE.
I mean pages faulted on the range in future since syscall is called
should be volatile automatically so that user could avoid frequent
syscall to make them volatile.

Any thoughts?

> 
> thanks
> -john
> 
> 
> Volatile ranges provides a method for userland to inform the kernel that
> a range of memory is safe to discard (ie: can be regenerated) but
> userspace may want to try access it in the future.  It can be thought of
> as similar to MADV_DONTNEED, but that the actual freeing of the memory
> is delayed and only done under memory pressure, and the user can try to
> cancel the action and be able to quickly access any unpurged pages. The
> idea originated from Android's ashmem, but I've since learned that other
> OSes provide similar functionality.
> 
> This functionality allows for a number of interesting uses:
> * Userland caches that have kernel triggered eviction under memory
> pressure. This allows for the kernel to "rightsize" userspace caches for
> current system-wide workload. Things like image bitmap caches, or
> rendered HTML in a hidden browser tab, where the data is not visible and
> can be regenerated if needed, are good examples.
> 
> * Opportunistic freeing of memory that may be quickly reused. Minchan
> has done a malloc implementation where free() marks the pages as
> volatile, allowing the kernel to reclaim under pressure. This avoids the
> unmapping and remapping of anonymous pages on free/malloc. So if
> userland wants to malloc memory quickly after the free, it just needs to
> mark the pages as non-volatile, and only purged pages will have to be
> faulted back in.
> 
> There are two basic ways this can be used:
> 
> Explicit marking method:
> 1) Userland marks a range of memory that can be regenerated if necessary
> as volatile
> 2) Before accessing the memory again, userland marks the memory as
> nonvolatile, and the kernel will provide notification if any pages in the
> range has been purged.
> 
> Optimistic method:
> 1) Userland marks a large range of data as volatile
> 2) Userland continues to access the data as it needs.
> 3) If userland accesses a page that has been purged, the kernel will
> send a SIGBUS
> 4) Userspace can trap the SIGBUS, mark the affected pages as
> non-volatile, and refill the data as needed before continuing on
> 
> You can read more about the history of volatile ranges here:
> http://permalink.gmane.org/gmane.linux.kernel.mm/98848
> http://permalink.gmane.org/gmane.linux.kernel.mm/98676
> https://lwn.net/Articles/522135/
> https://lwn.net/Kernel/Index/#Volatile_ranges
> 
> 
> This version of the patchset, at Johannes Weiner's suggestion, is much
> reduced in scope compared to earlier attempts. I've only handled
> volatility on anonymous memory, and we're storing the volatility in
> the VMA.  This may have performance implications compared with the earlier
> approach, but it does simplify the approach.
> 
> Further, the page discarding happens via normal vmscanning, which due to
> anonymous pages not being aged on swapless systems, means we'll only purge
> pages when swap is enabled. I'll be looking at enabling anonymous aging
> when swap is disabled to resolve this, but I wanted to get this out for
> initial comment.
> 
> Additionally, since we don't handle volatility on tmpfs files with this
> version of the patch, it is not able to be used to implement semantics
> similar to Android's ashmem. But since shared volatiltiy on files is
> more complex, my hope is to start small and hopefully grow from there.
> 
> Also, much of the logic in this patchset is based on Minchan's earlier
> efforts. On this iteration, I've not been in close collaboration with him,
> so I don't want to mis-attribute my rework of the code as his design,
> but I do want to make sure the credit goes to him for his major contribution.
> 
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@sr71.net>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Dhaval Giani <dgiani@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> 
> 
> John Stultz (3):
>   vrange: Add vrange syscall and handle splitting/merging and marking
>     vmas
>   vrange: Add purged page detection on setting memory non-volatile
>   vrange: Add page purging logic & SIGBUS trap
> 
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/mm.h               |   1 +
>  include/linux/swap.h             |  15 +-
>  include/linux/vrange.h           |  22 +++
>  mm/Makefile                      |   2 +-
>  mm/internal.h                    |   2 -
>  mm/memory.c                      |  21 +++
>  mm/rmap.c                        |   5 +
>  mm/vmscan.c                      |  12 ++
>  mm/vrange.c                      | 306 +++++++++++++++++++++++++++++++++++++++
>  10 files changed, 382 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/vrange.h
>  create mode 100644 mm/vrange.c
> 
> -- 
> 1.8.3.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
  2014-03-18 12:24 ` [PATCH 0/3] Volatile Ranges (v11) Michal Hocko
@ 2014-03-18 17:53   ` John Stultz
  2014-03-20  0:38   ` Dave Hansen
  1 sibling, 0 replies; 21+ messages in thread
From: John Stultz @ 2014-03-18 17:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Tue, Mar 18, 2014 at 5:24 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Fri 14-03-14 11:33:30, John Stultz wrote:
> [...]
>> Volatile ranges provides a method for userland to inform the kernel that
>> a range of memory is safe to discard (ie: can be regenerated) but
>> userspace may want to try access it in the future.  It can be thought of
>> as similar to MADV_DONTNEED, but that the actual freeing of the memory
>> is delayed and only done under memory pressure, and the user can try to
>> cancel the action and be able to quickly access any unpurged pages. The
>> idea originated from Android's ashmem, but I've since learned that other
>> OSes provide similar functionality.
>
> Maybe I have missed something (I've only glanced through the patches)
> but it seems that marking a range volatile doesn't alter neither
> reference bits nor position in the LRU. I thought that a volatile page
> would be moved to the end of inactive LRU with the reference bit
> dropped. Or is this expectation wrong and volatility is not supposed to
> touch page aging?

Hrmm. So you're right, I had talked about how we'd end up purging
pages in a range together (as opposed to just randomly) because the
pages would have been marked together. On this pass, I was trying to
avoid touching all the pages on every operation, but I'll try to add
the referencing to keep it consistent with what was discussed (and
we'll get a sense of the performance impact).

Though subtleties like this are still open for discussion. For
instance, Minchan would like to see the volatile pages moved to the
front of the LRU instead of the back.

thanks
-john

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
  2014-03-18 15:11 ` Minchan Kim
@ 2014-03-18 18:07   ` John Stultz
       [not found]     ` <20140319004918.GB13475@bbox>
  0 siblings, 1 reply; 21+ messages in thread
From: John Stultz @ 2014-03-18 18:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	linux-mm

On Tue, Mar 18, 2014 at 8:11 AM, Minchan Kim <minchan@kernel.org> wrote:
> 1) SIGBUS
>
> It's one of the arguable issue because some user want to get a
> SIGBUS(ex, Firefox) while other want a just zero page(ex, Google
> address sanitizer) without signal so it should be option.
>
>         int vrange(start, len, VRANGE_VOLATILE|VRANGE_ZERO, &purged);
>         int vrange(start, len, VRANGE_VOLATILE|VRANGE_SIGNAL, &purged);

So, the zero-fill on volatile access feels like a *very* special case
to me, since a null page could be valid data in many cases. Since
support/interest for volatile ranges has been middling at best, I want
to start culling the stranger use cases. I'm open in the future to
adding a special flag or something if it really make sense, but at
this point, lets just get the more general volatile range use cases
supported.


> 2) Accouting
>
> The one of problem I have thought is lack of accouting of vrange pages.
> I mean we need some statistics for vrange pages and it should be number
> of pages rather than vma size. Without that, user space couldn't see
> current status and then they couldn't control the system's memory
> consumption. It's alredy known problem for other OS which have support
> similar thing(ex, MADV_FREE).
>
> For accouting, we should account how many of existing pages are the range
> when vrange syscall is called. It could increase syscall overhead
> but user could have accurate statistics information. It's just trade-off.

Agreed. As I've been looking at handling anonymous page aging on
swapless systems, the naive method causes performance issues as we
scan and scan and scan the anonymous list trying to page things out to
nowhere. Providing the number of volatile pages would allow the
scanning to stop at a sensible time.

> 3) Aging
>
> I think vrange pages should be discarded eariler than other hot pages
> so want to move pages to tail of inactive LRU when syscall is called.
> We could do by using deactivate_page with some tweak while we accouts
> pages in syscall context.
>
> But if user want to treat vrange pages with other hot pages equally
> he could ask so that we could skip deactivating.
>
>         vrange(start, len, VRANGE_VOLATILE|VRANGE_ZERO|VRANGE_AGING, &purged)
>         or
>         vrange(start, len, VRANGE_VOLATILE|VRANGE_SIGNAL|VRANGE_AGING, &purged)
>
> It could be convenient for Moz usecase if they want to age vrange
> pages.

Again, I want to keep the scope small for now, so I'd rather not add
more options just yet. I think we should come up with a sensable
default and give that time to be used, and if there need to be more
options later, we can open those up. I think activating on volatile
(so the pages are purged together) is the right default approach, but
I'm open to discuss this further.


> 4) Permanency
>
> Like MCL_FUTURE of mlockall, it would be better to make the range
> have permanent property until called VRANGE_NOVOLATILE.
> I mean pages faulted on the range in future since syscall is called
> should be volatile automatically so that user could avoid frequent
> syscall to make them volatile.

I'm not sure I followed this. Is this with respect to the issue of
unmapped holes in the range?

thanks
-john

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
       [not found]     ` <20140319004918.GB13475@bbox>
@ 2014-03-19 10:12       ` Jan Kara
  2014-03-20  1:09         ` Minchan Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2014-03-19 10:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Johannes Weiner, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, linux-mm

On Wed 19-03-14 09:49:18, Minchan Kim wrote:
> On Tue, Mar 18, 2014 at 11:07:50AM -0700, John Stultz wrote:
> > On Tue, Mar 18, 2014 at 8:11 AM, Minchan Kim <minchan@kernel.org> wrote:
> > > 1) SIGBUS
> > >
> > > It's one of the arguable issue because some user want to get a
> > > SIGBUS(ex, Firefox) while other want a just zero page(ex, Google
> > > address sanitizer) without signal so it should be option.
> > >
> > >         int vrange(start, len, VRANGE_VOLATILE|VRANGE_ZERO, &purged);
> > >         int vrange(start, len, VRANGE_VOLATILE|VRANGE_SIGNAL, &purged);
> > 
> > So, the zero-fill on volatile access feels like a *very* special case
> > to me, since a null page could be valid data in many cases. Since
> > support/interest for volatile ranges has been middling at best, I want
> > to start culling the stranger use cases. I'm open in the future to
> > adding a special flag or something if it really make sense, but at
> > this point, lets just get the more general volatile range use cases
> > supported.
> 
> I'm not sure it's special case. Because some user could reserve
> a big volatile VMA and want to use the range by circle queue for
> caching so overwriting could happen easily.
> We should call vrange(NOVOLATILE) to prevent SIGBUS right before
> overwriting. I feel it's unnecessary overhead and we could avoid
> the cost with VRANGE_ZERO.
> Do you think this usecase would be rare?
  If I understand it correctly the buffer would be volatile all the time
and userspace would like to opportunistically access it. Hum, but then with
your automatic zero-filling it could see half of the page with data and
half of the page zeroed out (the page got evicted in the middle of
userspace reading it). I don't think that's a very comfortable interface to
work with (you would have to very carefully verify the data you've read is
really valid). And frankly in most cases I'm afraid the application would
fail to do proper verification and crash randomly under memory pressure. So
I wouldn't provide VRANGE_ZERO unless I come across real people for which
avoiding marking the range as NONVOLATILE is a big deal and they are OK with
handling all the odd situations that can happen.

That being said I agree with you that it makes sense to extend the syscall
with flags argument so that we have some room for different modifications
of the functionality.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
  2014-03-18 12:24 ` [PATCH 0/3] Volatile Ranges (v11) Michal Hocko
  2014-03-18 17:53   ` John Stultz
@ 2014-03-20  0:38   ` Dave Hansen
  2014-03-20  0:57     ` John Stultz
  2014-03-20  7:45     ` Minchan Kim
  1 sibling, 2 replies; 21+ messages in thread
From: Dave Hansen @ 2014-03-20  0:38 UTC (permalink / raw)
  To: Michal Hocko, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Dhaval Giani, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On 03/18/2014 05:24 AM, Michal Hocko wrote:
> On Fri 14-03-14 11:33:30, John Stultz wrote:
> [...]
>> Volatile ranges provides a method for userland to inform the kernel that
>> a range of memory is safe to discard (ie: can be regenerated) but
>> userspace may want to try access it in the future.  It can be thought of
>> as similar to MADV_DONTNEED, but that the actual freeing of the memory
>> is delayed and only done under memory pressure, and the user can try to
>> cancel the action and be able to quickly access any unpurged pages. The
>> idea originated from Android's ashmem, but I've since learned that other
>> OSes provide similar functionality.
> 
> Maybe I have missed something (I've only glanced through the patches)
> but it seems that marking a range volatile doesn't alter neither
> reference bits nor position in the LRU. I thought that a volatile page
> would be moved to the end of inactive LRU with the reference bit
> dropped. Or is this expectation wrong and volatility is not supposed to
> touch page aging?

I'm not really convinced it should alter the aging.  Things could
potentially go in and out of volatile state frequently, and requiring
aging means we've got to go after them page-by-page or pte-by-pte at
best.  That doesn't seem like something we want to do in a path we want
to be fast.

Why not just let normal page aging deal with them?  It seems to me like
like trying to infer intended lru position from volatility is the wrong
thing.  It's quite possible we'd have two pages in the same range that
we want in completely different parts of the LRU.  Maybe the structure
has a hot page and a cold one, and we would ideally want the cold one
swapped out and not the hot one.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
  2014-03-20  0:38   ` Dave Hansen
@ 2014-03-20  0:57     ` John Stultz
  2014-03-20  7:45     ` Minchan Kim
  1 sibling, 0 replies; 21+ messages in thread
From: John Stultz @ 2014-03-20  0:57 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko
  Cc: LKML, Andrew Morton, Android Kernel Team, Johannes Weiner,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Arcangeli, Mike Hommey,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm

On 03/19/2014 05:38 PM, Dave Hansen wrote:
> On 03/18/2014 05:24 AM, Michal Hocko wrote:
>> On Fri 14-03-14 11:33:30, John Stultz wrote:
>> [...]
>>> Volatile ranges provides a method for userland to inform the kernel that
>>> a range of memory is safe to discard (ie: can be regenerated) but
>>> userspace may want to try access it in the future.  It can be thought of
>>> as similar to MADV_DONTNEED, but that the actual freeing of the memory
>>> is delayed and only done under memory pressure, and the user can try to
>>> cancel the action and be able to quickly access any unpurged pages. The
>>> idea originated from Android's ashmem, but I've since learned that other
>>> OSes provide similar functionality.
>> Maybe I have missed something (I've only glanced through the patches)
>> but it seems that marking a range volatile doesn't alter neither
>> reference bits nor position in the LRU. I thought that a volatile page
>> would be moved to the end of inactive LRU with the reference bit
>> dropped. Or is this expectation wrong and volatility is not supposed to
>> touch page aging?
> I'm not really convinced it should alter the aging.  Things could
> potentially go in and out of volatile state frequently, and requiring
> aging means we've got to go after them page-by-page or pte-by-pte at
> best.  That doesn't seem like something we want to do in a path we want
> to be fast.
>
> Why not just let normal page aging deal with them?  It seems to me like
> like trying to infer intended lru position from volatility is the wrong
> thing.  It's quite possible we'd have two pages in the same range that
> we want in completely different parts of the LRU.  Maybe the structure
> has a hot page and a cold one, and we would ideally want the cold one
> swapped out and not the hot one.
s/swapped/purged

But yea. Part of the request here is that when talking with potential
users, there were some folks who were particularly concerned that if we
purge a page from a range, we should purge the rest of that range before
purging any pages of other ranges. Minchan has pushed for a flag
VRANGE_FULL option (vs VRANGE_PARTIAL) to trigger this sort of
full-range purging semantics.

Subtly, the same potential user wanted the partial semantics as well,
since they could continue to access the unpurged volatile data, allowing
only the cold pages to be purged.

I'm not particularly fond of having a option to specify this behavior,
since I really want to leave all purging decisions to the VM and not
have userland expect a particular behavior for volatile purging (since
the right call at a system level may be different from one situation to
the next - much as userspace cannot expect constant memory access times
since some pages may be swapped out).

So one way to approximate full range purging, while still doing page
based purging, is to touch the pages being marked volatile as we mark
them. Thus they will be all of the same "age", and thus likely to be
purged together (assuming they haven't been accessed since being made
volatile, in which case the cold pages rightly are purged first). Now,
while setting them to all be of the same age, there is still the open
question of what should that age be? And I'm not sure that answer is yet
clear.  But as long as they are together, we still get the (approximate)
full range purging behavior that was desired.

Now.. one could also argue (as you have) that such behavior could be
done separately from the mark-volatile operation. Possibly via making an
madvise call on the range, prior to calling
vrange(VRANGE_VOLATILE,...).  This is attractive, since it lowers the
performance overhead. But I wanted to at least try to implement the page
referencing, since I had talked about it as a solution to the
FULL/PARTIAL purging issue.

thanks
-john








^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
  2014-03-19 10:12       ` Jan Kara
@ 2014-03-20  1:09         ` Minchan Kim
  2014-03-20  8:13           ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2014-03-20  1:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Johannes Weiner, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

Hello,

On Wed, Mar 19, 2014 at 11:12:02AM +0100, Jan Kara wrote:
> On Wed 19-03-14 09:49:18, Minchan Kim wrote:
> > On Tue, Mar 18, 2014 at 11:07:50AM -0700, John Stultz wrote:
> > > On Tue, Mar 18, 2014 at 8:11 AM, Minchan Kim <minchan@kernel.org> wrote:
> > > > 1) SIGBUS
> > > >
> > > > It's one of the arguable issue because some user want to get a
> > > > SIGBUS(ex, Firefox) while other want a just zero page(ex, Google
> > > > address sanitizer) without signal so it should be option.
> > > >
> > > >         int vrange(start, len, VRANGE_VOLATILE|VRANGE_ZERO, &purged);
> > > >         int vrange(start, len, VRANGE_VOLATILE|VRANGE_SIGNAL, &purged);
> > > 
> > > So, the zero-fill on volatile access feels like a *very* special case
> > > to me, since a null page could be valid data in many cases. Since
> > > support/interest for volatile ranges has been middling at best, I want
> > > to start culling the stranger use cases. I'm open in the future to
> > > adding a special flag or something if it really make sense, but at
> > > this point, lets just get the more general volatile range use cases
> > > supported.
> > 
> > I'm not sure it's special case. Because some user could reserve
> > a big volatile VMA and want to use the range by circle queue for
> > caching so overwriting could happen easily.
> > We should call vrange(NOVOLATILE) to prevent SIGBUS right before
> > overwriting. I feel it's unnecessary overhead and we could avoid
> > the cost with VRANGE_ZERO.
> > Do you think this usecase would be rare?
>   If I understand it correctly the buffer would be volatile all the time
> and userspace would like to opportunistically access it. Hum, but then with
> your automatic zero-filling it could see half of the page with data and
> half of the page zeroed out (the page got evicted in the middle of
> userspace reading it). I don't think that's a very comfortable interface to
> work with (you would have to very carefully verify the data you've read is
> really valid). And frankly in most cases I'm afraid the application would
> fail to do proper verification and crash randomly under memory pressure. So
> I wouldn't provide VRANGE_ZERO unless I come across real people for which
> avoiding marking the range as NONVOLATILE is a big deal and they are OK with
> handling all the odd situations that can happen.

Plaes think following usecase.

Let's assume big volatile cacne.
If there is request for cache, it should find a object in a cache
and if it found, it should call vrange(NOVOLATILE) right before
passing it to the user and investigate it was purged or not.
If it wasn't purged, cache manager could pass the object to the user.
But it's circular cache so if there is no request from user, cache manager
always overwrites objects so it could encounter SIGBUS easily
so as current sematic, cache manager always should call vrange(NOVOLATILE)
right before the overwriting. Otherwise, it should register SIGBUS handler
to unmark volatile by page unit. SIGH.

If we support VRANGE_ZERO, cache manager could overwrite object without
SIGBUS handling or vrange(NOVOLATILE) call. Just need is vrange(NOVOLATILE)
call while cache manager pass it to the user.

> 
> That being said I agree with you that it makes sense to extend the syscall
> with flags argument so that we have some room for different modifications
> of the functionality.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
  2014-03-20  0:38   ` Dave Hansen
  2014-03-20  0:57     ` John Stultz
@ 2014-03-20  7:45     ` Minchan Kim
  1 sibling, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2014-03-20  7:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Michal Hocko, John Stultz, LKML, Andrew Morton,
	Android Kernel Team, Johannes Weiner, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, Dhaval Giani,
	Jan Kara, KOSAKI Motohiro, Michel Lespinasse, linux-mm

Hello Dave,

On Wed, Mar 19, 2014 at 05:38:10PM -0700, Dave Hansen wrote:
> On 03/18/2014 05:24 AM, Michal Hocko wrote:
> > On Fri 14-03-14 11:33:30, John Stultz wrote:
> > [...]
> >> Volatile ranges provides a method for userland to inform the kernel that
> >> a range of memory is safe to discard (ie: can be regenerated) but
> >> userspace may want to try access it in the future.  It can be thought of
> >> as similar to MADV_DONTNEED, but that the actual freeing of the memory
> >> is delayed and only done under memory pressure, and the user can try to
> >> cancel the action and be able to quickly access any unpurged pages. The
> >> idea originated from Android's ashmem, but I've since learned that other
> >> OSes provide similar functionality.
> > 
> > Maybe I have missed something (I've only glanced through the patches)
> > but it seems that marking a range volatile doesn't alter neither
> > reference bits nor position in the LRU. I thought that a volatile page
> > would be moved to the end of inactive LRU with the reference bit
> > dropped. Or is this expectation wrong and volatility is not supposed to
> > touch page aging?
> 
> I'm not really convinced it should alter the aging.  Things could
> potentially go in and out of volatile state frequently, and requiring
> aging means we've got to go after them page-by-page or pte-by-pte at
> best.  That doesn't seem like something we want to do in a path we want
> to be fast.

Since vrange syscall design was changed from range-based to pte-based,
it shouldn't be fast. Sure, vrange(VOLAILTE) could be fast with just
mark it VMA_VOALTILE to vma->vm_flags but vrange(NOVOLATILE) should
look every pages in the range so it could be slow.
Even vrange(VOLATILE) call is fast now, I want to accout volatile
pages to expose it to the user by vmstat so that user could see
current status of the system memory, which makes userspace more happy
and predicatble. If we add such stat, vrange(VOLATILE) should look
every pages in the range so it could be slow, too.

> 
> Why not just let normal page aging deal with them?  It seems to me like
> like trying to infer intended lru position from volatility is the wrong
> thing.  It's quite possible we'd have two pages in the same range that
> we want in completely different parts of the LRU.  Maybe the structure
> has a hot page and a cold one, and we would ideally want the cold one
> swapped out and not the hot one.

Yes, it would be really arguble and it depends on the user's usecase.
That's why I'd like to add VRANGE_NORMAL_AGING which just don't move
the page in curret position of the LRU. It would be useful when it used
with VRANGE_SIGBUS because they could handle partial pages.

Otherwise, I'd like to move that pages into inacive's tail so that it
should prevent reclaiming of the hot pages.
If there is no memory pressure, we could get a chance to reuse volatile
pages so it could rotate back to the head of LRU when VM reclaim logic is
triggered.

I agree with John's opinion that just make approach simple as possible
and extend it later so that we should make a room in syscall semantic
and make an agreement what should be default at the moment.

Thanks.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
  2014-03-20  1:09         ` Minchan Kim
@ 2014-03-20  8:13           ` Jan Kara
  2014-03-21  5:29             ` Minchan Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2014-03-20  8:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jan Kara, John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Johannes Weiner, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

On Thu 20-03-14 10:09:54, Minchan Kim wrote:
> Hello,
> 
> On Wed, Mar 19, 2014 at 11:12:02AM +0100, Jan Kara wrote:
> > On Wed 19-03-14 09:49:18, Minchan Kim wrote:
> > > On Tue, Mar 18, 2014 at 11:07:50AM -0700, John Stultz wrote:
> > > > On Tue, Mar 18, 2014 at 8:11 AM, Minchan Kim <minchan@kernel.org> wrote:
> > > > > 1) SIGBUS
> > > > >
> > > > > It's one of the arguable issue because some user want to get a
> > > > > SIGBUS(ex, Firefox) while other want a just zero page(ex, Google
> > > > > address sanitizer) without signal so it should be option.
> > > > >
> > > > >         int vrange(start, len, VRANGE_VOLATILE|VRANGE_ZERO, &purged);
> > > > >         int vrange(start, len, VRANGE_VOLATILE|VRANGE_SIGNAL, &purged);
> > > > 
> > > > So, the zero-fill on volatile access feels like a *very* special case
> > > > to me, since a null page could be valid data in many cases. Since
> > > > support/interest for volatile ranges has been middling at best, I want
> > > > to start culling the stranger use cases. I'm open in the future to
> > > > adding a special flag or something if it really make sense, but at
> > > > this point, lets just get the more general volatile range use cases
> > > > supported.
> > > 
> > > I'm not sure it's special case. Because some user could reserve
> > > a big volatile VMA and want to use the range by circle queue for
> > > caching so overwriting could happen easily.
> > > We should call vrange(NOVOLATILE) to prevent SIGBUS right before
> > > overwriting. I feel it's unnecessary overhead and we could avoid
> > > the cost with VRANGE_ZERO.
> > > Do you think this usecase would be rare?
> >   If I understand it correctly the buffer would be volatile all the time
> > and userspace would like to opportunistically access it. Hum, but then with
> > your automatic zero-filling it could see half of the page with data and
> > half of the page zeroed out (the page got evicted in the middle of
> > userspace reading it). I don't think that's a very comfortable interface to
> > work with (you would have to very carefully verify the data you've read is
> > really valid). And frankly in most cases I'm afraid the application would
> > fail to do proper verification and crash randomly under memory pressure. So
> > I wouldn't provide VRANGE_ZERO unless I come across real people for which
> > avoiding marking the range as NONVOLATILE is a big deal and they are OK with
> > handling all the odd situations that can happen.
> 
> Plaes think following usecase.
> 
> Let's assume big volatile cacne.
> If there is request for cache, it should find a object in a cache
> and if it found, it should call vrange(NOVOLATILE) right before
> passing it to the user and investigate it was purged or not.
> If it wasn't purged, cache manager could pass the object to the user.
> But it's circular cache so if there is no request from user, cache manager
> always overwrites objects so it could encounter SIGBUS easily
> so as current sematic, cache manager always should call vrange(NOVOLATILE)
> right before the overwriting. Otherwise, it should register SIGBUS handler
> to unmark volatile by page unit. SIGH.
> 
> If we support VRANGE_ZERO, cache manager could overwrite object without
> SIGBUS handling or vrange(NOVOLATILE) call. Just need is vrange(NOVOLATILE)
> call while cache manager pass it to the user.
  OK, that makes some sense but I don't think we have to implement this
functionality in the beginning...
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v11)
  2014-03-20  8:13           ` Jan Kara
@ 2014-03-21  5:29             ` Minchan Kim
  0 siblings, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2014-03-21  5:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Johannes Weiner, Robert Love, Mel Gorman, Hugh Dickins,
	Dave Hansen, Rik van Riel, Dmitry Adamushko, Neil Brown,
	Andrea Arcangeli, Mike Hommey, Taras Glek, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

On Thu, Mar 20, 2014 at 09:13:59AM +0100, Jan Kara wrote:
> On Thu 20-03-14 10:09:54, Minchan Kim wrote:
> > Hello,
> > 
> > On Wed, Mar 19, 2014 at 11:12:02AM +0100, Jan Kara wrote:
> > > On Wed 19-03-14 09:49:18, Minchan Kim wrote:
> > > > On Tue, Mar 18, 2014 at 11:07:50AM -0700, John Stultz wrote:
> > > > > On Tue, Mar 18, 2014 at 8:11 AM, Minchan Kim <minchan@kernel.org> wrote:
> > > > > > 1) SIGBUS
> > > > > >
> > > > > > It's one of the arguable issue because some user want to get a
> > > > > > SIGBUS(ex, Firefox) while other want a just zero page(ex, Google
> > > > > > address sanitizer) without signal so it should be option.
> > > > > >
> > > > > >         int vrange(start, len, VRANGE_VOLATILE|VRANGE_ZERO, &purged);
> > > > > >         int vrange(start, len, VRANGE_VOLATILE|VRANGE_SIGNAL, &purged);
> > > > > 
> > > > > So, the zero-fill on volatile access feels like a *very* special case
> > > > > to me, since a null page could be valid data in many cases. Since
> > > > > support/interest for volatile ranges has been middling at best, I want
> > > > > to start culling the stranger use cases. I'm open in the future to
> > > > > adding a special flag or something if it really make sense, but at
> > > > > this point, lets just get the more general volatile range use cases
> > > > > supported.
> > > > 
> > > > I'm not sure it's special case. Because some user could reserve
> > > > a big volatile VMA and want to use the range by circle queue for
> > > > caching so overwriting could happen easily.
> > > > We should call vrange(NOVOLATILE) to prevent SIGBUS right before
> > > > overwriting. I feel it's unnecessary overhead and we could avoid
> > > > the cost with VRANGE_ZERO.
> > > > Do you think this usecase would be rare?
> > >   If I understand it correctly the buffer would be volatile all the time
> > > and userspace would like to opportunistically access it. Hum, but then with
> > > your automatic zero-filling it could see half of the page with data and
> > > half of the page zeroed out (the page got evicted in the middle of
> > > userspace reading it). I don't think that's a very comfortable interface to
> > > work with (you would have to very carefully verify the data you've read is
> > > really valid). And frankly in most cases I'm afraid the application would
> > > fail to do proper verification and crash randomly under memory pressure. So
> > > I wouldn't provide VRANGE_ZERO unless I come across real people for which
> > > avoiding marking the range as NONVOLATILE is a big deal and they are OK with
> > > handling all the odd situations that can happen.
> > 
> > Plaes think following usecase.
> > 
> > Let's assume big volatile cacne.
> > If there is request for cache, it should find a object in a cache
> > and if it found, it should call vrange(NOVOLATILE) right before
> > passing it to the user and investigate it was purged or not.
> > If it wasn't purged, cache manager could pass the object to the user.
> > But it's circular cache so if there is no request from user, cache manager
> > always overwrites objects so it could encounter SIGBUS easily
> > so as current sematic, cache manager always should call vrange(NOVOLATILE)
> > right before the overwriting. Otherwise, it should register SIGBUS handler
> > to unmark volatile by page unit. SIGH.
> > 
> > If we support VRANGE_ZERO, cache manager could overwrite object without
> > SIGBUS handling or vrange(NOVOLATILE) call. Just need is vrange(NOVOLATILE)
> > call while cache manager pass it to the user.
>   OK, that makes some sense but I don't think we have to implement this
> functionality in the beginning...

Yeb, I am not strong against the idea which starts syscall as simple one
but make room for future but I believe scenario I mentioned is one of
typical usecase for volatile cache and it could avoid vrange(NOVOLATILE)
which is heavier than vrange(VOLATILE) because NOVOLATILE should enumerate
all of ptes in the range at current implementation so reducing NOVOATILE
call looks important to me.


> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-03-21  5:29 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-14 18:33 [PATCH 0/3] Volatile Ranges (v11) John Stultz
2014-03-14 18:33 ` [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas John Stultz
2014-03-17  9:21   ` Jan Kara
2014-03-17  9:43     ` Jan Kara
2014-03-18  0:36       ` John Stultz
2014-03-17 22:19     ` John Stultz
2014-03-14 18:33 ` [PATCH 2/3] vrange: Add purged page detection on setting memory non-volatile John Stultz
2014-03-17  9:39   ` Jan Kara
2014-03-17 22:22     ` John Stultz
2014-03-14 18:33 ` [PATCH 3/3] vrange: Add page purging logic & SIGBUS trap John Stultz
2014-03-18 12:24 ` [PATCH 0/3] Volatile Ranges (v11) Michal Hocko
2014-03-18 17:53   ` John Stultz
2014-03-20  0:38   ` Dave Hansen
2014-03-20  0:57     ` John Stultz
2014-03-20  7:45     ` Minchan Kim
2014-03-18 15:11 ` Minchan Kim
2014-03-18 18:07   ` John Stultz
     [not found]     ` <20140319004918.GB13475@bbox>
2014-03-19 10:12       ` Jan Kara
2014-03-20  1:09         ` Minchan Kim
2014-03-20  8:13           ` Jan Kara
2014-03-21  5:29             ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).