linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Support vranges on files
@ 2013-04-03 23:52 John Stultz
  2013-04-03 23:52 ` [RFC PATCH 1/4] vrange: Make various vrange.c local functions static John Stultz
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: John Stultz @ 2013-04-03 23:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton, Minchan Kim

This patchset is against Minchan's vrange work here:
	https://lkml.org/lkml/2013/3/12/105

Extending it to support volatile ranges on files. In effect
providing the same functionality of my earlier file based
volatile range patches on-top of Minchan's anonymous volatile
range work.

Volatile ranges on files are different then on anonymous memory,
because the volatility state can be shared between multiple
applications. This makes storing the volatile ranges exclusively
in the mm_struct (or in vmas as in Minchan's earlier work)
inappropriate.

The patchset starts with some minor cleanup.

Then we introduce the idea of a vrange_root, which provides a
interval-tree root and a lock to protect the tree. This structure
can then be stored in the mm_struct or in an addres_space. Then the
same infrastructure can be used to manage volatile ranges on both
anonymous and file backed memory.

Next we introduce a parallel fvrange() syscall for creating
volatile ranges directly against files.

And finally, we change the range pruging logic to be able to
handle both anonymous and file volatile ranges.

Now there are some quirks still to be resolved with the approach
used here. The biggest one being the vrange() call can't be used to
create volatile ranges against mmapped files. Instead only the
fvrange() can be used to create file backed volatile ranges.

This could be overcome by iterating across all the process VMAs to
determine if they're anonymous or file based, and if file-based,
create a VMA sized volatile range on the mapping pointed to by the
VMA.

But this would have downsides, as Minchan has been clear that he wants
to optmize the vrange() calls so that it is very cheap to create and
destroy volatile ranges. Having simple per-process ranges be created
means we don't have to iterate across the vmas in the range to
determine if they're anonymous or file backed. Instead the current
vrange() code just creates per process ranges (which may or may not
cover mmapped file data), but will only purge anonymous pages in
that range. This keeps the vrange() call cheap.

Additionally, just creating or destroying a single range is very
simple to do, and requires a fixed amount of memory known up front.
Thus we can allocate needed data prior to making any modifications.

But If we were to create a range that crosses anonymous and file
backed pages, it must create or destroy multiple per-process or
per-file ranges. This could require an unknown number of allocations,
opening the possibility of getting an ENOMEM half-way through the
operation, leaving the volatile range partially created or destroyed.

So to keep this simple for this first pass, for now we have two
syscalls for two types of volatile ranges.

Let me know if you have any thoughts or comments. I'm sure there's
plenty of room for improvement here.

In the meantime I'll be playing with some different approaches to
try to handle single volatile ranges that cross file and anonymous
vmas.

The entire queue, both Minchan's changes and mine can be found here:
git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-minchan

thanks
-john

Cc: linux-mm@kvack.org
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jason Evans <je@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>


John Stultz (4):
  vrange: Make various vrange.c local functions static
  vrange: Introduce vrange_root to make vrange structures more flexible
  vrange: Support fvrange() syscall for file based volatile ranges
  vrange: Enable purging of file backed volatile ranges

 arch/x86/syscalls/syscall_64.tbl |    1 +
 fs/file_table.c                  |    5 +
 fs/inode.c                       |    2 +
 fs/proc/task_mmu.c               |   10 +-
 include/linux/fs.h               |    2 +
 include/linux/mm_types.h         |    4 +-
 include/linux/vrange.h           |   60 ++++---
 include/linux/vrange_types.h     |   22 +++
 kernel/fork.c                    |    2 +-
 mm/vrange.c                      |  334 ++++++++++++++++++++++++++------------
 10 files changed, 308 insertions(+), 134 deletions(-)
 create mode 100644 include/linux/vrange_types.h

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/4] vrange: Make various vrange.c local functions static
  2013-04-03 23:52 [RFC PATCH 0/4] Support vranges on files John Stultz
@ 2013-04-03 23:52 ` John Stultz
  2013-04-03 23:52 ` [RFC PATCH 2/4] vrange: Introduce vrange_root to make vrange structures more flexible John Stultz
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: John Stultz @ 2013-04-03 23:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton, Minchan Kim

Make a number of local functions in vrange.c static.

Cc: linux-mm@kvack.org
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jason Evans <je@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 mm/vrange.c |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/vrange.c b/mm/vrange.c
index c0c5d50..d07884d 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -45,7 +45,7 @@ static inline void __set_vrange(struct vrange *range,
 	range->node.last = end_idx;
 }
 
-void lru_add_vrange(struct vrange *vrange)
+static void lru_add_vrange(struct vrange *vrange)
 {
 	spin_lock(&lru_lock);
 	WARN_ON(!list_empty(&vrange->lru));
@@ -53,7 +53,7 @@ void lru_add_vrange(struct vrange *vrange)
 	spin_unlock(&lru_lock);
 }
 
-void lru_remove_vrange(struct vrange *vrange)
+static void lru_remove_vrange(struct vrange *vrange)
 {
 	spin_lock(&lru_lock);
 	if (!list_empty(&vrange->lru))
@@ -130,7 +130,7 @@ static inline void range_resize(struct rb_root *root,
 	__add_range(range, root, mm);
 }
 
-int add_vrange(struct mm_struct *mm,
+static int add_vrange(struct mm_struct *mm,
 			unsigned long start, unsigned long end)
 {
 	struct rb_root *root;
@@ -172,7 +172,7 @@ out:
 	return 0;
 }
 
-int remove_vrange(struct mm_struct *mm,
+static int remove_vrange(struct mm_struct *mm,
 		unsigned long start, unsigned long end)
 {
 	struct rb_root *root;
@@ -292,7 +292,7 @@ out:
 	return ret;
 }
 
-bool __vrange_address(struct mm_struct *mm,
+static bool __vrange_address(struct mm_struct *mm,
 			unsigned long start, unsigned long end)
 {
 	struct rb_root *root = &mm->v_rb;
@@ -387,7 +387,7 @@ static void __vrange_purge(struct mm_struct *mm,
 	}
 }
 
-int try_to_discard_one(struct page *page, struct vm_area_struct *vma,
+static int try_to_discard_one(struct page *page, struct vm_area_struct *vma,
 		unsigned long address)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -602,7 +602,7 @@ static int vrange_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 }
 
-unsigned int discard_vma_pages(struct zone *zone, struct mm_struct *mm,
+static unsigned int discard_vma_pages(struct zone *zone, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, unsigned int nr_to_discard)
 {
@@ -669,7 +669,7 @@ out:
  * Get next victim vrange from LRU and hold a vrange refcount
  * and vrange->mm's refcount.
  */
-struct vrange *get_victim_vrange(void)
+static struct vrange *get_victim_vrange(void)
 {
 	struct mm_struct *mm;
 	struct vrange *vrange = NULL;
@@ -711,7 +711,7 @@ struct vrange *get_victim_vrange(void)
 	return vrange;
 }
 
-void put_victim_range(struct vrange *vrange)
+static void put_victim_range(struct vrange *vrange)
 {
 	put_vrange(vrange);
 	mmdrop(vrange->mm);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/4] vrange: Introduce vrange_root to make vrange structures more flexible
  2013-04-03 23:52 [RFC PATCH 0/4] Support vranges on files John Stultz
  2013-04-03 23:52 ` [RFC PATCH 1/4] vrange: Make various vrange.c local functions static John Stultz
@ 2013-04-03 23:52 ` John Stultz
  2013-04-03 23:52 ` [RFC PATCH 3/4] vrange: Support fvrange() syscall for file based volatile ranges John Stultz
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: John Stultz @ 2013-04-03 23:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton, Minchan Kim

Instead of having the vrange trees hanging directly off of the
mm_struct, use a vrange_root structure, which will allow us
to have vrange_roots that hang off the mm_struct for anonomous
memory, as well as address_space structures for file backed memory.

Cc: linux-mm@kvack.org
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jason Evans <je@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 fs/proc/task_mmu.c           |   10 +--
 include/linux/mm_types.h     |    4 +-
 include/linux/vrange.h       |   35 +++++-----
 include/linux/vrange_types.h |   21 ++++++
 kernel/fork.c                |    2 +-
 mm/vrange.c                  |  156 ++++++++++++++++++++++--------------------
 6 files changed, 126 insertions(+), 102 deletions(-)
 create mode 100644 include/linux/vrange_types.h

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index df009f0..11f63d4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -391,13 +391,13 @@ static void *v_start(struct seq_file *m, loff_t *pos)
 	if (!mm || IS_ERR(mm))
 		return mm;
 
-	vrange_lock(mm);
-	root = &mm->v_rb;
+	vrange_lock(&mm->vroot);
+	root = &mm->vroot.v_rb;
 
-	if (RB_EMPTY_ROOT(&mm->v_rb))
+	if (RB_EMPTY_ROOT(&mm->vroot.v_rb))
 		goto out;
 
-	next = rb_first(&mm->v_rb);
+	next = rb_first(&mm->vroot.v_rb);
 	range = vrange_entry(next);
 	while(n > 0 && range) {
 		n--;
@@ -432,7 +432,7 @@ static void v_stop(struct seq_file *m, void *v)
 	struct proc_vrange_private *priv = m->private;
 	if (priv->task) {
 		struct mm_struct *mm = priv->task->mm;
-		vrange_unlock(mm);
+		vrange_unlock(&mm->vroot);
 		mmput(mm);
 		put_task_struct(priv->task);
 	}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 080bf74..2e02a6d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -14,6 +14,7 @@
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
 #include <linux/mutex.h>
+#include <linux/vrange_types.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -353,8 +354,7 @@ struct mm_struct {
 
 
 #ifdef CONFIG_MMU
-	struct rb_root v_rb;		/* vrange rb tree */
-	struct mutex v_lock;		/* Protect v_rb */
+	struct vrange_root vroot;
 #endif
 	unsigned long hiwater_rss;	/* High-watermark of RSS usage */
 	unsigned long hiwater_vm;	/* High-water virtual memory usage */
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 4bcec40..b9b219c 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -1,42 +1,39 @@
 #ifndef _LINUX_VRANGE_H
 #define _LINUX_VRANGE_H
 
-#include <linux/mutex.h>
-#include <linux/interval_tree.h>
+#include <linux/vrange_types.h>
 #include <linux/mm.h>
 
-struct vrange {
-	struct interval_tree_node node;
-	bool purged;
-	struct mm_struct *mm;
-	struct list_head lru; /* protected by lru_lock */
-	atomic_t refcount;
-};
-
 #define vrange_entry(ptr) \
 	container_of(ptr, struct vrange, node.rb)
 
 #ifdef CONFIG_MMU
-struct mm_struct;
 
 static inline void mm_init_vrange(struct mm_struct *mm)
 {
-	mm->v_rb = RB_ROOT;
-	mutex_init(&mm->v_lock);
+	mm->vroot.v_rb = RB_ROOT;
+	mutex_init(&mm->vroot.v_lock);
+}
+
+static inline void vrange_lock(struct vrange_root *vroot)
+{
+	mutex_lock(&vroot->v_lock);
 }
 
-static inline void vrange_lock(struct mm_struct *mm)
+static inline void vrange_unlock(struct vrange_root *vroot)
 {
-	mutex_lock(&mm->v_lock);
+	mutex_unlock(&vroot->v_lock);
 }
 
-static inline void vrange_unlock(struct mm_struct *mm)
+static inline struct mm_struct *vrange_get_owner_mm(struct vrange *vrange)
 {
-	mutex_unlock(&mm->v_lock);
+
+	return container_of(vrange->owner, struct mm_struct, vroot);
 }
 
-extern void exit_vrange(struct mm_struct *mm);
+
 void vrange_init(void);
+extern void mm_exit_vrange(struct mm_struct *mm);
 int discard_vpage(struct page *page);
 bool vrange_address(struct mm_struct *mm, unsigned long start,
 			unsigned long end);
@@ -50,7 +47,7 @@ void lru_move_vrange_to_head(struct mm_struct *mm, unsigned long address);
 
 static inline void vrange_init(void) {};
 static inline void mm_init_vrange(struct mm_struct *mm) {};
-static inline void exit_vrange(struct mm_struct *mm);
+static inline void mm_exit_vrange(struct mm_struct *mm);
 
 static inline bool vrange_address(struct mm_struct *mm, unsigned long start,
 		unsigned long end) { return false; };
diff --git a/include/linux/vrange_types.h b/include/linux/vrange_types.h
new file mode 100644
index 0000000..bede336
--- /dev/null
+++ b/include/linux/vrange_types.h
@@ -0,0 +1,21 @@
+#ifndef _LINUX_VRANGE_TYPES_H
+#define _LINUX_VRANGE_TYPES_H
+
+#include <linux/mutex.h>
+#include <linux/interval_tree.h>
+
+struct vrange_root {
+	struct rb_root v_rb;		/* vrange rb tree */
+	struct mutex v_lock;		/* Protect v_rb */
+};
+
+
+struct vrange {
+	struct interval_tree_node node;
+	struct vrange_root *owner;
+	bool purged;
+	struct list_head lru; /* protected by lru_lock */
+	atomic_t refcount;
+};
+#endif
+
diff --git a/kernel/fork.c b/kernel/fork.c
index e3aa120..f2da4a0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -614,7 +614,7 @@ void mmput(struct mm_struct *mm)
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		uprobe_clear_state(mm);
-		exit_vrange(mm);
+		mm_exit_vrange(mm);
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
diff --git a/mm/vrange.c b/mm/vrange.c
index d07884d..9facbbc 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -39,10 +39,12 @@ void __init vrange_init(void)
 }
 
 static inline void __set_vrange(struct vrange *range,
-		unsigned long start_idx, unsigned long end_idx)
+		unsigned long start_idx, unsigned long end_idx,
+		bool purged)
 {
 	range->node.start = start_idx;
 	range->node.last = end_idx;
+	range->purged = purged;
 }
 
 static void lru_add_vrange(struct vrange *vrange)
@@ -63,12 +65,13 @@ static void lru_remove_vrange(struct vrange *vrange)
 
 void lru_move_vrange_to_head(struct mm_struct *mm, unsigned long address)
 {
-	struct rb_root *root = &mm->v_rb;
+	struct vrange_root *vroot = &mm->vroot;
 	struct interval_tree_node *node;
 	struct vrange *vrange;
 
-	vrange_lock(mm);
-	node = interval_tree_iter_first(root, address, address + PAGE_SIZE - 1);
+	vrange_lock(vroot);
+	node = interval_tree_iter_first(&vroot->v_rb, address,
+						address + PAGE_SIZE - 1);
 	if (node) {
 		vrange = container_of(node, struct vrange, node);
 		spin_lock(&lru_lock);
@@ -81,22 +84,21 @@ void lru_move_vrange_to_head(struct mm_struct *mm, unsigned long address)
 			list_move(&vrange->lru, &lru_vrange);
 		spin_unlock(&lru_lock);
 	}
-	vrange_unlock(mm);
+	vrange_unlock(vroot);
 }
 
-static void __add_range(struct vrange *range,
-			struct rb_root *root, struct mm_struct *mm)
+static void __add_range(struct vrange *range, struct vrange_root *vroot)
 {
-	range->mm = mm;
+	range->owner = vroot;
 	lru_add_vrange(range);
-	interval_tree_insert(&range->node, root);
+	interval_tree_insert(&range->node, &vroot->v_rb);
 }
 
 /* remove range from interval tree */
-static void __remove_range(struct vrange *range,
-				struct rb_root *root)
+static void __remove_range(struct vrange *range)
 {
-	interval_tree_remove(&range->node, root);
+	interval_tree_remove(&range->node, &range->owner->v_rb);
+	range->owner = NULL;
 }
 
 static struct vrange *alloc_vrange(void)
@@ -104,11 +106,13 @@ static struct vrange *alloc_vrange(void)
 	struct vrange *vrange = kmem_cache_alloc(vrange_cachep, GFP_KERNEL);
 	if (vrange)
 		atomic_set(&vrange->refcount, 1);
+	vrange->owner = NULL;
 	return vrange;
 }
 
 static void free_vrange(struct vrange *range)
 {
+	WARN_ON(range->owner);
 	lru_remove_vrange(range);
 	kmem_cache_free(vrange_cachep, range);
 }
@@ -120,20 +124,20 @@ static void put_vrange(struct vrange *range)
 		free_vrange(range);
 }
 
-static inline void range_resize(struct rb_root *root,
-		struct vrange *range,
-		unsigned long start, unsigned long end,
-		struct mm_struct *mm)
+static inline void range_resize(struct vrange *range,
+		unsigned long start, unsigned long end)
 {
-	__remove_range(range, root);
-	__set_vrange(range, start, end);
-	__add_range(range, root, mm);
+	struct vrange_root *vroot = range->owner;
+	bool purged = range->purged;
+
+	__remove_range(range);
+	__set_vrange(range, start, end, purged);
+	__add_range(range, vroot);
 }
 
-static int add_vrange(struct mm_struct *mm,
+static int add_vrange(struct vrange_root *vroot,
 			unsigned long start, unsigned long end)
 {
-	struct rb_root *root;
 	struct vrange *new_range, *range;
 	struct interval_tree_node *node, *next;
 	int purged = 0;
@@ -142,9 +146,8 @@ static int add_vrange(struct mm_struct *mm,
 	if (!new_range)
 		return -ENOMEM;
 
-	root = &mm->v_rb;
-	vrange_lock(mm);
-	node = interval_tree_iter_first(root, start, end);
+	vrange_lock(vroot);
+	node = interval_tree_iter_first(&vroot->v_rb, start, end);
 	while (node) {
 		next = interval_tree_iter_next(node, start, end);
 
@@ -158,24 +161,22 @@ static int add_vrange(struct mm_struct *mm,
 		end = max_t(unsigned long, end, node->last);
 
 		purged |= range->purged;
-		__remove_range(range, root);
+		__remove_range(range);
 		put_vrange(range);
 
 		node = next;
 	}
 
-	__set_vrange(new_range, start, end);
-	new_range->purged = purged;
-	__add_range(new_range, root, mm);
+	__set_vrange(new_range, start, end, purged);
+	__add_range(new_range, vroot);
 out:
-	vrange_unlock(mm);
+	vrange_unlock(vroot);
 	return 0;
 }
 
-static int remove_vrange(struct mm_struct *mm,
+static int remove_vrange(struct vrange_root *vroot,
 		unsigned long start, unsigned long end)
 {
-	struct rb_root *root;
 	struct vrange *new_range, *range;
 	struct interval_tree_node *node, *next;
 	int ret	= 0;
@@ -185,10 +186,9 @@ static int remove_vrange(struct mm_struct *mm,
 	if (!new_range)
 		return -ENOMEM;
 
-	root = &mm->v_rb;
-	vrange_lock(mm);
+	vrange_lock(vroot);
 
-	node = interval_tree_iter_first(root, start, end);
+	node = interval_tree_iter_first(&vroot->v_rb, start, end);
 	while (node) {
 		next = interval_tree_iter_next(node, start, end);
 
@@ -196,42 +196,40 @@ static int remove_vrange(struct mm_struct *mm,
 		ret |= range->purged;
 
 		if (start <= node->start && end >= node->last) {
-			__remove_range(range, root);
+			__remove_range(range);
 			put_vrange(range);
 		} else if (node->start >= start) {
-			range_resize(root, range, end, node->last, mm);
+			range_resize(range, end, node->last);
 		} else if (node->last <= end) {
-			range_resize(root, range, node->start, start, mm);
+			range_resize(range, node->start, start);
 		} else {
 			used_new = true;
-			__set_vrange(new_range, end, node->last);
-			new_range->purged = range->purged;
-			new_range->mm = mm;
-			range_resize(root, range, node->start, start, mm);
-			__add_range(new_range, root, mm);
+			__set_vrange(new_range, end, node->last, range->purged);
+			range_resize(range, node->start, start);
+			__add_range(new_range, vroot);
 			break;
 		}
 
 		node = next;
 	}
 
-	vrange_unlock(mm);
+	vrange_unlock(vroot);
 	if (!used_new)
 		put_vrange(new_range);
 
 	return ret;
 }
 
-void exit_vrange(struct mm_struct *mm)
+void mm_exit_vrange(struct mm_struct *mm)
 {
 	struct vrange *range;
 	struct rb_node *next;
 
-	next = rb_first(&mm->v_rb);
+	next = rb_first(&mm->vroot.v_rb);
 	while (next) {
 		range = vrange_entry(next);
 		next = rb_next(next);
-		__remove_range(range, &mm->v_rb);
+		__remove_range(range);
 		put_vrange(range);
 	}
 }
@@ -285,17 +283,18 @@ SYSCALL_DEFINE4(vrange, unsigned long, start,
 		goto out;
 
 	if (mode == VRANGE_VOLATILE)
-		ret = add_vrange(mm, start, end - 1);
+		ret = add_vrange(&mm->vroot, start, end - 1);
 	else if (mode == VRANGE_NOVOLATILE)
-		ret = remove_vrange(mm, start, end - 1);
+		ret = remove_vrange(&mm->vroot, start, end - 1);
 out:
 	return ret;
 }
 
+
 static bool __vrange_address(struct mm_struct *mm,
 			unsigned long start, unsigned long end)
 {
-	struct rb_root *root = &mm->v_rb;
+	struct rb_root *root = &mm->vroot.v_rb;
 	struct interval_tree_node *node;
 
 	node = interval_tree_iter_first(root, start, end);
@@ -306,10 +305,11 @@ bool vrange_address(struct mm_struct *mm,
 			unsigned long start, unsigned long end)
 {
 	bool ret;
+	struct vrange_root *vroot = &mm->vroot;
 
-	vrange_lock(mm);
+	vrange_lock(vroot);
 	ret = __vrange_address(mm, start, end);
-	vrange_unlock(mm);
+	vrange_unlock(vroot);
 	return ret;
 }
 
@@ -372,14 +372,13 @@ static inline pte_t *vpage_check_address(struct page *page,
 	return ptep;
 }
 
-static void __vrange_purge(struct mm_struct *mm,
+static void __vrange_purge(struct vrange_root *vroot,
 		unsigned long start, unsigned long end)
 {
-	struct rb_root *root = &mm->v_rb;
-	struct vrange *range;
 	struct interval_tree_node *node;
+	struct vrange *range;
 
-	node = interval_tree_iter_first(root, start, end);
+	node = interval_tree_iter_first(&vroot->v_rb, start, end);
 	while (node) {
 		range = container_of(node, struct vrange, node);
 		range->purged = true;
@@ -396,20 +395,19 @@ static int try_to_discard_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = 0;
 	bool present;
+	struct vrange_root *vroot = &mm->vroot;
 
 	VM_BUG_ON(!PageLocked(page));
 
-	vrange_lock(mm);
+	vrange_lock(vroot);
 	pte = vpage_check_address(page, mm, address, &ptl);
 	if (!pte) {
-		vrange_unlock(mm);
 		goto out;
 	}
 
 	if (vma->vm_flags & VM_LOCKED) {
 		pte_unmap_unlock(pte, ptl);
-		vrange_unlock(mm);
-		return 0;
+		goto out;
 	}
 
 	present = pte_present(*pte);
@@ -431,12 +429,13 @@ static int try_to_discard_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	set_pte_at(mm, address, pte, pteval);
-	__vrange_purge(mm, address, address + PAGE_SIZE -1);
+	__vrange_purge(&mm->vroot, address, address + PAGE_SIZE - 1);
 	pte_unmap_unlock(pte, ptl);
 	mmu_notifier_invalidate_page(mm, address);
-	vrange_unlock(mm);
 	ret = 1;
+
 out:
+	vrange_unlock(vroot);
 	return ret;
 }
 
@@ -458,12 +457,14 @@ static int try_to_discard_vpage(struct page *page)
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
 		pte_t *pte;
 		spinlock_t *ptl;
+		struct vrange_root *vroot;
 
 		vma = avc->vma;
 		mm = vma->vm_mm;
+		vroot = &mm->vroot;
 		address = vma_address(page, vma);
 
-		vrange_lock(mm);
+		vrange_lock(vroot);
 		/*
 		 * We can't use page_check_address because it doesn't check
 		 * swap entry of the page table. We need the check because
@@ -473,24 +474,24 @@ static int try_to_discard_vpage(struct page *page)
 		 */
 		pte = vpage_check_address(page, mm, address, &ptl);
 		if (!pte) {
-			vrange_unlock(mm);
+			vrange_unlock(vroot);
 			continue;
 		}
 
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
-			vrange_unlock(mm);
+			vrange_unlock(vroot);
 			goto out;
 		}
 
 		pte_unmap_unlock(pte, ptl);
 		if (!__vrange_address(mm, address,
 					address + PAGE_SIZE - 1)) {
-			vrange_unlock(mm);
+			vrange_unlock(vroot);
 			goto out;
 		}
 
-		vrange_unlock(mm);
+		vrange_unlock(vroot);
 	}
 
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
@@ -531,19 +532,20 @@ int discard_vpage(struct page *page)
 
 bool is_purged_vrange(struct mm_struct *mm, unsigned long address)
 {
-	struct rb_root *root = &mm->v_rb;
+	struct vrange_root *vroot = &mm->vroot;
 	struct interval_tree_node *node;
 	struct vrange *range;
 	bool ret = false;
 
-	vrange_lock(mm);
-	node = interval_tree_iter_first(root, address, address + PAGE_SIZE - 1);
+	vrange_lock(vroot);
+	node = interval_tree_iter_first(&vroot->v_rb, address,
+						address + PAGE_SIZE - 1);
 	if (node) {
 		range = container_of(node, struct vrange, node);
 		if (range->purged)
 			ret = true;
 	}
-	vrange_unlock(mm);
+	vrange_unlock(vroot);
 	return ret;
 }
 
@@ -631,12 +633,14 @@ static unsigned int discard_vma_pages(struct zone *zone, struct mm_struct *mm,
 unsigned int discard_vrange(struct zone *zone, struct vrange *vrange,
 				int nr_to_discard)
 {
-	struct mm_struct *mm = vrange->mm;
+	struct mm_struct *mm;
 	unsigned long start = vrange->node.start;
 	unsigned long end = vrange->node.last;
 	struct vm_area_struct *vma;
 	unsigned int nr_discarded = 0;
 
+	mm = vrange_get_owner_mm(vrange);
+
 	if (!down_read_trylock(&mm->mmap_sem))
 		goto out;
 
@@ -678,7 +682,7 @@ static struct vrange *get_victim_vrange(void)
 	spin_lock(&lru_lock);
 	list_for_each_prev_safe(cur, tmp, &lru_vrange) {
 		vrange = list_entry(cur, struct vrange, lru);
-		mm = vrange->mm;
+		mm = vrange_get_owner_mm(vrange);
 		/* the process is exiting so pass it */
 		if (atomic_read(&mm->mm_users) == 0) {
 			list_del_init(&vrange->lru);
@@ -698,7 +702,7 @@ static struct vrange *get_victim_vrange(void)
 		 * need to get a refcount of mm.
 		 * NOTE: We guarantee mm_count isn't zero in here because
 		 * if we found vrange from LRU list, it means we are
-		 * before exit_vrange or remove_vrange.
+		 * before mm_exit_vrange or remove_vrange.
 		 */
 		atomic_inc(&mm->mm_count);
 
@@ -713,8 +717,10 @@ static struct vrange *get_victim_vrange(void)
 
 static void put_victim_range(struct vrange *vrange)
 {
+	struct mm_struct *mm = vrange_get_owner_mm(vrange);
+
 	put_vrange(vrange);
-	mmdrop(vrange->mm);
+	mmdrop(mm);
 }
 
 unsigned int discard_vrange_pages(struct zone *zone, int nr_to_discard)
@@ -724,7 +730,7 @@ unsigned int discard_vrange_pages(struct zone *zone, int nr_to_discard)
 
 	start_vrange = vrange = get_victim_vrange();
 	if (start_vrange) {
-		struct mm_struct *mm = start_vrange->mm;
+		struct mm_struct *mm = vrange_get_owner_mm(vrange);
 		atomic_inc(&start_vrange->refcount);
 		atomic_inc(&mm->mm_count);
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/4] vrange: Support fvrange() syscall for file based volatile ranges
  2013-04-03 23:52 [RFC PATCH 0/4] Support vranges on files John Stultz
  2013-04-03 23:52 ` [RFC PATCH 1/4] vrange: Make various vrange.c local functions static John Stultz
  2013-04-03 23:52 ` [RFC PATCH 2/4] vrange: Introduce vrange_root to make vrange structures more flexible John Stultz
@ 2013-04-03 23:52 ` John Stultz
  2013-04-03 23:52 ` [RFC PATCH 4/4] vrange: Enable purging of file backed " John Stultz
  2013-04-04  6:55 ` [RFC PATCH 0/4] Support vranges on files Minchan Kim
  4 siblings, 0 replies; 15+ messages in thread
From: John Stultz @ 2013-04-03 23:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton, Minchan Kim

Add vrange support on addres_space structures, and add fvrange()
syscall for creating ranges on address_space structures.

Cc: linux-mm@kvack.org
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jason Evans <je@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 arch/x86/syscalls/syscall_64.tbl |    1 +
 fs/file_table.c                  |    5 +++
 fs/inode.c                       |    2 ++
 include/linux/fs.h               |    2 ++
 include/linux/vrange.h           |   19 +++++++++-
 include/linux/vrange_types.h     |    1 +
 mm/vrange.c                      |   72 +++++++++++++++++++++++++++++++++++++-
 7 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index dc332bd..910d9f3 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -321,6 +321,7 @@
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
 314	common	vrange			sys_vrange
+315	common	fvrange			sys_fvrange
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/file_table.c b/fs/file_table.c
index cd4d87a..61c8aaa 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -26,6 +26,7 @@
 #include <linux/hardirq.h>
 #include <linux/task_work.h>
 #include <linux/ima.h>
+#include <linux/vrange.h>
 
 #include <linux/atomic.h>
 
@@ -244,6 +245,10 @@ static void __fput(struct file *file)
 			file->f_op->fasync(-1, file, 0);
 	}
 	ima_file_free(file);
+
+	/* drop all vranges on last close */
+	mapping_exit_vrange(inode->i_mapping);
+
 	if (file->f_op && file->f_op->release)
 		file->f_op->release(inode, file);
 	security_file_free(file);
diff --git a/fs/inode.c b/fs/inode.c
index f5f7c06..4707c95 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -17,6 +17,7 @@
 #include <linux/prefetch.h>
 #include <linux/buffer_head.h> /* for inode_has_buffers */
 #include <linux/ratelimit.h>
+#include <linux/vrange.h>
 #include "internal.h"
 
 /*
@@ -350,6 +351,7 @@ void address_space_init_once(struct address_space *mapping)
 	spin_lock_init(&mapping->private_lock);
 	mapping->i_mmap = RB_ROOT;
 	INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
+	mapping_init_vrange(mapping);
 }
 EXPORT_SYMBOL(address_space_init_once);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..6f86c7c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -27,6 +27,7 @@
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
+#include <linux/vrange_types.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -411,6 +412,7 @@ struct address_space {
 	struct rb_root		i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
+	struct vrange_root	vroot;
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index b9b219c..91960eb 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -3,6 +3,7 @@
 
 #include <linux/vrange_types.h>
 #include <linux/mm.h>
+#include <linux/fs.h>
 
 #define vrange_entry(ptr) \
 	container_of(ptr, struct vrange, node.rb)
@@ -11,10 +12,19 @@
 
 static inline void mm_init_vrange(struct mm_struct *mm)
 {
+	mm->vroot.type = VRANGE_ANON;
 	mm->vroot.v_rb = RB_ROOT;
 	mutex_init(&mm->vroot.v_lock);
 }
 
+static inline void mapping_init_vrange(struct address_space *mapping)
+{
+	mapping->vroot.type = VRANGE_FILE;
+	mapping->vroot.v_rb = RB_ROOT;
+	mutex_init(&mapping->vroot.v_lock);
+}
+
+
 static inline void vrange_lock(struct vrange_root *vroot)
 {
 	mutex_lock(&vroot->v_lock);
@@ -25,15 +35,22 @@ static inline void vrange_unlock(struct vrange_root *vroot)
 	mutex_unlock(&vroot->v_lock);
 }
 
-static inline struct mm_struct *vrange_get_owner_mm(struct vrange *vrange)
+static inline int vrange_type(struct vrange *vrange)
 {
+	return vrange->owner->type;
+}
 
+static inline struct mm_struct *vrange_get_owner_mm(struct vrange *vrange)
+{
+	if (vrange_type(vrange) != VRANGE_ANON)
+		return NULL;
 	return container_of(vrange->owner, struct mm_struct, vroot);
 }
 
 
 void vrange_init(void);
 extern void mm_exit_vrange(struct mm_struct *mm);
+extern void mapping_exit_vrange(struct address_space *mapping);
 int discard_vpage(struct page *page);
 bool vrange_address(struct mm_struct *mm, unsigned long start,
 			unsigned long end);
diff --git a/include/linux/vrange_types.h b/include/linux/vrange_types.h
index bede336..c7154e4 100644
--- a/include/linux/vrange_types.h
+++ b/include/linux/vrange_types.h
@@ -7,6 +7,7 @@
 struct vrange_root {
 	struct rb_root v_rb;		/* vrange rb tree */
 	struct mutex v_lock;		/* Protect v_rb */
+	enum {VRANGE_ANON, VRANGE_FILE} type; /* range root type */
 };
 
 
diff --git a/mm/vrange.c b/mm/vrange.c
index 9facbbc..671909c 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -14,6 +14,7 @@
 #include <linux/swapops.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/file.h>
 
 struct vrange_walker_private {
 	struct zone *zone;
@@ -234,6 +235,20 @@ void mm_exit_vrange(struct mm_struct *mm)
 	}
 }
 
+void mapping_exit_vrange(struct address_space *mapping)
+{
+	struct vrange *range;
+	struct rb_node *next;
+
+	next = rb_first(&mapping->vroot.v_rb);
+	while (next) {
+		range = vrange_entry(next);
+		next = rb_next(next);
+		__remove_range(range);
+		put_vrange(range);
+	}
+}
+
 /*
  * The vrange(2) system call.
  *
@@ -291,6 +306,51 @@ out:
 }
 
 
+SYSCALL_DEFINE5(fvrange, int, fd, size_t, offset,
+		size_t, len, int, mode, int, behavior)
+{
+	struct fd f = fdget(fd);
+	struct address_space *mapping;
+	u64 start = offset;
+	u64 end;
+	int ret = -EINVAL;
+
+	if (!f.file)
+		return -EBADF;
+
+	if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+		ret = -ESPIPE;
+		goto out;
+	}
+
+	mapping = f.file->f_mapping;
+	if (!mapping || len < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (start & ~PAGE_MASK)
+		goto out;
+
+
+	len &= PAGE_MASK;
+	if (!len)
+		goto out;
+
+	end = start + len;
+	if (end < start)
+		goto out;
+
+	if (mode == VRANGE_VOLATILE)
+		ret = add_vrange(&mapping->vroot, start, end - 1);
+	else if (mode == VRANGE_NOVOLATILE)
+		ret = remove_vrange(&mapping->vroot, start, end - 1);
+out:
+	fdput(f);
+	return ret;
+}
+
+
 static bool __vrange_address(struct mm_struct *mm,
 			unsigned long start, unsigned long end)
 {
@@ -641,6 +701,9 @@ unsigned int discard_vrange(struct zone *zone, struct vrange *vrange,
 
 	mm = vrange_get_owner_mm(vrange);
 
+	if (!mm)
+		goto out;
+
 	if (!down_read_trylock(&mm->mmap_sem))
 		goto out;
 
@@ -683,6 +746,12 @@ static struct vrange *get_victim_vrange(void)
 	list_for_each_prev_safe(cur, tmp, &lru_vrange) {
 		vrange = list_entry(cur, struct vrange, lru);
 		mm = vrange_get_owner_mm(vrange);
+
+		if (!mm) {
+			vrange = NULL;
+			continue;
+		}
+
 		/* the process is exiting so pass it */
 		if (atomic_read(&mm->mm_users) == 0) {
 			list_del_init(&vrange->lru);
@@ -720,7 +789,8 @@ static void put_victim_range(struct vrange *vrange)
 	struct mm_struct *mm = vrange_get_owner_mm(vrange);
 
 	put_vrange(vrange);
-	mmdrop(mm);
+	if (mm)
+		mmdrop(mm);
 }
 
 unsigned int discard_vrange_pages(struct zone *zone, int nr_to_discard)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 4/4] vrange: Enable purging of file backed volatile ranges
  2013-04-03 23:52 [RFC PATCH 0/4] Support vranges on files John Stultz
                   ` (2 preceding siblings ...)
  2013-04-03 23:52 ` [RFC PATCH 3/4] vrange: Support fvrange() syscall for file based volatile ranges John Stultz
@ 2013-04-03 23:52 ` John Stultz
  2013-04-04  6:55 ` [RFC PATCH 0/4] Support vranges on files Minchan Kim
  4 siblings, 0 replies; 15+ messages in thread
From: John Stultz @ 2013-04-03 23:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton, Minchan Kim

Rework the victim range selection to also support
file backed volatile ranges.

Cc: linux-mm@kvack.org
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jason Evans <je@fb.com>
Cc: sanjay@google.com
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/vrange.h |    8 ++++
 mm/vrange.c            |  118 +++++++++++++++++++++++++++++++++---------------
 2 files changed, 89 insertions(+), 37 deletions(-)

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 91960eb..bada2bd 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -47,6 +47,14 @@ static inline struct mm_struct *vrange_get_owner_mm(struct vrange *vrange)
 	return container_of(vrange->owner, struct mm_struct, vroot);
 }
 
+static inline
+struct address_space *vrange_get_owner_mapping(struct vrange *vrange)
+{
+	if (vrange_type(vrange) != VRANGE_FILE)
+		return NULL;
+	return container_of(vrange->owner, struct address_space, vroot);
+}
+
 
 void vrange_init(void);
 extern void mm_exit_vrange(struct mm_struct *mm);
diff --git a/mm/vrange.c b/mm/vrange.c
index 671909c..b652513 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -690,8 +690,9 @@ static unsigned int discard_vma_pages(struct zone *zone, struct mm_struct *mm,
 	return ret;
 }
 
-unsigned int discard_vrange(struct zone *zone, struct vrange *vrange,
-				int nr_to_discard)
+static unsigned int discard_anon_vrange(struct zone *zone,
+					struct vrange *vrange,
+					int nr_to_discard)
 {
 	struct mm_struct *mm;
 	unsigned long start = vrange->node.start;
@@ -732,52 +733,91 @@ out:
 	return nr_discarded;
 }
 
+static unsigned int discard_file_vrange(struct zone *zone,
+					struct vrange *vrange,
+					int nr_to_discard)
+{
+	struct address_space *mapping;
+	unsigned long start = vrange->node.start;
+	unsigned long end = vrange->node.last;
+	unsigned long count = ((end-start) >> PAGE_CACHE_SHIFT);
+
+	mapping = vrange_get_owner_mapping(vrange);
+
+	truncate_inode_pages_range(mapping, start, end);
+	vrange->purged = true;
+
+	return count;
+}
+
+unsigned int discard_vrange(struct zone *zone, struct vrange *vrange,
+				int nr_to_discard)
+{
+	if (vrange_type(vrange) == VRANGE_ANON)
+		return discard_anon_vrange(zone, vrange, nr_to_discard);
+	return discard_file_vrange(zone, vrange, nr_to_discard);
+}
+
+
+/* Take a vrange refcount and depending on the type
+ * the vrange->owner's mm refcount or inode refcount
+ */
+static int hold_victim_vrange(struct vrange *vrange)
+{
+	if (vrange_type(vrange) == VRANGE_ANON) {
+		struct mm_struct *mm = vrange_get_owner_mm(vrange);
+
+
+		if (atomic_read(&mm->mm_users) == 0)
+			return -1;
+
+
+		if (!atomic_inc_not_zero(&vrange->refcount))
+			return -1;
+		/*
+		 * we need to access mmap_sem further routine so
+		 * need to get a refcount of mm.
+		 * NOTE: We guarantee mm_count isn't zero in here because
+		 * if we found vrange from LRU list, it means we are
+		 * before exit_vrange or remove_vrange.
+		 */
+		atomic_inc(&mm->mm_count);
+	} else {
+		struct address_space *mapping;
+		mapping = vrange_get_owner_mapping(vrange);
+
+		if (!atomic_inc_not_zero(&vrange->refcount))
+			return -1;
+		__iget(mapping->host);
+	}
+
+	return 0;
+}
+
+
+
 /*
- * Get next victim vrange from LRU and hold a vrange refcount
- * and vrange->mm's refcount.
+ * Get next victim vrange from LRU and hold needed refcounts.
  */
 static struct vrange *get_victim_vrange(void)
 {
-	struct mm_struct *mm;
 	struct vrange *vrange = NULL;
 	struct list_head *cur, *tmp;
 
 	spin_lock(&lru_lock);
 	list_for_each_prev_safe(cur, tmp, &lru_vrange) {
 		vrange = list_entry(cur, struct vrange, lru);
-		mm = vrange_get_owner_mm(vrange);
-
-		if (!mm) {
-			vrange = NULL;
-			continue;
-		}
 
-		/* the process is exiting so pass it */
-		if (atomic_read(&mm->mm_users) == 0) {
+		if (hold_victim_vrange(vrange)) {
 			list_del_init(&vrange->lru);
 			vrange = NULL;
 			continue;
 		}
 
-		/* vrange is freeing so continue to loop */
-		if (!atomic_inc_not_zero(&vrange->refcount)) {
-			list_del_init(&vrange->lru);
-			vrange = NULL;
-			continue;
-		}
-
-		/*
-		 * we need to access mmap_sem further routine so
-		 * need to get a refcount of mm.
-		 * NOTE: We guarantee mm_count isn't zero in here because
-		 * if we found vrange from LRU list, it means we are
-		 * before mm_exit_vrange or remove_vrange.
-		 */
-		atomic_inc(&mm->mm_count);
-
 		/* Isolate vrange */
 		list_del_init(&vrange->lru);
 		break;
+
 	}
 
 	spin_unlock(&lru_lock);
@@ -786,11 +826,18 @@ static struct vrange *get_victim_vrange(void)
 
 static void put_victim_range(struct vrange *vrange)
 {
-	struct mm_struct *mm = vrange_get_owner_mm(vrange);
-
 	put_vrange(vrange);
-	if (mm)
+
+	if (vrange_type(vrange) == VRANGE_ANON) {
+		struct mm_struct *mm = vrange_get_owner_mm(vrange);
+
 		mmdrop(mm);
+	} else {
+		struct address_space *mapping;
+
+		mapping = vrange_get_owner_mapping(vrange);
+		iput(mapping->host);
+	}
 }
 
 unsigned int discard_vrange_pages(struct zone *zone, int nr_to_discard)
@@ -799,11 +846,8 @@ unsigned int discard_vrange_pages(struct zone *zone, int nr_to_discard)
 	unsigned int nr_discarded = 0;
 
 	start_vrange = vrange = get_victim_vrange();
-	if (start_vrange) {
-		struct mm_struct *mm = vrange_get_owner_mm(vrange);
-		atomic_inc(&start_vrange->refcount);
-		atomic_inc(&mm->mm_count);
-	}
+	if (start_vrange)
+		hold_victim_vrange(start_vrange);
 
 	while (vrange) {
 		nr_discarded += discard_vrange(zone, vrange, nr_to_discard);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/4] Support vranges on files
  2013-04-03 23:52 [RFC PATCH 0/4] Support vranges on files John Stultz
                   ` (3 preceding siblings ...)
  2013-04-03 23:52 ` [RFC PATCH 4/4] vrange: Enable purging of file backed " John Stultz
@ 2013-04-04  6:55 ` Minchan Kim
  2013-04-04 17:37   ` John Stultz
  4 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2013-04-04  6:55 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton

Hey John,

First of all, I should confess I just glanced your code and poped
several questions. If I miss something, please slap me.

On Wed, Apr 03, 2013 at 04:52:19PM -0700, John Stultz wrote:
> This patchset is against Minchan's vrange work here:
> 	https://lkml.org/lkml/2013/3/12/105
> 
> Extending it to support volatile ranges on files. In effect
> providing the same functionality of my earlier file based
> volatile range patches on-top of Minchan's anonymous volatile
> range work.
> 
> Volatile ranges on files are different then on anonymous memory,
> because the volatility state can be shared between multiple
> applications. This makes storing the volatile ranges exclusively
> in the mm_struct (or in vmas as in Minchan's earlier work)
> inappropriate.
> 
> The patchset starts with some minor cleanup.
> 
> Then we introduce the idea of a vrange_root, which provides a
> interval-tree root and a lock to protect the tree. This structure
> can then be stored in the mm_struct or in an addres_space. Then the
> same infrastructure can be used to manage volatile ranges on both
> anonymous and file backed memory.

Thanks for the above two patches. It is a nice cleanup.

> 
> Next we introduce a parallel fvrange() syscall for creating
> volatile ranges directly against files.

Okay. It seems you want to replace ashmem interface with fvrange.
I dobut we have to eat a slot for system call. Can't we add "int fd"
in vrange systemcall without inventing new wheel?

> 
> And finally, we change the range pruging logic to be able to
> handle both anonymous and file volatile ranges.

Okay. Then, what's the semantic file-vrange?

There is a file F. Process A mapped some part of file into his
address space. Then, Process B calls fvrange same part.
As I looked over your code, it purges the range although process B
is using now. Right? Is it your intention? Maybe isn't.

Let's define fvrange's semantic same with anon-vrange.
If there is a process using range with non-volatile, at least,
we shouldn't purge at all.

So your [4/4] should investigate all processes mapped the page
atomically. You could do it with i_mmap_mutex and vrange_lock
and percolate the logic into try_to_discard_vpage.

> 
> Now there are some quirks still to be resolved with the approach
> used here. The biggest one being the vrange() call can't be used to
> create volatile ranges against mmapped files. Instead only the

Why?

> fvrange() can be used to create file backed volatile ranges.

I could't understand your point. It would be better to explain
my thought firstly then, you could point out something I am missing
now. Look below.

> 
> This could be overcome by iterating across all the process VMAs to
> determine if they're anonymous or file based, and if file-based,
> create a VMA sized volatile range on the mapping pointed to by the
> VMA.

It needs just when we start to discard pages. Simply, it is related
to reclaim path, NOT system call path so it's not a problem.

> 
> But this would have downsides, as Minchan has been clear that he wants
> to optmize the vrange() calls so that it is very cheap to create and
> destroy volatile ranges. Having simple per-process ranges be created
> means we don't have to iterate across the vmas in the range to
> determine if they're anonymous or file backed. Instead the current
> vrange() code just creates per process ranges (which may or may not
> cover mmapped file data), but will only purge anonymous pages in
> that range. This keeps the vrange() call cheap.

Right.

> 
> Additionally, just creating or destroying a single range is very
> simple to do, and requires a fixed amount of memory known up front.
> Thus we can allocate needed data prior to making any modifications.
> 
> But If we were to create a range that crosses anonymous and file
> backed pages, it must create or destroy multiple per-process or
> per-file ranges. This could require an unknown number of allocations,

This is a part I can fail to parse your opinion.

> opening the possibility of getting an ENOMEM half-way through the
> operation, leaving the volatile range partially created or destroyed.
> 
> So to keep this simple for this first pass, for now we have two
> syscalls for two types of volatile ranges.


My idea is following as

        vrange(fd, start, len, mode, behavior)

A) fd = 0

1) system call context - vrange system call registers new vrange
   in mm_struct.
2) Add new vrange into LRU
3) reclaim context - walk with rmap to confirm all processes make
   the range with volatile -> discard

B) fd = 1

1) system call context - vrange system call registers new vrange
   in address_space
2) Add new vrange into LRU
3) reclaim context - walk with rmap to confirm all processes make
   the range with volatile -> discard

What's the problem in this logic?

> 
> Let me know if you have any thoughts or comments. I'm sure there's
> plenty of room for improvement here.
> 
> In the meantime I'll be playing with some different approaches to
> try to handle single volatile ranges that cross file and anonymous
> vmas.
> 
> The entire queue, both Minchan's changes and mine can be found here:
> git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-minchan
> 
> thanks
> -john
> 
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/4] Support vranges on files
  2013-04-04  6:55 ` [RFC PATCH 0/4] Support vranges on files Minchan Kim
@ 2013-04-04 17:37   ` John Stultz
  2013-04-05  7:55     ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: John Stultz @ 2013-04-04 17:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton

On 04/03/2013 11:55 PM, Minchan Kim wrote:
> On Wed, Apr 03, 2013 at 04:52:19PM -0700, John Stultz wrote:
>> Next we introduce a parallel fvrange() syscall for creating
>> volatile ranges directly against files.
> Okay. It seems you want to replace ashmem interface with fvrange.
> I dobut we have to eat a slot for system call. Can't we add "int fd"
> in vrange systemcall without inventing new wheel?

Sure, that would be doable. I just added the new syscall to make the 
differences in functionality clear.
Once the subtleties are understood, we can condense things down if we 
think its best.


>> And finally, we change the range pruging logic to be able to
>> handle both anonymous and file volatile ranges.
> Okay. Then, what's the semantic file-vrange?
>
> There is a file F. Process A mapped some part of file into his
> address space. Then, Process B calls fvrange same part.
> As I looked over your code, it purges the range although process B
> is using now. Right? Is it your intention? Maybe isn't.

Not sure if you're example has a type-o and you meant "process A is 
using it"?  If so, yes. The point is the volatility is shared and 
consistent across all users of the file, in the same way the data in the 
file is shared. If process B punched a hole in the file, process A would 
see the effect immediately. With volatile ranges, the hole punching is 
just delayed and possibly done later by the kernel, in effect on behalf 
of process B, so the behavior is the same.

Consider the case where we could have two processes mmap a tmpfs file in 
order to create a circular buffer shared between them. You could then 
have a producer/consumer relationship with two processes where any data 
not between the head & tail offsets were marked volatile. The producer 
would mark tail+size non-volatile, write the data, and update the tail 
offset. The consumer would read data from the head offset, mark the 
just-read range as volatile, and update the offset.

In this example, the producer would be the only process to mark data 
non-volatile, while the consumer would be the only one marking ranges 
volatile. Thus the state of volatility would need to be an attribute of 
the file, not the process, in the same way the shared data is.

Is that clear?



> Let's define fvrange's semantic same with anon-vrange.
> If there is a process using range with non-volatile, at least,
> we shouldn't purge at all.

So this I'm not in agreement with.

Anonymous pages are for the most part not shared, except via COW. And 
for the COW case, yes, I agree, we shouldn't purge those pages.

Similarly (and I have yet to handle this in the code), for private 
mapped files, those pages shouldn't be purged either (or purging them 
shouldn't affect the private mapped pages - not sure which direction to 
go here).

But for shared mapped files, we need to keep the volatility state shared 
as well.


>> Now there are some quirks still to be resolved with the approach
>> used here. The biggest one being the vrange() call can't be used to
>> create volatile ranges against mmapped files. Instead only the
> Why?

As explained above, the volatility is shared like the data. The current 
vrange() code creates per-mm volatile ranges, which aren't shared.


>
>> fvrange() can be used to create file backed volatile ranges.
> I could't understand your point. It would be better to explain
> my thought firstly then, you could point out something I am missing
> now. Look below.
>
>> This could be overcome by iterating across all the process VMAs to
>> determine if they're anonymous or file based, and if file-based,
>> create a VMA sized volatile range on the mapping pointed to by the
>> VMA.
> It needs just when we start to discard pages. Simply, it is related
> to reclaim path, NOT system call path so it's not a problem.

The reason we can't defer this to only the reclaim path is if volatile 
ranges on shared mappings are stored in the mm_struct, if process A sets 
up a volatile range on a shared mapping, but stores the volatility in 
its own mm, then process B wants to clear the volatility on the range, 
process B would have to iterate over all processes that have those file 
vmas mapped and change them.

Additionally if process A sets up a volatile range on a shared mapped 
file, then quits, the volatility state dies with that process.

Either way, its not just a simple matter of handling data on your own 
mm_struct. That's fine for the process' own anonymous memory, but 
doesn't work for shared file mappings.


>
>> But this would have downsides, as Minchan has been clear that he wants
>> to optmize the vrange() calls so that it is very cheap to create and
>> destroy volatile ranges. Having simple per-process ranges be created
>> means we don't have to iterate across the vmas in the range to
>> determine if they're anonymous or file backed. Instead the current
>> vrange() code just creates per process ranges (which may or may not
>> cover mmapped file data), but will only purge anonymous pages in
>> that range. This keeps the vrange() call cheap.
> Right.
>
>> Additionally, just creating or destroying a single range is very
>> simple to do, and requires a fixed amount of memory known up front.
>> Thus we can allocate needed data prior to making any modifications.
>>
>> But If we were to create a range that crosses anonymous and file
>> backed pages, it must create or destroy multiple per-process or
>> per-file ranges. This could require an unknown number of allocations,
> This is a part I can fail to parse your opinion.

So if we were in the vrange() code to iterate over all the VMAs in the 
range, creating VMA sizes ranges on either the mm_struct or the backing 
address_space where appropriate, its possible that we could hit an 
ENOMEM half way through the operation. This would leaving the range in 
an inconsistent state: partially marked, and potentially causing us to 
lose the purged state on the subranges.



>
>> opening the possibility of getting an ENOMEM half-way through the
>> operation, leaving the volatile range partially created or destroyed.
>>
>> So to keep this simple for this first pass, for now we have two
>> syscalls for two types of volatile ranges.
>
> My idea is following as
>
>          vrange(fd, start, len, mode, behavior)
>
> A) fd = 0

Well we'd probably need to use -1 or something that would be an invalid 
fd here.

And really, I think having separate interfaces might be good, just as 
there are separate madvise() and fadvise() calls (and when all this is 
done, we may need to re-visit the new syscall vs new madvise/fadvise 
flags decision).

>
> 1) system call context - vrange system call registers new vrange
>     in mm_struct.
> 2) Add new vrange into LRU
> 3) reclaim context - walk with rmap to confirm all processes make
>     the range with volatile -> discard
>
> B) fd = 1
The fd would just need to be valid right, not 1.

> 1) system call context - vrange system call registers new vrange
>     in address_space
> 2) Add new vrange into LRU
> 3) reclaim context - walk with rmap to confirm all processes make
>     the range with volatile -> discard
>
> What's the problem in this logic?

The problem is only if in the first case, the volatile range being 
created crosses over both anonymous and shared file mmap pages. In that 
case we have to create appropriate sub-ranges on the mm_struct, and 
sub-ranges on the address_space of the mmaped file.

This isn't impossible to do, but again, the handling of errors mid-way 
through creating subranges is problematic (there may be yet a way around 
it, I just haven't thought of it yet).


Thus with my patches, I simplified the problem a bit by partitioning it 
into two separate problems and two separate interfaces: Volatile ranges 
that are created by the vrange() call won't affect mmaped pages, only 
anonymous pages. We may create a range that covers them, but the 
volatility isn't shared with other processes and the purging logic still 
skips file pages. If you want to to create a volatile range on file 
pages, you have to use fvrange().

Of course, my patchset has its own inconsistencies too, since if a range 
is marked non-volatile that covers a mmapped file that has been marked 
volatile, that volatility would persist. So I probably should return an 
error if the vrange call covers any mmapped files.


Also, to be clear, I'm not saying that we *have* to partition these 
operations into two separate behaviors, but I think having two separate 
behaviors at first helps makes clear the subtleties of the differences 
between them.


Let me know if any of this helps your understanding. :)

thanks
-john

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/4] Support vranges on files
  2013-04-04 17:37   ` John Stultz
@ 2013-04-05  7:55     ` Minchan Kim
  2013-04-08  0:46       ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2013-04-05  7:55 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton

Hi John,

On Thu, Apr 04, 2013 at 10:37:52AM -0700, John Stultz wrote:
> On 04/03/2013 11:55 PM, Minchan Kim wrote:
> >On Wed, Apr 03, 2013 at 04:52:19PM -0700, John Stultz wrote:
> >>Next we introduce a parallel fvrange() syscall for creating
> >>volatile ranges directly against files.
> >Okay. It seems you want to replace ashmem interface with fvrange.
> >I dobut we have to eat a slot for system call. Can't we add "int fd"
> >in vrange systemcall without inventing new wheel?
> 
> Sure, that would be doable. I just added the new syscall to make the
> differences in functionality clear.
> Once the subtleties are understood, we can condense things down if
> we think its best.

Fair enough.

> 
> 
> >>And finally, we change the range pruging logic to be able to
> >>handle both anonymous and file volatile ranges.
> >Okay. Then, what's the semantic file-vrange?
> >
> >There is a file F. Process A mapped some part of file into his
> >address space. Then, Process B calls fvrange same part.
> >As I looked over your code, it purges the range although process B
> >is using now. Right? Is it your intention? Maybe isn't.
> 
> Not sure if you're example has a type-o and you meant "process A is
> using it"?  If so, yes. The point is the volatility is shared and
> consistent across all users of the file, in the same way the data in
> the file is shared. If process B punched a hole in the file, process
> A would see the effect immediately. With volatile ranges, the hole
> punching is just delayed and possibly done later by the kernel, in
> effect on behalf of process B, so the behavior is the same.
> 
> Consider the case where we could have two processes mmap a tmpfs
> file in order to create a circular buffer shared between them. You
> could then have a producer/consumer relationship with two processes
> where any data not between the head & tail offsets were marked
> volatile. The producer would mark tail+size non-volatile, write the
> data, and update the tail offset. The consumer would read data from
> the head offset, mark the just-read range as volatile, and update
> the offset.
> 
> In this example, the producer would be the only process to mark data
> non-volatile, while the consumer would be the only one marking
> ranges volatile. Thus the state of volatility would need to be an
> attribute of the file, not the process, in the same way the shared
> data is.
> 
> Is that clear?

Yes, I got your point that you meant shared mapping.
Let's enumerate more examples.

1. Process A mapped FILE A with MAP_SHARED
   Process B mapped FILE A with MAP_SHARED
   Process C calls fvrange
   Discard all pages of process A and B -> Make sense to me.

2. Process A mapped FILE A with MAP_PRIVATE and is using it with read-only
   Process B mapped FILE A with MAP_PRIVATE and is using it with write-only
   Process C calls fvrange

   What does it happens? I expect process A lost all pages while process B
   keeps COWed pages.

3. Process A mapped FILE A with MAP_PRIVATE and is using it with read/write
   Process C calls fvrange

   Some pages non-COWed in process A are lost while some pages COWed are kept.
   Mixing.

Above all are your intention?
It would be very clear if you should have wrote down semantic you intent
about private mapped file and shared mapped file. ;-)

> 
> 
> 
> >Let's define fvrange's semantic same with anon-vrange.
> >If there is a process using range with non-volatile, at least,
> >we shouldn't purge at all.
> 
> So this I'm not in agreement with.

I got your point.

> 
> Anonymous pages are for the most part not shared, except via COW.
> And for the COW case, yes, I agree, we shouldn't purge those pages.
> 
> Similarly (and I have yet to handle this in the code), for private
> mapped files, those pages shouldn't be purged either (or purging
> them shouldn't affect the private mapped pages - not sure which
> direction to go here).

Yeb. It's questionable.
It seems fallocate for punch hole removes non-COWed pages although
they are mapped privately if I didn't miss something to read code.
If I was right, it looks very strange to me. COWed pages remain
in memory while NOT-YET-COWed pages are discarded. :(
Ho, Hmm.

> 
> But for shared mapped files, we need to keep the volatility state
> shared as well.
> 
> 
> >>Now there are some quirks still to be resolved with the approach
> >>used here. The biggest one being the vrange() call can't be used to
> >>create volatile ranges against mmapped files. Instead only the
> >Why?
> 
> As explained above, the volatility is shared like the data. The
> current vrange() code creates per-mm volatile ranges, which aren't
> shared.

Strictly speaking, we can do it by only per-mm volatile range, I think.
But the concern if we choose the approach is that what you mention in
below is we have to iterate all process's mm_sturct to check in system
call context. Of course, I don't like it and too bad design.

> 
> 
> >
> >>fvrange() can be used to create file backed volatile ranges.
> >I could't understand your point. It would be better to explain
> >my thought firstly then, you could point out something I am missing
> >now. Look below.
> >
> >>This could be overcome by iterating across all the process VMAs to
> >>determine if they're anonymous or file based, and if file-based,
> >>create a VMA sized volatile range on the mapping pointed to by the
> >>VMA.
> >It needs just when we start to discard pages. Simply, it is related
> >to reclaim path, NOT system call path so it's not a problem.
> 
> The reason we can't defer this to only the reclaim path is if
> volatile ranges on shared mappings are stored in the mm_struct, if
> process A sets up a volatile range on a shared mapping, but stores
> the volatility in its own mm, then process B wants to clear the
> volatility on the range, process B would have to iterate over all
> processes that have those file vmas mapped and change them.

Right. I think iterating all of relevant vmas isn't big cost
in normal situation but it could be rather bigger when the memory
pressure is severe, especially for file-backed pages because it's
not even read/write lock.
I'd like to minimize the system call overhead if possible.

> 
> Additionally if process A sets up a volatile range on a shared
> mapped file, then quits, the volatility state dies with that
> process.

Yes, so don't you want to use vrange system call for mmaped-file
range at the moment?

> 
> Either way, its not just a simple matter of handling data on your
> own mm_struct. That's fine for the process' own anonymous memory,
> but doesn't work for shared file mappings.

Agreed.

> 
> 
> >
> >>But this would have downsides, as Minchan has been clear that he wants
> >>to optmize the vrange() calls so that it is very cheap to create and
> >>destroy volatile ranges. Having simple per-process ranges be created
> >>means we don't have to iterate across the vmas in the range to
> >>determine if they're anonymous or file backed. Instead the current
> >>vrange() code just creates per process ranges (which may or may not
> >>cover mmapped file data), but will only purge anonymous pages in
> >>that range. This keeps the vrange() call cheap.
> >Right.
> >
> >>Additionally, just creating or destroying a single range is very
> >>simple to do, and requires a fixed amount of memory known up front.
> >>Thus we can allocate needed data prior to making any modifications.
> >>
> >>But If we were to create a range that crosses anonymous and file
> >>backed pages, it must create or destroy multiple per-process or
> >>per-file ranges. This could require an unknown number of allocations,
> >This is a part I can fail to parse your opinion.
> 
> So if we were in the vrange() code to iterate over all the VMAs in
> the range, creating VMA sizes ranges on either the mm_struct or the
> backing address_space where appropriate, its possible that we could
> hit an ENOMEM half way through the operation. This would leaving the
> range in an inconsistent state: partially marked, and potentially
> causing us to lose the purged state on the subranges.
> 
> 
> 
> >
> >>opening the possibility of getting an ENOMEM half-way through the
> >>operation, leaving the volatile range partially created or destroyed.
> >>
> >>So to keep this simple for this first pass, for now we have two
> >>syscalls for two types of volatile ranges.
> >
> >My idea is following as
> >
> >         vrange(fd, start, len, mode, behavior)
> >
> >A) fd = 0
> 
> Well we'd probably need to use -1 or something that would be an
> invalid fd here.
> 
> And really, I think having separate interfaces might be good, just
> as there are separate madvise() and fadvise() calls (and when all
> this is done, we may need to re-visit the new syscall vs new
> madvise/fadvise flags decision).

It does make sense in this phase where we are still RFC.

> 
> >
> >1) system call context - vrange system call registers new vrange
> >    in mm_struct.
> >2) Add new vrange into LRU
> >3) reclaim context - walk with rmap to confirm all processes make
> >    the range with volatile -> discard
> >
> >B) fd = 1
> The fd would just need to be valid right, not 1.
> 
> >1) system call context - vrange system call registers new vrange
> >    in address_space
> >2) Add new vrange into LRU
> >3) reclaim context - walk with rmap to confirm all processes make
> >    the range with volatile -> discard
> >
> >What's the problem in this logic?
> 
> The problem is only if in the first case, the volatile range being
> created crosses over both anonymous and shared file mmap pages. In
> that case we have to create appropriate sub-ranges on the mm_struct,
> and sub-ranges on the address_space of the mmaped file.
> 
> This isn't impossible to do, but again, the handling of errors
> mid-way through creating subranges is problematic (there may be yet
> a way around it, I just haven't thought of it yet).

Fair enough.

> 
> 
> Thus with my patches, I simplified the problem a bit by partitioning
> it into two separate problems and two separate interfaces: Volatile
> ranges that are created by the vrange() call won't affect mmaped
> pages, only anonymous pages. We may create a range that covers them,
> but the volatility isn't shared with other processes and the purging
> logic still skips file pages. If you want to to create a volatile
> range on file pages, you have to use fvrange().

Okay, I got your intention by this paragraph.
You don't want to handle file pages with vrange() and want to use
fvrange for file pages. I don't oppose it but please write down
why we did like you explained to me on above.
It would make reviewers happier.

> 
> Of course, my patchset has its own inconsistencies too, since if a
> range is marked non-volatile that covers a mmapped file that has
> been marked volatile, that volatility would persist. So I probably
> should return an error if the vrange call covers any mmapped files.

Hmm, if you intend to separate anon and file with vrange and fvrange's
separate data structure, it's no problem?

> 
> 
> Also, to be clear, I'm not saying that we *have* to partition these
> operations into two separate behaviors, but I think having two
> separate behaviors at first helps makes clear the subtleties of the
> differences between them.

I got your point and I am thinking about that more.

> 
> 
> Let me know if any of this helps your understanding. :)

Thank you very much. John!

Looking forward to seeing you in SF.

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/4] Support vranges on files
  2013-04-05  7:55     ` Minchan Kim
@ 2013-04-08  0:46       ` Minchan Kim
  2013-04-09  0:36         ` John Stultz
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2013-04-08  0:46 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton

Hello John,

As you know, userland people wanted to handle vrange with mmaped
pointer rather than fd-based and see the SIGBUS so I thought more
about semantic of vrange and want to make it very clear and easy.
So I suggest below semantic(Of course, it's not rock solid).

        mvrange(start_addr, lengh, mode, behavior)

It's same with that I suggested lately but different name, just
adding prefix "m". It's per-process model(ie, mm_struct vrange)
so if process is exited, "volatility" isn't valid any more.
It isn't a problem in anonymous but could be in file-vrange so let's
introduce fvrange for covering the problem.

        fvrange(int fd, start_offset, length, mode, behavior)

First of all, let's see mvrange with anonymous and file page POV.

1) anon-mvrange

The page in volaitle range will be purged only if all of processes
marked the range as volatile.

If A process calls mvrange and is forked, vrange could be copied
from parent to child so not-yet-COWed pages could be purged
unless either one of both processes marks NO_VOLATILE explicitly.

Of course, COWed page could be purged easily because there is no link
any more.

2) file-mvrange

A page in volatile range will be purged only if all of processes mapped
the page marked it as volatile AND there is no process mapped the page
as "private". IOW, all of the process mapped the page should map it
with "shared" for purging.

So, all of processes should mark each address range in own process
context if they want to collaborate with shared mapped file and gaurantee
there is no process mapped the range with "private".

Of course, volatility state will be terminated as the process is gone.

3) fvrange

It's same with 2) but volatility state could be persistent in address_space
until someone calls fvrange(NO_VOLATILE).
So it could remove the weakness of 2).
 
What do you think about above semantic?

If you don't have any problem, we could implement it. I think 1) and 2) could
be handled with my base code for anon-vrange handling with tweaking
file-vrange and need your new patches in address_space for handling 3).

On Fri, Apr 05, 2013 at 04:55:04PM +0900, Minchan Kim wrote:
> Hi John,
> 
> On Thu, Apr 04, 2013 at 10:37:52AM -0700, John Stultz wrote:
> > On 04/03/2013 11:55 PM, Minchan Kim wrote:
> > >On Wed, Apr 03, 2013 at 04:52:19PM -0700, John Stultz wrote:
> > >>Next we introduce a parallel fvrange() syscall for creating
> > >>volatile ranges directly against files.
> > >Okay. It seems you want to replace ashmem interface with fvrange.
> > >I dobut we have to eat a slot for system call. Can't we add "int fd"
> > >in vrange systemcall without inventing new wheel?
> > 
> > Sure, that would be doable. I just added the new syscall to make the
> > differences in functionality clear.
> > Once the subtleties are understood, we can condense things down if
> > we think its best.
> 
> Fair enough.
> 
> > 
> > 
> > >>And finally, we change the range pruging logic to be able to
> > >>handle both anonymous and file volatile ranges.
> > >Okay. Then, what's the semantic file-vrange?
> > >
> > >There is a file F. Process A mapped some part of file into his
> > >address space. Then, Process B calls fvrange same part.
> > >As I looked over your code, it purges the range although process B
> > >is using now. Right? Is it your intention? Maybe isn't.
> > 
> > Not sure if you're example has a type-o and you meant "process A is
> > using it"?  If so, yes. The point is the volatility is shared and
> > consistent across all users of the file, in the same way the data in
> > the file is shared. If process B punched a hole in the file, process
> > A would see the effect immediately. With volatile ranges, the hole
> > punching is just delayed and possibly done later by the kernel, in
> > effect on behalf of process B, so the behavior is the same.
> > 
> > Consider the case where we could have two processes mmap a tmpfs
> > file in order to create a circular buffer shared between them. You
> > could then have a producer/consumer relationship with two processes
> > where any data not between the head & tail offsets were marked
> > volatile. The producer would mark tail+size non-volatile, write the
> > data, and update the tail offset. The consumer would read data from
> > the head offset, mark the just-read range as volatile, and update
> > the offset.
> > 
> > In this example, the producer would be the only process to mark data
> > non-volatile, while the consumer would be the only one marking
> > ranges volatile. Thus the state of volatility would need to be an
> > attribute of the file, not the process, in the same way the shared
> > data is.
> > 
> > Is that clear?
> 
> Yes, I got your point that you meant shared mapping.
> Let's enumerate more examples.
> 
> 1. Process A mapped FILE A with MAP_SHARED
>    Process B mapped FILE A with MAP_SHARED
>    Process C calls fvrange
>    Discard all pages of process A and B -> Make sense to me.
> 
> 2. Process A mapped FILE A with MAP_PRIVATE and is using it with read-only
>    Process B mapped FILE A with MAP_PRIVATE and is using it with write-only
>    Process C calls fvrange
> 
>    What does it happens? I expect process A lost all pages while process B
>    keeps COWed pages.
> 
> 3. Process A mapped FILE A with MAP_PRIVATE and is using it with read/write
>    Process C calls fvrange
> 
>    Some pages non-COWed in process A are lost while some pages COWed are kept.
>    Mixing.
> 
> Above all are your intention?
> It would be very clear if you should have wrote down semantic you intent
> about private mapped file and shared mapped file. ;-)
> 
> > 
> > 
> > 
> > >Let's define fvrange's semantic same with anon-vrange.
> > >If there is a process using range with non-volatile, at least,
> > >we shouldn't purge at all.
> > 
> > So this I'm not in agreement with.
> 
> I got your point.
> 
> > 
> > Anonymous pages are for the most part not shared, except via COW.
> > And for the COW case, yes, I agree, we shouldn't purge those pages.
> > 
> > Similarly (and I have yet to handle this in the code), for private
> > mapped files, those pages shouldn't be purged either (or purging
> > them shouldn't affect the private mapped pages - not sure which
> > direction to go here).
> 
> Yeb. It's questionable.
> It seems fallocate for punch hole removes non-COWed pages although
> they are mapped privately if I didn't miss something to read code.
> If I was right, it looks very strange to me. COWed pages remain
> in memory while NOT-YET-COWed pages are discarded. :(
> Ho, Hmm.
> 
> > 
> > But for shared mapped files, we need to keep the volatility state
> > shared as well.
> > 
> > 
> > >>Now there are some quirks still to be resolved with the approach
> > >>used here. The biggest one being the vrange() call can't be used to
> > >>create volatile ranges against mmapped files. Instead only the
> > >Why?
> > 
> > As explained above, the volatility is shared like the data. The
> > current vrange() code creates per-mm volatile ranges, which aren't
> > shared.
> 
> Strictly speaking, we can do it by only per-mm volatile range, I think.
> But the concern if we choose the approach is that what you mention in
> below is we have to iterate all process's mm_sturct to check in system
> call context. Of course, I don't like it and too bad design.
> 
> > 
> > 
> > >
> > >>fvrange() can be used to create file backed volatile ranges.
> > >I could't understand your point. It would be better to explain
> > >my thought firstly then, you could point out something I am missing
> > >now. Look below.
> > >
> > >>This could be overcome by iterating across all the process VMAs to
> > >>determine if they're anonymous or file based, and if file-based,
> > >>create a VMA sized volatile range on the mapping pointed to by the
> > >>VMA.
> > >It needs just when we start to discard pages. Simply, it is related
> > >to reclaim path, NOT system call path so it's not a problem.
> > 
> > The reason we can't defer this to only the reclaim path is if
> > volatile ranges on shared mappings are stored in the mm_struct, if
> > process A sets up a volatile range on a shared mapping, but stores
> > the volatility in its own mm, then process B wants to clear the
> > volatility on the range, process B would have to iterate over all
> > processes that have those file vmas mapped and change them.
> 
> Right. I think iterating all of relevant vmas isn't big cost
> in normal situation but it could be rather bigger when the memory
> pressure is severe, especially for file-backed pages because it's
> not even read/write lock.
> I'd like to minimize the system call overhead if possible.
> 
> > 
> > Additionally if process A sets up a volatile range on a shared
> > mapped file, then quits, the volatility state dies with that
> > process.
> 
> Yes, so don't you want to use vrange system call for mmaped-file
> range at the moment?
> 
> > 
> > Either way, its not just a simple matter of handling data on your
> > own mm_struct. That's fine for the process' own anonymous memory,
> > but doesn't work for shared file mappings.
> 
> Agreed.
> 
> > 
> > 
> > >
> > >>But this would have downsides, as Minchan has been clear that he wants
> > >>to optmize the vrange() calls so that it is very cheap to create and
> > >>destroy volatile ranges. Having simple per-process ranges be created
> > >>means we don't have to iterate across the vmas in the range to
> > >>determine if they're anonymous or file backed. Instead the current
> > >>vrange() code just creates per process ranges (which may or may not
> > >>cover mmapped file data), but will only purge anonymous pages in
> > >>that range. This keeps the vrange() call cheap.
> > >Right.
> > >
> > >>Additionally, just creating or destroying a single range is very
> > >>simple to do, and requires a fixed amount of memory known up front.
> > >>Thus we can allocate needed data prior to making any modifications.
> > >>
> > >>But If we were to create a range that crosses anonymous and file
> > >>backed pages, it must create or destroy multiple per-process or
> > >>per-file ranges. This could require an unknown number of allocations,
> > >This is a part I can fail to parse your opinion.
> > 
> > So if we were in the vrange() code to iterate over all the VMAs in
> > the range, creating VMA sizes ranges on either the mm_struct or the
> > backing address_space where appropriate, its possible that we could
> > hit an ENOMEM half way through the operation. This would leaving the
> > range in an inconsistent state: partially marked, and potentially
> > causing us to lose the purged state on the subranges.
> > 
> > 
> > 
> > >
> > >>opening the possibility of getting an ENOMEM half-way through the
> > >>operation, leaving the volatile range partially created or destroyed.
> > >>
> > >>So to keep this simple for this first pass, for now we have two
> > >>syscalls for two types of volatile ranges.
> > >
> > >My idea is following as
> > >
> > >         vrange(fd, start, len, mode, behavior)
> > >
> > >A) fd = 0
> > 
> > Well we'd probably need to use -1 or something that would be an
> > invalid fd here.
> > 
> > And really, I think having separate interfaces might be good, just
> > as there are separate madvise() and fadvise() calls (and when all
> > this is done, we may need to re-visit the new syscall vs new
> > madvise/fadvise flags decision).
> 
> It does make sense in this phase where we are still RFC.
> 
> > 
> > >
> > >1) system call context - vrange system call registers new vrange
> > >    in mm_struct.
> > >2) Add new vrange into LRU
> > >3) reclaim context - walk with rmap to confirm all processes make
> > >    the range with volatile -> discard
> > >
> > >B) fd = 1
> > The fd would just need to be valid right, not 1.
> > 
> > >1) system call context - vrange system call registers new vrange
> > >    in address_space
> > >2) Add new vrange into LRU
> > >3) reclaim context - walk with rmap to confirm all processes make
> > >    the range with volatile -> discard
> > >
> > >What's the problem in this logic?
> > 
> > The problem is only if in the first case, the volatile range being
> > created crosses over both anonymous and shared file mmap pages. In
> > that case we have to create appropriate sub-ranges on the mm_struct,
> > and sub-ranges on the address_space of the mmaped file.
> > 
> > This isn't impossible to do, but again, the handling of errors
> > mid-way through creating subranges is problematic (there may be yet
> > a way around it, I just haven't thought of it yet).
> 
> Fair enough.
> 
> > 
> > 
> > Thus with my patches, I simplified the problem a bit by partitioning
> > it into two separate problems and two separate interfaces: Volatile
> > ranges that are created by the vrange() call won't affect mmaped
> > pages, only anonymous pages. We may create a range that covers them,
> > but the volatility isn't shared with other processes and the purging
> > logic still skips file pages. If you want to to create a volatile
> > range on file pages, you have to use fvrange().
> 
> Okay, I got your intention by this paragraph.
> You don't want to handle file pages with vrange() and want to use
> fvrange for file pages. I don't oppose it but please write down
> why we did like you explained to me on above.
> It would make reviewers happier.
> 
> > 
> > Of course, my patchset has its own inconsistencies too, since if a
> > range is marked non-volatile that covers a mmapped file that has
> > been marked volatile, that volatility would persist. So I probably
> > should return an error if the vrange call covers any mmapped files.
> 
> Hmm, if you intend to separate anon and file with vrange and fvrange's
> separate data structure, it's no problem?
> 
> > 
> > 
> > Also, to be clear, I'm not saying that we *have* to partition these
> > operations into two separate behaviors, but I think having two
> > separate behaviors at first helps makes clear the subtleties of the
> > differences between them.
> 
> I got your point and I am thinking about that more.
> 
> > 
> > 
> > Let me know if any of this helps your understanding. :)
> 
> Thank you very much. John!
> 
> Looking forward to seeing you in SF.
> 
> > 
> > thanks
> > -john
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/4] Support vranges on files
  2013-04-08  0:46       ` Minchan Kim
@ 2013-04-09  0:36         ` John Stultz
  2013-04-09  2:18           ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: John Stultz @ 2013-04-09  0:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton

On 04/07/2013 05:46 PM, Minchan Kim wrote:
> Hello John,
>
> As you know, userland people wanted to handle vrange with mmaped
> pointer rather than fd-based and see the SIGBUS so I thought more
> about semantic of vrange and want to make it very clear and easy.
> So I suggest below semantic(Of course, it's not rock solid).
>
>          mvrange(start_addr, lengh, mode, behavior)
>
> It's same with that I suggested lately but different name, just
> adding prefix "m". It's per-process model(ie, mm_struct vrange)
> so if process is exited, "volatility" isn't valid any more.
> It isn't a problem in anonymous but could be in file-vrange so let's
> introduce fvrange for covering the problem.
>
>          fvrange(int fd, start_offset, length, mode, behavior)
>
> First of all, let's see mvrange with anonymous and file page POV.
>
> 1) anon-mvrange
>
> The page in volaitle range will be purged only if all of processes
> marked the range as volatile.
>
> If A process calls mvrange and is forked, vrange could be copied
> from parent to child so not-yet-COWed pages could be purged
> unless either one of both processes marks NO_VOLATILE explicitly.
>
> Of course, COWed page could be purged easily because there is no link
> any more.

Ack. This seems reasonable.


> 2) file-mvrange
>
> A page in volatile range will be purged only if all of processes mapped
> the page marked it as volatile AND there is no process mapped the page
> as "private". IOW, all of the process mapped the page should map it
> with "shared" for purging.
>
> So, all of processes should mark each address range in own process
> context if they want to collaborate with shared mapped file and gaurantee
> there is no process mapped the range with "private".
>
> Of course, volatility state will be terminated as the process is gone.

This case doesn't seem ideal to me, but is sort of how the current code 
works to avoid the complexity of dealing with memory volatile ranges 
that cross page types (file/anonymous). Although the current code just 
doesn't purge file pages marked with mvrange().

I'd much prefer file-mvrange calls to behave identically to fvrange calls.

The important point here is that the kernel doesn't *have* to purge 
anything ever. Its the kernel's discretion as to which volatile pages to 
purge when. So its easier for now to simply not purge file pages marked 
volatile via mvolatile.

There however is the inconsistency that file pages marked volatile via 
fvrange, then are marked non-volatile via mvrange() might still be 
purged. That is broken in my mind, and still needs to be addressed. The 
easiest out is probably just to return an error if any of the mvrange 
calls cover file pages. But I'd really like a better fix.


> 3) fvrange
>
> It's same with 2) but volatility state could be persistent in address_space
> until someone calls fvrange(NO_VOLATILE).
> So it could remove the weakness of 2).
>   
> What do you think about above semantic?


I'd still like mvrange() calls on shared mapped files to be stored on 
the address_space.


> If you don't have any problem, we could implement it. I think 1) and 2) could
> be handled with my base code for anon-vrange handling with tweaking
> file-vrange and need your new patches in address_space for handling 3).

I think we can get it sorted out. It might just take a few iterations.

thanks
-john




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/4] Support vranges on files
  2013-04-09  0:36         ` John Stultz
@ 2013-04-09  2:18           ` Minchan Kim
  2013-04-09  3:27             ` John Stultz
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2013-04-09  2:18 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton

On Mon, Apr 08, 2013 at 05:36:42PM -0700, John Stultz wrote:
> On 04/07/2013 05:46 PM, Minchan Kim wrote:
> >Hello John,
> >
> >As you know, userland people wanted to handle vrange with mmaped
> >pointer rather than fd-based and see the SIGBUS so I thought more
> >about semantic of vrange and want to make it very clear and easy.
> >So I suggest below semantic(Of course, it's not rock solid).
> >
> >         mvrange(start_addr, lengh, mode, behavior)
> >
> >It's same with that I suggested lately but different name, just
> >adding prefix "m". It's per-process model(ie, mm_struct vrange)
> >so if process is exited, "volatility" isn't valid any more.
> >It isn't a problem in anonymous but could be in file-vrange so let's
> >introduce fvrange for covering the problem.
> >
> >         fvrange(int fd, start_offset, length, mode, behavior)
> >
> >First of all, let's see mvrange with anonymous and file page POV.
> >
> >1) anon-mvrange
> >
> >The page in volaitle range will be purged only if all of processes
> >marked the range as volatile.
> >
> >If A process calls mvrange and is forked, vrange could be copied
> >from parent to child so not-yet-COWed pages could be purged
> >unless either one of both processes marks NO_VOLATILE explicitly.
> >
> >Of course, COWed page could be purged easily because there is no link
> >any more.
> 
> Ack. This seems reasonable.
> 
> 
> >2) file-mvrange
> >
> >A page in volatile range will be purged only if all of processes mapped
> >the page marked it as volatile AND there is no process mapped the page
> >as "private". IOW, all of the process mapped the page should map it
> >with "shared" for purging.
> >
> >So, all of processes should mark each address range in own process
> >context if they want to collaborate with shared mapped file and gaurantee
> >there is no process mapped the range with "private".
> >
> >Of course, volatility state will be terminated as the process is gone.
> 
> This case doesn't seem ideal to me, but is sort of how the current
> code works to avoid the complexity of dealing with memory volatile
> ranges that cross page types (file/anonymous). Although the current
> code just doesn't purge file pages marked with mvrange().

Personally, I don't think it's to avoid the complexity of implemenation.
I thought explict declaration volatility on range before using would be
more clear for userspace programmer.
Otherwise, he can encounter SIGBUS and got confused easily.

Frankly speaking, I don't like to remain volatility permanently although
relavant processes go away and it could make processs using the file
much error-prone and hard to debug it.

Anyway, do you agree my suggestion that "we should not purge any page if
a process are using now with non-shared(ie, private)"?

> 
> I'd much prefer file-mvrange calls to behave identically to fvrange calls.
> 
> The important point here is that the kernel doesn't *have* to purge
> anything ever. Its the kernel's discretion as to which volatile
> pages to purge when. So its easier for now to simply not purge file

Right.

> pages marked volatile via mvolatile.

NP but we should write down vague description. User try to use it
in file-backed pages and got disappointed, then is reluctant to use it
any more. :)

I'm not saying that let's write down description implementation specific
but want to say them at least new system call can affect anonymous or file
or both, at least from the beginning. Just hope.

> 
> There however is the inconsistency that file pages marked volatile
> via fvrange, then are marked non-volatile via mvrange() might still
> be purged. That is broken in my mind, and still needs to be
> addressed. The easiest out is probably just to return an error if
> any of the mvrange calls cover file pages. But I'd really like a

It needs vma enumeration and mmap_sem read-lock.
It could hurt anon-vrange performance severely.

> better fix.

Another idea is that we can move per-mm vrange element to address_space
when the process goes away if the element covers file-backd vma.
But I'm still very not sure whether we should keep it persistent.

> 
> 
> >3) fvrange
> >
> >It's same with 2) but volatility state could be persistent in address_space
> >until someone calls fvrange(NO_VOLATILE).
> >So it could remove the weakness of 2).
> >What do you think about above semantic?
> 
> 
> I'd still like mvrange() calls on shared mapped files to be stored
> on the address_space.
> 
> 
> >If you don't have any problem, we could implement it. I think 1) and 2) could
> >be handled with my base code for anon-vrange handling with tweaking
> >file-vrange and need your new patches in address_space for handling 3).
> 
> I think we can get it sorted out. It might just take a few iterations.

Sure!

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/4] Support vranges on files
  2013-04-09  2:18           ` Minchan Kim
@ 2013-04-09  3:27             ` John Stultz
  2013-04-09  5:07               ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: John Stultz @ 2013-04-09  3:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton

On 04/08/2013 07:18 PM, Minchan Kim wrote:
> On Mon, Apr 08, 2013 at 05:36:42PM -0700, John Stultz wrote:
>> On 04/07/2013 05:46 PM, Minchan Kim wrote:
>>> Hello John,
>>>
>>> As you know, userland people wanted to handle vrange with mmaped
>>> pointer rather than fd-based and see the SIGBUS so I thought more
>>> about semantic of vrange and want to make it very clear and easy.
>>> So I suggest below semantic(Of course, it's not rock solid).
>>>
>>>          mvrange(start_addr, lengh, mode, behavior)
>>>
>>> It's same with that I suggested lately but different name, just
>>> adding prefix "m". It's per-process model(ie, mm_struct vrange)
>>> so if process is exited, "volatility" isn't valid any more.
>>> It isn't a problem in anonymous but could be in file-vrange so let's
>>> introduce fvrange for covering the problem.
>>>
>>>          fvrange(int fd, start_offset, length, mode, behavior)
>>>
>>> First of all, let's see mvrange with anonymous and file page POV.
>>>
>>> 1) anon-mvrange
>>>
>>> The page in volaitle range will be purged only if all of processes
>>> marked the range as volatile.
>>>
>>> If A process calls mvrange and is forked, vrange could be copied
>> >from parent to child so not-yet-COWed pages could be purged
>>> unless either one of both processes marks NO_VOLATILE explicitly.
>>>
>>> Of course, COWed page could be purged easily because there is no link
>>> any more.
>> Ack. This seems reasonable.
>>
>>
>>> 2) file-mvrange
>>>
>>> A page in volatile range will be purged only if all of processes mapped
>>> the page marked it as volatile AND there is no process mapped the page
>>> as "private". IOW, all of the process mapped the page should map it
>>> with "shared" for purging.
>>>
>>> So, all of processes should mark each address range in own process
>>> context if they want to collaborate with shared mapped file and gaurantee
>>> there is no process mapped the range with "private".
>>>
>>> Of course, volatility state will be terminated as the process is gone.
>> This case doesn't seem ideal to me, but is sort of how the current
>> code works to avoid the complexity of dealing with memory volatile
>> ranges that cross page types (file/anonymous). Although the current
>> code just doesn't purge file pages marked with mvrange().
> Personally, I don't think it's to avoid the complexity of implemenation.
> I thought explict declaration volatility on range before using would be
> more clear for userspace programmer.
> Otherwise, he can encounter SIGBUS and got confused easily.
>
> Frankly speaking, I don't like to remain volatility permanently although
> relavant processes go away and it could make processs using the file
> much error-prone and hard to debug it.

So this is maybe is a contentious point we'll have to work out.

Maybe could you describe some use cases you envision where someone would 
want to mark pages volatile on a file that could be accidentally shared? 
Or how you think the per-mm sense of volatility would be beneficial in 
those use-cases?

The use cases I envision where volatility would be used are when any 
sharing would be coordinated between processes.
Again, that producer/consumer example from before where the empty 
portion of a very large circular buffer could be made volatile, scaling 
the actual memory usage to the actual need.

And really the same concern would likely apply in the common case when 
multiple applications mmap (shared) a file, but use fvrange() to mark 
the data as volatile. This is exactly the use case the Android ashmem 
interface works for. In that case, once the data is marked volatile, it 
should remain volatile until someone who has the file open marks it as 
non-volatile.  The only time we clear the volatility is when the file is 
closed by all users.

I think the concern about surprising an application that isn't expecting 
volatility is odd, since if an application jumped in and punched a hole 
in the data, that could surprise other applications as well.  If you're 
going to use a file that can be shared, applications have to deal with 
potential changes to that file by others.

To me, the value in using volatile ranges on the file data is exactly 
because the file data can be shared. So it makes sense to me to have the 
volatility state be like the data in the file. I guess the only 
exception in my case is that if all the references to a file are closed, 
we can clear the volatility (since we don't have a sane way for the 
volatility to persist past that point).

One question that might help resolve this: Would having some sort of 
volatility checking interface be helpful in easing your concern about 
applications being surprised by volatility?


> Anyway, do you agree my suggestion that "we should not purge any page if
> a process are using now with non-shared(ie, private)"?

Yes, or if we do purge any pages, they should not affect the private 
mapped pages (in other words, the COW link should be broken - as the 
backing page has in-effect been written to by purging).


>> I'd much prefer file-mvrange calls to behave identically to fvrange calls.
>>
>> The important point here is that the kernel doesn't *have* to purge
>> anything ever. Its the kernel's discretion as to which volatile
>> pages to purge when. So its easier for now to simply not purge file
> Right.
>
>> pages marked volatile via mvolatile.
> NP but we should write down vague description. User try to use it
> in file-backed pages and got disappointed, then is reluctant to use it
> any more. :)
>
> I'm not saying that let's write down description implementation specific
> but want to say them at least new system call can affect anonymous or file
> or both, at least from the beginning. Just hope.

I'd like to make it generic enough that we have some flexibility to 
modify the puring rules if we find its more optimal. But I agree, the 
desired semantics of what could occur should be clear.


>> There however is the inconsistency that file pages marked volatile
>> via fvrange, then are marked non-volatile via mvrange() might still
>> be purged. That is broken in my mind, and still needs to be
>> addressed. The easiest out is probably just to return an error if
>> any of the mvrange calls cover file pages. But I'd really like a
> It needs vma enumeration and mmap_sem read-lock.
> It could hurt anon-vrange performance severely.

True. And performance needs to be good if this hinting interface is to 
be used easily. Although I worry about performance trumping sane 
semantics. So let me try to implement the desired behavior and we can 
measure the difference.


>> better fix.
> Another idea is that we can move per-mm vrange element to address_space
> when the process goes away if the element covers file-backd vma.
> But I'm still very not sure whether we should keep it persistent.

I really think the persistence of file-backed volatile ranges (as long 
as someone has the file open or a mapping to it) is important. Again, I 
think of the volatility really being a state of the page, but since a 
page-based approach is too costly, we're optimizing it into mm_struct 
state or address_space state.

thanks
-john


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/4] Support vranges on files
  2013-04-09  3:27             ` John Stultz
@ 2013-04-09  5:07               ` Minchan Kim
  2013-04-09 22:36                 ` John Stultz
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2013-04-09  5:07 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton

On Mon, Apr 08, 2013 at 08:27:50PM -0700, John Stultz wrote:
> On 04/08/2013 07:18 PM, Minchan Kim wrote:
> >On Mon, Apr 08, 2013 at 05:36:42PM -0700, John Stultz wrote:
> >>On 04/07/2013 05:46 PM, Minchan Kim wrote:
> >>>Hello John,
> >>>
> >>>As you know, userland people wanted to handle vrange with mmaped
> >>>pointer rather than fd-based and see the SIGBUS so I thought more
> >>>about semantic of vrange and want to make it very clear and easy.
> >>>So I suggest below semantic(Of course, it's not rock solid).
> >>>
> >>>         mvrange(start_addr, lengh, mode, behavior)
> >>>
> >>>It's same with that I suggested lately but different name, just
> >>>adding prefix "m". It's per-process model(ie, mm_struct vrange)
> >>>so if process is exited, "volatility" isn't valid any more.
> >>>It isn't a problem in anonymous but could be in file-vrange so let's
> >>>introduce fvrange for covering the problem.
> >>>
> >>>         fvrange(int fd, start_offset, length, mode, behavior)
> >>>
> >>>First of all, let's see mvrange with anonymous and file page POV.
> >>>
> >>>1) anon-mvrange
> >>>
> >>>The page in volaitle range will be purged only if all of processes
> >>>marked the range as volatile.
> >>>
> >>>If A process calls mvrange and is forked, vrange could be copied
> >>>from parent to child so not-yet-COWed pages could be purged
> >>>unless either one of both processes marks NO_VOLATILE explicitly.
> >>>
> >>>Of course, COWed page could be purged easily because there is no link
> >>>any more.
> >>Ack. This seems reasonable.
> >>
> >>
> >>>2) file-mvrange
> >>>
> >>>A page in volatile range will be purged only if all of processes mapped
> >>>the page marked it as volatile AND there is no process mapped the page
> >>>as "private". IOW, all of the process mapped the page should map it
> >>>with "shared" for purging.
> >>>
> >>>So, all of processes should mark each address range in own process
> >>>context if they want to collaborate with shared mapped file and gaurantee
> >>>there is no process mapped the range with "private".
> >>>
> >>>Of course, volatility state will be terminated as the process is gone.
> >>This case doesn't seem ideal to me, but is sort of how the current
> >>code works to avoid the complexity of dealing with memory volatile
> >>ranges that cross page types (file/anonymous). Although the current
> >>code just doesn't purge file pages marked with mvrange().
> >Personally, I don't think it's to avoid the complexity of implemenation.
> >I thought explict declaration volatility on range before using would be
> >more clear for userspace programmer.
> >Otherwise, he can encounter SIGBUS and got confused easily.
> >
> >Frankly speaking, I don't like to remain volatility permanently although
> >relavant processes go away and it could make processs using the file
> >much error-prone and hard to debug it.
> 
> So this is maybe is a contentious point we'll have to work out.
> 
> Maybe could you describe some use cases you envision where someone
> would want to mark pages volatile on a file that could be
> accidentally shared? Or how you think the per-mm sense of volatility
> would be beneficial in those use-cases?

My concern point is that following as

1. Process A calls mvrange for file F.
2. Process A is killed by someone or own BUG
3. Process B maps F with shared in his address space
4. Memory pressure happens
5. Process B is killed by SIGBUS but Process B really can't know why he
   was killed because he can't know anyone who open F except himself.
> 
> The use cases I envision where volatility would be used are when any
> sharing would be coordinated between processes.
> Again, that producer/consumer example from before where the empty
> portion of a very large circular buffer could be made volatile,
> scaling the actual memory usage to the actual need.
> 
> And really the same concern would likely apply in the common case
> when multiple applications mmap (shared) a file, but use fvrange()
> to mark the data as volatile. This is exactly the use case the
> Android ashmem interface works for. In that case, once the data is

I don't know Android ashmem interface well but if it works as I
mentioned early, I think it's not good interface.

> marked volatile, it should remain volatile until someone who has the
> file open marks it as non-volatile.  The only time we clear the
> volatility is when the file is closed by all users.

Yes. We need it that clear volatile ranges when the file is closed
by ball users. That's what we need and blow my concern out.

> 
> I think the concern about surprising an application that isn't
> expecting volatility is odd, since if an application jumped in and
> punched a hole in the data, that could surprise other applications
> as well.  If you're going to use a file that can be shared,
> applications have to deal with potential changes to that file by
> others.

True. My concern is delayed punching without any client of fd and
there is no interface to detect some range of file is volatile state or
not. It means anyone mapped a file with shared could encunter SIGBUS
although he try to best effort to check it with lsof before using.

> 
> To me, the value in using volatile ranges on the file data is
> exactly because the file data can be shared. So it makes sense to me
> to have the volatility state be like the data in the file. I guess
> the only exception in my case is that if all the references to a
> file are closed, we can clear the volatility (since we don't have a
> sane way for the volatility to persist past that point).

Agree if you provide to clear out volatility when file are closed by
all stakeholder.

> 
> One question that might help resolve this: Would having some sort of
> volatility checking interface be helpful in easing your concern
> about applications being surprised by volatility?

If we can provide above things, I think we don't need such interface
until someone want it with reasonable logic.

> 
> 
> >Anyway, do you agree my suggestion that "we should not purge any page if
> >a process are using now with non-shared(ie, private)"?
> 
> Yes, or if we do purge any pages, they should not affect the private
> mapped pages (in other words, the COW link should be broken - as the
> backing page has in-effect been written to by purging).
> 
> 
> >>I'd much prefer file-mvrange calls to behave identically to fvrange calls.
> >>
> >>The important point here is that the kernel doesn't *have* to purge
> >>anything ever. Its the kernel's discretion as to which volatile
> >>pages to purge when. So its easier for now to simply not purge file
> >Right.
> >
> >>pages marked volatile via mvolatile.
> >NP but we should write down vague description. User try to use it
> >in file-backed pages and got disappointed, then is reluctant to use it
> >any more. :)
> >
> >I'm not saying that let's write down description implementation specific
> >but want to say them at least new system call can affect anonymous or file
> >or both, at least from the beginning. Just hope.
> 
> I'd like to make it generic enough that we have some flexibility to
> modify the puring rules if we find its more optimal. But I agree,
> the desired semantics of what could occur should be clear.
> 
> 
> >>There however is the inconsistency that file pages marked volatile
> >>via fvrange, then are marked non-volatile via mvrange() might still
> >>be purged. That is broken in my mind, and still needs to be
> >>addressed. The easiest out is probably just to return an error if
> >>any of the mvrange calls cover file pages. But I'd really like a
> >It needs vma enumeration and mmap_sem read-lock.
> >It could hurt anon-vrange performance severely.
> 
> True. And performance needs to be good if this hinting interface is
> to be used easily. Although I worry about performance trumping sane
> semantics. So let me try to implement the desired behavior and we
> can measure the difference.

NP. But keep in mind that mmap_sem was really terrible for performance
when I took a expereiment(ie, concurrent page fault by many threads
while a thread calls mmap).
I guess primary reason is CONFIG_MUTEX_SPIN_ON_OWNER.
So at least, we should avoid it by introducing new mode like
VOLATILE_ANON|VOLATILE_FILE|VOLATILE_BOTH if we want to
support mvrange-file and mvragne interface was thing userland people
really want although ashmem have used fd-based model.

Thanks.

> 
> 
> >>better fix.
> >Another idea is that we can move per-mm vrange element to address_space
> >when the process goes away if the element covers file-backd vma.
> >But I'm still very not sure whether we should keep it persistent.
> 
> I really think the persistence of file-backed volatile ranges (as
> long as someone has the file open or a mapping to it) is important.
> Again, I think of the volatility really being a state of the page,
> but since a page-based approach is too costly, we're optimizing it
> into mm_struct state or address_space state.
> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/4] Support vranges on files
  2013-04-09  5:07               ` Minchan Kim
@ 2013-04-09 22:36                 ` John Stultz
  2013-04-10  2:48                   ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: John Stultz @ 2013-04-09 22:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton

On 04/08/2013 10:07 PM, Minchan Kim wrote:
> On Mon, Apr 08, 2013 at 08:27:50PM -0700, John Stultz wrote:
>> marked volatile, it should remain volatile until someone who has the
>> file open marks it as non-volatile.  The only time we clear the
>> volatility is when the file is closed by all users.
> Yes. We need it that clear volatile ranges when the file is closed
> by ball users. That's what we need and blow my concern out.

Ok, sorry this wasn't more clear. In all the implementations I've 
pushed, the volatility only persists as long as someone holds the file 
open. Once its closed by all users, the volatility is cleared.

Hopefully that calms your worries here. :)



>> I think the concern about surprising an application that isn't
>> expecting volatility is odd, since if an application jumped in and
>> punched a hole in the data, that could surprise other applications
>> as well.  If you're going to use a file that can be shared,
>> applications have to deal with potential changes to that file by
>> others.
> True. My concern is delayed punching without any client of fd and
> there is no interface to detect some range of file is volatile state or
> not. It means anyone mapped a file with shared could encunter SIGBUS
> although he try to best effort to check it with lsof before using.

I'll grant the SIGBUG semantics create the potential for stranger 
behavior then usual, but I think the use cases are still attractive 
enough to try to make it work.


>> To me, the value in using volatile ranges on the file data is
>> exactly because the file data can be shared. So it makes sense to me
>> to have the volatility state be like the data in the file. I guess
>> the only exception in my case is that if all the references to a
>> file are closed, we can clear the volatility (since we don't have a
>> sane way for the volatility to persist past that point).
> Agree if you provide to clear out volatility when file are closed by
> all stakeholder.

Agreed.


>> One question that might help resolve this: Would having some sort of
>> volatility checking interface be helpful in easing your concern
>> about applications being surprised by volatility?
> If we can provide above things, I think we don't need such interface
> until someone want it with reasonable logic.

Sure, I just wanted to know if you saw a need right away. For now we can 
leave it be.

>> True. And performance needs to be good if this hinting interface is
>> to be used easily. Although I worry about performance trumping sane
>> semantics. So let me try to implement the desired behavior and we
>> can measure the difference.
> NP. But keep in mind that mmap_sem was really terrible for performance
> when I took a expereiment(ie, concurrent page fault by many threads
> while a thread calls mmap).
> I guess primary reason is CONFIG_MUTEX_SPIN_ON_OWNER.
> So at least, we should avoid it by introducing new mode like
> VOLATILE_ANON|VOLATILE_FILE|VOLATILE_BOTH if we want to
> support mvrange-file and mvragne interface was thing userland people
> really want although ashmem have used fd-based model.

The VOLATILE_ANON|VOLATILE_FILE|VOLATILE_BOTH may be an interesting 
compromise.

Though, if one marks a VOLATILE_ANON range on an address that's an 
mmaped file, how do we detect this and provide a sane error value 
without checking the vmas?


thanks
-john


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/4] Support vranges on files
  2013-04-09 22:36                 ` John Stultz
@ 2013-04-10  2:48                   ` Minchan Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Minchan Kim @ 2013-04-10  2:48 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Arun Sharma, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Neil Brown, Mike Hommey,
	Taras Glek, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Jason Evans,
	sanjay, Paul Turner, Johannes Weiner, Michel Lespinasse,
	Andrew Morton

On Tue, Apr 09, 2013 at 03:36:20PM -0700, John Stultz wrote:
> On 04/08/2013 10:07 PM, Minchan Kim wrote:
> >On Mon, Apr 08, 2013 at 08:27:50PM -0700, John Stultz wrote:
> >>marked volatile, it should remain volatile until someone who has the
> >>file open marks it as non-volatile.  The only time we clear the
> >>volatility is when the file is closed by all users.
> >Yes. We need it that clear volatile ranges when the file is closed
> >by ball users. That's what we need and blow my concern out.
> 
> Ok, sorry this wasn't more clear. In all the implementations I've
> pushed, the volatility only persists as long as someone holds the
> file open. Once its closed by all users, the volatility is cleared.

I now confirmed it with your implementation.
Sorry for the confusing without looking into your code in detail. :(

> 
> Hopefully that calms your worries here. :)

Yeb.

> 
> 
> 
> >>I think the concern about surprising an application that isn't
> >>expecting volatility is odd, since if an application jumped in and
> >>punched a hole in the data, that could surprise other applications
> >>as well.  If you're going to use a file that can be shared,
> >>applications have to deal with potential changes to that file by
> >>others.
> >True. My concern is delayed punching without any client of fd and
> >there is no interface to detect some range of file is volatile state or
> >not. It means anyone mapped a file with shared could encunter SIGBUS
> >although he try to best effort to check it with lsof before using.
> 
> I'll grant the SIGBUG semantics create the potential for stranger
> behavior then usual, but I think the use cases are still attractive
> enough to try to make it work.

Indeed.

> 
> 
> >>To me, the value in using volatile ranges on the file data is
> >>exactly because the file data can be shared. So it makes sense to me
> >>to have the volatility state be like the data in the file. I guess
> >>the only exception in my case is that if all the references to a
> >>file are closed, we can clear the volatility (since we don't have a
> >>sane way for the volatility to persist past that point).
> >Agree if you provide to clear out volatility when file are closed by
> >all stakeholder.
> 
> Agreed.
> 
> 
> >>One question that might help resolve this: Would having some sort of
> >>volatility checking interface be helpful in easing your concern
> >>about applications being surprised by volatility?
> >If we can provide above things, I think we don't need such interface
> >until someone want it with reasonable logic.
> 
> Sure, I just wanted to know if you saw a need right away. For now we
> can leave it be.
> 
> >>True. And performance needs to be good if this hinting interface is
> >>to be used easily. Although I worry about performance trumping sane
> >>semantics. So let me try to implement the desired behavior and we
> >>can measure the difference.
> >NP. But keep in mind that mmap_sem was really terrible for performance
> >when I took a expereiment(ie, concurrent page fault by many threads
> >while a thread calls mmap).
> >I guess primary reason is CONFIG_MUTEX_SPIN_ON_OWNER.
> >So at least, we should avoid it by introducing new mode like
> >VOLATILE_ANON|VOLATILE_FILE|VOLATILE_BOTH if we want to
> >support mvrange-file and mvragne interface was thing userland people
> >really want although ashmem have used fd-based model.
> 
> The VOLATILE_ANON|VOLATILE_FILE|VOLATILE_BOTH may be an interesting
> compromise.
> 
> Though, if one marks a VOLATILE_ANON range on an address that's an
> mmaped file, how do we detect this and provide a sane error value
> without checking the vmas?
> 

Should we check vma?
If there are conflict with existing vrange type, just return an -EINVAL?

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-04-10  2:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-03 23:52 [RFC PATCH 0/4] Support vranges on files John Stultz
2013-04-03 23:52 ` [RFC PATCH 1/4] vrange: Make various vrange.c local functions static John Stultz
2013-04-03 23:52 ` [RFC PATCH 2/4] vrange: Introduce vrange_root to make vrange structures more flexible John Stultz
2013-04-03 23:52 ` [RFC PATCH 3/4] vrange: Support fvrange() syscall for file based volatile ranges John Stultz
2013-04-03 23:52 ` [RFC PATCH 4/4] vrange: Enable purging of file backed " John Stultz
2013-04-04  6:55 ` [RFC PATCH 0/4] Support vranges on files Minchan Kim
2013-04-04 17:37   ` John Stultz
2013-04-05  7:55     ` Minchan Kim
2013-04-08  0:46       ` Minchan Kim
2013-04-09  0:36         ` John Stultz
2013-04-09  2:18           ` Minchan Kim
2013-04-09  3:27             ` John Stultz
2013-04-09  5:07               ` Minchan Kim
2013-04-09 22:36                 ` John Stultz
2013-04-10  2:48                   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).