linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Volatile Ranges
@ 2012-04-24 17:49 John Stultz
  2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: John Stultz @ 2012-04-24 17:49 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

So the last RFC iteration of this patch set didn't get any feedback,
so I wanted to properly submit the volatile range functionality for
inclusion.  I suspect this will bring some useful critiques out of
the wood work, which I'd welcome, but maybe not and this can be
queued. Here's to optimism. :)

In addition to the two patches for volatile ranges, I've also
included an RFC patch converting the ashmem driver to use volatile
ranges.

Thoughts or feedback on it would be appreciated.

thanks
-john

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

John Stultz (3):
  Range tree implementation
  fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  [RFC] ashmem: Convert ashmem to use volatile ranges

 drivers/staging/android/ashmem.c |  327 +--------------------------
 fs/file_table.c                  |    4 +
 include/linux/fadvise.h          |    5 +
 include/linux/rangetree.h        |   56 +++++
 include/linux/volatile.h         |   12 +
 lib/Makefile                     |    2 +-
 lib/rangetree.c                  |  128 +++++++++++
 mm/Makefile                      |    2 +-
 mm/fadvise.c                     |   16 ++-
 mm/volatile.c                    |  457 ++++++++++++++++++++++++++++++++++++++
 10 files changed, 689 insertions(+), 320 deletions(-)
 create mode 100644 include/linux/rangetree.h
 create mode 100644 include/linux/volatile.h
 create mode 100644 lib/rangetree.c
 create mode 100644 mm/volatile.c

-- 
1.7.3.2.146.gca209


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] Range tree implementation
  2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
@ 2012-04-24 17:49 ` John Stultz
  2012-04-24 19:14   ` Peter Zijlstra
  2012-04-25 12:16   ` Dmitry Adamushko
  2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
  2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
  2 siblings, 2 replies; 28+ messages in thread
From: John Stultz @ 2012-04-24 17:49 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

After Andrew suggested something like his mumbletree idea
to better store a list of ranges, I worked on a few different
approaches, and this is what I've finally managed to get working.

I suspect range-tree isn't a totally accurate name, but I
couldn't quite make out the difference between range trees
and interval trees, so I just picked one to call it. Do
let me know if you have a better name.

The idea of storing ranges in a tree is nice, but has a number
of complications. When adding a range, its possible that a
large range will consume and merge a number of smaller ranges.
When removing a range, its possible you may end up splitting an
existing range, causing one range to become two. This makes it
very difficult to provide generic list_head like behavior, as
the parent structures would need to be duplicated and removed,
and that has lots of memory ownership issues.

So, this is a much simplified and more list_head like
implementation. You can add a node to a tree, or remove a node
to a tree, but the generic implementation doesn't do the
merging or splitting for you. But it does provide helpers to
find overlapping and adjacent ranges.

Andrew also really wanted this range-tree implementation to be
resuable so we don't duplicate the file locking logic. I'm not
totally convinced that the requirements between the volatile
ranges and file locking are really equivelent, but this reduced
impelementation may make it possible.


Changelog:
v2:
* Reworked code to use an rbtree instead of splaying

v3:
* Added range_tree_next_in_range() to avoid having to start
  lookups from the root every time.
* Fixed some comments and return NULL instead of 0, as suggested
  by Aneesh Kumar K.V

v6:
* Fixed range_tree_in_range() so that it finds the earliest range,
  rather then the first. This allows the next_in_range() function
  to properly cover all the ranges in the tree.
* Minor clenaups to simplify some of the functions

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/rangetree.h |   56 ++++++++++++++++++++
 lib/Makefile              |    2 +-
 lib/rangetree.c           |  128 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 185 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/rangetree.h
 create mode 100644 lib/rangetree.c

diff --git a/include/linux/rangetree.h b/include/linux/rangetree.h
new file mode 100644
index 0000000..c61ce7c
--- /dev/null
+++ b/include/linux/rangetree.h
@@ -0,0 +1,56 @@
+#ifndef _LINUX_RANGETREE_H
+#define _LINUX_RANGETREE_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+
+struct range_tree_node {
+	struct rb_node rb;
+	u64 start;
+	u64 end;
+};
+
+struct range_tree_root {
+	struct rb_root head;
+};
+
+static inline void range_tree_init(struct range_tree_root *root)
+{
+	root->head = RB_ROOT;
+}
+
+static inline void range_tree_node_init(struct range_tree_node *node)
+{
+	rb_init_node(&node->rb);
+	node->start = 0;
+	node->end = 0;
+}
+
+static inline int range_tree_empty(struct range_tree_root *root)
+{
+	return RB_EMPTY_ROOT(&root->head);
+}
+
+static inline
+struct range_tree_node *range_tree_root_node(struct range_tree_root *root)
+{
+	struct range_tree_node *ret;
+	ret = container_of(root->head.rb_node, struct range_tree_node, rb);
+	return ret;
+}
+
+extern struct range_tree_node *range_tree_in_range(struct range_tree_root *root,
+							 u64 start, u64 end);
+extern struct range_tree_node *range_tree_in_range_adjacent(
+						struct range_tree_root *root,
+							 u64 start, u64 end);
+extern struct range_tree_node *range_tree_next_in_range(
+						struct range_tree_node *node,
+						u64 start, u64 end);
+extern void range_tree_add(struct range_tree_root *root,
+						struct range_tree_node *node);
+extern void range_tree_remove(struct range_tree_root *root,
+						struct range_tree_node *node);
+#endif
+
+
diff --git a/lib/Makefile b/lib/Makefile
index 18515f0..f43ef0d 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 idr.o int_sqrt.o extable.o prio_tree.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o prio_heap.o ratelimit.o show_mem.o \
-	 is_single_threaded.o plist.o decompress.o
+	 is_single_threaded.o plist.o decompress.o rangetree.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/rangetree.c b/lib/rangetree.c
new file mode 100644
index 0000000..08185bc
--- /dev/null
+++ b/lib/rangetree.c
@@ -0,0 +1,128 @@
+#include <linux/rangetree.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+
+
+/**
+ * range_tree_in_range - Returns the first node that overlaps with the
+ *                       given range
+ * @root: range_tree root
+ * @start: range start
+ * @end: range end
+ *
+ */
+struct range_tree_node *range_tree_in_range(struct range_tree_root *root,
+						u64 start, u64 end)
+{
+	struct rb_node *p = root->head.rb_node;
+	struct range_tree_node *candidate, *match = NULL;
+
+	while (p) {
+		candidate = rb_entry(p, struct range_tree_node, rb);
+		if (end < candidate->start)
+			p = p->rb_left;
+		else if (start > candidate->end)
+			p = p->rb_right;
+		else {
+			/* We found one, but try to find an earlier match */
+			match = candidate;
+			p = p->rb_left;
+		}
+	}
+
+	return match;
+}
+
+
+/**
+ * range_tree_in_range_adjacent - Returns the first node that overlaps or
+ *                                is adjacent with the given range.
+ * @root: range_tree root
+ * @start: range start
+ * @end: range end
+ *
+ */
+struct range_tree_node *range_tree_in_range_adjacent(
+					struct range_tree_root *root,
+					u64 start, u64 end)
+{
+	struct rb_node *p = root->head.rb_node;
+	struct range_tree_node *candidate;
+
+	while (p) {
+		candidate = rb_entry(p, struct range_tree_node, rb);
+		if (end+1 < candidate->start)
+			p = p->rb_left;
+		else if (start > candidate->end + 1)
+			p = p->rb_right;
+		else
+			return candidate;
+	}
+	return NULL;
+}
+
+struct range_tree_node *range_tree_next_in_range(struct range_tree_node *node,
+							u64 start, u64 end)
+{
+	struct rb_node *next;
+	struct range_tree_node *candidate;
+	if (!node)
+		return NULL;
+	next = rb_next(&node->rb);
+	if (!next)
+		return NULL;
+
+	candidate = container_of(next, struct range_tree_node, rb);
+
+	if ((candidate->start > end) || (candidate->end < start))
+		return NULL;
+
+	return candidate;
+}
+
+/**
+ * range_tree_add - Add a node to a range tree
+ * @root: range tree to be added to
+ * @node: range_tree_node to be added
+ *
+ * Adds a node to the range tree.
+ */
+void range_tree_add(struct range_tree_root *root,
+					struct range_tree_node *node)
+{
+	struct rb_node **p = &root->head.rb_node;
+	struct rb_node *parent = NULL;
+	struct range_tree_node *ptr;
+
+	WARN_ON_ONCE(!RB_EMPTY_NODE(&node->rb));
+
+	while (*p) {
+		parent = *p;
+		ptr = rb_entry(parent, struct range_tree_node, rb);
+		if (node->start < ptr->start)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&node->rb, parent, p);
+	rb_insert_color(&node->rb, &root->head);
+
+}
+
+
+/**
+ * range_tree_remove: Removes a given node from the tree
+ * @root: root of tree
+ * @node: Node to be removed
+ *
+ * Removes a node and splays the tree
+ */
+void range_tree_remove(struct range_tree_root *root,
+						struct range_tree_node *node)
+{
+	WARN_ON_ONCE(RB_EMPTY_NODE(&node->rb));
+
+	rb_erase(&node->rb, &root->head);
+	RB_CLEAR_NODE(&node->rb);
+}
-- 
1.7.3.2.146.gca209


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
  2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
@ 2012-04-24 17:49 ` John Stultz
  2012-04-24 19:20   ` Peter Zijlstra
  2012-04-27  0:39   ` Dave Chinner
  2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
  2 siblings, 2 replies; 28+ messages in thread
From: John Stultz @ 2012-04-24 17:49 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

This patch provides new fadvise flags that can be used to mark
file pages as volatile, which will allow it to be discarded if the
kernel wants to reclaim memory.

This is useful for userspace to allocate things like caches, and lets
the kernel destructively (but safely) reclaim them when there's memory
pressure.

It's different from FADV_DONTNEED since the pages are not immediately
discarded; they are only discarded under pressure.

This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code, instead of isolating
it into a specific driver.

Also many thanks to Dave Hansen who helped design and develop the
initial version of this patch, and has provided continued review and
mentoring for me in the VM code. And thanks to Dmitry Adamushko for
continued review and bug reporting.


v2:
* After the valid critique that just dropping pages would poke holes
in volatile ranges, and instead we should zap an entire range if we
drop any of it, I changed the code to more closely mimic the ashmem
implementation, which zaps entire ranges via a shrinker using an lru
list that tracks which range has been marked volatile the longest.

v3:
* Reworked to use range tree implementation.

v4:
* Renamed functions to avoid confusion.
* More consistant PAGE_CACHE_SHIFT usage, suggested by Dmitry
  Adamushko
* Fixes exit without unlocking issue found by Dmitry Adamushko
* Migrate to rbtree based rangetree implementation
* Simplified locking to use global lock (we were grabbing global
  lru lock every time anyway).
* Avoid ENOMEM isses by allocating before we get into complicated
  code.
* Add some documentation to the volatile.c file from Neil Brown

v5:
* More fixes suggested by Dmitry Adamushko
* Improve range colescing so that we don't coalesce neighboring
  purged ranges.
* Utilize range_tree_next_in_range to avoid doing every lookup
  from the tree's root.

v6:
* Immediately zap range if we coalesce overlapping purged range.
* Use hash table to do mapping->rangetree lookup instead of
  bloating the address_space structure

v7:
* Race fix noted by Dmitry
* Clear volatile ranges on fput, so volatile ranges don't persist
  if no one has the file open
* Made it tmpfs only, using shmem_truncate_range() instead of
  vmtruncate_range(). This avoids the lockdep warnings caused
  by calling vmtruncate_range() from the shrinker. Seems to
  work ok, but I'd not be surprised if this isn't correct.
  Extra eyes would be appreciated here.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 fs/file_table.c          |    4 +
 include/linux/fadvise.h  |    5 +
 include/linux/volatile.h |   12 ++
 mm/Makefile              |    2 +-
 mm/fadvise.c             |   16 ++-
 mm/volatile.c            |  457 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 494 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

diff --git a/fs/file_table.c b/fs/file_table.c
index 70f2a0f..ada2c88 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -24,6 +24,7 @@
 #include <linux/percpu_counter.h>
 #include <linux/percpu.h>
 #include <linux/ima.h>
+#include <linux/volatile.h>
 
 #include <linux/atomic.h>
 
@@ -238,6 +239,9 @@ static void __fput(struct file *file)
 	eventpoll_release(file);
 	locks_remove_flock(file);
 
+	/* Volatile ranges should not persist after all fds are closed */
+	mapping_clear_volatile_ranges(&inode->i_data);
+
 	if (unlikely(file->f_flags & FASYNC)) {
 		if (file->f_op && file->f_op->fasync)
 			file->f_op->fasync(-1, file, 0);
diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h
index e8e7471..443951c 100644
--- a/include/linux/fadvise.h
+++ b/include/linux/fadvise.h
@@ -18,4 +18,9 @@
 #define POSIX_FADV_NOREUSE	5 /* Data will be accessed once.  */
 #endif
 
+#define POSIX_FADV_VOLATILE	8  /* _can_ toss, but don't toss now */
+#define POSIX_FADV_NONVOLATILE	9  /* Remove VOLATILE flag */
+
+
+
 #endif	/* FADVISE_H_INCLUDED */
diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..85a9249
--- /dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+#include <linux/fs.h>
+
+extern long mapping_range_volatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+extern long mapping_range_nonvolatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+extern void mapping_clear_volatile_ranges(struct address_space *mapping);
+
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..7b6c7a8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -13,7 +13,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   page_isolation.o mm_init.o mmu_context.o percpu.o \
-			   $(mmu-y)
+			   volatile.o $(mmu-y)
 obj-y += init-mm.o
 
 ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e0..3e33845 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -17,6 +17,7 @@
 #include <linux/fadvise.h>
 #include <linux/writeback.h>
 #include <linux/syscalls.h>
+#include <linux/volatile.h>
 
 #include <asm/unistd.h>
 
@@ -106,7 +107,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 		nrpages = end_index - start_index + 1;
 		if (!nrpages)
 			nrpages = ~0UL;
-		
+
 		ret = force_page_cache_readahead(mapping, file,
 				start_index,
 				nrpages);
@@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 			invalidate_mapping_pages(mapping, start_index,
 						end_index);
 		break;
+	case POSIX_FADV_VOLATILE:
+		/* First and last PARTIAL page! */
+		start_index = offset >> PAGE_CACHE_SHIFT;
+		end_index = endbyte >> PAGE_CACHE_SHIFT;
+		ret = mapping_range_volatile(mapping, start_index, end_index);
+		break;
+	case POSIX_FADV_NONVOLATILE:
+		/* First and last PARTIAL page! */
+		start_index = offset >> PAGE_CACHE_SHIFT;
+		end_index = endbyte >> PAGE_CACHE_SHIFT;
+		ret = mapping_range_nonvolatile(mapping, start_index,
+								end_index);
+		break;
 	default:
 		ret = -EINVAL;
 	}
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..e94e980
--- /dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,457 @@
+/* mm/volatile.c
+ *
+ * Volatile page range managment.
+ *      Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c
+ *      by Robert Love <rlove@google.com>
+ *      Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ *
+ * The goal behind volatile ranges is to allow applications to interact
+ * with the kernel's cache management infrastructure.  In particular an
+ * application can say "this memory contains data that might be useful in
+ * the future, but can be reconstructed if necessary, so if the kernel
+ * needs, it can zap and reclaim this memory without having to swap it out.
+ *
+ * The proposed mechanism - at a high level - is for user-space to be able
+ * to say "This memory is volatile" and then later "this memory is no longer
+ * volatile".  If the content of the memory is still available the second
+ * request succeeds.  If not, the memory is marked non-volatile and an
+ * error is returned to denote that the contents have been lost.
+ *
+ * Credits to Neil Brown for the above description.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/volatile.h>
+#include <linux/rangetree.h>
+#include <linux/hash.h>
+#include <linux/shmem_fs.h>
+
+static DEFINE_MUTEX(volatile_mutex);
+
+struct volatile_range {
+	struct list_head	lru;
+	struct range_tree_node	range_node;
+	unsigned int		purged;
+	struct address_space	*mapping;
+};
+
+/* LRU list of volatile page ranges */
+static LIST_HEAD(volatile_lru_list);
+
+/* Count of pages on our LRU list */
+static u64 lru_count;
+
+
+/*
+ * To avoid bloating the address_space structure, we use
+ * a hash structure to map from address_space mappings to
+ * the range_tree root that stores volatile ranges
+ */
+static struct hlist_head *mapping_hash;
+static long mapping_hash_shift = 8;
+
+struct mapping_hash_entry {
+	struct range_tree_root root;
+	struct address_space *mapping;
+	struct hlist_node hnode;
+};
+
+static inline
+struct range_tree_root *mapping_to_root(struct address_space *mapping)
+{
+	struct hlist_node *elem;
+	struct mapping_hash_entry *entry;
+
+	hlist_for_each_entry_rcu(entry, elem,
+			&mapping_hash[hash_ptr(mapping, mapping_hash_shift)],
+				hnode)
+		if (entry->mapping == mapping)
+			return &entry->root;
+
+	return NULL;
+}
+
+static inline
+struct range_tree_root *mapping_allocate_root(struct address_space *mapping)
+{
+	struct mapping_hash_entry *entry;
+	struct range_tree_root *dblchk;
+
+	/* Drop the volatile_mutex to avoid lockdep deadlock warnings */
+	mutex_unlock(&volatile_mutex);
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	mutex_lock(&volatile_mutex);
+
+	/* Since we dropped the lock, double check that no one has
+	 * created the same hash entry.
+	 */
+	dblchk = mapping_to_root(mapping);
+	if (dblchk) {
+		kfree(entry);
+		return dblchk;
+	}
+
+	INIT_HLIST_NODE(&entry->hnode);
+	entry->mapping = mapping;
+	range_tree_init(&entry->root);
+
+	hlist_add_head_rcu(&entry->hnode,
+		&mapping_hash[hash_ptr(mapping, mapping_hash_shift)]);
+
+	return &entry->root;
+}
+
+static inline void mapping_free_root(struct range_tree_root *root)
+{
+	struct mapping_hash_entry *entry;
+
+	entry = container_of(root, struct mapping_hash_entry, root);
+
+	hlist_del_rcu(&entry->hnode);
+	kfree(entry);
+}
+
+
+/* Range tree helpers */
+static inline u64 range_size(struct volatile_range *range)
+{
+	return range->range_node.end - range->range_node.start + 1;
+}
+
+static inline void lru_add(struct volatile_range *range)
+{
+	list_add_tail(&range->lru, &volatile_lru_list);
+	lru_count += range_size(range);
+}
+
+static inline void lru_del(struct volatile_range *range)
+{
+	list_del(&range->lru);
+	lru_count -= range_size(range);
+}
+
+#define range_on_lru(range) (!(range)->purged)
+
+
+static inline void volatile_range_resize(struct volatile_range *range,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	size_t pre = range_size(range);
+
+	range->range_node.start = start_index;
+	range->range_node.end = end_index;
+
+	if (range_on_lru(range))
+		lru_count -= pre - range_size(range);
+}
+
+static struct volatile_range *vrange_alloc(void)
+{
+	struct volatile_range *new;
+
+	new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+	if (!new)
+		return 0;
+	range_tree_node_init(&new->range_node);
+	return new;
+}
+
+static void vrange_del(struct range_tree_root *root,
+				struct volatile_range *vrange)
+{
+	if (range_on_lru(vrange))
+		lru_del(vrange);
+	range_tree_remove(root, &vrange->range_node);
+	kfree(vrange);
+}
+
+
+
+/*
+ * Mark a region as volatile, allowing dirty pages to be purged
+ * under memory pressure
+ */
+long mapping_range_volatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	struct volatile_range *new;
+	struct range_tree_node *node;
+	struct volatile_range *vrange;
+	struct range_tree_root *root;
+	u64 start, end;
+	int purged = 0;
+	start = (u64)start_index;
+	end = (u64)end_index;
+
+	if (strncmp(mapping->host->i_sb->s_type->name, "tmpfs",
+			strlen("tmpfs")))
+		return -EINVAL;
+
+	new = vrange_alloc();
+	if (!new)
+		return -ENOMEM;
+
+	mutex_lock(&volatile_mutex);
+
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		root = mapping_allocate_root(mapping);
+
+	/* Find any existing ranges that overlap */
+	node = range_tree_in_range(root, start, end);
+	while (node) {
+		/* Already entirely marked volatile, so we're done */
+		if (node->start < start && node->end > end) {
+			/* don't need the allocated value */
+			kfree(new);
+			goto out;
+		}
+
+		/* Grab containing volatile range */
+		vrange = container_of(node, struct volatile_range, range_node);
+
+		/* resize range */
+		start = min_t(u64, start, node->start);
+		end = max_t(u64, end, node->end);
+		purged |= vrange->purged;
+
+		node = range_tree_next_in_range(&vrange->range_node,
+								start, end);
+		vrange_del(root, vrange);
+	}
+
+	/* Coalesce left-adjacent ranges */
+	node = range_tree_in_range(root, start-1, start);
+	if (node) {
+		vrange = container_of(node, struct volatile_range, range_node);
+		/* Only coalesce if both are either purged or unpurged */
+		if (vrange->purged == purged) {
+			/* resize range */
+			start = min_t(u64, start, node->start);
+			end = max_t(u64, end, node->end);
+			vrange_del(root, vrange);
+		}
+	}
+
+	/* Coalesce right-adjacent ranges */
+	node = range_tree_in_range(root, end, end+1);
+	if (node) {
+		vrange = container_of(node, struct volatile_range, range_node);
+		/* Only coalesce if both are either purged or unpurged */
+		if (vrange->purged == purged) {
+			/* resize range */
+			start = min_t(u64, start, node->start);
+			end = max_t(u64, end, node->end);
+			vrange_del(root, vrange);
+		}
+	}
+
+	new->mapping = mapping;
+	new->range_node.start = start;
+	new->range_node.end = end;
+	new->purged = purged;
+
+	if (purged) {
+		struct inode *inode;
+		loff_t pstart, pend;
+
+		inode = mapping->host;
+		pstart = start << PAGE_CACHE_SHIFT;
+		pend = ((end + 1) << PAGE_CACHE_SHIFT) - 1;
+		vmtruncate_range(inode, pstart, pend);
+	}
+	range_tree_add(root, &new->range_node);
+	if (range_on_lru(new))
+		lru_add(new);
+
+out:
+	mutex_unlock(&volatile_mutex);
+
+	return 0;
+}
+
+/*
+ * Mark a region as nonvolatile, returns 1 if any pages in the region
+ * were purged.
+ */
+long mapping_range_nonvolatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	struct volatile_range *new;
+	struct range_tree_node *node;
+	struct range_tree_root *root;
+	int ret  = 0;
+	u64 start, end;
+	int used_new = 0;
+
+	start = (u64)start_index;
+	end = (u64)end_index;
+
+	if (strncmp(mapping->host->i_sb->s_type->name, "tmpfs",
+			strlen("tmpfs")))
+		return -EINVAL;
+
+	/* create new node */
+	new = vrange_alloc();
+	if (!new)
+		return -ENOMEM;
+
+	mutex_lock(&volatile_mutex);
+	root = mapping_to_root(mapping);
+	if (!root)
+		goto out; /* if no range tree root, there's nothing to unmark */
+
+	node = range_tree_in_range(root, start, end);
+	while (node) {
+		struct volatile_range *vrange;
+		vrange = container_of(node, struct volatile_range, range_node);
+
+		ret |= vrange->purged;
+
+		if (start <= node->start && end >= node->end) {
+			/* delete: volatile range is totally within range */
+			node = range_tree_next_in_range(&vrange->range_node,
+								start, end);
+			vrange_del(root, vrange);
+		} else if (node->start >= start) {
+			/* resize: volatile range right-overlaps range */
+			volatile_range_resize(vrange, end+1, node->end);
+			node = range_tree_next_in_range(&vrange->range_node,
+								start, end);
+
+		} else if (node->end <= end) {
+			/* resize: volatile range left-overlaps range */
+			volatile_range_resize(vrange, node->start, start-1);
+			node = range_tree_next_in_range(&vrange->range_node,
+								start, end);
+		} else {
+			/* split: range is totally within a volatile range */
+			used_new = 1; /* we only do this once */
+			new->mapping = mapping;
+			new->range_node.start = end + 1;
+			new->range_node.end = node->end;
+			new->purged = vrange->purged;
+			range_tree_add(root, &new->range_node);
+			if (range_on_lru(new))
+				lru_add(new);
+			volatile_range_resize(vrange, node->start, start-1);
+
+			break;
+		}
+	}
+
+out:
+	mutex_unlock(&volatile_mutex);
+
+	if (!used_new)
+		kfree(new);
+
+	return ret;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void mapping_clear_volatile_ranges(struct address_space *mapping)
+{
+	struct volatile_range *tozap;
+	struct range_tree_root *root;
+
+	mutex_lock(&volatile_mutex);
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		goto out;
+
+	while (!range_tree_empty(root)) {
+		struct range_tree_node *tmp;
+		tmp = range_tree_root_node(root);
+		tozap = container_of(tmp, struct volatile_range, range_node);
+		vrange_del(root, tozap);
+	}
+	mapping_free_root(root);
+out:
+	mutex_unlock(&volatile_mutex);
+}
+
+/*
+ * Purges volatile ranges when under memory pressure
+ */
+static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
+{
+	struct volatile_range *range, *next;
+	s64 nr_to_scan = sc->nr_to_scan;
+	const gfp_t gfp_mask = sc->gfp_mask;
+
+	if (nr_to_scan && !(gfp_mask & __GFP_FS))
+		return -1;
+	if (!nr_to_scan)
+		return lru_count;
+
+	mutex_lock(&volatile_mutex);
+	list_for_each_entry_safe(range, next, &volatile_lru_list, lru) {
+		struct inode *inode;
+		loff_t start, end;
+
+		inode = range->mapping->host;
+
+		start = range->range_node.start << PAGE_CACHE_SHIFT;
+		end = ((range->range_node.end + 1) << PAGE_CACHE_SHIFT) - 1;
+
+		shmem_truncate_range(inode, start, end);
+
+		lru_del(range);
+		range->purged = 1;
+		nr_to_scan -= range_size(range);
+
+		if (nr_to_scan <= 0)
+			break;
+	}
+	mutex_unlock(&volatile_mutex);
+
+	return lru_count;
+}
+
+static struct shrinker volatile_shrinker = {
+	.shrink = volatile_shrink,
+	.seeks = DEFAULT_SEEKS,
+};
+
+static int __init volatile_init(void)
+{
+	int i, size;
+
+	size = 1U << mapping_hash_shift;
+
+	mapping_hash = kzalloc(sizeof(mapping_hash)*size, GFP_KERNEL);
+
+	for (i = 0; i < size; i++)
+		INIT_HLIST_HEAD(&mapping_hash[i]);
+
+	register_shrinker(&volatile_shrinker);
+
+
+	return 0;
+}
+
+arch_initcall(volatile_init);
-- 
1.7.3.2.146.gca209


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges
  2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
  2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
  2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
@ 2012-04-24 17:49 ` John Stultz
  2012-04-24 19:21   ` Peter Zijlstra
  2 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2012-04-24 17:49 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

First pass attempt at getting ashmem to utilize the volatile range
code.

In this implementaiton GET_PIN_STATUS is unimplemented, due to
the fact that adding a ISVOLATILE check wasn't considered
terribly useful in earlier reviews. It would be trivial to
re-add that functionality, but I wanted to check w/ the
Android developers to see how often GET_PIN_STATUS is actually
used?

Similarly the ashmem PURGE_ALL_CACHES ioctl does not function,
as the volatile range purging is no longer directly under its
control.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 drivers/staging/android/ashmem.c |  327 ++------------------------------------
 1 files changed, 10 insertions(+), 317 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 9f1f27e..4b3c9e8 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -28,6 +28,7 @@
 #include <linux/bitops.h>
 #include <linux/mutex.h>
 #include <linux/shmem_fs.h>
+#include <linux/volatile.h>
 #include "ashmem.h"
 
 #define ASHMEM_NAME_PREFIX "dev/ashmem/"
@@ -49,26 +50,6 @@ struct ashmem_area {
 };
 
 /*
- * ashmem_range - represents an interval of unpinned (evictable) pages
- * Lifecycle: From unpin to pin
- * Locking: Protected by `ashmem_mutex'
- */
-struct ashmem_range {
-	struct list_head lru;		/* entry in LRU list */
-	struct list_head unpinned;	/* entry in its area's unpinned list */
-	struct ashmem_area *asma;	/* associated area */
-	size_t pgstart;			/* starting page, inclusive */
-	size_t pgend;			/* ending page, inclusive */
-	unsigned int purged;		/* ASHMEM_NOT or ASHMEM_WAS_PURGED */
-};
-
-/* LRU list of unpinned pages, protected by ashmem_mutex */
-static LIST_HEAD(ashmem_lru_list);
-
-/* Count of pages on our LRU list, protected by ashmem_mutex */
-static unsigned long lru_count;
-
-/*
  * ashmem_mutex - protects the list of and each individual ashmem_area
  *
  * Lock Ordering: ashmex_mutex -> i_mutex -> i_alloc_sem
@@ -76,102 +57,9 @@ static unsigned long lru_count;
 static DEFINE_MUTEX(ashmem_mutex);
 
 static struct kmem_cache *ashmem_area_cachep __read_mostly;
-static struct kmem_cache *ashmem_range_cachep __read_mostly;
-
-#define range_size(range) \
-	((range)->pgend - (range)->pgstart + 1)
-
-#define range_on_lru(range) \
-	((range)->purged == ASHMEM_NOT_PURGED)
-
-#define page_range_subsumes_range(range, start, end) \
-	(((range)->pgstart >= (start)) && ((range)->pgend <= (end)))
-
-#define page_range_subsumed_by_range(range, start, end) \
-	(((range)->pgstart <= (start)) && ((range)->pgend >= (end)))
-
-#define page_in_range(range, page) \
-	(((range)->pgstart <= (page)) && ((range)->pgend >= (page)))
-
-#define page_range_in_range(range, start, end) \
-	(page_in_range(range, start) || page_in_range(range, end) || \
-		page_range_subsumes_range(range, start, end))
-
-#define range_before_page(range, page) \
-	((range)->pgend < (page))
 
 #define PROT_MASK		(PROT_EXEC | PROT_READ | PROT_WRITE)
 
-static inline void lru_add(struct ashmem_range *range)
-{
-	list_add_tail(&range->lru, &ashmem_lru_list);
-	lru_count += range_size(range);
-}
-
-static inline void lru_del(struct ashmem_range *range)
-{
-	list_del(&range->lru);
-	lru_count -= range_size(range);
-}
-
-/*
- * range_alloc - allocate and initialize a new ashmem_range structure
- *
- * 'asma' - associated ashmem_area
- * 'prev_range' - the previous ashmem_range in the sorted asma->unpinned list
- * 'purged' - initial purge value (ASMEM_NOT_PURGED or ASHMEM_WAS_PURGED)
- * 'start' - starting page, inclusive
- * 'end' - ending page, inclusive
- *
- * Caller must hold ashmem_mutex.
- */
-static int range_alloc(struct ashmem_area *asma,
-		       struct ashmem_range *prev_range, unsigned int purged,
-		       size_t start, size_t end)
-{
-	struct ashmem_range *range;
-
-	range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
-	if (unlikely(!range))
-		return -ENOMEM;
-
-	range->asma = asma;
-	range->pgstart = start;
-	range->pgend = end;
-	range->purged = purged;
-
-	list_add_tail(&range->unpinned, &prev_range->unpinned);
-
-	if (range_on_lru(range))
-		lru_add(range);
-
-	return 0;
-}
-
-static void range_del(struct ashmem_range *range)
-{
-	list_del(&range->unpinned);
-	if (range_on_lru(range))
-		lru_del(range);
-	kmem_cache_free(ashmem_range_cachep, range);
-}
-
-/*
- * range_shrink - shrinks a range
- *
- * Caller must hold ashmem_mutex.
- */
-static inline void range_shrink(struct ashmem_range *range,
-				size_t start, size_t end)
-{
-	size_t pre = range_size(range);
-
-	range->pgstart = start;
-	range->pgend = end;
-
-	if (range_on_lru(range))
-		lru_count -= pre - range_size(range);
-}
 
 static int ashmem_open(struct inode *inode, struct file *file)
 {
@@ -197,12 +85,6 @@ static int ashmem_open(struct inode *inode, struct file *file)
 static int ashmem_release(struct inode *ignored, struct file *file)
 {
 	struct ashmem_area *asma = file->private_data;
-	struct ashmem_range *range, *next;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned)
-		range_del(range);
-	mutex_unlock(&ashmem_mutex);
 
 	if (asma->file)
 		fput(asma->file);
@@ -336,54 +218,6 @@ out:
 	return ret;
 }
 
-/*
- * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
- *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
- *
- * 'gfp_mask' is the mask of the allocation that got us into this mess.
- *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
- * proceed without risk of deadlock (due to gfp_mask).
- *
- * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
- * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
- * pages freed.
- */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
-{
-	struct ashmem_range *range, *next;
-
-	/* We might recurse into filesystem code, so bail out if necessary */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
-		return -1;
-	if (!sc->nr_to_scan)
-		return lru_count;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
-		struct inode *inode = range->asma->file->f_dentry->d_inode;
-		loff_t start = range->pgstart * PAGE_SIZE;
-		loff_t end = (range->pgend + 1) * PAGE_SIZE - 1;
-
-		vmtruncate_range(inode, start, end);
-		range->purged = ASHMEM_WAS_PURGED;
-		lru_del(range);
-
-		sc->nr_to_scan -= range_size(range);
-		if (sc->nr_to_scan <= 0)
-			break;
-	}
-	mutex_unlock(&ashmem_mutex);
-
-	return lru_count;
-}
-
-static struct shrinker ashmem_shrinker = {
-	.shrink = ashmem_shrink,
-	.seeks = DEFAULT_SEEKS * 4,
-};
 
 static int set_prot_mask(struct ashmem_area *asma, unsigned long prot)
 {
@@ -457,131 +291,6 @@ static int get_name(struct ashmem_area *asma, void __user *name)
 	return ret;
 }
 
-/*
- * ashmem_pin - pin the given ashmem region, returning whether it was
- * previously purged (ASHMEM_WAS_PURGED) or not (ASHMEM_NOT_PURGED).
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	int ret = ASHMEM_NOT_PURGED;
-
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* moved past last applicable page; we can short circuit */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to pin pages that span multiple ranges,
-		 * or to pin pages that aren't even unpinned, so this is messy.
-		 *
-		 * Four cases:
-		 * 1. The requested range subsumes an existing range, so we
-		 *    just remove the entire matching range.
-		 * 2. The requested range overlaps the start of an existing
-		 *    range, so we just update that range.
-		 * 3. The requested range overlaps the end of an existing
-		 *    range, so we just update that range.
-		 * 4. The requested range punches a hole in an existing range,
-		 *    so we have to update one side of the range and then
-		 *    create a new range for the other side.
-		 */
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret |= range->purged;
-
-			/* Case #1: Easy. Just nuke the whole thing. */
-			if (page_range_subsumes_range(range, pgstart, pgend)) {
-				range_del(range);
-				continue;
-			}
-
-			/* Case #2: We overlap from the start, so adjust it */
-			if (range->pgstart >= pgstart) {
-				range_shrink(range, pgend + 1, range->pgend);
-				continue;
-			}
-
-			/* Case #3: We overlap from the rear, so adjust it */
-			if (range->pgend <= pgend) {
-				range_shrink(range, range->pgstart, pgstart-1);
-				continue;
-			}
-
-			/*
-			 * Case #4: We eat a chunk out of the middle. A bit
-			 * more complicated, we allocate a new range for the
-			 * second half and adjust the first chunk's endpoint.
-			 */
-			range_alloc(asma, range, range->purged,
-				    pgend + 1, range->pgend);
-			range_shrink(range, range->pgstart, pgstart - 1);
-			break;
-		}
-	}
-
-	return ret;
-}
-
-/*
- * ashmem_unpin - unpin the given range of pages. Returns zero on success.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	unsigned int purged = ASHMEM_NOT_PURGED;
-
-restart:
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* short circuit: this is our insertion point */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to unpin pages that are already entirely
-		 * or partially pinned. We handle those two cases here.
-		 */
-		if (page_range_subsumed_by_range(range, pgstart, pgend))
-			return 0;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			pgstart = min_t(size_t, range->pgstart, pgstart),
-			pgend = max_t(size_t, range->pgend, pgend);
-			purged |= range->purged;
-			range_del(range);
-			goto restart;
-		}
-	}
-
-	return range_alloc(asma, range, purged, pgstart, pgend);
-}
-
-/*
- * ashmem_get_pin_status - Returns ASHMEM_IS_UNPINNED if _any_ pages in the
- * given interval are unpinned and ASHMEM_IS_PINNED otherwise.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_get_pin_status(struct ashmem_area *asma, size_t pgstart,
-				 size_t pgend)
-{
-	struct ashmem_range *range;
-	int ret = ASHMEM_IS_PINNED;
-
-	list_for_each_entry(range, &asma->unpinned_list, unpinned) {
-		if (range_before_page(range, pgstart))
-			break;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret = ASHMEM_IS_UNPINNED;
-			break;
-		}
-	}
-
-	return ret;
-}
-
 static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 			    void __user *p)
 {
@@ -615,13 +324,19 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 
 	switch (cmd) {
 	case ASHMEM_PIN:
-		ret = ashmem_pin(asma, pgstart, pgend);
+		ret = mapping_range_volatile(asma->file->f_mapping, pgstart,
+						pgend);
 		break;
 	case ASHMEM_UNPIN:
-		ret = ashmem_unpin(asma, pgstart, pgend);
+		ret = mapping_range_volatile(asma->file->f_mapping, pgstart,
+						pgend);
 		break;
 	case ASHMEM_GET_PIN_STATUS:
-		ret = ashmem_get_pin_status(asma, pgstart, pgend);
+		/*
+		 * XXX - volatile ranges currently don't provide status,
+		 * due to questionable utility
+		 */
+		ret = -EINVAL;
 		break;
 	}
 
@@ -665,15 +380,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		break;
 	case ASHMEM_PURGE_ALL_CACHES:
 		ret = -EPERM;
-		if (capable(CAP_SYS_ADMIN)) {
-			struct shrink_control sc = {
-				.gfp_mask = GFP_KERNEL,
-				.nr_to_scan = 0,
-			};
-			ret = ashmem_shrink(&ashmem_shrinker, &sc);
-			sc.nr_to_scan = ret;
-			ashmem_shrink(&ashmem_shrinker, &sc);
-		}
 		break;
 	}
 
@@ -709,22 +415,12 @@ static int __init ashmem_init(void)
 		return -ENOMEM;
 	}
 
-	ashmem_range_cachep = kmem_cache_create("ashmem_range_cache",
-					  sizeof(struct ashmem_range),
-					  0, 0, NULL);
-	if (unlikely(!ashmem_range_cachep)) {
-		printk(KERN_ERR "ashmem: failed to create slab cache\n");
-		return -ENOMEM;
-	}
-
 	ret = misc_register(&ashmem_misc);
 	if (unlikely(ret)) {
 		printk(KERN_ERR "ashmem: failed to register misc device!\n");
 		return ret;
 	}
 
-	register_shrinker(&ashmem_shrinker);
-
 	printk(KERN_INFO "ashmem: initialized\n");
 
 	return 0;
@@ -734,13 +430,10 @@ static void __exit ashmem_exit(void)
 {
 	int ret;
 
-	unregister_shrinker(&ashmem_shrinker);
-
 	ret = misc_deregister(&ashmem_misc);
 	if (unlikely(ret))
 		printk(KERN_ERR "ashmem: failed to unregister misc device!\n");
 
-	kmem_cache_destroy(ashmem_range_cachep);
 	kmem_cache_destroy(ashmem_area_cachep);
 
 	printk(KERN_INFO "ashmem: unloaded\n");
-- 
1.7.3.2.146.gca209


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
@ 2012-04-24 19:14   ` Peter Zijlstra
  2012-04-24 19:25     ` John Stultz
  2012-04-25 12:16   ` Dmitry Adamushko
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-04-24 19:14 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
> This makes it
> very difficult to provide generic list_head like behavior, as
> the parent structures would need to be duplicated and removed,
> and that has lots of memory ownership issues. 

You can in fact modify the rb-tree to have O(1) iteration by using the
empty leaf pointers to keep pointers to next/prev nodes.

Its a bit of a bother since you'd need to wrap ->rb_left and ->rb_right
in functions.. but now that we have coccinelle that shouldn't actually
be too hard.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
@ 2012-04-24 19:20   ` Peter Zijlstra
  2012-04-24 19:50     ` John Stultz
  2012-04-27  0:39   ` Dave Chinner
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-04-24 19:20 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
> +/*
> + * Purges volatile ranges when under memory pressure
> + */
> +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
> +{ 

Hmm, I would have expected regular page reclaim to know about this
instead of using a shrinker interface. Is this done specifically to
avoid growing small holes in all ranges and instead dump entire ranges
thereby keeping other ranges usable?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges
  2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
@ 2012-04-24 19:21   ` Peter Zijlstra
  2012-04-24 19:42     ` John Stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-04-24 19:21 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
> First pass attempt at getting ashmem to utilize the volatile range
> code. 

One would think the purpose of the previous patches was to entirely do
away with the ashmem interface.. this patch is a temporary band-aid to
assist in testing without having to modify userspace?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-24 19:14   ` Peter Zijlstra
@ 2012-04-24 19:25     ` John Stultz
  2012-04-24 19:33       ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2012-04-24 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

On 04/24/2012 12:14 PM, Peter Zijlstra wrote:
> On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
>> This makes it
>> very difficult to provide generic list_head like behavior, as
>> the parent structures would need to be duplicated and removed,
>> and that has lots of memory ownership issues.
> You can in fact modify the rb-tree to have O(1) iteration by using the
> empty leaf pointers to keep pointers to next/prev nodes.
>
> Its a bit of a bother since you'd need to wrap ->rb_left and ->rb_right
> in functions.. but now that we have coccinelle that shouldn't actually
> be too hard.
>
Sorry, I'm not sure I'm following you.

My point above was that a generic range-tree implementation that manages 
the splitting and coalescing of ranges internally is difficult, due to 
memory ownership issues.  This makes it hard to have a generic list_head 
style structure that you can use in your own structures. Thus in a way 
similar to how the rb_tree leaves the insert and search implementation 
to the suers,  there is a range_tree_node structure, and the splitting 
and coalescing logic is left to the range-tree user.

Does your suggestion address the ownership issue differently?  Or is it 
just a general optimization improvement?

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-24 19:25     ` John Stultz
@ 2012-04-24 19:33       ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2012-04-24 19:33 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

On Tue, 2012-04-24 at 12:25 -0700, John Stultz wrote:
> On 04/24/2012 12:14 PM, Peter Zijlstra wrote:
> > On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
> >> This makes it
> >> very difficult to provide generic list_head like behavior, as
> >> the parent structures would need to be duplicated and removed,
> >> and that has lots of memory ownership issues.
> > You can in fact modify the rb-tree to have O(1) iteration by using the
> > empty leaf pointers to keep pointers to next/prev nodes.
> >
> > Its a bit of a bother since you'd need to wrap ->rb_left and ->rb_right
> > in functions.. but now that we have coccinelle that shouldn't actually
> > be too hard.
> >
> Sorry, I'm not sure I'm following you.
> 
> My point above was that a generic range-tree implementation that manages 
> the splitting and coalescing of ranges internally is difficult, due to 
> memory ownership issues.  This makes it hard to have a generic list_head 
> style structure that you can use in your own structures. Thus in a way 
> similar to how the rb_tree leaves the insert and search implementation 
> to the suers,  there is a range_tree_node structure, and the splitting 
> and coalescing logic is left to the range-tree user.
> 
> Does your suggestion address the ownership issue differently?  Or is it 
> just a general optimization improvement?

Oh, I thought you also wanted a list_head to aid in the traversal
required to find adjacent ranges etc.. 

Brain completely failed to grasp what you were trying to say, sorry for
the noise. 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges
  2012-04-24 19:21   ` Peter Zijlstra
@ 2012-04-24 19:42     ` John Stultz
  0 siblings, 0 replies; 28+ messages in thread
From: John Stultz @ 2012-04-24 19:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

On 04/24/2012 12:21 PM, Peter Zijlstra wrote:
> On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
>> First pass attempt at getting ashmem to utilize the volatile range
>> code.
> One would think the purpose of the previous patches was to entirely do
> away with the ashmem interface.. this patch is a temporary band-aid to
> assist in testing without having to modify userspace?
>
So in a way, yes. I'm trying to iteratively chip away at the staging 
drivers, providing a smooth transition path. This patch is just to show 
(or determine, depending on if the GET_PIN_STATUS is critical) that the 
fadvise volatile functionality is sufficient to replace the ashmem 
unpinning feature.

Part of the trouble with pushing the Android drivers upstream is that 
each driver doesn't just solve one issue, but usually handles a number 
of problems. Thus discussion about alternative solutions gets muddled as 
no one alternative is sufficient.

For the ashmem driver, one big use case is a way to provide fds to 
anonymous memory that can be shared between processes. This is the same 
as an unlinked tmpfs file, but allows them to not have tmpfs mounts, 
which could potentially be cluttered with poorly written or malicious 
applications.  There is also the feature of being able to unpin memory 
ranges of that fd, which I thought was interesting for wider use.

The fadivse volatile work provides that unpinning feature, but doesn't 
do anything to try to provide atomically unlinked tmpfs fds. To solve 
that problem,  we'll have to revisit if the simplified ashmem interface 
makes sense, or if there should be something like a permissions based 
userspace solution with a deamon that does the create/unlink and hands 
the fds out.  But hopefully that discussion will be more clear, as it 
will be only about one feature.

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-24 19:20   ` Peter Zijlstra
@ 2012-04-24 19:50     ` John Stultz
  0 siblings, 0 replies; 28+ messages in thread
From: John Stultz @ 2012-04-24 19:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V

On 04/24/2012 12:20 PM, Peter Zijlstra wrote:
> On Tue, 2012-04-24 at 10:49 -0700, John Stultz wrote:
>> +/*
>> + * Purges volatile ranges when under memory pressure
>> + */
>> +static int volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
>> +{
> Hmm, I would have expected regular page reclaim to know about this
> instead of using a shrinker interface. Is this done specifically to
> avoid growing small holes in all ranges and instead dump entire ranges
> thereby keeping other ranges usable?
>
So yes. We don't want to have single pages purged from the ranges.  My 
first attempt used the standard writeout method to do the purging. 
However, since the ranges in their entirety should be purged at once 
(since if any single page is purged, the entire range must be 
reconstructed), keeping track of the least-recently-unpinned range is 
more useful then the least-recently-used page.

Now, Dave Chinner suggested using tags to mark pages as part of a 
volatile range, as well as tracking their purged status, then iterating 
over adjacent pages on writeout to also purge the entire range, which 
could be done, but seemed more costly to iterate over each page updating 
the tags when marking or unmarking a range as volatile.  This suggestion 
was also (as I understood) mainly to address the memory overhead of 
early implementation adding a pointer to each address_space structure. 
That issue was resolved by using the hash table for address_space -> 
range_tree mappings.

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
  2012-04-24 19:14   ` Peter Zijlstra
@ 2012-04-25 12:16   ` Dmitry Adamushko
  2012-04-25 16:19     ` John Stultz
  1 sibling, 1 reply; 28+ messages in thread
From: Dmitry Adamushko @ 2012-04-25 12:16 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

Hi John,

range_tree_in_range_adjacent() is not used in your code, and it
doesn't seem to be very useful in general case. range_tree_in_range()
can do the same thing (and you use it that way in the 2nd patch) and
is more flexible (can be paired with range_tree_next_in_range()). So I
think it can be dropped altogether.

Now, I'm wondering whether it actually makes sense to make a dedicated
interface out of the remaining bits.

Almost everything is common rb_tree-handling code that can be found in
any place where rb-trees are used (hard-coded for flexibility,
performance or whatever other reasons). So my feeling is that it
should not be different here.

-- Dmitry

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-25 12:16   ` Dmitry Adamushko
@ 2012-04-25 16:19     ` John Stultz
  2012-04-26 10:00       ` Dmitry Adamushko
  0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2012-04-25 16:19 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/25/2012 05:16 AM, Dmitry Adamushko wrote:
> Hi John,
>
> range_tree_in_range_adjacent() is not used in your code, and it
> doesn't seem to be very useful in general case. range_tree_in_range()
> can do the same thing (and you use it that way in the 2nd patch) and
> is more flexible (can be paired with range_tree_next_in_range()). So I
> think it can be dropped altogether.

Agreed. I actually at one point meant to do this and forgot. Thanks for 
pointing it out!

> Now, I'm wondering whether it actually makes sense to make a dedicated
> interface out of the remaining bits.
>
> Almost everything is common rb_tree-handling code that can be found in
> any place where rb-trees are used (hard-coded for flexibility,
> performance or whatever other reasons). So my feeling is that it
> should not be different here.
>
Sorry, not sure I quite understand what you're suggesting. Are you 
saying it doesn't make sense to have a generic range tree 
implementation, since really its just a small shim over the rbtree 
code?  So instead range-tree users should just implment them 
themselves?  Or something else?

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-25 16:19     ` John Stultz
@ 2012-04-26 10:00       ` Dmitry Adamushko
  2012-04-27 19:34         ` John Stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Dmitry Adamushko @ 2012-04-26 10:00 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

>> [ ... ]
>>
>> Almost everything is common rb_tree-handling code that can be found in
>> any place where rb-trees are used (hard-coded for flexibility,
>> performance or whatever other reasons). So my feeling is that it
>> should not be different here.
>>
> Sorry, not sure I quite understand what you're suggesting. Are you saying it
> doesn't make sense to have a generic range tree implementation, since really
> its just a small shim over the rbtree code?  So instead range-tree users
> should just implment them themselves?

Exactly. It's not much different from other rbtree
search-insert-delete implementations out there.

What are the generic use cases that we want to support with this interface?

Is the current notion of the 'overlapping range' as taken by
range_tree_in_range() common enough? What if another use-case requires
_only_ the ranges that are strictly inside the [ start, end ] range?
In this case, we might be better off sticking to the same
'implement-your-own-search' approach taken by the generic rbtree
interface.

-- Dmitry

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
  2012-04-24 19:20   ` Peter Zijlstra
@ 2012-04-27  0:39   ` Dave Chinner
  2012-04-27 15:25     ` Dave Hansen
  2012-04-27 19:14     ` John Stultz
  1 sibling, 2 replies; 28+ messages in thread
From: Dave Chinner @ 2012-04-27  0:39 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
> This patch provides new fadvise flags that can be used to mark
> file pages as volatile, which will allow it to be discarded if the
> kernel wants to reclaim memory.

.....

> @@ -18,4 +18,9 @@
>  #define POSIX_FADV_NOREUSE	5 /* Data will be accessed once.  */
>  #endif
>  
> +#define POSIX_FADV_VOLATILE	8  /* _can_ toss, but don't toss now */
> +#define POSIX_FADV_NONVOLATILE	9  /* Remove VOLATILE flag */

These aren't POSIX standards, so I don't think they should have the
POSIX_ prefix. Besides....

....
> @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>  			invalidate_mapping_pages(mapping, start_index,
>  						end_index);
>  		break;
> +	case POSIX_FADV_VOLATILE:
> +		/* First and last PARTIAL page! */
> +		start_index = offset >> PAGE_CACHE_SHIFT;
> +		end_index = endbyte >> PAGE_CACHE_SHIFT;
> +		ret = mapping_range_volatile(mapping, start_index, end_index);
> +		break;
> +	case POSIX_FADV_NONVOLATILE:
> +		/* First and last PARTIAL page! */
> +		start_index = offset >> PAGE_CACHE_SHIFT;
> +		end_index = endbyte >> PAGE_CACHE_SHIFT;
> +		ret = mapping_range_nonvolatile(mapping, start_index,
> +								end_index);

As it is, I'm still not sold on these being an fadvise() interface
because all it really is a delayed hole punching interface whose
functionailty is currently specific to tmpfs. The behaviour cannot
be implemented sanely by anything else at this point.

> + * The goal behind volatile ranges is to allow applications to interact
> + * with the kernel's cache management infrastructure.  In particular an
> + * application can say "this memory contains data that might be useful in
> + * the future, but can be reconstructed if necessary, so if the kernel
> + * needs, it can zap and reclaim this memory without having to swap it out.

This is what I mean - the definition of volatility is specific to a
filesystem implementation - one that doesn't store persistent data.

> + * The proposed mechanism - at a high level - is for user-space to be able
> + * to say "This memory is volatile" and then later "this memory is no longer
> + * volatile".  If the content of the memory is still available the second
> + * request succeeds.  If not, the memory is marked non-volatile and an
> + * error is returned to denote that the contents have been lost.

For a filesystem, it's not "memory" that is volatile - it is the
*data* that we have to consider that these hints apply to, and that
implies both in memory and on stable storage. because you are
targetting a filesystem without persisten storage, you are using
"memory" interchangably with "data". That basically results in an
interface that can only be used by non-persistent filesystems.
However, for managing on-disk caches of fixed sizes, being able to
mark regions as volatile or not is just as helpful to them as it is
to memory based caches on tmpfs....

So why can't you implement this as fallocate() flags, and then make
the tmpfs implementation of those fallocate flags do the right
things?  I think fallocate is the right interface, because this is
simply an extension of the existing hole punching implementation.
IOWs, the specification you are describing means that FADV_VOLATILE
could be correctly implemented as an immediate hole punch by every
filesystem that supports hole punching.

This probably won't perform wonderfully, which is where the range
tracking and delayed punching (and the implied memory freeing)
optimiation comes into play. Sure, for tmpfs this can be implemented
as a shrinker, but for real filesystems that have to punch blocks a
shrinker is really the wrong context to be running such
transactions. However, using the fallocate() interface allows each
filesytsem to optimise the delayed hole punching as they see best,
something that cannot be done with this fadvise() interface.

It's all great that this can replace a single function in ashmem,
but focussing purely on ashmem misses the point that this
functionality has wider use, and that using a different interface
allows independently tailored and optimised implementations of that
functionality....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-27  0:39   ` Dave Chinner
@ 2012-04-27 15:25     ` Dave Hansen
  2012-04-28  1:36       ` Dave Chinner
  2012-04-27 19:14     ` John Stultz
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2012-04-27 15:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/26/2012 05:39 PM, Dave Chinner wrote:
> As it is, I'm still not sold on these being an fadvise() interface
> because all it really is a delayed hole punching interface whose
> functionailty is currently specific to tmpfs. The behaviour cannot
> be implemented sanely by anything else at this point.
...
> IOWs, the specification you are describing means that FADV_VOLATILE
> could be correctly implemented as an immediate hole punch by every
> filesystem that supports hole punching.

Ahhh, I think I see where you're going with this.

1. Data written to a file somehow (mmap(), write()) and is on disk
2. mmap() the data, and fault it in
3. Do a small write to it
4. set FADV_VOLATILE on it

Now we've got a dirty page which can theoretically be tossed out.  But,
we've got old data on the disk and no real way to tell that it was old
if it got faulted in again.  It's a much cleaner situation to just drop
that data off the disk (hole punch) than to leave it around.  Is that
the concern?

But, we have other APIs that act this way, tossing out dirty data
without reflecting that on-disk (MADV_DONTNEED at least).  Is it really
a stretch to define the FADV_VOLATILE to behave the same way?  IOW,
Should the behavior _really_ be hole punching?  That'll cost us I/O to
throw away data during memory reclaim since we have to go write the
information about the hole.  Seems like a much more appropriate thing to
just toss the data out since the app can handle it.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-27  0:39   ` Dave Chinner
  2012-04-27 15:25     ` Dave Hansen
@ 2012-04-27 19:14     ` John Stultz
  2012-04-28  2:04       ` Dave Chinner
  1 sibling, 1 reply; 28+ messages in thread
From: John Stultz @ 2012-04-27 19:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/26/2012 05:39 PM, Dave Chinner wrote:
> On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
>> @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>>   			invalidate_mapping_pages(mapping, start_index,
>>   						end_index);
>>   		break;
>> +	case POSIX_FADV_VOLATILE:
>> +		/* First and last PARTIAL page! */
>> +		start_index = offset>>  PAGE_CACHE_SHIFT;
>> +		end_index = endbyte>>  PAGE_CACHE_SHIFT;
>> +		ret = mapping_range_volatile(mapping, start_index, end_index);
>> +		break;
>> +	case POSIX_FADV_NONVOLATILE:
>> +		/* First and last PARTIAL page! */
>> +		start_index = offset>>  PAGE_CACHE_SHIFT;
>> +		end_index = endbyte>>  PAGE_CACHE_SHIFT;
>> +		ret = mapping_range_nonvolatile(mapping, start_index,
>> +								end_index);
> As it is, I'm still not sold on these being an fadvise() interface
> because all it really is a delayed hole punching interface whose
> functionailty is currently specific to tmpfs. The behaviour cannot
> be implemented sanely by anything else at this point.
Yea. So I spent some time looking at the various hole punching 
mechanisms and they aren't all together consistent across filesystems. 
For instance, on some filesystems (ext4 and mostly disk backed fs) you 
have to use fallocate(fd, |FALLOC_FL_PUNCH_HOLE,...)|, while on tmpfs, 
its madvise(...,MADV_REMOVE).   So in a way, currently, the 
FADVISE_VOLATILE is closer to a delayed MADVISE_REMOVE.


>> + * The goal behind volatile ranges is to allow applications to interact
>> + * with the kernel's cache management infrastructure.  In particular an
>> + * application can say "this memory contains data that might be useful in
>> + * the future, but can be reconstructed if necessary, so if the kernel
>> + * needs, it can zap and reclaim this memory without having to swap it out.
> This is what I mean - the definition of volatility is specific to a
> filesystem implementation - one that doesn't store persistent data.
Well,  I'd like to think that it could be extended to do delayed hole 
punching on disk backed persistent files, but again, currently there's 
no unified way to punch holes across the disk and memory backed 
filesystems.

If other filesystems implemented vmtruncate_range for hole punching, we 
could (modulo the circular mutex lock issue of calling vmtruncate_range 
from a shrinker) support this on other filesystems.

Are there inherent reasons why vmtruncate_range isn't implemented (or 
can't be sanely implemented) by non-tmpfs filesystems?


>> + * The proposed mechanism - at a high level - is for user-space to be able
>> + * to say "This memory is volatile" and then later "this memory is no longer
>> + * volatile".  If the content of the memory is still available the second
>> + * request succeeds.  If not, the memory is marked non-volatile and an
>> + * error is returned to denote that the contents have been lost.
> For a filesystem, it's not "memory" that is volatile - it is the
> *data* that we have to consider that these hints apply to, and that
> implies both in memory and on stable storage. because you are
> targetting a filesystem without persisten storage, you are using
> "memory" interchangably with "data". That basically results in an
> interface that can only be used by non-persistent filesystems.
> However, for managing on-disk caches of fixed sizes, being able to
> mark regions as volatile or not is just as helpful to them as it is
> to memory based caches on tmpfs....
>
> So why can't you implement this as fallocate() flags, and then make
> the tmpfs implementation of those fallocate flags do the right
> things?  I think fallocate is the right interface, because this is
> simply an extension of the existing hole punching implementation.
> IOWs, the specification you are describing means that FADV_VOLATILE
> could be correctly implemented as an immediate hole punch by every
> filesystem that supports hole punching.

So yea, I'm fine with changing interface as long as fallocate is where 
the consensus is.  I'm not sure I maybe understand the subtlety of the 
interface differences, and it doesn't necessarily seem more intuitive to 
me (as seems more advisory then allocation based).  But I can give it a 
shot.

Another way we could go is using madvise, somewhat mimicing the 
MADVISE_REMOVE call, which again, is not implemented everywhere.

Although as DaveH said, doing the hole punch on disk is extra overhead. 
But I agree it makes more sense from a least-surprise approach (no data 
is less surprising then old data after a purge).

As for your immediate hole punch thought, that could work, although 
FADV_VOLATILE would be just as correctly implemented by not purging any 
of data on disk backed files.  Either way, a difference might be 
slightly confusing for users (since either way changes the global LRU 
purge behavior).

> This probably won't perform wonderfully, which is where the range
> tracking and delayed punching (and the implied memory freeing)
> optimiation comes into play. Sure, for tmpfs this can be implemented
> as a shrinker, but for real filesystems that have to punch blocks a
> shrinker is really the wrong context to be running such
> transactions. However, using the fallocate() interface allows each
> filesytsem to optimise the delayed hole punching as they see best,
> something that cannot be done with this fadvise() interface.

So if a shrinker isn't the right context, what would be a good context 
for delayed hole punching?


> It's all great that this can replace a single function in ashmem,
> but focussing purely on ashmem misses the point that this
> functionality has wider use, and that using a different interface
> allows independently tailored and optimised implementations of that
> functionality....

Very much agreed, I'd like this to be more generically usable as well.

Thanks again for the helpful feedback! Let me know your thoughts on my 
questions above, and I'll start working on seeing what is required to 
switch over to fallocate().

thanks
-john




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] Range tree implementation
  2012-04-26 10:00       ` Dmitry Adamushko
@ 2012-04-27 19:34         ` John Stultz
  0 siblings, 0 replies; 28+ messages in thread
From: John Stultz @ 2012-04-27 19:34 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dave Chinner, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/26/2012 03:00 AM, Dmitry Adamushko wrote:
>>> [ ... ]
>>>
>>> Almost everything is common rb_tree-handling code that can be found in
>>> any place where rb-trees are used (hard-coded for flexibility,
>>> performance or whatever other reasons). So my feeling is that it
>>> should not be different here.
>>>
>> Sorry, not sure I quite understand what you're suggesting. Are you saying it
>> doesn't make sense to have a generic range tree implementation, since really
>> its just a small shim over the rbtree code?  So instead range-tree users
>> should just implment them themselves?
> Exactly. It's not much different from other rbtree
> search-insert-delete implementations out there.
>
> What are the generic use cases that we want to support with this interface?

Well, Andrew really wants to see posix file locking to reuse something 
like this (which is the whole reason I split it out separately). And 
Aneesh has similar range management needs for his hugetlb region tracking.

> Is the current notion of the 'overlapping range' as taken by
> range_tree_in_range() common enough? What if another use-case requires
> _only_ the ranges that are strictly inside the [ start, end ] range?
> In this case, we might be better off sticking to the same
> 'implement-your-own-search' approach taken by the generic rbtree
> interface.
Or we could extend the interface for more specific requests?

So its true that the range-tree construct is a pretty thin layer over 
the rbtree code, but even so, we've had a few places in the kernel where 
folks basically are duplicating the same logic over and over again, so 
it might make sense to consolidate the obvious use cases, even just to 
avoid some of the simpler bugs that can happen with rbtree logic (for 
instance, I sent a simple fix to an rbtree related thinko in Rafael's 
recent userspace wakelock api earlier this week).

Anther example is the timerqueue structure I added earlier, which again 
is a thin shim over the rbtree, but allows a few different users to 
share code for quick insert ordered list behavior.

So yea, even though its fairly thin, I think it still has value for reuse.

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-27 15:25     ` Dave Hansen
@ 2012-04-28  1:36       ` Dave Chinner
  2012-04-30 21:07         ` John Stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2012-04-28  1:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Fri, Apr 27, 2012 at 08:25:40AM -0700, Dave Hansen wrote:
> On 04/26/2012 05:39 PM, Dave Chinner wrote:
> > As it is, I'm still not sold on these being an fadvise() interface
> > because all it really is a delayed hole punching interface whose
> > functionailty is currently specific to tmpfs. The behaviour cannot
> > be implemented sanely by anything else at this point.
> ...
> > IOWs, the specification you are describing means that FADV_VOLATILE
> > could be correctly implemented as an immediate hole punch by every
> > filesystem that supports hole punching.
> 
> Ahhh, I think I see where you're going with this.
> 
> 1. Data written to a file somehow (mmap(), write()) and is on disk
> 2. mmap() the data, and fault it in
> 3. Do a small write to it
> 4. set FADV_VOLATILE on it
> 
> Now we've got a dirty page which can theoretically be tossed out.  But,
> we've got old data on the disk and no real way to tell that it was old
> if it got faulted in again.  It's a much cleaner situation to just drop
> that data off the disk (hole punch) than to leave it around.  Is that
> the concern?

Right - if you are using a persistent filesystem then just dropping
the pages in memory does not do what the interface expects it to do.
rather than having a hole in the file, there is stale data. That
sounds like a recipe for security problems to me.

> But, we have other APIs that act this way, tossing out dirty data
> without reflecting that on-disk (MADV_DONTNEED at least). 

But that's a mmap() operation, not a file operation.  mmap()/
madvise() implies you are workng with memory pages, fadvise()
implies you are working with *file data*. The mmap() functionality is
designed really for anonymous memory, to be able to toss it out when
finished with, not for manipulating data files. Yes, it works with
data files, but that leaves unknown stale data behind on disk....

> Is it really
> a stretch to define the FADV_VOLATILE to behave the same way?  IOW,
> Should the behavior _really_ be hole punching?  That'll cost us I/O to
> throw away data during memory reclaim since we have to go write the
> information about the hole.

Not for tmpfs, which is what this is initially aimed at. And for
filesystems, we already do IO in shrinkers during reclaim, so at a
stretch hole punching in a shrinker is possible, though undesirable
and possible to avoid.

> Seems like a much more appropriate thing to
> just toss the data out since the app can handle it.

On tmpfs, re-reading the data range that was tossed will return
zeros. Easy to detect and handle, and has no security implications.
On a real filesystem, it will return stale data. Not only is that
much harder to detect - it may not even be possible depending on how
the application tracks data in it's files. If the filesystem punches
holes, it will return zeros just like tmpfs will.

That's my concern - that persistent filesystems will have different
behaviour to in-memory filesystems. They *must* be consistent in
behaviour w.r.t. to stale data exposure, otherwise we are in a world
of pain when applications start to use this. Quite frankly, I don't
care about performance of VOLATILE ranges, but I care greatly
about ensuring filesystems don't expose stale data to user
applications....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-27 19:14     ` John Stultz
@ 2012-04-28  2:04       ` Dave Chinner
  2012-04-30 19:40         ` John Stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2012-04-28  2:04 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
> On 04/26/2012 05:39 PM, Dave Chinner wrote:
> >On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
> >>@@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> >>  			invalidate_mapping_pages(mapping, start_index,
> >>  						end_index);
> >>  		break;
> >>+	case POSIX_FADV_VOLATILE:
> >>+		/* First and last PARTIAL page! */
> >>+		start_index = offset>>  PAGE_CACHE_SHIFT;
> >>+		end_index = endbyte>>  PAGE_CACHE_SHIFT;
> >>+		ret = mapping_range_volatile(mapping, start_index, end_index);
> >>+		break;
> >>+	case POSIX_FADV_NONVOLATILE:
> >>+		/* First and last PARTIAL page! */
> >>+		start_index = offset>>  PAGE_CACHE_SHIFT;
> >>+		end_index = endbyte>>  PAGE_CACHE_SHIFT;
> >>+		ret = mapping_range_nonvolatile(mapping, start_index,
> >>+								end_index);
> >As it is, I'm still not sold on these being an fadvise() interface
> >because all it really is a delayed hole punching interface whose
> >functionailty is currently specific to tmpfs. The behaviour cannot
> >be implemented sanely by anything else at this point.
> Yea. So I spent some time looking at the various hole punching
> mechanisms and they aren't all together consistent across
> filesystems. For instance, on some filesystems (ext4 and mostly disk
> backed fs) you have to use fallocate(fd,
> |FALLOC_FL_PUNCH_HOLE,...)|, while on tmpfs, its
> madvise(...,MADV_REMOVE).   So in a way, currently, the
> FADVISE_VOLATILE is closer to a delayed MADVISE_REMOVE.

The MADVISE_REMOVE functionality for hole punching works *only* for
tmpfs - no other filesystem implements the .truncate_range() method.
In fact, several filesystems *can't* implement .truncate_range()
because there is no callout from the page cache truncation code to
allow filesystems to punch out the underlying blocks. The
vmtruncate() code is deprecated for this reason (and various others
like a lack of error handling), and .truncate_range() is just as
nasty. .truncate_range() needs to die, IMO.

So, rather than building more infrastructure on a nasty, filesystem
specific mmap() hack, implement .fallocate() on tmpfs and use the
same interface that every other filesystem uses for punching holes.

> >>+ * The goal behind volatile ranges is to allow applications to interact
> >>+ * with the kernel's cache management infrastructure.  In particular an
> >>+ * application can say "this memory contains data that might be useful in
> >>+ * the future, but can be reconstructed if necessary, so if the kernel
> >>+ * needs, it can zap and reclaim this memory without having to swap it out.
> >This is what I mean - the definition of volatility is specific to a
> >filesystem implementation - one that doesn't store persistent data.
> Well,  I'd like to think that it could be extended to do delayed
> hole punching on disk backed persistent files, but again, currently
> there's no unified way to punch holes across the disk and memory
> backed filesystems.
> 
> If other filesystems implemented vmtruncate_range for hole punching,
> we could (modulo the circular mutex lock issue of calling
> vmtruncate_range from a shrinker) support this on other filesystems.

See above - vmtruncate() is *deprecated*.

> Are there inherent reasons why vmtruncate_range isn't implemented
> (or can't be sanely implemented) by non-tmpfs filesystems?

See above.

> >>+ * The proposed mechanism - at a high level - is for user-space to be able
> >>+ * to say "This memory is volatile" and then later "this memory is no longer
> >>+ * volatile".  If the content of the memory is still available the second
> >>+ * request succeeds.  If not, the memory is marked non-volatile and an
> >>+ * error is returned to denote that the contents have been lost.
> >For a filesystem, it's not "memory" that is volatile - it is the
> >*data* that we have to consider that these hints apply to, and that
> >implies both in memory and on stable storage. because you are
> >targetting a filesystem without persisten storage, you are using
> >"memory" interchangably with "data". That basically results in an
> >interface that can only be used by non-persistent filesystems.
> >However, for managing on-disk caches of fixed sizes, being able to
> >mark regions as volatile or not is just as helpful to them as it is
> >to memory based caches on tmpfs....
> >
> >So why can't you implement this as fallocate() flags, and then make
> >the tmpfs implementation of those fallocate flags do the right
> >things?  I think fallocate is the right interface, because this is
> >simply an extension of the existing hole punching implementation.
> >IOWs, the specification you are describing means that FADV_VOLATILE
> >could be correctly implemented as an immediate hole punch by every
> >filesystem that supports hole punching.
> 
> So yea, I'm fine with changing interface as long as fallocate is
> where the consensus is.  I'm not sure I maybe understand the
> subtlety of the interface differences, and it doesn't necessarily
> seem more intuitive to me (as seems more advisory then allocation
> based).  But I can give it a shot.
> 
> Another way we could go is using madvise, somewhat mimicing the
> MADVISE_REMOVE call, which again, is not implemented everywhere.

MADVISE_REMOVE is another tmpfs specifc interface, because it falls
down to vmtruncate_range(). fallocate()-based hole punching is the
only way this can be implemented on normal filesystems

> Although as DaveH said, doing the hole punch on disk is extra
> overhead. But I agree it makes more sense from a least-surprise
> approach (no data is less surprising then old data after a purge).

Exactly my point, though security rather least-surprise is the angle
I see this from.

> As for your immediate hole punch thought, that could work, although
> FADV_VOLATILE would be just as correctly implemented by not purging
> any of data on disk backed files.  Either way, a difference might be
> slightly confusing for users (since either way changes the global
> LRU purge behavior).

Right, it is equally valid to ignore them, which is why an initial
fallocate() based implementation wouldn't need to modify a single
filesystem. i.e. those that don't support fallocate behave as
expected (do nothing, not even purge the page cache), those that
support hole punching behave as expected (zeros for volatile
ranges), and tmpfs behaves as expected (zeros, really fast).

> >This probably won't perform wonderfully, which is where the range
> >tracking and delayed punching (and the implied memory freeing)
> >optimiation comes into play. Sure, for tmpfs this can be implemented
> >as a shrinker, but for real filesystems that have to punch blocks a
> >shrinker is really the wrong context to be running such
> >transactions. However, using the fallocate() interface allows each
> >filesytsem to optimise the delayed hole punching as they see best,
> >something that cannot be done with this fadvise() interface.
> 
> So if a shrinker isn't the right context, what would be a good
> context for delayed hole punching?

Like we in XFs for inode reclaim. We have a background workqueue
that frees aged inodes periodically in the fastest manner possible
(i.e. all async, no blocking on locks, etc), and the shrinker, when
run kicks that background thread first, and then enters into
synchronous reclaim. By the time a single sync reclaim cycle is run
and throttled reclaim sufficiently, the background thread has done a
great deal more work.

A similar mechanism can be used for this functionality within XFS.
Indeed, we could efficiently track which inodes have volatile ranges
on them via a bit in the radix trees than index the inode cache,
just like we do for reclaimable inodes. If we then used a bit in the
page cache radix tree index to indicate volatile pages, we could
then easily find the ranges we need to punch out without requiring
some new tree and more per-inode memory.

That's a very filesystem specific implementation - it's vastly
different to you tmpfs implementation - but this is exactly what I
mean about using fallocate to allow filesystems to optimise the
implementation in the most suitable manner for them....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-28  2:04       ` Dave Chinner
@ 2012-04-30 19:40         ` John Stultz
  2012-05-01  0:28           ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2012-04-30 19:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/27/2012 07:04 PM, Dave Chinner wrote:
> On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
>> On 04/26/2012 05:39 PM, Dave Chinner wrote:
>>> On Tue, Apr 24, 2012 at 10:49:46AM -0700, John Stultz wrote:
>>>> @@ -128,6 +129,19 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>>>>   			invalidate_mapping_pages(mapping, start_index,
>>>>   						end_index);
>>>>   		break;
>>>> +	case POSIX_FADV_VOLATILE:
>>>> +		/* First and last PARTIAL page! */
>>>> +		start_index = offset>>   PAGE_CACHE_SHIFT;
>>>> +		end_index = endbyte>>   PAGE_CACHE_SHIFT;
>>>> +		ret = mapping_range_volatile(mapping, start_index, end_index);
>>>> +		break;
>>>> +	case POSIX_FADV_NONVOLATILE:
>>>> +		/* First and last PARTIAL page! */
>>>> +		start_index = offset>>   PAGE_CACHE_SHIFT;
>>>> +		end_index = endbyte>>   PAGE_CACHE_SHIFT;
>>>> +		ret = mapping_range_nonvolatile(mapping, start_index,
>>>> +								end_index);
>>> As it is, I'm still not sold on these being an fadvise() interface
>>> because all it really is a delayed hole punching interface whose
>>> functionailty is currently specific to tmpfs. The behaviour cannot
>>> be implemented sanely by anything else at this point.
>> Yea. So I spent some time looking at the various hole punching
>> mechanisms and they aren't all together consistent across
>> filesystems. For instance, on some filesystems (ext4 and mostly disk
>> backed fs) you have to use fallocate(fd,
>> |FALLOC_FL_PUNCH_HOLE,...)|, while on tmpfs, its
>> madvise(...,MADV_REMOVE).   So in a way, currently, the
>> FADVISE_VOLATILE is closer to a delayed MADVISE_REMOVE.
> The MADVISE_REMOVE functionality for hole punching works *only* for
> tmpfs - no other filesystem implements the .truncate_range() method.
> In fact, several filesystems *can't* implement .truncate_range()
> because there is no callout from the page cache truncation code to
> allow filesystems to punch out the underlying blocks. The
> vmtruncate() code is deprecated for this reason (and various others
> like a lack of error handling), and .truncate_range() is just as
> nasty. .truncate_range() needs to die, IMO.
>
> So, rather than building more infrastructure on a nasty, filesystem
> specific mmap() hack, implement .fallocate() on tmpfs and use the
> same interface that every other filesystem uses for punching holes.

Ah. Ok.  I wasn't aware that vmtruncate was deprecated.  Thanks for 
cluing me in here!

>>> This probably won't perform wonderfully, which is where the range
>>> tracking and delayed punching (and the implied memory freeing)
>>> optimiation comes into play. Sure, for tmpfs this can be implemented
>>> as a shrinker, but for real filesystems that have to punch blocks a
>>> shrinker is really the wrong context to be running such
>>> transactions. However, using the fallocate() interface allows each
>>> filesytsem to optimise the delayed hole punching as they see best,
>>> something that cannot be done with this fadvise() interface.
>> So if a shrinker isn't the right context, what would be a good
>> context for delayed hole punching?
> Like we in XFs for inode reclaim. We have a background workqueue
> that frees aged inodes periodically in the fastest manner possible
> (i.e. all async, no blocking on locks, etc), and the shrinker, when
> run kicks that background thread first, and then enters into
> synchronous reclaim. By the time a single sync reclaim cycle is run
> and throttled reclaim sufficiently, the background thread has done a
> great deal more work.
>
> A similar mechanism can be used for this functionality within XFS.
> Indeed, we could efficiently track which inodes have volatile ranges
> on them via a bit in the radix trees than index the inode cache,
> just like we do for reclaimable inodes. If we then used a bit in the
> page cache radix tree index to indicate volatile pages, we could
> then easily find the ranges we need to punch out without requiring
> some new tree and more per-inode memory.
>
> That's a very filesystem specific implementation - it's vastly
> different to you tmpfs implementation - but this is exactly what I
> mean about using fallocate to allow filesystems to optimise the
> implementation in the most suitable manner for them....
>

So, just to make sure I'm folloiwng you, you're suggesting that there 
would be a filesystem specific implementation at the top level. 
Something like a  mark_volatile(struct inode *, bool, loff_t, loff_t) 
inode operation? And the filesystem would then be responsible for 
managing the ranges and appropriately purging them?

Thanks again for the feedback, I'll continue looking into this.

thanks
-john





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-28  1:36       ` Dave Chinner
@ 2012-04-30 21:07         ` John Stultz
  2012-05-01  0:08           ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2012-04-30 21:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Hansen, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/27/2012 06:36 PM, Dave Chinner wrote:
> That's my concern - that persistent filesystems will have different
> behaviour to in-memory filesystems. They *must* be consistent in
> behaviour w.r.t. to stale data exposure, otherwise we are in a world
> of pain when applications start to use this. Quite frankly, I don't
> care about performance of VOLATILE ranges, but I care greatly
> about ensuring filesystems don't expose stale data to user
> applications....
>
I think we're in agreement with the rest of this email, but I do want to 
stress that the performance of volatile ranges will become more 
ciritical, as in order for folks to effectively use them, they need to 
be able to mark and unmark ranges any time they're not using the data.

No application likely wants their data to be purged, but by volunteering 
it allows them to help the kernel with low-memory constraints and 
improve entire system behavior.

So if the overhead is too great for marking and unmarking pages, 
applications will be less likely to "help out".  :)

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-30 21:07         ` John Stultz
@ 2012-05-01  0:08           ` Dave Chinner
  2012-05-01  0:46             ` John Stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2012-05-01  0:08 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Mon, Apr 30, 2012 at 02:07:16PM -0700, John Stultz wrote:
> On 04/27/2012 06:36 PM, Dave Chinner wrote:
> >That's my concern - that persistent filesystems will have different
> >behaviour to in-memory filesystems. They *must* be consistent in
> >behaviour w.r.t. to stale data exposure, otherwise we are in a world
> >of pain when applications start to use this. Quite frankly, I don't
> >care about performance of VOLATILE ranges, but I care greatly
> >about ensuring filesystems don't expose stale data to user
> >applications....
> >
> I think we're in agreement with the rest of this email, but I do
> want to stress that the performance of volatile ranges will become
> more ciritical, as in order for folks to effectively use them, they
> need to be able to mark and unmark ranges any time they're not using
> the data.

Performance is far less important than data security. Make it safe
first, then optimise performance. As it is, the initial target of
tmpfs - by it's very nature of returning zeros for regions not
backed by pages - is safe w.r.t. stale data exposure, so it will not
be slowed down by using an fallocate "best effort" hole-punching
interface.  The performance of other filesystems is something that
the relevant filesystem developers can worry about....

> So if the overhead is too great for marking and unmarking pages,
> applications will be less likely to "help out".  :)

Devil's Advocate: If the benefit of managing caches in such a manner
is this marginal, then why add the complexity to the kernel?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-04-30 19:40         ` John Stultz
@ 2012-05-01  0:28           ` Dave Chinner
  2012-05-01  1:15             ` John Stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2012-05-01  0:28 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Mon, Apr 30, 2012 at 12:40:13PM -0700, John Stultz wrote:
> On 04/27/2012 07:04 PM, Dave Chinner wrote:
> >On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
> >>On 04/26/2012 05:39 PM, Dave Chinner wrote:
> >>>This probably won't perform wonderfully, which is where the range
> >>>tracking and delayed punching (and the implied memory freeing)
> >>>optimiation comes into play. Sure, for tmpfs this can be implemented
> >>>as a shrinker, but for real filesystems that have to punch blocks a
> >>>shrinker is really the wrong context to be running such
> >>>transactions. However, using the fallocate() interface allows each
> >>>filesytsem to optimise the delayed hole punching as they see best,
> >>>something that cannot be done with this fadvise() interface.
> >>So if a shrinker isn't the right context, what would be a good
> >>context for delayed hole punching?
> >Like we in XFs for inode reclaim. We have a background workqueue
> >that frees aged inodes periodically in the fastest manner possible
> >(i.e. all async, no blocking on locks, etc), and the shrinker, when
> >run kicks that background thread first, and then enters into
> >synchronous reclaim. By the time a single sync reclaim cycle is run
> >and throttled reclaim sufficiently, the background thread has done a
> >great deal more work.
> >
> >A similar mechanism can be used for this functionality within XFS.
> >Indeed, we could efficiently track which inodes have volatile ranges
> >on them via a bit in the radix trees than index the inode cache,
> >just like we do for reclaimable inodes. If we then used a bit in the
> >page cache radix tree index to indicate volatile pages, we could
> >then easily find the ranges we need to punch out without requiring
> >some new tree and more per-inode memory.
> >
> >That's a very filesystem specific implementation - it's vastly
> >different to you tmpfs implementation - but this is exactly what I
> >mean about using fallocate to allow filesystems to optimise the
> >implementation in the most suitable manner for them....
> >
> 
> So, just to make sure I'm folloiwng you, you're suggesting that
> there would be a filesystem specific implementation at the top
> level. Something like a  mark_volatile(struct inode *, bool, loff_t,
> loff_t) inode operation? And the filesystem would then be
> responsible for managing the ranges and appropriately purging them?

Not quite. I'm suggesting that you use the .fallocate() file
operation to call into the filesystem specific code, and from there
the filesystem code either calls a generic helper function to mark
ranges as volatile and provides a callback for implementing the
shrinker functionailty, or it implements it all itself.

i.e. userspace would do:

	err = fallocate(fd, FALLOC_FL_MARK_VOLATILE, off, len);
	err = fallocate(fd, FALLOC_FL_CLEAR_VOLATILE, off, len);

and that will get passed to the filesystem implementation of
.fallocate (from do_fallocate()). The filesystem callout for this:

0 btrfs/file.c   1898 .fallocate = btrfs_fallocate,
1 ext4/file.c     247 .fallocate = ext4_fallocate,
2 gfs2/file.c    1015 .fallocate = gfs2_fallocate,
3 gfs2/file.c    1045 .fallocate = gfs2_fallocate,
4 ocfs2/file.c   2727 .fallocate = ocfs2_fallocate,
5 ocfs2/file.c   2774 .fallocate = ocfs2_fallocate,
6 xfs/xfs_file.c 1026 .fallocate = xfs_file_fallocate,

can then call a generic helper like, say:

	filemap_mark_volatile_range(inode, off, len);
	filemap_clear_volatile_range(inode, off, len);

to be able to use the range tree tracking you have written for this
purpose. The filesystem is also free to track ranges however it
pleases.

The filesystem will need to be able to store a tree/list root for
tracking all it's inodes that have volatile ranges, and register a
shrinker to walk that list and do the work necessary when memory
becomes low, but that is simple to do for a basic implementation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-05-01  0:08           ` Dave Chinner
@ 2012-05-01  0:46             ` John Stultz
  2012-05-01  1:28               ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2012-05-01  0:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Hansen, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/30/2012 05:08 PM, Dave Chinner wrote:
> On Mon, Apr 30, 2012 at 02:07:16PM -0700, John Stultz wrote:
>> On 04/27/2012 06:36 PM, Dave Chinner wrote:
>>> That's my concern - that persistent filesystems will have different
>>> behaviour to in-memory filesystems. They *must* be consistent in
>>> behaviour w.r.t. to stale data exposure, otherwise we are in a world
>>> of pain when applications start to use this. Quite frankly, I don't
>>> care about performance of VOLATILE ranges, but I care greatly
>>> about ensuring filesystems don't expose stale data to user
>>> applications....
>>>
>> I think we're in agreement with the rest of this email, but I do
>> want to stress that the performance of volatile ranges will become
>> more ciritical, as in order for folks to effectively use them, they
>> need to be able to mark and unmark ranges any time they're not using
>> the data.
> Performance is far less important than data security. Make it safe
> first, then optimise performance. As it is, the initial target of
> tmpfs - by it's very nature of returning zeros for regions not
> backed by pages - is safe w.r.t. stale data exposure, so it will not
> be slowed down by using an fallocate "best effort" hole-punching
> interface.  The performance of other filesystems is something that
> the relevant filesystem developers can worry about....

Again, I think we're quite in agreement about the issue of stale data.  
I just want to make sure you understand that the marking and unmarking 
paths will need to be fast if they are to attract users.


>> So if the overhead is too great for marking and unmarking pages,
>> applications will be less likely to "help out".  :)
> Devil's Advocate: If the benefit of managing caches in such a manner
> is this marginal, then why add the complexity to the kernel?
>
I'm not saying the benefit is marginal. When we are resource constrained 
(no swap) and we need to free memory, having regions pre-marked by 
applications is a great benefit, as we can immediately take those marked 
volatile ranges (as opposed to memory notifiers, where we request 
applications to free memory themselves).  Being able to free chunks of 
application memory, rather then killing the application provides a 
better experience/overall system performance.  However, if applications 
feel the marking and unmarking is too costly, they are less likely to 
mark the freeable ranges as volatile.

So only if no consideration for performance is given,  in that case 
there'd be no benefit to adding the interface.

thanks
-john






^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-05-01  0:28           ` Dave Chinner
@ 2012-05-01  1:15             ` John Stultz
  2012-05-01  1:51               ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2012-05-01  1:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On 04/30/2012 05:28 PM, Dave Chinner wrote:
> On Mon, Apr 30, 2012 at 12:40:13PM -0700, John Stultz wrote:
>> On 04/27/2012 07:04 PM, Dave Chinner wrote:
>>> On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
>>>> On 04/26/2012 05:39 PM, Dave Chinner wrote:
>>>>> This probably won't perform wonderfully, which is where the range
>>>>> tracking and delayed punching (and the implied memory freeing)
>>>>> optimiation comes into play. Sure, for tmpfs this can be implemented
>>>>> as a shrinker, but for real filesystems that have to punch blocks a
>>>>> shrinker is really the wrong context to be running such
>>>>> transactions. However, using the fallocate() interface allows each
>>>>> filesytsem to optimise the delayed hole punching as they see best,
>>>>> something that cannot be done with this fadvise() interface.
>>>> So if a shrinker isn't the right context, what would be a good
>>>> context for delayed hole punching?
>>> Like we in XFs for inode reclaim. We have a background workqueue
>>> that frees aged inodes periodically in the fastest manner possible
>>> (i.e. all async, no blocking on locks, etc), and the shrinker, when
>>> run kicks that background thread first, and then enters into
>>> synchronous reclaim. By the time a single sync reclaim cycle is run
>>> and throttled reclaim sufficiently, the background thread has done a
>>> great deal more work.
>>>
>>> A similar mechanism can be used for this functionality within XFS.
>>> Indeed, we could efficiently track which inodes have volatile ranges
>>> on them via a bit in the radix trees than index the inode cache,
>>> just like we do for reclaimable inodes. If we then used a bit in the
>>> page cache radix tree index to indicate volatile pages, we could
>>> then easily find the ranges we need to punch out without requiring
>>> some new tree and more per-inode memory.
>>>
>>> That's a very filesystem specific implementation - it's vastly
>>> different to you tmpfs implementation - but this is exactly what I
>>> mean about using fallocate to allow filesystems to optimise the
>>> implementation in the most suitable manner for them....
>>>
>> So, just to make sure I'm folloiwng you, you're suggesting that
>> there would be a filesystem specific implementation at the top
>> level. Something like a  mark_volatile(struct inode *, bool, loff_t,
>> loff_t) inode operation? And the filesystem would then be
>> responsible for managing the ranges and appropriately purging them?
> Not quite. I'm suggesting that you use the .fallocate() file
> operation to call into the filesystem specific code, and from there
> the filesystem code either calls a generic helper function to mark
> ranges as volatile and provides a callback for implementing the
> shrinker functionailty, or it implements it all itself.
>
> i.e. userspace would do:
>
> 	err = fallocate(fd, FALLOC_FL_MARK_VOLATILE, off, len);
> 	err = fallocate(fd, FALLOC_FL_CLEAR_VOLATILE, off, len);
>
> and that will get passed to the filesystem implementation of
> .fallocate (from do_fallocate()). The filesystem callout for this:
>
> 0 btrfs/file.c   1898 .fallocate = btrfs_fallocate,
> 1 ext4/file.c     247 .fallocate = ext4_fallocate,
> 2 gfs2/file.c    1015 .fallocate = gfs2_fallocate,
> 3 gfs2/file.c    1045 .fallocate = gfs2_fallocate,
> 4 ocfs2/file.c   2727 .fallocate = ocfs2_fallocate,
> 5 ocfs2/file.c   2774 .fallocate = ocfs2_fallocate,
> 6 xfs/xfs_file.c 1026 .fallocate = xfs_file_fallocate,
Ok.

Although noting that tmpfs doesn't implement fallocate, this isn't just 
a ruse to get me to implement fallocate for tmpfs, right? ;)

Out of curiosity: While for my uses, a tmpfs exclusive implementation 
isn't a problem for me,  I *really* appreciate your feedback and help 
focusing this into a more generic fs solution. But to get more context 
for your insights, I'm curious if you have any use cases in your mind 
that is directing this work to be more generic?

Again, you've been helpfully skeptical of the work, and also at the same 
time pushing it in a certain direction, and I want to make sure I better 
understand where you're coming from.  You've clarified the file security 
concern well enough, but is that all?  I just want to make sure that 
going the generic fallocate route is really worth it, as opposed to 
moving to madvise() and just failing file-data backed memory.

> can then call a generic helper like, say:
>
> 	filemap_mark_volatile_range(inode, off, len);
> 	filemap_clear_volatile_range(inode, off, len);
>
> to be able to use the range tree tracking you have written for this
> purpose. The filesystem is also free to track ranges however it
> pleases.
>
> The filesystem will need to be able to store a tree/list root for
> tracking all it's inodes that have volatile ranges, and register a
> shrinker to walk that list and do the work necessary when memory
> becomes low, but that is simple to do for a basic implementation.
>
Hrmm..  Currently I'm using a per-mapping range-tree along with a global 
LRU list that ties all the ranges together.

We could go for a per-filesystem filesystem controlled LRU as you 
suggest.  Although that along with the filesystem managed range-tree 
roots  would make the generic helpers a little complex. But it could 
probably be worked out.

I'll let you know when I have a first pass implementation or if I hit 
any walls.

Once again, thanks so much for all the feedback and guidance!
-john






^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-05-01  0:46             ` John Stultz
@ 2012-05-01  1:28               ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2012-05-01  1:28 UTC (permalink / raw)
  To: John Stultz
  Cc: Dave Hansen, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Mon, Apr 30, 2012 at 05:46:12PM -0700, John Stultz wrote:
> On 04/30/2012 05:08 PM, Dave Chinner wrote:
> >On Mon, Apr 30, 2012 at 02:07:16PM -0700, John Stultz wrote:
> >>On 04/27/2012 06:36 PM, Dave Chinner wrote:
> >>So if the overhead is too great for marking and unmarking pages,
> >>applications will be less likely to "help out".  :)
> >Devil's Advocate: If the benefit of managing caches in such a manner
> >is this marginal, then why add the complexity to the kernel?
> >
> I'm not saying the benefit is marginal. When we are resource
> constrained (no swap) and we need to free memory, having regions
> pre-marked by applications is a great benefit, as we can immediately
> take those marked volatile ranges (as opposed to memory notifiers,
> where we request applications to free memory themselves). 

This isn't a performance problem - this is a "avoid a hard fail"
problem. OOM tends to be fatal, and when you get to those situations
performance is already compromised. Once again, resiliency in low
memory situations is more important that absolute performance

> Being
> able to free chunks of application memory, rather then killing the
> application provides a better experience/overall system performance.

Exactly. The choice you are making here is better resiliency at an
edge case vs hard fail. The result is better overall behaviour, but
what you are trying to solve is not a performance problem. Hence
making the argument that performance is critical or no-one will use
it is, IMO, way off the mark. Developer's will use it, otherwise it
is their application that will get killed by the OS rather than the
competing app that uses the interface and allows the kernel to
shrink it's memory usage.....

> However, if applications feel the marking and unmarking is too
> costly, they are less likely to mark the freeable ranges as
> volatile.

Tracking ranges is not going to be a performance problem. For
real filesystems, it's the hole punching to guarantee data security
that will cause performance issues. In the case you are most
interested in (tmpfs) there is no hole punching overhead when
freeing ranges, so the only overhead is the range tracking. That's
code you are writing, so I don't see how it would end up unfit for
your purpose. :)

> So only if no consideration for performance is given,  in that case
> there'd be no benefit to adding the interface.

I did not say that no consideration should be given to performance,
just that data safety comes first. For your specific application on
tmpfs, data safety comes for free due to the nature of the tmpfs
implementation. However, that data safety needs to be explicit in
the API, not a result of an undocumented side effect of a specific
implementation. For real filesystems there will be overhead as a
result of fulfilling the data safety requirement, but solving those
performance issues is the responsibility of the filesystem
developers, not you....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2012-05-01  1:15             ` John Stultz
@ 2012-05-01  1:51               ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2012-05-01  1:51 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Neil Brown, Andrea Righi, Aneesh Kumar K.V

On Mon, Apr 30, 2012 at 06:15:44PM -0700, John Stultz wrote:
> On 04/30/2012 05:28 PM, Dave Chinner wrote:
> >On Mon, Apr 30, 2012 at 12:40:13PM -0700, John Stultz wrote:
> >>On 04/27/2012 07:04 PM, Dave Chinner wrote:
> >>>On Fri, Apr 27, 2012 at 12:14:18PM -0700, John Stultz wrote:
> >>>>On 04/26/2012 05:39 PM, Dave Chinner wrote:
> >>>>>This probably won't perform wonderfully, which is where the range
> >>>>>tracking and delayed punching (and the implied memory freeing)
> >>>>>optimiation comes into play. Sure, for tmpfs this can be implemented
> >>>>>as a shrinker, but for real filesystems that have to punch blocks a
> >>>>>shrinker is really the wrong context to be running such
> >>>>>transactions. However, using the fallocate() interface allows each
> >>>>>filesytsem to optimise the delayed hole punching as they see best,
> >>>>>something that cannot be done with this fadvise() interface.
> >>>>So if a shrinker isn't the right context, what would be a good
> >>>>context for delayed hole punching?
> >>>Like we in XFs for inode reclaim. We have a background workqueue
> >>>that frees aged inodes periodically in the fastest manner possible
> >>>(i.e. all async, no blocking on locks, etc), and the shrinker, when
> >>>run kicks that background thread first, and then enters into
> >>>synchronous reclaim. By the time a single sync reclaim cycle is run
> >>>and throttled reclaim sufficiently, the background thread has done a
> >>>great deal more work.
> >>>
> >>>A similar mechanism can be used for this functionality within XFS.
> >>>Indeed, we could efficiently track which inodes have volatile ranges
> >>>on them via a bit in the radix trees than index the inode cache,
> >>>just like we do for reclaimable inodes. If we then used a bit in the
> >>>page cache radix tree index to indicate volatile pages, we could
> >>>then easily find the ranges we need to punch out without requiring
> >>>some new tree and more per-inode memory.
> >>>
> >>>That's a very filesystem specific implementation - it's vastly
> >>>different to you tmpfs implementation - but this is exactly what I
> >>>mean about using fallocate to allow filesystems to optimise the
> >>>implementation in the most suitable manner for them....
> >>>
> >>So, just to make sure I'm folloiwng you, you're suggesting that
> >>there would be a filesystem specific implementation at the top
> >>level. Something like a  mark_volatile(struct inode *, bool, loff_t,
> >>loff_t) inode operation? And the filesystem would then be
> >>responsible for managing the ranges and appropriately purging them?
> >Not quite. I'm suggesting that you use the .fallocate() file
> >operation to call into the filesystem specific code, and from there
> >the filesystem code either calls a generic helper function to mark
> >ranges as volatile and provides a callback for implementing the
> >shrinker functionailty, or it implements it all itself.
> >
> >i.e. userspace would do:
> >
> >	err = fallocate(fd, FALLOC_FL_MARK_VOLATILE, off, len);
> >	err = fallocate(fd, FALLOC_FL_CLEAR_VOLATILE, off, len);
> >
> >and that will get passed to the filesystem implementation of
> >.fallocate (from do_fallocate()). The filesystem callout for this:
> >
> >0 btrfs/file.c   1898 .fallocate = btrfs_fallocate,
> >1 ext4/file.c     247 .fallocate = ext4_fallocate,
> >2 gfs2/file.c    1015 .fallocate = gfs2_fallocate,
> >3 gfs2/file.c    1045 .fallocate = gfs2_fallocate,
> >4 ocfs2/file.c   2727 .fallocate = ocfs2_fallocate,
> >5 ocfs2/file.c   2774 .fallocate = ocfs2_fallocate,
> >6 xfs/xfs_file.c 1026 .fallocate = xfs_file_fallocate,
> Ok.
> 
> Although noting that tmpfs doesn't implement fallocate, this isn't
> just a ruse to get me to implement fallocate for tmpfs, right? ;)

No, I don't care about what tmpfs supports in most cases, and in
reality fallocate has not been implemented because tmpfs is generall
hacked on by MM people, not filesystem people. That's why it has all
the special interfaces that filesystems can't actually use...

> Out of curiosity: While for my uses, a tmpfs exclusive
> implementation isn't a problem for me,  I *really* appreciate your
> feedback and help focusing this into a more generic fs solution. But
> to get more context for your insights, I'm curious if you have any
> use cases in your mind that is directing this work to be more
> generic?

Managing large scale disk caches have exactly the same problems of
determining what to evict and/or move to secondary storage when
space is low. Being able to mark ranges of files as "remove this
first" woulxp dbe very advantageous for pro-active mangement of ENOSPC
conditions in the cache...

And being able to do space-demand hole-punching for stuff like
managing VM images would be really cool. For example, the loopback
device uses hole punching to implement TRIM commands, so turning
those into VOLATILE ranges for background dispatch will speed those
operations up immensely and avoid silly same block "TRIM - write -
TRIM - write" cyclic hole-punch/allocation in the backing file.  KVM
could do the same for implementing TRIM on image based block
devices...

There's lots of things that can be done with a generic advisory,
asynchornous hole-punching interface.

> Again, you've been helpfully skeptical of the work, and also at the
> same time pushing it in a certain direction, and I want to make sure
> I better understand where you're coming from.  You've clarified the
> file security concern well enough, but is that all?  I just want to
> make sure that going the generic fallocate route is really worth it,
> as opposed to moving to madvise() and just failing file-data backed
> memory.
> 
> >can then call a generic helper like, say:
> >
> >	filemap_mark_volatile_range(inode, off, len);
> >	filemap_clear_volatile_range(inode, off, len);
> >
> >to be able to use the range tree tracking you have written for this
> >purpose. The filesystem is also free to track ranges however it
> >pleases.
> >
> >The filesystem will need to be able to store a tree/list root for
> >tracking all it's inodes that have volatile ranges, and register a
> >shrinker to walk that list and do the work necessary when memory
> >becomes low, but that is simple to do for a basic implementation.
> >
> Hrmm..  Currently I'm using a per-mapping range-tree along with a
> global LRU list that ties all the ranges together.

/me smells another candidate for his in-progress generic
LRU-with-integrated-shrinker work.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2012-05-01  1:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-24 17:49 [PATCH 0/3] Volatile Ranges John Stultz
2012-04-24 17:49 ` [PATCH 1/3] Range tree implementation John Stultz
2012-04-24 19:14   ` Peter Zijlstra
2012-04-24 19:25     ` John Stultz
2012-04-24 19:33       ` Peter Zijlstra
2012-04-25 12:16   ` Dmitry Adamushko
2012-04-25 16:19     ` John Stultz
2012-04-26 10:00       ` Dmitry Adamushko
2012-04-27 19:34         ` John Stultz
2012-04-24 17:49 ` [PATCH 2/3] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
2012-04-24 19:20   ` Peter Zijlstra
2012-04-24 19:50     ` John Stultz
2012-04-27  0:39   ` Dave Chinner
2012-04-27 15:25     ` Dave Hansen
2012-04-28  1:36       ` Dave Chinner
2012-04-30 21:07         ` John Stultz
2012-05-01  0:08           ` Dave Chinner
2012-05-01  0:46             ` John Stultz
2012-05-01  1:28               ` Dave Chinner
2012-04-27 19:14     ` John Stultz
2012-04-28  2:04       ` Dave Chinner
2012-04-30 19:40         ` John Stultz
2012-05-01  0:28           ` Dave Chinner
2012-05-01  1:15             ` John Stultz
2012-05-01  1:51               ` Dave Chinner
2012-04-24 17:49 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
2012-04-24 19:21   ` Peter Zijlstra
2012-04-24 19:42     ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).