linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Volatile Ranges (v7) & Lots of words
@ 2012-09-29  3:16 John Stultz
  2012-09-29  3:16 ` [PATCH 1/3] [RFC] Add volatile range management code John Stultz
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: John Stultz @ 2012-09-29  3:16 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm


After Kernel Summit and Plumbers, I wanted to consider all the various
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.

Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :)  In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)


General Interface semantics:
----------------------------------------------

The high level interface I've been pushing has so far stayed fairly
consistent:

Application marks a range of data as volatile. Volatile data may
be purged at any time. Accessing volatile data is undefined, so
applications should not do so. If the application wants to access
data in a volatile range, it should mark it as non-volatile. If any
of the pages in the range being marked non-volatile had been purged,
the kernel will return an error, notifying the application that the
data was lost.

But one interesting new tweak on this design, suggested by the Taras
Glek and others at Mozilla, is as follows:

Instead of leaving volatile data access as being undefined , when
accessing volatile data, either the data expected will be returned
if it has not been purged, or the application will get a SIGBUS when
it accesses volatile data that has been purged.

Everything else remains the same (error on marking non-volatile
if data was purged, etc). This model allows applications to avoid
having to unmark volatile data when it wants to access it, then
immediately re-mark it as volatile when its done. It is in effect
"lazy" with its marking, allowing the kernel to hit it with a signal
when it gets unlucky and touches purged data. From the signal handler,
the application can note the address it faulted on, unmark the range,
and regenerate the needed data before returning to execution.

Since this approach avoids the more explicit unmark/access/mark
pattern, it avoids the extra overhead required to ensure data is
non-volatile before being accessed.

However, If applications don't want to deal with handling the
sigbus, they can use the more straightforward (but more costly)
unmark/access/mark pattern in the same way as my earlier proposals.

This allows folks to balance the cost vs complexity in their
application appropriately.

So that's a general overview of how the idea I'm proposing could
be used.



Specific Interface semantics:
---------------------------------------------

Here are some of the open question about how the user interface
should look:

fadvise vs fallocate:

	So while originally I used fadvise, currently my
	implementation uses fallocate(fd, FALLOC_FL_MARK_VOLATILE,
	start, len) to mark a range as volatile and fallocate(fd,
	FALLOC_FL_UNMARK_VOLATILE, start, len) to unmark ranges.

	During kernel summit, the question was brought up if fallocate
	was really the right interface to be using, and if fadvise
	would be better. To me fadvise makes a little more sense,
	but earlier it was pointed out that marking data ranges as
	volatile could also be seen as a type of cancellable and lazy
	hole-punching, so from that perspective fallocate might make
	more sense.  This is still an open question and I'd appreciate
	further input here.


tmpfs vs non-shmem filesystems:
	Android's ashmem primarily provides a way to get unlinked
	tmpfs fds that can be shared between applications. Its
	just an additional feature that those pages can "unpinned"
	or marked volatile in my terminology. Thus in implementing
	volatile ranges, I've focused on getting it to work on tmpfs
	file descriptors.  However, there has been some interest in
	using volatile ranges with more traditional filesystems. The
	semantics for how volatile range purging would work on a
	real filesystem are not well established, and I can't say I
	understand the utility quite yet, but there may be a case for
	having data that you know won't be committed to disk until it
	is marked as non-volatile.  However, returning an EINVAL on
	non-tmpfs filesystems until such a use is established should
	be fine.

fd based interfaces vs madvise:
	In talking with Taras Glek, he pointed out that for his
	needs, the fd based interface is a little annoying, as it
	requires having to get access to tmpfs file and mmap it in,
	then instead of just referencing a pointer to the data he
	wants to mark volatile, he has to calculate the offset from
	start of the mmap and pass those file offsets to the interface.
	Instead he mentioned that using something like madvise would be
	much nicer, since they could just pass a pointer to the object
	in memory they want to make volatile and avoid the extra work.

	I'm not opposed to adding an madvise interface for this as
	well, but since we have a existing use case with Android's
	ashmem, I want to make sure we support this existing behavior.
	Specifically as with ashmem  applications can be sharing
	these tmpfs fds, and so file-relative volatile ranges make
	more sense if you need to coordinate what data is volatile
	between two applications.

	Also, while I agree that having an madvise interface for
	volatile ranges would be nice, it does open up some more
	complex implementation issues, since with files, there is a
	fixed relationship between pages and the files' address_space
	mapping, where you can't have pages shared between different
	mappings. This makes it easy to hang the volatile-range tree
	off of the mapping (well, indirectly via a hash table). With
	general anonymous memory, pages can be shared between multiple
	processes, and as far as I understand, don't have any grouping
	structure we could use to determine if the page is in a
	volatile range or not. We would also need to determine more
	complex questions like: What are the semantics of volatility
	with copy-on-write pages?  I'm hoping to investigate this
	idea more deeply soon so I can be sure whatever is pushed has
	a clear plan of how to address this idea. Further thoughts
	here would be appreciated.


It would really be great to get any thoughts on these issues, as they
are higher-priority to me then diving into the details of how we
implement this internally, which can shift over time.



Implementation Considerations:
---------------------------------------------

How best to manage volatile ranges internally in the kernel is still
an open question.

With this patch set, I'm really wanting to provide a proof of concept
of the general interface semantics above. This allows applications to
play with the idea and validate that it would work for them. Allowing
further discussion to continue on how to best implement or best allow
the implementation to evolve in the kernel.

Even so, I'm very interested in any discussion about how to manage
volatile ranges optimally.

Before describing the different management approaches I've tried,
there are some abstract properties and considerations that need to
be kept in mind:

* Range-granular Purging:
	Since volatile ranges can be reclaimed at any time, the
	question of how the kernel should reclaim volatile data
	needs to be addressed.	When a large data range  is marked
	as volatile, if any single page in that range is purged,
	the application will get an error when it marks the range
	as non-volatile.  Thus when any single page in a range
	is purged, the "value" of the entire range is destroyed.
	Because of this property, it makes most sense to purge the
	entire range together.


* Coalescing of adjacent volatile ranges:
	With volatile ranges, any overlapping ranges are always
	coalesced. However, there is an open question of what to
	do with neighboring ranges. With Android's approach, any
	neighboring ranges were coalesced into one range.  I've since
	tweaked this so that adjacent ranges are coalesced only if
	both have not yet been purged (or both are already purged).
	This avoids throwing away fine data just because its next
	to data that has already been tossed.  Not coalescing
	non-overlapping ranges is also an option I've considered,
	as it better follows the applications wishes, since as
	the application is providing these non-overlapping ranges
	separately, we should probably also purge them separately.
	The one complication here is that for userlands-sake, we
	manage volatile ranges at a byte level. So if an application
	marks one an a half pages of data as volatile, we only purge
	pages that are entirely volatile. This avoids accidentally
	purging non-volatile data on the rest of the page.  However,
	if an array of sub-page sized data is marked volatile one by
	one, coalescing the ranges allows us to purge a page that
	consists entirely of multiple volatile ranges.	So for now
	I'm still coalescing assuming the neighbors are both unpurged,
	but this behavior is open to being tweaked.


* Purging order between volatile ranges:
	Again, since it makes sense to purge all the complete
	pages in a range at the same time, we need to consider the
	subtle difference between the least-recently-used pages vs
	least-recently-used ranges. A single range could contain very
	frequently accessed data, as well as rarely accessed data.
	One must also consider that the act of marking a range as
	volatile may not actually touch the underlying pages. Thus
	purging ranges based on a least-recently-used page may also
	result in purging the most-recently used page.

	Android addressed the purging order question by purging ranges
	in the order they were marked volatile. Thus the oldest
	volatile range is the first range to be purged. This works
	well in the Android  model, as applications aren't supposed
	to access volatile data, so the least-recently-marked-volatile
	order maps well to the least-recently-used-range.

	However, this assumption doesn't hold with the lazy SIGBUS
	notification method, as pages in a volatile range may continue
	to be accessed after the range is marked volatile.  So the
	question as to what is the best order of purging volatile
	ranges is definitely open.

	Abstractly the ideal solution might be to evaluate the
	most-recently used page in each range, and to purge the range
	with the oldest recently-used-page, but I suspect this is
	not something that could be calculated efficiently.

	Additionally, in my conversations with Taras, he pointed out
	that if we are using a one-application-at-a-time UI model,
	it would be ideal to discourage purging volatile data used by
	the current application, instead prioritizing volatile ranges
	from applications that aren't active. However, I'm not sure
	what mechanism could be used to prioritize range purging in
	this fashion, especially considering volatile ranges can be
	on data that is shared between applications.


* Volatile range purging order relative to non-volatile pages:
	Initially I had proposed that since applications had offered
	data up as unused, volatile ranges should be purged before we
	try to free any other pages in the system.  At Plumbers, Andrea
	pointed out that this doesn't make much sense, as there may be
	inactive file pages from some streaming file data which are not
	going to be used any time soon, and would be a better candidate
	to free then an application's volatile pages. This sounded
	quite reasonable, so its likely we need to balance volatile
	purging with freeing other pages in the system. However, I do
	think it is advantageous to purge volatile pages before we
	free any dirty pages that must be laundered, as part of the
	goal of volatile pages is to avoid extra io. Although from
	my reading of shrink_page_list in vmscan.c I'm not sure I see
	if/how we prioritize freeing clean pages prior to dirty ones.


So with that background covered, on to discussing actual
implementations.

Implementation Details:
---------------------------------------------

There is two rough approaches that I have tried so far

1) Managing volatile range objects, in a tree or list, which are then
purged using a shrinker

2) Page based management, where pages marked volatile are moved to
a new LRU list and are purged from there.



1) This patchset is of the the shrinker-based approach. In many ways it
is simpler, but it does have a few drawbacks.  Basically when marking a
range as volatile, we create a range object, and add it to an rbtree.
This allows us to be able to quickly find ranges, given an address in
the file.  We also add each range object to the tail of a  filesystem
global linked list, which acts as an LRU allowing us to quickly find
the least recently created volatile range. We then use a shrinker
callback to trigger purging, where we'll select the range on the head
of the LRU list, purge the data, mark the range object as purged,
and remove it from the lru list.

This allows fairly efficient behavior, as marking and unmarking
a range are both O(logn) operation with respect to the number of
ranges, to insert and remove from the tree.  Purging the range is
also O(1) to select the range, and we purge the entire range in
least-recently-marked-volatile order.

The drawbacks with this approach is that it uses a shrinker, thus it is
numa un-aware. We track the virtual address of the pages in the file,
so we don't have a sense of what physical pages we're using, nor on
which node those pages may be on. So its possible on a multi-node
system that when one node was under pressure, we'd purge volatile
ranges that are all on a different node, in effect throwing data away
without helping anything. This is clearly non-ideal for numa systems.

One idea I discussed with Michel Lespinasse is that this might be
something we could improve by providing the shrinker some node context,
then keep track in the range  what node their first page is on. That
way we would be sure to at least free up one page on the node under
pressure when purging that range.


2) The second approach, which was more page based, was also tried. In
this case when we marked a range as volatile, the pages in that range
were moved to a new  lru list LRU _VOLATILE in vmscan.c.  This provided
a page lru list that could be used to free pages before looking at
the LRU_INACTIVE_FILE/ANONYMOUS lists.

This integrates the feature deeper in the mm code, which is nice,
especially as we have an LRU_VOLATILE list for each numa node. Thus
under pressure we won't purge ranges that are entirely on a different
node, as is possible with the other approach.

However, this approach is more costly.	When marking a range
as volatile, we have to migrate every page in that range to the
LRU_VOLATILE list, and similarly on unmarking we have to move each
page back. This ends up being O(n) with respect to the number of
pages in the range we're marking or unmarking. Similarly when purging,
we let the scanning code select a page off the lru, then we have to
map it back to the volatile range so we can purge the entire range,
making it a more expensive O(logn),  with respect to the number of
ranges, operation.

This is a particular concern as applications that want to mark and
unmark data as volatile with fine granularity will likely be calling
these operations frequently, adding quite a bit of overhead. This
makes it less likely that applications will choose to volunteer data
as volatile to the system.

However, with the new lazy SIGBUS notification, applications using
the SIGBUS method would avoid having to mark and unmark data when
accessing it, so this overhead may be less of a concern. However, for
cases where applications don't want to deal with the SIGBUS and would
rather have the more deterministic behavior of the unmark/access/mark
pattern, the performance is a concern.

Additionally, there may be ways to defer and batch the page migration
so that applications don't suffer the extra cost, but this solution
may be limited or could  cause some strange behavior, as we can't
defer the unmark method, as we don't want pages to be purged after
the application thinks they were unmarked.


Whew, that was long...

Anyway, if you got this far and are still interested, I'd be greatly
appreciate  hearing of any other suggested implementations, or ways
around the drawbacks of the already tried approaches.

thanks
-john


For this v7 patchset revision the changes are as follows:
* Dropped the LRU_VOLATILE approach for now so we can focus on
  getting the general interface semantics agreed upon
* Converted to using byte ranges rather then page ranges to make
  userland's life easier.	
* Add SIGBUS on purged page access behavior, allowing for access
  of volatile data without having to unmark it.


Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>


John Stultz (3):
  [RFC] Add volatile range management code
  [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  [RFC] ashmem: Convert ashmem to use volatile ranges

 drivers/staging/android/ashmem.c |  335 +---------------------
 fs/open.c                        |    3 +-
 include/linux/falloc.h           |    7 +-
 include/linux/volatile.h         |   46 +++
 mm/Makefile                      |    2 +-
 mm/shmem.c                       |  120 ++++++++
 mm/volatile.c                    |  580 ++++++++++++++++++++++++++++++++++++++
 7 files changed, 763 insertions(+), 330 deletions(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] [RFC] Add volatile range management code
  2012-09-29  3:16 [PATCH 0/3] Volatile Ranges (v7) & Lots of words John Stultz
@ 2012-09-29  3:16 ` John Stultz
  2012-09-29  3:16 ` [PATCH 2/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers John Stultz
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: John Stultz @ 2012-09-29  3:16 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

This patch provides the volatile range management code
that filesystems can utilize when implementing
FALLOC_FL_MARK_VOLATILE.

It tracks a collection of page ranges against a mapping
stored in an interval-tree. This code handles coalescing
overlapping and adjacent ranges, as well as splitting
ranges when sub-chunks are removed.

The ranges can be marked purged or unpurged. And there is
a per-fs lru list that tracks all the unpurged ranges for
that fs.

v2:
* Fix bug in volatile_ranges_get_last_used returning bad
  start,end values
* Rework for intervaltree renaming
* Optimize volatile_range_lru_size to avoid running through
  lru list each time.

v3:
* Improve function name to make it clear what the
  volatile_ranges_pluck_lru() code does.
* Drop volatile_range_lru_size and unpurged_page_count
  mangement as its now unused

v4:
* Re-add volatile_range_lru_size and unpruged_page_count
* Fix bug in range_remove when we split ranges, we add
  an overlapping range before resizing the existing range.

v5:
* Drop intervaltree for prio_tree usage per Michel &
  Dmitry's suggestions.
* Cleanups

v6:
* Drop prio_tree usage for rbtree per Michel Lespinasse's
  suggestion.

v7:
* Use byte ranges instead of page ranges to make userspace's
  life easier.
* Add volatile_range_address_is_purged check for SIGBUS on
  purged page access.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/volatile.h |   46 ++++
 mm/Makefile              |    2 +-
 mm/volatile.c            |  580 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 627 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..c59a0f9
--- /dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,46 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+#include <linux/fs.h>
+
+struct volatile_fs_head {
+	struct mutex lock;
+	struct list_head lru_head;
+	s64 unpurged_page_count;
+};
+
+#define DEFINE_VOLATILE_FS_HEAD(name) struct volatile_fs_head name = {	\
+	.lock = __MUTEX_INITIALIZER(name.lock),				\
+	.lru_head = LIST_HEAD_INIT(name.lru_head),			\
+	.unpurged_page_count = 0,					\
+}
+
+static inline void volatile_range_lock(struct volatile_fs_head *head)
+{
+	mutex_lock(&head->lock);
+}
+
+static inline void volatile_range_unlock(struct volatile_fs_head *head)
+{
+	mutex_unlock(&head->lock);
+}
+
+extern long volatile_range_add(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				u64 *start, u64 *end);
+extern long volatile_range_remove(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				u64 start, u64 end);
+
+extern s64 volatile_range_lru_size(struct volatile_fs_head *head);
+
+extern void volatile_range_clear(struct volatile_fs_head *head,
+					struct address_space *mapping);
+
+extern s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
+				struct address_space **mapping,
+				u64 *start, u64 *end);
+
+extern int volatile_range_is_purged(struct address_space *mapping, u64 addr);
+
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 92753e2..4c18cd1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o $(mmu-y)
+			   compaction.o volatile.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..6d3e418
--- /dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,580 @@
+/* mm/volatile.c
+ *
+ * Volatile range management.
+ *      Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c
+ *      by Robert Love <rlove@google.com>
+ *      Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * The volatile range management is a helper layer on top of the range tree
+ * code, which is used to help filesystems manage byte ranges that are volatile.
+ *
+ * These ranges are stored in a per-mapping range tree. Storing both purged and
+ * unpurged ranges connected to that address_space. Unpurged ranges are also
+ * linked together in an lru list that is per-volatile-fs-head (basically
+ * per-filesystem).
+ *
+ * The goal behind volatile ranges is to allow applications to interact
+ * with the kernel's cache management infrastructure.  In particular an
+ * application can say "this memory contains data that might be useful in
+ * the future, but can be reconstructed if necessary, so if the kernel
+ * needs, it can zap and reclaim this memory without having to swap it out.
+ *
+ * The proposed mechanism - at a high level - is for user-space to be able
+ * to say "This memory is volatile" and then later "this memory is no longer
+ * volatile".  If the content of the memory is still available the second
+ * request succeeds.  If not, the memory is marked non-volatile and an
+ * error is returned to denote that the contents have been lost.
+ *
+ * Credits to Neil Brown for the above description.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/volatile.h>
+#include <linux/rbtree.h>
+#include <linux/hash.h>
+#include <linux/shmem_fs.h>
+
+struct volatile_range {
+	struct list_head		lru;
+	struct rb_node			node;
+	u64				start;
+	u64				end;
+	unsigned int			purged;
+	struct address_space		*mapping;
+};
+
+/*
+ * To avoid bloating the address_space structure, we use
+ * a hash structure to map from address_space mappings to
+ * the interval_tree root that stores volatile ranges
+ */
+static DEFINE_MUTEX(hash_mutex);
+static struct hlist_head *mapping_hash;
+static long mapping_hash_shift = 8;
+struct mapping_hash_entry {
+	struct rb_root			root;
+	struct address_space		*mapping;
+	struct hlist_node		hnode;
+};
+
+static inline
+struct rb_root *__mapping_to_root(struct address_space *mapping)
+{
+	struct hlist_node *elem;
+	struct mapping_hash_entry *entry;
+	struct rb_root *ret = NULL;
+
+	hlist_for_each_entry_rcu(entry, elem,
+			&mapping_hash[hash_ptr(mapping, mapping_hash_shift)],
+				hnode)
+		if (entry->mapping == mapping)
+			ret =  &entry->root;
+
+	return ret;
+}
+
+static inline
+struct rb_root *mapping_to_root(struct address_space *mapping)
+{
+	struct rb_root *ret;
+
+	mutex_lock(&hash_mutex);
+	ret =  __mapping_to_root(mapping);
+	mutex_unlock(&hash_mutex);
+	return ret;
+}
+
+
+static inline
+struct rb_root *mapping_allocate_root(struct address_space *mapping)
+{
+	struct mapping_hash_entry *entry;
+	struct rb_root *dblchk;
+	struct rb_root *ret = NULL;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return NULL;
+
+	mutex_lock(&hash_mutex);
+	/* Since we dropped the lock, double check that no one has
+	 * created the same hash entry.
+	 */
+	dblchk = __mapping_to_root(mapping);
+	if (dblchk) {
+		kfree(entry);
+		ret = dblchk;
+		goto out;
+	}
+
+	INIT_HLIST_NODE(&entry->hnode);
+	entry->mapping = mapping;
+	entry->root = RB_ROOT;
+
+	hlist_add_head_rcu(&entry->hnode,
+		&mapping_hash[hash_ptr(mapping, mapping_hash_shift)]);
+
+	ret = &entry->root;
+out:
+	mutex_unlock(&hash_mutex);
+	return ret;
+}
+
+static inline void mapping_free_root(struct rb_root *root)
+{
+	struct mapping_hash_entry *entry;
+
+	mutex_lock(&hash_mutex);
+	entry = container_of(root, struct mapping_hash_entry, root);
+
+	hlist_del_rcu(&entry->hnode);
+	kfree(entry);
+	mutex_unlock(&hash_mutex);
+}
+
+
+/* volatile range helpers */
+static struct volatile_range *vrange_alloc(void)
+{
+	struct volatile_range *new;
+
+	new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+	if (!new)
+		return 0;
+	rb_init_node(&new->node);
+	return new;
+}
+
+static void vrange_rb_insert(struct rb_root *root,
+				struct volatile_range *vrange)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct volatile_range *ptr;
+
+	WARN_ON_ONCE(!RB_EMPTY_NODE(&vrange->node));
+
+	while (*p) {
+		parent = *p;
+		ptr = rb_entry(parent, struct volatile_range, node);
+		if (vrange->start < ptr->start)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+	rb_link_node(&vrange->node, parent, p);
+	rb_insert_color(&vrange->node, root);
+}
+
+static void vrange_rb_remove(struct rb_root *root,
+				struct volatile_range *vrange)
+{
+	WARN_ON_ONCE(RB_EMPTY_NODE(&vrange->node));
+	rb_erase(&vrange->node, root);
+	RB_CLEAR_NODE(&vrange->node);
+}
+
+struct volatile_range *vrange_rb_search(struct rb_root *root,
+						u64 start, u64 end)
+{
+	struct rb_node *p = root->rb_node;
+	struct volatile_range *candidate, *match = NULL;
+
+	while (p) {
+		candidate = rb_entry(p, struct volatile_range, node);
+		if (end < candidate->start)
+			p = p->rb_left;
+		else if (start > candidate->end)
+			p = p->rb_right;
+		else {
+			/* We found one, but try to find an earlier match */
+			match = candidate;
+			p = p->rb_left;
+		}
+	}
+	return match;
+}
+
+struct volatile_range *vrange_rb_next(struct volatile_range *vrange,
+						u64 start, u64 end)
+{
+	struct rb_node *next;
+	struct volatile_range *candidate;
+
+	if (!vrange)
+		return NULL;
+
+	next = rb_next(&vrange->node);
+	if (!next)
+		return NULL;
+
+	candidate = container_of(next, struct volatile_range, node);
+
+	if ((candidate->start > end) || (candidate->end < start))
+		return NULL;
+
+	return candidate;
+}
+
+static void vrange_add(struct volatile_fs_head *head,
+				struct rb_root *root,
+				struct volatile_range *vrange)
+{
+	vrange_rb_insert(root, vrange);
+	/* Only add unpurged ranges to LRU */
+	if (!vrange->purged) {
+		head->unpurged_page_count += (vrange->end - vrange->start)
+							>> PAGE_CACHE_SHIFT;
+		list_add_tail(&vrange->lru, &head->lru_head);
+	}
+}
+
+static void vrange_del(struct volatile_fs_head *head,
+				struct rb_root *root,
+				struct volatile_range *vrange)
+{
+	if (!vrange->purged) {
+		head->unpurged_page_count -= (vrange->end - vrange->start)
+							>> PAGE_CACHE_SHIFT;
+		list_del(&vrange->lru);
+	}
+
+	vrange_rb_remove(root, vrange);
+	kfree(vrange);
+}
+
+static inline void vrange_resize(struct volatile_fs_head *head,
+				struct rb_root *root,
+				struct volatile_range *vrange,
+				s64 start_index, s64 end_index)
+{
+	s64 old_size, new_size;
+
+	old_size = vrange->end - vrange->start;
+	new_size = end_index - start_index;
+
+	if (!vrange->purged)
+		head->unpurged_page_count += (new_size - old_size)
+							>> PAGE_CACHE_SHIFT;
+
+	vrange_rb_remove(root, vrange);
+	vrange->start = start_index;
+	vrange->end = end_index;
+	vrange_rb_insert(root, vrange);
+}
+
+
+
+/**
+ * volatile_range_add: Marks a byte interval as volatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked volatile
+ * @start: Starting byte in range to be marked volatile
+ * @end: Ending byte in range to be marked volatile
+ *
+ * Mark a region as volatile. Coalesces overlapping and neighboring regions.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if the range was coalesced with any purged ranges.
+ * Returns 0 on success.
+ * start and end are modified to the full size of the coalesced range
+ */
+long volatile_range_add(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				u64 *start, u64 *end)
+{
+	struct volatile_range *new, *vrange, *node;
+	struct rb_root *root;
+	int purged = 0;
+
+	/* Make sure we're properly locked */
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	/*
+	 * Because the lock might be held in a shrinker, release
+	 * it during allocation.
+	 */
+	mutex_unlock(&head->lock);
+	new = vrange_alloc();
+	mutex_lock(&head->lock);
+	if (!new)
+		return -ENOMEM;
+
+	root = mapping_to_root(mapping);
+	if (!root) {
+		mutex_unlock(&head->lock);
+		root = mapping_allocate_root(mapping);
+		mutex_lock(&head->lock);
+		if (!root) {
+			kfree(new);
+			return -ENOMEM;
+		}
+	}
+
+	/* First, find any existing intervals that overlap */
+	node = vrange_rb_search(root, *start, *end);
+	while (node) {
+		vrange = node;
+
+		/* Already entirely marked volatile, so we're done */
+		if (vrange->start < *start && vrange->end > *end) {
+			/* don't need the allocated value */
+			kfree(new);
+			return purged;
+		}
+
+		/* Resize the new range to cover all overlapping ranges */
+		*start = min_t(u64, *start, vrange->start);
+		*end = max_t(u64, *end, vrange->end);
+
+		/* Inherit purged state from overlapping ranges */
+		purged |= vrange->purged;
+
+		/* See if there's a next range that overlaps */
+		node = vrange_rb_next(vrange, *start, *end);
+
+		/* Delete the old range, as we consume it */
+		vrange_del(head, root, vrange);
+	}
+
+
+	/* Coalesce left-adjacent ranges */
+	vrange = vrange_rb_search(root, *start-1, *start);
+	/* Only coalesce if both are either purged or unpurged */
+	if (vrange && (vrange->purged == purged)) {
+		/* resize new range */
+		*start = min_t(u64, *start, vrange->start);
+		*end = max_t(u64, *end, vrange->end);
+		/* delete old range */
+		vrange_del(head, root, vrange);
+	}
+
+	/* Coalesce right-adjacent ranges */
+	vrange = vrange_rb_search(root, *end, *end+1);
+	/* Only coalesce if both are either purged or unpurged */
+	if (vrange && (vrange->purged == purged)) {
+		/* resize new range */
+		*start = min_t(u64, *start, vrange->start);
+		*end = max_t(u64, *end, vrange->end);
+		/* delete old range */
+		vrange_del(head, root, vrange);
+	}
+	/* Assign and store the new range in the range tree */
+	new->mapping = mapping;
+	new->start = *start;
+	new->end = *end;
+	new->purged = purged;
+	vrange_add(head, root, new);
+
+	return purged;
+}
+
+
+/**
+ * volatile_range_remove: Marks a byte interval as nonvolatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked nonvolatile
+ * @start_index: Starting byte in range to be marked nonvolatile
+ * @end_index: Ending byte in range to be marked nonvolatile
+ *
+ * Mark a region as nonvolatile. And remove it from the volatile
+ * range tree.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if any portion of the range was purged.
+ * Returns 0 on success.
+ */
+long volatile_range_remove(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				u64 start, u64 end)
+{
+	struct volatile_range *new, *vrange, *node;
+	struct rb_root *root;
+	int ret		= 0;
+	int used_new	= 0;
+
+	/* Make sure we're properly locked */
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	/*
+	 * Because the lock might be held in a shrinker, release
+	 * it during allocation.
+	 */
+	mutex_unlock(&head->lock);
+	new = vrange_alloc();
+	mutex_lock(&head->lock);
+	if (!new)
+		return -ENOMEM;
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		goto out;
+
+
+	/* Find any overlapping ranges */
+	node = vrange_rb_search(root, start, end);
+	while (node) {
+		vrange = node;
+		node = vrange_rb_next(vrange, start, end);
+
+		ret |= vrange->purged;
+
+		if (start <= vrange->start && end >= vrange->end) {
+			/* delete: volatile range is totally within range */
+			vrange_del(head, root, vrange);
+		} else if (vrange->start >= start) {
+			/* resize: volatile range right-overlaps range */
+			vrange_resize(head, root, vrange, end+1, vrange->end);
+		} else if (vrange->end <= end) {
+			/* resize: volatile range left-overlaps range */
+			vrange_resize(head, root, vrange, vrange->start,
+								start-1);
+		} else {
+			/* split: range is totally within a volatile range */
+			used_new = 1; /* we only do this once */
+			new->mapping = mapping;
+			new->start = end + 1;
+			new->end = vrange->end;
+			new->purged = vrange->purged;
+			vrange_resize(head, root, vrange, vrange->start,
+								start-1);
+			vrange_add(head, root, new);
+			break;
+		}
+	}
+
+out:
+	if (!used_new)
+		kfree(new);
+
+	return ret;
+}
+
+/**
+ * volatile_range_lru_size: Returns the number of unpurged pages on the lru
+ * @head: per-fs volatile head
+ *
+ * Returns the number of unpurged pages on the LRU
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ */
+s64 volatile_range_lru_size(struct volatile_fs_head *head)
+{
+	WARN_ON(!mutex_is_locked(&head->lock));
+	return head->unpurged_page_count;
+}
+
+
+/**
+ * volatile_ranges_pluck_lru: Returns mapping and size of lru unpurged range
+ * @head: per-fs volatile head
+ * @mapping: dbl pointer to mapping who's range is being purged
+ * @start: Pointer to starting address of range being purged
+ * @end: Pointer to ending address of range being purged
+ *
+ * Returns the mapping, start and end values of the least recently used
+ * range. Marks the range as purged and removes it from the LRU.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 on success if a range was returned
+ * Return 0 if no ranges were found.
+ */
+s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
+				struct address_space **mapping,
+				u64 *start, u64 *end)
+{
+	struct volatile_range *range;
+
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	if (list_empty(&head->lru_head))
+		return 0;
+
+	range = list_first_entry(&head->lru_head, struct volatile_range, lru);
+
+	*start = range->start;
+	*end = range->end;
+	*mapping = range->mapping;
+
+	head->unpurged_page_count -= (*end - *start)>>PAGE_CACHE_SHIFT;
+	list_del(&range->lru);
+	range->purged = 1;
+
+	return 1;
+}
+
+
+int volatile_range_is_purged(struct address_space *mapping, u64 addr)
+{
+	struct volatile_range *found;
+	struct rb_root *root;
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		return 0;
+
+	found = vrange_rb_search(root, addr, addr);
+	if (!found)
+		return 0;
+
+	return found->purged;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void volatile_range_clear(struct volatile_fs_head *head,
+				struct address_space *mapping)
+{
+	struct volatile_range *tozap;
+	struct rb_root *root;
+
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		return;
+
+	while (!RB_EMPTY_ROOT(root)) {
+		tozap = container_of(root->rb_node, struct volatile_range,
+									node);
+		vrange_del(head, root, tozap);
+	}
+	mapping_free_root(root);
+}
+
+
+static int __init volatile_init(void)
+{
+	int i, size;
+
+	size = 1U << mapping_hash_shift;
+	mapping_hash = kzalloc(sizeof(mapping_hash)*size, GFP_KERNEL);
+	for (i = 0; i < size; i++)
+		INIT_HLIST_HEAD(&mapping_hash[i]);
+
+	return 0;
+}
+arch_initcall(volatile_init);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  2012-09-29  3:16 [PATCH 0/3] Volatile Ranges (v7) & Lots of words John Stultz
  2012-09-29  3:16 ` [PATCH 1/3] [RFC] Add volatile range management code John Stultz
@ 2012-09-29  3:16 ` John Stultz
  2012-09-29  3:16 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: John Stultz @ 2012-09-29  3:16 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
functionality for tmpfs making use of the volatile range
management code.

Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
FALLOC_FL_PUNCH_HOLE.  This allows applications that have
data caches that can be re-created to tell the kernel that
some memory contains data that is useful in the future, but
can be recreated if needed, so if the kernel needs, it can
zap the memory without having to swap it out.

In use, applications use FALLOC_FL_MARK_VOLATILE to mark
page ranges as volatile when they are not in use. Then later
if they wants to reuse the data, they use
FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
data has been purged.

This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code.

This is a reworked version of the fadvise volatile idea submitted
earlier to the list. Thanks to Dave Chinner for suggesting to
rework the idea in this fashion. Also thanks to Dmitry Adamushko
for continued review and bug reporting, and Dave Hansen for
help with the original design and mentoring me in the VM code.

v3:
* Fix off by one issue when truncating page ranges
* Use Dave Hansesn's suggestion to use shmem_writepage to trigger
  range purging instead of using a shrinker.

v4:
* Revert the shrinker removal, since writepage won't get called
  if we don't have swap.

v5:
* Cleanups

v7:
* Convert to byte ranges rather then page ranges to make userland's
  life easier.
* Add volatile_range_address_is_purged checking in shmem_fault to
  proivde SIGBUS on purged page access.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 fs/open.c              |    3 +-
 include/linux/falloc.h |    7 +--
 mm/shmem.c             |  120 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index e1f2cdb..6b8b983 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -225,7 +225,8 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EINVAL;
 
 	/* Return error if mode is not supported */
-	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+			FALLOC_FL_MARK_VOLATILE | FALLOC_FL_UNMARK_VOLATILE))
 		return -EOPNOTSUPP;
 
 	/* Punch hole must have keep size set */
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 73e0b62..3e47ad5 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -1,9 +1,10 @@
 #ifndef _FALLOC_H_
 #define _FALLOC_H_
 
-#define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
-#define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
-
+#define FALLOC_FL_KEEP_SIZE		0x01 /* default is extend size */
+#define FALLOC_FL_PUNCH_HOLE		0x02 /* de-allocates range */
+#define FALLOC_FL_MARK_VOLATILE		0x04 /* mark range volatile */
+#define FALLOC_FL_UNMARK_VOLATILE	0x08 /* mark range non-volatile */
 #ifdef __KERNEL__
 
 /*
diff --git a/mm/shmem.c b/mm/shmem.c
index d4e184e..9403ffb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -64,6 +64,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/volatile.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -633,6 +634,100 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 	return error;
 }
 
+static DEFINE_VOLATILE_FS_HEAD(shmem_volatile_head);
+
+static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+	u64 start, end;
+	int ret;
+
+	start = offset;
+	end = offset + len;
+
+	volatile_range_lock(&shmem_volatile_head);
+	ret = volatile_range_add(&shmem_volatile_head, &inode->i_data,
+								&start, &end);
+	if (ret > 0) { /* immdiately purge */
+		shmem_truncate_range(inode, (loff_t)start, (loff_t)end-1);
+		ret = 0;
+	}
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return ret;
+}
+
+static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+	u64 start, end;
+	int ret;
+
+	start = offset;
+	end = offset + len;
+
+	volatile_range_lock(&shmem_volatile_head);
+	ret = volatile_range_remove(&shmem_volatile_head, &inode->i_data,
+								start, end);
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return ret;
+}
+
+static void shmem_clear_volatile(struct inode *inode)
+{
+	volatile_range_lock(&shmem_volatile_head);
+	volatile_range_clear(&shmem_volatile_head, &inode->i_data);
+	volatile_range_unlock(&shmem_volatile_head);
+}
+
+static
+int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
+{
+	s64 nr_to_scan = sc->nr_to_scan;
+	const gfp_t gfp_mask = sc->gfp_mask;
+	struct address_space *mapping;
+	u64 start, end;
+	int ret;
+	s64 page_count;
+
+	if (nr_to_scan && !(gfp_mask & __GFP_FS))
+		return -1;
+
+	volatile_range_lock(&shmem_volatile_head);
+	page_count = volatile_range_lru_size(&shmem_volatile_head);
+	if (!nr_to_scan)
+		goto out;
+
+	do {
+		ret = volatile_ranges_pluck_lru(&shmem_volatile_head,
+							&mapping, &start, &end);
+		if (ret) {
+			shmem_truncate_range(mapping->host, (loff_t) start,
+								(loff_t) end-1);
+
+			nr_to_scan -= (end-start) >> PAGE_CACHE_SHIFT;
+			page_count -= (end-start) >> PAGE_CACHE_SHIFT;
+		};
+	} while (ret && (nr_to_scan > 0));
+
+out:
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return page_count;
+}
+
+static struct shrinker shmem_volatile_shrinker = {
+	.shrink = shmem_volatile_shrink,
+	.seeks = DEFAULT_SEEKS,
+};
+
+static int __init shmem_shrinker_init(void)
+{
+	register_shrinker(&shmem_volatile_shrinker);
+	return 0;
+}
+arch_initcall(shmem_shrinker_init);
+
+
 static void shmem_evict_inode(struct inode *inode)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
@@ -1292,6 +1387,16 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	int error;
 	int ret = VM_FAULT_LOCKED;
 
+
+	volatile_range_lock(&shmem_volatile_head);
+	if (volatile_range_is_purged(inode->i_mapping,
+					(u64)vmf->pgoff << PAGE_CACHE_SHIFT)) {
+		volatile_range_unlock(&shmem_volatile_head);
+		return VM_FAULT_SIGBUS;
+	}
+	volatile_range_unlock(&shmem_volatile_head);
+
+
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
@@ -1732,6 +1837,14 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		/* No need to unmap again: hole-punching leaves COWed pages */
 		error = 0;
 		goto out;
+	} else if (mode & FALLOC_FL_MARK_VOLATILE) {
+		/* Mark pages volatile, sort of delayed hole punching */
+		error = shmem_mark_volatile(inode, offset, len);
+		goto out;
+	} else if (mode & FALLOC_FL_UNMARK_VOLATILE) {
+		/* Mark pages non-volatile, return error if pages were purged */
+		error = shmem_unmark_volatile(inode, offset, len);
+		goto out;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -1810,6 +1923,12 @@ out:
 	return error;
 }
 
+static int shmem_release(struct inode *inode, struct file *file)
+{
+	shmem_clear_volatile(inode);
+	return 0;
+}
+
 static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
@@ -2721,6 +2840,7 @@ static const struct file_operations shmem_file_operations = {
 	.splice_read	= shmem_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.fallocate	= shmem_fallocate,
+	.release	= shmem_release,
 #endif
 };
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges
  2012-09-29  3:16 [PATCH 0/3] Volatile Ranges (v7) & Lots of words John Stultz
  2012-09-29  3:16 ` [PATCH 1/3] [RFC] Add volatile range management code John Stultz
  2012-09-29  3:16 ` [PATCH 2/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers John Stultz
@ 2012-09-29  3:16 ` John Stultz
  2012-10-02  7:39 ` [PATCH 0/3] Volatile Ranges (v7) & Lots of words NeilBrown
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: John Stultz @ 2012-09-29  3:16 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, linux-mm

Rework of my first pass attempt at getting ashmem to utilize
the volatile range code, now using the fallocate interface.

In this implementation GET_PIN_STATUS is unimplemented, due to
the fact that adding a ISVOLATILE check wasn't considered
terribly useful in earlier reviews. It would be trivial to
re-add that functionality, but I wanted to check w/ the
Android developers to see how often GET_PIN_STATUS is actually
used?

Similarly the ashmem PURGE_ALL_CACHES ioctl does not function,
as the volatile range purging is no longer directly under its
control.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Android Kernel Team <kernel-team@android.com>
Cc: Robert Love <rlove@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mike Hommey <mh@glandium.org>
Cc: Taras Glek <tglek@mozilla.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 drivers/staging/android/ashmem.c |  335 ++------------------------------------
 1 file changed, 10 insertions(+), 325 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 69cf2db..9f9654c 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -52,26 +52,6 @@ struct ashmem_area {
 };
 
 /*
- * ashmem_range - represents an interval of unpinned (evictable) pages
- * Lifecycle: From unpin to pin
- * Locking: Protected by `ashmem_mutex'
- */
-struct ashmem_range {
-	struct list_head lru;		/* entry in LRU list */
-	struct list_head unpinned;	/* entry in its area's unpinned list */
-	struct ashmem_area *asma;	/* associated area */
-	size_t pgstart;			/* starting page, inclusive */
-	size_t pgend;			/* ending page, inclusive */
-	unsigned int purged;		/* ASHMEM_NOT or ASHMEM_WAS_PURGED */
-};
-
-/* LRU list of unpinned pages, protected by ashmem_mutex */
-static LIST_HEAD(ashmem_lru_list);
-
-/* Count of pages on our LRU list, protected by ashmem_mutex */
-static unsigned long lru_count;
-
-/*
  * ashmem_mutex - protects the list of and each individual ashmem_area
  *
  * Lock Ordering: ashmex_mutex -> i_mutex -> i_alloc_sem
@@ -79,102 +59,9 @@ static unsigned long lru_count;
 static DEFINE_MUTEX(ashmem_mutex);
 
 static struct kmem_cache *ashmem_area_cachep __read_mostly;
-static struct kmem_cache *ashmem_range_cachep __read_mostly;
-
-#define range_size(range) \
-	((range)->pgend - (range)->pgstart + 1)
-
-#define range_on_lru(range) \
-	((range)->purged == ASHMEM_NOT_PURGED)
-
-#define page_range_subsumes_range(range, start, end) \
-	(((range)->pgstart >= (start)) && ((range)->pgend <= (end)))
-
-#define page_range_subsumed_by_range(range, start, end) \
-	(((range)->pgstart <= (start)) && ((range)->pgend >= (end)))
-
-#define page_in_range(range, page) \
-	(((range)->pgstart <= (page)) && ((range)->pgend >= (page)))
-
-#define page_range_in_range(range, start, end) \
-	(page_in_range(range, start) || page_in_range(range, end) || \
-		page_range_subsumes_range(range, start, end))
-
-#define range_before_page(range, page) \
-	((range)->pgend < (page))
 
 #define PROT_MASK		(PROT_EXEC | PROT_READ | PROT_WRITE)
 
-static inline void lru_add(struct ashmem_range *range)
-{
-	list_add_tail(&range->lru, &ashmem_lru_list);
-	lru_count += range_size(range);
-}
-
-static inline void lru_del(struct ashmem_range *range)
-{
-	list_del(&range->lru);
-	lru_count -= range_size(range);
-}
-
-/*
- * range_alloc - allocate and initialize a new ashmem_range structure
- *
- * 'asma' - associated ashmem_area
- * 'prev_range' - the previous ashmem_range in the sorted asma->unpinned list
- * 'purged' - initial purge value (ASMEM_NOT_PURGED or ASHMEM_WAS_PURGED)
- * 'start' - starting page, inclusive
- * 'end' - ending page, inclusive
- *
- * Caller must hold ashmem_mutex.
- */
-static int range_alloc(struct ashmem_area *asma,
-		       struct ashmem_range *prev_range, unsigned int purged,
-		       size_t start, size_t end)
-{
-	struct ashmem_range *range;
-
-	range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
-	if (unlikely(!range))
-		return -ENOMEM;
-
-	range->asma = asma;
-	range->pgstart = start;
-	range->pgend = end;
-	range->purged = purged;
-
-	list_add_tail(&range->unpinned, &prev_range->unpinned);
-
-	if (range_on_lru(range))
-		lru_add(range);
-
-	return 0;
-}
-
-static void range_del(struct ashmem_range *range)
-{
-	list_del(&range->unpinned);
-	if (range_on_lru(range))
-		lru_del(range);
-	kmem_cache_free(ashmem_range_cachep, range);
-}
-
-/*
- * range_shrink - shrinks a range
- *
- * Caller must hold ashmem_mutex.
- */
-static inline void range_shrink(struct ashmem_range *range,
-				size_t start, size_t end)
-{
-	size_t pre = range_size(range);
-
-	range->pgstart = start;
-	range->pgend = end;
-
-	if (range_on_lru(range))
-		lru_count -= pre - range_size(range);
-}
 
 static int ashmem_open(struct inode *inode, struct file *file)
 {
@@ -200,12 +87,6 @@ static int ashmem_open(struct inode *inode, struct file *file)
 static int ashmem_release(struct inode *ignored, struct file *file)
 {
 	struct ashmem_area *asma = file->private_data;
-	struct ashmem_range *range, *next;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned)
-		range_del(range);
-	mutex_unlock(&ashmem_mutex);
 
 	if (asma->file)
 		fput(asma->file);
@@ -339,56 +220,6 @@ out:
 	return ret;
 }
 
-/*
- * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
- *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
- *
- * 'gfp_mask' is the mask of the allocation that got us into this mess.
- *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
- * proceed without risk of deadlock (due to gfp_mask).
- *
- * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
- * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
- * pages freed.
- */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
-{
-	struct ashmem_range *range, *next;
-
-	/* We might recurse into filesystem code, so bail out if necessary */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
-		return -1;
-	if (!sc->nr_to_scan)
-		return lru_count;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
-		loff_t start = range->pgstart * PAGE_SIZE;
-		loff_t end = (range->pgend + 1) * PAGE_SIZE;
-
-		do_fallocate(range->asma->file,
-				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-				start, end - start);
-		range->purged = ASHMEM_WAS_PURGED;
-		lru_del(range);
-
-		sc->nr_to_scan -= range_size(range);
-		if (sc->nr_to_scan <= 0)
-			break;
-	}
-	mutex_unlock(&ashmem_mutex);
-
-	return lru_count;
-}
-
-static struct shrinker ashmem_shrinker = {
-	.shrink = ashmem_shrink,
-	.seeks = DEFAULT_SEEKS * 4,
-};
-
 static int set_prot_mask(struct ashmem_area *asma, unsigned long prot)
 {
 	int ret = 0;
@@ -461,136 +292,10 @@ static int get_name(struct ashmem_area *asma, void __user *name)
 	return ret;
 }
 
-/*
- * ashmem_pin - pin the given ashmem region, returning whether it was
- * previously purged (ASHMEM_WAS_PURGED) or not (ASHMEM_NOT_PURGED).
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	int ret = ASHMEM_NOT_PURGED;
-
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* moved past last applicable page; we can short circuit */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to pin pages that span multiple ranges,
-		 * or to pin pages that aren't even unpinned, so this is messy.
-		 *
-		 * Four cases:
-		 * 1. The requested range subsumes an existing range, so we
-		 *    just remove the entire matching range.
-		 * 2. The requested range overlaps the start of an existing
-		 *    range, so we just update that range.
-		 * 3. The requested range overlaps the end of an existing
-		 *    range, so we just update that range.
-		 * 4. The requested range punches a hole in an existing range,
-		 *    so we have to update one side of the range and then
-		 *    create a new range for the other side.
-		 */
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret |= range->purged;
-
-			/* Case #1: Easy. Just nuke the whole thing. */
-			if (page_range_subsumes_range(range, pgstart, pgend)) {
-				range_del(range);
-				continue;
-			}
-
-			/* Case #2: We overlap from the start, so adjust it */
-			if (range->pgstart >= pgstart) {
-				range_shrink(range, pgend + 1, range->pgend);
-				continue;
-			}
-
-			/* Case #3: We overlap from the rear, so adjust it */
-			if (range->pgend <= pgend) {
-				range_shrink(range, range->pgstart, pgstart-1);
-				continue;
-			}
-
-			/*
-			 * Case #4: We eat a chunk out of the middle. A bit
-			 * more complicated, we allocate a new range for the
-			 * second half and adjust the first chunk's endpoint.
-			 */
-			range_alloc(asma, range, range->purged,
-				    pgend + 1, range->pgend);
-			range_shrink(range, range->pgstart, pgstart - 1);
-			break;
-		}
-	}
-
-	return ret;
-}
-
-/*
- * ashmem_unpin - unpin the given range of pages. Returns zero on success.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	unsigned int purged = ASHMEM_NOT_PURGED;
-
-restart:
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* short circuit: this is our insertion point */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to unpin pages that are already entirely
-		 * or partially pinned. We handle those two cases here.
-		 */
-		if (page_range_subsumed_by_range(range, pgstart, pgend))
-			return 0;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			pgstart = min_t(size_t, range->pgstart, pgstart),
-			pgend = max_t(size_t, range->pgend, pgend);
-			purged |= range->purged;
-			range_del(range);
-			goto restart;
-		}
-	}
-
-	return range_alloc(asma, range, purged, pgstart, pgend);
-}
-
-/*
- * ashmem_get_pin_status - Returns ASHMEM_IS_UNPINNED if _any_ pages in the
- * given interval are unpinned and ASHMEM_IS_PINNED otherwise.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_get_pin_status(struct ashmem_area *asma, size_t pgstart,
-				 size_t pgend)
-{
-	struct ashmem_range *range;
-	int ret = ASHMEM_IS_PINNED;
-
-	list_for_each_entry(range, &asma->unpinned_list, unpinned) {
-		if (range_before_page(range, pgstart))
-			break;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret = ASHMEM_IS_UNPINNED;
-			break;
-		}
-	}
-
-	return ret;
-}
-
 static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 			    void __user *p)
 {
 	struct ashmem_pin pin;
-	size_t pgstart, pgend;
 	int ret = -EINVAL;
 
 	if (unlikely(!asma->file))
@@ -612,25 +317,25 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 	if (unlikely(PAGE_ALIGN(asma->size) < pin.offset + pin.len))
 		return -EINVAL;
 
-	pgstart = pin.offset / PAGE_SIZE;
-	pgend = pgstart + (pin.len / PAGE_SIZE) - 1;
-
-	mutex_lock(&ashmem_mutex);
 
 	switch (cmd) {
 	case ASHMEM_PIN:
-		ret = ashmem_pin(asma, pgstart, pgend);
+		ret = do_fallocate(asma->file, FALLOC_FL_MARK_VOLATILE,
+					pin.offset, pin.len);
 		break;
 	case ASHMEM_UNPIN:
-		ret = ashmem_unpin(asma, pgstart, pgend);
+		ret = do_fallocate(asma->file, FALLOC_FL_UNMARK_VOLATILE,
+					pin.offset, pin.len);
 		break;
 	case ASHMEM_GET_PIN_STATUS:
-		ret = ashmem_get_pin_status(asma, pgstart, pgend);
+		/*
+		 * XXX - volatile ranges currently don't provide status,
+		 * due to questionable utility
+		 */
+		ret = -EINVAL;
 		break;
 	}
 
-	mutex_unlock(&ashmem_mutex);
-
 	return ret;
 }
 
@@ -669,15 +374,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		break;
 	case ASHMEM_PURGE_ALL_CACHES:
 		ret = -EPERM;
-		if (capable(CAP_SYS_ADMIN)) {
-			struct shrink_control sc = {
-				.gfp_mask = GFP_KERNEL,
-				.nr_to_scan = 0,
-			};
-			ret = ashmem_shrink(&ashmem_shrinker, &sc);
-			sc.nr_to_scan = ret;
-			ashmem_shrink(&ashmem_shrinker, &sc);
-		}
 		break;
 	}
 
@@ -713,21 +409,13 @@ static int __init ashmem_init(void)
 		return -ENOMEM;
 	}
 
-	ashmem_range_cachep = kmem_cache_create("ashmem_range_cache",
-					  sizeof(struct ashmem_range),
-					  0, 0, NULL);
-	if (unlikely(!ashmem_range_cachep)) {
-		pr_err("failed to create slab cache\n");
-		return -ENOMEM;
-	}
-
 	ret = misc_register(&ashmem_misc);
 	if (unlikely(ret)) {
 		pr_err("failed to register misc device!\n");
 		return ret;
 	}
 
-	register_shrinker(&ashmem_shrinker);
+
 
 	pr_info("initialized\n");
 
@@ -738,13 +426,10 @@ static void __exit ashmem_exit(void)
 {
 	int ret;
 
-	unregister_shrinker(&ashmem_shrinker);
-
 	ret = misc_deregister(&ashmem_misc);
 	if (unlikely(ret))
 		pr_err("failed to unregister misc device!\n");
 
-	kmem_cache_destroy(ashmem_range_cachep);
 	kmem_cache_destroy(ashmem_area_cachep);
 
 	pr_info("unloaded\n");
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-09-29  3:16 [PATCH 0/3] Volatile Ranges (v7) & Lots of words John Stultz
                   ` (2 preceding siblings ...)
  2012-09-29  3:16 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
@ 2012-10-02  7:39 ` NeilBrown
  2012-10-02 22:38   ` John Stultz
  2012-10-02 17:31 ` Taras Glek
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2012-10-02  7:39 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Andrea Righi, Aneesh Kumar K.V,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2994 bytes --]

On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz <john.stultz@linaro.org> wrote:

> 
> After Kernel Summit and Plumbers, I wanted to consider all the various
> side-discussions and try to summarize my current thoughts here along
> with sending out my current implementation for review.
> 
> Also: I'm going on four weeks of paternity leave in the very near
> (but non-deterministic) future. So while I hope I still have time
> for some discussion, I may have to deal with fussier complaints
> then yours. :)  In any case, you'll have more time to chew on
> the idea and come up with amazing suggestions. :)

Hi John,

 I wonder if you are trying to please everyone and risking pleasing no-one?
 Well, maybe not quite that extreme, but you can't please all the people all
 the time.

 For example, allowing sub-page volatile region seems to be above and beyond
 the call of duty.  You cannot mmap sub-pages, so why should they be volatile?

 Similarly the suggestion of using madvise - while tempting - is probably a
 minority interest and can probably be managed with library code.  I'm glad
 you haven't pursued it.

 I think discarding whole ranges at a time is very sensible, and so merging
 adjacent ranges is best avoided.  If you require page-aligned ranges this
 becomes trivial - is that right?

 I wonder if the oldest page/oldest range issue can be defined way by
 requiring apps the touch the first page in a range when they touch the range.
 Then the age of a range is the age of the first page.  Non-initial pages
 could even be kept off the free list .... though that might confuse NUMA
 page reclaim if a range had pages from different nodes.


 Application to non-tmpfs files seems very unclear and so probably best
 avoided.
 If I understand you correctly, then you have suggested both that a volatile
 range would be a "lazy hole punch" and a "don't let this get written to disk
 yet" flag.  It cannot really be both.  The former sounds like fallocate,
 the latter like fadvise.
 I think the later sounds more like the general purpose of volatile ranges,
 but I also suspect that some journalling filesystems might be uncomfortable
 providing a guarantee like that.  So I would suggest firmly stating that it
 is a tmpfs-only feature.  If someone wants something vaguely similar for
 other filesystems, let them implement it separately.


 The SIGBUS interface could have some merit if it really reduces overhead.  I
 worry about app bugs that could result from the non-deterministic
 behaviour.   A range could get unmapped while it is in use and testing for
 the case of "get a SIGBUS half way though accessing something" would not
 be straight forward (SIGBUS on first step of access should be easy).
 I guess that is up to the app writer, but I have never liked anything about
 the signal interface and encouraging further use doesn't feel wise.

 That's my 2c worth for now.  Keep up the good work,

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-09-29  3:16 [PATCH 0/3] Volatile Ranges (v7) & Lots of words John Stultz
                   ` (3 preceding siblings ...)
  2012-10-02  7:39 ` [PATCH 0/3] Volatile Ranges (v7) & Lots of words NeilBrown
@ 2012-10-02 17:31 ` Taras Glek
  2012-10-08  6:25 ` Minchan Kim
  2012-10-09  8:07 ` Mike Hommey
  6 siblings, 0 replies; 16+ messages in thread
From: Taras Glek @ 2012-10-02 17:31 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On 9/28/2012 8:16 PM, John Stultz wrote:
> <snip>
> There is two rough approaches that I have tried so far
>
> 1) Managing volatile range objects, in a tree or list, which are then
> purged using a shrinker
>
> 2) Page based management, where pages marked volatile are moved to
> a new LRU list and are purged from there.
>
>
>
> 1) This patchset is of the the shrinker-based approach. In many ways it
> is simpler, but it does have a few drawbacks.  Basically when marking a
> range as volatile, we create a range object, and add it to an rbtree.
> This allows us to be able to quickly find ranges, given an address in
> the file.  We also add each range object to the tail of a  filesystem
> global linked list, which acts as an LRU allowing us to quickly find
> the least recently created volatile range. We then use a shrinker
> callback to trigger purging, where we'll select the range on the head
> of the LRU list, purge the data, mark the range object as purged,
> and remove it from the lru list.
>
> This allows fairly efficient behavior, as marking and unmarking
> a range are both O(logn) operation with respect to the number of
> ranges, to insert and remove from the tree.  Purging the range is
> also O(1) to select the range, and we purge the entire range in
> least-recently-marked-volatile order.
>
> The drawbacks with this approach is that it uses a shrinker, thus it is
> numa un-aware. We track the virtual address of the pages in the file,
> so we don't have a sense of what physical pages we're using, nor on
> which node those pages may be on. So its possible on a multi-node
> system that when one node was under pressure, we'd purge volatile
> ranges that are all on a different node, in effect throwing data away
> without helping anything. This is clearly non-ideal for numa systems.
>
> One idea I discussed with Michel Lespinasse is that this might be
> something we could improve by providing the shrinker some node context,
> then keep track in the range  what node their first page is on. That
> way we would be sure to at least free up one page on the node under
> pressure when purging that range.
>
>
> 2) The second approach, which was more page based, was also tried. In
> this case when we marked a range as volatile, the pages in that range
> were moved to a new  lru list LRU _VOLATILE in vmscan.c.  This provided
> a page lru list that could be used to free pages before looking at
> the LRU_INACTIVE_FILE/ANONYMOUS lists.
>
> This integrates the feature deeper in the mm code, which is nice,
> especially as we have an LRU_VOLATILE list for each numa node. Thus
> under pressure we won't purge ranges that are entirely on a different
> node, as is possible with the other approach.
>
> However, this approach is more costly.	When marking a range
> as volatile, we have to migrate every page in that range to the
> LRU_VOLATILE list, and similarly on unmarking we have to move each
> page back. This ends up being O(n) with respect to the number of
> pages in the range we're marking or unmarking. Similarly when purging,
> we let the scanning code select a page off the lru, then we have to
> map it back to the volatile range so we can purge the entire range,
> making it a more expensive O(logn),  with respect to the number of
> ranges, operation.
>
> This is a particular concern as applications that want to mark and
> unmark data as volatile with fine granularity will likely be calling
> these operations frequently, adding quite a bit of overhead. This
> makes it less likely that applications will choose to volunteer data
> as volatile to the system.
>
> However, with the new lazy SIGBUS notification, applications using
> the SIGBUS method would avoid having to mark and unmark data when
> accessing it, so this overhead may be less of a concern. However, for
> cases where applications don't want to deal with the SIGBUS and would
> rather have the more deterministic behavior of the unmark/access/mark
> pattern, the performance is a concern.
Unfortunately, approach 1 is not useful for our use-case. It'll mean 
that we are continuously re-decompressing frequently used parts of 
libxul.so under memory pressure(which is pretty often on limited ram 
devices).


Taras

ps. John, I really appreciate movement on this. We really need this to 
improve Firefox memory usage + startup speed on low memory devices. Will 
be great to have Firefox start faster+ respond to memory pressure better 
on desktop Linux too.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-10-02  7:39 ` [PATCH 0/3] Volatile Ranges (v7) & Lots of words NeilBrown
@ 2012-10-02 22:38   ` John Stultz
  2012-11-02 20:59     ` Michael Kerrisk
  2012-11-03  7:57     ` Michael Kerrisk
  0 siblings, 2 replies; 16+ messages in thread
From: John Stultz @ 2012-10-02 22:38 UTC (permalink / raw)
  To: NeilBrown
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Andrea Righi, Aneesh Kumar K.V,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm, Christoph Hellwig

On 10/02/2012 12:39 AM, NeilBrown wrote:
> On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz <john.stultz@linaro.org> wrote:
>
>> After Kernel Summit and Plumbers, I wanted to consider all the various
>> side-discussions and try to summarize my current thoughts here along
>> with sending out my current implementation for review.
>>
>> Also: I'm going on four weeks of paternity leave in the very near
>> (but non-deterministic) future. So while I hope I still have time
>> for some discussion, I may have to deal with fussier complaints
>> then yours. :)  In any case, you'll have more time to chew on
>> the idea and come up with amazing suggestions. :)
>   I wonder if you are trying to please everyone and risking pleasing no-one?
>   Well, maybe not quite that extreme, but you can't please all the people all
>   the time.
So while I do agree that I won't be able to please everyone, especially 
when it comes to how this interface is implemented internally, I do want 
to make sure that the userland interface really does make sense and 
isn't limited by my own short-sightedness.  :)

>   For example, allowing sub-page volatile region seems to be above and beyond
>   the call of duty.  You cannot mmap sub-pages, so why should they be volatile?
Although if someone marked a page and a half as volatile, would it be 
reasonable to throw away the second half of that second page? That seems 
unexpected to me. So we're really only marking the whole pages specified 
as volatlie,  similar to how FALLOC_FL_PUNCH_HOLE behaves.

But if it happens that the adjacent range is also a partial page, we can 
coalesce them possibly into an purgable whole page. I think it makes 
sense, especially from a userland point of view and wasn't really 
complicated to add.

>   Similarly the suggestion of using madvise - while tempting - is probably a
>   minority interest and can probably be managed with library code.  I'm glad
>   you haven't pursued it.
For now I see this as a lower priority, but its something I'd like to 
investigate.  As depending on tmpfs has issues since there's no quota 
support, so having a user-writable tmpfs partition mounted is a DoS 
opening, especially on low-memory systems.

>   I think discarding whole ranges at a time is very sensible, and so merging
>   adjacent ranges is best avoided.  If you require page-aligned ranges this
>   becomes trivial - is that right?
True. If we avoid coalescing non-whole page ranges, keeping 
non-overlapping ranges independent is fairly easy.

But it is also easy to avoid coalescing in all cases except when 
multiple sub-page ranges can be coalesced together.

In other words, we mark whole page portions of the range as volatile, 
and keep the sub-page portions separate. So non-page aligned ranges 
would possibly consist of three independent ranges, with the middle one 
as the only one marked volatile. Should those non-whole-page ranges be 
adjacent to other non-whole-page ranges, they could be coalesced. Since 
the coalesced edge ranges would be marked volatile after the full range, 
we would also avoid puriging the edge pages that would invalidate two 
unpurged range.

Alternatively, we can never coalesce and only mark whole pages in single 
ranges as volatile. It doesn't really make it more complex.

But again, these are implementation details.

The main point is I think at the user-interface level, allowing userland 
to provide non-page aligned ranges is valid. What we do with those 
non-page aligned chunks is up to the kernel/implementation, but I think 
we should be conservative and be sure never to purge non-volatile data.

>   I wonder if the oldest page/oldest range issue can be defined way by
>   requiring apps the touch the first page in a range when they touch the range.
>   Then the age of a range is the age of the first page.  Non-initial pages
>   could even be kept off the free list .... though that might confuse NUMA
>   page reclaim if a range had pages from different nodes.
Not sure I followed this. Are you suggesting keeping non-initial ranges 
off the vmscan LRU lists entirely?

Another appraoch that was suggested that sounds similar is touching all 
the pages when we mark them as volatile, so they are all close to each 
other in the active/inactive list. Then when the vmscan 
shrink_lru_list() code runs it would purge the pages together (although 
it might only purge half a range if there wasn't the need for more 
memory).   But again, these page-based solutions have much higher 
algorithmic complexity (O(n) - with respect to pages marked) and overhead.


>   Application to non-tmpfs files seems very unclear and so probably best
>   avoided.
>   If I understand you correctly, then you have suggested both that a volatile
>   range would be a "lazy hole punch" and a "don't let this get written to disk
>   yet" flag.  It cannot really be both.  The former sounds like fallocate,
>   the latter like fadvise.
I don't think I see the exclusivity aspect. If we say "Dear kernel, you 
may punch a hole at this offset in this file whenever you want in the 
future" and then later say "Cancel my earlier hole punching request" 
(which the kernel can say "Sorry, too late")  it has very close 
semantics to what I'm describing with the abstract interface to volatile 
range.  Maybe the only subtlety with the hole-punching oriented 
worldview is that the kernel is smart enough not bother writing out any 
data that could be punched out in the future.

But maybe this is a sufficient subtlety to still warrant avoiding it.

>   I think the later sounds more like the general purpose of volatile ranges,
>   but I also suspect that some journalling filesystems might be uncomfortable
>   providing a guarantee like that.  So I would suggest firmly stating that it
>   is a tmpfs-only feature.  If someone wants something vaguely similar for
>   other filesystems, let them implement it separately.
I mostly agree, as I don't have the context to see how this could be 
useful to other filesystems.  So I'm limiting my functionality to tmpfs. 
However DaveC saw some value in allowing it to be extended to other 
filesystems, and I'm not opposed in seeing the same interface be used if 
the semantics are close enough.

 From Dave's earlier mail:

"Managing large scale disk caches have exactly the same problems of
determining what to evict and/or move to secondary storage when
space is low. Being able to mark ranges of files as "remove this
first" woulxp dbe very advantageous for pro-active mangement of ENOSPC
conditions in the cache...

And being able to do space-demand hole-punching for stuff like
managing VM images would be really cool. For example, the loopback
device uses hole punching to implement TRIM commands, so turning
those into VOLATILE ranges for background dispatch will speed those
operations up immensely and avoid silly same block "TRIM - write -
TRIM - write" cyclic hole-punch/allocation in the backing file.  KVM
could do the same for implementing TRIM on image based block
devices...

There's lots of things that can be done with a generic advisory,
asynchornous hole-punching interface."

Christoph also mentioned the concept would have some usefulness for 
persistent caches and I think xfsutils as well?

To me, it seems the dynamic is: fadvise is too wishy washy for anything 
that deals with persistent data on disk. Its more how the kernel memory 
management should manage file data.  Where as fallocate has stronger 
semantics for the behavior of what happens on disk.  So if this is 
really a tmpfs only feature, fadvise should be ok, but if it were ever 
to be useful for making actual changes to disk, fallocate would be better.

So just from that standpoint, fallocate might be a more flexible 
interface to use, since its really all the same for tmpfs.

But let me know if my read on things here is off.

>   The SIGBUS interface could have some merit if it really reduces overhead.  I
>   worry about app bugs that could result from the non-deterministic
>   behaviour.   A range could get unmapped while it is in use and testing for
>   the case of "get a SIGBUS half way though accessing something" would not
>   be straight forward (SIGBUS on first step of access should be easy).
>   I guess that is up to the app writer, but I have never liked anything about
>   the signal interface and encouraging further use doesn't feel wise.
Initially I didn't like the idea, but have warmed considerably to it. 
Mainly due to the concern that the constant unmark/access/mark pattern 
would be too much overhead, and having a lazy method will be much nicer 
for performance.  But yes, at the cost of additional complexity of 
handling the signal, marking the faulted address range as non-volatile, 
restoring the data and continuing.

The use case for Mozilla is where there are compressed library files on 
disk, which are decompressed into memory to reduce the io. Then the 
entire in-memory library can be marked volatile, and will be re-fetched 
as needed. Basically allowing for filesystem independent disk 
compression (and more importantly for them - reduced io).

Hopefully that provides some extra context. Thanks again for the words 
of wisdom here.  I do agree that at a certain point I will have to 
become less flexible, in order to push something upstream, but since 
interest in this work has been somewhat sporadic, I do want to make sure 
folks have at least a chance to "bend the sapling" this one last time. :)

thanks
-john


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-09-29  3:16 [PATCH 0/3] Volatile Ranges (v7) & Lots of words John Stultz
                   ` (4 preceding siblings ...)
  2012-10-02 17:31 ` Taras Glek
@ 2012-10-08  6:25 ` Minchan Kim
  2012-10-09  1:25   ` John Stultz
  2012-10-09  8:07 ` Mike Hommey
  6 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2012-10-08  6:25 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, linux-mm

Hi John,

On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:
> 
> After Kernel Summit and Plumbers, I wanted to consider all the various
> side-discussions and try to summarize my current thoughts here along
> with sending out my current implementation for review.
> 
> Also: I'm going on four weeks of paternity leave in the very near
> (but non-deterministic) future. So while I hope I still have time
> for some discussion, I may have to deal with fussier complaints
> then yours. :)  In any case, you'll have more time to chew on
> the idea and come up with amazing suggestions. :)
> 
> 
> General Interface semantics:
> ----------------------------------------------
> 
> The high level interface I've been pushing has so far stayed fairly
> consistent:
> 
> Application marks a range of data as volatile. Volatile data may
> be purged at any time. Accessing volatile data is undefined, so
> applications should not do so. If the application wants to access
> data in a volatile range, it should mark it as non-volatile. If any
> of the pages in the range being marked non-volatile had been purged,
> the kernel will return an error, notifying the application that the
> data was lost.
> 
> But one interesting new tweak on this design, suggested by the Taras
> Glek and others at Mozilla, is as follows:
> 
> Instead of leaving volatile data access as being undefined , when
> accessing volatile data, either the data expected will be returned
> if it has not been purged, or the application will get a SIGBUS when
> it accesses volatile data that has been purged.
> 
> Everything else remains the same (error on marking non-volatile
> if data was purged, etc). This model allows applications to avoid
> having to unmark volatile data when it wants to access it, then
> immediately re-mark it as volatile when its done. It is in effect

Just out of curiosity.
Why should application remark it as volatile again?
It have been already volatile range and application doesn't receive
any signal while it uses that range. So I think it doesn't need to
remark.


> "lazy" with its marking, allowing the kernel to hit it with a signal
> when it gets unlucky and touches purged data. From the signal handler,
> the application can note the address it faulted on, unmark the range,
> and regenerate the needed data before returning to execution.

I like this model if plumbers really want it.

> 
> Since this approach avoids the more explicit unmark/access/mark
> pattern, it avoids the extra overhead required to ensure data is
> non-volatile before being accessed.

I have an idea to reduce the overhead.
See below.

> 
> However, If applications don't want to deal with handling the
> sigbus, they can use the more straightforward (but more costly)
> unmark/access/mark pattern in the same way as my earlier proposals.
> 
> This allows folks to balance the cost vs complexity in their
> application appropriately.
> 
> So that's a general overview of how the idea I'm proposing could
> be used.

My idea is that we don't need to move all pages in the range
to tail of LRU or new LRU list. Just move a page in the range
into tail of LRU or new LRU list. And when reclaimer start to find
victim page, it can know this page is volatile by something
(ex, if we use new LRU list, we can know it easily, Otherwise,
we can use VMA's new flag - VM_VOLATILE and we can know it easily
by page_check_references's tweak) and isolate all pages of the range
in middle of LRU list and reclaim them all at once.
So the cost of marking is just a (search cost for finding in-memory
page of the range + moving a page between LRU or from middle to tail)
It means we can move the cost time from mark/unmark to reclaim point.

> 
> 
> 
> Specific Interface semantics:
> ---------------------------------------------
> 
> Here are some of the open question about how the user interface
> should look:
> 
> fadvise vs fallocate:
> 
> 	So while originally I used fadvise, currently my
> 	implementation uses fallocate(fd, FALLOC_FL_MARK_VOLATILE,
> 	start, len) to mark a range as volatile and fallocate(fd,
> 	FALLOC_FL_UNMARK_VOLATILE, start, len) to unmark ranges.
> 
> 	During kernel summit, the question was brought up if fallocate
> 	was really the right interface to be using, and if fadvise
> 	would be better. To me fadvise makes a little more sense,
> 	but earlier it was pointed out that marking data ranges as
> 	volatile could also be seen as a type of cancellable and lazy
> 	hole-punching, so from that perspective fallocate might make
> 	more sense.  This is still an open question and I'd appreciate
> 	further input here.
> 
> tmpfs vs non-shmem filesystems:
> 	Android's ashmem primarily provides a way to get unlinked
> 	tmpfs fds that can be shared between applications. Its
> 	just an additional feature that those pages can "unpinned"
> 	or marked volatile in my terminology. Thus in implementing
> 	volatile ranges, I've focused on getting it to work on tmpfs
> 	file descriptors.  However, there has been some interest in
> 	using volatile ranges with more traditional filesystems. The
> 	semantics for how volatile range purging would work on a
> 	real filesystem are not well established, and I can't say I
> 	understand the utility quite yet, but there may be a case for
> 	having data that you know won't be committed to disk until it
> 	is marked as non-volatile.  However, returning an EINVAL on
> 	non-tmpfs filesystems until such a use is established should
> 	be fine.
> 
> fd based interfaces vs madvise:
> 	In talking with Taras Glek, he pointed out that for his
> 	needs, the fd based interface is a little annoying, as it
> 	requires having to get access to tmpfs file and mmap it in,
> 	then instead of just referencing a pointer to the data he
> 	wants to mark volatile, he has to calculate the offset from
> 	start of the mmap and pass those file offsets to the interface.
> 	Instead he mentioned that using something like madvise would be
> 	much nicer, since they could just pass a pointer to the object
> 	in memory they want to make volatile and avoid the extra work.
> 
> 	I'm not opposed to adding an madvise interface for this as
> 	well, but since we have a existing use case with Android's
> 	ashmem, I want to make sure we support this existing behavior.
> 	Specifically as with ashmem  applications can be sharing
> 	these tmpfs fds, and so file-relative volatile ranges make
> 	more sense if you need to coordinate what data is volatile
> 	between two applications.
> 
> 	Also, while I agree that having an madvise interface for
> 	volatile ranges would be nice, it does open up some more
> 	complex implementation issues, since with files, there is a
> 	fixed relationship between pages and the files' address_space
> 	mapping, where you can't have pages shared between different
> 	mappings. This makes it easy to hang the volatile-range tree
> 	off of the mapping (well, indirectly via a hash table). With
> 	general anonymous memory, pages can be shared between multiple
> 	processes, and as far as I understand, don't have any grouping
> 	structure we could use to determine if the page is in a
> 	volatile range or not. We would also need to determine more
> 	complex questions like: What are the semantics of volatility
> 	with copy-on-write pages?  I'm hoping to investigate this
> 	idea more deeply soon so I can be sure whatever is pushed has
> 	a clear plan of how to address this idea. Further thoughts
> 	here would be appreciated.

I like madvise interface because allocator can use it for memory pool.
If allocator has free memory which just return from application
he can mark it into volatile so VM can reclaim that pages without swapout
when memory pressure happens and it can unmark it before allocating.
It would be more effective rather than calling munmap or madvise(DONTNEED)
which those operations requires all page table operation and even vma
unlinking in case of munmap.

For it, we can add new VMA flag VM_VOLATILE and we can use reverse mapping
for grouping structure. For COW semantics, I think we can discard volatile
page only if all vmas which share the page don't have VM_VOLATILE.
> 
> 
> It would really be great to get any thoughts on these issues, as they
> are higher-priority to me then diving into the details of how we
> implement this internally, which can shift over time.
> 
> 
> 
> Implementation Considerations:
> ---------------------------------------------
> 
> How best to manage volatile ranges internally in the kernel is still
> an open question.
> 
> With this patch set, I'm really wanting to provide a proof of concept
> of the general interface semantics above. This allows applications to
> play with the idea and validate that it would work for them. Allowing
> further discussion to continue on how to best implement or best allow
> the implementation to evolve in the kernel.
> 
> Even so, I'm very interested in any discussion about how to manage
> volatile ranges optimally.
> 
> Before describing the different management approaches I've tried,
> there are some abstract properties and considerations that need to
> be kept in mind:
> 
> * Range-granular Purging:
> 	Since volatile ranges can be reclaimed at any time, the
> 	question of how the kernel should reclaim volatile data
> 	needs to be addressed.	When a large data range  is marked
> 	as volatile, if any single page in that range is purged,
> 	the application will get an error when it marks the range
> 	as non-volatile.  Thus when any single page in a range
> 	is purged, the "value" of the entire range is destroyed.
> 	Because of this property, it makes most sense to purge the
> 	entire range together.
> 
> 
> * Coalescing of adjacent volatile ranges:
> 	With volatile ranges, any overlapping ranges are always
> 	coalesced. However, there is an open question of what to
> 	do with neighboring ranges. With Android's approach, any
> 	neighboring ranges were coalesced into one range.  I've since
> 	tweaked this so that adjacent ranges are coalesced only if
> 	both have not yet been purged (or both are already purged).
> 	This avoids throwing away fine data just because its next
> 	to data that has already been tossed.  Not coalescing
> 	non-overlapping ranges is also an option I've considered,
> 	as it better follows the applications wishes, since as
> 	the application is providing these non-overlapping ranges
> 	separately, we should probably also purge them separately.
> 	The one complication here is that for userlands-sake, we
> 	manage volatile ranges at a byte level. So if an application
> 	marks one an a half pages of data as volatile, we only purge
> 	pages that are entirely volatile. This avoids accidentally
> 	purging non-volatile data on the rest of the page.  However,
> 	if an array of sub-page sized data is marked volatile one by
> 	one, coalescing the ranges allows us to purge a page that
> 	consists entirely of multiple volatile ranges.	So for now
> 	I'm still coalescing assuming the neighbors are both unpurged,
> 	but this behavior is open to being tweaked.
> 
> 
> * Purging order between volatile ranges:
> 	Again, since it makes sense to purge all the complete
> 	pages in a range at the same time, we need to consider the
> 	subtle difference between the least-recently-used pages vs
> 	least-recently-used ranges. A single range could contain very
> 	frequently accessed data, as well as rarely accessed data.
> 	One must also consider that the act of marking a range as
> 	volatile may not actually touch the underlying pages. Thus
> 	purging ranges based on a least-recently-used page may also
> 	result in purging the most-recently used page.
> 
> 	Android addressed the purging order question by purging ranges
> 	in the order they were marked volatile. Thus the oldest
> 	volatile range is the first range to be purged. This works
> 	well in the Android  model, as applications aren't supposed
> 	to access volatile data, so the least-recently-marked-volatile
> 	order maps well to the least-recently-used-range.
> 
> 	However, this assumption doesn't hold with the lazy SIGBUS
> 	notification method, as pages in a volatile range may continue
> 	to be accessed after the range is marked volatile.  So the
> 	question as to what is the best order of purging volatile
> 	ranges is definitely open.
> 
> 	Abstractly the ideal solution might be to evaluate the
> 	most-recently used page in each range, and to purge the range
> 	with the oldest recently-used-page, but I suspect this is
> 	not something that could be calculated efficiently.
> 
> 	Additionally, in my conversations with Taras, he pointed out
> 	that if we are using a one-application-at-a-time UI model,
> 	it would be ideal to discourage purging volatile data used by
> 	the current application, instead prioritizing volatile ranges
> 	from applications that aren't active. However, I'm not sure
> 	what mechanism could be used to prioritize range purging in
> 	this fashion, especially considering volatile ranges can be
> 	on data that is shared between applications.

My thought is that "let it be" without creating new LRU list or
deactivating volatile page to tail of LRU for early reclaiming.
It means volatile pages has same priorty with other normal pages.
Volatile doesn't mean "Early reclaim" but "we don't need to swap out
them for reclaiming" in my perception.

> 
> 
> * Volatile range purging order relative to non-volatile pages:
> 	Initially I had proposed that since applications had offered
> 	data up as unused, volatile ranges should be purged before we
> 	try to free any other pages in the system.  At Plumbers, Andrea
> 	pointed out that this doesn't make much sense, as there may be
> 	inactive file pages from some streaming file data which are not
> 	going to be used any time soon, and would be a better candidate
> 	to free then an application's volatile pages. This sounded
> 	quite reasonable, so its likely we need to balance volatile
> 	purging with freeing other pages in the system. However, I do
> 	think it is advantageous to purge volatile pages before we
> 	free any dirty pages that must be laundered, as part of the
> 	goal of volatile pages is to avoid extra io. Although from
> 	my reading of shrink_page_list in vmscan.c I'm not sure I see
> 	if/how we prioritize freeing clean pages prior to dirty ones.
> 
> 
> So with that background covered, on to discussing actual
> implementations.
> 
> Implementation Details:
> ---------------------------------------------
> 
> There is two rough approaches that I have tried so far
> 
> 1) Managing volatile range objects, in a tree or list, which are then
> purged using a shrinker
> 
> 2) Page based management, where pages marked volatile are moved to
> a new LRU list and are purged from there.
> 
> 
> 
> 1) This patchset is of the the shrinker-based approach. In many ways it
> is simpler, but it does have a few drawbacks.  Basically when marking a
> range as volatile, we create a range object, and add it to an rbtree.
> This allows us to be able to quickly find ranges, given an address in
> the file.  We also add each range object to the tail of a  filesystem
> global linked list, which acts as an LRU allowing us to quickly find
> the least recently created volatile range. We then use a shrinker
> callback to trigger purging, where we'll select the range on the head
> of the LRU list, purge the data, mark the range object as purged,
> and remove it from the lru list.
> 
> This allows fairly efficient behavior, as marking and unmarking
> a range are both O(logn) operation with respect to the number of
> ranges, to insert and remove from the tree.  Purging the range is
> also O(1) to select the range, and we purge the entire range in
> least-recently-marked-volatile order.
> 
> The drawbacks with this approach is that it uses a shrinker, thus it is
> numa un-aware. We track the virtual address of the pages in the file,
> so we don't have a sense of what physical pages we're using, nor on
> which node those pages may be on. So its possible on a multi-node
> system that when one node was under pressure, we'd purge volatile
> ranges that are all on a different node, in effect throwing data away
> without helping anything. This is clearly non-ideal for numa systems.
> 
> One idea I discussed with Michel Lespinasse is that this might be
> something we could improve by providing the shrinker some node context,
> then keep track in the range  what node their first page is on. That
> way we would be sure to at least free up one page on the node under
> pressure when purging that range.
> 
> 
> 2) The second approach, which was more page based, was also tried. In
> this case when we marked a range as volatile, the pages in that range
> were moved to a new  lru list LRU _VOLATILE in vmscan.c.  This provided
> a page lru list that could be used to free pages before looking at
> the LRU_INACTIVE_FILE/ANONYMOUS lists.
> 
> This integrates the feature deeper in the mm code, which is nice,
> especially as we have an LRU_VOLATILE list for each numa node. Thus
> under pressure we won't purge ranges that are entirely on a different
> node, as is possible with the other approach.
> 
> However, this approach is more costly.	When marking a range
> as volatile, we have to migrate every page in that range to the
> LRU_VOLATILE list, and similarly on unmarking we have to move each
> page back. This ends up being O(n) with respect to the number of
> pages in the range we're marking or unmarking. Similarly when purging,
> we let the scanning code select a page off the lru, then we have to
> map it back to the volatile range so we can purge the entire range,
> making it a more expensive O(logn),  with respect to the number of
> ranges, operation.
> 
> This is a particular concern as applications that want to mark and
> unmark data as volatile with fine granularity will likely be calling
> these operations frequently, adding quite a bit of overhead. This
> makes it less likely that applications will choose to volunteer data
> as volatile to the system.
> 
> However, with the new lazy SIGBUS notification, applications using
> the SIGBUS method would avoid having to mark and unmark data when
> accessing it, so this overhead may be less of a concern. However, for
> cases where applications don't want to deal with the SIGBUS and would
> rather have the more deterministic behavior of the unmark/access/mark
> pattern, the performance is a concern.
> 
> Additionally, there may be ways to defer and batch the page migration
> so that applications don't suffer the extra cost, but this solution
> may be limited or could  cause some strange behavior, as we can't
> defer the unmark method, as we don't want pages to be purged after
> the application thinks they were unmarked.
> 

In summary, the idea I am suggesting now if we select lazy SIGBUS is
following as,

1) use madvise and VMA rmap with newly VM_VOLATILE page
2) mark
   just mark VM_VOLATILE in the VMA
3) treat volatile pages same with normal pages in POV aging
   (Of course, non swap system, we have to tweak VM for reclaimaing
    volatile pages, for example, we can move all volatile pages into
    inactive's tail when swapoff happens and VM peek tail of inactive
    LRU list without aging when memory pressure happens)
4) unmark
   just unmark VM_VOLATILE in the VMA and return error.

> 
> Whew, that was long...

Thanks for really good summary, John!

> 
> Anyway, if you got this far and are still interested, I'd be greatly
> appreciate  hearing of any other suggested implementations, or ways
> around the drawbacks of the already tried approaches.
> 
> thanks
> -john
> 
> 
> For this v7 patchset revision the changes are as follows:
> * Dropped the LRU_VOLATILE approach for now so we can focus on
>   getting the general interface semantics agreed upon
> * Converted to using byte ranges rather then page ranges to make
>   userland's life easier.	
> * Add SIGBUS on purged page access behavior, allowing for access
>   of volatile data without having to unmark it.
> 
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Robert Love <rlove@google.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Andrea Righi <andrea@betterlinux.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Cc: Mike Hommey <mh@glandium.org>
> Cc: Taras Glek <tglek@mozilla.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: linux-mm@kvack.org <linux-mm@kvack.org>
> 
> 
> John Stultz (3):
>   [RFC] Add volatile range management code
>   [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
>   [RFC] ashmem: Convert ashmem to use volatile ranges
> 
>  drivers/staging/android/ashmem.c |  335 +---------------------
>  fs/open.c                        |    3 +-
>  include/linux/falloc.h           |    7 +-
>  include/linux/volatile.h         |   46 +++
>  mm/Makefile                      |    2 +-
>  mm/shmem.c                       |  120 ++++++++
>  mm/volatile.c                    |  580 ++++++++++++++++++++++++++++++++++++++
>  7 files changed, 763 insertions(+), 330 deletions(-)
>  create mode 100644 include/linux/volatile.h
>  create mode 100644 mm/volatile.c
> 
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-10-08  6:25 ` Minchan Kim
@ 2012-10-09  1:25   ` John Stultz
  2012-10-09  2:49     ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: John Stultz @ 2012-10-09  1:25 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, linux-mm, Kleen, Andi

On 10/07/2012 11:25 PM, Minchan Kim wrote:
> Hi John,
>
> On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:
>> After Kernel Summit and Plumbers, I wanted to consider all the various
>> side-discussions and try to summarize my current thoughts here along
>> with sending out my current implementation for review.
>>
>> Also: I'm going on four weeks of paternity leave in the very near
>> (but non-deterministic) future. So while I hope I still have time
>> for some discussion, I may have to deal with fussier complaints
>> then yours. :)  In any case, you'll have more time to chew on
>> the idea and come up with amazing suggestions. :)
>>
>>
>> General Interface semantics:
>> ----------------------------------------------
>>
>> The high level interface I've been pushing has so far stayed fairly
>> consistent:
>>
>> Application marks a range of data as volatile. Volatile data may
>> be purged at any time. Accessing volatile data is undefined, so
>> applications should not do so. If the application wants to access
>> data in a volatile range, it should mark it as non-volatile. If any
>> of the pages in the range being marked non-volatile had been purged,
>> the kernel will return an error, notifying the application that the
>> data was lost.
>>
>> But one interesting new tweak on this design, suggested by the Taras
>> Glek and others at Mozilla, is as follows:
>>
>> Instead of leaving volatile data access as being undefined , when
>> accessing volatile data, either the data expected will be returned
>> if it has not been purged, or the application will get a SIGBUS when
>> it accesses volatile data that has been purged.
>>
>> Everything else remains the same (error on marking non-volatile
>> if data was purged, etc). This model allows applications to avoid
>> having to unmark volatile data when it wants to access it, then
>> immediately re-mark it as volatile when its done. It is in effect
> Just out of curiosity.
> Why should application remark it as volatile again?
> It have been already volatile range and application doesn't receive
> any signal while it uses that range. So I think it doesn't need to
> remark.

Not totally sure I understand your question clearly.

So assuming one has a large cache of independently accessed objects, 
this mark-nonvolatile/access/mark-volatile pattern is useful if you 
don't want to have to deal with handling the SIGBUS.

For instance, if when accessing the data (say uncompressed image data), 
you are passing it to a library (to do something like an image filter, 
in place), where you don't want the library's access of the data to 
cause an unexpected SIGBUS that would be difficult to recover from.


>> "lazy" with its marking, allowing the kernel to hit it with a signal
>> when it gets unlucky and touches purged data. From the signal handler,
>> the application can note the address it faulted on, unmark the range,
>> and regenerate the needed data before returning to execution.
> I like this model if plumbers really want it.

I think it makes sense.  Also it avoids a hole in the earlier semantics: 
If accessing volatile data is undefined, and you might get the data, or 
you might get zeros, there's the problem of writes that occur on purged 
ranges (Credits to Mike Hommey for pointing this out in his critique of 
Android's ashmem).  If an application writes data to a purged range, the 
range continues to be considered volatile, and since neighboring data 
may still be purged, the entire set is considered purged.  Because of 
this, we don't end up purging the data again (at least with the shrinker 
method).

By adding the SIGBUS on access of purged pages, it cleans up the 
semantics nicely.



>
>> However, If applications don't want to deal with handling the
>> sigbus, they can use the more straightforward (but more costly)
>> unmark/access/mark pattern in the same way as my earlier proposals.
>>
>> This allows folks to balance the cost vs complexity in their
>> application appropriately.
>>
>> So that's a general overview of how the idea I'm proposing could
>> be used.
> My idea is that we don't need to move all pages in the range
> to tail of LRU or new LRU list. Just move a page in the range
> into tail of LRU or new LRU list. And when reclaimer start to find
> victim page, it can know this page is volatile by something
> (ex, if we use new LRU list, we can know it easily, Otherwise,
> we can use VMA's new flag - VM_VOLATILE and we can know it easily
> by page_check_references's tweak) and isolate all pages of the range
> in middle of LRU list and reclaim them all at once.
> So the cost of marking is just a (search cost for finding in-memory
> page of the range + moving a page between LRU or from middle to tail)
> It means we can move the cost time from mark/unmark to reclaim point.

So this general idea of moving a single page to represent the entire 
range has been mentioned before (I think Neil also suggested something 
similar).

But what happens if we are creating a large volatile range, most of 
which hasn't been accessed in a long time. However, the chosen flag page 
has been accessed recently.  In this case, we might end up swapping 
volatile pages to disk before we try to evict the volatile flag page.

I haven't looked it up, but I worry that trying to purge the rest of the 
range would end up causing those pages to be swapped back in.

And Neil's suggestion of removing the rest of the pages from the LRUs 
(if I understood his suggestion correctly) to avoid swapping out those 
pages, unfortunately results in the same extra O(n) cost of touching all 
the pages in the range every time we change the volatility state.


>>
>>
>> Specific Interface semantics:
>> ---------------------------------------------
>>
>> Here are some of the open question about how the user interface
>> should look:
[snip]
>> fd based interfaces vs madvise:
>> 	In talking with Taras Glek, he pointed out that for his
>> 	needs, the fd based interface is a little annoying, as it
>> 	requires having to get access to tmpfs file and mmap it in,
>> 	then instead of just referencing a pointer to the data he
>> 	wants to mark volatile, he has to calculate the offset from
>> 	start of the mmap and pass those file offsets to the interface.
>> 	Instead he mentioned that using something like madvise would be
>> 	much nicer, since they could just pass a pointer to the object
>> 	in memory they want to make volatile and avoid the extra work.
>>
>> 	I'm not opposed to adding an madvise interface for this as
>> 	well, but since we have a existing use case with Android's
>> 	ashmem, I want to make sure we support this existing behavior.
>> 	Specifically as with ashmem  applications can be sharing
>> 	these tmpfs fds, and so file-relative volatile ranges make
>> 	more sense if you need to coordinate what data is volatile
>> 	between two applications.
>>
>> 	Also, while I agree that having an madvise interface for
>> 	volatile ranges would be nice, it does open up some more
>> 	complex implementation issues, since with files, there is a
>> 	fixed relationship between pages and the files' address_space
>> 	mapping, where you can't have pages shared between different
>> 	mappings. This makes it easy to hang the volatile-range tree
>> 	off of the mapping (well, indirectly via a hash table). With
>> 	general anonymous memory, pages can be shared between multiple
>> 	processes, and as far as I understand, don't have any grouping
>> 	structure we could use to determine if the page is in a
>> 	volatile range or not. We would also need to determine more
>> 	complex questions like: What are the semantics of volatility
>> 	with copy-on-write pages?  I'm hoping to investigate this
>> 	idea more deeply soon so I can be sure whatever is pushed has
>> 	a clear plan of how to address this idea. Further thoughts
>> 	here would be appreciated.
> I like madvise interface because allocator can use it for memory pool.
> If allocator has free memory which just return from application
> he can mark it into volatile so VM can reclaim that pages without swapout
> when memory pressure happens and it can unmark it before allocating.
> It would be more effective rather than calling munmap or madvise(DONTNEED)
> which those operations requires all page table operation and even vma
> unlinking in case of munmap.
>
> For it, we can add new VMA flag VM_VOLATILE and we can use reverse mapping
> for grouping structure. For COW semantics, I think we can discard volatile
> page only if all vmas which share the page don't have VM_VOLATILE.

I'll look into this approach some more. I've got some concerns below.


>
>> It would really be great to get any thoughts on these issues, as they
>> are higher-priority to me then diving into the details of how we
>> implement this internally, which can shift over time.
>>
>>
>>
>> Implementation Considerations:
>> ---------------------------------------------
[snip]
>> * Purging order between volatile ranges:
>> 	Again, since it makes sense to purge all the complete
>> 	pages in a range at the same time, we need to consider the
>> 	subtle difference between the least-recently-used pages vs
>> 	least-recently-used ranges. A single range could contain very
>> 	frequently accessed data, as well as rarely accessed data.
>> 	One must also consider that the act of marking a range as
>> 	volatile may not actually touch the underlying pages. Thus
>> 	purging ranges based on a least-recently-used page may also
>> 	result in purging the most-recently used page.
>>
>> 	Android addressed the purging order question by purging ranges
>> 	in the order they were marked volatile. Thus the oldest
>> 	volatile range is the first range to be purged. This works
>> 	well in the Android  model, as applications aren't supposed
>> 	to access volatile data, so the least-recently-marked-volatile
>> 	order maps well to the least-recently-used-range.
>>
>> 	However, this assumption doesn't hold with the lazy SIGBUS
>> 	notification method, as pages in a volatile range may continue
>> 	to be accessed after the range is marked volatile.  So the
>> 	question as to what is the best order of purging volatile
>> 	ranges is definitely open.
>>
>> 	Abstractly the ideal solution might be to evaluate the
>> 	most-recently used page in each range, and to purge the range
>> 	with the oldest recently-used-page, but I suspect this is
>> 	not something that could be calculated efficiently.
>>
>> 	Additionally, in my conversations with Taras, he pointed out
>> 	that if we are using a one-application-at-a-time UI model,
>> 	it would be ideal to discourage purging volatile data used by
>> 	the current application, instead prioritizing volatile ranges
>> 	from applications that aren't active. However, I'm not sure
>> 	what mechanism could be used to prioritize range purging in
>> 	this fashion, especially considering volatile ranges can be
>> 	on data that is shared between applications.
> My thought is that "let it be" without creating new LRU list or
> deactivating volatile page to tail of LRU for early reclaiming.
> It means volatile pages has same priorty with other normal pages.
> Volatile doesn't mean "Early reclaim" but "we don't need to swap out
> them for reclaiming" in my perception.

Right. Using your earlier terms, ez-reclaim instead early-reclaim.

Ideally I think we would want to have preference of purging volatile 
pages over laundering dirty pages under pressure, but maybe that's a 
separate vmscan issue/optimization for now?



> In summary, the idea I am suggesting now if we select lazy SIGBUS is
> following as,
>
> 1) use madvise and VMA rmap with newly VM_VOLATILE page
> 2) mark
>     just mark VM_VOLATILE in the VMA
> 3) treat volatile pages same with normal pages in POV aging
>     (Of course, non swap system, we have to tweak VM for reclaimaing
>      volatile pages, for example, we can move all volatile pages into
>      inactive's tail when swapoff happens and VM peek tail of inactive
>      LRU list without aging when memory pressure happens)
> 4) unmark
>     just unmark VM_VOLATILE in the VMA and return error.

So to better grasp this, you're suggesting using the vma in-effect as 
the volatile range object? So if we mark a large area as volatile, we'll 
create a new vma with the (start, end), splitting the remaining portions 
of the old area into two non-volatile vmas?

The caveat for #3 seems a bit large.  Since on no-swap cases (which are 
standard for Android) that sounds like we're still having to migrate 
every page in the range to the end of the list (and move them to the 
head when they are marked non-volatile).  This seems to have the same 
O(n) performance issue.

Also,  if I understand correctly, you could have a one page in multiple 
vmas, and the semantics for when we consider the page as volatile are 
awkward. When we want to purge/writeout, do we have to scan every vma 
the page is in to determine if its volatile?

Another thing I worry about with this, is that the since the vmas are 
per-mm, I don't see how volatile range could be shared, like with a 
tmpfs file between two applications, a usage model that Android's ashmem 
supports.

I'm likely missing something, as I'm unfamiliar so far with the rmap 
code, so I'll look into this some more (although I may not get too much 
time for it in the next month).  Andi Kleen tried something similar to 
your suggestion awhile back, but I'm just not confident that the per-mm 
vmas are the right place to manage this state (thus why I've used 
address_space mappings).

Thanks again for the feedback!
-john


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-10-09  1:25   ` John Stultz
@ 2012-10-09  2:49     ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-10-09  2:49 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, linux-mm, Kleen, Andi

On Mon, Oct 08, 2012 at 06:25:07PM -0700, John Stultz wrote:
> On 10/07/2012 11:25 PM, Minchan Kim wrote:
> >Hi John,
> >
> >On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:
> >>After Kernel Summit and Plumbers, I wanted to consider all the various
> >>side-discussions and try to summarize my current thoughts here along
> >>with sending out my current implementation for review.
> >>
> >>Also: I'm going on four weeks of paternity leave in the very near
> >>(but non-deterministic) future. So while I hope I still have time
> >>for some discussion, I may have to deal with fussier complaints
> >>then yours. :)  In any case, you'll have more time to chew on
> >>the idea and come up with amazing suggestions. :)
> >>
> >>
> >>General Interface semantics:
> >>----------------------------------------------
> >>
> >>The high level interface I've been pushing has so far stayed fairly
> >>consistent:
> >>
> >>Application marks a range of data as volatile. Volatile data may
> >>be purged at any time. Accessing volatile data is undefined, so
> >>applications should not do so. If the application wants to access
> >>data in a volatile range, it should mark it as non-volatile. If any
> >>of the pages in the range being marked non-volatile had been purged,
> >>the kernel will return an error, notifying the application that the
> >>data was lost.
> >>
> >>But one interesting new tweak on this design, suggested by the Taras
> >>Glek and others at Mozilla, is as follows:
> >>
> >>Instead of leaving volatile data access as being undefined , when
> >>accessing volatile data, either the data expected will be returned
> >>if it has not been purged, or the application will get a SIGBUS when
> >>it accesses volatile data that has been purged.
> >>
> >>Everything else remains the same (error on marking non-volatile
> >>if data was purged, etc). This model allows applications to avoid
> >>having to unmark volatile data when it wants to access it, then
> >>immediately re-mark it as volatile when its done. It is in effect
> >Just out of curiosity.
> >Why should application remark it as volatile again?
> >It have been already volatile range and application doesn't receive
> >any signal while it uses that range. So I think it doesn't need to
> >remark.
> 
> Not totally sure I understand your question clearly.
> 
> So assuming one has a large cache of independently accessed objects,
> this mark-nonvolatile/access/mark-volatile pattern is useful if you
> don't want to have to deal with handling the SIGBUS.
> 
> For instance, if when accessing the data (say uncompressed image
> data), you are passing it to a library (to do something like an
> image filter, in place), where you don't want the library's access
> of the data to cause an unexpected SIGBUS that would be difficult to
> recover from.

I just confused by your word.
AFAIUC, you mean following as.

1) mark volatile
2) access pages in the range until SIGBUS happens
3) when SIGBUS happens, unmark volatile
4) access pages in the raange
5) When it's done, remark it as volatile

I agree this model.

> 
> 
> >>"lazy" with its marking, allowing the kernel to hit it with a signal
> >>when it gets unlucky and touches purged data. From the signal handler,
> >>the application can note the address it faulted on, unmark the range,
> >>and regenerate the needed data before returning to execution.
> >I like this model if plumbers really want it.
> 
> I think it makes sense.  Also it avoids a hole in the earlier
> semantics: If accessing volatile data is undefined, and you might
> get the data, or you might get zeros, there's the problem of writes
> that occur on purged ranges (Credits to Mike Hommey for pointing
> this out in his critique of Android's ashmem).  If an application
> writes data to a purged range, the range continues to be considered
> volatile, and since neighboring data may still be purged, the entire
> set is considered purged.  Because of this, we don't end up purging
> the data again (at least with the shrinker method).
> 
> By adding the SIGBUS on access of purged pages, it cleans up the
> semantics nicely.
> 
> 
> 
> >
> >>However, If applications don't want to deal with handling the
> >>sigbus, they can use the more straightforward (but more costly)
> >>unmark/access/mark pattern in the same way as my earlier proposals.
> >>
> >>This allows folks to balance the cost vs complexity in their
> >>application appropriately.
> >>
> >>So that's a general overview of how the idea I'm proposing could
> >>be used.
> >My idea is that we don't need to move all pages in the range
> >to tail of LRU or new LRU list. Just move a page in the range
> >into tail of LRU or new LRU list. And when reclaimer start to find
> >victim page, it can know this page is volatile by something
> >(ex, if we use new LRU list, we can know it easily, Otherwise,
> >we can use VMA's new flag - VM_VOLATILE and we can know it easily
> >by page_check_references's tweak) and isolate all pages of the range
> >in middle of LRU list and reclaim them all at once.
> >So the cost of marking is just a (search cost for finding in-memory
> >page of the range + moving a page between LRU or from middle to tail)
> >It means we can move the cost time from mark/unmark to reclaim point.
> 
> So this general idea of moving a single page to represent the entire
> range has been mentioned before (I think Neil also suggested
> something similar).
> 
> But what happens if we are creating a large volatile range, most of
> which hasn't been accessed in a long time. However, the chosen flag
> page has been accessed recently.  In this case, we might end up
> swapping volatile pages to disk before we try to evict the volatile
> flag page.

Is your concern is that mostly-used flag page is in the middle of
LRU while rest of the pages in that range are in the tail of LRU?

If we use VMA flag, it shouldn't be a problem because VM can know
that the non-flag page is in volatile vma by VM_VOLATILE so VM can
discard the page without swapout.

If we use ez-reclaim LRU list, VM should peek ez-reclaim LRU list firsty
before looking up anon LRU list so it shouldn't be a problem, either.
Problem in this approach is only space shortage for new page flag in 32bit
machine. I had patch for making new room but lost it :(.
Anyway, if we really want it, I think I can make that patch again.

> 
> I haven't looked it up, but I worry that trying to purge the rest of
> the range would end up causing those pages to be swapped back in.
> 
> And Neil's suggestion of removing the rest of the pages from the
> LRUs (if I understood his suggestion correctly) to avoid swapping
> out those pages, unfortunately results in the same extra O(n) cost
> of touching all the pages in the range every time we change the
> volatility state.
> 
> 
> >>
> >>
> >>Specific Interface semantics:
> >>---------------------------------------------
> >>
> >>Here are some of the open question about how the user interface
> >>should look:
> [snip]
> >>fd based interfaces vs madvise:
> >>	In talking with Taras Glek, he pointed out that for his
> >>	needs, the fd based interface is a little annoying, as it
> >>	requires having to get access to tmpfs file and mmap it in,
> >>	then instead of just referencing a pointer to the data he
> >>	wants to mark volatile, he has to calculate the offset from
> >>	start of the mmap and pass those file offsets to the interface.
> >>	Instead he mentioned that using something like madvise would be
> >>	much nicer, since they could just pass a pointer to the object
> >>	in memory they want to make volatile and avoid the extra work.
> >>
> >>	I'm not opposed to adding an madvise interface for this as
> >>	well, but since we have a existing use case with Android's
> >>	ashmem, I want to make sure we support this existing behavior.
> >>	Specifically as with ashmem  applications can be sharing
> >>	these tmpfs fds, and so file-relative volatile ranges make
> >>	more sense if you need to coordinate what data is volatile
> >>	between two applications.
> >>
> >>	Also, while I agree that having an madvise interface for
> >>	volatile ranges would be nice, it does open up some more
> >>	complex implementation issues, since with files, there is a
> >>	fixed relationship between pages and the files' address_space
> >>	mapping, where you can't have pages shared between different
> >>	mappings. This makes it easy to hang the volatile-range tree
> >>	off of the mapping (well, indirectly via a hash table). With
> >>	general anonymous memory, pages can be shared between multiple
> >>	processes, and as far as I understand, don't have any grouping
> >>	structure we could use to determine if the page is in a
> >>	volatile range or not. We would also need to determine more
> >>	complex questions like: What are the semantics of volatility
> >>	with copy-on-write pages?  I'm hoping to investigate this
> >>	idea more deeply soon so I can be sure whatever is pushed has
> >>	a clear plan of how to address this idea. Further thoughts
> >>	here would be appreciated.
> >I like madvise interface because allocator can use it for memory pool.
> >If allocator has free memory which just return from application
> >he can mark it into volatile so VM can reclaim that pages without swapout
> >when memory pressure happens and it can unmark it before allocating.
> >It would be more effective rather than calling munmap or madvise(DONTNEED)
> >which those operations requires all page table operation and even vma
> >unlinking in case of munmap.
> >
> >For it, we can add new VMA flag VM_VOLATILE and we can use reverse mapping
> >for grouping structure. For COW semantics, I think we can discard volatile
> >page only if all vmas which share the page don't have VM_VOLATILE.
> 
> I'll look into this approach some more. I've got some concerns below.
> 
> 
> >
> >>It would really be great to get any thoughts on these issues, as they
> >>are higher-priority to me then diving into the details of how we
> >>implement this internally, which can shift over time.
> >>
> >>
> >>
> >>Implementation Considerations:
> >>---------------------------------------------
> [snip]
> >>* Purging order between volatile ranges:
> >>	Again, since it makes sense to purge all the complete
> >>	pages in a range at the same time, we need to consider the
> >>	subtle difference between the least-recently-used pages vs
> >>	least-recently-used ranges. A single range could contain very
> >>	frequently accessed data, as well as rarely accessed data.
> >>	One must also consider that the act of marking a range as
> >>	volatile may not actually touch the underlying pages. Thus
> >>	purging ranges based on a least-recently-used page may also
> >>	result in purging the most-recently used page.
> >>
> >>	Android addressed the purging order question by purging ranges
> >>	in the order they were marked volatile. Thus the oldest
> >>	volatile range is the first range to be purged. This works
> >>	well in the Android  model, as applications aren't supposed
> >>	to access volatile data, so the least-recently-marked-volatile
> >>	order maps well to the least-recently-used-range.
> >>
> >>	However, this assumption doesn't hold with the lazy SIGBUS
> >>	notification method, as pages in a volatile range may continue
> >>	to be accessed after the range is marked volatile.  So the
> >>	question as to what is the best order of purging volatile
> >>	ranges is definitely open.
> >>
> >>	Abstractly the ideal solution might be to evaluate the
> >>	most-recently used page in each range, and to purge the range
> >>	with the oldest recently-used-page, but I suspect this is
> >>	not something that could be calculated efficiently.
> >>
> >>	Additionally, in my conversations with Taras, he pointed out
> >>	that if we are using a one-application-at-a-time UI model,
> >>	it would be ideal to discourage purging volatile data used by
> >>	the current application, instead prioritizing volatile ranges
> >>	from applications that aren't active. However, I'm not sure
> >>	what mechanism could be used to prioritize range purging in
> >>	this fashion, especially considering volatile ranges can be
> >>	on data that is shared between applications.
> >My thought is that "let it be" without creating new LRU list or
> >deactivating volatile page to tail of LRU for early reclaiming.
> >It means volatile pages has same priorty with other normal pages.
> >Volatile doesn't mean "Early reclaim" but "we don't need to swap out
> >them for reclaiming" in my perception.
> 
> Right. Using your earlier terms, ez-reclaim instead early-reclaim.
> 
> Ideally I think we would want to have preference of purging volatile
> pages over laundering dirty pages under pressure, but maybe that's a

Yes. It does make sense. Under memory pressure, writing out of dirty pages
makes system's responsiveness very slow.

> separate vmscan issue/optimization for now?

I think so. After we settle down "What's the volatile page?",
we can optimize it for early reclaim of volatile pages than
dirty page's writeout. Normally, it's hard to reclaim anonymous pages
compared to file-backed pages so it can happen that writeout of dirty pages
for reclaim occurs although there are lots of volatile pages in anon lru list.
The one of idea is we can rotate volatile pages from inactive's tail to
inactive's head instead of active LRU list's head in aging model.
It can make volatile pages be reclaimable easily.

> 
> 
> 
> >In summary, the idea I am suggesting now if we select lazy SIGBUS is
> >following as,
> >
> >1) use madvise and VMA rmap with newly VM_VOLATILE page
> >2) mark
> >    just mark VM_VOLATILE in the VMA
> >3) treat volatile pages same with normal pages in POV aging
> >    (Of course, non swap system, we have to tweak VM for reclaimaing
> >     volatile pages, for example, we can move all volatile pages into
> >     inactive's tail when swapoff happens and VM peek tail of inactive
> >     LRU list without aging when memory pressure happens)
> >4) unmark
> >    just unmark VM_VOLATILE in the VMA and return error.
> 
> So to better grasp this, you're suggesting using the vma in-effect
> as the volatile range object? So if we mark a large area as
> volatile, we'll create a new vma with the (start, end), splitting
> the remaining portions of the old area into two non-volatile vmas?

Exactly.

> 
> The caveat for #3 seems a bit large.  Since on no-swap cases (which
> are standard for Android) that sounds like we're still having to
> migrate every page in the range to the end of the list (and move
> them to the head when they are marked non-volatile).  This seems to
> have the same O(n) performance issue.

Firstly, I still don't convinced that volatile pages should be reclaimed by
top prioirty., As I mentioned, it's just for representing "NOT necessary
swapout because it's volatile" so we don't need to migrate thoese pages
into tail of LRU, IMHO.
Instead, we need to do aging anon LRU list if the system has volatile pages.

If we don't want to do aging in anon LRU list and select model which we 
should reclaim volatile pages by top prioirty, we can move migration cost
from mark/unmark time to reclaim time.
I mean mark/unmark don't need to migrate the pages. Instead, VM can scan anon
LRU list to find volatile pages when memory pressure happens.
It's a tradeoff but normally reclaim path is inherently not fast path so
it wouldn't be a big problem. But if we want to select this model really,
I would like to make new LRU list than making VM more complicated.

Anyway, it's policy problem we should define about "volatile page".
Firstly, let's define reclaim policy of volatile page.

> 
> Also,  if I understand correctly, you could have a one page in
> multiple vmas, and the semantics for when we consider the page as
> volatile are awkward. When we want to purge/writeout, do we have to
> scan every vma the page is in to determine if its volatile?

Yes and VM have done already. Look at page_check_references.
> 
> Another thing I worry about with this, is that the since the vmas
> are per-mm, I don't see how volatile range could be shared, like
> with a tmpfs file between two applications, a usage model that
> Android's ashmem supports.

In your mail, you mentioned COW so that's why I said this model.
If application which has volatile VMA calls fork, pages in that range
could be shared by parent and child and rmap can manage them.

> 
> I'm likely missing something, as I'm unfamiliar so far with the rmap
> code, so I'll look into this some more (although I may not get too
> much time for it in the next month).  Andi Kleen tried something
> similar to your suggestion awhile back, but I'm just not confident
> that the per-mm vmas are the right place to manage this state (thus
> why I've used address_space mappings).
> 
> Thanks again for the feedback!

I will start to implement prototype for my idea and hope sending them
before traveling in next week.


> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-09-29  3:16 [PATCH 0/3] Volatile Ranges (v7) & Lots of words John Stultz
                   ` (5 preceding siblings ...)
  2012-10-08  6:25 ` Minchan Kim
@ 2012-10-09  8:07 ` Mike Hommey
  2012-10-09 21:30   ` John Stultz
  6 siblings, 1 reply; 16+ messages in thread
From: Mike Hommey @ 2012-10-09  8:07 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On Fri, Sep 28, 2012 at 11:16:30PM -0400, John Stultz wrote:
> fd based interfaces vs madvise:
> 	In talking with Taras Glek, he pointed out that for his
> 	needs, the fd based interface is a little annoying, as it
> 	requires having to get access to tmpfs file and mmap it in,
> 	then instead of just referencing a pointer to the data he
> 	wants to mark volatile, he has to calculate the offset from
> 	start of the mmap and pass those file offsets to the interface.
> 	Instead he mentioned that using something like madvise would be
> 	much nicer, since they could just pass a pointer to the object
> 	in memory they want to make volatile and avoid the extra work.
> 
> 	I'm not opposed to adding an madvise interface for this as
> 	well, but since we have a existing use case with Android's
> 	ashmem, I want to make sure we support this existing behavior.
> 	Specifically as with ashmem  applications can be sharing
> 	these tmpfs fds, and so file-relative volatile ranges make
> 	more sense if you need to coordinate what data is volatile
> 	between two applications.
> 
> 	Also, while I agree that having an madvise interface for
> 	volatile ranges would be nice, it does open up some more
> 	complex implementation issues, since with files, there is a
> 	fixed relationship between pages and the files' address_space
> 	mapping, where you can't have pages shared between different
> 	mappings. This makes it easy to hang the volatile-range tree
> 	off of the mapping (well, indirectly via a hash table). With
> 	general anonymous memory, pages can be shared between multiple
> 	processes, and as far as I understand, don't have any grouping
> 	structure we could use to determine if the page is in a
> 	volatile range or not. We would also need to determine more
> 	complex questions like: What are the semantics of volatility
> 	with copy-on-write pages?  I'm hoping to investigate this
> 	idea more deeply soon so I can be sure whatever is pushed has
> 	a clear plan of how to address this idea. Further thoughts
> 	here would be appreciated.

Note it doesn't have to be a vs. situation. madvise could be an
additional way to interface with volatile ranges on a given fd.

That is, madvise doesn't have to mean anonymous memory. As a matter of
fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files.
Similarly, there could be a way to use madvise to mark volatile ranges,
without the application having to track what memory ranges are
associated to what part of what file, which the kernel already tracks.

Mike

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-10-09  8:07 ` Mike Hommey
@ 2012-10-09 21:30   ` John Stultz
  2012-10-10  0:15     ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: John Stultz @ 2012-10-09 21:30 UTC (permalink / raw)
  To: Mike Hommey
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

On 10/09/2012 01:07 AM, Mike Hommey wrote:
> Note it doesn't have to be a vs. situation. madvise could be an
> additional way to interface with volatile ranges on a given fd.
>
> That is, madvise doesn't have to mean anonymous memory. As a matter of
> fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files.
> Similarly, there could be a way to use madvise to mark volatile ranges,
> without the application having to track what memory ranges are
> associated to what part of what file, which the kernel already tracks.

Good point. We could add madvise() interface, but limit it only to 
mmapped tmpfs files, in parallel with the fallocate() interface.

However, I would like to think through how MADV_MARK_VOLATILE with 
purely anonymous memory could work, before starting that approach. That 
and Neil's point that having an identical kernel interface restricted to 
tmpfs, only as a convenience to userland in switching from virtual 
address to/from mmapped file offset may be better left to a userland 
library.

thanks
-john


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-10-09 21:30   ` John Stultz
@ 2012-10-10  0:15     ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-10-10  0:15 UTC (permalink / raw)
  To: John Stultz
  Cc: Mike Hommey, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

On Tue, Oct 09, 2012 at 02:30:03PM -0700, John Stultz wrote:
> On 10/09/2012 01:07 AM, Mike Hommey wrote:
> >Note it doesn't have to be a vs. situation. madvise could be an
> >additional way to interface with volatile ranges on a given fd.
> >
> >That is, madvise doesn't have to mean anonymous memory. As a matter of
> >fact, MADV_WILLNEED/MADV_DONTNEED are usually used on mmaped files.
> >Similarly, there could be a way to use madvise to mark volatile ranges,
> >without the application having to track what memory ranges are
> >associated to what part of what file, which the kernel already tracks.
> 
> Good point. We could add madvise() interface, but limit it only to
> mmapped tmpfs files, in parallel with the fallocate() interface.
> 
> However, I would like to think through how MADV_MARK_VOLATILE with
> purely anonymous memory could work, before starting that approach.
> That and Neil's point that having an identical kernel interface
> restricted to tmpfs, only as a convenience to userland in switching
> from virtual address to/from mmapped file offset may be better left
> to a userland library.

How about this?

The scenario I imagine about madvise semantic following as.

1) Anonymous pages
Assume that there is some allocator library which manage mmaped reserved pool.
If it has lots of free memory which isn't used by anyone, it can unmap part of
reserved pool but unmap isn't cheap because kernel should zap all ptes of the
pages in the range. But if we avoid unmap, VM would swap out that range which
have just garbage unnecessary when memory pressure happens.
If it mark that range volatile, we can avoid unnecessary swap out and even
reclaim them with no swap. Only thing allocator have to do is unmark that range
before allocating to user.

2) File pages(NOT tmpfs)
We can reclaim volatile file pages easily without recycling of LRU
although it is accessed recently.
The difference with DONTNEED is that DONTNEED always move pages to
tail of inactive LRU to reclaim early but VOLATILE semantic leave them
as it is without moving to tail and reclaim them without considering
recently-used when they reach at tail of LRU by aging because they can
be unmarked sooner or later for using and we can't expect cost of
recreating of the object.

So reclaim preference : NORMAL < VOLATILE < DONTNEED


> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-10-02 22:38   ` John Stultz
@ 2012-11-02 20:59     ` Michael Kerrisk
  2012-11-29 15:58       ` Mike Hommey
  2012-11-03  7:57     ` Michael Kerrisk
  1 sibling, 1 reply; 16+ messages in thread
From: Michael Kerrisk @ 2012-11-02 20:59 UTC (permalink / raw)
  To: John Stultz
  Cc: NeilBrown, LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Andrea Righi, Aneesh Kumar K.V,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm, Christoph Hellwig

John,

A question at on one point:

On Wed, Oct 3, 2012 at 12:38 AM, John Stultz <john.stultz@linaro.org> wrote:
> On 10/02/2012 12:39 AM, NeilBrown wrote:
[...]
>>   The SIGBUS interface could have some merit if it really reduces
>> overhead.  I
>>   worry about app bugs that could result from the non-deterministic
>>   behaviour.   A range could get unmapped while it is in use and testing
>> for
>>   the case of "get a SIGBUS half way though accessing something" would not
>>   be straight forward (SIGBUS on first step of access should be easy).
>>   I guess that is up to the app writer, but I have never liked anything
>> about
>>   the signal interface and encouraging further use doesn't feel wise.
>
> Initially I didn't like the idea, but have warmed considerably to it. Mainly
> due to the concern that the constant unmark/access/mark pattern would be too
> much overhead, and having a lazy method will be much nicer for performance.
> But yes, at the cost of additional complexity of handling the signal,
> marking the faulted address range as non-volatile, restoring the data and
> continuing.

At a finer level of detail, how do you see this as happening in the
application. I mean: in the general case, repopulating the purged
volatile page would have to be done outside the signal handler (I
think, because async-signal-safety considerations would preclude too
much compdex stuff going on inside the handler). That implies
longjumping out of the handler, repopulating the pages with data, and
then restarting whatever work was being done when the SIGBUS was
generated.

Cheers,

Michael

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-10-02 22:38   ` John Stultz
  2012-11-02 20:59     ` Michael Kerrisk
@ 2012-11-03  7:57     ` Michael Kerrisk
  1 sibling, 0 replies; 16+ messages in thread
From: Michael Kerrisk @ 2012-11-03  7:57 UTC (permalink / raw)
  To: John Stultz
  Cc: NeilBrown, LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Andrea Righi, Aneesh Kumar K.V,
	Mike Hommey, Taras Glek, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm, Christoph Hellwig,
	Linux API, Jake Edge, Michael Kerrisk

[CC += linux-api, since this is an API change.]

Hi John,

A couple of other questions that occurred to me...

What are the expected/planned semantics of volatile ranges for mlocked
pages? I noticed that Minchan's patch series
(https://lwn.net/Articles/522154/) gives an error on attempt to mark
locked pages as volatile (which seems sensible). I didn't see anything
similar in your patches. Perhaps it's not easy to do because of the
non-VMA-based implementation? Something to think about.

On Wed, Oct 3, 2012 at 12:38 AM, John Stultz <john.stultz@linaro.org> wrote:
> On 10/02/2012 12:39 AM, NeilBrown wrote:
>>
>> On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz <john.stultz@linaro.org>
>> wrote:
>>
>>   For example, allowing sub-page volatile region seems to be above and
>> beyond
>>   the call of duty.  You cannot mmap sub-pages, so why should they be
>> volatile?
>
> Although if someone marked a page and a half as volatile, would it be
> reasonable to throw away the second half of that second page? That seems
> unexpected to me. So we're really only marking the whole pages specified as
> volatlie,  similar to how FALLOC_FL_PUNCH_HOLE behaves.
>
> But if it happens that the adjacent range is also a partial page, we can
> coalesce them possibly into an purgable whole page. I think it makes sense,
> especially from a userland point of view and wasn't really complicated to
> add.

I must confess that I'm puzzled by this facility to lock sub-page
range ranges as well. What's the use case? What I'm thinking is: the
goal of volatile ranges is to help improve system performance by
freeing up a (sizeable) block of pages. Why then would the user care
too much about marking with sub-page granularity, or that such ranges
might be merged? After all, the system calls to do this marking are
expensive, and so for performance reasons, I suppose that a process
would like to keep those system calls to a minimum.

[...]

>>   I think discarding whole ranges at a time is very sensible, and so
>> merging
>>   adjacent ranges is best avoided.  If you require page-aligned ranges
>> this
>>   becomes trivial - is that right?
>
> True. If we avoid coalescing non-whole page ranges, keeping non-overlapping
> ranges independent is fairly easy.

Regarding coalescing of adjacent ranges. Here's one possible argument
against it (Jake Edge alerted me to this). If an application marked
adjacent ranges using separate system calls, that might be an
indication that the application intends to to have different access
patterns against the two ranges: one frequent, the other rare. In that
case, I suppose it would be better if the ranges were not merged.

Cheers,

Michael

-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] Volatile Ranges (v7) & Lots of words
  2012-11-02 20:59     ` Michael Kerrisk
@ 2012-11-29 15:58       ` Mike Hommey
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Hommey @ 2012-11-29 15:58 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: John Stultz, NeilBrown, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Andrea Righi, Aneesh Kumar K.V,
	Taras Glek, Jan Kara, KOSAKI Motohiro, Michel Lespinasse,
	Minchan Kim, linux-mm, Christoph Hellwig

On Fri, Nov 02, 2012 at 09:59:07PM +0100, Michael Kerrisk wrote:
> John,
> 
> A question at on one point:
> 
> On Wed, Oct 3, 2012 at 12:38 AM, John Stultz <john.stultz@linaro.org> wrote:
> > On 10/02/2012 12:39 AM, NeilBrown wrote:
> [...]
> >>   The SIGBUS interface could have some merit if it really reduces
> >> overhead.  I
> >>   worry about app bugs that could result from the non-deterministic
> >>   behaviour.   A range could get unmapped while it is in use and testing
> >> for
> >>   the case of "get a SIGBUS half way though accessing something" would not
> >>   be straight forward (SIGBUS on first step of access should be easy).
> >>   I guess that is up to the app writer, but I have never liked anything
> >> about
> >>   the signal interface and encouraging further use doesn't feel wise.
> >
> > Initially I didn't like the idea, but have warmed considerably to it. Mainly
> > due to the concern that the constant unmark/access/mark pattern would be too
> > much overhead, and having a lazy method will be much nicer for performance.
> > But yes, at the cost of additional complexity of handling the signal,
> > marking the faulted address range as non-volatile, restoring the data and
> > continuing.
> 
> At a finer level of detail, how do you see this as happening in the
> application. I mean: in the general case, repopulating the purged
> volatile page would have to be done outside the signal handler (I
> think, because async-signal-safety considerations would preclude too
> much compdex stuff going on inside the handler). That implies
> longjumping out of the handler, repopulating the pages with data, and
> then restarting whatever work was being done when the SIGBUS was
> generated.

There are different strategies that can be used to repopulate the pages,
within or outside the signal handler, and I'd say it's not that
important of a detail.

That being said, if the kernel could be helpful and avoid people
shooting themselves in the foot, that would be great, too.

I don't know how possible this would be but being able to get the
notification on a signalfd in a dedicated thread would certainly improve
things (I guess other usecases of SIGSEGV/SIGBUG handlers could
appreciate something like this). The kernel would pause the faulting
thread while sending the notification on the signalfd, and the notified
thread would be allowed to resume the faulting thread when it's done
doing its job.

Mike

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-11-29 16:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-29  3:16 [PATCH 0/3] Volatile Ranges (v7) & Lots of words John Stultz
2012-09-29  3:16 ` [PATCH 1/3] [RFC] Add volatile range management code John Stultz
2012-09-29  3:16 ` [PATCH 2/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers John Stultz
2012-09-29  3:16 ` [PATCH 3/3] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
2012-10-02  7:39 ` [PATCH 0/3] Volatile Ranges (v7) & Lots of words NeilBrown
2012-10-02 22:38   ` John Stultz
2012-11-02 20:59     ` Michael Kerrisk
2012-11-29 15:58       ` Mike Hommey
2012-11-03  7:57     ` Michael Kerrisk
2012-10-02 17:31 ` Taras Glek
2012-10-08  6:25 ` Minchan Kim
2012-10-09  1:25   ` John Stultz
2012-10-09  2:49     ` Minchan Kim
2012-10-09  8:07 ` Mike Hommey
2012-10-09 21:30   ` John Stultz
2012-10-10  0:15     ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).