All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5][RFC] Fallocate Volatile Ranges v6
@ 2012-07-28  3:57 ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

So after not getting too much positive feedback on my last
attempt at trying to use a non-shrinker method for managing
& purging volatile ranges, I decided I'd go ahead and try
to implement something along Minchan's ERECLAIM LRU list
idea.

Again this patchset has two parts:

The first 3 patches add generic volatile range management code, as 
well as tmpfs support for FALLOC_FL_MARK_VOLATILE, which uses a
shrinker to purge ranges, and converts ashmem to use
FALLOC_FL_MARK_VOLATILE, almost reducing the driver in half.


Since Kosaki-san objected to using the shrinker, as its not numa
aware, and is only called after we shrink normal lru lists. The 
second half of this patch set provides a different method that is
not shrinker based.

The last two patches introduce a new lru list, LRU_VOLATILE.
When pages are marked volatile, they are moved to this lru list,
which we will shrink first when trying to free up memory.

The reason why I'm keeping this in two parts is that I want
to be able to easily do performance comparisons between
the more lightweight, but numa-unaware, shrinker method vs the
more correct, but slower (due to the page-by-page management)
method of handling it deeper in the VM.

I know the way this is currently implemented is really bad for
performance, since we add/remove pages to the LRU_VOLATILE list
page by page instead of batching. So I know this needs more work,
but I wanted to get just some initial reactions to this approach
versus the earlier ones.

Also, I know this isn't exactly the same as the ERECLAIM lru list
that Minchan suggested, since that might also contain inactive
clean file pages as well, but I wanted to stick to just volatile
pages for now as I learn more about how the core vm works. I'll
be fine to rename LRU_VOLATILE later if its appropriate.

What's new in this iteration:
* Dropped the writepage style purging instead for the LRU_VOLATILE

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>


John Stultz (5):
  [RFC] Add volatile range management code
  [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  [RFC] ashmem: Convert ashmem to use volatile ranges
  [RFC][HACK] Add LRU_VOLATILE support to the VM
  [RFC][HACK] Switch volatile/shmem over to LRU_VOLATILE

 drivers/staging/android/ashmem.c |  331 +--------------------------
 fs/open.c                        |    3 +-
 include/linux/falloc.h           |    7 +-
 include/linux/fs.h               |    1 +
 include/linux/mm_inline.h        |    2 +
 include/linux/mmzone.h           |    1 +
 include/linux/page-flags.h       |    3 +
 include/linux/swap.h             |    3 +
 include/linux/volatile.h         |   39 ++++
 mm/Makefile                      |    2 +-
 mm/memcontrol.c                  |    1 +
 mm/page_alloc.c                  |    1 +
 mm/shmem.c                       |  118 ++++++++++
 mm/swap.c                        |   71 ++++++
 mm/vmscan.c                      |   76 ++++++-
 mm/volatile.c                    |  459 ++++++++++++++++++++++++++++++++++++++
 16 files changed, 788 insertions(+), 330 deletions(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 0/5][RFC] Fallocate Volatile Ranges v6
@ 2012-07-28  3:57 ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

So after not getting too much positive feedback on my last
attempt at trying to use a non-shrinker method for managing
& purging volatile ranges, I decided I'd go ahead and try
to implement something along Minchan's ERECLAIM LRU list
idea.

Again this patchset has two parts:

The first 3 patches add generic volatile range management code, as 
well as tmpfs support for FALLOC_FL_MARK_VOLATILE, which uses a
shrinker to purge ranges, and converts ashmem to use
FALLOC_FL_MARK_VOLATILE, almost reducing the driver in half.


Since Kosaki-san objected to using the shrinker, as its not numa
aware, and is only called after we shrink normal lru lists. The 
second half of this patch set provides a different method that is
not shrinker based.

The last two patches introduce a new lru list, LRU_VOLATILE.
When pages are marked volatile, they are moved to this lru list,
which we will shrink first when trying to free up memory.

The reason why I'm keeping this in two parts is that I want
to be able to easily do performance comparisons between
the more lightweight, but numa-unaware, shrinker method vs the
more correct, but slower (due to the page-by-page management)
method of handling it deeper in the VM.

I know the way this is currently implemented is really bad for
performance, since we add/remove pages to the LRU_VOLATILE list
page by page instead of batching. So I know this needs more work,
but I wanted to get just some initial reactions to this approach
versus the earlier ones.

Also, I know this isn't exactly the same as the ERECLAIM lru list
that Minchan suggested, since that might also contain inactive
clean file pages as well, but I wanted to stick to just volatile
pages for now as I learn more about how the core vm works. I'll
be fine to rename LRU_VOLATILE later if its appropriate.

What's new in this iteration:
* Dropped the writepage style purging instead for the LRU_VOLATILE

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>


John Stultz (5):
  [RFC] Add volatile range management code
  [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  [RFC] ashmem: Convert ashmem to use volatile ranges
  [RFC][HACK] Add LRU_VOLATILE support to the VM
  [RFC][HACK] Switch volatile/shmem over to LRU_VOLATILE

 drivers/staging/android/ashmem.c |  331 +--------------------------
 fs/open.c                        |    3 +-
 include/linux/falloc.h           |    7 +-
 include/linux/fs.h               |    1 +
 include/linux/mm_inline.h        |    2 +
 include/linux/mmzone.h           |    1 +
 include/linux/page-flags.h       |    3 +
 include/linux/swap.h             |    3 +
 include/linux/volatile.h         |   39 ++++
 mm/Makefile                      |    2 +-
 mm/memcontrol.c                  |    1 +
 mm/page_alloc.c                  |    1 +
 mm/shmem.c                       |  118 ++++++++++
 mm/swap.c                        |   71 ++++++
 mm/vmscan.c                      |   76 ++++++-
 mm/volatile.c                    |  459 ++++++++++++++++++++++++++++++++++++++
 16 files changed, 788 insertions(+), 330 deletions(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/5] [RFC] Add volatile range management code
  2012-07-28  3:57 ` John Stultz
@ 2012-07-28  3:57   ` John Stultz
  -1 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

This patch provides the volatile range management code
that filesystems can utilize when implementing
FALLOC_FL_MARK_VOLATILE.

It tracks a collection of page ranges against a mapping
stored in an interval-tree. This code handles coalescing
overlapping and adjacent ranges, as well as splitting
ranges when sub-chunks are removed.

The ranges can be marked purged or unpurged. And there is
a per-fs lru list that tracks all the unpurged ranges for
that fs.

v2:
* Fix bug in volatile_ranges_get_last_used returning bad
  start,end values
* Rework for intervaltree renaming
* Optimize volatile_range_lru_size to avoid running through
  lru list each time.

v3:
* Improve function name to make it clear what the
  volatile_ranges_pluck_lru() code does.
* Drop volatile_range_lru_size and unpurged_page_count
  mangement as its now unused

v4:
* Re-add volatile_range_lru_size and unpruged_page_count
* Fix bug in range_remove when we split ranges, we add
  an overlapping range before resizing the existing range.

v5:
* Drop intervaltree for prio_tree usage per Michel &
  Dmitry's suggestions.
* Cleanups

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/volatile.h |   45 ++++
 mm/Makefile              |    2 +-
 mm/volatile.c            |  509 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 555 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..6f41b98
--- /dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,45 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+#include <linux/fs.h>
+
+struct volatile_fs_head {
+	struct mutex lock;
+	struct list_head lru_head;
+	s64 unpurged_page_count;
+};
+
+
+#define DEFINE_VOLATILE_FS_HEAD(name) struct volatile_fs_head name = {	\
+	.lock = __MUTEX_INITIALIZER(name.lock),				\
+	.lru_head = LIST_HEAD_INIT(name.lru_head),			\
+	.unpurged_page_count = 0,					\
+}
+
+
+static inline void volatile_range_lock(struct volatile_fs_head *head)
+{
+	mutex_lock(&head->lock);
+}
+
+static inline void volatile_range_unlock(struct volatile_fs_head *head)
+{
+	mutex_unlock(&head->lock);
+}
+
+extern long volatile_range_add(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+extern long volatile_range_remove(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+
+extern s64 volatile_range_lru_size(struct volatile_fs_head *head);
+
+extern void volatile_range_clear(struct volatile_fs_head *head,
+					struct address_space *mapping);
+
+extern s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
+				struct address_space **mapping,
+				pgoff_t *start, pgoff_t *end);
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 2e2fbbe..3e3cd6f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   page_isolation.o mm_init.o mmu_context.o percpu.o \
-			   compaction.o $(mmu-y)
+			   compaction.o volatile.o $(mmu-y)
 obj-y += init-mm.o
 
 ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..d05a767
--- /dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,509 @@
+/* mm/volatile.c
+ *
+ * Volatile page range managment.
+ *      Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c
+ *      by Robert Love <rlove@google.com>
+ *      Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * The volatile range management is a helper layer on top of the range tree
+ * code, which is used to help filesystems manage page ranges that are volatile.
+ *
+ * These ranges are stored in a per-mapping range tree. Storing both purged and
+ * unpurged ranges connected to that address_space. Unpurged ranges are also
+ * linked together in an lru list that is per-volatile-fs-head (basically
+ * per-filesystem).
+ *
+ * The goal behind volatile ranges is to allow applications to interact
+ * with the kernel's cache management infrastructure.  In particular an
+ * application can say "this memory contains data that might be useful in
+ * the future, but can be reconstructed if necessary, so if the kernel
+ * needs, it can zap and reclaim this memory without having to swap it out.
+ *
+ * The proposed mechanism - at a high level - is for user-space to be able
+ * to say "This memory is volatile" and then later "this memory is no longer
+ * volatile".  If the content of the memory is still available the second
+ * request succeeds.  If not, the memory is marked non-volatile and an
+ * error is returned to denote that the contents have been lost.
+ *
+ * Credits to Neil Brown for the above description.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/volatile.h>
+#include <linux/rbtree.h>
+#include <linux/hash.h>
+#include <linux/shmem_fs.h>
+
+
+struct volatile_range {
+	struct list_head		lru;
+	struct prio_tree_node		node;
+	unsigned int			purged;
+	struct address_space		*mapping;
+};
+
+
+/*
+ * To avoid bloating the address_space structure, we use
+ * a hash structure to map from address_space mappings to
+ * the interval_tree root that stores volatile ranges
+ */
+static DEFINE_MUTEX(hash_mutex);
+static struct hlist_head *mapping_hash;
+static long mapping_hash_shift = 8;
+struct mapping_hash_entry {
+	struct prio_tree_root		root;
+	struct address_space		*mapping;
+	struct hlist_node		hnode;
+};
+
+
+static inline
+struct prio_tree_root *__mapping_to_root(struct address_space *mapping)
+{
+	struct hlist_node *elem;
+	struct mapping_hash_entry *entry;
+	struct prio_tree_root *ret = NULL;
+
+	hlist_for_each_entry_rcu(entry, elem,
+			&mapping_hash[hash_ptr(mapping, mapping_hash_shift)],
+				hnode)
+		if (entry->mapping == mapping)
+			ret =  &entry->root;
+
+	return ret;
+}
+
+
+static inline
+struct prio_tree_root *mapping_to_root(struct address_space *mapping)
+{
+	struct prio_tree_root *ret;
+
+	mutex_lock(&hash_mutex);
+	ret =  __mapping_to_root(mapping);
+	mutex_unlock(&hash_mutex);
+	return ret;
+}
+
+
+static inline
+struct prio_tree_root *mapping_allocate_root(struct address_space *mapping)
+{
+	struct mapping_hash_entry *entry;
+	struct prio_tree_root *dblchk;
+	struct prio_tree_root *ret = NULL;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return NULL;
+
+	mutex_lock(&hash_mutex);
+	/* Since we dropped the lock, double check that no one has
+	 * created the same hash entry.
+	 */
+	dblchk = __mapping_to_root(mapping);
+	if (dblchk) {
+		kfree(entry);
+		ret = dblchk;
+		goto out;
+	}
+
+	INIT_HLIST_NODE(&entry->hnode);
+	entry->mapping = mapping;
+	INIT_PRIO_TREE_ROOT(&entry->root);
+
+	hlist_add_head_rcu(&entry->hnode,
+		&mapping_hash[hash_ptr(mapping, mapping_hash_shift)]);
+
+	ret = &entry->root;
+out:
+	mutex_unlock(&hash_mutex);
+	return ret;
+}
+
+
+static inline void mapping_free_root(struct prio_tree_root *root)
+{
+	struct mapping_hash_entry *entry;
+
+	mutex_lock(&hash_mutex);
+	entry = container_of(root, struct mapping_hash_entry, root);
+
+	hlist_del_rcu(&entry->hnode);
+	kfree(entry);
+	mutex_unlock(&hash_mutex);
+}
+
+
+/* volatile range helpers */
+static inline void vrange_resize(struct volatile_fs_head *head,
+				struct prio_tree_root *root,
+				struct volatile_range *vrange,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	pgoff_t old_size, new_size;
+
+	old_size = vrange->node.last - vrange->node.start;
+	new_size = end_index-start_index;
+
+	if (!vrange->purged)
+		head->unpurged_page_count += new_size - old_size;
+
+	prio_tree_remove(root, &vrange->node);
+	vrange->node.start = start_index;
+	vrange->node.last = end_index;
+	prio_tree_insert(root, &vrange->node);
+}
+
+static struct volatile_range *vrange_alloc(void)
+{
+	struct volatile_range *new;
+
+	new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+	if (!new)
+		return 0;
+	INIT_PRIO_TREE_NODE(&new->node);
+	return new;
+}
+
+
+static void vrange_add(struct volatile_fs_head *head,
+				struct prio_tree_root *root,
+				struct volatile_range *vrange)
+{
+
+	prio_tree_insert(root, &vrange->node);
+
+	/* Only add unpurged ranges to LRU */
+	if (!vrange->purged) {
+		head->unpurged_page_count += vrange->node.last - vrange->node.start;
+		list_add_tail(&vrange->lru, &head->lru_head);
+	}
+
+}
+
+
+
+static void vrange_del(struct volatile_fs_head *head,
+				struct prio_tree_root *root,
+				struct volatile_range *vrange)
+{
+	if (!vrange->purged) {
+		head->unpurged_page_count -= vrange->node.last - vrange->node.start;
+		list_del(&vrange->lru);
+	}
+	prio_tree_remove(root, &vrange->node);
+	kfree(vrange);
+}
+
+
+/**
+ * volatile_range_add: Marks a page interval as volatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked volatile
+ * @start_index: Starting page in range to be marked volatile
+ * @end_index: Ending page in range to be marked volatile
+ *
+ * Mark a region as volatile. Coalesces overlapping and neighboring regions.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if the range was coalesced with any purged ranges.
+ * Returns 0 on success.
+ */
+long volatile_range_add(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				pgoff_t start, pgoff_t end)
+{
+	struct prio_tree_node *node;
+	struct prio_tree_iter iter;
+	struct volatile_range *new, *vrange;
+	struct prio_tree_root *root;
+	int purged = 0;
+
+	/* Make sure we're properly locked */
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	/*
+	 * Because the lock might be held in a shrinker, release
+	 * it during allocation.
+	 */
+	mutex_unlock(&head->lock);
+	new = vrange_alloc();
+	mutex_lock(&head->lock);
+	if (!new)
+		return -ENOMEM;
+
+	root = mapping_to_root(mapping);
+	if (!root) {
+		mutex_unlock(&head->lock);
+		root = mapping_allocate_root(mapping);
+		mutex_lock(&head->lock);
+		if (!root) {
+			kfree(new);
+			return -ENOMEM;
+		}
+	}
+
+
+	/* First, find any existing intervals that overlap */
+	prio_tree_iter_init(&iter, root, start, end);
+	node = prio_tree_next(&iter);
+	while (node) {
+		vrange = container_of(node, struct volatile_range, node);
+
+		/* Already entirely marked volatile, so we're done */
+		if (vrange->node.start < start && vrange->node.last > end) {
+			/* don't need the allocated value */
+			kfree(new);
+			return purged;
+		}
+
+		/* Resize the new range to cover all overlapping ranges */
+		start = min_t(u64, start, vrange->node.start);
+		end = max_t(u64, end, vrange->node.last);
+
+		/* Inherit purged state from overlapping ranges */
+		purged |= vrange->purged;
+
+		/* See if there's a next range that overlaps */
+		node = prio_tree_next(&iter);
+
+		/* Delete the old range, as we consume it */
+		vrange_del(head, root, vrange);
+
+	}
+
+	/* Coalesce left-adjacent ranges */
+	prio_tree_iter_init(&iter, root, start-1, start);
+	node = prio_tree_next(&iter);
+	while (node) {
+		vrange = container_of(node, struct volatile_range, node);
+		node = prio_tree_next(&iter);
+		/* Only coalesce if both are either purged or unpurged */
+		if (vrange->purged == purged) {
+			/* resize new range */
+			start = min_t(u64, start, vrange->node.start);
+			end = max_t(u64, end, vrange->node.last);
+			/* delete old range */
+			vrange_del(head, root, vrange);
+		}
+	}
+
+	/* Coalesce right-adjacent ranges */
+	prio_tree_iter_init(&iter, root, end, end+1);
+	node = prio_tree_next(&iter);
+	while (node) {
+		vrange = container_of(node, struct volatile_range, node);
+		node = prio_tree_next(&iter);
+		/* Only coalesce if both are either purged or unpurged */
+		if (vrange->purged == purged) {
+			/* resize new range */
+			start = min_t(u64, start, vrange->node.start);
+			end = max_t(u64, end, vrange->node.last);
+			/* delete old range */
+			vrange_del(head, root, vrange);
+		}
+	}
+	/* Assign and store the new range in the range tree */
+	new->mapping = mapping;
+	new->node.start = start;
+	new->node.last = end;
+	new->purged = purged;
+	vrange_add(head, root, new);
+
+	return purged;
+}
+
+
+/**
+ * volatile_range_remove: Marks a page interval as nonvolatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked nonvolatile
+ * @start_index: Starting page in range to be marked nonvolatile
+ * @end_index: Ending page in range to be marked nonvolatile
+ *
+ * Mark a region as nonvolatile. And remove any contained pages
+ * from the volatile range tree.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if any portion of the range was purged.
+ * Returns 0 on success.
+ */
+long volatile_range_remove(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				pgoff_t start, pgoff_t end)
+{
+	struct prio_tree_node *node;
+	struct prio_tree_iter iter;
+	struct volatile_range *new, *vrange;
+	struct prio_tree_root *root;
+	int ret		= 0;
+	int used_new	= 0;
+
+	/* Make sure we're properly locked */
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	/*
+	 * Because the lock might be held in a shrinker, release
+	 * it during allocation.
+	 */
+	mutex_unlock(&head->lock);
+	new = vrange_alloc();
+	mutex_lock(&head->lock);
+	if (!new)
+		return -ENOMEM;
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		goto out;
+
+
+	/* Find any overlapping ranges */
+	prio_tree_iter_init(&iter, root, start, end);
+	node = prio_tree_next(&iter);
+	while (node) {
+		vrange = container_of(node, struct volatile_range, node);
+		node = prio_tree_next(&iter);
+
+		ret |= vrange->purged;
+
+		if (start <= vrange->node.start && end >= vrange->node.last) {
+			/* delete: volatile range is totally within range */
+			vrange_del(head, root, vrange);
+		} else if (vrange->node.start >= start) {
+			/* resize: volatile range right-overlaps range */
+			vrange_resize(head, root, vrange, end+1, vrange->node.last);
+		} else if (vrange->node.last <= end) {
+			/* resize: volatile range left-overlaps range */
+			vrange_resize(head, root, vrange, vrange->node.start, start-1);
+		} else {
+			/* split: range is totally within a volatile range */
+			used_new = 1; /* we only do this once */
+			new->mapping = mapping;
+			new->node.start = end + 1;
+			new->node.last = vrange->node.last;
+			new->purged = vrange->purged;
+			vrange_resize(head, root, vrange, vrange->node.start, start-1);
+			vrange_add(head, root, new);
+			break;
+		}
+	}
+
+out:
+	if (!used_new)
+		kfree(new);
+
+	return ret;
+}
+
+/**
+ * volatile_range_lru_size: Returns the number of unpurged pages on the lru
+ * @head: per-fs volatile head
+ *
+ * Returns the number of unpurged pages on the LRU
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ */
+s64 volatile_range_lru_size(struct volatile_fs_head *head)
+{
+	WARN_ON(!mutex_is_locked(&head->lock));
+	return head->unpurged_page_count;
+}
+
+
+/**
+ * volatile_ranges_pluck_lru: Returns mapping and size of lru unpurged range
+ * @head: per-fs volatile head
+ * @mapping: dbl pointer to mapping who's range is being purged
+ * @start: Pointer to starting address of range being purged
+ * @end: Pointer to ending address of range being purged
+ *
+ * Returns the mapping, start and end values of the least recently used
+ * range. Marks the range as purged and removes it from the LRU.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 on success if a range was returned
+ * Return 0 if no ranges were found.
+ */
+s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
+				struct address_space **mapping,
+				pgoff_t *start, pgoff_t *end)
+{
+	struct volatile_range *range;
+
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	if (list_empty(&head->lru_head))
+		return 0;
+
+	range = list_first_entry(&head->lru_head, struct volatile_range, lru);
+
+	*start = range->node.start;
+	*end = range->node.last;
+	*mapping = range->mapping;
+
+	head->unpurged_page_count -= *end - *start;
+	list_del(&range->lru);
+	range->purged = 1;
+
+	return 1;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void volatile_range_clear(struct volatile_fs_head *head,
+				struct address_space *mapping)
+{
+	struct volatile_range *tozap;
+	struct prio_tree_root *root;
+
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		return;
+
+	while (!prio_tree_empty(root)) {
+		tozap = container_of(root->prio_tree_node, struct volatile_range, node);
+		vrange_del(head, root, tozap);
+	}
+	mapping_free_root(root);
+}
+
+
+static int __init volatile_init(void)
+{
+	int i, size;
+
+	size = 1U << mapping_hash_shift;
+	mapping_hash = kzalloc(sizeof(mapping_hash)*size, GFP_KERNEL);
+	for (i = 0; i < size; i++)
+		INIT_HLIST_HEAD(&mapping_hash[i]);
+
+	return 0;
+}
+arch_initcall(volatile_init);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 1/5] [RFC] Add volatile range management code
@ 2012-07-28  3:57   ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

This patch provides the volatile range management code
that filesystems can utilize when implementing
FALLOC_FL_MARK_VOLATILE.

It tracks a collection of page ranges against a mapping
stored in an interval-tree. This code handles coalescing
overlapping and adjacent ranges, as well as splitting
ranges when sub-chunks are removed.

The ranges can be marked purged or unpurged. And there is
a per-fs lru list that tracks all the unpurged ranges for
that fs.

v2:
* Fix bug in volatile_ranges_get_last_used returning bad
  start,end values
* Rework for intervaltree renaming
* Optimize volatile_range_lru_size to avoid running through
  lru list each time.

v3:
* Improve function name to make it clear what the
  volatile_ranges_pluck_lru() code does.
* Drop volatile_range_lru_size and unpurged_page_count
  mangement as its now unused

v4:
* Re-add volatile_range_lru_size and unpruged_page_count
* Fix bug in range_remove when we split ranges, we add
  an overlapping range before resizing the existing range.

v5:
* Drop intervaltree for prio_tree usage per Michel &
  Dmitry's suggestions.
* Cleanups

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/volatile.h |   45 ++++
 mm/Makefile              |    2 +-
 mm/volatile.c            |  509 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 555 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..6f41b98
--- /dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,45 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+#include <linux/fs.h>
+
+struct volatile_fs_head {
+	struct mutex lock;
+	struct list_head lru_head;
+	s64 unpurged_page_count;
+};
+
+
+#define DEFINE_VOLATILE_FS_HEAD(name) struct volatile_fs_head name = {	\
+	.lock = __MUTEX_INITIALIZER(name.lock),				\
+	.lru_head = LIST_HEAD_INIT(name.lru_head),			\
+	.unpurged_page_count = 0,					\
+}
+
+
+static inline void volatile_range_lock(struct volatile_fs_head *head)
+{
+	mutex_lock(&head->lock);
+}
+
+static inline void volatile_range_unlock(struct volatile_fs_head *head)
+{
+	mutex_unlock(&head->lock);
+}
+
+extern long volatile_range_add(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+extern long volatile_range_remove(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+
+extern s64 volatile_range_lru_size(struct volatile_fs_head *head);
+
+extern void volatile_range_clear(struct volatile_fs_head *head,
+					struct address_space *mapping);
+
+extern s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
+				struct address_space **mapping,
+				pgoff_t *start, pgoff_t *end);
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 2e2fbbe..3e3cd6f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -16,7 +16,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   page_isolation.o mm_init.o mmu_context.o percpu.o \
-			   compaction.o $(mmu-y)
+			   compaction.o volatile.o $(mmu-y)
 obj-y += init-mm.o
 
 ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..d05a767
--- /dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,509 @@
+/* mm/volatile.c
+ *
+ * Volatile page range managment.
+ *      Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c
+ *      by Robert Love <rlove@google.com>
+ *      Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * The volatile range management is a helper layer on top of the range tree
+ * code, which is used to help filesystems manage page ranges that are volatile.
+ *
+ * These ranges are stored in a per-mapping range tree. Storing both purged and
+ * unpurged ranges connected to that address_space. Unpurged ranges are also
+ * linked together in an lru list that is per-volatile-fs-head (basically
+ * per-filesystem).
+ *
+ * The goal behind volatile ranges is to allow applications to interact
+ * with the kernel's cache management infrastructure.  In particular an
+ * application can say "this memory contains data that might be useful in
+ * the future, but can be reconstructed if necessary, so if the kernel
+ * needs, it can zap and reclaim this memory without having to swap it out.
+ *
+ * The proposed mechanism - at a high level - is for user-space to be able
+ * to say "This memory is volatile" and then later "this memory is no longer
+ * volatile".  If the content of the memory is still available the second
+ * request succeeds.  If not, the memory is marked non-volatile and an
+ * error is returned to denote that the contents have been lost.
+ *
+ * Credits to Neil Brown for the above description.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/volatile.h>
+#include <linux/rbtree.h>
+#include <linux/hash.h>
+#include <linux/shmem_fs.h>
+
+
+struct volatile_range {
+	struct list_head		lru;
+	struct prio_tree_node		node;
+	unsigned int			purged;
+	struct address_space		*mapping;
+};
+
+
+/*
+ * To avoid bloating the address_space structure, we use
+ * a hash structure to map from address_space mappings to
+ * the interval_tree root that stores volatile ranges
+ */
+static DEFINE_MUTEX(hash_mutex);
+static struct hlist_head *mapping_hash;
+static long mapping_hash_shift = 8;
+struct mapping_hash_entry {
+	struct prio_tree_root		root;
+	struct address_space		*mapping;
+	struct hlist_node		hnode;
+};
+
+
+static inline
+struct prio_tree_root *__mapping_to_root(struct address_space *mapping)
+{
+	struct hlist_node *elem;
+	struct mapping_hash_entry *entry;
+	struct prio_tree_root *ret = NULL;
+
+	hlist_for_each_entry_rcu(entry, elem,
+			&mapping_hash[hash_ptr(mapping, mapping_hash_shift)],
+				hnode)
+		if (entry->mapping == mapping)
+			ret =  &entry->root;
+
+	return ret;
+}
+
+
+static inline
+struct prio_tree_root *mapping_to_root(struct address_space *mapping)
+{
+	struct prio_tree_root *ret;
+
+	mutex_lock(&hash_mutex);
+	ret =  __mapping_to_root(mapping);
+	mutex_unlock(&hash_mutex);
+	return ret;
+}
+
+
+static inline
+struct prio_tree_root *mapping_allocate_root(struct address_space *mapping)
+{
+	struct mapping_hash_entry *entry;
+	struct prio_tree_root *dblchk;
+	struct prio_tree_root *ret = NULL;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return NULL;
+
+	mutex_lock(&hash_mutex);
+	/* Since we dropped the lock, double check that no one has
+	 * created the same hash entry.
+	 */
+	dblchk = __mapping_to_root(mapping);
+	if (dblchk) {
+		kfree(entry);
+		ret = dblchk;
+		goto out;
+	}
+
+	INIT_HLIST_NODE(&entry->hnode);
+	entry->mapping = mapping;
+	INIT_PRIO_TREE_ROOT(&entry->root);
+
+	hlist_add_head_rcu(&entry->hnode,
+		&mapping_hash[hash_ptr(mapping, mapping_hash_shift)]);
+
+	ret = &entry->root;
+out:
+	mutex_unlock(&hash_mutex);
+	return ret;
+}
+
+
+static inline void mapping_free_root(struct prio_tree_root *root)
+{
+	struct mapping_hash_entry *entry;
+
+	mutex_lock(&hash_mutex);
+	entry = container_of(root, struct mapping_hash_entry, root);
+
+	hlist_del_rcu(&entry->hnode);
+	kfree(entry);
+	mutex_unlock(&hash_mutex);
+}
+
+
+/* volatile range helpers */
+static inline void vrange_resize(struct volatile_fs_head *head,
+				struct prio_tree_root *root,
+				struct volatile_range *vrange,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	pgoff_t old_size, new_size;
+
+	old_size = vrange->node.last - vrange->node.start;
+	new_size = end_index-start_index;
+
+	if (!vrange->purged)
+		head->unpurged_page_count += new_size - old_size;
+
+	prio_tree_remove(root, &vrange->node);
+	vrange->node.start = start_index;
+	vrange->node.last = end_index;
+	prio_tree_insert(root, &vrange->node);
+}
+
+static struct volatile_range *vrange_alloc(void)
+{
+	struct volatile_range *new;
+
+	new = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+	if (!new)
+		return 0;
+	INIT_PRIO_TREE_NODE(&new->node);
+	return new;
+}
+
+
+static void vrange_add(struct volatile_fs_head *head,
+				struct prio_tree_root *root,
+				struct volatile_range *vrange)
+{
+
+	prio_tree_insert(root, &vrange->node);
+
+	/* Only add unpurged ranges to LRU */
+	if (!vrange->purged) {
+		head->unpurged_page_count += vrange->node.last - vrange->node.start;
+		list_add_tail(&vrange->lru, &head->lru_head);
+	}
+
+}
+
+
+
+static void vrange_del(struct volatile_fs_head *head,
+				struct prio_tree_root *root,
+				struct volatile_range *vrange)
+{
+	if (!vrange->purged) {
+		head->unpurged_page_count -= vrange->node.last - vrange->node.start;
+		list_del(&vrange->lru);
+	}
+	prio_tree_remove(root, &vrange->node);
+	kfree(vrange);
+}
+
+
+/**
+ * volatile_range_add: Marks a page interval as volatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked volatile
+ * @start_index: Starting page in range to be marked volatile
+ * @end_index: Ending page in range to be marked volatile
+ *
+ * Mark a region as volatile. Coalesces overlapping and neighboring regions.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if the range was coalesced with any purged ranges.
+ * Returns 0 on success.
+ */
+long volatile_range_add(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				pgoff_t start, pgoff_t end)
+{
+	struct prio_tree_node *node;
+	struct prio_tree_iter iter;
+	struct volatile_range *new, *vrange;
+	struct prio_tree_root *root;
+	int purged = 0;
+
+	/* Make sure we're properly locked */
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	/*
+	 * Because the lock might be held in a shrinker, release
+	 * it during allocation.
+	 */
+	mutex_unlock(&head->lock);
+	new = vrange_alloc();
+	mutex_lock(&head->lock);
+	if (!new)
+		return -ENOMEM;
+
+	root = mapping_to_root(mapping);
+	if (!root) {
+		mutex_unlock(&head->lock);
+		root = mapping_allocate_root(mapping);
+		mutex_lock(&head->lock);
+		if (!root) {
+			kfree(new);
+			return -ENOMEM;
+		}
+	}
+
+
+	/* First, find any existing intervals that overlap */
+	prio_tree_iter_init(&iter, root, start, end);
+	node = prio_tree_next(&iter);
+	while (node) {
+		vrange = container_of(node, struct volatile_range, node);
+
+		/* Already entirely marked volatile, so we're done */
+		if (vrange->node.start < start && vrange->node.last > end) {
+			/* don't need the allocated value */
+			kfree(new);
+			return purged;
+		}
+
+		/* Resize the new range to cover all overlapping ranges */
+		start = min_t(u64, start, vrange->node.start);
+		end = max_t(u64, end, vrange->node.last);
+
+		/* Inherit purged state from overlapping ranges */
+		purged |= vrange->purged;
+
+		/* See if there's a next range that overlaps */
+		node = prio_tree_next(&iter);
+
+		/* Delete the old range, as we consume it */
+		vrange_del(head, root, vrange);
+
+	}
+
+	/* Coalesce left-adjacent ranges */
+	prio_tree_iter_init(&iter, root, start-1, start);
+	node = prio_tree_next(&iter);
+	while (node) {
+		vrange = container_of(node, struct volatile_range, node);
+		node = prio_tree_next(&iter);
+		/* Only coalesce if both are either purged or unpurged */
+		if (vrange->purged == purged) {
+			/* resize new range */
+			start = min_t(u64, start, vrange->node.start);
+			end = max_t(u64, end, vrange->node.last);
+			/* delete old range */
+			vrange_del(head, root, vrange);
+		}
+	}
+
+	/* Coalesce right-adjacent ranges */
+	prio_tree_iter_init(&iter, root, end, end+1);
+	node = prio_tree_next(&iter);
+	while (node) {
+		vrange = container_of(node, struct volatile_range, node);
+		node = prio_tree_next(&iter);
+		/* Only coalesce if both are either purged or unpurged */
+		if (vrange->purged == purged) {
+			/* resize new range */
+			start = min_t(u64, start, vrange->node.start);
+			end = max_t(u64, end, vrange->node.last);
+			/* delete old range */
+			vrange_del(head, root, vrange);
+		}
+	}
+	/* Assign and store the new range in the range tree */
+	new->mapping = mapping;
+	new->node.start = start;
+	new->node.last = end;
+	new->purged = purged;
+	vrange_add(head, root, new);
+
+	return purged;
+}
+
+
+/**
+ * volatile_range_remove: Marks a page interval as nonvolatile
+ * @head: per-fs volatile head
+ * @mapping: address space who's range is being marked nonvolatile
+ * @start_index: Starting page in range to be marked nonvolatile
+ * @end_index: Ending page in range to be marked nonvolatile
+ *
+ * Mark a region as nonvolatile. And remove any contained pages
+ * from the volatile range tree.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 if any portion of the range was purged.
+ * Returns 0 on success.
+ */
+long volatile_range_remove(struct volatile_fs_head *head,
+				struct address_space *mapping,
+				pgoff_t start, pgoff_t end)
+{
+	struct prio_tree_node *node;
+	struct prio_tree_iter iter;
+	struct volatile_range *new, *vrange;
+	struct prio_tree_root *root;
+	int ret		= 0;
+	int used_new	= 0;
+
+	/* Make sure we're properly locked */
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	/*
+	 * Because the lock might be held in a shrinker, release
+	 * it during allocation.
+	 */
+	mutex_unlock(&head->lock);
+	new = vrange_alloc();
+	mutex_lock(&head->lock);
+	if (!new)
+		return -ENOMEM;
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		goto out;
+
+
+	/* Find any overlapping ranges */
+	prio_tree_iter_init(&iter, root, start, end);
+	node = prio_tree_next(&iter);
+	while (node) {
+		vrange = container_of(node, struct volatile_range, node);
+		node = prio_tree_next(&iter);
+
+		ret |= vrange->purged;
+
+		if (start <= vrange->node.start && end >= vrange->node.last) {
+			/* delete: volatile range is totally within range */
+			vrange_del(head, root, vrange);
+		} else if (vrange->node.start >= start) {
+			/* resize: volatile range right-overlaps range */
+			vrange_resize(head, root, vrange, end+1, vrange->node.last);
+		} else if (vrange->node.last <= end) {
+			/* resize: volatile range left-overlaps range */
+			vrange_resize(head, root, vrange, vrange->node.start, start-1);
+		} else {
+			/* split: range is totally within a volatile range */
+			used_new = 1; /* we only do this once */
+			new->mapping = mapping;
+			new->node.start = end + 1;
+			new->node.last = vrange->node.last;
+			new->purged = vrange->purged;
+			vrange_resize(head, root, vrange, vrange->node.start, start-1);
+			vrange_add(head, root, new);
+			break;
+		}
+	}
+
+out:
+	if (!used_new)
+		kfree(new);
+
+	return ret;
+}
+
+/**
+ * volatile_range_lru_size: Returns the number of unpurged pages on the lru
+ * @head: per-fs volatile head
+ *
+ * Returns the number of unpurged pages on the LRU
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ */
+s64 volatile_range_lru_size(struct volatile_fs_head *head)
+{
+	WARN_ON(!mutex_is_locked(&head->lock));
+	return head->unpurged_page_count;
+}
+
+
+/**
+ * volatile_ranges_pluck_lru: Returns mapping and size of lru unpurged range
+ * @head: per-fs volatile head
+ * @mapping: dbl pointer to mapping who's range is being purged
+ * @start: Pointer to starting address of range being purged
+ * @end: Pointer to ending address of range being purged
+ *
+ * Returns the mapping, start and end values of the least recently used
+ * range. Marks the range as purged and removes it from the LRU.
+ *
+ * Must lock the volatile_fs_head before calling!
+ *
+ * Returns 1 on success if a range was returned
+ * Return 0 if no ranges were found.
+ */
+s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
+				struct address_space **mapping,
+				pgoff_t *start, pgoff_t *end)
+{
+	struct volatile_range *range;
+
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	if (list_empty(&head->lru_head))
+		return 0;
+
+	range = list_first_entry(&head->lru_head, struct volatile_range, lru);
+
+	*start = range->node.start;
+	*end = range->node.last;
+	*mapping = range->mapping;
+
+	head->unpurged_page_count -= *end - *start;
+	list_del(&range->lru);
+	range->purged = 1;
+
+	return 1;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void volatile_range_clear(struct volatile_fs_head *head,
+				struct address_space *mapping)
+{
+	struct volatile_range *tozap;
+	struct prio_tree_root *root;
+
+	WARN_ON(!mutex_is_locked(&head->lock));
+
+	root = mapping_to_root(mapping);
+	if (!root)
+		return;
+
+	while (!prio_tree_empty(root)) {
+		tozap = container_of(root->prio_tree_node, struct volatile_range, node);
+		vrange_del(head, root, tozap);
+	}
+	mapping_free_root(root);
+}
+
+
+static int __init volatile_init(void)
+{
+	int i, size;
+
+	size = 1U << mapping_hash_shift;
+	mapping_hash = kzalloc(sizeof(mapping_hash)*size, GFP_KERNEL);
+	for (i = 0; i < size; i++)
+		INIT_HLIST_HEAD(&mapping_hash[i]);
+
+	return 0;
+}
+arch_initcall(volatile_init);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/5] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  2012-07-28  3:57 ` John Stultz
@ 2012-07-28  3:57   ` John Stultz
  -1 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
functionality for tmpfs making use of the volatile range
management code.

Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
FALLOC_FL_PUNCH_HOLE.  This allows applications that have
data caches that can be re-created to tell the kernel that
some memory contains data that is useful in the future, but
can be recreated if needed, so if the kernel needs, it can
zap the memory without having to swap it out.

In use, applications use FALLOC_FL_MARK_VOLATILE to mark
page ranges as volatile when they are not in use. Then later
if they wants to reuse the data, they use
FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
data has been purged.

This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code.

This is a reworked version of the fadvise volatile idea submitted
earlier to the list. Thanks to Dave Chinner for suggesting to
rework the idea in this fashion. Also thanks to Dmitry Adamushko
for continued review and bug reporting, and Dave Hansen for
help with the original design and mentoring me in the VM code.

v3:
* Fix off by one issue when truncating page ranges
* Use Dave Hansesn's suggestion to use shmem_writepage to trigger
  range purging instead of using a shrinker.

v4:
* Revert the shrinker removal, since writepage won't get called
  if we don't have swap.

v5:
* Cleanups

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 fs/open.c              |    3 +-
 include/linux/falloc.h |    7 +--
 mm/shmem.c             |  113 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 119 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 1e914b3..421a97c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -223,7 +223,8 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EINVAL;
 
 	/* Return error if mode is not supported */
-	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+			FALLOC_FL_MARK_VOLATILE | FALLOC_FL_UNMARK_VOLATILE))
 		return -EOPNOTSUPP;
 
 	/* Punch hole must have keep size set */
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 73e0b62..3e47ad5 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -1,9 +1,10 @@
 #ifndef _FALLOC_H_
 #define _FALLOC_H_
 
-#define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
-#define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
-
+#define FALLOC_FL_KEEP_SIZE		0x01 /* default is extend size */
+#define FALLOC_FL_PUNCH_HOLE		0x02 /* de-allocates range */
+#define FALLOC_FL_MARK_VOLATILE		0x04 /* mark range volatile */
+#define FALLOC_FL_UNMARK_VOLATILE	0x08 /* mark range non-volatile */
 #ifdef __KERNEL__
 
 /*
diff --git a/mm/shmem.c b/mm/shmem.c
index c15b998..e5ce04c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -64,6 +64,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/volatile.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -633,6 +634,103 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 	return error;
 }
 
+static DEFINE_VOLATILE_FS_HEAD(shmem_volatile_head);
+
+static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+	pgoff_t start, end;
+	int ret;
+
+	start = offset >> PAGE_CACHE_SHIFT;
+	end = (offset+len) >> PAGE_CACHE_SHIFT;
+
+	volatile_range_lock(&shmem_volatile_head);
+	ret = volatile_range_add(&shmem_volatile_head, &inode->i_data,
+								start, end);
+	if (ret > 0) { /* immdiately purge */
+		shmem_truncate_range(inode,
+				((loff_t) start << PAGE_CACHE_SHIFT),
+				((loff_t) end << PAGE_CACHE_SHIFT)-1);
+		ret = 0;
+	}
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return ret;
+}
+
+static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+	pgoff_t start, end;
+	int ret;
+
+	start = offset >> PAGE_CACHE_SHIFT;
+	end = (offset+len) >> PAGE_CACHE_SHIFT;
+
+	volatile_range_lock(&shmem_volatile_head);
+	ret = volatile_range_remove(&shmem_volatile_head, &inode->i_data,
+								start, end);
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return ret;
+}
+
+static void shmem_clear_volatile(struct inode *inode)
+{
+	volatile_range_lock(&shmem_volatile_head);
+	volatile_range_clear(&shmem_volatile_head, &inode->i_data);
+	volatile_range_unlock(&shmem_volatile_head);
+}
+
+static
+int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
+{
+	s64 nr_to_scan = sc->nr_to_scan;
+	const gfp_t gfp_mask = sc->gfp_mask;
+	struct address_space *mapping;
+	pgoff_t start, end;
+	int ret;
+	s64 page_count;
+
+	if (nr_to_scan && !(gfp_mask & __GFP_FS))
+		return -1;
+
+	volatile_range_lock(&shmem_volatile_head);
+	page_count = volatile_range_lru_size(&shmem_volatile_head);
+	if (!nr_to_scan)
+		goto out;
+
+	do {
+		ret = volatile_ranges_pluck_lru(&shmem_volatile_head,
+							&mapping, &start, &end);
+		if (ret) {
+			shmem_truncate_range(mapping->host,
+				((loff_t) start << PAGE_CACHE_SHIFT),
+				((loff_t) end << PAGE_CACHE_SHIFT)-1);
+
+			nr_to_scan -= end-start;
+			page_count -= end-start;
+		};
+	} while (ret && (nr_to_scan > 0));
+
+out:
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return page_count;
+}
+
+static struct shrinker shmem_volatile_shrinker = {
+	.shrink = shmem_volatile_shrink,
+	.seeks = DEFAULT_SEEKS,
+};
+
+static int __init shmem_shrinker_init(void)
+{
+	register_shrinker(&shmem_volatile_shrinker);
+	return 0;
+}
+arch_initcall(shmem_shrinker_init);
+
+
 static void shmem_evict_inode(struct inode *inode)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
@@ -1730,6 +1828,14 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		/* No need to unmap again: hole-punching leaves COWed pages */
 		error = 0;
 		goto out;
+	} else if (mode & FALLOC_FL_MARK_VOLATILE) {
+		/* Mark pages volatile, sort of delayed hole punching */
+		error = shmem_mark_volatile(inode, offset, len);
+		goto out;
+	} else if (mode & FALLOC_FL_UNMARK_VOLATILE) {
+		/* Mark pages non-volatile, return error if pages were purged */
+		error = shmem_unmark_volatile(inode, offset, len);
+		goto out;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -1808,6 +1914,12 @@ out:
 	return error;
 }
 
+static int shmem_release(struct inode *inode, struct file *file)
+{
+	shmem_clear_volatile(inode);
+	return 0;
+}
+
 static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
@@ -2719,6 +2831,7 @@ static const struct file_operations shmem_file_operations = {
 	.splice_read	= shmem_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.fallocate	= shmem_fallocate,
+	.release	= shmem_release,
 #endif
 };
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/5] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
@ 2012-07-28  3:57   ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

This patch enables FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE
functionality for tmpfs making use of the volatile range
management code.

Conceptually, FALLOC_FL_MARK_VOLATILE is like a delayed
FALLOC_FL_PUNCH_HOLE.  This allows applications that have
data caches that can be re-created to tell the kernel that
some memory contains data that is useful in the future, but
can be recreated if needed, so if the kernel needs, it can
zap the memory without having to swap it out.

In use, applications use FALLOC_FL_MARK_VOLATILE to mark
page ranges as volatile when they are not in use. Then later
if they wants to reuse the data, they use
FALLOC_FL_UNMARK_VOLATILE, which will return an error if the
data has been purged.

This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code.

This is a reworked version of the fadvise volatile idea submitted
earlier to the list. Thanks to Dave Chinner for suggesting to
rework the idea in this fashion. Also thanks to Dmitry Adamushko
for continued review and bug reporting, and Dave Hansen for
help with the original design and mentoring me in the VM code.

v3:
* Fix off by one issue when truncating page ranges
* Use Dave Hansesn's suggestion to use shmem_writepage to trigger
  range purging instead of using a shrinker.

v4:
* Revert the shrinker removal, since writepage won't get called
  if we don't have swap.

v5:
* Cleanups

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 fs/open.c              |    3 +-
 include/linux/falloc.h |    7 +--
 mm/shmem.c             |  113 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 119 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 1e914b3..421a97c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -223,7 +223,8 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EINVAL;
 
 	/* Return error if mode is not supported */
-	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+			FALLOC_FL_MARK_VOLATILE | FALLOC_FL_UNMARK_VOLATILE))
 		return -EOPNOTSUPP;
 
 	/* Punch hole must have keep size set */
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 73e0b62..3e47ad5 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -1,9 +1,10 @@
 #ifndef _FALLOC_H_
 #define _FALLOC_H_
 
-#define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
-#define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
-
+#define FALLOC_FL_KEEP_SIZE		0x01 /* default is extend size */
+#define FALLOC_FL_PUNCH_HOLE		0x02 /* de-allocates range */
+#define FALLOC_FL_MARK_VOLATILE		0x04 /* mark range volatile */
+#define FALLOC_FL_UNMARK_VOLATILE	0x08 /* mark range non-volatile */
 #ifdef __KERNEL__
 
 /*
diff --git a/mm/shmem.c b/mm/shmem.c
index c15b998..e5ce04c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -64,6 +64,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/volatile.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -633,6 +634,103 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 	return error;
 }
 
+static DEFINE_VOLATILE_FS_HEAD(shmem_volatile_head);
+
+static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+	pgoff_t start, end;
+	int ret;
+
+	start = offset >> PAGE_CACHE_SHIFT;
+	end = (offset+len) >> PAGE_CACHE_SHIFT;
+
+	volatile_range_lock(&shmem_volatile_head);
+	ret = volatile_range_add(&shmem_volatile_head, &inode->i_data,
+								start, end);
+	if (ret > 0) { /* immdiately purge */
+		shmem_truncate_range(inode,
+				((loff_t) start << PAGE_CACHE_SHIFT),
+				((loff_t) end << PAGE_CACHE_SHIFT)-1);
+		ret = 0;
+	}
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return ret;
+}
+
+static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
+{
+	pgoff_t start, end;
+	int ret;
+
+	start = offset >> PAGE_CACHE_SHIFT;
+	end = (offset+len) >> PAGE_CACHE_SHIFT;
+
+	volatile_range_lock(&shmem_volatile_head);
+	ret = volatile_range_remove(&shmem_volatile_head, &inode->i_data,
+								start, end);
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return ret;
+}
+
+static void shmem_clear_volatile(struct inode *inode)
+{
+	volatile_range_lock(&shmem_volatile_head);
+	volatile_range_clear(&shmem_volatile_head, &inode->i_data);
+	volatile_range_unlock(&shmem_volatile_head);
+}
+
+static
+int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
+{
+	s64 nr_to_scan = sc->nr_to_scan;
+	const gfp_t gfp_mask = sc->gfp_mask;
+	struct address_space *mapping;
+	pgoff_t start, end;
+	int ret;
+	s64 page_count;
+
+	if (nr_to_scan && !(gfp_mask & __GFP_FS))
+		return -1;
+
+	volatile_range_lock(&shmem_volatile_head);
+	page_count = volatile_range_lru_size(&shmem_volatile_head);
+	if (!nr_to_scan)
+		goto out;
+
+	do {
+		ret = volatile_ranges_pluck_lru(&shmem_volatile_head,
+							&mapping, &start, &end);
+		if (ret) {
+			shmem_truncate_range(mapping->host,
+				((loff_t) start << PAGE_CACHE_SHIFT),
+				((loff_t) end << PAGE_CACHE_SHIFT)-1);
+
+			nr_to_scan -= end-start;
+			page_count -= end-start;
+		};
+	} while (ret && (nr_to_scan > 0));
+
+out:
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return page_count;
+}
+
+static struct shrinker shmem_volatile_shrinker = {
+	.shrink = shmem_volatile_shrink,
+	.seeks = DEFAULT_SEEKS,
+};
+
+static int __init shmem_shrinker_init(void)
+{
+	register_shrinker(&shmem_volatile_shrinker);
+	return 0;
+}
+arch_initcall(shmem_shrinker_init);
+
+
 static void shmem_evict_inode(struct inode *inode)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
@@ -1730,6 +1828,14 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		/* No need to unmap again: hole-punching leaves COWed pages */
 		error = 0;
 		goto out;
+	} else if (mode & FALLOC_FL_MARK_VOLATILE) {
+		/* Mark pages volatile, sort of delayed hole punching */
+		error = shmem_mark_volatile(inode, offset, len);
+		goto out;
+	} else if (mode & FALLOC_FL_UNMARK_VOLATILE) {
+		/* Mark pages non-volatile, return error if pages were purged */
+		error = shmem_unmark_volatile(inode, offset, len);
+		goto out;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -1808,6 +1914,12 @@ out:
 	return error;
 }
 
+static int shmem_release(struct inode *inode, struct file *file)
+{
+	shmem_clear_volatile(inode);
+	return 0;
+}
+
 static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
@@ -2719,6 +2831,7 @@ static const struct file_operations shmem_file_operations = {
 	.splice_read	= shmem_file_splice_read,
 	.splice_write	= generic_file_splice_write,
 	.fallocate	= shmem_fallocate,
+	.release	= shmem_release,
 #endif
 };
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/5] [RFC] ashmem: Convert ashmem to use volatile ranges
  2012-07-28  3:57 ` John Stultz
@ 2012-07-28  3:57   ` John Stultz
  -1 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

Rework of my first pass attempt at getting ashmem to utilize
the volatile range code, now using the fallocate interface.

In this implementaiton GET_PIN_STATUS is unimplemented, due to
the fact that adding a ISVOLATILE check wasn't considered
terribly useful in earlier reviews. It would be trivial to
re-add that functionality, but I wanted to check w/ the
Android developers to see how often GET_PIN_STATUS is actually
used?

Similarly the ashmem PURGE_ALL_CACHES ioctl does not function,
as the volatile range purging is no longer directly under its
control.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 drivers/staging/android/ashmem.c |  331 ++------------------------------------
 1 file changed, 10 insertions(+), 321 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 69cf2db..6ce73e1 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -52,26 +52,6 @@ struct ashmem_area {
 };
 
 /*
- * ashmem_range - represents an interval of unpinned (evictable) pages
- * Lifecycle: From unpin to pin
- * Locking: Protected by `ashmem_mutex'
- */
-struct ashmem_range {
-	struct list_head lru;		/* entry in LRU list */
-	struct list_head unpinned;	/* entry in its area's unpinned list */
-	struct ashmem_area *asma;	/* associated area */
-	size_t pgstart;			/* starting page, inclusive */
-	size_t pgend;			/* ending page, inclusive */
-	unsigned int purged;		/* ASHMEM_NOT or ASHMEM_WAS_PURGED */
-};
-
-/* LRU list of unpinned pages, protected by ashmem_mutex */
-static LIST_HEAD(ashmem_lru_list);
-
-/* Count of pages on our LRU list, protected by ashmem_mutex */
-static unsigned long lru_count;
-
-/*
  * ashmem_mutex - protects the list of and each individual ashmem_area
  *
  * Lock Ordering: ashmex_mutex -> i_mutex -> i_alloc_sem
@@ -79,102 +59,9 @@ static unsigned long lru_count;
 static DEFINE_MUTEX(ashmem_mutex);
 
 static struct kmem_cache *ashmem_area_cachep __read_mostly;
-static struct kmem_cache *ashmem_range_cachep __read_mostly;
-
-#define range_size(range) \
-	((range)->pgend - (range)->pgstart + 1)
-
-#define range_on_lru(range) \
-	((range)->purged == ASHMEM_NOT_PURGED)
-
-#define page_range_subsumes_range(range, start, end) \
-	(((range)->pgstart >= (start)) && ((range)->pgend <= (end)))
-
-#define page_range_subsumed_by_range(range, start, end) \
-	(((range)->pgstart <= (start)) && ((range)->pgend >= (end)))
-
-#define page_in_range(range, page) \
-	(((range)->pgstart <= (page)) && ((range)->pgend >= (page)))
-
-#define page_range_in_range(range, start, end) \
-	(page_in_range(range, start) || page_in_range(range, end) || \
-		page_range_subsumes_range(range, start, end))
-
-#define range_before_page(range, page) \
-	((range)->pgend < (page))
 
 #define PROT_MASK		(PROT_EXEC | PROT_READ | PROT_WRITE)
 
-static inline void lru_add(struct ashmem_range *range)
-{
-	list_add_tail(&range->lru, &ashmem_lru_list);
-	lru_count += range_size(range);
-}
-
-static inline void lru_del(struct ashmem_range *range)
-{
-	list_del(&range->lru);
-	lru_count -= range_size(range);
-}
-
-/*
- * range_alloc - allocate and initialize a new ashmem_range structure
- *
- * 'asma' - associated ashmem_area
- * 'prev_range' - the previous ashmem_range in the sorted asma->unpinned list
- * 'purged' - initial purge value (ASMEM_NOT_PURGED or ASHMEM_WAS_PURGED)
- * 'start' - starting page, inclusive
- * 'end' - ending page, inclusive
- *
- * Caller must hold ashmem_mutex.
- */
-static int range_alloc(struct ashmem_area *asma,
-		       struct ashmem_range *prev_range, unsigned int purged,
-		       size_t start, size_t end)
-{
-	struct ashmem_range *range;
-
-	range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
-	if (unlikely(!range))
-		return -ENOMEM;
-
-	range->asma = asma;
-	range->pgstart = start;
-	range->pgend = end;
-	range->purged = purged;
-
-	list_add_tail(&range->unpinned, &prev_range->unpinned);
-
-	if (range_on_lru(range))
-		lru_add(range);
-
-	return 0;
-}
-
-static void range_del(struct ashmem_range *range)
-{
-	list_del(&range->unpinned);
-	if (range_on_lru(range))
-		lru_del(range);
-	kmem_cache_free(ashmem_range_cachep, range);
-}
-
-/*
- * range_shrink - shrinks a range
- *
- * Caller must hold ashmem_mutex.
- */
-static inline void range_shrink(struct ashmem_range *range,
-				size_t start, size_t end)
-{
-	size_t pre = range_size(range);
-
-	range->pgstart = start;
-	range->pgend = end;
-
-	if (range_on_lru(range))
-		lru_count -= pre - range_size(range);
-}
 
 static int ashmem_open(struct inode *inode, struct file *file)
 {
@@ -200,12 +87,6 @@ static int ashmem_open(struct inode *inode, struct file *file)
 static int ashmem_release(struct inode *ignored, struct file *file)
 {
 	struct ashmem_area *asma = file->private_data;
-	struct ashmem_range *range, *next;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned)
-		range_del(range);
-	mutex_unlock(&ashmem_mutex);
 
 	if (asma->file)
 		fput(asma->file);
@@ -339,56 +220,6 @@ out:
 	return ret;
 }
 
-/*
- * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
- *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
- *
- * 'gfp_mask' is the mask of the allocation that got us into this mess.
- *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
- * proceed without risk of deadlock (due to gfp_mask).
- *
- * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
- * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
- * pages freed.
- */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
-{
-	struct ashmem_range *range, *next;
-
-	/* We might recurse into filesystem code, so bail out if necessary */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
-		return -1;
-	if (!sc->nr_to_scan)
-		return lru_count;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
-		loff_t start = range->pgstart * PAGE_SIZE;
-		loff_t end = (range->pgend + 1) * PAGE_SIZE;
-
-		do_fallocate(range->asma->file,
-				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-				start, end - start);
-		range->purged = ASHMEM_WAS_PURGED;
-		lru_del(range);
-
-		sc->nr_to_scan -= range_size(range);
-		if (sc->nr_to_scan <= 0)
-			break;
-	}
-	mutex_unlock(&ashmem_mutex);
-
-	return lru_count;
-}
-
-static struct shrinker ashmem_shrinker = {
-	.shrink = ashmem_shrink,
-	.seeks = DEFAULT_SEEKS * 4,
-};
-
 static int set_prot_mask(struct ashmem_area *asma, unsigned long prot)
 {
 	int ret = 0;
@@ -461,136 +292,10 @@ static int get_name(struct ashmem_area *asma, void __user *name)
 	return ret;
 }
 
-/*
- * ashmem_pin - pin the given ashmem region, returning whether it was
- * previously purged (ASHMEM_WAS_PURGED) or not (ASHMEM_NOT_PURGED).
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	int ret = ASHMEM_NOT_PURGED;
-
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* moved past last applicable page; we can short circuit */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to pin pages that span multiple ranges,
-		 * or to pin pages that aren't even unpinned, so this is messy.
-		 *
-		 * Four cases:
-		 * 1. The requested range subsumes an existing range, so we
-		 *    just remove the entire matching range.
-		 * 2. The requested range overlaps the start of an existing
-		 *    range, so we just update that range.
-		 * 3. The requested range overlaps the end of an existing
-		 *    range, so we just update that range.
-		 * 4. The requested range punches a hole in an existing range,
-		 *    so we have to update one side of the range and then
-		 *    create a new range for the other side.
-		 */
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret |= range->purged;
-
-			/* Case #1: Easy. Just nuke the whole thing. */
-			if (page_range_subsumes_range(range, pgstart, pgend)) {
-				range_del(range);
-				continue;
-			}
-
-			/* Case #2: We overlap from the start, so adjust it */
-			if (range->pgstart >= pgstart) {
-				range_shrink(range, pgend + 1, range->pgend);
-				continue;
-			}
-
-			/* Case #3: We overlap from the rear, so adjust it */
-			if (range->pgend <= pgend) {
-				range_shrink(range, range->pgstart, pgstart-1);
-				continue;
-			}
-
-			/*
-			 * Case #4: We eat a chunk out of the middle. A bit
-			 * more complicated, we allocate a new range for the
-			 * second half and adjust the first chunk's endpoint.
-			 */
-			range_alloc(asma, range, range->purged,
-				    pgend + 1, range->pgend);
-			range_shrink(range, range->pgstart, pgstart - 1);
-			break;
-		}
-	}
-
-	return ret;
-}
-
-/*
- * ashmem_unpin - unpin the given range of pages. Returns zero on success.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	unsigned int purged = ASHMEM_NOT_PURGED;
-
-restart:
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* short circuit: this is our insertion point */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to unpin pages that are already entirely
-		 * or partially pinned. We handle those two cases here.
-		 */
-		if (page_range_subsumed_by_range(range, pgstart, pgend))
-			return 0;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			pgstart = min_t(size_t, range->pgstart, pgstart),
-			pgend = max_t(size_t, range->pgend, pgend);
-			purged |= range->purged;
-			range_del(range);
-			goto restart;
-		}
-	}
-
-	return range_alloc(asma, range, purged, pgstart, pgend);
-}
-
-/*
- * ashmem_get_pin_status - Returns ASHMEM_IS_UNPINNED if _any_ pages in the
- * given interval are unpinned and ASHMEM_IS_PINNED otherwise.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_get_pin_status(struct ashmem_area *asma, size_t pgstart,
-				 size_t pgend)
-{
-	struct ashmem_range *range;
-	int ret = ASHMEM_IS_PINNED;
-
-	list_for_each_entry(range, &asma->unpinned_list, unpinned) {
-		if (range_before_page(range, pgstart))
-			break;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret = ASHMEM_IS_UNPINNED;
-			break;
-		}
-	}
-
-	return ret;
-}
-
 static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 			    void __user *p)
 {
 	struct ashmem_pin pin;
-	size_t pgstart, pgend;
 	int ret = -EINVAL;
 
 	if (unlikely(!asma->file))
@@ -612,20 +317,24 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 	if (unlikely(PAGE_ALIGN(asma->size) < pin.offset + pin.len))
 		return -EINVAL;
 
-	pgstart = pin.offset / PAGE_SIZE;
-	pgend = pgstart + (pin.len / PAGE_SIZE) - 1;
 
 	mutex_lock(&ashmem_mutex);
 
 	switch (cmd) {
 	case ASHMEM_PIN:
-		ret = ashmem_pin(asma, pgstart, pgend);
+		ret = do_fallocate(asma->file, FALLOC_FL_MARK_VOLATILE,
+					pin.offset, pin.len);
 		break;
 	case ASHMEM_UNPIN:
-		ret = ashmem_unpin(asma, pgstart, pgend);
+		ret = do_fallocate(asma->file, FALLOC_FL_UNMARK_VOLATILE,
+					pin.offset, pin.len);
 		break;
 	case ASHMEM_GET_PIN_STATUS:
-		ret = ashmem_get_pin_status(asma, pgstart, pgend);
+		/*
+		 * XXX - volatile ranges currently don't provide status,
+		 * due to questionable utility
+		 */
+		ret = -EINVAL;
 		break;
 	}
 
@@ -669,15 +378,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		break;
 	case ASHMEM_PURGE_ALL_CACHES:
 		ret = -EPERM;
-		if (capable(CAP_SYS_ADMIN)) {
-			struct shrink_control sc = {
-				.gfp_mask = GFP_KERNEL,
-				.nr_to_scan = 0,
-			};
-			ret = ashmem_shrink(&ashmem_shrinker, &sc);
-			sc.nr_to_scan = ret;
-			ashmem_shrink(&ashmem_shrinker, &sc);
-		}
 		break;
 	}
 
@@ -713,21 +413,13 @@ static int __init ashmem_init(void)
 		return -ENOMEM;
 	}
 
-	ashmem_range_cachep = kmem_cache_create("ashmem_range_cache",
-					  sizeof(struct ashmem_range),
-					  0, 0, NULL);
-	if (unlikely(!ashmem_range_cachep)) {
-		pr_err("failed to create slab cache\n");
-		return -ENOMEM;
-	}
-
 	ret = misc_register(&ashmem_misc);
 	if (unlikely(ret)) {
 		pr_err("failed to register misc device!\n");
 		return ret;
 	}
 
-	register_shrinker(&ashmem_shrinker);
+
 
 	pr_info("initialized\n");
 
@@ -738,13 +430,10 @@ static void __exit ashmem_exit(void)
 {
 	int ret;
 
-	unregister_shrinker(&ashmem_shrinker);
-
 	ret = misc_deregister(&ashmem_misc);
 	if (unlikely(ret))
 		pr_err("failed to unregister misc device!\n");
 
-	kmem_cache_destroy(ashmem_range_cachep);
 	kmem_cache_destroy(ashmem_area_cachep);
 
 	pr_info("unloaded\n");
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/5] [RFC] ashmem: Convert ashmem to use volatile ranges
@ 2012-07-28  3:57   ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

Rework of my first pass attempt at getting ashmem to utilize
the volatile range code, now using the fallocate interface.

In this implementaiton GET_PIN_STATUS is unimplemented, due to
the fact that adding a ISVOLATILE check wasn't considered
terribly useful in earlier reviews. It would be trivial to
re-add that functionality, but I wanted to check w/ the
Android developers to see how often GET_PIN_STATUS is actually
used?

Similarly the ashmem PURGE_ALL_CACHES ioctl does not function,
as the volatile range purging is no longer directly under its
control.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 drivers/staging/android/ashmem.c |  331 ++------------------------------------
 1 file changed, 10 insertions(+), 321 deletions(-)

diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 69cf2db..6ce73e1 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -52,26 +52,6 @@ struct ashmem_area {
 };
 
 /*
- * ashmem_range - represents an interval of unpinned (evictable) pages
- * Lifecycle: From unpin to pin
- * Locking: Protected by `ashmem_mutex'
- */
-struct ashmem_range {
-	struct list_head lru;		/* entry in LRU list */
-	struct list_head unpinned;	/* entry in its area's unpinned list */
-	struct ashmem_area *asma;	/* associated area */
-	size_t pgstart;			/* starting page, inclusive */
-	size_t pgend;			/* ending page, inclusive */
-	unsigned int purged;		/* ASHMEM_NOT or ASHMEM_WAS_PURGED */
-};
-
-/* LRU list of unpinned pages, protected by ashmem_mutex */
-static LIST_HEAD(ashmem_lru_list);
-
-/* Count of pages on our LRU list, protected by ashmem_mutex */
-static unsigned long lru_count;
-
-/*
  * ashmem_mutex - protects the list of and each individual ashmem_area
  *
  * Lock Ordering: ashmex_mutex -> i_mutex -> i_alloc_sem
@@ -79,102 +59,9 @@ static unsigned long lru_count;
 static DEFINE_MUTEX(ashmem_mutex);
 
 static struct kmem_cache *ashmem_area_cachep __read_mostly;
-static struct kmem_cache *ashmem_range_cachep __read_mostly;
-
-#define range_size(range) \
-	((range)->pgend - (range)->pgstart + 1)
-
-#define range_on_lru(range) \
-	((range)->purged == ASHMEM_NOT_PURGED)
-
-#define page_range_subsumes_range(range, start, end) \
-	(((range)->pgstart >= (start)) && ((range)->pgend <= (end)))
-
-#define page_range_subsumed_by_range(range, start, end) \
-	(((range)->pgstart <= (start)) && ((range)->pgend >= (end)))
-
-#define page_in_range(range, page) \
-	(((range)->pgstart <= (page)) && ((range)->pgend >= (page)))
-
-#define page_range_in_range(range, start, end) \
-	(page_in_range(range, start) || page_in_range(range, end) || \
-		page_range_subsumes_range(range, start, end))
-
-#define range_before_page(range, page) \
-	((range)->pgend < (page))
 
 #define PROT_MASK		(PROT_EXEC | PROT_READ | PROT_WRITE)
 
-static inline void lru_add(struct ashmem_range *range)
-{
-	list_add_tail(&range->lru, &ashmem_lru_list);
-	lru_count += range_size(range);
-}
-
-static inline void lru_del(struct ashmem_range *range)
-{
-	list_del(&range->lru);
-	lru_count -= range_size(range);
-}
-
-/*
- * range_alloc - allocate and initialize a new ashmem_range structure
- *
- * 'asma' - associated ashmem_area
- * 'prev_range' - the previous ashmem_range in the sorted asma->unpinned list
- * 'purged' - initial purge value (ASMEM_NOT_PURGED or ASHMEM_WAS_PURGED)
- * 'start' - starting page, inclusive
- * 'end' - ending page, inclusive
- *
- * Caller must hold ashmem_mutex.
- */
-static int range_alloc(struct ashmem_area *asma,
-		       struct ashmem_range *prev_range, unsigned int purged,
-		       size_t start, size_t end)
-{
-	struct ashmem_range *range;
-
-	range = kmem_cache_zalloc(ashmem_range_cachep, GFP_KERNEL);
-	if (unlikely(!range))
-		return -ENOMEM;
-
-	range->asma = asma;
-	range->pgstart = start;
-	range->pgend = end;
-	range->purged = purged;
-
-	list_add_tail(&range->unpinned, &prev_range->unpinned);
-
-	if (range_on_lru(range))
-		lru_add(range);
-
-	return 0;
-}
-
-static void range_del(struct ashmem_range *range)
-{
-	list_del(&range->unpinned);
-	if (range_on_lru(range))
-		lru_del(range);
-	kmem_cache_free(ashmem_range_cachep, range);
-}
-
-/*
- * range_shrink - shrinks a range
- *
- * Caller must hold ashmem_mutex.
- */
-static inline void range_shrink(struct ashmem_range *range,
-				size_t start, size_t end)
-{
-	size_t pre = range_size(range);
-
-	range->pgstart = start;
-	range->pgend = end;
-
-	if (range_on_lru(range))
-		lru_count -= pre - range_size(range);
-}
 
 static int ashmem_open(struct inode *inode, struct file *file)
 {
@@ -200,12 +87,6 @@ static int ashmem_open(struct inode *inode, struct file *file)
 static int ashmem_release(struct inode *ignored, struct file *file)
 {
 	struct ashmem_area *asma = file->private_data;
-	struct ashmem_range *range, *next;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned)
-		range_del(range);
-	mutex_unlock(&ashmem_mutex);
 
 	if (asma->file)
 		fput(asma->file);
@@ -339,56 +220,6 @@ out:
 	return ret;
 }
 
-/*
- * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
- *
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
- *
- * 'gfp_mask' is the mask of the allocation that got us into this mess.
- *
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
- * proceed without risk of deadlock (due to gfp_mask).
- *
- * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
- * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
- * pages freed.
- */
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
-{
-	struct ashmem_range *range, *next;
-
-	/* We might recurse into filesystem code, so bail out if necessary */
-	if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
-		return -1;
-	if (!sc->nr_to_scan)
-		return lru_count;
-
-	mutex_lock(&ashmem_mutex);
-	list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
-		loff_t start = range->pgstart * PAGE_SIZE;
-		loff_t end = (range->pgend + 1) * PAGE_SIZE;
-
-		do_fallocate(range->asma->file,
-				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-				start, end - start);
-		range->purged = ASHMEM_WAS_PURGED;
-		lru_del(range);
-
-		sc->nr_to_scan -= range_size(range);
-		if (sc->nr_to_scan <= 0)
-			break;
-	}
-	mutex_unlock(&ashmem_mutex);
-
-	return lru_count;
-}
-
-static struct shrinker ashmem_shrinker = {
-	.shrink = ashmem_shrink,
-	.seeks = DEFAULT_SEEKS * 4,
-};
-
 static int set_prot_mask(struct ashmem_area *asma, unsigned long prot)
 {
 	int ret = 0;
@@ -461,136 +292,10 @@ static int get_name(struct ashmem_area *asma, void __user *name)
 	return ret;
 }
 
-/*
- * ashmem_pin - pin the given ashmem region, returning whether it was
- * previously purged (ASHMEM_WAS_PURGED) or not (ASHMEM_NOT_PURGED).
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_pin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	int ret = ASHMEM_NOT_PURGED;
-
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* moved past last applicable page; we can short circuit */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to pin pages that span multiple ranges,
-		 * or to pin pages that aren't even unpinned, so this is messy.
-		 *
-		 * Four cases:
-		 * 1. The requested range subsumes an existing range, so we
-		 *    just remove the entire matching range.
-		 * 2. The requested range overlaps the start of an existing
-		 *    range, so we just update that range.
-		 * 3. The requested range overlaps the end of an existing
-		 *    range, so we just update that range.
-		 * 4. The requested range punches a hole in an existing range,
-		 *    so we have to update one side of the range and then
-		 *    create a new range for the other side.
-		 */
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret |= range->purged;
-
-			/* Case #1: Easy. Just nuke the whole thing. */
-			if (page_range_subsumes_range(range, pgstart, pgend)) {
-				range_del(range);
-				continue;
-			}
-
-			/* Case #2: We overlap from the start, so adjust it */
-			if (range->pgstart >= pgstart) {
-				range_shrink(range, pgend + 1, range->pgend);
-				continue;
-			}
-
-			/* Case #3: We overlap from the rear, so adjust it */
-			if (range->pgend <= pgend) {
-				range_shrink(range, range->pgstart, pgstart-1);
-				continue;
-			}
-
-			/*
-			 * Case #4: We eat a chunk out of the middle. A bit
-			 * more complicated, we allocate a new range for the
-			 * second half and adjust the first chunk's endpoint.
-			 */
-			range_alloc(asma, range, range->purged,
-				    pgend + 1, range->pgend);
-			range_shrink(range, range->pgstart, pgstart - 1);
-			break;
-		}
-	}
-
-	return ret;
-}
-
-/*
- * ashmem_unpin - unpin the given range of pages. Returns zero on success.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_unpin(struct ashmem_area *asma, size_t pgstart, size_t pgend)
-{
-	struct ashmem_range *range, *next;
-	unsigned int purged = ASHMEM_NOT_PURGED;
-
-restart:
-	list_for_each_entry_safe(range, next, &asma->unpinned_list, unpinned) {
-		/* short circuit: this is our insertion point */
-		if (range_before_page(range, pgstart))
-			break;
-
-		/*
-		 * The user can ask us to unpin pages that are already entirely
-		 * or partially pinned. We handle those two cases here.
-		 */
-		if (page_range_subsumed_by_range(range, pgstart, pgend))
-			return 0;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			pgstart = min_t(size_t, range->pgstart, pgstart),
-			pgend = max_t(size_t, range->pgend, pgend);
-			purged |= range->purged;
-			range_del(range);
-			goto restart;
-		}
-	}
-
-	return range_alloc(asma, range, purged, pgstart, pgend);
-}
-
-/*
- * ashmem_get_pin_status - Returns ASHMEM_IS_UNPINNED if _any_ pages in the
- * given interval are unpinned and ASHMEM_IS_PINNED otherwise.
- *
- * Caller must hold ashmem_mutex.
- */
-static int ashmem_get_pin_status(struct ashmem_area *asma, size_t pgstart,
-				 size_t pgend)
-{
-	struct ashmem_range *range;
-	int ret = ASHMEM_IS_PINNED;
-
-	list_for_each_entry(range, &asma->unpinned_list, unpinned) {
-		if (range_before_page(range, pgstart))
-			break;
-		if (page_range_in_range(range, pgstart, pgend)) {
-			ret = ASHMEM_IS_UNPINNED;
-			break;
-		}
-	}
-
-	return ret;
-}
-
 static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 			    void __user *p)
 {
 	struct ashmem_pin pin;
-	size_t pgstart, pgend;
 	int ret = -EINVAL;
 
 	if (unlikely(!asma->file))
@@ -612,20 +317,24 @@ static int ashmem_pin_unpin(struct ashmem_area *asma, unsigned long cmd,
 	if (unlikely(PAGE_ALIGN(asma->size) < pin.offset + pin.len))
 		return -EINVAL;
 
-	pgstart = pin.offset / PAGE_SIZE;
-	pgend = pgstart + (pin.len / PAGE_SIZE) - 1;
 
 	mutex_lock(&ashmem_mutex);
 
 	switch (cmd) {
 	case ASHMEM_PIN:
-		ret = ashmem_pin(asma, pgstart, pgend);
+		ret = do_fallocate(asma->file, FALLOC_FL_MARK_VOLATILE,
+					pin.offset, pin.len);
 		break;
 	case ASHMEM_UNPIN:
-		ret = ashmem_unpin(asma, pgstart, pgend);
+		ret = do_fallocate(asma->file, FALLOC_FL_UNMARK_VOLATILE,
+					pin.offset, pin.len);
 		break;
 	case ASHMEM_GET_PIN_STATUS:
-		ret = ashmem_get_pin_status(asma, pgstart, pgend);
+		/*
+		 * XXX - volatile ranges currently don't provide status,
+		 * due to questionable utility
+		 */
+		ret = -EINVAL;
 		break;
 	}
 
@@ -669,15 +378,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		break;
 	case ASHMEM_PURGE_ALL_CACHES:
 		ret = -EPERM;
-		if (capable(CAP_SYS_ADMIN)) {
-			struct shrink_control sc = {
-				.gfp_mask = GFP_KERNEL,
-				.nr_to_scan = 0,
-			};
-			ret = ashmem_shrink(&ashmem_shrinker, &sc);
-			sc.nr_to_scan = ret;
-			ashmem_shrink(&ashmem_shrinker, &sc);
-		}
 		break;
 	}
 
@@ -713,21 +413,13 @@ static int __init ashmem_init(void)
 		return -ENOMEM;
 	}
 
-	ashmem_range_cachep = kmem_cache_create("ashmem_range_cache",
-					  sizeof(struct ashmem_range),
-					  0, 0, NULL);
-	if (unlikely(!ashmem_range_cachep)) {
-		pr_err("failed to create slab cache\n");
-		return -ENOMEM;
-	}
-
 	ret = misc_register(&ashmem_misc);
 	if (unlikely(ret)) {
 		pr_err("failed to register misc device!\n");
 		return ret;
 	}
 
-	register_shrinker(&ashmem_shrinker);
+
 
 	pr_info("initialized\n");
 
@@ -738,13 +430,10 @@ static void __exit ashmem_exit(void)
 {
 	int ret;
 
-	unregister_shrinker(&ashmem_shrinker);
-
 	ret = misc_deregister(&ashmem_misc);
 	if (unlikely(ret))
 		pr_err("failed to unregister misc device!\n");
 
-	kmem_cache_destroy(ashmem_range_cachep);
 	kmem_cache_destroy(ashmem_area_cachep);
 
 	pr_info("unloaded\n");
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
  2012-07-28  3:57 ` John Stultz
@ 2012-07-28  3:57   ` John Stultz
  -1 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

In an attempt to push the volatile range managment even
deeper into the VM code, this is my first attempt at
implementing Minchan's idea of a LRU_VOLATILE list in
the mm core.

This list sits along side the LRU_ACTIVE_ANON, _INACTIVE_ANON,
_ACTIVE_FILE, _INACTIVE_FILE and _UNEVICTABLE lru lists.

When a range is marked volatile, the pages in that range
are moved to the LRU_VOLATILE list. Since volatile pages
can be quickly purged, this list is the first list we
shrink when we need to free memory.

When a page is marked non-volatile, it is moved from the
LRU_VOLATILE list to the appropriate LRU_ACTIVE_ list.

This patch introduces the LRU_VOLATILE list, an isvolatile
page flag, functions to mark and unmark a single page
as volatile, and shrinker functions to purge volatile
pages.

This is a very raw first pass, and is neither performant
or likely bugfree. It works in my trivial testing, but
I've not pushed it very hard yet.

I wanted to send it out just to get some inital thoughts
on the approach and any suggestions should I be going too
far in the wrong direction.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/fs.h         |    1 +
 include/linux/mm_inline.h  |    2 ++
 include/linux/mmzone.h     |    1 +
 include/linux/page-flags.h |    3 ++
 include/linux/swap.h       |    3 ++
 mm/memcontrol.c            |    1 +
 mm/page_alloc.c            |    1 +
 mm/swap.c                  |   71 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |   76 +++++++++++++++++++++++++++++++++++++++++---
 9 files changed, 155 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8fabb03..c6f3415 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -636,6 +636,7 @@ struct address_space_operations {
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
+	int (*purgepage)(struct page *page, struct writeback_control *wbc);
 };
 
 extern const struct address_space_operations empty_aops;
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 1397ccf..f78806c 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -91,6 +91,8 @@ static __always_inline enum lru_list page_lru(struct page *page)
 
 	if (PageUnevictable(page))
 		lru = LRU_UNEVICTABLE;
+	else if (PageIsVolatile(page))
+		lru = LRU_VOLATILE;
 	else {
 		lru = page_lru_base_type(page);
 		if (PageActive(page))
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 458988b..4bfa6c4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -162,6 +162,7 @@ enum lru_list {
 	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
 	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
 	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
+	LRU_VOLATILE,
 	LRU_UNEVICTABLE,
 	NR_LRU_LISTS
 };
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c88d2a9..57800c8 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -108,6 +108,7 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+	PG_isvolatile,
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -201,6 +202,8 @@ PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
 	TESTCLEARFLAG(Active, active)
+PAGEFLAG(IsVolatile, isvolatile) __CLEARPAGEFLAG(IsVolatile, isvolatile)
+	TESTCLEARFLAG(IsVolatile, isvolatile)
 __PAGEFLAG(Slab, slab)
 PAGEFLAG(Checked, checked)		/* Used by some filesystems */
 PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c84ec68..eb12d53 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -236,6 +236,9 @@ extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
+extern void mark_volatile_page(struct page *page);
+extern void mark_nonvolatile_page(struct page *page);
+
 extern void add_page_to_unevictable_list(struct page *page);
 
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f72b5e5..98e1303 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4066,6 +4066,7 @@ static const char * const mem_cgroup_lru_names[] = {
 	"active_anon",
 	"inactive_file",
 	"active_file",
+	"volatile",
 	"unevictable",
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..cffe1b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5975,6 +5975,7 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+	{1UL << PG_isvolatile,		"volatile"	},
 };
 
 static void dump_page_flags(unsigned long flags)
diff --git a/mm/swap.c b/mm/swap.c
index 4e7e2ec..24bf1f8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -574,6 +574,77 @@ void deactivate_page(struct page *page)
 	}
 }
 
+/**
+ * mark_volatile_page - Sets a page as volatile
+ * @page: page to mark volatile
+ *
+ * This function moves a page to the volatile lru.
+ */
+void mark_volatile_page(struct page *page)
+{
+	int lru;
+	bool active;
+	struct zone *zone = page_zone(page);
+	struct lruvec *lruvec;
+
+	if (!PageLRU(page))
+		return;
+
+	if (PageUnevictable(page))
+		return;
+
+	active = PageActive(page);
+	lru = page_lru_base_type(page);
+
+	/*
+	 * XXX - Doing this page by page is terrible for performance.
+	 * Rework w/ pagevec_lru_move_fn.
+	 */
+	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, zone);
+	del_page_from_lru_list(page, lruvec, lru + active);
+	add_page_to_lru_list(page, lruvec, LRU_VOLATILE);
+	SetPageIsVolatile(page);
+	ClearPageActive(page);
+	spin_unlock_irq(&zone->lru_lock);
+
+
+}
+
+/**
+ * mark_nonvolatile_page - Sets a page as non-volatile
+ * @page: page to mark non-volatile
+ *
+ * This function moves a page from the volatile lru
+ * to the appropriate active list.
+ */
+void mark_nonvolatile_page(struct page *page)
+{
+	int lru;
+	struct zone *zone = page_zone(page);
+	struct lruvec *lruvec;
+
+	if (!PageLRU(page))
+		return;
+
+	if (!PageIsVolatile(page))
+		return;
+
+	lru = page_lru_base_type(page);
+
+	/*
+	 * XXX - Doing this page by page is terrible for performance.
+	 * Rework w/ pagevec_lru_move_fn
+	 */
+	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, zone);
+	del_page_from_lru_list(page, lruvec, LRU_VOLATILE);
+	ClearPageIsVolatile(page);
+	SetPageActive(page);
+	add_page_to_lru_list(page, lruvec,  lru + LRU_ACTIVE);
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 347b3ff..c15d604 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -409,6 +409,11 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 		}
 		return PAGE_KEEP;
 	}
+
+
+	if (PageIsVolatile(page))
+		return PAGE_CLEAN;
+
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
 	if (!may_write_to_queue(mapping->backing_dev_info, sc))
@@ -483,7 +488,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 	if (!page_freeze_refs(page, 2))
 		goto cannot_free;
 	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
-	if (unlikely(PageDirty(page))) {
+	if (unlikely(PageDirty(page)) && !PageIsVolatile(page)) {
 		page_unfreeze_refs(page, 2);
 		goto cannot_free;
 	}
@@ -869,6 +874,21 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (!mapping || !__remove_mapping(mapping, page))
 			goto keep_locked;
 
+
+		/* If the page is volatile, call purgepage on it */
+		if (PageIsVolatile(page)) {
+			struct writeback_control wbc = {
+				.sync_mode = WB_SYNC_NONE,
+				.nr_to_write = SWAP_CLUSTER_MAX,
+				.range_start = 0,
+				.range_end = LLONG_MAX,
+				.for_reclaim = 1,
+			};
+
+			if (mapping && mapping->a_ops && mapping->a_ops->purgepage)
+				mapping->a_ops->purgepage(page, &wbc);
+		}
+
 		/*
 		 * At this point, we have no other references and there is
 		 * no way to pick any more up (removed from LRU, removed
@@ -898,9 +918,11 @@ activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */
 		if (PageSwapCache(page) && vm_swap_full())
 			try_to_free_swap(page);
-		VM_BUG_ON(PageActive(page));
-		SetPageActive(page);
-		pgactivate++;
+		if (!PageIsVolatile(page)) {
+			VM_BUG_ON(PageActive(page));
+			SetPageActive(page);
+			pgactivate++;
+		}
 keep_locked:
 		unlock_page(page);
 keep:
@@ -1190,6 +1212,45 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 	list_splice(&pages_to_free, page_list);
 }
 
+static noinline_for_stack unsigned long
+shrink_volatile_list(unsigned long nr_to_scan, struct lruvec *lruvec,
+		     struct scan_control *sc)
+{
+	LIST_HEAD(page_list);
+	unsigned long nr_scanned;
+	unsigned long nr_reclaimed = 0;
+	unsigned long nr_taken;
+	unsigned long nr_dirty = 0;
+	unsigned long nr_writeback = 0;
+
+	isolate_mode_t isolate_mode = 0;
+	struct zone *zone = lruvec_zone(lruvec);
+
+
+	lru_add_drain();
+
+	if (!sc->may_unmap)
+		isolate_mode |= ISOLATE_UNMAPPED;
+	if (!sc->may_writepage)
+		isolate_mode |= ISOLATE_CLEAN;
+
+	spin_lock_irq(&zone->lru_lock);
+	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
+				     &nr_scanned, sc, isolate_mode, LRU_VOLATILE);
+	spin_unlock_irq(&zone->lru_lock);
+
+	if (nr_taken == 0)
+		goto done;
+
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
+						&nr_dirty, &nr_writeback);
+	spin_lock_irq(&zone->lru_lock);
+	putback_inactive_pages(lruvec, &page_list);
+	spin_unlock_irq(&zone->lru_lock);
+done:
+	return nr_reclaimed;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1777,6 +1838,13 @@ restart:
 	get_scan_count(lruvec, sc, nr);
 
 	blk_start_plug(&plug);
+
+
+	nr_to_scan = min_t(unsigned long, get_lru_size(lruvec, LRU_VOLATILE), SWAP_CLUSTER_MAX);
+	if (nr_to_scan)
+		shrink_volatile_list(nr_to_scan, lruvec, sc);
+
+
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(lru) {
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
@ 2012-07-28  3:57   ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

In an attempt to push the volatile range managment even
deeper into the VM code, this is my first attempt at
implementing Minchan's idea of a LRU_VOLATILE list in
the mm core.

This list sits along side the LRU_ACTIVE_ANON, _INACTIVE_ANON,
_ACTIVE_FILE, _INACTIVE_FILE and _UNEVICTABLE lru lists.

When a range is marked volatile, the pages in that range
are moved to the LRU_VOLATILE list. Since volatile pages
can be quickly purged, this list is the first list we
shrink when we need to free memory.

When a page is marked non-volatile, it is moved from the
LRU_VOLATILE list to the appropriate LRU_ACTIVE_ list.

This patch introduces the LRU_VOLATILE list, an isvolatile
page flag, functions to mark and unmark a single page
as volatile, and shrinker functions to purge volatile
pages.

This is a very raw first pass, and is neither performant
or likely bugfree. It works in my trivial testing, but
I've not pushed it very hard yet.

I wanted to send it out just to get some inital thoughts
on the approach and any suggestions should I be going too
far in the wrong direction.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/fs.h         |    1 +
 include/linux/mm_inline.h  |    2 ++
 include/linux/mmzone.h     |    1 +
 include/linux/page-flags.h |    3 ++
 include/linux/swap.h       |    3 ++
 mm/memcontrol.c            |    1 +
 mm/page_alloc.c            |    1 +
 mm/swap.c                  |   71 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |   76 +++++++++++++++++++++++++++++++++++++++++---
 9 files changed, 155 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8fabb03..c6f3415 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -636,6 +636,7 @@ struct address_space_operations {
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
+	int (*purgepage)(struct page *page, struct writeback_control *wbc);
 };
 
 extern const struct address_space_operations empty_aops;
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 1397ccf..f78806c 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -91,6 +91,8 @@ static __always_inline enum lru_list page_lru(struct page *page)
 
 	if (PageUnevictable(page))
 		lru = LRU_UNEVICTABLE;
+	else if (PageIsVolatile(page))
+		lru = LRU_VOLATILE;
 	else {
 		lru = page_lru_base_type(page);
 		if (PageActive(page))
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 458988b..4bfa6c4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -162,6 +162,7 @@ enum lru_list {
 	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
 	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
 	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
+	LRU_VOLATILE,
 	LRU_UNEVICTABLE,
 	NR_LRU_LISTS
 };
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c88d2a9..57800c8 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -108,6 +108,7 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+	PG_isvolatile,
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -201,6 +202,8 @@ PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
 	TESTCLEARFLAG(Active, active)
+PAGEFLAG(IsVolatile, isvolatile) __CLEARPAGEFLAG(IsVolatile, isvolatile)
+	TESTCLEARFLAG(IsVolatile, isvolatile)
 __PAGEFLAG(Slab, slab)
 PAGEFLAG(Checked, checked)		/* Used by some filesystems */
 PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c84ec68..eb12d53 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -236,6 +236,9 @@ extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
+extern void mark_volatile_page(struct page *page);
+extern void mark_nonvolatile_page(struct page *page);
+
 extern void add_page_to_unevictable_list(struct page *page);
 
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f72b5e5..98e1303 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4066,6 +4066,7 @@ static const char * const mem_cgroup_lru_names[] = {
 	"active_anon",
 	"inactive_file",
 	"active_file",
+	"volatile",
 	"unevictable",
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..cffe1b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5975,6 +5975,7 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+	{1UL << PG_isvolatile,		"volatile"	},
 };
 
 static void dump_page_flags(unsigned long flags)
diff --git a/mm/swap.c b/mm/swap.c
index 4e7e2ec..24bf1f8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -574,6 +574,77 @@ void deactivate_page(struct page *page)
 	}
 }
 
+/**
+ * mark_volatile_page - Sets a page as volatile
+ * @page: page to mark volatile
+ *
+ * This function moves a page to the volatile lru.
+ */
+void mark_volatile_page(struct page *page)
+{
+	int lru;
+	bool active;
+	struct zone *zone = page_zone(page);
+	struct lruvec *lruvec;
+
+	if (!PageLRU(page))
+		return;
+
+	if (PageUnevictable(page))
+		return;
+
+	active = PageActive(page);
+	lru = page_lru_base_type(page);
+
+	/*
+	 * XXX - Doing this page by page is terrible for performance.
+	 * Rework w/ pagevec_lru_move_fn.
+	 */
+	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, zone);
+	del_page_from_lru_list(page, lruvec, lru + active);
+	add_page_to_lru_list(page, lruvec, LRU_VOLATILE);
+	SetPageIsVolatile(page);
+	ClearPageActive(page);
+	spin_unlock_irq(&zone->lru_lock);
+
+
+}
+
+/**
+ * mark_nonvolatile_page - Sets a page as non-volatile
+ * @page: page to mark non-volatile
+ *
+ * This function moves a page from the volatile lru
+ * to the appropriate active list.
+ */
+void mark_nonvolatile_page(struct page *page)
+{
+	int lru;
+	struct zone *zone = page_zone(page);
+	struct lruvec *lruvec;
+
+	if (!PageLRU(page))
+		return;
+
+	if (!PageIsVolatile(page))
+		return;
+
+	lru = page_lru_base_type(page);
+
+	/*
+	 * XXX - Doing this page by page is terrible for performance.
+	 * Rework w/ pagevec_lru_move_fn
+	 */
+	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, zone);
+	del_page_from_lru_list(page, lruvec, LRU_VOLATILE);
+	ClearPageIsVolatile(page);
+	SetPageActive(page);
+	add_page_to_lru_list(page, lruvec,  lru + LRU_ACTIVE);
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 347b3ff..c15d604 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -409,6 +409,11 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 		}
 		return PAGE_KEEP;
 	}
+
+
+	if (PageIsVolatile(page))
+		return PAGE_CLEAN;
+
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
 	if (!may_write_to_queue(mapping->backing_dev_info, sc))
@@ -483,7 +488,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 	if (!page_freeze_refs(page, 2))
 		goto cannot_free;
 	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
-	if (unlikely(PageDirty(page))) {
+	if (unlikely(PageDirty(page)) && !PageIsVolatile(page)) {
 		page_unfreeze_refs(page, 2);
 		goto cannot_free;
 	}
@@ -869,6 +874,21 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (!mapping || !__remove_mapping(mapping, page))
 			goto keep_locked;
 
+
+		/* If the page is volatile, call purgepage on it */
+		if (PageIsVolatile(page)) {
+			struct writeback_control wbc = {
+				.sync_mode = WB_SYNC_NONE,
+				.nr_to_write = SWAP_CLUSTER_MAX,
+				.range_start = 0,
+				.range_end = LLONG_MAX,
+				.for_reclaim = 1,
+			};
+
+			if (mapping && mapping->a_ops && mapping->a_ops->purgepage)
+				mapping->a_ops->purgepage(page, &wbc);
+		}
+
 		/*
 		 * At this point, we have no other references and there is
 		 * no way to pick any more up (removed from LRU, removed
@@ -898,9 +918,11 @@ activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */
 		if (PageSwapCache(page) && vm_swap_full())
 			try_to_free_swap(page);
-		VM_BUG_ON(PageActive(page));
-		SetPageActive(page);
-		pgactivate++;
+		if (!PageIsVolatile(page)) {
+			VM_BUG_ON(PageActive(page));
+			SetPageActive(page);
+			pgactivate++;
+		}
 keep_locked:
 		unlock_page(page);
 keep:
@@ -1190,6 +1212,45 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 	list_splice(&pages_to_free, page_list);
 }
 
+static noinline_for_stack unsigned long
+shrink_volatile_list(unsigned long nr_to_scan, struct lruvec *lruvec,
+		     struct scan_control *sc)
+{
+	LIST_HEAD(page_list);
+	unsigned long nr_scanned;
+	unsigned long nr_reclaimed = 0;
+	unsigned long nr_taken;
+	unsigned long nr_dirty = 0;
+	unsigned long nr_writeback = 0;
+
+	isolate_mode_t isolate_mode = 0;
+	struct zone *zone = lruvec_zone(lruvec);
+
+
+	lru_add_drain();
+
+	if (!sc->may_unmap)
+		isolate_mode |= ISOLATE_UNMAPPED;
+	if (!sc->may_writepage)
+		isolate_mode |= ISOLATE_CLEAN;
+
+	spin_lock_irq(&zone->lru_lock);
+	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
+				     &nr_scanned, sc, isolate_mode, LRU_VOLATILE);
+	spin_unlock_irq(&zone->lru_lock);
+
+	if (nr_taken == 0)
+		goto done;
+
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
+						&nr_dirty, &nr_writeback);
+	spin_lock_irq(&zone->lru_lock);
+	putback_inactive_pages(lruvec, &page_list);
+	spin_unlock_irq(&zone->lru_lock);
+done:
+	return nr_reclaimed;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1777,6 +1838,13 @@ restart:
 	get_scan_count(lruvec, sc, nr);
 
 	blk_start_plug(&plug);
+
+
+	nr_to_scan = min_t(unsigned long, get_lru_size(lruvec, LRU_VOLATILE), SWAP_CLUSTER_MAX);
+	if (nr_to_scan)
+		shrink_volatile_list(nr_to_scan, lruvec, sc);
+
+
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(lru) {
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/5] [RFC][HACK] Switch volatile/shmem over to LRU_VOLATILE
  2012-07-28  3:57 ` John Stultz
@ 2012-07-28  3:57   ` John Stultz
  -1 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

This changes the earlier shrinker based volatile range
management over to using the LRU_VOLATILE list in mm core.

Again, this is likely has performance issues, as well as
other problems I'm not aware of, so I'd greatly appreciate
any additional feedback or suggestions.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/volatile.h |   12 ++----
 mm/shmem.c               |  103 ++++++++++++++++++++++++----------------------
 mm/volatile.c            |   86 ++++++++------------------------------
 3 files changed, 75 insertions(+), 126 deletions(-)

diff --git a/include/linux/volatile.h b/include/linux/volatile.h
index 6f41b98..7bd11c1 100644
--- a/include/linux/volatile.h
+++ b/include/linux/volatile.h
@@ -5,15 +5,11 @@
 
 struct volatile_fs_head {
 	struct mutex lock;
-	struct list_head lru_head;
-	s64 unpurged_page_count;
 };
 
 
 #define DEFINE_VOLATILE_FS_HEAD(name) struct volatile_fs_head name = {	\
 	.lock = __MUTEX_INITIALIZER(name.lock),				\
-	.lru_head = LIST_HEAD_INIT(name.lru_head),			\
-	.unpurged_page_count = 0,					\
 }
 
 
@@ -34,12 +30,10 @@ extern long volatile_range_remove(struct volatile_fs_head *head,
 				struct address_space *mapping,
 				pgoff_t start_index, pgoff_t end_index);
 
-extern s64 volatile_range_lru_size(struct volatile_fs_head *head);
-
 extern void volatile_range_clear(struct volatile_fs_head *head,
 					struct address_space *mapping);
 
-extern s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
-				struct address_space **mapping,
-				pgoff_t *start, pgoff_t *end);
+int volatile_page_mark_range_purged(struct volatile_fs_head *head,
+							struct page *page);
+
 #endif /* _LINUX_VOLATILE_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index e5ce04c..79f75af 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -636,6 +636,34 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 
 static DEFINE_VOLATILE_FS_HEAD(shmem_volatile_head);
 
+void modify_range(struct address_space *mapping, pgoff_t start, pgoff_t end,
+			void(*activate_func)(struct page*))
+{
+	struct pagevec pvec;
+	pgoff_t index = start;
+	int i;
+
+	pagevec_init(&pvec, 0);
+	while (index <= end && pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		mem_cgroup_uncharge_start();
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+
+			/* We rely upon deletion not changing page->index */
+			index = page->index;
+			if (index > end)
+				break;
+
+			activate_func(page);
+		}
+		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
+		cond_resched();
+		index++;
+	}
+}
+
 static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
 {
 	pgoff_t start, end;
@@ -652,7 +680,11 @@ static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
 				((loff_t) start << PAGE_CACHE_SHIFT),
 				((loff_t) end << PAGE_CACHE_SHIFT)-1);
 		ret = 0;
+
 	}
+
+	modify_range(&inode->i_data, start, end-1, &mark_volatile_page);
+
 	volatile_range_unlock(&shmem_volatile_head);
 
 	return ret;
@@ -669,6 +701,9 @@ static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
 	volatile_range_lock(&shmem_volatile_head);
 	ret = volatile_range_remove(&shmem_volatile_head, &inode->i_data,
 								start, end);
+
+	modify_range(&inode->i_data, start, end-1, &mark_nonvolatile_page);
+
 	volatile_range_unlock(&shmem_volatile_head);
 
 	return ret;
@@ -681,55 +716,6 @@ static void shmem_clear_volatile(struct inode *inode)
 	volatile_range_unlock(&shmem_volatile_head);
 }
 
-static
-int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
-{
-	s64 nr_to_scan = sc->nr_to_scan;
-	const gfp_t gfp_mask = sc->gfp_mask;
-	struct address_space *mapping;
-	pgoff_t start, end;
-	int ret;
-	s64 page_count;
-
-	if (nr_to_scan && !(gfp_mask & __GFP_FS))
-		return -1;
-
-	volatile_range_lock(&shmem_volatile_head);
-	page_count = volatile_range_lru_size(&shmem_volatile_head);
-	if (!nr_to_scan)
-		goto out;
-
-	do {
-		ret = volatile_ranges_pluck_lru(&shmem_volatile_head,
-							&mapping, &start, &end);
-		if (ret) {
-			shmem_truncate_range(mapping->host,
-				((loff_t) start << PAGE_CACHE_SHIFT),
-				((loff_t) end << PAGE_CACHE_SHIFT)-1);
-
-			nr_to_scan -= end-start;
-			page_count -= end-start;
-		};
-	} while (ret && (nr_to_scan > 0));
-
-out:
-	volatile_range_unlock(&shmem_volatile_head);
-
-	return page_count;
-}
-
-static struct shrinker shmem_volatile_shrinker = {
-	.shrink = shmem_volatile_shrink,
-	.seeks = DEFAULT_SEEKS,
-};
-
-static int __init shmem_shrinker_init(void)
-{
-	register_shrinker(&shmem_volatile_shrinker);
-	return 0;
-}
-arch_initcall(shmem_shrinker_init);
-
 
 static void shmem_evict_inode(struct inode *inode)
 {
@@ -884,6 +870,24 @@ out:
 	return error;
 }
 
+static int shmem_purgepage(struct page *page, struct writeback_control *wbc)
+{
+	struct address_space *mapping;
+	struct inode *inode;
+	int purge;
+
+	BUG_ON(!PageLocked(page));
+	mapping = page->mapping;
+	inode = mapping->host;
+
+	volatile_range_lock(&shmem_volatile_head);
+	purge = volatile_page_mark_range_purged(&shmem_volatile_head, page);
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return 0;
+}
+
+
 /*
  * Move the page from the page cache to the swap cache.
  */
@@ -2817,6 +2821,7 @@ static const struct address_space_operations shmem_aops = {
 #endif
 	.migratepage	= migrate_page,
 	.error_remove_page = generic_error_remove_page,
+	.purgepage	= shmem_purgepage,
 };
 
 static const struct file_operations shmem_file_operations = {
diff --git a/mm/volatile.c b/mm/volatile.c
index d05a767..b7db12a 100644
--- a/mm/volatile.c
+++ b/mm/volatile.c
@@ -53,7 +53,6 @@
 
 
 struct volatile_range {
-	struct list_head		lru;
 	struct prio_tree_node		node;
 	unsigned int			purged;
 	struct address_space		*mapping;
@@ -159,15 +158,8 @@ static inline void vrange_resize(struct volatile_fs_head *head,
 				struct volatile_range *vrange,
 				pgoff_t start_index, pgoff_t end_index)
 {
-	pgoff_t old_size, new_size;
-
-	old_size = vrange->node.last - vrange->node.start;
-	new_size = end_index-start_index;
-
-	if (!vrange->purged)
-		head->unpurged_page_count += new_size - old_size;
-
 	prio_tree_remove(root, &vrange->node);
+	INIT_PRIO_TREE_NODE(&vrange->node);
 	vrange->node.start = start_index;
 	vrange->node.last = end_index;
 	prio_tree_insert(root, &vrange->node);
@@ -189,15 +181,7 @@ static void vrange_add(struct volatile_fs_head *head,
 				struct prio_tree_root *root,
 				struct volatile_range *vrange)
 {
-
 	prio_tree_insert(root, &vrange->node);
-
-	/* Only add unpurged ranges to LRU */
-	if (!vrange->purged) {
-		head->unpurged_page_count += vrange->node.last - vrange->node.start;
-		list_add_tail(&vrange->lru, &head->lru_head);
-	}
-
 }
 
 
@@ -206,10 +190,6 @@ static void vrange_del(struct volatile_fs_head *head,
 				struct prio_tree_root *root,
 				struct volatile_range *vrange)
 {
-	if (!vrange->purged) {
-		head->unpurged_page_count -= vrange->node.last - vrange->node.start;
-		list_del(&vrange->lru);
-	}
 	prio_tree_remove(root, &vrange->node);
 	kfree(vrange);
 }
@@ -416,62 +396,32 @@ out:
 	return ret;
 }
 
-/**
- * volatile_range_lru_size: Returns the number of unpurged pages on the lru
- * @head: per-fs volatile head
- *
- * Returns the number of unpurged pages on the LRU
- *
- * Must lock the volatile_fs_head before calling!
- *
- */
-s64 volatile_range_lru_size(struct volatile_fs_head *head)
-{
-	WARN_ON(!mutex_is_locked(&head->lock));
-	return head->unpurged_page_count;
-}
-
-
-/**
- * volatile_ranges_pluck_lru: Returns mapping and size of lru unpurged range
- * @head: per-fs volatile head
- * @mapping: dbl pointer to mapping who's range is being purged
- * @start: Pointer to starting address of range being purged
- * @end: Pointer to ending address of range being purged
- *
- * Returns the mapping, start and end values of the least recently used
- * range. Marks the range as purged and removes it from the LRU.
- *
- * Must lock the volatile_fs_head before calling!
- *
- * Returns 1 on success if a range was returned
- * Return 0 if no ranges were found.
- */
-s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
-				struct address_space **mapping,
-				pgoff_t *start, pgoff_t *end)
+int volatile_page_mark_range_purged(struct volatile_fs_head *head,
+							struct page *page)
 {
-	struct volatile_range *range;
+	struct prio_tree_root *root;
+	struct prio_tree_node *node;
+	struct prio_tree_iter iter;
+	struct volatile_range *vrange;
+	int ret	= 0;
 
 	WARN_ON(!mutex_is_locked(&head->lock));
 
-	if (list_empty(&head->lru_head))
+	root = mapping_to_root(page->mapping);
+	if (!root)
 		return 0;
 
-	range = list_first_entry(&head->lru_head, struct volatile_range, lru);
-
-	*start = range->node.start;
-	*end = range->node.last;
-	*mapping = range->mapping;
-
-	head->unpurged_page_count -= *end - *start;
-	list_del(&range->lru);
-	range->purged = 1;
+	prio_tree_iter_init(&iter, root, page->index, page->index);
+	node = prio_tree_next(&iter);
+	if (node) {
+		vrange = container_of(node, struct volatile_range, node);
 
-	return 1;
+		vrange->purged = 1;
+		ret = 1;
+	}
+	return ret;
 }
 
-
 /*
  * Cleans up any volatile ranges.
  */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/5] [RFC][HACK] Switch volatile/shmem over to LRU_VOLATILE
@ 2012-07-28  3:57   ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-07-28  3:57 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, Minchan Kim, linux-mm

This changes the earlier shrinker based volatile range
management over to using the LRU_VOLATILE list in mm core.

Again, this is likely has performance issues, as well as
other problems I'm not aware of, so I'd greatly appreciate
any additional feedback or suggestions.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Android Kernel Team <kernel-team@android.com>
CC: Robert Love <rlove@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Hugh Dickins <hughd@google.com>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Neil Brown <neilb@suse.de>
CC: Andrea Righi <andrea@betterlinux.com>
CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Mike Hommey <mh@glandium.org>
CC: Jan Kara <jack@suse.cz>
CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
CC: Michel Lespinasse <walken@google.com>
CC: Minchan Kim <minchan@kernel.org>
CC: linux-mm@kvack.org <linux-mm@kvack.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/volatile.h |   12 ++----
 mm/shmem.c               |  103 ++++++++++++++++++++++++----------------------
 mm/volatile.c            |   86 ++++++++------------------------------
 3 files changed, 75 insertions(+), 126 deletions(-)

diff --git a/include/linux/volatile.h b/include/linux/volatile.h
index 6f41b98..7bd11c1 100644
--- a/include/linux/volatile.h
+++ b/include/linux/volatile.h
@@ -5,15 +5,11 @@
 
 struct volatile_fs_head {
 	struct mutex lock;
-	struct list_head lru_head;
-	s64 unpurged_page_count;
 };
 
 
 #define DEFINE_VOLATILE_FS_HEAD(name) struct volatile_fs_head name = {	\
 	.lock = __MUTEX_INITIALIZER(name.lock),				\
-	.lru_head = LIST_HEAD_INIT(name.lru_head),			\
-	.unpurged_page_count = 0,					\
 }
 
 
@@ -34,12 +30,10 @@ extern long volatile_range_remove(struct volatile_fs_head *head,
 				struct address_space *mapping,
 				pgoff_t start_index, pgoff_t end_index);
 
-extern s64 volatile_range_lru_size(struct volatile_fs_head *head);
-
 extern void volatile_range_clear(struct volatile_fs_head *head,
 					struct address_space *mapping);
 
-extern s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
-				struct address_space **mapping,
-				pgoff_t *start, pgoff_t *end);
+int volatile_page_mark_range_purged(struct volatile_fs_head *head,
+							struct page *page);
+
 #endif /* _LINUX_VOLATILE_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index e5ce04c..79f75af 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -636,6 +636,34 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 
 static DEFINE_VOLATILE_FS_HEAD(shmem_volatile_head);
 
+void modify_range(struct address_space *mapping, pgoff_t start, pgoff_t end,
+			void(*activate_func)(struct page*))
+{
+	struct pagevec pvec;
+	pgoff_t index = start;
+	int i;
+
+	pagevec_init(&pvec, 0);
+	while (index <= end && pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		mem_cgroup_uncharge_start();
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+
+			/* We rely upon deletion not changing page->index */
+			index = page->index;
+			if (index > end)
+				break;
+
+			activate_func(page);
+		}
+		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
+		cond_resched();
+		index++;
+	}
+}
+
 static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
 {
 	pgoff_t start, end;
@@ -652,7 +680,11 @@ static int shmem_mark_volatile(struct inode *inode, loff_t offset, loff_t len)
 				((loff_t) start << PAGE_CACHE_SHIFT),
 				((loff_t) end << PAGE_CACHE_SHIFT)-1);
 		ret = 0;
+
 	}
+
+	modify_range(&inode->i_data, start, end-1, &mark_volatile_page);
+
 	volatile_range_unlock(&shmem_volatile_head);
 
 	return ret;
@@ -669,6 +701,9 @@ static int shmem_unmark_volatile(struct inode *inode, loff_t offset, loff_t len)
 	volatile_range_lock(&shmem_volatile_head);
 	ret = volatile_range_remove(&shmem_volatile_head, &inode->i_data,
 								start, end);
+
+	modify_range(&inode->i_data, start, end-1, &mark_nonvolatile_page);
+
 	volatile_range_unlock(&shmem_volatile_head);
 
 	return ret;
@@ -681,55 +716,6 @@ static void shmem_clear_volatile(struct inode *inode)
 	volatile_range_unlock(&shmem_volatile_head);
 }
 
-static
-int shmem_volatile_shrink(struct shrinker *ignored, struct shrink_control *sc)
-{
-	s64 nr_to_scan = sc->nr_to_scan;
-	const gfp_t gfp_mask = sc->gfp_mask;
-	struct address_space *mapping;
-	pgoff_t start, end;
-	int ret;
-	s64 page_count;
-
-	if (nr_to_scan && !(gfp_mask & __GFP_FS))
-		return -1;
-
-	volatile_range_lock(&shmem_volatile_head);
-	page_count = volatile_range_lru_size(&shmem_volatile_head);
-	if (!nr_to_scan)
-		goto out;
-
-	do {
-		ret = volatile_ranges_pluck_lru(&shmem_volatile_head,
-							&mapping, &start, &end);
-		if (ret) {
-			shmem_truncate_range(mapping->host,
-				((loff_t) start << PAGE_CACHE_SHIFT),
-				((loff_t) end << PAGE_CACHE_SHIFT)-1);
-
-			nr_to_scan -= end-start;
-			page_count -= end-start;
-		};
-	} while (ret && (nr_to_scan > 0));
-
-out:
-	volatile_range_unlock(&shmem_volatile_head);
-
-	return page_count;
-}
-
-static struct shrinker shmem_volatile_shrinker = {
-	.shrink = shmem_volatile_shrink,
-	.seeks = DEFAULT_SEEKS,
-};
-
-static int __init shmem_shrinker_init(void)
-{
-	register_shrinker(&shmem_volatile_shrinker);
-	return 0;
-}
-arch_initcall(shmem_shrinker_init);
-
 
 static void shmem_evict_inode(struct inode *inode)
 {
@@ -884,6 +870,24 @@ out:
 	return error;
 }
 
+static int shmem_purgepage(struct page *page, struct writeback_control *wbc)
+{
+	struct address_space *mapping;
+	struct inode *inode;
+	int purge;
+
+	BUG_ON(!PageLocked(page));
+	mapping = page->mapping;
+	inode = mapping->host;
+
+	volatile_range_lock(&shmem_volatile_head);
+	purge = volatile_page_mark_range_purged(&shmem_volatile_head, page);
+	volatile_range_unlock(&shmem_volatile_head);
+
+	return 0;
+}
+
+
 /*
  * Move the page from the page cache to the swap cache.
  */
@@ -2817,6 +2821,7 @@ static const struct address_space_operations shmem_aops = {
 #endif
 	.migratepage	= migrate_page,
 	.error_remove_page = generic_error_remove_page,
+	.purgepage	= shmem_purgepage,
 };
 
 static const struct file_operations shmem_file_operations = {
diff --git a/mm/volatile.c b/mm/volatile.c
index d05a767..b7db12a 100644
--- a/mm/volatile.c
+++ b/mm/volatile.c
@@ -53,7 +53,6 @@
 
 
 struct volatile_range {
-	struct list_head		lru;
 	struct prio_tree_node		node;
 	unsigned int			purged;
 	struct address_space		*mapping;
@@ -159,15 +158,8 @@ static inline void vrange_resize(struct volatile_fs_head *head,
 				struct volatile_range *vrange,
 				pgoff_t start_index, pgoff_t end_index)
 {
-	pgoff_t old_size, new_size;
-
-	old_size = vrange->node.last - vrange->node.start;
-	new_size = end_index-start_index;
-
-	if (!vrange->purged)
-		head->unpurged_page_count += new_size - old_size;
-
 	prio_tree_remove(root, &vrange->node);
+	INIT_PRIO_TREE_NODE(&vrange->node);
 	vrange->node.start = start_index;
 	vrange->node.last = end_index;
 	prio_tree_insert(root, &vrange->node);
@@ -189,15 +181,7 @@ static void vrange_add(struct volatile_fs_head *head,
 				struct prio_tree_root *root,
 				struct volatile_range *vrange)
 {
-
 	prio_tree_insert(root, &vrange->node);
-
-	/* Only add unpurged ranges to LRU */
-	if (!vrange->purged) {
-		head->unpurged_page_count += vrange->node.last - vrange->node.start;
-		list_add_tail(&vrange->lru, &head->lru_head);
-	}
-
 }
 
 
@@ -206,10 +190,6 @@ static void vrange_del(struct volatile_fs_head *head,
 				struct prio_tree_root *root,
 				struct volatile_range *vrange)
 {
-	if (!vrange->purged) {
-		head->unpurged_page_count -= vrange->node.last - vrange->node.start;
-		list_del(&vrange->lru);
-	}
 	prio_tree_remove(root, &vrange->node);
 	kfree(vrange);
 }
@@ -416,62 +396,32 @@ out:
 	return ret;
 }
 
-/**
- * volatile_range_lru_size: Returns the number of unpurged pages on the lru
- * @head: per-fs volatile head
- *
- * Returns the number of unpurged pages on the LRU
- *
- * Must lock the volatile_fs_head before calling!
- *
- */
-s64 volatile_range_lru_size(struct volatile_fs_head *head)
-{
-	WARN_ON(!mutex_is_locked(&head->lock));
-	return head->unpurged_page_count;
-}
-
-
-/**
- * volatile_ranges_pluck_lru: Returns mapping and size of lru unpurged range
- * @head: per-fs volatile head
- * @mapping: dbl pointer to mapping who's range is being purged
- * @start: Pointer to starting address of range being purged
- * @end: Pointer to ending address of range being purged
- *
- * Returns the mapping, start and end values of the least recently used
- * range. Marks the range as purged and removes it from the LRU.
- *
- * Must lock the volatile_fs_head before calling!
- *
- * Returns 1 on success if a range was returned
- * Return 0 if no ranges were found.
- */
-s64 volatile_ranges_pluck_lru(struct volatile_fs_head *head,
-				struct address_space **mapping,
-				pgoff_t *start, pgoff_t *end)
+int volatile_page_mark_range_purged(struct volatile_fs_head *head,
+							struct page *page)
 {
-	struct volatile_range *range;
+	struct prio_tree_root *root;
+	struct prio_tree_node *node;
+	struct prio_tree_iter iter;
+	struct volatile_range *vrange;
+	int ret	= 0;
 
 	WARN_ON(!mutex_is_locked(&head->lock));
 
-	if (list_empty(&head->lru_head))
+	root = mapping_to_root(page->mapping);
+	if (!root)
 		return 0;
 
-	range = list_first_entry(&head->lru_head, struct volatile_range, lru);
-
-	*start = range->node.start;
-	*end = range->node.last;
-	*mapping = range->mapping;
-
-	head->unpurged_page_count -= *end - *start;
-	list_del(&range->lru);
-	range->purged = 1;
+	prio_tree_iter_init(&iter, root, page->index, page->index);
+	node = prio_tree_next(&iter);
+	if (node) {
+		vrange = container_of(node, struct volatile_range, node);
 
-	return 1;
+		vrange->purged = 1;
+		ret = 1;
+	}
+	return ret;
 }
 
-
 /*
  * Cleans up any volatile ranges.
  */
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
  2012-07-28  3:57   ` John Stultz
@ 2012-08-06  3:04     ` Minchan Kim
  -1 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2012-08-06  3:04 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, dan.magenheimer, linux-mm

Hi John,

On Fri, Jul 27, 2012 at 11:57:11PM -0400, John Stultz wrote:
> In an attempt to push the volatile range managment even
> deeper into the VM code, this is my first attempt at
> implementing Minchan's idea of a LRU_VOLATILE list in
> the mm core.
> 
> This list sits along side the LRU_ACTIVE_ANON, _INACTIVE_ANON,
> _ACTIVE_FILE, _INACTIVE_FILE and _UNEVICTABLE lru lists.
> 
> When a range is marked volatile, the pages in that range
> are moved to the LRU_VOLATILE list. Since volatile pages
> can be quickly purged, this list is the first list we
> shrink when we need to free memory.
> 
> When a page is marked non-volatile, it is moved from the
> LRU_VOLATILE list to the appropriate LRU_ACTIVE_ list.

I think active list promotion is not good.
It should go to the inactive list and they get a chance to
activate from inactive to active sooner or later if it is
really touched.

> 
> This patch introduces the LRU_VOLATILE list, an isvolatile
> page flag, functions to mark and unmark a single page
> as volatile, and shrinker functions to purge volatile
> pages.
> 
> This is a very raw first pass, and is neither performant
> or likely bugfree. It works in my trivial testing, but
> I've not pushed it very hard yet.
> 
> I wanted to send it out just to get some inital thoughts
> on the approach and any suggestions should I be going too
> far in the wrong direction.

I look at this series and found several nitpicks about implemenataion
but I think it's not a good stage about concerning it.

Although naming is rather differet with I suggested, I think it's good idea.
So let's talk about it firstly.
I will call VOLATILE list as EReclaimale LRU list.

The purpose of it is that prevent unnecessary LRU churning and
reclaim unnecessary pages fastly so that latency-sensitive system
don't have a big latency when memory pressure happens.

Targets for the LRU list could be following as in future

1. volatile pages in this patchset.
2. ephemeral pages of tmem
3. madivse(DONTNEED)
4. fadvise(NOREUSE)
5. PG_reclaimed pages
6. clean pages if we write CFLRU(clean first LRU)

So if any guys have objection, please raise your hands
before further progress.

> 
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Android Kernel Team <kernel-team@android.com>
> CC: Robert Love <rlove@google.com>
> CC: Mel Gorman <mel@csn.ul.ie>
> CC: Hugh Dickins <hughd@google.com>
> CC: Dave Hansen <dave@linux.vnet.ibm.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Neil Brown <neilb@suse.de>
> CC: Andrea Righi <andrea@betterlinux.com>
> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> CC: Mike Hommey <mh@glandium.org>
> CC: Jan Kara <jack@suse.cz>
> CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> CC: Michel Lespinasse <walken@google.com>
> CC: Minchan Kim <minchan@kernel.org>
> CC: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/fs.h         |    1 +
>  include/linux/mm_inline.h  |    2 ++
>  include/linux/mmzone.h     |    1 +
>  include/linux/page-flags.h |    3 ++
>  include/linux/swap.h       |    3 ++
>  mm/memcontrol.c            |    1 +
>  mm/page_alloc.c            |    1 +
>  mm/swap.c                  |   71 +++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |   76 +++++++++++++++++++++++++++++++++++++++++---
>  9 files changed, 155 insertions(+), 4 deletions(-)
> 
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
@ 2012-08-06  3:04     ` Minchan Kim
  0 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2012-08-06  3:04 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, dan.magenheimer, linux-mm

Hi John,

On Fri, Jul 27, 2012 at 11:57:11PM -0400, John Stultz wrote:
> In an attempt to push the volatile range managment even
> deeper into the VM code, this is my first attempt at
> implementing Minchan's idea of a LRU_VOLATILE list in
> the mm core.
> 
> This list sits along side the LRU_ACTIVE_ANON, _INACTIVE_ANON,
> _ACTIVE_FILE, _INACTIVE_FILE and _UNEVICTABLE lru lists.
> 
> When a range is marked volatile, the pages in that range
> are moved to the LRU_VOLATILE list. Since volatile pages
> can be quickly purged, this list is the first list we
> shrink when we need to free memory.
> 
> When a page is marked non-volatile, it is moved from the
> LRU_VOLATILE list to the appropriate LRU_ACTIVE_ list.

I think active list promotion is not good.
It should go to the inactive list and they get a chance to
activate from inactive to active sooner or later if it is
really touched.

> 
> This patch introduces the LRU_VOLATILE list, an isvolatile
> page flag, functions to mark and unmark a single page
> as volatile, and shrinker functions to purge volatile
> pages.
> 
> This is a very raw first pass, and is neither performant
> or likely bugfree. It works in my trivial testing, but
> I've not pushed it very hard yet.
> 
> I wanted to send it out just to get some inital thoughts
> on the approach and any suggestions should I be going too
> far in the wrong direction.

I look at this series and found several nitpicks about implemenataion
but I think it's not a good stage about concerning it.

Although naming is rather differet with I suggested, I think it's good idea.
So let's talk about it firstly.
I will call VOLATILE list as EReclaimale LRU list.

The purpose of it is that prevent unnecessary LRU churning and
reclaim unnecessary pages fastly so that latency-sensitive system
don't have a big latency when memory pressure happens.

Targets for the LRU list could be following as in future

1. volatile pages in this patchset.
2. ephemeral pages of tmem
3. madivse(DONTNEED)
4. fadvise(NOREUSE)
5. PG_reclaimed pages
6. clean pages if we write CFLRU(clean first LRU)

So if any guys have objection, please raise your hands
before further progress.

> 
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Android Kernel Team <kernel-team@android.com>
> CC: Robert Love <rlove@google.com>
> CC: Mel Gorman <mel@csn.ul.ie>
> CC: Hugh Dickins <hughd@google.com>
> CC: Dave Hansen <dave@linux.vnet.ibm.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Neil Brown <neilb@suse.de>
> CC: Andrea Righi <andrea@betterlinux.com>
> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> CC: Mike Hommey <mh@glandium.org>
> CC: Jan Kara <jack@suse.cz>
> CC: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> CC: Michel Lespinasse <walken@google.com>
> CC: Minchan Kim <minchan@kernel.org>
> CC: linux-mm@kvack.org <linux-mm@kvack.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  include/linux/fs.h         |    1 +
>  include/linux/mm_inline.h  |    2 ++
>  include/linux/mmzone.h     |    1 +
>  include/linux/page-flags.h |    3 ++
>  include/linux/swap.h       |    3 ++
>  mm/memcontrol.c            |    1 +
>  mm/page_alloc.c            |    1 +
>  mm/swap.c                  |   71 +++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |   76 +++++++++++++++++++++++++++++++++++++++++---
>  9 files changed, 155 insertions(+), 4 deletions(-)
> 
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
  2012-08-06  3:04     ` Minchan Kim
@ 2012-08-06 15:46       ` Dan Magenheimer
  -1 siblings, 0 replies; 38+ messages in thread
From: Dan Magenheimer @ 2012-08-06 15:46 UTC (permalink / raw)
  To: Minchan Kim, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

> From: Minchan Kim [mailto:minchan@kernel.org]
> To: John Stultz
> Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM

Hi Minchan --

Thanks for cc'ing me on this!

> Targets for the LRU list could be following as in future
> 
> 1. volatile pages in this patchset.
> 2. ephemeral pages of tmem
> 3. madivse(DONTNEED)
> 4. fadvise(NOREUSE)
> 5. PG_reclaimed pages
> 6. clean pages if we write CFLRU(clean first LRU)
> 
> So if any guys have objection, please raise your hands
> before further progress.

I agree that the existing shrinker mechanism is too primitive
and the kernel needs to take into account more factors in
deciding how to quickly reclaim pages from a broader set
of sources.  However, I think it is important to ensure
that both the "demand" side and the "supply" side are
studied.  There has to be some kind of prioritization policy
among all the RAM consumers so that a lower-priority
alloc_page doesn't cause a higher-priority "volatile" page
to be consumed.  I suspect this policy will be VERY hard to
define and maintain.

Related, ephemeral pages in tmem are not truly volatile
as there is always at least one tmem data structure pointing
to it.  I haven't followed this thread previously so my apologies
if it already has this, but the LRU_VOLATILE list might
need to support a per-page "garbage collection" callback.

Dan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
@ 2012-08-06 15:46       ` Dan Magenheimer
  0 siblings, 0 replies; 38+ messages in thread
From: Dan Magenheimer @ 2012-08-06 15:46 UTC (permalink / raw)
  To: Minchan Kim, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

> From: Minchan Kim [mailto:minchan@kernel.org]
> To: John Stultz
> Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM

Hi Minchan --

Thanks for cc'ing me on this!

> Targets for the LRU list could be following as in future
> 
> 1. volatile pages in this patchset.
> 2. ephemeral pages of tmem
> 3. madivse(DONTNEED)
> 4. fadvise(NOREUSE)
> 5. PG_reclaimed pages
> 6. clean pages if we write CFLRU(clean first LRU)
> 
> So if any guys have objection, please raise your hands
> before further progress.

I agree that the existing shrinker mechanism is too primitive
and the kernel needs to take into account more factors in
deciding how to quickly reclaim pages from a broader set
of sources.  However, I think it is important to ensure
that both the "demand" side and the "supply" side are
studied.  There has to be some kind of prioritization policy
among all the RAM consumers so that a lower-priority
alloc_page doesn't cause a higher-priority "volatile" page
to be consumed.  I suspect this policy will be VERY hard to
define and maintain.

Related, ephemeral pages in tmem are not truly volatile
as there is always at least one tmem data structure pointing
to it.  I haven't followed this thread previously so my apologies
if it already has this, but the LRU_VOLATILE list might
need to support a per-page "garbage collection" callback.

Dan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
  2012-08-06  3:04     ` Minchan Kim
@ 2012-08-06 20:38       ` John Stultz
  -1 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-08-06 20:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, dan.magenheimer, linux-mm

On 08/05/2012 08:04 PM, Minchan Kim wrote:
> Hi John,
>
> On Fri, Jul 27, 2012 at 11:57:11PM -0400, John Stultz wrote:
>> In an attempt to push the volatile range managment even
>> deeper into the VM code, this is my first attempt at
>> implementing Minchan's idea of a LRU_VOLATILE list in
>> the mm core.
>>
>> This list sits along side the LRU_ACTIVE_ANON, _INACTIVE_ANON,
>> _ACTIVE_FILE, _INACTIVE_FILE and _UNEVICTABLE lru lists.
>>
>> When a range is marked volatile, the pages in that range
>> are moved to the LRU_VOLATILE list. Since volatile pages
>> can be quickly purged, this list is the first list we
>> shrink when we need to free memory.
>>
>> When a page is marked non-volatile, it is moved from the
>> LRU_VOLATILE list to the appropriate LRU_ACTIVE_ list.
> I think active list promotion is not good.
> It should go to the inactive list and they get a chance to
> activate from inactive to active sooner or later if it is
> really touched.

Ok. Thanks, I'll change it so we move to the inactive list then.


>> This patch introduces the LRU_VOLATILE list, an isvolatile
>> page flag, functions to mark and unmark a single page
>> as volatile, and shrinker functions to purge volatile
>> pages.
>>
>> This is a very raw first pass, and is neither performant
>> or likely bugfree. It works in my trivial testing, but
>> I've not pushed it very hard yet.
>>
>> I wanted to send it out just to get some inital thoughts
>> on the approach and any suggestions should I be going too
>> far in the wrong direction.
> I look at this series and found several nitpicks about implemenataion
> but I think it's not a good stage about concerning it.

Although while I know the design may still need significant change, I'd 
still appreciate nitpicks, as they might help me better understand the 
mm code and any mistakes I'm making.


> Although naming is rather differet with I suggested, I think it's good idea.
> So let's talk about it firstly.
> I will call VOLATILE list as EReclaimale LRU list.
Yea, I didn't want to call it ERECLAIMABLE since for this iteration I 
was limiting the scope just to volatile pages. I'm totally fine renaming 
it as the scope widens.

thanks
-john


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
@ 2012-08-06 20:38       ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-08-06 20:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, dan.magenheimer, linux-mm

On 08/05/2012 08:04 PM, Minchan Kim wrote:
> Hi John,
>
> On Fri, Jul 27, 2012 at 11:57:11PM -0400, John Stultz wrote:
>> In an attempt to push the volatile range managment even
>> deeper into the VM code, this is my first attempt at
>> implementing Minchan's idea of a LRU_VOLATILE list in
>> the mm core.
>>
>> This list sits along side the LRU_ACTIVE_ANON, _INACTIVE_ANON,
>> _ACTIVE_FILE, _INACTIVE_FILE and _UNEVICTABLE lru lists.
>>
>> When a range is marked volatile, the pages in that range
>> are moved to the LRU_VOLATILE list. Since volatile pages
>> can be quickly purged, this list is the first list we
>> shrink when we need to free memory.
>>
>> When a page is marked non-volatile, it is moved from the
>> LRU_VOLATILE list to the appropriate LRU_ACTIVE_ list.
> I think active list promotion is not good.
> It should go to the inactive list and they get a chance to
> activate from inactive to active sooner or later if it is
> really touched.

Ok. Thanks, I'll change it so we move to the inactive list then.


>> This patch introduces the LRU_VOLATILE list, an isvolatile
>> page flag, functions to mark and unmark a single page
>> as volatile, and shrinker functions to purge volatile
>> pages.
>>
>> This is a very raw first pass, and is neither performant
>> or likely bugfree. It works in my trivial testing, but
>> I've not pushed it very hard yet.
>>
>> I wanted to send it out just to get some inital thoughts
>> on the approach and any suggestions should I be going too
>> far in the wrong direction.
> I look at this series and found several nitpicks about implemenataion
> but I think it's not a good stage about concerning it.

Although while I know the design may still need significant change, I'd 
still appreciate nitpicks, as they might help me better understand the 
mm code and any mistakes I'm making.


> Although naming is rather differet with I suggested, I think it's good idea.
> So let's talk about it firstly.
> I will call VOLATILE list as EReclaimale LRU list.
Yea, I didn't want to call it ERECLAIMABLE since for this iteration I 
was limiting the scope just to volatile pages. I'm totally fine renaming 
it as the scope widens.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
  2012-08-06 15:46       ` Dan Magenheimer
@ 2012-08-07  0:56         ` Minchan Kim
  -1 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2012-08-07  0:56 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

On Mon, Aug 06, 2012 at 08:46:18AM -0700, Dan Magenheimer wrote:
> > From: Minchan Kim [mailto:minchan@kernel.org]
> > To: John Stultz
> > Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
> 
> Hi Minchan --
> 
> Thanks for cc'ing me on this!
> 
> > Targets for the LRU list could be following as in future
> > 
> > 1. volatile pages in this patchset.
> > 2. ephemeral pages of tmem
> > 3. madivse(DONTNEED)
> > 4. fadvise(NOREUSE)
> > 5. PG_reclaimed pages
> > 6. clean pages if we write CFLRU(clean first LRU)
> > 
> > So if any guys have objection, please raise your hands
> > before further progress.
> 
> I agree that the existing shrinker mechanism is too primitive
> and the kernel needs to take into account more factors in
> deciding how to quickly reclaim pages from a broader set
> of sources.  However, I think it is important to ensure
> that both the "demand" side and the "supply" side are
> studied.  There has to be some kind of prioritization policy
> among all the RAM consumers so that a lower-priority
> alloc_page doesn't cause a higher-priority "volatile" page
> to be consumed.  I suspect this policy will be VERY hard to
> define and maintain.

Yes. It's another story.
At the moment, VM doesn't consider such priority-inversion problem
excpet giving the more memory to privileged processes. It's so simple
but works well till now.

> 
> Related, ephemeral pages in tmem are not truly volatile

"volatile" term is used by John for only his special patch so
I like Ereclaim(Easy Reclaim) rather than volatile.

> as there is always at least one tmem data structure pointing
> to it.  I haven't followed this thread previously so my apologies
> if it already has this, but the LRU_VOLATILE list might
> need to support a per-page "garbage collection" callback.

Right. That's why this patch provides purgepage in address_space_operations.
I think zcache could attach own address_space_operations to the page
which is allocated by zbud for instance, zcache_purgepage which is called by VM
when the page is reclaimed. So zcache don't need custom LRU policy(but still need
linked list for managing zbuddy) and pass the decision to the VM.


> 
> Dan
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
@ 2012-08-07  0:56         ` Minchan Kim
  0 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2012-08-07  0:56 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

On Mon, Aug 06, 2012 at 08:46:18AM -0700, Dan Magenheimer wrote:
> > From: Minchan Kim [mailto:minchan@kernel.org]
> > To: John Stultz
> > Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
> 
> Hi Minchan --
> 
> Thanks for cc'ing me on this!
> 
> > Targets for the LRU list could be following as in future
> > 
> > 1. volatile pages in this patchset.
> > 2. ephemeral pages of tmem
> > 3. madivse(DONTNEED)
> > 4. fadvise(NOREUSE)
> > 5. PG_reclaimed pages
> > 6. clean pages if we write CFLRU(clean first LRU)
> > 
> > So if any guys have objection, please raise your hands
> > before further progress.
> 
> I agree that the existing shrinker mechanism is too primitive
> and the kernel needs to take into account more factors in
> deciding how to quickly reclaim pages from a broader set
> of sources.  However, I think it is important to ensure
> that both the "demand" side and the "supply" side are
> studied.  There has to be some kind of prioritization policy
> among all the RAM consumers so that a lower-priority
> alloc_page doesn't cause a higher-priority "volatile" page
> to be consumed.  I suspect this policy will be VERY hard to
> define and maintain.

Yes. It's another story.
At the moment, VM doesn't consider such priority-inversion problem
excpet giving the more memory to privileged processes. It's so simple
but works well till now.

> 
> Related, ephemeral pages in tmem are not truly volatile

"volatile" term is used by John for only his special patch so
I like Ereclaim(Easy Reclaim) rather than volatile.

> as there is always at least one tmem data structure pointing
> to it.  I haven't followed this thread previously so my apologies
> if it already has this, but the LRU_VOLATILE list might
> need to support a per-page "garbage collection" callback.

Right. That's why this patch provides purgepage in address_space_operations.
I think zcache could attach own address_space_operations to the page
which is allocated by zbud for instance, zcache_purgepage which is called by VM
when the page is reclaimed. So zcache don't need custom LRU policy(but still need
linked list for managing zbuddy) and pass the decision to the VM.


> 
> Dan
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
  2012-08-07  0:56         ` Minchan Kim
@ 2012-08-07  1:26           ` Dan Magenheimer
  -1 siblings, 0 replies; 38+ messages in thread
From: Dan Magenheimer @ 2012-08-07  1:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

> From: Minchan Kim [mailto:minchan@kernel.org]
> Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
> 
> On Mon, Aug 06, 2012 at 08:46:18AM -0700, Dan Magenheimer wrote:
> > > From: Minchan Kim [mailto:minchan@kernel.org]
> > > To: John Stultz
> > > Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
> >
> > Hi Minchan --
> >
> > Thanks for cc'ing me on this!
> >
> > > Targets for the LRU list could be following as in future
> > >
> > > 1. volatile pages in this patchset.
> > > 2. ephemeral pages of tmem
> > > 3. madivse(DONTNEED)
> > > 4. fadvise(NOREUSE)
> > > 5. PG_reclaimed pages
> > > 6. clean pages if we write CFLRU(clean first LRU)
> > >
> > > So if any guys have objection, please raise your hands
> > > before further progress.
> >
> > I agree that the existing shrinker mechanism is too primitive
> > and the kernel needs to take into account more factors in
> > deciding how to quickly reclaim pages from a broader set
> > of sources.  However, I think it is important to ensure
> > that both the "demand" side and the "supply" side are
> > studied.  There has to be some kind of prioritization policy
> > among all the RAM consumers so that a lower-priority
> > alloc_page doesn't cause a higher-priority "volatile" page
> > to be consumed.  I suspect this policy will be VERY hard to
> > define and maintain.
> 
> Yes. It's another story.
> At the moment, VM doesn't consider such priority-inversion problem
> excpet giving the more memory to privileged processes. It's so simple
> but works well till now.

I think it is very important that both stories must be
solved together.  See below...

> > Related, ephemeral pages in tmem are not truly volatile
> 
> "volatile" term is used by John for only his special patch so
> I like Ereclaim(Easy Reclaim) rather than volatile.

If others agree, that's fine.  However, the "E" prefix is
currently used differently in common English (for example,
for e-books).  Maybe "ezreclaim"?

> > as there is always at least one tmem data structure pointing
> > to it.  I haven't followed this thread previously so my apologies
> > if it already has this, but the LRU_VOLATILE list might
> > need to support a per-page "garbage collection" callback.
> 
> Right. That's why this patch provides purgepage in address_space_operations.
> I think zcache could attach own address_space_operations to the page
> which is allocated by zbud for instance, zcache_purgepage which is called by VM
> when the page is reclaimed. So zcache don't need custom LRU policy(but still need
> linked list for managing zbuddy) and pass the decision to the VM.

The simple VM decisions are going to need a lot more intelligence
(and data?) to drive which page to reclaim.  For example, is it better
to reclaim a pageframe that contains two compressed pages of ephemeral data
or a pageframe that has one active (or inactive) file page?  Such
a policy is not "Easy". ;-)

(Also, BTW, zcache pages aren't in any address space so don't have
an address_space_operations... because it is not possible to directly
address the data in a compressed page.)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
@ 2012-08-07  1:26           ` Dan Magenheimer
  0 siblings, 0 replies; 38+ messages in thread
From: Dan Magenheimer @ 2012-08-07  1:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

> From: Minchan Kim [mailto:minchan@kernel.org]
> Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
> 
> On Mon, Aug 06, 2012 at 08:46:18AM -0700, Dan Magenheimer wrote:
> > > From: Minchan Kim [mailto:minchan@kernel.org]
> > > To: John Stultz
> > > Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
> >
> > Hi Minchan --
> >
> > Thanks for cc'ing me on this!
> >
> > > Targets for the LRU list could be following as in future
> > >
> > > 1. volatile pages in this patchset.
> > > 2. ephemeral pages of tmem
> > > 3. madivse(DONTNEED)
> > > 4. fadvise(NOREUSE)
> > > 5. PG_reclaimed pages
> > > 6. clean pages if we write CFLRU(clean first LRU)
> > >
> > > So if any guys have objection, please raise your hands
> > > before further progress.
> >
> > I agree that the existing shrinker mechanism is too primitive
> > and the kernel needs to take into account more factors in
> > deciding how to quickly reclaim pages from a broader set
> > of sources.  However, I think it is important to ensure
> > that both the "demand" side and the "supply" side are
> > studied.  There has to be some kind of prioritization policy
> > among all the RAM consumers so that a lower-priority
> > alloc_page doesn't cause a higher-priority "volatile" page
> > to be consumed.  I suspect this policy will be VERY hard to
> > define and maintain.
> 
> Yes. It's another story.
> At the moment, VM doesn't consider such priority-inversion problem
> excpet giving the more memory to privileged processes. It's so simple
> but works well till now.

I think it is very important that both stories must be
solved together.  See below...

> > Related, ephemeral pages in tmem are not truly volatile
> 
> "volatile" term is used by John for only his special patch so
> I like Ereclaim(Easy Reclaim) rather than volatile.

If others agree, that's fine.  However, the "E" prefix is
currently used differently in common English (for example,
for e-books).  Maybe "ezreclaim"?

> > as there is always at least one tmem data structure pointing
> > to it.  I haven't followed this thread previously so my apologies
> > if it already has this, but the LRU_VOLATILE list might
> > need to support a per-page "garbage collection" callback.
> 
> Right. That's why this patch provides purgepage in address_space_operations.
> I think zcache could attach own address_space_operations to the page
> which is allocated by zbud for instance, zcache_purgepage which is called by VM
> when the page is reclaimed. So zcache don't need custom LRU policy(but still need
> linked list for managing zbuddy) and pass the decision to the VM.

The simple VM decisions are going to need a lot more intelligence
(and data?) to drive which page to reclaim.  For example, is it better
to reclaim a pageframe that contains two compressed pages of ephemeral data
or a pageframe that has one active (or inactive) file page?  Such
a policy is not "Easy". ;-)

(Also, BTW, zcache pages aren't in any address space so don't have
an address_space_operations... because it is not possible to directly
address the data in a compressed page.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
  2012-08-07  1:26           ` Dan Magenheimer
@ 2012-08-07  1:45             ` Minchan Kim
  -1 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2012-08-07  1:45 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

On Mon, Aug 06, 2012 at 06:26:03PM -0700, Dan Magenheimer wrote:
> > From: Minchan Kim [mailto:minchan@kernel.org]
> > Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
> > 
> > On Mon, Aug 06, 2012 at 08:46:18AM -0700, Dan Magenheimer wrote:
> > > > From: Minchan Kim [mailto:minchan@kernel.org]
> > > > To: John Stultz
> > > > Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
> > >
> > > Hi Minchan --
> > >
> > > Thanks for cc'ing me on this!
> > >
> > > > Targets for the LRU list could be following as in future
> > > >
> > > > 1. volatile pages in this patchset.
> > > > 2. ephemeral pages of tmem
> > > > 3. madivse(DONTNEED)
> > > > 4. fadvise(NOREUSE)
> > > > 5. PG_reclaimed pages
> > > > 6. clean pages if we write CFLRU(clean first LRU)
> > > >
> > > > So if any guys have objection, please raise your hands
> > > > before further progress.
> > >
> > > I agree that the existing shrinker mechanism is too primitive
> > > and the kernel needs to take into account more factors in
> > > deciding how to quickly reclaim pages from a broader set
> > > of sources.  However, I think it is important to ensure
> > > that both the "demand" side and the "supply" side are
> > > studied.  There has to be some kind of prioritization policy
> > > among all the RAM consumers so that a lower-priority
> > > alloc_page doesn't cause a higher-priority "volatile" page
> > > to be consumed.  I suspect this policy will be VERY hard to
> > > define and maintain.
> > 
> > Yes. It's another story.
> > At the moment, VM doesn't consider such priority-inversion problem
> > excpet giving the more memory to privileged processes. It's so simple
> > but works well till now.
> 
> I think it is very important that both stories must be
> solved together.  See below...
> 
> > > Related, ephemeral pages in tmem are not truly volatile
> > 
> > "volatile" term is used by John for only his special patch so
> > I like Ereclaim(Easy Reclaim) rather than volatile.
> 
> If others agree, that's fine.  However, the "E" prefix is
> currently used differently in common English (for example,
> for e-books).  Maybe "ezreclaim"?

Looks better. I will use that term from now on.
Thanks!

> 
> > > as there is always at least one tmem data structure pointing
> > > to it.  I haven't followed this thread previously so my apologies
> > > if it already has this, but the LRU_VOLATILE list might
> > > need to support a per-page "garbage collection" callback.
> > 
> > Right. That's why this patch provides purgepage in address_space_operations.
> > I think zcache could attach own address_space_operations to the page
> > which is allocated by zbud for instance, zcache_purgepage which is called by VM
> > when the page is reclaimed. So zcache don't need custom LRU policy(but still need
> > linked list for managing zbuddy) and pass the decision to the VM.
> 
> The simple VM decisions are going to need a lot more intelligence
> (and data?) to drive which page to reclaim.  For example, is it better
> to reclaim a pageframe that contains two compressed pages of ephemeral data
> or a pageframe that has one active (or inactive) file page?  Such
> a policy is not "Easy". ;-)

I should have said more cleary.
VM just pick a page in tail of ezreclaim list and and then just reclaim.
So rotation of the active page or two compresssed pages should be implemented
by smart zcache which can do anyting.

> 
> (Also, BTW, zcache pages aren't in any address space so don't have
> an address_space_operations... because it is not possible to directly
> address the data in a compressed page.)

I mean we can make just fake address_space_operations like this.

static struct address_space_operations zcache_aop = {
       .purgepage = zcache_purge_page,
};

static struct address_space zcache_address_space = {
       .a_ops = &zcache_aop,
};

        struct page *page = alloc_page();
        page->mapping = &zcache_address_space;

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
@ 2012-08-07  1:45             ` Minchan Kim
  0 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2012-08-07  1:45 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: John Stultz, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Michel Lespinasse, linux-mm

On Mon, Aug 06, 2012 at 06:26:03PM -0700, Dan Magenheimer wrote:
> > From: Minchan Kim [mailto:minchan@kernel.org]
> > Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
> > 
> > On Mon, Aug 06, 2012 at 08:46:18AM -0700, Dan Magenheimer wrote:
> > > > From: Minchan Kim [mailto:minchan@kernel.org]
> > > > To: John Stultz
> > > > Subject: Re: [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM
> > >
> > > Hi Minchan --
> > >
> > > Thanks for cc'ing me on this!
> > >
> > > > Targets for the LRU list could be following as in future
> > > >
> > > > 1. volatile pages in this patchset.
> > > > 2. ephemeral pages of tmem
> > > > 3. madivse(DONTNEED)
> > > > 4. fadvise(NOREUSE)
> > > > 5. PG_reclaimed pages
> > > > 6. clean pages if we write CFLRU(clean first LRU)
> > > >
> > > > So if any guys have objection, please raise your hands
> > > > before further progress.
> > >
> > > I agree that the existing shrinker mechanism is too primitive
> > > and the kernel needs to take into account more factors in
> > > deciding how to quickly reclaim pages from a broader set
> > > of sources.  However, I think it is important to ensure
> > > that both the "demand" side and the "supply" side are
> > > studied.  There has to be some kind of prioritization policy
> > > among all the RAM consumers so that a lower-priority
> > > alloc_page doesn't cause a higher-priority "volatile" page
> > > to be consumed.  I suspect this policy will be VERY hard to
> > > define and maintain.
> > 
> > Yes. It's another story.
> > At the moment, VM doesn't consider such priority-inversion problem
> > excpet giving the more memory to privileged processes. It's so simple
> > but works well till now.
> 
> I think it is very important that both stories must be
> solved together.  See below...
> 
> > > Related, ephemeral pages in tmem are not truly volatile
> > 
> > "volatile" term is used by John for only his special patch so
> > I like Ereclaim(Easy Reclaim) rather than volatile.
> 
> If others agree, that's fine.  However, the "E" prefix is
> currently used differently in common English (for example,
> for e-books).  Maybe "ezreclaim"?

Looks better. I will use that term from now on.
Thanks!

> 
> > > as there is always at least one tmem data structure pointing
> > > to it.  I haven't followed this thread previously so my apologies
> > > if it already has this, but the LRU_VOLATILE list might
> > > need to support a per-page "garbage collection" callback.
> > 
> > Right. That's why this patch provides purgepage in address_space_operations.
> > I think zcache could attach own address_space_operations to the page
> > which is allocated by zbud for instance, zcache_purgepage which is called by VM
> > when the page is reclaimed. So zcache don't need custom LRU policy(but still need
> > linked list for managing zbuddy) and pass the decision to the VM.
> 
> The simple VM decisions are going to need a lot more intelligence
> (and data?) to drive which page to reclaim.  For example, is it better
> to reclaim a pageframe that contains two compressed pages of ephemeral data
> or a pageframe that has one active (or inactive) file page?  Such
> a policy is not "Easy". ;-)

I should have said more cleary.
VM just pick a page in tail of ezreclaim list and and then just reclaim.
So rotation of the active page or two compresssed pages should be implemented
by smart zcache which can do anyting.

> 
> (Also, BTW, zcache pages aren't in any address space so don't have
> an address_space_operations... because it is not possible to directly
> address the data in a compressed page.)

I mean we can make just fake address_space_operations like this.

static struct address_space_operations zcache_aop = {
       .purgepage = zcache_purge_page,
};

static struct address_space zcache_address_space = {
       .a_ops = &zcache_aop,
};

        struct page *page = alloc_page();
        page->mapping = &zcache_address_space;

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5][RFC] Fallocate Volatile Ranges v6
  2012-07-28  3:57 ` John Stultz
@ 2012-08-09  9:28   ` Michel Lespinasse
  -1 siblings, 0 replies; 38+ messages in thread
From: Michel Lespinasse @ 2012-08-09  9:28 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Minchan Kim, linux-mm

Hi John,

On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
> So after not getting too much positive feedback on my last
> attempt at trying to use a non-shrinker method for managing
> & purging volatile ranges, I decided I'd go ahead and try
> to implement something along Minchan's ERECLAIM LRU list
> idea.

Agree that there hasn't been much feedback from MM folks yet - sorry
about that :/

I think one issue might be that most people don't have a good
background on how the feature is intended to be used, and it is very
difficult to comment meaningfully without that.

As for myself, I have been wondering:

- Why the feature needs to be on a per-range basis, rather than
per-file. Is this simply to make it easier to transition the android
use case from whatever they are doing right now, or is it that the
object boundaries within a file can't be known in advance, and thus
one wouldn't know how to split objects accross different files ? Or
could it be that some of the objects would be small (less than a page)
so space use would be inefficient if they were placed in different
files ? Or just that there would be too many files for efficient
management ?

- What are the desired semantics for the volatile objects. Can the
objects be accessed while they are marked as volatile, or do they have
to get unmarked first ? Is it really the case that we always want to
reclaim from volatile objects first, before any other kind of caches
we might have ? This sounds like a very strong hint, and I think I
would be more comfortable with something more subtle if that's
possible. Also, if we have several volatile objects to reclaim from,
is it desirable to reclaim from the one that's been marked volatile
the longest or does it make no difference ? When an object is marked
volatile, would it be sufficient to ensure it gets placed on the
inactive list (maybe with the referenced bit cleared) and let the
normal reclaim algorithm get to it, or is that an insufficiently
strong hint somehow ?

Basically, having some background information of how android would be
using the feature would help us better understand the design decision
here, I think.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5][RFC] Fallocate Volatile Ranges v6
@ 2012-08-09  9:28   ` Michel Lespinasse
  0 siblings, 0 replies; 38+ messages in thread
From: Michel Lespinasse @ 2012-08-09  9:28 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Minchan Kim, linux-mm

Hi John,

On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
> So after not getting too much positive feedback on my last
> attempt at trying to use a non-shrinker method for managing
> & purging volatile ranges, I decided I'd go ahead and try
> to implement something along Minchan's ERECLAIM LRU list
> idea.

Agree that there hasn't been much feedback from MM folks yet - sorry
about that :/

I think one issue might be that most people don't have a good
background on how the feature is intended to be used, and it is very
difficult to comment meaningfully without that.

As for myself, I have been wondering:

- Why the feature needs to be on a per-range basis, rather than
per-file. Is this simply to make it easier to transition the android
use case from whatever they are doing right now, or is it that the
object boundaries within a file can't be known in advance, and thus
one wouldn't know how to split objects accross different files ? Or
could it be that some of the objects would be small (less than a page)
so space use would be inefficient if they were placed in different
files ? Or just that there would be too many files for efficient
management ?

- What are the desired semantics for the volatile objects. Can the
objects be accessed while they are marked as volatile, or do they have
to get unmarked first ? Is it really the case that we always want to
reclaim from volatile objects first, before any other kind of caches
we might have ? This sounds like a very strong hint, and I think I
would be more comfortable with something more subtle if that's
possible. Also, if we have several volatile objects to reclaim from,
is it desirable to reclaim from the one that's been marked volatile
the longest or does it make no difference ? When an object is marked
volatile, would it be sufficient to ensure it gets placed on the
inactive list (maybe with the referenced bit cleared) and let the
normal reclaim algorithm get to it, or is that an insufficiently
strong hint somehow ?

Basically, having some background information of how android would be
using the feature would help us better understand the design decision
here, I think.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] [RFC] Add volatile range management code
  2012-07-28  3:57   ` John Stultz
@ 2012-08-09  9:46     ` Michel Lespinasse
  -1 siblings, 0 replies; 38+ messages in thread
From: Michel Lespinasse @ 2012-08-09  9:46 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Minchan Kim, linux-mm

On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
> v5:
> * Drop intervaltree for prio_tree usage per Michel &
>   Dmitry's suggestions.

Actually, I believe the ranges you need to track are non-overlapping, correct ?

If that is the case, a simple rbtree, sorted by start-of-range
address, would work best.
(I am trying to remove prio_tree users... :)

> +       /* First, find any existing intervals that overlap */
> +       prio_tree_iter_init(&iter, root, start, end);

Note that prio tree iterations take intervals as [start; last] not [start; end[
So if you want to stick with prio trees, you would have to use end-1 here.

> +       /* Coalesce left-adjacent ranges */
> +       prio_tree_iter_init(&iter, root, start-1, start);

Same here; you probably want to use start-1 on both ends

> +       node = prio_tree_next(&iter);
> +       while (node) {

I'm confused, I don't think you ever expect more than one range to
match, do you ???

> +       /* Coalesce right-adjacent ranges */
> +       prio_tree_iter_init(&iter, root, end, end+1);

Same again, here you probably want end on both ends

This is far from a complete code review, but I just wanted to point
out a couple details that jumped to me first. I am afraid I am missing
some of the background about how the feature is to be used to really
dig into the rest of the changes at this point :/

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] [RFC] Add volatile range management code
@ 2012-08-09  9:46     ` Michel Lespinasse
  0 siblings, 0 replies; 38+ messages in thread
From: Michel Lespinasse @ 2012-08-09  9:46 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Minchan Kim, linux-mm

On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
> v5:
> * Drop intervaltree for prio_tree usage per Michel &
>   Dmitry's suggestions.

Actually, I believe the ranges you need to track are non-overlapping, correct ?

If that is the case, a simple rbtree, sorted by start-of-range
address, would work best.
(I am trying to remove prio_tree users... :)

> +       /* First, find any existing intervals that overlap */
> +       prio_tree_iter_init(&iter, root, start, end);

Note that prio tree iterations take intervals as [start; last] not [start; end[
So if you want to stick with prio trees, you would have to use end-1 here.

> +       /* Coalesce left-adjacent ranges */
> +       prio_tree_iter_init(&iter, root, start-1, start);

Same here; you probably want to use start-1 on both ends

> +       node = prio_tree_next(&iter);
> +       while (node) {

I'm confused, I don't think you ever expect more than one range to
match, do you ???

> +       /* Coalesce right-adjacent ranges */
> +       prio_tree_iter_init(&iter, root, end, end+1);

Same again, here you probably want end on both ends

This is far from a complete code review, but I just wanted to point
out a couple details that jumped to me first. I am afraid I am missing
some of the background about how the feature is to be used to really
dig into the rest of the changes at this point :/

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] [RFC] Add volatile range management code
  2012-08-09  9:46     ` Michel Lespinasse
@ 2012-08-09 13:35       ` Andrea Righi
  -1 siblings, 0 replies; 38+ messages in thread
From: Andrea Righi @ 2012-08-09 13:35 UTC (permalink / raw)
  To: Michel Lespinasse, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Aneesh Kumar K.V,
	Mike Hommey, Jan Kara, KOSAKI Motohiro, Minchan Kim, linux-mm

On Thu, Aug 09, 2012 at 02:46:37AM -0700, Michel Lespinasse wrote:
> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
> > v5:
> > * Drop intervaltree for prio_tree usage per Michel &
> >   Dmitry's suggestions.
> 
> Actually, I believe the ranges you need to track are non-overlapping, correct ?
> 
> If that is the case, a simple rbtree, sorted by start-of-range
> address, would work best.
> (I am trying to remove prio_tree users... :)
> 

John,

JFYI, if you want to try a possible rbtree-based implementation, as
suggested by Michel you could try this one:
https://github.com/arighi/kinterval

This implementation supports insertion, deletion and transparent merging
of adjacent ranges, as well as splitting ranges when chunks removed or
different chunk types are added in the middle of an existing range; so
if I'm not wrong probably you should be able to use this code as is,
without any modification.

If you decide to go this way and/or need help to use it in your patch
set just let me know.

-Andrea

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] [RFC] Add volatile range management code
@ 2012-08-09 13:35       ` Andrea Righi
  0 siblings, 0 replies; 38+ messages in thread
From: Andrea Righi @ 2012-08-09 13:35 UTC (permalink / raw)
  To: Michel Lespinasse, John Stultz
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Aneesh Kumar K.V,
	Mike Hommey, Jan Kara, KOSAKI Motohiro, Minchan Kim, linux-mm

On Thu, Aug 09, 2012 at 02:46:37AM -0700, Michel Lespinasse wrote:
> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
> > v5:
> > * Drop intervaltree for prio_tree usage per Michel &
> >   Dmitry's suggestions.
> 
> Actually, I believe the ranges you need to track are non-overlapping, correct ?
> 
> If that is the case, a simple rbtree, sorted by start-of-range
> address, would work best.
> (I am trying to remove prio_tree users... :)
> 

John,

JFYI, if you want to try a possible rbtree-based implementation, as
suggested by Michel you could try this one:
https://github.com/arighi/kinterval

This implementation supports insertion, deletion and transparent merging
of adjacent ranges, as well as splitting ranges when chunks removed or
different chunk types are added in the middle of an existing range; so
if I'm not wrong probably you should be able to use this code as is,
without any modification.

If you decide to go this way and/or need help to use it in your patch
set just let me know.

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5][RFC] Fallocate Volatile Ranges v6
  2012-08-09  9:28   ` Michel Lespinasse
@ 2012-08-09 18:45     ` John Stultz
  -1 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-08-09 18:45 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Minchan Kim, linux-mm

On 08/09/2012 02:28 AM, Michel Lespinasse wrote:
> Hi John,
>
> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
>> So after not getting too much positive feedback on my last
>> attempt at trying to use a non-shrinker method for managing
>> & purging volatile ranges, I decided I'd go ahead and try
>> to implement something along Minchan's ERECLAIM LRU list
>> idea.
> Agree that there hasn't been much feedback from MM folks yet - sorry
> about that :/
>
> I think one issue might be that most people don't have a good
> background on how the feature is intended to be used, and it is very
> difficult to comment meaningfully without that.
>
> As for myself, I have been wondering:
>
> - Why the feature needs to be on a per-range basis, rather than
> per-file. Is this simply to make it easier to transition the android
> use case from whatever they are doing right now, or is it that the
> object boundaries within a file can't be known in advance, and thus
> one wouldn't know how to split objects accross different files ? Or
> could it be that some of the objects would be small (less than a page)
> so space use would be inefficient if they were placed in different
> files ? Or just that there would be too many files for efficient
> management ?
For me, keeping the feature per-range instead of something like per-file 
is in order to be able to support Android's existing use case.

As to why Android uses per-range instead of per file, Arve or someone 
from the Android team could  probably better answer,  but I can 
theorize.   In discussions with the Android guys, they've mentioned 
ashmem's primary goal was to basically provide atomically unlinked tmpfs 
fds for sharing memory, and memory unpinning was a follow on feature.  
So I suspect that ranges fit better into their existing model of using a 
mmapped fd to share data between two processes. Instead of creating a 
new tmpfs file and sharing the fd for each object, they're able to 
create a one shared mapping, which can be effectively resized w/ 
unpinning,  for multiple objects.

Another use case where ranges would be beneficial might be for where 
there's a large single object that might not all be in use at one time. 
For example, a very large web page,  where you have a limited view onto 
it. Having the rest of the page rendered so you can quickly scroll 
without re-rendering would be nice, but under memory pressure it could 
effectively throw out the non-visible portions.

Further, one additional reason I have for not having the volatile 
attribute be per-file, is that I have my eye on allowing volatile ranges 
to be set on anonymous heap memory via something like madvise() in the 
future, which would be an easier api to use if you're not sharing data 
via mmapped tmpfs files, but wouldn't be possible if this was a file 
attribute rather then a range of pages.


> - What are the desired semantics for the volatile objects. Can the
> objects be accessed while they are marked as volatile, or do they have
> to get unmarked first ?
So accessing a volatile page before marking it non-volatile can produce 
undefined behavior. You could get the data that was there, or you could 
get empty pages.  The expectation is that pages are unmarked before 
being accessed, so one can know if the data was lost or not.   I'm open 
to other suggestions here, if folks think we should SIGSEGV on accesses 
to volatile pages. However, I don't know how setting that up and tearing 
it down on each mark_volatile/unmark_volatile might affect performance.

> Is it really the case that we always want to
> reclaim from volatile objects first, before any other kind of caches
> we might have ? This sounds like a very strong hint, and I think I
> would be more comfortable with something more subtle if that's
> possible.
So the current Android ashmem implementation uses a shrinker, which 
isn't necessarily called before any other caches are freed.  So, I don't 
think its a strong hint, but it just seems somewhat intuitive to me that 
we should free effectively "user-donated" pages before freeing other 
system caches.  But that's not something the interface necessarily 
defines or requires.


> Also, if we have several volatile objects to reclaim from,
> is it desirable to reclaim from the one that's been marked volatile
> the longest or does it make no difference ?
While I don't think its strictly necessary, I do think LRU order purging 
is important from the least-surprise angle.   Since ranges marked 
volatile should not be touched until they are marked non-volatile, it 
follows normal expectations that recently touched data is likely to be 
faster then data that has not been accessed for some time.  Reasonable 
exceptions would be situations like NUMA systems where pressure on one 
node forces purging volatile pages in a non-global-lru order. So 
probably not critical, but I think useful to try to preserve.


> When an object is marked
> volatile, would it be sufficient to ensure it gets placed on the
> inactive list (maybe with the referenced bit cleared) and let the
> normal reclaim algorithm get to it, or is that an insufficiently
> strong hint somehow ?

The problem is that on machines without swap, there wouldn't be any 
reclaim for inactive anonymous pages.

An earlier iteration of this implementation (v4 I think?) used writepage 
to decide if to purge or writeout a page, and made it so in non-swap 
cases the anonymous inactive list was in effect the "volatile" list.  
However, folks didn't seem to like this approach.


> Basically, having some background information of how android would be
> using the feature would help us better understand the design decision
> here, I think.
Hopefully the details above help, and I'll try to get some more concrete 
examples from the Android code base.

thanks
-john


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5][RFC] Fallocate Volatile Ranges v6
@ 2012-08-09 18:45     ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-08-09 18:45 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Minchan Kim, linux-mm

On 08/09/2012 02:28 AM, Michel Lespinasse wrote:
> Hi John,
>
> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
>> So after not getting too much positive feedback on my last
>> attempt at trying to use a non-shrinker method for managing
>> & purging volatile ranges, I decided I'd go ahead and try
>> to implement something along Minchan's ERECLAIM LRU list
>> idea.
> Agree that there hasn't been much feedback from MM folks yet - sorry
> about that :/
>
> I think one issue might be that most people don't have a good
> background on how the feature is intended to be used, and it is very
> difficult to comment meaningfully without that.
>
> As for myself, I have been wondering:
>
> - Why the feature needs to be on a per-range basis, rather than
> per-file. Is this simply to make it easier to transition the android
> use case from whatever they are doing right now, or is it that the
> object boundaries within a file can't be known in advance, and thus
> one wouldn't know how to split objects accross different files ? Or
> could it be that some of the objects would be small (less than a page)
> so space use would be inefficient if they were placed in different
> files ? Or just that there would be too many files for efficient
> management ?
For me, keeping the feature per-range instead of something like per-file 
is in order to be able to support Android's existing use case.

As to why Android uses per-range instead of per file, Arve or someone 
from the Android team could  probably better answer,  but I can 
theorize.   In discussions with the Android guys, they've mentioned 
ashmem's primary goal was to basically provide atomically unlinked tmpfs 
fds for sharing memory, and memory unpinning was a follow on feature.  
So I suspect that ranges fit better into their existing model of using a 
mmapped fd to share data between two processes. Instead of creating a 
new tmpfs file and sharing the fd for each object, they're able to 
create a one shared mapping, which can be effectively resized w/ 
unpinning,  for multiple objects.

Another use case where ranges would be beneficial might be for where 
there's a large single object that might not all be in use at one time. 
For example, a very large web page,  where you have a limited view onto 
it. Having the rest of the page rendered so you can quickly scroll 
without re-rendering would be nice, but under memory pressure it could 
effectively throw out the non-visible portions.

Further, one additional reason I have for not having the volatile 
attribute be per-file, is that I have my eye on allowing volatile ranges 
to be set on anonymous heap memory via something like madvise() in the 
future, which would be an easier api to use if you're not sharing data 
via mmapped tmpfs files, but wouldn't be possible if this was a file 
attribute rather then a range of pages.


> - What are the desired semantics for the volatile objects. Can the
> objects be accessed while they are marked as volatile, or do they have
> to get unmarked first ?
So accessing a volatile page before marking it non-volatile can produce 
undefined behavior. You could get the data that was there, or you could 
get empty pages.  The expectation is that pages are unmarked before 
being accessed, so one can know if the data was lost or not.   I'm open 
to other suggestions here, if folks think we should SIGSEGV on accesses 
to volatile pages. However, I don't know how setting that up and tearing 
it down on each mark_volatile/unmark_volatile might affect performance.

> Is it really the case that we always want to
> reclaim from volatile objects first, before any other kind of caches
> we might have ? This sounds like a very strong hint, and I think I
> would be more comfortable with something more subtle if that's
> possible.
So the current Android ashmem implementation uses a shrinker, which 
isn't necessarily called before any other caches are freed.  So, I don't 
think its a strong hint, but it just seems somewhat intuitive to me that 
we should free effectively "user-donated" pages before freeing other 
system caches.  But that's not something the interface necessarily 
defines or requires.


> Also, if we have several volatile objects to reclaim from,
> is it desirable to reclaim from the one that's been marked volatile
> the longest or does it make no difference ?
While I don't think its strictly necessary, I do think LRU order purging 
is important from the least-surprise angle.   Since ranges marked 
volatile should not be touched until they are marked non-volatile, it 
follows normal expectations that recently touched data is likely to be 
faster then data that has not been accessed for some time.  Reasonable 
exceptions would be situations like NUMA systems where pressure on one 
node forces purging volatile pages in a non-global-lru order. So 
probably not critical, but I think useful to try to preserve.


> When an object is marked
> volatile, would it be sufficient to ensure it gets placed on the
> inactive list (maybe with the referenced bit cleared) and let the
> normal reclaim algorithm get to it, or is that an insufficiently
> strong hint somehow ?

The problem is that on machines without swap, there wouldn't be any 
reclaim for inactive anonymous pages.

An earlier iteration of this implementation (v4 I think?) used writepage 
to decide if to purge or writeout a page, and made it so in non-swap 
cases the anonymous inactive list was in effect the "volatile" list.  
However, folks didn't seem to like this approach.


> Basically, having some background information of how android would be
> using the feature would help us better understand the design decision
> here, I think.
Hopefully the details above help, and I'll try to get some more concrete 
examples from the Android code base.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] [RFC] Add volatile range management code
  2012-08-09  9:46     ` Michel Lespinasse
@ 2012-08-09 19:11       ` John Stultz
  -1 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-08-09 19:11 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Minchan Kim, linux-mm

On 08/09/2012 02:46 AM, Michel Lespinasse wrote:
> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
>> v5:
>> * Drop intervaltree for prio_tree usage per Michel &
>>    Dmitry's suggestions.
> Actually, I believe the ranges you need to track are non-overlapping, correct ?
Correct.  Any overlapping range is coalesced.

> If that is the case, a simple rbtree, sorted by start-of-range
> address, would work best.
> (I am trying to remove prio_tree users... :)

Sigh.  Sure.  Although I've blown with the wind on a number of different 
approaches for storing the ranges. I'm not particularly passionate about 
it, but the continual conflicting suggestions are a slight frustration.  :)


>> +       /* First, find any existing intervals that overlap */
>> +       prio_tree_iter_init(&iter, root, start, end);
> Note that prio tree iterations take intervals as [start; last] not [start; end[
> So if you want to stick with prio trees, you would have to use end-1 here.
Thanks!  I think I hit this off-by-one issue in my testing, but fixed it 
on the backend  w/ :

     modify_range(&inode->i_data, start, end-1, &mark_nonvolatile_page);

Clearly fixing it at the start instead of papering over it is better.


>> +       node = prio_tree_next(&iter);
>> +       while (node) {
> I'm confused, I don't think you ever expect more than one range to
> match, do you ???

So yea.  If you already have two ranges (0-5),(10-15) and then add range 
(0-20) we need to coalesce the two existing ranges into the new one.


> This is far from a complete code review, but I just wanted to point
> out a couple details that jumped to me first. I am afraid I am missing
> some of the background about how the feature is to be used to really
> dig into the rest of the changes at this point :/

Well, I really appreciate any feedback here.

thanks
-john


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] [RFC] Add volatile range management code
@ 2012-08-09 19:11       ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-08-09 19:11 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: LKML, Andrew Morton, Android Kernel Team, Robert Love,
	Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Mike Hommey, Jan Kara, KOSAKI Motohiro,
	Minchan Kim, linux-mm

On 08/09/2012 02:46 AM, Michel Lespinasse wrote:
> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
>> v5:
>> * Drop intervaltree for prio_tree usage per Michel &
>>    Dmitry's suggestions.
> Actually, I believe the ranges you need to track are non-overlapping, correct ?
Correct.  Any overlapping range is coalesced.

> If that is the case, a simple rbtree, sorted by start-of-range
> address, would work best.
> (I am trying to remove prio_tree users... :)

Sigh.  Sure.  Although I've blown with the wind on a number of different 
approaches for storing the ranges. I'm not particularly passionate about 
it, but the continual conflicting suggestions are a slight frustration.  :)


>> +       /* First, find any existing intervals that overlap */
>> +       prio_tree_iter_init(&iter, root, start, end);
> Note that prio tree iterations take intervals as [start; last] not [start; end[
> So if you want to stick with prio trees, you would have to use end-1 here.
Thanks!  I think I hit this off-by-one issue in my testing, but fixed it 
on the backend  w/ :

     modify_range(&inode->i_data, start, end-1, &mark_nonvolatile_page);

Clearly fixing it at the start instead of papering over it is better.


>> +       node = prio_tree_next(&iter);
>> +       while (node) {
> I'm confused, I don't think you ever expect more than one range to
> match, do you ???

So yea.  If you already have two ranges (0-5),(10-15) and then add range 
(0-20) we need to coalesce the two existing ranges into the new one.


> This is far from a complete code review, but I just wanted to point
> out a couple details that jumped to me first. I am afraid I am missing
> some of the background about how the feature is to be used to really
> dig into the rest of the changes at this point :/

Well, I really appreciate any feedback here.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] [RFC] Add volatile range management code
  2012-08-09 13:35       ` Andrea Righi
@ 2012-08-09 19:33         ` John Stultz
  -1 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-08-09 19:33 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Michel Lespinasse, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Aneesh Kumar K.V,
	Mike Hommey, Jan Kara, KOSAKI Motohiro, Minchan Kim, linux-mm

On 08/09/2012 06:35 AM, Andrea Righi wrote:
> On Thu, Aug 09, 2012 at 02:46:37AM -0700, Michel Lespinasse wrote:
>> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
>>> v5:
>>> * Drop intervaltree for prio_tree usage per Michel &
>>>    Dmitry's suggestions.
>> Actually, I believe the ranges you need to track are non-overlapping, correct ?
>>
>> If that is the case, a simple rbtree, sorted by start-of-range
>> address, would work best.
>> (I am trying to remove prio_tree users... :)
>>
> John,
>
> JFYI, if you want to try a possible rbtree-based implementation, as
> suggested by Michel you could try this one:
> https://github.com/arighi/kinterval
>
> This implementation supports insertion, deletion and transparent merging
> of adjacent ranges, as well as splitting ranges when chunks removed or
> different chunk types are added in the middle of an existing range; so
> if I'm not wrong probably you should be able to use this code as is,
> without any modification.
I do appreciate the suggestion, and considered this earlier when you 
posted this before.

Unfotunately the transparent merging/splitting/etc is actually not 
useful for me, since I manage other data per-range. The earlier generic 
rangetree/intervaltree implementations I tried limiting the interface to 
basically add(), remove(), search(), and search_next(), since when we 
coalesce intervals, we need to free the data in the structure 
referencing the interval being deleted (and similarly create new 
structures to reference new intervals created when we remove an 
interval). So the coalescing/splitting logic can't be pushed into the 
interval management code cleanly.

So while I might be able to make use of your kinterval in a fairly 
simple manner (only using add/del/lookup), I'm not sure it wins anything 
over just using an rbtree.  Especially since I'd have to do my own 
coalesce/splitting logic anyway, it would actually be more expensive as 
on add() it would still scan to check for overlapping ranges to merge.

I ended up dropping my generic intervaltree implementation because folks 
objected that it was so trivial (basically just wrapping an rbtree) and 
didn't handle some of the more complex intervaltree use cases (ie: 
allowing for overlapping intervals). The priotree seemed to match fairly 
closely the interface I was using, but apparently its on its way out as 
well, so unless anyone further objects, I think I'll just fall back to a 
simple rbtree implementation.

thanks
-john


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] [RFC] Add volatile range management code
@ 2012-08-09 19:33         ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2012-08-09 19:33 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Michel Lespinasse, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Aneesh Kumar K.V,
	Mike Hommey, Jan Kara, KOSAKI Motohiro, Minchan Kim, linux-mm

On 08/09/2012 06:35 AM, Andrea Righi wrote:
> On Thu, Aug 09, 2012 at 02:46:37AM -0700, Michel Lespinasse wrote:
>> On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
>>> v5:
>>> * Drop intervaltree for prio_tree usage per Michel &
>>>    Dmitry's suggestions.
>> Actually, I believe the ranges you need to track are non-overlapping, correct ?
>>
>> If that is the case, a simple rbtree, sorted by start-of-range
>> address, would work best.
>> (I am trying to remove prio_tree users... :)
>>
> John,
>
> JFYI, if you want to try a possible rbtree-based implementation, as
> suggested by Michel you could try this one:
> https://github.com/arighi/kinterval
>
> This implementation supports insertion, deletion and transparent merging
> of adjacent ranges, as well as splitting ranges when chunks removed or
> different chunk types are added in the middle of an existing range; so
> if I'm not wrong probably you should be able to use this code as is,
> without any modification.
I do appreciate the suggestion, and considered this earlier when you 
posted this before.

Unfotunately the transparent merging/splitting/etc is actually not 
useful for me, since I manage other data per-range. The earlier generic 
rangetree/intervaltree implementations I tried limiting the interface to 
basically add(), remove(), search(), and search_next(), since when we 
coalesce intervals, we need to free the data in the structure 
referencing the interval being deleted (and similarly create new 
structures to reference new intervals created when we remove an 
interval). So the coalescing/splitting logic can't be pushed into the 
interval management code cleanly.

So while I might be able to make use of your kinterval in a fairly 
simple manner (only using add/del/lookup), I'm not sure it wins anything 
over just using an rbtree.  Especially since I'd have to do my own 
coalesce/splitting logic anyway, it would actually be more expensive as 
on add() it would still scan to check for overlapping ranges to merge.

I ended up dropping my generic intervaltree implementation because folks 
objected that it was so trivial (basically just wrapping an rbtree) and 
didn't handle some of the more complex intervaltree use cases (ie: 
allowing for overlapping intervals). The priotree seemed to match fairly 
closely the interface I was using, but apparently its on its way out as 
well, so unless anyone further objects, I think I'll just fall back to a 
simple rbtree implementation.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] [RFC] Add volatile range management code
  2012-08-09 19:33         ` John Stultz
@ 2012-08-09 19:39           ` Andrea Righi
  -1 siblings, 0 replies; 38+ messages in thread
From: Andrea Righi @ 2012-08-09 19:39 UTC (permalink / raw)
  To: John Stultz
  Cc: Michel Lespinasse, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Aneesh Kumar K.V,
	Mike Hommey, Jan Kara, KOSAKI Motohiro, Minchan Kim, linux-mm

On Thu, Aug 09, 2012 at 12:33:17PM -0700, John Stultz wrote:
> On 08/09/2012 06:35 AM, Andrea Righi wrote:
> >On Thu, Aug 09, 2012 at 02:46:37AM -0700, Michel Lespinasse wrote:
> >>On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
> >>>v5:
> >>>* Drop intervaltree for prio_tree usage per Michel &
> >>>   Dmitry's suggestions.
> >>Actually, I believe the ranges you need to track are non-overlapping, correct ?
> >>
> >>If that is the case, a simple rbtree, sorted by start-of-range
> >>address, would work best.
> >>(I am trying to remove prio_tree users... :)
> >>
> >John,
> >
> >JFYI, if you want to try a possible rbtree-based implementation, as
> >suggested by Michel you could try this one:
> >https://github.com/arighi/kinterval
> >
> >This implementation supports insertion, deletion and transparent merging
> >of adjacent ranges, as well as splitting ranges when chunks removed or
> >different chunk types are added in the middle of an existing range; so
> >if I'm not wrong probably you should be able to use this code as is,
> >without any modification.
> I do appreciate the suggestion, and considered this earlier when you
> posted this before.
> 
> Unfotunately the transparent merging/splitting/etc is actually not
> useful for me, since I manage other data per-range. The earlier
> generic rangetree/intervaltree implementations I tried limiting the
> interface to basically add(), remove(), search(), and search_next(),
> since when we coalesce intervals, we need to free the data in the
> structure referencing the interval being deleted (and similarly
> create new structures to reference new intervals created when we
> remove an interval). So the coalescing/splitting logic can't be
> pushed into the interval management code cleanly.
> 
> So while I might be able to make use of your kinterval in a fairly
> simple manner (only using add/del/lookup), I'm not sure it wins
> anything over just using an rbtree.  Especially since I'd have to do
> my own coalesce/splitting logic anyway, it would actually be more
> expensive as on add() it would still scan to check for overlapping
> ranges to merge.
> 
> I ended up dropping my generic intervaltree implementation because
> folks objected that it was so trivial (basically just wrapping an
> rbtree) and didn't handle some of the more complex intervaltree use
> cases (ie: allowing for overlapping intervals). The priotree seemed
> to match fairly closely the interface I was using, but apparently
> its on its way out as well, so unless anyone further objects, I
> think I'll just fall back to a simple rbtree implementation.

OK, everything makes sense now, thanks for the clarifications, and sorry
for suggesting yet another range/interval tree implementation. :)

I'll look at your patch set more in details and try to test/review it
closely.

-Andrea

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] [RFC] Add volatile range management code
@ 2012-08-09 19:39           ` Andrea Righi
  0 siblings, 0 replies; 38+ messages in thread
From: Andrea Righi @ 2012-08-09 19:39 UTC (permalink / raw)
  To: John Stultz
  Cc: Michel Lespinasse, LKML, Andrew Morton, Android Kernel Team,
	Robert Love, Mel Gorman, Hugh Dickins, Dave Hansen, Rik van Riel,
	Dmitry Adamushko, Dave Chinner, Neil Brown, Aneesh Kumar K.V,
	Mike Hommey, Jan Kara, KOSAKI Motohiro, Minchan Kim, linux-mm

On Thu, Aug 09, 2012 at 12:33:17PM -0700, John Stultz wrote:
> On 08/09/2012 06:35 AM, Andrea Righi wrote:
> >On Thu, Aug 09, 2012 at 02:46:37AM -0700, Michel Lespinasse wrote:
> >>On Fri, Jul 27, 2012 at 8:57 PM, John Stultz <john.stultz@linaro.org> wrote:
> >>>v5:
> >>>* Drop intervaltree for prio_tree usage per Michel &
> >>>   Dmitry's suggestions.
> >>Actually, I believe the ranges you need to track are non-overlapping, correct ?
> >>
> >>If that is the case, a simple rbtree, sorted by start-of-range
> >>address, would work best.
> >>(I am trying to remove prio_tree users... :)
> >>
> >John,
> >
> >JFYI, if you want to try a possible rbtree-based implementation, as
> >suggested by Michel you could try this one:
> >https://github.com/arighi/kinterval
> >
> >This implementation supports insertion, deletion and transparent merging
> >of adjacent ranges, as well as splitting ranges when chunks removed or
> >different chunk types are added in the middle of an existing range; so
> >if I'm not wrong probably you should be able to use this code as is,
> >without any modification.
> I do appreciate the suggestion, and considered this earlier when you
> posted this before.
> 
> Unfotunately the transparent merging/splitting/etc is actually not
> useful for me, since I manage other data per-range. The earlier
> generic rangetree/intervaltree implementations I tried limiting the
> interface to basically add(), remove(), search(), and search_next(),
> since when we coalesce intervals, we need to free the data in the
> structure referencing the interval being deleted (and similarly
> create new structures to reference new intervals created when we
> remove an interval). So the coalescing/splitting logic can't be
> pushed into the interval management code cleanly.
> 
> So while I might be able to make use of your kinterval in a fairly
> simple manner (only using add/del/lookup), I'm not sure it wins
> anything over just using an rbtree.  Especially since I'd have to do
> my own coalesce/splitting logic anyway, it would actually be more
> expensive as on add() it would still scan to check for overlapping
> ranges to merge.
> 
> I ended up dropping my generic intervaltree implementation because
> folks objected that it was so trivial (basically just wrapping an
> rbtree) and didn't handle some of the more complex intervaltree use
> cases (ie: allowing for overlapping intervals). The priotree seemed
> to match fairly closely the interface I was using, but apparently
> its on its way out as well, so unless anyone further objects, I
> think I'll just fall back to a simple rbtree implementation.

OK, everything makes sense now, thanks for the clarifications, and sorry
for suggesting yet another range/interval tree implementation. :)

I'll look at your patch set more in details and try to test/review it
closely.

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2012-08-09 19:39 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-28  3:57 [PATCH 0/5][RFC] Fallocate Volatile Ranges v6 John Stultz
2012-07-28  3:57 ` John Stultz
2012-07-28  3:57 ` [PATCH 1/5] [RFC] Add volatile range management code John Stultz
2012-07-28  3:57   ` John Stultz
2012-08-09  9:46   ` Michel Lespinasse
2012-08-09  9:46     ` Michel Lespinasse
2012-08-09 13:35     ` Andrea Righi
2012-08-09 13:35       ` Andrea Righi
2012-08-09 19:33       ` John Stultz
2012-08-09 19:33         ` John Stultz
2012-08-09 19:39         ` Andrea Righi
2012-08-09 19:39           ` Andrea Righi
2012-08-09 19:11     ` John Stultz
2012-08-09 19:11       ` John Stultz
2012-07-28  3:57 ` [PATCH 2/5] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers John Stultz
2012-07-28  3:57   ` John Stultz
2012-07-28  3:57 ` [PATCH 3/5] [RFC] ashmem: Convert ashmem to use volatile ranges John Stultz
2012-07-28  3:57   ` John Stultz
2012-07-28  3:57 ` [PATCH 4/5] [RFC][HACK] Add LRU_VOLATILE support to the VM John Stultz
2012-07-28  3:57   ` John Stultz
2012-08-06  3:04   ` Minchan Kim
2012-08-06  3:04     ` Minchan Kim
2012-08-06 15:46     ` Dan Magenheimer
2012-08-06 15:46       ` Dan Magenheimer
2012-08-07  0:56       ` Minchan Kim
2012-08-07  0:56         ` Minchan Kim
2012-08-07  1:26         ` Dan Magenheimer
2012-08-07  1:26           ` Dan Magenheimer
2012-08-07  1:45           ` Minchan Kim
2012-08-07  1:45             ` Minchan Kim
2012-08-06 20:38     ` John Stultz
2012-08-06 20:38       ` John Stultz
2012-07-28  3:57 ` [PATCH 5/5] [RFC][HACK] Switch volatile/shmem over to LRU_VOLATILE John Stultz
2012-07-28  3:57   ` John Stultz
2012-08-09  9:28 ` [PATCH 0/5][RFC] Fallocate Volatile Ranges v6 Michel Lespinasse
2012-08-09  9:28   ` Michel Lespinasse
2012-08-09 18:45   ` John Stultz
2012-08-09 18:45     ` John Stultz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.