All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
@ 2011-11-22  3:33 John Stultz
  2011-11-22  9:37 ` Rik van Riel
  2011-11-22 20:52 ` Andrew Morton
  0 siblings, 2 replies; 14+ messages in thread
From: John Stultz @ 2011-11-22  3:33 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Robert Love, Christoph Hellwig, Andrew Morton,
	Hugh Dickins, Mel Gorman, Dave Hansen, Rik van Riel, Eric Anholt,
	Jesse Barnes

This patch provides new fadvise flags that can be used to mark
file pages as volatile, which will allow it to be discarded if the
kernel wants to reclaim memory.

This is useful for userspace to allocate things like caches, and lets
the kernel destructively (but safely) reclaim them when there's memory
pressure.

Right now, we can simply throw away pages if they are clean (backed
by a current on-disk copy).  That only happens for anonymous/tmpfs/shmfs
pages when they're swapped out.  This patch lets userspace select
dirty pages which can be simply thrown away instead of writing them
to disk first.  See the mm/shmem.c for this bit of code.  It's
different from FADV_DONTNEED since the pages are not immediately
discarded; they are only discarded under pressure.

This is very much influenced by the Android Ashmem interface by
Robert Love so credits to him and the Android developers.
In many cases the code & logic come directly from the ashmem patch.
The intent of this patch is to allow for ashmem-like behavior, but
embeds the idea a little deeper into the VM code, instead of isolating
it into a specific driver.

I'm very much a newbie at the VM code, so At this point, I just want
to try to get some input on the patch, so if you have another idea
for using something other then fadvise, or other thoughts on how the
volatile ranges are stored, I'd be really interested in hearing them.
So let me know if you have any comments for feedback!

Also many thanks to Dave Hansen who helped design and develop the
initial version of this patch, and has provided continued review and
mentoring for me in the VM code.

CC: Robert Love <rlove@google.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Hugh Dickins <hughd@google.com>
CC: Mel Gorman <mel@csn.ul.ie>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Eric Anholt <eric@anholt.net>
CC: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 fs/inode.c               |    2 +
 include/linux/fadvise.h  |    6 +
 include/linux/fs.h       |    2 +
 include/linux/volatile.h |   34 ++++++
 mm/Makefile              |    2 +-
 mm/fadvise.c             |   21 ++++-
 mm/shmem.c               |   14 +++
 mm/volatile.c            |  253 ++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 332 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/volatile.h
 create mode 100644 mm/volatile.c

diff --git a/fs/inode.c b/fs/inode.c
index ee4e66b..78a7581 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -247,6 +247,7 @@ void __destroy_inode(struct inode *inode)
 	if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
 		posix_acl_release(inode->i_default_acl);
 #endif
+	mapping_clear_volatile_ranges(&inode->i_data);
 	this_cpu_dec(nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
@@ -278,6 +279,7 @@ void address_space_init_once(struct address_space *mapping)
 	spin_lock_init(&mapping->private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap);
 	INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
+	INIT_LIST_HEAD(&mapping->volatile_list);
 }
 EXPORT_SYMBOL(address_space_init_once);
 
diff --git a/include/linux/fadvise.h b/include/linux/fadvise.h
index e8e7471..988fb00 100644
--- a/include/linux/fadvise.h
+++ b/include/linux/fadvise.h
@@ -18,4 +18,10 @@
 #define POSIX_FADV_NOREUSE	5 /* Data will be accessed once.  */
 #endif
 
+#define POSIX_FADV_VOLATILE	8  /* _can_ toss, but don't toss now */
+#define POSIX_FADV_NONVOLATILE	9  /* Remove VOLATILE flag */
+#define POSIX_FADV_ISVOLATILE	10 /* Returns volatile flag for region */
+
+
+
 #endif	/* FADVISE_H_INCLUDED */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e313022..4f15ade 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -10,6 +10,7 @@
 #include <linux/ioctl.h>
 #include <linux/blk_types.h>
 #include <linux/types.h>
+#include <linux/volatile.h>
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -650,6 +651,7 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+	struct list_head	volatile_list;	/* volatile range list */
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/include/linux/volatile.h b/include/linux/volatile.h
new file mode 100644
index 0000000..11e8a3e
--- /dev/null
+++ b/include/linux/volatile.h
@@ -0,0 +1,34 @@
+#ifndef _LINUX_VOLATILE_H
+#define _LINUX_VOLATILE_H
+
+struct address_space;
+
+
+struct volatile_range {
+	/*
+	 * List is sorted, and no two ranges
+	 * on the same list should overlap.
+	 */
+	struct list_head unpinned;
+	pgoff_t start_page;
+	pgoff_t end_page;
+	unsigned int purged;
+};
+
+static inline bool page_in_range(struct volatile_range *range,
+					pgoff_t page_index)
+{
+	return (range->start_page <= page_index) &&
+					(range->end_page >= page_index);
+}
+
+extern long mapping_range_volatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+extern long mapping_range_nonvolatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+extern long mapping_range_isvolatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index);
+extern void mapping_clear_volatile_ranges(struct address_space *mapping);
+
+
+#endif /* _LINUX_VOLATILE_H */
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..7b6c7a8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -13,7 +13,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   page_isolation.o mm_init.o mmu_context.o percpu.o \
-			   $(mmu-y)
+			   volatile.o $(mmu-y)
 obj-y += init-mm.o
 
 ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 8d723c9..e4530c9 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -106,7 +106,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 		nrpages = end_index - start_index + 1;
 		if (!nrpages)
 			nrpages = ~0UL;
-		
+
 		ret = force_page_cache_readahead(mapping, file,
 				start_index,
 				nrpages);
@@ -127,6 +127,25 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 			invalidate_mapping_pages(mapping, start_index,
 						end_index);
 		break;
+	case POSIX_FADV_VOLATILE:
+		/* First and last PARTIAL page! */
+		start_index = offset >> PAGE_CACHE_SHIFT;
+		end_index = endbyte >> PAGE_CACHE_SHIFT;
+		ret = mapping_range_volatile(mapping, start_index, end_index);
+		break;
+	case POSIX_FADV_NONVOLATILE:
+		/* First and last PARTIAL page! */
+		start_index = offset >> PAGE_CACHE_SHIFT;
+		end_index = endbyte >> PAGE_CACHE_SHIFT;
+		ret = mapping_range_nonvolatile(mapping, start_index,
+								end_index);
+		break;
+	case POSIX_FADV_ISVOLATILE:
+		/* First and last PARTIAL page! */
+		start_index = offset >> PAGE_CACHE_SHIFT;
+		end_index = endbyte >> PAGE_CACHE_SHIFT;
+		ret = mapping_range_isvolatile(mapping, start_index, end_index);
+		break;
 	default:
 		ret = -EINVAL;
 	}
diff --git a/mm/shmem.c b/mm/shmem.c
index d672250..765cef2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -679,6 +679,20 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	index = page->index;
 	inode = mapping->host;
 	info = SHMEM_I(inode);
+
+	/* Check if page is in volatile range */
+	if (!list_empty(&mapping->volatile_list)) {
+		struct volatile_range *range, *next;
+		list_for_each_entry_safe(range, next, &mapping->volatile_list,
+					unpinned) {
+			if (page_in_range(range, index)) {
+				range->purged = 1;
+				unlock_page(page);
+				return 0;
+			}
+		}
+	}
+
 	if (info->flags & VM_LOCKED)
 		goto redirty;
 	if (!total_swap_pages)
diff --git a/mm/volatile.c b/mm/volatile.c
new file mode 100644
index 0000000..c6a9c00
--- /dev/null
+++ b/mm/volatile.c
@@ -0,0 +1,253 @@
+/* mm/volatile.c
+ *
+ * Volatile page range managment.
+ *	Copyright 2011 Linaro
+ *
+ * Based on mm/ashmem.c 
+ * 	by Robert Love <rlove@google.com>
+ * 	Copyright (C) 2008 Google, Inc.
+ *
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/volatile.h>
+
+/* range helpers */
+static inline bool range_before_page(struct volatile_range *range,
+					pgoff_t page_index)
+{
+	return range->end_page < page_index;
+}
+
+static inline bool page_range_subsumes_range(struct volatile_range *range,
+					pgoff_t start_index, pgoff_t end_index)
+{
+
+	return (range->start_page >= start_index)
+			&& (range->end_page <= end_index);
+}
+
+static inline bool page_range_subsumed_by_range(
+					struct volatile_range *range,
+					pgoff_t start_index, pgoff_t end_index)
+{
+	return (range->start_page <= start_index)
+			&& (range->end_page >= end_index);
+}
+
+static inline bool page_range_in_range(struct volatile_range *range,
+					pgoff_t start_index, pgoff_t end_index)
+{
+	return page_in_range(range, start_index) ||
+		page_in_range(range, end_index) ||
+		page_range_subsumes_range(range, start_index, end_index);
+}
+
+
+
+/*
+ * Allocates a volatile_range, and adds it to the address_space's
+ * volatile list
+ */
+static int volatile_range_alloc(struct volatile_range *prev_range,
+                               unsigned int purged,
+			       pgoff_t start_index, pgoff_t end_index)
+{
+       struct volatile_range *range;
+
+       range = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
+       if (!range)
+               return -ENOMEM;
+
+       range->start_page = start_index;
+       range->end_page = end_index;
+       range->purged = purged;
+
+       list_add_tail(&range->unpinned, &prev_range->unpinned);
+
+       return 0;
+}
+
+/*
+ * Deletes a volatile_range, removing it from the address_space's
+ * unpinned list
+ */
+static void volatile_range_del(struct volatile_range *range)
+{
+       list_del(&range->unpinned);
+       kfree(range);
+}
+
+/*
+ * Resizes a volatile_range
+ */
+static inline void volatile_range_shrink(struct volatile_range *range,
+                               pgoff_t start_index, pgoff_t end_index)
+{
+       range->start_page = start_index;
+       range->end_page = end_index;
+}
+
+
+/*
+ * Mark a region as volatile, allowing dirty pages to be purged
+ * under memory pressure
+ */
+long mapping_range_volatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	struct volatile_range *range, *next;
+	unsigned int purged = 0;
+	int ret;
+
+	mutex_lock(&mapping->i_mmap_mutex);
+restart:
+	/* Iterate through the sorted range list */
+	list_for_each_entry_safe(range, next, &mapping->volatile_list,
+					unpinned) {
+		/*
+		 * If the current existing range is before the start
+		 * of tnew range, then we're done, since the list is
+		 * sorted
+		 */
+		if (range_before_page(range, start_index))
+			break;
+		/*
+		 * If the new range is already covered by the existing
+		 * range, then there is nothing we need to do.
+		 */
+		if (page_range_subsumed_by_range(range, start_index,
+								end_index)) {
+			ret = 0;
+			goto out;
+		}
+		/*
+		 * Coalesce if the new range overlaps the existing range,
+		 * by growing the new range to cover the existing range,
+		 * deleting the existing range, and start over.
+		 * Starting over is necessary to make sure we also coalesce
+		 * any other ranges we overlap with.
+		 */
+		if (page_range_in_range(range, start_index, end_index)) {
+			start_index = min_t(size_t, range->start_page,
+						start_index);
+			end_index = max_t(size_t, range->end_page, end_index);
+			purged |= range->purged;
+			volatile_range_del(range);
+			goto restart;
+		}
+
+	}
+	/* Allocate the new range and add it to the list */
+	ret = volatile_range_alloc(range, purged, start_index, end_index);
+
+out:
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return ret;
+}
+
+/*
+ * Mark a region as nonvolatile, returns 1 if any pages in the region
+ * were purged.
+ */
+long mapping_range_nonvolatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	struct volatile_range *range, *next;
+	int ret  = 0;
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	list_for_each_entry_safe(range, next, &mapping->volatile_list,
+				unpinned) {
+		if (range_before_page(range, start_index))
+			break;
+
+		if (page_range_in_range(range, start_index, end_index)) {
+			ret |= range->purged;
+			/* Case #1: Easy. Just nuke the whole thing. */
+			if (page_range_subsumes_range(range, start_index,
+								end_index)) {
+				volatile_range_del(range);
+				continue;
+			}
+
+			/* Case #2: We overlap from the start, so adjust it */
+			if (range->start_page >= start_index) {
+				volatile_range_shrink(range, end_index + 1,
+							range->end_page);
+				continue;
+			}
+
+			/* Case #3: We overlap from the rear, so adjust it */
+			if (range->end_page <= end_index) {
+				volatile_range_shrink(range, range->start_page,
+							start_index - 1);
+				continue;
+			}
+
+			/*
+			 * Case #4: We eat a chunk out of the middle. A bit
+			 * more complicated, we allocate a new range for the
+			 * second half and adjust the first chunk's endpoint.
+			 */
+			volatile_range_alloc(range, range->purged,
+						end_index + 1, range->end_page);
+			volatile_range_shrink(range, range->start_page,
+						start_index - 1);
+		}
+	}
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+	return ret;
+}
+
+/*
+ * Returns if a region has been marked volatile or not.
+ * Does not return if the region has been purged.
+ */
+long mapping_range_isvolatile(struct address_space *mapping,
+				pgoff_t start_index, pgoff_t end_index)
+{
+	struct volatile_range *range;
+	long ret = 0;
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	list_for_each_entry(range, &mapping->volatile_list, unpinned) {
+		if (range_before_page(range, start_index))
+			break;
+		if (page_range_in_range(range, start_index, end_index)) {
+			ret = 1;
+			break;
+		}
+	}
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return ret;
+}
+
+
+/*
+ * Cleans up any volatile ranges.
+ */
+void mapping_clear_volatile_ranges(struct address_space *mapping)
+{
+	struct volatile_range *range;
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	list_for_each_entry(range, &mapping->volatile_list, unpinned)
+		volatile_range_del(range);
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+}
-- 
1.7.3.2.146.gca209


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22  3:33 [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
@ 2011-11-22  9:37 ` Rik van Riel
  2011-11-22 10:45   ` Rik van Riel
                     ` (4 more replies)
  2011-11-22 20:52 ` Andrew Morton
  1 sibling, 5 replies; 14+ messages in thread
From: Rik van Riel @ 2011-11-22  9:37 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Robert Love, Christoph Hellwig, Andrew Morton,
	Hugh Dickins, Mel Gorman, Dave Hansen, Eric Anholt, Jesse Barnes,
	Johannes Weiner, Jon Masters

On 11/21/2011 10:33 PM, John Stultz wrote:
> This patch provides new fadvise flags that can be used to mark
> file pages as volatile, which will allow it to be discarded if the
> kernel wants to reclaim memory.
>
> This is useful for userspace to allocate things like caches, and lets
> the kernel destructively (but safely) reclaim them when there's memory
> pressure.
>
> Right now, we can simply throw away pages if they are clean (backed
> by a current on-disk copy).  That only happens for anonymous/tmpfs/shmfs
> pages when they're swapped out.  This patch lets userspace select
> dirty pages which can be simply thrown away instead of writing them
> to disk first.  See the mm/shmem.c for this bit of code.  It's
> different from FADV_DONTNEED since the pages are not immediately
> discarded; they are only discarded under pressure.

I've got a few questions:

1) How do you tell userspace some of its data got
    discarded?

2) How do you prevent the situation where every
    volatile object gets a few pages discarded, making
    them all unusable?
    (better to throw away an entire object at once)

3) Isn't it too slow for something like Firefox to
    create a new tmpfs object for every single throw-away
    cache object?

Johannes, Jon and I have looked at an alternative way to
allow the kernel and userspace to cooperate in throwing
out cached data.  This alternative way does not touch
the alloc/free fast path at all, but does require some
cooperation at "shrink cache" time.

The idea is quite simple:

1) Every program that we are interested in already has
    some kind of main loop where it polls on file descriptors.
    It is easy for such programs to add an additional file,
    which would be a device or sysfs file that wakes up the
    program from its poll/select loop when memory is getting
    full to the point that userspace needs to shrink its
    caches.

    The kernel can be smart here and wake up just one process
    at a time, targeting specific NUMA nodes or cgroups. Such
    kernel smarts do not require additional userspace changes.

2) When userspace gets such a "please shrink your caches"
    event, it can do various things.  A program like firefox
    could throw away several cached objects, eg. uncompressed
    images or entire pre-rendered tabs, while a JVM can shrink
    its heap size and a database could shrink its internal
    cache.

3) After doing that, they could all call the same glibc
    function that walks across program-internal free memory
    and calls MADV_FREE on all free regions that span
    multiple pages, which gives the pages back to the kernel,
    without needing to move VMA boundaries.  This is relatively
    light weight and allows for the nuking of pages right in
    the middle of a heap VMA.

4) In some GUI libraries, like gtk/glib, we could open the
    memory pressure device node (or sysfs file) by default,
    hooking it up to the glibc function from (3) by default,
    which would give all gtk/glib programs the ability to
    give free()d memory back to the kernel on request, without
    needing to even modify the program.

    Program modification would only be needed in order to
    free cached objects, etc.  The modification of programs
    running under those libraries would consist of overriding
    the "shrink caches" hook with their own function, which
    first does program-specific stuff and then calls the
    default hook to take care of the glibc side.

We considered the same approach you are proposing as well, but
we did not come up with satisfactory answers to the questions I
asked above, which is why we came up with this scheme.

Unfortunately we have not gotten around to implementing it yet,
but I'd be happy to work on it with you guys if you are
interested.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22  9:37 ` Rik van Riel
@ 2011-11-22 10:45   ` Rik van Riel
  2011-11-22 20:39     ` Dave Hansen
  2011-11-22 16:31   ` Robert Love
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2011-11-22 10:45 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Robert Love, Christoph Hellwig, Andrew Morton,
	Hugh Dickins, Mel Gorman, Dave Hansen, Eric Anholt, Jesse Barnes,
	Johannes Weiner, Jon Masters

On 11/22/2011 04:37 AM, Rik van Riel wrote:
> On 11/21/2011 10:33 PM, John Stultz wrote:
>> This patch provides new fadvise flags that can be used to mark
>> file pages as volatile, which will allow it to be discarded if the
>> kernel wants to reclaim memory.
>>
>> This is useful for userspace to allocate things like caches, and lets
>> the kernel destructively (but safely) reclaim them when there's memory
>> pressure.
>>
>> Right now, we can simply throw away pages if they are clean (backed
>> by a current on-disk copy). That only happens for anonymous/tmpfs/shmfs
>> pages when they're swapped out. This patch lets userspace select
>> dirty pages which can be simply thrown away instead of writing them
>> to disk first. See the mm/shmem.c for this bit of code. It's
>> different from FADV_DONTNEED since the pages are not immediately
>> discarded; they are only discarded under pressure.
>
> I've got a few questions:
>
> 1) How do you tell userspace some of its data got
> discarded?
>
> 2) How do you prevent the situation where every
> volatile object gets a few pages discarded, making
> them all unusable?
> (better to throw away an entire object at once)
>
> 3) Isn't it too slow for something like Firefox to
> create a new tmpfs object for every single throw-away
> cache object?

Oh, and a fourth issue with the _VOLATILE approach, which
I forgot to write down before:

4) Virtualization.  Marking an object (and its pages)
    _VOLATILE inside a guest will not be visible on the
    host side, which means a virtual system may continue
    to suffer the performance penalty anyway.

On the other hand, the approach I outlined will simply result
in a virtual machine being asked to reduce its memory, and
possibly later on passing that notification on to the programs
running inside. In other words, the "please shrink your caches"
notification naturally recurses into cgroups and virtual machines.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22  9:37 ` Rik van Riel
  2011-11-22 10:45   ` Rik van Riel
@ 2011-11-22 16:31   ` Robert Love
  2011-11-22 19:48   ` John Stultz
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Robert Love @ 2011-11-22 16:31 UTC (permalink / raw)
  To: LKML

First, thanks to John for driving this. I'd love to see something like
ashmem's cache functionality in Linux.

Comments inline:

On Tue, Nov 22, 2011 at 4:37 AM, Rik van Riel <riel@redhat.com> wrote:
> 1) How do you tell userspace some of its data got
>   discarded?

This is my big question, too. This must be addressed to make this
usable by user-space. Unfortunately there isn't a side channel
available in madvise or fadvise.

The ashmem use case is like this:

1. Create an ashmem mapping. All pages start off pinned.
2. Populate pages
3. Use pages.
4. Unpin some or all pages.
5. Re-pin pages. Return value tells you if they were jettisoned. If
so, goto #2. If not, goto #3.

> 2) How do you prevent the situation where every
>   volatile object gets a few pages discarded, making
>   them all unusable?
>   (better to throw away an entire object at once)

In ashmem, I prevent this by jettisoning pages in units of unpinned
chunks, ordered by least-recently-unpinned. Since pinned/unpinned
chunks tend to relate to user-space objects (e.g. you'd unpin a whole
image), you end up throwing away the entire user-space object.

Put another way, the eviction algorithm is least-recently-unpinned
chunk, not least-recently-unpinned page, which prevents thrashing lots
of objects. An alternative would be to toss not whole chunks, but
whole mappings.

> 3) Isn't it too slow for something like Firefox to
>   create a new tmpfs object for every single throw-away
>   cache object?

Nope, although user-space might want to create larger mappings and
sub-divide them if its objects are *really* small.

> Johannes, Jon and I have looked at an alternative way to
> allow the kernel and userspace to cooperate in throwing
> out cached data.  This alternative way does not touch
> the alloc/free fast path at all, but does require some
> cooperation at "shrink cache" time.

I'm open to whatever works, and I need to think about your proposal
more, but it sounds inferior to the model John is proposing.

I think we can sum the two models up as,

- User-space pins, unpins objects, and kernel handles eviction
- Kernel lets user-space register callbacks for notification during
cache shrink, user handles eviction

Right?

Actually, I can see the need for both. But the first is simpler, lets
the kernel evict pages from user-space even in low memory situations
(since you can call the cache shrinker atomically), and relies on the
fact that the "free caches" action is just jettisoning the pages, so
the kernel can do it. The actual work is in repopulating the caches.

That last point, btw, is why user-space developers like ashmem: It
fits their models of how they do work. They pin/unpin objects as part
of their normal access and repopulating is tied into the re-pin path,
which is where it makes sense.

In a callback model, you have two issues: One, user-space needs to do
its own jettisoning. Two, user-space now needs to keep track of
whether an object was jettisoned and repopulate it (when? on next use?
now you have user-space reimplementing the pin/unpin logic, right?).

         Robert

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22  9:37 ` Rik van Riel
  2011-11-22 10:45   ` Rik van Riel
  2011-11-22 16:31   ` Robert Love
@ 2011-11-22 19:48   ` John Stultz
  2011-11-23  0:27     ` Rik van Riel
       [not found]   ` <CAG6tG3xTkW1J=6xmUmmJoswJyR6ii5RDXvAsYrcH0CkVuUmJrQ@mail.gmail.com>
  2011-11-26  0:05   ` Jan Kara
  4 siblings, 1 reply; 14+ messages in thread
From: John Stultz @ 2011-11-22 19:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Robert Love, Christoph Hellwig, Andrew Morton,
	Hugh Dickins, Mel Gorman, Dave Hansen, Eric Anholt, Jesse Barnes,
	Johannes Weiner, Jon Masters

On Tue, 2011-11-22 at 04:37 -0500, Rik van Riel wrote:
> On 11/21/2011 10:33 PM, John Stultz wrote:
> > This patch provides new fadvise flags that can be used to mark
> > file pages as volatile, which will allow it to be discarded if the
> > kernel wants to reclaim memory.
> >
> > This is useful for userspace to allocate things like caches, and lets
> > the kernel destructively (but safely) reclaim them when there's memory
> > pressure.
> >
> > Right now, we can simply throw away pages if they are clean (backed
> > by a current on-disk copy).  That only happens for anonymous/tmpfs/shmfs
> > pages when they're swapped out.  This patch lets userspace select
> > dirty pages which can be simply thrown away instead of writing them
> > to disk first.  See the mm/shmem.c for this bit of code.  It's
> > different from FADV_DONTNEED since the pages are not immediately
> > discarded; they are only discarded under pressure.
> 
> I've got a few questions:
> 
> 1) How do you tell userspace some of its data got
>     discarded?

You get a return code when marking the page non-volatile if it has been
discarded. This follows the ashmem style that Robert described in the
other mail.

> 2) How do you prevent the situation where every
>     volatile object gets a few pages discarded, making
>     them all unusable?
>     (better to throw away an entire object at once)

Indeed. One of the issues folks brought up about the ashmem code was
that it manages its own lru.  This attempt just simplifies the code, by
using the kerenl's own lru, but does have the draw back that it is page
based instead of  object or range-based.

We could try to zap the entire range when a page from the range is
written out, or we could go back to using a range based lru, like ashmem
does. 


> 3) Isn't it too slow for something like Firefox to
>     create a new tmpfs object for every single throw-away
>     cache object?

So, if you mean creating a new file for every cache object, that doesn't
seem necessary, as you could map a number of objects into the same file
and mark the ranges as volatile or not as needed. 

Or are you worried about the allocation of the range structure when we
mark a region as volatile?

Either way, I'd defer to Robert on real-world usage. 


> Johannes, Jon and I have looked at an alternative way to
> allow the kernel and userspace to cooperate in throwing
> out cached data.  This alternative way does not touch
> the alloc/free fast path at all, but does require some
> cooperation at "shrink cache" time.
> 
> The idea is quite simple:
> 
> 1) Every program that we are interested in already has
>     some kind of main loop where it polls on file descriptors.
>     It is easy for such programs to add an additional file,
>     which would be a device or sysfs file that wakes up the
>     program from its poll/select loop when memory is getting
>     full to the point that userspace needs to shrink its
>     caches.
> 
>     The kernel can be smart here and wake up just one process
>     at a time, targeting specific NUMA nodes or cgroups. Such
>     kernel smarts do not require additional userspace changes.
> 
> 2) When userspace gets such a "please shrink your caches"
>     event, it can do various things.  A program like firefox
>     could throw away several cached objects, eg. uncompressed
>     images or entire pre-rendered tabs, while a JVM can shrink
>     its heap size and a database could shrink its internal
>     cache.

So similarly to Robert, I don't see this approach as necessarily
exclusive to the volatile flags. There are just some tradeoffs with the
different approaches.

The upside with your approach is that applications don't have to
remember to re-pin the cache before using it and unpin it after its done
using it.

The downside is that the "please shrink your caches" message is likely
to arrive when the system is low on resources. Once applications have
been asked to "be nice and get small!", having to wait for that action
to occur might not be great. Where as with the volatile regions, there
are just additionally easily freeable pages available when the kernel
needs them.

thanks
-john



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22 10:45   ` Rik van Riel
@ 2011-11-22 20:39     ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2011-11-22 20:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: John Stultz, LKML, Robert Love, Christoph Hellwig, Andrew Morton,
	Hugh Dickins, Mel Gorman, Eric Anholt, Jesse Barnes,
	Johannes Weiner, Jon Masters

On 11/22/2011 02:45 AM, Rik van Riel wrote:
> 4) Virtualization.  Marking an object (and its pages)
>     _VOLATILE inside a guest will not be visible on the
>     host side, which means a virtual system may continue
>     to suffer the performance penalty anyway.

Yeah, I guess we still have to communicate it _somehow_.

I guess we could theoretically pass the calls up to the
hypervisor and it could even make its own VOLATILE calls
to the host kernel.  We'd also have to pass back down the
"was this evicted" information during a re-pin. That seems
messy to me.

Is it really any different of a problem than page cache?
The guest has data sitting in RAM that it probably doesn't
need.  If we passed up just the amount of unpinned data back
up to the hypervisor, it would have a decent idea how much
it could balloon the guest, for instance.  That would fit
in well with some of the existing schemes that folks have
and be *much* nicer than what they've got at the moment.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22  3:33 [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
  2011-11-22  9:37 ` Rik van Riel
@ 2011-11-22 20:52 ` Andrew Morton
  2011-11-22 21:32   ` John Stultz
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2011-11-22 20:52 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Robert Love, Christoph Hellwig, Hugh Dickins, Mel Gorman,
	Dave Hansen, Rik van Riel, Eric Anholt, Jesse Barnes

[-- Attachment #1: Type: text/plain, Size: 7032 bytes --]

On Mon, 21 Nov 2011 19:33:08 -0800
John Stultz <john.stultz@linaro.org> wrote:

> This patch provides new fadvise flags that can be used to mark
> file pages as volatile, which will allow it to be discarded if the
> kernel wants to reclaim memory.
> 
> This is useful for userspace to allocate things like caches, and lets
> the kernel destructively (but safely) reclaim them when there's memory
> pressure.
> 
> Right now, we can simply throw away pages if they are clean (backed
> by a current on-disk copy).  That only happens for anonymous/tmpfs/shmfs
> pages when they're swapped out.  This patch lets userspace select
> dirty pages which can be simply thrown away instead of writing them
> to disk first.  See the mm/shmem.c for this bit of code.  It's
> different from FADV_DONTNEED since the pages are not immediately
> discarded; they are only discarded under pressure.
> 
> This is very much influenced by the Android Ashmem interface by
> Robert Love so credits to him and the Android developers.
> In many cases the code & logic come directly from the ashmem patch.
> The intent of this patch is to allow for ashmem-like behavior, but
> embeds the idea a little deeper into the VM code, instead of isolating
> it into a specific driver.
> 
> I'm very much a newbie at the VM code, so At this point, I just want
> to try to get some input on the patch, so if you have another idea
> for using something other then fadvise, or other thoughts on how the
> volatile ranges are stored, I'd be really interested in hearing them.
> So let me know if you have any comments for feedback!
> 
> Also many thanks to Dave Hansen who helped design and develop the
> initial version of this patch, and has provided continued review and
> mentoring for me in the VM code.

I'm interestedly watching the design/use-case discussion.  Meanwhile,
some comments on the implementation.

>
> ...
>
> +#define POSIX_FADV_VOLATILE	8  /* _can_ toss, but don't toss now */
> +#define POSIX_FADV_NONVOLATILE	9  /* Remove VOLATILE flag */
> +#define POSIX_FADV_ISVOLATILE	10 /* Returns volatile flag for region */

linux-man@vger.kernel.org will want to be told all about these at some
stage.

> +
> +
>  #endif	/* FADVISE_H_INCLUDED */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e313022..4f15ade 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -10,6 +10,7 @@
>  #include <linux/ioctl.h>
>  #include <linux/blk_types.h>
>  #include <linux/types.h>
> +#include <linux/volatile.h>

volatile is a C keyword.  This is a bit confusing/misleading.

>  /*
>   * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> @@ -650,6 +651,7 @@ struct address_space {
>  	spinlock_t		private_lock;	/* for use by the address_space */
>  	struct list_head	private_list;	/* ditto */
>  	struct address_space	*assoc_mapping;	/* ditto */
> +	struct list_head	volatile_list;	/* volatile range list */

Comment should tell us what lock protects this.

It appers to be i_mmap_lock, which is weird.

>  } __attribute__((aligned(sizeof(long))));
>  	/*
>  	 * On most architectures that alignment is already the case; but
> diff --git a/include/linux/volatile.h b/include/linux/volatile.h
> new file mode 100644
> index 0000000..11e8a3e
> --- /dev/null
> +++ b/include/linux/volatile.h
> @@ -0,0 +1,34 @@
> +#ifndef _LINUX_VOLATILE_H
> +#define _LINUX_VOLATILE_H
> +
> +struct address_space;
> +
> +
> +struct volatile_range {
> +	/*
> +	 * List is sorted, and no two ranges

sorted by pgoff_t, it appears.

> +	 * on the same list should overlap.
> +	 */
> +	struct list_head unpinned;

What's this do?  It appears to be the list anchored in
address_space.volatile_list and protected by i_mmap_lock to maintain
all these things.  "unpinned" is a strange name for such a thing.

> +	pgoff_t start_page;
> +	pgoff_t end_page;
> +	unsigned int purged;

Some description here would be good.  What's it for, when is it set and
cleared.

> +};
> +
> +static inline bool page_in_range(struct volatile_range *range,
> +					pgoff_t page_index)
> +{
> +	return (range->start_page <= page_index) &&
> +					(range->end_page >= page_index);
> +}

"page_in_range" is too vague a name for this...

> +extern long mapping_range_volatile(struct address_space *mapping,
> +				pgoff_t start_index, pgoff_t end_index);
> +extern long mapping_range_nonvolatile(struct address_space *mapping,
> +				pgoff_t start_index, pgoff_t end_index);
> +extern long mapping_range_isvolatile(struct address_space *mapping,
> +				pgoff_t start_index, pgoff_t end_index);
> +extern void mapping_clear_volatile_ranges(struct address_space *mapping);
> +
> +
> +#endif /* _LINUX_VOLATILE_H */
>
> ...
>
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -679,6 +679,20 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  	index = page->index;
>  	inode = mapping->host;
>  	info = SHMEM_I(inode);
> +
> +	/* Check if page is in volatile range */
> +	if (!list_empty(&mapping->volatile_list)) {
> +		struct volatile_range *range, *next;
> +		list_for_each_entry_safe(range, next, &mapping->volatile_list,
> +					unpinned) {
> +			if (page_in_range(range, index)) {
> +				range->purged = 1;
> +				unlock_page(page);
> +				return 0;
> +			}
> +		}
> +	}

That's very optimistic code :(

We've handed users a way in which to consume arbitrarily vast amounts
of kernel CPU time.  Putting a cond_resched() in there won't fix this.

Also, the volatile_range's are kmalloced so we have also given the user
a way of quickly consuming unlimited amounts of kernel memory.  Not
good.

>
> ...
>
> +/*
> + * Allocates a volatile_range, and adds it to the address_space's
> + * volatile list
> + */
> +static int volatile_range_alloc(struct volatile_range *prev_range,
> +                               unsigned int purged,
> +			       pgoff_t start_index, pgoff_t end_index)
> +{
> +       struct volatile_range *range;
> +
> +       range = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
> +       if (!range)
> +               return -ENOMEM;
> +
> +       range->start_page = start_index;
> +       range->end_page = end_index;
> +       range->purged = purged;
> +
> +       list_add_tail(&range->unpinned, &prev_range->unpinned);
> +
> +       return 0;
> +}

Back in, ahem, 1999 I wrote a piece of tree code which I think does
this.  Each node represents a span and the nodes are kept sorted in the
tree and it does merging and splitting of nodes as ranges are added. 
Removal and lookup don't appear to be implemented yet, but they're easy
enough.  Pretty complex but everything is O(log(n)) and I think it
could be made to work here.  

<rummage rummage>

OK, see attached.



Also, do you know what this code is like?  Posix and BSD file locking. 
It has the same requirement to maintain state about arbitrary spans of
a single file.

The locking code is an utter pig and has well known (?) O(N^2) and even
O(N^3) failure modes.  But don't let that deter you ;) We should
strenuously avoid duplicating such a similar thing.


[-- Attachment #2: mumbletree.tar.gz --]
[-- Type: application/octet-stream, Size: 5137 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22 20:52 ` Andrew Morton
@ 2011-11-22 21:32   ` John Stultz
  2011-11-22 21:39     ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: John Stultz @ 2011-11-22 21:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Robert Love, Christoph Hellwig, Hugh Dickins, Mel Gorman,
	Dave Hansen, Rik van Riel, Eric Anholt, Jesse Barnes

On Tue, 2011-11-22 at 12:52 -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 19:33:08 -0800
> John Stultz <john.stultz@linaro.org> wrote:
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e313022..4f15ade 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -10,6 +10,7 @@
> >  #include <linux/ioctl.h>
> >  #include <linux/blk_types.h>
> >  #include <linux/types.h>
> > +#include <linux/volatile.h>
> 
> volatile is a C keyword.  This is a bit confusing/misleading.

Yea. This struck me as well, but I didn't have a better name, so I went
ahead and sent it out. I'd be happy with any suggestions, or ideas for
where such code should live.

> >  /*
> >   * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> > @@ -650,6 +651,7 @@ struct address_space {
> >  	spinlock_t		private_lock;	/* for use by the address_space */
> >  	struct list_head	private_list;	/* ditto */
> >  	struct address_space	*assoc_mapping;	/* ditto */
> > +	struct list_head	volatile_list;	/* volatile range list */
> 
> Comment should tell us what lock protects this.
> 
> It appers to be i_mmap_lock, which is weird.

Yea. I need a mutex connected to the address_space, so i_mmap_lock
looked possibly useful for this. It was really a last minute stab. 

Any suggestions? I can introduce my own, but Dave was already grumbling
about the list_head living on the address_space structure.


> >  } __attribute__((aligned(sizeof(long))));
> >  	/*
> >  	 * On most architectures that alignment is already the case; but
> > diff --git a/include/linux/volatile.h b/include/linux/volatile.h
> > new file mode 100644
> > index 0000000..11e8a3e
> > --- /dev/null
> > +++ b/include/linux/volatile.h
> > @@ -0,0 +1,34 @@
> > +#ifndef _LINUX_VOLATILE_H
> > +#define _LINUX_VOLATILE_H
> > +
> > +struct address_space;
> > +
> > +
> > +struct volatile_range {
> > +	/*
> > +	 * List is sorted, and no two ranges
> 
> sorted by pgoff_t, it appears.
> 
> > +	 * on the same list should overlap.
> > +	 */
> > +	struct list_head unpinned;
> 
> What's this do?  It appears to be the list anchored in
> address_space.volatile_list and protected by i_mmap_lock to maintain
> all these things.  "unpinned" is a strange name for such a thing.

Unpinned comes from the ashmem patch. Assuming folks think this basic
approach is worth continuing, I'll try to come up with a more sane name
for the next version.

> > +	pgoff_t start_page;
> > +	pgoff_t end_page;
> > +	unsigned int purged;
> 
> Some description here would be good.  What's it for, when is it set and
> cleared.
> 
> > +};
> > +
> > +static inline bool page_in_range(struct volatile_range *range,
> > +					pgoff_t page_index)
> > +{
> > +	return (range->start_page <= page_index) &&
> > +					(range->end_page >= page_index);
> > +}
> 
> "page_in_range" is too vague a name for this...

Agreed.

> > +extern long mapping_range_volatile(struct address_space *mapping,
> > +				pgoff_t start_index, pgoff_t end_index);
> > +extern long mapping_range_nonvolatile(struct address_space *mapping,
> > +				pgoff_t start_index, pgoff_t end_index);
> > +extern long mapping_range_isvolatile(struct address_space *mapping,
> > +				pgoff_t start_index, pgoff_t end_index);
> > +extern void mapping_clear_volatile_ranges(struct address_space *mapping);
> > +
> > +
> > +#endif /* _LINUX_VOLATILE_H */
> >
> > ...
> >
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -679,6 +679,20 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> >  	index = page->index;
> >  	inode = mapping->host;
> >  	info = SHMEM_I(inode);
> > +
> > +	/* Check if page is in volatile range */
> > +	if (!list_empty(&mapping->volatile_list)) {
> > +		struct volatile_range *range, *next;
> > +		list_for_each_entry_safe(range, next, &mapping->volatile_list,
> > +					unpinned) {
> > +			if (page_in_range(range, index)) {
> > +				range->purged = 1;
> > +				unlock_page(page);
> > +				return 0;
> > +			}
> > +		}
> > +	}
> 
> That's very optimistic code :(
>
> We've handed users a way in which to consume arbitrarily vast amounts
> of kernel CPU time.  Putting a cond_resched() in there won't fix this.
> 
> Also, the volatile_range's are kmalloced so we have also given the user
> a way of quickly consuming unlimited amounts of kernel memory.  Not
> good.

Indeed. Besides the missing locking, this is a very good critique and
probably a very good argument for not-pushing such functionality too
deeply down into the VM.  Maybe the shrinker strategy used in the ashmem
patch is the better approach, as its a less critical path?


> > +/*
> > + * Allocates a volatile_range, and adds it to the address_space's
> > + * volatile list
> > + */
> > +static int volatile_range_alloc(struct volatile_range *prev_range,
> > +                               unsigned int purged,
> > +			       pgoff_t start_index, pgoff_t end_index)
> > +{
> > +       struct volatile_range *range;
> > +
> > +       range = kzalloc(sizeof(struct volatile_range), GFP_KERNEL);
> > +       if (!range)
> > +               return -ENOMEM;
> > +
> > +       range->start_page = start_index;
> > +       range->end_page = end_index;
> > +       range->purged = purged;
> > +
> > +       list_add_tail(&range->unpinned, &prev_range->unpinned);
> > +
> > +       return 0;
> > +}
> 
> Back in, ahem, 1999 I wrote a piece of tree code which I think does
> this.  Each node represents a span and the nodes are kept sorted in the
> tree and it does merging and splitting of nodes as ranges are added. 
> Removal and lookup don't appear to be implemented yet, but they're easy
> enough.  Pretty complex but everything is O(log(n)) and I think it
> could be made to work here.  
> 
> <rummage rummage>
> 
> OK, see attached.

Very cool!  Yea. Dave brought up that the list was fairly inefficient,
but I figured I'd get more feedback on the mechanism before spending too
much time optimizing the structure. I'll look at your mumbletree code
and see how it can be adapted. Other then apparently being a nod to the
"original neo-grunge/indie-rock band from Rochester", is there a story
to the name? :)

Again, I appreciate the feedback! Thanks so much!
-john


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22 21:32   ` John Stultz
@ 2011-11-22 21:39     ` Andrew Morton
  2011-11-22 22:58       ` John Stultz
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2011-11-22 21:39 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Robert Love, Christoph Hellwig, Hugh Dickins, Mel Gorman,
	Dave Hansen, Rik van Riel, Eric Anholt, Jesse Barnes

On Tue, 22 Nov 2011 13:32:18 -0800
John Stultz <john.stultz@linaro.org> wrote:

> Very cool!  Yea. Dave brought up that the list was fairly inefficient,
> but I figured I'd get more feedback on the mechanism before spending too
> much time optimizing the structure. I'll look at your mumbletree code
> and see how it can be adapted. Other then apparently being a nod to the
> "original neo-grunge/indie-rock band from Rochester", is there a story
> to the name? :)

I was first!

I was sure there was some correct term for such a tree in the
literature, but I didn't know what it was and the compiler didn't like
"*_tree".

> Again, I appreciate the feedback! Thanks so much!

I note you sneakily deleted the bit about posix/bsd file locking.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22 21:39     ` Andrew Morton
@ 2011-11-22 22:58       ` John Stultz
  0 siblings, 0 replies; 14+ messages in thread
From: John Stultz @ 2011-11-22 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Robert Love, Christoph Hellwig, Hugh Dickins, Mel Gorman,
	Dave Hansen, Rik van Riel, Eric Anholt, Jesse Barnes

On Tue, 2011-11-22 at 13:39 -0800, Andrew Morton wrote:
> On Tue, 22 Nov 2011 13:32:18 -0800
> John Stultz <john.stultz@linaro.org> wrote:
> 
> > Very cool!  Yea. Dave brought up that the list was fairly inefficient,
> > but I figured I'd get more feedback on the mechanism before spending too
> > much time optimizing the structure. I'll look at your mumbletree code
> > and see how it can be adapted. Other then apparently being a nod to the
> > "original neo-grunge/indie-rock band from Rochester", is there a story
> > to the name? :)
> 
> I was first!

Sorry! Didn't realize it was the other way around. Hopefully reading the
code will inform my listening. :)

> I was sure there was some correct term for such a tree in the
> literature, but I didn't know what it was and the compiler didn't like
> "*_tree".
> 
> > Again, I appreciate the feedback! Thanks so much!
> 
> I note you sneakily deleted the bit about posix/bsd file locking.

Oh, no sneakiness intended. I agree about the parallels, and didn't
really have anything to add. Hopefully the needs here are a little bit
simpler, since we can coalesce regions. Maybe that will let us avoid the
worst of it?

thanks
-john



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22 19:48   ` John Stultz
@ 2011-11-23  0:27     ` Rik van Riel
  0 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2011-11-23  0:27 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Robert Love, Christoph Hellwig, Andrew Morton,
	Hugh Dickins, Mel Gorman, Dave Hansen, Eric Anholt, Jesse Barnes,
	Johannes Weiner, Jon Masters

On 11/22/2011 02:48 PM, John Stultz wrote:
> On Tue, 2011-11-22 at 04:37 -0500, Rik van Riel wrote:
>> On 11/21/2011 10:33 PM, John Stultz wrote:

> So similarly to Robert, I don't see this approach as necessarily
> exclusive to the volatile flags. There are just some tradeoffs with the
> different approaches.

Agreed, they might be complementary.

> The upside with your approach is that applications don't have to
> remember to re-pin the cache before using it and unpin it after its done
> using it.

If using these volatile regions is going to become
common, programs will be pinning and unpinning
those cache regions all the time, even when there
is no memory pressure at all.

At that point, I wonder if userspace programmers will
not end up making an automatic choice for a method
that does not impact their fast path at all, and only
gets invoked rarely.

> The downside is that the "please shrink your caches" message is likely
> to arrive when the system is low on resources. Once applications have
> been asked to "be nice and get small!", having to wait for that action
> to occur might not be great.

The pageout code in vmscan.c can send these messages
before we actually get around to evicting any anonymous
page from memory.

I believe the code we have there today can get these
signals sent before we get in any serious trouble.

> Where as with the volatile regions, there
> are just additionally easily freeable pages available when the kernel
> needs them.

That is certainly true.  However, it is unclear how
that would translate to virtualized solutions, since
there is no way for a virtual machine to mark pages
as volatile with the host (especially since there is
no way to tell the host what pages belong together
in objects).

I'm not objecting to your idea - in fact I like it.

However, I believe it would be good to answer some of
these questions before adding another interface to the
kernel that needs to be maintained forever.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
       [not found]   ` <CAG6tG3xTkW1J=6xmUmmJoswJyR6ii5RDXvAsYrcH0CkVuUmJrQ@mail.gmail.com>
@ 2011-11-23  0:39     ` Rik van Riel
  2011-11-23 15:52       ` Robert Love
  0 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2011-11-23  0:39 UTC (permalink / raw)
  To: Robert Love
  Cc: John Stultz, LKML, Christoph Hellwig, Andrew Morton,
	Hugh Dickins, Mel Gorman, Dave Hansen, Eric Anholt, Jesse Barnes,
	Johannes Weiner, Jon Masters

On 11/22/2011 10:40 AM, Robert Love wrote:

>     3) Isn't it too slow for something like Firefox to
>        create a new tmpfs object for every single throw-away
>        cache object?
>
>
> Nope, although user-space might want to create larger mappings and
> sub-divide them if its objects are *really* small.

At first glance, it sounds like a great idea to have a program
like Firefox create a new object for every tab.

However, this quickly runs into the practical problem that
Firefox does not know in advance how much memory each tab will
require.

Without a good way to size each mapping in advance, it may not
be practical to group related bits of memory together.

Never mind that it would require userspace programs to grow new
memory allocators...

>     Johannes, Jon and I have looked at an alternative way to
>     allow the kernel and userspace to cooperate in throwing
>     out cached data.  This alternative way does not touch
>     the alloc/free fast path at all, but does require some
>     cooperation at "shrink cache" time.
>
>
> I'm open to whatever works, and I need to think about your proposal
> more, but it sounds inferior to the model John is proposing. I think we
> can sum the two models up as,
>
> - User-space pins, unpins objects, and kernel handles eviction
> - Kernel lets user-space register callbacks for notification during
> cache shrink, user handles eviction
>
> Right?

Yes, this characterization is correct.

Note how in the object I proposed, when there is no memory
pressure, userspace can hold on to freed pages and quickly
reuse them for something else.

There is no need for pages to be handed back to the kernel,
and then zeroed again when they get handed back to userspace
later on.

There is also no need to write a new memory allocator for
programs that want to pack related objects together in one
_VOLATILE mapping.

The reason is that, on memory pressure, any arbitrary set of
pages can be nuked from userspace, so any memory allocator will
be compatible.

> Actually, I can see the need for both. But the first is simpler, lets
> the kernel evict pages from user-space even in low memory situations
> (since you can call the cache shrinker atomically), and relies on the
> fact that the "free caches" action is just jettisoning the pages, so the
> kernel can do it. The actual work is in repopulating the caches.
>
> That last point, btw, is why user-space developers like ashmem: It fits
> their models of how they do work. They pin/unpin objects as part of
> their normal access and repopulating is tied into the re-pin path, which
> is where it makes sense.

Which userspace programs work that way already?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-23  0:39     ` Rik van Riel
@ 2011-11-23 15:52       ` Robert Love
  0 siblings, 0 replies; 14+ messages in thread
From: Robert Love @ 2011-11-23 15:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: John Stultz, LKML, Christoph Hellwig, Andrew Morton,
	Hugh Dickins, Mel Gorman, Dave Hansen, Eric Anholt, Jesse Barnes,
	Johannes Weiner, Jon Masters

Hi, Rik.

On Tue, Nov 22, 2011 at 7:39 PM, Rik van Riel <riel@redhat.com> wrote:
> At first glance, it sounds like a great idea to have a program
> like Firefox create a new object for every tab.
>
> However, this quickly runs into the practical problem that
> Firefox does not know in advance how much memory each tab will
> require.
>
> Without a good way to size each mapping in advance, it may not
> be practical to group related bits of memory together.

On Android (and Firefox for Android for that matter) we manage just
about every resource (such as images, rendered web pages, etc) through
ashmem. I don't recall the details of how the various subsystems that
utilize ashmem decide whether to use a mapping-per-object versus
packing multiple objects into a single mapping, but it works.

> Never mind that it would require userspace programs to grow new
> memory allocators...

Agreed.

> Yes, this characterization is correct.
>
> Note how in the object I proposed, when there is no memory
> pressure, userspace can hold on to freed pages and quickly
> reuse them for something else.
>
> There is no need for pages to be handed back to the kernel,
> and then zeroed again when they get handed back to userspace
> later on.

Agreed, this is a nice benefit. However, couldn't user-space do the
same thing now with MADV_DONTNEED?

I do think we have two different features here, and we may want both.

I'm concerned your proposal isn't ideal for Android or similar
systems. It is heavy, and will involve too much process involvement.
Remember, these systems are always running at memory capacity and are
constantly LRU-freeing small amounts of pages. Having a way to say to
the kernel, "you can toss these pages if you must, just let me know
later if I try to reuse them" -- in a way that is very simple and very
cheap -- is really valuable.

> There is also no need to write a new memory allocator for
> programs that want to pack related objects together in one
> _VOLATILE mapping.
>
> The reason is that, on memory pressure, any arbitrary set of
> pages can be nuked from userspace, so any memory allocator will
> be compatible.

In ashmem, you can pin and unpin any arbitrary set of pages. You don't
have to pin the whole mapping, allowing user-space to pack arbitrary
mappings into a single mapping and have the kernel jettison only the
unused objects.

>> That last point, btw, is why user-space developers like ashmem: It fits
>> their models of how they do work. They pin/unpin objects as part of
>> their normal access and repopulating is tied into the re-pin path, which
>> is where it makes sense.
>
> Which userspace programs work that way already?

What I mean is, user-space implicitly (if not explicitly) has get and
put methods via things like reference counting, locking & unlocking,
or the RAII pattern. And user-space already has a concept of "populate
this object." So adding ashmem pin and unpin calls is often very
simple, because the ashmem use case maps very nicely on top of how
user-space applications are already managing resources.

      Robert

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags
  2011-11-22  9:37 ` Rik van Riel
                     ` (3 preceding siblings ...)
       [not found]   ` <CAG6tG3xTkW1J=6xmUmmJoswJyR6ii5RDXvAsYrcH0CkVuUmJrQ@mail.gmail.com>
@ 2011-11-26  0:05   ` Jan Kara
  4 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2011-11-26  0:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: John Stultz, LKML, Robert Love, Christoph Hellwig, Andrew Morton,
	Hugh Dickins, Mel Gorman, Dave Hansen, Eric Anholt, Jesse Barnes,
	Johannes Weiner, Jon Masters

On Tue 22-11-11 04:37:36, Rik van Riel wrote:
> On 11/21/2011 10:33 PM, John Stultz wrote:
> The idea is quite simple:
> 
> 1) Every program that we are interested in already has
>    some kind of main loop where it polls on file descriptors.
>    It is easy for such programs to add an additional file,
>    which would be a device or sysfs file that wakes up the
>    program from its poll/select loop when memory is getting
>    full to the point that userspace needs to shrink its
>    caches.
> 
>    The kernel can be smart here and wake up just one process
>    at a time, targeting specific NUMA nodes or cgroups. Such
>    kernel smarts do not require additional userspace changes.
> 
> 2) When userspace gets such a "please shrink your caches"
>    event, it can do various things.  A program like firefox
>    could throw away several cached objects, eg. uncompressed
>    images or entire pre-rendered tabs, while a JVM can shrink
>    its heap size and a database could shrink its internal
>    cache.
  Hmm, I wonder here: How much should a program free? A single object? Or
one meg of memory? I find this decision rather problematic. How much should
be reclaimed depends on the number of applications listening, how aggressive
they are, and current memory pressure => it's basically unpredictable from
userspace so you are almost guaranteed to either reclaim too much or too
few. Advantage of the VOLATILE approach is that kernel (which is the only
place where there is all necessary information) controls how much memory
is evicted... Just one argument for VOLATILE approach I didn't see
mentioned in the discussion yet.

								Honza

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-11-26  0:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-22  3:33 [PATCH] [RFC] fadvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags John Stultz
2011-11-22  9:37 ` Rik van Riel
2011-11-22 10:45   ` Rik van Riel
2011-11-22 20:39     ` Dave Hansen
2011-11-22 16:31   ` Robert Love
2011-11-22 19:48   ` John Stultz
2011-11-23  0:27     ` Rik van Riel
     [not found]   ` <CAG6tG3xTkW1J=6xmUmmJoswJyR6ii5RDXvAsYrcH0CkVuUmJrQ@mail.gmail.com>
2011-11-23  0:39     ` Rik van Riel
2011-11-23 15:52       ` Robert Love
2011-11-26  0:05   ` Jan Kara
2011-11-22 20:52 ` Andrew Morton
2011-11-22 21:32   ` John Stultz
2011-11-22 21:39     ` Andrew Morton
2011-11-22 22:58       ` John Stultz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.