linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/8] mm: thrash detection-based file cache sizing v5
@ 2013-10-10 21:46 Johannes Weiner
  2013-10-10 21:46 ` [patch 1/8] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
                   ` (10 more replies)
  0 siblings, 11 replies; 24+ messages in thread
From: Johannes Weiner @ 2013-10-10 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

Hi everyone,

here is an update to the cache sizing patches for 3.13.

	Changes in this revision

o Drop frequency synchronization between refaulted and demoted pages
  and just straight up activate refaulting pages whose access
  frequency indicates they could stay in memory.  This was suggested
  by Rik van Riel a looong time ago but misinterpretation of test
  results during early stages of development took me a while to
  overcome.  It's still the same overall concept, but a little simpler
  and with even faster cache adaptation.  Yay!

o More extensively document underlying design concepts like the
  meaning of the refault distance, based on input from Andrew and
  Tejun.

o Document the new page cache lookup API which can return shadow
  entries, based on input from Andrew

o Document and simplify the synchronization between inode teardown and
  reclaim planting shadow entries, based on input from Andrew

o Drop 'time' from names of variables that are not in typical kernel
  time units like jiffies or seconds, based on input from Andrew

	Summary of problem & solution

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".
    
Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages.  This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching.

This happens on file servers and media streaming servers, where the
popular files and file sections change over time.  Even though the
individual files might be smaller than half of memory, concurrent
access to many of them may still result in their inter-reference
distance being greater than half of memory.  It's also been reported
as a problem on database workloads that switch back and forth between
tables that are bigger than half of memory.  In these cases the VM
never recognizes the new working set and will for the remainder of the
workload thrash disk data which could easily live in memory.
    
Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use-once streams,
but ultimately was not significantly better than a FIFO policy and
still thrashed cache based on eviction speed, rather than actual
demand for cache.
    
This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to detect frequently
used pages regardless of inactive list size and so facilitate working
set transitions in excess of the inactive list.

	Tests
	
The reported database workload is easily demonstrated on a 16G machine
with two filesets a 12G.  This fio workload operates on one set first,
then switches to the other.  The VM should obviously always cache the
set that the workload is currently using.

unpatched:
db1: READ: io=196608MB, aggrb=740405KB/s, minb=740405KB/s, maxb=740405KB/s, mint= 271914msec, maxt= 271914msec
db2: READ: io=196608MB, aggrb= 68558KB/s, minb= 68558KB/s, maxb= 68558KB/s, mint=2936584msec, maxt=2936584msec

real    53m29.192s
user    1m11.544s
sys     3m34.820s

patched:
db1: READ: io=196608MB, aggrb=743750KB/s, minb=743750KB/s, maxb=743750KB/s, mint=270691msec, maxt=270691msec
db2: READ: io=196608MB, aggrb=383503KB/s, minb=383503KB/s, maxb=383503KB/s, mint=524967msec, maxt=524967msec

real    13m16.432s
user    0m56.518s
sys     2m38.284s

As can be seen, the unpatched kernel simply never adapts to the
workingset change and db2 is stuck with secondary storage speed.  The
patched kernel on the other hand needs 2-3 iterations over db2 before
it replaces db1 and reaches full memory speed.  Given the unbounded
negative affect of the existing VM behavior, these patches should be
considered correctness fixes rather than performance optimizations.

Another test resembles a fileserver or streaming server workload,
where data in excess of memory size is accessed at different
frequencies.  First, there is very hot data accessed at a high
frequency.  Machines should be fitted so that the hot set of such a
workload can be fully cached or all bets are off.  Then there is a
very big (compared to available memory) set of data that is used-once
or at a very low frequency; this is what drives the inactive list and
does not really benefit from caching.  Lastly, there is a big set of
warm data in between that is accessed at medium frequencies and would
benefit from caching the pages between the first and last streamer of
each burst as much as possible.

unpatched:
cold: READ: io= 30720MB, aggrb= 24808KB/s, minb= 24808KB/s, maxb= 24808KB/s, mint=1267987msec, maxt=1267987msec
 hot: READ: io=174080MB, aggrb=158225KB/s, minb=158225KB/s, maxb=158225KB/s, mint=1126605msec, maxt=1126605msec
warm: READ: io=102400MB, aggrb=112403KB/s, minb= 22480KB/s, maxb= 33010KB/s, mint= 635291msec, maxt= 932866msec

real    21m15.924s
user    17m36.036s
sys     3m5.117s

patched:
cold: READ: io= 30720MB, aggrb= 27451KB/s, minb= 27451KB/s, maxb= 27451KB/s, mint=1145932msec, maxt=1145932msec
 hot: READ: io=174080MB, aggrb=158617KB/s, minb=158617KB/s, maxb=158617KB/s, mint=1123822msec, maxt=1123822msec
warm: READ: io=102400MB, aggrb=131964KB/s, minb= 26392KB/s, maxb= 40164KB/s, mint= 522141msec, maxt= 794592msec

real    19m22.671s
user    19m33.838s
sys     2m39.652s

In both kernels, the hot set is propagated to the active list and then
served from cache for the duration of the workload.

In both kernels, the beginning of the warm set is propagated to the
active list as well, but in the unpatched case the active list
eventually takes up half of memory and does not leave enough space
anymore for many new warm pages to get activated and so they start
thrashing while a significant part of the active list is now stale.
The patched kernel on the other hand constantly challenges the active
pages based on refaults and so manages to keep a cache window rolling
through the warm data set.  This frees up IO bandwidth for the cold
set as well.

For reference, this same test was performed with the traditional
demotion mechanism, where deactivation is coupled to inactive list
reclaim.  However, this had the same outcome as the unpatched kernel:
while the warm set does indeed get activated continuously, it is
forced out of the active list by inactive list pressure, which is
dictated primarily by the unrelated cold set.  The warm set is evicted
before subsequent streamers can benefit from it, even though there
would be enough space available to cache the pages of interest.

	Costs

These patches increase struct inode by three words to manage shadow
entries in the page cache radix tree.  However, given that a typical
inode (like the ext4 inode) is already 1k in size, this is not much.
It's a 2% size increase for a reclaimable object.  fs_mark metadata
tests with millions of inodes did not show a measurable difference.
And as soon as there is any file data involved, the page cache pages
dominate the memory cost anyway.

A much harder cost to estimate is the size of the page cache radix
tree.  Page reclaim used to shrink the radix trees but now the tree
nodes are reused for shadow entries, and so the cost depends heavily
on the page cache access patterns.  However, with workloads that
maintain spatial or temporal locality, the shadow entries are either
refaulted quickly or reclaimed along with the inode object itself.
Workloads that will experience a memory cost increase are those that
don't really benefit from caching in the first place.

A more predictable alternative would be a fixed-cost separate pool of
shadow entries, but this would incur relatively higher memory cost for
well-behaved workloads at the benefit of cornercases.  It would also
make the shadow entry lookup more costly compared to storing them
directly in the cache structure.

The biggest impact on the existing VM fastpaths is an extra branch in
page cache lookups to filter out shadow entries.  shmem already has
this check, though, since it stores swap entries alongside regular
pages inside its page cache radix trees.

	Future

Right now we have a fixed ratio (50:50) between inactive and active
list but we already have complaints about working sets exceeding half
of memory being pushed out of the cache by simple used-once streaming
in the background.  Ultimately, we want to adjust this ratio and allow
for a much smaller inactive list.  These patches are an essential step
in this direction because they decouple the VMs ability to detect
working set changes from the inactive list size.  This would allow us
to base the inactive list size on something more sensible, like the
combined readahead window size for example.

Please consider merging.  Thank you!

 fs/block_dev.c                   |   2 +-
 fs/btrfs/compression.c           |   2 +-
 fs/cachefiles/rdwr.c             |  33 +--
 fs/inode.c                       |  11 +-
 fs/nfs/blocklayout/blocklayout.c |   2 +-
 fs/nilfs2/inode.c                |   4 +-
 include/linux/fs.h               |   2 +
 include/linux/mm.h               |   8 +
 include/linux/mmzone.h           |   6 +
 include/linux/pagemap.h          |  33 ++-
 include/linux/pagevec.h          |   3 +
 include/linux/radix-tree.h       |   5 +-
 include/linux/shmem_fs.h         |   1 +
 include/linux/swap.h             |   9 +
 include/linux/writeback.h        |   1 +
 lib/radix-tree.c                 | 106 ++------
 mm/Makefile                      |   2 +-
 mm/filemap.c                     | 335 +++++++++++++++++++++---
 mm/mincore.c                     |  20 +-
 mm/page-writeback.c              |   2 +-
 mm/readahead.c                   |   6 +-
 mm/shmem.c                       | 122 +++------
 mm/swap.c                        |  49 ++++
 mm/truncate.c                    |  78 ++++--
 mm/vmscan.c                      |  24 +-
 mm/vmstat.c                      |   3 +
 mm/workingset.c                  | 506 +++++++++++++++++++++++++++++++++++++

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 1/8] fs: cachefiles: use add_to_page_cache_lru()
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
@ 2013-10-10 21:46 ` Johannes Weiner
  2013-10-10 21:46 ` [patch 2/8] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2013-10-10 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

This code used to have its own lru cache pagevec up until a0b8cab3
("mm: remove lru parameter from __pagevec_lru_add and remove parts of
pagevec API").  Now it's just add_to_page_cache() followed by
lru_cache_add(), might as well use add_to_page_cache_lru() directly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/cachefiles/rdwr.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index ebaff36..4b1fb5c 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -265,24 +265,22 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
 				goto nomem_monitor;
 		}
 
-		ret = add_to_page_cache(newpage, bmapping,
-					netpage->index, cachefiles_gfp);
+		ret = add_to_page_cache_lru(newpage, bmapping,
+					    netpage->index, cachefiles_gfp);
 		if (ret == 0)
 			goto installed_new_backing_page;
 		if (ret != -EEXIST)
 			goto nomem_page;
 	}
 
-	/* we've installed a new backing page, so now we need to add it
-	 * to the LRU list and start it reading */
+	/* we've installed a new backing page, so now we need to start
+	 * it reading */
 installed_new_backing_page:
 	_debug("- new %p", newpage);
 
 	backpage = newpage;
 	newpage = NULL;
 
-	lru_cache_add_file(backpage);
-
 read_backing_page:
 	ret = bmapping->a_ops->readpage(NULL, backpage);
 	if (ret < 0)
@@ -510,24 +508,23 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 					goto nomem;
 			}
 
-			ret = add_to_page_cache(newpage, bmapping,
-						netpage->index, cachefiles_gfp);
+			ret = add_to_page_cache_lru(newpage, bmapping,
+						    netpage->index,
+						    cachefiles_gfp);
 			if (ret == 0)
 				goto installed_new_backing_page;
 			if (ret != -EEXIST)
 				goto nomem;
 		}
 
-		/* we've installed a new backing page, so now we need to add it
-		 * to the LRU list and start it reading */
+		/* we've installed a new backing page, so now we need
+		 * to start it reading */
 	installed_new_backing_page:
 		_debug("- new %p", newpage);
 
 		backpage = newpage;
 		newpage = NULL;
 
-		lru_cache_add_file(backpage);
-
 	reread_backing_page:
 		ret = bmapping->a_ops->readpage(NULL, backpage);
 		if (ret < 0)
@@ -538,8 +535,8 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 	monitor_backing_page:
 		_debug("- monitor add");
 
-		ret = add_to_page_cache(netpage, op->mapping, netpage->index,
-					cachefiles_gfp);
+		ret = add_to_page_cache_lru(netpage, op->mapping,
+					    netpage->index, cachefiles_gfp);
 		if (ret < 0) {
 			if (ret == -EEXIST) {
 				page_cache_release(netpage);
@@ -549,8 +546,6 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 			goto nomem;
 		}
 
-		lru_cache_add_file(netpage);
-
 		/* install a monitor */
 		page_cache_get(netpage);
 		monitor->netfs_page = netpage;
@@ -613,8 +608,8 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 	backing_page_already_uptodate:
 		_debug("- uptodate");
 
-		ret = add_to_page_cache(netpage, op->mapping, netpage->index,
-					cachefiles_gfp);
+		ret = add_to_page_cache_lru(netpage, op->mapping,
+					    netpage->index, cachefiles_gfp);
 		if (ret < 0) {
 			if (ret == -EEXIST) {
 				page_cache_release(netpage);
@@ -631,8 +626,6 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 
 		fscache_mark_page_cached(op, netpage);
 
-		lru_cache_add_file(netpage);
-
 		/* the netpage is unlocked and marked up to date here */
 		fscache_end_io(op, netpage, 0);
 		page_cache_release(netpage);
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch 2/8] lib: radix-tree: radix_tree_delete_item()
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
  2013-10-10 21:46 ` [patch 1/8] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
@ 2013-10-10 21:46 ` Johannes Weiner
  2013-10-10 21:46 ` [patch 3/8] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2013-10-10 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

Provide a function that does not just delete an entry at a given
index, but also allows passing in an expected item.  Delete only if
that item is still located at the specified index.

This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |  1 +
 lib/radix-tree.c           | 31 +++++++++++++++++++++++++++----
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..1bf0a9c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -219,6 +219,7 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..f442e32 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1335,15 +1335,18 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 }
 
 /**
- *	radix_tree_delete    -    delete an item from a radix tree
+ *	radix_tree_delete_item    -    delete an item from a radix tree
  *	@root:		radix tree root
  *	@index:		index key
+ *	@item:		expected item
  *
- *	Remove the item at @index from the radix tree rooted at @root.
+ *	Remove @item at @index from the radix tree rooted at @root.
  *
- *	Returns the address of the deleted item, or NULL if it was not present.
+ *	Returns the address of the deleted item, or NULL if it was not present
+ *	or the entry at the given @index was not @item.
  */
-void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+void *radix_tree_delete_item(struct radix_tree_root *root,
+			     unsigned long index, void *item)
 {
 	struct radix_tree_node *node = NULL;
 	struct radix_tree_node *slot = NULL;
@@ -1378,6 +1381,11 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 	if (slot == NULL)
 		goto out;
 
+	if (item && slot != item) {
+		slot = NULL;
+		goto out;
+	}
+
 	/*
 	 * Clear all tags associated with the item to be deleted.
 	 * This way of doing it would be inefficient, but seldom is any set.
@@ -1422,6 +1430,21 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 out:
 	return slot;
 }
+EXPORT_SYMBOL(radix_tree_delete_item);
+
+/**
+ *	radix_tree_delete    -    delete an item from a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Remove the item at @index from the radix tree rooted at @root.
+ *
+ *	Returns the address of the deleted item, or NULL if it was not present.
+ */
+void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+{
+	return radix_tree_delete_item(root, index, NULL);
+}
 EXPORT_SYMBOL(radix_tree_delete);
 
 /**
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch 3/8] mm: shmem: save one radix tree lookup when truncating swapped pages
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
  2013-10-10 21:46 ` [patch 1/8] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
  2013-10-10 21:46 ` [patch 2/8] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
@ 2013-10-10 21:46 ` Johannes Weiner
  2013-10-10 21:46 ` [patch 4/8] mm: filemap: move radix tree hole searching here Johannes Weiner
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2013-10-10 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

Page cache radix tree slots are usually stabilized by the page lock,
but shmem's swap cookies have no such thing.  Because the overall
truncation loop is lockless, the swap entry is currently confirmed by
a tree lookup and then deleted by another tree lookup under the same
tree lock region.

Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup.  This also allows removing the
delete-only special case from shmem_radix_tree_replace().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/shmem.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 8297623..7c67249 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -242,19 +242,17 @@ static int shmem_radix_tree_replace(struct address_space *mapping,
 			pgoff_t index, void *expected, void *replacement)
 {
 	void **pslot;
-	void *item = NULL;
+	void *item;
 
 	VM_BUG_ON(!expected);
+	VM_BUG_ON(!replacement);
 	pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
-	if (pslot)
-		item = radix_tree_deref_slot_protected(pslot,
-							&mapping->tree_lock);
+	if (!pslot)
+		return -ENOENT;
+	item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
 	if (item != expected)
 		return -ENOENT;
-	if (replacement)
-		radix_tree_replace_slot(pslot, replacement);
-	else
-		radix_tree_delete(&mapping->page_tree, index);
+	radix_tree_replace_slot(pslot, replacement);
 	return 0;
 }
 
@@ -386,14 +384,15 @@ export:
 static int shmem_free_swap(struct address_space *mapping,
 			   pgoff_t index, void *radswap)
 {
-	int error;
+	void *old;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
+	old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!error)
-		free_swap_and_cache(radix_to_swp_entry(radswap));
-	return error;
+	if (old != radswap)
+		return -ENOENT;
+	free_swap_and_cache(radix_to_swp_entry(radswap));
+	return 0;
 }
 
 /*
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch 4/8] mm: filemap: move radix tree hole searching here
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
                   ` (2 preceding siblings ...)
  2013-10-10 21:46 ` [patch 3/8] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
@ 2013-10-10 21:46 ` Johannes Weiner
  2013-10-10 21:46 ` [patch 5/8] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2013-10-10 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot".  But this is about to change, though, as shadow
page descriptors will be stored in the page cache after the actual
pages get evicted from memory.

Move the functions over to mm/filemap.c and make them native page
cache operations, where they can later be adapted to handle the new
definition of "page cache hole".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/nfs/blocklayout/blocklayout.c |  2 +-
 include/linux/pagemap.h          |  5 +++
 include/linux/radix-tree.h       |  4 ---
 lib/radix-tree.c                 | 75 ---------------------------------------
 mm/filemap.c                     | 76 ++++++++++++++++++++++++++++++++++++++++
 mm/readahead.c                   |  4 +--
 6 files changed, 84 insertions(+), 82 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index e242bbf..fdb74cb 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1220,7 +1220,7 @@ static u64 pnfs_num_cont_bytes(struct inode *inode, pgoff_t idx)
 	end = DIV_ROUND_UP(i_size_read(inode), PAGE_CACHE_SIZE);
 	if (end != NFS_I(inode)->npages) {
 		rcu_read_lock();
-		end = radix_tree_next_hole(&mapping->page_tree, idx + 1, ULONG_MAX);
+		end = page_cache_next_hole(mapping, idx + 1, ULONG_MAX);
 		rcu_read_unlock();
 	}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..c73130c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -243,6 +243,11 @@ static inline struct page *page_cache_alloc_readahead(struct address_space *x)
 
 typedef int filler_t(void *, struct page *);
 
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+
 extern struct page * find_get_page(struct address_space *mapping,
 				pgoff_t index);
 extern struct page * find_lock_page(struct address_space *mapping,
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 1bf0a9c..e8be53e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -227,10 +227,6 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index f442e32..e8adb5d 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -946,81 +946,6 @@ next:
 }
 EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
 
-
-/**
- *	radix_tree_next_hole    -    find the next hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the lowest
- *	indexed hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'return - index >= max_scan'
- *	will be true). In rare cases of index wrap-around, 0 will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 5, then subsequently a hole is created at index 10,
- *	radix_tree_next_hole covering both indexes may return 10 if called
- *	under rcu_read_lock.
- */
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index++;
-		if (index == 0)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_next_hole);
-
-/**
- *	radix_tree_prev_hole    -    find the prev hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search backwards in the range [max(index-max_scan+1, 0), index]
- *	for the first hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'index - return >= max_scan'
- *	will be true). In rare cases of wrap-around, ULONG_MAX will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 10, then subsequently a hole is created at index 5,
- *	radix_tree_prev_hole covering both indexes may return 5 if called under
- *	rcu_read_lock.
- */
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				   unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index--;
-		if (index == ULONG_MAX)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_prev_hole);
-
 /**
  *	radix_tree_gang_lookup - perform multiple lookup on a radix tree
  *	@root:		radix tree root
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e6aec4..1f7931c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -686,6 +686,82 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 }
 
 /**
+ * page_cache_next_hole - find the next hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
+ * lowest indexed hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'return - index >=
+ * max_scan' will be true). In rare cases of index wrap-around, 0 will
+ * be returned.
+ *
+ * page_cache_next_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 5, then subsequently a hole is created at
+ * index 10, page_cache_next_hole covering both indexes may return 10
+ * if called under rcu_read_lock.
+ */
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index++;
+		if (index == 0)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_next_hole);
+
+/**
+ * page_cache_prev_hole - find the prev hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search backwards in the range [max(index-max_scan+1, 0), index] for
+ * the first hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'index - return >=
+ * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
+ * will be returned.
+ *
+ * page_cache_prev_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 10, then subsequently a hole is created at
+ * index 5, page_cache_prev_hole covering both indexes may return 5 if
+ * called under rcu_read_lock.
+ */
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index--;
+		if (index == ULONG_MAX)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_prev_hole);
+
+/**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
diff --git a/mm/readahead.c b/mm/readahead.c
index e4ed041..9eeeeda 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -351,7 +351,7 @@ static pgoff_t count_history_pages(struct address_space *mapping,
 	pgoff_t head;
 
 	rcu_read_lock();
-	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+	head = page_cache_prev_hole(mapping, offset - 1, max);
 	rcu_read_unlock();
 
 	return offset - 1 - head;
@@ -430,7 +430,7 @@ ondemand_readahead(struct address_space *mapping,
 		pgoff_t start;
 
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = page_cache_next_hole(mapping, offset + 1, max);
 		rcu_read_unlock();
 
 		if (!start || start - offset > max)
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch 5/8] mm + fs: prepare for non-page entries in page cache radix trees
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
                   ` (3 preceding siblings ...)
  2013-10-10 21:46 ` [patch 4/8] mm: filemap: move radix tree hole searching here Johannes Weiner
@ 2013-10-10 21:46 ` Johannes Weiner
  2013-10-10 21:47 ` [patch 6/8] mm + fs: store shadow entries in page cache Johannes Weiner
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2013-10-10 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache,
prepare every site dealing with the radix trees directly to handle
entries other than pages.

The common lookup functions will filter out non-page entries and
return NULL for page cache holes, just as before.  But provide a raw
version of the API which returns non-page entries as well, and switch
shmem over to use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c   |   2 +-
 include/linux/mm.h       |   8 ++
 include/linux/pagemap.h  |  15 ++--
 include/linux/pagevec.h  |   3 +
 include/linux/shmem_fs.h |   1 +
 mm/filemap.c             | 190 +++++++++++++++++++++++++++++++++++++++++------
 mm/mincore.c             |  20 +++--
 mm/readahead.c           |   2 +-
 mm/shmem.c               |  97 +++++-------------------
 mm/swap.c                |  47 ++++++++++++
 mm/truncate.c            |  75 +++++++++++++++----
 11 files changed, 331 insertions(+), 129 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 6aad98c..c883165 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -474,7 +474,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, pg_index);
 		rcu_read_unlock();
-		if (page) {
+		if (page && !radix_tree_exceptional_entry(page)) {
 			misses++;
 			if (misses > 4)
 				break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..c09ef3a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -906,6 +906,14 @@ extern void show_free_areas(unsigned int flags);
 extern bool skip_free_areas_node(unsigned int flags, int nid);
 
 int shmem_zero_setup(struct vm_area_struct *);
+#ifdef CONFIG_SHMEM
+bool shmem_mapping(struct address_space *mapping);
+#else
+static inline bool shmem_mapping(struct address_space *mapping)
+{
+	return false;
+}
+#endif
 
 extern int can_do_mlock(void);
 extern int user_shm_lock(size_t, struct user_struct *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c73130c..b6854b7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-extern struct page * find_get_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_lock_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_or_create_page(struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
+				 gfp_t gfp_mask);
+unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
+			  unsigned int nr_pages, struct page **pages,
+			  pgoff_t *indices);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index e4dbfab..3c6b8b1 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,9 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices);
+void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 30aa0dc..deb4960 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -49,6 +49,7 @@ extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1f7931c..3d216e7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -446,6 +446,24 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
+static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
+			     struct page *page)
+{
+	void **slot;
+
+	slot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (slot) {
+		void *p;
+
+		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+		if (!radix_tree_exceptional_entry(p))
+			return -EEXIST;
+		radix_tree_replace_slot(slot, page);
+		return 0;
+	}
+	return radix_tree_insert(&mapping->page_tree, offset, page);
+}
+
 /**
  * add_to_page_cache_locked - add a locked page to the pagecache
  * @page:	page to add
@@ -480,7 +498,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	page->index = offset;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = radix_tree_insert(&mapping->page_tree, offset, page);
+	error = page_cache_insert(mapping, offset, page);
 	radix_tree_preload_end();
 	if (unlikely(error))
 		goto err_insert;
@@ -712,7 +730,10 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index++;
 		if (index == 0)
@@ -750,7 +771,10 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index--;
 		if (index == ULONG_MAX)
@@ -762,14 +786,19 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 EXPORT_SYMBOL(page_cache_prev_hole);
 
 /**
- * find_get_page - find and get a page reference
+ * __find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Is there a pagecache struct page at the given (mapping, offset) tuple?
- * If yes, increment its refcount and return it; if no, return NULL.
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned with an increased refcount.
+ *
+ * If the slot holds a shadow entry of a previously evicted page, it
+ * is returned.
+ *
+ * Otherwise, %NULL is returned.
  */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	void **pagep;
 	struct page *page;
@@ -810,24 +839,49 @@ out:
 
 	return page;
 }
+EXPORT_SYMBOL(__find_get_page);
+
+/**
+ * find_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned with an increased refcount.
+ *
+ * Otherwise, %NULL is returned.
+ */
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page = __find_get_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
 EXPORT_SYMBOL(find_get_page);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * __find_lock_page - locate, pin and lock a pagecache page
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Locates the desired pagecache page, locks it, increments its reference
- * count and returns its address.
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * If the slot holds a shadow entry of a previously evicted page, it
+ * is returned.
+ *
+ * Otherwise, %NULL is returned.
  *
- * Returns zero if the page was not present. find_lock_page() may sleep.
+ * __find_lock_page() may sleep.
  */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
 {
 	struct page *page;
-
 repeat:
-	page = find_get_page(mapping, offset);
+	page = __find_get_page(mapping, offset);
 	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
@@ -840,6 +894,29 @@ repeat:
 	}
 	return page;
 }
+EXPORT_SYMBOL(__find_lock_page);
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * Otherwise, %NULL is returned.
+ *
+ * find_lock_page() may sleep.
+ */
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page = __find_lock_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
 EXPORT_SYMBOL(find_lock_page);
 
 /**
@@ -848,16 +925,18 @@ EXPORT_SYMBOL(find_lock_page);
  * @index: the page's index into the mapping
  * @gfp_mask: page allocation mode
  *
- * Locates a page in the pagecache.  If the page is not present, a new page
- * is allocated using @gfp_mask and is added to the pagecache and to the VM's
- * LRU list.  The returned page is locked and has its reference count
- * incremented.
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
  *
- * find_or_create_page() may sleep, even if @gfp_flags specifies an atomic
- * allocation!
+ * If the page is not present, a new page is allocated using @gfp_mask
+ * and added to the page cache and the VM's LRU list.  The page is
+ * returned locked and with an increased refcount.
  *
- * find_or_create_page() returns the desired page's address, or zero on
- * memory exhaustion.
+ * On memory exhaustion, %NULL is returned.
+ *
+ * find_or_create_page() may sleep, even if @gfp_flags specifies an
+ * atomic allocation!
  */
 struct page *find_or_create_page(struct address_space *mapping,
 		pgoff_t index, gfp_t gfp_mask)
@@ -890,6 +969,73 @@ repeat:
 EXPORT_SYMBOL(find_or_create_page);
 
 /**
+ * __find_get_pages - gang pagecache lookup
+ * @mapping:	The address_space to search
+ * @start:	The starting page index
+ * @nr_pages:	The maximum number of pages
+ * @pages:	Where the resulting pages are placed
+ *
+ * __find_get_pages() will search for and return a group of up to
+ * @nr_pages pages in the mapping.  The pages are placed at @pages.
+ * __find_get_pages() takes a reference against the returned pages.
+ *
+ * The search returns a group of mapping-contiguous pages with ascending
+ * indexes.  There may be holes in the indices due to not-present pages.
+ *
+ * Any shadow entries of evicted pages are included in the returned
+ * array.
+ *
+ * __find_get_pages() returns the number of pages and shadow entries
+ * which were found.
+ */
+unsigned __find_get_pages(struct address_space *mapping,
+			  pgoff_t start, unsigned int nr_pages,
+			  struct page **pages, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_pages)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * Otherwise, we must be storing a swap entry
+			 * here as an exceptional entry: so return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		pages[ret] = page;
+		if (++ret == nr_pages)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
  * find_get_pages - gang pagecache lookup
  * @mapping:	The address_space to search
  * @start:	The starting page index
diff --git a/mm/mincore.c b/mm/mincore.c
index da2be56..ad411ec 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 	 */
-	page = find_get_page(mapping, pgoff);
 #ifdef CONFIG_SWAP
-	/* shmem/tmpfs may return swap: account for swapcache page too. */
-	if (radix_tree_exceptional_entry(page)) {
-		swp_entry_t swap = radix_to_swp_entry(page);
-		page = find_get_page(swap_address_space(swap), swap.val);
-	}
+	if (shmem_mapping(mapping)) {
+		page = __find_get_page(mapping, pgoff);
+		/*
+		 * shmem/tmpfs may return swap: account for swapcache
+		 * page too.
+		 */
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swp = radix_to_swp_entry(page);
+			page = find_get_page(swap_address_space(swp), swp.val);
+		}
+	} else
+		page = find_get_page(mapping, pgoff);
+#else
+	page = find_get_page(mapping, pgoff);
 #endif
 	if (page) {
 		present = PageUptodate(page);
diff --git a/mm/readahead.c b/mm/readahead.c
index 9eeeeda..912c003 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		rcu_read_unlock();
-		if (page)
+		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
 		page = page_cache_alloc_readahead(mapping);
diff --git a/mm/shmem.c b/mm/shmem.c
index 7c67249..1f4b65f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -329,56 +329,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 }
 
 /*
- * Like find_get_pages, but collecting swap entries as well as pages.
- */
-static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
-					pgoff_t start, unsigned int nr_pages,
-					struct page **pages, pgoff_t *indices)
-{
-	void **slot;
-	unsigned int ret = 0;
-	struct radix_tree_iter iter;
-
-	if (!nr_pages)
-		return 0;
-
-	rcu_read_lock();
-restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
-		struct page *page;
-repeat:
-		page = radix_tree_deref_slot(slot);
-		if (unlikely(!page))
-			continue;
-		if (radix_tree_exception(page)) {
-			if (radix_tree_deref_retry(page))
-				goto restart;
-			/*
-			 * Otherwise, we must be storing a swap entry
-			 * here as an exceptional entry: so return it
-			 * without attempting to raise page count.
-			 */
-			goto export;
-		}
-		if (!page_cache_get_speculative(page))
-			goto repeat;
-
-		/* Has the page moved? */
-		if (unlikely(page != *slot)) {
-			page_cache_release(page);
-			goto repeat;
-		}
-export:
-		indices[ret] = iter.index;
-		pages[ret] = page;
-		if (++ret == nr_pages)
-			break;
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
@@ -396,21 +346,6 @@ static int shmem_free_swap(struct address_space *mapping,
 }
 
 /*
- * Pagevec may contain swap entries, so shuffle up pages before releasing.
- */
-static void shmem_deswap_pagevec(struct pagevec *pvec)
-{
-	int i, j;
-
-	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		if (!radix_tree_exceptional_entry(page))
-			pvec->pages[j++] = page;
-	}
-	pvec->nr = j;
-}
-
-/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
@@ -428,12 +363,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
 		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
 		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
 		 */
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					PAGEVEC_SIZE, pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		index = indices[pvec.nr - 1] + 1;
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		check_move_unevictable_pages(pvec.pages, pvec.nr);
 		pagevec_release(&pvec);
 		cond_resched();
@@ -465,9 +400,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
-				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+		pvec.nr = __find_get_pages(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		mem_cgroup_uncharge_start();
@@ -496,7 +431,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -534,9 +469,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+
+		pvec.nr = __find_get_pages(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+				pvec.pages, indices);
 		if (!pvec.nr) {
 			if (index == start || unfalloc)
 				break;
@@ -544,7 +480,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			continue;
 		}
 		if ((index == start || unfalloc) && indices[0] >= end) {
-			shmem_deswap_pagevec(&pvec);
+			pagevec_remove_exceptionals(&pvec);
 			pagevec_release(&pvec);
 			break;
 		}
@@ -573,7 +509,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -1081,7 +1017,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return -EFBIG;
 repeat:
 	swap.val = 0;
-	page = find_lock_page(mapping, index);
+	page = __find_lock_page(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
 		page = NULL;
@@ -1418,6 +1354,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	return inode;
 }
 
+bool shmem_mapping(struct address_space *mapping)
+{
+	return mapping->backing_dev_info == &shmem_backing_dev_info;
+}
+
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
 static const struct inode_operations shmem_short_symlink_operations;
@@ -1730,7 +1671,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 	pagevec_init(&pvec, 0);
 	pvec.nr = 1;		/* start small: we may be there already */
 	while (!done) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					pvec.nr, pvec.pages, indices);
 		if (!pvec.nr) {
 			if (whence == SEEK_DATA)
@@ -1757,7 +1698,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 				break;
 			}
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		pvec.nr = PAGEVEC_SIZE;
 		cond_resched();
diff --git a/mm/swap.c b/mm/swap.c
index 759c3ca..f624e5b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -894,6 +894,53 @@ EXPORT_SYMBOL(__pagevec_lru_add);
 
 /**
  * pagevec_lookup - gang pagecache lookup
+ * @pvec:	Where the resulting entries are placed
+ * @mapping:	The address_space to search
+ * @start:	The starting entry index
+ * @nr_pages:	The maximum number of entries
+ *
+ * pagevec_lookup() will search for and return a group of up to
+ * @nr_pages pages and shadow entries in the mapping.  All entries are
+ * placed in @pvec.  pagevec_lookup() takes a reference against actual
+ * pages in @pvec.
+ *
+ * The search returns a group of mapping-contiguous entries with
+ * ascending indexes.  There may be holes in the indices due to
+ * not-present entries.
+ *
+ * pagevec_lookup() returns the number of entries which were found.
+ */
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices)
+{
+	pvec->nr = __find_get_pages(mapping, start, nr_pages,
+				    pvec->pages, indices);
+	return pagevec_count(pvec);
+}
+
+/**
+ * pagevec_remove_exceptionals - pagevec exceptionals pruning
+ * @pvec:	The pagevec to prune
+ *
+ * __pagevec_lookup() fills both pages and exceptional radix tree
+ * entries into the pagevec.  This function prunes all exceptionals
+ * from @pvec without leaving holes, so that it can be passed on to
+ * other pagevec operations.
+ */
+void pagevec_remove_exceptionals(struct pagevec *pvec)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		if (!radix_tree_exceptional_entry(page))
+			pvec->pages[j++] = page;
+	}
+	pvec->nr = j;
+}
+
+/**
+ * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
  * @mapping:	The address_space to search
  * @start:	The starting page index
diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683..4c36035 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -22,6 +22,22 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+static void clear_exceptional_entry(struct address_space *mapping,
+				    pgoff_t index, struct page *page)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	radix_tree_delete_item(&mapping->page_tree, index, page);
+	spin_unlock_irq(&mapping->tree_lock);
+}
 
 /**
  * do_invalidatepage - invalidate part or all of a page
@@ -208,12 +224,11 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	unsigned int	partial_start;	/* inclusive */
 	unsigned int	partial_end;	/* exclusive */
 	struct pagevec	pvec;
+	pgoff_t		indices[PAGEVEC_SIZE];
 	pgoff_t		index;
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0)
-		return;
 
 	/* Offsets within partial pages */
 	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
@@ -238,17 +253,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index < end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+	while (index < end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index >= end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -259,6 +280,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -307,14 +329,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+		if (!__pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			indices)) {
 			if (index == start)
 				break;
 			index = start;
 			continue;
 		}
-		if (index == start && pvec.pages[0]->index >= end) {
+		if (index == start && indices[0] >= end) {
 			pagevec_release(&pvec);
 			break;
 		}
@@ -323,16 +346,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index >= end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -375,6 +404,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index = start;
 	unsigned long ret;
@@ -390,17 +420,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	 */
 
 	pagevec_init(&pvec, 0);
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -414,6 +450,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				deactivate_page(page);
 			count += ret;
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -481,6 +518,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
 int invalidate_inode_pages2_range(struct address_space *mapping,
 				  pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	int i;
@@ -491,17 +529,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	cleancache_invalidate_inode(mapping);
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			if (page->mapping != mapping) {
@@ -539,6 +583,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				ret = ret2;
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch 6/8] mm + fs: store shadow entries in page cache
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
                   ` (4 preceding siblings ...)
  2013-10-10 21:46 ` [patch 5/8] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
@ 2013-10-10 21:47 ` Johannes Weiner
  2013-10-10 21:47 ` [patch 7/8] mm: thrash detection-based file cache sizing Johannes Weiner
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2013-10-10 21:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

Reclaim will be leaving shadow entries in the page cache radix tree
upon evicting the real page.  As those pages are found from the LRU,
an iput() can lead to the inode being freed concurrently.  At this
point, reclaim must no longer install shadow pages because the inode
freeing code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code
sets under the tree lock before doing the final truncate.  Reclaim
will check for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/block_dev.c          |  2 +-
 fs/inode.c              | 18 +++++++++++++++++-
 fs/nilfs2/inode.c       |  4 ++--
 include/linux/fs.h      |  1 +
 include/linux/pagemap.h | 13 ++++++++++++-
 mm/filemap.c            | 16 ++++++++++++----
 mm/truncate.c           |  5 +++--
 mm/vmscan.c             |  2 +-
 8 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1e86823..391ffe5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -83,7 +83,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index b33ba8e..56712ac 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -503,6 +503,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
+	BUG_ON(inode->i_data.nrshadows);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
@@ -545,10 +546,25 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
+	/*
+	 * Page reclaim may happen concurrently against pages in this
+	 * address space (pinned by the page lock).  Make sure that it
+	 * does not plant shadow pages behind the final truncate's
+	 * back, or they will be lost forever.
+	 *
+	 * As truncation uses a lockless tree lookup, acquire the
+	 * spinlock to make sure any ongoing tree modification that
+	 * does not see AS_EXITING is completed before starting the
+	 * final truncate.
+	 */
+	spin_lock_irq(&inode->i_data.tree_lock);
+	mapping_set_exiting(&inode->i_data);
+	spin_unlock_irq(&inode->i_data.tree_lock);
+
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 	}
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 7e350c5..42fcbe3 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -783,7 +783,7 @@ void nilfs_evict_inode(struct inode *inode)
 	int ret;
 
 	if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 		nilfs_clear_inode(inode);
@@ -791,7 +791,7 @@ void nilfs_evict_inode(struct inode *inode)
 	}
 	nilfs_transaction_begin(sb, &ti, 0); /* never fails */
 
-	if (inode->i_data.nrpages)
+	if (inode->i_data.nrpages || inode->i_data.nrshadows)
 		truncate_inode_pages(&inode->i_data, 0);
 
 	/* TODO: some of the following operations may fail.  */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3f40547..9bfa5a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -416,6 +416,7 @@ struct address_space {
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
+	unsigned long		nrshadows;	/* number of shadow entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index b6854b7..f132fdf 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_EXITING	= __GFP_BITS_SHIFT + 5, /* final truncate in progress */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -69,6 +70,16 @@ static inline int mapping_balloon(struct address_space *mapping)
 	return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
 }
 
+static inline void mapping_set_exiting(struct address_space *mapping)
+{
+	set_bit(AS_EXITING, &mapping->flags);
+}
+
+static inline int mapping_exiting(struct address_space *mapping)
+{
+	return test_bit(AS_EXITING, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
@@ -547,7 +558,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page);
+extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d216e7..aaccc5fe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -112,7 +112,7 @@
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the mapping's tree_lock.
  */
-void __delete_from_page_cache(struct page *page)
+void __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
@@ -127,7 +127,14 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (shadow) {
+		void **slot;
+
+		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+		radix_tree_replace_slot(slot, shadow);
+		mapping->nrshadows++;
+	} else
+		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
 	mapping->nrpages--;
@@ -166,7 +173,7 @@ void delete_from_page_cache(struct page *page)
 
 	freepage = mapping->a_ops->freepage;
 	spin_lock_irq(&mapping->tree_lock);
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
@@ -426,7 +433,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		__delete_from_page_cache(old);
+		__delete_from_page_cache(old, NULL);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
 		BUG_ON(error);
 		mapping->nrpages++;
@@ -459,6 +466,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
+		mapping->nrshadows--;
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 4c36035..86866f1 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -35,7 +35,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * without the tree itself locked.  These unlocked entries
 	 * need verification under the tree lock.
 	 */
-	radix_tree_delete_item(&mapping->page_tree, index, page);
+	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
+		mapping->nrshadows--;
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
@@ -481,7 +482,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 53f2f82..958bb64 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -553,7 +553,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page);
+		__delete_from_page_cache(page, NULL);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch 7/8] mm: thrash detection-based file cache sizing
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
                   ` (5 preceding siblings ...)
  2013-10-10 21:47 ` [patch 6/8] mm + fs: store shadow entries in page cache Johannes Weiner
@ 2013-10-10 21:47 ` Johannes Weiner
  2013-10-10 21:47 ` [patch 8/8] mm: workingset: keep shadow entries in check Johannes Weiner
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2013-10-10 21:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages.  This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use-once streams,
but ultimately was not significantly better than a FIFO policy and
still thrashed cache based on eviction speed, rather than actual
demand for cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size.  By
maintaining a history of recently evicted file pages it can detect
frequently used pages with an arbitrarily small inactive list size,
and subsequently apply pressure on the active list based on actual
demand for cache, not just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot.  On refault, the minimum access
distance of the page can be assesed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list.  And it's the foundation to further improve the
protection ability by basing the inactive list size on something more
appropriate than "50% of cache", like the combined readahead window
size.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   5 +
 include/linux/swap.h   |   5 +
 mm/Makefile            |   2 +-
 mm/filemap.c           |  61 ++++++++----
 mm/swap.c              |   2 +
 mm/vmscan.c            |  24 ++++-
 mm/vmstat.c            |   2 +
 mm/workingset.c        | 254 +++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 332 insertions(+), 23 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd791e4..118ba9f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,8 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_REFAULT,
+	WORKINGSET_ACTIVATE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -392,6 +394,9 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t		inactive_age;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..b83cf61 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
 	int next;	/* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+bool workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index 305d10a..b30aeb8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o \
-			   interval_tree.o list_lru.o $(mmu-y)
+			   interval_tree.o list_lru.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/filemap.c b/mm/filemap.c
index aaccc5fe..8ef41b7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -454,7 +454,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
 static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
-			     struct page *page)
+			     struct page *page, void **shadowp)
 {
 	void **slot;
 
@@ -467,23 +467,17 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
 		mapping->nrshadows--;
+		if (shadowp)
+			*shadowp = p;
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
 }
 
-/**
- * add_to_page_cache_locked - add a locked page to the pagecache
- * @page:	page to add
- * @mapping:	the page's address_space
- * @offset:	page index
- * @gfp_mask:	page allocation mode
- *
- * This function is used to add a page to the pagecache. It must be locked.
- * This function does not add the page to the LRU.  The caller must do that.
- */
-int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
-		pgoff_t offset, gfp_t gfp_mask)
+static int __add_to_page_cache_locked(struct page *page,
+				      struct address_space *mapping,
+				      pgoff_t offset, gfp_t gfp_mask,
+				      void **shadowp)
 {
 	int error;
 
@@ -506,7 +500,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	page->index = offset;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = page_cache_insert(mapping, offset, page);
+	error = page_cache_insert(mapping, offset, page, shadowp);
 	radix_tree_preload_end();
 	if (unlikely(error))
 		goto err_insert;
@@ -523,16 +517,49 @@ err_insert:
 	page_cache_release(page);
 	return error;
 }
+
+/**
+ * add_to_page_cache_locked - add a locked page to the pagecache
+ * @page:	page to add
+ * @mapping:	the page's address_space
+ * @offset:	page index
+ * @gfp_mask:	page allocation mode
+ *
+ * This function is used to add a page to the pagecache. It must be locked.
+ * This function does not add the page to the LRU.  The caller must do that.
+ */
+int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
+		pgoff_t offset, gfp_t gfp_mask)
+{
+	return __add_to_page_cache_locked(page, mapping, offset,
+					  gfp_mask, NULL);
+}
 EXPORT_SYMBOL(add_to_page_cache_locked);
 
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t offset, gfp_t gfp_mask)
 {
+	void *shadow = NULL;
 	int ret;
 
-	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	if (ret == 0)
-		lru_cache_add_file(page);
+	__set_page_locked(page);
+	ret = __add_to_page_cache_locked(page, mapping, offset,
+					 gfp_mask, &shadow);
+	if (unlikely(ret))
+		__clear_page_locked(page);
+	else {
+		/*
+		 * The page might have been evicted from cache only
+		 * recently, in which case it should be activated like
+		 * any other repeatedly accessed page.
+		 */
+		if (shadow && workingset_refault(shadow)) {
+			SetPageActive(page);
+			workingset_activation(page);
+		} else
+			ClearPageActive(page);
+		lru_cache_add(page);
+	}
 	return ret;
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
diff --git a/mm/swap.c b/mm/swap.c
index f624e5b..ece5c49 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -519,6 +519,8 @@ void mark_page_accessed(struct page *page)
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 958bb64..4689dec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -504,7 +504,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -550,10 +551,23 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
-
-		__delete_from_page_cache(page, NULL);
+		/*
+		 * Remember a shadow entry for reclaimed file cache in
+		 * order to detect refaults, thus thrashing, later on.
+		 *
+		 * But don't store shadows in an address space that is
+		 * already exiting.  This is not just an optizimation,
+		 * inode reclaim needs to empty out the radix tree or
+		 * the nodes are lost.  Don't plant shadows behind its
+		 * back.
+		 */
+		if (reclaimed && page_is_file_cache(page) &&
+		    !mapping_exiting(mapping))
+			shadow = workingset_eviction(mapping, page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -576,7 +590,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -1046,7 +1060,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9bb3145..3ac830d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -770,6 +770,8 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_refault",
+	"workingset_activate",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..1c114cd
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,254 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ *		Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list.  Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail.  Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * active list grows too big.
+ *
+ *   fault ------------------------+
+ *                                 |
+ *              +--------------+   |            +-------------+
+ *   reclaim <- |   inactive   | <-+-- demotion |    active   | <--+
+ *              +--------------+                +-------------+    |
+ *                     |                                           |
+ *                     +-------------- promotion ------------------+
+ *
+ *
+ *		Access frequency and refault distance
+ *
+ * A workload is trashing when its pages are frequently used but they
+ * are evicted from the inactive list every time before another access
+ * would have promoted them to the active list.
+ *
+ * In cases where the average access distance between thrashing pages
+ * is bigger than the size of memory there is nothing that can be
+ * done - the thrashing set could never fit into memory under any
+ * circumstance.
+ *
+ * However, the average access distance could be bigger than the
+ * inactive list, yet smaller than the size of memory.  In this case,
+ * the set could fit into memory if it weren't for the currently
+ * active pages - which may be used more, hopefully less frequently:
+ *
+ *      +-memory available to cache-+
+ *      |                           |
+ *      +-inactive------+-active----+
+ *  a b | c d e f g h i | J K L M N |
+ *      +---------------+-----------+
+ *
+ * It is prohibitively expensive to accurately track access frequency
+ * of pages.  But a reasonable approximation can be made to measure
+ * thrashing on the inactive list, after which refaulting pages can be
+ * activated optimistically to compete with the existing active pages.
+ *
+ * Approximating inactive page access frequency - Observations:
+ *
+ * 1. When a page is accesed for the first time, it is added to the
+ *    head of the inactive list, slides every existing inactive page
+ *    towards the tail by one slot, and pushes the current tail page
+ *    out of memory.
+ *
+ * 2. When a page is accessed for the second time, it is promoted to
+ *    the active list, shrinking the inactive list by one slot.  This
+ *    also slides all inactive pages that were faulted into the cache
+ *    more recently than the activated page towards the tail of the
+ *    inactive list.
+ *
+ * Thus:
+ *
+ * 1. The sum of evictions and activations between any two points in
+ *    time indicate the minimum number of inactive pages accessed in
+ *    between.
+ *
+ * 2. Moving one inactive page N page slots towards the tail of the
+ *    list requires at least N inactive page accesses.
+ *
+ * Combining these:
+ *
+ * 1. When a page is finally evicted from memory, the number of
+ *    inactive pages accessed while the page was in cache is at least
+ *    the number of page slots on the inactive list.
+ *
+ * 2. In addition, measuring the sum of evictions and activations (E)
+ *    at the time of a page's eviction, and comparing it to another
+ *    reading (R) at the time the page faults back into memory tells
+ *    the minimum number of accesses while the page was not cached.
+ *    This is called the refault distance.
+ *
+ * Because the first access of the page was the fault and the second
+ * access the refault, we combine the in-cache distance with the
+ * out-of-cache distance to get the complete minimum access distance
+ * of this page:
+ *
+ *      NR_inactive + (R - E)
+ *
+ * And knowing the minimum access distance of a page, we can easily
+ * tell if the page would be able to stay in cache assuming all page
+ * slots in the cache were available:
+ *
+ *   NR_inactive + (R - E) <= NR_inactive + NR_active
+ *
+ * which can be further simplified to
+ *
+ *   (R - E) <= NR_active
+ *
+ * Put into words, the refault distance (out-of-cache) can be seen as
+ * a deficit in inactive list space (in-cache).  If the inactive list
+ * had (R - E) more page slots, the page would not have been evicted
+ * in between accesses, but activated instead.  And on a full system,
+ * the only thing eating into inactive list space is active pages.
+ *
+ *
+ *		Activating refaulting pages
+ *
+ * All that is known about the active list is that the pages have been
+ * accessed more than once in the past.  This means that at any given
+ * time there is actually a good chance that pages on the active list
+ * are no longer in active use.
+ *
+ * So when a refault distance of (R - E) is observed and there are at
+ * least (R - E) active pages, the refaulting page is activated
+ * optimistically in the hope that (R - E) active pages are actually
+ * used less frequently than the refaulting page - or even not used at
+ * all anymore.
+ *
+ * If this is wrong and demotion kicks in, the pages which are truly
+ * used more frequently will be reactivated while the less frequently
+ * used once will be evicted from memory.
+ *
+ * But if this is right, the stale pages will be pushed out of memory
+ * and the used pages get to stay in cache.
+ *
+ *
+ *		Implementation
+ *
+ * For each zone's file LRU lists, a counter for inactive evictions
+ * and activations is maintained (zone->inactive_age).
+ *
+ * On eviction, a snapshot of this counter (along with some bits to
+ * identify the zone) is stored in the now empty page cache radix tree
+ * slot of the evicted page.  This is called a shadow entry.
+ *
+ * On cache misses for which there are shadow entries, an eligible
+ * refault distance will immediately activate the refaulting page.
+ */
+
+static void *pack_shadow(unsigned long eviction, struct zone *zone)
+{
+	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
+	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
+	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
+
+	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+static void unpack_shadow(void *shadow,
+			  struct zone **zone,
+			  unsigned long *distance)
+{
+	unsigned long entry = (unsigned long)shadow;
+	unsigned long eviction;
+	unsigned long refault;
+	unsigned long mask;
+	int zid, nid;
+
+	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	zid = entry & ((1UL << ZONES_SHIFT) - 1);
+	entry >>= ZONES_SHIFT;
+	nid = entry & ((1UL << NODES_SHIFT) - 1);
+	entry >>= NODES_SHIFT;
+	eviction = entry;
+
+	*zone = NODE_DATA(nid)->node_zones + zid;
+
+	refault = atomic_long_read(&(*zone)->inactive_age);
+	mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
+			RADIX_TREE_EXCEPTIONAL_SHIFT);
+	/*
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases.
+	 *
+	 * There is a special case: usually, shadow entries have a
+	 * short lifetime and are either refaulted or reclaimed along
+	 * with the inode before they get too old.  But it is not
+	 * impossible for the inactive_age to lap a shadow entry in
+	 * the field, which can then can result in a false small
+	 * refault distance, leading to a false activation should this
+	 * old entry actually refault again.  However, earlier kernels
+	 * used to deactivate unconditionally with *every* reclaim
+	 * invocation for the longest time, so the occasional
+	 * inappropriate activation leading to pressure on the active
+	 * list is not a problem.
+	 */
+	*distance = (refault - eviction) & mask;
+}
+
+/**
+ * workingset_eviction - note the eviction of a page from memory
+ * @mapping: address space the page was backing
+ * @page: the page being evicted
+ *
+ * Returns a shadow entry to be stored in @mapping->page_tree in place
+ * of the evicted @page so that a later refault can be detected.  Or
+ * %NULL when the eviction should not be remembered.
+ */
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long eviction;
+
+	eviction = atomic_long_inc_return(&zone->inactive_age);
+	return pack_shadow(eviction, zone);
+}
+
+/**
+ * workingset_refault - evaluate the refault of a previously evicted page
+ * @shadow: shadow entry of the evicted page
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted page in the context of the zone it was allocated in.
+ *
+ * Returns %true if the page should be activated, %false otherwise.
+ */
+bool workingset_refault(void *shadow)
+{
+	unsigned long refault_distance;
+	struct zone *zone;
+
+	unpack_shadow(shadow, &zone, &refault_distance);
+	inc_zone_state(zone, WORKINGSET_REFAULT);
+
+	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+		inc_zone_state(zone, WORKINGSET_ACTIVATE);
+		return true;
+	}
+	return false;
+}
+
+/**
+ * workingset_activation - note a page activation
+ * @page: page that is being activated
+ */
+void workingset_activation(struct page *page)
+{
+	atomic_long_inc(&page_zone(page)->inactive_age);
+}
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [patch 8/8] mm: workingset: keep shadow entries in check
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
                   ` (6 preceding siblings ...)
  2013-10-10 21:47 ` [patch 7/8] mm: thrash detection-based file cache sizing Johannes Weiner
@ 2013-10-10 21:47 ` Johannes Weiner
  2013-10-11  0:39 ` [patch 0/8] mm: thrash detection-based file cache sizing v5 Dave Chinner
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2013-10-10 21:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, a list of inodes that contain shadow
entries is maintained.  If the global number of shadows exceeds a
certain threshold, a shrinker is activated that reclaims old entries
from the mappings.  This is heavy-handed but it should not be a hot
path and is mainly there to protect from accidentally/maliciously
induced OOM kills.  The global list is also not a problem because the
modifications are very rare: inodes are added once in their lifetime
when the first shadow entry is stored (i.e. the first page reclaimed)
and lazily removed when the inode exits.  Or if the shrinker removes
all shadow entries.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c                |  17 +---
 include/linux/fs.h        |   1 +
 include/linux/mmzone.h    |   1 +
 include/linux/swap.h      |   4 +
 include/linux/writeback.h |   1 +
 mm/filemap.c              |   4 +-
 mm/page-writeback.c       |   2 +-
 mm/truncate.c             |   2 +-
 mm/vmstat.c               |   1 +
 mm/workingset.c           | 252 ++++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 269 insertions(+), 16 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 56712ac..040210f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -169,6 +169,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->private_data = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+	workingset_init_mapping(mapping);
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
@@ -547,19 +548,11 @@ static void evict(struct inode *inode)
 	inode_wait_for_writeback(inode);
 
 	/*
-	 * Page reclaim may happen concurrently against pages in this
-	 * address space (pinned by the page lock).  Make sure that it
-	 * does not plant shadow pages behind the final truncate's
-	 * back, or they will be lost forever.
-	 *
-	 * As truncation uses a lockless tree lookup, acquire the
-	 * spinlock to make sure any ongoing tree modification that
-	 * does not see AS_EXITING is completed before starting the
-	 * final truncate.
+	 * Tell page reclaim that the address space is going away,
+	 * before starting the final truncate, to prevent it from
+	 * installing shadow entries that might get lost otherwise.
 	 */
-	spin_lock_irq(&inode->i_data.tree_lock);
-	mapping_set_exiting(&inode->i_data);
-	spin_unlock_irq(&inode->i_data.tree_lock);
+	workingset_exit_mapping(&inode->i_data);
 
 	if (op->evict_inode) {
 		op->evict_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9bfa5a5..442aa9a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -417,6 +417,7 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	struct list_head	shadow_list;	/* list of mappings with shadows */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 118ba9f..1424aa1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
 #endif
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
+	WORKINGSET_SHADOWS_RECLAIMED,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b83cf61..891809b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,9 +261,13 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
+void workingset_init_mapping(struct address_space *mapping);
+void workingset_exit_mapping(struct address_space *mapping);
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 021b8a3..557cc4b 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -152,6 +152,7 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
+unsigned long global_dirtyable_memory(void);
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
diff --git a/mm/filemap.c b/mm/filemap.c
index 8ef41b7..3e8d3a1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,7 +132,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
 		radix_tree_replace_slot(slot, shadow);
-		mapping->nrshadows++;
+		workingset_shadows_inc(mapping);
 	} else
 		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
@@ -466,7 +466,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 		if (shadowp)
 			*shadowp = p;
 		return 0;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f5236f8..05812b8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -234,7 +234,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the global number of pages potentially available for dirty
  * page cache.  This is the base value for the global dirty limits.
  */
-static unsigned long global_dirtyable_memory(void)
+unsigned long global_dirtyable_memory(void)
 {
 	unsigned long x;
 
diff --git a/mm/truncate.c b/mm/truncate.c
index 86866f1..847aa16 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -36,7 +36,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * need verification under the tree lock.
 	 */
 	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3ac830d..ea5993f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -772,6 +772,7 @@ const char * const vmstat_text[] = {
 #endif
 	"workingset_refault",
 	"workingset_activate",
+	"workingset_shadows_reclaimed",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 1c114cd..31174bb 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -152,6 +152,62 @@
  * refault distance will immediately activate the refaulting page.
  */
 
+struct percpu_counter nr_shadows;
+
+static DEFINE_SPINLOCK(shadow_lock);
+static LIST_HEAD(shadow_mappings);
+
+/**
+ * workingset_init_mapping - prepare address space for page reclaim
+ * @mapping: address space
+ *
+ * Must be called when the inode is instantiated, before any page
+ * cache is populated.
+ */
+void workingset_init_mapping(struct address_space *mapping)
+{
+	INIT_LIST_HEAD(&mapping->shadow_list);
+}
+
+/**
+ * workingset_exit_mapping - tell page reclaim address space is exiting
+ * @mapping: address space
+ *
+ * Must be called before the final truncate, to prevent page reclaim
+ * from installing shadow entries behind the back of the inode
+ * teardown process.
+ */
+void workingset_exit_mapping(struct address_space *mapping)
+{
+	/*
+	 * Page reclaim may happen concurrently against pages in this
+	 * address space (pinned by the page lock).  Make sure that it
+	 * does not plant shadow pages behind the final truncate's
+	 * back, or they will be lost forever.
+	 *
+	 * As truncation uses a lockless tree lookup, acquire the
+	 * spinlock to make sure any ongoing tree modification that
+	 * does not see AS_EXITING is completed before starting the
+	 * final truncate.
+	 */
+	spin_lock_irq(&mapping->tree_lock);
+	mapping_set_exiting(mapping);
+	/*
+	 * Take the mapping off the shrinker list, the final truncate
+	 * is about to remove potentially remaining shadow entries.
+	 */
+	if (!list_empty(&mapping->shadow_list)) {
+		/*
+		 * shadow_lock is irq-safe but nests inside the
+		 * irq-safe mapping->tree_lock, so don't bother.
+		 */
+		spin_lock(&shadow_lock);
+		list_del(&mapping->shadow_list);
+		spin_unlock(&shadow_lock);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+}
+
 static void *pack_shadow(unsigned long eviction, struct zone *zone)
 {
 	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
@@ -252,3 +308,199 @@ void workingset_activation(struct page *page)
 {
 	atomic_long_inc(&page_zone(page)->inactive_age);
 }
+
+/*
+ * Explicit shadow shrinking
+ *
+ * In most cases, shadow entries are refaulted or truncated along with
+ * inode reclaim before their radix tree node consumption becomes a
+ * problem.  However, to protect against the odd/malicious workload,
+ * the following code pushes them back should they grow out of bounds.
+ *
+ * A global list of page cache objects (struct address_space) that
+ * host shadow entries is maintained lazily.  This means that the
+ * first shadow entry links the object to the list, but it is only
+ * removed when the inode is destroyed or the shrinker reclaimed the
+ * last shadow entry in the object, making list modifications a very
+ * rare event.
+ *
+ * Should the shadow entries exceed a certain number under memory
+ * pressure, the shrinker will walk the list and scan each object's
+ * radix tree to delete shadows that would no longer have a refault
+ * effect anyway, i.e. whose refault distance is already bigger than
+ * the zone's active list.
+ */
+
+/**
+ * workingset_shadows_inc - count a shadow entry insertion
+ * @mapping: page cache object
+ *
+ * Counts a new shadow entry in @mapping, caller must hold
+ * @mapping->tree_lock.
+ */
+void workingset_shadows_inc(struct address_space *mapping)
+{
+	VM_BUG_ON(!spin_is_locked(&mapping->tree_lock));
+
+	might_lock(&shadow_lock);
+
+	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
+		/*
+		 * shadow_lock is irq-safe but nests inside the
+		 * irq-safe mapping->tree_lock, so don't bother.
+		 */
+		spin_lock(&shadow_lock);
+		list_add(&mapping->shadow_list, &shadow_mappings);
+		spin_unlock(&shadow_lock);
+	}
+
+	mapping->nrshadows++;
+	percpu_counter_add(&nr_shadows, 1);
+}
+
+/**
+ * workingset_shadows_dec - count a shadow entry removal
+ * @mapping: page cache object
+ *
+ * Counts the removal of a shadow entry from @mapping, caller must
+ * hold @mapping->tree_lock.
+ */
+void workingset_shadows_dec(struct address_space *mapping)
+{
+	VM_BUG_ON(!spin_is_locked(&mapping->tree_lock));
+
+	mapping->nrshadows--;
+	percpu_counter_add(&nr_shadows, -1);
+	/*
+	 * shadow_mappings operations are costly, so we keep the
+	 * mapping linked here even without any shadows left and
+	 * unlink it lazily in the shadow shrinker or when the inode
+	 * is destroyed.
+	 */
+}
+
+static unsigned long get_nr_old_shadows(void)
+{
+	unsigned long nr_max;
+	unsigned long nr;
+
+	nr = percpu_counter_read_positive(&nr_shadows);
+	/*
+	 * Every shadow entry with a refault distance bigger than the
+	 * active list is ignored and so NR_ACTIVE_FILE would be a
+	 * reasonable ceiling.  But scanning and shrinking shadow
+	 * entries is quite expensive, so be generous.
+	 */
+	nr_max = global_dirtyable_memory() * 4;
+
+	if (nr <= nr_max)
+		return 0;
+	return nr - nr_max;
+}
+
+static unsigned long scan_mapping(struct address_space *mapping,
+				  unsigned long nr_to_scan)
+{
+	unsigned long nr_scanned = 0;
+	struct radix_tree_iter iter;
+	void **slot;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
+		unsigned long nrshadows;
+		unsigned long distance;
+		struct zone *zone;
+		struct page *page;
+
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (!radix_tree_exception(page))
+			continue;
+		if (radix_tree_deref_retry(page))
+			goto restart;
+
+		unpack_shadow(page, &zone, &distance);
+
+		if (distance <= zone_page_state(zone, NR_ACTIVE_FILE))
+			continue;
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (radix_tree_delete_item(&mapping->page_tree,
+					   iter.index, page)) {
+			inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
+			workingset_shadows_dec(mapping);
+			nr_scanned++;
+		}
+		nrshadows = mapping->nrshadows;
+		spin_unlock_irq(&mapping->tree_lock);
+
+		if (nrshadows == 0)
+			break;
+
+		if (--nr_to_scan == 0)
+			break;
+	}
+	rcu_read_unlock();
+
+	return nr_scanned;
+}
+
+static unsigned long count_shadows(struct shrinker *shrink,
+				   struct shrink_control *sc)
+{
+	return get_nr_old_shadows();
+}
+
+static unsigned long scan_shadows(struct shrinker *shrink,
+				  struct shrink_control *sc)
+{
+	unsigned long nr_to_scan = sc->nr_to_scan;
+
+	do {
+		struct address_space *mapping;
+
+		spin_lock_irq(&shadow_lock);
+		if (list_empty(&shadow_mappings)) {
+			spin_unlock_irq(&shadow_lock);
+			break;
+		}
+		mapping = list_entry(shadow_mappings.prev,
+				     struct address_space,
+				     shadow_list);
+		list_move(&mapping->shadow_list, &shadow_mappings);
+		__iget(mapping->host);
+		spin_unlock_irq(&shadow_lock);
+
+		if (mapping->nrshadows)
+			nr_to_scan -= scan_mapping(mapping, nr_to_scan);
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (mapping->nrshadows == 0) {
+			spin_lock(&shadow_lock);
+			list_del_init(&mapping->shadow_list);
+			spin_unlock(&shadow_lock);
+		}
+		spin_unlock_irq(&mapping->tree_lock);
+
+		iput(mapping->host);
+
+	} while (nr_to_scan && get_nr_old_shadows());
+
+	return sc->nr_to_scan - nr_to_scan;
+}
+
+static struct shrinker shadow_shrinker = {
+	.count_objects = count_shadows,
+	.scan_objects = scan_shadows,
+	.seeks = 1,
+};
+
+static int __init workingset_init(void)
+{
+	percpu_counter_init(&nr_shadows, 0);
+	register_shrinker(&shadow_shrinker);
+	return 0;
+}
+core_initcall(workingset_init);
-- 
1.8.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
                   ` (7 preceding siblings ...)
  2013-10-10 21:47 ` [patch 8/8] mm: workingset: keep shadow entries in check Johannes Weiner
@ 2013-10-11  0:39 ` Dave Chinner
  2013-10-14 21:42   ` Johannes Weiner
  2013-10-21  9:26 ` Vlastimil Babka
  2013-11-12 10:30 ` Bob Liu
  10 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2013-10-11  0:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Oct 10, 2013 at 05:46:54PM -0400, Johannes Weiner wrote:
> 	Costs
> 
> These patches increase struct inode by three words to manage shadow
> entries in the page cache radix tree.

An additional 24 bytes on a 64 bit system. Filesystem developers
will kill to save 4 bytes in the struct inode, so adding 24 bytes is
a *major* concern.

> However, given that a typical
> inode (like the ext4 inode) is already 1k in size, this is not much.
> It's a 2% size increase for a reclaimable object. 

The struct ext4_inode is one of the larger inodes in the system at
960 bytes (same as the xfs_inode) - most of the filesystem inode
structures are down around the 600-700 byte range.

Really, the struct inode is what you should be comparing the
increase against (i.e. the VFS inode footprint), not the filesystem
specific inode footprint. In that case, it's actually an increase of
closer to a 4% increase in size because we are talking about a 560
byte structure....

> fs_mark metadata
> tests with millions of inodes did not show a measurable difference.
> And as soon as there is any file data involved, the page cache pages
> dominate the memory cost anyway.

We don't need to measure it at runtime to know the difference - a
million inodes will consume an extra 24MB of RAM at minimum. If the
size increase pushes the inode over a slab boundary, it might be
significantly more than that...

The main cost here is a new list head for a new shrinker. There's
interesting new inode lifecycle issues introduced by this shadow
tree - adding serialisation in evict() because the VM can now modify
to the address space without having a reference to the inode
is kinda nasty.

Also, I really don't like the idea of a new inode cache shrinker
that is completely uncoordinated with the existing inode cache
shrinkers. It uses a global lock and list and is not node aware so
all it will do under many workloads is re-introduce a scalability
choke point we just got rid of in 3.12.

I think that you could simply piggy-back on inode_lru_isolate() to
remove shadow mappings in exactly the same manner it removes inode
buffers and page cache pages on inodes that are about to be
reclaimed.  Keeping the size of the inode cache down will have the
side effect of keeping the shadow mappings under control, and so I
don't see a need for a separate shrinker at all here.

And removing the special shrinker will bring the struct inode size
increase back to only 8 bytes, and I think we can live with that
increase given the workload improvements that the rest of the
functionality brings.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-11  0:39 ` [patch 0/8] mm: thrash detection-based file cache sizing v5 Dave Chinner
@ 2013-10-14 21:42   ` Johannes Weiner
  2013-10-15  1:41     ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2013-10-14 21:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel

Hi Dave,

On Fri, Oct 11, 2013 at 11:39:30AM +1100, Dave Chinner wrote:
> On Thu, Oct 10, 2013 at 05:46:54PM -0400, Johannes Weiner wrote:
> > 	Costs
> > 
> > These patches increase struct inode by three words to manage shadow
> > entries in the page cache radix tree.
> 
> An additional 24 bytes on a 64 bit system. Filesystem developers
> will kill to save 4 bytes in the struct inode, so adding 24 bytes is
> a *major* concern.
> 
> > However, given that a typical
> > inode (like the ext4 inode) is already 1k in size, this is not much.
> > It's a 2% size increase for a reclaimable object. 
> 
> The struct ext4_inode is one of the larger inodes in the system at
> 960 bytes (same as the xfs_inode) - most of the filesystem inode
> structures are down around the 600-700 byte range.
> 
> Really, the struct inode is what you should be comparing the
> increase against (i.e. the VFS inode footprint), not the filesystem
> specific inode footprint. In that case, it's actually an increase of
> closer to a 4% increase in size because we are talking about a 560
> byte structure....
> 
> > fs_mark metadata
> > tests with millions of inodes did not show a measurable difference.
> > And as soon as there is any file data involved, the page cache pages
> > dominate the memory cost anyway.
> 
> We don't need to measure it at runtime to know the difference - a
> million inodes will consume an extra 24MB of RAM at minimum. If the
> size increase pushes the inode over a slab boundary, it might be
> significantly more than that...
> 
> The main cost here is a new list head for a new shrinker. There's
> interesting new inode lifecycle issues introduced by this shadow
> tree - adding serialisation in evict() because the VM can now modify
> to the address space without having a reference to the inode
> is kinda nasty.

This is unlikely to change, though.  Direct reclaim may hold all kinds
of fs locks so we can't reasonably do iput() from reclaim context.

We already serialize inode eviction and reclaim through the page lock
of cached pages.  I just added the equivalent for shadow entries,
which don't have their own per-item serialization.

> Also, I really don't like the idea of a new inode cache shrinker
> that is completely uncoordinated with the existing inode cache
> shrinkers. It uses a global lock and list and is not node aware so
> all it will do under many workloads is re-introduce a scalability
> choke point we just got rid of in 3.12.

Shadow entries are mostly self-regulating and, unlike the inode case,
the shrinker is not the primary means of resource control here.  I
don't think this has the same scalability requirements as inode
shrinking.

> I think that you could simply piggy-back on inode_lru_isolate() to
> remove shadow mappings in exactly the same manner it removes inode
> buffers and page cache pages on inodes that are about to be
> reclaimed.  Keeping the size of the inode cache down will have the
> side effect of keeping the shadow mappings under control, and so I
> don't see a need for a separate shrinker at all here.

Pinned inodes are not on the LRU, so you could send a machine OOM by
simply catting a single large (sparse) file to /dev/null.

Buffers and page cache are kept in check by page reclaim, the inode
shrinker only drops cache as part of inode lifetime management.  Just
like with buffers and page cache, there is no relationship between the
amount of memory occupied and the number of inodes; or between the
node of said memory and the node that holds the inode object.  The
inode shrinker does not really seem appropriate for managing excessive
shadow entry memory.

> And removing the special shrinker will bring the struct inode size
> increase back to only 8 bytes, and I think we can live with that
> increase given the workload improvements that the rest of the
> functionality brings.

That would be very desirable indeed.

What we would really want is a means of per-zone tracking of
radix_tree_nodes occupied by shadow entries but I can't see a way to
do this without blowing up the radix tree structure at a much bigger
cost than an extra list_head in struct address_space.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-14 21:42   ` Johannes Weiner
@ 2013-10-15  1:41     ` Dave Chinner
  2013-10-15 10:27       ` Jan Kara
  2013-10-15 17:41       ` Johannes Weiner
  0 siblings, 2 replies; 24+ messages in thread
From: Dave Chinner @ 2013-10-15  1:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel

On Mon, Oct 14, 2013 at 05:42:50PM -0400, Johannes Weiner wrote:
> Hi Dave,
> 
> On Fri, Oct 11, 2013 at 11:39:30AM +1100, Dave Chinner wrote:
> > On Thu, Oct 10, 2013 at 05:46:54PM -0400, Johannes Weiner wrote:
> > > 	Costs
> > > 
> > > These patches increase struct inode by three words to manage shadow
> > > entries in the page cache radix tree.
> > 
> > An additional 24 bytes on a 64 bit system. Filesystem developers
> > will kill to save 4 bytes in the struct inode, so adding 24 bytes is
> > a *major* concern.
> > 
> > > However, given that a typical
> > > inode (like the ext4 inode) is already 1k in size, this is not much.
> > > It's a 2% size increase for a reclaimable object. 
> > 
> > The struct ext4_inode is one of the larger inodes in the system at
> > 960 bytes (same as the xfs_inode) - most of the filesystem inode
> > structures are down around the 600-700 byte range.
> > 
> > Really, the struct inode is what you should be comparing the
> > increase against (i.e. the VFS inode footprint), not the filesystem
> > specific inode footprint. In that case, it's actually an increase of
> > closer to a 4% increase in size because we are talking about a 560
> > byte structure....
> > 
> > > fs_mark metadata
> > > tests with millions of inodes did not show a measurable difference.
> > > And as soon as there is any file data involved, the page cache pages
> > > dominate the memory cost anyway.
> > 
> > We don't need to measure it at runtime to know the difference - a
> > million inodes will consume an extra 24MB of RAM at minimum. If the
> > size increase pushes the inode over a slab boundary, it might be
> > significantly more than that...
> > 
> > The main cost here is a new list head for a new shrinker. There's
> > interesting new inode lifecycle issues introduced by this shadow
> > tree - adding serialisation in evict() because the VM can now modify
> > to the address space without having a reference to the inode
> > is kinda nasty.
> 
> This is unlikely to change, though.  Direct reclaim may hold all kinds
> of fs locks so we can't reasonably do iput() from reclaim context.

Right, but you do exactly that from the new shrinker because it
doesn't have protection against being called in GFP_NOFS
contexts....

> We already serialize inode eviction and reclaim through the page lock
> of cached pages.

Sure, we do that via truncate_inode_pages() but that is not playing
games with the inode life cycle - we know the inode is already dead
and all users of the pages are gone at the point we remove the pages
from the page cache.

That's kind of my point: the VM already has an address space
serialisation point in the evict() path via truncate_inode_pages() and
so I don't see any reason for you needing to introduce a new one
that disables/enables interrupts in the hot path regardless of
whether the flag needs to be set or not. Why can't you put this in
truncate_inode_pages() or some new wrapper and keep the
synchronisation wholly within the VM subsystem like we do now?

> I just added the equivalent for shadow entries,
> which don't have their own per-item serialization.

They don't need serialisation at the inode level - only between
page reclaim and the final address space truncate...

> > Also, I really don't like the idea of a new inode cache shrinker
> > that is completely uncoordinated with the existing inode cache
> > shrinkers. It uses a global lock and list and is not node aware so
> > all it will do under many workloads is re-introduce a scalability
> > choke point we just got rid of in 3.12.
> 
> Shadow entries are mostly self-regulating and, unlike the inode case,
> the shrinker is not the primary means of resource control here.  I
> don't think this has the same scalability requirements as inode
> shrinking.

Anything that introduces a global lock that needs to be taken in the
inode evict() path is a scalability limitation. I've been working to
remove all global locks and lists from the evict() path precisely
because they severely limit VFS scalability. Hence new code that
that introduces a global lock and list into hot VFS paths is simply
not acceptible any more.

> > I think that you could simply piggy-back on inode_lru_isolate() to
> > remove shadow mappings in exactly the same manner it removes inode
> > buffers and page cache pages on inodes that are about to be
> > reclaimed.  Keeping the size of the inode cache down will have the
> > side effect of keeping the shadow mappings under control, and so I
> > don't see a need for a separate shrinker at all here.
> 
> Pinned inodes are not on the LRU, so you could send a machine OOM by
> simply catting a single large (sparse) file to /dev/null.

Then you have a serious design flaw if you are relying on a shrinker
to control memory consumed by page cache radix trees as a result of
page cache reclaim inserting exceptional entries into the radix
tree and then forgetting about them.

To work around this, you keep a global count of exceptional entries
and a global list of inodes with such exceptional radix tree
entries. The count doesn't really tell you how much memory is used
by the radix trees - the same count can mean an order of
magnitude difference in actual memory consumption (one shadow entry
per radix tree node vs 64) so it's not a very good measure to base
memory reclaim behaviour on but it is an inferred (rather than
actual) object count.

And even if you do free some entries, there is no guarantee that any
memory will be freed because only empty radix tree nodes will get
freed, and then memory will only get freed when the entire slab of
radix tree nodes are freed.

This reclaim behaviour has potential to cause internal
fragmentation of the radix tree node slab, which means that we'll
simply spend time scanning and freeing entries but not free any
memory.

You walk the inode list by a shrinker and scan radix trees for
shadow entries that can be removed. It's expensive to scan radix
trees, especially for inodes with large amounts of cached data, so
this could do a lot of work to find very little in way of entries to
free.

The shrinker doesn't rotate inodes on the list, so it will always
scan the same inodes on the list in the same order and so if memory
reclaim removes a few pages from an inode with a large amount of
cached pages between each shrinker call, then those radix trees will
be repeatedly scanned in it's entirety on each call to the shrinker.

Also, the shrinker only decrements nr_to_scan when it finds an entry
to reclaim. nr_to_scan is the number of objects to scan for reclaim,
not the number of objects to reclaim. hence the shrinker will be
doing a lot of scanning if there's inodes at the head of the list
with large radix trees....

Do I need to go on pointing out how unscalable this approach is?

> Buffers and page cache are kept in check by page reclaim, the inode
> shrinker only drops cache as part of inode lifetime management.  Just
> like with buffers and page cache, there is no relationship between the
> amount of memory occupied and the number of inodes; or between the
> node of said memory and the node that holds the inode object.

Yes, but buffers and page cache pages are kept in check directly the
VM, not another shrinker. That's the big difference here - you're
introducing something that is the equivalent of pages or buffers
(i.e. allocated and controlled by the VM) and then saying that it
can't be controlled by the VM and we need to link inodes together so
a shrinker can do a new type of per-inode scan to find the VM
controlled objects.

Besides, doesn't the fact that you're requiring VFS cache lifecycle
awareness in the core VM functionality ring alarm bells about
layering violations in your head? They are loud enough to hurt my
head...

> The
> inode shrinker does not really seem appropriate for managing excessive
> shadow entry memory.

It may not be - it was simply a suggestion on how we might be able
get rid of the nasty shrinker code in your patchset....

> > And removing the special shrinker will bring the struct inode size
> > increase back to only 8 bytes, and I think we can live with that
> > increase given the workload improvements that the rest of the
> > functionality brings.
> 
> That would be very desirable indeed.
> 
> What we would really want is a means of per-zone tracking of
> radix_tree_nodes occupied by shadow entries but I can't see a way to
> do this without blowing up the radix tree structure at a much bigger
> cost than an extra list_head in struct address_space.

Putting a list_head in the radix tree node is likely to have a lower
cost than putting one in every inode. Most cached inodes don't have
any page cache associated with them. Indeed, my workstation right
now shows:

$ sudo grep "radix\|xfs_inode" /proc/slabinfo 
xfs_inode         277773 278432   1024    4    1 : tunables   54   27    8 : slabdata  69608  69608      0
radix_tree_node    74137  74956    560    7    1 : tunables   54   27    8 : slabdata  10708  10708      0

4x as many inodes as there are radix tree nodes in memory(*).
That's with 55% of memory being used by the page cache. So it's
pretty clear that tracking radix tree nodes directly might well be
lower cost than tracking address spaces and/or inodes....

(*) Note that the inode size is 1024 bytes in the config I'm using, so
increasing it is going to push it to only 3 inodes per page rather
than 4, so that extra 16 listhead means a 25% increase in memory
consumption for the inode cache, not 2%. The radix tree node can be
increased by 24 bytes and still fit 7 per page and so not actually
increase memory consumption at all....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-15  1:41     ` Dave Chinner
@ 2013-10-15 10:27       ` Jan Kara
  2013-10-15 17:41       ` Johannes Weiner
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Kara @ 2013-10-15 10:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Andrew Morton, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, Roman Gushchin,
	Ozgun Erdogan, Metin Doslu, Vlastimil Babka, Tejun Heo, linux-mm,
	linux-fsdevel, linux-kernel

On Tue 15-10-13 12:41:23, Dave Chinner wrote:
> Then you have a serious design flaw if you are relying on a shrinker
> to control memory consumed by page cache radix trees as a result of
> page cache reclaim inserting exceptional entries into the radix
> tree and then forgetting about them.
> 
> To work around this, you keep a global count of exceptional entries
> and a global list of inodes with such exceptional radix tree
> entries. The count doesn't really tell you how much memory is used
> by the radix trees - the same count can mean an order of
> magnitude difference in actual memory consumption (one shadow entry
> per radix tree node vs 64) so it's not a very good measure to base
> memory reclaim behaviour on but it is an inferred (rather than
> actual) object count.
> 
> And even if you do free some entries, there is no guarantee that any
> memory will be freed because only empty radix tree nodes will get
> freed, and then memory will only get freed when the entire slab of
> radix tree nodes are freed.
> 
> This reclaim behaviour has potential to cause internal
> fragmentation of the radix tree node slab, which means that we'll
> simply spend time scanning and freeing entries but not free any
> memory.
> 
> You walk the inode list by a shrinker and scan radix trees for
> shadow entries that can be removed. It's expensive to scan radix
> trees, especially for inodes with large amounts of cached data, so
> this could do a lot of work to find very little in way of entries to
> free.
> 
> The shrinker doesn't rotate inodes on the list, so it will always
> scan the same inodes on the list in the same order and so if memory
> reclaim removes a few pages from an inode with a large amount of
> cached pages between each shrinker call, then those radix trees will
> be repeatedly scanned in it's entirety on each call to the shrinker.
> 
> Also, the shrinker only decrements nr_to_scan when it finds an entry
> to reclaim. nr_to_scan is the number of objects to scan for reclaim,
> not the number of objects to reclaim. hence the shrinker will be
> doing a lot of scanning if there's inodes at the head of the list
> with large radix trees....
> 
> Do I need to go on pointing out how unscalable this approach is?
  Just to add some real world experience to what Dave points out - ext4 has
a thing called extent cache. It essentially caches logical->physical
mapping of blocks together with some state flags together with inode. And
currently the cache is maintained similarly as you do it with shadow
entries - we have LRU list of inodes and we have a shrinker to scan extents
in an inode to find extents to free (we cannot reclaim arbitrary cached
extent because some state cannot be simply lost). And it is a pain. We burn
lots of CPU when scanning for extents to free under some loads, sometimes we
get RCU lockups and similar stuff. So we will have to rewrite the code to
use something more clever sooner rather than later. I don't think you
should repeat our mistake ;)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-15  1:41     ` Dave Chinner
  2013-10-15 10:27       ` Jan Kara
@ 2013-10-15 17:41       ` Johannes Weiner
  2013-10-15 23:41         ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2013-10-15 17:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel

On Tue, Oct 15, 2013 at 12:41:23PM +1100, Dave Chinner wrote:
> On Mon, Oct 14, 2013 at 05:42:50PM -0400, Johannes Weiner wrote:
> > Hi Dave,
> > 
> > On Fri, Oct 11, 2013 at 11:39:30AM +1100, Dave Chinner wrote:
> > > On Thu, Oct 10, 2013 at 05:46:54PM -0400, Johannes Weiner wrote:
> > > > 	Costs
> > > > 
> > > > These patches increase struct inode by three words to manage shadow
> > > > entries in the page cache radix tree.
> > > 
> > > An additional 24 bytes on a 64 bit system. Filesystem developers
> > > will kill to save 4 bytes in the struct inode, so adding 24 bytes is
> > > a *major* concern.
> > > 
> > > > However, given that a typical
> > > > inode (like the ext4 inode) is already 1k in size, this is not much.
> > > > It's a 2% size increase for a reclaimable object. 
> > > 
> > > The struct ext4_inode is one of the larger inodes in the system at
> > > 960 bytes (same as the xfs_inode) - most of the filesystem inode
> > > structures are down around the 600-700 byte range.
> > > 
> > > Really, the struct inode is what you should be comparing the
> > > increase against (i.e. the VFS inode footprint), not the filesystem
> > > specific inode footprint. In that case, it's actually an increase of
> > > closer to a 4% increase in size because we are talking about a 560
> > > byte structure....
> > > 
> > > > fs_mark metadata
> > > > tests with millions of inodes did not show a measurable difference.
> > > > And as soon as there is any file data involved, the page cache pages
> > > > dominate the memory cost anyway.
> > > 
> > > We don't need to measure it at runtime to know the difference - a
> > > million inodes will consume an extra 24MB of RAM at minimum. If the
> > > size increase pushes the inode over a slab boundary, it might be
> > > significantly more than that...
> > > 
> > > The main cost here is a new list head for a new shrinker. There's
> > > interesting new inode lifecycle issues introduced by this shadow
> > > tree - adding serialisation in evict() because the VM can now modify
> > > to the address space without having a reference to the inode
> > > is kinda nasty.
> > 
> > This is unlikely to change, though.  Direct reclaim may hold all kinds
> > of fs locks so we can't reasonably do iput() from reclaim context.
> 
> Right, but you do exactly that from the new shrinker because it
> doesn't have protection against being called in GFP_NOFS
> contexts....

Oops, yes, that's a bug, but easy to fix.

> > We already serialize inode eviction and reclaim through the page lock
> > of cached pages.
> 
> Sure, we do that via truncate_inode_pages() but that is not playing
> games with the inode life cycle - we know the inode is already dead
> and all users of the pages are gone at the point we remove the pages
> from the page cache.
> 
> That's kind of my point: the VM already has an address space
> serialisation point in the evict() path via truncate_inode_pages() and
> so I don't see any reason for you needing to introduce a new one
> that disables/enables interrupts in the hot path regardless of
> whether the flag needs to be set or not. Why can't you put this in
> truncate_inode_pages() or some new wrapper and keep the
> synchronisation wholly within the VM subsystem like we do now?

Fair enough, that should be easily fixable as well.

> > > Also, I really don't like the idea of a new inode cache shrinker
> > > that is completely uncoordinated with the existing inode cache
> > > shrinkers. It uses a global lock and list and is not node aware so
> > > all it will do under many workloads is re-introduce a scalability
> > > choke point we just got rid of in 3.12.
> > 
> > Shadow entries are mostly self-regulating and, unlike the inode case,
> > the shrinker is not the primary means of resource control here.  I
> > don't think this has the same scalability requirements as inode
> > shrinking.
> 
> Anything that introduces a global lock that needs to be taken in the
> inode evict() path is a scalability limitation. I've been working to
> remove all global locks and lists from the evict() path precisely
> because they severely limit VFS scalability. Hence new code that
> that introduces a global lock and list into hot VFS paths is simply
> not acceptible any more.

Fair enough as well.  But do keep in mind that the lock and list is
only involved when the address space actually had pages evicted from
it in the past.  As you said, most inodes don't even have pages...

> > > I think that you could simply piggy-back on inode_lru_isolate() to
> > > remove shadow mappings in exactly the same manner it removes inode
> > > buffers and page cache pages on inodes that are about to be
> > > reclaimed.  Keeping the size of the inode cache down will have the
> > > side effect of keeping the shadow mappings under control, and so I
> > > don't see a need for a separate shrinker at all here.
> > 
> > Pinned inodes are not on the LRU, so you could send a machine OOM by
> > simply catting a single large (sparse) file to /dev/null.
> 
> Then you have a serious design flaw if you are relying on a shrinker
> to control memory consumed by page cache radix trees as a result of
> page cache reclaim inserting exceptional entries into the radix
> tree and then forgetting about them.

I'm not forgetting about them, I just track them very coarsely by
linking up address spaces and then lazily enforce their upper limit
when memory is tight by using the shrinker callback.  The assumption
was that actually scanning them is such a rare event that we trade the
rare computational costs for smaller memory consumption most of the
time.

> To work around this, you keep a global count of exceptional entries
> and a global list of inodes with such exceptional radix tree
> entries. The count doesn't really tell you how much memory is used
> by the radix trees - the same count can mean an order of
> magnitude difference in actual memory consumption (one shadow entry
> per radix tree node vs 64) so it's not a very good measure to base
> memory reclaim behaviour on but it is an inferred (rather than
> actual) object count.

Capping shadow entries instead of memory consumption was intentional.
They should be trimmed based on whether old shadow entries are still
meaningful and have an effect if refaulted, not based on memory
pressure.  These entries have an influence on future memory pressure
so we shouldn't kick them out based on how tight resources are but
based on whether there are too many expired entries.

Previous implementations of non-resident history from Peter & Rik
maintained a big system-wide hash table with a constant cost instead
of using radix tree memory like this.  My idea was that this is cache
friendlier and memory consumption should be lower in most cases and
the shrinker is only there to cap the extreme / malicious cases.

> You walk the inode list by a shrinker and scan radix trees for
> shadow entries that can be removed. It's expensive to scan radix
> trees, especially for inodes with large amounts of cached data, so
> this could do a lot of work to find very little in way of entries to
> free.
> 
> The shrinker doesn't rotate inodes on the list, so it will always
> scan the same inodes on the list in the same order and so if memory
> reclaim removes a few pages from an inode with a large amount of
> cached pages between each shrinker call, then those radix trees will
> be repeatedly scanned in it's entirety on each call to the shrinker.
>
> Also, the shrinker only decrements nr_to_scan when it finds an entry
> to reclaim. nr_to_scan is the number of objects to scan for reclaim,
> not the number of objects to reclaim. hence the shrinker will be
> doing a lot of scanning if there's inodes at the head of the list
> with large radix trees....

I realize all of this.  The scanner is absolutely expensive, I just
didn't care because it's not supposed to run in the first place but
rather act like an emergency brake.

Again, the shrinker isn't even called until shadow entries are in
excess, regardless of how bad memory pressure is.  On the other hand,
the fact that this code is unneeded most of the time makes the struct
inode size increase even worse.

Don't get me wrong, I am very much for improving this, but I think
it's important that we agree on how performance critical that whole
thing is.

> Do I need to go on pointing out how unscalable this approach is?

No, I think I got your point ;-)

> > Buffers and page cache are kept in check by page reclaim, the inode
> > shrinker only drops cache as part of inode lifetime management.  Just
> > like with buffers and page cache, there is no relationship between the
> > amount of memory occupied and the number of inodes; or between the
> > node of said memory and the node that holds the inode object.
> 
> Yes, but buffers and page cache pages are kept in check directly the
> VM, not another shrinker. That's the big difference here - you're
> introducing something that is the equivalent of pages or buffers
> (i.e. allocated and controlled by the VM) and then saying that it
> can't be controlled by the VM and we need to link inodes together so
> a shrinker can do a new type of per-inode scan to find the VM
> controlled objects.
>
> Besides, doesn't the fact that you're requiring VFS cache lifecycle
> awareness in the core VM functionality ring alarm bells about
> layering violations in your head? They are loud enough to hurt my
> head...

Page reclaim is one big-ass layering violation.  The whole source of
this is that we work on objects tied to inode lifetime without keeping
a reference to the inode.  The VM already assumes that the inode will
stay alive as long as it holds a locked, untruncated page.  As I said,
I don't think I added a new class of cross-subsystem synchronization.

> > The
> > inode shrinker does not really seem appropriate for managing excessive
> > shadow entry memory.
> 
> It may not be - it was simply a suggestion on how we might be able
> get rid of the nasty shrinker code in your patchset....
> 
> > > And removing the special shrinker will bring the struct inode size
> > > increase back to only 8 bytes, and I think we can live with that
> > > increase given the workload improvements that the rest of the
> > > functionality brings.
> > 
> > That would be very desirable indeed.
> > 
> > What we would really want is a means of per-zone tracking of
> > radix_tree_nodes occupied by shadow entries but I can't see a way to
> > do this without blowing up the radix tree structure at a much bigger
> > cost than an extra list_head in struct address_space.
> 
> Putting a list_head in the radix tree node is likely to have a lower
> cost than putting one in every inode. Most cached inodes don't have
> any page cache associated with them. Indeed, my workstation right
> now shows:
> 
> $ sudo grep "radix\|xfs_inode" /proc/slabinfo 
> xfs_inode         277773 278432   1024    4    1 : tunables   54   27    8 : slabdata  69608  69608      0
> radix_tree_node    74137  74956    560    7    1 : tunables   54   27    8 : slabdata  10708  10708      0

Is that a slab configuration?  On my slub config, this actually shows
568 even though the structure definition really adds up to 560 bytes.

> 4x as many inodes as there are radix tree nodes in memory(*).
> That's with 55% of memory being used by the page cache. So it's
> pretty clear that tracking radix tree nodes directly might well be
> lower cost than tracking address spaces and/or inodes....
> 
> (*) Note that the inode size is 1024 bytes in the config I'm using, so
> increasing it is going to push it to only 3 inodes per page rather
> than 4, so that extra 16 listhead means a 25% increase in memory
> consumption for the inode cache, not 2%. The radix tree node can be
> increased by 24 bytes and still fit 7 per page and so not actually
> increase memory consumption at all....

Yes, I really don't like the extra inode cost and the computational
overhead in corner cases.

What I do like is that the shadow entries are in-line and not in an
auxiliary array and that memory consumption of shadow entries is
mostly low, so I'm not eager to change the data structure.  But it
looks like tracking radix tree nodes with a list and backpointers to
the mapping object for the lock etc. will be a major pain in the ass.

I'll see what I can come up with.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-15 17:41       ` Johannes Weiner
@ 2013-10-15 23:41         ` Dave Chinner
  2013-10-16  2:05           ` Rik van Riel
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2013-10-15 23:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel

On Tue, Oct 15, 2013 at 01:41:28PM -0400, Johannes Weiner wrote:
> On Tue, Oct 15, 2013 at 12:41:23PM +1100, Dave Chinner wrote:
> > On Mon, Oct 14, 2013 at 05:42:50PM -0400, Johannes Weiner wrote:
> > > Hi Dave,
> > > 
> > > On Fri, Oct 11, 2013 at 11:39:30AM +1100, Dave Chinner wrote:
> > > > On Thu, Oct 10, 2013 at 05:46:54PM -0400, Johannes Weiner wrote:
> > > > Also, I really don't like the idea of a new inode cache shrinker
> > > > that is completely uncoordinated with the existing inode cache
> > > > shrinkers. It uses a global lock and list and is not node aware so
> > > > all it will do under many workloads is re-introduce a scalability
> > > > choke point we just got rid of in 3.12.
> > > 
> > > Shadow entries are mostly self-regulating and, unlike the inode case,
> > > the shrinker is not the primary means of resource control here.  I
> > > don't think this has the same scalability requirements as inode
> > > shrinking.
> > 
> > Anything that introduces a global lock that needs to be taken in the
> > inode evict() path is a scalability limitation. I've been working to
> > remove all global locks and lists from the evict() path precisely
> > because they severely limit VFS scalability. Hence new code that
> > that introduces a global lock and list into hot VFS paths is simply
> > not acceptible any more.
> 
> Fair enough as well.  But do keep in mind that the lock and list is
> only involved when the address space actually had pages evicted from
> it in the past.  As you said, most inodes don't even have pages...

.... because page reclaim typically removes them long before the
inode is evicted from the inode cache.

> > > > I think that you could simply piggy-back on inode_lru_isolate() to
> > > > remove shadow mappings in exactly the same manner it removes inode
> > > > buffers and page cache pages on inodes that are about to be
> > > > reclaimed.  Keeping the size of the inode cache down will have the
> > > > side effect of keeping the shadow mappings under control, and so I
> > > > don't see a need for a separate shrinker at all here.
> > > 
> > > Pinned inodes are not on the LRU, so you could send a machine OOM by
> > > simply catting a single large (sparse) file to /dev/null.
> > 
> > Then you have a serious design flaw if you are relying on a shrinker
> > to control memory consumed by page cache radix trees as a result of
> > page cache reclaim inserting exceptional entries into the radix
> > tree and then forgetting about them.
> 
> I'm not forgetting about them, I just track them very coarsely by
> linking up address spaces and then lazily enforce their upper limit
> when memory is tight by using the shrinker callback.  The assumption
> was that actually scanning them is such a rare event that we trade the
> rare computational costs for smaller memory consumption most of the
> time.

Sure, I understand the tradeoff that you made. But there's nothing
worse than a system that slows down unpredictably because of some
magic threshold in some subsystem has been crossed and
computationally expensive operations kick in.

Keep in mind that shrinkers are called in parallel, too, so once the
thresholdis crossed you have the possibility of every single CPU in
the system running that shrinker at the same time....

> > To work around this, you keep a global count of exceptional entries
> > and a global list of inodes with such exceptional radix tree
> > entries. The count doesn't really tell you how much memory is used
> > by the radix trees - the same count can mean an order of
> > magnitude difference in actual memory consumption (one shadow entry
> > per radix tree node vs 64) so it's not a very good measure to base
> > memory reclaim behaviour on but it is an inferred (rather than
> > actual) object count.
> 
> Capping shadow entries instead of memory consumption was intentional.
> They should be trimmed based on whether old shadow entries are still
> meaningful and have an effect if refaulted, not based on memory
> pressure.  These entries have an influence on future memory pressure
> so we shouldn't kick them out based on how tight resources are but
> based on whether there are too many expired entries.

Then I suspect that a shrinker is the wrong interface to us, as they
are designed to trim caches when resources are tight. What your
current design will lead to is windup, where it does nothing for
many calls and then when it passes the threshold the delta is so
large that it will ask the shrinker to scan the entire cache.

So, while your intention is that it reacts to expired entry count,
the reality is that it will result in a shadow entry count that
looks like a sawtooth over time instead of a smooth, slowly varying
line that changes value only as workloads change....

The architecture of shrinkers is that they act little by little to
memory pressure to keep all the caches in the system balanced
dynmaically, so when memory pressure occurs we don't have the
balance of the system change. Doing nothing until a magic threshold
is reached and then doing lots of work at that point results in
non-deterministic behaviour because the balance and behaviour of the
system will change drastically at that threshold point.  IOWs,
creating a shrinker that only does really expensive operations when
it passes a high threshold is not a good idea from a behavioural
POV.

> Previous implementations of non-resident history from Peter & Rik
> maintained a big system-wide hash table with a constant cost instead
> of using radix tree memory like this.  My idea was that this is cache
> friendlier and memory consumption should be lower in most cases and
> the shrinker is only there to cap the extreme / malicious cases.

Yes, it's an improvement on the hash table in some ways, but in
other ways it is much worse.

> > You walk the inode list by a shrinker and scan radix trees for
> > shadow entries that can be removed. It's expensive to scan radix
> > trees, especially for inodes with large amounts of cached data, so
> > this could do a lot of work to find very little in way of entries to
> > free.
> > 
> > The shrinker doesn't rotate inodes on the list, so it will always
> > scan the same inodes on the list in the same order and so if memory
> > reclaim removes a few pages from an inode with a large amount of
> > cached pages between each shrinker call, then those radix trees will
> > be repeatedly scanned in it's entirety on each call to the shrinker.
> >
> > Also, the shrinker only decrements nr_to_scan when it finds an entry
> > to reclaim. nr_to_scan is the number of objects to scan for reclaim,
> > not the number of objects to reclaim. hence the shrinker will be
> > doing a lot of scanning if there's inodes at the head of the list
> > with large radix trees....
> 
> I realize all of this.  The scanner is absolutely expensive, I just
> didn't care because it's not supposed to run in the first place but
> rather act like an emergency brake.
> 
> Again, the shrinker isn't even called until shadow entries are in
> excess, regardless of how bad memory pressure is.  On the other hand,
> the fact that this code is unneeded most of the time makes the struct
> inode size increase even worse.

Yup, and that's one of the big problems I have with the design.

> > > > And removing the special shrinker will bring the struct inode size
> > > > increase back to only 8 bytes, and I think we can live with that
> > > > increase given the workload improvements that the rest of the
> > > > functionality brings.
> > > 
> > > That would be very desirable indeed.
> > > 
> > > What we would really want is a means of per-zone tracking of
> > > radix_tree_nodes occupied by shadow entries but I can't see a way to
> > > do this without blowing up the radix tree structure at a much bigger
> > > cost than an extra list_head in struct address_space.
> > 
> > Putting a list_head in the radix tree node is likely to have a lower
> > cost than putting one in every inode. Most cached inodes don't have
> > any page cache associated with them. Indeed, my workstation right
> > now shows:
> > 
> > $ sudo grep "radix\|xfs_inode" /proc/slabinfo 
> > xfs_inode         277773 278432   1024    4    1 : tunables   54   27    8 : slabdata  69608  69608      0
> > radix_tree_node    74137  74956    560    7    1 : tunables   54   27    8 : slabdata  10708  10708      0
> 
> Is that a slab configuration?  On my slub config, this actually shows
> 568 even though the structure definition really adds up to 560 bytes.

I'm assuming that it is SLAB - it's the 3.11 kernel that debian
shipped out of experimental. yup:

$ grep SLAB /boot/config-3.11-trunk-amd64 
CONFIG_SLAB=y
CONFIG_SLABINFO=y
# CONFIG_DEBUG_SLAB is not set
$


> Yes, I really don't like the extra inode cost and the computational
> overhead in corner cases.

I think we are agreed on that :)

> What I do like is that the shadow entries are in-line and not in an
> auxiliary array and that memory consumption of shadow entries is
> mostly low, so I'm not eager to change the data structure.

And i'm not disagreeing with you there, either.

> But it
> looks like tracking radix tree nodes with a list and backpointers to
> the mapping object for the lock etc. will be a major pain in the ass.

Perhaps so - it may not work out when we get down to the fine
details...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-15 23:41         ` Dave Chinner
@ 2013-10-16  2:05           ` Rik van Riel
  2013-10-16  2:26             ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2013-10-16  2:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Andrew Morton, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel

On 10/15/2013 07:41 PM, Dave Chinner wrote:
> On Tue, Oct 15, 2013 at 01:41:28PM -0400, Johannes Weiner wrote:

>> I'm not forgetting about them, I just track them very coarsely by
>> linking up address spaces and then lazily enforce their upper limit
>> when memory is tight by using the shrinker callback.  The assumption
>> was that actually scanning them is such a rare event that we trade the
>> rare computational costs for smaller memory consumption most of the
>> time.
> 
> Sure, I understand the tradeoff that you made. But there's nothing
> worse than a system that slows down unpredictably because of some
> magic threshold in some subsystem has been crossed and
> computationally expensive operations kick in.

The shadow shrinker should remove the radix nodes with
the oldest shadow entries first, so true LRU should actually
work for the radix tree nodes.

Actually, since we only care about the age of the youngest
shadow entry in each radix tree node, FIFO will be the same
as LRU for that list.

That means the shrinker can always just take the radix tree
nodes off the end.

>> But it
>> looks like tracking radix tree nodes with a list and backpointers to
>> the mapping object for the lock etc. will be a major pain in the ass.
> 
> Perhaps so - it may not work out when we get down to the fine
> details...

I suspect that a combination of lifetime rules (inode cannot
disappear until all the radix tree nodes) and using RCU free
for the radix tree nodes, and the inodes might do the trick.

That would mean that, while holding the rcu read lock, the
back pointer from a radix tree node to the inode will always
point to valid memory.

That allows the shrinker to lock the inode, and verify that
the inode is still valid, before it attempts to rcu free the
radix tree node with shadow entries.

It also means that locking only needs to be in the inode,
and on the LRU list for shadow radix tree nodes.

Does that sound sane?

Am I overlooking something?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-16  2:05           ` Rik van Riel
@ 2013-10-16  2:26             ` Dave Chinner
  2013-10-16 22:31               ` Johannes Weiner
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2013-10-16  2:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, Andrew Morton, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel

On Tue, Oct 15, 2013 at 10:05:26PM -0400, Rik van Riel wrote:
> On 10/15/2013 07:41 PM, Dave Chinner wrote:
> > On Tue, Oct 15, 2013 at 01:41:28PM -0400, Johannes Weiner wrote:
> 
> >> I'm not forgetting about them, I just track them very coarsely by
> >> linking up address spaces and then lazily enforce their upper limit
> >> when memory is tight by using the shrinker callback.  The assumption
> >> was that actually scanning them is such a rare event that we trade the
> >> rare computational costs for smaller memory consumption most of the
> >> time.
> > 
> > Sure, I understand the tradeoff that you made. But there's nothing
> > worse than a system that slows down unpredictably because of some
> > magic threshold in some subsystem has been crossed and
> > computationally expensive operations kick in.
> 
> The shadow shrinker should remove the radix nodes with
> the oldest shadow entries first, so true LRU should actually
> work for the radix tree nodes.
> 
> Actually, since we only care about the age of the youngest
> shadow entry in each radix tree node, FIFO will be the same
> as LRU for that list.
> 
> That means the shrinker can always just take the radix tree
> nodes off the end.

Right, but it can't necessarily free the node as it may still have
pointers to pages in it. In that case, it would have to simply
rotate the page to the end of the LRU again.

Unless, of course, we kept track of the number of exceptional
entries in a node and didn't add it to the reclaim list until there
were no non-expceptional entries in the node....

> >> But it
> >> looks like tracking radix tree nodes with a list and backpointers to
> >> the mapping object for the lock etc. will be a major pain in the ass.
> > 
> > Perhaps so - it may not work out when we get down to the fine
> > details...
> 
> I suspect that a combination of lifetime rules (inode cannot
> disappear until all the radix tree nodes) and using RCU free
> for the radix tree nodes, and the inodes might do the trick.
> 
> That would mean that, while holding the rcu read lock, the
> back pointer from a radix tree node to the inode will always
> point to valid memory.

Yes, that is what I was thinking...

> That allows the shrinker to lock the inode, and verify that
> the inode is still valid, before it attempts to rcu free the
> radix tree node with shadow entries.

Lock the mapping, not the inode. The radix tree is protected by the
mapping_lock, not an inode lock. i.e. I'd hope that this can all b
contained within the struct address_space and not require any
knowledge of inodes or inode lifecycles at all.

> It also means that locking only needs to be in the inode,
> and on the LRU list for shadow radix tree nodes.
> 
> Does that sound sane?
> 
> Am I overlooking something?

It's pretty much along the same lines of what I was thinking, but
lets see what Johannes thinks.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-16  2:26             ` Dave Chinner
@ 2013-10-16 22:31               ` Johannes Weiner
  2013-10-21 12:10                 ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2013-10-16 22:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel

On Wed, Oct 16, 2013 at 01:26:06PM +1100, Dave Chinner wrote:
> On Tue, Oct 15, 2013 at 10:05:26PM -0400, Rik van Riel wrote:
> > On 10/15/2013 07:41 PM, Dave Chinner wrote:
> > > On Tue, Oct 15, 2013 at 01:41:28PM -0400, Johannes Weiner wrote:
> > 
> > >> I'm not forgetting about them, I just track them very coarsely by
> > >> linking up address spaces and then lazily enforce their upper limit
> > >> when memory is tight by using the shrinker callback.  The assumption
> > >> was that actually scanning them is such a rare event that we trade the
> > >> rare computational costs for smaller memory consumption most of the
> > >> time.
> > > 
> > > Sure, I understand the tradeoff that you made. But there's nothing
> > > worse than a system that slows down unpredictably because of some
> > > magic threshold in some subsystem has been crossed and
> > > computationally expensive operations kick in.
> > 
> > The shadow shrinker should remove the radix nodes with
> > the oldest shadow entries first, so true LRU should actually
> > work for the radix tree nodes.
> > 
> > Actually, since we only care about the age of the youngest
> > shadow entry in each radix tree node, FIFO will be the same
> > as LRU for that list.
> > 
> > That means the shrinker can always just take the radix tree
> > nodes off the end.
> 
> Right, but it can't necessarily free the node as it may still have
> pointers to pages in it. In that case, it would have to simply
> rotate the page to the end of the LRU again.
> 
> Unless, of course, we kept track of the number of exceptional
> entries in a node and didn't add it to the reclaim list until there
> were no non-expceptional entries in the node....

I think that's doable.  As long as there is an actual page in the
node, no memory is going to be freed anyway.  And we have a full int
per node with only 64 slots, we can easily split the counter.

> > >> But it
> > >> looks like tracking radix tree nodes with a list and backpointers to
> > >> the mapping object for the lock etc. will be a major pain in the ass.
> > > 
> > > Perhaps so - it may not work out when we get down to the fine
> > > details...
> > 
> > I suspect that a combination of lifetime rules (inode cannot
> > disappear until all the radix tree nodes) and using RCU free
> > for the radix tree nodes, and the inodes might do the trick.
> > 
> > That would mean that, while holding the rcu read lock, the
> > back pointer from a radix tree node to the inode will always
> > point to valid memory.
> 
> Yes, that is what I was thinking...
> 
> > That allows the shrinker to lock the inode, and verify that
> > the inode is still valid, before it attempts to rcu free the
> > radix tree node with shadow entries.
> 
> Lock the mapping, not the inode. The radix tree is protected by the
> mapping_lock, not an inode lock. i.e. I'd hope that this can all b
> contained within the struct address_space and not require any
> knowledge of inodes or inode lifecycles at all.

Agreed, we can point to struct address_space and invalidate it by
setting mapping->host to NULL or so during the RCU grace period.

Also, the parent pointer is in a union with the two-word rcu_head, so
we get the address space backpointer for free.

The struct list_head for the FIFO, as you said, we can get for free as
well (at least on slab).

The FIFO lists themselves can be trimmed by a shrinker, I think.  You
were concerned about wind-up but if the nodes are not in excess and
->count just returns 0 until we are actually prepared to shrink
objects, then there shouldn't be any windup, right?

I don't see a natural threshold for "excess" yet, though, because
there is no relationship between where radix_tree_node is allocated
and which zones the contained shadow entries point to, so it's hard to
estimate how worthwile any given radix_tree_node is to a node.  Maybe
tying it to an event might be better, like reclaim pressure, swapping
etc.

> > It also means that locking only needs to be in the inode,
> > and on the LRU list for shadow radix tree nodes.
> > 
> > Does that sound sane?
> > 
> > Am I overlooking something?
> 
> It's pretty much along the same lines of what I was thinking, but
> lets see what Johannes thinks.

That sounds great to me.  I just have looked at this code for so long
that I'm getting blind towards it, so I really appreciate your input.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
                   ` (8 preceding siblings ...)
  2013-10-11  0:39 ` [patch 0/8] mm: thrash detection-based file cache sizing v5 Dave Chinner
@ 2013-10-21  9:26 ` Vlastimil Babka
  2013-10-22  9:35   ` Johannes Weiner
  2013-11-14 15:56   ` Rik van Riel
  2013-11-12 10:30 ` Bob Liu
  10 siblings, 2 replies; 24+ messages in thread
From: Vlastimil Babka @ 2013-10-21  9:26 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Tejun Heo, linux-mm,
	linux-fsdevel, linux-kernel

On 10/10/2013 11:46 PM, Johannes Weiner wrote:
> Hi everyone,
> 
> here is an update to the cache sizing patches for 3.13.
> 
> 	Changes in this revision
> 
> o Drop frequency synchronization between refaulted and demoted pages
>   and just straight up activate refaulting pages whose access
>   frequency indicates they could stay in memory.  This was suggested
>   by Rik van Riel a looong time ago but misinterpretation of test
>   results during early stages of development took me a while to
>   overcome.  It's still the same overall concept, but a little simpler
>   and with even faster cache adaptation.  Yay!

Oh, I liked the previous approach with direct competition between the
refaulted and demoted page :) Doesn't the new approach favor the
refaulted page too much? No wonder it leads to faster cache adaptation,
but could it also cause degradations for workloads that don't benefit
from it? Were there any tests for performance regressions on workloads
that were not the target of the patchset?

Vlastimil

> o More extensively document underlying design concepts like the
>   meaning of the refault distance, based on input from Andrew and
>   Tejun.
> 
> o Document the new page cache lookup API which can return shadow
>   entries, based on input from Andrew
> 
> o Document and simplify the synchronization between inode teardown and
>   reclaim planting shadow entries, based on input from Andrew
> 
> o Drop 'time' from names of variables that are not in typical kernel
>   time units like jiffies or seconds, based on input from Andrew
> 
> 	Summary of problem & solution
> 
> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
>     
> Currently, the VM aims for a 1:1 ratio between the lists, which is the
> "perfect" trade-off between the ability to *protect* frequently used
> pages and the ability to *detect* frequently used pages.  This means
> that working set changes bigger than half of cache memory go
> undetected and thrash indefinitely, whereas working sets bigger than
> half of cache memory are unprotected against used-once streams that
> don't even need caching.
> 
> This happens on file servers and media streaming servers, where the
> popular files and file sections change over time.  Even though the
> individual files might be smaller than half of memory, concurrent
> access to many of them may still result in their inter-reference
> distance being greater than half of memory.  It's also been reported
> as a problem on database workloads that switch back and forth between
> tables that are bigger than half of memory.  In these cases the VM
> never recognizes the new working set and will for the remainder of the
> workload thrash disk data which could easily live in memory.
>     
> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use-once streams,
> but ultimately was not significantly better than a FIFO policy and
> still thrashed cache based on eviction speed, rather than actual
> demand for cache.
>     
> This series solves the problem by maintaining a history of pages
> evicted from the inactive list, enabling the VM to detect frequently
> used pages regardless of inactive list size and so facilitate working
> set transitions in excess of the inactive list.
> 
> 	Tests
> 	
> The reported database workload is easily demonstrated on a 16G machine
> with two filesets a 12G.  This fio workload operates on one set first,
> then switches to the other.  The VM should obviously always cache the
> set that the workload is currently using.
> 
> unpatched:
> db1: READ: io=196608MB, aggrb=740405KB/s, minb=740405KB/s, maxb=740405KB/s, mint= 271914msec, maxt= 271914msec
> db2: READ: io=196608MB, aggrb= 68558KB/s, minb= 68558KB/s, maxb= 68558KB/s, mint=2936584msec, maxt=2936584msec
> 
> real    53m29.192s
> user    1m11.544s
> sys     3m34.820s
> 
> patched:
> db1: READ: io=196608MB, aggrb=743750KB/s, minb=743750KB/s, maxb=743750KB/s, mint=270691msec, maxt=270691msec
> db2: READ: io=196608MB, aggrb=383503KB/s, minb=383503KB/s, maxb=383503KB/s, mint=524967msec, maxt=524967msec
> 
> real    13m16.432s
> user    0m56.518s
> sys     2m38.284s
> 
> As can be seen, the unpatched kernel simply never adapts to the
> workingset change and db2 is stuck with secondary storage speed.  The
> patched kernel on the other hand needs 2-3 iterations over db2 before
> it replaces db1 and reaches full memory speed.  Given the unbounded
> negative affect of the existing VM behavior, these patches should be
> considered correctness fixes rather than performance optimizations.
> 
> Another test resembles a fileserver or streaming server workload,
> where data in excess of memory size is accessed at different
> frequencies.  First, there is very hot data accessed at a high
> frequency.  Machines should be fitted so that the hot set of such a
> workload can be fully cached or all bets are off.  Then there is a
> very big (compared to available memory) set of data that is used-once
> or at a very low frequency; this is what drives the inactive list and
> does not really benefit from caching.  Lastly, there is a big set of
> warm data in between that is accessed at medium frequencies and would
> benefit from caching the pages between the first and last streamer of
> each burst as much as possible.
> 
> unpatched:
> cold: READ: io= 30720MB, aggrb= 24808KB/s, minb= 24808KB/s, maxb= 24808KB/s, mint=1267987msec, maxt=1267987msec
>  hot: READ: io=174080MB, aggrb=158225KB/s, minb=158225KB/s, maxb=158225KB/s, mint=1126605msec, maxt=1126605msec
> warm: READ: io=102400MB, aggrb=112403KB/s, minb= 22480KB/s, maxb= 33010KB/s, mint= 635291msec, maxt= 932866msec
> 
> real    21m15.924s
> user    17m36.036s
> sys     3m5.117s
> 
> patched:
> cold: READ: io= 30720MB, aggrb= 27451KB/s, minb= 27451KB/s, maxb= 27451KB/s, mint=1145932msec, maxt=1145932msec
>  hot: READ: io=174080MB, aggrb=158617KB/s, minb=158617KB/s, maxb=158617KB/s, mint=1123822msec, maxt=1123822msec
> warm: READ: io=102400MB, aggrb=131964KB/s, minb= 26392KB/s, maxb= 40164KB/s, mint= 522141msec, maxt= 794592msec
> 
> real    19m22.671s
> user    19m33.838s
> sys     2m39.652s
> 
> In both kernels, the hot set is propagated to the active list and then
> served from cache for the duration of the workload.
> 
> In both kernels, the beginning of the warm set is propagated to the
> active list as well, but in the unpatched case the active list
> eventually takes up half of memory and does not leave enough space
> anymore for many new warm pages to get activated and so they start
> thrashing while a significant part of the active list is now stale.
> The patched kernel on the other hand constantly challenges the active
> pages based on refaults and so manages to keep a cache window rolling
> through the warm data set.  This frees up IO bandwidth for the cold
> set as well.
> 
> For reference, this same test was performed with the traditional
> demotion mechanism, where deactivation is coupled to inactive list
> reclaim.  However, this had the same outcome as the unpatched kernel:
> while the warm set does indeed get activated continuously, it is
> forced out of the active list by inactive list pressure, which is
> dictated primarily by the unrelated cold set.  The warm set is evicted
> before subsequent streamers can benefit from it, even though there
> would be enough space available to cache the pages of interest.
> 
> 	Costs
> 
> These patches increase struct inode by three words to manage shadow
> entries in the page cache radix tree.  However, given that a typical
> inode (like the ext4 inode) is already 1k in size, this is not much.
> It's a 2% size increase for a reclaimable object.  fs_mark metadata
> tests with millions of inodes did not show a measurable difference.
> And as soon as there is any file data involved, the page cache pages
> dominate the memory cost anyway.
> 
> A much harder cost to estimate is the size of the page cache radix
> tree.  Page reclaim used to shrink the radix trees but now the tree
> nodes are reused for shadow entries, and so the cost depends heavily
> on the page cache access patterns.  However, with workloads that
> maintain spatial or temporal locality, the shadow entries are either
> refaulted quickly or reclaimed along with the inode object itself.
> Workloads that will experience a memory cost increase are those that
> don't really benefit from caching in the first place.
> 
> A more predictable alternative would be a fixed-cost separate pool of
> shadow entries, but this would incur relatively higher memory cost for
> well-behaved workloads at the benefit of cornercases.  It would also
> make the shadow entry lookup more costly compared to storing them
> directly in the cache structure.
> 
> The biggest impact on the existing VM fastpaths is an extra branch in
> page cache lookups to filter out shadow entries.  shmem already has
> this check, though, since it stores swap entries alongside regular
> pages inside its page cache radix trees.
> 
> 	Future
> 
> Right now we have a fixed ratio (50:50) between inactive and active
> list but we already have complaints about working sets exceeding half
> of memory being pushed out of the cache by simple used-once streaming
> in the background.  Ultimately, we want to adjust this ratio and allow
> for a much smaller inactive list.  These patches are an essential step
> in this direction because they decouple the VMs ability to detect
> working set changes from the inactive list size.  This would allow us
> to base the inactive list size on something more sensible, like the
> combined readahead window size for example.
> 
> Please consider merging.  Thank you!
> 
>  fs/block_dev.c                   |   2 +-
>  fs/btrfs/compression.c           |   2 +-
>  fs/cachefiles/rdwr.c             |  33 +--
>  fs/inode.c                       |  11 +-
>  fs/nfs/blocklayout/blocklayout.c |   2 +-
>  fs/nilfs2/inode.c                |   4 +-
>  include/linux/fs.h               |   2 +
>  include/linux/mm.h               |   8 +
>  include/linux/mmzone.h           |   6 +
>  include/linux/pagemap.h          |  33 ++-
>  include/linux/pagevec.h          |   3 +
>  include/linux/radix-tree.h       |   5 +-
>  include/linux/shmem_fs.h         |   1 +
>  include/linux/swap.h             |   9 +
>  include/linux/writeback.h        |   1 +
>  lib/radix-tree.c                 | 106 ++------
>  mm/Makefile                      |   2 +-
>  mm/filemap.c                     | 335 +++++++++++++++++++++---
>  mm/mincore.c                     |  20 +-
>  mm/page-writeback.c              |   2 +-
>  mm/readahead.c                   |   6 +-
>  mm/shmem.c                       | 122 +++------
>  mm/swap.c                        |  49 ++++
>  mm/truncate.c                    |  78 ++++--
>  mm/vmscan.c                      |  24 +-
>  mm/vmstat.c                      |   3 +
>  mm/workingset.c                  | 506 +++++++++++++++++++++++++++++++++++++
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-16 22:31               ` Johannes Weiner
@ 2013-10-21 12:10                 ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2013-10-21 12:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Rik van Riel, Andrew Morton, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel

On Wed, Oct 16, 2013 at 06:31:04PM -0400, Johannes Weiner wrote:
> On Wed, Oct 16, 2013 at 01:26:06PM +1100, Dave Chinner wrote:
> > On Tue, Oct 15, 2013 at 10:05:26PM -0400, Rik van Riel wrote:
> > > On 10/15/2013 07:41 PM, Dave Chinner wrote:
> > > > On Tue, Oct 15, 2013 at 01:41:28PM -0400, Johannes Weiner wrote:
> > > >> But it
> > > >> looks like tracking radix tree nodes with a list and backpointers to
> > > >> the mapping object for the lock etc. will be a major pain in the ass.
> > > > 
> > > > Perhaps so - it may not work out when we get down to the fine
> > > > details...
> > > 
> > > I suspect that a combination of lifetime rules (inode cannot
> > > disappear until all the radix tree nodes) and using RCU free
> > > for the radix tree nodes, and the inodes might do the trick.
> > > 
> > > That would mean that, while holding the rcu read lock, the
> > > back pointer from a radix tree node to the inode will always
> > > point to valid memory.
> > 
> > Yes, that is what I was thinking...
> > 
> > > That allows the shrinker to lock the inode, and verify that
> > > the inode is still valid, before it attempts to rcu free the
> > > radix tree node with shadow entries.
> > 
> > Lock the mapping, not the inode. The radix tree is protected by the
> > mapping_lock, not an inode lock. i.e. I'd hope that this can all b
> > contained within the struct address_space and not require any
> > knowledge of inodes or inode lifecycles at all.
> 
> Agreed, we can point to struct address_space and invalidate it by
> setting mapping->host to NULL or so during the RCU grace period.
> 
> Also, the parent pointer is in a union with the two-word rcu_head, so
> we get the address space backpointer for free.
> 
> The struct list_head for the FIFO, as you said, we can get for free as
> well (at least on slab).
> 
> The FIFO lists themselves can be trimmed by a shrinker, I think.  You
> were concerned about wind-up but if the nodes are not in excess and
> ->count just returns 0 until we are actually prepared to shrink
> objects, then there shouldn't be any windup, right?

It's not windup that will be the problem, it's the step change from
going from zero cache items to the global counter value when the
magic threshold is crossed. That will generate a massive delta, and
so generate a huge amount of work to be done from a single shrinker
call that will be executed on the next context that can run it.

i.e. the problem is that the cache size goes from 0 to something
huge in an instant, and will drop from something huge to zero just
as quickly. There is no way the shrinker can prevent overshoot and
oscillation around that magic threshold because the nature of the
step change overwhelms the damping algorithms in the shrinker used
to control the amount work being done and hence reach stability.

Normally, the shrinker sees the size of the cache change gradually,
and so the delta tends to be relatively small and so it does a
little bit of work every time it is called. This keeps the caches
balanced with all the other caches that are trimmed a little at a
time. IOWs, there is a natural damping algorithm built into the
shrinkers that biases them towards a stable, balanced condition,
even under changing workloads.

Windup occurs when that little bit of work keeps getting delayed and
aggregated (e.g. repeated GFP_NOFS reclaim context) which then gets
dumped on a single scan that can make progress (e.g. kswapd). So
windup is an iterative process that triggers "catchup work". The
macro level behaviour might end up looking the same, but they have
very different underlying algorithmic causes.

> I don't see a natural threshold for "excess" yet, though, because
> there is no relationship between where radix_tree_node is allocated
> and which zones the contained shadow entries point to, so it's hard to
> estimate how worthwile any given radix_tree_node is to a node.  Maybe
> tying it to an event might be better, like reclaim pressure, swapping
> etc.

The shrinker is provided a measure of reclaim pressure already,
which it uses to balance the cache sizes. The shadow entries are no
different from that perspective. You can't let them overrun the
system, but you want to keep a respectable number of them around to
keep (some metric of) performance within respectable bounds.  IOWs,
that's the same constraints as most other caches (e.g. inode and
dentry caches) with a shrinker shrinker and so I don't see any
reason why it needs some magic threshold to avoid being shrunk
proportionally like all other caches...

Indeed, as memory pressure gets higher, the value of keeping lots of
shadow entries around goes down because if there is lots of
refaulting occurring then the number of shadow entries will be
shrinking as a natural side effect of replacing shadow entries with
real pages.

If the memory pressure is not causing refaults to occur, then the
shadow entries are using memory that could otherwise be put to
better use, and so we should reclaim them in proportion to the
memory pressure.

If you use lazy lists, in the first case the scanner will expend
most of it work removing radix tree nodes from the list because they
have pages in them again, and so the shrinker does cleanup work
rather than reclaim work. If the second case occurs, then the
shrinker does reclaim work to free the radix tree nodes so the
memory can be put to better use. No magic thresholds are needed at
all...

> > > It also means that locking only needs to be in the inode,
> > > and on the LRU list for shadow radix tree nodes.
> > > 
> > > Does that sound sane?
> > > 
> > > Am I overlooking something?
> > 
> > It's pretty much along the same lines of what I was thinking, but
> > lets see what Johannes thinks.
> 
> That sounds great to me.  I just have looked at this code for so long
> that I'm getting blind towards it, so I really appreciate your input.

I think we all suffer from that problem from time to time :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-21  9:26 ` Vlastimil Babka
@ 2013-10-22  9:35   ` Johannes Weiner
  2013-11-14 15:56   ` Rik van Riel
  1 sibling, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2013-10-22  9:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

On Mon, Oct 21, 2013 at 11:26:43AM +0200, Vlastimil Babka wrote:
> On 10/10/2013 11:46 PM, Johannes Weiner wrote:
> > Hi everyone,
> > 
> > here is an update to the cache sizing patches for 3.13.
> > 
> > 	Changes in this revision
> > 
> > o Drop frequency synchronization between refaulted and demoted pages
> >   and just straight up activate refaulting pages whose access
> >   frequency indicates they could stay in memory.  This was suggested
> >   by Rik van Riel a looong time ago but misinterpretation of test
> >   results during early stages of development took me a while to
> >   overcome.  It's still the same overall concept, but a little simpler
> >   and with even faster cache adaptation.  Yay!
> 
> Oh, I liked the previous approach with direct competition between the
> refaulted and demoted page :) Doesn't the new approach favor the
> refaulted page too much? No wonder it leads to faster cache adaptation,
> but could it also cause degradations for workloads that don't benefit
> from it? Were there any tests for performance regressions on workloads
> that were not the target of the patchset?

If anything, it's unfair to refaulting pages because it requires 3
references before they are activated instead of the regular 2.

We don't do the direct competition for regular in-core activation,
either, which has the same theoretical problem but was never an issue
in the real world.  Not that I know of anyway.

I ran a standard battery of mmtests (kernbench, dbench, postmark,
micro, fsmark, what have you) and did not notice any regressions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
                   ` (9 preceding siblings ...)
  2013-10-21  9:26 ` Vlastimil Babka
@ 2013-11-12 10:30 ` Bob Liu
  2013-11-14 15:53   ` Rik van Riel
  10 siblings, 1 reply; 24+ messages in thread
From: Bob Liu @ 2013-11-12 10:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, Linux-MM, linux-fsdevel,
	Linux-Kernel

Hi Johannes,

On Fri, Oct 11, 2013 at 5:46 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>         Future
>
> Right now we have a fixed ratio (50:50) between inactive and active
> list but we already have complaints about working sets exceeding half
> of memory being pushed out of the cache by simple used-once streaming
> in the background.  Ultimately, we want to adjust this ratio and allow
> for a much smaller inactive list.  These patches are an essential step
> in this direction because they decouple the VMs ability to detect
> working set changes from the inactive list size.  This would allow us
> to base the inactive list size on something more sensible, like the
> combined readahead window size for example.
>

I found that this patchset have the similar purpose as
Zcache(http://lwn.net/Articles/562254/) in some way.

Zcache uses the cleancache API to compress clean file pages so as to
keep them in memory, and only pages which used to in active file pages
are considered. Through Zcache we can hold more active pages in memory
and won't be effected by streaming workload.

Perhaps you can take a look at the way of zcache, your workloads are
very likely to benefit from it.
And zcache don't need so many changes to core VM/VFS subsystem because
it's based on cleancache API. I think it's more acceptable.

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-11-12 10:30 ` Bob Liu
@ 2013-11-14 15:53   ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2013-11-14 15:53 UTC (permalink / raw)
  To: Bob Liu
  Cc: Johannes Weiner, Andrew Morton, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Vlastimil Babka, Tejun Heo, Linux-MM, linux-fsdevel,
	Linux-Kernel

On 11/12/2013 05:30 AM, Bob Liu wrote:
> Hi Johannes,
>
> On Fri, Oct 11, 2013 at 5:46 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>>          Future
>>
>> Right now we have a fixed ratio (50:50) between inactive and active
>> list but we already have complaints about working sets exceeding half
>> of memory being pushed out of the cache by simple used-once streaming
>> in the background.  Ultimately, we want to adjust this ratio and allow
>> for a much smaller inactive list.  These patches are an essential step
>> in this direction because they decouple the VMs ability to detect
>> working set changes from the inactive list size.  This would allow us
>> to base the inactive list size on something more sensible, like the
>> combined readahead window size for example.
>>
>
> I found that this patchset have the similar purpose as
> Zcache(http://lwn.net/Articles/562254/) in some way.

Sorry, but that is unrelated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/8] mm: thrash detection-based file cache sizing v5
  2013-10-21  9:26 ` Vlastimil Babka
  2013-10-22  9:35   ` Johannes Weiner
@ 2013-11-14 15:56   ` Rik van Riel
  1 sibling, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2013-11-14 15:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Johannes Weiner, Andrew Morton, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, Tejun Heo, linux-mm, linux-fsdevel, linux-kernel

On 10/21/2013 05:26 AM, Vlastimil Babka wrote:
> On 10/10/2013 11:46 PM, Johannes Weiner wrote:
>> Hi everyone,
>>
>> here is an update to the cache sizing patches for 3.13.
>>
>> 	Changes in this revision
>>
>> o Drop frequency synchronization between refaulted and demoted pages
>>    and just straight up activate refaulting pages whose access
>>    frequency indicates they could stay in memory.  This was suggested
>>    by Rik van Riel a looong time ago but misinterpretation of test
>>    results during early stages of development took me a while to
>>    overcome.  It's still the same overall concept, but a little simpler
>>    and with even faster cache adaptation.  Yay!
>
> Oh, I liked the previous approach with direct competition between the
> refaulted and demoted page :) Doesn't the new approach favor the
> refaulted page too much? No wonder it leads to faster cache adaptation,
> but could it also cause degradations for workloads that don't benefit
> from it? Were there any tests for performance regressions on workloads
> that were not the target of the patchset?

This is a good question, and one that is probably
best settled through experimentation.

Even with the first scheme (fault refaulted page to
the inactive list), those pages only need 2 accesses
to be promoted to the active list.

That is because a refault tends to immediately be
followed by an access (after all, the attempted
access causes the page to get loaded back into memory).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-11-14 15:57 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-10 21:46 [patch 0/8] mm: thrash detection-based file cache sizing v5 Johannes Weiner
2013-10-10 21:46 ` [patch 1/8] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
2013-10-10 21:46 ` [patch 2/8] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2013-10-10 21:46 ` [patch 3/8] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2013-10-10 21:46 ` [patch 4/8] mm: filemap: move radix tree hole searching here Johannes Weiner
2013-10-10 21:46 ` [patch 5/8] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2013-10-10 21:47 ` [patch 6/8] mm + fs: store shadow entries in page cache Johannes Weiner
2013-10-10 21:47 ` [patch 7/8] mm: thrash detection-based file cache sizing Johannes Weiner
2013-10-10 21:47 ` [patch 8/8] mm: workingset: keep shadow entries in check Johannes Weiner
2013-10-11  0:39 ` [patch 0/8] mm: thrash detection-based file cache sizing v5 Dave Chinner
2013-10-14 21:42   ` Johannes Weiner
2013-10-15  1:41     ` Dave Chinner
2013-10-15 10:27       ` Jan Kara
2013-10-15 17:41       ` Johannes Weiner
2013-10-15 23:41         ` Dave Chinner
2013-10-16  2:05           ` Rik van Riel
2013-10-16  2:26             ` Dave Chinner
2013-10-16 22:31               ` Johannes Weiner
2013-10-21 12:10                 ` Dave Chinner
2013-10-21  9:26 ` Vlastimil Babka
2013-10-22  9:35   ` Johannes Weiner
2013-11-14 15:56   ` Rik van Riel
2013-11-12 10:30 ` Bob Liu
2013-11-14 15:53   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).