linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/9] mm: thrash detection-based file cache sizing v8
@ 2014-01-10 18:10 Johannes Weiner
  2014-01-10 18:10 ` [patch 1/9] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
                   ` (8 more replies)
  0 siblings, 9 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Hi,

version 8 of this series contains a small cleanup in the shadow
shrinker's list_lru usage, as suggested by Dave Chinner.  Other than
that the series has now stabilized, and is, in Andrew's opinion, "a
very readable patchset."  But see for yourself!

Thanks.

	Changes in this revision

o Changed the list_lru interface to provide LRU_REMOVED_RETRY for
  reclaimers that have to drop the lru lock even in the event of a
  successful reclaim.  Suggested by Dave Chinner.

	Summary

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".
    
Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages.  This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching.

This happens on file servers and media streaming servers, where the
popular files and file sections change over time.  Even though the
individual files might be smaller than half of memory, concurrent
access to many of them may still result in their inter-reference
distance being greater than half of memory.  It's also been reported
as a problem on database workloads that switch back and forth between
tables that are bigger than half of memory.  In these cases the VM
never recognizes the new working set and will for the remainder of the
workload thrash disk data which could easily live in memory.
    
Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use-once streams,
but ultimately was not significantly better than a FIFO policy and
still thrashed cache based on eviction speed, rather than actual
demand for cache.
    
This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to detect frequently
used pages regardless of inactive list size and facilitate working set
transitions.

	Tests

The reported database workload is easily demonstrated on a 8G machine
with two filesets a 6G.  This fio workload operates on one set first,
then switches to the other.  The VM should obviously always cache the
set that the workload is currently using.

This test is based on a problem encountered by Citus Data customers:
http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data

unpatched:
db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec
db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec
sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92%

real    27m15.541s
user    0m19.059s
sys     0m51.459s

patched:
db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec
db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec
sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40%

real    6m8.630s
user    0m14.714s
sys     0m31.233s

As can be seen, the unpatched kernel simply never adapts to the
workingset change and db2 is stuck indefinitely with secondary storage
speed.  The patched kernel needs 2-3 iterations over db2 before it
replaces db1 and reaches full memory speed.  Given the unbounded
negative affect of the existing VM behavior, these patches should be
considered correctness fixes rather than performance optimizations.

Another test resembles a fileserver or streaming server workload,
where data in excess of memory size is accessed at different
frequencies.  There is very hot data accessed at a high frequency.
Machines should be fitted so that the hot set of such a workload can
be fully cached or all bets are off.  Then there is a very big
(compared to available memory) set of data that is used-once or at a
very low frequency; this is what drives the inactive list and does not
really benefit from caching.  Lastly, there is a big set of warm data
in between that is accessed at medium frequencies and benefits from
caching the pages between the first and last streamer of each burst.

unpatched:
 hot: READ: io=128000MB, aggrb=160693KB/s, minb=160693KB/s, maxb=160693KB/s, mint=815665msec, maxt=815665msec
warm: READ: io= 81920MB, aggrb=109853KB/s, minb= 27463KB/s, maxb= 29244KB/s, mint=717110msec, maxt=763617msec
cold: READ: io= 30720MB, aggrb= 35245KB/s, minb= 35245KB/s, maxb= 35245KB/s, mint=892530msec, maxt=892530msec
 sdb: ios=797960/4, merge=11763/1, ticks=4307910/796, in_queue=4308380, util=100.00%

patched:
 hot: READ: io=128000MB, aggrb=160678KB/s, minb=160678KB/s, maxb=160678KB/s, mint=815740msec, maxt=815740msec
warm: READ: io= 81920MB, aggrb=147747KB/s, minb= 36936KB/s, maxb= 40960KB/s, mint=512000msec, maxt=567767msec
cold: READ: io= 30720MB, aggrb= 40960KB/s, minb= 40960KB/s, maxb= 40960KB/s, mint=768000msec, maxt=768000msec
 sdb: ios=596514/4, merge=9341/1, ticks=2395362/997, in_queue=2396484, util=79.18%

In both kernels, the hot set is propagated to the active list and then
served from cache.

In both kernels, the beginning of the warm set is propagated to the
active list as well, but in the unpatched case the active list
eventually takes up half of memory and no new pages from the warm set
get activated, despite repeated access, and despite most of the active
list soon being stale.  The patched kernel on the other hand detects
the thrashing and manages to keep this cache window rolling through
the data set.  This frees up enough IO bandwidth that the cold set is
served at full speed as well and disk utilization even drops by 20%.

For reference, this same test was performed with the traditional
demotion mechanism, where deactivation is coupled to inactive list
reclaim.  However, this had the same outcome as the unpatched kernel:
while the warm set does indeed get activated continuously, it is
forced out of the active list by inactive list pressure, which is
dictated primarily by the unrelated cold set.  The warm set is evicted
before subsequent streamers can benefit from it, even though there
would be enough space available to cache the pages of interest.

	Costs

Page reclaim used to shrink the radix trees but now the tree nodes are
reused for shadow entries, where the cost depends heavily on the page
cache access patterns.  However, with workloads that maintain spatial
or temporal locality, the shadow entries are either refaulted quickly
or reclaimed along with the inode object itself.  Workloads that will
experience a memory cost increase are those that don't really benefit
from caching in the first place.

A more predictable alternative would be a fixed-cost separate pool of
shadow entries, but this would incur relatively higher memory cost for
well-behaved workloads at the benefit of cornercases.  It would also
make the shadow entry lookup more costly compared to storing them
directly in the cache structure.

	Future

Right now we have a fixed ratio (50:50) between inactive and active
list but we already have complaints about working sets exceeding half
of memory being pushed out of the cache by simple streaming in the
background.  Ultimately, we want to adjust this ratio and allow for a
much smaller inactive list.  These patches are an essential step in
this direction because they decouple the VMs ability to detect working
set changes from the inactive list size.  This would allow us to base
the inactive list size on the combined readahead window size for
example and potentially protect a much bigger working set.

Another possibility of having thrashing information would be to
revisit the idea of local reclaim in the form of zero-config memory
control groups.  Instead of having allocating tasks go straight to
global reclaim, they could try to reclaim the pages in the memcg they
are part of first as long as the group is not thrashing.  This would
allow a user to drop e.g. a back-up job in an otherwise unconfigured
memcg and it would only inflate (and possibly do global reclaim) until
it has enough memory to do proper readahead.  But once it reaches that
point and stops thrashing it would just recycle its own used-once
pages without kicking out the cache of any other tasks in the system
more than necessary.

To simplify the merging process, this patch set is implementing thrash
detection on a global per-zone level only for now, but the design is
such that it can be extended to memory cgroups as well.  All we need
to do is store the unique cgroup ID along the node and zone identifier
inside the eviction cookie to identify the lruvec.

 Documentation/filesystems/porting               |   6 +-
 drivers/staging/lustre/lustre/llite/llite_lib.c |   2 +-
 fs/9p/vfs_inode.c                               |   2 +-
 fs/affs/inode.c                                 |   2 +-
 fs/afs/inode.c                                  |   2 +-
 fs/bfs/inode.c                                  |   2 +-
 fs/block_dev.c                                  |   4 +-
 fs/btrfs/compression.c                          |   2 +-
 fs/btrfs/inode.c                                |   2 +-
 fs/cachefiles/rdwr.c                            |  33 +-
 fs/cifs/cifsfs.c                                |   2 +-
 fs/coda/inode.c                                 |   2 +-
 fs/ecryptfs/super.c                             |   2 +-
 fs/exofs/inode.c                                |   2 +-
 fs/ext2/inode.c                                 |   2 +-
 fs/ext3/inode.c                                 |   2 +-
 fs/ext4/inode.c                                 |   4 +-
 fs/f2fs/inode.c                                 |   2 +-
 fs/fat/inode.c                                  |   2 +-
 fs/freevxfs/vxfs_inode.c                        |   2 +-
 fs/fuse/inode.c                                 |   2 +-
 fs/gfs2/super.c                                 |   2 +-
 fs/hfs/inode.c                                  |   2 +-
 fs/hfsplus/super.c                              |   2 +-
 fs/hostfs/hostfs_kern.c                         |   2 +-
 fs/hpfs/inode.c                                 |   2 +-
 fs/inode.c                                      |   4 +-
 fs/jffs2/fs.c                                   |   2 +-
 fs/jfs/inode.c                                  |   4 +-
 fs/logfs/readwrite.c                            |   2 +-
 fs/minix/inode.c                                |   2 +-
 fs/ncpfs/inode.c                                |   2 +-
 fs/nfs/blocklayout/blocklayout.c                |   2 +-
 fs/nfs/inode.c                                  |   2 +-
 fs/nfs/nfs4super.c                              |   2 +-
 fs/nilfs2/inode.c                               |   6 +-
 fs/ntfs/inode.c                                 |   2 +-
 fs/ocfs2/inode.c                                |   4 +-
 fs/omfs/inode.c                                 |   2 +-
 fs/proc/inode.c                                 |   2 +-
 fs/reiserfs/inode.c                             |   2 +-
 fs/sysfs/inode.c                                |   2 +-
 fs/sysv/inode.c                                 |   2 +-
 fs/ubifs/super.c                                |   2 +-
 fs/udf/inode.c                                  |   4 +-
 fs/ufs/inode.c                                  |   2 +-
 fs/xfs/xfs_super.c                              |   2 +-
 include/linux/fs.h                              |   1 +
 include/linux/list_lru.h                        |   2 +
 include/linux/mm.h                              |   9 +
 include/linux/mmzone.h                          |   6 +
 include/linux/pagemap.h                         |  33 +-
 include/linux/pagevec.h                         |   3 +
 include/linux/radix-tree.h                      |  55 ++-
 include/linux/shmem_fs.h                        |   1 +
 include/linux/swap.h                            |   6 +
 lib/radix-tree.c                                | 383 ++++++++++----------
 mm/Makefile                                     |   2 +-
 mm/filemap.c                                    | 403 +++++++++++++++++++---
 mm/list_lru.c                                   |   8 +
 mm/mincore.c                                    |  20 +-
 mm/readahead.c                                  |   6 +-
 mm/shmem.c                                      | 122 ++-----
 mm/swap.c                                       |  49 +++
 mm/truncate.c                                   | 141 +++++++-
 mm/vmscan.c                                     |  24 +-
 mm/vmstat.c                                     |   3 +
 mm/workingset.c                                 | 374 ++++++++++++++++++++
 68 files changed, 1348 insertions(+), 448 deletions(-)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 1/9] fs: cachefiles: use add_to_page_cache_lru()
  2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
@ 2014-01-10 18:10 ` Johannes Weiner
  2014-01-13  1:17   ` Minchan Kim
  2014-01-10 18:10 ` [patch 2/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2014-01-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

This code used to have its own lru cache pagevec up until a0b8cab3
("mm: remove lru parameter from __pagevec_lru_add and remove parts of
pagevec API").  Now it's just add_to_page_cache() followed by
lru_cache_add(), might as well use add_to_page_cache_lru() directly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 fs/cachefiles/rdwr.c | 33 +++++++++++++--------------------
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index ebaff368120d..4b1fb5ca65b8 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -265,24 +265,22 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
 				goto nomem_monitor;
 		}
 
-		ret = add_to_page_cache(newpage, bmapping,
-					netpage->index, cachefiles_gfp);
+		ret = add_to_page_cache_lru(newpage, bmapping,
+					    netpage->index, cachefiles_gfp);
 		if (ret == 0)
 			goto installed_new_backing_page;
 		if (ret != -EEXIST)
 			goto nomem_page;
 	}
 
-	/* we've installed a new backing page, so now we need to add it
-	 * to the LRU list and start it reading */
+	/* we've installed a new backing page, so now we need to start
+	 * it reading */
 installed_new_backing_page:
 	_debug("- new %p", newpage);
 
 	backpage = newpage;
 	newpage = NULL;
 
-	lru_cache_add_file(backpage);
-
 read_backing_page:
 	ret = bmapping->a_ops->readpage(NULL, backpage);
 	if (ret < 0)
@@ -510,24 +508,23 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 					goto nomem;
 			}
 
-			ret = add_to_page_cache(newpage, bmapping,
-						netpage->index, cachefiles_gfp);
+			ret = add_to_page_cache_lru(newpage, bmapping,
+						    netpage->index,
+						    cachefiles_gfp);
 			if (ret == 0)
 				goto installed_new_backing_page;
 			if (ret != -EEXIST)
 				goto nomem;
 		}
 
-		/* we've installed a new backing page, so now we need to add it
-		 * to the LRU list and start it reading */
+		/* we've installed a new backing page, so now we need
+		 * to start it reading */
 	installed_new_backing_page:
 		_debug("- new %p", newpage);
 
 		backpage = newpage;
 		newpage = NULL;
 
-		lru_cache_add_file(backpage);
-
 	reread_backing_page:
 		ret = bmapping->a_ops->readpage(NULL, backpage);
 		if (ret < 0)
@@ -538,8 +535,8 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 	monitor_backing_page:
 		_debug("- monitor add");
 
-		ret = add_to_page_cache(netpage, op->mapping, netpage->index,
-					cachefiles_gfp);
+		ret = add_to_page_cache_lru(netpage, op->mapping,
+					    netpage->index, cachefiles_gfp);
 		if (ret < 0) {
 			if (ret == -EEXIST) {
 				page_cache_release(netpage);
@@ -549,8 +546,6 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 			goto nomem;
 		}
 
-		lru_cache_add_file(netpage);
-
 		/* install a monitor */
 		page_cache_get(netpage);
 		monitor->netfs_page = netpage;
@@ -613,8 +608,8 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 	backing_page_already_uptodate:
 		_debug("- uptodate");
 
-		ret = add_to_page_cache(netpage, op->mapping, netpage->index,
-					cachefiles_gfp);
+		ret = add_to_page_cache_lru(netpage, op->mapping,
+					    netpage->index, cachefiles_gfp);
 		if (ret < 0) {
 			if (ret == -EEXIST) {
 				page_cache_release(netpage);
@@ -631,8 +626,6 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 
 		fscache_mark_page_cached(op, netpage);
 
-		lru_cache_add_file(netpage);
-
 		/* the netpage is unlocked and marked up to date here */
 		fscache_end_io(op, netpage, 0);
 		page_cache_release(netpage);
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [patch 2/9] lib: radix-tree: radix_tree_delete_item()
  2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
  2014-01-10 18:10 ` [patch 1/9] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
@ 2014-01-10 18:10 ` Johannes Weiner
  2014-01-10 18:10 ` [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Provide a function that does not just delete an entry at a given
index, but also allows passing in an expected item.  Delete only if
that item is still located at the specified index.

This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/radix-tree.h |  1 +
 lib/radix-tree.c           | 31 +++++++++++++++++++++++++++----
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 403940787be1..1bf0a9c388d9 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -219,6 +219,7 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3b4e70..f442e3243607 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1335,15 +1335,18 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 }
 
 /**
- *	radix_tree_delete    -    delete an item from a radix tree
+ *	radix_tree_delete_item    -    delete an item from a radix tree
  *	@root:		radix tree root
  *	@index:		index key
+ *	@item:		expected item
  *
- *	Remove the item at @index from the radix tree rooted at @root.
+ *	Remove @item at @index from the radix tree rooted at @root.
  *
- *	Returns the address of the deleted item, or NULL if it was not present.
+ *	Returns the address of the deleted item, or NULL if it was not present
+ *	or the entry at the given @index was not @item.
  */
-void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+void *radix_tree_delete_item(struct radix_tree_root *root,
+			     unsigned long index, void *item)
 {
 	struct radix_tree_node *node = NULL;
 	struct radix_tree_node *slot = NULL;
@@ -1378,6 +1381,11 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 	if (slot == NULL)
 		goto out;
 
+	if (item && slot != item) {
+		slot = NULL;
+		goto out;
+	}
+
 	/*
 	 * Clear all tags associated with the item to be deleted.
 	 * This way of doing it would be inefficient, but seldom is any set.
@@ -1422,6 +1430,21 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 out:
 	return slot;
 }
+EXPORT_SYMBOL(radix_tree_delete_item);
+
+/**
+ *	radix_tree_delete    -    delete an item from a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Remove the item at @index from the radix tree rooted at @root.
+ *
+ *	Returns the address of the deleted item, or NULL if it was not present.
+ */
+void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+{
+	return radix_tree_delete_item(root, index, NULL);
+}
 EXPORT_SYMBOL(radix_tree_delete);
 
 /**
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages
  2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
  2014-01-10 18:10 ` [patch 1/9] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
  2014-01-10 18:10 ` [patch 2/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
@ 2014-01-10 18:10 ` Johannes Weiner
  2014-01-10 18:25   ` Rik van Riel
  2014-01-10 18:10 ` [patch 4/9] mm: filemap: move radix tree hole searching here Johannes Weiner
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2014-01-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Page cache radix tree slots are usually stabilized by the page lock,
but shmem's swap cookies have no such thing.  Because the overall
truncation loop is lockless, the swap entry is currently confirmed by
a tree lookup and then deleted by another tree lookup under the same
tree lock region.

Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup.  This also allows removing the
delete-only special case from shmem_radix_tree_replace().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
---
 mm/shmem.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 8297623fcaed..7c67249d6f28 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -242,19 +242,17 @@ static int shmem_radix_tree_replace(struct address_space *mapping,
 			pgoff_t index, void *expected, void *replacement)
 {
 	void **pslot;
-	void *item = NULL;
+	void *item;
 
 	VM_BUG_ON(!expected);
+	VM_BUG_ON(!replacement);
 	pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
-	if (pslot)
-		item = radix_tree_deref_slot_protected(pslot,
-							&mapping->tree_lock);
+	if (!pslot)
+		return -ENOENT;
+	item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
 	if (item != expected)
 		return -ENOENT;
-	if (replacement)
-		radix_tree_replace_slot(pslot, replacement);
-	else
-		radix_tree_delete(&mapping->page_tree, index);
+	radix_tree_replace_slot(pslot, replacement);
 	return 0;
 }
 
@@ -386,14 +384,15 @@ export:
 static int shmem_free_swap(struct address_space *mapping,
 			   pgoff_t index, void *radswap)
 {
-	int error;
+	void *old;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
+	old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!error)
-		free_swap_and_cache(radix_to_swp_entry(radswap));
-	return error;
+	if (old != radswap)
+		return -ENOENT;
+	free_swap_and_cache(radix_to_swp_entry(radswap));
+	return 0;
 }
 
 /*
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [patch 4/9] mm: filemap: move radix tree hole searching here
  2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
                   ` (2 preceding siblings ...)
  2014-01-10 18:10 ` [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
@ 2014-01-10 18:10 ` Johannes Weiner
  2014-01-10 19:22   ` Rik van Riel
  2014-01-13  1:25   ` Minchan Kim
  2014-01-10 18:10 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot".  But this is about to change, though, as shadow
page descriptors will be stored in the page cache after the actual
pages get evicted from memory.

Move the functions over to mm/filemap.c and make them native page
cache operations, where they can later be adapted to handle the new
definition of "page cache hole".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/nfs/blocklayout/blocklayout.c |  2 +-
 include/linux/pagemap.h          |  5 +++
 include/linux/radix-tree.h       |  4 ---
 lib/radix-tree.c                 | 75 ---------------------------------------
 mm/filemap.c                     | 76 ++++++++++++++++++++++++++++++++++++++++
 mm/readahead.c                   |  4 +--
 6 files changed, 84 insertions(+), 82 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index e242bbf72972..fdb74cbb9e0c 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1220,7 +1220,7 @@ static u64 pnfs_num_cont_bytes(struct inode *inode, pgoff_t idx)
 	end = DIV_ROUND_UP(i_size_read(inode), PAGE_CACHE_SIZE);
 	if (end != NFS_I(inode)->npages) {
 		rcu_read_lock();
-		end = radix_tree_next_hole(&mapping->page_tree, idx + 1, ULONG_MAX);
+		end = page_cache_next_hole(mapping, idx + 1, ULONG_MAX);
 		rcu_read_unlock();
 	}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75a078b..c73130c607c4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -243,6 +243,11 @@ static inline struct page *page_cache_alloc_readahead(struct address_space *x)
 
 typedef int filler_t(void *, struct page *);
 
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+
 extern struct page * find_get_page(struct address_space *mapping,
 				pgoff_t index);
 extern struct page * find_lock_page(struct address_space *mapping,
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 1bf0a9c388d9..e8be53ecfc45 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -227,10 +227,6 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index f442e3243607..e8adb5d8a184 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -946,81 +946,6 @@ next:
 }
 EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
 
-
-/**
- *	radix_tree_next_hole    -    find the next hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the lowest
- *	indexed hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'return - index >= max_scan'
- *	will be true). In rare cases of index wrap-around, 0 will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 5, then subsequently a hole is created at index 10,
- *	radix_tree_next_hole covering both indexes may return 10 if called
- *	under rcu_read_lock.
- */
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index++;
-		if (index == 0)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_next_hole);
-
-/**
- *	radix_tree_prev_hole    -    find the prev hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search backwards in the range [max(index-max_scan+1, 0), index]
- *	for the first hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'index - return >= max_scan'
- *	will be true). In rare cases of wrap-around, ULONG_MAX will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 10, then subsequently a hole is created at index 5,
- *	radix_tree_prev_hole covering both indexes may return 5 if called under
- *	rcu_read_lock.
- */
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				   unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index--;
-		if (index == ULONG_MAX)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_prev_hole);
-
 /**
  *	radix_tree_gang_lookup - perform multiple lookup on a radix tree
  *	@root:		radix tree root
diff --git a/mm/filemap.c b/mm/filemap.c
index ae4846ff4849..0746b7a4658f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -686,6 +686,82 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 }
 
 /**
+ * page_cache_next_hole - find the next hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
+ * lowest indexed hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'return - index >=
+ * max_scan' will be true). In rare cases of index wrap-around, 0 will
+ * be returned.
+ *
+ * page_cache_next_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 5, then subsequently a hole is created at
+ * index 10, page_cache_next_hole covering both indexes may return 10
+ * if called under rcu_read_lock.
+ */
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index++;
+		if (index == 0)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_next_hole);
+
+/**
+ * page_cache_prev_hole - find the prev hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search backwards in the range [max(index-max_scan+1, 0), index] for
+ * the first hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'index - return >=
+ * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
+ * will be returned.
+ *
+ * page_cache_prev_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 10, then subsequently a hole is created at
+ * index 5, page_cache_prev_hole covering both indexes may return 5 if
+ * called under rcu_read_lock.
+ */
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index--;
+		if (index == ULONG_MAX)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_prev_hole);
+
+/**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
diff --git a/mm/readahead.c b/mm/readahead.c
index e4ed04149785..9eeeeda4ac0e 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -351,7 +351,7 @@ static pgoff_t count_history_pages(struct address_space *mapping,
 	pgoff_t head;
 
 	rcu_read_lock();
-	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+	head = page_cache_prev_hole(mapping, offset - 1, max);
 	rcu_read_unlock();
 
 	return offset - 1 - head;
@@ -430,7 +430,7 @@ ondemand_readahead(struct address_space *mapping,
 		pgoff_t start;
 
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = page_cache_next_hole(mapping, offset + 1, max);
 		rcu_read_unlock();
 
 		if (!start || start - offset > max)
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees
  2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
                   ` (3 preceding siblings ...)
  2014-01-10 18:10 ` [patch 4/9] mm: filemap: move radix tree hole searching here Johannes Weiner
@ 2014-01-10 18:10 ` Johannes Weiner
  2014-01-10 19:39   ` Rik van Riel
                     ` (2 more replies)
  2014-01-10 18:10 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache,
prepare every site dealing with the radix trees directly to handle
entries other than pages.

The common lookup functions will filter out non-page entries and
return NULL for page cache holes, just as before.  But provide a raw
version of the API which returns non-page entries as well, and switch
shmem over to use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c   |   2 +-
 include/linux/mm.h       |   8 ++
 include/linux/pagemap.h  |  15 ++--
 include/linux/pagevec.h  |   3 +
 include/linux/shmem_fs.h |   1 +
 mm/filemap.c             | 196 +++++++++++++++++++++++++++++++++++++++++------
 mm/mincore.c             |  20 +++--
 mm/readahead.c           |   2 +-
 mm/shmem.c               |  97 +++++------------------
 mm/swap.c                |  47 ++++++++++++
 mm/truncate.c            |  73 ++++++++++++++----
 11 files changed, 336 insertions(+), 128 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 6aad98cb343f..c88316587900 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -474,7 +474,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, pg_index);
 		rcu_read_unlock();
-		if (page) {
+		if (page && !radix_tree_exceptional_entry(page)) {
 			misses++;
 			if (misses > 4)
 				break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55ee8855..c09ef3ae55bc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -906,6 +906,14 @@ extern void show_free_areas(unsigned int flags);
 extern bool skip_free_areas_node(unsigned int flags, int nid);
 
 int shmem_zero_setup(struct vm_area_struct *);
+#ifdef CONFIG_SHMEM
+bool shmem_mapping(struct address_space *mapping);
+#else
+static inline bool shmem_mapping(struct address_space *mapping)
+{
+	return false;
+}
+#endif
 
 extern int can_do_mlock(void);
 extern int user_shm_lock(size_t, struct user_struct *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c73130c607c4..b6854b7c58cb 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-extern struct page * find_get_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_lock_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_or_create_page(struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
+				 gfp_t gfp_mask);
+unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
+			  unsigned int nr_pages, struct page **pages,
+			  pgoff_t *indices);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index e4dbfab37729..3c6b8b1e945b 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,9 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices);
+void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 30aa0dc60d75..deb49609cd36 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -49,6 +49,7 @@ extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index 0746b7a4658f..23eb3be27205 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -446,6 +446,29 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
+static int page_cache_tree_insert(struct address_space *mapping,
+				  struct page *page)
+{
+	void **slot;
+	int error;
+
+	slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+	if (slot) {
+		void *p;
+
+		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+		if (!radix_tree_exceptional_entry(p))
+			return -EEXIST;
+		radix_tree_replace_slot(slot, page);
+		mapping->nrpages++;
+		return 0;
+	}
+	error = radix_tree_insert(&mapping->page_tree, page->index, page);
+	if (!error)
+		mapping->nrpages++;
+	return error;
+}
+
 /**
  * add_to_page_cache_locked - add a locked page to the pagecache
  * @page:	page to add
@@ -480,11 +503,10 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	page->index = offset;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = radix_tree_insert(&mapping->page_tree, offset, page);
+	error = page_cache_tree_insert(mapping, page);
 	radix_tree_preload_end();
 	if (unlikely(error))
 		goto err_insert;
-	mapping->nrpages++;
 	__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	trace_mm_filemap_add_to_page_cache(page);
@@ -712,7 +734,10 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index++;
 		if (index == 0)
@@ -750,7 +775,10 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index--;
 		if (index == ULONG_MAX)
@@ -762,14 +790,19 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 EXPORT_SYMBOL(page_cache_prev_hole);
 
 /**
- * find_get_page - find and get a page reference
+ * __find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Is there a pagecache struct page at the given (mapping, offset) tuple?
- * If yes, increment its refcount and return it; if no, return NULL.
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned with an increased refcount.
+ *
+ * If the slot holds a shadow entry of a previously evicted page, it
+ * is returned.
+ *
+ * Otherwise, %NULL is returned.
  */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	void **pagep;
 	struct page *page;
@@ -810,24 +843,49 @@ out:
 
 	return page;
 }
+EXPORT_SYMBOL(__find_get_page);
+
+/**
+ * find_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned with an increased refcount.
+ *
+ * Otherwise, %NULL is returned.
+ */
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page = __find_get_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
 EXPORT_SYMBOL(find_get_page);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * __find_lock_page - locate, pin and lock a pagecache page
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Locates the desired pagecache page, locks it, increments its reference
- * count and returns its address.
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * If the slot holds a shadow entry of a previously evicted page, it
+ * is returned.
+ *
+ * Otherwise, %NULL is returned.
  *
- * Returns zero if the page was not present. find_lock_page() may sleep.
+ * __find_lock_page() may sleep.
  */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
 {
 	struct page *page;
-
 repeat:
-	page = find_get_page(mapping, offset);
+	page = __find_get_page(mapping, offset);
 	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
@@ -840,6 +898,29 @@ repeat:
 	}
 	return page;
 }
+EXPORT_SYMBOL(__find_lock_page);
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * Otherwise, %NULL is returned.
+ *
+ * find_lock_page() may sleep.
+ */
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page = __find_lock_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
 EXPORT_SYMBOL(find_lock_page);
 
 /**
@@ -848,16 +929,18 @@ EXPORT_SYMBOL(find_lock_page);
  * @index: the page's index into the mapping
  * @gfp_mask: page allocation mode
  *
- * Locates a page in the pagecache.  If the page is not present, a new page
- * is allocated using @gfp_mask and is added to the pagecache and to the VM's
- * LRU list.  The returned page is locked and has its reference count
- * incremented.
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
  *
- * find_or_create_page() may sleep, even if @gfp_flags specifies an atomic
- * allocation!
+ * If the page is not present, a new page is allocated using @gfp_mask
+ * and added to the page cache and the VM's LRU list.  The page is
+ * returned locked and with an increased refcount.
  *
- * find_or_create_page() returns the desired page's address, or zero on
- * memory exhaustion.
+ * On memory exhaustion, %NULL is returned.
+ *
+ * find_or_create_page() may sleep, even if @gfp_flags specifies an
+ * atomic allocation!
  */
 struct page *find_or_create_page(struct address_space *mapping,
 		pgoff_t index, gfp_t gfp_mask)
@@ -890,6 +973,73 @@ repeat:
 EXPORT_SYMBOL(find_or_create_page);
 
 /**
+ * __find_get_pages - gang pagecache lookup
+ * @mapping:	The address_space to search
+ * @start:	The starting page index
+ * @nr_pages:	The maximum number of pages
+ * @pages:	Where the resulting pages are placed
+ *
+ * __find_get_pages() will search for and return a group of up to
+ * @nr_pages pages in the mapping.  The pages are placed at @pages.
+ * __find_get_pages() takes a reference against the returned pages.
+ *
+ * The search returns a group of mapping-contiguous pages with ascending
+ * indexes.  There may be holes in the indices due to not-present pages.
+ *
+ * Any shadow entries of evicted pages are included in the returned
+ * array.
+ *
+ * __find_get_pages() returns the number of pages and shadow entries
+ * which were found.
+ */
+unsigned __find_get_pages(struct address_space *mapping,
+			  pgoff_t start, unsigned int nr_pages,
+			  struct page **pages, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_pages)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * Otherwise, we must be storing a swap entry
+			 * here as an exceptional entry: so return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		pages[ret] = page;
+		if (++ret == nr_pages)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
  * find_get_pages - gang pagecache lookup
  * @mapping:	The address_space to search
  * @start:	The starting page index
diff --git a/mm/mincore.c b/mm/mincore.c
index da2be56a7b8f..ad411ec86a55 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 	 */
-	page = find_get_page(mapping, pgoff);
 #ifdef CONFIG_SWAP
-	/* shmem/tmpfs may return swap: account for swapcache page too. */
-	if (radix_tree_exceptional_entry(page)) {
-		swp_entry_t swap = radix_to_swp_entry(page);
-		page = find_get_page(swap_address_space(swap), swap.val);
-	}
+	if (shmem_mapping(mapping)) {
+		page = __find_get_page(mapping, pgoff);
+		/*
+		 * shmem/tmpfs may return swap: account for swapcache
+		 * page too.
+		 */
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swp = radix_to_swp_entry(page);
+			page = find_get_page(swap_address_space(swp), swp.val);
+		}
+	} else
+		page = find_get_page(mapping, pgoff);
+#else
+	page = find_get_page(mapping, pgoff);
 #endif
 	if (page) {
 		present = PageUptodate(page);
diff --git a/mm/readahead.c b/mm/readahead.c
index 9eeeeda4ac0e..912c00358112 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		rcu_read_unlock();
-		if (page)
+		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
 		page = page_cache_alloc_readahead(mapping);
diff --git a/mm/shmem.c b/mm/shmem.c
index 7c67249d6f28..1f4b65f7b831 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -329,56 +329,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 }
 
 /*
- * Like find_get_pages, but collecting swap entries as well as pages.
- */
-static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
-					pgoff_t start, unsigned int nr_pages,
-					struct page **pages, pgoff_t *indices)
-{
-	void **slot;
-	unsigned int ret = 0;
-	struct radix_tree_iter iter;
-
-	if (!nr_pages)
-		return 0;
-
-	rcu_read_lock();
-restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
-		struct page *page;
-repeat:
-		page = radix_tree_deref_slot(slot);
-		if (unlikely(!page))
-			continue;
-		if (radix_tree_exception(page)) {
-			if (radix_tree_deref_retry(page))
-				goto restart;
-			/*
-			 * Otherwise, we must be storing a swap entry
-			 * here as an exceptional entry: so return it
-			 * without attempting to raise page count.
-			 */
-			goto export;
-		}
-		if (!page_cache_get_speculative(page))
-			goto repeat;
-
-		/* Has the page moved? */
-		if (unlikely(page != *slot)) {
-			page_cache_release(page);
-			goto repeat;
-		}
-export:
-		indices[ret] = iter.index;
-		pages[ret] = page;
-		if (++ret == nr_pages)
-			break;
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
@@ -396,21 +346,6 @@ static int shmem_free_swap(struct address_space *mapping,
 }
 
 /*
- * Pagevec may contain swap entries, so shuffle up pages before releasing.
- */
-static void shmem_deswap_pagevec(struct pagevec *pvec)
-{
-	int i, j;
-
-	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		if (!radix_tree_exceptional_entry(page))
-			pvec->pages[j++] = page;
-	}
-	pvec->nr = j;
-}
-
-/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
@@ -428,12 +363,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
 		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
 		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
 		 */
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					PAGEVEC_SIZE, pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		index = indices[pvec.nr - 1] + 1;
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		check_move_unevictable_pages(pvec.pages, pvec.nr);
 		pagevec_release(&pvec);
 		cond_resched();
@@ -465,9 +400,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
-				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+		pvec.nr = __find_get_pages(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		mem_cgroup_uncharge_start();
@@ -496,7 +431,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -534,9 +469,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+
+		pvec.nr = __find_get_pages(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+				pvec.pages, indices);
 		if (!pvec.nr) {
 			if (index == start || unfalloc)
 				break;
@@ -544,7 +480,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			continue;
 		}
 		if ((index == start || unfalloc) && indices[0] >= end) {
-			shmem_deswap_pagevec(&pvec);
+			pagevec_remove_exceptionals(&pvec);
 			pagevec_release(&pvec);
 			break;
 		}
@@ -573,7 +509,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -1081,7 +1017,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return -EFBIG;
 repeat:
 	swap.val = 0;
-	page = find_lock_page(mapping, index);
+	page = __find_lock_page(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
 		page = NULL;
@@ -1418,6 +1354,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	return inode;
 }
 
+bool shmem_mapping(struct address_space *mapping)
+{
+	return mapping->backing_dev_info == &shmem_backing_dev_info;
+}
+
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
 static const struct inode_operations shmem_short_symlink_operations;
@@ -1730,7 +1671,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 	pagevec_init(&pvec, 0);
 	pvec.nr = 1;		/* start small: we may be there already */
 	while (!done) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					pvec.nr, pvec.pages, indices);
 		if (!pvec.nr) {
 			if (whence == SEEK_DATA)
@@ -1757,7 +1698,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 				break;
 			}
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		pvec.nr = PAGEVEC_SIZE;
 		cond_resched();
diff --git a/mm/swap.c b/mm/swap.c
index 759c3caf44bd..f624e5b4b724 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -894,6 +894,53 @@ EXPORT_SYMBOL(__pagevec_lru_add);
 
 /**
  * pagevec_lookup - gang pagecache lookup
+ * @pvec:	Where the resulting entries are placed
+ * @mapping:	The address_space to search
+ * @start:	The starting entry index
+ * @nr_pages:	The maximum number of entries
+ *
+ * pagevec_lookup() will search for and return a group of up to
+ * @nr_pages pages and shadow entries in the mapping.  All entries are
+ * placed in @pvec.  pagevec_lookup() takes a reference against actual
+ * pages in @pvec.
+ *
+ * The search returns a group of mapping-contiguous entries with
+ * ascending indexes.  There may be holes in the indices due to
+ * not-present entries.
+ *
+ * pagevec_lookup() returns the number of entries which were found.
+ */
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices)
+{
+	pvec->nr = __find_get_pages(mapping, start, nr_pages,
+				    pvec->pages, indices);
+	return pagevec_count(pvec);
+}
+
+/**
+ * pagevec_remove_exceptionals - pagevec exceptionals pruning
+ * @pvec:	The pagevec to prune
+ *
+ * __pagevec_lookup() fills both pages and exceptional radix tree
+ * entries into the pagevec.  This function prunes all exceptionals
+ * from @pvec without leaving holes, so that it can be passed on to
+ * other pagevec operations.
+ */
+void pagevec_remove_exceptionals(struct pagevec *pvec)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		if (!radix_tree_exceptional_entry(page))
+			pvec->pages[j++] = page;
+	}
+	pvec->nr = j;
+}
+
+/**
+ * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
  * @mapping:	The address_space to search
  * @start:	The starting page index
diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683afd6e..b0f4d4bee8ab 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -22,6 +22,22 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+static void clear_exceptional_entry(struct address_space *mapping,
+				    pgoff_t index, void *entry)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	radix_tree_delete_item(&mapping->page_tree, index, entry);
+	spin_unlock_irq(&mapping->tree_lock);
+}
 
 /**
  * do_invalidatepage - invalidate part or all of a page
@@ -208,6 +224,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	unsigned int	partial_start;	/* inclusive */
 	unsigned int	partial_end;	/* exclusive */
 	struct pagevec	pvec;
+	pgoff_t		indices[PAGEVEC_SIZE];
 	pgoff_t		index;
 	int		i;
 
@@ -238,17 +255,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index < end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+	while (index < end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index >= end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -259,6 +282,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -307,14 +331,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+		if (!__pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			indices)) {
 			if (index == start)
 				break;
 			index = start;
 			continue;
 		}
-		if (index == start && pvec.pages[0]->index >= end) {
+		if (index == start && indices[0] >= end) {
 			pagevec_release(&pvec);
 			break;
 		}
@@ -323,16 +348,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index >= end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -375,6 +406,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index = start;
 	unsigned long ret;
@@ -390,17 +422,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	 */
 
 	pagevec_init(&pvec, 0);
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -414,6 +452,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				deactivate_page(page);
 			count += ret;
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -481,6 +520,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
 int invalidate_inode_pages2_range(struct address_space *mapping,
 				  pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	int i;
@@ -491,17 +531,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	cleancache_invalidate_inode(mapping);
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			if (page->mapping != mapping) {
@@ -539,6 +585,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				ret = ret2;
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [patch 6/9] mm + fs: store shadow entries in page cache
  2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
                   ` (4 preceding siblings ...)
  2014-01-10 18:10 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
@ 2014-01-10 18:10 ` Johannes Weiner
  2014-01-10 22:30   ` Rik van Riel
  2014-01-13  2:18   ` Minchan Kim
  2014-01-10 18:10 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Reclaim will be leaving shadow entries in the page cache radix tree
upon evicting the real page.  As those pages are found from the LRU,
an iput() can lead to the inode being freed concurrently.  At this
point, reclaim must no longer install shadow pages because the inode
freeing code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code
sets under the tree lock before doing the final truncate.  Reclaim
will check for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/filesystems/porting               |  6 +--
 drivers/staging/lustre/lustre/llite/llite_lib.c |  2 +-
 fs/9p/vfs_inode.c                               |  2 +-
 fs/affs/inode.c                                 |  2 +-
 fs/afs/inode.c                                  |  2 +-
 fs/bfs/inode.c                                  |  2 +-
 fs/block_dev.c                                  |  4 +-
 fs/btrfs/inode.c                                |  2 +-
 fs/cifs/cifsfs.c                                |  2 +-
 fs/coda/inode.c                                 |  2 +-
 fs/ecryptfs/super.c                             |  2 +-
 fs/exofs/inode.c                                |  2 +-
 fs/ext2/inode.c                                 |  2 +-
 fs/ext3/inode.c                                 |  2 +-
 fs/ext4/inode.c                                 |  4 +-
 fs/f2fs/inode.c                                 |  2 +-
 fs/fat/inode.c                                  |  2 +-
 fs/freevxfs/vxfs_inode.c                        |  2 +-
 fs/fuse/inode.c                                 |  2 +-
 fs/gfs2/super.c                                 |  2 +-
 fs/hfs/inode.c                                  |  2 +-
 fs/hfsplus/super.c                              |  2 +-
 fs/hostfs/hostfs_kern.c                         |  2 +-
 fs/hpfs/inode.c                                 |  2 +-
 fs/inode.c                                      |  4 +-
 fs/jffs2/fs.c                                   |  2 +-
 fs/jfs/inode.c                                  |  4 +-
 fs/logfs/readwrite.c                            |  2 +-
 fs/minix/inode.c                                |  2 +-
 fs/ncpfs/inode.c                                |  2 +-
 fs/nfs/inode.c                                  |  2 +-
 fs/nfs/nfs4super.c                              |  2 +-
 fs/nilfs2/inode.c                               |  6 +--
 fs/ntfs/inode.c                                 |  2 +-
 fs/ocfs2/inode.c                                |  4 +-
 fs/omfs/inode.c                                 |  2 +-
 fs/proc/inode.c                                 |  2 +-
 fs/reiserfs/inode.c                             |  2 +-
 fs/sysfs/inode.c                                |  2 +-
 fs/sysv/inode.c                                 |  2 +-
 fs/ubifs/super.c                                |  2 +-
 fs/udf/inode.c                                  |  4 +-
 fs/ufs/inode.c                                  |  2 +-
 fs/xfs/xfs_super.c                              |  2 +-
 include/linux/fs.h                              |  1 +
 include/linux/mm.h                              |  1 +
 include/linux/pagemap.h                         | 13 +++++-
 mm/filemap.c                                    | 33 ++++++++++++---
 mm/truncate.c                                   | 54 +++++++++++++++++++++++--
 mm/vmscan.c                                     |  2 +-
 50 files changed, 147 insertions(+), 65 deletions(-)

diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index f0890581f7f6..fc0de703066b 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -295,9 +295,9 @@ in the beginning of ->setattr unconditionally.
 	->clear_inode() and ->delete_inode() are gone; ->evict_inode() should
 be used instead.  It gets called whenever the inode is evicted, whether it has
 remaining links or not.  Caller does *not* evict the pagecache or inode-associated
-metadata buffers; getting rid of those is responsibility of method, as it had
-been for ->delete_inode(). Caller makes sure async writeback cannot be running
-for the inode while (or after) ->evict_inode() is called.
+metadata buffers; the method has to use truncate_inode_pages_final() to get rid
+of those. Caller makes sure async writeback cannot be running for the inode while
+(or after) ->evict_inode() is called.
 
 	->drop_inode() returns int now; it's called on final iput() with
 inode->i_lock held and it returns true if filesystems wants the inode to be
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index b868c2bd58d2..79cbc9c5b744 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -1817,7 +1817,7 @@ void ll_delete_inode(struct inode *inode)
 		cl_sync_file_range(inode, 0, OBD_OBJECT_EOF,
 				   CL_FSYNC_DISCARD, 1);
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 
 	/* Workaround for LU-118 */
 	if (inode->i_data.nrpages) {
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 94de6d1482e2..e6716c295a99 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -444,7 +444,7 @@ void v9fs_evict_inode(struct inode *inode)
 {
 	struct v9fs_inode *v9inode = V9FS_I(inode);
 
-	truncate_inode_pages(inode->i_mapping, 0);
+	truncate_inode_pages_final(inode->i_mapping);
 	clear_inode(inode);
 	filemap_fdatawrite(inode->i_mapping);
 
diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 0e092d08680e..96df91e8c334 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -259,7 +259,7 @@ affs_evict_inode(struct inode *inode)
 {
 	unsigned long cache_page;
 	pr_debug("AFFS: evict_inode(ino=%lu, nlink=%u)\n", inode->i_ino, inode->i_nlink);
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 
 	if (!inode->i_nlink) {
 		inode->i_size = 0;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 789bc253b5f6..2bbe60e3f0e3 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -422,7 +422,7 @@ void afs_evict_inode(struct inode *inode)
 
 	ASSERTCMP(inode->i_ino, ==, vnode->fid.vnode);
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 
 	afs_give_up_callback(vnode);
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index 8defc6b3f9a2..29aa5cf6639b 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -172,7 +172,7 @@ static void bfs_evict_inode(struct inode *inode)
 
 	dprintf("ino=%08lx\n", ino);
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	invalidate_inode_buffers(inode);
 	clear_inode(inode);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1e86823a9cbd..c7a7def27b07 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -83,7 +83,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
 		return;
 
 	invalidate_bh_lrus();
@@ -419,7 +419,7 @@ static void bdev_evict_inode(struct inode *inode)
 {
 	struct block_device *bdev = &BDEV_I(inode)->bdev;
 	struct list_head *p;
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	invalidate_inode_buffers(inode); /* is it needed here? */
 	clear_inode(inode);
 	spin_lock(&bdev_lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 51e3afa78354..d3e498390189 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4471,7 +4471,7 @@ void btrfs_evict_inode(struct inode *inode)
 
 	trace_btrfs_inode_evict(inode);
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	if (inode->i_nlink && (btrfs_root_refs(&root->root_item) != 0 ||
 			       btrfs_is_free_space_inode(inode)))
 		goto no_delete;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 77fc5e181077..d795c50e67cb 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -286,7 +286,7 @@ cifs_destroy_inode(struct inode *inode)
 static void
 cifs_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	cifs_fscache_release_inode_cookie(inode);
 }
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 4dcc0d81a7aa..43a5b38fc8d3 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -250,7 +250,7 @@ static void coda_put_super(struct super_block *sb)
 
 static void coda_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	coda_cache_clear_inode(inode);
 }
diff --git a/fs/ecryptfs/super.c b/fs/ecryptfs/super.c
index e879cf8ff0b1..afa1b81c3418 100644
--- a/fs/ecryptfs/super.c
+++ b/fs/ecryptfs/super.c
@@ -132,7 +132,7 @@ static int ecryptfs_statfs(struct dentry *dentry, struct kstatfs *buf)
  */
 static void ecryptfs_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	iput(ecryptfs_inode_to_lower(inode));
 }
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index a52a5d23c30b..d9ff4d304b41 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1479,7 +1479,7 @@ void exofs_evict_inode(struct inode *inode)
 	struct ore_io_state *ios;
 	int ret;
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 
 	/* TODO: should do better here */
 	if (inode->i_nlink || is_bad_inode(inode))
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c260de6d7b6d..115fa58bb9ae 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -78,7 +78,7 @@ void ext2_evict_inode(struct inode * inode)
 		dquot_drop(inode);
 	}
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 
 	if (want_delete) {
 		sb_start_intwrite(inode->i_sb);
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 2bd85486b879..153f4bec69ef 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -228,7 +228,7 @@ void ext3_evict_inode (struct inode *inode)
 		log_wait_commit(journal, commit_tid);
 		filemap_write_and_wait(&inode->i_data);
 	}
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 
 	ext3_discard_reservation(inode);
 	rsv = ei->i_block_alloc_info;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e274e9c1171f..3b75e70ae2eb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -214,7 +214,7 @@ void ext4_evict_inode(struct inode *inode)
 			jbd2_complete_transaction(journal, commit_tid);
 			filemap_write_and_wait(&inode->i_data);
 		}
-		truncate_inode_pages(&inode->i_data, 0);
+		truncate_inode_pages_final(&inode->i_data);
 
 		WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
 		goto no_delete;
@@ -225,7 +225,7 @@ void ext4_evict_inode(struct inode *inode)
 
 	if (ext4_should_order_data(inode))
 		ext4_begin_ordered_truncate(inode, 0);
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 
 	WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
 	if (is_bad_inode(inode))
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 9339cd292047..0bd44f84e79b 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -246,7 +246,7 @@ void f2fs_evict_inode(struct inode *inode)
 	int ilock;
 
 	trace_f2fs_evict_inode(inode);
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 
 	if (inode->i_ino == F2FS_NODE_INO(sbi) ||
 			inode->i_ino == F2FS_META_INO(sbi))
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 0062da21dd8b..fe802d83abdb 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -490,7 +490,7 @@ EXPORT_SYMBOL_GPL(fat_build_inode);
 
 static void fat_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	if (!inode->i_nlink) {
 		inode->i_size = 0;
 		fat_truncate_blocks(inode, 0);
diff --git a/fs/freevxfs/vxfs_inode.c b/fs/freevxfs/vxfs_inode.c
index f47df72cef17..363e3ae25f6b 100644
--- a/fs/freevxfs/vxfs_inode.c
+++ b/fs/freevxfs/vxfs_inode.c
@@ -354,7 +354,7 @@ static void vxfs_i_callback(struct rcu_head *head)
 void
 vxfs_evict_inode(struct inode *ip)
 {
-	truncate_inode_pages(&ip->i_data, 0);
+	truncate_inode_pages_final(&ip->i_data);
 	clear_inode(ip);
 	call_rcu(&ip->i_rcu, vxfs_i_callback);
 }
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index a8ce6dab60a0..09d7fa05f136 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -123,7 +123,7 @@ static void fuse_destroy_inode(struct inode *inode)
 
 static void fuse_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	if (inode->i_sb->s_flags & MS_ACTIVE) {
 		struct fuse_conn *fc = get_fuse_conn(inode);
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index e5639dec66c4..ac96a99c0e5d 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1525,7 +1525,7 @@ out_unlock:
 		fs_warn(sdp, "gfs2_evict_inode: %d\n", error);
 out:
 	/* Case 3 starts here */
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	gfs2_rs_delete(ip);
 	gfs2_ordered_del_inode(ip);
 	clear_inode(inode);
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 380ab31b5e0f..9e2fecd62f62 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -547,7 +547,7 @@ out:
 
 void hfs_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	if (HFS_IS_RSRC(inode) && HFS_I(inode)->rsrc_inode) {
 		HFS_I(HFS_I(inode)->rsrc_inode)->rsrc_inode = NULL;
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 4c4d142cf890..b9436d923585 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -161,7 +161,7 @@ static int hfsplus_write_inode(struct inode *inode,
 static void hfsplus_evict_inode(struct inode *inode)
 {
 	hfs_dbg(INODE, "hfsplus_evict_inode: %lu\n", inode->i_ino);
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	if (HFSPLUS_IS_RSRC(inode)) {
 		HFSPLUS_I(HFSPLUS_I(inode)->rsrc_inode)->rsrc_inode = NULL;
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 25437280a207..0c9f64070e0f 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -239,7 +239,7 @@ static struct inode *hostfs_alloc_inode(struct super_block *sb)
 
 static void hostfs_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	if (HOSTFS_I(inode)->fd != -1) {
 		close_file(&HOSTFS_I(inode)->fd);
diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
index 9edeeb0ea97e..50a427313835 100644
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -304,7 +304,7 @@ void hpfs_write_if_changed(struct inode *inode)
 
 void hpfs_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	if (!inode->i_nlink) {
 		hpfs_lock(inode->i_sb);
diff --git a/fs/inode.c b/fs/inode.c
index b33ba8e021cc..093864ea2358 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -503,6 +503,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
+	BUG_ON(inode->i_data.nrshadows);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
@@ -548,8 +549,7 @@ static void evict(struct inode *inode)
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
-		if (inode->i_data.nrpages)
-			truncate_inode_pages(&inode->i_data, 0);
+		truncate_inode_pages_final(&inode->i_data);
 		clear_inode(inode);
 	}
 	if (S_ISBLK(inode->i_mode) && inode->i_bdev)
diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index fe3c0527545f..00ed6c64a579 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -241,7 +241,7 @@ void jffs2_evict_inode (struct inode *inode)
 
 	jffs2_dbg(1, "%s(): ino #%lu mode %o\n",
 		  __func__, inode->i_ino, inode->i_mode);
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	jffs2_do_clear_inode(c, f);
 }
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index f4aab719add5..6f8fe72c2a7a 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -154,7 +154,7 @@ void jfs_evict_inode(struct inode *inode)
 		dquot_initialize(inode);
 
 		if (JFS_IP(inode)->fileset == FILESYSTEM_I) {
-			truncate_inode_pages(&inode->i_data, 0);
+			truncate_inode_pages_final(&inode->i_data);
 
 			if (test_cflag(COMMIT_Freewmap, inode))
 				jfs_free_zero_link(inode);
@@ -168,7 +168,7 @@ void jfs_evict_inode(struct inode *inode)
 			dquot_free_inode(inode);
 		}
 	} else {
-		truncate_inode_pages(&inode->i_data, 0);
+		truncate_inode_pages_final(&inode->i_data);
 	}
 	clear_inode(inode);
 	dquot_drop(inode);
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 9a59cbade2fb..48140315f627 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -2180,7 +2180,7 @@ void logfs_evict_inode(struct inode *inode)
 			do_delete_inode(inode);
 		}
 	}
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 
 	/* Cheaper version of write_inode.  All changes are concealed in
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index 0332109162a5..03aaeb1a694a 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -26,7 +26,7 @@ static int minix_remount (struct super_block * sb, int * flags, char * data);
 
 static void minix_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	if (!inode->i_nlink) {
 		inode->i_size = 0;
 		minix_truncate(inode);
diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index 4659da67e7f6..e728061edb13 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -296,7 +296,7 @@ ncp_iget(struct super_block *sb, struct ncp_entry_info *info)
 static void
 ncp_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 
 	if (S_ISDIR(inode->i_mode)) {
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index eda8879171c4..fbc38a62cbc9 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -128,7 +128,7 @@ EXPORT_SYMBOL_GPL(nfs_clear_inode);
 
 void nfs_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	nfs_clear_inode(inode);
 }
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
index e26acdd1a645..f2a5c44106b6 100644
--- a/fs/nfs/nfs4super.c
+++ b/fs/nfs/nfs4super.c
@@ -98,7 +98,7 @@ static int nfs4_write_inode(struct inode *inode, struct writeback_control *wbc)
  */
 static void nfs4_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	pnfs_return_layout(inode);
 	pnfs_destroy_layout(NFS_I(inode));
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 7e350c562e0e..b9c5726120e3 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -783,16 +783,14 @@ void nilfs_evict_inode(struct inode *inode)
 	int ret;
 
 	if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
-		if (inode->i_data.nrpages)
-			truncate_inode_pages(&inode->i_data, 0);
+		truncate_inode_pages_final(&inode->i_data);
 		clear_inode(inode);
 		nilfs_clear_inode(inode);
 		return;
 	}
 	nilfs_transaction_begin(sb, &ti, 0); /* never fails */
 
-	if (inode->i_data.nrpages)
-		truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 
 	/* TODO: some of the following operations may fail.  */
 	nilfs_truncate_bmap(ii, 0);
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 2778b0255dc6..bd50adc1e6a7 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -2259,7 +2259,7 @@ void ntfs_evict_big_inode(struct inode *vi)
 {
 	ntfs_inode *ni = NTFS_I(vi);
 
-	truncate_inode_pages(&vi->i_data, 0);
+	truncate_inode_pages_final(&vi->i_data);
 	clear_inode(vi);
 
 #ifdef NTFS_RW
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index f87f9bd1edff..f1c46a7f9bc5 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -951,7 +951,7 @@ static void ocfs2_cleanup_delete_inode(struct inode *inode,
 		(unsigned long long)OCFS2_I(inode)->ip_blkno, sync_data);
 	if (sync_data)
 		filemap_write_and_wait(inode->i_mapping);
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 }
 
 static void ocfs2_delete_inode(struct inode *inode)
@@ -1167,7 +1167,7 @@ void ocfs2_evict_inode(struct inode *inode)
 	    (OCFS2_I(inode)->ip_flags & OCFS2_INODE_MAYBE_ORPHANED)) {
 		ocfs2_delete_inode(inode);
 	} else {
-		truncate_inode_pages(&inode->i_data, 0);
+		truncate_inode_pages_final(&inode->i_data);
 	}
 	ocfs2_clear_inode(inode);
 }
diff --git a/fs/omfs/inode.c b/fs/omfs/inode.c
index d8b0afde2179..ec58c7659183 100644
--- a/fs/omfs/inode.c
+++ b/fs/omfs/inode.c
@@ -183,7 +183,7 @@ int omfs_sync_inode(struct inode *inode)
  */
 static void omfs_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 
 	if (inode->i_nlink)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 8eaa1ba793fc..9ca0f085dada 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -35,7 +35,7 @@ static void proc_evict_inode(struct inode *inode)
 	const struct proc_ns_operations *ns_ops;
 	void *ns;
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 
 	/* Stop tracking associated processes */
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index ad62bdbb451e..bc8b8009897d 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -35,7 +35,7 @@ void reiserfs_evict_inode(struct inode *inode)
 	if (!inode->i_nlink && !is_bad_inode(inode))
 		dquot_initialize(inode);
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	if (inode->i_nlink)
 		goto no_delete;
 
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index 963f910c8034..bd0dd8d88b50 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -309,7 +309,7 @@ void sysfs_evict_inode(struct inode *inode)
 {
 	struct sysfs_dirent *sd  = inode->i_private;
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	sysfs_put(sd);
 }
diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
index c327d4ee1235..5625ca920f5e 100644
--- a/fs/sysv/inode.c
+++ b/fs/sysv/inode.c
@@ -295,7 +295,7 @@ int sysv_sync_inode(struct inode *inode)
 
 static void sysv_evict_inode(struct inode *inode)
 {
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	if (!inode->i_nlink) {
 		inode->i_size = 0;
 		sysv_truncate(inode);
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 3e4aa7281e04..b9ac1f350920 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -351,7 +351,7 @@ static void ubifs_evict_inode(struct inode *inode)
 	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
 	ubifs_assert(!atomic_read(&inode->i_count));
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 
 	if (inode->i_nlink)
 		goto done;
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 062b7925bca0..af6f4c38d91a 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -146,8 +146,8 @@ void udf_evict_inode(struct inode *inode)
 		want_delete = 1;
 		udf_setsize(inode, 0);
 		udf_update_inode(inode, IS_SYNC(inode));
-	} else
-		truncate_inode_pages(&inode->i_data, 0);
+	}
+	truncate_inode_pages_final(&inode->i_data);
 	invalidate_inode_buffers(inode);
 	clear_inode(inode);
 	if (iinfo->i_alloc_type != ICBTAG_FLAG_AD_IN_ICB &&
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index c8ca96086784..61e8a9b021dd 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -885,7 +885,7 @@ void ufs_evict_inode(struct inode * inode)
 	if (!inode->i_nlink && !is_bad_inode(inode))
 		want_delete = 1;
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	if (want_delete) {
 		loff_t old_i_size;
 		/*UFS_I(inode)->i_dtime = CURRENT_TIME;*/
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 15188cc99449..47ce25dc412d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1006,7 +1006,7 @@ xfs_fs_evict_inode(
 
 	trace_xfs_evict_inode(ip);
 
-	truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 	XFS_STATS_INC(vn_rele);
 	XFS_STATS_INC(vn_remove);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3f40547ba191..9bfa5a57b4ed 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -416,6 +416,7 @@ struct address_space {
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
+	unsigned long		nrshadows;	/* number of shadow entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c09ef3ae55bc..5449e7a96adf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1588,6 +1588,7 @@ vm_unmapped_area(struct vm_unmapped_area_info *info)
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
 				       loff_t lstart, loff_t lend);
+extern void truncate_inode_pages_final(struct address_space *);
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index b6854b7c58cb..f132fdf5ce0f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_EXITING	= __GFP_BITS_SHIFT + 5, /* final truncate in progress */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -69,6 +70,16 @@ static inline int mapping_balloon(struct address_space *mapping)
 	return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
 }
 
+static inline void mapping_set_exiting(struct address_space *mapping)
+{
+	set_bit(AS_EXITING, &mapping->flags);
+}
+
+static inline int mapping_exiting(struct address_space *mapping)
+{
+	return test_bit(AS_EXITING, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
@@ -547,7 +558,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page);
+extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 23eb3be27205..d02db5801dda 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -107,12 +107,33 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
+static void page_cache_tree_delete(struct address_space *mapping,
+				   struct page *page, void *shadow)
+{
+	if (shadow) {
+		void **slot;
+
+		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+		radix_tree_replace_slot(slot, shadow);
+		mapping->nrshadows++;
+		/*
+		 * Make sure the nrshadows update is committed before
+		 * the nrpages update so that final truncate racing
+		 * with reclaim does not see both counters 0 at the
+		 * same time and miss a shadow entry.
+		 */
+		smp_wmb();
+	} else
+		radix_tree_delete(&mapping->page_tree, page->index);
+	mapping->nrpages--;
+}
+
 /*
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the mapping's tree_lock.
  */
-void __delete_from_page_cache(struct page *page)
+void __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
@@ -127,10 +148,11 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	page_cache_tree_delete(mapping, page, shadow);
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
+
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	if (PageSwapBacked(page))
 		__dec_zone_page_state(page, NR_SHMEM);
@@ -166,7 +188,7 @@ void delete_from_page_cache(struct page *page)
 
 	freepage = mapping->a_ops->freepage;
 	spin_lock_irq(&mapping->tree_lock);
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
@@ -426,7 +448,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		__delete_from_page_cache(old);
+		__delete_from_page_cache(old, NULL);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
 		BUG_ON(error);
 		mapping->nrpages++;
@@ -460,6 +482,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
+		mapping->nrshadows--;
 		mapping->nrpages++;
 		return 0;
 	}
diff --git a/mm/truncate.c b/mm/truncate.c
index b0f4d4bee8ab..97606fa4c458 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -35,7 +35,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * without the tree itself locked.  These unlocked entries
 	 * need verification under the tree lock.
 	 */
-	radix_tree_delete_item(&mapping->page_tree, index, entry);
+	if (radix_tree_delete_item(&mapping->page_tree, index, entry) == entry)
+		mapping->nrshadows--;
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
@@ -229,7 +230,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
 		return;
 
 	/* Offsets within partial pages */
@@ -391,6 +392,53 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
 EXPORT_SYMBOL(truncate_inode_pages);
 
 /**
+ * truncate_inode_pages_final - truncate *all* pages before inode dies
+ * @mapping: mapping to truncate
+ *
+ * Called under (and serialized by) inode->i_mutex.
+ *
+ * Filesystems have to use this in the .evict_inode path to inform the
+ * VM that this is the final truncate and the inode is going away.
+ */
+void truncate_inode_pages_final(struct address_space *mapping)
+{
+	unsigned long nrshadows;
+	unsigned long nrpages;
+
+	/*
+	 * Page reclaim can not participate in regular inode lifetime
+	 * management (can't call iput()) and thus can race with the
+	 * inode teardown.  Tell it when the address space is exiting,
+	 * so that it does not install eviction information after the
+	 * final truncate has begun.
+	 */
+	mapping_set_exiting(mapping);
+
+	/*
+	 * When reclaim installs eviction entries, it increases
+	 * nrshadows first, then decreases nrpages.  Make sure we see
+	 * this in the right order or we might miss an entry.
+	 */
+	nrpages = mapping->nrpages;
+	smp_rmb();
+	nrshadows = mapping->nrshadows;
+
+	if (nrpages || nrshadows) {
+		/*
+		 * As truncation uses a lockless tree lookup, acquire
+		 * the spinlock to make sure any ongoing tree
+		 * modification that does not see AS_EXITING is
+		 * completed before starting the final truncate.
+		 */
+		spin_lock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
+
+		truncate_inode_pages(mapping, 0);
+	}
+}
+EXPORT_SYMBOL(truncate_inode_pages_final);
+
+/**
  * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
  * @mapping: the address_space which holds the pages to invalidate
  * @start: the offset 'from' which to invalidate
@@ -483,7 +531,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d9cff6..b954b31602cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -554,7 +554,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page);
+		__delete_from_page_cache(page, NULL);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [patch 7/9] mm: thrash detection-based file cache sizing
  2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
                   ` (5 preceding siblings ...)
  2014-01-10 18:10 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
@ 2014-01-10 18:10 ` Johannes Weiner
  2014-01-10 22:51   ` Rik van Riel
                     ` (2 more replies)
  2014-01-10 18:10 ` [patch 8/9] lib: radix_tree: tree node interface Johannes Weiner
  2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
  8 siblings, 3 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages.  This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use-once streams,
but ultimately was not significantly better than a FIFO policy and
still thrashed cache based on eviction speed, rather than actual
demand for cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size.  By
maintaining a history of recently evicted file pages it can detect
frequently used pages with an arbitrarily small inactive list size,
and subsequently apply pressure on the active list based on actual
demand for cache, not just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot.  On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list.  And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |   5 +
 include/linux/swap.h   |   5 +
 mm/Makefile            |   2 +-
 mm/filemap.c           |  61 ++++++++----
 mm/swap.c              |   2 +
 mm/vmscan.c            |  24 ++++-
 mm/vmstat.c            |   2 +
 mm/workingset.c        | 253 +++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 331 insertions(+), 23 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd791e452ad7..118ba9f51e86 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,8 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_REFAULT,
+	WORKINGSET_ACTIVATE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -392,6 +394,9 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
+	/* Evictions & activations on the inactive file list */
+	atomic_long_t		inactive_age;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6c219f..b83cf61403ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
 	int next;	/* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+bool workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index 305d10acd081..b30aeb86abd6 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o \
-			   interval_tree.o list_lru.o $(mmu-y)
+			   interval_tree.o list_lru.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/filemap.c b/mm/filemap.c
index d02db5801dda..65a374c0df4f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
 static int page_cache_tree_insert(struct address_space *mapping,
-				  struct page *page)
+				  struct page *page, void **shadowp)
 {
 	void **slot;
 	int error;
@@ -484,6 +484,8 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		radix_tree_replace_slot(slot, page);
 		mapping->nrshadows--;
 		mapping->nrpages++;
+		if (shadowp)
+			*shadowp = p;
 		return 0;
 	}
 	error = radix_tree_insert(&mapping->page_tree, page->index, page);
@@ -492,18 +494,10 @@ static int page_cache_tree_insert(struct address_space *mapping,
 	return error;
 }
 
-/**
- * add_to_page_cache_locked - add a locked page to the pagecache
- * @page:	page to add
- * @mapping:	the page's address_space
- * @offset:	page index
- * @gfp_mask:	page allocation mode
- *
- * This function is used to add a page to the pagecache. It must be locked.
- * This function does not add the page to the LRU.  The caller must do that.
- */
-int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
-		pgoff_t offset, gfp_t gfp_mask)
+static int __add_to_page_cache_locked(struct page *page,
+				      struct address_space *mapping,
+				      pgoff_t offset, gfp_t gfp_mask,
+				      void **shadowp)
 {
 	int error;
 
@@ -526,7 +520,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	page->index = offset;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = page_cache_tree_insert(mapping, page);
+	error = page_cache_tree_insert(mapping, page, shadowp);
 	radix_tree_preload_end();
 	if (unlikely(error))
 		goto err_insert;
@@ -542,16 +536,49 @@ err_insert:
 	page_cache_release(page);
 	return error;
 }
+
+/**
+ * add_to_page_cache_locked - add a locked page to the pagecache
+ * @page:	page to add
+ * @mapping:	the page's address_space
+ * @offset:	page index
+ * @gfp_mask:	page allocation mode
+ *
+ * This function is used to add a page to the pagecache. It must be locked.
+ * This function does not add the page to the LRU.  The caller must do that.
+ */
+int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
+		pgoff_t offset, gfp_t gfp_mask)
+{
+	return __add_to_page_cache_locked(page, mapping, offset,
+					  gfp_mask, NULL);
+}
 EXPORT_SYMBOL(add_to_page_cache_locked);
 
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t offset, gfp_t gfp_mask)
 {
+	void *shadow = NULL;
 	int ret;
 
-	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	if (ret == 0)
-		lru_cache_add_file(page);
+	__set_page_locked(page);
+	ret = __add_to_page_cache_locked(page, mapping, offset,
+					 gfp_mask, &shadow);
+	if (unlikely(ret))
+		__clear_page_locked(page);
+	else {
+		/*
+		 * The page might have been evicted from cache only
+		 * recently, in which case it should be activated like
+		 * any other repeatedly accessed page.
+		 */
+		if (shadow && workingset_refault(shadow)) {
+			SetPageActive(page);
+			workingset_activation(page);
+		} else
+			ClearPageActive(page);
+		lru_cache_add(page);
+	}
 	return ret;
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
diff --git a/mm/swap.c b/mm/swap.c
index f624e5b4b724..ece5c49d6364 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -519,6 +519,8 @@ void mark_page_accessed(struct page *page)
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b954b31602cf..0d3c3d7f8c1b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -505,7 +505,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -551,10 +552,23 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
-
-		__delete_from_page_cache(page, NULL);
+		/*
+		 * Remember a shadow entry for reclaimed file cache in
+		 * order to detect refaults, thus thrashing, later on.
+		 *
+		 * But don't store shadows in an address space that is
+		 * already exiting.  This is not just an optizimation,
+		 * inode reclaim needs to empty out the radix tree or
+		 * the nodes are lost.  Don't plant shadows behind its
+		 * back.
+		 */
+		if (reclaimed && page_is_file_cache(page) &&
+		    !mapping_exiting(mapping))
+			shadow = workingset_eviction(mapping, page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -577,7 +591,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -1047,7 +1061,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9bb314577911..3ac830d1b533 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -770,6 +770,8 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_refault",
+	"workingset_activate",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 000000000000..8a6c7cff4923
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,253 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ *		Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list.  Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail.  Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * active list grows too big.
+ *
+ *   fault ------------------------+
+ *                                 |
+ *              +--------------+   |            +-------------+
+ *   reclaim <- |   inactive   | <-+-- demotion |    active   | <--+
+ *              +--------------+                +-------------+    |
+ *                     |                                           |
+ *                     +-------------- promotion ------------------+
+ *
+ *
+ *		Access frequency and refault distance
+ *
+ * A workload is thrashing when its pages are frequently used but they
+ * are evicted from the inactive list every time before another access
+ * would have promoted them to the active list.
+ *
+ * In cases where the average access distance between thrashing pages
+ * is bigger than the size of memory there is nothing that can be
+ * done - the thrashing set could never fit into memory under any
+ * circumstance.
+ *
+ * However, the average access distance could be bigger than the
+ * inactive list, yet smaller than the size of memory.  In this case,
+ * the set could fit into memory if it weren't for the currently
+ * active pages - which may be used more, hopefully less frequently:
+ *
+ *      +-memory available to cache-+
+ *      |                           |
+ *      +-inactive------+-active----+
+ *  a b | c d e f g h i | J K L M N |
+ *      +---------------+-----------+
+ *
+ * It is prohibitively expensive to accurately track access frequency
+ * of pages.  But a reasonable approximation can be made to measure
+ * thrashing on the inactive list, after which refaulting pages can be
+ * activated optimistically to compete with the existing active pages.
+ *
+ * Approximating inactive page access frequency - Observations:
+ *
+ * 1. When a page is accessed for the first time, it is added to the
+ *    head of the inactive list, slides every existing inactive page
+ *    towards the tail by one slot, and pushes the current tail page
+ *    out of memory.
+ *
+ * 2. When a page is accessed for the second time, it is promoted to
+ *    the active list, shrinking the inactive list by one slot.  This
+ *    also slides all inactive pages that were faulted into the cache
+ *    more recently than the activated page towards the tail of the
+ *    inactive list.
+ *
+ * Thus:
+ *
+ * 1. The sum of evictions and activations between any two points in
+ *    time indicate the minimum number of inactive pages accessed in
+ *    between.
+ *
+ * 2. Moving one inactive page N page slots towards the tail of the
+ *    list requires at least N inactive page accesses.
+ *
+ * Combining these:
+ *
+ * 1. When a page is finally evicted from memory, the number of
+ *    inactive pages accessed while the page was in cache is at least
+ *    the number of page slots on the inactive list.
+ *
+ * 2. In addition, measuring the sum of evictions and activations (E)
+ *    at the time of a page's eviction, and comparing it to another
+ *    reading (R) at the time the page faults back into memory tells
+ *    the minimum number of accesses while the page was not cached.
+ *    This is called the refault distance.
+ *
+ * Because the first access of the page was the fault and the second
+ * access the refault, we combine the in-cache distance with the
+ * out-of-cache distance to get the complete minimum access distance
+ * of this page:
+ *
+ *      NR_inactive + (R - E)
+ *
+ * And knowing the minimum access distance of a page, we can easily
+ * tell if the page would be able to stay in cache assuming all page
+ * slots in the cache were available:
+ *
+ *   NR_inactive + (R - E) <= NR_inactive + NR_active
+ *
+ * which can be further simplified to
+ *
+ *   (R - E) <= NR_active
+ *
+ * Put into words, the refault distance (out-of-cache) can be seen as
+ * a deficit in inactive list space (in-cache).  If the inactive list
+ * had (R - E) more page slots, the page would not have been evicted
+ * in between accesses, but activated instead.  And on a full system,
+ * the only thing eating into inactive list space is active pages.
+ *
+ *
+ *		Activating refaulting pages
+ *
+ * All that is known about the active list is that the pages have been
+ * accessed more than once in the past.  This means that at any given
+ * time there is actually a good chance that pages on the active list
+ * are no longer in active use.
+ *
+ * So when a refault distance of (R - E) is observed and there are at
+ * least (R - E) active pages, the refaulting page is activated
+ * optimistically in the hope that (R - E) active pages are actually
+ * used less frequently than the refaulting page - or even not used at
+ * all anymore.
+ *
+ * If this is wrong and demotion kicks in, the pages which are truly
+ * used more frequently will be reactivated while the less frequently
+ * used once will be evicted from memory.
+ *
+ * But if this is right, the stale pages will be pushed out of memory
+ * and the used pages get to stay in cache.
+ *
+ *
+ *		Implementation
+ *
+ * For each zone's file LRU lists, a counter for inactive evictions
+ * and activations is maintained (zone->inactive_age).
+ *
+ * On eviction, a snapshot of this counter (along with some bits to
+ * identify the zone) is stored in the now empty page cache radix tree
+ * slot of the evicted page.  This is called a shadow entry.
+ *
+ * On cache misses for which there are shadow entries, an eligible
+ * refault distance will immediately activate the refaulting page.
+ */
+
+static void *pack_shadow(unsigned long eviction, struct zone *zone)
+{
+	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
+	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
+	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
+
+	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+static void unpack_shadow(void *shadow,
+			  struct zone **zone,
+			  unsigned long *distance)
+{
+	unsigned long entry = (unsigned long)shadow;
+	unsigned long eviction;
+	unsigned long refault;
+	unsigned long mask;
+	int zid, nid;
+
+	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	zid = entry & ((1UL << ZONES_SHIFT) - 1);
+	entry >>= ZONES_SHIFT;
+	nid = entry & ((1UL << NODES_SHIFT) - 1);
+	entry >>= NODES_SHIFT;
+	eviction = entry;
+
+	*zone = NODE_DATA(nid)->node_zones + zid;
+
+	refault = atomic_long_read(&(*zone)->inactive_age);
+	mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
+			RADIX_TREE_EXCEPTIONAL_SHIFT);
+	/*
+	 * The unsigned subtraction here gives an accurate distance
+	 * across inactive_age overflows in most cases.
+	 *
+	 * There is a special case: usually, shadow entries have a
+	 * short lifetime and are either refaulted or reclaimed along
+	 * with the inode before they get too old.  But it is not
+	 * impossible for the inactive_age to lap a shadow entry in
+	 * the field, which can then can result in a false small
+	 * refault distance, leading to a false activation should this
+	 * old entry actually refault again.  However, earlier kernels
+	 * used to deactivate unconditionally with *every* reclaim
+	 * invocation for the longest time, so the occasional
+	 * inappropriate activation leading to pressure on the active
+	 * list is not a problem.
+	 */
+	*distance = (refault - eviction) & mask;
+}
+
+/**
+ * workingset_eviction - note the eviction of a page from memory
+ * @mapping: address space the page was backing
+ * @page: the page being evicted
+ *
+ * Returns a shadow entry to be stored in @mapping->page_tree in place
+ * of the evicted @page so that a later refault can be detected.
+ */
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long eviction;
+
+	eviction = atomic_long_inc_return(&zone->inactive_age);
+	return pack_shadow(eviction, zone);
+}
+
+/**
+ * workingset_refault - evaluate the refault of a previously evicted page
+ * @shadow: shadow entry of the evicted page
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted page in the context of the zone it was allocated in.
+ *
+ * Returns %true if the page should be activated, %false otherwise.
+ */
+bool workingset_refault(void *shadow)
+{
+	unsigned long refault_distance;
+	struct zone *zone;
+
+	unpack_shadow(shadow, &zone, &refault_distance);
+	inc_zone_state(zone, WORKINGSET_REFAULT);
+
+	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+		inc_zone_state(zone, WORKINGSET_ACTIVATE);
+		return true;
+	}
+	return false;
+}
+
+/**
+ * workingset_activation - note a page activation
+ * @page: page that is being activated
+ */
+void workingset_activation(struct page *page)
+{
+	atomic_long_inc(&page_zone(page)->inactive_age);
+}
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [patch 8/9] lib: radix_tree: tree node interface
  2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
                   ` (6 preceding siblings ...)
  2014-01-10 18:10 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
@ 2014-01-10 18:10 ` Johannes Weiner
  2014-01-10 22:57   ` Rik van Riel
  2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
  8 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2014-01-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Make struct radix_tree_node part of the public interface and provide
API functions to create, look up, and delete whole nodes.  Refactor
the existing insert, look up, delete functions on top of these new
node primitives.

This will allow the VM to track and garbage collect page cache radix
tree nodes.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |  34 ++++++
 lib/radix-tree.c           | 261 +++++++++++++++++++++++++--------------------
 2 files changed, 180 insertions(+), 115 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index e8be53ecfc45..13636c40bc42 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -60,6 +60,33 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 #define RADIX_TREE_MAX_TAGS 3
 
+#ifdef __KERNEL__
+#define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
+#else
+#define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
+#endif
+
+#define RADIX_TREE_MAP_SIZE	(1UL << RADIX_TREE_MAP_SHIFT)
+#define RADIX_TREE_MAP_MASK	(RADIX_TREE_MAP_SIZE-1)
+
+#define RADIX_TREE_TAG_LONGS	\
+	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
+
+struct radix_tree_node {
+	unsigned int	height;		/* Height from the bottom */
+	unsigned int	count;
+	union {
+		struct radix_tree_node *parent;	/* Used when ascending tree */
+		struct rcu_head	rcu_head;	/* Used when freeing node */
+	};
+	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
+	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
+};
+
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+					  RADIX_TREE_MAP_SHIFT))
+
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
 	unsigned int		height;
@@ -101,6 +128,7 @@ do {									\
  *   concurrently with other readers.
  *
  * The notable exceptions to this rule are the following functions:
+ * __radix_tree_lookup
  * radix_tree_lookup
  * radix_tree_lookup_slot
  * radix_tree_tag_get
@@ -216,9 +244,15 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
 	rcu_assign_pointer(*pslot, item);
 }
 
+int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
+			struct radix_tree_node **nodep, void ***slotp);
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
+void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
+			  struct radix_tree_node **nodep, void ***slotp);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+			      struct radix_tree_node *node);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e8adb5d8a184..e601c56a43d0 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -35,33 +35,6 @@
 #include <linux/hardirq.h>		/* in_interrupt() */
 
 
-#ifdef __KERNEL__
-#define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
-#else
-#define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
-#endif
-
-#define RADIX_TREE_MAP_SIZE	(1UL << RADIX_TREE_MAP_SHIFT)
-#define RADIX_TREE_MAP_MASK	(RADIX_TREE_MAP_SIZE-1)
-
-#define RADIX_TREE_TAG_LONGS	\
-	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
-
-struct radix_tree_node {
-	unsigned int	height;		/* Height from the bottom */
-	unsigned int	count;
-	union {
-		struct radix_tree_node *parent;	/* Used when ascending tree */
-		struct rcu_head	rcu_head;	/* Used when freeing node */
-	};
-	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
-	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
-};
-
-#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
-#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
-					  RADIX_TREE_MAP_SHIFT))
-
 /*
  * The height_to_maxindex array needs to be one deeper than the maximum
  * path as height 0 holds only 1 entry.
@@ -387,23 +360,28 @@ out:
 }
 
 /**
- *	radix_tree_insert    -    insert into a radix tree
+ *	__radix_tree_create	-	create a slot in a radix tree
  *	@root:		radix tree root
  *	@index:		index key
- *	@item:		item to insert
+ *	@nodep:		returns node
+ *	@slotp:		returns slot
  *
- *	Insert an item into the radix tree at position @index.
+ *	Create, if necessary, and return the node and slot for an item
+ *	at position @index in the radix tree @root.
+ *
+ *	Until there is more than one item in the tree, no nodes are
+ *	allocated and @root->rnode is used as a direct slot instead of
+ *	pointing to a node, in which case *@nodep will be NULL.
+ *
+ *	Returns -ENOMEM, or 0 for success.
  */
-int radix_tree_insert(struct radix_tree_root *root,
-			unsigned long index, void *item)
+int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
+			struct radix_tree_node **nodep, void ***slotp)
 {
 	struct radix_tree_node *node = NULL, *slot;
-	unsigned int height, shift;
-	int offset;
+	unsigned int height, shift, offset;
 	int error;
 
-	BUG_ON(radix_tree_is_indirect_ptr(item));
-
 	/* Make sure the tree is high enough.  */
 	if (index > radix_tree_maxindex(root->height)) {
 		error = radix_tree_extend(root, index);
@@ -439,16 +417,40 @@ int radix_tree_insert(struct radix_tree_root *root,
 		height--;
 	}
 
-	if (slot != NULL)
+	if (nodep)
+		*nodep = node;
+	if (slotp)
+		*slotp = node ? node->slots + offset : (void **)&root->rnode;
+	return 0;
+}
+
+/**
+ *	radix_tree_insert    -    insert into a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@item:		item to insert
+ *
+ *	Insert an item into the radix tree at position @index.
+ */
+int radix_tree_insert(struct radix_tree_root *root,
+			unsigned long index, void *item)
+{
+	struct radix_tree_node *node;
+	void **slot;
+	int error;
+
+	BUG_ON(radix_tree_is_indirect_ptr(item));
+
+	error = __radix_tree_create(root, index, &node, &slot);
+	if (*slot != NULL)
 		return -EEXIST;
+	rcu_assign_pointer(*slot, item);
 
 	if (node) {
 		node->count++;
-		rcu_assign_pointer(node->slots[offset], item);
-		BUG_ON(tag_get(node, 0, offset));
-		BUG_ON(tag_get(node, 1, offset));
+		BUG_ON(tag_get(node, 0, index & RADIX_TREE_MAP_MASK));
+		BUG_ON(tag_get(node, 1, index & RADIX_TREE_MAP_MASK));
 	} else {
-		rcu_assign_pointer(root->rnode, item);
 		BUG_ON(root_tag_get(root, 0));
 		BUG_ON(root_tag_get(root, 1));
 	}
@@ -457,15 +459,26 @@ int radix_tree_insert(struct radix_tree_root *root,
 }
 EXPORT_SYMBOL(radix_tree_insert);
 
-/*
- * is_slot == 1 : search for the slot.
- * is_slot == 0 : search for the node.
+/**
+ *	__radix_tree_lookup	-	lookup an item in a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@nodep:		returns node
+ *	@slotp:		returns slot
+ *
+ *	Lookup and return the item at position @index in the radix
+ *	tree @root.
+ *
+ *	Until there is more than one item in the tree, no nodes are
+ *	allocated and @root->rnode is used as a direct slot instead of
+ *	pointing to a node, in which case *@nodep will be NULL.
  */
-static void *radix_tree_lookup_element(struct radix_tree_root *root,
-				unsigned long index, int is_slot)
+void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
+			  struct radix_tree_node **nodep, void ***slotp)
 {
+	struct radix_tree_node *node, *parent;
 	unsigned int height, shift;
-	struct radix_tree_node *node, **slot;
+	void **slot;
 
 	node = rcu_dereference_raw(root->rnode);
 	if (node == NULL)
@@ -474,7 +487,12 @@ static void *radix_tree_lookup_element(struct radix_tree_root *root,
 	if (!radix_tree_is_indirect_ptr(node)) {
 		if (index > 0)
 			return NULL;
-		return is_slot ? (void *)&root->rnode : node;
+
+		if (nodep)
+			*nodep = NULL;
+		if (slotp)
+			*slotp = (void **)&root->rnode;
+		return node;
 	}
 	node = indirect_to_ptr(node);
 
@@ -485,8 +503,8 @@ static void *radix_tree_lookup_element(struct radix_tree_root *root,
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
 	do {
-		slot = (struct radix_tree_node **)
-			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
+		parent = node;
+		slot = node->slots + ((index >> shift) & RADIX_TREE_MAP_MASK);
 		node = rcu_dereference_raw(*slot);
 		if (node == NULL)
 			return NULL;
@@ -495,7 +513,11 @@ static void *radix_tree_lookup_element(struct radix_tree_root *root,
 		height--;
 	} while (height > 0);
 
-	return is_slot ? (void *)slot : indirect_to_ptr(node);
+	if (nodep)
+		*nodep = parent;
+	if (slotp)
+		*slotp = slot;
+	return node;
 }
 
 /**
@@ -513,7 +535,11 @@ static void *radix_tree_lookup_element(struct radix_tree_root *root,
  */
 void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
-	return (void **)radix_tree_lookup_element(root, index, 1);
+	void **slot;
+
+	if (!__radix_tree_lookup(root, index, NULL, &slot))
+		return NULL;
+	return slot;
 }
 EXPORT_SYMBOL(radix_tree_lookup_slot);
 
@@ -531,7 +557,7 @@ EXPORT_SYMBOL(radix_tree_lookup_slot);
  */
 void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
 {
-	return radix_tree_lookup_element(root, index, 0);
+	return __radix_tree_lookup(root, index, NULL, NULL);
 }
 EXPORT_SYMBOL(radix_tree_lookup);
 
@@ -1260,6 +1286,56 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 }
 
 /**
+ *	__radix_tree_delete_node    -    try to free node after clearing a slot
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@node:		node containing @index
+ *
+ *	After clearing the slot at @index in @node from radix tree
+ *	rooted at @root, call this function to attempt freeing the
+ *	node and shrinking the tree.
+ *
+ *	Returns %true if @node was freed, %false otherwise.
+ */
+bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+			      struct radix_tree_node *node)
+{
+	bool deleted = false;
+
+	do {
+		struct radix_tree_node *parent;
+
+		if (node->count) {
+			if (node == indirect_to_ptr(root->rnode)) {
+				radix_tree_shrink(root);
+				if (root->height == 0)
+					deleted = true;
+			}
+			return deleted;
+		}
+
+		parent = node->parent;
+		if (parent) {
+			index >>= RADIX_TREE_MAP_SHIFT;
+
+			parent->slots[index & RADIX_TREE_MAP_MASK] = NULL;
+			parent->count--;
+		} else {
+			root_tag_clear_all(root);
+			root->height = 0;
+			root->rnode = NULL;
+		}
+
+		radix_tree_node_free(node);
+		deleted = true;
+
+		node = parent;
+	} while (node);
+
+	return deleted;
+}
+
+/**
  *	radix_tree_delete_item    -    delete an item from a radix tree
  *	@root:		radix tree root
  *	@index:		index key
@@ -1273,43 +1349,26 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 void *radix_tree_delete_item(struct radix_tree_root *root,
 			     unsigned long index, void *item)
 {
-	struct radix_tree_node *node = NULL;
-	struct radix_tree_node *slot = NULL;
-	struct radix_tree_node *to_free;
-	unsigned int height, shift;
+	struct radix_tree_node *node;
+	unsigned int offset;
+	void **slot;
+	void *entry;
 	int tag;
-	int uninitialized_var(offset);
 
-	height = root->height;
-	if (index > radix_tree_maxindex(height))
-		goto out;
+	entry = __radix_tree_lookup(root, index, &node, &slot);
+	if (!entry)
+		return NULL;
 
-	slot = root->rnode;
-	if (height == 0) {
+	if (item && entry != item)
+		return NULL;
+
+	if (!node) {
 		root_tag_clear_all(root);
 		root->rnode = NULL;
-		goto out;
+		return entry;
 	}
-	slot = indirect_to_ptr(slot);
-	shift = height * RADIX_TREE_MAP_SHIFT;
-
-	do {
-		if (slot == NULL)
-			goto out;
-
-		shift -= RADIX_TREE_MAP_SHIFT;
-		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
-		node = slot;
-		slot = slot->slots[offset];
-	} while (shift);
-
-	if (slot == NULL)
-		goto out;
 
-	if (item && slot != item) {
-		slot = NULL;
-		goto out;
-	}
+	offset = index & RADIX_TREE_MAP_MASK;
 
 	/*
 	 * Clear all tags associated with the item to be deleted.
@@ -1320,40 +1379,12 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
 			radix_tree_tag_clear(root, index, tag);
 	}
 
-	to_free = NULL;
-	/* Now free the nodes we do not need anymore */
-	while (node) {
-		node->slots[offset] = NULL;
-		node->count--;
-		/*
-		 * Queue the node for deferred freeing after the
-		 * last reference to it disappears (set NULL, above).
-		 */
-		if (to_free)
-			radix_tree_node_free(to_free);
-
-		if (node->count) {
-			if (node == indirect_to_ptr(root->rnode))
-				radix_tree_shrink(root);
-			goto out;
-		}
-
-		/* Node with zero slots in use so free it */
-		to_free = node;
-
-		index >>= RADIX_TREE_MAP_SHIFT;
-		offset = index & RADIX_TREE_MAP_MASK;
-		node = node->parent;
-	}
+	node->slots[offset] = NULL;
+	node->count--;
 
-	root_tag_clear_all(root);
-	root->height = 0;
-	root->rnode = NULL;
-	if (to_free)
-		radix_tree_node_free(to_free);
+	__radix_tree_delete_node(root, index, node);
 
-out:
-	return slot;
+	return entry;
 }
 EXPORT_SYMBOL(radix_tree_delete_item);
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
                   ` (7 preceding siblings ...)
  2014-01-10 18:10 ` [patch 8/9] lib: radix_tree: tree node interface Johannes Weiner
@ 2014-01-10 18:10 ` Johannes Weiner
  2014-01-10 23:09   ` Rik van Riel
                     ` (3 more replies)
  8 siblings, 4 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-10 18:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.
Per-NUMA rather than global because we expect the radix tree nodes
themselves to be allocated node-locally and we want to reduce
cross-node references of otherwise independent cache workloads.  A
simple shrinker will then reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/list_lru.h   |   2 +
 include/linux/mmzone.h     |   1 +
 include/linux/radix-tree.h |  32 +++++++++---
 include/linux/swap.h       |   1 +
 lib/radix-tree.c           |  36 ++++++++------
 mm/filemap.c               |  77 +++++++++++++++++++++++------
 mm/list_lru.c              |   8 +++
 mm/truncate.c              |  20 +++++++-
 mm/vmstat.c                |   1 +
 mm/workingset.c            | 121 +++++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 259 insertions(+), 40 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce541753c88..b02fc233eadd 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -13,6 +13,8 @@
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
 	LRU_REMOVED,		/* item removed from list */
+	LRU_REMOVED_RETRY,	/* item removed, but lock has been
+				   dropped and reacquired */
 	LRU_ROTATE,		/* item referenced, give another pass */
 	LRU_SKIP,		/* item cannot be locked, skip */
 	LRU_RETRY,		/* item not freeable. May drop the lock
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 118ba9f51e86..8cac5a7ef7a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
 #endif
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
+	WORKINGSET_NODERECLAIM,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c40bc42..33170dbd9db4 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS	\
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+					  RADIX_TREE_MAP_SHIFT))
+
+/* Height component in node->path */
+#define RADIX_TREE_HEIGHT_SHIFT	(RADIX_TREE_MAX_PATH + 1)
+#define RADIX_TREE_HEIGHT_MASK	((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
+
+/* Internally used bits of node->count */
+#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
+#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
+
 struct radix_tree_node {
-	unsigned int	height;		/* Height from the bottom */
+	unsigned int	path;	/* Offset in parent & height from the bottom */
 	unsigned int	count;
 	union {
-		struct radix_tree_node *parent;	/* Used when ascending tree */
-		struct rcu_head	rcu_head;	/* Used when freeing node */
+		struct {
+			/* Used when ascending tree */
+			struct radix_tree_node *parent;
+			/* For tree user */
+			void *private_data;
+		};
+		/* Used when freeing node */
+		struct rcu_head	rcu_head;
 	};
+	/* For tree user */
+	struct list_head private_list;
 	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
 };
 
-#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
-#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
-					  RADIX_TREE_MAP_SHIFT))
-
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
 	unsigned int		height;
@@ -251,7 +267,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 			  struct radix_tree_node **nodep, void ***slotp);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
-bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+bool __radix_tree_delete_node(struct radix_tree_root *root,
 			      struct radix_tree_node *node);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b83cf61403ed..102e37bc82d5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -264,6 +264,7 @@ struct swap_list_t {
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+extern struct list_lru workingset_shadow_nodes;
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e601c56a43d0..0a0895371447 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -342,7 +342,8 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
 
 		/* Increase the height.  */
 		newheight = root->height+1;
-		node->height = newheight;
+		BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK);
+		node->path = newheight;
 		node->count = 1;
 		node->parent = NULL;
 		slot = root->rnode;
@@ -400,11 +401,12 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 			/* Have to add a child node.  */
 			if (!(slot = radix_tree_node_alloc(root)))
 				return -ENOMEM;
-			slot->height = height;
+			slot->path = height;
 			slot->parent = node;
 			if (node) {
 				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
+				slot->path |= offset << RADIX_TREE_HEIGHT_SHIFT;
 			} else
 				rcu_assign_pointer(root->rnode, ptr_to_indirect(slot));
 		}
@@ -496,7 +498,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 	}
 	node = indirect_to_ptr(node);
 
-	height = node->height;
+	height = node->path & RADIX_TREE_HEIGHT_MASK;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
@@ -702,7 +704,7 @@ int radix_tree_tag_get(struct radix_tree_root *root,
 		return (index == 0);
 	node = indirect_to_ptr(node);
 
-	height = node->height;
+	height = node->path & RADIX_TREE_HEIGHT_MASK;
 	if (index > radix_tree_maxindex(height))
 		return 0;
 
@@ -739,7 +741,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
 {
 	unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK;
 	struct radix_tree_node *rnode, *node;
-	unsigned long index, offset;
+	unsigned long index, offset, height;
 
 	if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag))
 		return NULL;
@@ -770,7 +772,8 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
 		return NULL;
 
 restart:
-	shift = (rnode->height - 1) * RADIX_TREE_MAP_SHIFT;
+	height = rnode->path & RADIX_TREE_HEIGHT_MASK;
+	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	offset = index >> shift;
 
 	/* Index outside of the tree */
@@ -1140,7 +1143,7 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item,
 	unsigned int shift, height;
 	unsigned long i;
 
-	height = slot->height;
+	height = slot->path & RADIX_TREE_HEIGHT_MASK;
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
 	for ( ; height > 1; height--) {
@@ -1203,7 +1206,8 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item)
 		}
 
 		node = indirect_to_ptr(node);
-		max_index = radix_tree_maxindex(node->height);
+		max_index = radix_tree_maxindex(node->path &
+						RADIX_TREE_HEIGHT_MASK);
 		if (cur_index > max_index)
 			break;
 
@@ -1297,7 +1301,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
  *
  *	Returns %true if @node was freed, %false otherwise.
  */
-bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+bool __radix_tree_delete_node(struct radix_tree_root *root,
 			      struct radix_tree_node *node)
 {
 	bool deleted = false;
@@ -1316,9 +1320,10 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
 
 		parent = node->parent;
 		if (parent) {
-			index >>= RADIX_TREE_MAP_SHIFT;
+			unsigned int offset;
 
-			parent->slots[index & RADIX_TREE_MAP_MASK] = NULL;
+			offset = node->path >> RADIX_TREE_HEIGHT_SHIFT;
+			parent->slots[offset] = NULL;
 			parent->count--;
 		} else {
 			root_tag_clear_all(root);
@@ -1382,7 +1387,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
 	node->slots[offset] = NULL;
 	node->count--;
 
-	__radix_tree_delete_node(root, index, node);
+	__radix_tree_delete_node(root, node);
 
 	return entry;
 }
@@ -1415,9 +1420,12 @@ int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag)
 EXPORT_SYMBOL(radix_tree_tagged);
 
 static void
-radix_tree_node_ctor(void *node)
+radix_tree_node_ctor(void *arg)
 {
-	memset(node, 0, sizeof(struct radix_tree_node));
+	struct radix_tree_node *node = arg;
+
+	memset(node, 0, sizeof(*node));
+	INIT_LIST_HEAD(&node->private_list);
 }
 
 static __init unsigned long __maxindex(unsigned int height)
diff --git a/mm/filemap.c b/mm/filemap.c
index 65a374c0df4f..b93e223b59a9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,11 +110,17 @@
 static void page_cache_tree_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
-	if (shadow) {
-		void **slot;
+	struct radix_tree_node *node;
+	unsigned long index;
+	unsigned int offset;
+	unsigned int tag;
+	void **slot;
 
-		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
-		radix_tree_replace_slot(slot, shadow);
+	VM_BUG_ON(!PageLocked(page));
+
+	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
+
+	if (shadow) {
 		mapping->nrshadows++;
 		/*
 		 * Make sure the nrshadows update is committed before
@@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
 		 * same time and miss a shadow entry.
 		 */
 		smp_wmb();
-	} else
-		radix_tree_delete(&mapping->page_tree, page->index);
+	}
 	mapping->nrpages--;
+
+	if (!node) {
+		/* Clear direct pointer tags in root node */
+		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
+		radix_tree_replace_slot(slot, shadow);
+		return;
+	}
+
+	/* Clear tree tags for the removed page */
+	index = page->index;
+	offset = index & RADIX_TREE_MAP_MASK;
+	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+		if (test_bit(offset, node->tags[tag]))
+			radix_tree_tag_clear(&mapping->page_tree, index, tag);
+	}
+
+	/* Delete page, swap shadow entry */
+	radix_tree_replace_slot(slot, shadow);
+	node->count--;
+	if (shadow)
+		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
+	else
+		if (__radix_tree_delete_node(&mapping->page_tree, node))
+			return;
+
+	/* Only shadow entries in there, keep track of this node */
+	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
+	    list_empty(&node->private_list)) {
+		node->private_data = mapping;
+		list_lru_add(&workingset_shadow_nodes, &node->private_list);
+	}
 }
 
 /*
@@ -471,27 +507,36 @@ EXPORT_SYMBOL_GPL(replace_page_cache_page);
 static int page_cache_tree_insert(struct address_space *mapping,
 				  struct page *page, void **shadowp)
 {
+	struct radix_tree_node *node;
 	void **slot;
 	int error;
 
-	slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
-	if (slot) {
+	error = __radix_tree_create(&mapping->page_tree, page->index,
+				    &node, &slot);
+	if (error)
+		return error;
+	if (*slot) {
 		void *p;
 
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
-		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
-		mapping->nrpages++;
 		if (shadowp)
 			*shadowp = p;
-		return 0;
+		mapping->nrshadows--;
+		if (node)
+			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
 	}
-	error = radix_tree_insert(&mapping->page_tree, page->index, page);
-	if (!error)
-		mapping->nrpages++;
-	return error;
+	radix_tree_replace_slot(slot, page);
+	mapping->nrpages++;
+	if (node) {
+		node->count++;
+		/* Installed page, can't be shadow-only anymore */
+		if (!list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes,
+				     &node->private_list);
+	}
+	return 0;
 }
 
 static int __add_to_page_cache_locked(struct page *page,
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 72f9decb0104..47a9faf4070b 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -88,10 +88,18 @@ restart:
 		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case LRU_REMOVED:
+		case LRU_REMOVED_RETRY:
 			if (--nlru->nr_items == 0)
 				node_clear(nid, lru->active_nodes);
 			WARN_ON_ONCE(nlru->nr_items < 0);
 			isolated++;
+			/*
+			 * If the lru lock has been dropped, our list
+			 * traversal is now invalid and so we have to
+			 * restart from scratch.
+			 */
+			if (ret == LRU_REMOVED_RETRY)
+				goto restart;
 			break;
 		case LRU_ROTATE:
 			list_move_tail(item, &nlru->list);
diff --git a/mm/truncate.c b/mm/truncate.c
index 97606fa4c458..5c2615d7f4da 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -25,6 +25,9 @@
 static void clear_exceptional_entry(struct address_space *mapping,
 				    pgoff_t index, void *entry)
 {
+	struct radix_tree_node *node;
+	void **slot;
+
 	/* Handled by shmem itself */
 	if (shmem_mapping(mapping))
 		return;
@@ -35,8 +38,21 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * without the tree itself locked.  These unlocked entries
 	 * need verification under the tree lock.
 	 */
-	if (radix_tree_delete_item(&mapping->page_tree, index, entry) == entry)
-		mapping->nrshadows--;
+	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+	radix_tree_replace_slot(slot, NULL);
+	mapping->nrshadows--;
+	if (!node)
+		goto unlock;
+	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+	/* No more shadow entries, stop tracking the node */
+	if (!(node->count >> RADIX_TREE_COUNT_SHIFT) &&
+	    !list_empty(&node->private_list))
+		list_lru_del(&workingset_shadow_nodes, &node->private_list);
+	__radix_tree_delete_node(&mapping->page_tree, node);
+unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3ac830d1b533..baa3ba586685 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -772,6 +772,7 @@ const char * const vmstat_text[] = {
 #endif
 	"workingset_refault",
 	"workingset_activate",
+	"workingset_nodereclaim",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 8a6c7cff4923..7bb1a432c137 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -251,3 +251,124 @@ void workingset_activation(struct page *page)
 {
 	atomic_long_inc(&page_zone(page)->inactive_age);
 }
+
+/*
+ * Page cache radix tree nodes containing only shadow entries can grow
+ * excessively on certain workloads.  That's why they are tracked on
+ * per-(NUMA)node lists and pushed back by a shrinker, but with a
+ * slightly higher threshold than regular shrinkers so we don't
+ * discard the entries too eagerly - after all, during light memory
+ * pressure is exactly when we need them.
+ */
+
+struct list_lru workingset_shadow_nodes;
+
+static unsigned long count_shadow_nodes(struct shrinker *shrinker,
+					struct shrink_control *sc)
+{
+	return list_lru_count_node(&workingset_shadow_nodes, sc->nid);
+}
+
+static enum lru_status shadow_lru_isolate(struct list_head *item,
+					  spinlock_t *lru_lock,
+					  void *arg)
+{
+	unsigned long *nr_reclaimed = arg;
+	struct address_space *mapping;
+	struct radix_tree_node *node;
+	unsigned int i;
+	int ret;
+
+	/*
+	 * Page cache insertions and deletions synchroneously maintain
+	 * the shadow node LRU under the mapping->tree_lock and the
+	 * lru_lock.  Because the page cache tree is emptied before
+	 * the inode can be destroyed, holding the lru_lock pins any
+	 * address_space that has radix tree nodes on the LRU.
+	 *
+	 * We can then safely transition to the mapping->tree_lock to
+	 * pin only the address_space of the particular node we want
+	 * to reclaim, take the node off-LRU, and drop the lru_lock.
+	 */
+
+	node = container_of(item, struct radix_tree_node, private_list);
+	mapping = node->private_data;
+
+	/* Coming from the list, invert the lock order */
+	if (!spin_trylock_irq(&mapping->tree_lock)) {
+		spin_unlock(lru_lock);
+		ret = LRU_RETRY;
+		goto out;
+	}
+
+	list_del_init(item);
+	spin_unlock(lru_lock);
+
+	/*
+	 * The nodes should only contain one or more shadow entries,
+	 * no pages, so we expect to be able to remove them all and
+	 * delete and free the empty node afterwards.
+	 */
+
+	BUG_ON(!node->count);
+	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
+
+	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
+		if (node->slots[i]) {
+			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
+			node->slots[i] = NULL;
+			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
+			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+			BUG_ON(!mapping->nrshadows);
+			mapping->nrshadows--;
+		}
+	}
+	BUG_ON(node->count);
+	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
+	if (!__radix_tree_delete_node(&mapping->page_tree, node))
+		BUG();
+	(*nr_reclaimed)++;
+
+	spin_unlock_irq(&mapping->tree_lock);
+	ret = LRU_REMOVED_RETRY;
+out:
+	cond_resched();
+	spin_lock(lru_lock);
+	return ret;
+}
+
+static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
+				       struct shrink_control *sc)
+{
+	unsigned long nr_reclaimed = 0;
+
+	list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
+			   shadow_lru_isolate, &nr_reclaimed, &sc->nr_to_scan);
+
+	return nr_reclaimed;
+}
+
+static struct shrinker workingset_shadow_shrinker = {
+	.count_objects = count_shadow_nodes,
+	.scan_objects = scan_shadow_nodes,
+	.seeks = DEFAULT_SEEKS * 4,
+	.flags = SHRINKER_NUMA_AWARE,
+};
+
+static int __init workingset_init(void)
+{
+	int ret;
+
+	ret = list_lru_init(&workingset_shadow_nodes);
+	if (ret)
+		goto err;
+	ret = register_shrinker(&workingset_shadow_shrinker);
+	if (ret)
+		goto err_list_lru;
+	return 0;
+err_list_lru:
+	list_lru_destroy(&workingset_shadow_nodes);
+err:
+	return ret;
+}
+module_init(workingset_init);
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages
  2014-01-10 18:10 ` [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
@ 2014-01-10 18:25   ` Rik van Riel
  0 siblings, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2014-01-10 18:25 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Roman Gushchin, Ryan Mallon, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On 01/10/2014 01:10 PM, Johannes Weiner wrote:
> Page cache radix tree slots are usually stabilized by the page lock,
> but shmem's swap cookies have no such thing.  Because the overall
> truncation loop is lockless, the swap entry is currently confirmed by
> a tree lookup and then deleted by another tree lookup under the same
> tree lock region.
> 
> Use radix_tree_delete_item() instead, which does the verification and
> deletion with only one lookup.  This also allows removing the
> delete-only special case from shmem_radix_tree_replace().
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 4/9] mm: filemap: move radix tree hole searching here
  2014-01-10 18:10 ` [patch 4/9] mm: filemap: move radix tree hole searching here Johannes Weiner
@ 2014-01-10 19:22   ` Rik van Riel
  2014-01-13  1:25   ` Minchan Kim
  1 sibling, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2014-01-10 19:22 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Roman Gushchin, Ryan Mallon, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On 01/10/2014 01:10 PM, Johannes Weiner wrote:
> The radix tree hole searching code is only used for page cache, for
> example the readahead code trying to get a a picture of the area
> surrounding a fault.
> 
> It sufficed to rely on the radix tree definition of holes, which is
> "empty tree slot".  But this is about to change, though, as shadow
> page descriptors will be stored in the page cache after the actual
> pages get evicted from memory.
> 
> Move the functions over to mm/filemap.c and make them native page
> cache operations, where they can later be adapted to handle the new
> definition of "page cache hole".
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees
  2014-01-10 18:10 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
@ 2014-01-10 19:39   ` Rik van Riel
  2014-01-13  2:01   ` Minchan Kim
  2014-02-12 14:00   ` Mel Gorman
  2 siblings, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2014-01-10 19:39 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Roman Gushchin, Ryan Mallon, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On 01/10/2014 01:10 PM, Johannes Weiner wrote:
> shmem mappings already contain exceptional entries where swap slot
> information is remembered.
> 
> To be able to store eviction information for regular page cache,
> prepare every site dealing with the radix trees directly to handle
> entries other than pages.
> 
> The common lookup functions will filter out non-page entries and
> return NULL for page cache holes, just as before.  But provide a raw
> version of the API which returns non-page entries as well, and switch
> shmem over to use it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 6/9] mm + fs: store shadow entries in page cache
  2014-01-10 18:10 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
@ 2014-01-10 22:30   ` Rik van Riel
  2014-01-13  2:18   ` Minchan Kim
  1 sibling, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2014-01-10 22:30 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Roman Gushchin, Ryan Mallon, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On 01/10/2014 01:10 PM, Johannes Weiner wrote:
> Reclaim will be leaving shadow entries in the page cache radix tree
> upon evicting the real page.  As those pages are found from the LRU,
> an iput() can lead to the inode being freed concurrently.  At this
> point, reclaim must no longer install shadow pages because the inode
> freeing code needs to ensure the page tree is really empty.
> 
> Add an address_space flag, AS_EXITING, that the inode freeing code
> sets under the tree lock before doing the final truncate.  Reclaim
> will check for this flag before installing shadow pages.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 7/9] mm: thrash detection-based file cache sizing
  2014-01-10 18:10 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
@ 2014-01-10 22:51   ` Rik van Riel
  2014-01-13  2:42   ` Minchan Kim
  2014-01-14  1:01   ` Bob Liu
  2 siblings, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2014-01-10 22:51 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Roman Gushchin, Ryan Mallon, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On 01/10/2014 01:10 PM, Johannes Weiner wrote:

> This patch solves one half of the problem by decoupling the ability to
> detect working set changes from the inactive list size.  By
> maintaining a history of recently evicted file pages it can detect
> frequently used pages with an arbitrarily small inactive list size,
> and subsequently apply pressure on the active list based on actual
> demand for cache, not just overall eviction speed.
> 
> Every zone maintains a counter that tracks inactive list aging speed.
> When a page is evicted, a snapshot of this counter is stored in the
> now-empty page cache radix tree slot.  On refault, the minimum access
> distance of the page can be assessed, to evaluate whether the page
> should be part of the active list or not.
> 
> This fixes the VM's blindness towards working set changes in excess of
> the inactive list.  And it's the foundation to further improve the
> protection ability and reduce the minimum inactive list size of 50%.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 8/9] lib: radix_tree: tree node interface
  2014-01-10 18:10 ` [patch 8/9] lib: radix_tree: tree node interface Johannes Weiner
@ 2014-01-10 22:57   ` Rik van Riel
  0 siblings, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2014-01-10 22:57 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Roman Gushchin, Ryan Mallon, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On 01/10/2014 01:10 PM, Johannes Weiner wrote:
> Make struct radix_tree_node part of the public interface and provide
> API functions to create, look up, and delete whole nodes.  Refactor
> the existing insert, look up, delete functions on top of these new
> node primitives.
> 
> This will allow the VM to track and garbage collect page cache radix
> tree nodes.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
@ 2014-01-10 23:09   ` Rik van Riel
  2014-01-13  7:39   ` Minchan Kim
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2014-01-10 23:09 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Bob Liu, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Roman Gushchin, Ryan Mallon, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On 01/10/2014 01:10 PM, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.
> Per-NUMA rather than global because we expect the radix tree nodes
> themselves to be allocated node-locally and we want to reduce
> cross-node references of otherwise independent cache workloads.  A
> simple shrinker will then reclaim these nodes on memory pressure.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 1/9] fs: cachefiles: use add_to_page_cache_lru()
  2014-01-10 18:10 ` [patch 1/9] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
@ 2014-01-13  1:17   ` Minchan Kim
  0 siblings, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2014-01-13  1:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Fri, Jan 10, 2014 at 01:10:35PM -0500, Johannes Weiner wrote:
> This code used to have its own lru cache pagevec up until a0b8cab3
> ("mm: remove lru parameter from __pagevec_lru_add and remove parts of
> pagevec API").  Now it's just add_to_page_cache() followed by
> lru_cache_add(), might as well use add_to_page_cache_lru() directly.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 4/9] mm: filemap: move radix tree hole searching here
  2014-01-10 18:10 ` [patch 4/9] mm: filemap: move radix tree hole searching here Johannes Weiner
  2014-01-10 19:22   ` Rik van Riel
@ 2014-01-13  1:25   ` Minchan Kim
  1 sibling, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2014-01-13  1:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Fri, Jan 10, 2014 at 01:10:38PM -0500, Johannes Weiner wrote:
> The radix tree hole searching code is only used for page cache, for
> example the readahead code trying to get a a picture of the area
> surrounding a fault.
> 
> It sufficed to rely on the radix tree definition of holes, which is
> "empty tree slot".  But this is about to change, though, as shadow
> page descriptors will be stored in the page cache after the actual
> pages get evicted from memory.
> 
> Move the functions over to mm/filemap.c and make them native page
> cache operations, where they can later be adapted to handle the new
> definition of "page cache hole".
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees
  2014-01-10 18:10 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
  2014-01-10 19:39   ` Rik van Riel
@ 2014-01-13  2:01   ` Minchan Kim
  2014-01-22 17:47     ` Johannes Weiner
  2014-02-12 14:00   ` Mel Gorman
  2 siblings, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2014-01-13  2:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Fri, Jan 10, 2014 at 01:10:39PM -0500, Johannes Weiner wrote:
> shmem mappings already contain exceptional entries where swap slot
> information is remembered.
> 
> To be able to store eviction information for regular page cache,
> prepare every site dealing with the radix trees directly to handle
> entries other than pages.
> 
> The common lookup functions will filter out non-page entries and
> return NULL for page cache holes, just as before.  But provide a raw
> version of the API which returns non-page entries as well, and switch
> shmem over to use it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>

Below are just nitpicks.

> ---
>  fs/btrfs/compression.c   |   2 +-
>  include/linux/mm.h       |   8 ++
>  include/linux/pagemap.h  |  15 ++--
>  include/linux/pagevec.h  |   3 +
>  include/linux/shmem_fs.h |   1 +
>  mm/filemap.c             | 196 +++++++++++++++++++++++++++++++++++++++++------
>  mm/mincore.c             |  20 +++--
>  mm/readahead.c           |   2 +-
>  mm/shmem.c               |  97 +++++------------------
>  mm/swap.c                |  47 ++++++++++++
>  mm/truncate.c            |  73 ++++++++++++++----
>  11 files changed, 336 insertions(+), 128 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 6aad98cb343f..c88316587900 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -474,7 +474,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  		rcu_read_lock();
>  		page = radix_tree_lookup(&mapping->page_tree, pg_index);
>  		rcu_read_unlock();
> -		if (page) {
> +		if (page && !radix_tree_exceptional_entry(page)) {
>  			misses++;
>  			if (misses > 4)
>  				break;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8b6e55ee8855..c09ef3ae55bc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -906,6 +906,14 @@ extern void show_free_areas(unsigned int flags);
>  extern bool skip_free_areas_node(unsigned int flags, int nid);
>  
>  int shmem_zero_setup(struct vm_area_struct *);
> +#ifdef CONFIG_SHMEM
> +bool shmem_mapping(struct address_space *mapping);
> +#else
> +static inline bool shmem_mapping(struct address_space *mapping)
> +{
> +	return false;
> +}
> +#endif
>  
>  extern int can_do_mlock(void);
>  extern int user_shm_lock(size_t, struct user_struct *);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index c73130c607c4..b6854b7c58cb 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
>  pgoff_t page_cache_prev_hole(struct address_space *mapping,
>  			     pgoff_t index, unsigned long max_scan);
>  
> -extern struct page * find_get_page(struct address_space *mapping,
> -				pgoff_t index);
> -extern struct page * find_lock_page(struct address_space *mapping,
> -				pgoff_t index);
> -extern struct page * find_or_create_page(struct address_space *mapping,
> -				pgoff_t index, gfp_t gfp_mask);
> +struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
> +struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
> +struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
> +struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
> +struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
> +				 gfp_t gfp_mask);
> +unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
> +			  unsigned int nr_pages, struct page **pages,
> +			  pgoff_t *indices);
>  unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
>  			unsigned int nr_pages, struct page **pages);
>  unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
> index e4dbfab37729..3c6b8b1e945b 100644
> --- a/include/linux/pagevec.h
> +++ b/include/linux/pagevec.h
> @@ -22,6 +22,9 @@ struct pagevec {
>  
>  void __pagevec_release(struct pagevec *pvec);
>  void __pagevec_lru_add(struct pagevec *pvec);
> +unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
> +			  pgoff_t start, unsigned nr_pages, pgoff_t *indices);
> +void pagevec_remove_exceptionals(struct pagevec *pvec);
>  unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
>  		pgoff_t start, unsigned nr_pages);
>  unsigned pagevec_lookup_tag(struct pagevec *pvec,
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 30aa0dc60d75..deb49609cd36 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -49,6 +49,7 @@ extern struct file *shmem_file_setup(const char *name,
>  					loff_t size, unsigned long flags);
>  extern int shmem_zero_setup(struct vm_area_struct *);
>  extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
> +extern bool shmem_mapping(struct address_space *mapping);
>  extern void shmem_unlock_mapping(struct address_space *mapping);
>  extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>  					pgoff_t index, gfp_t gfp_mask);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0746b7a4658f..23eb3be27205 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -446,6 +446,29 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  
> +static int page_cache_tree_insert(struct address_space *mapping,
> +				  struct page *page)
> +{
> +	void **slot;
> +	int error;
> +
> +	slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
> +	if (slot) {
> +		void *p;
> +
> +		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
> +		if (!radix_tree_exceptional_entry(p))
> +		    return -EEXIST;
> +		radix_tree_replace_slot(slot, page);
> +		mapping->nrpages++;
> +		return 0;
> +	}
> +	error = radix_tree_insert(&mapping->page_tree, page->index, page);
> +	if (!error)
> +		mapping->nrpages++;
> +	return error;
> +}
> +
>  /**
>   * add_to_page_cache_locked - add a locked page to the pagecache
>   * @page:	page to add
> @@ -480,11 +503,10 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  	page->index = offset;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	error = radix_tree_insert(&mapping->page_tree, offset, page);
> +	error = page_cache_tree_insert(mapping, page);
>  	radix_tree_preload_end();
>  	if (unlikely(error))
>  		goto err_insert;
> -	mapping->nrpages++;
>  	__inc_zone_page_state(page, NR_FILE_PAGES);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	trace_mm_filemap_add_to_page_cache(page);
> @@ -712,7 +734,10 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
>  	unsigned long i;
>  
>  	for (i = 0; i < max_scan; i++) {
> -		if (!radix_tree_lookup(&mapping->page_tree, index))
> +		struct page *page;
> +
> +		page = radix_tree_lookup(&mapping->page_tree, index);
> +		if (!page || radix_tree_exceptional_entry(page))
>  			break;
>  		index++;
>  		if (index == 0)
> @@ -750,7 +775,10 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
>  	unsigned long i;
>  
>  	for (i = 0; i < max_scan; i++) {
> -		if (!radix_tree_lookup(&mapping->page_tree, index))
> +		struct page *page;
> +
> +		page = radix_tree_lookup(&mapping->page_tree, index);
> +		if (!page || radix_tree_exceptional_entry(page))
>  			break;
>  		index--;
>  		if (index == ULONG_MAX)
> @@ -762,14 +790,19 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
>  EXPORT_SYMBOL(page_cache_prev_hole);
>  
>  /**
> - * find_get_page - find and get a page reference
> + * __find_get_page - find and get a page reference
>   * @mapping: the address_space to search
>   * @offset: the page index
>   *
> - * Is there a pagecache struct page at the given (mapping, offset) tuple?
> - * If yes, increment its refcount and return it; if no, return NULL.
> + * Looks up the page cache slot at @mapping & @offset.  If there is a
> + * page cache page, it is returned with an increased refcount.
> + *
> + * If the slot holds a shadow entry of a previously evicted page, it
> + * is returned.
> + *
> + * Otherwise, %NULL is returned.
>   */
> -struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
> +struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
>  {
>  	void **pagep;
>  	struct page *page;
> @@ -810,24 +843,49 @@ out:
>  
>  	return page;
>  }
> +EXPORT_SYMBOL(__find_get_page);
> +
> +/**
> + * find_get_page - find and get a page reference
> + * @mapping: the address_space to search
> + * @offset: the page index
> + *
> + * Looks up the page cache slot at @mapping & @offset.  If there is a
> + * page cache page, it is returned with an increased refcount.
> + *
> + * Otherwise, %NULL is returned.
> + */
> +struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
> +{
> +	struct page *page = __find_get_page(mapping, offset);
> +
> +	if (radix_tree_exceptional_entry(page))
> +		page = NULL;
> +	return page;
> +}
>  EXPORT_SYMBOL(find_get_page);
>  
>  /**
> - * find_lock_page - locate, pin and lock a pagecache page
> + * __find_lock_page - locate, pin and lock a pagecache page
>   * @mapping: the address_space to search
>   * @offset: the page index
>   *
> - * Locates the desired pagecache page, locks it, increments its reference
> - * count and returns its address.
> + * Looks up the page cache slot at @mapping & @offset.  If there is a
> + * page cache page, it is returned locked and with an increased
> + * refcount.
> + *
> + * If the slot holds a shadow entry of a previously evicted page, it
> + * is returned.
> + *
> + * Otherwise, %NULL is returned.
>   *
> - * Returns zero if the page was not present. find_lock_page() may sleep.
> + * __find_lock_page() may sleep.
>   */
> -struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
> +struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
>  {
>  	struct page *page;
> -
>  repeat:
> -	page = find_get_page(mapping, offset);
> +	page = __find_get_page(mapping, offset);
>  	if (page && !radix_tree_exception(page)) {
>  		lock_page(page);
>  		/* Has the page been truncated? */
> @@ -840,6 +898,29 @@ repeat:
>  	}
>  	return page;
>  }
> +EXPORT_SYMBOL(__find_lock_page);
> +
> +/**
> + * find_lock_page - locate, pin and lock a pagecache page
> + * @mapping: the address_space to search
> + * @offset: the page index
> + *
> + * Looks up the page cache slot at @mapping & @offset.  If there is a
> + * page cache page, it is returned locked and with an increased
> + * refcount.
> + *
> + * Otherwise, %NULL is returned.
> + *
> + * find_lock_page() may sleep.
> + */
> +struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
> +{
> +	struct page *page = __find_lock_page(mapping, offset);
> +
> +	if (radix_tree_exceptional_entry(page))
> +		page = NULL;
> +	return page;
> +}
>  EXPORT_SYMBOL(find_lock_page);
>  
>  /**
> @@ -848,16 +929,18 @@ EXPORT_SYMBOL(find_lock_page);
>   * @index: the page's index into the mapping
>   * @gfp_mask: page allocation mode
>   *
> - * Locates a page in the pagecache.  If the page is not present, a new page
> - * is allocated using @gfp_mask and is added to the pagecache and to the VM's
> - * LRU list.  The returned page is locked and has its reference count
> - * incremented.
> + * Looks up the page cache slot at @mapping & @offset.  If there is a
> + * page cache page, it is returned locked and with an increased
> + * refcount.
>   *
> - * find_or_create_page() may sleep, even if @gfp_flags specifies an atomic
> - * allocation!
> + * If the page is not present, a new page is allocated using @gfp_mask
> + * and added to the page cache and the VM's LRU list.  The page is
> + * returned locked and with an increased refcount.
>   *
> - * find_or_create_page() returns the desired page's address, or zero on
> - * memory exhaustion.
> + * On memory exhaustion, %NULL is returned.
> + *
> + * find_or_create_page() may sleep, even if @gfp_flags specifies an
> + * atomic allocation!
>   */
>  struct page *find_or_create_page(struct address_space *mapping,
>  		pgoff_t index, gfp_t gfp_mask)
> @@ -890,6 +973,73 @@ repeat:
>  EXPORT_SYMBOL(find_or_create_page);
>  
>  /**
> + * __find_get_pages - gang pagecache lookup
> + * @mapping:	The address_space to search
> + * @start:	The starting page index
> + * @nr_pages:	The maximum number of pages
> + * @pages:	Where the resulting pages are placed

where is @indices?

> + *
> + * __find_get_pages() will search for and return a group of up to
> + * @nr_pages pages in the mapping.  The pages are placed at @pages.
> + * __find_get_pages() takes a reference against the returned pages.
> + *
> + * The search returns a group of mapping-contiguous pages with ascending
> + * indexes.  There may be holes in the indices due to not-present pages.
> + *
> + * Any shadow entries of evicted pages are included in the returned
> + * array.
> + *
> + * __find_get_pages() returns the number of pages and shadow entries
> + * which were found.
> + */
> +unsigned __find_get_pages(struct address_space *mapping,
> +			  pgoff_t start, unsigned int nr_pages,
> +			  struct page **pages, pgoff_t *indices)
> +{
> +	void **slot;
> +	unsigned int ret = 0;
> +	struct radix_tree_iter iter;
> +
> +	if (!nr_pages)
> +		return 0;
> +
> +	rcu_read_lock();
> +restart:
> +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> +		struct page *page;
> +repeat:
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (radix_tree_exception(page)) {
> +			if (radix_tree_deref_retry(page))
> +				goto restart;
> +			/*
> +			 * Otherwise, we must be storing a swap entry
> +			 * here as an exceptional entry: so return it
> +			 * without attempting to raise page count.
> +			 */
> +			goto export;
> +		}
> +		if (!page_cache_get_speculative(page))
> +			goto repeat;
> +
> +		/* Has the page moved? */
> +		if (unlikely(page != *slot)) {
> +			page_cache_release(page);
> +			goto repeat;
> +		}
> +export:
> +		indices[ret] = iter.index;
> +		pages[ret] = page;
> +		if (++ret == nr_pages)
> +			break;
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +/**
>   * find_get_pages - gang pagecache lookup
>   * @mapping:	The address_space to search
>   * @start:	The starting page index
> diff --git a/mm/mincore.c b/mm/mincore.c
> index da2be56a7b8f..ad411ec86a55 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
>  	 * any other file mapping (ie. marked !present and faulted in with
>  	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
>  	 */
> -	page = find_get_page(mapping, pgoff);
>  #ifdef CONFIG_SWAP
> -	/* shmem/tmpfs may return swap: account for swapcache page too. */
> -	if (radix_tree_exceptional_entry(page)) {
> -		swp_entry_t swap = radix_to_swp_entry(page);
> -		page = find_get_page(swap_address_space(swap), swap.val);
> -	}
> +	if (shmem_mapping(mapping)) {
> +		page = __find_get_page(mapping, pgoff);
> +		/*
> +		 * shmem/tmpfs may return swap: account for swapcache
> +		 * page too.
> +		 */
> +		if (radix_tree_exceptional_entry(page)) {
> +			swp_entry_t swp = radix_to_swp_entry(page);
> +			page = find_get_page(swap_address_space(swp), swp.val);
> +		}
> +	} else
> +		page = find_get_page(mapping, pgoff);
> +#else
> +	page = find_get_page(mapping, pgoff);
>  #endif
>  	if (page) {
>  		present = PageUptodate(page);
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9eeeeda4ac0e..912c00358112 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  		rcu_read_lock();
>  		page = radix_tree_lookup(&mapping->page_tree, page_offset);
>  		rcu_read_unlock();
> -		if (page)
> +		if (page && !radix_tree_exceptional_entry(page))
>  			continue;
>  
>  		page = page_cache_alloc_readahead(mapping);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 7c67249d6f28..1f4b65f7b831 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -329,56 +329,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
>  }
>  
>  /*
> - * Like find_get_pages, but collecting swap entries as well as pages.
> - */
> -static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
> -					pgoff_t start, unsigned int nr_pages,
> -					struct page **pages, pgoff_t *indices)
> -{
> -	void **slot;
> -	unsigned int ret = 0;
> -	struct radix_tree_iter iter;
> -
> -	if (!nr_pages)
> -		return 0;
> -
> -	rcu_read_lock();
> -restart:
> -	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> -		struct page *page;
> -repeat:
> -		page = radix_tree_deref_slot(slot);
> -		if (unlikely(!page))
> -			continue;
> -		if (radix_tree_exception(page)) {
> -			if (radix_tree_deref_retry(page))
> -				goto restart;
> -			/*
> -			 * Otherwise, we must be storing a swap entry
> -			 * here as an exceptional entry: so return it
> -			 * without attempting to raise page count.
> -			 */
> -			goto export;
> -		}
> -		if (!page_cache_get_speculative(page))
> -			goto repeat;
> -
> -		/* Has the page moved? */
> -		if (unlikely(page != *slot)) {
> -			page_cache_release(page);
> -			goto repeat;
> -		}
> -export:
> -		indices[ret] = iter.index;
> -		pages[ret] = page;
> -		if (++ret == nr_pages)
> -			break;
> -	}
> -	rcu_read_unlock();
> -	return ret;
> -}
> -
> -/*
>   * Remove swap entry from radix tree, free the swap and its page cache.
>   */
>  static int shmem_free_swap(struct address_space *mapping,
> @@ -396,21 +346,6 @@ static int shmem_free_swap(struct address_space *mapping,
>  }
>  
>  /*
> - * Pagevec may contain swap entries, so shuffle up pages before releasing.
> - */
> -static void shmem_deswap_pagevec(struct pagevec *pvec)
> -{
> -	int i, j;
> -
> -	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
> -		struct page *page = pvec->pages[i];
> -		if (!radix_tree_exceptional_entry(page))
> -			pvec->pages[j++] = page;
> -	}
> -	pvec->nr = j;
> -}
> -
> -/*
>   * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
>   */
>  void shmem_unlock_mapping(struct address_space *mapping)
> @@ -428,12 +363,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
>  		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
>  		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
>  		 */
> -		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
> +		pvec.nr = __find_get_pages(mapping, index,
>  					PAGEVEC_SIZE, pvec.pages, indices);
>  		if (!pvec.nr)
>  			break;
>  		index = indices[pvec.nr - 1] + 1;
> -		shmem_deswap_pagevec(&pvec);
> +		pagevec_remove_exceptionals(&pvec);
>  		check_move_unevictable_pages(pvec.pages, pvec.nr);
>  		pagevec_release(&pvec);
>  		cond_resched();
> @@ -465,9 +400,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  	pagevec_init(&pvec, 0);
>  	index = start;
>  	while (index < end) {
> -		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
> -				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> -							pvec.pages, indices);
> +		pvec.nr = __find_get_pages(mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> +			pvec.pages, indices);
>  		if (!pvec.nr)
>  			break;
>  		mem_cgroup_uncharge_start();
> @@ -496,7 +431,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  			}
>  			unlock_page(page);
>  		}
> -		shmem_deswap_pagevec(&pvec);
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		cond_resched();
> @@ -534,9 +469,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  	index = start;
>  	for ( ; ; ) {
>  		cond_resched();
> -		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
> +
> +		pvec.nr = __find_get_pages(mapping, index,
>  				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> -							pvec.pages, indices);
> +				pvec.pages, indices);
>  		if (!pvec.nr) {
>  			if (index == start || unfalloc)
>  				break;
> @@ -544,7 +480,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  			continue;
>  		}
>  		if ((index == start || unfalloc) && indices[0] >= end) {
> -			shmem_deswap_pagevec(&pvec);
> +			pagevec_remove_exceptionals(&pvec);
>  			pagevec_release(&pvec);
>  			break;
>  		}
> @@ -573,7 +509,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  			}
>  			unlock_page(page);
>  		}
> -		shmem_deswap_pagevec(&pvec);
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		index++;
> @@ -1081,7 +1017,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  		return -EFBIG;
>  repeat:
>  	swap.val = 0;
> -	page = find_lock_page(mapping, index);
> +	page = __find_lock_page(mapping, index);
>  	if (radix_tree_exceptional_entry(page)) {
>  		swap = radix_to_swp_entry(page);
>  		page = NULL;
> @@ -1418,6 +1354,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>  	return inode;
>  }
>  
> +bool shmem_mapping(struct address_space *mapping)
> +{
> +	return mapping->backing_dev_info == &shmem_backing_dev_info;
> +}
> +
>  #ifdef CONFIG_TMPFS
>  static const struct inode_operations shmem_symlink_inode_operations;
>  static const struct inode_operations shmem_short_symlink_operations;
> @@ -1730,7 +1671,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
>  	pagevec_init(&pvec, 0);
>  	pvec.nr = 1;		/* start small: we may be there already */
>  	while (!done) {
> -		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
> +		pvec.nr = __find_get_pages(mapping, index,
>  					pvec.nr, pvec.pages, indices);
>  		if (!pvec.nr) {
>  			if (whence == SEEK_DATA)
> @@ -1757,7 +1698,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
>  				break;
>  			}
>  		}
> -		shmem_deswap_pagevec(&pvec);
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		pvec.nr = PAGEVEC_SIZE;
>  		cond_resched();
> diff --git a/mm/swap.c b/mm/swap.c
> index 759c3caf44bd..f624e5b4b724 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -894,6 +894,53 @@ EXPORT_SYMBOL(__pagevec_lru_add);
>  
>  /**
>   * pagevec_lookup - gang pagecache lookup

      __pagevec_lookup?

> + * @pvec:	Where the resulting entries are placed
> + * @mapping:	The address_space to search
> + * @start:	The starting entry index
> + * @nr_pages:	The maximum number of entries

      missing @indices?

> + *
> + * pagevec_lookup() will search for and return a group of up to
> + * @nr_pages pages and shadow entries in the mapping.  All entries are
> + * placed in @pvec.  pagevec_lookup() takes a reference against actual
> + * pages in @pvec.
> + *
> + * The search returns a group of mapping-contiguous entries with
> + * ascending indexes.  There may be holes in the indices due to
> + * not-present entries.
> + *
> + * pagevec_lookup() returns the number of entries which were found.

      __pagevec_lookup

> + */
> +unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
> +			  pgoff_t start, unsigned nr_pages, pgoff_t *indices)
> +{
> +	pvec->nr = __find_get_pages(mapping, start, nr_pages,
> +				    pvec->pages, indices);
> +	return pagevec_count(pvec);
> +}
> +
> +/**
> + * pagevec_remove_exceptionals - pagevec exceptionals pruning
> + * @pvec:	The pagevec to prune
> + *
> + * __pagevec_lookup() fills both pages and exceptional radix tree
> + * entries into the pagevec.  This function prunes all exceptionals
> + * from @pvec without leaving holes, so that it can be passed on to
> + * other pagevec operations.
> + */
> +void pagevec_remove_exceptionals(struct pagevec *pvec)
> +{
> +	int i, j;
> +
> +	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +		if (!radix_tree_exceptional_entry(page))
> +			pvec->pages[j++] = page;
> +	}
> +	pvec->nr = j;
> +}
> +
> +/**
> + * pagevec_lookup - gang pagecache lookup
>   * @pvec:	Where the resulting pages are placed
>   * @mapping:	The address_space to search
>   * @start:	The starting page index
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 353b683afd6e..b0f4d4bee8ab 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -22,6 +22,22 @@
>  #include <linux/cleancache.h>
>  #include "internal.h"
>  
> +static void clear_exceptional_entry(struct address_space *mapping,
> +				    pgoff_t index, void *entry)
> +{
> +	/* Handled by shmem itself */
> +	if (shmem_mapping(mapping))
> +		return;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * Regular page slots are stabilized by the page lock even
> +	 * without the tree itself locked.  These unlocked entries
> +	 * need verification under the tree lock.
> +	 */

Could you explain why repeated spin_lock with irq disabled isn't problem
in truncation path?

> +	radix_tree_delete_item(&mapping->page_tree, index, entry);
> +	spin_unlock_irq(&mapping->tree_lock);
> +}
>  
>  /**
>   * do_invalidatepage - invalidate part or all of a page
> @@ -208,6 +224,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	unsigned int	partial_start;	/* inclusive */
>  	unsigned int	partial_end;	/* exclusive */
>  	struct pagevec	pvec;
> +	pgoff_t		indices[PAGEVEC_SIZE];
>  	pgoff_t		index;
>  	int		i;
>  
> @@ -238,17 +255,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  
>  	pagevec_init(&pvec, 0);
>  	index = start;
> -	while (index < end && pagevec_lookup(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
> +	while (index < end && __pagevec_lookup(&pvec, mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> +			indices)) {
>  		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
>  			/* We rely upon deletion not changing page->index */
> -			index = page->index;
> +			index = indices[i];
>  			if (index >= end)
>  				break;
>  
> +			if (radix_tree_exceptional_entry(page)) {
> +				clear_exceptional_entry(mapping, index, page);
> +				continue;
> +			}
> +
>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> @@ -259,6 +282,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			truncate_inode_page(mapping, page);
>  			unlock_page(page);
>  		}
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		cond_resched();
> @@ -307,14 +331,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	index = start;
>  	for ( ; ; ) {
>  		cond_resched();
> -		if (!pagevec_lookup(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
> +		if (!__pagevec_lookup(&pvec, mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> +			indices)) {
>  			if (index == start)
>  				break;
>  			index = start;
>  			continue;
>  		}
> -		if (index == start && pvec.pages[0]->index >= end) {
> +		if (index == start && indices[0] >= end) {
>  			pagevec_release(&pvec);
>  			break;
>  		}
> @@ -323,16 +348,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			struct page *page = pvec.pages[i];
>  
>  			/* We rely upon deletion not changing page->index */
> -			index = page->index;
> +			index = indices[i];
>  			if (index >= end)
>  				break;
>  
> +			if (radix_tree_exceptional_entry(page)) {
> +				clear_exceptional_entry(mapping, index, page);
> +				continue;
> +			}
> +
>  			lock_page(page);
>  			WARN_ON(page->index != index);
>  			wait_on_page_writeback(page);
>  			truncate_inode_page(mapping, page);
>  			unlock_page(page);
>  		}
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		index++;
> @@ -375,6 +406,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
>  unsigned long invalidate_mapping_pages(struct address_space *mapping,
>  		pgoff_t start, pgoff_t end)
>  {
> +	pgoff_t indices[PAGEVEC_SIZE];
>  	struct pagevec pvec;
>  	pgoff_t index = start;
>  	unsigned long ret;
> @@ -390,17 +422,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>  	 */
>  
>  	pagevec_init(&pvec, 0);
> -	while (index <= end && pagevec_lookup(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> +	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
> +			indices)) {
>  		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
>  			/* We rely upon deletion not changing page->index */
> -			index = page->index;
> +			index = indices[i];
>  			if (index > end)
>  				break;
>  
> +			if (radix_tree_exceptional_entry(page)) {
> +				clear_exceptional_entry(mapping, index, page);
> +				continue;
> +			}
> +
>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> @@ -414,6 +452,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>  				deactivate_page(page);
>  			count += ret;
>  		}
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		cond_resched();
> @@ -481,6 +520,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
>  int invalidate_inode_pages2_range(struct address_space *mapping,
>  				  pgoff_t start, pgoff_t end)
>  {
> +	pgoff_t indices[PAGEVEC_SIZE];
>  	struct pagevec pvec;
>  	pgoff_t index;
>  	int i;
> @@ -491,17 +531,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>  	cleancache_invalidate_inode(mapping);
>  	pagevec_init(&pvec, 0);
>  	index = start;
> -	while (index <= end && pagevec_lookup(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> +	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
> +			indices)) {
>  		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
>  			/* We rely upon deletion not changing page->index */
> -			index = page->index;
> +			index = indices[i];
>  			if (index > end)
>  				break;
>  
> +			if (radix_tree_exceptional_entry(page)) {
> +				clear_exceptional_entry(mapping, index, page);
> +				continue;
> +			}
> +
>  			lock_page(page);
>  			WARN_ON(page->index != index);
>  			if (page->mapping != mapping) {
> @@ -539,6 +585,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>  				ret = ret2;
>  			unlock_page(page);
>  		}
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		cond_resched();
> -- 
> 1.8.4.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 6/9] mm + fs: store shadow entries in page cache
  2014-01-10 18:10 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
  2014-01-10 22:30   ` Rik van Riel
@ 2014-01-13  2:18   ` Minchan Kim
  1 sibling, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2014-01-13  2:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Fri, Jan 10, 2014 at 01:10:40PM -0500, Johannes Weiner wrote:
> Reclaim will be leaving shadow entries in the page cache radix tree
> upon evicting the real page.  As those pages are found from the LRU,
> an iput() can lead to the inode being freed concurrently.  At this
> point, reclaim must no longer install shadow pages because the inode
> freeing code needs to ensure the page tree is really empty.
> 
> Add an address_space flag, AS_EXITING, that the inode freeing code
> sets under the tree lock before doing the final truncate.  Reclaim
> will check for this flag before installing shadow pages.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed only vm part, NOT fs part.
Reviewed-by: Minchan Kim <minchan@kernel.org>

> ---
>  Documentation/filesystems/porting               |  6 +--
>  drivers/staging/lustre/lustre/llite/llite_lib.c |  2 +-
>  fs/9p/vfs_inode.c                               |  2 +-
>  fs/affs/inode.c                                 |  2 +-
>  fs/afs/inode.c                                  |  2 +-
>  fs/bfs/inode.c                                  |  2 +-
>  fs/block_dev.c                                  |  4 +-
>  fs/btrfs/inode.c                                |  2 +-
>  fs/cifs/cifsfs.c                                |  2 +-
>  fs/coda/inode.c                                 |  2 +-
>  fs/ecryptfs/super.c                             |  2 +-
>  fs/exofs/inode.c                                |  2 +-
>  fs/ext2/inode.c                                 |  2 +-
>  fs/ext3/inode.c                                 |  2 +-
>  fs/ext4/inode.c                                 |  4 +-
>  fs/f2fs/inode.c                                 |  2 +-
>  fs/fat/inode.c                                  |  2 +-
>  fs/freevxfs/vxfs_inode.c                        |  2 +-
>  fs/fuse/inode.c                                 |  2 +-
>  fs/gfs2/super.c                                 |  2 +-
>  fs/hfs/inode.c                                  |  2 +-
>  fs/hfsplus/super.c                              |  2 +-
>  fs/hostfs/hostfs_kern.c                         |  2 +-
>  fs/hpfs/inode.c                                 |  2 +-
>  fs/inode.c                                      |  4 +-
>  fs/jffs2/fs.c                                   |  2 +-
>  fs/jfs/inode.c                                  |  4 +-
>  fs/logfs/readwrite.c                            |  2 +-
>  fs/minix/inode.c                                |  2 +-
>  fs/ncpfs/inode.c                                |  2 +-
>  fs/nfs/inode.c                                  |  2 +-
>  fs/nfs/nfs4super.c                              |  2 +-
>  fs/nilfs2/inode.c                               |  6 +--
>  fs/ntfs/inode.c                                 |  2 +-
>  fs/ocfs2/inode.c                                |  4 +-
>  fs/omfs/inode.c                                 |  2 +-
>  fs/proc/inode.c                                 |  2 +-
>  fs/reiserfs/inode.c                             |  2 +-
>  fs/sysfs/inode.c                                |  2 +-
>  fs/sysv/inode.c                                 |  2 +-
>  fs/ubifs/super.c                                |  2 +-
>  fs/udf/inode.c                                  |  4 +-
>  fs/ufs/inode.c                                  |  2 +-
>  fs/xfs/xfs_super.c                              |  2 +-
>  include/linux/fs.h                              |  1 +
>  include/linux/mm.h                              |  1 +
>  include/linux/pagemap.h                         | 13 +++++-
>  mm/filemap.c                                    | 33 ++++++++++++---
>  mm/truncate.c                                   | 54 +++++++++++++++++++++++--
>  mm/vmscan.c                                     |  2 +-
>  50 files changed, 147 insertions(+), 65 deletions(-)
> 
> diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
> index f0890581f7f6..fc0de703066b 100644
> --- a/Documentation/filesystems/porting
> +++ b/Documentation/filesystems/porting
> @@ -295,9 +295,9 @@ in the beginning of ->setattr unconditionally.
>  	->clear_inode() and ->delete_inode() are gone; ->evict_inode() should
>  be used instead.  It gets called whenever the inode is evicted, whether it has
>  remaining links or not.  Caller does *not* evict the pagecache or inode-associated
> -metadata buffers; getting rid of those is responsibility of method, as it had
> -been for ->delete_inode(). Caller makes sure async writeback cannot be running
> -for the inode while (or after) ->evict_inode() is called.
> +metadata buffers; the method has to use truncate_inode_pages_final() to get rid
> +of those. Caller makes sure async writeback cannot be running for the inode while
> +(or after) ->evict_inode() is called.
>  
>  	->drop_inode() returns int now; it's called on final iput() with
>  inode->i_lock held and it returns true if filesystems wants the inode to be
> diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
> index b868c2bd58d2..79cbc9c5b744 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_lib.c
> +++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
> @@ -1817,7 +1817,7 @@ void ll_delete_inode(struct inode *inode)
>  		cl_sync_file_range(inode, 0, OBD_OBJECT_EOF,
>  				   CL_FSYNC_DISCARD, 1);
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  
>  	/* Workaround for LU-118 */
>  	if (inode->i_data.nrpages) {
> diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
> index 94de6d1482e2..e6716c295a99 100644
> --- a/fs/9p/vfs_inode.c
> +++ b/fs/9p/vfs_inode.c
> @@ -444,7 +444,7 @@ void v9fs_evict_inode(struct inode *inode)
>  {
>  	struct v9fs_inode *v9inode = V9FS_I(inode);
>  
> -	truncate_inode_pages(inode->i_mapping, 0);
> +	truncate_inode_pages_final(inode->i_mapping);
>  	clear_inode(inode);
>  	filemap_fdatawrite(inode->i_mapping);
>  
> diff --git a/fs/affs/inode.c b/fs/affs/inode.c
> index 0e092d08680e..96df91e8c334 100644
> --- a/fs/affs/inode.c
> +++ b/fs/affs/inode.c
> @@ -259,7 +259,7 @@ affs_evict_inode(struct inode *inode)
>  {
>  	unsigned long cache_page;
>  	pr_debug("AFFS: evict_inode(ino=%lu, nlink=%u)\n", inode->i_ino, inode->i_nlink);
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  
>  	if (!inode->i_nlink) {
>  		inode->i_size = 0;
> diff --git a/fs/afs/inode.c b/fs/afs/inode.c
> index 789bc253b5f6..2bbe60e3f0e3 100644
> --- a/fs/afs/inode.c
> +++ b/fs/afs/inode.c
> @@ -422,7 +422,7 @@ void afs_evict_inode(struct inode *inode)
>  
>  	ASSERTCMP(inode->i_ino, ==, vnode->fid.vnode);
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  
>  	afs_give_up_callback(vnode);
> diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
> index 8defc6b3f9a2..29aa5cf6639b 100644
> --- a/fs/bfs/inode.c
> +++ b/fs/bfs/inode.c
> @@ -172,7 +172,7 @@ static void bfs_evict_inode(struct inode *inode)
>  
>  	dprintf("ino=%08lx\n", ino);
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	invalidate_inode_buffers(inode);
>  	clear_inode(inode);
>  
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 1e86823a9cbd..c7a7def27b07 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -83,7 +83,7 @@ void kill_bdev(struct block_device *bdev)
>  {
>  	struct address_space *mapping = bdev->bd_inode->i_mapping;
>  
> -	if (mapping->nrpages == 0)
> +	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
>  		return;
>  
>  	invalidate_bh_lrus();
> @@ -419,7 +419,7 @@ static void bdev_evict_inode(struct inode *inode)
>  {
>  	struct block_device *bdev = &BDEV_I(inode)->bdev;
>  	struct list_head *p;
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	invalidate_inode_buffers(inode); /* is it needed here? */
>  	clear_inode(inode);
>  	spin_lock(&bdev_lock);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 51e3afa78354..d3e498390189 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4471,7 +4471,7 @@ void btrfs_evict_inode(struct inode *inode)
>  
>  	trace_btrfs_inode_evict(inode);
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	if (inode->i_nlink && (btrfs_root_refs(&root->root_item) != 0 ||
>  			       btrfs_is_free_space_inode(inode)))
>  		goto no_delete;
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 77fc5e181077..d795c50e67cb 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -286,7 +286,7 @@ cifs_destroy_inode(struct inode *inode)
>  static void
>  cifs_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	cifs_fscache_release_inode_cookie(inode);
>  }
> diff --git a/fs/coda/inode.c b/fs/coda/inode.c
> index 4dcc0d81a7aa..43a5b38fc8d3 100644
> --- a/fs/coda/inode.c
> +++ b/fs/coda/inode.c
> @@ -250,7 +250,7 @@ static void coda_put_super(struct super_block *sb)
>  
>  static void coda_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	coda_cache_clear_inode(inode);
>  }
> diff --git a/fs/ecryptfs/super.c b/fs/ecryptfs/super.c
> index e879cf8ff0b1..afa1b81c3418 100644
> --- a/fs/ecryptfs/super.c
> +++ b/fs/ecryptfs/super.c
> @@ -132,7 +132,7 @@ static int ecryptfs_statfs(struct dentry *dentry, struct kstatfs *buf)
>   */
>  static void ecryptfs_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	iput(ecryptfs_inode_to_lower(inode));
>  }
> diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
> index a52a5d23c30b..d9ff4d304b41 100644
> --- a/fs/exofs/inode.c
> +++ b/fs/exofs/inode.c
> @@ -1479,7 +1479,7 @@ void exofs_evict_inode(struct inode *inode)
>  	struct ore_io_state *ios;
>  	int ret;
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  
>  	/* TODO: should do better here */
>  	if (inode->i_nlink || is_bad_inode(inode))
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c260de6d7b6d..115fa58bb9ae 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -78,7 +78,7 @@ void ext2_evict_inode(struct inode * inode)
>  		dquot_drop(inode);
>  	}
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  
>  	if (want_delete) {
>  		sb_start_intwrite(inode->i_sb);
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 2bd85486b879..153f4bec69ef 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -228,7 +228,7 @@ void ext3_evict_inode (struct inode *inode)
>  		log_wait_commit(journal, commit_tid);
>  		filemap_write_and_wait(&inode->i_data);
>  	}
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  
>  	ext3_discard_reservation(inode);
>  	rsv = ei->i_block_alloc_info;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e274e9c1171f..3b75e70ae2eb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -214,7 +214,7 @@ void ext4_evict_inode(struct inode *inode)
>  			jbd2_complete_transaction(journal, commit_tid);
>  			filemap_write_and_wait(&inode->i_data);
>  		}
> -		truncate_inode_pages(&inode->i_data, 0);
> +		truncate_inode_pages_final(&inode->i_data);
>  
>  		WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
>  		goto no_delete;
> @@ -225,7 +225,7 @@ void ext4_evict_inode(struct inode *inode)
>  
>  	if (ext4_should_order_data(inode))
>  		ext4_begin_ordered_truncate(inode, 0);
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  
>  	WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
>  	if (is_bad_inode(inode))
> diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> index 9339cd292047..0bd44f84e79b 100644
> --- a/fs/f2fs/inode.c
> +++ b/fs/f2fs/inode.c
> @@ -246,7 +246,7 @@ void f2fs_evict_inode(struct inode *inode)
>  	int ilock;
>  
>  	trace_f2fs_evict_inode(inode);
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  
>  	if (inode->i_ino == F2FS_NODE_INO(sbi) ||
>  			inode->i_ino == F2FS_META_INO(sbi))
> diff --git a/fs/fat/inode.c b/fs/fat/inode.c
> index 0062da21dd8b..fe802d83abdb 100644
> --- a/fs/fat/inode.c
> +++ b/fs/fat/inode.c
> @@ -490,7 +490,7 @@ EXPORT_SYMBOL_GPL(fat_build_inode);
>  
>  static void fat_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	if (!inode->i_nlink) {
>  		inode->i_size = 0;
>  		fat_truncate_blocks(inode, 0);
> diff --git a/fs/freevxfs/vxfs_inode.c b/fs/freevxfs/vxfs_inode.c
> index f47df72cef17..363e3ae25f6b 100644
> --- a/fs/freevxfs/vxfs_inode.c
> +++ b/fs/freevxfs/vxfs_inode.c
> @@ -354,7 +354,7 @@ static void vxfs_i_callback(struct rcu_head *head)
>  void
>  vxfs_evict_inode(struct inode *ip)
>  {
> -	truncate_inode_pages(&ip->i_data, 0);
> +	truncate_inode_pages_final(&ip->i_data);
>  	clear_inode(ip);
>  	call_rcu(&ip->i_rcu, vxfs_i_callback);
>  }
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index a8ce6dab60a0..09d7fa05f136 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -123,7 +123,7 @@ static void fuse_destroy_inode(struct inode *inode)
>  
>  static void fuse_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	if (inode->i_sb->s_flags & MS_ACTIVE) {
>  		struct fuse_conn *fc = get_fuse_conn(inode);
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index e5639dec66c4..ac96a99c0e5d 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -1525,7 +1525,7 @@ out_unlock:
>  		fs_warn(sdp, "gfs2_evict_inode: %d\n", error);
>  out:
>  	/* Case 3 starts here */
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	gfs2_rs_delete(ip);
>  	gfs2_ordered_del_inode(ip);
>  	clear_inode(inode);
> diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
> index 380ab31b5e0f..9e2fecd62f62 100644
> --- a/fs/hfs/inode.c
> +++ b/fs/hfs/inode.c
> @@ -547,7 +547,7 @@ out:
>  
>  void hfs_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	if (HFS_IS_RSRC(inode) && HFS_I(inode)->rsrc_inode) {
>  		HFS_I(HFS_I(inode)->rsrc_inode)->rsrc_inode = NULL;
> diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
> index 4c4d142cf890..b9436d923585 100644
> --- a/fs/hfsplus/super.c
> +++ b/fs/hfsplus/super.c
> @@ -161,7 +161,7 @@ static int hfsplus_write_inode(struct inode *inode,
>  static void hfsplus_evict_inode(struct inode *inode)
>  {
>  	hfs_dbg(INODE, "hfsplus_evict_inode: %lu\n", inode->i_ino);
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	if (HFSPLUS_IS_RSRC(inode)) {
>  		HFSPLUS_I(HFSPLUS_I(inode)->rsrc_inode)->rsrc_inode = NULL;
> diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
> index 25437280a207..0c9f64070e0f 100644
> --- a/fs/hostfs/hostfs_kern.c
> +++ b/fs/hostfs/hostfs_kern.c
> @@ -239,7 +239,7 @@ static struct inode *hostfs_alloc_inode(struct super_block *sb)
>  
>  static void hostfs_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	if (HOSTFS_I(inode)->fd != -1) {
>  		close_file(&HOSTFS_I(inode)->fd);
> diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
> index 9edeeb0ea97e..50a427313835 100644
> --- a/fs/hpfs/inode.c
> +++ b/fs/hpfs/inode.c
> @@ -304,7 +304,7 @@ void hpfs_write_if_changed(struct inode *inode)
>  
>  void hpfs_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	if (!inode->i_nlink) {
>  		hpfs_lock(inode->i_sb);
> diff --git a/fs/inode.c b/fs/inode.c
> index b33ba8e021cc..093864ea2358 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -503,6 +503,7 @@ void clear_inode(struct inode *inode)
>  	 */
>  	spin_lock_irq(&inode->i_data.tree_lock);
>  	BUG_ON(inode->i_data.nrpages);
> +	BUG_ON(inode->i_data.nrshadows);
>  	spin_unlock_irq(&inode->i_data.tree_lock);
>  	BUG_ON(!list_empty(&inode->i_data.private_list));
>  	BUG_ON(!(inode->i_state & I_FREEING));
> @@ -548,8 +549,7 @@ static void evict(struct inode *inode)
>  	if (op->evict_inode) {
>  		op->evict_inode(inode);
>  	} else {
> -		if (inode->i_data.nrpages)
> -			truncate_inode_pages(&inode->i_data, 0);
> +		truncate_inode_pages_final(&inode->i_data);
>  		clear_inode(inode);
>  	}
>  	if (S_ISBLK(inode->i_mode) && inode->i_bdev)
> diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
> index fe3c0527545f..00ed6c64a579 100644
> --- a/fs/jffs2/fs.c
> +++ b/fs/jffs2/fs.c
> @@ -241,7 +241,7 @@ void jffs2_evict_inode (struct inode *inode)
>  
>  	jffs2_dbg(1, "%s(): ino #%lu mode %o\n",
>  		  __func__, inode->i_ino, inode->i_mode);
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	jffs2_do_clear_inode(c, f);
>  }
> diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
> index f4aab719add5..6f8fe72c2a7a 100644
> --- a/fs/jfs/inode.c
> +++ b/fs/jfs/inode.c
> @@ -154,7 +154,7 @@ void jfs_evict_inode(struct inode *inode)
>  		dquot_initialize(inode);
>  
>  		if (JFS_IP(inode)->fileset == FILESYSTEM_I) {
> -			truncate_inode_pages(&inode->i_data, 0);
> +			truncate_inode_pages_final(&inode->i_data);
>  
>  			if (test_cflag(COMMIT_Freewmap, inode))
>  				jfs_free_zero_link(inode);
> @@ -168,7 +168,7 @@ void jfs_evict_inode(struct inode *inode)
>  			dquot_free_inode(inode);
>  		}
>  	} else {
> -		truncate_inode_pages(&inode->i_data, 0);
> +		truncate_inode_pages_final(&inode->i_data);
>  	}
>  	clear_inode(inode);
>  	dquot_drop(inode);
> diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
> index 9a59cbade2fb..48140315f627 100644
> --- a/fs/logfs/readwrite.c
> +++ b/fs/logfs/readwrite.c
> @@ -2180,7 +2180,7 @@ void logfs_evict_inode(struct inode *inode)
>  			do_delete_inode(inode);
>  		}
>  	}
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  
>  	/* Cheaper version of write_inode.  All changes are concealed in
> diff --git a/fs/minix/inode.c b/fs/minix/inode.c
> index 0332109162a5..03aaeb1a694a 100644
> --- a/fs/minix/inode.c
> +++ b/fs/minix/inode.c
> @@ -26,7 +26,7 @@ static int minix_remount (struct super_block * sb, int * flags, char * data);
>  
>  static void minix_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	if (!inode->i_nlink) {
>  		inode->i_size = 0;
>  		minix_truncate(inode);
> diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
> index 4659da67e7f6..e728061edb13 100644
> --- a/fs/ncpfs/inode.c
> +++ b/fs/ncpfs/inode.c
> @@ -296,7 +296,7 @@ ncp_iget(struct super_block *sb, struct ncp_entry_info *info)
>  static void
>  ncp_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  
>  	if (S_ISDIR(inode->i_mode)) {
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index eda8879171c4..fbc38a62cbc9 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -128,7 +128,7 @@ EXPORT_SYMBOL_GPL(nfs_clear_inode);
>  
>  void nfs_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	nfs_clear_inode(inode);
>  }
> diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
> index e26acdd1a645..f2a5c44106b6 100644
> --- a/fs/nfs/nfs4super.c
> +++ b/fs/nfs/nfs4super.c
> @@ -98,7 +98,7 @@ static int nfs4_write_inode(struct inode *inode, struct writeback_control *wbc)
>   */
>  static void nfs4_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	pnfs_return_layout(inode);
>  	pnfs_destroy_layout(NFS_I(inode));
> diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
> index 7e350c562e0e..b9c5726120e3 100644
> --- a/fs/nilfs2/inode.c
> +++ b/fs/nilfs2/inode.c
> @@ -783,16 +783,14 @@ void nilfs_evict_inode(struct inode *inode)
>  	int ret;
>  
>  	if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
> -		if (inode->i_data.nrpages)
> -			truncate_inode_pages(&inode->i_data, 0);
> +		truncate_inode_pages_final(&inode->i_data);
>  		clear_inode(inode);
>  		nilfs_clear_inode(inode);
>  		return;
>  	}
>  	nilfs_transaction_begin(sb, &ti, 0); /* never fails */
>  
> -	if (inode->i_data.nrpages)
> -		truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  
>  	/* TODO: some of the following operations may fail.  */
>  	nilfs_truncate_bmap(ii, 0);
> diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
> index 2778b0255dc6..bd50adc1e6a7 100644
> --- a/fs/ntfs/inode.c
> +++ b/fs/ntfs/inode.c
> @@ -2259,7 +2259,7 @@ void ntfs_evict_big_inode(struct inode *vi)
>  {
>  	ntfs_inode *ni = NTFS_I(vi);
>  
> -	truncate_inode_pages(&vi->i_data, 0);
> +	truncate_inode_pages_final(&vi->i_data);
>  	clear_inode(vi);
>  
>  #ifdef NTFS_RW
> diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
> index f87f9bd1edff..f1c46a7f9bc5 100644
> --- a/fs/ocfs2/inode.c
> +++ b/fs/ocfs2/inode.c
> @@ -951,7 +951,7 @@ static void ocfs2_cleanup_delete_inode(struct inode *inode,
>  		(unsigned long long)OCFS2_I(inode)->ip_blkno, sync_data);
>  	if (sync_data)
>  		filemap_write_and_wait(inode->i_mapping);
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  }
>  
>  static void ocfs2_delete_inode(struct inode *inode)
> @@ -1167,7 +1167,7 @@ void ocfs2_evict_inode(struct inode *inode)
>  	    (OCFS2_I(inode)->ip_flags & OCFS2_INODE_MAYBE_ORPHANED)) {
>  		ocfs2_delete_inode(inode);
>  	} else {
> -		truncate_inode_pages(&inode->i_data, 0);
> +		truncate_inode_pages_final(&inode->i_data);
>  	}
>  	ocfs2_clear_inode(inode);
>  }
> diff --git a/fs/omfs/inode.c b/fs/omfs/inode.c
> index d8b0afde2179..ec58c7659183 100644
> --- a/fs/omfs/inode.c
> +++ b/fs/omfs/inode.c
> @@ -183,7 +183,7 @@ int omfs_sync_inode(struct inode *inode)
>   */
>  static void omfs_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  
>  	if (inode->i_nlink)
> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
> index 8eaa1ba793fc..9ca0f085dada 100644
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -35,7 +35,7 @@ static void proc_evict_inode(struct inode *inode)
>  	const struct proc_ns_operations *ns_ops;
>  	void *ns;
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  
>  	/* Stop tracking associated processes */
> diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
> index ad62bdbb451e..bc8b8009897d 100644
> --- a/fs/reiserfs/inode.c
> +++ b/fs/reiserfs/inode.c
> @@ -35,7 +35,7 @@ void reiserfs_evict_inode(struct inode *inode)
>  	if (!inode->i_nlink && !is_bad_inode(inode))
>  		dquot_initialize(inode);
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	if (inode->i_nlink)
>  		goto no_delete;
>  
> diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
> index 963f910c8034..bd0dd8d88b50 100644
> --- a/fs/sysfs/inode.c
> +++ b/fs/sysfs/inode.c
> @@ -309,7 +309,7 @@ void sysfs_evict_inode(struct inode *inode)
>  {
>  	struct sysfs_dirent *sd  = inode->i_private;
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	sysfs_put(sd);
>  }
> diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
> index c327d4ee1235..5625ca920f5e 100644
> --- a/fs/sysv/inode.c
> +++ b/fs/sysv/inode.c
> @@ -295,7 +295,7 @@ int sysv_sync_inode(struct inode *inode)
>  
>  static void sysv_evict_inode(struct inode *inode)
>  {
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	if (!inode->i_nlink) {
>  		inode->i_size = 0;
>  		sysv_truncate(inode);
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index 3e4aa7281e04..b9ac1f350920 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -351,7 +351,7 @@ static void ubifs_evict_inode(struct inode *inode)
>  	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
>  	ubifs_assert(!atomic_read(&inode->i_count));
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  
>  	if (inode->i_nlink)
>  		goto done;
> diff --git a/fs/udf/inode.c b/fs/udf/inode.c
> index 062b7925bca0..af6f4c38d91a 100644
> --- a/fs/udf/inode.c
> +++ b/fs/udf/inode.c
> @@ -146,8 +146,8 @@ void udf_evict_inode(struct inode *inode)
>  		want_delete = 1;
>  		udf_setsize(inode, 0);
>  		udf_update_inode(inode, IS_SYNC(inode));
> -	} else
> -		truncate_inode_pages(&inode->i_data, 0);
> +	}
> +	truncate_inode_pages_final(&inode->i_data);
>  	invalidate_inode_buffers(inode);
>  	clear_inode(inode);
>  	if (iinfo->i_alloc_type != ICBTAG_FLAG_AD_IN_ICB &&
> diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
> index c8ca96086784..61e8a9b021dd 100644
> --- a/fs/ufs/inode.c
> +++ b/fs/ufs/inode.c
> @@ -885,7 +885,7 @@ void ufs_evict_inode(struct inode * inode)
>  	if (!inode->i_nlink && !is_bad_inode(inode))
>  		want_delete = 1;
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	if (want_delete) {
>  		loff_t old_i_size;
>  		/*UFS_I(inode)->i_dtime = CURRENT_TIME;*/
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 15188cc99449..47ce25dc412d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1006,7 +1006,7 @@ xfs_fs_evict_inode(
>  
>  	trace_xfs_evict_inode(ip);
>  
> -	truncate_inode_pages(&inode->i_data, 0);
> +	truncate_inode_pages_final(&inode->i_data);
>  	clear_inode(inode);
>  	XFS_STATS_INC(vn_rele);
>  	XFS_STATS_INC(vn_remove);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3f40547ba191..9bfa5a57b4ed 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -416,6 +416,7 @@ struct address_space {
>  	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
>  	/* Protected by tree_lock together with the radix tree */
>  	unsigned long		nrpages;	/* number of total pages */
> +	unsigned long		nrshadows;	/* number of shadow entries */
>  	pgoff_t			writeback_index;/* writeback starts here */
>  	const struct address_space_operations *a_ops;	/* methods */
>  	unsigned long		flags;		/* error bits/gfp mask */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c09ef3ae55bc..5449e7a96adf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1588,6 +1588,7 @@ vm_unmapped_area(struct vm_unmapped_area_info *info)
>  extern void truncate_inode_pages(struct address_space *, loff_t);
>  extern void truncate_inode_pages_range(struct address_space *,
>  				       loff_t lstart, loff_t lend);
> +extern void truncate_inode_pages_final(struct address_space *);
>  
>  /* generic vm_area_ops exported for stackable file systems */
>  extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index b6854b7c58cb..f132fdf5ce0f 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -25,6 +25,7 @@ enum mapping_flags {
>  	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
>  	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
>  	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
> +	AS_EXITING	= __GFP_BITS_SHIFT + 5, /* final truncate in progress */
>  };
>  
>  static inline void mapping_set_error(struct address_space *mapping, int error)
> @@ -69,6 +70,16 @@ static inline int mapping_balloon(struct address_space *mapping)
>  	return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
>  }
>  
> +static inline void mapping_set_exiting(struct address_space *mapping)
> +{
> +	set_bit(AS_EXITING, &mapping->flags);
> +}
> +
> +static inline int mapping_exiting(struct address_space *mapping)
> +{
> +	return test_bit(AS_EXITING, &mapping->flags);
> +}
> +
>  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
>  {
>  	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
> @@ -547,7 +558,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  				pgoff_t index, gfp_t gfp_mask);
>  extern void delete_from_page_cache(struct page *page);
> -extern void __delete_from_page_cache(struct page *page);
> +extern void __delete_from_page_cache(struct page *page, void *shadow);
>  int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
>  
>  /*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 23eb3be27205..d02db5801dda 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -107,12 +107,33 @@
>   *   ->tasklist_lock            (memory_failure, collect_procs_ao)
>   */
>  
> +static void page_cache_tree_delete(struct address_space *mapping,
> +				   struct page *page, void *shadow)
> +{
> +	if (shadow) {
> +		void **slot;
> +
> +		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
> +		radix_tree_replace_slot(slot, shadow);
> +		mapping->nrshadows++;
> +		/*
> +		 * Make sure the nrshadows update is committed before
> +		 * the nrpages update so that final truncate racing
> +		 * with reclaim does not see both counters 0 at the
> +		 * same time and miss a shadow entry.
> +		 */
> +		smp_wmb();
> +	} else
> +		radix_tree_delete(&mapping->page_tree, page->index);
> +	mapping->nrpages--;
> +}
> +
>  /*
>   * Delete a page from the page cache and free it. Caller has to make
>   * sure the page is locked and that nobody else uses it - or that usage
>   * is safe.  The caller must hold the mapping's tree_lock.
>   */
> -void __delete_from_page_cache(struct page *page)
> +void __delete_from_page_cache(struct page *page, void *shadow)
>  {
>  	struct address_space *mapping = page->mapping;
>  
> @@ -127,10 +148,11 @@ void __delete_from_page_cache(struct page *page)
>  	else
>  		cleancache_invalidate_page(mapping, page);
>  
> -	radix_tree_delete(&mapping->page_tree, page->index);
> +	page_cache_tree_delete(mapping, page, shadow);
> +
>  	page->mapping = NULL;
>  	/* Leave page->index set: truncation lookup relies upon it */
> -	mapping->nrpages--;
> +
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
>  	if (PageSwapBacked(page))
>  		__dec_zone_page_state(page, NR_SHMEM);
> @@ -166,7 +188,7 @@ void delete_from_page_cache(struct page *page)
>  
>  	freepage = mapping->a_ops->freepage;
>  	spin_lock_irq(&mapping->tree_lock);
> -	__delete_from_page_cache(page);
> +	__delete_from_page_cache(page, NULL);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	mem_cgroup_uncharge_cache_page(page);
>  
> @@ -426,7 +448,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>  		new->index = offset;
>  
>  		spin_lock_irq(&mapping->tree_lock);
> -		__delete_from_page_cache(old);
> +		__delete_from_page_cache(old, NULL);
>  		error = radix_tree_insert(&mapping->page_tree, offset, new);
>  		BUG_ON(error);
>  		mapping->nrpages++;
> @@ -460,6 +482,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  		if (!radix_tree_exceptional_entry(p))
>  			return -EEXIST;
>  		radix_tree_replace_slot(slot, page);
> +		mapping->nrshadows--;
>  		mapping->nrpages++;
>  		return 0;
>  	}
> diff --git a/mm/truncate.c b/mm/truncate.c
> index b0f4d4bee8ab..97606fa4c458 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -35,7 +35,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	 * without the tree itself locked.  These unlocked entries
>  	 * need verification under the tree lock.
>  	 */
> -	radix_tree_delete_item(&mapping->page_tree, index, entry);
> +	if (radix_tree_delete_item(&mapping->page_tree, index, entry) == entry)
> +		mapping->nrshadows--;
>  	spin_unlock_irq(&mapping->tree_lock);
>  }
>  
> @@ -229,7 +230,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	int		i;
>  
>  	cleancache_invalidate_inode(mapping);
> -	if (mapping->nrpages == 0)
> +	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
>  		return;
>  
>  	/* Offsets within partial pages */
> @@ -391,6 +392,53 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
>  EXPORT_SYMBOL(truncate_inode_pages);
>  
>  /**
> + * truncate_inode_pages_final - truncate *all* pages before inode dies
> + * @mapping: mapping to truncate
> + *
> + * Called under (and serialized by) inode->i_mutex.
> + *
> + * Filesystems have to use this in the .evict_inode path to inform the
> + * VM that this is the final truncate and the inode is going away.
> + */
> +void truncate_inode_pages_final(struct address_space *mapping)
> +{
> +	unsigned long nrshadows;
> +	unsigned long nrpages;
> +
> +	/*
> +	 * Page reclaim can not participate in regular inode lifetime
> +	 * management (can't call iput()) and thus can race with the
> +	 * inode teardown.  Tell it when the address space is exiting,
> +	 * so that it does not install eviction information after the
> +	 * final truncate has begun.
> +	 */
> +	mapping_set_exiting(mapping);
> +
> +	/*
> +	 * When reclaim installs eviction entries, it increases
> +	 * nrshadows first, then decreases nrpages.  Make sure we see
> +	 * this in the right order or we might miss an entry.
> +	 */
> +	nrpages = mapping->nrpages;
> +	smp_rmb();
> +	nrshadows = mapping->nrshadows;
> +
> +	if (nrpages || nrshadows) {
> +		/*
> +		 * As truncation uses a lockless tree lookup, acquire
> +		 * the spinlock to make sure any ongoing tree
> +		 * modification that does not see AS_EXITING is
> +		 * completed before starting the final truncate.
> +		 */
> +		spin_lock_irq(&mapping->tree_lock);
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		truncate_inode_pages(mapping, 0);
> +	}
> +}
> +EXPORT_SYMBOL(truncate_inode_pages_final);
> +
> +/**
>   * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
>   * @mapping: the address_space which holds the pages to invalidate
>   * @start: the offset 'from' which to invalidate
> @@ -483,7 +531,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
>  		goto failed;
>  
>  	BUG_ON(page_has_private(page));
> -	__delete_from_page_cache(page);
> +	__delete_from_page_cache(page, NULL);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	mem_cgroup_uncharge_cache_page(page);
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eea668d9cff6..b954b31602cf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -554,7 +554,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
>  
>  		freepage = mapping->a_ops->freepage;
>  
> -		__delete_from_page_cache(page);
> +		__delete_from_page_cache(page, NULL);
>  		spin_unlock_irq(&mapping->tree_lock);
>  		mem_cgroup_uncharge_cache_page(page);
>  
> -- 
> 1.8.4.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 7/9] mm: thrash detection-based file cache sizing
  2014-01-10 18:10 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
  2014-01-10 22:51   ` Rik van Riel
@ 2014-01-13  2:42   ` Minchan Kim
  2014-01-14  1:01   ` Bob Liu
  2 siblings, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2014-01-13  2:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Fri, Jan 10, 2014 at 01:10:41PM -0500, Johannes Weiner wrote:
> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
> 
> Currently, the VM aims for a 1:1 ratio between the lists, which is the
> "perfect" trade-off between the ability to *protect* frequently used
> pages and the ability to *detect* frequently used pages.  This means
> that working set changes bigger than half of cache memory go
> undetected and thrash indefinitely, whereas working sets bigger than
> half of cache memory are unprotected against used-once streams that
> don't even need caching.
> 
> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use-once streams,
> but ultimately was not significantly better than a FIFO policy and
> still thrashed cache based on eviction speed, rather than actual
> demand for cache.
> 
> This patch solves one half of the problem by decoupling the ability to
> detect working set changes from the inactive list size.  By
> maintaining a history of recently evicted file pages it can detect
> frequently used pages with an arbitrarily small inactive list size,
> and subsequently apply pressure on the active list based on actual
> demand for cache, not just overall eviction speed.
> 
> Every zone maintains a counter that tracks inactive list aging speed.
> When a page is evicted, a snapshot of this counter is stored in the
> now-empty page cache radix tree slot.  On refault, the minimum access
> distance of the page can be assessed, to evaluate whether the page
> should be part of the active list or not.
> 
> This fixes the VM's blindness towards working set changes in excess of
> the inactive list.  And it's the foundation to further improve the
> protection ability and reduce the minimum inactive list size of 50%.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>

Really nice description and code to understand.
I believe we should really merge/maintain such advanced algorithm,
which will end up putting more advanced concept easily.

Johannes, thanks for the your effort!

> ---
>  include/linux/mmzone.h |   5 +
>  include/linux/swap.h   |   5 +
>  mm/Makefile            |   2 +-
>  mm/filemap.c           |  61 ++++++++----
>  mm/swap.c              |   2 +
>  mm/vmscan.c            |  24 ++++-
>  mm/vmstat.c            |   2 +
>  mm/workingset.c        | 253 +++++++++++++++++++++++++++++++++++++++++++++++++
>  8 files changed, 331 insertions(+), 23 deletions(-)
>  create mode 100644 mm/workingset.c
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bd791e452ad7..118ba9f51e86 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -142,6 +142,8 @@ enum zone_stat_item {
>  	NUMA_LOCAL,		/* allocation from local node */
>  	NUMA_OTHER,		/* allocation from other node */
>  #endif
> +	WORKINGSET_REFAULT,
> +	WORKINGSET_ACTIVATE,
>  	NR_ANON_TRANSPARENT_HUGEPAGES,
>  	NR_FREE_CMA_PAGES,
>  	NR_VM_ZONE_STAT_ITEMS };
> @@ -392,6 +394,9 @@ struct zone {
>  	spinlock_t		lru_lock;
>  	struct lruvec		lruvec;
>  
> +	/* Evictions & activations on the inactive file list */
> +	atomic_long_t		inactive_age;
> +
>  	unsigned long		pages_scanned;	   /* since last reclaim */
>  	unsigned long		flags;		   /* zone flags, see below */
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6c219f..b83cf61403ed 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -260,6 +260,11 @@ struct swap_list_t {
>  	int next;	/* swapfile to be used next */
>  };
>  
> +/* linux/mm/workingset.c */
> +void *workingset_eviction(struct address_space *mapping, struct page *page);
> +bool workingset_refault(void *shadow);
> +void workingset_activation(struct page *page);
> +
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
>  extern unsigned long totalreserve_pages;
> diff --git a/mm/Makefile b/mm/Makefile
> index 305d10acd081..b30aeb86abd6 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
>  			   compaction.o balloon_compaction.o \
> -			   interval_tree.o list_lru.o $(mmu-y)
> +			   interval_tree.o list_lru.o workingset.o $(mmu-y)
>  
>  obj-y += init-mm.o
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d02db5801dda..65a374c0df4f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  
>  static int page_cache_tree_insert(struct address_space *mapping,
> -				  struct page *page)
> +				  struct page *page, void **shadowp)
>  {
>  	void **slot;
>  	int error;
> @@ -484,6 +484,8 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  		radix_tree_replace_slot(slot, page);
>  		mapping->nrshadows--;
>  		mapping->nrpages++;
> +		if (shadowp)
> +			*shadowp = p;
>  		return 0;
>  	}
>  	error = radix_tree_insert(&mapping->page_tree, page->index, page);
> @@ -492,18 +494,10 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  	return error;
>  }
>  
> -/**
> - * add_to_page_cache_locked - add a locked page to the pagecache
> - * @page:	page to add
> - * @mapping:	the page's address_space
> - * @offset:	page index
> - * @gfp_mask:	page allocation mode
> - *
> - * This function is used to add a page to the pagecache. It must be locked.
> - * This function does not add the page to the LRU.  The caller must do that.
> - */
> -int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> -		pgoff_t offset, gfp_t gfp_mask)
> +static int __add_to_page_cache_locked(struct page *page,
> +				      struct address_space *mapping,
> +				      pgoff_t offset, gfp_t gfp_mask,
> +				      void **shadowp)
>  {
>  	int error;
>  
> @@ -526,7 +520,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  	page->index = offset;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	error = page_cache_tree_insert(mapping, page);
> +	error = page_cache_tree_insert(mapping, page, shadowp);
>  	radix_tree_preload_end();
>  	if (unlikely(error))
>  		goto err_insert;
> @@ -542,16 +536,49 @@ err_insert:
>  	page_cache_release(page);
>  	return error;
>  }
> +
> +/**
> + * add_to_page_cache_locked - add a locked page to the pagecache
> + * @page:	page to add
> + * @mapping:	the page's address_space
> + * @offset:	page index
> + * @gfp_mask:	page allocation mode
> + *
> + * This function is used to add a page to the pagecache. It must be locked.
> + * This function does not add the page to the LRU.  The caller must do that.
> + */
> +int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> +		pgoff_t offset, gfp_t gfp_mask)
> +{
> +	return __add_to_page_cache_locked(page, mapping, offset,
> +					  gfp_mask, NULL);
> +}
>  EXPORT_SYMBOL(add_to_page_cache_locked);
>  
>  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  				pgoff_t offset, gfp_t gfp_mask)
>  {
> +	void *shadow = NULL;
>  	int ret;
>  
> -	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
> -	if (ret == 0)
> -		lru_cache_add_file(page);
> +	__set_page_locked(page);
> +	ret = __add_to_page_cache_locked(page, mapping, offset,
> +					 gfp_mask, &shadow);
> +	if (unlikely(ret))
> +		__clear_page_locked(page);
> +	else {
> +		/*
> +		 * The page might have been evicted from cache only
> +		 * recently, in which case it should be activated like
> +		 * any other repeatedly accessed page.
> +		 */
> +		if (shadow && workingset_refault(shadow)) {
> +			SetPageActive(page);
> +			workingset_activation(page);
> +		} else
> +			ClearPageActive(page);
> +		lru_cache_add(page);
> +	}
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
> diff --git a/mm/swap.c b/mm/swap.c
> index f624e5b4b724..ece5c49d6364 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -519,6 +519,8 @@ void mark_page_accessed(struct page *page)
>  		else
>  			__lru_cache_activate_page(page);
>  		ClearPageReferenced(page);
> +		if (page_is_file_cache(page))
> +			workingset_activation(page);
>  	} else if (!PageReferenced(page)) {
>  		SetPageReferenced(page);
>  	}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b954b31602cf..0d3c3d7f8c1b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -505,7 +505,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>   * Same as remove_mapping, but if the page is removed from the mapping, it
>   * gets returned with a refcount of 0.
>   */
> -static int __remove_mapping(struct address_space *mapping, struct page *page)
> +static int __remove_mapping(struct address_space *mapping, struct page *page,
> +			    bool reclaimed)
>  {
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(mapping != page_mapping(page));
> @@ -551,10 +552,23 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
>  		swapcache_free(swap, page);
>  	} else {
>  		void (*freepage)(struct page *);
> +		void *shadow = NULL;
>  
>  		freepage = mapping->a_ops->freepage;
> -
> -		__delete_from_page_cache(page, NULL);
> +		/*
> +		 * Remember a shadow entry for reclaimed file cache in
> +		 * order to detect refaults, thus thrashing, later on.
> +		 *
> +		 * But don't store shadows in an address space that is
> +		 * already exiting.  This is not just an optizimation,
> +		 * inode reclaim needs to empty out the radix tree or
> +		 * the nodes are lost.  Don't plant shadows behind its
> +		 * back.
> +		 */
> +		if (reclaimed && page_is_file_cache(page) &&
> +		    !mapping_exiting(mapping))
> +			shadow = workingset_eviction(mapping, page);
> +		__delete_from_page_cache(page, shadow);
>  		spin_unlock_irq(&mapping->tree_lock);
>  		mem_cgroup_uncharge_cache_page(page);
>  
> @@ -577,7 +591,7 @@ cannot_free:
>   */
>  int remove_mapping(struct address_space *mapping, struct page *page)
>  {
> -	if (__remove_mapping(mapping, page)) {
> +	if (__remove_mapping(mapping, page, false)) {
>  		/*
>  		 * Unfreezing the refcount with 1 rather than 2 effectively
>  		 * drops the pagecache ref for us without requiring another
> @@ -1047,7 +1061,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (!mapping || !__remove_mapping(mapping, page))
> +		if (!mapping || !__remove_mapping(mapping, page, true))
>  			goto keep_locked;
>  
>  		/*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 9bb314577911..3ac830d1b533 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -770,6 +770,8 @@ const char * const vmstat_text[] = {
>  	"numa_local",
>  	"numa_other",
>  #endif
> +	"workingset_refault",
> +	"workingset_activate",
>  	"nr_anon_transparent_hugepages",
>  	"nr_free_cma",
>  	"nr_dirty_threshold",
> diff --git a/mm/workingset.c b/mm/workingset.c
> new file mode 100644
> index 000000000000..8a6c7cff4923
> --- /dev/null
> +++ b/mm/workingset.c
> @@ -0,0 +1,253 @@
> +/*
> + * Workingset detection
> + *
> + * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
> + */
> +
> +#include <linux/memcontrol.h>
> +#include <linux/writeback.h>
> +#include <linux/pagemap.h>
> +#include <linux/atomic.h>
> +#include <linux/module.h>
> +#include <linux/swap.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +
> +/*
> + *		Double CLOCK lists
> + *
> + * Per zone, two clock lists are maintained for file pages: the
> + * inactive and the active list.  Freshly faulted pages start out at
> + * the head of the inactive list and page reclaim scans pages from the
> + * tail.  Pages that are accessed multiple times on the inactive list
> + * are promoted to the active list, to protect them from reclaim,
> + * whereas active pages are demoted to the inactive list when the
> + * active list grows too big.
> + *
> + *   fault ------------------------+
> + *                                 |
> + *              +--------------+   |            +-------------+
> + *   reclaim <- |   inactive   | <-+-- demotion |    active   | <--+
> + *              +--------------+                +-------------+    |
> + *                     |                                           |
> + *                     +-------------- promotion ------------------+
> + *
> + *
> + *		Access frequency and refault distance
> + *
> + * A workload is thrashing when its pages are frequently used but they
> + * are evicted from the inactive list every time before another access
> + * would have promoted them to the active list.
> + *
> + * In cases where the average access distance between thrashing pages
> + * is bigger than the size of memory there is nothing that can be
> + * done - the thrashing set could never fit into memory under any
> + * circumstance.
> + *
> + * However, the average access distance could be bigger than the
> + * inactive list, yet smaller than the size of memory.  In this case,
> + * the set could fit into memory if it weren't for the currently
> + * active pages - which may be used more, hopefully less frequently:
> + *
> + *      +-memory available to cache-+
> + *      |                           |
> + *      +-inactive------+-active----+
> + *  a b | c d e f g h i | J K L M N |
> + *      +---------------+-----------+
> + *
> + * It is prohibitively expensive to accurately track access frequency
> + * of pages.  But a reasonable approximation can be made to measure
> + * thrashing on the inactive list, after which refaulting pages can be
> + * activated optimistically to compete with the existing active pages.
> + *
> + * Approximating inactive page access frequency - Observations:
> + *
> + * 1. When a page is accessed for the first time, it is added to the
> + *    head of the inactive list, slides every existing inactive page
> + *    towards the tail by one slot, and pushes the current tail page
> + *    out of memory.
> + *
> + * 2. When a page is accessed for the second time, it is promoted to
> + *    the active list, shrinking the inactive list by one slot.  This
> + *    also slides all inactive pages that were faulted into the cache
> + *    more recently than the activated page towards the tail of the
> + *    inactive list.
> + *
> + * Thus:
> + *
> + * 1. The sum of evictions and activations between any two points in
> + *    time indicate the minimum number of inactive pages accessed in
> + *    between.
> + *
> + * 2. Moving one inactive page N page slots towards the tail of the
> + *    list requires at least N inactive page accesses.
> + *
> + * Combining these:
> + *
> + * 1. When a page is finally evicted from memory, the number of
> + *    inactive pages accessed while the page was in cache is at least
> + *    the number of page slots on the inactive list.
> + *
> + * 2. In addition, measuring the sum of evictions and activations (E)
> + *    at the time of a page's eviction, and comparing it to another
> + *    reading (R) at the time the page faults back into memory tells
> + *    the minimum number of accesses while the page was not cached.
> + *    This is called the refault distance.
> + *
> + * Because the first access of the page was the fault and the second
> + * access the refault, we combine the in-cache distance with the
> + * out-of-cache distance to get the complete minimum access distance
> + * of this page:
> + *
> + *      NR_inactive + (R - E)
> + *
> + * And knowing the minimum access distance of a page, we can easily
> + * tell if the page would be able to stay in cache assuming all page
> + * slots in the cache were available:
> + *
> + *   NR_inactive + (R - E) <= NR_inactive + NR_active
> + *
> + * which can be further simplified to
> + *
> + *   (R - E) <= NR_active
> + *
> + * Put into words, the refault distance (out-of-cache) can be seen as
> + * a deficit in inactive list space (in-cache).  If the inactive list
> + * had (R - E) more page slots, the page would not have been evicted
> + * in between accesses, but activated instead.  And on a full system,
> + * the only thing eating into inactive list space is active pages.
> + *
> + *
> + *		Activating refaulting pages
> + *
> + * All that is known about the active list is that the pages have been
> + * accessed more than once in the past.  This means that at any given
> + * time there is actually a good chance that pages on the active list
> + * are no longer in active use.
> + *
> + * So when a refault distance of (R - E) is observed and there are at
> + * least (R - E) active pages, the refaulting page is activated
> + * optimistically in the hope that (R - E) active pages are actually
> + * used less frequently than the refaulting page - or even not used at
> + * all anymore.
> + *
> + * If this is wrong and demotion kicks in, the pages which are truly
> + * used more frequently will be reactivated while the less frequently
> + * used once will be evicted from memory.
> + *
> + * But if this is right, the stale pages will be pushed out of memory
> + * and the used pages get to stay in cache.
> + *
> + *
> + *		Implementation
> + *
> + * For each zone's file LRU lists, a counter for inactive evictions
> + * and activations is maintained (zone->inactive_age).
> + *
> + * On eviction, a snapshot of this counter (along with some bits to
> + * identify the zone) is stored in the now empty page cache radix tree
> + * slot of the evicted page.  This is called a shadow entry.
> + *
> + * On cache misses for which there are shadow entries, an eligible
> + * refault distance will immediately activate the refaulting page.
> + */
> +
> +static void *pack_shadow(unsigned long eviction, struct zone *zone)
> +{
> +	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
> +	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
> +	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
> +
> +	return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
> +}
> +
> +static void unpack_shadow(void *shadow,
> +			  struct zone **zone,
> +			  unsigned long *distance)
> +{
> +	unsigned long entry = (unsigned long)shadow;
> +	unsigned long eviction;
> +	unsigned long refault;
> +	unsigned long mask;
> +	int zid, nid;
> +
> +	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> +	zid = entry & ((1UL << ZONES_SHIFT) - 1);
> +	entry >>= ZONES_SHIFT;
> +	nid = entry & ((1UL << NODES_SHIFT) - 1);
> +	entry >>= NODES_SHIFT;
> +	eviction = entry;
> +
> +	*zone = NODE_DATA(nid)->node_zones + zid;
> +
> +	refault = atomic_long_read(&(*zone)->inactive_age);
> +	mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
> +			RADIX_TREE_EXCEPTIONAL_SHIFT);
> +	/*
> +	 * The unsigned subtraction here gives an accurate distance
> +	 * across inactive_age overflows in most cases.
> +	 *
> +	 * There is a special case: usually, shadow entries have a
> +	 * short lifetime and are either refaulted or reclaimed along
> +	 * with the inode before they get too old.  But it is not
> +	 * impossible for the inactive_age to lap a shadow entry in
> +	 * the field, which can then can result in a false small
> +	 * refault distance, leading to a false activation should this
> +	 * old entry actually refault again.  However, earlier kernels
> +	 * used to deactivate unconditionally with *every* reclaim
> +	 * invocation for the longest time, so the occasional
> +	 * inappropriate activation leading to pressure on the active
> +	 * list is not a problem.
> +	 */
> +	*distance = (refault - eviction) & mask;
> +}
> +
> +/**
> + * workingset_eviction - note the eviction of a page from memory
> + * @mapping: address space the page was backing
> + * @page: the page being evicted
> + *
> + * Returns a shadow entry to be stored in @mapping->page_tree in place
> + * of the evicted @page so that a later refault can be detected.
> + */
> +void *workingset_eviction(struct address_space *mapping, struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long eviction;
> +
> +	eviction = atomic_long_inc_return(&zone->inactive_age);
> +	return pack_shadow(eviction, zone);
> +}
> +
> +/**
> + * workingset_refault - evaluate the refault of a previously evicted page
> + * @shadow: shadow entry of the evicted page
> + *
> + * Calculates and evaluates the refault distance of the previously
> + * evicted page in the context of the zone it was allocated in.
> + *
> + * Returns %true if the page should be activated, %false otherwise.
> + */
> +bool workingset_refault(void *shadow)
> +{
> +	unsigned long refault_distance;
> +	struct zone *zone;
> +
> +	unpack_shadow(shadow, &zone, &refault_distance);
> +	inc_zone_state(zone, WORKINGSET_REFAULT);
> +
> +	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> +		inc_zone_state(zone, WORKINGSET_ACTIVATE);
> +		return true;
> +	}
> +	return false;
> +}
> +
> +/**
> + * workingset_activation - note a page activation
> + * @page: page that is being activated
> + */
> +void workingset_activation(struct page *page)
> +{
> +	atomic_long_inc(&page_zone(page)->inactive_age);
> +}
> -- 
> 1.8.4.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
  2014-01-10 23:09   ` Rik van Riel
@ 2014-01-13  7:39   ` Minchan Kim
  2014-01-14  5:40     ` Minchan Kim
  2014-01-22 18:42     ` Johannes Weiner
  2014-01-15  5:55   ` Bob Liu
  2014-01-17  0:05   ` Dave Chinner
  3 siblings, 2 replies; 58+ messages in thread
From: Minchan Kim @ 2014-01-13  7:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.
> Per-NUMA rather than global because we expect the radix tree nodes
> themselves to be allocated node-locally and we want to reduce
> cross-node references of otherwise independent cache workloads.  A
> simple shrinker will then reclaim these nodes on memory pressure.
> 
> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:
> 
> 1. There is no index available that would describe the reverse path
>    from the node up to the tree root, which is needed to perform a
>    deletion.  To solve this, encode in each node its offset inside the
>    parent.  This can be stored in the unused upper bits of the same
>    member that stores the node's height at no extra space cost.
> 
> 2. The number of shadow entries needs to be counted in addition to the
>    regular entries, to quickly detect when the node is ready to go to
>    the shadow node LRU list.  The current entry count is an unsigned
>    int but the maximum number of entries is 64, so a shadow counter
>    can easily be stored in the unused upper bits.
> 
> 3. Tree modification needs tree lock and tree root, which are located
>    in the address space, so store an address_space backpointer in the
>    node.  The parent pointer of the node is in a union with the 2-word
>    rcu_head, so the backpointer comes at no extra cost as well.
> 
> 4. The node needs to be linked to an LRU list, which requires a list
>    head inside the node.  This does increase the size of the node, but
>    it does not change the number of objects that fit into a slab page.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/list_lru.h   |   2 +
>  include/linux/mmzone.h     |   1 +
>  include/linux/radix-tree.h |  32 +++++++++---
>  include/linux/swap.h       |   1 +
>  lib/radix-tree.c           |  36 ++++++++------
>  mm/filemap.c               |  77 +++++++++++++++++++++++------
>  mm/list_lru.c              |   8 +++
>  mm/truncate.c              |  20 +++++++-
>  mm/vmstat.c                |   1 +
>  mm/workingset.c            | 121 +++++++++++++++++++++++++++++++++++++++++++++
>  10 files changed, 259 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 3ce541753c88..b02fc233eadd 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -13,6 +13,8 @@
>  /* list_lru_walk_cb has to always return one of those */
>  enum lru_status {
>  	LRU_REMOVED,		/* item removed from list */
> +	LRU_REMOVED_RETRY,	/* item removed, but lock has been
> +				   dropped and reacquired */
>  	LRU_ROTATE,		/* item referenced, give another pass */
>  	LRU_SKIP,		/* item cannot be locked, skip */
>  	LRU_RETRY,		/* item not freeable. May drop the lock
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 118ba9f51e86..8cac5a7ef7a7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -144,6 +144,7 @@ enum zone_stat_item {
>  #endif
>  	WORKINGSET_REFAULT,
>  	WORKINGSET_ACTIVATE,
> +	WORKINGSET_NODERECLAIM,
>  	NR_ANON_TRANSPARENT_HUGEPAGES,
>  	NR_FREE_CMA_PAGES,
>  	NR_VM_ZONE_STAT_ITEMS };
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 13636c40bc42..33170dbd9db4 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
>  #define RADIX_TREE_TAG_LONGS	\
>  	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
>  
> +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> +					  RADIX_TREE_MAP_SHIFT))
> +
> +/* Height component in node->path */
> +#define RADIX_TREE_HEIGHT_SHIFT	(RADIX_TREE_MAX_PATH + 1)
> +#define RADIX_TREE_HEIGHT_MASK	((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
> +
> +/* Internally used bits of node->count */
> +#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
> +#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
> +
>  struct radix_tree_node {
> -	unsigned int	height;		/* Height from the bottom */
> +	unsigned int	path;	/* Offset in parent & height from the bottom */
>  	unsigned int	count;
>  	union {
> -		struct radix_tree_node *parent;	/* Used when ascending tree */
> -		struct rcu_head	rcu_head;	/* Used when freeing node */
> +		struct {
> +			/* Used when ascending tree */
> +			struct radix_tree_node *parent;
> +			/* For tree user */
> +			void *private_data;
> +		};
> +		/* Used when freeing node */
> +		struct rcu_head	rcu_head;
>  	};
> +	/* For tree user */
> +	struct list_head private_list;
>  	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
>  	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
>  };
>  
> -#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> -#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> -					  RADIX_TREE_MAP_SHIFT))
> -
>  /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
>  struct radix_tree_root {
>  	unsigned int		height;
> @@ -251,7 +267,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
>  			  struct radix_tree_node **nodep, void ***slotp);
>  void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
>  void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
> -bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
> +bool __radix_tree_delete_node(struct radix_tree_root *root,
>  			      struct radix_tree_node *node);
>  void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
>  void *radix_tree_delete(struct radix_tree_root *, unsigned long);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b83cf61403ed..102e37bc82d5 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -264,6 +264,7 @@ struct swap_list_t {
>  void *workingset_eviction(struct address_space *mapping, struct page *page);
>  bool workingset_refault(void *shadow);
>  void workingset_activation(struct page *page);
> +extern struct list_lru workingset_shadow_nodes;
>  
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index e601c56a43d0..0a0895371447 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -342,7 +342,8 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
>  
>  		/* Increase the height.  */
>  		newheight = root->height+1;
> -		node->height = newheight;
> +		BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK);
> +		node->path = newheight;

Nitpick:
It would be better to add some accessor for path and offset for
readability and future enhance?

>  		node->count = 1;
>  		node->parent = NULL;
>  		slot = root->rnode;
> @@ -400,11 +401,12 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
>  			/* Have to add a child node.  */
>  			if (!(slot = radix_tree_node_alloc(root)))
>  				return -ENOMEM;
> -			slot->height = height;
> +			slot->path = height;
>  			slot->parent = node;
>  			if (node) {
>  				rcu_assign_pointer(node->slots[offset], slot);
>  				node->count++;
> +				slot->path |= offset << RADIX_TREE_HEIGHT_SHIFT;
>  			} else
>  				rcu_assign_pointer(root->rnode, ptr_to_indirect(slot));
>  		}
> @@ -496,7 +498,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
>  	}
>  	node = indirect_to_ptr(node);
>  
> -	height = node->height;
> +	height = node->path & RADIX_TREE_HEIGHT_MASK;
>  	if (index > radix_tree_maxindex(height))
>  		return NULL;
>  
> @@ -702,7 +704,7 @@ int radix_tree_tag_get(struct radix_tree_root *root,
>  		return (index == 0);
>  	node = indirect_to_ptr(node);
>  
> -	height = node->height;
> +	height = node->path & RADIX_TREE_HEIGHT_MASK;
>  	if (index > radix_tree_maxindex(height))
>  		return 0;
>  
> @@ -739,7 +741,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
>  {
>  	unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK;
>  	struct radix_tree_node *rnode, *node;
> -	unsigned long index, offset;
> +	unsigned long index, offset, height;
>  
>  	if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag))
>  		return NULL;
> @@ -770,7 +772,8 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
>  		return NULL;
>  
>  restart:
> -	shift = (rnode->height - 1) * RADIX_TREE_MAP_SHIFT;
> +	height = rnode->path & RADIX_TREE_HEIGHT_MASK;
> +	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
>  	offset = index >> shift;
>  
>  	/* Index outside of the tree */
> @@ -1140,7 +1143,7 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item,
>  	unsigned int shift, height;
>  	unsigned long i;
>  
> -	height = slot->height;
> +	height = slot->path & RADIX_TREE_HEIGHT_MASK;
>  	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
>  
>  	for ( ; height > 1; height--) {
> @@ -1203,7 +1206,8 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item)
>  		}
>  
>  		node = indirect_to_ptr(node);
> -		max_index = radix_tree_maxindex(node->height);
> +		max_index = radix_tree_maxindex(node->path &
> +						RADIX_TREE_HEIGHT_MASK);
>  		if (cur_index > max_index)
>  			break;
>  
> @@ -1297,7 +1301,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
>   *
>   *	Returns %true if @node was freed, %false otherwise.
>   */
> -bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
> +bool __radix_tree_delete_node(struct radix_tree_root *root,
>  			      struct radix_tree_node *node)
>  {
>  	bool deleted = false;
> @@ -1316,9 +1320,10 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
>  
>  		parent = node->parent;
>  		if (parent) {
> -			index >>= RADIX_TREE_MAP_SHIFT;
> +			unsigned int offset;
>  
> -			parent->slots[index & RADIX_TREE_MAP_MASK] = NULL;
> +			offset = node->path >> RADIX_TREE_HEIGHT_SHIFT;
> +			parent->slots[offset] = NULL;
>  			parent->count--;
>  		} else {
>  			root_tag_clear_all(root);
> @@ -1382,7 +1387,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
>  	node->slots[offset] = NULL;
>  	node->count--;
>  
> -	__radix_tree_delete_node(root, index, node);
> +	__radix_tree_delete_node(root, node);
>  
>  	return entry;
>  }
> @@ -1415,9 +1420,12 @@ int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag)
>  EXPORT_SYMBOL(radix_tree_tagged);
>  
>  static void
> -radix_tree_node_ctor(void *node)
> +radix_tree_node_ctor(void *arg)
>  {
> -	memset(node, 0, sizeof(struct radix_tree_node));
> +	struct radix_tree_node *node = arg;
> +
> +	memset(node, 0, sizeof(*node));
> +	INIT_LIST_HEAD(&node->private_list);
>  }
>  
>  static __init unsigned long __maxindex(unsigned int height)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 65a374c0df4f..b93e223b59a9 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -110,11 +110,17 @@
>  static void page_cache_tree_delete(struct address_space *mapping,
>  				   struct page *page, void *shadow)
>  {
> -	if (shadow) {
> -		void **slot;
> +	struct radix_tree_node *node;
> +	unsigned long index;
> +	unsigned int offset;
> +	unsigned int tag;
> +	void **slot;
>  
> -		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
> -		radix_tree_replace_slot(slot, shadow);
> +	VM_BUG_ON(!PageLocked(page));
> +
> +	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
> +
> +	if (shadow) {
>  		mapping->nrshadows++;
>  		/*
>  		 * Make sure the nrshadows update is committed before
> @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
>  		 * same time and miss a shadow entry.
>  		 */
>  		smp_wmb();
> -	} else
> -		radix_tree_delete(&mapping->page_tree, page->index);
> +	}
>  	mapping->nrpages--;
> +
> +	if (!node) {
> +		/* Clear direct pointer tags in root node */
> +		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> +		radix_tree_replace_slot(slot, shadow);
> +		return;
> +	}
> +
> +	/* Clear tree tags for the removed page */
> +	index = page->index;
> +	offset = index & RADIX_TREE_MAP_MASK;
> +	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> +		if (test_bit(offset, node->tags[tag]))
> +			radix_tree_tag_clear(&mapping->page_tree, index, tag);
> +	}
> +
> +	/* Delete page, swap shadow entry */
> +	radix_tree_replace_slot(slot, shadow);
> +	node->count--;
> +	if (shadow)
> +		node->count += 1U << RADIX_TREE_COUNT_SHIFT;

Nitpick2:
It should be a function of workingset.c rather than exposing
RADIX_TREE_COUNT_SHIFT?

IMO, It would be better to provide some accessor functions here, too.

I didn't review locking part yet and will review it tomorrow with
fresh brain. :)

> +	else
> +		if (__radix_tree_delete_node(&mapping->page_tree, node))
> +			return;
> +
> +	/* Only shadow entries in there, keep track of this node */
> +	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> +	    list_empty(&node->private_list)) {
> +		node->private_data = mapping;
> +		list_lru_add(&workingset_shadow_nodes, &node->private_list);
> +	}
>  }
>  
>  /*
> @@ -471,27 +507,36 @@ EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  static int page_cache_tree_insert(struct address_space *mapping,
>  				  struct page *page, void **shadowp)
>  {
> +	struct radix_tree_node *node;
>  	void **slot;
>  	int error;
>  
> -	slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
> -	if (slot) {
> +	error = __radix_tree_create(&mapping->page_tree, page->index,
> +				    &node, &slot);
> +	if (error)
> +		return error;
> +	if (*slot) {
>  		void *p;
>  
>  		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
>  		if (!radix_tree_exceptional_entry(p))
>  			return -EEXIST;
> -		radix_tree_replace_slot(slot, page);
> -		mapping->nrshadows--;
> -		mapping->nrpages++;
>  		if (shadowp)
>  			*shadowp = p;
> -		return 0;
> +		mapping->nrshadows--;
> +		if (node)
> +			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
>  	}
> -	error = radix_tree_insert(&mapping->page_tree, page->index, page);
> -	if (!error)
> -		mapping->nrpages++;
> -	return error;
> +	radix_tree_replace_slot(slot, page);
> +	mapping->nrpages++;
> +	if (node) {
> +		node->count++;
> +		/* Installed page, can't be shadow-only anymore */
> +		if (!list_empty(&node->private_list))
> +			list_lru_del(&workingset_shadow_nodes,
> +				     &node->private_list);
> +	}
> +	return 0;
>  }
>  
>  static int __add_to_page_cache_locked(struct page *page,
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 72f9decb0104..47a9faf4070b 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -88,10 +88,18 @@ restart:
>  		ret = isolate(item, &nlru->lock, cb_arg);
>  		switch (ret) {
>  		case LRU_REMOVED:
> +		case LRU_REMOVED_RETRY:
>  			if (--nlru->nr_items == 0)
>  				node_clear(nid, lru->active_nodes);
>  			WARN_ON_ONCE(nlru->nr_items < 0);
>  			isolated++;
> +			/*
> +			 * If the lru lock has been dropped, our list
> +			 * traversal is now invalid and so we have to
> +			 * restart from scratch.
> +			 */
> +			if (ret == LRU_REMOVED_RETRY)
> +				goto restart;
>  			break;
>  		case LRU_ROTATE:
>  			list_move_tail(item, &nlru->list);
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 97606fa4c458..5c2615d7f4da 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -25,6 +25,9 @@
>  static void clear_exceptional_entry(struct address_space *mapping,
>  				    pgoff_t index, void *entry)
>  {
> +	struct radix_tree_node *node;
> +	void **slot;
> +
>  	/* Handled by shmem itself */
>  	if (shmem_mapping(mapping))
>  		return;
> @@ -35,8 +38,21 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	 * without the tree itself locked.  These unlocked entries
>  	 * need verification under the tree lock.
>  	 */
> -	if (radix_tree_delete_item(&mapping->page_tree, index, entry) == entry)
> -		mapping->nrshadows--;
> +	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> +		goto unlock;
> +	if (*slot != entry)
> +		goto unlock;
> +	radix_tree_replace_slot(slot, NULL);
> +	mapping->nrshadows--;
> +	if (!node)
> +		goto unlock;
> +	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> +	/* No more shadow entries, stop tracking the node */
> +	if (!(node->count >> RADIX_TREE_COUNT_SHIFT) &&
> +	    !list_empty(&node->private_list))
> +		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> +	__radix_tree_delete_node(&mapping->page_tree, node);
> +unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
>  }
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3ac830d1b533..baa3ba586685 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -772,6 +772,7 @@ const char * const vmstat_text[] = {
>  #endif
>  	"workingset_refault",
>  	"workingset_activate",
> +	"workingset_nodereclaim",
>  	"nr_anon_transparent_hugepages",
>  	"nr_free_cma",
>  	"nr_dirty_threshold",
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 8a6c7cff4923..7bb1a432c137 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -251,3 +251,124 @@ void workingset_activation(struct page *page)
>  {
>  	atomic_long_inc(&page_zone(page)->inactive_age);
>  }
> +
> +/*
> + * Page cache radix tree nodes containing only shadow entries can grow
> + * excessively on certain workloads.  That's why they are tracked on
> + * per-(NUMA)node lists and pushed back by a shrinker, but with a
> + * slightly higher threshold than regular shrinkers so we don't
> + * discard the entries too eagerly - after all, during light memory
> + * pressure is exactly when we need them.
> + */
> +
> +struct list_lru workingset_shadow_nodes;
> +
> +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> +					struct shrink_control *sc)
> +{
> +	return list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> +}
> +
> +static enum lru_status shadow_lru_isolate(struct list_head *item,
> +					  spinlock_t *lru_lock,
> +					  void *arg)
> +{
> +	unsigned long *nr_reclaimed = arg;
> +	struct address_space *mapping;
> +	struct radix_tree_node *node;
> +	unsigned int i;
> +	int ret;
> +
> +	/*
> +	 * Page cache insertions and deletions synchroneously maintain
> +	 * the shadow node LRU under the mapping->tree_lock and the
> +	 * lru_lock.  Because the page cache tree is emptied before
> +	 * the inode can be destroyed, holding the lru_lock pins any
> +	 * address_space that has radix tree nodes on the LRU.
> +	 *
> +	 * We can then safely transition to the mapping->tree_lock to
> +	 * pin only the address_space of the particular node we want
> +	 * to reclaim, take the node off-LRU, and drop the lru_lock.
> +	 */
> +
> +	node = container_of(item, struct radix_tree_node, private_list);
> +	mapping = node->private_data;
> +
> +	/* Coming from the list, invert the lock order */
> +	if (!spin_trylock_irq(&mapping->tree_lock)) {
> +		spin_unlock(lru_lock);
> +		ret = LRU_RETRY;
> +		goto out;
> +	}
> +
> +	list_del_init(item);
> +	spin_unlock(lru_lock);
> +
> +	/*
> +	 * The nodes should only contain one or more shadow entries,
> +	 * no pages, so we expect to be able to remove them all and
> +	 * delete and free the empty node afterwards.
> +	 */
> +
> +	BUG_ON(!node->count);
> +	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> +
> +	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> +		if (node->slots[i]) {
> +			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
> +			node->slots[i] = NULL;
> +			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
> +			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> +			BUG_ON(!mapping->nrshadows);
> +			mapping->nrshadows--;
> +		}
> +	}
> +	BUG_ON(node->count);
> +	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
> +	if (!__radix_tree_delete_node(&mapping->page_tree, node))
> +		BUG();
> +	(*nr_reclaimed)++;
> +
> +	spin_unlock_irq(&mapping->tree_lock);
> +	ret = LRU_REMOVED_RETRY;
> +out:
> +	cond_resched();
> +	spin_lock(lru_lock);
> +	return ret;
> +}
> +
> +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
> +				       struct shrink_control *sc)
> +{
> +	unsigned long nr_reclaimed = 0;
> +
> +	list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
> +			   shadow_lru_isolate, &nr_reclaimed, &sc->nr_to_scan);
> +
> +	return nr_reclaimed;
> +}
> +
> +static struct shrinker workingset_shadow_shrinker = {
> +	.count_objects = count_shadow_nodes,
> +	.scan_objects = scan_shadow_nodes,
> +	.seeks = DEFAULT_SEEKS * 4,
> +	.flags = SHRINKER_NUMA_AWARE,
> +};
> +
> +static int __init workingset_init(void)
> +{
> +	int ret;
> +
> +	ret = list_lru_init(&workingset_shadow_nodes);
> +	if (ret)
> +		goto err;
> +	ret = register_shrinker(&workingset_shadow_shrinker);
> +	if (ret)
> +		goto err_list_lru;
> +	return 0;
> +err_list_lru:
> +	list_lru_destroy(&workingset_shadow_nodes);
> +err:
> +	return ret;
> +}
> +module_init(workingset_init);
> -- 
> 1.8.4.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 7/9] mm: thrash detection-based file cache sizing
  2014-01-10 18:10 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
  2014-01-10 22:51   ` Rik van Riel
  2014-01-13  2:42   ` Minchan Kim
@ 2014-01-14  1:01   ` Bob Liu
  2014-01-14 19:16     ` Johannes Weiner
  2 siblings, 1 reply; 58+ messages in thread
From: Bob Liu @ 2014-01-14  1:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Hi Johannes,

On 01/11/2014 02:10 AM, Johannes Weiner wrote:
> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
> 
> Currently, the VM aims for a 1:1 ratio between the lists, which is the
> "perfect" trade-off between the ability to *protect* frequently used
> pages and the ability to *detect* frequently used pages.  This means
> that working set changes bigger than half of cache memory go
> undetected and thrash indefinitely, whereas working sets bigger than
> half of cache memory are unprotected against used-once streams that
> don't even need caching.
> 

Good job! This patch looks good to me and with nice descriptions.
But it seems that this patch only fix the issue "working set changes
bigger than half of cache memory go undetected and thrash indefinitely".
My concern is could it be extended easily to address all other issues
based on this patch set?

The other possible way is something like Peter has implemented the CART
and Clock-Pro which I think may be better because of using advanced
algorithms and consider the problem as a whole from the beginning.(Sorry
I haven't get enough time to read the source code, so I'm not 100% sure.)
http://linux-mm.org/PeterZClockPro2

> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use-once streams,
> but ultimately was not significantly better than a FIFO policy and
> still thrashed cache based on eviction speed, rather than actual
> demand for cache.
> 
> This patch solves one half of the problem by decoupling the ability to
> detect working set changes from the inactive list size.  By
> maintaining a history of recently evicted file pages it can detect
> frequently used pages with an arbitrarily small inactive list size,
> and subsequently apply pressure on the active list based on actual
> demand for cache, not just overall eviction speed.
> 
> Every zone maintains a counter that tracks inactive list aging speed.
> When a page is evicted, a snapshot of this counter is stored in the
> now-empty page cache radix tree slot.  On refault, the minimum access
> distance of the page can be assessed, to evaluate whether the page
> should be part of the active list or not.
> 
> This fixes the VM's blindness towards working set changes in excess of
> the inactive list.  And it's the foundation to further improve the
> protection ability and reduce the minimum inactive list size of 50%.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/mmzone.h |   5 +
>  include/linux/swap.h   |   5 +
>  mm/Makefile            |   2 +-
>  mm/filemap.c           |  61 ++++++++----
>  mm/swap.c              |   2 +
>  mm/vmscan.c            |  24 ++++-
>  mm/vmstat.c            |   2 +
>  mm/workingset.c        | 253 +++++++++++++++++++++++++++++++++++++++++++++++++
>  8 files changed, 331 insertions(+), 23 deletions(-)
>  create mode 100644 mm/workingset.c
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bd791e452ad7..118ba9f51e86 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -142,6 +142,8 @@ enum zone_stat_item {
>  	NUMA_LOCAL,		/* allocation from local node */
>  	NUMA_OTHER,		/* allocation from other node */
>  #endif
> +	WORKINGSET_REFAULT,
> +	WORKINGSET_ACTIVATE,
>  	NR_ANON_TRANSPARENT_HUGEPAGES,
>  	NR_FREE_CMA_PAGES,
>  	NR_VM_ZONE_STAT_ITEMS };
> @@ -392,6 +394,9 @@ struct zone {
>  	spinlock_t		lru_lock;
>  	struct lruvec		lruvec;
>  
> +	/* Evictions & activations on the inactive file list */
> +	atomic_long_t		inactive_age;
> +
>  	unsigned long		pages_scanned;	   /* since last reclaim */
>  	unsigned long		flags;		   /* zone flags, see below */
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6c219f..b83cf61403ed 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -260,6 +260,11 @@ struct swap_list_t {
>  	int next;	/* swapfile to be used next */
>  };
>  
> +/* linux/mm/workingset.c */
> +void *workingset_eviction(struct address_space *mapping, struct page *page);
> +bool workingset_refault(void *shadow);
> +void workingset_activation(struct page *page);
> +
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
>  extern unsigned long totalreserve_pages;
> diff --git a/mm/Makefile b/mm/Makefile
> index 305d10acd081..b30aeb86abd6 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   util.o mmzone.o vmstat.o backing-dev.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
>  			   compaction.o balloon_compaction.o \
> -			   interval_tree.o list_lru.o $(mmu-y)
> +			   interval_tree.o list_lru.o workingset.o $(mmu-y)
>  
>  obj-y += init-mm.o
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d02db5801dda..65a374c0df4f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  
>  static int page_cache_tree_insert(struct address_space *mapping,
> -				  struct page *page)
> +				  struct page *page, void **shadowp)
>  {
>  	void **slot;
>  	int error;
> @@ -484,6 +484,8 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  		radix_tree_replace_slot(slot, page);
>  		mapping->nrshadows--;
>  		mapping->nrpages++;
> +		if (shadowp)
> +			*shadowp = p;
>  		return 0;
>  	}
>  	error = radix_tree_insert(&mapping->page_tree, page->index, page);
> @@ -492,18 +494,10 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  	return error;
>  }
>  
> -/**
> - * add_to_page_cache_locked - add a locked page to the pagecache
> - * @page:	page to add
> - * @mapping:	the page's address_space
> - * @offset:	page index
> - * @gfp_mask:	page allocation mode
> - *
> - * This function is used to add a page to the pagecache. It must be locked.
> - * This function does not add the page to the LRU.  The caller must do that.
> - */
> -int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> -		pgoff_t offset, gfp_t gfp_mask)
> +static int __add_to_page_cache_locked(struct page *page,
> +				      struct address_space *mapping,
> +				      pgoff_t offset, gfp_t gfp_mask,
> +				      void **shadowp)
>  {
>  	int error;
>  
> @@ -526,7 +520,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  	page->index = offset;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	error = page_cache_tree_insert(mapping, page);
> +	error = page_cache_tree_insert(mapping, page, shadowp);
>  	radix_tree_preload_end();
>  	if (unlikely(error))
>  		goto err_insert;
> @@ -542,16 +536,49 @@ err_insert:
>  	page_cache_release(page);
>  	return error;
>  }
> +
> +/**
> + * add_to_page_cache_locked - add a locked page to the pagecache
> + * @page:	page to add
> + * @mapping:	the page's address_space
> + * @offset:	page index
> + * @gfp_mask:	page allocation mode
> + *
> + * This function is used to add a page to the pagecache. It must be locked.
> + * This function does not add the page to the LRU.  The caller must do that.
> + */
> +int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> +		pgoff_t offset, gfp_t gfp_mask)
> +{
> +	return __add_to_page_cache_locked(page, mapping, offset,
> +					  gfp_mask, NULL);
> +}
>  EXPORT_SYMBOL(add_to_page_cache_locked);
>  
>  int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
>  				pgoff_t offset, gfp_t gfp_mask)
>  {
> +	void *shadow = NULL;
>  	int ret;
>  
> -	ret = add_to_page_cache(page, mapping, offset, gfp_mask);
> -	if (ret == 0)
> -		lru_cache_add_file(page);
> +	__set_page_locked(page);
> +	ret = __add_to_page_cache_locked(page, mapping, offset,
> +					 gfp_mask, &shadow);
> +	if (unlikely(ret))
> +		__clear_page_locked(page);
> +	else {
> +		/*
> +		 * The page might have been evicted from cache only
> +		 * recently, in which case it should be activated like
> +		 * any other repeatedly accessed page.
> +		 */
> +		if (shadow && workingset_refault(shadow)) {
> +			SetPageActive(page);
> +			workingset_activation(page);
> +		} else
> +			ClearPageActive(page);
> +		lru_cache_add(page);
> +	}
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
> diff --git a/mm/swap.c b/mm/swap.c
> index f624e5b4b724..ece5c49d6364 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -519,6 +519,8 @@ void mark_page_accessed(struct page *page)
>  		else
>  			__lru_cache_activate_page(page);
>  		ClearPageReferenced(page);
> +		if (page_is_file_cache(page))
> +			workingset_activation(page);
>  	} else if (!PageReferenced(page)) {
>  		SetPageReferenced(page);
>  	}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b954b31602cf..0d3c3d7f8c1b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -505,7 +505,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>   * Same as remove_mapping, but if the page is removed from the mapping, it
>   * gets returned with a refcount of 0.
>   */
> -static int __remove_mapping(struct address_space *mapping, struct page *page)
> +static int __remove_mapping(struct address_space *mapping, struct page *page,
> +			    bool reclaimed)
>  {
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(mapping != page_mapping(page));
> @@ -551,10 +552,23 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
>  		swapcache_free(swap, page);
>  	} else {
>  		void (*freepage)(struct page *);
> +		void *shadow = NULL;
>  
>  		freepage = mapping->a_ops->freepage;
> -
> -		__delete_from_page_cache(page, NULL);
> +		/*
> +		 * Remember a shadow entry for reclaimed file cache in
> +		 * order to detect refaults, thus thrashing, later on.
> +		 *
> +		 * But don't store shadows in an address space that is
> +		 * already exiting.  This is not just an optizimation,
> +		 * inode reclaim needs to empty out the radix tree or
> +		 * the nodes are lost.  Don't plant shadows behind its
> +		 * back.
> +		 */
> +		if (reclaimed && page_is_file_cache(page) &&
> +		    !mapping_exiting(mapping))
> +			shadow = workingset_eviction(mapping, page);
> +		__delete_from_page_cache(page, shadow);
>  		spin_unlock_irq(&mapping->tree_lock);
>  		mem_cgroup_uncharge_cache_page(page);
>  
> @@ -577,7 +591,7 @@ cannot_free:
>   */
>  int remove_mapping(struct address_space *mapping, struct page *page)
>  {
> -	if (__remove_mapping(mapping, page)) {
> +	if (__remove_mapping(mapping, page, false)) {
>  		/*
>  		 * Unfreezing the refcount with 1 rather than 2 effectively
>  		 * drops the pagecache ref for us without requiring another
> @@ -1047,7 +1061,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (!mapping || !__remove_mapping(mapping, page))
> +		if (!mapping || !__remove_mapping(mapping, page, true))
>  			goto keep_locked;
>  
>  		/*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 9bb314577911..3ac830d1b533 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -770,6 +770,8 @@ const char * const vmstat_text[] = {
>  	"numa_local",
>  	"numa_other",
>  #endif
> +	"workingset_refault",
> +	"workingset_activate",
>  	"nr_anon_transparent_hugepages",
>  	"nr_free_cma",
>  	"nr_dirty_threshold",
> diff --git a/mm/workingset.c b/mm/workingset.c
> new file mode 100644
> index 000000000000..8a6c7cff4923
> --- /dev/null
> +++ b/mm/workingset.c
> @@ -0,0 +1,253 @@
> +/*
> + * Workingset detection
> + *
> + * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
> + */
> +
> +#include <linux/memcontrol.h>
> +#include <linux/writeback.h>
> +#include <linux/pagemap.h>
> +#include <linux/atomic.h>
> +#include <linux/module.h>
> +#include <linux/swap.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +
> +/*
> + *		Double CLOCK lists
> + *
> + * Per zone, two clock lists are maintained for file pages: the
> + * inactive and the active list.  Freshly faulted pages start out at
> + * the head of the inactive list and page reclaim scans pages from the
> + * tail.  Pages that are accessed multiple times on the inactive list
> + * are promoted to the active list, to protect them from reclaim,
> + * whereas active pages are demoted to the inactive list when the
> + * active list grows too big.
> + *
> + *   fault ------------------------+
> + *                                 |
> + *              +--------------+   |            +-------------+
> + *   reclaim <- |   inactive   | <-+-- demotion |    active   | <--+
> + *              +--------------+                +-------------+    |
> + *                     |                                           |
> + *                     +-------------- promotion ------------------+
> + *
> + *
> + *		Access frequency and refault distance
> + *
> + * A workload is thrashing when its pages are frequently used but they
> + * are evicted from the inactive list every time before another access
> + * would have promoted them to the active list.
> + *
> + * In cases where the average access distance between thrashing pages
> + * is bigger than the size of memory there is nothing that can be
> + * done - the thrashing set could never fit into memory under any
> + * circumstance.
> + *
> + * However, the average access distance could be bigger than the
> + * inactive list, yet smaller than the size of memory.  In this case,
> + * the set could fit into memory if it weren't for the currently
> + * active pages - which may be used more, hopefully less frequently:
> + *
> + *      +-memory available to cache-+
> + *      |                           |
> + *      +-inactive------+-active----+
> + *  a b | c d e f g h i | J K L M N |
> + *      +---------------+-----------+
> + *
> + * It is prohibitively expensive to accurately track access frequency
> + * of pages.  But a reasonable approximation can be made to measure
> + * thrashing on the inactive list, after which refaulting pages can be
> + * activated optimistically to compete with the existing active pages.
> + *
> + * Approximating inactive page access frequency - Observations:
> + *
> + * 1. When a page is accessed for the first time, it is added to the
> + *    head of the inactive list, slides every existing inactive page
> + *    towards the tail by one slot, and pushes the current tail page
> + *    out of memory.
> + *
> + * 2. When a page is accessed for the second time, it is promoted to
> + *    the active list, shrinking the inactive list by one slot.  This
> + *    also slides all inactive pages that were faulted into the cache
> + *    more recently than the activated page towards the tail of the
> + *    inactive list.
> + *

Nitpick, how about the reference bit?

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-13  7:39   ` Minchan Kim
@ 2014-01-14  5:40     ` Minchan Kim
  2014-01-22 18:42     ` Johannes Weiner
  1 sibling, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2014-01-14  5:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.
> > 
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >    from the node up to the tree root, which is needed to perform a
> >    deletion.  To solve this, encode in each node its offset inside the
> >    parent.  This can be stored in the unused upper bits of the same
> >    member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >    regular entries, to quickly detect when the node is ready to go to
> >    the shadow node LRU list.  The current entry count is an unsigned
> >    int but the maximum number of entries is 64, so a shadow counter
> >    can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs tree lock and tree root, which are located
> >    in the address space, so store an address_space backpointer in the
> >    node.  The parent pointer of the node is in a union with the 2-word
> >    rcu_head, so the backpointer comes at no extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >    head inside the node.  This does increase the size of the node, but
> >    it does not change the number of objects that fit into a slab page.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  include/linux/list_lru.h   |   2 +
> >  include/linux/mmzone.h     |   1 +
> >  include/linux/radix-tree.h |  32 +++++++++---
> >  include/linux/swap.h       |   1 +
> >  lib/radix-tree.c           |  36 ++++++++------
> >  mm/filemap.c               |  77 +++++++++++++++++++++++------
> >  mm/list_lru.c              |   8 +++
> >  mm/truncate.c              |  20 +++++++-
> >  mm/vmstat.c                |   1 +
> >  mm/workingset.c            | 121 +++++++++++++++++++++++++++++++++++++++++++++
> >  10 files changed, 259 insertions(+), 40 deletions(-)
> > 
> > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> > index 3ce541753c88..b02fc233eadd 100644
> > --- a/include/linux/list_lru.h
> > +++ b/include/linux/list_lru.h
> > @@ -13,6 +13,8 @@
> >  /* list_lru_walk_cb has to always return one of those */
> >  enum lru_status {
> >  	LRU_REMOVED,		/* item removed from list */
> > +	LRU_REMOVED_RETRY,	/* item removed, but lock has been
> > +				   dropped and reacquired */
> >  	LRU_ROTATE,		/* item referenced, give another pass */
> >  	LRU_SKIP,		/* item cannot be locked, skip */
> >  	LRU_RETRY,		/* item not freeable. May drop the lock
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 118ba9f51e86..8cac5a7ef7a7 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -144,6 +144,7 @@ enum zone_stat_item {
> >  #endif
> >  	WORKINGSET_REFAULT,
> >  	WORKINGSET_ACTIVATE,
> > +	WORKINGSET_NODERECLAIM,
> >  	NR_ANON_TRANSPARENT_HUGEPAGES,
> >  	NR_FREE_CMA_PAGES,
> >  	NR_VM_ZONE_STAT_ITEMS };
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index 13636c40bc42..33170dbd9db4 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
> >  #define RADIX_TREE_TAG_LONGS	\
> >  	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
> >  
> > +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> > +					  RADIX_TREE_MAP_SHIFT))
> > +
> > +/* Height component in node->path */
> > +#define RADIX_TREE_HEIGHT_SHIFT	(RADIX_TREE_MAX_PATH + 1)
> > +#define RADIX_TREE_HEIGHT_MASK	((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
> > +
> > +/* Internally used bits of node->count */
> > +#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
> > +#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
> > +
> >  struct radix_tree_node {
> > -	unsigned int	height;		/* Height from the bottom */
> > +	unsigned int	path;	/* Offset in parent & height from the bottom */
> >  	unsigned int	count;
> >  	union {
> > -		struct radix_tree_node *parent;	/* Used when ascending tree */
> > -		struct rcu_head	rcu_head;	/* Used when freeing node */
> > +		struct {
> > +			/* Used when ascending tree */
> > +			struct radix_tree_node *parent;
> > +			/* For tree user */
> > +			void *private_data;
> > +		};
> > +		/* Used when freeing node */
> > +		struct rcu_head	rcu_head;
> >  	};
> > +	/* For tree user */
> > +	struct list_head private_list;
> >  	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
> >  	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
> >  };
> >  
> > -#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > -#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> > -					  RADIX_TREE_MAP_SHIFT))
> > -
> >  /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
> >  struct radix_tree_root {
> >  	unsigned int		height;
> > @@ -251,7 +267,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
> >  			  struct radix_tree_node **nodep, void ***slotp);
> >  void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
> >  void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
> > -bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
> > +bool __radix_tree_delete_node(struct radix_tree_root *root,
> >  			      struct radix_tree_node *node);
> >  void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
> >  void *radix_tree_delete(struct radix_tree_root *, unsigned long);
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index b83cf61403ed..102e37bc82d5 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -264,6 +264,7 @@ struct swap_list_t {
> >  void *workingset_eviction(struct address_space *mapping, struct page *page);
> >  bool workingset_refault(void *shadow);
> >  void workingset_activation(struct page *page);
> > +extern struct list_lru workingset_shadow_nodes;
> >  
> >  /* linux/mm/page_alloc.c */
> >  extern unsigned long totalram_pages;
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index e601c56a43d0..0a0895371447 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -342,7 +342,8 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
> >  
> >  		/* Increase the height.  */
> >  		newheight = root->height+1;
> > -		node->height = newheight;
> > +		BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK);
> > +		node->path = newheight;
> 
> Nitpick:
> It would be better to add some accessor for path and offset for
> readability and future enhance?
> 
> >  		node->count = 1;
> >  		node->parent = NULL;
> >  		slot = root->rnode;
> > @@ -400,11 +401,12 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
> >  			/* Have to add a child node.  */
> >  			if (!(slot = radix_tree_node_alloc(root)))
> >  				return -ENOMEM;
> > -			slot->height = height;
> > +			slot->path = height;
> >  			slot->parent = node;
> >  			if (node) {
> >  				rcu_assign_pointer(node->slots[offset], slot);
> >  				node->count++;
> > +				slot->path |= offset << RADIX_TREE_HEIGHT_SHIFT;
> >  			} else
> >  				rcu_assign_pointer(root->rnode, ptr_to_indirect(slot));
> >  		}
> > @@ -496,7 +498,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
> >  	}
> >  	node = indirect_to_ptr(node);
> >  
> > -	height = node->height;
> > +	height = node->path & RADIX_TREE_HEIGHT_MASK;
> >  	if (index > radix_tree_maxindex(height))
> >  		return NULL;
> >  
> > @@ -702,7 +704,7 @@ int radix_tree_tag_get(struct radix_tree_root *root,
> >  		return (index == 0);
> >  	node = indirect_to_ptr(node);
> >  
> > -	height = node->height;
> > +	height = node->path & RADIX_TREE_HEIGHT_MASK;
> >  	if (index > radix_tree_maxindex(height))
> >  		return 0;
> >  
> > @@ -739,7 +741,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
> >  {
> >  	unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK;
> >  	struct radix_tree_node *rnode, *node;
> > -	unsigned long index, offset;
> > +	unsigned long index, offset, height;
> >  
> >  	if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag))
> >  		return NULL;
> > @@ -770,7 +772,8 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
> >  		return NULL;
> >  
> >  restart:
> > -	shift = (rnode->height - 1) * RADIX_TREE_MAP_SHIFT;
> > +	height = rnode->path & RADIX_TREE_HEIGHT_MASK;
> > +	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
> >  	offset = index >> shift;
> >  
> >  	/* Index outside of the tree */
> > @@ -1140,7 +1143,7 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item,
> >  	unsigned int shift, height;
> >  	unsigned long i;
> >  
> > -	height = slot->height;
> > +	height = slot->path & RADIX_TREE_HEIGHT_MASK;
> >  	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
> >  
> >  	for ( ; height > 1; height--) {
> > @@ -1203,7 +1206,8 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item)
> >  		}
> >  
> >  		node = indirect_to_ptr(node);
> > -		max_index = radix_tree_maxindex(node->height);
> > +		max_index = radix_tree_maxindex(node->path &
> > +						RADIX_TREE_HEIGHT_MASK);
> >  		if (cur_index > max_index)
> >  			break;
> >  
> > @@ -1297,7 +1301,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
> >   *
> >   *	Returns %true if @node was freed, %false otherwise.
> >   */
> > -bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
> > +bool __radix_tree_delete_node(struct radix_tree_root *root,
> >  			      struct radix_tree_node *node)
> >  {
> >  	bool deleted = false;
> > @@ -1316,9 +1320,10 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
> >  
> >  		parent = node->parent;
> >  		if (parent) {
> > -			index >>= RADIX_TREE_MAP_SHIFT;
> > +			unsigned int offset;
> >  
> > -			parent->slots[index & RADIX_TREE_MAP_MASK] = NULL;
> > +			offset = node->path >> RADIX_TREE_HEIGHT_SHIFT;
> > +			parent->slots[offset] = NULL;
> >  			parent->count--;
> >  		} else {
> >  			root_tag_clear_all(root);
> > @@ -1382,7 +1387,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
> >  	node->slots[offset] = NULL;
> >  	node->count--;
> >  
> > -	__radix_tree_delete_node(root, index, node);
> > +	__radix_tree_delete_node(root, node);
> >  
> >  	return entry;
> >  }
> > @@ -1415,9 +1420,12 @@ int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag)
> >  EXPORT_SYMBOL(radix_tree_tagged);
> >  
> >  static void
> > -radix_tree_node_ctor(void *node)
> > +radix_tree_node_ctor(void *arg)
> >  {
> > -	memset(node, 0, sizeof(struct radix_tree_node));
> > +	struct radix_tree_node *node = arg;
> > +
> > +	memset(node, 0, sizeof(*node));
> > +	INIT_LIST_HEAD(&node->private_list);
> >  }
> >  
> >  static __init unsigned long __maxindex(unsigned int height)
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 65a374c0df4f..b93e223b59a9 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -110,11 +110,17 @@
> >  static void page_cache_tree_delete(struct address_space *mapping,
> >  				   struct page *page, void *shadow)
> >  {
> > -	if (shadow) {
> > -		void **slot;
> > +	struct radix_tree_node *node;
> > +	unsigned long index;
> > +	unsigned int offset;
> > +	unsigned int tag;
> > +	void **slot;
> >  
> > -		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
> > -		radix_tree_replace_slot(slot, shadow);
> > +	VM_BUG_ON(!PageLocked(page));
> > +
> > +	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
> > +
> > +	if (shadow) {
> >  		mapping->nrshadows++;
> >  		/*
> >  		 * Make sure the nrshadows update is committed before
> > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
> >  		 * same time and miss a shadow entry.
> >  		 */
> >  		smp_wmb();
> > -	} else
> > -		radix_tree_delete(&mapping->page_tree, page->index);
> > +	}
> >  	mapping->nrpages--;
> > +
> > +	if (!node) {
> > +		/* Clear direct pointer tags in root node */
> > +		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > +		radix_tree_replace_slot(slot, shadow);
> > +		return;
> > +	}
> > +
> > +	/* Clear tree tags for the removed page */
> > +	index = page->index;
> > +	offset = index & RADIX_TREE_MAP_MASK;
> > +	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > +		if (test_bit(offset, node->tags[tag]))
> > +			radix_tree_tag_clear(&mapping->page_tree, index, tag);
> > +	}
> > +
> > +	/* Delete page, swap shadow entry */
> > +	radix_tree_replace_slot(slot, shadow);
> > +	node->count--;
> > +	if (shadow)
> > +		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> 
> Nitpick2:
> It should be a function of workingset.c rather than exposing
> RADIX_TREE_COUNT_SHIFT?
> 
> IMO, It would be better to provide some accessor functions here, too.
> 
> I didn't review locking part yet and will review it tomorrow with
> fresh brain. :)

Complete to review.
I couldn't spot any mistake any more so, Other than that,

Reviewed-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 7/9] mm: thrash detection-based file cache sizing
  2014-01-14  1:01   ` Bob Liu
@ 2014-01-14 19:16     ` Johannes Weiner
  2014-01-15  2:57       ` Bob Liu
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2014-01-14 19:16 UTC (permalink / raw)
  To: Bob Liu
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
> Hi Johannes,
> 
> On 01/11/2014 02:10 AM, Johannes Weiner wrote:
> > The VM maintains cached filesystem pages on two types of lists.  One
> > list holds the pages recently faulted into the cache, the other list
> > holds pages that have been referenced repeatedly on that first list.
> > The idea is to prefer reclaiming young pages over those that have
> > shown to benefit from caching in the past.  We call the recently used
> > list "inactive list" and the frequently used list "active list".
> > 
> > Currently, the VM aims for a 1:1 ratio between the lists, which is the
> > "perfect" trade-off between the ability to *protect* frequently used
> > pages and the ability to *detect* frequently used pages.  This means
> > that working set changes bigger than half of cache memory go
> > undetected and thrash indefinitely, whereas working sets bigger than
> > half of cache memory are unprotected against used-once streams that
> > don't even need caching.
> > 
> 
> Good job! This patch looks good to me and with nice descriptions.
> But it seems that this patch only fix the issue "working set changes
> bigger than half of cache memory go undetected and thrash indefinitely".
> My concern is could it be extended easily to address all other issues
> based on this patch set?
> 
> The other possible way is something like Peter has implemented the CART
> and Clock-Pro which I think may be better because of using advanced
> algorithms and consider the problem as a whole from the beginning.(Sorry
> I haven't get enough time to read the source code, so I'm not 100% sure.)
> http://linux-mm.org/PeterZClockPro2

My patches are moving the VM towards something that is comparable to
how Peter implemented Clock-Pro.  However, the current VM has evolved
over time in small increments based on real life performance
observations.  Rewriting everything in one go would be incredibly
disruptive and I doubt very much we would merge any such proposal in
the first place.  So it's not like I don't see the big picture, it's
just divide and conquer:

Peter's Clock-Pro implementation was basically a double clock with an
intricate system to classify hotness, augmented by eviction
information to work with reuse distances independent of memory size.

What we have right now is a double clock with a very rudimentary
system to classify whether a page is hot: it has been accessed twice
while on the inactive clock.  My patches now add eviction information
to this, and improve the classification so that it can work with reuse
distances up to memory size and is no longer dependent on the inactive
clock size.

This is the smallest imaginable step that is still useful, and even
then we had a lot of discussions about scalability of the data
structures and confusion about how the new data point should be
interpreted.  It also took a long time until somebody read the series
and went, "Ok, this actually makes sense to me."  Now, maybe I suck at
documenting, but maybe this is just complicated stuff.  Either way, we
have to get there collectively, so that the code is maintainable in
the long term.

Once we have these new concepts established, we can further improve
the hotness detector so that it can classify and order pages with
reuse distances beyond memory size.  But this will come with its own
set of problems.  For example, some time ago we stopped regularly
scanning and rotating active pages because of scalability issues, but
we'll most likely need an uptodate estimate of the reuse distances on
the active list in order to classify refaults properly.

> > + * Approximating inactive page access frequency - Observations:
> > + *
> > + * 1. When a page is accessed for the first time, it is added to the
> > + *    head of the inactive list, slides every existing inactive page
> > + *    towards the tail by one slot, and pushes the current tail page
> > + *    out of memory.
> > + *
> > + * 2. When a page is accessed for the second time, it is promoted to
> > + *    the active list, shrinking the inactive list by one slot.  This
> > + *    also slides all inactive pages that were faulted into the cache
> > + *    more recently than the activated page towards the tail of the
> > + *    inactive list.
> > + *
> 
> Nitpick, how about the reference bit?

What do you mean?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 7/9] mm: thrash detection-based file cache sizing
  2014-01-14 19:16     ` Johannes Weiner
@ 2014-01-15  2:57       ` Bob Liu
  2014-01-15  3:52         ` Zhang Yanfei
  2014-01-16 21:17         ` Johannes Weiner
  0 siblings, 2 replies; 58+ messages in thread
From: Bob Liu @ 2014-01-15  2:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel


On 01/15/2014 03:16 AM, Johannes Weiner wrote:
> On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
>> Hi Johannes,
>>
>> On 01/11/2014 02:10 AM, Johannes Weiner wrote:
>>> The VM maintains cached filesystem pages on two types of lists.  One
>>> list holds the pages recently faulted into the cache, the other list
>>> holds pages that have been referenced repeatedly on that first list.
>>> The idea is to prefer reclaiming young pages over those that have
>>> shown to benefit from caching in the past.  We call the recently used
>>> list "inactive list" and the frequently used list "active list".
>>>
>>> Currently, the VM aims for a 1:1 ratio between the lists, which is the
>>> "perfect" trade-off between the ability to *protect* frequently used
>>> pages and the ability to *detect* frequently used pages.  This means
>>> that working set changes bigger than half of cache memory go
>>> undetected and thrash indefinitely, whereas working sets bigger than
>>> half of cache memory are unprotected against used-once streams that
>>> don't even need caching.
>>>
>>
>> Good job! This patch looks good to me and with nice descriptions.
>> But it seems that this patch only fix the issue "working set changes
>> bigger than half of cache memory go undetected and thrash indefinitely".
>> My concern is could it be extended easily to address all other issues
>> based on this patch set?
>>
>> The other possible way is something like Peter has implemented the CART
>> and Clock-Pro which I think may be better because of using advanced
>> algorithms and consider the problem as a whole from the beginning.(Sorry
>> I haven't get enough time to read the source code, so I'm not 100% sure.)
>> http://linux-mm.org/PeterZClockPro2
> 
> My patches are moving the VM towards something that is comparable to
> how Peter implemented Clock-Pro.  However, the current VM has evolved
> over time in small increments based on real life performance
> observations.  Rewriting everything in one go would be incredibly
> disruptive and I doubt very much we would merge any such proposal in
> the first place.  So it's not like I don't see the big picture, it's
> just divide and conquer:
> 
> Peter's Clock-Pro implementation was basically a double clock with an
> intricate system to classify hotness, augmented by eviction
> information to work with reuse distances independent of memory size.
> 
> What we have right now is a double clock with a very rudimentary
> system to classify whether a page is hot: it has been accessed twice
> while on the inactive clock.  My patches now add eviction information
> to this, and improve the classification so that it can work with reuse
> distances up to memory size and is no longer dependent on the inactive
> clock size.
> 
> This is the smallest imaginable step that is still useful, and even
> then we had a lot of discussions about scalability of the data
> structures and confusion about how the new data point should be
> interpreted.  It also took a long time until somebody read the series
> and went, "Ok, this actually makes sense to me."  Now, maybe I suck at
> documenting, but maybe this is just complicated stuff.  Either way, we
> have to get there collectively, so that the code is maintainable in
> the long term.
> 
> Once we have these new concepts established, we can further improve
> the hotness detector so that it can classify and order pages with
> reuse distances beyond memory size.  But this will come with its own
> set of problems.  For example, some time ago we stopped regularly
> scanning and rotating active pages because of scalability issues, but
> we'll most likely need an uptodate estimate of the reuse distances on
> the active list in order to classify refaults properly.
> 

Thank you for your kindly explanation. It make sense to me please feel
free to add my review.

>>> + * Approximating inactive page access frequency - Observations:
>>> + *
>>> + * 1. When a page is accessed for the first time, it is added to the
>>> + *    head of the inactive list, slides every existing inactive page
>>> + *    towards the tail by one slot, and pushes the current tail page
>>> + *    out of memory.
>>> + *
>>> + * 2. When a page is accessed for the second time, it is promoted to
>>> + *    the active list, shrinking the inactive list by one slot.  This
>>> + *    also slides all inactive pages that were faulted into the cache
>>> + *    more recently than the activated page towards the tail of the
>>> + *    inactive list.
>>> + *
>>
>> Nitpick, how about the reference bit?
> 
> What do you mean?
> 

Sorry, I mean the PG_referenced flag. I thought when a page is accessed
for the second time only PG_referenced flag  will be set instead of be
promoted to active list.

-- 
Regards,
-Bob

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 7/9] mm: thrash detection-based file cache sizing
  2014-01-15  2:57       ` Bob Liu
@ 2014-01-15  3:52         ` Zhang Yanfei
  2014-01-16 21:17         ` Johannes Weiner
  1 sibling, 0 replies; 58+ messages in thread
From: Zhang Yanfei @ 2014-01-15  3:52 UTC (permalink / raw)
  To: Bob Liu, Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Hello

On 01/15/2014 10:57 AM, Bob Liu wrote:
> 
> On 01/15/2014 03:16 AM, Johannes Weiner wrote:
>> On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
>>> Hi Johannes,
>>>
>>> On 01/11/2014 02:10 AM, Johannes Weiner wrote:
>>>> The VM maintains cached filesystem pages on two types of lists.  One
>>>> list holds the pages recently faulted into the cache, the other list
>>>> holds pages that have been referenced repeatedly on that first list.
>>>> The idea is to prefer reclaiming young pages over those that have
>>>> shown to benefit from caching in the past.  We call the recently used
>>>> list "inactive list" and the frequently used list "active list".
>>>>
>>>> Currently, the VM aims for a 1:1 ratio between the lists, which is the
>>>> "perfect" trade-off between the ability to *protect* frequently used
>>>> pages and the ability to *detect* frequently used pages.  This means
>>>> that working set changes bigger than half of cache memory go
>>>> undetected and thrash indefinitely, whereas working sets bigger than
>>>> half of cache memory are unprotected against used-once streams that
>>>> don't even need caching.
>>>>
>>>
>>> Good job! This patch looks good to me and with nice descriptions.
>>> But it seems that this patch only fix the issue "working set changes
>>> bigger than half of cache memory go undetected and thrash indefinitely".
>>> My concern is could it be extended easily to address all other issues
>>> based on this patch set?
>>>
>>> The other possible way is something like Peter has implemented the CART
>>> and Clock-Pro which I think may be better because of using advanced
>>> algorithms and consider the problem as a whole from the beginning.(Sorry
>>> I haven't get enough time to read the source code, so I'm not 100% sure.)
>>> http://linux-mm.org/PeterZClockPro2
>>
>> My patches are moving the VM towards something that is comparable to
>> how Peter implemented Clock-Pro.  However, the current VM has evolved
>> over time in small increments based on real life performance
>> observations.  Rewriting everything in one go would be incredibly
>> disruptive and I doubt very much we would merge any such proposal in
>> the first place.  So it's not like I don't see the big picture, it's
>> just divide and conquer:
>>
>> Peter's Clock-Pro implementation was basically a double clock with an
>> intricate system to classify hotness, augmented by eviction
>> information to work with reuse distances independent of memory size.
>>
>> What we have right now is a double clock with a very rudimentary
>> system to classify whether a page is hot: it has been accessed twice
>> while on the inactive clock.  My patches now add eviction information
>> to this, and improve the classification so that it can work with reuse
>> distances up to memory size and is no longer dependent on the inactive
>> clock size.
>>
>> This is the smallest imaginable step that is still useful, and even
>> then we had a lot of discussions about scalability of the data
>> structures and confusion about how the new data point should be
>> interpreted.  It also took a long time until somebody read the series
>> and went, "Ok, this actually makes sense to me."  Now, maybe I suck at
>> documenting, but maybe this is just complicated stuff.  Either way, we
>> have to get there collectively, so that the code is maintainable in
>> the long term.
>>
>> Once we have these new concepts established, we can further improve
>> the hotness detector so that it can classify and order pages with
>> reuse distances beyond memory size.  But this will come with its own
>> set of problems.  For example, some time ago we stopped regularly
>> scanning and rotating active pages because of scalability issues, but
>> we'll most likely need an uptodate estimate of the reuse distances on
>> the active list in order to classify refaults properly.
>>
> 
> Thank you for your kindly explanation. It make sense to me please feel
> free to add my review.
> 
>>>> + * Approximating inactive page access frequency - Observations:
>>>> + *
>>>> + * 1. When a page is accessed for the first time, it is added to the
>>>> + *    head of the inactive list, slides every existing inactive page
>>>> + *    towards the tail by one slot, and pushes the current tail page
>>>> + *    out of memory.
>>>> + *
>>>> + * 2. When a page is accessed for the second time, it is promoted to
>>>> + *    the active list, shrinking the inactive list by one slot.  This
>>>> + *    also slides all inactive pages that were faulted into the cache
>>>> + *    more recently than the activated page towards the tail of the
>>>> + *    inactive list.
>>>> + *
>>>
>>> Nitpick, how about the reference bit?
>>
>> What do you mean?
>>
> 
> Sorry, I mean the PG_referenced flag. I thought when a page is accessed
> for the second time only PG_referenced flag  will be set instead of be
> promoted to active list.
> 

No. I try to explain a bit. For mapped file pages, if the second access
occurs on a different page table entry, the page is surely promoted to active
list. But if the paged is always accessed from the same page table entry, it
was mistakenly evicted. This was fixed by Johannes already by reusing the
PG_referenced flag, for details, please refer to commit 64574746
("vmscan: detect mapped file pages used only once").

Correct me if I am wrong.

-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
  2014-01-10 23:09   ` Rik van Riel
  2014-01-13  7:39   ` Minchan Kim
@ 2014-01-15  5:55   ` Bob Liu
  2014-01-16 22:09     ` Johannes Weiner
  2014-01-17  0:05   ` Dave Chinner
  3 siblings, 1 reply; 58+ messages in thread
From: Bob Liu @ 2014-01-15  5:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Hi Johannes,

On 01/11/2014 02:10 AM, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 

I have one more question. It seems that other algorithm only remember
history information of a limit number of evicted pages where the number
is usually the same as the total cache or memory size.
But in your patch, I didn't see a preferred value that how many evicted
pages' history information should be recorded. It all depends on the
workingset_shadow_shrinker?

Thanks,
-Bob

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 7/9] mm: thrash detection-based file cache sizing
  2014-01-15  2:57       ` Bob Liu
  2014-01-15  3:52         ` Zhang Yanfei
@ 2014-01-16 21:17         ` Johannes Weiner
  1 sibling, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-16 21:17 UTC (permalink / raw)
  To: Bob Liu
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Wed, Jan 15, 2014 at 10:57:21AM +0800, Bob Liu wrote:
> On 01/15/2014 03:16 AM, Johannes Weiner wrote:
> > On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
> >> Good job! This patch looks good to me and with nice descriptions.
> >> But it seems that this patch only fix the issue "working set changes
> >> bigger than half of cache memory go undetected and thrash indefinitely".
> >> My concern is could it be extended easily to address all other issues
> >> based on this patch set?
> >>
> >> The other possible way is something like Peter has implemented the CART
> >> and Clock-Pro which I think may be better because of using advanced
> >> algorithms and consider the problem as a whole from the beginning.(Sorry
> >> I haven't get enough time to read the source code, so I'm not 100% sure.)
> >> http://linux-mm.org/PeterZClockPro2
> > 
> > My patches are moving the VM towards something that is comparable to
> > how Peter implemented Clock-Pro.  However, the current VM has evolved
> > over time in small increments based on real life performance
> > observations.  Rewriting everything in one go would be incredibly
> > disruptive and I doubt very much we would merge any such proposal in
> > the first place.  So it's not like I don't see the big picture, it's
> > just divide and conquer:
> > 
> > Peter's Clock-Pro implementation was basically a double clock with an
> > intricate system to classify hotness, augmented by eviction
> > information to work with reuse distances independent of memory size.
> > 
> > What we have right now is a double clock with a very rudimentary
> > system to classify whether a page is hot: it has been accessed twice
> > while on the inactive clock.  My patches now add eviction information
> > to this, and improve the classification so that it can work with reuse
> > distances up to memory size and is no longer dependent on the inactive
> > clock size.
> > 
> > This is the smallest imaginable step that is still useful, and even
> > then we had a lot of discussions about scalability of the data
> > structures and confusion about how the new data point should be
> > interpreted.  It also took a long time until somebody read the series
> > and went, "Ok, this actually makes sense to me."  Now, maybe I suck at
> > documenting, but maybe this is just complicated stuff.  Either way, we
> > have to get there collectively, so that the code is maintainable in
> > the long term.
> > 
> > Once we have these new concepts established, we can further improve
> > the hotness detector so that it can classify and order pages with
> > reuse distances beyond memory size.  But this will come with its own
> > set of problems.  For example, some time ago we stopped regularly
> > scanning and rotating active pages because of scalability issues, but
> > we'll most likely need an uptodate estimate of the reuse distances on
> > the active list in order to classify refaults properly.
> > 
> 
> Thank you for your kindly explanation. It make sense to me please feel
> free to add my review.

Thank you!

> >>> + * Approximating inactive page access frequency - Observations:
> >>> + *
> >>> + * 1. When a page is accessed for the first time, it is added to the
> >>> + *    head of the inactive list, slides every existing inactive page
> >>> + *    towards the tail by one slot, and pushes the current tail page
> >>> + *    out of memory.
> >>> + *
> >>> + * 2. When a page is accessed for the second time, it is promoted to
> >>> + *    the active list, shrinking the inactive list by one slot.  This
> >>> + *    also slides all inactive pages that were faulted into the cache
> >>> + *    more recently than the activated page towards the tail of the
> >>> + *    inactive list.
> >>> + *
> >>
> >> Nitpick, how about the reference bit?
> > 
> > What do you mean?
> > 
> 
> Sorry, I mean the PG_referenced flag. I thought when a page is accessed
> for the second time only PG_referenced flag  will be set instead of be
> promoted to active list.

It's cleared during rotation or not set on pages that came in through
readahead, but the first access sets the bit and the second access
activates it.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-15  5:55   ` Bob Liu
@ 2014-01-16 22:09     ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-16 22:09 UTC (permalink / raw)
  To: Bob Liu
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Wed, Jan 15, 2014 at 01:55:01PM +0800, Bob Liu wrote:
> Hi Johannes,
> 
> On 01/11/2014 02:10 AM, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> 
> I have one more question. It seems that other algorithm only remember
> history information of a limit number of evicted pages where the number
> is usually the same as the total cache or memory size.
> But in your patch, I didn't see a preferred value that how many evicted
> pages' history information should be recorded. It all depends on the
> workingset_shadow_shrinker?

That "same as total cache" number is a fairly arbitrary cut-off that
defines how far we record eviction history.  For this patch set, we
technically do not need more shadow entries than active pages, but
strict enforcement would be very expensive.  So we leave it mostly to
refaults and inode reclaim to keep the number of shadow entries low,
with the shadow shrinker as an emergency backup.  Keep in mind that
the shadow entries represent that part of the working set that exceeds
available memory.  So the only way the number of shadow entries
exceeds the number of RAM pages in the system is if your workingset is
more than twice that of memory, otherwise the shadow entries refault
before they can accumulate.  And because of inode reclaim, that huge
working set would have to be backed by a very small number of files,
otherwise the shadow entries are reclaimed along with the inodes.  But
this theoretical workload would be entirely IO bound and a few extra
MB wasted on shadow entries should make no difference.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
                     ` (2 preceding siblings ...)
  2014-01-15  5:55   ` Bob Liu
@ 2014-01-17  0:05   ` Dave Chinner
  2014-01-20 23:17     ` Johannes Weiner
  3 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2014-01-17  0:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.
> Per-NUMA rather than global because we expect the radix tree nodes
> themselves to be allocated node-locally and we want to reduce
> cross-node references of otherwise independent cache workloads.  A
> simple shrinker will then reclaim these nodes on memory pressure.
> 
> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:

Just a couple of things with the list_lru interfaces.

....
> @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
>  		 * same time and miss a shadow entry.
>  		 */
>  		smp_wmb();
> -	} else
> -		radix_tree_delete(&mapping->page_tree, page->index);
> +	}
>  	mapping->nrpages--;
> +
> +	if (!node) {
> +		/* Clear direct pointer tags in root node */
> +		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> +		radix_tree_replace_slot(slot, shadow);
> +		return;
> +	}
> +
> +	/* Clear tree tags for the removed page */
> +	index = page->index;
> +	offset = index & RADIX_TREE_MAP_MASK;
> +	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> +		if (test_bit(offset, node->tags[tag]))
> +			radix_tree_tag_clear(&mapping->page_tree, index, tag);
> +	}
> +
> +	/* Delete page, swap shadow entry */
> +	radix_tree_replace_slot(slot, shadow);
> +	node->count--;
> +	if (shadow)
> +		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> +	else
> +		if (__radix_tree_delete_node(&mapping->page_tree, node))
> +			return;
> +
> +	/* Only shadow entries in there, keep track of this node */
> +	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> +	    list_empty(&node->private_list)) {
> +		node->private_data = mapping;
> +		list_lru_add(&workingset_shadow_nodes, &node->private_list);
> +	}

You can't do this list_empty(&node->private_list) check safely
externally to the list_lru code - only time that entry can be
checked safely is under the LRU list locks. This is the reason that
list_lru_add/list_lru_del return a boolean to indicate is the object
was added/removed from the list - they do this list_empty() check
internally. i.e. the correct, safe way to do conditionally update
state iff the object was added to the LRU is:

	if (!(node->count & RADIX_TREE_COUNT_MASK)) {
		if (list_lru_add(&workingset_shadow_nodes, &node->private_list))
			node->private_data = mapping;
	}

> +	radix_tree_replace_slot(slot, page);
> +	mapping->nrpages++;
> +	if (node) {
> +		node->count++;
> +		/* Installed page, can't be shadow-only anymore */
> +		if (!list_empty(&node->private_list))
> +			list_lru_del(&workingset_shadow_nodes,
> +				     &node->private_list);
> +	}

Same issue here:

	if (node) {
		node->count++;
		list_lru_del(&workingset_shadow_nodes, &node->private_list);
	}


> +	return 0;
>  }
>  
>  static int __add_to_page_cache_locked(struct page *page,
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 72f9decb0104..47a9faf4070b 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -88,10 +88,18 @@ restart:
>  		ret = isolate(item, &nlru->lock, cb_arg);
>  		switch (ret) {
>  		case LRU_REMOVED:
> +		case LRU_REMOVED_RETRY:
>  			if (--nlru->nr_items == 0)
>  				node_clear(nid, lru->active_nodes);
>  			WARN_ON_ONCE(nlru->nr_items < 0);
>  			isolated++;
> +			/*
> +			 * If the lru lock has been dropped, our list
> +			 * traversal is now invalid and so we have to
> +			 * restart from scratch.
> +			 */
> +			if (ret == LRU_REMOVED_RETRY)
> +				goto restart;
>  			break;
>  		case LRU_ROTATE:
>  			list_move_tail(item, &nlru->list);

I think that we need to assert that the list lru lock is correctly
held here on return with LRU_REMOVED_RETRY. i.e.

		case LRU_REMOVED_RETRY:
			assert_spin_locked(&nlru->lock);
		case LRU_REMOVED:
		.....

> @@ -35,8 +38,21 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	 * without the tree itself locked.  These unlocked entries
>  	 * need verification under the tree lock.
>  	 */
> -	if (radix_tree_delete_item(&mapping->page_tree, index, entry) == entry)
> -		mapping->nrshadows--;
> +	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> +		goto unlock;
> +	if (*slot != entry)
> +		goto unlock;
> +	radix_tree_replace_slot(slot, NULL);
> +	mapping->nrshadows--;
> +	if (!node)
> +		goto unlock;
> +	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> +	/* No more shadow entries, stop tracking the node */
> +	if (!(node->count >> RADIX_TREE_COUNT_SHIFT) &&
> +	    !list_empty(&node->private_list))
> +		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> +	__radix_tree_delete_node(&mapping->page_tree, node);

Same issue with list_empty() check.

> +
> +/*
> + * Page cache radix tree nodes containing only shadow entries can grow
> + * excessively on certain workloads.  That's why they are tracked on
> + * per-(NUMA)node lists and pushed back by a shrinker, but with a
> + * slightly higher threshold than regular shrinkers so we don't
> + * discard the entries too eagerly - after all, during light memory
> + * pressure is exactly when we need them.
> + */
> +
> +struct list_lru workingset_shadow_nodes;
> +
> +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> +					struct shrink_control *sc)
> +{
> +	return list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> +}
> +
> +static enum lru_status shadow_lru_isolate(struct list_head *item,
> +					  spinlock_t *lru_lock,
> +					  void *arg)
> +{
> +	unsigned long *nr_reclaimed = arg;
> +	struct address_space *mapping;
> +	struct radix_tree_node *node;
> +	unsigned int i;
> +	int ret;
> +
> +	/*
> +	 * Page cache insertions and deletions synchroneously maintain
> +	 * the shadow node LRU under the mapping->tree_lock and the
> +	 * lru_lock.  Because the page cache tree is emptied before
> +	 * the inode can be destroyed, holding the lru_lock pins any
> +	 * address_space that has radix tree nodes on the LRU.
> +	 *
> +	 * We can then safely transition to the mapping->tree_lock to
> +	 * pin only the address_space of the particular node we want
> +	 * to reclaim, take the node off-LRU, and drop the lru_lock.
> +	 */
> +
> +	node = container_of(item, struct radix_tree_node, private_list);
> +	mapping = node->private_data;
> +
> +	/* Coming from the list, invert the lock order */
> +	if (!spin_trylock_irq(&mapping->tree_lock)) {
> +		spin_unlock(lru_lock);
> +		ret = LRU_RETRY;
> +		goto out;
> +	}
> +
> +	list_del_init(item);
> +	spin_unlock(lru_lock);
> +
> +	/*
> +	 * The nodes should only contain one or more shadow entries,
> +	 * no pages, so we expect to be able to remove them all and
> +	 * delete and free the empty node afterwards.
> +	 */
> +
> +	BUG_ON(!node->count);
> +	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> +
> +	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> +		if (node->slots[i]) {
> +			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
> +			node->slots[i] = NULL;
> +			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
> +			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> +			BUG_ON(!mapping->nrshadows);
> +			mapping->nrshadows--;
> +		}
> +	}
> +	BUG_ON(node->count);
> +	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
> +	if (!__radix_tree_delete_node(&mapping->page_tree, node))
> +		BUG();
> +	(*nr_reclaimed)++;
> +
> +	spin_unlock_irq(&mapping->tree_lock);
> +	ret = LRU_REMOVED_RETRY;
> +out:
> +	cond_resched();
> +	spin_lock(lru_lock);
> +	return ret;
> +}
> +
> +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
> +				       struct shrink_control *sc)
> +{
> +	unsigned long nr_reclaimed = 0;
> +
> +	list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
> +			   shadow_lru_isolate, &nr_reclaimed, &sc->nr_to_scan);
> +
> +	return nr_reclaimed;

list_lru_walk_node() returns the number of reclaimed objects (i.e.
the number of objects that returned LRU_REMOVED/LRU_REMOVED_RETRY
from the ->isolate callback). You don't need to count nr_reclaimed
yourself.

> +}
> +
> +static struct shrinker workingset_shadow_shrinker = {
> +	.count_objects = count_shadow_nodes,
> +	.scan_objects = scan_shadow_nodes,
> +	.seeks = DEFAULT_SEEKS * 4,
> +	.flags = SHRINKER_NUMA_AWARE,
> +};

Can you add a comment explaining how you calculated the .seeks
value? It's important to document the weighings/importance
we give to slab reclaim so we can determine if it's actually
acheiving the desired balance under different loads...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-17  0:05   ` Dave Chinner
@ 2014-01-20 23:17     ` Johannes Weiner
  2014-01-21  3:03       ` Dave Chinner
  2014-01-23  5:57       ` Minchan Kim
  0 siblings, 2 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-20 23:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.
> > 
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> 
> Just a couple of things with the list_lru interfaces.
> 
> ....
> > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
> >  		 * same time and miss a shadow entry.
> >  		 */
> >  		smp_wmb();
> > -	} else
> > -		radix_tree_delete(&mapping->page_tree, page->index);
> > +	}
> >  	mapping->nrpages--;
> > +
> > +	if (!node) {
> > +		/* Clear direct pointer tags in root node */
> > +		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > +		radix_tree_replace_slot(slot, shadow);
> > +		return;
> > +	}
> > +
> > +	/* Clear tree tags for the removed page */
> > +	index = page->index;
> > +	offset = index & RADIX_TREE_MAP_MASK;
> > +	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > +		if (test_bit(offset, node->tags[tag]))
> > +			radix_tree_tag_clear(&mapping->page_tree, index, tag);
> > +	}
> > +
> > +	/* Delete page, swap shadow entry */
> > +	radix_tree_replace_slot(slot, shadow);
> > +	node->count--;
> > +	if (shadow)
> > +		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> > +	else
> > +		if (__radix_tree_delete_node(&mapping->page_tree, node))
> > +			return;
> > +
> > +	/* Only shadow entries in there, keep track of this node */
> > +	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > +	    list_empty(&node->private_list)) {
> > +		node->private_data = mapping;
> > +		list_lru_add(&workingset_shadow_nodes, &node->private_list);
> > +	}
> 
> You can't do this list_empty(&node->private_list) check safely
> externally to the list_lru code - only time that entry can be
> checked safely is under the LRU list locks. This is the reason that
> list_lru_add/list_lru_del return a boolean to indicate is the object
> was added/removed from the list - they do this list_empty() check
> internally. i.e. the correct, safe way to do conditionally update
> state iff the object was added to the LRU is:
> 
> 	if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> 		if (list_lru_add(&workingset_shadow_nodes, &node->private_list))
> 			node->private_data = mapping;
> 	}
> 
> > +	radix_tree_replace_slot(slot, page);
> > +	mapping->nrpages++;
> > +	if (node) {
> > +		node->count++;
> > +		/* Installed page, can't be shadow-only anymore */
> > +		if (!list_empty(&node->private_list))
> > +			list_lru_del(&workingset_shadow_nodes,
> > +				     &node->private_list);
> > +	}
> 
> Same issue here:
> 
> 	if (node) {
> 		node->count++;
> 		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> 	}

All modifications to node->private_list happen under
mapping->tree_lock, and modifications of a neighboring link should not
affect the outcome of the list_empty(), so I don't think the lru lock
is necessary.

It would be cleaner to take it of course, but that would mean adding
an unconditional NUMAnode-wide lock to every page cache population.

> >  static int __add_to_page_cache_locked(struct page *page,
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index 72f9decb0104..47a9faf4070b 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -88,10 +88,18 @@ restart:
> >  		ret = isolate(item, &nlru->lock, cb_arg);
> >  		switch (ret) {
> >  		case LRU_REMOVED:
> > +		case LRU_REMOVED_RETRY:
> >  			if (--nlru->nr_items == 0)
> >  				node_clear(nid, lru->active_nodes);
> >  			WARN_ON_ONCE(nlru->nr_items < 0);
> >  			isolated++;
> > +			/*
> > +			 * If the lru lock has been dropped, our list
> > +			 * traversal is now invalid and so we have to
> > +			 * restart from scratch.
> > +			 */
> > +			if (ret == LRU_REMOVED_RETRY)
> > +				goto restart;
> >  			break;
> >  		case LRU_ROTATE:
> >  			list_move_tail(item, &nlru->list);
> 
> I think that we need to assert that the list lru lock is correctly
> held here on return with LRU_REMOVED_RETRY. i.e.
> 
> 		case LRU_REMOVED_RETRY:
> 			assert_spin_locked(&nlru->lock);
> 		case LRU_REMOVED:

Ah, good idea.  How about adding it to LRU_RETRY as well?

> > +/*
> > + * Page cache radix tree nodes containing only shadow entries can grow
> > + * excessively on certain workloads.  That's why they are tracked on
> > + * per-(NUMA)node lists and pushed back by a shrinker, but with a
> > + * slightly higher threshold than regular shrinkers so we don't
> > + * discard the entries too eagerly - after all, during light memory
> > + * pressure is exactly when we need them.
> > + */
> > +
> > +struct list_lru workingset_shadow_nodes;
> > +
> > +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> > +					struct shrink_control *sc)
> > +{
> > +	return list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> > +}
> > +
> > +static enum lru_status shadow_lru_isolate(struct list_head *item,
> > +					  spinlock_t *lru_lock,
> > +					  void *arg)
> > +{
> > +	unsigned long *nr_reclaimed = arg;
> > +	struct address_space *mapping;
> > +	struct radix_tree_node *node;
> > +	unsigned int i;
> > +	int ret;
> > +
> > +	/*
> > +	 * Page cache insertions and deletions synchroneously maintain
> > +	 * the shadow node LRU under the mapping->tree_lock and the
> > +	 * lru_lock.  Because the page cache tree is emptied before
> > +	 * the inode can be destroyed, holding the lru_lock pins any
> > +	 * address_space that has radix tree nodes on the LRU.
> > +	 *
> > +	 * We can then safely transition to the mapping->tree_lock to
> > +	 * pin only the address_space of the particular node we want
> > +	 * to reclaim, take the node off-LRU, and drop the lru_lock.
> > +	 */
> > +
> > +	node = container_of(item, struct radix_tree_node, private_list);
> > +	mapping = node->private_data;
> > +
> > +	/* Coming from the list, invert the lock order */
> > +	if (!spin_trylock_irq(&mapping->tree_lock)) {
> > +		spin_unlock(lru_lock);
> > +		ret = LRU_RETRY;
> > +		goto out;
> > +	}
> > +
> > +	list_del_init(item);
> > +	spin_unlock(lru_lock);
> > +
> > +	/*
> > +	 * The nodes should only contain one or more shadow entries,
> > +	 * no pages, so we expect to be able to remove them all and
> > +	 * delete and free the empty node afterwards.
> > +	 */
> > +
> > +	BUG_ON(!node->count);
> > +	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> > +
> > +	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> > +		if (node->slots[i]) {
> > +			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
> > +			node->slots[i] = NULL;
> > +			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
> > +			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> > +			BUG_ON(!mapping->nrshadows);
> > +			mapping->nrshadows--;
> > +		}
> > +	}
> > +	BUG_ON(node->count);
> > +	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
> > +	if (!__radix_tree_delete_node(&mapping->page_tree, node))
> > +		BUG();
> > +	(*nr_reclaimed)++;
> > +
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	ret = LRU_REMOVED_RETRY;
> > +out:
> > +	cond_resched();
> > +	spin_lock(lru_lock);
> > +	return ret;
> > +}
> > +
> > +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
> > +				       struct shrink_control *sc)
> > +{
> > +	unsigned long nr_reclaimed = 0;
> > +
> > +	list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
> > +			   shadow_lru_isolate, &nr_reclaimed, &sc->nr_to_scan);
> > +
> > +	return nr_reclaimed;
> 
> list_lru_walk_node() returns the number of reclaimed objects (i.e.
> the number of objects that returned LRU_REMOVED/LRU_REMOVED_RETRY
> from the ->isolate callback). You don't need to count nr_reclaimed
> yourself.

Good catch, this is a leftover from before LRU_REMOVED_RETRY.  Removed
the ad-hoc counter altogether.

> > +static struct shrinker workingset_shadow_shrinker = {
> > +	.count_objects = count_shadow_nodes,
> > +	.scan_objects = scan_shadow_nodes,
> > +	.seeks = DEFAULT_SEEKS * 4,
> > +	.flags = SHRINKER_NUMA_AWARE,
> > +};
> 
> Can you add a comment explaining how you calculated the .seeks
> value? It's important to document the weighings/importance
> we give to slab reclaim so we can determine if it's actually
> acheiving the desired balance under different loads...

This is not an exact science, to say the least.

The shadow entries are mostly self-regulated, so I don't want the
shrinker to interfere while the machine is just regularly trimming
caches during normal operation.

It should only kick in when either a) reclaim is picking up and the
scan-to-reclaim ratio increases due to mapped pages, dirty cache,
swapping etc. or b) the number of objects compared to LRU pages
becomes excessive.

I think that is what most shrinkers with an elevated seeks value want,
but this translates very awkwardly (and not completely) to the current
cost model, and we should probably rework that interface.

"Seeks" currently encodes 3 ratios:

  1. the cost of creating an object vs. a page

  2. the expected number of objects vs. pages

  3. the cost of reclaiming an object vs. a page

but they are not necessarily correlated.  How I would like to
configure the shadow shrinker instead is:

  o scan objects when reclaim efficiency is down to 75%, because they
    are more valuable than use-once cache but less than workingset

  o scan objects when the ratio between them and the number of pages
    exceeds 1/32 (one shadow entry for each resident page, up to 64
    entries per shrinkable object, assume 50% packing for robustness)

  o as the expected balance between objects and lru pages is 1:32,
    reclaim one object for every 32 reclaimed LRU pages, instead of
    assuming that number of scanned pages corresponds meaningfully to
    number of objects to scan.

"4" just doesn't have the same ring to it.

It would be great if we could eliminate the reclaim cost assumption by
turning the nr_to_scan into a nr_to_reclaim, and then set the other
two ratios independently.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-20 23:17     ` Johannes Weiner
@ 2014-01-21  3:03       ` Dave Chinner
  2014-01-21  5:50         ` Johannes Weiner
  2014-01-23  5:57       ` Minchan Kim
  1 sibling, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2014-01-21  3:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > +	/* Only shadow entries in there, keep track of this node */
> > > +	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > +	    list_empty(&node->private_list)) {
> > > +		node->private_data = mapping;
> > > +		list_lru_add(&workingset_shadow_nodes, &node->private_list);
> > > +	}
> > 
> > You can't do this list_empty(&node->private_list) check safely
> > externally to the list_lru code - only time that entry can be
> > checked safely is under the LRU list locks. This is the reason that
> > list_lru_add/list_lru_del return a boolean to indicate is the object
> > was added/removed from the list - they do this list_empty() check
> > internally. i.e. the correct, safe way to do conditionally update
> > state iff the object was added to the LRU is:
> > 
> > 	if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > 		if (list_lru_add(&workingset_shadow_nodes, &node->private_list))
> > 			node->private_data = mapping;
> > 	}
> > 
> > > +	radix_tree_replace_slot(slot, page);
> > > +	mapping->nrpages++;
> > > +	if (node) {
> > > +		node->count++;
> > > +		/* Installed page, can't be shadow-only anymore */
> > > +		if (!list_empty(&node->private_list))
> > > +			list_lru_del(&workingset_shadow_nodes,
> > > +				     &node->private_list);
> > > +	}
> > 
> > Same issue here:
> > 
> > 	if (node) {
> > 		node->count++;
> > 		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > 	}
> 
> All modifications to node->private_list happen under
> mapping->tree_lock, and modifications of a neighboring link should not
> affect the outcome of the list_empty(), so I don't think the lru lock
> is necessary.

Can you please add that as a comment somewhere explaining why it is
safe to do this?

> > > +		case LRU_REMOVED_RETRY:
> > >  			if (--nlru->nr_items == 0)
> > >  				node_clear(nid, lru->active_nodes);
> > >  			WARN_ON_ONCE(nlru->nr_items < 0);
> > >  			isolated++;
> > > +			/*
> > > +			 * If the lru lock has been dropped, our list
> > > +			 * traversal is now invalid and so we have to
> > > +			 * restart from scratch.
> > > +			 */
> > > +			if (ret == LRU_REMOVED_RETRY)
> > > +				goto restart;
> > >  			break;
> > >  		case LRU_ROTATE:
> > >  			list_move_tail(item, &nlru->list);
> > 
> > I think that we need to assert that the list lru lock is correctly
> > held here on return with LRU_REMOVED_RETRY. i.e.
> > 
> > 		case LRU_REMOVED_RETRY:
> > 			assert_spin_locked(&nlru->lock);
> > 		case LRU_REMOVED:
> 
> Ah, good idea.  How about adding it to LRU_RETRY as well?

Yup, good idea.

> > > +static struct shrinker workingset_shadow_shrinker = {
> > > +	.count_objects = count_shadow_nodes,
> > > +	.scan_objects = scan_shadow_nodes,
> > > +	.seeks = DEFAULT_SEEKS * 4,
> > > +	.flags = SHRINKER_NUMA_AWARE,
> > > +};
> > 
> > Can you add a comment explaining how you calculated the .seeks
> > value? It's important to document the weighings/importance
> > we give to slab reclaim so we can determine if it's actually
> > acheiving the desired balance under different loads...
> 
> This is not an exact science, to say the least.

I know, that's why I asked it be documented rather than be something
kept in your head.

> The shadow entries are mostly self-regulated, so I don't want the
> shrinker to interfere while the machine is just regularly trimming
> caches during normal operation.
> 
> It should only kick in when either a) reclaim is picking up and the
> scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> swapping etc. or b) the number of objects compared to LRU pages
> becomes excessive.
> 
> I think that is what most shrinkers with an elevated seeks value want,
> but this translates very awkwardly (and not completely) to the current
> cost model, and we should probably rework that interface.
> 
> "Seeks" currently encodes 3 ratios:
> 
>   1. the cost of creating an object vs. a page
> 
>   2. the expected number of objects vs. pages

It doesn't encode that at all. If it did, then the default value
wouldn't be "2".

>   3. the cost of reclaiming an object vs. a page

Which, when you consider #3 in conjunction with #1, the actual
intended meaning of .seeks is "the cost of replacing this object in
the cache compared to the cost of replacing a page cache page."

> but they are not necessarily correlated.  How I would like to
> configure the shadow shrinker instead is:
> 
>   o scan objects when reclaim efficiency is down to 75%, because they
>     are more valuable than use-once cache but less than workingset
> 
>   o scan objects when the ratio between them and the number of pages
>     exceeds 1/32 (one shadow entry for each resident page, up to 64
>     entries per shrinkable object, assume 50% packing for robustness)
> 
>   o as the expected balance between objects and lru pages is 1:32,
>     reclaim one object for every 32 reclaimed LRU pages, instead of
>     assuming that number of scanned pages corresponds meaningfully to
>     number of objects to scan.

You're assuming that every radix tree node has a full population of
pages. This only occurs on sequential read and write workloads, and
so isn't going tobe true for things like mapped executables or any
semi-randomly accessed data set...

> "4" just doesn't have the same ring to it.

Right, but you still haven't explained how you came to the value of
"4"....

> It would be great if we could eliminate the reclaim cost assumption by
> turning the nr_to_scan into a nr_to_reclaim, and then set the other
> two ratios independently.

That doesn't work for caches that are full of objects that can't (or
won't) be reclaimed immediately. The CPU cost of repeatedly scanning
to find N reclaimable objects when you have millions of objects in
the cache is prohibitive.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-21  3:03       ` Dave Chinner
@ 2014-01-21  5:50         ` Johannes Weiner
  2014-01-22  3:06           ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2014-01-21  5:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Tue, Jan 21, 2014 at 02:03:58PM +1100, Dave Chinner wrote:
> On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> > On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > > +	/* Only shadow entries in there, keep track of this node */
> > > > +	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > > +	    list_empty(&node->private_list)) {
> > > > +		node->private_data = mapping;
> > > > +		list_lru_add(&workingset_shadow_nodes, &node->private_list);
> > > > +	}
> > > 
> > > You can't do this list_empty(&node->private_list) check safely
> > > externally to the list_lru code - only time that entry can be
> > > checked safely is under the LRU list locks. This is the reason that
> > > list_lru_add/list_lru_del return a boolean to indicate is the object
> > > was added/removed from the list - they do this list_empty() check
> > > internally. i.e. the correct, safe way to do conditionally update
> > > state iff the object was added to the LRU is:
> > > 
> > > 	if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > > 		if (list_lru_add(&workingset_shadow_nodes, &node->private_list))
> > > 			node->private_data = mapping;
> > > 	}
> > > 
> > > > +	radix_tree_replace_slot(slot, page);
> > > > +	mapping->nrpages++;
> > > > +	if (node) {
> > > > +		node->count++;
> > > > +		/* Installed page, can't be shadow-only anymore */
> > > > +		if (!list_empty(&node->private_list))
> > > > +			list_lru_del(&workingset_shadow_nodes,
> > > > +				     &node->private_list);
> > > > +	}
> > > 
> > > Same issue here:
> > > 
> > > 	if (node) {
> > > 		node->count++;
> > > 		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > > 	}
> > 
> > All modifications to node->private_list happen under
> > mapping->tree_lock, and modifications of a neighboring link should not
> > affect the outcome of the list_empty(), so I don't think the lru lock
> > is necessary.
> 
> Can you please add that as a comment somewhere explaining why it is
> safe to do this?

Absolutely.

> > > > +		case LRU_REMOVED_RETRY:
> > > >  			if (--nlru->nr_items == 0)
> > > >  				node_clear(nid, lru->active_nodes);
> > > >  			WARN_ON_ONCE(nlru->nr_items < 0);
> > > >  			isolated++;
> > > > +			/*
> > > > +			 * If the lru lock has been dropped, our list
> > > > +			 * traversal is now invalid and so we have to
> > > > +			 * restart from scratch.
> > > > +			 */
> > > > +			if (ret == LRU_REMOVED_RETRY)
> > > > +				goto restart;
> > > >  			break;
> > > >  		case LRU_ROTATE:
> > > >  			list_move_tail(item, &nlru->list);
> > > 
> > > I think that we need to assert that the list lru lock is correctly
> > > held here on return with LRU_REMOVED_RETRY. i.e.
> > > 
> > > 		case LRU_REMOVED_RETRY:
> > > 			assert_spin_locked(&nlru->lock);
> > > 		case LRU_REMOVED:
> > 
> > Ah, good idea.  How about adding it to LRU_RETRY as well?
> 
> Yup, good idea.

Ok, will do.

> > > > +static struct shrinker workingset_shadow_shrinker = {
> > > > +	.count_objects = count_shadow_nodes,
> > > > +	.scan_objects = scan_shadow_nodes,
> > > > +	.seeks = DEFAULT_SEEKS * 4,
> > > > +	.flags = SHRINKER_NUMA_AWARE,
> > > > +};
> > > 
> > > Can you add a comment explaining how you calculated the .seeks
> > > value? It's important to document the weighings/importance
> > > we give to slab reclaim so we can determine if it's actually
> > > acheiving the desired balance under different loads...
> > 
> > This is not an exact science, to say the least.
> 
> I know, that's why I asked it be documented rather than be something
> kept in your head.
> 
> > The shadow entries are mostly self-regulated, so I don't want the
> > shrinker to interfere while the machine is just regularly trimming
> > caches during normal operation.
> > 
> > It should only kick in when either a) reclaim is picking up and the
> > scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> > swapping etc. or b) the number of objects compared to LRU pages
> > becomes excessive.
> > 
> > I think that is what most shrinkers with an elevated seeks value want,
> > but this translates very awkwardly (and not completely) to the current
> > cost model, and we should probably rework that interface.
> > 
> > "Seeks" currently encodes 3 ratios:
> > 
> >   1. the cost of creating an object vs. a page
> > 
> >   2. the expected number of objects vs. pages
> 
> It doesn't encode that at all. If it did, then the default value
> wouldn't be "2".
>
> >   3. the cost of reclaiming an object vs. a page
> 
> Which, when you consider #3 in conjunction with #1, the actual
> intended meaning of .seeks is "the cost of replacing this object in
> the cache compared to the cost of replacing a page cache page."

But what it actually seems to do is translate scan rate from LRU pages
to scan rate in another object pool.  The actual replacement cost
varies based on hotness of each set, an in-use object is more
expensive to replace than a cold page and vice versa, the dentry and
inode shrinkers reflect this by rotating hot objects and refusing to
actually reclaim items while they are in active use.

So I am having a hard time deriving a meaningful value out of this
definition for my usecase because I want to push back objects based on
reclaim efficiency (scan rate vs. reclaim rate).  The other shrinkers
with non-standard seek settings reek of magic number as well, which
suggests I am not alone with this.

I wonder if we can come up with a better interface that allows both
traditional cache shrinkers with their own aging, as well as object
pools that want to push back based on reclaim efficiency.

> > but they are not necessarily correlated.  How I would like to
> > configure the shadow shrinker instead is:
> > 
> >   o scan objects when reclaim efficiency is down to 75%, because they
> >     are more valuable than use-once cache but less than workingset
> > 
> >   o scan objects when the ratio between them and the number of pages
> >     exceeds 1/32 (one shadow entry for each resident page, up to 64
> >     entries per shrinkable object, assume 50% packing for robustness)
> > 
> >   o as the expected balance between objects and lru pages is 1:32,
> >     reclaim one object for every 32 reclaimed LRU pages, instead of
> >     assuming that number of scanned pages corresponds meaningfully to
> >     number of objects to scan.
> 
> You're assuming that every radix tree node has a full population of
> pages. This only occurs on sequential read and write workloads, and
> so isn't going tobe true for things like mapped executables or any
> semi-randomly accessed data set...

No, I'm assuming 50% population on average for that reason.  I don't
know how else I could assign a fixed value to a variable object.

> > "4" just doesn't have the same ring to it.
> 
> Right, but you still haven't explained how you came to the value of
> "4"....

It's a complete magic number.  The tests I ran suggested lower numbers
throw out shadow entries prematurely, whereas higher numbers thrash
the working set while there are plenty radix tree nodes present.

> > It would be great if we could eliminate the reclaim cost assumption by
> > turning the nr_to_scan into a nr_to_reclaim, and then set the other
> > two ratios independently.
> 
> That doesn't work for caches that are full of objects that can't (or
> won't) be reclaimed immediately. The CPU cost of repeatedly scanning
> to find N reclaimable objects when you have millions of objects in
> the cache is prohibitive.

That is true.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-21  5:50         ` Johannes Weiner
@ 2014-01-22  3:06           ` Dave Chinner
  2014-01-22  6:57             ` Johannes Weiner
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2014-01-22  3:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Tue, Jan 21, 2014 at 12:50:17AM -0500, Johannes Weiner wrote:
> On Tue, Jan 21, 2014 at 02:03:58PM +1100, Dave Chinner wrote:
> > On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> > > On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > > > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > > > +static struct shrinker workingset_shadow_shrinker = {
> > > > > +	.count_objects = count_shadow_nodes,
> > > > > +	.scan_objects = scan_shadow_nodes,
> > > > > +	.seeks = DEFAULT_SEEKS * 4,
> > > > > +	.flags = SHRINKER_NUMA_AWARE,
> > > > > +};
> > > > 
> > > > Can you add a comment explaining how you calculated the .seeks
> > > > value? It's important to document the weighings/importance
> > > > we give to slab reclaim so we can determine if it's actually
> > > > acheiving the desired balance under different loads...
> > > 
> > > This is not an exact science, to say the least.
> > 
> > I know, that's why I asked it be documented rather than be something
> > kept in your head.
> > 
> > > The shadow entries are mostly self-regulated, so I don't want the
> > > shrinker to interfere while the machine is just regularly trimming
> > > caches during normal operation.
> > > 
> > > It should only kick in when either a) reclaim is picking up and the
> > > scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> > > swapping etc. or b) the number of objects compared to LRU pages
> > > becomes excessive.
> > > 
> > > I think that is what most shrinkers with an elevated seeks value want,
> > > but this translates very awkwardly (and not completely) to the current
> > > cost model, and we should probably rework that interface.
> > > 
> > > "Seeks" currently encodes 3 ratios:
> > > 
> > >   1. the cost of creating an object vs. a page
> > > 
> > >   2. the expected number of objects vs. pages
> > 
> > It doesn't encode that at all. If it did, then the default value
> > wouldn't be "2".
> >
> > >   3. the cost of reclaiming an object vs. a page
> > 
> > Which, when you consider #3 in conjunction with #1, the actual
> > intended meaning of .seeks is "the cost of replacing this object in
> > the cache compared to the cost of replacing a page cache page."
> 
> But what it actually seems to do is translate scan rate from LRU pages
> to scan rate in another object pool.  The actual replacement cost
> varies based on hotness of each set, an in-use object is more
> expensive to replace than a cold page and vice versa, the dentry and
> inode shrinkers reflect this by rotating hot objects and refusing to
> actually reclaim items while they are in active use.

Right, but so does the page cache when the page referenced bit is
seen by the LRU scanner. That's a scanned page, so what is passed to
shrink_slab is a ratio of pages scanned vs pages eligible for
reclaim. IOWs, the fact that the slab caches rotate rather than
reclaim is irrelevant - what matters is the same proportional
pressure is applied to the slab cache that was applied to the page
cache....

> So I am having a hard time deriving a meaningful value out of this
> definition for my usecase because I want to push back objects based on
> reclaim efficiency (scan rate vs. reclaim rate).  The other shrinkers
> with non-standard seek settings reek of magic number as well, which
> suggests I am not alone with this.

Right, which is exactly why I'm asking you to document it. I've got
no idea how other subsystems have come up with their magic numbers
because they are not documented, and so it's just about impossible
to determine what the author of the code really needed and hence the
best way to improve the interface is difficult to determine.

> I wonder if we can come up with a better interface that allows both
> traditional cache shrinkers with their own aging, as well as object
> pools that want to push back based on reclaim efficiency.

We probably can, though I'd prefer we don't end up with some
alternative algorithm that is specific to a single shrinker.

So, how do we measure page cache reclaim efficiency? How can that be
communicated to a shrinker? how can we tell a shrinker what measure
to use? How do we tell shrinker authors what measure to use?  How do
we translate that new method useful scan count information?

> > > but they are not necessarily correlated.  How I would like to
> > > configure the shadow shrinker instead is:
> > > 
> > >   o scan objects when reclaim efficiency is down to 75%, because they
> > >     are more valuable than use-once cache but less than workingset
> > > 
> > >   o scan objects when the ratio between them and the number of pages
> > >     exceeds 1/32 (one shadow entry for each resident page, up to 64
> > >     entries per shrinkable object, assume 50% packing for robustness)
> > > 
> > >   o as the expected balance between objects and lru pages is 1:32,
> > >     reclaim one object for every 32 reclaimed LRU pages, instead of
> > >     assuming that number of scanned pages corresponds meaningfully to
> > >     number of objects to scan.
> > 
> > You're assuming that every radix tree node has a full population of
> > pages. This only occurs on sequential read and write workloads, and
> > so isn't going tobe true for things like mapped executables or any
> > semi-randomly accessed data set...
> 
> No, I'm assuming 50% population on average for that reason.  I don't
> know how else I could assign a fixed value to a variable object.

Ok, I should have say "fixed population", not "full population". Do
you have any stats on the typical mapping tree radix node population
on running systems?

> > > "4" just doesn't have the same ring to it.
> > 
> > Right, but you still haven't explained how you came to the value of
> > "4"....
> 
> It's a complete magic number.  The tests I ran suggested lower numbers
> throw out shadow entries prematurely, whereas higher numbers thrash
> the working set while there are plenty radix tree nodes present.

That, at minimum, needs to be in a comment so that people have some
idea of how the magic number biases behaviour. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-22  3:06           ` Dave Chinner
@ 2014-01-22  6:57             ` Johannes Weiner
  2014-01-22 18:48               ` Johannes Weiner
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2014-01-22  6:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Wed, Jan 22, 2014 at 02:06:07PM +1100, Dave Chinner wrote:
> On Tue, Jan 21, 2014 at 12:50:17AM -0500, Johannes Weiner wrote:
> > On Tue, Jan 21, 2014 at 02:03:58PM +1100, Dave Chinner wrote:
> > > On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> > > > On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > > > > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > > > > +static struct shrinker workingset_shadow_shrinker = {
> > > > > > +	.count_objects = count_shadow_nodes,
> > > > > > +	.scan_objects = scan_shadow_nodes,
> > > > > > +	.seeks = DEFAULT_SEEKS * 4,
> > > > > > +	.flags = SHRINKER_NUMA_AWARE,
> > > > > > +};
> > > > > 
> > > > > Can you add a comment explaining how you calculated the .seeks
> > > > > value? It's important to document the weighings/importance
> > > > > we give to slab reclaim so we can determine if it's actually
> > > > > acheiving the desired balance under different loads...
> > > > 
> > > > This is not an exact science, to say the least.
> > > 
> > > I know, that's why I asked it be documented rather than be something
> > > kept in your head.
> > > 
> > > > The shadow entries are mostly self-regulated, so I don't want the
> > > > shrinker to interfere while the machine is just regularly trimming
> > > > caches during normal operation.
> > > > 
> > > > It should only kick in when either a) reclaim is picking up and the
> > > > scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> > > > swapping etc. or b) the number of objects compared to LRU pages
> > > > becomes excessive.
> > > > 
> > > > I think that is what most shrinkers with an elevated seeks value want,
> > > > but this translates very awkwardly (and not completely) to the current
> > > > cost model, and we should probably rework that interface.
> > > > 
> > > > "Seeks" currently encodes 3 ratios:
> > > > 
> > > >   1. the cost of creating an object vs. a page
> > > > 
> > > >   2. the expected number of objects vs. pages
> > > 
> > > It doesn't encode that at all. If it did, then the default value
> > > wouldn't be "2".
> > >
> > > >   3. the cost of reclaiming an object vs. a page
> > > 
> > > Which, when you consider #3 in conjunction with #1, the actual
> > > intended meaning of .seeks is "the cost of replacing this object in
> > > the cache compared to the cost of replacing a page cache page."
> > 
> > But what it actually seems to do is translate scan rate from LRU pages
> > to scan rate in another object pool.  The actual replacement cost
> > varies based on hotness of each set, an in-use object is more
> > expensive to replace than a cold page and vice versa, the dentry and
> > inode shrinkers reflect this by rotating hot objects and refusing to
> > actually reclaim items while they are in active use.
> 
> Right, but so does the page cache when the page referenced bit is
> seen by the LRU scanner. That's a scanned page, so what is passed to
> shrink_slab is a ratio of pages scanned vs pages eligible for
> reclaim. IOWs, the fact that the slab caches rotate rather than
> reclaim is irrelevant - what matters is the same proportional
> pressure is applied to the slab cache that was applied to the page
> cache....

Oh, but it does.  You apply the same pressure to both, but the actual
reclaim outcome depends on object valuation measures specific to each
pool (e.g. recently referenced or not), whereas my shrinker takes
sc->nr_to_scan objects and reclaims them without looking at their
individual value, which varies just like the value of slab objects
varies.

I thought I could compensate for the lack of object valuation in the
shadow shrinker by tweaking that fixed pressure factor between page
cache and shadow entries, but I'm no longer convinced this can work.

One thing that does affect the value of shadow entries is the overall
health of the system, memory-wise, so reclaim efficiency would be one
factor that affects individual object value, albeit a secondary one.

The most obvious value factor is whether the shadow entries in a node
are expired or not, but there are potentially 64 of them, potentially
from different zones with different "inactive ages" atomic_t's, so
that is fairly expensive to assess.

> > So I am having a hard time deriving a meaningful value out of this
> > definition for my usecase because I want to push back objects based on
> > reclaim efficiency (scan rate vs. reclaim rate).  The other shrinkers
> > with non-standard seek settings reek of magic number as well, which
> > suggests I am not alone with this.
> 
> Right, which is exactly why I'm asking you to document it. I've got
> no idea how other subsystems have come up with their magic numbers
> because they are not documented, and so it's just about impossible
> to determine what the author of the code really needed and hence the
> best way to improve the interface is difficult to determine.
> 
> > I wonder if we can come up with a better interface that allows both
> > traditional cache shrinkers with their own aging, as well as object
> > pools that want to push back based on reclaim efficiency.
> 
> We probably can, though I'd prefer we don't end up with some
> alternative algorithm that is specific to a single shrinker.
> 
> So, how do we measure page cache reclaim efficiency? How can that be
> communicated to a shrinker? how can we tell a shrinker what measure
> to use? How do we tell shrinker authors what measure to use?  How do
> we translate that new method useful scan count information?

We usually define it as the scanned / reclaim ratio.  I have to think
about the rest and what exactly I need from the shrinker.  Unless I
can come up with a better object valuation model that can be a private
part of the shadow shrinker, of course.

> > > > but they are not necessarily correlated.  How I would like to
> > > > configure the shadow shrinker instead is:
> > > > 
> > > >   o scan objects when reclaim efficiency is down to 75%, because they
> > > >     are more valuable than use-once cache but less than workingset
> > > > 
> > > >   o scan objects when the ratio between them and the number of pages
> > > >     exceeds 1/32 (one shadow entry for each resident page, up to 64
> > > >     entries per shrinkable object, assume 50% packing for robustness)
> > > > 
> > > >   o as the expected balance between objects and lru pages is 1:32,
> > > >     reclaim one object for every 32 reclaimed LRU pages, instead of
> > > >     assuming that number of scanned pages corresponds meaningfully to
> > > >     number of objects to scan.
> > > 
> > > You're assuming that every radix tree node has a full population of
> > > pages. This only occurs on sequential read and write workloads, and
> > > so isn't going tobe true for things like mapped executables or any
> > > semi-randomly accessed data set...
> > 
> > No, I'm assuming 50% population on average for that reason.  I don't
> > know how else I could assign a fixed value to a variable object.
> 
> Ok, I should have say "fixed population", not "full population". Do
> you have any stats on the typical mapping tree radix node population
> on running systems?

Not at this time, I'll try to look into that.  For now, I am updating
the patch to revert the shrinker back to DEFAULT_SEEKS and change the
object count to only include objects above a certain threshold, which
assumes a worst-case population of 4 in 64 slots.  It's not perfect,
but neither was the seeks magic, and it's easier to reason about what
it's actually doing.

---
 mm/filemap.c    | 17 +++++++++++++++--
 mm/list_lru.c   |  4 +++-
 mm/truncate.c   |  8 +++++++-
 mm/workingset.c | 54 ++++++++++++++++++++++++++++++++++++++----------------
 4 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b93e223b59a9..45a52fd28938 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -156,7 +156,13 @@ static void page_cache_tree_delete(struct address_space *mapping,
 		if (__radix_tree_delete_node(&mapping->page_tree, node))
 			return;
 
-	/* Only shadow entries in there, keep track of this node */
+	/*
+	 * Track node that only contains shadow entries.
+	 *
+	 * Avoid acquiring the list_lru lock if already tracked.  The
+	 * list_empty() test is safe as node->private_list is
+	 * protected by mapping->tree_lock.
+	 */
 	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
 	    list_empty(&node->private_list)) {
 		node->private_data = mapping;
@@ -531,7 +537,14 @@ static int page_cache_tree_insert(struct address_space *mapping,
 	mapping->nrpages++;
 	if (node) {
 		node->count++;
-		/* Installed page, can't be shadow-only anymore */
+		/*
+		 * Don't track node that contains actual pages.
+		 *
+		 * Avoid acquiring the list_lru lock if already
+		 * untracked.  The list_empty() test is safe as
+		 * node->private_list is protected by
+		 * mapping->tree_lock.
+		 */
 		if (!list_empty(&node->private_list))
 			list_lru_del(&workingset_shadow_nodes,
 				     &node->private_list);
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 47a9faf4070b..7f5b73e2513b 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -87,8 +87,9 @@ restart:
 
 		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
-		case LRU_REMOVED:
 		case LRU_REMOVED_RETRY:
+			assert_spin_locked(&nlru->lock);
+		case LRU_REMOVED:
 			if (--nlru->nr_items == 0)
 				node_clear(nid, lru->active_nodes);
 			WARN_ON_ONCE(nlru->nr_items < 0);
@@ -111,6 +112,7 @@ restart:
 			 * The lru lock has been dropped, our list traversal is
 			 * now invalid and so we have to restart from scratch.
 			 */
+			assert_spin_locked(&nlru->lock);
 			goto restart;
 		default:
 			BUG();
diff --git a/mm/truncate.c b/mm/truncate.c
index 5c2615d7f4da..5f7599b49126 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -47,7 +47,13 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	if (!node)
 		goto unlock;
 	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-	/* No more shadow entries, stop tracking the node */
+	/*
+	 * Don't track node without shadow entries.
+	 *
+	 * Avoid acquiring the list_lru lock if already untracked.
+	 * The list_empty() test is safe as node->private_list is
+	 * protected by mapping->tree_lock.
+	 */
 	if (!(node->count >> RADIX_TREE_COUNT_SHIFT) &&
 	    !list_empty(&node->private_list))
 		list_lru_del(&workingset_shadow_nodes, &node->private_list);
diff --git a/mm/workingset.c b/mm/workingset.c
index 7bb1a432c137..8ac2a26951ef 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -253,12 +253,15 @@ void workingset_activation(struct page *page)
 }
 
 /*
- * Page cache radix tree nodes containing only shadow entries can grow
- * excessively on certain workloads.  That's why they are tracked on
- * per-(NUMA)node lists and pushed back by a shrinker, but with a
- * slightly higher threshold than regular shrinkers so we don't
- * discard the entries too eagerly - after all, during light memory
- * pressure is exactly when we need them.
+ * Shadow entries reflect the share of the working set that does not
+ * fit into memory, so their number depends on the access pattern of
+ * the workload.  In most cases, they will refault or get reclaimed
+ * along with the inode, but a (malicious) workload that streams
+ * through files with a total size several times that of available
+ * memory, while preventing the inodes from being reclaimed, can
+ * create excessive amounts of shadow nodes.  To keep a lid on this,
+ * track shadow nodes and reclaim them when they grow way past the
+ * point where they would still be useful.
  */
 
 struct list_lru workingset_shadow_nodes;
@@ -266,14 +269,38 @@ struct list_lru workingset_shadow_nodes;
 static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 					struct shrink_control *sc)
 {
-	return list_lru_count_node(&workingset_shadow_nodes, sc->nid);
+	unsigned long shadow_nodes;
+	unsigned long max_nodes;
+	unsigned long pages;
+
+	shadow_nodes = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
+	pages = node_present_pages(sc->nid);
+	/*
+	 * Active cache pages are limited to 50% of memory, and shadow
+	 * entries that represent a refault distance bigger than that
+	 * do not have any effect.  Limit the number of shadow nodes
+	 * such that shadow entries do not exceed the number of active
+	 * cache pages, assuming a worst-case node population density
+	 * of 1/16th on average.
+	 *
+	 * On 64-bit with 7 radix_tree_nodes per page and 64 slots
+	 * each, this will reclaim shadow entries when they consume
+	 * ~2% of available memory:
+	 *
+	 * PAGE_SIZE / radix_tree_nodes / node_entries / PAGE_SIZE
+	 */
+	max_nodes = pages >> (1 + RADIX_TREE_MAP_SHIFT - 3);
+
+	if (shadow_nodes <= max_nodes)
+		return 0;
+
+	return shadow_nodes - max_nodes;
 }
 
 static enum lru_status shadow_lru_isolate(struct list_head *item,
 					  spinlock_t *lru_lock,
 					  void *arg)
 {
-	unsigned long *nr_reclaimed = arg;
 	struct address_space *mapping;
 	struct radix_tree_node *node;
 	unsigned int i;
@@ -327,7 +354,6 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
 	if (!__radix_tree_delete_node(&mapping->page_tree, node))
 		BUG();
-	(*nr_reclaimed)++;
 
 	spin_unlock_irq(&mapping->tree_lock);
 	ret = LRU_REMOVED_RETRY;
@@ -340,18 +366,14 @@ out:
 static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
 				       struct shrink_control *sc)
 {
-	unsigned long nr_reclaimed = 0;
-
-	list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
-			   shadow_lru_isolate, &nr_reclaimed, &sc->nr_to_scan);
-
-	return nr_reclaimed;
+	return list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
+				  shadow_lru_isolate, NULL, &sc->nr_to_scan);
 }
 
 static struct shrinker workingset_shadow_shrinker = {
 	.count_objects = count_shadow_nodes,
 	.scan_objects = scan_shadow_nodes,
-	.seeks = DEFAULT_SEEKS * 4,
+	.seeks = DEFAULT_SEEKS,
 	.flags = SHRINKER_NUMA_AWARE,
 };
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees
  2014-01-13  2:01   ` Minchan Kim
@ 2014-01-22 17:47     ` Johannes Weiner
  2014-01-23  5:07       ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2014-01-22 17:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Mon, Jan 13, 2014 at 11:01:32AM +0900, Minchan Kim wrote:
> On Fri, Jan 10, 2014 at 01:10:39PM -0500, Johannes Weiner wrote:
> > shmem mappings already contain exceptional entries where swap slot
> > information is remembered.
> > 
> > To be able to store eviction information for regular page cache,
> > prepare every site dealing with the radix trees directly to handle
> > entries other than pages.
> > 
> > The common lookup functions will filter out non-page entries and
> > return NULL for page cache holes, just as before.  But provide a raw
> > version of the API which returns non-page entries as well, and switch
> > shmem over to use it.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Minchan Kim <minchan@kernel.org>

Thanks, Minchan!

> > @@ -890,6 +973,73 @@ repeat:
> >  EXPORT_SYMBOL(find_or_create_page);
> >  
> >  /**
> > + * __find_get_pages - gang pagecache lookup
> > + * @mapping:	The address_space to search
> > + * @start:	The starting page index
> > + * @nr_pages:	The maximum number of pages
> > + * @pages:	Where the resulting pages are placed
> 
> where is @indices?

Fixed :)

> > @@ -894,6 +894,53 @@ EXPORT_SYMBOL(__pagevec_lru_add);
> >  
> >  /**
> >   * pagevec_lookup - gang pagecache lookup
> 
>       __pagevec_lookup?
> 
> > + * @pvec:	Where the resulting entries are placed
> > + * @mapping:	The address_space to search
> > + * @start:	The starting entry index
> > + * @nr_pages:	The maximum number of entries
> 
>       missing @indices?
> 
> > + *
> > + * pagevec_lookup() will search for and return a group of up to
> > + * @nr_pages pages and shadow entries in the mapping.  All entries are
> > + * placed in @pvec.  pagevec_lookup() takes a reference against actual
> > + * pages in @pvec.
> > + *
> > + * The search returns a group of mapping-contiguous entries with
> > + * ascending indexes.  There may be holes in the indices due to
> > + * not-present entries.
> > + *
> > + * pagevec_lookup() returns the number of entries which were found.
> 
>       __pagevec_lookup

Yikes, all three fixed.

> > @@ -22,6 +22,22 @@
> >  #include <linux/cleancache.h>
> >  #include "internal.h"
> >  
> > +static void clear_exceptional_entry(struct address_space *mapping,
> > +				    pgoff_t index, void *entry)
> > +{
> > +	/* Handled by shmem itself */
> > +	if (shmem_mapping(mapping))
> > +		return;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * Regular page slots are stabilized by the page lock even
> > +	 * without the tree itself locked.  These unlocked entries
> > +	 * need verification under the tree lock.
> > +	 */
> 
> Could you explain why repeated spin_lock with irq disabled isn't problem
> in truncation path?

To modify the cache tree, we have to take the IRQ-safe tree_lock, this
is no different than removing a page (see truncate_complete_page).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-13  7:39   ` Minchan Kim
  2014-01-14  5:40     ` Minchan Kim
@ 2014-01-22 18:42     ` Johannes Weiner
  2014-01-23  5:20       ` Minchan Kim
  1 sibling, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2014-01-22 18:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.
> > 
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >    from the node up to the tree root, which is needed to perform a
> >    deletion.  To solve this, encode in each node its offset inside the
> >    parent.  This can be stored in the unused upper bits of the same
> >    member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >    regular entries, to quickly detect when the node is ready to go to
> >    the shadow node LRU list.  The current entry count is an unsigned
> >    int but the maximum number of entries is 64, so a shadow counter
> >    can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs tree lock and tree root, which are located
> >    in the address space, so store an address_space backpointer in the
> >    node.  The parent pointer of the node is in a union with the 2-word
> >    rcu_head, so the backpointer comes at no extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >    head inside the node.  This does increase the size of the node, but
> >    it does not change the number of objects that fit into a slab page.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  include/linux/list_lru.h   |   2 +
> >  include/linux/mmzone.h     |   1 +
> >  include/linux/radix-tree.h |  32 +++++++++---
> >  include/linux/swap.h       |   1 +
> >  lib/radix-tree.c           |  36 ++++++++------
> >  mm/filemap.c               |  77 +++++++++++++++++++++++------
> >  mm/list_lru.c              |   8 +++
> >  mm/truncate.c              |  20 +++++++-
> >  mm/vmstat.c                |   1 +
> >  mm/workingset.c            | 121 +++++++++++++++++++++++++++++++++++++++++++++
> >  10 files changed, 259 insertions(+), 40 deletions(-)
> > 
> > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> > index 3ce541753c88..b02fc233eadd 100644
> > --- a/include/linux/list_lru.h
> > +++ b/include/linux/list_lru.h
> > @@ -13,6 +13,8 @@
> >  /* list_lru_walk_cb has to always return one of those */
> >  enum lru_status {
> >  	LRU_REMOVED,		/* item removed from list */
> > +	LRU_REMOVED_RETRY,	/* item removed, but lock has been
> > +				   dropped and reacquired */
> >  	LRU_ROTATE,		/* item referenced, give another pass */
> >  	LRU_SKIP,		/* item cannot be locked, skip */
> >  	LRU_RETRY,		/* item not freeable. May drop the lock
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 118ba9f51e86..8cac5a7ef7a7 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -144,6 +144,7 @@ enum zone_stat_item {
> >  #endif
> >  	WORKINGSET_REFAULT,
> >  	WORKINGSET_ACTIVATE,
> > +	WORKINGSET_NODERECLAIM,
> >  	NR_ANON_TRANSPARENT_HUGEPAGES,
> >  	NR_FREE_CMA_PAGES,
> >  	NR_VM_ZONE_STAT_ITEMS };
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index 13636c40bc42..33170dbd9db4 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
> >  #define RADIX_TREE_TAG_LONGS	\
> >  	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
> >  
> > +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> > +					  RADIX_TREE_MAP_SHIFT))
> > +
> > +/* Height component in node->path */
> > +#define RADIX_TREE_HEIGHT_SHIFT	(RADIX_TREE_MAX_PATH + 1)
> > +#define RADIX_TREE_HEIGHT_MASK	((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
> > +
> > +/* Internally used bits of node->count */
> > +#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
> > +#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
> > +
> >  struct radix_tree_node {
> > -	unsigned int	height;		/* Height from the bottom */
> > +	unsigned int	path;	/* Offset in parent & height from the bottom */
> >  	unsigned int	count;
> >  	union {
> > -		struct radix_tree_node *parent;	/* Used when ascending tree */
> > -		struct rcu_head	rcu_head;	/* Used when freeing node */
> > +		struct {
> > +			/* Used when ascending tree */
> > +			struct radix_tree_node *parent;
> > +			/* For tree user */
> > +			void *private_data;
> > +		};
> > +		/* Used when freeing node */
> > +		struct rcu_head	rcu_head;
> >  	};
> > +	/* For tree user */
> > +	struct list_head private_list;
> >  	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
> >  	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
> >  };
> >  
> > -#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > -#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> > -					  RADIX_TREE_MAP_SHIFT))
> > -
> >  /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
> >  struct radix_tree_root {
> >  	unsigned int		height;
> > @@ -251,7 +267,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
> >  			  struct radix_tree_node **nodep, void ***slotp);
> >  void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
> >  void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
> > -bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
> > +bool __radix_tree_delete_node(struct radix_tree_root *root,
> >  			      struct radix_tree_node *node);
> >  void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
> >  void *radix_tree_delete(struct radix_tree_root *, unsigned long);
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index b83cf61403ed..102e37bc82d5 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -264,6 +264,7 @@ struct swap_list_t {
> >  void *workingset_eviction(struct address_space *mapping, struct page *page);
> >  bool workingset_refault(void *shadow);
> >  void workingset_activation(struct page *page);
> > +extern struct list_lru workingset_shadow_nodes;
> >  
> >  /* linux/mm/page_alloc.c */
> >  extern unsigned long totalram_pages;
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index e601c56a43d0..0a0895371447 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -342,7 +342,8 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
> >  
> >  		/* Increase the height.  */
> >  		newheight = root->height+1;
> > -		node->height = newheight;
> > +		BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK);
> > +		node->path = newheight;
> 
> Nitpick:
> It would be better to add some accessor for path and offset for
> readability and future enhance?

Nodes are instantiated in one central place, I can't see the value in
obscuring a straight-forward bitop with a radix_tree_node_set_offset()
call.

And height = node->path & RADIX_TREE_HEIGHT_MASK should be fairly
descriptive, I think.

> > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
> >  		 * same time and miss a shadow entry.
> >  		 */
> >  		smp_wmb();
> > -	} else
> > -		radix_tree_delete(&mapping->page_tree, page->index);
> > +	}
> >  	mapping->nrpages--;
> > +
> > +	if (!node) {
> > +		/* Clear direct pointer tags in root node */
> > +		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > +		radix_tree_replace_slot(slot, shadow);
> > +		return;
> > +	}
> > +
> > +	/* Clear tree tags for the removed page */
> > +	index = page->index;
> > +	offset = index & RADIX_TREE_MAP_MASK;
> > +	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > +		if (test_bit(offset, node->tags[tag]))
> > +			radix_tree_tag_clear(&mapping->page_tree, index, tag);
> > +	}
> > +
> > +	/* Delete page, swap shadow entry */
> > +	radix_tree_replace_slot(slot, shadow);
> > +	node->count--;
> > +	if (shadow)
> > +		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> 
> Nitpick2:
> It should be a function of workingset.c rather than exposing
> RADIX_TREE_COUNT_SHIFT?
> 
> IMO, It would be better to provide some accessor functions here, too.

The shadow maintenance and node lifetime management are pretty
interwoven to share branches and reduce instructions as these are
common paths.  I don't see how this could result in cleaner code while
keeping these advantages.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-22  6:57             ` Johannes Weiner
@ 2014-01-22 18:48               ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-01-22 18:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Wed, Jan 22, 2014 at 01:57:14AM -0500, Johannes Weiner wrote:
> Not at this time, I'll try to look into that.  For now, I am updating
> the patch to revert the shrinker back to DEFAULT_SEEKS and change the
> object count to only include objects above a certain threshold, which
> assumes a worst-case population of 4 in 64 slots.  It's not perfect,
> but neither was the seeks magic, and it's easier to reason about what
> it's actually doing.

Ah, the quality of 2am submissions...  8 out of 64 of course.

> @@ -266,14 +269,38 @@ struct list_lru workingset_shadow_nodes;
>  static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>  					struct shrink_control *sc)
>  {
> -	return list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> +	unsigned long shadow_nodes;
> +	unsigned long max_nodes;
> +	unsigned long pages;
> +
> +	shadow_nodes = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> +	pages = node_present_pages(sc->nid);
> +	/*
> +	 * Active cache pages are limited to 50% of memory, and shadow
> +	 * entries that represent a refault distance bigger than that
> +	 * do not have any effect.  Limit the number of shadow nodes
> +	 * such that shadow entries do not exceed the number of active
> +	 * cache pages, assuming a worst-case node population density
> +	 * of 1/16th on average.

1/8th.  The actual code is consistent:

> +	 * On 64-bit with 7 radix_tree_nodes per page and 64 slots
> +	 * each, this will reclaim shadow entries when they consume
> +	 * ~2% of available memory:
> +	 *
> +	 * PAGE_SIZE / radix_tree_nodes / node_entries / PAGE_SIZE
> +	 */
> +	max_nodes = pages >> (1 + RADIX_TREE_MAP_SHIFT - 3);
> +
> +	if (shadow_nodes <= max_nodes)
> +		return 0;
> +
> +	return shadow_nodes - max_nodes;
>  }
>  
>  static enum lru_status shadow_lru_isolate(struct list_head *item,
>  					  spinlock_t *lru_lock,
>  					  void *arg)
>  {
> -	unsigned long *nr_reclaimed = arg;
>  	struct address_space *mapping;
>  	struct radix_tree_node *node;
>  	unsigned int i;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees
  2014-01-22 17:47     ` Johannes Weiner
@ 2014-01-23  5:07       ` Minchan Kim
  0 siblings, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2014-01-23  5:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Hi Hannes,

On Wed, Jan 22, 2014 at 12:47:44PM -0500, Johannes Weiner wrote:
> On Mon, Jan 13, 2014 at 11:01:32AM +0900, Minchan Kim wrote:
> > On Fri, Jan 10, 2014 at 01:10:39PM -0500, Johannes Weiner wrote:
> > > shmem mappings already contain exceptional entries where swap slot
> > > information is remembered.
> > > 
> > > To be able to store eviction information for regular page cache,
> > > prepare every site dealing with the radix trees directly to handle
> > > entries other than pages.
> > > 
> > > The common lookup functions will filter out non-page entries and
> > > return NULL for page cache holes, just as before.  But provide a raw
> > > version of the API which returns non-page entries as well, and switch
> > > shmem over to use it.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Reviewed-by: Minchan Kim <minchan@kernel.org>
> 
> Thanks, Minchan!
> 
> > > @@ -890,6 +973,73 @@ repeat:
> > >  EXPORT_SYMBOL(find_or_create_page);
> > >  
> > >  /**
> > > + * __find_get_pages - gang pagecache lookup
> > > + * @mapping:	The address_space to search
> > > + * @start:	The starting page index
> > > + * @nr_pages:	The maximum number of pages
> > > + * @pages:	Where the resulting pages are placed
> > 
> > where is @indices?
> 
> Fixed :)
> 
> > > @@ -894,6 +894,53 @@ EXPORT_SYMBOL(__pagevec_lru_add);
> > >  
> > >  /**
> > >   * pagevec_lookup - gang pagecache lookup
> > 
> >       __pagevec_lookup?
> > 
> > > + * @pvec:	Where the resulting entries are placed
> > > + * @mapping:	The address_space to search
> > > + * @start:	The starting entry index
> > > + * @nr_pages:	The maximum number of entries
> > 
> >       missing @indices?
> > 
> > > + *
> > > + * pagevec_lookup() will search for and return a group of up to
> > > + * @nr_pages pages and shadow entries in the mapping.  All entries are
> > > + * placed in @pvec.  pagevec_lookup() takes a reference against actual
> > > + * pages in @pvec.
> > > + *
> > > + * The search returns a group of mapping-contiguous entries with
> > > + * ascending indexes.  There may be holes in the indices due to
> > > + * not-present entries.
> > > + *
> > > + * pagevec_lookup() returns the number of entries which were found.
> > 
> >       __pagevec_lookup
> 
> Yikes, all three fixed.
> 
> > > @@ -22,6 +22,22 @@
> > >  #include <linux/cleancache.h>
> > >  #include "internal.h"
> > >  
> > > +static void clear_exceptional_entry(struct address_space *mapping,
> > > +				    pgoff_t index, void *entry)
> > > +{
> > > +	/* Handled by shmem itself */
> > > +	if (shmem_mapping(mapping))
> > > +		return;
> > > +
> > > +	spin_lock_irq(&mapping->tree_lock);
> > > +	/*
> > > +	 * Regular page slots are stabilized by the page lock even
> > > +	 * without the tree itself locked.  These unlocked entries
> > > +	 * need verification under the tree lock.
> > > +	 */
> > 
> > Could you explain why repeated spin_lock with irq disabled isn't problem
> > in truncation path?
> 
> To modify the cache tree, we have to take the IRQ-safe tree_lock, this
> is no different than removing a page (see truncate_complete_page).

I meant we can do batch irq_[lock|unlock] part with periodic irq release
because clear_exceptional_entry is always called with gang pagecache
lookup.

Just a comment about optimiztation so it shouldn't be critical for merging
and we could do in future if it's really problem for scalability.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-22 18:42     ` Johannes Weiner
@ 2014-01-23  5:20       ` Minchan Kim
  2014-01-23 19:22         ` Johannes Weiner
  0 siblings, 1 reply; 58+ messages in thread
From: Minchan Kim @ 2014-01-23  5:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Wed, Jan 22, 2014 at 01:42:17PM -0500, Johannes Weiner wrote:
> On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > Previously, page cache radix tree nodes were freed after reclaim
> > > emptied out their page pointers.  But now reclaim stores shadow
> > > entries in their place, which are only reclaimed when the inodes
> > > themselves are reclaimed.  This is problematic for bigger files that
> > > are still in use after they have a significant amount of their cache
> > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > entries will just sit there and waste memory.  In the worst case, the
> > > shadow entries will accumulate until the machine runs out of memory.
> > > 
> > > To get this under control, the VM will track radix tree nodes
> > > exclusively containing shadow entries on a per-NUMA node list.
> > > Per-NUMA rather than global because we expect the radix tree nodes
> > > themselves to be allocated node-locally and we want to reduce
> > > cross-node references of otherwise independent cache workloads.  A
> > > simple shrinker will then reclaim these nodes on memory pressure.
> > > 
> > > A few things need to be stored in the radix tree node to implement the
> > > shadow node LRU and allow tree deletions coming from the list:
> > > 
> > > 1. There is no index available that would describe the reverse path
> > >    from the node up to the tree root, which is needed to perform a
> > >    deletion.  To solve this, encode in each node its offset inside the
> > >    parent.  This can be stored in the unused upper bits of the same
> > >    member that stores the node's height at no extra space cost.
> > > 
> > > 2. The number of shadow entries needs to be counted in addition to the
> > >    regular entries, to quickly detect when the node is ready to go to
> > >    the shadow node LRU list.  The current entry count is an unsigned
> > >    int but the maximum number of entries is 64, so a shadow counter
> > >    can easily be stored in the unused upper bits.
> > > 
> > > 3. Tree modification needs tree lock and tree root, which are located
> > >    in the address space, so store an address_space backpointer in the
> > >    node.  The parent pointer of the node is in a union with the 2-word
> > >    rcu_head, so the backpointer comes at no extra cost as well.
> > > 
> > > 4. The node needs to be linked to an LRU list, which requires a list
> > >    head inside the node.  This does increase the size of the node, but
> > >    it does not change the number of objects that fit into a slab page.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  include/linux/list_lru.h   |   2 +
> > >  include/linux/mmzone.h     |   1 +
> > >  include/linux/radix-tree.h |  32 +++++++++---
> > >  include/linux/swap.h       |   1 +
> > >  lib/radix-tree.c           |  36 ++++++++------
> > >  mm/filemap.c               |  77 +++++++++++++++++++++++------
> > >  mm/list_lru.c              |   8 +++
> > >  mm/truncate.c              |  20 +++++++-
> > >  mm/vmstat.c                |   1 +
> > >  mm/workingset.c            | 121 +++++++++++++++++++++++++++++++++++++++++++++
> > >  10 files changed, 259 insertions(+), 40 deletions(-)
> > > 
> > > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> > > index 3ce541753c88..b02fc233eadd 100644
> > > --- a/include/linux/list_lru.h
> > > +++ b/include/linux/list_lru.h
> > > @@ -13,6 +13,8 @@
> > >  /* list_lru_walk_cb has to always return one of those */
> > >  enum lru_status {
> > >  	LRU_REMOVED,		/* item removed from list */
> > > +	LRU_REMOVED_RETRY,	/* item removed, but lock has been
> > > +				   dropped and reacquired */
> > >  	LRU_ROTATE,		/* item referenced, give another pass */
> > >  	LRU_SKIP,		/* item cannot be locked, skip */
> > >  	LRU_RETRY,		/* item not freeable. May drop the lock
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 118ba9f51e86..8cac5a7ef7a7 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -144,6 +144,7 @@ enum zone_stat_item {
> > >  #endif
> > >  	WORKINGSET_REFAULT,
> > >  	WORKINGSET_ACTIVATE,
> > > +	WORKINGSET_NODERECLAIM,
> > >  	NR_ANON_TRANSPARENT_HUGEPAGES,
> > >  	NR_FREE_CMA_PAGES,
> > >  	NR_VM_ZONE_STAT_ITEMS };
> > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > > index 13636c40bc42..33170dbd9db4 100644
> > > --- a/include/linux/radix-tree.h
> > > +++ b/include/linux/radix-tree.h
> > > @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
> > >  #define RADIX_TREE_TAG_LONGS	\
> > >  	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
> > >  
> > > +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > > +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> > > +					  RADIX_TREE_MAP_SHIFT))
> > > +
> > > +/* Height component in node->path */
> > > +#define RADIX_TREE_HEIGHT_SHIFT	(RADIX_TREE_MAX_PATH + 1)
> > > +#define RADIX_TREE_HEIGHT_MASK	((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
> > > +
> > > +/* Internally used bits of node->count */
> > > +#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
> > > +#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
> > > +
> > >  struct radix_tree_node {
> > > -	unsigned int	height;		/* Height from the bottom */
> > > +	unsigned int	path;	/* Offset in parent & height from the bottom */
> > >  	unsigned int	count;
> > >  	union {
> > > -		struct radix_tree_node *parent;	/* Used when ascending tree */
> > > -		struct rcu_head	rcu_head;	/* Used when freeing node */
> > > +		struct {
> > > +			/* Used when ascending tree */
> > > +			struct radix_tree_node *parent;
> > > +			/* For tree user */
> > > +			void *private_data;
> > > +		};
> > > +		/* Used when freeing node */
> > > +		struct rcu_head	rcu_head;
> > >  	};
> > > +	/* For tree user */
> > > +	struct list_head private_list;
> > >  	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
> > >  	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
> > >  };
> > >  
> > > -#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > > -#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> > > -					  RADIX_TREE_MAP_SHIFT))
> > > -
> > >  /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
> > >  struct radix_tree_root {
> > >  	unsigned int		height;
> > > @@ -251,7 +267,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
> > >  			  struct radix_tree_node **nodep, void ***slotp);
> > >  void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
> > >  void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
> > > -bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
> > > +bool __radix_tree_delete_node(struct radix_tree_root *root,
> > >  			      struct radix_tree_node *node);
> > >  void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
> > >  void *radix_tree_delete(struct radix_tree_root *, unsigned long);
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index b83cf61403ed..102e37bc82d5 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -264,6 +264,7 @@ struct swap_list_t {
> > >  void *workingset_eviction(struct address_space *mapping, struct page *page);
> > >  bool workingset_refault(void *shadow);
> > >  void workingset_activation(struct page *page);
> > > +extern struct list_lru workingset_shadow_nodes;
> > >  
> > >  /* linux/mm/page_alloc.c */
> > >  extern unsigned long totalram_pages;
> > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > index e601c56a43d0..0a0895371447 100644
> > > --- a/lib/radix-tree.c
> > > +++ b/lib/radix-tree.c
> > > @@ -342,7 +342,8 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
> > >  
> > >  		/* Increase the height.  */
> > >  		newheight = root->height+1;
> > > -		node->height = newheight;
> > > +		BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK);
> > > +		node->path = newheight;
> > 
> > Nitpick:
> > It would be better to add some accessor for path and offset for
> > readability and future enhance?
> 
> Nodes are instantiated in one central place, I can't see the value in
> obscuring a straight-forward bitop with a radix_tree_node_set_offset()
> call.
> 
> And height = node->path & RADIX_TREE_HEIGHT_MASK should be fairly
> descriptive, I think.
> 
> > > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
> > >  		 * same time and miss a shadow entry.
> > >  		 */
> > >  		smp_wmb();
> > > -	} else
> > > -		radix_tree_delete(&mapping->page_tree, page->index);
> > > +	}
> > >  	mapping->nrpages--;
> > > +
> > > +	if (!node) {
> > > +		/* Clear direct pointer tags in root node */
> > > +		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > > +		radix_tree_replace_slot(slot, shadow);
> > > +		return;
> > > +	}
> > > +
> > > +	/* Clear tree tags for the removed page */
> > > +	index = page->index;
> > > +	offset = index & RADIX_TREE_MAP_MASK;
> > > +	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > > +		if (test_bit(offset, node->tags[tag]))
> > > +			radix_tree_tag_clear(&mapping->page_tree, index, tag);
> > > +	}
> > > +
> > > +	/* Delete page, swap shadow entry */
> > > +	radix_tree_replace_slot(slot, shadow);
> > > +	node->count--;
> > > +	if (shadow)
> > > +		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> > 
> > Nitpick2:
> > It should be a function of workingset.c rather than exposing
> > RADIX_TREE_COUNT_SHIFT?
> > 
> > IMO, It would be better to provide some accessor functions here, too.
> 
> The shadow maintenance and node lifetime management are pretty
> interwoven to share branches and reduce instructions as these are
> common paths.  I don't see how this could result in cleaner code while
> keeping these advantages.

What I want is just put a inline accessor in somewhere like workingset.h

static inline void inc_shadow_entry(struct radix_tree_node *node)
{
    node->count += 1U << RADIX_TREE_COUNT_MASK;
}

So, anyone don't need to know that node->count upper bits present
count of shadow entry.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-20 23:17     ` Johannes Weiner
  2014-01-21  3:03       ` Dave Chinner
@ 2014-01-23  5:57       ` Minchan Kim
  1 sibling, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2014-01-23  5:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Andrew Morton, Andi Kleen, Andrea Arcangeli,
	Bob Liu, Christoph Hellwig, Greg Thelen, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Luigi Semenzato, Mel Gorman, Metin Doslu,
	Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra, Rik van Riel,
	Roman Gushchin, Ryan Mallon, Tejun Heo, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > Previously, page cache radix tree nodes were freed after reclaim
> > > emptied out their page pointers.  But now reclaim stores shadow
> > > entries in their place, which are only reclaimed when the inodes
> > > themselves are reclaimed.  This is problematic for bigger files that
> > > are still in use after they have a significant amount of their cache
> > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > entries will just sit there and waste memory.  In the worst case, the
> > > shadow entries will accumulate until the machine runs out of memory.
> > > 
> > > To get this under control, the VM will track radix tree nodes
> > > exclusively containing shadow entries on a per-NUMA node list.
> > > Per-NUMA rather than global because we expect the radix tree nodes
> > > themselves to be allocated node-locally and we want to reduce
> > > cross-node references of otherwise independent cache workloads.  A
> > > simple shrinker will then reclaim these nodes on memory pressure.
> > > 
> > > A few things need to be stored in the radix tree node to implement the
> > > shadow node LRU and allow tree deletions coming from the list:
> > 
> > Just a couple of things with the list_lru interfaces.
> > 
> > ....
> > > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
> > >  		 * same time and miss a shadow entry.
> > >  		 */
> > >  		smp_wmb();
> > > -	} else
> > > -		radix_tree_delete(&mapping->page_tree, page->index);
> > > +	}
> > >  	mapping->nrpages--;
> > > +
> > > +	if (!node) {
> > > +		/* Clear direct pointer tags in root node */
> > > +		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > > +		radix_tree_replace_slot(slot, shadow);
> > > +		return;
> > > +	}
> > > +
> > > +	/* Clear tree tags for the removed page */
> > > +	index = page->index;
> > > +	offset = index & RADIX_TREE_MAP_MASK;
> > > +	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > > +		if (test_bit(offset, node->tags[tag]))
> > > +			radix_tree_tag_clear(&mapping->page_tree, index, tag);
> > > +	}
> > > +
> > > +	/* Delete page, swap shadow entry */
> > > +	radix_tree_replace_slot(slot, shadow);
> > > +	node->count--;
> > > +	if (shadow)
> > > +		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> > > +	else
> > > +		if (__radix_tree_delete_node(&mapping->page_tree, node))
> > > +			return;
> > > +
> > > +	/* Only shadow entries in there, keep track of this node */
> > > +	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > +	    list_empty(&node->private_list)) {
> > > +		node->private_data = mapping;
> > > +		list_lru_add(&workingset_shadow_nodes, &node->private_list);
> > > +	}
> > 
> > You can't do this list_empty(&node->private_list) check safely
> > externally to the list_lru code - only time that entry can be
> > checked safely is under the LRU list locks. This is the reason that
> > list_lru_add/list_lru_del return a boolean to indicate is the object
> > was added/removed from the list - they do this list_empty() check
> > internally. i.e. the correct, safe way to do conditionally update
> > state iff the object was added to the LRU is:
> > 
> > 	if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > 		if (list_lru_add(&workingset_shadow_nodes, &node->private_list))
> > 			node->private_data = mapping;
> > 	}
> > 
> > > +	radix_tree_replace_slot(slot, page);
> > > +	mapping->nrpages++;
> > > +	if (node) {
> > > +		node->count++;
> > > +		/* Installed page, can't be shadow-only anymore */
> > > +		if (!list_empty(&node->private_list))
> > > +			list_lru_del(&workingset_shadow_nodes,
> > > +				     &node->private_list);
> > > +	}
> > 
> > Same issue here:
> > 
> > 	if (node) {
> > 		node->count++;
> > 		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > 	}
> 
> All modifications to node->private_list happen under
> mapping->tree_lock, and modifications of a neighboring link should not
> affect the outcome of the list_empty(), so I don't think the lru lock
> is necessary.
> 
> It would be cleaner to take it of course, but that would mean adding
> an unconditional NUMAnode-wide lock to every page cache population.
> 
> > >  static int __add_to_page_cache_locked(struct page *page,
> > > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > > index 72f9decb0104..47a9faf4070b 100644
> > > --- a/mm/list_lru.c
> > > +++ b/mm/list_lru.c
> > > @@ -88,10 +88,18 @@ restart:
> > >  		ret = isolate(item, &nlru->lock, cb_arg);
> > >  		switch (ret) {
> > >  		case LRU_REMOVED:
> > > +		case LRU_REMOVED_RETRY:
> > >  			if (--nlru->nr_items == 0)
> > >  				node_clear(nid, lru->active_nodes);
> > >  			WARN_ON_ONCE(nlru->nr_items < 0);
> > >  			isolated++;
> > > +			/*
> > > +			 * If the lru lock has been dropped, our list
> > > +			 * traversal is now invalid and so we have to
> > > +			 * restart from scratch.
> > > +			 */
> > > +			if (ret == LRU_REMOVED_RETRY)
> > > +				goto restart;
> > >  			break;
> > >  		case LRU_ROTATE:
> > >  			list_move_tail(item, &nlru->list);
> > 
> > I think that we need to assert that the list lru lock is correctly
> > held here on return with LRU_REMOVED_RETRY. i.e.
> > 
> > 		case LRU_REMOVED_RETRY:
> > 			assert_spin_locked(&nlru->lock);
> > 		case LRU_REMOVED:
> 
> Ah, good idea.  How about adding it to LRU_RETRY as well?
> 
> > > +/*
> > > + * Page cache radix tree nodes containing only shadow entries can grow
> > > + * excessively on certain workloads.  That's why they are tracked on
> > > + * per-(NUMA)node lists and pushed back by a shrinker, but with a
> > > + * slightly higher threshold than regular shrinkers so we don't
> > > + * discard the entries too eagerly - after all, during light memory
> > > + * pressure is exactly when we need them.
> > > + */
> > > +
> > > +struct list_lru workingset_shadow_nodes;
> > > +
> > > +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> > > +					struct shrink_control *sc)
> > > +{
> > > +	return list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> > > +}
> > > +
> > > +static enum lru_status shadow_lru_isolate(struct list_head *item,
> > > +					  spinlock_t *lru_lock,
> > > +					  void *arg)
> > > +{
> > > +	unsigned long *nr_reclaimed = arg;
> > > +	struct address_space *mapping;
> > > +	struct radix_tree_node *node;
> > > +	unsigned int i;
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * Page cache insertions and deletions synchroneously maintain
> > > +	 * the shadow node LRU under the mapping->tree_lock and the
> > > +	 * lru_lock.  Because the page cache tree is emptied before
> > > +	 * the inode can be destroyed, holding the lru_lock pins any
> > > +	 * address_space that has radix tree nodes on the LRU.
> > > +	 *
> > > +	 * We can then safely transition to the mapping->tree_lock to
> > > +	 * pin only the address_space of the particular node we want
> > > +	 * to reclaim, take the node off-LRU, and drop the lru_lock.
> > > +	 */
> > > +
> > > +	node = container_of(item, struct radix_tree_node, private_list);
> > > +	mapping = node->private_data;
> > > +
> > > +	/* Coming from the list, invert the lock order */
> > > +	if (!spin_trylock_irq(&mapping->tree_lock)) {
> > > +		spin_unlock(lru_lock);
> > > +		ret = LRU_RETRY;
> > > +		goto out;
> > > +	}
> > > +
> > > +	list_del_init(item);
> > > +	spin_unlock(lru_lock);
> > > +
> > > +	/*
> > > +	 * The nodes should only contain one or more shadow entries,
> > > +	 * no pages, so we expect to be able to remove them all and
> > > +	 * delete and free the empty node afterwards.
> > > +	 */
> > > +
> > > +	BUG_ON(!node->count);
> > > +	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> > > +
> > > +	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> > > +		if (node->slots[i]) {
> > > +			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
> > > +			node->slots[i] = NULL;
> > > +			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
> > > +			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> > > +			BUG_ON(!mapping->nrshadows);
> > > +			mapping->nrshadows--;
> > > +		}
> > > +	}
> > > +	BUG_ON(node->count);
> > > +	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
> > > +	if (!__radix_tree_delete_node(&mapping->page_tree, node))
> > > +		BUG();
> > > +	(*nr_reclaimed)++;
> > > +
> > > +	spin_unlock_irq(&mapping->tree_lock);
> > > +	ret = LRU_REMOVED_RETRY;
> > > +out:
> > > +	cond_resched();
> > > +	spin_lock(lru_lock);
> > > +	return ret;
> > > +}
> > > +
> > > +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
> > > +				       struct shrink_control *sc)
> > > +{
> > > +	unsigned long nr_reclaimed = 0;
> > > +
> > > +	list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
> > > +			   shadow_lru_isolate, &nr_reclaimed, &sc->nr_to_scan);
> > > +
> > > +	return nr_reclaimed;
> > 
> > list_lru_walk_node() returns the number of reclaimed objects (i.e.
> > the number of objects that returned LRU_REMOVED/LRU_REMOVED_RETRY
> > from the ->isolate callback). You don't need to count nr_reclaimed
> > yourself.
> 
> Good catch, this is a leftover from before LRU_REMOVED_RETRY.  Removed
> the ad-hoc counter altogether.
> 
> > > +static struct shrinker workingset_shadow_shrinker = {
> > > +	.count_objects = count_shadow_nodes,
> > > +	.scan_objects = scan_shadow_nodes,
> > > +	.seeks = DEFAULT_SEEKS * 4,
> > > +	.flags = SHRINKER_NUMA_AWARE,
> > > +};
> > 
> > Can you add a comment explaining how you calculated the .seeks
> > value? It's important to document the weighings/importance
> > we give to slab reclaim so we can determine if it's actually
> > acheiving the desired balance under different loads...
> 
> This is not an exact science, to say the least.
> 
> The shadow entries are mostly self-regulated, so I don't want the
> shrinker to interfere while the machine is just regularly trimming
> caches during normal operation.
> 
> It should only kick in when either a) reclaim is picking up and the
> scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> swapping etc. or b) the number of objects compared to LRU pages
> becomes excessive.
> 
> I think that is what most shrinkers with an elevated seeks value want,
> but this translates very awkwardly (and not completely) to the current
> cost model, and we should probably rework that interface.
> 
> "Seeks" currently encodes 3 ratios:
> 
>   1. the cost of creating an object vs. a page
> 
>   2. the expected number of objects vs. pages
> 
>   3. the cost of reclaiming an object vs. a page
> 
> but they are not necessarily correlated.  How I would like to
> configure the shadow shrinker instead is:
> 
>   o scan objects when reclaim efficiency is down to 75%, because they
>     are more valuable than use-once cache but less than workingset
> 

Sorry if it is another topic.
Just out of curiosity. Why do you set it to 75%?
The why I ask is it's really needed thing for volatile range, which
want to discard volatile pages more than use-once but less than
working set.

In recent version, I just used (priority < DEF_PRIORITY - 2) to catch up
reclaim efficiency because we already have used it for noticing reclaim
trouble several places but not hard tested so I'm not sure it's better
than #reclaimed_pages/#scanned_page.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-23  5:20       ` Minchan Kim
@ 2014-01-23 19:22         ` Johannes Weiner
  2014-01-27  2:31           ` Minchan Kim
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2014-01-23 19:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Thu, Jan 23, 2014 at 02:20:14PM +0900, Minchan Kim wrote:
> On Wed, Jan 22, 2014 at 01:42:17PM -0500, Johannes Weiner wrote:
> > On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> > > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
> > > >  		 * same time and miss a shadow entry.
> > > >  		 */
> > > >  		smp_wmb();
> > > > -	} else
> > > > -		radix_tree_delete(&mapping->page_tree, page->index);
> > > > +	}
> > > >  	mapping->nrpages--;
> > > > +
> > > > +	if (!node) {
> > > > +		/* Clear direct pointer tags in root node */
> > > > +		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > > > +		radix_tree_replace_slot(slot, shadow);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	/* Clear tree tags for the removed page */
> > > > +	index = page->index;
> > > > +	offset = index & RADIX_TREE_MAP_MASK;
> > > > +	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > > > +		if (test_bit(offset, node->tags[tag]))
> > > > +			radix_tree_tag_clear(&mapping->page_tree, index, tag);
> > > > +	}
> > > > +
> > > > +	/* Delete page, swap shadow entry */
> > > > +	radix_tree_replace_slot(slot, shadow);
> > > > +	node->count--;
> > > > +	if (shadow)
> > > > +		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> > > 
> > > Nitpick2:
> > > It should be a function of workingset.c rather than exposing
> > > RADIX_TREE_COUNT_SHIFT?
> > > 
> > > IMO, It would be better to provide some accessor functions here, too.
> > 
> > The shadow maintenance and node lifetime management are pretty
> > interwoven to share branches and reduce instructions as these are
> > common paths.  I don't see how this could result in cleaner code while
> > keeping these advantages.
> 
> What I want is just put a inline accessor in somewhere like workingset.h
> 
> static inline void inc_shadow_entry(struct radix_tree_node *node)
> {
>     node->count += 1U << RADIX_TREE_COUNT_MASK;
> }
> 
> So, anyone don't need to know that node->count upper bits present
> count of shadow entry.

Okay, but then you have to cover lower bits as well, without explicit
higher bit access it would be confusing to use the mask for lower
bits.

Something like the following?

---

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 102e37bc82d5..b33171a3673c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -266,6 +266,36 @@ bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
 extern struct list_lru workingset_shadow_nodes;
 
+static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
+{
+	return node->count & RADIX_TREE_COUNT_MASK;
+}
+
+static inline void workingset_node_pages_inc(struct radix_tree_node *node)
+{
+	return node->count++;
+}
+
+static inline void workingset_node_pages_dec(struct radix_tree_node *node)
+{
+	return node->count--;
+}
+
+static inline unsigned int workingset_node_shadows(struct radix_tree_node *node)
+{
+	return node->count >> RADIX_TREE_COUNT_SHIFT;
+}
+
+static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
+{
+	node->count += 1U << RADIX_TREE_COUNT_SHIFT;
+}
+
+static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
+{
+	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+}
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index a63e89484d18..ac7f62db9ccd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -149,9 +149,9 @@ static void page_cache_tree_delete(struct address_space *mapping,
 
 	/* Delete page, swap shadow entry */
 	radix_tree_replace_slot(slot, shadow);
-	node->count--;
+	workingset_node_pages_dec(node);
 	if (shadow)
-		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
+		workingset_node_shadow_inc(node);
 	else
 		if (__radix_tree_delete_node(&mapping->page_tree, node))
 			return;
@@ -163,7 +163,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
 	 * list_empty() test is safe as node->private_list is
 	 * protected by mapping->tree_lock.
 	 */
-	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
+	if (!workingset_node_pages(node) &&
 	    list_empty(&node->private_list)) {
 		node->private_data = mapping;
 		list_lru_add(&workingset_shadow_nodes, &node->private_list);
@@ -531,12 +531,12 @@ static int page_cache_tree_insert(struct address_space *mapping,
 			*shadowp = p;
 		mapping->nrshadows--;
 		if (node)
-			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+			workingset_node_shadows_dec(node);
 	}
 	radix_tree_replace_slot(slot, page);
 	mapping->nrpages++;
 	if (node) {
-		node->count++;
+		workingset_node_pages_inc(node);
 		/*
 		 * Don't track node that contains actual pages.
 		 *
diff --git a/mm/truncate.c b/mm/truncate.c
index c7a0d02a03eb..9cb54b7525dc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -46,7 +46,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	mapping->nrshadows--;
 	if (!node)
 		goto unlock;
-	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+	workingset_node_shadows_dec(node);
 	/*
 	 * Don't track node without shadow entries.
 	 *
@@ -54,7 +54,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * The list_empty() test is safe as node->private_list is
 	 * protected by mapping->tree_lock.
 	 */
-	if (!(node->count >> RADIX_TREE_COUNT_SHIFT) &&
+	if (!workingset_node_shadows(node) &&
 	    !list_empty(&node->private_list))
 		list_lru_del(&workingset_shadow_nodes, &node->private_list);
 	__radix_tree_delete_node(&mapping->page_tree, node);

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2014-01-23 19:22         ` Johannes Weiner
@ 2014-01-27  2:31           ` Minchan Kim
  0 siblings, 0 replies; 58+ messages in thread
From: Minchan Kim @ 2014-01-27  2:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Thu, Jan 23, 2014 at 02:22:12PM -0500, Johannes Weiner wrote:
> On Thu, Jan 23, 2014 at 02:20:14PM +0900, Minchan Kim wrote:
> > On Wed, Jan 22, 2014 at 01:42:17PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> > > > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > > > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
> > > > >  		 * same time and miss a shadow entry.
> > > > >  		 */
> > > > >  		smp_wmb();
> > > > > -	} else
> > > > > -		radix_tree_delete(&mapping->page_tree, page->index);
> > > > > +	}
> > > > >  	mapping->nrpages--;
> > > > > +
> > > > > +	if (!node) {
> > > > > +		/* Clear direct pointer tags in root node */
> > > > > +		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > > > > +		radix_tree_replace_slot(slot, shadow);
> > > > > +		return;
> > > > > +	}
> > > > > +
> > > > > +	/* Clear tree tags for the removed page */
> > > > > +	index = page->index;
> > > > > +	offset = index & RADIX_TREE_MAP_MASK;
> > > > > +	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > > > > +		if (test_bit(offset, node->tags[tag]))
> > > > > +			radix_tree_tag_clear(&mapping->page_tree, index, tag);
> > > > > +	}
> > > > > +
> > > > > +	/* Delete page, swap shadow entry */
> > > > > +	radix_tree_replace_slot(slot, shadow);
> > > > > +	node->count--;
> > > > > +	if (shadow)
> > > > > +		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> > > > 
> > > > Nitpick2:
> > > > It should be a function of workingset.c rather than exposing
> > > > RADIX_TREE_COUNT_SHIFT?
> > > > 
> > > > IMO, It would be better to provide some accessor functions here, too.
> > > 
> > > The shadow maintenance and node lifetime management are pretty
> > > interwoven to share branches and reduce instructions as these are
> > > common paths.  I don't see how this could result in cleaner code while
> > > keeping these advantages.
> > 
> > What I want is just put a inline accessor in somewhere like workingset.h
> > 
> > static inline void inc_shadow_entry(struct radix_tree_node *node)
> > {
> >     node->count += 1U << RADIX_TREE_COUNT_MASK;
> > }
> > 
> > So, anyone don't need to know that node->count upper bits present
> > count of shadow entry.
> 
> Okay, but then you have to cover lower bits as well, without explicit
> higher bit access it would be confusing to use the mask for lower
> bits.
> 
> Something like the following?

LGTM.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees
  2014-01-10 18:10 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
  2014-01-10 19:39   ` Rik van Riel
  2014-01-13  2:01   ` Minchan Kim
@ 2014-02-12 14:00   ` Mel Gorman
  2014-03-12  1:15     ` Johannes Weiner
  2 siblings, 1 reply; 58+ messages in thread
From: Mel Gorman @ 2014-02-12 14:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

On Fri, Jan 10, 2014 at 01:10:39PM -0500, Johannes Weiner wrote:
> shmem mappings already contain exceptional entries where swap slot
> information is remembered.
> 
> To be able to store eviction information for regular page cache,
> prepare every site dealing with the radix trees directly to handle
> entries other than pages.
> 
> The common lookup functions will filter out non-page entries and
> return NULL for page cache holes, just as before.  But provide a raw
> version of the API which returns non-page entries as well, and switch
> shmem over to use it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I know I'm really late in the game here and this isn't even a proper
review. It's just a glancing through my inbox trying to reduce the mess
before going offline.

> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 6aad98cb343f..c88316587900 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -474,7 +474,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  		rcu_read_lock();
>  		page = radix_tree_lookup(&mapping->page_tree, pg_index);
>  		rcu_read_unlock();
> -		if (page) {
> +		if (page && !radix_tree_exceptional_entry(page)) {
>  			misses++;
>  			if (misses > 4)
>  				break;

For unrelated reasons this raises a red flag for me. The block of code
looks very similar to find_or_create_page except it uses different locking
for reasons that are not very obvious at first glance.  It's also a little
similar to normal readahead but different enough that they cannot easily
share code. The similarities are close enough to wonder why btrfs had
to be special here and it now looks worse when they have to deal with
exceptional entries.

It's something for the btrfs people I guess.

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8b6e55ee8855..c09ef3ae55bc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -906,6 +906,14 @@ extern void show_free_areas(unsigned int flags);
>  extern bool skip_free_areas_node(unsigned int flags, int nid);
>  
>  int shmem_zero_setup(struct vm_area_struct *);
> +#ifdef CONFIG_SHMEM
> +bool shmem_mapping(struct address_space *mapping);
> +#else
> +static inline bool shmem_mapping(struct address_space *mapping)
> +{
> +	return false;
> +}
> +#endif
>  
>  extern int can_do_mlock(void);
>  extern int user_shm_lock(size_t, struct user_struct *);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index c73130c607c4..b6854b7c58cb 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
>  pgoff_t page_cache_prev_hole(struct address_space *mapping,
>  			     pgoff_t index, unsigned long max_scan);
>  
> -extern struct page * find_get_page(struct address_space *mapping,
> -				pgoff_t index);
> -extern struct page * find_lock_page(struct address_space *mapping,
> -				pgoff_t index);
> -extern struct page * find_or_create_page(struct address_space *mapping,
> -				pgoff_t index, gfp_t gfp_mask);
> +struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
> +struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
> +struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
> +struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
> +struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
> +				 gfp_t gfp_mask);
> +unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
> +			  unsigned int nr_pages, struct page **pages,
> +			  pgoff_t *indices);

When I see foo() and __foo() in a header, my first assumption is that
__foo() is a version of foo() that assumes the necesssary locks are
already held. If I see it within a C file, my second assumption will be
that it's an internal helper. Here __find_get_page is not returning just
a page. It's returning a page or a shadow entry if they exist and that may
cause some confusion. Consider renaming __find_get_page to find_get_entry()
to give a hint to the reader they should be looking out for either pages
or shadow entries. It still makes sense for find_lock_entry -- if it's a
page, then it'll return locked etc

>  unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
>  			unsigned int nr_pages, struct page **pages);
>  unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
> index e4dbfab37729..3c6b8b1e945b 100644
> --- a/include/linux/pagevec.h
> +++ b/include/linux/pagevec.h
> @@ -22,6 +22,9 @@ struct pagevec {
>  
>  void __pagevec_release(struct pagevec *pvec);
>  void __pagevec_lru_add(struct pagevec *pvec);
> +unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
> +			  pgoff_t start, unsigned nr_pages, pgoff_t *indices);
> +void pagevec_remove_exceptionals(struct pagevec *pvec);
>  unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
>  		pgoff_t start, unsigned nr_pages);
>  unsigned pagevec_lookup_tag(struct pagevec *pvec,
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 30aa0dc60d75..deb49609cd36 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -49,6 +49,7 @@ extern struct file *shmem_file_setup(const char *name,
>  					loff_t size, unsigned long flags);
>  extern int shmem_zero_setup(struct vm_area_struct *);
>  extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
> +extern bool shmem_mapping(struct address_space *mapping);
>  extern void shmem_unlock_mapping(struct address_space *mapping);
>  extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>  					pgoff_t index, gfp_t gfp_mask);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0746b7a4658f..23eb3be27205 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -446,6 +446,29 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  
> +static int page_cache_tree_insert(struct address_space *mapping,
> +				  struct page *page)
> +{

Nothing here on the locking rules for the function although the existing
docs here are poor. Everyone knows you need the mapping lock and page lock
here, right?

> +	void **slot;
> +	int error;
> +
> +	slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
> +	if (slot) {
> +		void *p;
> +
> +		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
> +		if (!radix_tree_exceptional_entry(p))
> +			return -EEXIST;
> +		radix_tree_replace_slot(slot, page);
> +		mapping->nrpages++;
> +		return 0;
> +	}
> +	error = radix_tree_insert(&mapping->page_tree, page->index, page);
> +	if (!error)
> +		mapping->nrpages++;
> +	return error;
> +}
> +
>  /**
>   * add_to_page_cache_locked - add a locked page to the pagecache
>   * @page:	page to add
> @@ -480,11 +503,10 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  	page->index = offset;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	error = radix_tree_insert(&mapping->page_tree, offset, page);
> +	error = page_cache_tree_insert(mapping, page);
>  	radix_tree_preload_end();
>  	if (unlikely(error))
>  		goto err_insert;
> -	mapping->nrpages++;
>  	__inc_zone_page_state(page, NR_FILE_PAGES);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	trace_mm_filemap_add_to_page_cache(page);
> @@ -712,7 +734,10 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
>  	unsigned long i;
>  
>  	for (i = 0; i < max_scan; i++) {
> -		if (!radix_tree_lookup(&mapping->page_tree, index))
> +		struct page *page;
> +
> +		page = radix_tree_lookup(&mapping->page_tree, index);
> +		if (!page || radix_tree_exceptional_entry(page))
>  			break;
>  		index++;
>  		if (index == 0)
> @@ -750,7 +775,10 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
>  	unsigned long i;
>  
>  	for (i = 0; i < max_scan; i++) {
> -		if (!radix_tree_lookup(&mapping->page_tree, index))
> +		struct page *page;
> +
> +		page = radix_tree_lookup(&mapping->page_tree, index);
> +		if (!page || radix_tree_exceptional_entry(page))
>  			break;
>  		index--;
>  		if (index == ULONG_MAX)
> @@ -762,14 +790,19 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
>  EXPORT_SYMBOL(page_cache_prev_hole);
>  
>  /**
> - * find_get_page - find and get a page reference
> + * __find_get_page - find and get a page reference

This comment will be out of date when it could be returning shadow
entries

>   * @mapping: the address_space to search
>   * @offset: the page index
>   *
> - * Is there a pagecache struct page at the given (mapping, offset) tuple?
> - * If yes, increment its refcount and return it; if no, return NULL.
> + * Looks up the page cache slot at @mapping & @offset.  If there is a
> + * page cache page, it is returned with an increased refcount.
> + *
> + * If the slot holds a shadow entry of a previously evicted page, it
> + * is returned.
> + *

That's not true yet but who cares. Anyone doing a git blame of the history
will need to search around the area anyway.

> + * Otherwise, %NULL is returned.
>   */
> -struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
> +struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
>  {
>  	void **pagep;
>  	struct page *page;
> @@ -810,24 +843,49 @@ out:
>  
>  	return page;
>  }
> +EXPORT_SYMBOL(__find_get_page);
> +
> +/**
> + * find_get_page - find and get a page reference
> + * @mapping: the address_space to search
> + * @offset: the page index
> + *
> + * Looks up the page cache slot at @mapping & @offset.  If there is a
> + * page cache page, it is returned with an increased refcount.
> + *
> + * Otherwise, %NULL is returned.
> + */
> +struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
> +{
> +	struct page *page = __find_get_page(mapping, offset);
> +
> +	if (radix_tree_exceptional_entry(page))
> +		page = NULL;
> +	return page;
> +}
>  EXPORT_SYMBOL(find_get_page);
>  
>  /**
> - * find_lock_page - locate, pin and lock a pagecache page
> + * __find_lock_page - locate, pin and lock a pagecache page
>   * @mapping: the address_space to search
>   * @offset: the page index
>   *
> - * Locates the desired pagecache page, locks it, increments its reference
> - * count and returns its address.
> + * Looks up the page cache slot at @mapping & @offset.  If there is a
> + * page cache page, it is returned locked and with an increased
> + * refcount.
> + *
> + * If the slot holds a shadow entry of a previously evicted page, it
> + * is returned.
> + *
> + * Otherwise, %NULL is returned.
>   *
> - * Returns zero if the page was not present. find_lock_page() may sleep.
> + * __find_lock_page() may sleep.
>   */
> -struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
> +struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
>  {
>  	struct page *page;
> -
>  repeat:

Unnecessary whitespace change.

> -	page = find_get_page(mapping, offset);
> +	page = __find_get_page(mapping, offset);
>  	if (page && !radix_tree_exception(page)) {
>  		lock_page(page);
>  		/* Has the page been truncated? */

Just as an example, if this was find_get_entry() it would be a lot
clearer that the return value may or may not be a page.

> @@ -840,6 +898,29 @@ repeat:
>  	}
>  	return page;
>  }
> +EXPORT_SYMBOL(__find_lock_page);
> +
> +/**
> + * find_lock_page - locate, pin and lock a pagecache page
> + * @mapping: the address_space to search
> + * @offset: the page index
> + *
> + * Looks up the page cache slot at @mapping & @offset.  If there is a
> + * page cache page, it is returned locked and with an increased
> + * refcount.
> + *
> + * Otherwise, %NULL is returned.
> + *
> + * find_lock_page() may sleep.
> + */
> +struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
> +{
> +	struct page *page = __find_lock_page(mapping, offset);
> +
> +	if (radix_tree_exceptional_entry(page))
> +		page = NULL;
> +	return page;
> +}
>  EXPORT_SYMBOL(find_lock_page);
>  
>  /**
> @@ -848,16 +929,18 @@ EXPORT_SYMBOL(find_lock_page);
>   * @index: the page's index into the mapping
>   * @gfp_mask: page allocation mode
>   *
> - * Locates a page in the pagecache.  If the page is not present, a new page
> - * is allocated using @gfp_mask and is added to the pagecache and to the VM's
> - * LRU list.  The returned page is locked and has its reference count
> - * incremented.
> + * Looks up the page cache slot at @mapping & @offset.  If there is a
> + * page cache page, it is returned locked and with an increased
> + * refcount.
>   *
> - * find_or_create_page() may sleep, even if @gfp_flags specifies an atomic
> - * allocation!
> + * If the page is not present, a new page is allocated using @gfp_mask
> + * and added to the page cache and the VM's LRU list.  The page is
> + * returned locked and with an increased refcount.
>   *
> - * find_or_create_page() returns the desired page's address, or zero on
> - * memory exhaustion.
> + * On memory exhaustion, %NULL is returned.
> + *
> + * find_or_create_page() may sleep, even if @gfp_flags specifies an
> + * atomic allocation!
>   */
>  struct page *find_or_create_page(struct address_space *mapping,
>  		pgoff_t index, gfp_t gfp_mask)
> @@ -890,6 +973,73 @@ repeat:
>  EXPORT_SYMBOL(find_or_create_page);
>  
>  /**
> + * __find_get_pages - gang pagecache lookup
> + * @mapping:	The address_space to search
> + * @start:	The starting page index
> + * @nr_pages:	The maximum number of pages
> + * @pages:	Where the resulting pages are placed
> + *
> + * __find_get_pages() will search for and return a group of up to
> + * @nr_pages pages in the mapping.  The pages are placed at @pages.
> + * __find_get_pages() takes a reference against the returned pages.
> + *
> + * The search returns a group of mapping-contiguous pages with ascending
> + * indexes.  There may be holes in the indices due to not-present pages.
> + *
> + * Any shadow entries of evicted pages are included in the returned
> + * array.
> + *
> + * __find_get_pages() returns the number of pages and shadow entries
> + * which were found.
> + */
> +unsigned __find_get_pages(struct address_space *mapping,
> +			  pgoff_t start, unsigned int nr_pages,
> +			  struct page **pages, pgoff_t *indices)
> +{
> +	void **slot;
> +	unsigned int ret = 0;
> +	struct radix_tree_iter iter;
> +
> +	if (!nr_pages)
> +		return 0;
> +
> +	rcu_read_lock();
> +restart:
> +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> +		struct page *page;
> +repeat:
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (radix_tree_exception(page)) {
> +			if (radix_tree_deref_retry(page))
> +				goto restart;
> +			/*
> +			 * Otherwise, we must be storing a swap entry
> +			 * here as an exceptional entry: so return it
> +			 * without attempting to raise page count.
> +			 */
> +			goto export;
> +		}

There is a non-obvious API hazard here that should be called out in
the function description. shmem was the previous gang lookup user and
it knew that there would be swap entries and removed them if necessary
with shmem_deswap_pagevec. It was internal to shmem.c so it could deal
with the complexity. Now that you are making it a generic function it
should clearly explain that exceptional entries can be returned and
pagevec_remove_exceptionals should be used to remove them if necessary
or else split the helper in two to return just pages or both pages and
exceptional entries.

> +		if (!page_cache_get_speculative(page))
> +			goto repeat;
> +
> +		/* Has the page moved? */
> +		if (unlikely(page != *slot)) {
> +			page_cache_release(page);
> +			goto repeat;
> +		}
> +export:
> +		indices[ret] = iter.index;
> +		pages[ret] = page;
> +		if (++ret == nr_pages)
> +			break;
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +/**
>   * find_get_pages - gang pagecache lookup
>   * @mapping:	The address_space to search
>   * @start:	The starting page index
> diff --git a/mm/mincore.c b/mm/mincore.c
> index da2be56a7b8f..ad411ec86a55 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
>  	 * any other file mapping (ie. marked !present and faulted in with
>  	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
>  	 */
> -	page = find_get_page(mapping, pgoff);
>  #ifdef CONFIG_SWAP
> -	/* shmem/tmpfs may return swap: account for swapcache page too. */
> -	if (radix_tree_exceptional_entry(page)) {
> -		swp_entry_t swap = radix_to_swp_entry(page);
> -		page = find_get_page(swap_address_space(swap), swap.val);
> -	}
> +	if (shmem_mapping(mapping)) {
> +		page = __find_get_page(mapping, pgoff);
> +		/*
> +		 * shmem/tmpfs may return swap: account for swapcache
> +		 * page too.
> +		 */
> +		if (radix_tree_exceptional_entry(page)) {
> +			swp_entry_t swp = radix_to_swp_entry(page);
> +			page = find_get_page(swap_address_space(swp), swp.val);
> +		}
> +	} else
> +		page = find_get_page(mapping, pgoff);
> +#else
> +	page = find_get_page(mapping, pgoff);
>  #endif
>  	if (page) {
>  		present = PageUptodate(page);
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9eeeeda4ac0e..912c00358112 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  		rcu_read_lock();
>  		page = radix_tree_lookup(&mapping->page_tree, page_offset);
>  		rcu_read_unlock();
> -		if (page)
> +		if (page && !radix_tree_exceptional_entry(page))
>  			continue;
>  
>  		page = page_cache_alloc_readahead(mapping);

Maybe just think hunk can be split out and shared with btrfs to avoid it
dealing with exceptional entries although I've no good suggestions on what
you'd call it.

> diff --git a/mm/shmem.c b/mm/shmem.c
> index 7c67249d6f28..1f4b65f7b831 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -329,56 +329,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
>  }
>  
>  /*
> - * Like find_get_pages, but collecting swap entries as well as pages.
> - */
> -static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
> -					pgoff_t start, unsigned int nr_pages,
> -					struct page **pages, pgoff_t *indices)
> -{
> -	void **slot;
> -	unsigned int ret = 0;
> -	struct radix_tree_iter iter;
> -
> -	if (!nr_pages)
> -		return 0;
> -
> -	rcu_read_lock();
> -restart:
> -	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> -		struct page *page;
> -repeat:
> -		page = radix_tree_deref_slot(slot);
> -		if (unlikely(!page))
> -			continue;
> -		if (radix_tree_exception(page)) {
> -			if (radix_tree_deref_retry(page))
> -				goto restart;
> -			/*
> -			 * Otherwise, we must be storing a swap entry
> -			 * here as an exceptional entry: so return it
> -			 * without attempting to raise page count.
> -			 */
> -			goto export;
> -		}
> -		if (!page_cache_get_speculative(page))
> -			goto repeat;
> -
> -		/* Has the page moved? */
> -		if (unlikely(page != *slot)) {
> -			page_cache_release(page);
> -			goto repeat;
> -		}
> -export:
> -		indices[ret] = iter.index;
> -		pages[ret] = page;
> -		if (++ret == nr_pages)
> -			break;
> -	}
> -	rcu_read_unlock();
> -	return ret;
> -}
> -
> -/*
>   * Remove swap entry from radix tree, free the swap and its page cache.
>   */
>  static int shmem_free_swap(struct address_space *mapping,
> @@ -396,21 +346,6 @@ static int shmem_free_swap(struct address_space *mapping,
>  }
>  
>  /*
> - * Pagevec may contain swap entries, so shuffle up pages before releasing.
> - */
> -static void shmem_deswap_pagevec(struct pagevec *pvec)
> -{
> -	int i, j;
> -
> -	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
> -		struct page *page = pvec->pages[i];
> -		if (!radix_tree_exceptional_entry(page))
> -			pvec->pages[j++] = page;
> -	}
> -	pvec->nr = j;
> -}
> -
> -/*
>   * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
>   */
>  void shmem_unlock_mapping(struct address_space *mapping)
> @@ -428,12 +363,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
>  		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
>  		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
>  		 */
> -		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
> +		pvec.nr = __find_get_pages(mapping, index,
>  					PAGEVEC_SIZE, pvec.pages, indices);
>  		if (!pvec.nr)
>  			break;
>  		index = indices[pvec.nr - 1] + 1;
> -		shmem_deswap_pagevec(&pvec);
> +		pagevec_remove_exceptionals(&pvec);
>  		check_move_unevictable_pages(pvec.pages, pvec.nr);
>  		pagevec_release(&pvec);
>  		cond_resched();
> @@ -465,9 +400,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  	pagevec_init(&pvec, 0);
>  	index = start;
>  	while (index < end) {
> -		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
> -				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> -							pvec.pages, indices);
> +		pvec.nr = __find_get_pages(mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> +			pvec.pages, indices);
>  		if (!pvec.nr)
>  			break;
>  		mem_cgroup_uncharge_start();
> @@ -496,7 +431,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  			}
>  			unlock_page(page);
>  		}
> -		shmem_deswap_pagevec(&pvec);
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		cond_resched();
> @@ -534,9 +469,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  	index = start;
>  	for ( ; ; ) {
>  		cond_resched();
> -		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
> +
> +		pvec.nr = __find_get_pages(mapping, index,
>  				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> -							pvec.pages, indices);
> +				pvec.pages, indices);
>  		if (!pvec.nr) {
>  			if (index == start || unfalloc)
>  				break;
> @@ -544,7 +480,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  			continue;
>  		}
>  		if ((index == start || unfalloc) && indices[0] >= end) {
> -			shmem_deswap_pagevec(&pvec);
> +			pagevec_remove_exceptionals(&pvec);
>  			pagevec_release(&pvec);
>  			break;
>  		}
> @@ -573,7 +509,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>  			}
>  			unlock_page(page);
>  		}
> -		shmem_deswap_pagevec(&pvec);
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		index++;
> @@ -1081,7 +1017,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  		return -EFBIG;
>  repeat:
>  	swap.val = 0;
> -	page = find_lock_page(mapping, index);
> +	page = __find_lock_page(mapping, index);
>  	if (radix_tree_exceptional_entry(page)) {
>  		swap = radix_to_swp_entry(page);
>  		page = NULL;
> @@ -1418,6 +1354,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>  	return inode;
>  }
>  
> +bool shmem_mapping(struct address_space *mapping)
> +{
> +	return mapping->backing_dev_info == &shmem_backing_dev_info;
> +}
> +
>  #ifdef CONFIG_TMPFS
>  static const struct inode_operations shmem_symlink_inode_operations;
>  static const struct inode_operations shmem_short_symlink_operations;
> @@ -1730,7 +1671,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
>  	pagevec_init(&pvec, 0);
>  	pvec.nr = 1;		/* start small: we may be there already */
>  	while (!done) {
> -		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
> +		pvec.nr = __find_get_pages(mapping, index,
>  					pvec.nr, pvec.pages, indices);
>  		if (!pvec.nr) {
>  			if (whence == SEEK_DATA)
> @@ -1757,7 +1698,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
>  				break;
>  			}
>  		}
> -		shmem_deswap_pagevec(&pvec);
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		pvec.nr = PAGEVEC_SIZE;
>  		cond_resched();
> diff --git a/mm/swap.c b/mm/swap.c
> index 759c3caf44bd..f624e5b4b724 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -894,6 +894,53 @@ EXPORT_SYMBOL(__pagevec_lru_add);
>  
>  /**
>   * pagevec_lookup - gang pagecache lookup
> + * @pvec:	Where the resulting entries are placed
> + * @mapping:	The address_space to search
> + * @start:	The starting entry index
> + * @nr_pages:	The maximum number of entries
> + *
> + * pagevec_lookup() will search for and return a group of up to
> + * @nr_pages pages and shadow entries in the mapping.  All entries are
> + * placed in @pvec.  pagevec_lookup() takes a reference against actual
> + * pages in @pvec.
> + *
> + * The search returns a group of mapping-contiguous entries with
> + * ascending indexes.  There may be holes in the indices due to
> + * not-present entries.
> + *
> + * pagevec_lookup() returns the number of entries which were found.
> + */
> +unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
> +			  pgoff_t start, unsigned nr_pages, pgoff_t *indices)
> +{
> +	pvec->nr = __find_get_pages(mapping, start, nr_pages,
> +				    pvec->pages, indices);
> +	return pagevec_count(pvec);
> +}
> +
> +/**
> + * pagevec_remove_exceptionals - pagevec exceptionals pruning
> + * @pvec:	The pagevec to prune
> + *
> + * __pagevec_lookup() fills both pages and exceptional radix tree
> + * entries into the pagevec.  This function prunes all exceptionals
> + * from @pvec without leaving holes, so that it can be passed on to
> + * other pagevec operations.
> + */
> +void pagevec_remove_exceptionals(struct pagevec *pvec)
> +{
> +	int i, j;
> +
> +	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +		if (!radix_tree_exceptional_entry(page))
> +			pvec->pages[j++] = page;
> +	}
> +	pvec->nr = j;
> +}
> +
> +/**
> + * pagevec_lookup - gang pagecache lookup
>   * @pvec:	Where the resulting pages are placed
>   * @mapping:	The address_space to search
>   * @start:	The starting page index
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 353b683afd6e..b0f4d4bee8ab 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -22,6 +22,22 @@
>  #include <linux/cleancache.h>
>  #include "internal.h"
>  
> +static void clear_exceptional_entry(struct address_space *mapping,
> +				    pgoff_t index, void *entry)
> +{
> +	/* Handled by shmem itself */
> +	if (shmem_mapping(mapping))
> +		return;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * Regular page slots are stabilized by the page lock even
> +	 * without the tree itself locked.  These unlocked entries
> +	 * need verification under the tree lock.
> +	 */
> +	radix_tree_delete_item(&mapping->page_tree, index, entry);
> +	spin_unlock_irq(&mapping->tree_lock);
> +}
>  
>  /**
>   * do_invalidatepage - invalidate part or all of a page
> @@ -208,6 +224,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	unsigned int	partial_start;	/* inclusive */
>  	unsigned int	partial_end;	/* exclusive */
>  	struct pagevec	pvec;
> +	pgoff_t		indices[PAGEVEC_SIZE];
>  	pgoff_t		index;
>  	int		i;
>  
> @@ -238,17 +255,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  
>  	pagevec_init(&pvec, 0);
>  	index = start;
> -	while (index < end && pagevec_lookup(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
> +	while (index < end && __pagevec_lookup(&pvec, mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> +			indices)) {
>  		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
>  			/* We rely upon deletion not changing page->index */
> -			index = page->index;
> +			index = indices[i];
>  			if (index >= end)
>  				break;
>  
> +			if (radix_tree_exceptional_entry(page)) {
> +				clear_exceptional_entry(mapping, index, page);
> +				continue;
> +			}
> +
>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> @@ -259,6 +282,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			truncate_inode_page(mapping, page);
>  			unlock_page(page);
>  		}
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		cond_resched();
> @@ -307,14 +331,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	index = start;
>  	for ( ; ; ) {
>  		cond_resched();
> -		if (!pagevec_lookup(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
> +		if (!__pagevec_lookup(&pvec, mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> +			indices)) {
>  			if (index == start)
>  				break;
>  			index = start;
>  			continue;
>  		}
> -		if (index == start && pvec.pages[0]->index >= end) {
> +		if (index == start && indices[0] >= end) {
>  			pagevec_release(&pvec);
>  			break;
>  		}
> @@ -323,16 +348,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			struct page *page = pvec.pages[i];
>  
>  			/* We rely upon deletion not changing page->index */
> -			index = page->index;
> +			index = indices[i];
>  			if (index >= end)
>  				break;
>  
> +			if (radix_tree_exceptional_entry(page)) {
> +				clear_exceptional_entry(mapping, index, page);
> +				continue;
> +			}
> +
>  			lock_page(page);
>  			WARN_ON(page->index != index);
>  			wait_on_page_writeback(page);
>  			truncate_inode_page(mapping, page);
>  			unlock_page(page);
>  		}
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		index++;
> @@ -375,6 +406,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
>  unsigned long invalidate_mapping_pages(struct address_space *mapping,
>  		pgoff_t start, pgoff_t end)
>  {
> +	pgoff_t indices[PAGEVEC_SIZE];
>  	struct pagevec pvec;
>  	pgoff_t index = start;
>  	unsigned long ret;
> @@ -390,17 +422,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>  	 */
>  
>  	pagevec_init(&pvec, 0);
> -	while (index <= end && pagevec_lookup(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> +	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
> +			indices)) {
>  		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
>  			/* We rely upon deletion not changing page->index */
> -			index = page->index;
> +			index = indices[i];
>  			if (index > end)
>  				break;
>  
> +			if (radix_tree_exceptional_entry(page)) {
> +				clear_exceptional_entry(mapping, index, page);
> +				continue;
> +			}
> +
>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> @@ -414,6 +452,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
>  				deactivate_page(page);
>  			count += ret;
>  		}
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		cond_resched();
> @@ -481,6 +520,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
>  int invalidate_inode_pages2_range(struct address_space *mapping,
>  				  pgoff_t start, pgoff_t end)
>  {
> +	pgoff_t indices[PAGEVEC_SIZE];
>  	struct pagevec pvec;
>  	pgoff_t index;
>  	int i;
> @@ -491,17 +531,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>  	cleancache_invalidate_inode(mapping);
>  	pagevec_init(&pvec, 0);
>  	index = start;
> -	while (index <= end && pagevec_lookup(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> +	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
> +			indices)) {
>  		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
>  			/* We rely upon deletion not changing page->index */
> -			index = page->index;
> +			index = indices[i];
>  			if (index > end)
>  				break;
>  
> +			if (radix_tree_exceptional_entry(page)) {
> +				clear_exceptional_entry(mapping, index, page);
> +				continue;
> +			}
> +
>  			lock_page(page);
>  			WARN_ON(page->index != index);
>  			if (page->mapping != mapping) {
> @@ -539,6 +585,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>  				ret = ret2;
>  			unlock_page(page);
>  		}
> +		pagevec_remove_exceptionals(&pvec);
>  		pagevec_release(&pvec);
>  		mem_cgroup_uncharge_end();
>  		cond_resched();

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees
  2014-02-12 14:00   ` Mel Gorman
@ 2014-03-12  1:15     ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2014-03-12  1:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Bob Liu,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Hugh Dickins,
	Jan Kara, KOSAKI Motohiro, Luigi Semenzato, Metin Doslu,
	Michel Lespinasse, Minchan Kim, Ozgun Erdogan, Peter Zijlstra,
	Rik van Riel, Roman Gushchin, Ryan Mallon, Tejun Heo,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

Hello Mel,

On Wed, Feb 12, 2014 at 02:00:52PM +0000, Mel Gorman wrote:
> On Fri, Jan 10, 2014 at 01:10:39PM -0500, Johannes Weiner wrote:
> > @@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
> >  pgoff_t page_cache_prev_hole(struct address_space *mapping,
> >  			     pgoff_t index, unsigned long max_scan);
> >  
> > -extern struct page * find_get_page(struct address_space *mapping,
> > -				pgoff_t index);
> > -extern struct page * find_lock_page(struct address_space *mapping,
> > -				pgoff_t index);
> > -extern struct page * find_or_create_page(struct address_space *mapping,
> > -				pgoff_t index, gfp_t gfp_mask);
> > +struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
> > +struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
> > +struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
> > +struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
> > +struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
> > +				 gfp_t gfp_mask);
> > +unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
> > +			  unsigned int nr_pages, struct page **pages,
> > +			  pgoff_t *indices);
> 
> When I see foo() and __foo() in a header, my first assumption is that
> __foo() is a version of foo() that assumes the necesssary locks are
> already held. If I see it within a C file, my second assumption will be
> that it's an internal helper. Here __find_get_page is not returning just
> a page. It's returning a page or a shadow entry if they exist and that may
> cause some confusion. Consider renaming __find_get_page to find_get_entry()
> to give a hint to the reader they should be looking out for either pages
> or shadow entries. It still makes sense for find_lock_entry -- if it's a
> page, then it'll return locked etc

Oh, this is so much better.  I renamed the whole thing to
find_get_entry, find_get_entries, find_lock_entry,
pagevec_lookup_entries() and so forth.

Thanks for the suggestion.

> > @@ -446,6 +446,29 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> >  }
> >  EXPORT_SYMBOL_GPL(replace_page_cache_page);
> >  
> > +static int page_cache_tree_insert(struct address_space *mapping,
> > +				  struct page *page)
> > +{
> 
> Nothing here on the locking rules for the function although the existing
> docs here are poor. Everyone knows you need the mapping lock and page lock
> here, right?

Yes, I would think so.  The function was only split out for
readability and not really as an interface...

> > @@ -762,14 +790,19 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
> >  EXPORT_SYMBOL(page_cache_prev_hole);
> >  
> >  /**
> > - * find_get_page - find and get a page reference
> > + * __find_get_page - find and get a page reference
> 
> This comment will be out of date when it could be returning shadow
> entries

Oops, fixed.  And renamed to find_get_entry().  Thanks.

> >   * @mapping: the address_space to search
> >   * @offset: the page index
> >   *
> > - * Is there a pagecache struct page at the given (mapping, offset) tuple?
> > - * If yes, increment its refcount and return it; if no, return NULL.
> > + * Looks up the page cache slot at @mapping & @offset.  If there is a
> > + * page cache page, it is returned with an increased refcount.
> > + *
> > + * If the slot holds a shadow entry of a previously evicted page, it
> > + * is returned.
> > + *
> 
> That's not true yet but who cares. Anyone doing a git blame of the history
> will need to search around the area anyway.

Technically the documentation is true, just nobody stores shadow
entries yet :)

> > @@ -810,24 +843,49 @@ out:
> >  
> >  	return page;
> >  }
> > +EXPORT_SYMBOL(__find_get_page);
> > +
> > +/**
> > + * find_get_page - find and get a page reference
> > + * @mapping: the address_space to search
> > + * @offset: the page index
> > + *
> > + * Looks up the page cache slot at @mapping & @offset.  If there is a
> > + * page cache page, it is returned with an increased refcount.
> > + *
> > + * Otherwise, %NULL is returned.
> > + */
> > +struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
> > +{
> > +	struct page *page = __find_get_page(mapping, offset);
> > +
> > +	if (radix_tree_exceptional_entry(page))
> > +		page = NULL;
> > +	return page;
> > +}
> >  EXPORT_SYMBOL(find_get_page);
> >  
> >  /**
> > - * find_lock_page - locate, pin and lock a pagecache page
> > + * __find_lock_page - locate, pin and lock a pagecache page
> >   * @mapping: the address_space to search
> >   * @offset: the page index
> >   *
> > - * Locates the desired pagecache page, locks it, increments its reference
> > - * count and returns its address.
> > + * Looks up the page cache slot at @mapping & @offset.  If there is a
> > + * page cache page, it is returned locked and with an increased
> > + * refcount.
> > + *
> > + * If the slot holds a shadow entry of a previously evicted page, it
> > + * is returned.
> > + *
> > + * Otherwise, %NULL is returned.
> >   *
> > - * Returns zero if the page was not present. find_lock_page() may sleep.
> > + * __find_lock_page() may sleep.
> >   */
> > -struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
> > +struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
> >  {
> >  	struct page *page;
> > -
> >  repeat:
> 
> Unnecessary whitespace change.

I reverted that.

> > -	page = find_get_page(mapping, offset);
> > +	page = __find_get_page(mapping, offset);
> >  	if (page && !radix_tree_exception(page)) {
> >  		lock_page(page);
> >  		/* Has the page been truncated? */
> 
> Just as an example, if this was find_get_entry() it would be a lot
> clearer that the return value may or may not be a page.

Fully agreed, this site has been updated.

> > @@ -890,6 +973,73 @@ repeat:
> >  EXPORT_SYMBOL(find_or_create_page);
> >  
> >  /**
> > + * __find_get_pages - gang pagecache lookup
> > + * @mapping:	The address_space to search
> > + * @start:	The starting page index
> > + * @nr_pages:	The maximum number of pages
> > + * @pages:	Where the resulting pages are placed
> > + *
> > + * __find_get_pages() will search for and return a group of up to
> > + * @nr_pages pages in the mapping.  The pages are placed at @pages.
> > + * __find_get_pages() takes a reference against the returned pages.
> > + *
> > + * The search returns a group of mapping-contiguous pages with ascending
> > + * indexes.  There may be holes in the indices due to not-present pages.
> > + *
> > + * Any shadow entries of evicted pages are included in the returned
> > + * array.
> > + *
> > + * __find_get_pages() returns the number of pages and shadow entries
> > + * which were found.
> > + */
> > +unsigned __find_get_pages(struct address_space *mapping,
> > +			  pgoff_t start, unsigned int nr_pages,
> > +			  struct page **pages, pgoff_t *indices)
> > +{
> > +	void **slot;
> > +	unsigned int ret = 0;
> > +	struct radix_tree_iter iter;
> > +
> > +	if (!nr_pages)
> > +		return 0;
> > +
> > +	rcu_read_lock();
> > +restart:
> > +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> > +		struct page *page;
> > +repeat:
> > +		page = radix_tree_deref_slot(slot);
> > +		if (unlikely(!page))
> > +			continue;
> > +		if (radix_tree_exception(page)) {
> > +			if (radix_tree_deref_retry(page))
> > +				goto restart;
> > +			/*
> > +			 * Otherwise, we must be storing a swap entry
> > +			 * here as an exceptional entry: so return it
> > +			 * without attempting to raise page count.
> > +			 */
> > +			goto export;
> > +		}
> 
> There is a non-obvious API hazard here that should be called out in
> the function description. shmem was the previous gang lookup user and
> it knew that there would be swap entries and removed them if necessary
> with shmem_deswap_pagevec. It was internal to shmem.c so it could deal
> with the complexity. Now that you are making it a generic function it
> should clearly explain that exceptional entries can be returned and
> pagevec_remove_exceptionals should be used to remove them if necessary
> or else split the helper in two to return just pages or both pages and
> exceptional entries.

I'm confused.  That is not the pagevec API, so
pagevec_remove_exceptionals() does not apply.

Also, this API does in fact provide two functions, one of which
returns all entries, and one which returns only pages.  They are
called __find_get_pages() (now find_get_entries()) and
find_get_pages().

> > @@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
> >  		rcu_read_lock();
> >  		page = radix_tree_lookup(&mapping->page_tree, page_offset);
> >  		rcu_read_unlock();
> > -		if (page)
> > +		if (page && !radix_tree_exceptional_entry(page))
> >  			continue;
> >  
> >  		page = page_cache_alloc_readahead(mapping);
> 
> Maybe just think hunk can be split out and shared with btrfs to avoid it
> dealing with exceptional entries although I've no good suggestions on what
> you'd call it.

I'd rather btrfs wouldn't poke around in page cache internals like
that, so I'm reluctant to provide an interface to facilitate it.  Or
put lipstick on the pig... :)

Here is a delta patch based on your feedback.  Thanks for the review!

Andrew, short of any objections, could you please include the
following patch as
mm-fs-prepare-for-non-page-entries-in-page-cache-radix-trees-fix-fix.patch?

Thanks!

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] mm: __find_get_page() -> find_get_entry()

__find_get_page() -> find_get_entry()
__find_lock_page() -> find_lock_entry()
__find_get_pages() -> find_get_entries()
__pagevec_lookup() -> pagevec_lookup_entries()

Also update and fix stale kerneldocs and revert gratuitous whitespace
changes.

Based on feedback from Mel Gorman.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/pagemap.h |  8 +++----
 include/linux/pagevec.h |  6 +++--
 mm/filemap.c            | 61 ++++++++++++++++++++++++++-----------------------
 mm/mincore.c            |  2 +-
 mm/shmem.c              | 12 +++++-----
 mm/swap.c               | 31 +++++++++++++------------
 mm/truncate.c           |  8 +++----
 7 files changed, 68 insertions(+), 60 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2eeca3c83b0f..493bfd85214e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -248,14 +248,14 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
 struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
-struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
 struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
 struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
 				 gfp_t gfp_mask);
-unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
-			  unsigned int nr_pages, struct page **pages,
+unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
+			  unsigned int nr_entries, struct page **entries,
 			  pgoff_t *indices);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 3c6b8b1e945b..b45d391b4540 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,8 +22,10 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
-unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
-			  pgoff_t start, unsigned nr_pages, pgoff_t *indices);
+unsigned pagevec_lookup_entries(struct pagevec *pvec,
+				struct address_space *mapping,
+				pgoff_t start, unsigned nr_entries,
+				pgoff_t *indices);
 void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
diff --git a/mm/filemap.c b/mm/filemap.c
index a194179303e5..8ed29b71c972 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -790,9 +790,9 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 EXPORT_SYMBOL(page_cache_prev_hole);
 
 /**
- * __find_get_page - find and get a page reference
+ * find_get_entry - find and get a page cache entry
  * @mapping: the address_space to search
- * @offset: the page index
+ * @offset: the page cache index
  *
  * Looks up the page cache slot at @mapping & @offset.  If there is a
  * page cache page, it is returned with an increased refcount.
@@ -802,7 +802,7 @@ EXPORT_SYMBOL(page_cache_prev_hole);
  *
  * Otherwise, %NULL is returned.
  */
-struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
 {
 	void **pagep;
 	struct page *page;
@@ -843,7 +843,7 @@ out:
 
 	return page;
 }
-EXPORT_SYMBOL(__find_get_page);
+EXPORT_SYMBOL(find_get_entry);
 
 /**
  * find_get_page - find and get a page reference
@@ -857,7 +857,7 @@ EXPORT_SYMBOL(__find_get_page);
  */
 struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page = __find_get_page(mapping, offset);
+	struct page *page = find_get_entry(mapping, offset);
 
 	if (radix_tree_exceptional_entry(page))
 		page = NULL;
@@ -866,9 +866,9 @@ struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 EXPORT_SYMBOL(find_get_page);
 
 /**
- * __find_lock_page - locate, pin and lock a pagecache page
+ * find_lock_entry - locate, pin and lock a page cache entry
  * @mapping: the address_space to search
- * @offset: the page index
+ * @offset: the page cache index
  *
  * Looks up the page cache slot at @mapping & @offset.  If there is a
  * page cache page, it is returned locked and with an increased
@@ -879,13 +879,14 @@ EXPORT_SYMBOL(find_get_page);
  *
  * Otherwise, %NULL is returned.
  *
- * __find_lock_page() may sleep.
+ * find_lock_entry() may sleep.
  */
-struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset)
 {
 	struct page *page;
+
 repeat:
-	page = __find_get_page(mapping, offset);
+	page = find_get_entry(mapping, offset);
 	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
@@ -898,7 +899,7 @@ repeat:
 	}
 	return page;
 }
-EXPORT_SYMBOL(__find_lock_page);
+EXPORT_SYMBOL(find_lock_entry);
 
 /**
  * find_lock_page - locate, pin and lock a pagecache page
@@ -915,7 +916,7 @@ EXPORT_SYMBOL(__find_lock_page);
  */
 struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page = __find_lock_page(mapping, offset);
+	struct page *page = find_lock_entry(mapping, offset);
 
 	if (radix_tree_exceptional_entry(page))
 		page = NULL;
@@ -973,35 +974,37 @@ repeat:
 EXPORT_SYMBOL(find_or_create_page);
 
 /**
- * __find_get_pages - gang pagecache lookup
+ * find_get_entries - gang pagecache lookup
  * @mapping:	The address_space to search
- * @start:	The starting page index
- * @nr_pages:	The maximum number of pages
- * @pages:	Where the resulting entries are placed
- * @indices:	The cache indices corresponding to the entries in @pages
+ * @start:	The starting page cache index
+ * @nr_entries:	The maximum number of entries
+ * @entries:	Where the resulting entries are placed
+ * @indices:	The cache indices corresponding to the entries in @entries
  *
- * __find_get_pages() will search for and return a group of up to
- * @nr_pages pages in the mapping.  The pages are placed at @pages.
- * __find_get_pages() takes a reference against the returned pages.
+ * find_get_entries() will search for and return a group of up to
+ * @nr_entries entries in the mapping.  The entries are placed at
+ * @entries.  find_get_entries() takes a reference against any actual
+ * pages it returns.
  *
- * The search returns a group of mapping-contiguous pages with ascending
- * indexes.  There may be holes in the indices due to not-present pages.
+ * The search returns a group of mapping-contiguous page cache entries
+ * with ascending indexes.  There may be holes in the indices due to
+ * not-present pages.
  *
  * Any shadow entries of evicted pages are included in the returned
  * array.
  *
- * __find_get_pages() returns the number of pages and shadow entries
+ * find_get_entries() returns the number of pages and shadow entries
  * which were found.
  */
-unsigned __find_get_pages(struct address_space *mapping,
-			  pgoff_t start, unsigned int nr_pages,
-			  struct page **pages, pgoff_t *indices)
+unsigned find_get_entries(struct address_space *mapping,
+			  pgoff_t start, unsigned int nr_entries,
+			  struct page **entries, pgoff_t *indices)
 {
 	void **slot;
 	unsigned int ret = 0;
 	struct radix_tree_iter iter;
 
-	if (!nr_pages)
+	if (!nr_entries)
 		return 0;
 
 	rcu_read_lock();
@@ -1032,8 +1035,8 @@ repeat:
 		}
 export:
 		indices[ret] = iter.index;
-		pages[ret] = page;
-		if (++ret == nr_pages)
+		entries[ret] = page;
+		if (++ret == nr_entries)
 			break;
 	}
 	rcu_read_unlock();
diff --git a/mm/mincore.c b/mm/mincore.c
index df52b572e8b4..725c80961048 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -72,7 +72,7 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 	 */
 #ifdef CONFIG_SWAP
 	if (shmem_mapping(mapping)) {
-		page = __find_get_page(mapping, pgoff);
+		page = find_get_entry(mapping, pgoff);
 		/*
 		 * shmem/tmpfs may return swap: account for swapcache
 		 * page too.
diff --git a/mm/shmem.c b/mm/shmem.c
index e5fe262bb834..a3ba988ec946 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -363,8 +363,8 @@ void shmem_unlock_mapping(struct address_space *mapping)
 		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
 		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
 		 */
-		pvec.nr = __find_get_pages(mapping, index,
-					PAGEVEC_SIZE, pvec.pages, indices);
+		pvec.nr = find_get_entries(mapping, index,
+					   PAGEVEC_SIZE, pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		index = indices[pvec.nr - 1] + 1;
@@ -400,7 +400,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
-		pvec.nr = __find_get_pages(mapping, index,
+		pvec.nr = find_get_entries(mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			pvec.pages, indices);
 		if (!pvec.nr)
@@ -470,7 +470,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	for ( ; ; ) {
 		cond_resched();
 
-		pvec.nr = __find_get_pages(mapping, index,
+		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
 		if (!pvec.nr) {
@@ -1015,7 +1015,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return -EFBIG;
 repeat:
 	swap.val = 0;
-	page = __find_lock_page(mapping, index);
+	page = find_lock_entry(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
 		page = NULL;
@@ -1669,7 +1669,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 	pagevec_init(&pvec, 0);
 	pvec.nr = 1;		/* start small: we may be there already */
 	while (!done) {
-		pvec.nr = __find_get_pages(mapping, index,
+		pvec.nr = find_get_entries(mapping, index,
 					pvec.nr, pvec.pages, indices);
 		if (!pvec.nr) {
 			if (whence == SEEK_DATA)
diff --git a/mm/swap.c b/mm/swap.c
index 20c267b52914..0c1715036a1f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -948,28 +948,31 @@ void __pagevec_lru_add(struct pagevec *pvec)
 EXPORT_SYMBOL(__pagevec_lru_add);
 
 /**
- * __pagevec_lookup - gang pagecache lookup
+ * pagevec_lookup_entries - gang pagecache lookup
  * @pvec:	Where the resulting entries are placed
  * @mapping:	The address_space to search
  * @start:	The starting entry index
- * @nr_pages:	The maximum number of entries
+ * @nr_entries:	The maximum number of entries
  * @indices:	The cache indices corresponding to the entries in @pvec
  *
- * __pagevec_lookup() will search for and return a group of up to
- * @nr_pages pages and shadow entries in the mapping.  All entries are
- * placed in @pvec.  __pagevec_lookup() takes a reference against
- * actual pages in @pvec.
+ * pagevec_lookup_entries() will search for and return a group of up
+ * to @nr_entries pages and shadow entries in the mapping.  All
+ * entries are placed in @pvec.  pagevec_lookup_entries() takes a
+ * reference against actual pages in @pvec.
  *
  * The search returns a group of mapping-contiguous entries with
  * ascending indexes.  There may be holes in the indices due to
  * not-present entries.
  *
- * __pagevec_lookup() returns the number of entries which were found.
+ * pagevec_lookup_entries() returns the number of entries which were
+ * found.
  */
-unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
-			  pgoff_t start, unsigned nr_pages, pgoff_t *indices)
+unsigned pagevec_lookup_entries(struct pagevec *pvec,
+				struct address_space *mapping,
+				pgoff_t start, unsigned nr_pages,
+				pgoff_t *indices)
 {
-	pvec->nr = __find_get_pages(mapping, start, nr_pages,
+	pvec->nr = find_get_entries(mapping, start, nr_pages,
 				    pvec->pages, indices);
 	return pagevec_count(pvec);
 }
@@ -978,10 +981,10 @@ unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
  * pagevec_remove_exceptionals - pagevec exceptionals pruning
  * @pvec:	The pagevec to prune
  *
- * __pagevec_lookup() fills both pages and exceptional radix tree
- * entries into the pagevec.  This function prunes all exceptionals
- * from @pvec without leaving holes, so that it can be passed on to
- * page-only pagevec operations.
+ * pagevec_lookup_entries() fills both pages and exceptional radix
+ * tree entries into the pagevec.  This function prunes all
+ * exceptionals from @pvec without leaving holes, so that it can be
+ * passed on to page-only pagevec operations.
  */
 void pagevec_remove_exceptionals(struct pagevec *pvec)
 {
diff --git a/mm/truncate.c b/mm/truncate.c
index b0f4d4bee8ab..60c9817c5365 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -255,7 +255,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index < end && __pagevec_lookup(&pvec, mapping, index,
+	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
 		mem_cgroup_uncharge_start();
@@ -331,7 +331,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!__pagevec_lookup(&pvec, mapping, index,
+		if (!pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			indices)) {
 			if (index == start)
@@ -422,7 +422,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	 */
 
 	pagevec_init(&pvec, 0);
-	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
 		mem_cgroup_uncharge_start();
@@ -531,7 +531,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	cleancache_invalidate_inode(mapping);
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 			indices)) {
 		mem_cgroup_uncharge_start();
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-12-02 22:10   ` Dave Chinner
@ 2013-12-02 22:46     ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2013-12-02 22:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Christoph Hellwig,
	Greg Thelen, Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Minchan Kim, Ozgun Erdogan,
	Peter Zijlstra, Rik van Riel, Roman Gushchin, Ryan Mallon,
	Tejun Heo, Vlastimil Babka, linux-mm, linux-fsdevel,
	linux-kernel

On Tue, Dec 03, 2013 at 09:10:52AM +1100, Dave Chinner wrote:
> On Mon, Dec 02, 2013 at 02:21:48PM -0500, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.
> > 
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >    from the node up to the tree root, which is needed to perform a
> >    deletion.  To solve this, encode in each node its offset inside the
> >    parent.  This can be stored in the unused upper bits of the same
> >    member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >    regular entries, to quickly detect when the node is ready to go to
> >    the shadow node LRU list.  The current entry count is an unsigned
> >    int but the maximum number of entries is 64, so a shadow counter
> >    can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs tree lock and tree root, which are located
> >    in the address space, so store an address_space backpointer in the
> >    node.  The parent pointer of the node is in a union with the 2-word
> >    rcu_head, so the backpointer comes at no extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >    head inside the node.  This does increase the size of the node, but
> >    it does not change the number of objects that fit into a slab page.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Mostly looks ok, though there is no need to expose the internals of
> list_lru_add/del. The reason for the different return values was so
> that the isolate callback could simply use list_del_init() and not
> have to worry about all the internal accounting stuff. We can drop
> the lock and then do the accounting after regaining it because it
> won't result in the count of objects going negative and triggering
> warnings.
> 
> Hence I think that all we need to do is add a new isolate return
> value "LRU_REMOVED_RETRY" and add it to list_lru_walk_node() like
> so:
> 
>  		switch (ret) {
> +		case LRU_REMOVED_RETRY:
> +			/*
> +			 * object was removed from the list so we need to
> +			 * account for it just like LRU_REMOVED hence the
> +			 * fallthrough.  However, the list lock was also
> +			 * dropped so we need to restart the list walk.
> +			 */
>  		case LRU_REMOVED:
>  			if (--nlru->nr_items == 0)
>  				node_clear(nid, lru->active_nodes);
>  			WARN_ON_ONCE(nlru->nr_items < 0);
>  			isolated++;
> +			if (ret == LRU_REMOVED_RETRY)
> +				goto restart;
>  			break;

Ha, that is actually exactly what I did when I first implemented it
but decided to change it towards giving the walker callback a bit more
flexibility rather than hardcoding a certain behavior like this (they
might want to put it back if the item can't be shrunk, would that mean
LRU_ROTATE_RETRY?).  That and I wasn't too thrilled with taking the
item off the list without accounting for it before dropping the lock;
just didn't seem right.

But I don't feel strongly about it and we might not want to make the
interface more flexible than we have to at this point.  I'll change
this in the next revision or send a delta to akpm, depending on how
things go from here.

> > +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
> > +				       struct shrink_control *sc)
> > +{
> > +	unsigned long nr_reclaimed = 0;
> > +
> > +	list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
> > +			   shadow_lru_isolate, &nr_reclaimed, &sc->nr_to_scan);
> > +
> > +	return nr_reclaimed;
> > +}
> 
> Do we need check against GFP_NOFS here? I don't think so, but I just
> wanted to check...

No, that should be okay.  We do the same radix tree modifications that
GFP_NOFS reclaim does and don't call into the filesystem.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-12-02 19:21 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
@ 2013-12-02 22:10   ` Dave Chinner
  2013-12-02 22:46     ` Johannes Weiner
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2013-12-02 22:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Andi Kleen, Andrea Arcangeli, Christoph Hellwig,
	Greg Thelen, Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Minchan Kim, Ozgun Erdogan,
	Peter Zijlstra, Rik van Riel, Roman Gushchin, Ryan Mallon,
	Tejun Heo, Vlastimil Babka, linux-mm, linux-fsdevel,
	linux-kernel

On Mon, Dec 02, 2013 at 02:21:48PM -0500, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.
> Per-NUMA rather than global because we expect the radix tree nodes
> themselves to be allocated node-locally and we want to reduce
> cross-node references of otherwise independent cache workloads.  A
> simple shrinker will then reclaim these nodes on memory pressure.
> 
> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:
> 
> 1. There is no index available that would describe the reverse path
>    from the node up to the tree root, which is needed to perform a
>    deletion.  To solve this, encode in each node its offset inside the
>    parent.  This can be stored in the unused upper bits of the same
>    member that stores the node's height at no extra space cost.
> 
> 2. The number of shadow entries needs to be counted in addition to the
>    regular entries, to quickly detect when the node is ready to go to
>    the shadow node LRU list.  The current entry count is an unsigned
>    int but the maximum number of entries is 64, so a shadow counter
>    can easily be stored in the unused upper bits.
> 
> 3. Tree modification needs tree lock and tree root, which are located
>    in the address space, so store an address_space backpointer in the
>    node.  The parent pointer of the node is in a union with the 2-word
>    rcu_head, so the backpointer comes at no extra cost as well.
> 
> 4. The node needs to be linked to an LRU list, which requires a list
>    head inside the node.  This does increase the size of the node, but
>    it does not change the number of objects that fit into a slab page.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Mostly looks ok, though there is no need to expose the internals of
list_lru_add/del. The reason for the different return values was so
that the isolate callback could simply use list_del_init() and not
have to worry about all the internal accounting stuff. We can drop
the lock and then do the accounting after regaining it because it
won't result in the count of objects going negative and triggering
warnings.

Hence I think that all we need to do is add a new isolate return
value "LRU_REMOVED_RETRY" and add it to list_lru_walk_node() like
so:

 		switch (ret) {
+		case LRU_REMOVED_RETRY:
+			/*
+			 * object was removed from the list so we need to
+			 * account for it just like LRU_REMOVED hence the
+			 * fallthrough.  However, the list lock was also
+			 * dropped so we need to restart the list walk.
+			 */
 		case LRU_REMOVED:
 			if (--nlru->nr_items == 0)
 				node_clear(nid, lru->active_nodes);
 			WARN_ON_ONCE(nlru->nr_items < 0);
 			isolated++;
+			if (ret == LRU_REMOVED_RETRY)
+				goto restart;
 			break;

> +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
> +				       struct shrink_control *sc)
> +{
> +	unsigned long nr_reclaimed = 0;
> +
> +	list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
> +			   shadow_lru_isolate, &nr_reclaimed, &sc->nr_to_scan);
> +
> +	return nr_reclaimed;
> +}

Do we need check against GFP_NOFS here? I don't think so, but I just
wanted to check...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-12-02 19:21 [patch 0/9] mm: thrash detection-based file cache sizing v7 Johannes Weiner
@ 2013-12-02 19:21 ` Johannes Weiner
  2013-12-02 22:10   ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2013-12-02 19:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Christoph Hellwig, Dave Chinner,
	Greg Thelen, Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman,
	Metin Doslu, Michel Lespinasse, Minchan Kim, Ozgun Erdogan,
	Peter Zijlstra, Rik van Riel, Roman Gushchin, Ryan Mallon,
	Tejun Heo, Vlastimil Babka, linux-mm, linux-fsdevel,
	linux-kernel

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.
Per-NUMA rather than global because we expect the radix tree nodes
themselves to be allocated node-locally and we want to reduce
cross-node references of otherwise independent cache workloads.  A
simple shrinker will then reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/list_lru.h      |  21 ++++++++
 include/linux/mmzone.h        |   1 +
 include/linux/radix-tree.h    |  32 +++++++++---
 include/linux/swap.h          |   1 +
 include/linux/vm_event_item.h |   1 +
 lib/radix-tree.c              |  36 ++++++++-----
 mm/filemap.c                  |  77 +++++++++++++++++++++------
 mm/list_lru.c                 |  26 ++++++++++
 mm/truncate.c                 |  20 ++++++-
 mm/vmstat.c                   |   1 +
 mm/workingset.c               | 118 ++++++++++++++++++++++++++++++++++++++++++
 11 files changed, 294 insertions(+), 40 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..9a7ad61 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -128,4 +128,25 @@ list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 	}
 	return isolated;
 }
+
+/**
+ * __list_lru_add: add an element to the lru list
+ * @list_lru: the lru pointer
+ * @item: the item to be added
+ *
+ * This is a version of list_lru_add() for use from within the list
+ * walker.  The lock must be held and the item can't be on the list.
+ */
+void __list_lru_add(struct list_lru *lru, struct list_head *item);
+
+/**
+ * __list_lru_del: delete an element from the lru list
+ * @list_lru: the lru pointer
+ * @item: the item to be deleted
+ *
+ * This is a version of list_lru_del() for use from within the list
+ * walker.  The lock must be held and the item must be on the list.
+ */
+void __list_lru_del(struct list_lru *lru, struct list_head *item);
+
 #endif /* _LRU_LIST_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 118ba9f..8cac5a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
 #endif
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
+	WORKINGSET_NODERECLAIM,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c4..33170db 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS	\
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+					  RADIX_TREE_MAP_SHIFT))
+
+/* Height component in node->path */
+#define RADIX_TREE_HEIGHT_SHIFT	(RADIX_TREE_MAX_PATH + 1)
+#define RADIX_TREE_HEIGHT_MASK	((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
+
+/* Internally used bits of node->count */
+#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
+#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
+
 struct radix_tree_node {
-	unsigned int	height;		/* Height from the bottom */
+	unsigned int	path;	/* Offset in parent & height from the bottom */
 	unsigned int	count;
 	union {
-		struct radix_tree_node *parent;	/* Used when ascending tree */
-		struct rcu_head	rcu_head;	/* Used when freeing node */
+		struct {
+			/* Used when ascending tree */
+			struct radix_tree_node *parent;
+			/* For tree user */
+			void *private_data;
+		};
+		/* Used when freeing node */
+		struct rcu_head	rcu_head;
 	};
+	/* For tree user */
+	struct list_head private_list;
 	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
 };
 
-#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
-#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
-					  RADIX_TREE_MAP_SHIFT))
-
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
 	unsigned int		height;
@@ -251,7 +267,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 			  struct radix_tree_node **nodep, void ***slotp);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
-bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+bool __radix_tree_delete_node(struct radix_tree_root *root,
 			      struct radix_tree_node *node);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b83cf61..102e37b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -264,6 +264,7 @@ struct swap_list_t {
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+extern struct list_lru workingset_shadow_nodes;
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1855f0a..0b15c59 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -76,6 +76,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
 		NR_TLB_LOCAL_FLUSH_ALL,
 		NR_TLB_LOCAL_FLUSH_ONE,
+		WORKINGSET_NODES_RECLAIMED,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e601c56..0a08953 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -342,7 +342,8 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
 
 		/* Increase the height.  */
 		newheight = root->height+1;
-		node->height = newheight;
+		BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK);
+		node->path = newheight;
 		node->count = 1;
 		node->parent = NULL;
 		slot = root->rnode;
@@ -400,11 +401,12 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 			/* Have to add a child node.  */
 			if (!(slot = radix_tree_node_alloc(root)))
 				return -ENOMEM;
-			slot->height = height;
+			slot->path = height;
 			slot->parent = node;
 			if (node) {
 				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
+				slot->path |= offset << RADIX_TREE_HEIGHT_SHIFT;
 			} else
 				rcu_assign_pointer(root->rnode, ptr_to_indirect(slot));
 		}
@@ -496,7 +498,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 	}
 	node = indirect_to_ptr(node);
 
-	height = node->height;
+	height = node->path & RADIX_TREE_HEIGHT_MASK;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
@@ -702,7 +704,7 @@ int radix_tree_tag_get(struct radix_tree_root *root,
 		return (index == 0);
 	node = indirect_to_ptr(node);
 
-	height = node->height;
+	height = node->path & RADIX_TREE_HEIGHT_MASK;
 	if (index > radix_tree_maxindex(height))
 		return 0;
 
@@ -739,7 +741,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
 {
 	unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK;
 	struct radix_tree_node *rnode, *node;
-	unsigned long index, offset;
+	unsigned long index, offset, height;
 
 	if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag))
 		return NULL;
@@ -770,7 +772,8 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
 		return NULL;
 
 restart:
-	shift = (rnode->height - 1) * RADIX_TREE_MAP_SHIFT;
+	height = rnode->path & RADIX_TREE_HEIGHT_MASK;
+	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	offset = index >> shift;
 
 	/* Index outside of the tree */
@@ -1140,7 +1143,7 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item,
 	unsigned int shift, height;
 	unsigned long i;
 
-	height = slot->height;
+	height = slot->path & RADIX_TREE_HEIGHT_MASK;
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
 	for ( ; height > 1; height--) {
@@ -1203,7 +1206,8 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item)
 		}
 
 		node = indirect_to_ptr(node);
-		max_index = radix_tree_maxindex(node->height);
+		max_index = radix_tree_maxindex(node->path &
+						RADIX_TREE_HEIGHT_MASK);
 		if (cur_index > max_index)
 			break;
 
@@ -1297,7 +1301,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
  *
  *	Returns %true if @node was freed, %false otherwise.
  */
-bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+bool __radix_tree_delete_node(struct radix_tree_root *root,
 			      struct radix_tree_node *node)
 {
 	bool deleted = false;
@@ -1316,9 +1320,10 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
 
 		parent = node->parent;
 		if (parent) {
-			index >>= RADIX_TREE_MAP_SHIFT;
+			unsigned int offset;
 
-			parent->slots[index & RADIX_TREE_MAP_MASK] = NULL;
+			offset = node->path >> RADIX_TREE_HEIGHT_SHIFT;
+			parent->slots[offset] = NULL;
 			parent->count--;
 		} else {
 			root_tag_clear_all(root);
@@ -1382,7 +1387,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
 	node->slots[offset] = NULL;
 	node->count--;
 
-	__radix_tree_delete_node(root, index, node);
+	__radix_tree_delete_node(root, node);
 
 	return entry;
 }
@@ -1415,9 +1420,12 @@ int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag)
 EXPORT_SYMBOL(radix_tree_tagged);
 
 static void
-radix_tree_node_ctor(void *node)
+radix_tree_node_ctor(void *arg)
 {
-	memset(node, 0, sizeof(struct radix_tree_node));
+	struct radix_tree_node *node = arg;
+
+	memset(node, 0, sizeof(*node));
+	INIT_LIST_HEAD(&node->private_list);
 }
 
 static __init unsigned long __maxindex(unsigned int height)
diff --git a/mm/filemap.c b/mm/filemap.c
index 65a374c..b93e223 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,11 +110,17 @@
 static void page_cache_tree_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
-	if (shadow) {
-		void **slot;
+	struct radix_tree_node *node;
+	unsigned long index;
+	unsigned int offset;
+	unsigned int tag;
+	void **slot;
 
-		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
-		radix_tree_replace_slot(slot, shadow);
+	VM_BUG_ON(!PageLocked(page));
+
+	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
+
+	if (shadow) {
 		mapping->nrshadows++;
 		/*
 		 * Make sure the nrshadows update is committed before
@@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space *mapping,
 		 * same time and miss a shadow entry.
 		 */
 		smp_wmb();
-	} else
-		radix_tree_delete(&mapping->page_tree, page->index);
+	}
 	mapping->nrpages--;
+
+	if (!node) {
+		/* Clear direct pointer tags in root node */
+		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
+		radix_tree_replace_slot(slot, shadow);
+		return;
+	}
+
+	/* Clear tree tags for the removed page */
+	index = page->index;
+	offset = index & RADIX_TREE_MAP_MASK;
+	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+		if (test_bit(offset, node->tags[tag]))
+			radix_tree_tag_clear(&mapping->page_tree, index, tag);
+	}
+
+	/* Delete page, swap shadow entry */
+	radix_tree_replace_slot(slot, shadow);
+	node->count--;
+	if (shadow)
+		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
+	else
+		if (__radix_tree_delete_node(&mapping->page_tree, node))
+			return;
+
+	/* Only shadow entries in there, keep track of this node */
+	if (!(node->count & RADIX_TREE_COUNT_MASK) &&
+	    list_empty(&node->private_list)) {
+		node->private_data = mapping;
+		list_lru_add(&workingset_shadow_nodes, &node->private_list);
+	}
 }
 
 /*
@@ -471,27 +507,36 @@ EXPORT_SYMBOL_GPL(replace_page_cache_page);
 static int page_cache_tree_insert(struct address_space *mapping,
 				  struct page *page, void **shadowp)
 {
+	struct radix_tree_node *node;
 	void **slot;
 	int error;
 
-	slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
-	if (slot) {
+	error = __radix_tree_create(&mapping->page_tree, page->index,
+				    &node, &slot);
+	if (error)
+		return error;
+	if (*slot) {
 		void *p;
 
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
-		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
-		mapping->nrpages++;
 		if (shadowp)
 			*shadowp = p;
-		return 0;
+		mapping->nrshadows--;
+		if (node)
+			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
 	}
-	error = radix_tree_insert(&mapping->page_tree, page->index, page);
-	if (!error)
-		mapping->nrpages++;
-	return error;
+	radix_tree_replace_slot(slot, page);
+	mapping->nrpages++;
+	if (node) {
+		node->count++;
+		/* Installed page, can't be shadow-only anymore */
+		if (!list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes,
+				     &node->private_list);
+	}
+	return 0;
 }
 
 static int __add_to_page_cache_locked(struct page *page,
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 72f9dec..e3aa481 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -10,6 +10,19 @@
 #include <linux/list_lru.h>
 #include <linux/slab.h>
 
+void __list_lru_add(struct list_lru *lru, struct list_head *item)
+{
+	int nid = page_to_nid(virt_to_page(item));
+	struct list_lru_node *nlru = &lru->node[nid];
+
+	WARN_ON_ONCE(!list_empty(item));
+	list_add_tail(item, &nlru->list);
+	WARN_ON_ONCE(nlru->nr_items < 0);
+	if (nlru->nr_items++ == 0)
+		node_set(nid, lru->active_nodes);
+}
+EXPORT_SYMBOL_GPL(__list_lru_add);
+
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
 	int nid = page_to_nid(virt_to_page(item));
@@ -29,6 +42,19 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_add);
 
+void __list_lru_del(struct list_lru *lru, struct list_head *item)
+{
+	int nid = page_to_nid(virt_to_page(item));
+	struct list_lru_node *nlru = &lru->node[nid];
+
+	WARN_ON_ONCE(list_empty(item));
+	list_del_init(item);
+	if (--nlru->nr_items == 0)
+		node_clear(nid, lru->active_nodes);
+	WARN_ON_ONCE(nlru->nr_items < 0);
+}
+EXPORT_SYMBOL_GPL(__list_lru_del);
+
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
 	int nid = page_to_nid(virt_to_page(item));
diff --git a/mm/truncate.c b/mm/truncate.c
index 97606fa..5c2615d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -25,6 +25,9 @@
 static void clear_exceptional_entry(struct address_space *mapping,
 				    pgoff_t index, void *entry)
 {
+	struct radix_tree_node *node;
+	void **slot;
+
 	/* Handled by shmem itself */
 	if (shmem_mapping(mapping))
 		return;
@@ -35,8 +38,21 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * without the tree itself locked.  These unlocked entries
 	 * need verification under the tree lock.
 	 */
-	if (radix_tree_delete_item(&mapping->page_tree, index, entry) == entry)
-		mapping->nrshadows--;
+	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+	radix_tree_replace_slot(slot, NULL);
+	mapping->nrshadows--;
+	if (!node)
+		goto unlock;
+	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+	/* No more shadow entries, stop tracking the node */
+	if (!(node->count >> RADIX_TREE_COUNT_SHIFT) &&
+	    !list_empty(&node->private_list))
+		list_lru_del(&workingset_shadow_nodes, &node->private_list);
+	__radix_tree_delete_node(&mapping->page_tree, node);
+unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3ac830d..baa3ba5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -772,6 +772,7 @@ const char * const vmstat_text[] = {
 #endif
 	"workingset_refault",
 	"workingset_activate",
+	"workingset_nodereclaim",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 8a6c7cf..86aeb83 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -251,3 +251,121 @@ void workingset_activation(struct page *page)
 {
 	atomic_long_inc(&page_zone(page)->inactive_age);
 }
+
+/*
+ * Page cache radix tree nodes containing only shadow entries can grow
+ * excessively on certain workloads.  That's why they are tracked on
+ * per-(NUMA)node lists and pushed back by a shrinker, but with a
+ * slightly higher threshold than regular shrinkers so we don't
+ * discard the entries too eagerly - after all, during light memory
+ * pressure is exactly when we need them.
+ */
+
+struct list_lru workingset_shadow_nodes;
+
+static unsigned long count_shadow_nodes(struct shrinker *shrinker,
+					struct shrink_control *sc)
+{
+	return list_lru_count_node(&workingset_shadow_nodes, sc->nid);
+}
+
+static enum lru_status shadow_lru_isolate(struct list_head *item,
+					  spinlock_t *lru_lock,
+					  void *arg)
+{
+	unsigned long *nr_reclaimed = arg;
+	struct address_space *mapping;
+	struct radix_tree_node *node;
+	unsigned int i;
+
+	/*
+	 * Page cache insertions and deletions synchroneously maintain
+	 * the shadow node LRU under the mapping->tree_lock and the
+	 * lru_lock.  Because the page cache tree is emptied before
+	 * the inode can be destroyed, holding the lru_lock pins any
+	 * address_space that has radix tree nodes on the LRU.
+	 *
+	 * We can then safely transition to the mapping->tree_lock to
+	 * pin only the address_space of the particular node we want
+	 * to reclaim, take the node off-LRU, and drop the lru_lock.
+	 */
+
+	node = container_of(item, struct radix_tree_node, private_list);
+	mapping = node->private_data;
+
+	/* Coming from the list, invert the lock order */
+	if (!spin_trylock_irq(&mapping->tree_lock)) {
+		spin_unlock(lru_lock);
+		goto out_retry;
+	}
+
+	__list_lru_del(&workingset_shadow_nodes, item);
+	spin_unlock(lru_lock);
+
+	/*
+	 * The nodes should only contain one or more shadow entries,
+	 * no pages, so we expect to be able to remove them all and
+	 * delete and free the empty node afterwards.
+	 */
+
+	BUG_ON(!node->count);
+	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
+
+	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
+		if (node->slots[i]) {
+			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
+			node->slots[i] = NULL;
+			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
+			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+			BUG_ON(!mapping->nrshadows);
+			mapping->nrshadows--;
+		}
+	}
+	BUG_ON(node->count);
+	inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
+	if (!__radix_tree_delete_node(&mapping->page_tree, node))
+		BUG();
+	(*nr_reclaimed)++;
+
+	spin_unlock_irq(&mapping->tree_lock);
+out_retry:
+	cond_resched();
+	spin_lock(lru_lock);
+	return LRU_RETRY;
+}
+
+static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
+				       struct shrink_control *sc)
+{
+	unsigned long nr_reclaimed = 0;
+
+	list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
+			   shadow_lru_isolate, &nr_reclaimed, &sc->nr_to_scan);
+
+	return nr_reclaimed;
+}
+
+static struct shrinker workingset_shadow_shrinker = {
+	.count_objects = count_shadow_nodes,
+	.scan_objects = scan_shadow_nodes,
+	.seeks = DEFAULT_SEEKS * 4,
+	.flags = SHRINKER_NUMA_AWARE,
+};
+
+static int __init workingset_init(void)
+{
+	int ret;
+
+	ret = list_lru_init(&workingset_shadow_nodes);
+	if (ret)
+		goto err;
+	ret = register_shrinker(&workingset_shadow_shrinker);
+	if (ret)
+		goto err_list_lru;
+	return 0;
+err_list_lru:
+	list_lru_destroy(&workingset_shadow_nodes);
+err:
+	return ret;
+}
+module_init(workingset_init);
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-11-26 23:00         ` Johannes Weiner
@ 2013-11-27  0:59           ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2013-11-27  0:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Jan Kara, Vlastimil Babka,
	Peter Zijlstra, Tejun Heo, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, Nov 26, 2013 at 06:00:10PM -0500, Johannes Weiner wrote:
> On Wed, Nov 27, 2013 at 09:29:37AM +1100, Dave Chinner wrote:
> > On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
> > > On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
> > > > On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> > > > > Previously, page cache radix tree nodes were freed after reclaim
> > > > > emptied out their page pointers.  But now reclaim stores shadow
> > > > > entries in their place, which are only reclaimed when the inodes
> > > > > themselves are reclaimed.  This is problematic for bigger files that
> > > > > are still in use after they have a significant amount of their cache
> > > > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > > > entries will just sit there and waste memory.  In the worst case, the
> > > > > shadow entries will accumulate until the machine runs out of memory.
> > ....
> > > > ....
> > > > > +	radix_tree_replace_slot(slot, page);
> > > > > +	if (node) {
> > > > > +		node->count++;
> > > > > +		/* Installed page, can't be shadow-only anymore */
> > > > > +		if (!list_empty(&node->lru))
> > > > > +			list_lru_del(&workingset_shadow_nodes, &node->lru);
> > > > > +	}
> > > > > +	return 0;
> > > > 
> > > > Hmmmmm - what's the overhead of direct management of LRU removal
> > > > here? Most list_lru code uses lazy removal (i.e. via the shrinker)
> > > > to avoid having to touch the LRU when adding new references to an
> > > > object.....
> > > 
> > > It's measurable in microbenchmarks, but not when any real IO is
> > > involved.  The difference was in the noise even on SSD drives.
> > 
> > Well, it's not an SSD or two I'm worried about - it's devices that
> > can do millions of IOPS where this is likely to be noticable...
> > 
> > > The other list_lru users see items only once they become unused and
> > > subsequent references are expected to be few and temporary, right?
> > 
> > They go onto the list when the refcount falls to zero, but reuse can
> > be frequent when being referenced repeatedly by a single user. That
> > avoids every reuse from removing the object from the LRU then
> > putting it back on the LRU for every reference cycle...
> 
> That's true, but it's less of a concern in the radix_tree_node case
> because it takes a full inactive list cycle after a refault before the
> node is put back on the LRU.  Or a really unlikely placed partial node
> truncation/invalidation (full truncation would just delete the whole
> node anyway).

OK, fair enough. We can deal with the problem if we see it being a
limitation.

> > > We expect pages to refault in spades on certain loads, at which point
> > > we may have thousands of those nodes on the list that are no longer
> > > reclaimable (10k nodes for about 2.5G of cache).
> > 
> > Sure, look at the way the inode and dentry caches work - entire
> > caches of millions of inodes and dentries often sit on the LRUs. A
> > quick look at my workstations dentry cache shows:
> > 
> > $ at /proc/sys/fs/dentry-state 
> > 180108  170596  45      0       0       0
> > 
> > 180k allocated dentries, 170k sitting on the LRU...
> 
> Hm, and a significant amount of those 170k could rotate on the next
> shrinker scan due to recent references or do you generally have
> smaller spikes?

I see very little dentry/inode reclaim because the shrinker tends to
skip most inodes and dentries because they have the referenced bit
set on them whenever the shrinker runs. i.e. that's the working set,
and it gets maintained pretty well...

> But as per above I think the case for lazily removing shadow nodes is
> less convincing than for inodes and dentries.

Agreed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-11-26 22:29       ` Dave Chinner
@ 2013-11-26 23:00         ` Johannes Weiner
  2013-11-27  0:59           ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2013-11-26 23:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Rik van Riel, Jan Kara, Vlastimil Babka,
	Peter Zijlstra, Tejun Heo, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Nov 27, 2013 at 09:29:37AM +1100, Dave Chinner wrote:
> On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
> > On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
> > > On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> > > > Previously, page cache radix tree nodes were freed after reclaim
> > > > emptied out their page pointers.  But now reclaim stores shadow
> > > > entries in their place, which are only reclaimed when the inodes
> > > > themselves are reclaimed.  This is problematic for bigger files that
> > > > are still in use after they have a significant amount of their cache
> > > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > > entries will just sit there and waste memory.  In the worst case, the
> > > > shadow entries will accumulate until the machine runs out of memory.
> ....
> > > ....
> > > > +	radix_tree_replace_slot(slot, page);
> > > > +	if (node) {
> > > > +		node->count++;
> > > > +		/* Installed page, can't be shadow-only anymore */
> > > > +		if (!list_empty(&node->lru))
> > > > +			list_lru_del(&workingset_shadow_nodes, &node->lru);
> > > > +	}
> > > > +	return 0;
> > > 
> > > Hmmmmm - what's the overhead of direct management of LRU removal
> > > here? Most list_lru code uses lazy removal (i.e. via the shrinker)
> > > to avoid having to touch the LRU when adding new references to an
> > > object.....
> > 
> > It's measurable in microbenchmarks, but not when any real IO is
> > involved.  The difference was in the noise even on SSD drives.
> 
> Well, it's not an SSD or two I'm worried about - it's devices that
> can do millions of IOPS where this is likely to be noticable...
> 
> > The other list_lru users see items only once they become unused and
> > subsequent references are expected to be few and temporary, right?
> 
> They go onto the list when the refcount falls to zero, but reuse can
> be frequent when being referenced repeatedly by a single user. That
> avoids every reuse from removing the object from the LRU then
> putting it back on the LRU for every reference cycle...

That's true, but it's less of a concern in the radix_tree_node case
because it takes a full inactive list cycle after a refault before the
node is put back on the LRU.  Or a really unlikely placed partial node
truncation/invalidation (full truncation would just delete the whole
node anyway).

> > We expect pages to refault in spades on certain loads, at which point
> > we may have thousands of those nodes on the list that are no longer
> > reclaimable (10k nodes for about 2.5G of cache).
> 
> Sure, look at the way the inode and dentry caches work - entire
> caches of millions of inodes and dentries often sit on the LRUs. A
> quick look at my workstations dentry cache shows:
> 
> $ at /proc/sys/fs/dentry-state 
> 180108  170596  45      0       0       0
> 
> 180k allocated dentries, 170k sitting on the LRU...

Hm, and a significant amount of those 170k could rotate on the next
shrinker scan due to recent references or do you generally have
smaller spikes?

But as per above I think the case for lazily removing shadow nodes is
less convincing than for inodes and dentries.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-11-26 21:27     ` Johannes Weiner
@ 2013-11-26 22:29       ` Dave Chinner
  2013-11-26 23:00         ` Johannes Weiner
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2013-11-26 22:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Jan Kara, Vlastimil Babka,
	Peter Zijlstra, Tejun Heo, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
> On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
> > On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> > > Previously, page cache radix tree nodes were freed after reclaim
> > > emptied out their page pointers.  But now reclaim stores shadow
> > > entries in their place, which are only reclaimed when the inodes
> > > themselves are reclaimed.  This is problematic for bigger files that
> > > are still in use after they have a significant amount of their cache
> > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > entries will just sit there and waste memory.  In the worst case, the
> > > shadow entries will accumulate until the machine runs out of memory.
....
> > ....
> > > +	radix_tree_replace_slot(slot, page);
> > > +	if (node) {
> > > +		node->count++;
> > > +		/* Installed page, can't be shadow-only anymore */
> > > +		if (!list_empty(&node->lru))
> > > +			list_lru_del(&workingset_shadow_nodes, &node->lru);
> > > +	}
> > > +	return 0;
> > 
> > Hmmmmm - what's the overhead of direct management of LRU removal
> > here? Most list_lru code uses lazy removal (i.e. via the shrinker)
> > to avoid having to touch the LRU when adding new references to an
> > object.....
> 
> It's measurable in microbenchmarks, but not when any real IO is
> involved.  The difference was in the noise even on SSD drives.

Well, it's not an SSD or two I'm worried about - it's devices that
can do millions of IOPS where this is likely to be noticable...

> The other list_lru users see items only once they become unused and
> subsequent references are expected to be few and temporary, right?

They go onto the list when the refcount falls to zero, but reuse can
be frequent when being referenced repeatedly by a single user. That
avoids every reuse from removing the object from the LRU then
putting it back on the LRU for every reference cycle...

> We expect pages to refault in spades on certain loads, at which point
> we may have thousands of those nodes on the list that are no longer
> reclaimable (10k nodes for about 2.5G of cache).

Sure, look at the way the inode and dentry caches work - entire
caches of millions of inodes and dentries often sit on the LRUs. A
quick look at my workstations dentry cache shows:

$ at /proc/sys/fs/dentry-state 
180108  170596  45      0       0       0

180k allocated dentries, 170k sitting on the LRU...

> > > + * Page cache radix tree nodes containing only shadow entries can grow
> > > + * excessively on certain workloads.  That's why they are tracked on
> > > + * per-(NUMA)node lists and pushed back by a shrinker, but with a
> > > + * slightly higher threshold than regular shrinkers so we don't
> > > + * discard the entries too eagerly - after all, during light memory
> > > + * pressure is exactly when we need them.
> > > + *
> > > + * The list_lru lock nests inside the IRQ-safe mapping->tree_lock, so
> > > + * we have to disable IRQs for any list_lru operation as well.
> > > + */
> > > +
> > > +struct list_lru workingset_shadow_nodes;
> > > +
> > > +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> > > +					struct shrink_control *sc)
> > > +{
> > > +	unsigned long count;
> > > +
> > > +	local_irq_disable();
> > > +	count = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> > > +	local_irq_enable();
> > 
> > The count returned is not perfectly accurate, and the use of it in
> > the shrinker will be concurrent with other modifications, so
> > disabling IRQs here doesn't add any anything but unnecessary
> > overhead.
> 
> Lockdep complains when taking an IRQ-unsafe lock (lru_lock) inside an
> IRQ-safe lock (mapping->tree_lock).

Bah - sometimes I hate lockdep because it makes people do silly
things just to shut it up. IMO, the right fix is this patch:

https://lkml.org/lkml/2013/7/31/7

> > > +#define NOIRQ_BATCH 32
> > > +
> > > +static enum lru_status shadow_lru_isolate(struct list_head *item,
> > > +					  spinlock_t *lru_lock,
> > > +					  void *arg)
> > > +{
> > > +	struct address_space *mapping;
> > > +	struct radix_tree_node *node;
> > > +	unsigned long *batch = arg;
> > > +	unsigned int i;
> > > +
> > > +	node = container_of(item, struct radix_tree_node, lru);
> > > +	mapping = node->private;
> > > +
> > > +	/* Don't disable IRQs for too long */
> > > +	if (--(*batch) == 0) {
> > > +		spin_unlock_irq(lru_lock);
> > > +		*batch = NOIRQ_BATCH;
> > > +		spin_lock_irq(lru_lock);
> > > +		return LRU_RETRY;
> > > +	}
> > 
> > Ugh.
> > 
> > > +	/* Coming from the list, inverse the lock order */
> > > +	if (!spin_trylock(&mapping->tree_lock))
> > > +		return LRU_SKIP;
> > 
> > Why not spin_trylock_irq(&mapping->tree_lock) and get rid of the
> > nasty irq batching stuff? The LRU list is internally consistent,
> > so I don't see why irqs need to be disabled to walk across the
> > objects in the list - we only need that to avoid taking an interrupt
> > while holding the mapping->tree_lock() and the interrupt running
> > I/O completion which may try to take the mapping->tree_lock....
> 
> Same reason, IRQ-unsafe nesting inside IRQ-safe lock...

Seems to me like you're designing the code to workaround lockdep
deficiencies rather than thinking about the most efficient way to
solve the problem. lockdep can always be fixed to work with
whatever code we come up with, so don't let lockdep stifle your
creativity. ;)

> > Given that we should always be removing the item from the head of
> > the LRU list (except when we can't get the mapping lock), I'd
> > suggest that it would be better to do something like this:
> > 
> > 	/*
> > 	 * Coming from the list, inverse the lock order. Drop the
> > 	 * list lock, too, so that if a caller is spinning on it we
> > 	 * don't get stuck here.
> > 	 */
> > 	if (!spin_trylock(&mapping->tree_lock)) {

That should be spin_trylock_irq()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-11-26  0:13   ` Andrew Morton
@ 2013-11-26 22:05     ` Johannes Weiner
  0 siblings, 0 replies; 58+ messages in thread
From: Johannes Weiner @ 2013-11-26 22:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Rik van Riel, Jan Kara, Vlastimil Babka,
	Peter Zijlstra, Tejun Heo, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Nov 25, 2013 at 04:13:32PM -0800, Andrew Morton wrote:
> On Sun, 24 Nov 2013 18:38:28 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> 
> Why per-node rather than a global list?

The radix tree nodes should always be allocated node-local, so it made
sense to string them up locally as well and prevent otherwise
independent workloads on separate nodes to contend for the same global
lock and list.

> >  A simple shrinker will reclaim these nodes on memory pressure.
> 
> Truncate needs to go off and massacre these things as well - some
> description of how that happens would be useful.

What do you mean?  Truncate operates on a range in the tree and
deletes these items the same way page entries are deleted.  It's a
regular tree deletion.  Only the shrinker is special because it
reaches into the tree coming from a tree node, not from the root and
an index.

> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >    from the node up to the tree root, which is needed to perform a
> >    deletion.  To solve this, encode in each node its offset inside the
> >    parent.  This can be stored in the unused upper bits of the same
> >    member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >    regular entries, to quickly detect when the node is ready to go to
> >    the shadow node LRU list.  The current entry count is an unsigned
> >    int but the maximum number of entries is 64, so a shadow counter
> >    can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs the lock, which is located in the address
> >    space,
> 
> Presumably "the lock" == tree_lock.

Yes, will clarify.

> >    so store a backpointer to it.
> 
> <looks at the code>
> 
> "it" is the address_space, not tree_lock, yes?

Yes, the address space so we can get to both the lock and the tree
root.  Will clarify.

> >   The parent pointer is in a
> >    union with the 2-word rcu_head, so the backpointer comes at no
> >    extra cost as well.
> 
> So we have a shrinker walking backwards from radix-tree nodes and
> reaching up into address_spaces.  We need to take steps to prevent
> those address_spaces from getting shot down (reclaim, umount, truncate,
> etc) while we're doing this.  What's happening here?

Ah, now the question about truncate above makes more sense as well.
Teardown makes sure the node is unlinked from the LRU, so holding the
lru_lock while the node is on the LRU pins the whole address space and
keeps teardown from finishing.

Dave already said that the lru_lock time is too long, though, so I'll
have to change the reclaimer to use RCU.  radix_tree_node is already
RCU-freed.  Teardown can mark the node dead under the tree lock, while
the shrinker can optimistically can take the tree lock of a node under
RCU, then verify the node is still alive.

I'll rework this and document the lifetime management properly.

> > 4. The node needs to be linked to an LRU list, which requires a list
> >    head inside the node.  This does increase the size of the node, but
> >    it does not change the number of objects that fit into a slab page.
> >
> > ...
> >
> > --- a/include/linux/list_lru.h
> > +++ b/include/linux/list_lru.h
> > @@ -32,7 +32,7 @@ struct list_lru {
> >  };
> >  
> >  void list_lru_destroy(struct list_lru *lru);
> > -int list_lru_init(struct list_lru *lru);
> > +int list_lru_init(struct list_lru *lru, struct lock_class_key *key);
> 
> It's a bit of a shame to be adding overhead to non-lockdep kernels.  A
> few ifdefs could fix this.
> 
> Presumably this is being done to squish some lockdep warning you hit. 
> A comment at the list_lru_init() implementation site would be useful. 
> One which describes the warning and why it's OK to squish it.

Yes, the other users of list_lru have an IRQ-unsafe lru_lock, so I
added a separate class for the IRQ-safe version.

> >  struct radix_tree_node {
> > -	unsigned int	height;		/* Height from the bottom */
> > +	unsigned int	path;	/* Offset in parent & height from the bottom */
> >  	unsigned int	count;
> >  	union {
> > -		struct radix_tree_node *parent;	/* Used when ascending tree */
> > -		struct rcu_head	rcu_head;	/* Used when freeing node */
> > +		/* Used when ascending tree */
> > +		struct {
> > +			struct radix_tree_node *parent;
> > +			void *private;
> 
> Private to whom?  The radix-tree implementation?  The radix-tree caller?

The caller.  Isn't that a standard name?  page->private,
mapping->private*, etc.?  Anyway, will add a comment.

> 
> > +		};
> > +		/* Used when freeing node */
> > +		struct rcu_head	rcu_head;
> >  	};
> > +	struct list_head lru;
> 
> Locking for this list?

The list_lru lock.  I'll document this.

> >  	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
> >  	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
> >  };
> >  
> >
> > ...
> >
> > +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> > +					struct shrink_control *sc)
> > +{
> > +	unsigned long count;
> > +
> > +	local_irq_disable();
> > +	count = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> > +	local_irq_enable();
> 
> I'm struggling with the local_irq_disable() here.  Presumably it's
> there to quash a lockdep warning, but page_cache_tree_delete() and
> friends can get away without the local_irq_disable().  Some more
> clarity here would be nice.

Yes, we are nesting the lru_lock inside the IRQ-safe
mapping->tree_lock, lockdep complains about that.

page_cache_tree_delete() and friends also disable IRQs by using
spin_lock_irq().

As I said in the email to Dave, it would be great to teach lockdep to
not complain because the deadlock scenario (irq tries to acquire lock
held in process context) is not possible in our case: the irq context
does not actually acquire the lru_lock.

> > +	return count;
> > +}
> > +
> > +#define NOIRQ_BATCH 32
> > +
> > +static enum lru_status shadow_lru_isolate(struct list_head *item,
> > +					  spinlock_t *lru_lock,
> > +					  void *arg)
> > +{
> > +	struct address_space *mapping;
> > +	struct radix_tree_node *node;
> > +	unsigned long *batch = arg;
> > +	unsigned int i;
> > +
> > +	node = container_of(item, struct radix_tree_node, lru);
> > +	mapping = node->private;
> > +
> > +	/* Don't disable IRQs for too long */
> > +	if (--(*batch) == 0) {
> > +		spin_unlock_irq(lru_lock);
> > +		*batch = NOIRQ_BATCH;
> > +		spin_lock_irq(lru_lock);
> > +		return LRU_RETRY;
> > +	}
> > +
> > +	/* Coming from the list, inverse the lock order */
> 
> "invert" ;)

Thanks :)

> > +	if (!spin_trylock(&mapping->tree_lock))
> > +		return LRU_SKIP;
> > +
> > +	/*
> > +	 * The nodes should only contain one or more shadow entries,
> > +	 * no pages, so we expect to be able to remove them all and
> > +	 * delete and free the empty node afterwards.
> > +	 */
> > +
> > +	BUG_ON(!node->count);
> > +	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> > +
> > +	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> > +		if (node->slots[i]) {
> > +			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
> > +			node->slots[i] = NULL;
> > +			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
> > +			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> > +			BUG_ON(!mapping->nrshadows);
> > +			mapping->nrshadows--;
> > +		}
> > +	}
> > +	list_del_init(&node->lru);
> > +	BUG_ON(node->count);
> > +	if (!__radix_tree_delete_node(&mapping->page_tree, node))
> > +		BUG();
> > +
> > +	spin_unlock(&mapping->tree_lock);
> > +
> > +	count_vm_event(WORKINGSET_NODES_RECLAIMED);
> > +
> > +	return LRU_REMOVED;
> > +}
> > +
> >
> > ...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-11-25 23:49   ` Dave Chinner
@ 2013-11-26 21:27     ` Johannes Weiner
  2013-11-26 22:29       ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Johannes Weiner @ 2013-11-26 21:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Rik van Riel, Jan Kara, Vlastimil Babka,
	Peter Zijlstra, Tejun Heo, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
> On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.  A
> > simple shrinker will reclaim these nodes on memory pressure.
> > 
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >    from the node up to the tree root, which is needed to perform a
> >    deletion.  To solve this, encode in each node its offset inside the
> >    parent.  This can be stored in the unused upper bits of the same
> >    member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >    regular entries, to quickly detect when the node is ready to go to
> >    the shadow node LRU list.  The current entry count is an unsigned
> >    int but the maximum number of entries is 64, so a shadow counter
> >    can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs the lock, which is located in the address
> >    space, so store a backpointer to it.  The parent pointer is in a
> >    union with the 2-word rcu_head, so the backpointer comes at no
> >    extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >    head inside the node.  This does increase the size of the node, but
> >    it does not change the number of objects that fit into a slab page.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  fs/super.c                    |   4 +-
> >  fs/xfs/xfs_buf.c              |   2 +-
> >  fs/xfs/xfs_qm.c               |   2 +-
> >  include/linux/list_lru.h      |   2 +-
> >  include/linux/radix-tree.h    |  30 +++++++---
> >  include/linux/swap.h          |   1 +
> >  include/linux/vm_event_item.h |   1 +
> >  lib/radix-tree.c              |  36 +++++++-----
> >  mm/filemap.c                  |  70 ++++++++++++++++++++----
> >  mm/list_lru.c                 |   4 +-
> >  mm/truncate.c                 |  19 ++++++-
> >  mm/vmstat.c                   |   2 +
> >  mm/workingset.c               | 124 ++++++++++++++++++++++++++++++++++++++++++
> >  13 files changed, 255 insertions(+), 42 deletions(-)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 0225c20..a958d52 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -196,9 +196,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
> >  		INIT_HLIST_BL_HEAD(&s->s_anon);
> >  		INIT_LIST_HEAD(&s->s_inodes);
> >  
> > -		if (list_lru_init(&s->s_dentry_lru))
> > +		if (list_lru_init(&s->s_dentry_lru, NULL))
> >  			goto err_out;
> > -		if (list_lru_init(&s->s_inode_lru))
> > +		if (list_lru_init(&s->s_inode_lru, NULL))
> >  			goto err_out_dentry_lru;
> 
> rather than modifying all the callers of list_lru_init(), can you
> add a new function list_lru_init_key() and implement list_lru_init()
> as a wrapper around it?

Ok, sure.

> >  static int page_cache_tree_insert(struct address_space *mapping,
> >  				  struct page *page, void **shadowp)
> >  {
> ....
> > +	radix_tree_replace_slot(slot, page);
> > +	if (node) {
> > +		node->count++;
> > +		/* Installed page, can't be shadow-only anymore */
> > +		if (!list_empty(&node->lru))
> > +			list_lru_del(&workingset_shadow_nodes, &node->lru);
> > +	}
> > +	return 0;
> 
> Hmmmmm - what's the overhead of direct management of LRU removal
> here? Most list_lru code uses lazy removal (i.e. via the shrinker)
> to avoid having to touch the LRU when adding new references to an
> object.....

It's measurable in microbenchmarks, but not when any real IO is
involved.  The difference was in the noise even on SSD drives.

The other list_lru users see items only once they become unused and
subsequent references are expected to be few and temporary, right?

We expect pages to refault in spades on certain loads, at which point
we may have thousands of those nodes on the list that are no longer
reclaimable (10k nodes for about 2.5G of cache).

> > + * Page cache radix tree nodes containing only shadow entries can grow
> > + * excessively on certain workloads.  That's why they are tracked on
> > + * per-(NUMA)node lists and pushed back by a shrinker, but with a
> > + * slightly higher threshold than regular shrinkers so we don't
> > + * discard the entries too eagerly - after all, during light memory
> > + * pressure is exactly when we need them.
> > + *
> > + * The list_lru lock nests inside the IRQ-safe mapping->tree_lock, so
> > + * we have to disable IRQs for any list_lru operation as well.
> > + */
> > +
> > +struct list_lru workingset_shadow_nodes;
> > +
> > +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> > +					struct shrink_control *sc)
> > +{
> > +	unsigned long count;
> > +
> > +	local_irq_disable();
> > +	count = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> > +	local_irq_enable();
> 
> The count returned is not perfectly accurate, and the use of it in
> the shrinker will be concurrent with other modifications, so
> disabling IRQs here doesn't add any anything but unnecessary
> overhead.

Lockdep complains when taking an IRQ-unsafe lock (lru_lock) inside an
IRQ-safe lock (mapping->tree_lock).

end_page_writeback should not modify the radix tree beyond tags and so
should never try to acquire the lru_lock.  It would be good if we
could annotate lockdep accordingly and get rid of the IRQ-disabling.

> > +#define NOIRQ_BATCH 32
> > +
> > +static enum lru_status shadow_lru_isolate(struct list_head *item,
> > +					  spinlock_t *lru_lock,
> > +					  void *arg)
> > +{
> > +	struct address_space *mapping;
> > +	struct radix_tree_node *node;
> > +	unsigned long *batch = arg;
> > +	unsigned int i;
> > +
> > +	node = container_of(item, struct radix_tree_node, lru);
> > +	mapping = node->private;
> > +
> > +	/* Don't disable IRQs for too long */
> > +	if (--(*batch) == 0) {
> > +		spin_unlock_irq(lru_lock);
> > +		*batch = NOIRQ_BATCH;
> > +		spin_lock_irq(lru_lock);
> > +		return LRU_RETRY;
> > +	}
> 
> Ugh.
> 
> > +	/* Coming from the list, inverse the lock order */
> > +	if (!spin_trylock(&mapping->tree_lock))
> > +		return LRU_SKIP;
> 
> Why not spin_trylock_irq(&mapping->tree_lock) and get rid of the
> nasty irq batching stuff? The LRU list is internally consistent,
> so I don't see why irqs need to be disabled to walk across the
> objects in the list - we only need that to avoid taking an interrupt
> while holding the mapping->tree_lock() and the interrupt running
> I/O completion which may try to take the mapping->tree_lock....

Same reason, IRQ-unsafe nesting inside IRQ-safe lock...

> > +	/*
> > +	 * The nodes should only contain one or more shadow entries,
> > +	 * no pages, so we expect to be able to remove them all and
> > +	 * delete and free the empty node afterwards.
> > +	 */
> > +
> > +	BUG_ON(!node->count);
> > +	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> > +
> > +	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> > +		if (node->slots[i]) {
> > +			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
> > +			node->slots[i] = NULL;
> > +			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
> > +			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> > +			BUG_ON(!mapping->nrshadows);
> > +			mapping->nrshadows--;
> > +		}
> > +	}
> > +	list_del_init(&node->lru);
> > +	BUG_ON(node->count);
> > +	if (!__radix_tree_delete_node(&mapping->page_tree, node))
> > +		BUG();
> 
> That's a lot of work to be doing under the LRU spinlock and with
> irqs disabled. That's going to cause hold-off issues for other LRU
> operations on the node, and other operations on the CPU....
> 
> Given that we should always be removing the item from the head of
> the LRU list (except when we can't get the mapping lock), I'd
> suggest that it would be better to do something like this:
> 
> 	/*
> 	 * Coming from the list, inverse the lock order. Drop the
> 	 * list lock, too, so that if a caller is spinning on it we
> 	 * don't get stuck here.
> 	 */
> 	if (!spin_trylock(&mapping->tree_lock)) {
> 		spin_unlock(lru_lock);
> 		goto out_retry;
> 	}
> 
> 	/*
> 	 * The nodes should only contain one or more shadow entries,
> 	 * no pages, so we expect to be able to remove them all and
> 	 * delete and free the empty node afterwards.
> 	 */
> 	list_del_init(&node->lru);
> 	spin_unlock(lru_lock);
> 
> 	BUG_ON(!node->count);
> 	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> .....
> 	if (!__radix_tree_delete_node(&mapping->page_tree, node))
> 		BUG();
> 
> 	spin_unlock_irq(&mapping->tree_lock);
> 	count_vm_event(WORKINGSET_NODES_RECLAIMED);
> 
> out_retry:
> 	cond_resched();
> 	spin_lock(lru_lock);
> 	return LRU_RETRY;
> }

Yes, that should work.  I'll update the patch, thanks.

> So that we don't hold off other LRU operations, we don't hold IRQs
> disabled for too long, and we don't cause too much scheduler latency
> when doing long scans...

Yep.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-11-24 23:38 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
  2013-11-25 23:49   ` Dave Chinner
@ 2013-11-26  0:13   ` Andrew Morton
  2013-11-26 22:05     ` Johannes Weiner
  1 sibling, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2013-11-26  0:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Rik van Riel, Jan Kara, Vlastimil Babka,
	Peter Zijlstra, Tejun Heo, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, linux-mm,
	linux-fsdevel, linux-kernel

On Sun, 24 Nov 2013 18:38:28 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.

Why per-node rather than a global list?

>  A simple shrinker will reclaim these nodes on memory pressure.

Truncate needs to go off and massacre these things as well - some
description of how that happens would be useful.

> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:
> 
> 1. There is no index available that would describe the reverse path
>    from the node up to the tree root, which is needed to perform a
>    deletion.  To solve this, encode in each node its offset inside the
>    parent.  This can be stored in the unused upper bits of the same
>    member that stores the node's height at no extra space cost.
> 
> 2. The number of shadow entries needs to be counted in addition to the
>    regular entries, to quickly detect when the node is ready to go to
>    the shadow node LRU list.  The current entry count is an unsigned
>    int but the maximum number of entries is 64, so a shadow counter
>    can easily be stored in the unused upper bits.
> 
> 3. Tree modification needs the lock, which is located in the address
>    space,

Presumably "the lock" == tree_lock.

>    so store a backpointer to it.

<looks at the code>

"it" is the address_space, not tree_lock, yes?

>   The parent pointer is in a
>    union with the 2-word rcu_head, so the backpointer comes at no
>    extra cost as well.

So we have a shrinker walking backwards from radix-tree nodes and
reaching up into address_spaces.  We need to take steps to prevent
those address_spaces from getting shot down (reclaim, umount, truncate,
etc) while we're doing this.  What's happening here?

> 4. The node needs to be linked to an LRU list, which requires a list
>    head inside the node.  This does increase the size of the node, but
>    it does not change the number of objects that fit into a slab page.
>
> ...
>
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -32,7 +32,7 @@ struct list_lru {
>  };
>  
>  void list_lru_destroy(struct list_lru *lru);
> -int list_lru_init(struct list_lru *lru);
> +int list_lru_init(struct list_lru *lru, struct lock_class_key *key);

It's a bit of a shame to be adding overhead to non-lockdep kernels.  A
few ifdefs could fix this.

Presumably this is being done to squish some lockdep warning you hit. 
A comment at the list_lru_init() implementation site would be useful. 
One which describes the warning and why it's OK to squish it.

>
> ...
>
>  struct radix_tree_node {
> -	unsigned int	height;		/* Height from the bottom */
> +	unsigned int	path;	/* Offset in parent & height from the bottom */
>  	unsigned int	count;
>  	union {
> -		struct radix_tree_node *parent;	/* Used when ascending tree */
> -		struct rcu_head	rcu_head;	/* Used when freeing node */
> +		/* Used when ascending tree */
> +		struct {
> +			struct radix_tree_node *parent;
> +			void *private;

Private to whom?  The radix-tree implementation?  The radix-tree caller?

> +		};
> +		/* Used when freeing node */
> +		struct rcu_head	rcu_head;
>  	};
> +	struct list_head lru;

Locking for this list?

>  	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
>  	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
>  };
>  
>
> ...
>
> +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> +					struct shrink_control *sc)
> +{
> +	unsigned long count;
> +
> +	local_irq_disable();
> +	count = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> +	local_irq_enable();

I'm struggling with the local_irq_disable() here.  Presumably it's
there to quash a lockdep warning, but page_cache_tree_delete() and
friends can get away without the local_irq_disable().  Some more
clarity here would be nice.

> +	return count;
> +}
> +
> +#define NOIRQ_BATCH 32
> +
> +static enum lru_status shadow_lru_isolate(struct list_head *item,
> +					  spinlock_t *lru_lock,
> +					  void *arg)
> +{
> +	struct address_space *mapping;
> +	struct radix_tree_node *node;
> +	unsigned long *batch = arg;
> +	unsigned int i;
> +
> +	node = container_of(item, struct radix_tree_node, lru);
> +	mapping = node->private;
> +
> +	/* Don't disable IRQs for too long */
> +	if (--(*batch) == 0) {
> +		spin_unlock_irq(lru_lock);
> +		*batch = NOIRQ_BATCH;
> +		spin_lock_irq(lru_lock);
> +		return LRU_RETRY;
> +	}
> +
> +	/* Coming from the list, inverse the lock order */

"invert" ;)

> +	if (!spin_trylock(&mapping->tree_lock))
> +		return LRU_SKIP;
> +
> +	/*
> +	 * The nodes should only contain one or more shadow entries,
> +	 * no pages, so we expect to be able to remove them all and
> +	 * delete and free the empty node afterwards.
> +	 */
> +
> +	BUG_ON(!node->count);
> +	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> +
> +	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> +		if (node->slots[i]) {
> +			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
> +			node->slots[i] = NULL;
> +			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
> +			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> +			BUG_ON(!mapping->nrshadows);
> +			mapping->nrshadows--;
> +		}
> +	}
> +	list_del_init(&node->lru);
> +	BUG_ON(node->count);
> +	if (!__radix_tree_delete_node(&mapping->page_tree, node))
> +		BUG();
> +
> +	spin_unlock(&mapping->tree_lock);
> +
> +	count_vm_event(WORKINGSET_NODES_RECLAIMED);
> +
> +	return LRU_REMOVED;
> +}
> +
>
> ...
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-11-24 23:38 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
@ 2013-11-25 23:49   ` Dave Chinner
  2013-11-26 21:27     ` Johannes Weiner
  2013-11-26  0:13   ` Andrew Morton
  1 sibling, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2013-11-25 23:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Jan Kara, Vlastimil Babka,
	Peter Zijlstra, Tejun Heo, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, linux-mm,
	linux-fsdevel, linux-kernel

On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.  A
> simple shrinker will reclaim these nodes on memory pressure.
> 
> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:
> 
> 1. There is no index available that would describe the reverse path
>    from the node up to the tree root, which is needed to perform a
>    deletion.  To solve this, encode in each node its offset inside the
>    parent.  This can be stored in the unused upper bits of the same
>    member that stores the node's height at no extra space cost.
> 
> 2. The number of shadow entries needs to be counted in addition to the
>    regular entries, to quickly detect when the node is ready to go to
>    the shadow node LRU list.  The current entry count is an unsigned
>    int but the maximum number of entries is 64, so a shadow counter
>    can easily be stored in the unused upper bits.
> 
> 3. Tree modification needs the lock, which is located in the address
>    space, so store a backpointer to it.  The parent pointer is in a
>    union with the 2-word rcu_head, so the backpointer comes at no
>    extra cost as well.
> 
> 4. The node needs to be linked to an LRU list, which requires a list
>    head inside the node.  This does increase the size of the node, but
>    it does not change the number of objects that fit into a slab page.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/super.c                    |   4 +-
>  fs/xfs/xfs_buf.c              |   2 +-
>  fs/xfs/xfs_qm.c               |   2 +-
>  include/linux/list_lru.h      |   2 +-
>  include/linux/radix-tree.h    |  30 +++++++---
>  include/linux/swap.h          |   1 +
>  include/linux/vm_event_item.h |   1 +
>  lib/radix-tree.c              |  36 +++++++-----
>  mm/filemap.c                  |  70 ++++++++++++++++++++----
>  mm/list_lru.c                 |   4 +-
>  mm/truncate.c                 |  19 ++++++-
>  mm/vmstat.c                   |   2 +
>  mm/workingset.c               | 124 ++++++++++++++++++++++++++++++++++++++++++
>  13 files changed, 255 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 0225c20..a958d52 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -196,9 +196,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  		INIT_HLIST_BL_HEAD(&s->s_anon);
>  		INIT_LIST_HEAD(&s->s_inodes);
>  
> -		if (list_lru_init(&s->s_dentry_lru))
> +		if (list_lru_init(&s->s_dentry_lru, NULL))
>  			goto err_out;
> -		if (list_lru_init(&s->s_inode_lru))
> +		if (list_lru_init(&s->s_inode_lru, NULL))
>  			goto err_out_dentry_lru;

rather than modifying all the callers of list_lru_init(), can you
add a new function list_lru_init_key() and implement list_lru_init()
as a wrapper around it?

[snip radix tree modifications I didn't look at]

>  static int page_cache_tree_insert(struct address_space *mapping,
>  				  struct page *page, void **shadowp)
>  {
....
> +	radix_tree_replace_slot(slot, page);
> +	if (node) {
> +		node->count++;
> +		/* Installed page, can't be shadow-only anymore */
> +		if (!list_empty(&node->lru))
> +			list_lru_del(&workingset_shadow_nodes, &node->lru);
> +	}
> +	return 0;

Hmmmmm - what's the overhead of direct management of LRU removal
here? Most list_lru code uses lazy removal (i.e. via the shrinker)
to avoid having to touch the LRU when adding new references to an
object.....

> +
> +/*
> + * Page cache radix tree nodes containing only shadow entries can grow
> + * excessively on certain workloads.  That's why they are tracked on
> + * per-(NUMA)node lists and pushed back by a shrinker, but with a
> + * slightly higher threshold than regular shrinkers so we don't
> + * discard the entries too eagerly - after all, during light memory
> + * pressure is exactly when we need them.
> + *
> + * The list_lru lock nests inside the IRQ-safe mapping->tree_lock, so
> + * we have to disable IRQs for any list_lru operation as well.
> + */
> +
> +struct list_lru workingset_shadow_nodes;
> +
> +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> +					struct shrink_control *sc)
> +{
> +	unsigned long count;
> +
> +	local_irq_disable();
> +	count = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
> +	local_irq_enable();

The count returned is not perfectly accurate, and the use of it in
the shrinker will be concurrent with other modifications, so
disabling IRQs here doesn't add any anything but unnecessary
overhead.

> +#define NOIRQ_BATCH 32
> +
> +static enum lru_status shadow_lru_isolate(struct list_head *item,
> +					  spinlock_t *lru_lock,
> +					  void *arg)
> +{
> +	struct address_space *mapping;
> +	struct radix_tree_node *node;
> +	unsigned long *batch = arg;
> +	unsigned int i;
> +
> +	node = container_of(item, struct radix_tree_node, lru);
> +	mapping = node->private;
> +
> +	/* Don't disable IRQs for too long */
> +	if (--(*batch) == 0) {
> +		spin_unlock_irq(lru_lock);
> +		*batch = NOIRQ_BATCH;
> +		spin_lock_irq(lru_lock);
> +		return LRU_RETRY;
> +	}

Ugh.

> +	/* Coming from the list, inverse the lock order */
> +	if (!spin_trylock(&mapping->tree_lock))
> +		return LRU_SKIP;

Why not spin_trylock_irq(&mapping->tree_lock) and get rid of the
nasty irq batching stuff? The LRU list is internally consistent,
so I don't see why irqs need to be disabled to walk across the
objects in the list - we only need that to avoid taking an interrupt
while holding the mapping->tree_lock() and the interrupt running
I/O completion which may try to take the mapping->tree_lock....

> +	/*
> +	 * The nodes should only contain one or more shadow entries,
> +	 * no pages, so we expect to be able to remove them all and
> +	 * delete and free the empty node afterwards.
> +	 */
> +
> +	BUG_ON(!node->count);
> +	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> +
> +	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> +		if (node->slots[i]) {
> +			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
> +			node->slots[i] = NULL;
> +			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
> +			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> +			BUG_ON(!mapping->nrshadows);
> +			mapping->nrshadows--;
> +		}
> +	}
> +	list_del_init(&node->lru);
> +	BUG_ON(node->count);
> +	if (!__radix_tree_delete_node(&mapping->page_tree, node))
> +		BUG();

That's a lot of work to be doing under the LRU spinlock and with
irqs disabled. That's going to cause hold-off issues for other LRU
operations on the node, and other operations on the CPU....

Given that we should always be removing the item from the head of
the LRU list (except when we can't get the mapping lock), I'd
suggest that it would be better to do something like this:

	/*
	 * Coming from the list, inverse the lock order. Drop the
	 * list lock, too, so that if a caller is spinning on it we
	 * don't get stuck here.
	 */
	if (!spin_trylock(&mapping->tree_lock)) {
		spin_unlock(lru_lock);
		goto out_retry;
	}

	/*
	 * The nodes should only contain one or more shadow entries,
	 * no pages, so we expect to be able to remove them all and
	 * delete and free the empty node afterwards.
	 */
	list_del_init(&node->lru);
	spin_unlock(lru_lock);

	BUG_ON(!node->count);
	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
.....
	if (!__radix_tree_delete_node(&mapping->page_tree, node))
		BUG();

	spin_unlock_irq(&mapping->tree_lock);
	count_vm_event(WORKINGSET_NODES_RECLAIMED);

out_retry:
	cond_resched();
	spin_lock(lru_lock);
	return LRU_RETRY;
}

So that we don't hold off other LRU operations, we don't hold IRQs
disabled for too long, and we don't cause too much scheduler latency
when doing long scans...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [patch 9/9] mm: keep page cache radix tree nodes in check
  2013-11-24 23:38 [patch 0/9] mm: thrash detection-based file cache sizing v6 Johannes Weiner
@ 2013-11-24 23:38 ` Johannes Weiner
  2013-11-25 23:49   ` Dave Chinner
  2013-11-26  0:13   ` Andrew Morton
  0 siblings, 2 replies; 58+ messages in thread
From: Johannes Weiner @ 2013-11-24 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Rik van Riel, Jan Kara, Vlastimil Babka,
	Peter Zijlstra, Tejun Heo, Andi Kleen, Andrea Arcangeli,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, linux-mm,
	linux-fsdevel, linux-kernel

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.  A
simple shrinker will reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs the lock, which is located in the address
   space, so store a backpointer to it.  The parent pointer is in a
   union with the 2-word rcu_head, so the backpointer comes at no
   extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/super.c                    |   4 +-
 fs/xfs/xfs_buf.c              |   2 +-
 fs/xfs/xfs_qm.c               |   2 +-
 include/linux/list_lru.h      |   2 +-
 include/linux/radix-tree.h    |  30 +++++++---
 include/linux/swap.h          |   1 +
 include/linux/vm_event_item.h |   1 +
 lib/radix-tree.c              |  36 +++++++-----
 mm/filemap.c                  |  70 ++++++++++++++++++++----
 mm/list_lru.c                 |   4 +-
 mm/truncate.c                 |  19 ++++++-
 mm/vmstat.c                   |   2 +
 mm/workingset.c               | 124 ++++++++++++++++++++++++++++++++++++++++++
 13 files changed, 255 insertions(+), 42 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 0225c20..a958d52 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -196,9 +196,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
 
-		if (list_lru_init(&s->s_dentry_lru))
+		if (list_lru_init(&s->s_dentry_lru, NULL))
 			goto err_out;
-		if (list_lru_init(&s->s_inode_lru))
+		if (list_lru_init(&s->s_inode_lru, NULL))
 			goto err_out_dentry_lru;
 
 		INIT_LIST_HEAD(&s->s_mounts);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2634700..c49cbce 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1670,7 +1670,7 @@ xfs_alloc_buftarg(
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
 
-	if (list_lru_init(&btp->bt_lru))
+	if (list_lru_init(&btp->bt_lru, NULL))
 		goto error;
 
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 3e6c2e6..57d6aa9 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -831,7 +831,7 @@ xfs_qm_init_quotainfo(
 
 	qinf = mp->m_quotainfo = kmem_zalloc(sizeof(xfs_quotainfo_t), KM_SLEEP);
 
-	if ((error = list_lru_init(&qinf->qi_lru))) {
+	if ((error = list_lru_init(&qinf->qi_lru, NULL))) {
 		kmem_free(qinf);
 		mp->m_quotainfo = NULL;
 		return error;
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..b970a45 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -32,7 +32,7 @@ struct list_lru {
 };
 
 void list_lru_destroy(struct list_lru *lru);
-int list_lru_init(struct list_lru *lru);
+int list_lru_init(struct list_lru *lru, struct lock_class_key *key);
 
 /**
  * list_lru_add: add an element to the lru list's tail
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c4..29df11f 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,35 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS	\
 	((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+					  RADIX_TREE_MAP_SHIFT))
+
+/* Height component in node->path */
+#define RADIX_TREE_HEIGHT_SHIFT	(RADIX_TREE_MAX_PATH + 1)
+#define RADIX_TREE_HEIGHT_MASK	((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
+
+/* Internally used bits of node->count */
+#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
+#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
+
 struct radix_tree_node {
-	unsigned int	height;		/* Height from the bottom */
+	unsigned int	path;	/* Offset in parent & height from the bottom */
 	unsigned int	count;
 	union {
-		struct radix_tree_node *parent;	/* Used when ascending tree */
-		struct rcu_head	rcu_head;	/* Used when freeing node */
+		/* Used when ascending tree */
+		struct {
+			struct radix_tree_node *parent;
+			void *private;
+		};
+		/* Used when freeing node */
+		struct rcu_head	rcu_head;
 	};
+	struct list_head lru;
 	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
 };
 
-#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
-#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
-					  RADIX_TREE_MAP_SHIFT))
-
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
 	unsigned int		height;
@@ -251,7 +265,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 			  struct radix_tree_node **nodep, void ***slotp);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
-bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+bool __radix_tree_delete_node(struct radix_tree_root *root,
 			      struct radix_tree_node *node);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b83cf61..102e37b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -264,6 +264,7 @@ struct swap_list_t {
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+extern struct list_lru workingset_shadow_nodes;
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1855f0a..0b15c59 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -76,6 +76,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
 		NR_TLB_LOCAL_FLUSH_ALL,
 		NR_TLB_LOCAL_FLUSH_ONE,
+		WORKINGSET_NODES_RECLAIMED,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e601c56..1865cd2 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -342,7 +342,8 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)
 
 		/* Increase the height.  */
 		newheight = root->height+1;
-		node->height = newheight;
+		BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK);
+		node->path = newheight;
 		node->count = 1;
 		node->parent = NULL;
 		slot = root->rnode;
@@ -400,11 +401,12 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
 			/* Have to add a child node.  */
 			if (!(slot = radix_tree_node_alloc(root)))
 				return -ENOMEM;
-			slot->height = height;
+			slot->path = height;
 			slot->parent = node;
 			if (node) {
 				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
+				slot->path |= offset << RADIX_TREE_HEIGHT_SHIFT;
 			} else
 				rcu_assign_pointer(root->rnode, ptr_to_indirect(slot));
 		}
@@ -496,7 +498,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
 	}
 	node = indirect_to_ptr(node);
 
-	height = node->height;
+	height = node->path & RADIX_TREE_HEIGHT_MASK;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
@@ -702,7 +704,7 @@ int radix_tree_tag_get(struct radix_tree_root *root,
 		return (index == 0);
 	node = indirect_to_ptr(node);
 
-	height = node->height;
+	height = node->path & RADIX_TREE_HEIGHT_MASK;
 	if (index > radix_tree_maxindex(height))
 		return 0;
 
@@ -739,7 +741,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
 {
 	unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK;
 	struct radix_tree_node *rnode, *node;
-	unsigned long index, offset;
+	unsigned long index, offset, height;
 
 	if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag))
 		return NULL;
@@ -770,7 +772,8 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
 		return NULL;
 
 restart:
-	shift = (rnode->height - 1) * RADIX_TREE_MAP_SHIFT;
+	height = rnode->path & RADIX_TREE_HEIGHT_MASK;
+	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	offset = index >> shift;
 
 	/* Index outside of the tree */
@@ -1140,7 +1143,7 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item,
 	unsigned int shift, height;
 	unsigned long i;
 
-	height = slot->height;
+	height = slot->path & RADIX_TREE_HEIGHT_MASK;
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
 	for ( ; height > 1; height--) {
@@ -1203,7 +1206,8 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item)
 		}
 
 		node = indirect_to_ptr(node);
-		max_index = radix_tree_maxindex(node->height);
+		max_index = radix_tree_maxindex(node->path &
+						RADIX_TREE_HEIGHT_MASK);
 		if (cur_index > max_index)
 			break;
 
@@ -1297,7 +1301,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
  *
  *	Returns %true if @node was freed, %false otherwise.
  */
-bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+bool __radix_tree_delete_node(struct radix_tree_root *root,
 			      struct radix_tree_node *node)
 {
 	bool deleted = false;
@@ -1316,9 +1320,10 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
 
 		parent = node->parent;
 		if (parent) {
-			index >>= RADIX_TREE_MAP_SHIFT;
+			unsigned int offset;
 
-			parent->slots[index & RADIX_TREE_MAP_MASK] = NULL;
+			offset = node->path >> RADIX_TREE_HEIGHT_SHIFT;
+			parent->slots[offset] = NULL;
 			parent->count--;
 		} else {
 			root_tag_clear_all(root);
@@ -1382,7 +1387,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
 	node->slots[offset] = NULL;
 	node->count--;
 
-	__radix_tree_delete_node(root, index, node);
+	__radix_tree_delete_node(root, node);
 
 	return entry;
 }
@@ -1415,9 +1420,12 @@ int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag)
 EXPORT_SYMBOL(radix_tree_tagged);
 
 static void
-radix_tree_node_ctor(void *node)
+radix_tree_node_ctor(void *arg)
 {
-	memset(node, 0, sizeof(struct radix_tree_node));
+	struct radix_tree_node *node = arg;
+
+	memset(node, 0, sizeof(*node));
+	INIT_LIST_HEAD(&node->lru);
 }
 
 static __init unsigned long __maxindex(unsigned int height)
diff --git a/mm/filemap.c b/mm/filemap.c
index 30a74be..79a7546 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,14 +110,48 @@
 static void page_cache_tree_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
-	if (shadow) {
-		void **slot;
+	struct radix_tree_node *node;
+	unsigned long index;
+	unsigned int offset;
+	unsigned int tag;
+	void **slot;
 
-		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
-		radix_tree_replace_slot(slot, shadow);
+	VM_BUG_ON(!PageLocked(page));
+
+	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
+
+	if (shadow)
 		mapping->nrshadows++;
-	} else
-		radix_tree_delete(&mapping->page_tree, page->index);
+
+	if (!node) {
+		/* Clear direct pointer tags in root node */
+		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
+		radix_tree_replace_slot(slot, shadow);
+		return;
+	}
+
+	/* Clear tree tags for the removed page */
+	index = page->index;
+	offset = index & RADIX_TREE_MAP_MASK;
+	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+		if (test_bit(offset, node->tags[tag]))
+			radix_tree_tag_clear(&mapping->page_tree, index, tag);
+	}
+
+	/* Delete page, swap shadow entry */
+	radix_tree_replace_slot(slot, shadow);
+	node->count--;
+	if (shadow)
+		node->count += 1U << RADIX_TREE_COUNT_SHIFT;
+	else
+		if (__radix_tree_delete_node(&mapping->page_tree, node))
+			return;
+
+	/* Only shadow entries in there, keep track of this node */
+	if (!(node->count & RADIX_TREE_COUNT_MASK) && list_empty(&node->lru)) {
+		node->private = mapping;
+		list_lru_add(&workingset_shadow_nodes, &node->lru);
+	}
 }
 
 /*
@@ -463,22 +497,34 @@ EXPORT_SYMBOL_GPL(replace_page_cache_page);
 static int page_cache_tree_insert(struct address_space *mapping,
 				  struct page *page, void **shadowp)
 {
+	struct radix_tree_node *node;
 	void **slot;
+	int error;
 
-	slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
-	if (slot) {
+	error = __radix_tree_create(&mapping->page_tree, page->index,
+				    &node, &slot);
+	if (error)
+		return error;
+	if (*slot) {
 		void *p;
 
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
-		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
 		if (shadowp)
 			*shadowp = p;
-		return 0;
+		mapping->nrshadows--;
+		if (node)
+			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
 	}
-	return radix_tree_insert(&mapping->page_tree, page->index, page);
+	radix_tree_replace_slot(slot, page);
+	if (node) {
+		node->count++;
+		/* Installed page, can't be shadow-only anymore */
+		if (!list_empty(&node->lru))
+			list_lru_del(&workingset_shadow_nodes, &node->lru);
+	}
+	return 0;
 }
 
 static int __add_to_page_cache_locked(struct page *page,
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 72f9dec..c357e8f 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -114,7 +114,7 @@ restart:
 }
 EXPORT_SYMBOL_GPL(list_lru_walk_node);
 
-int list_lru_init(struct list_lru *lru)
+int list_lru_init(struct list_lru *lru, struct lock_class_key *key)
 {
 	int i;
 	size_t size = sizeof(*lru->node) * nr_node_ids;
@@ -126,6 +126,8 @@ int list_lru_init(struct list_lru *lru)
 	nodes_clear(lru->active_nodes);
 	for (i = 0; i < nr_node_ids; i++) {
 		spin_lock_init(&lru->node[i].lock);
+		if (key)
+			lockdep_set_class(&lru->node[i].lock, key);
 		INIT_LIST_HEAD(&lru->node[i].list);
 		lru->node[i].nr_items = 0;
 	}
diff --git a/mm/truncate.c b/mm/truncate.c
index cbd0167..9cf5f88 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -25,6 +25,9 @@
 static void clear_exceptional_entry(struct address_space *mapping,
 				    pgoff_t index, void *entry)
 {
+	struct radix_tree_node *node;
+	void **slot;
+
 	/* Handled by shmem itself */
 	if (shmem_mapping(mapping))
 		return;
@@ -35,8 +38,20 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * without the tree itself locked.  These unlocked entries
 	 * need verification under the tree lock.
 	 */
-	if (radix_tree_delete_item(&mapping->page_tree, index, entry) == entry)
-		mapping->nrshadows--;
+	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+	radix_tree_replace_slot(slot, NULL);
+	mapping->nrshadows--;
+	if (!node)
+		goto unlock;
+	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+	/* No more shadow entries, stop tracking the node */
+	if (!(node->count >> RADIX_TREE_COUNT_SHIFT) && !list_empty(&node->lru))
+		list_lru_del(&workingset_shadow_nodes, &node->lru);
+	__radix_tree_delete_node(&mapping->page_tree, node);
+unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3ac830d..c5f33d2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -859,6 +859,8 @@ const char * const vmstat_text[] = {
 	"nr_tlb_local_flush_all",
 	"nr_tlb_local_flush_one",
 
+	"workingset_nodes_reclaimed",
+
 #endif /* CONFIG_VM_EVENTS_COUNTERS */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */
diff --git a/mm/workingset.c b/mm/workingset.c
index 478060f..ba8f0dd 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -251,3 +251,127 @@ void workingset_activation(struct page *page)
 {
 	atomic_long_inc(&page_zone(page)->inactive_age);
 }
+
+/*
+ * Page cache radix tree nodes containing only shadow entries can grow
+ * excessively on certain workloads.  That's why they are tracked on
+ * per-(NUMA)node lists and pushed back by a shrinker, but with a
+ * slightly higher threshold than regular shrinkers so we don't
+ * discard the entries too eagerly - after all, during light memory
+ * pressure is exactly when we need them.
+ *
+ * The list_lru lock nests inside the IRQ-safe mapping->tree_lock, so
+ * we have to disable IRQs for any list_lru operation as well.
+ */
+
+struct list_lru workingset_shadow_nodes;
+
+static unsigned long count_shadow_nodes(struct shrinker *shrinker,
+					struct shrink_control *sc)
+{
+	unsigned long count;
+
+	local_irq_disable();
+	count = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
+	local_irq_enable();
+
+	return count;
+}
+
+#define NOIRQ_BATCH 32
+
+static enum lru_status shadow_lru_isolate(struct list_head *item,
+					  spinlock_t *lru_lock,
+					  void *arg)
+{
+	struct address_space *mapping;
+	struct radix_tree_node *node;
+	unsigned long *batch = arg;
+	unsigned int i;
+
+	node = container_of(item, struct radix_tree_node, lru);
+	mapping = node->private;
+
+	/* Don't disable IRQs for too long */
+	if (--(*batch) == 0) {
+		spin_unlock_irq(lru_lock);
+		*batch = NOIRQ_BATCH;
+		spin_lock_irq(lru_lock);
+		return LRU_RETRY;
+	}
+
+	/* Coming from the list, inverse the lock order */
+	if (!spin_trylock(&mapping->tree_lock))
+		return LRU_SKIP;
+
+	/*
+	 * The nodes should only contain one or more shadow entries,
+	 * no pages, so we expect to be able to remove them all and
+	 * delete and free the empty node afterwards.
+	 */
+
+	BUG_ON(!node->count);
+	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
+
+	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
+		if (node->slots[i]) {
+			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
+			node->slots[i] = NULL;
+			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
+			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+			BUG_ON(!mapping->nrshadows);
+			mapping->nrshadows--;
+		}
+	}
+	list_del_init(&node->lru);
+	BUG_ON(node->count);
+	if (!__radix_tree_delete_node(&mapping->page_tree, node))
+		BUG();
+
+	spin_unlock(&mapping->tree_lock);
+
+	count_vm_event(WORKINGSET_NODES_RECLAIMED);
+
+	return LRU_REMOVED;
+}
+
+static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
+				       struct shrink_control *sc)
+{
+	unsigned long batch = NOIRQ_BATCH;
+	unsigned long freed;
+
+	local_irq_disable();
+	freed = list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
+				   shadow_lru_isolate, &batch, &sc->nr_to_scan);
+	local_irq_enable();
+
+	return freed;
+}
+
+static struct shrinker workingset_shadow_shrinker = {
+	.count_objects = count_shadow_nodes,
+	.scan_objects = scan_shadow_nodes,
+	.seeks = DEFAULT_SEEKS * 4,
+	.flags = SHRINKER_NUMA_AWARE,
+};
+
+static struct lock_class_key shadow_nodes_key;
+
+static int __init workingset_init(void)
+{
+	int ret;
+
+	ret = list_lru_init(&workingset_shadow_nodes, &shadow_nodes_key);
+	if (ret)
+		goto err;
+	ret = register_shrinker(&workingset_shadow_shrinker);
+	if (ret)
+		goto err_list_lru;
+	return 0;
+err_list_lru:
+	list_lru_destroy(&workingset_shadow_nodes);
+err:
+	return ret;
+}
+module_init(workingset_init);
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2014-03-12  1:16 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-10 18:10 [patch 0/9] mm: thrash detection-based file cache sizing v8 Johannes Weiner
2014-01-10 18:10 ` [patch 1/9] fs: cachefiles: use add_to_page_cache_lru() Johannes Weiner
2014-01-13  1:17   ` Minchan Kim
2014-01-10 18:10 ` [patch 2/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2014-01-10 18:10 ` [patch 3/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2014-01-10 18:25   ` Rik van Riel
2014-01-10 18:10 ` [patch 4/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2014-01-10 19:22   ` Rik van Riel
2014-01-13  1:25   ` Minchan Kim
2014-01-10 18:10 ` [patch 5/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2014-01-10 19:39   ` Rik van Riel
2014-01-13  2:01   ` Minchan Kim
2014-01-22 17:47     ` Johannes Weiner
2014-01-23  5:07       ` Minchan Kim
2014-02-12 14:00   ` Mel Gorman
2014-03-12  1:15     ` Johannes Weiner
2014-01-10 18:10 ` [patch 6/9] mm + fs: store shadow entries in page cache Johannes Weiner
2014-01-10 22:30   ` Rik van Riel
2014-01-13  2:18   ` Minchan Kim
2014-01-10 18:10 ` [patch 7/9] mm: thrash detection-based file cache sizing Johannes Weiner
2014-01-10 22:51   ` Rik van Riel
2014-01-13  2:42   ` Minchan Kim
2014-01-14  1:01   ` Bob Liu
2014-01-14 19:16     ` Johannes Weiner
2014-01-15  2:57       ` Bob Liu
2014-01-15  3:52         ` Zhang Yanfei
2014-01-16 21:17         ` Johannes Weiner
2014-01-10 18:10 ` [patch 8/9] lib: radix_tree: tree node interface Johannes Weiner
2014-01-10 22:57   ` Rik van Riel
2014-01-10 18:10 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2014-01-10 23:09   ` Rik van Riel
2014-01-13  7:39   ` Minchan Kim
2014-01-14  5:40     ` Minchan Kim
2014-01-22 18:42     ` Johannes Weiner
2014-01-23  5:20       ` Minchan Kim
2014-01-23 19:22         ` Johannes Weiner
2014-01-27  2:31           ` Minchan Kim
2014-01-15  5:55   ` Bob Liu
2014-01-16 22:09     ` Johannes Weiner
2014-01-17  0:05   ` Dave Chinner
2014-01-20 23:17     ` Johannes Weiner
2014-01-21  3:03       ` Dave Chinner
2014-01-21  5:50         ` Johannes Weiner
2014-01-22  3:06           ` Dave Chinner
2014-01-22  6:57             ` Johannes Weiner
2014-01-22 18:48               ` Johannes Weiner
2014-01-23  5:57       ` Minchan Kim
  -- strict thread matches above, loose matches on Subject: below --
2013-12-02 19:21 [patch 0/9] mm: thrash detection-based file cache sizing v7 Johannes Weiner
2013-12-02 19:21 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-12-02 22:10   ` Dave Chinner
2013-12-02 22:46     ` Johannes Weiner
2013-11-24 23:38 [patch 0/9] mm: thrash detection-based file cache sizing v6 Johannes Weiner
2013-11-24 23:38 ` [patch 9/9] mm: keep page cache radix tree nodes in check Johannes Weiner
2013-11-25 23:49   ` Dave Chinner
2013-11-26 21:27     ` Johannes Weiner
2013-11-26 22:29       ` Dave Chinner
2013-11-26 23:00         ` Johannes Weiner
2013-11-27  0:59           ` Dave Chinner
2013-11-26  0:13   ` Andrew Morton
2013-11-26 22:05     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).