linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/9] mm: thrash detection-based file cache sizing v3
@ 2013-08-06 22:44 Johannes Weiner
  2013-08-06 22:44 ` [patch 1/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
                   ` (9 more replies)
  0 siblings, 10 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

[ My apologies for the double send, I screwed up one of the recipient
  addresses the first time around and it got dropped by some MTAs. ]

Changes in version 3:

o Lazily remove inodes without shadow entries from the global list to
  reduce modifications of said list to an absolute minimum.  Global
  list operations are now reduced to when an inode has its first cache
  page reclaimed (rare) and when a linked inode is destroyed (rare) or
  when the inode's shadows are shrunk (rare) to zero (rare).  These
  events should be even rarer than the per-sb inode list operations,
  which take a global lock.  Based on feedback from Peter Zijlstra.

o Drop global working set time, store zone ID in addition to
  zone-specific timestamp in radix tree instead.  Balance zones based
  on their own refaults only.  This allows the refault detecting side
  to be much sleaker too and removes a lot of changes to the page
  allocator interface.  Based on feedback from Peter Zijlstra.

o Document all interfaces properly

o Split out fair allocator patches (in -mmotm)

---

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.
    
Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This happens on file servers and media streaming servers, where the
popular set of files changes over time.  Even though the individual
files might be smaller than half of memory, concurrent access to many
of them may still result in their inter-reference distance being
greater than half of memory.  It's also been reported as a problem on
database workloads that switch back and forth between tables that are
bigger than half of memory.  In these cases the VM never recognizes
the new working set and will for the remainder of the workload thrash
disk data which could easily live in memory.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to tell streaming IO
from thrashing and rebalance the page cache lists when appropriate.

 drivers/staging/lustre/lustre/llite/dir.c |   2 +-
 fs/block_dev.c                            |   2 +-
 fs/btrfs/compression.c                    |   4 +-
 fs/cachefiles/rdwr.c                      |  13 +-
 fs/ceph/xattr.c                           |   2 +-
 fs/inode.c                                |   6 +-
 fs/logfs/readwrite.c                      |   6 +-
 fs/nfs/blocklayout/blocklayout.c          |   2 +-
 fs/nilfs2/inode.c                         |   4 +-
 fs/ntfs/file.c                            |   7 +-
 fs/splice.c                               |   6 +-
 include/linux/fs.h                        |   3 +
 include/linux/mm.h                        |   8 +
 include/linux/mmzone.h                    |   7 +
 include/linux/pagemap.h                   |  55 +++-
 include/linux/pagevec.h                   |   3 +
 include/linux/radix-tree.h                |   5 +-
 include/linux/shmem_fs.h                  |   1 +
 include/linux/swap.h                      |  10 +
 include/linux/writeback.h                 |   1 +
 lib/radix-tree.c                          | 105 ++-----
 mm/Makefile                               |   2 +-
 mm/filemap.c                              | 265 +++++++++++++---
 mm/mincore.c                              |  20 +-
 mm/page-writeback.c                       |   2 +-
 mm/readahead.c                            |   8 +-
 mm/shmem.c                                | 122 ++------
 mm/swap.c                                 |  22 ++
 mm/truncate.c                             |  78 ++++-
 mm/vmscan.c                               |  62 +++-
 mm/vmstat.c                               |   4 +
 mm/workingset.c                           | 461 ++++++++++++++++++++++++++++
 net/ceph/pagelist.c                       |   4 +-
 net/ceph/pagevec.c                        |   2 +-
 34 files changed, 1005 insertions(+), 299 deletions(-)

Based on the latest -mmotm, which includes the required page allocator
fairness patches.  All that: http://git.cmpxchg.org/cgit/linux-jw.git/

Thanks!


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 1/9] lib: radix-tree: radix_tree_delete_item()
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
@ 2013-08-06 22:44 ` Johannes Weiner
  2013-08-06 22:44 ` [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

Provide a function that does not just delete an entry at a given
index, but also allows passing in an expected item.  Delete only if
that item is still located at the specified index.

This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |  1 +
 lib/radix-tree.c           | 30 ++++++++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..1bf0a9c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -219,6 +219,7 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..60b202b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1335,15 +1335,18 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 }
 
 /**
- *	radix_tree_delete    -    delete an item from a radix tree
+ *	radix_tree_delete_item    -    delete an item from a radix tree
  *	@root:		radix tree root
  *	@index:		index key
+ *	@item:		expected item
  *
- *	Remove the item at @index from the radix tree rooted at @root.
+ *	Remove @item at @index from the radix tree rooted at @root.
  *
- *	Returns the address of the deleted item, or NULL if it was not present.
+ *	Returns the address of the deleted item, or NULL if it was not present
+ *	or the entry at the given @index was not @item.
  */
-void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+void *radix_tree_delete_item(struct radix_tree_root *root,
+			     unsigned long index, void *item)
 {
 	struct radix_tree_node *node = NULL;
 	struct radix_tree_node *slot = NULL;
@@ -1378,6 +1381,11 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 	if (slot == NULL)
 		goto out;
 
+	if (item && slot != item) {
+		slot = NULL;
+		goto out;
+	}
+
 	/*
 	 * Clear all tags associated with the item to be deleted.
 	 * This way of doing it would be inefficient, but seldom is any set.
@@ -1422,6 +1430,20 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 out:
 	return slot;
 }
+
+/**
+ *	radix_tree_delete    -    delete an item from a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Remove the item at @index from the radix tree rooted at @root.
+ *
+ *	Returns the address of the deleted item, or NULL if it was not present.
+ */
+void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+{
+	return radix_tree_delete_item(root, index, NULL);
+}
 EXPORT_SYMBOL(radix_tree_delete);
 
 /**
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
  2013-08-06 22:44 ` [patch 1/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
@ 2013-08-06 22:44 ` Johannes Weiner
  2013-08-06 22:44 ` [patch 3/9] mm: filemap: move radix tree hole searching here Johannes Weiner
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

Page cache radix tree slots are usually stabilized by the page lock,
but shmem's swap cookies have no such thing.  Because the overall
truncation loop is lockless, the swap entry is currently confirmed by
a tree lookup and then deleted by another tree lookup under the same
tree lock region.

Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup.  This also allows removing the
delete-only special case from shmem_radix_tree_replace().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/shmem.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c93dcd6..8510534 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -242,19 +242,17 @@ static int shmem_radix_tree_replace(struct address_space *mapping,
 			pgoff_t index, void *expected, void *replacement)
 {
 	void **pslot;
-	void *item = NULL;
+	void *item;
 
 	VM_BUG_ON(!expected);
+	VM_BUG_ON(!replacement);
 	pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
-	if (pslot)
-		item = radix_tree_deref_slot_protected(pslot,
-							&mapping->tree_lock);
+	if (!pslot)
+		return -ENOENT;
+	item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
 	if (item != expected)
 		return -ENOENT;
-	if (replacement)
-		radix_tree_replace_slot(pslot, replacement);
-	else
-		radix_tree_delete(&mapping->page_tree, index);
+	radix_tree_replace_slot(pslot, replacement);
 	return 0;
 }
 
@@ -386,14 +384,15 @@ export:
 static int shmem_free_swap(struct address_space *mapping,
 			   pgoff_t index, void *radswap)
 {
-	int error;
+	void *old;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
+	old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!error)
-		free_swap_and_cache(radix_to_swp_entry(radswap));
-	return error;
+	if (old != radswap)
+		return -ENOENT;
+	free_swap_and_cache(radix_to_swp_entry(radswap));
+	return 0;
 }
 
 /*
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [patch 3/9] mm: filemap: move radix tree hole searching here
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
  2013-08-06 22:44 ` [patch 1/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
  2013-08-06 22:44 ` [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
@ 2013-08-06 22:44 ` Johannes Weiner
  2013-08-06 22:44 ` [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot".  But this is about to change, though, as shadow
page descriptors will be stored in the page cache after the actual
pages get evicted from memory.

Move the functions over to mm/filemap.c and make them native page
cache operations, where they can later be adapted to handle the new
definition of "page cache hole".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/nfs/blocklayout/blocklayout.c |  2 +-
 include/linux/pagemap.h          |  5 +++
 include/linux/radix-tree.h       |  4 ---
 lib/radix-tree.c                 | 75 ---------------------------------------
 mm/filemap.c                     | 76 ++++++++++++++++++++++++++++++++++++++++
 mm/readahead.c                   |  4 +--
 6 files changed, 84 insertions(+), 82 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index e242bbf..fdb74cb 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1220,7 +1220,7 @@ static u64 pnfs_num_cont_bytes(struct inode *inode, pgoff_t idx)
 	end = DIV_ROUND_UP(i_size_read(inode), PAGE_CACHE_SIZE);
 	if (end != NFS_I(inode)->npages) {
 		rcu_read_lock();
-		end = radix_tree_next_hole(&mapping->page_tree, idx + 1, ULONG_MAX);
+		end = page_cache_next_hole(mapping, idx + 1, ULONG_MAX);
 		rcu_read_unlock();
 	}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..c73130c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -243,6 +243,11 @@ static inline struct page *page_cache_alloc_readahead(struct address_space *x)
 
 typedef int filler_t(void *, struct page *);
 
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+
 extern struct page * find_get_page(struct address_space *mapping,
 				pgoff_t index);
 extern struct page * find_lock_page(struct address_space *mapping,
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 1bf0a9c..e8be53e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -227,10 +227,6 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 60b202b..912e67b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -946,81 +946,6 @@ next:
 }
 EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
 
-
-/**
- *	radix_tree_next_hole    -    find the next hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the lowest
- *	indexed hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'return - index >= max_scan'
- *	will be true). In rare cases of index wrap-around, 0 will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 5, then subsequently a hole is created at index 10,
- *	radix_tree_next_hole covering both indexes may return 10 if called
- *	under rcu_read_lock.
- */
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index++;
-		if (index == 0)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_next_hole);
-
-/**
- *	radix_tree_prev_hole    -    find the prev hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search backwards in the range [max(index-max_scan+1, 0), index]
- *	for the first hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'index - return >= max_scan'
- *	will be true). In rare cases of wrap-around, ULONG_MAX will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 10, then subsequently a hole is created at index 5,
- *	radix_tree_prev_hole covering both indexes may return 5 if called under
- *	rcu_read_lock.
- */
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				   unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index--;
-		if (index == ULONG_MAX)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_prev_hole);
-
 /**
  *	radix_tree_gang_lookup - perform multiple lookup on a radix tree
  *	@root:		radix tree root
diff --git a/mm/filemap.c b/mm/filemap.c
index ae5cc01..e7833d2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -688,6 +688,82 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 }
 
 /**
+ * page_cache_next_hole - find the next hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
+ * lowest indexed hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'return - index >=
+ * max_scan' will be true). In rare cases of index wrap-around, 0 will
+ * be returned.
+ *
+ * page_cache_next_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 5, then subsequently a hole is created at
+ * index 10, page_cache_next_hole covering both indexes may return 10
+ * if called under rcu_read_lock.
+ */
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index++;
+		if (index == 0)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_next_hole);
+
+/**
+ * page_cache_prev_hole - find the prev hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search backwards in the range [max(index-max_scan+1, 0), index] for
+ * the first hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'index - return >=
+ * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
+ * will be returned.
+ *
+ * page_cache_prev_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 10, then subsequently a hole is created at
+ * index 5, page_cache_prev_hole covering both indexes may return 5 if
+ * called under rcu_read_lock.
+ */
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index--;
+		if (index == ULONG_MAX)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_prev_hole);
+
+/**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
diff --git a/mm/readahead.c b/mm/readahead.c
index 829a77c..01f4cae 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -351,7 +351,7 @@ static pgoff_t count_history_pages(struct address_space *mapping,
 	pgoff_t head;
 
 	rcu_read_lock();
-	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+	head = page_cache_prev_hole(mapping, offset - 1, max);
 	rcu_read_unlock();
 
 	return offset - 1 - head;
@@ -430,7 +430,7 @@ ondemand_readahead(struct address_space *mapping,
 		pgoff_t start;
 
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = page_cache_next_hole(mapping, offset + 1, max);
 		rcu_read_unlock();
 
 		if (!start || start - offset > max)
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
                   ` (2 preceding siblings ...)
  2013-08-06 22:44 ` [patch 3/9] mm: filemap: move radix tree hole searching here Johannes Weiner
@ 2013-08-06 22:44 ` Johannes Weiner
  2013-08-06 22:44 ` [patch 5/9] mm + fs: store shadow entries in page cache Johannes Weiner
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache,
prepare every site dealing with the radix trees directly to handle
entries other than pages.

The common lookup functions will filter out non-page entries and
return NULL for page cache holes, just as before.  But provide a raw
version of the API which returns non-page entries as well, and switch
shmem over to use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c   |   2 +-
 include/linux/mm.h       |   8 +++
 include/linux/pagemap.h  |  15 +++---
 include/linux/pagevec.h  |   3 ++
 include/linux/shmem_fs.h |   1 +
 mm/filemap.c             | 131 +++++++++++++++++++++++++++++++++++++++--------
 mm/mincore.c             |  20 +++++---
 mm/readahead.c           |   2 +-
 mm/shmem.c               |  97 +++++++----------------------------
 mm/swap.c                |  20 ++++++++
 mm/truncate.c            |  75 +++++++++++++++++++++------
 11 files changed, 246 insertions(+), 128 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b189bd1..5ce2c0f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -475,7 +475,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, pg_index);
 		rcu_read_unlock();
-		if (page) {
+		if (page && !radix_tree_exceptional_entry(page)) {
 			misses++;
 			if (misses > 4)
 				break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e20da1b..7f4d1f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -913,6 +913,14 @@ extern void show_free_areas(unsigned int flags);
 extern bool skip_free_areas_node(unsigned int flags, int nid);
 
 int shmem_zero_setup(struct vm_area_struct *);
+#ifdef CONFIG_SHMEM
+bool shmem_mapping(struct address_space *mapping);
+#else
+static inline bool shmem_mapping(struct address_space *mapping)
+{
+	return false;
+}
+#endif
 
 extern int can_do_mlock(void);
 extern int user_shm_lock(size_t, struct user_struct *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c73130c..b6854b7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-extern struct page * find_get_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_lock_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_or_create_page(struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
+				 gfp_t gfp_mask);
+unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
+			  unsigned int nr_pages, struct page **pages,
+			  pgoff_t *indices);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index e4dbfab..3c6b8b1 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,9 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices);
+void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 30aa0dc..deb4960 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -49,6 +49,7 @@ extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index e7833d2..254eb16 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -446,6 +446,24 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
+static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
+			     struct page *page)
+{
+	void **slot;
+
+	slot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (slot) {
+		void *p;
+
+		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+		if (!radix_tree_exceptional_entry(p))
+			return -EEXIST;
+		radix_tree_replace_slot(slot, page);
+		return 0;
+	}
+	return radix_tree_insert(&mapping->page_tree, offset, page);
+}
+
 /**
  * add_to_page_cache_locked - add a locked page to the pagecache
  * @page:	page to add
@@ -480,7 +498,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	page->index = offset;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = radix_tree_insert(&mapping->page_tree, offset, page);
+	error = page_cache_insert(mapping, offset, page);
 	radix_tree_preload_end();
 	if (unlikely(error))
 		goto err_insert;
@@ -714,7 +732,10 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index++;
 		if (index == 0)
@@ -752,7 +773,10 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index--;
 		if (index == ULONG_MAX)
@@ -763,15 +787,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 }
 EXPORT_SYMBOL(page_cache_prev_hole);
 
-/**
- * find_get_page - find and get a page reference
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Is there a pagecache struct page at the given (mapping, offset) tuple?
- * If yes, increment its refcount and return it; if no, return NULL.
- */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	void **pagep;
 	struct page *page;
@@ -812,24 +828,31 @@ out:
 
 	return page;
 }
-EXPORT_SYMBOL(find_get_page);
+EXPORT_SYMBOL(__find_get_page);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Locates the desired pagecache page, locks it, increments its reference
- * count and returns its address.
- *
- * Returns zero if the page was not present. find_lock_page() may sleep.
+ * Is there a pagecache struct page at the given (mapping, offset) tuple?
+ * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page;
+	struct page *page = __find_get_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
+EXPORT_SYMBOL(find_get_page);
 
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page;
 repeat:
-	page = find_get_page(mapping, offset);
+	page = __find_get_page(mapping, offset);
 	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
@@ -842,6 +865,25 @@ repeat:
 	}
 	return page;
 }
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Locates the desired pagecache page, locks it, increments its reference
+ * count and returns its address.
+ *
+ * Returns zero if the page was not present. find_lock_page() may sleep.
+ */
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page = __find_lock_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
 EXPORT_SYMBOL(find_lock_page);
 
 /**
@@ -891,6 +933,53 @@ repeat:
 }
 EXPORT_SYMBOL(find_or_create_page);
 
+unsigned __find_get_pages(struct address_space *mapping,
+			  pgoff_t start, unsigned int nr_pages,
+			  struct page **pages, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_pages)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * Otherwise, we must be storing a swap entry
+			 * here as an exceptional entry: so return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		pages[ret] = page;
+		if (++ret == nr_pages)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
 /**
  * find_get_pages - gang pagecache lookup
  * @mapping:	The address_space to search
diff --git a/mm/mincore.c b/mm/mincore.c
index da2be56..ad411ec 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 	 */
-	page = find_get_page(mapping, pgoff);
 #ifdef CONFIG_SWAP
-	/* shmem/tmpfs may return swap: account for swapcache page too. */
-	if (radix_tree_exceptional_entry(page)) {
-		swp_entry_t swap = radix_to_swp_entry(page);
-		page = find_get_page(swap_address_space(swap), swap.val);
-	}
+	if (shmem_mapping(mapping)) {
+		page = __find_get_page(mapping, pgoff);
+		/*
+		 * shmem/tmpfs may return swap: account for swapcache
+		 * page too.
+		 */
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swp = radix_to_swp_entry(page);
+			page = find_get_page(swap_address_space(swp), swp.val);
+		}
+	} else
+		page = find_get_page(mapping, pgoff);
+#else
+	page = find_get_page(mapping, pgoff);
 #endif
 	if (page) {
 		present = PageUptodate(page);
diff --git a/mm/readahead.c b/mm/readahead.c
index 01f4cae..0f85996 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		rcu_read_unlock();
-		if (page)
+		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
 		page = page_cache_alloc_readahead(mapping);
diff --git a/mm/shmem.c b/mm/shmem.c
index 8510534..2d18981 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -329,56 +329,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 }
 
 /*
- * Like find_get_pages, but collecting swap entries as well as pages.
- */
-static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
-					pgoff_t start, unsigned int nr_pages,
-					struct page **pages, pgoff_t *indices)
-{
-	void **slot;
-	unsigned int ret = 0;
-	struct radix_tree_iter iter;
-
-	if (!nr_pages)
-		return 0;
-
-	rcu_read_lock();
-restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
-		struct page *page;
-repeat:
-		page = radix_tree_deref_slot(slot);
-		if (unlikely(!page))
-			continue;
-		if (radix_tree_exception(page)) {
-			if (radix_tree_deref_retry(page))
-				goto restart;
-			/*
-			 * Otherwise, we must be storing a swap entry
-			 * here as an exceptional entry: so return it
-			 * without attempting to raise page count.
-			 */
-			goto export;
-		}
-		if (!page_cache_get_speculative(page))
-			goto repeat;
-
-		/* Has the page moved? */
-		if (unlikely(page != *slot)) {
-			page_cache_release(page);
-			goto repeat;
-		}
-export:
-		indices[ret] = iter.index;
-		pages[ret] = page;
-		if (++ret == nr_pages)
-			break;
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
@@ -396,21 +346,6 @@ static int shmem_free_swap(struct address_space *mapping,
 }
 
 /*
- * Pagevec may contain swap entries, so shuffle up pages before releasing.
- */
-static void shmem_deswap_pagevec(struct pagevec *pvec)
-{
-	int i, j;
-
-	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		if (!radix_tree_exceptional_entry(page))
-			pvec->pages[j++] = page;
-	}
-	pvec->nr = j;
-}
-
-/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
@@ -428,12 +363,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
 		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
 		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
 		 */
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					PAGEVEC_SIZE, pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		index = indices[pvec.nr - 1] + 1;
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		check_move_unevictable_pages(pvec.pages, pvec.nr);
 		pagevec_release(&pvec);
 		cond_resched();
@@ -465,9 +400,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
-				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+		pvec.nr = __find_get_pages(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		mem_cgroup_uncharge_start();
@@ -496,7 +431,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -534,9 +469,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+
+		pvec.nr = __find_get_pages(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+				pvec.pages, indices);
 		if (!pvec.nr) {
 			if (index == start || unfalloc)
 				break;
@@ -544,7 +480,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			continue;
 		}
 		if ((index == start || unfalloc) && indices[0] >= end) {
-			shmem_deswap_pagevec(&pvec);
+			pagevec_remove_exceptionals(&pvec);
 			pagevec_release(&pvec);
 			break;
 		}
@@ -573,7 +509,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -1081,7 +1017,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return -EFBIG;
 repeat:
 	swap.val = 0;
-	page = find_lock_page(mapping, index);
+	page = __find_lock_page(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
 		page = NULL;
@@ -1418,6 +1354,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	return inode;
 }
 
+bool shmem_mapping(struct address_space *mapping)
+{
+	return mapping->backing_dev_info == &shmem_backing_dev_info;
+}
+
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
 static const struct inode_operations shmem_short_symlink_operations;
@@ -1731,7 +1672,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 	pagevec_init(&pvec, 0);
 	pvec.nr = 1;		/* start small: we may be there already */
 	while (!done) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					pvec.nr, pvec.pages, indices);
 		if (!pvec.nr) {
 			if (whence == SEEK_DATA)
@@ -1758,7 +1699,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 				break;
 			}
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		pvec.nr = PAGEVEC_SIZE;
 		cond_resched();
diff --git a/mm/swap.c b/mm/swap.c
index 62b78a6..bf448cf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -831,6 +831,26 @@ void __pagevec_lru_add(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_lru_add);
 
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices)
+{
+	pvec->nr = __find_get_pages(mapping, start, nr_pages,
+				    pvec->pages, indices);
+	return pagevec_count(pvec);
+}
+
+void pagevec_remove_exceptionals(struct pagevec *pvec)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		if (!radix_tree_exceptional_entry(page))
+			pvec->pages[j++] = page;
+	}
+	pvec->nr = j;
+}
+
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
diff --git a/mm/truncate.c b/mm/truncate.c
index e2e8a8a..21e4851 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -22,6 +22,22 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+static void clear_exceptional_entry(struct address_space *mapping,
+				    pgoff_t index, struct page *page)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	radix_tree_delete_item(&mapping->page_tree, index, page);
+	spin_unlock_irq(&mapping->tree_lock);
+}
 
 /**
  * do_invalidatepage - invalidate part or all of a page
@@ -208,12 +224,11 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	unsigned int	partial_start;	/* inclusive */
 	unsigned int	partial_end;	/* exclusive */
 	struct pagevec	pvec;
+	pgoff_t		indices[PAGEVEC_SIZE];
 	pgoff_t		index;
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0)
-		return;
 
 	/* Offsets within partial pages */
 	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
@@ -238,17 +253,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index < end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+	while (index < end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index >= end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -259,6 +280,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -307,14 +329,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+		if (!__pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			indices)) {
 			if (index == start)
 				break;
 			index = start;
 			continue;
 		}
-		if (index == start && pvec.pages[0]->index >= end) {
+		if (index == start && indices[0] >= end) {
 			pagevec_release(&pvec);
 			break;
 		}
@@ -323,16 +346,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index >= end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -375,6 +404,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index = start;
 	unsigned long ret;
@@ -390,17 +420,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	 */
 
 	pagevec_init(&pvec, 0);
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -414,6 +450,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				deactivate_page(page);
 			count += ret;
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -481,6 +518,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
 int invalidate_inode_pages2_range(struct address_space *mapping,
 				  pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	int i;
@@ -491,17 +529,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	cleancache_invalidate_inode(mapping);
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			if (page->mapping != mapping) {
@@ -539,6 +583,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				ret = ret2;
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [patch 5/9] mm + fs: store shadow entries in page cache
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
                   ` (3 preceding siblings ...)
  2013-08-06 22:44 ` [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
@ 2013-08-06 22:44 ` Johannes Weiner
  2013-08-06 22:44 ` [patch 6/9] mm + fs: provide shadow pages to page cache allocations Johannes Weiner
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

Reclaim will be leaving shadow entries in the page cache radix tree
upon evicting the real page.  As those pages are found from the LRU,
an iput() can lead to the inode being freed concurrently.  At this
point, reclaim must no longer install shadow pages because the inode
freeing code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code
sets under the tree lock before doing the final truncate.  Reclaim
will check for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/block_dev.c          |  2 +-
 fs/inode.c              |  7 ++++++-
 fs/nilfs2/inode.c       |  4 ++--
 include/linux/fs.h      |  1 +
 include/linux/pagemap.h | 13 ++++++++++++-
 mm/filemap.c            | 16 ++++++++++++----
 mm/truncate.c           |  5 +++--
 mm/vmscan.c             |  2 +-
 8 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7bda5c..26b992d14 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -83,7 +83,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index e315c0a..8862b1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -503,6 +503,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
+	BUG_ON(inode->i_data.nrshadows);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
@@ -545,10 +546,14 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
+	spin_lock_irq(&inode->i_data.tree_lock);
+	mapping_set_exiting(&inode->i_data);
+	spin_unlock_irq(&inode->i_data.tree_lock);
+
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 	}
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index b1a5277..047bafb 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -783,7 +783,7 @@ void nilfs_evict_inode(struct inode *inode)
 	int ret;
 
 	if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 		nilfs_clear_inode(inode);
@@ -791,7 +791,7 @@ void nilfs_evict_inode(struct inode *inode)
 	}
 	nilfs_transaction_begin(sb, &ti, 0); /* never fails */
 
-	if (inode->i_data.nrpages)
+	if (inode->i_data.nrpages || inode->i_data.nrshadows)
 		truncate_inode_pages(&inode->i_data, 0);
 
 	/* TODO: some of the following operations may fail.  */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b09ddc0..ac5d84e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -416,6 +416,7 @@ struct address_space {
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
+	unsigned long		nrshadows;	/* number of shadow entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index b6854b7..db3a78b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_EXITING	= __GFP_BITS_SHIFT + 5, /* inode is being evicted */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -69,6 +70,16 @@ static inline int mapping_balloon(struct address_space *mapping)
 	return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
 }
 
+static inline void mapping_set_exiting(struct address_space *mapping)
+{
+	set_bit(AS_EXITING, &mapping->flags);
+}
+
+static inline int mapping_exiting(struct address_space *mapping)
+{
+	return test_bit(AS_EXITING, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
@@ -547,7 +558,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page);
+extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 254eb16..34b2f0b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -112,7 +112,7 @@
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the mapping's tree_lock.
  */
-void __delete_from_page_cache(struct page *page)
+void __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
@@ -127,7 +127,14 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (shadow) {
+		void **slot;
+
+		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+		radix_tree_replace_slot(slot, shadow);
+		mapping->nrshadows++;
+	} else
+		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
 	mapping->nrpages--;
@@ -166,7 +173,7 @@ void delete_from_page_cache(struct page *page)
 
 	freepage = mapping->a_ops->freepage;
 	spin_lock_irq(&mapping->tree_lock);
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
@@ -426,7 +433,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		__delete_from_page_cache(old);
+		__delete_from_page_cache(old, NULL);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
 		BUG_ON(error);
 		mapping->nrpages++;
@@ -459,6 +466,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
+		mapping->nrshadows--;
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 21e4851..5c85dd4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -35,7 +35,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * without the tree itself locked.  These unlocked entries
 	 * need verification under the tree lock.
 	 */
-	radix_tree_delete_item(&mapping->page_tree, index, page);
+	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
+		mapping->nrshadows--;
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
@@ -481,7 +482,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a3bf7fd..dd5f67c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -545,7 +545,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page);
+		__delete_from_page_cache(page, NULL);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [patch 6/9] mm + fs: provide shadow pages to page cache allocations
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
                   ` (4 preceding siblings ...)
  2013-08-06 22:44 ` [patch 5/9] mm + fs: store shadow entries in page cache Johannes Weiner
@ 2013-08-06 22:44 ` Johannes Weiner
  2013-08-06 22:44 ` [patch 7/9] mm: make global_dirtyable_memory() available to other mm code Johannes Weiner
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

In order to make informed placement and reclaim decisions, the page
cache allocation requires the shadow information of refaulting pages.

Every site that does a find_or_create()-style page cache allocation is
converted to pass the shadow page found in the faulting slot of the
radix tree to page_cache_alloc(), where it can be used in subsequent
patches to influence reclaim behavior.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 drivers/staging/lustre/lustre/llite/dir.c |  2 +-
 fs/btrfs/compression.c                    |  2 +-
 fs/cachefiles/rdwr.c                      | 13 +++++----
 fs/ceph/xattr.c                           |  2 +-
 fs/logfs/readwrite.c                      |  6 ++--
 fs/ntfs/file.c                            |  7 +++--
 fs/splice.c                               |  6 ++--
 include/linux/pagemap.h                   | 20 ++++++++------
 mm/filemap.c                              | 46 +++++++++++++++++--------------
 mm/readahead.c                            |  2 +-
 net/ceph/pagelist.c                       |  4 +--
 net/ceph/pagevec.c                        |  2 +-
 12 files changed, 61 insertions(+), 51 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index 2ca8c45..ac63e4d 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -172,7 +172,7 @@ static int ll_dir_filler(void *_hash, struct page *page0)
 		max_pages = 1;
 	}
 	for (npages = 1; npages < max_pages; npages++) {
-		page = page_cache_alloc_cold(inode->i_mapping);
+		page = page_cache_alloc_cold(inode->i_mapping, NULL);
 		if (!page)
 			break;
 		page_pool[npages] = page;
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 5ce2c0f..f23bb17 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -483,7 +483,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		}
 
 		page = __page_cache_alloc(mapping_gfp_mask(mapping) &
-								~__GFP_FS);
+					  ~__GFP_FS, page);
 		if (!page)
 			break;
 
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index ebaff36..1b34a42 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -254,13 +254,13 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
 	newpage = NULL;
 
 	for (;;) {
-		backpage = find_get_page(bmapping, netpage->index);
-		if (backpage)
+		backpage = __find_get_page(bmapping, netpage->index);
+		if (backpage && !radix_tree_exceptional_entry(backpage))
 			goto backing_page_already_present;
 
 		if (!newpage) {
 			newpage = __page_cache_alloc(cachefiles_gfp |
-						     __GFP_COLD);
+						     __GFP_COLD, backpage);
 			if (!newpage)
 				goto nomem_monitor;
 		}
@@ -499,13 +499,14 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 		}
 
 		for (;;) {
-			backpage = find_get_page(bmapping, netpage->index);
-			if (backpage)
+			backpage = __find_get_page(bmapping, netpage->index);
+			if (backpage && !radix_tree_exceptional_entry(backpage))
 				goto backing_page_already_present;
 
 			if (!newpage) {
 				newpage = __page_cache_alloc(cachefiles_gfp |
-							     __GFP_COLD);
+							     __GFP_COLD,
+							     backpage);
 				if (!newpage)
 					goto nomem;
 			}
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index be661d8..a5d2b86 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -816,7 +816,7 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name,
 			return -ENOMEM;
 		err = -ENOMEM;
 		for (i = 0; i < nr_pages; i++) {
-			pages[i] = __page_cache_alloc(GFP_NOFS);
+			pages[i] = __page_cache_alloc(GFP_NOFS, NULL);
 			if (!pages[i]) {
 				nr_pages = i;
 				goto out;
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 9a59cba..67c669a 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -316,9 +316,9 @@ static struct page *logfs_get_write_page(struct inode *inode, u64 bix,
 	int err;
 
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(GFP_NOFS);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		page = __page_cache_alloc(GFP_NOFS, page);
 		if (!page)
 			return NULL;
 		err = add_to_page_cache_lru(page, mapping, index, GFP_NOFS);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index c5670b8..7aee2d1 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -413,10 +413,11 @@ static inline int __ntfs_grab_cache_pages(struct address_space *mapping,
 	BUG_ON(!nr_pages);
 	err = nr = 0;
 	do {
-		pages[nr] = find_lock_page(mapping, index);
-		if (!pages[nr]) {
+		pages[nr] = __find_lock_page(mapping, index);
+		if (!pages[nr] || radix_tree_exceptional_entry(pages[nr])) {
 			if (!*cached_page) {
-				*cached_page = page_cache_alloc(mapping);
+				*cached_page = page_cache_alloc(mapping,
+								pages[nr]);
 				if (unlikely(!*cached_page)) {
 					err = -ENOMEM;
 					goto err_out;
diff --git a/fs/splice.c b/fs/splice.c
index 3b7ee65..edc54ae 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -353,12 +353,12 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		 * Page could be there, find_get_pages_contig() breaks on
 		 * the first hole.
 		 */
-		page = find_get_page(mapping, index);
-		if (!page) {
+		page = __find_get_page(mapping, index);
+		if (!page || radix_tree_exceptional_entry(page)) {
 			/*
 			 * page didn't exist, allocate one.
 			 */
-			page = page_cache_alloc_cold(mapping);
+			page = page_cache_alloc_cold(mapping, page);
 			if (!page)
 				break;
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db3a78b..4b24236 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -228,28 +228,32 @@ static inline void page_unfreeze_refs(struct page *page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
 	return alloc_pages(gfp, 0);
 }
 #endif
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+static inline struct page *page_cache_alloc(struct address_space *x,
+					    struct page *shadow)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping_gfp_mask(x), shadow);
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *x,
+						 struct page *shadow)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD, shadow);
 }
 
-static inline struct page *page_cache_alloc_readahead(struct address_space *x)
+static inline struct page *page_cache_alloc_readahead(struct address_space *x,
+						      struct page *shadow)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x) |
-				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
+				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN,
+				  shadow);
 }
 
 typedef int filler_t(void *, struct page *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 34b2f0b..d3e5578 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -538,7 +538,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
 	int n;
 	struct page *page;
@@ -917,9 +917,9 @@ struct page *find_or_create_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+	page = __find_lock_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		page = __page_cache_alloc(gfp_mask, page);
 		if (!page)
 			return NULL;
 		/*
@@ -1222,15 +1222,16 @@ EXPORT_SYMBOL(find_get_pages_tag);
 struct page *
 grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
-	struct page *page = find_get_page(mapping, index);
+	struct page *page = __find_get_page(mapping, index);
 
-	if (page) {
+	if (page && !radix_tree_exceptional_entry(page)) {
 		if (trylock_page(page))
 			return page;
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+				  page);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
@@ -1304,8 +1305,9 @@ find_page:
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
-			page = find_get_page(mapping, index);
-			if (unlikely(page == NULL))
+			page = __find_get_page(mapping, index);
+			if (unlikely(page == NULL ||
+				     radix_tree_exceptional_entry(page)))
 				goto no_cached_page;
 		}
 		if (PageReadahead(page)) {
@@ -1464,7 +1466,7 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, page);
 		if (!page) {
 			desc->error = -ENOMEM;
 			goto out;
@@ -1673,18 +1675,20 @@ EXPORT_SYMBOL(generic_file_aio_read);
  * page_cache_read - adds requested page to the page cache if not already there
  * @file:	file to read
  * @offset:	page index
+ * @shadow:	shadow page of the page to be added
  *
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset,
+			   struct page *shadow)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *page; 
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, shadow);
 		if (!page)
 			return -ENOMEM;
 
@@ -1815,8 +1819,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
-		page = find_get_page(mapping, offset);
-		if (!page)
+		page = __find_get_page(mapping, offset);
+		if (!page || radix_tree_exceptional_entry(page))
 			goto no_cached_page;
 	}
 
@@ -1859,7 +1863,7 @@ no_cached_page:
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, offset);
+	error = page_cache_read(file, offset, page);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1981,9 +1985,9 @@ static struct page *__read_cache_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp | __GFP_COLD);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		page = __page_cache_alloc(gfp | __GFP_COLD, page);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
 		err = add_to_page_cache_lru(page, mapping, index, gfp);
@@ -2454,11 +2458,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (page)
+	page = __find_lock_page(mapping, index);
+	if (page && !radix_tree_exceptional_entry(page))
 		goto found;
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	page = __page_cache_alloc(gfp_mask & ~gfp_notmask, page);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
diff --git a/mm/readahead.c b/mm/readahead.c
index 0f85996..58142ef 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -182,7 +182,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
-		page = page_cache_alloc_readahead(mapping);
+		page = page_cache_alloc_readahead(mapping, page);
 		if (!page)
 			break;
 		page->index = page_offset;
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 92866be..83fb56e 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -32,7 +32,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
 	struct page *page;
 
 	if (!pl->num_pages_free) {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = __page_cache_alloc(GFP_NOFS, NULL);
 	} else {
 		page = list_first_entry(&pl->free_list, struct page, lru);
 		list_del(&page->lru);
@@ -83,7 +83,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
 	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
 
 	while (space > pl->num_pages_free) {
-		struct page *page = __page_cache_alloc(GFP_NOFS);
+		struct page *page = __page_cache_alloc(GFP_NOFS, NULL);
 		if (!page)
 			return -ENOMEM;
 		list_add_tail(&page->lru, &pl->free_list);
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 815a224..ff76422 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -79,7 +79,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < num_pages; i++) {
-		pages[i] = __page_cache_alloc(flags);
+		pages[i] = __page_cache_alloc(flags, NULL);
 		if (pages[i] == NULL) {
 			ceph_release_page_vector(pages, i);
 			return ERR_PTR(-ENOMEM);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [patch 7/9] mm: make global_dirtyable_memory() available to other mm code
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
                   ` (5 preceding siblings ...)
  2013-08-06 22:44 ` [patch 6/9] mm + fs: provide shadow pages to page cache allocations Johannes Weiner
@ 2013-08-06 22:44 ` Johannes Weiner
  2013-08-06 22:44 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

Subsequent patches need a rough estimate of memory available for page
cache.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/writeback.h | 1 +
 mm/page-writeback.c       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4e198ca..4c26df7 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -154,6 +154,7 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
+unsigned long global_dirtyable_memory(void);
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d374b29..60376ad 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -231,7 +231,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the global number of pages potentially available for dirty
  * page cache.  This is the base value for the global dirty limits.
  */
-static unsigned long global_dirtyable_memory(void)
+unsigned long global_dirtyable_memory(void)
 {
 	unsigned long x;
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
                   ` (6 preceding siblings ...)
  2013-08-06 22:44 ` [patch 7/9] mm: make global_dirtyable_memory() available to other mm code Johannes Weiner
@ 2013-08-06 22:44 ` Johannes Weiner
  2013-08-09 22:49   ` Andrew Morton
  2013-08-11 21:57   ` Vlastimil Babka
  2013-08-06 22:44 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
  2013-08-09 22:53 ` [patch 0/9] mm: thrash detection-based file cache sizing v3 Andrew Morton
  9 siblings, 2 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This patch solves this problem by maintaining a history of recently
evicted file pages, which in turn allows the VM to tell used-once page
streams from thrashing file cache.

To accomplish this, a per-zone counter is increased every time a page
is evicted and a snapshot of that counter is stored as shadow entry in
the page's now empty page cache radix tree slot.  Upon refault of that
page, the difference between the current value of that counter and the
shadow entry value is called the refault distance.  It tells how many
pages have been evicted from the zone since that page's eviction,
which is how many page slots are missing from the zone's inactive list
for this page to get accessed twice while in memory.  If the number of
missing slots is less than or equal to the number of active pages,
increasing the inactive list at the cost of the active list would give
this thrashing set a chance to establish itself:

eviction counter = 4
                        evicted      inactive           active
 Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
  Shadow entries:         0 1 2 3
Refault distance:         4 3 2 1

When c is faulted back into memory, it is noted that two more page
slots on the inactive list could have prevented the refault.  Thus,
the active list needs to be challenged for those two page slots as it
is possible that c is used more frequently than l, m, n.  However, c
might also be used much less frequent than the active pages and so 1)
pages can not be directly reclaimed from the tail of the active list
and b) refaulting pages can not be directly activated.  Instead,
active pages are moved from the tail of the active list to the head of
the inactive list and placed directly next to the refaulting pages.
This way, they both have the same time on the inactive list to prove
which page is actually used more frequently without incurring
unnecessary major faults or diluting the active page set in case the
previously active page is in fact the more frequently used one.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h  |   6 ++
 include/linux/pagemap.h |   2 +
 include/linux/swap.h    |   5 ++
 mm/Makefile             |   2 +-
 mm/filemap.c            |   3 +
 mm/swap.c               |   2 +
 mm/vmscan.c             |  62 +++++++++++---
 mm/vmstat.c             |   3 +
 mm/workingset.c         | 213 ++++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 284 insertions(+), 14 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0c41d59..e75fc92 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -141,6 +141,9 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_STALE,
+	WORKINGSET_BALANCE,
+	WORKINGSET_BALANCE_FORCE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -393,6 +396,9 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
+	atomic_long_t		workingset_time;
+	long			shrink_active;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b24236..c6beed2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -232,6 +232,8 @@ extern struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
+	if (shadow)
+		workingset_refault(shadow);
 	return alloc_pages(gfp, 0);
 }
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 24db914..441845d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
 	int next;	/* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+void workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index cd0abd8..f740427 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o \
-			   interval_tree.o list_lru.o $(mmu-y)
+			   interval_tree.o list_lru.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/filemap.c b/mm/filemap.c
index d3e5578..ab4351e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -543,6 +543,9 @@ struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 	int n;
 	struct page *page;
 
+	if (shadow)
+		workingset_refault(shadow);
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
diff --git a/mm/swap.c b/mm/swap.c
index bf448cf..f90a331 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,6 +482,8 @@ void mark_page_accessed(struct page *page)
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dd5f67c..27a36f6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -496,7 +496,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -542,10 +543,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page, NULL);
+		if (reclaimed && page_is_file_cache(page))
+			shadow = workingset_eviction(mapping, page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -568,7 +572,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -1038,7 +1042,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
@@ -1746,32 +1750,64 @@ static inline int inactive_anon_is_low(struct lruvec *lruvec)
 /**
  * inactive_file_is_low - check if file pages need to be deactivated
  * @lruvec: LRU vector to check
+ * @nr_to_scan: number of active pages to scan
+ * @sc: scan parameters
  *
  * When the system is doing streaming IO, memory pressure here
  * ensures that active file pages get deactivated, until more
  * than half of the file pages are on the inactive list.
  *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
+ * Once we get to that situation, protect the system's working set
+ * from being evicted by disabling active file page aging, until
+ * thrashing on the inactive list suggests that a new working set is
+ * trying to form.
  *
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct lruvec *lruvec)
+static int inactive_file_is_low(struct lruvec *lruvec,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long inactive;
 	unsigned long active;
 
+	if (global_reclaim(sc)) {
+		if (zone->shrink_active > 0) {
+			if (nr_to_scan) {
+				inc_zone_state(zone, WORKINGSET_BALANCE);
+				zone->shrink_active -= nr_to_scan;
+			}
+			return true;
+		}
+	}
+	/*
+	 * Make sure there is always a reasonable amount of inactive
+	 * file pages around to keep the zone reclaimable.
+	 *
+	 * We could do better than requiring half of memory, but we
+	 * need a big safety buffer until we are smarter about
+	 * dirty/writeback pages and file readahead windows.
+	 * Otherwise, we can end up with all pages on the inactive
+	 * list being dirty, or trash readahead pages before use.
+	 */
 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
-
-	return active > inactive;
+	if (active > inactive) {
+		if (global_reclaim(sc) && nr_to_scan)
+			inc_zone_state(zone, WORKINGSET_BALANCE_FORCE);
+		return true;
+	}
+	return false;
 }
 
-static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
+static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
 	if (is_file_lru(lru))
-		return inactive_file_is_low(lruvec);
+		return inactive_file_is_low(lruvec, nr_to_scan, sc);
 	else
 		return inactive_anon_is_low(lruvec);
 }
@@ -1780,7 +1816,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, lru))
+		if (inactive_list_is_low(lruvec, lru, nr_to_scan, sc))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -1891,7 +1927,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * There is enough inactive page cache, do not reclaim
 	 * anything from the anonymous working set right now.
 	 */
-	if (!inactive_file_is_low(lruvec)) {
+	if (!inactive_file_is_low(lruvec, 0, sc)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 87228c5..2b14f7a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -738,6 +738,9 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_stale",
+	"workingset_balance",
+	"workingset_balance_force",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..65714d2
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,213 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ *		Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list.  Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail.  Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * inactive list requires more space to detect repeatedly accessed
+ * pages in the current workload and prevent them from thrashing:
+ *
+ *   fault -----------------------+
+ *                                |
+ *              +-------------+   |            +-------------+
+ *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
+ *              +-------------+                +-------------+    |
+ *                       |                                        |
+ *                       +----------- promotion ------------------+
+ *
+ *
+ *		Access frequency and refault distance
+ *
+ * A workload is thrashing when the distances between the first and
+ * second access of pages that are frequently used is bigger than the
+ * current inactive clock list size, as the pages get reclaimed before
+ * the second access would have promoted them instead:
+ *
+ *    Access #: 1 2 3 4 5 6 7 8 9
+ *     Page ID: x y b c d e f x y
+ *                  | inactive  |
+ *
+ * To prevent this workload from thrashing, a bigger inactive list is
+ * required.  And the only way the inactive list can grow on a full
+ * zone is by taking away space from the corresponding active list.
+ *
+ *      +-inactive--+-active------+
+ *  x y | b c d e f | G H I J K L |
+ *      +-----------+-------------+
+ *
+ * Not every refault should lead to growing the inactive list at the
+ * cost of the active list, however: if the access distances are
+ * bigger than available memory overall, there is little point in
+ * challenging the protected pages on the active list, as those
+ * refaulting pages will not fit completely into memory.
+ *
+ * It is prohibitively expensive to track the access frequency of
+ * in-core pages, but it is possible to track their refault distance,
+ * which is the number of page slots shrunk from the inactive list
+ * between a page's eviction and subsequent refault.  This indicates
+ * how many page slots are missing on the inactive list in order to
+ * prevent future thrashing of that page.  Thus, instead of comparing
+ * access frequency to total available memory, one can compare the
+ * refault distance to the inactive list's potential for growth: the
+ * size of the active list.
+ *
+ *
+ *		Rebalancing the lists
+ *
+ * Shrinking the active list has to be done carefully because the
+ * pages on it may have vastly different access frequencies compared
+ * to the pages on the inactive list.  Thus, pages are not reclaimed
+ * directly from the tail of the active list, but instead moved to the
+ * head of the inactive list.  This way, they are competing directly
+ * with the pages that challenged their protected status.  If they are
+ * unused, they will eventually be reclaimed, but if they are indeed
+ * used more frequently than the challenging inactive pages, they will
+ * be reactivated.  This allows the existing protected set to be
+ * challenged without incurring major faults in case of a mistake.
+ */
+
+static void *pack_shadow(unsigned long time, struct zone *zone)
+{
+	time = (time << NODES_SHIFT) | zone_to_nid(zone);
+	time = (time << ZONES_SHIFT) | zone_idx(zone);
+	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
+
+	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+static void unpack_shadow(void *shadow,
+			  struct zone **zone,
+			  unsigned long *distance)
+{
+	unsigned long entry = (unsigned long)shadow;
+	unsigned long time_of_eviction;
+	unsigned long mask;
+	unsigned long now;
+	int zid, nid;
+
+	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	zid = entry & ((1UL << ZONES_SHIFT) - 1);
+	entry >>= ZONES_SHIFT;
+	nid = entry & ((1UL << NODES_SHIFT) - 1);
+	entry >>= NODES_SHIFT;
+	time_of_eviction = entry;
+
+	*zone = NODE_DATA(nid)->node_zones + zid;
+
+	now = atomic_long_read(&(*zone)->workingset_time);
+	mask = ~0UL >> (RADIX_TREE_EXCEPTIONAL_SHIFT +
+			ZONES_SHIFT + NODES_SHIFT);
+
+	*distance = (now - time_of_eviction) & mask;
+}
+
+/**
+ * workingset_eviction - note the eviction of a page from memory
+ * @mapping: address space the page was backing
+ * @page: the page being evicted
+ *
+ * Returns a shadow entry to be stored in @mapping->page_tree in place
+ * of the evicted @page so that a later refault can be detected.  Or
+ * %NULL when the eviction should not be remembered.
+ */
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long time;
+
+	time = atomic_long_inc_return(&zone->workingset_time);
+
+	/*
+	 * Don't store shadows in an inode that is being reclaimed.
+	 * This is not just an optizimation, inode reclaim needs to
+	 * empty out the radix tree or the nodes are lost, so don't
+	 * plant shadows behind its back.
+	 */
+	if (mapping_exiting(mapping))
+		return NULL;
+
+	return pack_shadow(time, zone);
+}
+
+/**
+ * workingset_refault - note the refault of a previously evicted page
+ * @shadow: shadow entry of the evicted page
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted page in the context of the zone it was allocated in.
+ *
+ * This primes page reclaim to rebalance the zone's file lists if
+ * necessary, so it must be called before a page frame for the
+ * refaulting page is allocated.
+ */
+void workingset_refault(void *shadow)
+{
+	unsigned long refault_distance;
+	unsigned long zone_active;
+	unsigned long zone_free;
+	struct zone *zone;
+
+	unpack_shadow(shadow, &zone, &refault_distance);
+
+	/*
+	 * The dirty balance reserve is a generous estimation of the
+	 * zone's memory reserve that is not available to page cache.
+	 * If the zone has more free pages than that, it means that
+	 * there are pages ready to allocate without reclaiming from
+	 * the zone at all, let alone putting pressure on its active
+	 * pages.
+	 */
+	zone_free = zone_page_state(zone, NR_FREE_PAGES);
+	if (zone_free > zone->dirty_balance_reserve)
+		return;
+
+	/*
+	 * Protected pages should be challenged when the refault
+	 * distance indicates that thrashing could be stopped by
+	 * increasing the inactive list at the cost of the active
+	 * list.
+	 */
+	zone_active = zone_page_state(zone, NR_ACTIVE_FILE);
+	if (refault_distance > zone_active)
+		return;
+
+	inc_zone_state(zone, WORKINGSET_STALE);
+	zone->shrink_active++;
+}
+EXPORT_SYMBOL(workingset_refault);
+
+/**
+ * workingset_activation - note a page activation
+ * @page: page that is being activated
+ */
+void workingset_activation(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	/*
+	 * The lists are rebalanced when the inactive list is observed
+	 * to be too small for activations.  An activation means that
+	 * the inactive list is now big enough again for at least one
+	 * page, so back off further deactivation.
+	 */
+	atomic_long_inc(&zone->workingset_time);
+	if (zone->shrink_active > 0)
+		zone->shrink_active--;
+}
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [patch 9/9] mm: workingset: keep shadow entries in check
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
                   ` (7 preceding siblings ...)
  2013-08-06 22:44 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
@ 2013-08-06 22:44 ` Johannes Weiner
  2013-08-11 23:56   ` Andi Kleen
  2013-08-09 22:53 ` [patch 0/9] mm: thrash detection-based file cache sizing v3 Andrew Morton
  9 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, two mechanisms are used:

1. a refault balance counter is maintained per inode that grows with
   each shadow entry planted and shrinks with each refault.  Once the
   counter grows beyond a certain threshold, planting new shadows in
   that file is throttled.  It's per file so that a single file can
   not disable thrashing detection globally.  However, this still
   allows shadow entries to grow excessively when many files show this
   usage pattern, and so:

2. a list of inodes that contain shadow entries is maintained.  If the
   global number of shadows exceeds a certain threshold, a shrinker is
   activated that reclaims old entries from the mappings.  This is
   heavy-handed but it should not be a common case and is only there
   to protect from accidentally/maliciously induced OOM kills.  The
   global list is also not a problem because the modifications are
   very rare: inodes are added once in their lifetime when the first
   shadow entry is stored (i.e. the first page reclaimed) and lazily
   removed when the inode exits.  Or if the shrinker removes all
   shadow entries.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c             |   5 +-
 include/linux/fs.h     |   2 +
 include/linux/mmzone.h |   1 +
 include/linux/swap.h   |   5 +
 mm/filemap.c           |   5 +-
 mm/truncate.c          |   2 +-
 mm/vmstat.c            |   1 +
 mm/workingset.c        | 248 +++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 263 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8862b1b..b23b141 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -169,6 +169,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->private_data = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+	workingset_init_mapping(mapping);
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
@@ -546,9 +547,7 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
-	spin_lock_irq(&inode->i_data.tree_lock);
-	mapping_set_exiting(&inode->i_data);
-	spin_unlock_irq(&inode->i_data.tree_lock);
+	workingset_exit_mapping(&inode->i_data);
 
 	if (op->evict_inode) {
 		op->evict_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac5d84e..ea3c25b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -417,6 +417,8 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	struct list_head	shadow_list;	/* list of mappings with shadows */
+	unsigned long		shadow_debt;	/* shadow entries with unmatched refaults */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e75fc92..6e74ac5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
 	WORKINGSET_STALE,
 	WORKINGSET_BALANCE,
 	WORKINGSET_BALANCE_FORCE,
+	WORKINGSET_SHADOWS_RECLAIMED,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 441845d..4816c50 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,9 +261,14 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
+void workingset_init_mapping(struct address_space *mapping);
+void workingset_exit_mapping(struct address_space *mapping);
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 void workingset_refault(void *shadow);
+void workingset_count_refault(struct address_space *mapping);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index ab4351e..bd4121b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,7 +132,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
 		radix_tree_replace_slot(slot, shadow);
-		mapping->nrshadows++;
+		workingset_shadows_inc(mapping);
 	} else
 		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
@@ -466,7 +466,8 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
+		workingset_count_refault(mapping);
+		workingset_shadows_dec(mapping);
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 5c85dd4..76064a4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -36,7 +36,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * need verification under the tree lock.
 	 */
 	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2b14f7a..2c5bf80 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -741,6 +741,7 @@ const char * const vmstat_text[] = {
 	"workingset_stale",
 	"workingset_balance",
 	"workingset_balance_force",
+	"workingset_shadows_reclaimed",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 65714d2..5c5cf74 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -84,6 +84,55 @@
  * challenged without incurring major faults in case of a mistake.
  */
 
+static DEFINE_PER_CPU(unsigned long, nr_shadows);
+static DEFINE_SPINLOCK(shadow_lock);
+static LIST_HEAD(shadow_mappings);
+static int memory_shift;
+
+/**
+ * workingset_init_mapping - prepare address space for page reclaim
+ * @mapping: address space
+ *
+ * Must be called when the inode is instantiated, before any page
+ * cache is populated.
+ */
+void workingset_init_mapping(struct address_space *mapping)
+{
+	INIT_LIST_HEAD(&mapping->shadow_list);
+	/*
+	 * Throttle installation of shadow entries in new inodes from
+	 * the beginning.  Subsequent refaults will decrease this to
+	 * make the inode a more trusted source when evaluating
+	 * workingset changes.  Or not, in which case we put less
+	 * pressure on the shadow shrinker.
+	 */
+	mapping->shadow_debt = global_dirtyable_memory();
+}
+
+/**
+ * workingset_exit_mapping - tell page reclaim address space is exiting
+ * @mapping: address space
+ *
+ * Must be called before the final truncate, to prevent page reclaim
+ * from installing shadow entries behind the back of the inode
+ * teardown process.
+ */
+void workingset_exit_mapping(struct address_space *mapping)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	mapping_set_exiting(mapping);
+	/*
+	 * Take it off the shrinker list, the final truncate is about
+	 * to remove potentially remaining shadow entries.
+	 */
+	if (!list_empty(&mapping->shadow_list)) {
+		spin_lock(&shadow_lock);
+		list_del(&mapping->shadow_list);
+		spin_unlock(&shadow_lock);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+}
+
 static void *pack_shadow(unsigned long time, struct zone *zone)
 {
 	time = (time << NODES_SHIFT) | zone_to_nid(zone);
@@ -131,6 +180,7 @@ static void unpack_shadow(void *shadow,
 void *workingset_eviction(struct address_space *mapping, struct page *page)
 {
 	struct zone *zone = page_zone(page);
+	unsigned long excess_order;
 	unsigned long time;
 
 	time = atomic_long_inc_return(&zone->workingset_time);
@@ -144,6 +194,25 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
 	if (mapping_exiting(mapping))
 		return NULL;
 
+	/*
+	 * If the planted shadows exceed the refaults, throttle new
+	 * planting to relieve the shadow shrinker.
+	 */
+	excess_order = mapping->shadow_debt >> memory_shift;
+	if (excess_order &&
+	    (time & ((SWAP_CLUSTER_MAX << (excess_order - 1)) - 1)))
+		return NULL;
+
+	/*
+	 * The counter needs a safety buffer so that we don't
+	 * oscillate, but don't plant shadows too sparsely, either.
+	 * This is a trade-off between shrinker activity during
+	 * streaming IO and speed of adapting when the workload
+	 * actually does start to use this file's pages frequently.
+	 */
+	if (excess_order < 4)
+		mapping->shadow_debt++;
+
 	return pack_shadow(time, zone);
 }
 
@@ -195,6 +264,24 @@ void workingset_refault(void *shadow)
 EXPORT_SYMBOL(workingset_refault);
 
 /**
+ * workingset_count_refault - account for finished refault
+ * @mapping: address space that was repopulated
+ *
+ * Account for a refault after the page has been fully reinstated in
+ * @mapping.
+ */
+void workingset_count_refault(struct address_space *mapping)
+{
+	unsigned int excess_order;
+	unsigned long delta = 1;
+
+	excess_order = mapping->shadow_debt >> memory_shift;
+	if (excess_order)
+		delta = SWAP_CLUSTER_MAX << (excess_order - 1);
+	mapping->shadow_debt -= min(delta, mapping->shadow_debt);
+}
+
+/**
  * workingset_activation - note a page activation
  * @page: page that is being activated
  */
@@ -211,3 +298,164 @@ void workingset_activation(struct page *page)
 	if (zone->shrink_active > 0)
 		zone->shrink_active--;
 }
+
+void workingset_shadows_inc(struct address_space *mapping)
+{
+	might_lock(&shadow_lock);
+	if (mapping->nrshadows++ == 0 && list_empty(&mapping->shadow_list)) {
+		spin_lock(&shadow_lock);
+		list_add(&mapping->shadow_list, &shadow_mappings);
+		spin_unlock(&shadow_lock);
+	}
+	this_cpu_inc(nr_shadows);
+}
+
+void workingset_shadows_dec(struct address_space *mapping)
+{
+	mapping->nrshadows--;
+	this_cpu_dec(nr_shadows);
+	/*
+	 * shadow_mappings operations are costly, so we keep the
+	 * mapping linked here even without any shadows left and
+	 * unlink it lazily in the shadow shrinker or when the inode
+	 * is destroyed.
+	 */
+}
+
+static unsigned long get_nr_shadows(void)
+{
+	long sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(nr_shadows, cpu);
+	return max(sum, 0L);
+}
+
+static unsigned long nr_old_shadows(unsigned long nr_shadows,
+				    unsigned long cutoff)
+{
+	if (nr_shadows <= cutoff)
+		return 0;
+	return nr_shadows - cutoff;
+}
+
+static unsigned long scan_mapping(struct address_space *mapping,
+				  unsigned long nr_to_scan)
+{
+	unsigned long nr_scanned = 0;
+	struct radix_tree_iter iter;
+	void **slot;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
+		unsigned long nrshadows;
+		unsigned long distance;
+		unsigned long nractive;
+		struct zone *zone;
+		struct page *page;
+
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (!radix_tree_exception(page))
+			continue;
+		if (radix_tree_deref_retry(page))
+			goto restart;
+
+		unpack_shadow(page, &zone, &distance);
+
+		nractive = zone_page_state(zone, NR_ACTIVE_FILE);
+		if (distance <= nractive)
+			continue;
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (radix_tree_delete_item(&mapping->page_tree,
+					   iter.index, page)) {
+			inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
+			workingset_shadows_dec(mapping);
+			nr_scanned++;
+		}
+		nrshadows = mapping->nrshadows;
+		spin_unlock_irq(&mapping->tree_lock);
+
+		if (nrshadows == 0)
+			break;
+
+		if (--nr_to_scan == 0)
+			break;
+	}
+	rcu_read_unlock();
+	return nr_scanned;
+}
+
+static unsigned long count_shadows(struct shrinker *shrink,
+				   struct shrink_control *sc)
+{
+	return nr_old_shadows(get_nr_shadows(), global_dirtyable_memory());
+}
+
+static unsigned long scan_shadows(struct shrinker *shrink,
+				  struct shrink_control *sc)
+{
+	unsigned long nr_scanned = 0;
+	unsigned long nr_to_scan;
+	unsigned long nr_max;
+	unsigned long nr_old;
+
+	nr_to_scan = sc->nr_to_scan;
+	nr_max = global_dirtyable_memory() * 2;
+	nr_old = nr_old_shadows(get_nr_shadows(), nr_max);
+
+	while (nr_to_scan && nr_old) {
+		struct address_space *mapping;
+
+		spin_lock_irq(&shadow_lock);
+		if (list_empty(&shadow_mappings)) {
+			spin_unlock_irq(&shadow_lock);
+			break;
+		}
+		mapping = list_entry(shadow_mappings.prev,
+				     struct address_space,
+				     shadow_list);
+		list_move(&mapping->shadow_list, &shadow_mappings);
+		__iget(mapping->host);
+		spin_unlock_irq(&shadow_lock);
+
+		if (mapping->nrshadows) {
+			unsigned long nr;
+
+			nr = scan_mapping(mapping, nr_to_scan);
+			nr_to_scan -= nr;
+			nr_scanned += nr;
+		}
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (mapping->nrshadows == 0) {
+			spin_lock(&shadow_lock);
+			list_del_init(&mapping->shadow_list);
+			spin_unlock(&shadow_lock);
+		}
+		spin_unlock_irq(&mapping->tree_lock);
+
+		iput(mapping->host);
+
+		nr_old = nr_old_shadows(get_nr_shadows(), nr_max);
+	}
+	return nr_scanned;
+}
+
+static struct shrinker shadow_shrinker = {
+	.count_objects = count_shadows,
+	.scan_objects = scan_shadows,
+	.seeks = 1,
+};
+
+static int __init workingset_init(void)
+{
+	memory_shift = ilog2(global_dirtyable_memory());
+	register_shrinker(&shadow_shrinker);
+	return 0;
+}
+core_initcall(workingset_init);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-06 22:44 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
@ 2013-08-09 22:49   ` Andrew Morton
  2013-08-12 16:00     ` Johannes Weiner
  2013-08-11 21:57   ` Vlastimil Babka
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2013-08-09 22:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

On Tue,  6 Aug 2013 18:44:09 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> To accomplish this, a per-zone counter is increased every time a page
> is evicted and a snapshot of that counter is stored as shadow entry in
> the page's now empty page cache radix tree slot.

How do you handle wraparound of that counter on 32-bit machines?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 0/9] mm: thrash detection-based file cache sizing v3
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
                   ` (8 preceding siblings ...)
  2013-08-06 22:44 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
@ 2013-08-09 22:53 ` Andrew Morton
  2013-08-12 22:15   ` Johannes Weiner
  9 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2013-08-09 22:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

On Tue,  6 Aug 2013 18:44:01 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> This series solves the problem by maintaining a history of pages
> evicted from the inactive list, enabling the VM to tell streaming IO
> from thrashing and rebalance the page cache lists when appropriate.

Looks nice. The lack of testing results is conspicuous ;)

It only really solves the problem in the case where

	size-of-inactive-list < size-of-working-set < size-of-total-memory

yes?  In fact less than that, because the active list presumably
doesn't get shrunk to zero (how far *can* it go?).  I wonder how many
workloads fit into those constraints in the real world.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-06 22:44 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
  2013-08-09 22:49   ` Andrew Morton
@ 2013-08-11 21:57   ` Vlastimil Babka
  2013-08-12 16:27     ` Johannes Weiner
  1 sibling, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2013-08-11 21:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, Roman Gushchin,
	Ozgun Erdogan, Metin Doslu, linux-fsdevel, linux-kernel

On 08/07/2013 12:44 AM, Johannes Weiner wrote:
> To accomplish this, a per-zone counter is increased every time a page
> is evicted and a snapshot of that counter is stored as shadow entry in
> the page's now empty page cache radix tree slot.  Upon refault of that
> page, the difference between the current value of that counter and the
> shadow entry value is called the refault distance.  It tells how many
> pages have been evicted from the zone since that page's eviction,
This explanation of refault distance seems correct...
> which is how many page slots are missing from the zone's inactive list
> for this page to get accessed twice while in memory.
But this part seems slightly incorrect. IMHO the correct formulation 
would be "...how many page slots are AT MOST missing...". See below.
> If the number of
> missing slots is less than or equal to the number of active pages,
> increasing the inactive list at the cost of the active list would give
> this thrashing set a chance to establish itself:
>
> eviction counter = 4
>                          evicted      inactive           active
>   Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
>    Shadow entries:         0 1 2 3
> Refault distance:         4 3 2 1
Consider here that if 'd' was now accessed before 'c', I think 'e' would 
be evicted and eviction counter would be incremented to 5. So for 'c' 
you would now say that three slots would prevent the refault, but in 
fact two would still be sufficient. This potential imprecision could 
make the algorithm challenge more active pages than it should, but I am 
not sure how bad the consequences could be... so just pointing it out.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 9/9] mm: workingset: keep shadow entries in check
  2013-08-06 22:44 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
@ 2013-08-11 23:56   ` Andi Kleen
  2013-08-14 14:41     ` Johannes Weiner
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2013-08-11 23:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, Roman Gushchin,
	Ozgun Erdogan, Metin Doslu, linux-fsdevel, linux-kernel


I really like the idea of using the spare slots in the radix tree
for something useful. It's amazing we haven't used that before.

I wonder if with some clever encoding even more information could be fit?

e.g. I assume you don't really need all bits of the refault distance,
just a good enough approximation.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-09 22:49   ` Andrew Morton
@ 2013-08-12 16:00     ` Johannes Weiner
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-12 16:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

On Fri, Aug 09, 2013 at 03:49:43PM -0700, Andrew Morton wrote:
> On Tue,  6 Aug 2013 18:44:09 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > To accomplish this, a per-zone counter is increased every time a page
> > is evicted and a snapshot of that counter is stored as shadow entry in
> > the page's now empty page cache radix tree slot.
> 
> How do you handle wraparound of that counter on 32-bit machines?

The distance between two time stamps is an unsigned subtraction, so
it's accurate even when the counter has wrapped between them.

The per-zone counter lapping shadow entries is possible but not very
likely because the shadow pages are reclaimed when more than
2*global_dirtyable_memory() of them exist.  And usually they are
refaulted or reclaimed along with the inode before that happens.

There is an unlikely case where some shadow entries make it into an
inode and then that same inode is evicting and refaulting pages in
another area, which increases the counter while not producing an
excess of shadow entries.  Should the counter lap these inactive
shadow entries, the worst case is that a refault will incorrectly
interpret them as recently evicted and deactivate a page for every
such entry.  Which would at worst be a "regression" to how the code
was for a long time, where every reclaim run also always deactivated
some pages.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-11 21:57   ` Vlastimil Babka
@ 2013-08-12 16:27     ` Johannes Weiner
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-12 16:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, Roman Gushchin,
	Ozgun Erdogan, Metin Doslu, linux-fsdevel, linux-kernel

Hello Vlastimil!

On Sun, Aug 11, 2013 at 11:57:37PM +0200, Vlastimil Babka wrote:
> On 08/07/2013 12:44 AM, Johannes Weiner wrote:
> >To accomplish this, a per-zone counter is increased every time a page
> >is evicted and a snapshot of that counter is stored as shadow entry in
> >the page's now empty page cache radix tree slot.  Upon refault of that
> >page, the difference between the current value of that counter and the
> >shadow entry value is called the refault distance.  It tells how many
> >pages have been evicted from the zone since that page's eviction,
> This explanation of refault distance seems correct...
> >which is how many page slots are missing from the zone's inactive list
> >for this page to get accessed twice while in memory.
> But this part seems slightly incorrect. IMHO the correct formulation
> would be "...how many page slots are AT MOST missing...". See below.

Yes, I think this would be better phrasing.

> >If the number of
> >missing slots is less than or equal to the number of active pages,
> >increasing the inactive list at the cost of the active list would give
> >this thrashing set a chance to establish itself:
> >
> >eviction counter = 4
> >                         evicted      inactive           active
> >  Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
> >   Shadow entries:         0 1 2 3
> >Refault distance:         4 3 2 1
> Consider here that if 'd' was now accessed before 'c', I think 'e'
> would be evicted and eviction counter would be incremented to 5. So
> for 'c' you would now say that three slots would prevent the
> refault, but in fact two would still be sufficient. This potential
> imprecision could make the algorithm challenge more active pages
> than it should, but I am not sure how bad the consequences could
> be... so just pointing it out.

Yes, pages are not accessed in strict order all the time and there is
very much the occasional outlier that gets accessed before its peers.

In fact, the code treats it as an AT MOST already.  Every page fault
is evaluated based on its refault distance, but a refault distance of
N does not mean that N pages are deactivated.  Only one page is.

An outlier is thus harmless but we catch if a whole group of pages is
thrashing.

The changelog and documentation should be updated, thanks for pointing
this out.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 0/9] mm: thrash detection-based file cache sizing v3
  2013-08-09 22:53 ` [patch 0/9] mm: thrash detection-based file cache sizing v3 Andrew Morton
@ 2013-08-12 22:15   ` Johannes Weiner
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-12 22:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

On Fri, Aug 09, 2013 at 03:53:09PM -0700, Andrew Morton wrote:
> On Tue,  6 Aug 2013 18:44:01 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > This series solves the problem by maintaining a history of pages
> > evicted from the inactive list, enabling the VM to tell streaming IO
> > from thrashing and rebalance the page cache lists when appropriate.
> 
> Looks nice. The lack of testing results is conspicuous ;)
> 
> It only really solves the problem in the case where
> 
> 	size-of-inactive-list < size-of-working-set < size-of-total-memory
> 
> yes?  In fact less than that, because the active list presumably
> doesn't get shrunk to zero (how far *can* it go?).

It can theoretically shrink to 0 if the replacing working set needs
exactly 100% of the available memory and is perfectly sequential and
the page allocator is 100% fair.  So in practice, it probably won't.

It's more likely that after some active pages have been deactivated
and pushed out of memory that new pages get a chance to get activated
so there will always be some pages on the active list.

> I wonder how many workloads fit into those constraints in the real
> world.

If the working set exceeds memory and the reference frequency is the
same for each page in the set, there is nothing we can reasonably do
to cache.

If the working set exceeds memory and all reference distances are
bigger than memory but not all equal to each other, it would be great
to be able to detect the more frequently used pages and prefer
reclaiming the others over them.  But I don't think that's actually
possible without a true LRU algorithm (as opposed to our
approximation) because we would need to know about reference distances
in the active page list and compare them to the refault distances.

So yes, this algorithm is limited to interpreting reference distances
up to memory size.

The development of this was kicked off by actual bug reports and I'm
working with the reporters to get these patches tested in the
production environments that exhibited the problem.  The reporters
always had usecases where the working set should have fit into memory
but wasn't cached even after repeatedly referencing it, that's why
they complained in the first place.  So it's hard to tell how many
environments fall into this category, but they certainly do exist,
they are not unreasonable setups, and the behavior is pretty abysmal
(most accesses major faults when everything should fit in memory).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 9/9] mm: workingset: keep shadow entries in check
  2013-08-11 23:56   ` Andi Kleen
@ 2013-08-14 14:41     ` Johannes Weiner
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-14 14:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

Hey Andi!

On Mon, Aug 12, 2013 at 01:56:31AM +0200, Andi Kleen wrote:
> 
> I really like the idea of using the spare slots in the radix tree
> for something useful. It's amazing we haven't used that before.
> 
> I wonder if with some clever encoding even more information could be fit?

What do you have in mind?

> e.g. I assume you don't really need all bits of the refault distance,
> just a good enough approximation.
> 
> -Andi
> -- 
> ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 9/9] mm: workingset: keep shadow entries in check
  2013-08-20 20:59   ` Andrew Morton
@ 2013-08-22  9:48     ` Johannes Weiner
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Weiner @ 2013-08-22  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Aug 20, 2013 at 01:59:24PM -0700, Andrew Morton wrote:
> On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> 
> erk.  This whole patch is overhead :(
> 
> > To get this under control, a list of inodes that contain shadow
> > entries is maintained.  If the global number of shadows exceeds a
> > certain threshold, a shrinker is activated that reclaims old entries
> > from the mappings.  This is heavy-handed but it should not be a hot
> > path and is mainly there to protect from accidentally/maliciously
> > induced OOM kills.  The global list is also not a problem because the
> > modifications are very rare: inodes are added once in their lifetime
> > when the first shadow entry is stored (i.e. the first page reclaimed)
> > and lazily removed when the inode exits.  Or if the shrinker removes
> > all shadow entries.
> > 
> > ...
> >
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -417,6 +417,7 @@ struct address_space {
> >  	/* Protected by tree_lock together with the radix tree */
> >  	unsigned long		nrpages;	/* number of total pages */
> >  	unsigned long		nrshadows;	/* number of shadow entries */
> > +	struct list_head	shadow_list;	/* list of mappings with shadows */
> >  	pgoff_t			writeback_index;/* writeback starts here */
> >  	const struct address_space_operations *a_ops;	/* methods */
> >  	unsigned long		flags;		/* error bits/gfp mask */
> 
> There's another 16 bytes into the inode.  Bad.

Yeah :(

An obvious alternative to storing page eviction information in the
page cache radix tree would be to have a separate data structure that
scales with the number of physical pages available.

It really depends on the workload which one's cheaper in terms of both
memory and cpu.

Workloads where the ratio between number of inodes and inode size is
high suffer from the increased inode size, but when they have decent
access locality within the inodes, the inodes should be evicted along
with their pages.  So in this case there is little to no memory
overhead from the eviction information compared to the fixed size
separate data structure.

And refault information lookup is cheaper of course when storing
eviction information inside the cache slots.

Workloads for which this model sucks are those with inode locality but
no data locality.  The inodes stay alive and the lack of data locality
produces many shadow entries that only the costly shrinker can get rid
of.  Numbers aside, it was a judgement call to improve workloads with
high data locality at the cost of those without.

But I'll include a more concrete cost analysis in the submission that
also includes more concrete details on the benefits of the series ;)

> > +void workingset_shadows_inc(struct address_space *mapping)
> > +{
> > +	might_lock(&shadow_lock);
> > +
> > +	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
> > +		spin_lock(&shadow_lock);
> 
> I can't work out whether or not shadow_lock is supposed to be irq-save.
> Some places it is, others are unobvious.

It is.  The caller holds the irq-safe tree_lock, though, so no need to
disable IRQs a second time.  I'll add documentation.

> > +static unsigned long get_nr_old_shadows(void)
> > +{
> > +	unsigned long nr_max;
> > +	unsigned long nr;
> > +	long sum = 0;
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu)
> > +		sum += per_cpu(nr_shadows, cpu);
> 
> Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
> calls it from a loop.  Can we use something non-deathly-slow here? 
> Like percpu_counter_read_positive()?

Finally, a usecase for percpu_counter!!  Sounds reasonable, I'll
convert this stuff over.

> > +	nr = max(sum, 0L);
> > +
> > +	/*
> > +	 * Every shadow entry with a refault distance bigger than the
> > +	 * active list is ignored and so NR_ACTIVE_FILE would be a
> > +	 * reasonable ceiling.  But scanning and shrinking shadow
> > +	 * entries is quite expensive, so be generous.
> > +	 */
> > +	nr_max = global_dirtyable_memory() * 4;
> > +
> > +	if (nr <= nr_max)
> > +		return 0;
> > +	return nr - nr_max;
> > +}
> > +
> > +static unsigned long scan_mapping(struct address_space *mapping,
> > +				  unsigned long nr_to_scan)
> 
> Some methodological description would be useful.

Fair enough, I'll write something.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 9/9] mm: workingset: keep shadow entries in check
  2013-08-17 19:31 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
@ 2013-08-20 20:59   ` Andrew Morton
  2013-08-22  9:48     ` Johannes Weiner
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.

erk.  This whole patch is overhead :(

> To get this under control, a list of inodes that contain shadow
> entries is maintained.  If the global number of shadows exceeds a
> certain threshold, a shrinker is activated that reclaims old entries
> from the mappings.  This is heavy-handed but it should not be a hot
> path and is mainly there to protect from accidentally/maliciously
> induced OOM kills.  The global list is also not a problem because the
> modifications are very rare: inodes are added once in their lifetime
> when the first shadow entry is stored (i.e. the first page reclaimed)
> and lazily removed when the inode exits.  Or if the shrinker removes
> all shadow entries.
> 
> ...
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -417,6 +417,7 @@ struct address_space {
>  	/* Protected by tree_lock together with the radix tree */
>  	unsigned long		nrpages;	/* number of total pages */
>  	unsigned long		nrshadows;	/* number of shadow entries */
> +	struct list_head	shadow_list;	/* list of mappings with shadows */
>  	pgoff_t			writeback_index;/* writeback starts here */
>  	const struct address_space_operations *a_ops;	/* methods */
>  	unsigned long		flags;		/* error bits/gfp mask */

There's another 16 bytes into the inode.  Bad.

>
> ...
>
> +void workingset_shadows_inc(struct address_space *mapping)
> +{
> +	might_lock(&shadow_lock);
> +
> +	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
> +		spin_lock(&shadow_lock);

I can't work out whether or not shadow_lock is supposed to be irq-save.
Some places it is, others are unobvious.

> +		list_add(&mapping->shadow_list, &shadow_mappings);
> +		spin_unlock(&shadow_lock);
> +	}
> +
> +	mapping->nrshadows++;
> +	this_cpu_inc(nr_shadows);
> +}
> +
>
> ...
>
> +static unsigned long get_nr_old_shadows(void)
> +{
> +	unsigned long nr_max;
> +	unsigned long nr;
> +	long sum = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		sum += per_cpu(nr_shadows, cpu);

Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
calls it from a loop.  Can we use something non-deathly-slow here? 
Like percpu_counter_read_positive()?

> +	nr = max(sum, 0L);
> +
> +	/*
> +	 * Every shadow entry with a refault distance bigger than the
> +	 * active list is ignored and so NR_ACTIVE_FILE would be a
> +	 * reasonable ceiling.  But scanning and shrinking shadow
> +	 * entries is quite expensive, so be generous.
> +	 */
> +	nr_max = global_dirtyable_memory() * 4;
> +
> +	if (nr <= nr_max)
> +		return 0;
> +	return nr - nr_max;
> +}
> +
> +static unsigned long scan_mapping(struct address_space *mapping,
> +				  unsigned long nr_to_scan)

Some methodological description would be useful.

> +{
> +	unsigned long nr_scanned = 0;
> +	struct radix_tree_iter iter;
> +	void **slot;
> +
> +	rcu_read_lock();
> +restart:
> +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
> +		unsigned long nrshadows;
> +		unsigned long distance;
> +		struct zone *zone;
> +		struct page *page;
> +
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (!radix_tree_exception(page))
> +			continue;
> +		if (radix_tree_deref_retry(page))
> +			goto restart;
> +
> +		unpack_shadow(page, &zone, &distance);
> +
> +		if (distance <= zone_page_state(zone, NR_ACTIVE_FILE))
> +			continue;
> +
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (radix_tree_delete_item(&mapping->page_tree,
> +					   iter.index, page)) {
> +			inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
> +			workingset_shadows_dec(mapping);
> +			nr_scanned++;
> +		}
> +		nrshadows = mapping->nrshadows;
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		if (nrshadows == 0)
> +			break;
> +
> +		if (--nr_to_scan == 0)
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return nr_scanned;
> +}
> +
>
> ...
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch 9/9] mm: workingset: keep shadow entries in check
  2013-08-17 19:31 [patch 9/9] mm: thrash detection-based file cache sizing v4 Johannes Weiner
@ 2013-08-17 19:31 ` Johannes Weiner
  2013-08-20 20:59   ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, a list of inodes that contain shadow
entries is maintained.  If the global number of shadows exceeds a
certain threshold, a shrinker is activated that reclaims old entries
from the mappings.  This is heavy-handed but it should not be a hot
path and is mainly there to protect from accidentally/maliciously
induced OOM kills.  The global list is also not a problem because the
modifications are very rare: inodes are added once in their lifetime
when the first shadow entry is stored (i.e. the first page reclaimed)
and lazily removed when the inode exits.  Or if the shrinker removes
all shadow entries.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c             |   5 +-
 include/linux/fs.h     |   1 +
 include/linux/mmzone.h |   1 +
 include/linux/swap.h   |   4 +
 mm/filemap.c           |   4 +-
 mm/truncate.c          |   2 +-
 mm/vmstat.c            |   1 +
 mm/workingset.c        | 196 +++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 208 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8862b1b..b23b141 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -169,6 +169,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->private_data = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+	workingset_init_mapping(mapping);
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
@@ -546,9 +547,7 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
-	spin_lock_irq(&inode->i_data.tree_lock);
-	mapping_set_exiting(&inode->i_data);
-	spin_unlock_irq(&inode->i_data.tree_lock);
+	workingset_exit_mapping(&inode->i_data);
 
 	if (op->evict_inode) {
 		op->evict_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac5d84e..140ae2d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -417,6 +417,7 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	struct list_head	shadow_list;	/* list of mappings with shadows */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 56f540e..747adae 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -146,6 +146,7 @@ enum zone_stat_item {
 	WORKINGSET_STALE,
 	WORKINGSET_BALANCE,
 	WORKINGSET_BALANCE_FORCE,
+	WORKINGSET_SHADOWS_RECLAIMED,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 441845d..f18dd33eb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,9 +261,13 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
+void workingset_init_mapping(struct address_space *mapping);
+void workingset_exit_mapping(struct address_space *mapping);
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 void workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index ab4351e..d47d83e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,7 +132,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
 		radix_tree_replace_slot(slot, shadow);
-		mapping->nrshadows++;
+		workingset_shadows_inc(mapping);
 	} else
 		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
@@ -466,7 +466,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 5c85dd4..76064a4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -36,7 +36,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * need verification under the tree lock.
 	 */
 	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4fabaed..b0b8e3d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -743,6 +743,7 @@ const char * const vmstat_text[] = {
 	"workingset_stale",
 	"workingset_balance",
 	"workingset_balance_force",
+	"workingset_shadows_reclaimed",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 88ad7ea..00b2280e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -84,6 +84,46 @@
  * challenged without incurring major faults in case of a mistake.
  */
 
+static DEFINE_PER_CPU(unsigned long, nr_shadows);
+static DEFINE_SPINLOCK(shadow_lock);
+static LIST_HEAD(shadow_mappings);
+
+/**
+ * workingset_init_mapping - prepare address space for page reclaim
+ * @mapping: address space
+ *
+ * Must be called when the inode is instantiated, before any page
+ * cache is populated.
+ */
+void workingset_init_mapping(struct address_space *mapping)
+{
+	INIT_LIST_HEAD(&mapping->shadow_list);
+}
+
+/**
+ * workingset_exit_mapping - tell page reclaim address space is exiting
+ * @mapping: address space
+ *
+ * Must be called before the final truncate, to prevent page reclaim
+ * from installing shadow entries behind the back of the inode
+ * teardown process.
+ */
+void workingset_exit_mapping(struct address_space *mapping)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	mapping_set_exiting(mapping);
+	/*
+	 * Take the mapping off the shrinker list, the final truncate
+	 * is about to remove potentially remaining shadow entries.
+	 */
+	if (!list_empty(&mapping->shadow_list)) {
+		spin_lock(&shadow_lock);
+		list_del(&mapping->shadow_list);
+		spin_unlock(&shadow_lock);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+}
+
 static void *pack_shadow(unsigned long time, struct zone *zone)
 {
 	time = (time << NODES_SHIFT) | zone_to_nid(zone);
@@ -198,3 +238,159 @@ void workingset_activation(struct page *page)
 	if (zone->shrink_active > 0)
 		zone->shrink_active--;
 }
+
+void workingset_shadows_inc(struct address_space *mapping)
+{
+	might_lock(&shadow_lock);
+
+	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
+		spin_lock(&shadow_lock);
+		list_add(&mapping->shadow_list, &shadow_mappings);
+		spin_unlock(&shadow_lock);
+	}
+
+	mapping->nrshadows++;
+	this_cpu_inc(nr_shadows);
+}
+
+void workingset_shadows_dec(struct address_space *mapping)
+{
+	mapping->nrshadows--;
+	this_cpu_dec(nr_shadows);
+	/*
+	 * shadow_mappings operations are costly, so we keep the
+	 * mapping linked here even without any shadows left and
+	 * unlink it lazily in the shadow shrinker or when the inode
+	 * is destroyed.
+	 */
+}
+
+static unsigned long get_nr_old_shadows(void)
+{
+	unsigned long nr_max;
+	unsigned long nr;
+	long sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(nr_shadows, cpu);
+	nr = max(sum, 0L);
+
+	/*
+	 * Every shadow entry with a refault distance bigger than the
+	 * active list is ignored and so NR_ACTIVE_FILE would be a
+	 * reasonable ceiling.  But scanning and shrinking shadow
+	 * entries is quite expensive, so be generous.
+	 */
+	nr_max = global_dirtyable_memory() * 4;
+
+	if (nr <= nr_max)
+		return 0;
+	return nr - nr_max;
+}
+
+static unsigned long scan_mapping(struct address_space *mapping,
+				  unsigned long nr_to_scan)
+{
+	unsigned long nr_scanned = 0;
+	struct radix_tree_iter iter;
+	void **slot;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
+		unsigned long nrshadows;
+		unsigned long distance;
+		struct zone *zone;
+		struct page *page;
+
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (!radix_tree_exception(page))
+			continue;
+		if (radix_tree_deref_retry(page))
+			goto restart;
+
+		unpack_shadow(page, &zone, &distance);
+
+		if (distance <= zone_page_state(zone, NR_ACTIVE_FILE))
+			continue;
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (radix_tree_delete_item(&mapping->page_tree,
+					   iter.index, page)) {
+			inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
+			workingset_shadows_dec(mapping);
+			nr_scanned++;
+		}
+		nrshadows = mapping->nrshadows;
+		spin_unlock_irq(&mapping->tree_lock);
+
+		if (nrshadows == 0)
+			break;
+
+		if (--nr_to_scan == 0)
+			break;
+	}
+	rcu_read_unlock();
+
+	return nr_scanned;
+}
+
+static unsigned long count_shadows(struct shrinker *shrink,
+				   struct shrink_control *sc)
+{
+	return get_nr_old_shadows();
+}
+
+static unsigned long scan_shadows(struct shrinker *shrink,
+				  struct shrink_control *sc)
+{
+	unsigned long nr_to_scan = sc->nr_to_scan;
+
+	do {
+		struct address_space *mapping;
+
+		spin_lock_irq(&shadow_lock);
+		if (list_empty(&shadow_mappings)) {
+			spin_unlock_irq(&shadow_lock);
+			break;
+		}
+		mapping = list_entry(shadow_mappings.prev,
+				     struct address_space,
+				     shadow_list);
+		list_move(&mapping->shadow_list, &shadow_mappings);
+		__iget(mapping->host);
+		spin_unlock_irq(&shadow_lock);
+
+		if (mapping->nrshadows)
+			nr_to_scan -= scan_mapping(mapping, nr_to_scan);
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (mapping->nrshadows == 0) {
+			spin_lock(&shadow_lock);
+			list_del_init(&mapping->shadow_list);
+			spin_unlock(&shadow_lock);
+		}
+		spin_unlock_irq(&mapping->tree_lock);
+
+		iput(mapping->host);
+
+	} while (nr_to_scan && get_nr_old_shadows());
+
+	return sc->nr_to_scan - nr_to_scan;
+}
+
+static struct shrinker shadow_shrinker = {
+	.count_objects = count_shadows,
+	.scan_objects = scan_shadows,
+	.seeks = 1,
+};
+
+static int __init workingset_init(void)
+{
+	register_shrinker(&shadow_shrinker);
+	return 0;
+}
+core_initcall(workingset_init);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-08-22  9:51 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
2013-08-06 22:44 ` [patch 1/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2013-08-06 22:44 ` [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2013-08-06 22:44 ` [patch 3/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2013-08-06 22:44 ` [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2013-08-06 22:44 ` [patch 5/9] mm + fs: store shadow entries in page cache Johannes Weiner
2013-08-06 22:44 ` [patch 6/9] mm + fs: provide shadow pages to page cache allocations Johannes Weiner
2013-08-06 22:44 ` [patch 7/9] mm: make global_dirtyable_memory() available to other mm code Johannes Weiner
2013-08-06 22:44 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
2013-08-09 22:49   ` Andrew Morton
2013-08-12 16:00     ` Johannes Weiner
2013-08-11 21:57   ` Vlastimil Babka
2013-08-12 16:27     ` Johannes Weiner
2013-08-06 22:44 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
2013-08-11 23:56   ` Andi Kleen
2013-08-14 14:41     ` Johannes Weiner
2013-08-09 22:53 ` [patch 0/9] mm: thrash detection-based file cache sizing v3 Andrew Morton
2013-08-12 22:15   ` Johannes Weiner
2013-08-17 19:31 [patch 9/9] mm: thrash detection-based file cache sizing v4 Johannes Weiner
2013-08-17 19:31 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-22  9:48     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).