All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 9/9] mm: thrash detection-based file cache sizing v4
@ 2013-08-17 19:31 ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Changes in version 4:

o Rework shrinker and shadow planting throttle.  The per-file
  throttling created problems in production tests.  And the shrinker
  code changed so much over the development of the series that the
  throttling policy is no longer applicable, so just remove it, and
  with it the extra unsigned long to track refault ratios in struct
  inode (yay!).

o Remove the 'enough free pages' filter from refault detection.  This
  never was just right for all types of zone sizes (varying watermarks
  and lowmem reserves) and filtered too many valid refault hits.  It
  was put in place to detect when reclaim already freed enough pages,
  to stop deactivating more than necessary.  But reclaim advances the
  working set time, so progress is reflected in the refault distances
  that we check either way.  Just remove the redundant test.

o Update changelog in terms of what the refault distance means and how
  the code protects against spurious refaults that happen out of
  order.  Suggested by Vlastimil Babka.

Changes in version 3:

o Drop global working set time, store zone ID in addition to
  zone-specific timestamp in radix tree instead.  Balance zones based
  on their own refaults only.  Based on feedback from Peter Zijlstra.

o Lazily remove inodes without shadow entries from the global list to
  reduce modifications of said list to an absolute minimum.  Global
  list operations are now reduced to when an inode has its first cache
  page reclaimed (rare) and when a linked inode is destroyed (rare) or
  when the inode's shadows are shrunk to zero (rare).  Based on
  feedback from Peter Zijlstra.

o Document all interfaces properly

o Split out fair allocator patches (in -mmotm)

---

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.
    
Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This happens on file servers and media streaming servers, where the
popular set of files changes over time.  Even though the individual
files might be smaller than half of memory, concurrent access to many
of them may still result in their inter-reference distance being
greater than half of memory.  It's also been reported as a problem on
database workloads that switch back and forth between tables that are
bigger than half of memory.  In these cases the VM never recognizes
the new working set and will for the remainder of the workload thrash
disk data which could easily live in memory.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to tell streaming IO
from thrashing and rebalance the page cache lists when appropriate.

 drivers/staging/lustre/lustre/llite/dir.c |   2 +-
 fs/block_dev.c                            |   2 +-
 fs/btrfs/compression.c                    |   4 +-
 fs/cachefiles/rdwr.c                      |  13 +-
 fs/ceph/xattr.c                           |   2 +-
 fs/inode.c                                |   6 +-
 fs/logfs/readwrite.c                      |   6 +-
 fs/nfs/blocklayout/blocklayout.c          |   2 +-
 fs/nilfs2/inode.c                         |   4 +-
 fs/ntfs/file.c                            |   7 +-
 fs/splice.c                               |   6 +-
 include/linux/fs.h                        |   2 +
 include/linux/mm.h                        |   8 +
 include/linux/mmzone.h                    |   8 +
 include/linux/pagemap.h                   |  55 ++--
 include/linux/pagevec.h                   |   3 +
 include/linux/radix-tree.h                |   5 +-
 include/linux/shmem_fs.h                  |   1 +
 include/linux/swap.h                      |   9 +
 include/linux/writeback.h                 |   1 +
 lib/radix-tree.c                          | 105 ++------
 mm/Makefile                               |   2 +-
 mm/filemap.c                              | 264 ++++++++++++++++---
 mm/mincore.c                              |  20 +-
 mm/page-writeback.c                       |   2 +-
 mm/readahead.c                            |   8 +-
 mm/shmem.c                                | 122 +++------
 mm/swap.c                                 |  22 ++
 mm/truncate.c                             |  78 ++++--
 mm/vmscan.c                               |  62 ++++-
 mm/vmstat.c                               |   5 +
 mm/workingset.c                           | 396 ++++++++++++++++++++++++++++
 net/ceph/pagelist.c                       |   4 +-
 net/ceph/pagevec.c                        |   2 +-
 34 files changed, 939 insertions(+), 299 deletions(-)

Based on -mmotm, which includes the required page allocator fairness
patches.  All that: http://git.cmpxchg.org/cgit/linux-jw.git/

Thanks!


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 9/9] mm: thrash detection-based file cache sizing v4
@ 2013-08-17 19:31 ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Changes in version 4:

o Rework shrinker and shadow planting throttle.  The per-file
  throttling created problems in production tests.  And the shrinker
  code changed so much over the development of the series that the
  throttling policy is no longer applicable, so just remove it, and
  with it the extra unsigned long to track refault ratios in struct
  inode (yay!).

o Remove the 'enough free pages' filter from refault detection.  This
  never was just right for all types of zone sizes (varying watermarks
  and lowmem reserves) and filtered too many valid refault hits.  It
  was put in place to detect when reclaim already freed enough pages,
  to stop deactivating more than necessary.  But reclaim advances the
  working set time, so progress is reflected in the refault distances
  that we check either way.  Just remove the redundant test.

o Update changelog in terms of what the refault distance means and how
  the code protects against spurious refaults that happen out of
  order.  Suggested by Vlastimil Babka.

Changes in version 3:

o Drop global working set time, store zone ID in addition to
  zone-specific timestamp in radix tree instead.  Balance zones based
  on their own refaults only.  Based on feedback from Peter Zijlstra.

o Lazily remove inodes without shadow entries from the global list to
  reduce modifications of said list to an absolute minimum.  Global
  list operations are now reduced to when an inode has its first cache
  page reclaimed (rare) and when a linked inode is destroyed (rare) or
  when the inode's shadows are shrunk to zero (rare).  Based on
  feedback from Peter Zijlstra.

o Document all interfaces properly

o Split out fair allocator patches (in -mmotm)

---

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.
    
Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This happens on file servers and media streaming servers, where the
popular set of files changes over time.  Even though the individual
files might be smaller than half of memory, concurrent access to many
of them may still result in their inter-reference distance being
greater than half of memory.  It's also been reported as a problem on
database workloads that switch back and forth between tables that are
bigger than half of memory.  In these cases the VM never recognizes
the new working set and will for the remainder of the workload thrash
disk data which could easily live in memory.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to tell streaming IO
from thrashing and rebalance the page cache lists when appropriate.

 drivers/staging/lustre/lustre/llite/dir.c |   2 +-
 fs/block_dev.c                            |   2 +-
 fs/btrfs/compression.c                    |   4 +-
 fs/cachefiles/rdwr.c                      |  13 +-
 fs/ceph/xattr.c                           |   2 +-
 fs/inode.c                                |   6 +-
 fs/logfs/readwrite.c                      |   6 +-
 fs/nfs/blocklayout/blocklayout.c          |   2 +-
 fs/nilfs2/inode.c                         |   4 +-
 fs/ntfs/file.c                            |   7 +-
 fs/splice.c                               |   6 +-
 include/linux/fs.h                        |   2 +
 include/linux/mm.h                        |   8 +
 include/linux/mmzone.h                    |   8 +
 include/linux/pagemap.h                   |  55 ++--
 include/linux/pagevec.h                   |   3 +
 include/linux/radix-tree.h                |   5 +-
 include/linux/shmem_fs.h                  |   1 +
 include/linux/swap.h                      |   9 +
 include/linux/writeback.h                 |   1 +
 lib/radix-tree.c                          | 105 ++------
 mm/Makefile                               |   2 +-
 mm/filemap.c                              | 264 ++++++++++++++++---
 mm/mincore.c                              |  20 +-
 mm/page-writeback.c                       |   2 +-
 mm/readahead.c                            |   8 +-
 mm/shmem.c                                | 122 +++------
 mm/swap.c                                 |  22 ++
 mm/truncate.c                             |  78 ++++--
 mm/vmscan.c                               |  62 ++++-
 mm/vmstat.c                               |   5 +
 mm/workingset.c                           | 396 ++++++++++++++++++++++++++++
 net/ceph/pagelist.c                       |   4 +-
 net/ceph/pagevec.c                        |   2 +-
 34 files changed, 939 insertions(+), 299 deletions(-)

Based on -mmotm, which includes the required page allocator fairness
patches.  All that: http://git.cmpxchg.org/cgit/linux-jw.git/

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 1/9] lib: radix-tree: radix_tree_delete_item()
  2013-08-17 19:31 ` Johannes Weiner
@ 2013-08-17 19:31   ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Provide a function that does not just delete an entry at a given
index, but also allows passing in an expected item.  Delete only if
that item is still located at the specified index.

This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |  1 +
 lib/radix-tree.c           | 30 ++++++++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..1bf0a9c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -219,6 +219,7 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..60b202b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1335,15 +1335,18 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 }
 
 /**
- *	radix_tree_delete    -    delete an item from a radix tree
+ *	radix_tree_delete_item    -    delete an item from a radix tree
  *	@root:		radix tree root
  *	@index:		index key
+ *	@item:		expected item
  *
- *	Remove the item at @index from the radix tree rooted at @root.
+ *	Remove @item at @index from the radix tree rooted at @root.
  *
- *	Returns the address of the deleted item, or NULL if it was not present.
+ *	Returns the address of the deleted item, or NULL if it was not present
+ *	or the entry at the given @index was not @item.
  */
-void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+void *radix_tree_delete_item(struct radix_tree_root *root,
+			     unsigned long index, void *item)
 {
 	struct radix_tree_node *node = NULL;
 	struct radix_tree_node *slot = NULL;
@@ -1378,6 +1381,11 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 	if (slot == NULL)
 		goto out;
 
+	if (item && slot != item) {
+		slot = NULL;
+		goto out;
+	}
+
 	/*
 	 * Clear all tags associated with the item to be deleted.
 	 * This way of doing it would be inefficient, but seldom is any set.
@@ -1422,6 +1430,20 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 out:
 	return slot;
 }
+
+/**
+ *	radix_tree_delete    -    delete an item from a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Remove the item at @index from the radix tree rooted at @root.
+ *
+ *	Returns the address of the deleted item, or NULL if it was not present.
+ */
+void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+{
+	return radix_tree_delete_item(root, index, NULL);
+}
 EXPORT_SYMBOL(radix_tree_delete);
 
 /**
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 1/9] lib: radix-tree: radix_tree_delete_item()
@ 2013-08-17 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Provide a function that does not just delete an entry at a given
index, but also allows passing in an expected item.  Delete only if
that item is still located at the specified index.

This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |  1 +
 lib/radix-tree.c           | 30 ++++++++++++++++++++++++++----
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..1bf0a9c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -219,6 +219,7 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..60b202b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1335,15 +1335,18 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 }
 
 /**
- *	radix_tree_delete    -    delete an item from a radix tree
+ *	radix_tree_delete_item    -    delete an item from a radix tree
  *	@root:		radix tree root
  *	@index:		index key
+ *	@item:		expected item
  *
- *	Remove the item at @index from the radix tree rooted at @root.
+ *	Remove @item at @index from the radix tree rooted at @root.
  *
- *	Returns the address of the deleted item, or NULL if it was not present.
+ *	Returns the address of the deleted item, or NULL if it was not present
+ *	or the entry at the given @index was not @item.
  */
-void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+void *radix_tree_delete_item(struct radix_tree_root *root,
+			     unsigned long index, void *item)
 {
 	struct radix_tree_node *node = NULL;
 	struct radix_tree_node *slot = NULL;
@@ -1378,6 +1381,11 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 	if (slot == NULL)
 		goto out;
 
+	if (item && slot != item) {
+		slot = NULL;
+		goto out;
+	}
+
 	/*
 	 * Clear all tags associated with the item to be deleted.
 	 * This way of doing it would be inefficient, but seldom is any set.
@@ -1422,6 +1430,20 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 out:
 	return slot;
 }
+
+/**
+ *	radix_tree_delete    -    delete an item from a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Remove the item at @index from the radix tree rooted at @root.
+ *
+ *	Returns the address of the deleted item, or NULL if it was not present.
+ */
+void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+{
+	return radix_tree_delete_item(root, index, NULL);
+}
 EXPORT_SYMBOL(radix_tree_delete);
 
 /**
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages
  2013-08-17 19:31 ` Johannes Weiner
@ 2013-08-17 19:31   ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Page cache radix tree slots are usually stabilized by the page lock,
but shmem's swap cookies have no such thing.  Because the overall
truncation loop is lockless, the swap entry is currently confirmed by
a tree lookup and then deleted by another tree lookup under the same
tree lock region.

Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup.  This also allows removing the
delete-only special case from shmem_radix_tree_replace().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/shmem.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c93dcd6..8510534 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -242,19 +242,17 @@ static int shmem_radix_tree_replace(struct address_space *mapping,
 			pgoff_t index, void *expected, void *replacement)
 {
 	void **pslot;
-	void *item = NULL;
+	void *item;
 
 	VM_BUG_ON(!expected);
+	VM_BUG_ON(!replacement);
 	pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
-	if (pslot)
-		item = radix_tree_deref_slot_protected(pslot,
-							&mapping->tree_lock);
+	if (!pslot)
+		return -ENOENT;
+	item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
 	if (item != expected)
 		return -ENOENT;
-	if (replacement)
-		radix_tree_replace_slot(pslot, replacement);
-	else
-		radix_tree_delete(&mapping->page_tree, index);
+	radix_tree_replace_slot(pslot, replacement);
 	return 0;
 }
 
@@ -386,14 +384,15 @@ export:
 static int shmem_free_swap(struct address_space *mapping,
 			   pgoff_t index, void *radswap)
 {
-	int error;
+	void *old;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
+	old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!error)
-		free_swap_and_cache(radix_to_swp_entry(radswap));
-	return error;
+	if (old != radswap)
+		return -ENOENT;
+	free_swap_and_cache(radix_to_swp_entry(radswap));
+	return 0;
 }
 
 /*
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages
@ 2013-08-17 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Page cache radix tree slots are usually stabilized by the page lock,
but shmem's swap cookies have no such thing.  Because the overall
truncation loop is lockless, the swap entry is currently confirmed by
a tree lookup and then deleted by another tree lookup under the same
tree lock region.

Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup.  This also allows removing the
delete-only special case from shmem_radix_tree_replace().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/shmem.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c93dcd6..8510534 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -242,19 +242,17 @@ static int shmem_radix_tree_replace(struct address_space *mapping,
 			pgoff_t index, void *expected, void *replacement)
 {
 	void **pslot;
-	void *item = NULL;
+	void *item;
 
 	VM_BUG_ON(!expected);
+	VM_BUG_ON(!replacement);
 	pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
-	if (pslot)
-		item = radix_tree_deref_slot_protected(pslot,
-							&mapping->tree_lock);
+	if (!pslot)
+		return -ENOENT;
+	item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
 	if (item != expected)
 		return -ENOENT;
-	if (replacement)
-		radix_tree_replace_slot(pslot, replacement);
-	else
-		radix_tree_delete(&mapping->page_tree, index);
+	radix_tree_replace_slot(pslot, replacement);
 	return 0;
 }
 
@@ -386,14 +384,15 @@ export:
 static int shmem_free_swap(struct address_space *mapping,
 			   pgoff_t index, void *radswap)
 {
-	int error;
+	void *old;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
+	old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!error)
-		free_swap_and_cache(radix_to_swp_entry(radswap));
-	return error;
+	if (old != radswap)
+		return -ENOENT;
+	free_swap_and_cache(radix_to_swp_entry(radswap));
+	return 0;
 }
 
 /*
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 3/9] mm: filemap: move radix tree hole searching here
  2013-08-17 19:31 ` Johannes Weiner
@ 2013-08-17 19:31   ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot".  But this is about to change, though, as shadow
page descriptors will be stored in the page cache after the actual
pages get evicted from memory.

Move the functions over to mm/filemap.c and make them native page
cache operations, where they can later be adapted to handle the new
definition of "page cache hole".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/nfs/blocklayout/blocklayout.c |  2 +-
 include/linux/pagemap.h          |  5 +++
 include/linux/radix-tree.h       |  4 ---
 lib/radix-tree.c                 | 75 ---------------------------------------
 mm/filemap.c                     | 76 ++++++++++++++++++++++++++++++++++++++++
 mm/readahead.c                   |  4 +--
 6 files changed, 84 insertions(+), 82 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index e242bbf..fdb74cb 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1220,7 +1220,7 @@ static u64 pnfs_num_cont_bytes(struct inode *inode, pgoff_t idx)
 	end = DIV_ROUND_UP(i_size_read(inode), PAGE_CACHE_SIZE);
 	if (end != NFS_I(inode)->npages) {
 		rcu_read_lock();
-		end = radix_tree_next_hole(&mapping->page_tree, idx + 1, ULONG_MAX);
+		end = page_cache_next_hole(mapping, idx + 1, ULONG_MAX);
 		rcu_read_unlock();
 	}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..c73130c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -243,6 +243,11 @@ static inline struct page *page_cache_alloc_readahead(struct address_space *x)
 
 typedef int filler_t(void *, struct page *);
 
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+
 extern struct page * find_get_page(struct address_space *mapping,
 				pgoff_t index);
 extern struct page * find_lock_page(struct address_space *mapping,
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 1bf0a9c..e8be53e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -227,10 +227,6 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 60b202b..912e67b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -946,81 +946,6 @@ next:
 }
 EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
 
-
-/**
- *	radix_tree_next_hole    -    find the next hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the lowest
- *	indexed hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'return - index >= max_scan'
- *	will be true). In rare cases of index wrap-around, 0 will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 5, then subsequently a hole is created at index 10,
- *	radix_tree_next_hole covering both indexes may return 10 if called
- *	under rcu_read_lock.
- */
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index++;
-		if (index == 0)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_next_hole);
-
-/**
- *	radix_tree_prev_hole    -    find the prev hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search backwards in the range [max(index-max_scan+1, 0), index]
- *	for the first hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'index - return >= max_scan'
- *	will be true). In rare cases of wrap-around, ULONG_MAX will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 10, then subsequently a hole is created at index 5,
- *	radix_tree_prev_hole covering both indexes may return 5 if called under
- *	rcu_read_lock.
- */
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				   unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index--;
-		if (index == ULONG_MAX)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_prev_hole);
-
 /**
  *	radix_tree_gang_lookup - perform multiple lookup on a radix tree
  *	@root:		radix tree root
diff --git a/mm/filemap.c b/mm/filemap.c
index ae5cc01..e7833d2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -688,6 +688,82 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 }
 
 /**
+ * page_cache_next_hole - find the next hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
+ * lowest indexed hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'return - index >=
+ * max_scan' will be true). In rare cases of index wrap-around, 0 will
+ * be returned.
+ *
+ * page_cache_next_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 5, then subsequently a hole is created at
+ * index 10, page_cache_next_hole covering both indexes may return 10
+ * if called under rcu_read_lock.
+ */
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index++;
+		if (index == 0)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_next_hole);
+
+/**
+ * page_cache_prev_hole - find the prev hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search backwards in the range [max(index-max_scan+1, 0), index] for
+ * the first hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'index - return >=
+ * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
+ * will be returned.
+ *
+ * page_cache_prev_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 10, then subsequently a hole is created at
+ * index 5, page_cache_prev_hole covering both indexes may return 5 if
+ * called under rcu_read_lock.
+ */
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index--;
+		if (index == ULONG_MAX)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_prev_hole);
+
+/**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
diff --git a/mm/readahead.c b/mm/readahead.c
index 829a77c..01f4cae 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -351,7 +351,7 @@ static pgoff_t count_history_pages(struct address_space *mapping,
 	pgoff_t head;
 
 	rcu_read_lock();
-	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+	head = page_cache_prev_hole(mapping, offset - 1, max);
 	rcu_read_unlock();
 
 	return offset - 1 - head;
@@ -430,7 +430,7 @@ ondemand_readahead(struct address_space *mapping,
 		pgoff_t start;
 
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = page_cache_next_hole(mapping, offset + 1, max);
 		rcu_read_unlock();
 
 		if (!start || start - offset > max)
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 3/9] mm: filemap: move radix tree hole searching here
@ 2013-08-17 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot".  But this is about to change, though, as shadow
page descriptors will be stored in the page cache after the actual
pages get evicted from memory.

Move the functions over to mm/filemap.c and make them native page
cache operations, where they can later be adapted to handle the new
definition of "page cache hole".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/nfs/blocklayout/blocklayout.c |  2 +-
 include/linux/pagemap.h          |  5 +++
 include/linux/radix-tree.h       |  4 ---
 lib/radix-tree.c                 | 75 ---------------------------------------
 mm/filemap.c                     | 76 ++++++++++++++++++++++++++++++++++++++++
 mm/readahead.c                   |  4 +--
 6 files changed, 84 insertions(+), 82 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index e242bbf..fdb74cb 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1220,7 +1220,7 @@ static u64 pnfs_num_cont_bytes(struct inode *inode, pgoff_t idx)
 	end = DIV_ROUND_UP(i_size_read(inode), PAGE_CACHE_SIZE);
 	if (end != NFS_I(inode)->npages) {
 		rcu_read_lock();
-		end = radix_tree_next_hole(&mapping->page_tree, idx + 1, ULONG_MAX);
+		end = page_cache_next_hole(mapping, idx + 1, ULONG_MAX);
 		rcu_read_unlock();
 	}
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..c73130c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -243,6 +243,11 @@ static inline struct page *page_cache_alloc_readahead(struct address_space *x)
 
 typedef int filler_t(void *, struct page *);
 
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan);
+
 extern struct page * find_get_page(struct address_space *mapping,
 				pgoff_t index);
 extern struct page * find_lock_page(struct address_space *mapping,
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 1bf0a9c..e8be53e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -227,10 +227,6 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 60b202b..912e67b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -946,81 +946,6 @@ next:
 }
 EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
 
-
-/**
- *	radix_tree_next_hole    -    find the next hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the lowest
- *	indexed hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'return - index >= max_scan'
- *	will be true). In rare cases of index wrap-around, 0 will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 5, then subsequently a hole is created at index 10,
- *	radix_tree_next_hole covering both indexes may return 10 if called
- *	under rcu_read_lock.
- */
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index++;
-		if (index == 0)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_next_hole);
-
-/**
- *	radix_tree_prev_hole    -    find the prev hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search backwards in the range [max(index-max_scan+1, 0), index]
- *	for the first hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'index - return >= max_scan'
- *	will be true). In rare cases of wrap-around, ULONG_MAX will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 10, then subsequently a hole is created at index 5,
- *	radix_tree_prev_hole covering both indexes may return 5 if called under
- *	rcu_read_lock.
- */
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				   unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index--;
-		if (index == ULONG_MAX)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_prev_hole);
-
 /**
  *	radix_tree_gang_lookup - perform multiple lookup on a radix tree
  *	@root:		radix tree root
diff --git a/mm/filemap.c b/mm/filemap.c
index ae5cc01..e7833d2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -688,6 +688,82 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 }
 
 /**
+ * page_cache_next_hole - find the next hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
+ * lowest indexed hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'return - index >=
+ * max_scan' will be true). In rare cases of index wrap-around, 0 will
+ * be returned.
+ *
+ * page_cache_next_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 5, then subsequently a hole is created at
+ * index 10, page_cache_next_hole covering both indexes may return 10
+ * if called under rcu_read_lock.
+ */
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index++;
+		if (index == 0)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_next_hole);
+
+/**
+ * page_cache_prev_hole - find the prev hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search backwards in the range [max(index-max_scan+1, 0), index] for
+ * the first hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'index - return >=
+ * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
+ * will be returned.
+ *
+ * page_cache_prev_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 10, then subsequently a hole is created at
+ * index 5, page_cache_prev_hole covering both indexes may return 5 if
+ * called under rcu_read_lock.
+ */
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+			     pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index--;
+		if (index == ULONG_MAX)
+			break;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(page_cache_prev_hole);
+
+/**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
diff --git a/mm/readahead.c b/mm/readahead.c
index 829a77c..01f4cae 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -351,7 +351,7 @@ static pgoff_t count_history_pages(struct address_space *mapping,
 	pgoff_t head;
 
 	rcu_read_lock();
-	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+	head = page_cache_prev_hole(mapping, offset - 1, max);
 	rcu_read_unlock();
 
 	return offset - 1 - head;
@@ -430,7 +430,7 @@ ondemand_readahead(struct address_space *mapping,
 		pgoff_t start;
 
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = page_cache_next_hole(mapping, offset + 1, max);
 		rcu_read_unlock();
 
 		if (!start || start - offset > max)
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees
  2013-08-17 19:31 ` Johannes Weiner
@ 2013-08-17 19:31   ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache,
prepare every site dealing with the radix trees directly to handle
entries other than pages.

The common lookup functions will filter out non-page entries and
return NULL for page cache holes, just as before.  But provide a raw
version of the API which returns non-page entries as well, and switch
shmem over to use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c   |   2 +-
 include/linux/mm.h       |   8 +++
 include/linux/pagemap.h  |  15 +++---
 include/linux/pagevec.h  |   3 ++
 include/linux/shmem_fs.h |   1 +
 mm/filemap.c             | 131 +++++++++++++++++++++++++++++++++++++++--------
 mm/mincore.c             |  20 +++++---
 mm/readahead.c           |   2 +-
 mm/shmem.c               |  97 +++++++----------------------------
 mm/swap.c                |  20 ++++++++
 mm/truncate.c            |  75 +++++++++++++++++++++------
 11 files changed, 246 insertions(+), 128 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b189bd1..5ce2c0f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -475,7 +475,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, pg_index);
 		rcu_read_unlock();
-		if (page) {
+		if (page && !radix_tree_exceptional_entry(page)) {
 			misses++;
 			if (misses > 4)
 				break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e20da1b..7f4d1f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -913,6 +913,14 @@ extern void show_free_areas(unsigned int flags);
 extern bool skip_free_areas_node(unsigned int flags, int nid);
 
 int shmem_zero_setup(struct vm_area_struct *);
+#ifdef CONFIG_SHMEM
+bool shmem_mapping(struct address_space *mapping);
+#else
+static inline bool shmem_mapping(struct address_space *mapping)
+{
+	return false;
+}
+#endif
 
 extern int can_do_mlock(void);
 extern int user_shm_lock(size_t, struct user_struct *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c73130c..b6854b7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-extern struct page * find_get_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_lock_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_or_create_page(struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
+				 gfp_t gfp_mask);
+unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
+			  unsigned int nr_pages, struct page **pages,
+			  pgoff_t *indices);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index e4dbfab..3c6b8b1 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,9 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices);
+void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 30aa0dc..deb4960 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -49,6 +49,7 @@ extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index e7833d2..254eb16 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -446,6 +446,24 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
+static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
+			     struct page *page)
+{
+	void **slot;
+
+	slot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (slot) {
+		void *p;
+
+		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+		if (!radix_tree_exceptional_entry(p))
+			return -EEXIST;
+		radix_tree_replace_slot(slot, page);
+		return 0;
+	}
+	return radix_tree_insert(&mapping->page_tree, offset, page);
+}
+
 /**
  * add_to_page_cache_locked - add a locked page to the pagecache
  * @page:	page to add
@@ -480,7 +498,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	page->index = offset;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = radix_tree_insert(&mapping->page_tree, offset, page);
+	error = page_cache_insert(mapping, offset, page);
 	radix_tree_preload_end();
 	if (unlikely(error))
 		goto err_insert;
@@ -714,7 +732,10 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index++;
 		if (index == 0)
@@ -752,7 +773,10 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index--;
 		if (index == ULONG_MAX)
@@ -763,15 +787,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 }
 EXPORT_SYMBOL(page_cache_prev_hole);
 
-/**
- * find_get_page - find and get a page reference
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Is there a pagecache struct page at the given (mapping, offset) tuple?
- * If yes, increment its refcount and return it; if no, return NULL.
- */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	void **pagep;
 	struct page *page;
@@ -812,24 +828,31 @@ out:
 
 	return page;
 }
-EXPORT_SYMBOL(find_get_page);
+EXPORT_SYMBOL(__find_get_page);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Locates the desired pagecache page, locks it, increments its reference
- * count and returns its address.
- *
- * Returns zero if the page was not present. find_lock_page() may sleep.
+ * Is there a pagecache struct page at the given (mapping, offset) tuple?
+ * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page;
+	struct page *page = __find_get_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
+EXPORT_SYMBOL(find_get_page);
 
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page;
 repeat:
-	page = find_get_page(mapping, offset);
+	page = __find_get_page(mapping, offset);
 	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
@@ -842,6 +865,25 @@ repeat:
 	}
 	return page;
 }
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Locates the desired pagecache page, locks it, increments its reference
+ * count and returns its address.
+ *
+ * Returns zero if the page was not present. find_lock_page() may sleep.
+ */
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page = __find_lock_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
 EXPORT_SYMBOL(find_lock_page);
 
 /**
@@ -891,6 +933,53 @@ repeat:
 }
 EXPORT_SYMBOL(find_or_create_page);
 
+unsigned __find_get_pages(struct address_space *mapping,
+			  pgoff_t start, unsigned int nr_pages,
+			  struct page **pages, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_pages)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * Otherwise, we must be storing a swap entry
+			 * here as an exceptional entry: so return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		pages[ret] = page;
+		if (++ret == nr_pages)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
 /**
  * find_get_pages - gang pagecache lookup
  * @mapping:	The address_space to search
diff --git a/mm/mincore.c b/mm/mincore.c
index da2be56..ad411ec 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 	 */
-	page = find_get_page(mapping, pgoff);
 #ifdef CONFIG_SWAP
-	/* shmem/tmpfs may return swap: account for swapcache page too. */
-	if (radix_tree_exceptional_entry(page)) {
-		swp_entry_t swap = radix_to_swp_entry(page);
-		page = find_get_page(swap_address_space(swap), swap.val);
-	}
+	if (shmem_mapping(mapping)) {
+		page = __find_get_page(mapping, pgoff);
+		/*
+		 * shmem/tmpfs may return swap: account for swapcache
+		 * page too.
+		 */
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swp = radix_to_swp_entry(page);
+			page = find_get_page(swap_address_space(swp), swp.val);
+		}
+	} else
+		page = find_get_page(mapping, pgoff);
+#else
+	page = find_get_page(mapping, pgoff);
 #endif
 	if (page) {
 		present = PageUptodate(page);
diff --git a/mm/readahead.c b/mm/readahead.c
index 01f4cae..0f85996 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		rcu_read_unlock();
-		if (page)
+		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
 		page = page_cache_alloc_readahead(mapping);
diff --git a/mm/shmem.c b/mm/shmem.c
index 8510534..2d18981 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -329,56 +329,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 }
 
 /*
- * Like find_get_pages, but collecting swap entries as well as pages.
- */
-static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
-					pgoff_t start, unsigned int nr_pages,
-					struct page **pages, pgoff_t *indices)
-{
-	void **slot;
-	unsigned int ret = 0;
-	struct radix_tree_iter iter;
-
-	if (!nr_pages)
-		return 0;
-
-	rcu_read_lock();
-restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
-		struct page *page;
-repeat:
-		page = radix_tree_deref_slot(slot);
-		if (unlikely(!page))
-			continue;
-		if (radix_tree_exception(page)) {
-			if (radix_tree_deref_retry(page))
-				goto restart;
-			/*
-			 * Otherwise, we must be storing a swap entry
-			 * here as an exceptional entry: so return it
-			 * without attempting to raise page count.
-			 */
-			goto export;
-		}
-		if (!page_cache_get_speculative(page))
-			goto repeat;
-
-		/* Has the page moved? */
-		if (unlikely(page != *slot)) {
-			page_cache_release(page);
-			goto repeat;
-		}
-export:
-		indices[ret] = iter.index;
-		pages[ret] = page;
-		if (++ret == nr_pages)
-			break;
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
@@ -396,21 +346,6 @@ static int shmem_free_swap(struct address_space *mapping,
 }
 
 /*
- * Pagevec may contain swap entries, so shuffle up pages before releasing.
- */
-static void shmem_deswap_pagevec(struct pagevec *pvec)
-{
-	int i, j;
-
-	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		if (!radix_tree_exceptional_entry(page))
-			pvec->pages[j++] = page;
-	}
-	pvec->nr = j;
-}
-
-/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
@@ -428,12 +363,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
 		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
 		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
 		 */
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					PAGEVEC_SIZE, pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		index = indices[pvec.nr - 1] + 1;
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		check_move_unevictable_pages(pvec.pages, pvec.nr);
 		pagevec_release(&pvec);
 		cond_resched();
@@ -465,9 +400,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
-				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+		pvec.nr = __find_get_pages(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		mem_cgroup_uncharge_start();
@@ -496,7 +431,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -534,9 +469,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+
+		pvec.nr = __find_get_pages(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+				pvec.pages, indices);
 		if (!pvec.nr) {
 			if (index == start || unfalloc)
 				break;
@@ -544,7 +480,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			continue;
 		}
 		if ((index == start || unfalloc) && indices[0] >= end) {
-			shmem_deswap_pagevec(&pvec);
+			pagevec_remove_exceptionals(&pvec);
 			pagevec_release(&pvec);
 			break;
 		}
@@ -573,7 +509,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -1081,7 +1017,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return -EFBIG;
 repeat:
 	swap.val = 0;
-	page = find_lock_page(mapping, index);
+	page = __find_lock_page(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
 		page = NULL;
@@ -1418,6 +1354,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	return inode;
 }
 
+bool shmem_mapping(struct address_space *mapping)
+{
+	return mapping->backing_dev_info == &shmem_backing_dev_info;
+}
+
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
 static const struct inode_operations shmem_short_symlink_operations;
@@ -1731,7 +1672,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 	pagevec_init(&pvec, 0);
 	pvec.nr = 1;		/* start small: we may be there already */
 	while (!done) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					pvec.nr, pvec.pages, indices);
 		if (!pvec.nr) {
 			if (whence == SEEK_DATA)
@@ -1758,7 +1699,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 				break;
 			}
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		pvec.nr = PAGEVEC_SIZE;
 		cond_resched();
diff --git a/mm/swap.c b/mm/swap.c
index 62b78a6..bf448cf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -831,6 +831,26 @@ void __pagevec_lru_add(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_lru_add);
 
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices)
+{
+	pvec->nr = __find_get_pages(mapping, start, nr_pages,
+				    pvec->pages, indices);
+	return pagevec_count(pvec);
+}
+
+void pagevec_remove_exceptionals(struct pagevec *pvec)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		if (!radix_tree_exceptional_entry(page))
+			pvec->pages[j++] = page;
+	}
+	pvec->nr = j;
+}
+
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
diff --git a/mm/truncate.c b/mm/truncate.c
index e2e8a8a..21e4851 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -22,6 +22,22 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+static void clear_exceptional_entry(struct address_space *mapping,
+				    pgoff_t index, struct page *page)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	radix_tree_delete_item(&mapping->page_tree, index, page);
+	spin_unlock_irq(&mapping->tree_lock);
+}
 
 /**
  * do_invalidatepage - invalidate part or all of a page
@@ -208,12 +224,11 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	unsigned int	partial_start;	/* inclusive */
 	unsigned int	partial_end;	/* exclusive */
 	struct pagevec	pvec;
+	pgoff_t		indices[PAGEVEC_SIZE];
 	pgoff_t		index;
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0)
-		return;
 
 	/* Offsets within partial pages */
 	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
@@ -238,17 +253,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index < end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+	while (index < end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index >= end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -259,6 +280,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -307,14 +329,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+		if (!__pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			indices)) {
 			if (index == start)
 				break;
 			index = start;
 			continue;
 		}
-		if (index == start && pvec.pages[0]->index >= end) {
+		if (index == start && indices[0] >= end) {
 			pagevec_release(&pvec);
 			break;
 		}
@@ -323,16 +346,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index >= end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -375,6 +404,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index = start;
 	unsigned long ret;
@@ -390,17 +420,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	 */
 
 	pagevec_init(&pvec, 0);
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -414,6 +450,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				deactivate_page(page);
 			count += ret;
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -481,6 +518,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
 int invalidate_inode_pages2_range(struct address_space *mapping,
 				  pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	int i;
@@ -491,17 +529,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	cleancache_invalidate_inode(mapping);
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			if (page->mapping != mapping) {
@@ -539,6 +583,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				ret = ret2;
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees
@ 2013-08-17 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache,
prepare every site dealing with the radix trees directly to handle
entries other than pages.

The common lookup functions will filter out non-page entries and
return NULL for page cache holes, just as before.  But provide a raw
version of the API which returns non-page entries as well, and switch
shmem over to use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c   |   2 +-
 include/linux/mm.h       |   8 +++
 include/linux/pagemap.h  |  15 +++---
 include/linux/pagevec.h  |   3 ++
 include/linux/shmem_fs.h |   1 +
 mm/filemap.c             | 131 +++++++++++++++++++++++++++++++++++++++--------
 mm/mincore.c             |  20 +++++---
 mm/readahead.c           |   2 +-
 mm/shmem.c               |  97 +++++++----------------------------
 mm/swap.c                |  20 ++++++++
 mm/truncate.c            |  75 +++++++++++++++++++++------
 11 files changed, 246 insertions(+), 128 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b189bd1..5ce2c0f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -475,7 +475,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, pg_index);
 		rcu_read_unlock();
-		if (page) {
+		if (page && !radix_tree_exceptional_entry(page)) {
 			misses++;
 			if (misses > 4)
 				break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e20da1b..7f4d1f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -913,6 +913,14 @@ extern void show_free_areas(unsigned int flags);
 extern bool skip_free_areas_node(unsigned int flags, int nid);
 
 int shmem_zero_setup(struct vm_area_struct *);
+#ifdef CONFIG_SHMEM
+bool shmem_mapping(struct address_space *mapping);
+#else
+static inline bool shmem_mapping(struct address_space *mapping)
+{
+	return false;
+}
+#endif
 
 extern int can_do_mlock(void);
 extern int user_shm_lock(size_t, struct user_struct *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c73130c..b6854b7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-extern struct page * find_get_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_lock_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_or_create_page(struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
+				 gfp_t gfp_mask);
+unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
+			  unsigned int nr_pages, struct page **pages,
+			  pgoff_t *indices);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index e4dbfab..3c6b8b1 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,9 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices);
+void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 30aa0dc..deb4960 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -49,6 +49,7 @@ extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index e7833d2..254eb16 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -446,6 +446,24 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
+static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
+			     struct page *page)
+{
+	void **slot;
+
+	slot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (slot) {
+		void *p;
+
+		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+		if (!radix_tree_exceptional_entry(p))
+			return -EEXIST;
+		radix_tree_replace_slot(slot, page);
+		return 0;
+	}
+	return radix_tree_insert(&mapping->page_tree, offset, page);
+}
+
 /**
  * add_to_page_cache_locked - add a locked page to the pagecache
  * @page:	page to add
@@ -480,7 +498,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 	page->index = offset;
 
 	spin_lock_irq(&mapping->tree_lock);
-	error = radix_tree_insert(&mapping->page_tree, offset, page);
+	error = page_cache_insert(mapping, offset, page);
 	radix_tree_preload_end();
 	if (unlikely(error))
 		goto err_insert;
@@ -714,7 +732,10 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index++;
 		if (index == 0)
@@ -752,7 +773,10 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index--;
 		if (index == ULONG_MAX)
@@ -763,15 +787,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
 }
 EXPORT_SYMBOL(page_cache_prev_hole);
 
-/**
- * find_get_page - find and get a page reference
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Is there a pagecache struct page at the given (mapping, offset) tuple?
- * If yes, increment its refcount and return it; if no, return NULL.
- */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	void **pagep;
 	struct page *page;
@@ -812,24 +828,31 @@ out:
 
 	return page;
 }
-EXPORT_SYMBOL(find_get_page);
+EXPORT_SYMBOL(__find_get_page);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Locates the desired pagecache page, locks it, increments its reference
- * count and returns its address.
- *
- * Returns zero if the page was not present. find_lock_page() may sleep.
+ * Is there a pagecache struct page at the given (mapping, offset) tuple?
+ * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page;
+	struct page *page = __find_get_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
+EXPORT_SYMBOL(find_get_page);
 
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page;
 repeat:
-	page = find_get_page(mapping, offset);
+	page = __find_get_page(mapping, offset);
 	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
@@ -842,6 +865,25 @@ repeat:
 	}
 	return page;
 }
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Locates the desired pagecache page, locks it, increments its reference
+ * count and returns its address.
+ *
+ * Returns zero if the page was not present. find_lock_page() may sleep.
+ */
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page = __find_lock_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
 EXPORT_SYMBOL(find_lock_page);
 
 /**
@@ -891,6 +933,53 @@ repeat:
 }
 EXPORT_SYMBOL(find_or_create_page);
 
+unsigned __find_get_pages(struct address_space *mapping,
+			  pgoff_t start, unsigned int nr_pages,
+			  struct page **pages, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_pages)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * Otherwise, we must be storing a swap entry
+			 * here as an exceptional entry: so return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		pages[ret] = page;
+		if (++ret == nr_pages)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
 /**
  * find_get_pages - gang pagecache lookup
  * @mapping:	The address_space to search
diff --git a/mm/mincore.c b/mm/mincore.c
index da2be56..ad411ec 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 	 */
-	page = find_get_page(mapping, pgoff);
 #ifdef CONFIG_SWAP
-	/* shmem/tmpfs may return swap: account for swapcache page too. */
-	if (radix_tree_exceptional_entry(page)) {
-		swp_entry_t swap = radix_to_swp_entry(page);
-		page = find_get_page(swap_address_space(swap), swap.val);
-	}
+	if (shmem_mapping(mapping)) {
+		page = __find_get_page(mapping, pgoff);
+		/*
+		 * shmem/tmpfs may return swap: account for swapcache
+		 * page too.
+		 */
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swp = radix_to_swp_entry(page);
+			page = find_get_page(swap_address_space(swp), swp.val);
+		}
+	} else
+		page = find_get_page(mapping, pgoff);
+#else
+	page = find_get_page(mapping, pgoff);
 #endif
 	if (page) {
 		present = PageUptodate(page);
diff --git a/mm/readahead.c b/mm/readahead.c
index 01f4cae..0f85996 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		rcu_read_unlock();
-		if (page)
+		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
 		page = page_cache_alloc_readahead(mapping);
diff --git a/mm/shmem.c b/mm/shmem.c
index 8510534..2d18981 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -329,56 +329,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 }
 
 /*
- * Like find_get_pages, but collecting swap entries as well as pages.
- */
-static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
-					pgoff_t start, unsigned int nr_pages,
-					struct page **pages, pgoff_t *indices)
-{
-	void **slot;
-	unsigned int ret = 0;
-	struct radix_tree_iter iter;
-
-	if (!nr_pages)
-		return 0;
-
-	rcu_read_lock();
-restart:
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
-		struct page *page;
-repeat:
-		page = radix_tree_deref_slot(slot);
-		if (unlikely(!page))
-			continue;
-		if (radix_tree_exception(page)) {
-			if (radix_tree_deref_retry(page))
-				goto restart;
-			/*
-			 * Otherwise, we must be storing a swap entry
-			 * here as an exceptional entry: so return it
-			 * without attempting to raise page count.
-			 */
-			goto export;
-		}
-		if (!page_cache_get_speculative(page))
-			goto repeat;
-
-		/* Has the page moved? */
-		if (unlikely(page != *slot)) {
-			page_cache_release(page);
-			goto repeat;
-		}
-export:
-		indices[ret] = iter.index;
-		pages[ret] = page;
-		if (++ret == nr_pages)
-			break;
-	}
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
@@ -396,21 +346,6 @@ static int shmem_free_swap(struct address_space *mapping,
 }
 
 /*
- * Pagevec may contain swap entries, so shuffle up pages before releasing.
- */
-static void shmem_deswap_pagevec(struct pagevec *pvec)
-{
-	int i, j;
-
-	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		if (!radix_tree_exceptional_entry(page))
-			pvec->pages[j++] = page;
-	}
-	pvec->nr = j;
-}
-
-/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
@@ -428,12 +363,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
 		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
 		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
 		 */
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					PAGEVEC_SIZE, pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		index = indices[pvec.nr - 1] + 1;
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		check_move_unevictable_pages(pvec.pages, pvec.nr);
 		pagevec_release(&pvec);
 		cond_resched();
@@ -465,9 +400,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
-				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+		pvec.nr = __find_get_pages(mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		mem_cgroup_uncharge_start();
@@ -496,7 +431,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -534,9 +469,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+
+		pvec.nr = __find_get_pages(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
-							pvec.pages, indices);
+				pvec.pages, indices);
 		if (!pvec.nr) {
 			if (index == start || unfalloc)
 				break;
@@ -544,7 +480,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			continue;
 		}
 		if ((index == start || unfalloc) && indices[0] >= end) {
-			shmem_deswap_pagevec(&pvec);
+			pagevec_remove_exceptionals(&pvec);
 			pagevec_release(&pvec);
 			break;
 		}
@@ -573,7 +509,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -1081,7 +1017,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return -EFBIG;
 repeat:
 	swap.val = 0;
-	page = find_lock_page(mapping, index);
+	page = __find_lock_page(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
 		page = NULL;
@@ -1418,6 +1354,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	return inode;
 }
 
+bool shmem_mapping(struct address_space *mapping)
+{
+	return mapping->backing_dev_info == &shmem_backing_dev_info;
+}
+
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
 static const struct inode_operations shmem_short_symlink_operations;
@@ -1731,7 +1672,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 	pagevec_init(&pvec, 0);
 	pvec.nr = 1;		/* start small: we may be there already */
 	while (!done) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					pvec.nr, pvec.pages, indices);
 		if (!pvec.nr) {
 			if (whence == SEEK_DATA)
@@ -1758,7 +1699,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
 				break;
 			}
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		pvec.nr = PAGEVEC_SIZE;
 		cond_resched();
diff --git a/mm/swap.c b/mm/swap.c
index 62b78a6..bf448cf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -831,6 +831,26 @@ void __pagevec_lru_add(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_lru_add);
 
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned nr_pages, pgoff_t *indices)
+{
+	pvec->nr = __find_get_pages(mapping, start, nr_pages,
+				    pvec->pages, indices);
+	return pagevec_count(pvec);
+}
+
+void pagevec_remove_exceptionals(struct pagevec *pvec)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		if (!radix_tree_exceptional_entry(page))
+			pvec->pages[j++] = page;
+	}
+	pvec->nr = j;
+}
+
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
diff --git a/mm/truncate.c b/mm/truncate.c
index e2e8a8a..21e4851 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -22,6 +22,22 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+static void clear_exceptional_entry(struct address_space *mapping,
+				    pgoff_t index, struct page *page)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	radix_tree_delete_item(&mapping->page_tree, index, page);
+	spin_unlock_irq(&mapping->tree_lock);
+}
 
 /**
  * do_invalidatepage - invalidate part or all of a page
@@ -208,12 +224,11 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	unsigned int	partial_start;	/* inclusive */
 	unsigned int	partial_end;	/* exclusive */
 	struct pagevec	pvec;
+	pgoff_t		indices[PAGEVEC_SIZE];
 	pgoff_t		index;
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0)
-		return;
 
 	/* Offsets within partial pages */
 	partial_start = lstart & (PAGE_CACHE_SIZE - 1);
@@ -238,17 +253,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index < end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+	while (index < end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index >= end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -259,6 +280,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -307,14 +329,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+		if (!__pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE),
+			indices)) {
 			if (index == start)
 				break;
 			index = start;
 			continue;
 		}
-		if (index == start && pvec.pages[0]->index >= end) {
+		if (index == start && indices[0] >= end) {
 			pagevec_release(&pvec);
 			break;
 		}
@@ -323,16 +346,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index >= end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -375,6 +404,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index = start;
 	unsigned long ret;
@@ -390,17 +420,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	 */
 
 	pagevec_init(&pvec, 0);
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -414,6 +450,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				deactivate_page(page);
 			count += ret;
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -481,6 +518,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
 int invalidate_inode_pages2_range(struct address_space *mapping,
 				  pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	int i;
@@ -491,17 +529,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	cleancache_invalidate_inode(mapping);
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			if (page->mapping != mapping) {
@@ -539,6 +583,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				ret = ret2;
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 5/9] mm + fs: store shadow entries in page cache
  2013-08-17 19:31 ` Johannes Weiner
@ 2013-08-17 19:31   ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Reclaim will be leaving shadow entries in the page cache radix tree
upon evicting the real page.  As those pages are found from the LRU,
an iput() can lead to the inode being freed concurrently.  At this
point, reclaim must no longer install shadow pages because the inode
freeing code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code
sets under the tree lock before doing the final truncate.  Reclaim
will check for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/block_dev.c          |  2 +-
 fs/inode.c              |  7 ++++++-
 fs/nilfs2/inode.c       |  4 ++--
 include/linux/fs.h      |  1 +
 include/linux/pagemap.h | 13 ++++++++++++-
 mm/filemap.c            | 16 ++++++++++++----
 mm/truncate.c           |  5 +++--
 mm/vmscan.c             |  2 +-
 8 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7bda5c..26b992d14 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -83,7 +83,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index e315c0a..8862b1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -503,6 +503,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
+	BUG_ON(inode->i_data.nrshadows);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
@@ -545,10 +546,14 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
+	spin_lock_irq(&inode->i_data.tree_lock);
+	mapping_set_exiting(&inode->i_data);
+	spin_unlock_irq(&inode->i_data.tree_lock);
+
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 	}
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index b1a5277..047bafb 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -783,7 +783,7 @@ void nilfs_evict_inode(struct inode *inode)
 	int ret;
 
 	if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 		nilfs_clear_inode(inode);
@@ -791,7 +791,7 @@ void nilfs_evict_inode(struct inode *inode)
 	}
 	nilfs_transaction_begin(sb, &ti, 0); /* never fails */
 
-	if (inode->i_data.nrpages)
+	if (inode->i_data.nrpages || inode->i_data.nrshadows)
 		truncate_inode_pages(&inode->i_data, 0);
 
 	/* TODO: some of the following operations may fail.  */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b09ddc0..ac5d84e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -416,6 +416,7 @@ struct address_space {
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
+	unsigned long		nrshadows;	/* number of shadow entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index b6854b7..db3a78b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_EXITING	= __GFP_BITS_SHIFT + 5, /* inode is being evicted */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -69,6 +70,16 @@ static inline int mapping_balloon(struct address_space *mapping)
 	return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
 }
 
+static inline void mapping_set_exiting(struct address_space *mapping)
+{
+	set_bit(AS_EXITING, &mapping->flags);
+}
+
+static inline int mapping_exiting(struct address_space *mapping)
+{
+	return test_bit(AS_EXITING, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
@@ -547,7 +558,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page);
+extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 254eb16..34b2f0b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -112,7 +112,7 @@
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the mapping's tree_lock.
  */
-void __delete_from_page_cache(struct page *page)
+void __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
@@ -127,7 +127,14 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (shadow) {
+		void **slot;
+
+		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+		radix_tree_replace_slot(slot, shadow);
+		mapping->nrshadows++;
+	} else
+		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
 	mapping->nrpages--;
@@ -166,7 +173,7 @@ void delete_from_page_cache(struct page *page)
 
 	freepage = mapping->a_ops->freepage;
 	spin_lock_irq(&mapping->tree_lock);
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
@@ -426,7 +433,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		__delete_from_page_cache(old);
+		__delete_from_page_cache(old, NULL);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
 		BUG_ON(error);
 		mapping->nrpages++;
@@ -459,6 +466,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
+		mapping->nrshadows--;
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 21e4851..5c85dd4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -35,7 +35,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * without the tree itself locked.  These unlocked entries
 	 * need verification under the tree lock.
 	 */
-	radix_tree_delete_item(&mapping->page_tree, index, page);
+	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
+		mapping->nrshadows--;
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
@@ -481,7 +482,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a3bf7fd..dd5f67c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -545,7 +545,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page);
+		__delete_from_page_cache(page, NULL);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 5/9] mm + fs: store shadow entries in page cache
@ 2013-08-17 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Reclaim will be leaving shadow entries in the page cache radix tree
upon evicting the real page.  As those pages are found from the LRU,
an iput() can lead to the inode being freed concurrently.  At this
point, reclaim must no longer install shadow pages because the inode
freeing code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code
sets under the tree lock before doing the final truncate.  Reclaim
will check for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/block_dev.c          |  2 +-
 fs/inode.c              |  7 ++++++-
 fs/nilfs2/inode.c       |  4 ++--
 include/linux/fs.h      |  1 +
 include/linux/pagemap.h | 13 ++++++++++++-
 mm/filemap.c            | 16 ++++++++++++----
 mm/truncate.c           |  5 +++--
 mm/vmscan.c             |  2 +-
 8 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7bda5c..26b992d14 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -83,7 +83,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index e315c0a..8862b1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -503,6 +503,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
+	BUG_ON(inode->i_data.nrshadows);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
@@ -545,10 +546,14 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
+	spin_lock_irq(&inode->i_data.tree_lock);
+	mapping_set_exiting(&inode->i_data);
+	spin_unlock_irq(&inode->i_data.tree_lock);
+
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 	}
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index b1a5277..047bafb 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -783,7 +783,7 @@ void nilfs_evict_inode(struct inode *inode)
 	int ret;
 
 	if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
-		if (inode->i_data.nrpages)
+		if (inode->i_data.nrpages || inode->i_data.nrshadows)
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 		nilfs_clear_inode(inode);
@@ -791,7 +791,7 @@ void nilfs_evict_inode(struct inode *inode)
 	}
 	nilfs_transaction_begin(sb, &ti, 0); /* never fails */
 
-	if (inode->i_data.nrpages)
+	if (inode->i_data.nrpages || inode->i_data.nrshadows)
 		truncate_inode_pages(&inode->i_data, 0);
 
 	/* TODO: some of the following operations may fail.  */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b09ddc0..ac5d84e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -416,6 +416,7 @@ struct address_space {
 	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
+	unsigned long		nrshadows;	/* number of shadow entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index b6854b7..db3a78b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_EXITING	= __GFP_BITS_SHIFT + 5, /* inode is being evicted */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -69,6 +70,16 @@ static inline int mapping_balloon(struct address_space *mapping)
 	return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
 }
 
+static inline void mapping_set_exiting(struct address_space *mapping)
+{
+	set_bit(AS_EXITING, &mapping->flags);
+}
+
+static inline int mapping_exiting(struct address_space *mapping)
+{
+	return test_bit(AS_EXITING, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
@@ -547,7 +558,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page);
+extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 254eb16..34b2f0b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -112,7 +112,7 @@
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the mapping's tree_lock.
  */
-void __delete_from_page_cache(struct page *page)
+void __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
@@ -127,7 +127,14 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (shadow) {
+		void **slot;
+
+		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+		radix_tree_replace_slot(slot, shadow);
+		mapping->nrshadows++;
+	} else
+		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
 	mapping->nrpages--;
@@ -166,7 +173,7 @@ void delete_from_page_cache(struct page *page)
 
 	freepage = mapping->a_ops->freepage;
 	spin_lock_irq(&mapping->tree_lock);
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
@@ -426,7 +433,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		__delete_from_page_cache(old);
+		__delete_from_page_cache(old, NULL);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
 		BUG_ON(error);
 		mapping->nrpages++;
@@ -459,6 +466,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
+		mapping->nrshadows--;
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 21e4851..5c85dd4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -35,7 +35,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * without the tree itself locked.  These unlocked entries
 	 * need verification under the tree lock.
 	 */
-	radix_tree_delete_item(&mapping->page_tree, index, page);
+	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
+		mapping->nrshadows--;
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
@@ -481,7 +482,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 		goto failed;
 
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a3bf7fd..dd5f67c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -545,7 +545,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page);
+		__delete_from_page_cache(page, NULL);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 6/9] mm + fs: provide shadow pages to page cache allocations
  2013-08-17 19:31 ` Johannes Weiner
@ 2013-08-17 19:31   ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

In order to make informed placement and reclaim decisions, the page
cache allocation requires the shadow information of refaulting pages.

Every site that does a find_or_create()-style page cache allocation is
converted to pass the shadow page found in the faulting slot of the
radix tree to page_cache_alloc(), where it can be used in subsequent
patches to influence reclaim behavior.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 drivers/staging/lustre/lustre/llite/dir.c |  2 +-
 fs/btrfs/compression.c                    |  2 +-
 fs/cachefiles/rdwr.c                      | 13 +++++----
 fs/ceph/xattr.c                           |  2 +-
 fs/logfs/readwrite.c                      |  6 ++--
 fs/ntfs/file.c                            |  7 +++--
 fs/splice.c                               |  6 ++--
 include/linux/pagemap.h                   | 20 ++++++++------
 mm/filemap.c                              | 46 +++++++++++++++++--------------
 mm/readahead.c                            |  2 +-
 net/ceph/pagelist.c                       |  4 +--
 net/ceph/pagevec.c                        |  2 +-
 12 files changed, 61 insertions(+), 51 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index 2ca8c45..ac63e4d 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -172,7 +172,7 @@ static int ll_dir_filler(void *_hash, struct page *page0)
 		max_pages = 1;
 	}
 	for (npages = 1; npages < max_pages; npages++) {
-		page = page_cache_alloc_cold(inode->i_mapping);
+		page = page_cache_alloc_cold(inode->i_mapping, NULL);
 		if (!page)
 			break;
 		page_pool[npages] = page;
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 5ce2c0f..f23bb17 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -483,7 +483,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		}
 
 		page = __page_cache_alloc(mapping_gfp_mask(mapping) &
-								~__GFP_FS);
+					  ~__GFP_FS, page);
 		if (!page)
 			break;
 
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index ebaff36..1b34a42 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -254,13 +254,13 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
 	newpage = NULL;
 
 	for (;;) {
-		backpage = find_get_page(bmapping, netpage->index);
-		if (backpage)
+		backpage = __find_get_page(bmapping, netpage->index);
+		if (backpage && !radix_tree_exceptional_entry(backpage))
 			goto backing_page_already_present;
 
 		if (!newpage) {
 			newpage = __page_cache_alloc(cachefiles_gfp |
-						     __GFP_COLD);
+						     __GFP_COLD, backpage);
 			if (!newpage)
 				goto nomem_monitor;
 		}
@@ -499,13 +499,14 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 		}
 
 		for (;;) {
-			backpage = find_get_page(bmapping, netpage->index);
-			if (backpage)
+			backpage = __find_get_page(bmapping, netpage->index);
+			if (backpage && !radix_tree_exceptional_entry(backpage))
 				goto backing_page_already_present;
 
 			if (!newpage) {
 				newpage = __page_cache_alloc(cachefiles_gfp |
-							     __GFP_COLD);
+							     __GFP_COLD,
+							     backpage);
 				if (!newpage)
 					goto nomem;
 			}
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index be661d8..a5d2b86 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -816,7 +816,7 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name,
 			return -ENOMEM;
 		err = -ENOMEM;
 		for (i = 0; i < nr_pages; i++) {
-			pages[i] = __page_cache_alloc(GFP_NOFS);
+			pages[i] = __page_cache_alloc(GFP_NOFS, NULL);
 			if (!pages[i]) {
 				nr_pages = i;
 				goto out;
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 9a59cba..67c669a 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -316,9 +316,9 @@ static struct page *logfs_get_write_page(struct inode *inode, u64 bix,
 	int err;
 
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(GFP_NOFS);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		page = __page_cache_alloc(GFP_NOFS, page);
 		if (!page)
 			return NULL;
 		err = add_to_page_cache_lru(page, mapping, index, GFP_NOFS);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index c5670b8..7aee2d1 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -413,10 +413,11 @@ static inline int __ntfs_grab_cache_pages(struct address_space *mapping,
 	BUG_ON(!nr_pages);
 	err = nr = 0;
 	do {
-		pages[nr] = find_lock_page(mapping, index);
-		if (!pages[nr]) {
+		pages[nr] = __find_lock_page(mapping, index);
+		if (!pages[nr] || radix_tree_exceptional_entry(pages[nr])) {
 			if (!*cached_page) {
-				*cached_page = page_cache_alloc(mapping);
+				*cached_page = page_cache_alloc(mapping,
+								pages[nr]);
 				if (unlikely(!*cached_page)) {
 					err = -ENOMEM;
 					goto err_out;
diff --git a/fs/splice.c b/fs/splice.c
index 3b7ee65..edc54ae 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -353,12 +353,12 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		 * Page could be there, find_get_pages_contig() breaks on
 		 * the first hole.
 		 */
-		page = find_get_page(mapping, index);
-		if (!page) {
+		page = __find_get_page(mapping, index);
+		if (!page || radix_tree_exceptional_entry(page)) {
 			/*
 			 * page didn't exist, allocate one.
 			 */
-			page = page_cache_alloc_cold(mapping);
+			page = page_cache_alloc_cold(mapping, page);
 			if (!page)
 				break;
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db3a78b..4b24236 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -228,28 +228,32 @@ static inline void page_unfreeze_refs(struct page *page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
 	return alloc_pages(gfp, 0);
 }
 #endif
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+static inline struct page *page_cache_alloc(struct address_space *x,
+					    struct page *shadow)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping_gfp_mask(x), shadow);
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *x,
+						 struct page *shadow)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD, shadow);
 }
 
-static inline struct page *page_cache_alloc_readahead(struct address_space *x)
+static inline struct page *page_cache_alloc_readahead(struct address_space *x,
+						      struct page *shadow)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x) |
-				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
+				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN,
+				  shadow);
 }
 
 typedef int filler_t(void *, struct page *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 34b2f0b..d3e5578 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -538,7 +538,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
 	int n;
 	struct page *page;
@@ -917,9 +917,9 @@ struct page *find_or_create_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+	page = __find_lock_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		page = __page_cache_alloc(gfp_mask, page);
 		if (!page)
 			return NULL;
 		/*
@@ -1222,15 +1222,16 @@ EXPORT_SYMBOL(find_get_pages_tag);
 struct page *
 grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
-	struct page *page = find_get_page(mapping, index);
+	struct page *page = __find_get_page(mapping, index);
 
-	if (page) {
+	if (page && !radix_tree_exceptional_entry(page)) {
 		if (trylock_page(page))
 			return page;
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+				  page);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
@@ -1304,8 +1305,9 @@ find_page:
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
-			page = find_get_page(mapping, index);
-			if (unlikely(page == NULL))
+			page = __find_get_page(mapping, index);
+			if (unlikely(page == NULL ||
+				     radix_tree_exceptional_entry(page)))
 				goto no_cached_page;
 		}
 		if (PageReadahead(page)) {
@@ -1464,7 +1466,7 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, page);
 		if (!page) {
 			desc->error = -ENOMEM;
 			goto out;
@@ -1673,18 +1675,20 @@ EXPORT_SYMBOL(generic_file_aio_read);
  * page_cache_read - adds requested page to the page cache if not already there
  * @file:	file to read
  * @offset:	page index
+ * @shadow:	shadow page of the page to be added
  *
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset,
+			   struct page *shadow)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *page; 
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, shadow);
 		if (!page)
 			return -ENOMEM;
 
@@ -1815,8 +1819,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
-		page = find_get_page(mapping, offset);
-		if (!page)
+		page = __find_get_page(mapping, offset);
+		if (!page || radix_tree_exceptional_entry(page))
 			goto no_cached_page;
 	}
 
@@ -1859,7 +1863,7 @@ no_cached_page:
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, offset);
+	error = page_cache_read(file, offset, page);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1981,9 +1985,9 @@ static struct page *__read_cache_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp | __GFP_COLD);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		page = __page_cache_alloc(gfp | __GFP_COLD, page);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
 		err = add_to_page_cache_lru(page, mapping, index, gfp);
@@ -2454,11 +2458,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (page)
+	page = __find_lock_page(mapping, index);
+	if (page && !radix_tree_exceptional_entry(page))
 		goto found;
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	page = __page_cache_alloc(gfp_mask & ~gfp_notmask, page);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
diff --git a/mm/readahead.c b/mm/readahead.c
index 0f85996..58142ef 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -182,7 +182,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
-		page = page_cache_alloc_readahead(mapping);
+		page = page_cache_alloc_readahead(mapping, page);
 		if (!page)
 			break;
 		page->index = page_offset;
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 92866be..83fb56e 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -32,7 +32,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
 	struct page *page;
 
 	if (!pl->num_pages_free) {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = __page_cache_alloc(GFP_NOFS, NULL);
 	} else {
 		page = list_first_entry(&pl->free_list, struct page, lru);
 		list_del(&page->lru);
@@ -83,7 +83,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
 	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
 
 	while (space > pl->num_pages_free) {
-		struct page *page = __page_cache_alloc(GFP_NOFS);
+		struct page *page = __page_cache_alloc(GFP_NOFS, NULL);
 		if (!page)
 			return -ENOMEM;
 		list_add_tail(&page->lru, &pl->free_list);
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 815a224..ff76422 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -79,7 +79,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < num_pages; i++) {
-		pages[i] = __page_cache_alloc(flags);
+		pages[i] = __page_cache_alloc(flags, NULL);
 		if (pages[i] == NULL) {
 			ceph_release_page_vector(pages, i);
 			return ERR_PTR(-ENOMEM);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 6/9] mm + fs: provide shadow pages to page cache allocations
@ 2013-08-17 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

In order to make informed placement and reclaim decisions, the page
cache allocation requires the shadow information of refaulting pages.

Every site that does a find_or_create()-style page cache allocation is
converted to pass the shadow page found in the faulting slot of the
radix tree to page_cache_alloc(), where it can be used in subsequent
patches to influence reclaim behavior.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 drivers/staging/lustre/lustre/llite/dir.c |  2 +-
 fs/btrfs/compression.c                    |  2 +-
 fs/cachefiles/rdwr.c                      | 13 +++++----
 fs/ceph/xattr.c                           |  2 +-
 fs/logfs/readwrite.c                      |  6 ++--
 fs/ntfs/file.c                            |  7 +++--
 fs/splice.c                               |  6 ++--
 include/linux/pagemap.h                   | 20 ++++++++------
 mm/filemap.c                              | 46 +++++++++++++++++--------------
 mm/readahead.c                            |  2 +-
 net/ceph/pagelist.c                       |  4 +--
 net/ceph/pagevec.c                        |  2 +-
 12 files changed, 61 insertions(+), 51 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index 2ca8c45..ac63e4d 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -172,7 +172,7 @@ static int ll_dir_filler(void *_hash, struct page *page0)
 		max_pages = 1;
 	}
 	for (npages = 1; npages < max_pages; npages++) {
-		page = page_cache_alloc_cold(inode->i_mapping);
+		page = page_cache_alloc_cold(inode->i_mapping, NULL);
 		if (!page)
 			break;
 		page_pool[npages] = page;
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 5ce2c0f..f23bb17 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -483,7 +483,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		}
 
 		page = __page_cache_alloc(mapping_gfp_mask(mapping) &
-								~__GFP_FS);
+					  ~__GFP_FS, page);
 		if (!page)
 			break;
 
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index ebaff36..1b34a42 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -254,13 +254,13 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
 	newpage = NULL;
 
 	for (;;) {
-		backpage = find_get_page(bmapping, netpage->index);
-		if (backpage)
+		backpage = __find_get_page(bmapping, netpage->index);
+		if (backpage && !radix_tree_exceptional_entry(backpage))
 			goto backing_page_already_present;
 
 		if (!newpage) {
 			newpage = __page_cache_alloc(cachefiles_gfp |
-						     __GFP_COLD);
+						     __GFP_COLD, backpage);
 			if (!newpage)
 				goto nomem_monitor;
 		}
@@ -499,13 +499,14 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 		}
 
 		for (;;) {
-			backpage = find_get_page(bmapping, netpage->index);
-			if (backpage)
+			backpage = __find_get_page(bmapping, netpage->index);
+			if (backpage && !radix_tree_exceptional_entry(backpage))
 				goto backing_page_already_present;
 
 			if (!newpage) {
 				newpage = __page_cache_alloc(cachefiles_gfp |
-							     __GFP_COLD);
+							     __GFP_COLD,
+							     backpage);
 				if (!newpage)
 					goto nomem;
 			}
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index be661d8..a5d2b86 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -816,7 +816,7 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name,
 			return -ENOMEM;
 		err = -ENOMEM;
 		for (i = 0; i < nr_pages; i++) {
-			pages[i] = __page_cache_alloc(GFP_NOFS);
+			pages[i] = __page_cache_alloc(GFP_NOFS, NULL);
 			if (!pages[i]) {
 				nr_pages = i;
 				goto out;
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 9a59cba..67c669a 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -316,9 +316,9 @@ static struct page *logfs_get_write_page(struct inode *inode, u64 bix,
 	int err;
 
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(GFP_NOFS);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		page = __page_cache_alloc(GFP_NOFS, page);
 		if (!page)
 			return NULL;
 		err = add_to_page_cache_lru(page, mapping, index, GFP_NOFS);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index c5670b8..7aee2d1 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -413,10 +413,11 @@ static inline int __ntfs_grab_cache_pages(struct address_space *mapping,
 	BUG_ON(!nr_pages);
 	err = nr = 0;
 	do {
-		pages[nr] = find_lock_page(mapping, index);
-		if (!pages[nr]) {
+		pages[nr] = __find_lock_page(mapping, index);
+		if (!pages[nr] || radix_tree_exceptional_entry(pages[nr])) {
 			if (!*cached_page) {
-				*cached_page = page_cache_alloc(mapping);
+				*cached_page = page_cache_alloc(mapping,
+								pages[nr]);
 				if (unlikely(!*cached_page)) {
 					err = -ENOMEM;
 					goto err_out;
diff --git a/fs/splice.c b/fs/splice.c
index 3b7ee65..edc54ae 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -353,12 +353,12 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		 * Page could be there, find_get_pages_contig() breaks on
 		 * the first hole.
 		 */
-		page = find_get_page(mapping, index);
-		if (!page) {
+		page = __find_get_page(mapping, index);
+		if (!page || radix_tree_exceptional_entry(page)) {
 			/*
 			 * page didn't exist, allocate one.
 			 */
-			page = page_cache_alloc_cold(mapping);
+			page = page_cache_alloc_cold(mapping, page);
 			if (!page)
 				break;
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db3a78b..4b24236 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -228,28 +228,32 @@ static inline void page_unfreeze_refs(struct page *page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
 	return alloc_pages(gfp, 0);
 }
 #endif
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+static inline struct page *page_cache_alloc(struct address_space *x,
+					    struct page *shadow)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping_gfp_mask(x), shadow);
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *x,
+						 struct page *shadow)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD, shadow);
 }
 
-static inline struct page *page_cache_alloc_readahead(struct address_space *x)
+static inline struct page *page_cache_alloc_readahead(struct address_space *x,
+						      struct page *shadow)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x) |
-				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
+				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN,
+				  shadow);
 }
 
 typedef int filler_t(void *, struct page *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 34b2f0b..d3e5578 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -538,7 +538,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
 	int n;
 	struct page *page;
@@ -917,9 +917,9 @@ struct page *find_or_create_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+	page = __find_lock_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		page = __page_cache_alloc(gfp_mask, page);
 		if (!page)
 			return NULL;
 		/*
@@ -1222,15 +1222,16 @@ EXPORT_SYMBOL(find_get_pages_tag);
 struct page *
 grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
-	struct page *page = find_get_page(mapping, index);
+	struct page *page = __find_get_page(mapping, index);
 
-	if (page) {
+	if (page && !radix_tree_exceptional_entry(page)) {
 		if (trylock_page(page))
 			return page;
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+				  page);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
@@ -1304,8 +1305,9 @@ find_page:
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
-			page = find_get_page(mapping, index);
-			if (unlikely(page == NULL))
+			page = __find_get_page(mapping, index);
+			if (unlikely(page == NULL ||
+				     radix_tree_exceptional_entry(page)))
 				goto no_cached_page;
 		}
 		if (PageReadahead(page)) {
@@ -1464,7 +1466,7 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, page);
 		if (!page) {
 			desc->error = -ENOMEM;
 			goto out;
@@ -1673,18 +1675,20 @@ EXPORT_SYMBOL(generic_file_aio_read);
  * page_cache_read - adds requested page to the page cache if not already there
  * @file:	file to read
  * @offset:	page index
+ * @shadow:	shadow page of the page to be added
  *
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset,
+			   struct page *shadow)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *page; 
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, shadow);
 		if (!page)
 			return -ENOMEM;
 
@@ -1815,8 +1819,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
-		page = find_get_page(mapping, offset);
-		if (!page)
+		page = __find_get_page(mapping, offset);
+		if (!page || radix_tree_exceptional_entry(page))
 			goto no_cached_page;
 	}
 
@@ -1859,7 +1863,7 @@ no_cached_page:
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, offset);
+	error = page_cache_read(file, offset, page);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1981,9 +1985,9 @@ static struct page *__read_cache_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp | __GFP_COLD);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		page = __page_cache_alloc(gfp | __GFP_COLD, page);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
 		err = add_to_page_cache_lru(page, mapping, index, gfp);
@@ -2454,11 +2458,11 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (page)
+	page = __find_lock_page(mapping, index);
+	if (page && !radix_tree_exceptional_entry(page))
 		goto found;
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	page = __page_cache_alloc(gfp_mask & ~gfp_notmask, page);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
diff --git a/mm/readahead.c b/mm/readahead.c
index 0f85996..58142ef 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -182,7 +182,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
-		page = page_cache_alloc_readahead(mapping);
+		page = page_cache_alloc_readahead(mapping, page);
 		if (!page)
 			break;
 		page->index = page_offset;
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 92866be..83fb56e 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -32,7 +32,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
 	struct page *page;
 
 	if (!pl->num_pages_free) {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = __page_cache_alloc(GFP_NOFS, NULL);
 	} else {
 		page = list_first_entry(&pl->free_list, struct page, lru);
 		list_del(&page->lru);
@@ -83,7 +83,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
 	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
 
 	while (space > pl->num_pages_free) {
-		struct page *page = __page_cache_alloc(GFP_NOFS);
+		struct page *page = __page_cache_alloc(GFP_NOFS, NULL);
 		if (!page)
 			return -ENOMEM;
 		list_add_tail(&page->lru, &pl->free_list);
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 815a224..ff76422 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -79,7 +79,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < num_pages; i++) {
-		pages[i] = __page_cache_alloc(flags);
+		pages[i] = __page_cache_alloc(flags, NULL);
 		if (pages[i] == NULL) {
 			ceph_release_page_vector(pages, i);
 			return ERR_PTR(-ENOMEM);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 7/9] mm: make global_dirtyable_memory() available to other mm code
  2013-08-17 19:31 ` Johannes Weiner
@ 2013-08-17 19:31   ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Subsequent patches need a rough estimate of memory available for page
cache.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/writeback.h | 1 +
 mm/page-writeback.c       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4e198ca..4c26df7 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -154,6 +154,7 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
+unsigned long global_dirtyable_memory(void);
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d374b29..60376ad 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -231,7 +231,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the global number of pages potentially available for dirty
  * page cache.  This is the base value for the global dirty limits.
  */
-static unsigned long global_dirtyable_memory(void)
+unsigned long global_dirtyable_memory(void)
 {
 	unsigned long x;
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 7/9] mm: make global_dirtyable_memory() available to other mm code
@ 2013-08-17 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Subsequent patches need a rough estimate of memory available for page
cache.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/writeback.h | 1 +
 mm/page-writeback.c       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4e198ca..4c26df7 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -154,6 +154,7 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
+unsigned long global_dirtyable_memory(void);
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d374b29..60376ad 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -231,7 +231,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the global number of pages potentially available for dirty
  * page cache.  This is the base value for the global dirty limits.
  */
-static unsigned long global_dirtyable_memory(void)
+unsigned long global_dirtyable_memory(void)
 {
 	unsigned long x;
 
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-17 19:31 ` Johannes Weiner
@ 2013-08-17 19:31   ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This patch solves this problem by maintaining a history of recently
evicted file pages, which in turn allows the VM to tell used-once page
streams from thrashing file cache.

To accomplish this, a per-zone counter is increased every time a page
is evicted and a snapshot of that counter is stored as shadow entry in
the page's now empty page cache radix tree slot.  Upon refault of that
page, the difference between the current value of that counter and the
shadow entry value is called the refault distance.  It tells how many
pages have been evicted from the zone since that page's eviction,
which is how many page slots at most are missing from the zone's
inactive list for this page to get accessed twice while in memory.  If
the number of missing slots is less than or equal to the number of
active pages, increasing the inactive list at the cost of the active
list would give this thrashing set a chance to establish itself:

eviction counter = 4
                        evicted      inactive           active
 Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
  Shadow entries:         0 1 2 3
Refault distance:         4 3 2 1

When c is faulted back into memory, it is noted that at most two more
page slots on the inactive list could have prevented the refault (it
could be less if c is used out of order).  Thus, the active list needs
to be challenged as it is possible that c is used more frequently than
l, m, n.  However, there is no access frequency information available
on active pages so the pages have to be put in direct competition with
each other before deciding which one to keep.  Thus, 1) pages can not
be directly reclaimed from the tail of the active list and b)
refaulting pages can not be directly activated.  Instead, active pages
are moved from the tail of the active list to the head of the inactive
list and placed directly next to the refaulting pages.  This way, they
both have the same time on the inactive list to prove which page is
actually used more frequently without incurring unnecessary major
faults or diluting the active page set in case the previously active
page is in fact the more frequently used one.

Also, since the refault of c could have been due to a spurious access,
only one active page per qualifying refault is challenged.  This will
keep the impact of outliers low but still detect if bigger groups of
pages are refaulting.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h  |   7 ++
 include/linux/pagemap.h |   2 +
 include/linux/swap.h    |   5 ++
 mm/Makefile             |   2 +-
 mm/filemap.c            |   3 +
 mm/swap.c               |   2 +
 mm/vmscan.c             |  62 +++++++++++----
 mm/vmstat.c             |   4 +
 mm/workingset.c         | 200 ++++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 273 insertions(+), 14 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ac1ea79..56f540e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,10 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_REFAULT,
+	WORKINGSET_STALE,
+	WORKINGSET_BALANCE,
+	WORKINGSET_BALANCE_FORCE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -393,6 +397,9 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
+	atomic_long_t		workingset_time;
+	long			shrink_active;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b24236..c6beed2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -232,6 +232,8 @@ extern struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
+	if (shadow)
+		workingset_refault(shadow);
 	return alloc_pages(gfp, 0);
 }
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 24db914..441845d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
 	int next;	/* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+void workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index cd0abd8..f740427 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o \
-			   interval_tree.o list_lru.o $(mmu-y)
+			   interval_tree.o list_lru.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/filemap.c b/mm/filemap.c
index d3e5578..ab4351e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -543,6 +543,9 @@ struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 	int n;
 	struct page *page;
 
+	if (shadow)
+		workingset_refault(shadow);
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
diff --git a/mm/swap.c b/mm/swap.c
index bf448cf..f90a331 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,6 +482,8 @@ void mark_page_accessed(struct page *page)
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dd5f67c..27a36f6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -496,7 +496,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -542,10 +543,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page, NULL);
+		if (reclaimed && page_is_file_cache(page))
+			shadow = workingset_eviction(mapping, page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -568,7 +572,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -1038,7 +1042,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
@@ -1746,32 +1750,64 @@ static inline int inactive_anon_is_low(struct lruvec *lruvec)
 /**
  * inactive_file_is_low - check if file pages need to be deactivated
  * @lruvec: LRU vector to check
+ * @nr_to_scan: number of active pages to scan
+ * @sc: scan parameters
  *
  * When the system is doing streaming IO, memory pressure here
  * ensures that active file pages get deactivated, until more
  * than half of the file pages are on the inactive list.
  *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
+ * Once we get to that situation, protect the system's working set
+ * from being evicted by disabling active file page aging, until
+ * thrashing on the inactive list suggests that a new working set is
+ * trying to form.
  *
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct lruvec *lruvec)
+static int inactive_file_is_low(struct lruvec *lruvec,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long inactive;
 	unsigned long active;
 
+	if (global_reclaim(sc)) {
+		if (zone->shrink_active > 0) {
+			if (nr_to_scan) {
+				inc_zone_state(zone, WORKINGSET_BALANCE);
+				zone->shrink_active -= nr_to_scan;
+			}
+			return true;
+		}
+	}
+	/*
+	 * Make sure there is always a reasonable amount of inactive
+	 * file pages around to keep the zone reclaimable.
+	 *
+	 * We could do better than requiring half of memory, but we
+	 * need a big safety buffer until we are smarter about
+	 * dirty/writeback pages and file readahead windows.
+	 * Otherwise, we can end up with all pages on the inactive
+	 * list being dirty, or trash readahead pages before use.
+	 */
 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
-
-	return active > inactive;
+	if (active > inactive) {
+		if (global_reclaim(sc) && nr_to_scan)
+			inc_zone_state(zone, WORKINGSET_BALANCE_FORCE);
+		return true;
+	}
+	return false;
 }
 
-static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
+static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
 	if (is_file_lru(lru))
-		return inactive_file_is_low(lruvec);
+		return inactive_file_is_low(lruvec, nr_to_scan, sc);
 	else
 		return inactive_anon_is_low(lruvec);
 }
@@ -1780,7 +1816,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, lru))
+		if (inactive_list_is_low(lruvec, lru, nr_to_scan, sc))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -1891,7 +1927,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * There is enough inactive page cache, do not reclaim
 	 * anything from the anonymous working set right now.
 	 */
-	if (!inactive_file_is_low(lruvec)) {
+	if (!inactive_file_is_low(lruvec, 0, sc)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ba9e46b..4fabaed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -739,6 +739,10 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_refault",
+	"workingset_stale",
+	"workingset_balance",
+	"workingset_balance_force",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..88ad7ea
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,200 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ *		Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list.  Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail.  Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * inactive list requires more space to detect repeatedly accessed
+ * pages in the current workload and prevent them from thrashing:
+ *
+ *   fault -----------------------+
+ *                                |
+ *              +-------------+   |            +-------------+
+ *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
+ *              +-------------+                +-------------+    |
+ *                       |                                        |
+ *                       +----------- promotion ------------------+
+ *
+ *
+ *		Access frequency and refault distance
+ *
+ * A workload is thrashing when the distances between the first and
+ * second access of pages that are frequently used is bigger than the
+ * current inactive clock list size, as the pages get reclaimed before
+ * the second access would have promoted them instead:
+ *
+ *    Access #: 1 2 3 4 5 6 7 8 9
+ *     Page ID: x y b c d e f x y
+ *                  | inactive  |
+ *
+ * To prevent this workload from thrashing, a bigger inactive list is
+ * required.  And the only way the inactive list can grow on a full
+ * zone is by taking away space from the corresponding active list.
+ *
+ *      +-inactive--+-active------+
+ *  x y | b c d e f | G H I J K L |
+ *      +-----------+-------------+
+ *
+ * Not every refault should lead to growing the inactive list at the
+ * cost of the active list, however: if the access distances are
+ * bigger than available memory overall, there is little point in
+ * challenging the protected pages on the active list, as those
+ * refaulting pages will not fit completely into memory.
+ *
+ * It is prohibitively expensive to track the access frequency of
+ * in-core pages, but it is possible to track their refault distance,
+ * which is the number of page slots shrunk from the inactive list
+ * between a page's eviction and subsequent refault.  This indicates
+ * how many page slots are missing on the inactive list in order to
+ * prevent future thrashing of that page.  Thus, instead of comparing
+ * access frequency to total available memory, one can compare the
+ * refault distance to the inactive list's potential for growth: the
+ * size of the active list.
+ *
+ *
+ *		Rebalancing the lists
+ *
+ * Shrinking the active list has to be done carefully because the
+ * pages on it may have vastly different access frequencies compared
+ * to the pages on the inactive list.  Thus, pages are not reclaimed
+ * directly from the tail of the active list, but instead moved to the
+ * head of the inactive list.  This way, they are competing directly
+ * with the pages that challenged their protected status.  If they are
+ * unused, they will eventually be reclaimed, but if they are indeed
+ * used more frequently than the challenging inactive pages, they will
+ * be reactivated.  This allows the existing protected set to be
+ * challenged without incurring major faults in case of a mistake.
+ */
+
+static void *pack_shadow(unsigned long time, struct zone *zone)
+{
+	time = (time << NODES_SHIFT) | zone_to_nid(zone);
+	time = (time << ZONES_SHIFT) | zone_idx(zone);
+	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
+
+	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+static void unpack_shadow(void *shadow,
+			  struct zone **zone,
+			  unsigned long *distance)
+{
+	unsigned long entry = (unsigned long)shadow;
+	unsigned long time_of_eviction;
+	unsigned long mask;
+	unsigned long now;
+	int zid, nid;
+
+	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	zid = entry & ((1UL << ZONES_SHIFT) - 1);
+	entry >>= ZONES_SHIFT;
+	nid = entry & ((1UL << NODES_SHIFT) - 1);
+	entry >>= NODES_SHIFT;
+	time_of_eviction = entry;
+
+	*zone = NODE_DATA(nid)->node_zones + zid;
+
+	now = atomic_long_read(&(*zone)->workingset_time);
+	mask = ~0UL >> (RADIX_TREE_EXCEPTIONAL_SHIFT +
+			ZONES_SHIFT + NODES_SHIFT);
+
+	*distance = (now - time_of_eviction) & mask;
+}
+
+/**
+ * workingset_eviction - note the eviction of a page from memory
+ * @mapping: address space the page was backing
+ * @page: the page being evicted
+ *
+ * Returns a shadow entry to be stored in @mapping->page_tree in place
+ * of the evicted @page so that a later refault can be detected.  Or
+ * %NULL when the eviction should not be remembered.
+ */
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long time;
+
+	time = atomic_long_inc_return(&zone->workingset_time);
+
+	/*
+	 * Don't store shadows in an inode that is being reclaimed.
+	 * This is not just an optizimation, inode reclaim needs to
+	 * empty out the radix tree or the nodes are lost, so don't
+	 * plant shadows behind its back.
+	 */
+	if (mapping_exiting(mapping))
+		return NULL;
+
+	return pack_shadow(time, zone);
+}
+
+/**
+ * workingset_refault - note the refault of a previously evicted page
+ * @shadow: shadow entry of the evicted page
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted page in the context of the zone it was allocated in.
+ *
+ * This primes page reclaim to rebalance the zone's file lists if
+ * necessary, so it must be called before a page frame for the
+ * refaulting page is allocated.
+ */
+void workingset_refault(void *shadow)
+{
+	unsigned long refault_distance;
+	struct zone *zone;
+
+	unpack_shadow(shadow, &zone, &refault_distance);
+
+	inc_zone_state(zone, WORKINGSET_REFAULT);
+
+	/*
+	 * Protected pages should be challenged when the refault
+	 * distance indicates that thrashing could be stopped by
+	 * increasing the inactive list at the cost of the active
+	 * list.
+	 */
+	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+		inc_zone_state(zone, WORKINGSET_STALE);
+		zone->shrink_active++;
+	}
+}
+EXPORT_SYMBOL(workingset_refault);
+
+/**
+ * workingset_activation - note a page activation
+ * @page: page that is being activated
+ */
+void workingset_activation(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	/*
+	 * The lists are rebalanced when the inactive list is observed
+	 * to be too small for activations.  An activation means that
+	 * the inactive list is now big enough again for at least one
+	 * page, so back off further deactivation.
+	 */
+	atomic_long_inc(&zone->workingset_time);
+	if (zone->shrink_active > 0)
+		zone->shrink_active--;
+}
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 8/9] mm: thrash detection-based file cache sizing
@ 2013-08-17 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This patch solves this problem by maintaining a history of recently
evicted file pages, which in turn allows the VM to tell used-once page
streams from thrashing file cache.

To accomplish this, a per-zone counter is increased every time a page
is evicted and a snapshot of that counter is stored as shadow entry in
the page's now empty page cache radix tree slot.  Upon refault of that
page, the difference between the current value of that counter and the
shadow entry value is called the refault distance.  It tells how many
pages have been evicted from the zone since that page's eviction,
which is how many page slots at most are missing from the zone's
inactive list for this page to get accessed twice while in memory.  If
the number of missing slots is less than or equal to the number of
active pages, increasing the inactive list at the cost of the active
list would give this thrashing set a chance to establish itself:

eviction counter = 4
                        evicted      inactive           active
 Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
  Shadow entries:         0 1 2 3
Refault distance:         4 3 2 1

When c is faulted back into memory, it is noted that at most two more
page slots on the inactive list could have prevented the refault (it
could be less if c is used out of order).  Thus, the active list needs
to be challenged as it is possible that c is used more frequently than
l, m, n.  However, there is no access frequency information available
on active pages so the pages have to be put in direct competition with
each other before deciding which one to keep.  Thus, 1) pages can not
be directly reclaimed from the tail of the active list and b)
refaulting pages can not be directly activated.  Instead, active pages
are moved from the tail of the active list to the head of the inactive
list and placed directly next to the refaulting pages.  This way, they
both have the same time on the inactive list to prove which page is
actually used more frequently without incurring unnecessary major
faults or diluting the active page set in case the previously active
page is in fact the more frequently used one.

Also, since the refault of c could have been due to a spurious access,
only one active page per qualifying refault is challenged.  This will
keep the impact of outliers low but still detect if bigger groups of
pages are refaulting.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h  |   7 ++
 include/linux/pagemap.h |   2 +
 include/linux/swap.h    |   5 ++
 mm/Makefile             |   2 +-
 mm/filemap.c            |   3 +
 mm/swap.c               |   2 +
 mm/vmscan.c             |  62 +++++++++++----
 mm/vmstat.c             |   4 +
 mm/workingset.c         | 200 ++++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 273 insertions(+), 14 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ac1ea79..56f540e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,10 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_REFAULT,
+	WORKINGSET_STALE,
+	WORKINGSET_BALANCE,
+	WORKINGSET_BALANCE_FORCE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -393,6 +397,9 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
+	atomic_long_t		workingset_time;
+	long			shrink_active;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b24236..c6beed2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -232,6 +232,8 @@ extern struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
+	if (shadow)
+		workingset_refault(shadow);
 	return alloc_pages(gfp, 0);
 }
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 24db914..441845d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
 	int next;	/* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+void workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index cd0abd8..f740427 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o \
-			   interval_tree.o list_lru.o $(mmu-y)
+			   interval_tree.o list_lru.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/filemap.c b/mm/filemap.c
index d3e5578..ab4351e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -543,6 +543,9 @@ struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 	int n;
 	struct page *page;
 
+	if (shadow)
+		workingset_refault(shadow);
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
diff --git a/mm/swap.c b/mm/swap.c
index bf448cf..f90a331 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,6 +482,8 @@ void mark_page_accessed(struct page *page)
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dd5f67c..27a36f6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -496,7 +496,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -542,10 +543,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page, NULL);
+		if (reclaimed && page_is_file_cache(page))
+			shadow = workingset_eviction(mapping, page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -568,7 +572,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -1038,7 +1042,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
@@ -1746,32 +1750,64 @@ static inline int inactive_anon_is_low(struct lruvec *lruvec)
 /**
  * inactive_file_is_low - check if file pages need to be deactivated
  * @lruvec: LRU vector to check
+ * @nr_to_scan: number of active pages to scan
+ * @sc: scan parameters
  *
  * When the system is doing streaming IO, memory pressure here
  * ensures that active file pages get deactivated, until more
  * than half of the file pages are on the inactive list.
  *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
+ * Once we get to that situation, protect the system's working set
+ * from being evicted by disabling active file page aging, until
+ * thrashing on the inactive list suggests that a new working set is
+ * trying to form.
  *
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct lruvec *lruvec)
+static int inactive_file_is_low(struct lruvec *lruvec,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long inactive;
 	unsigned long active;
 
+	if (global_reclaim(sc)) {
+		if (zone->shrink_active > 0) {
+			if (nr_to_scan) {
+				inc_zone_state(zone, WORKINGSET_BALANCE);
+				zone->shrink_active -= nr_to_scan;
+			}
+			return true;
+		}
+	}
+	/*
+	 * Make sure there is always a reasonable amount of inactive
+	 * file pages around to keep the zone reclaimable.
+	 *
+	 * We could do better than requiring half of memory, but we
+	 * need a big safety buffer until we are smarter about
+	 * dirty/writeback pages and file readahead windows.
+	 * Otherwise, we can end up with all pages on the inactive
+	 * list being dirty, or trash readahead pages before use.
+	 */
 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
-
-	return active > inactive;
+	if (active > inactive) {
+		if (global_reclaim(sc) && nr_to_scan)
+			inc_zone_state(zone, WORKINGSET_BALANCE_FORCE);
+		return true;
+	}
+	return false;
 }
 
-static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
+static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
 	if (is_file_lru(lru))
-		return inactive_file_is_low(lruvec);
+		return inactive_file_is_low(lruvec, nr_to_scan, sc);
 	else
 		return inactive_anon_is_low(lruvec);
 }
@@ -1780,7 +1816,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, lru))
+		if (inactive_list_is_low(lruvec, lru, nr_to_scan, sc))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -1891,7 +1927,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * There is enough inactive page cache, do not reclaim
 	 * anything from the anonymous working set right now.
 	 */
-	if (!inactive_file_is_low(lruvec)) {
+	if (!inactive_file_is_low(lruvec, 0, sc)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ba9e46b..4fabaed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -739,6 +739,10 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_refault",
+	"workingset_stale",
+	"workingset_balance",
+	"workingset_balance_force",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..88ad7ea
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,200 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ *		Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list.  Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail.  Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * inactive list requires more space to detect repeatedly accessed
+ * pages in the current workload and prevent them from thrashing:
+ *
+ *   fault -----------------------+
+ *                                |
+ *              +-------------+   |            +-------------+
+ *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
+ *              +-------------+                +-------------+    |
+ *                       |                                        |
+ *                       +----------- promotion ------------------+
+ *
+ *
+ *		Access frequency and refault distance
+ *
+ * A workload is thrashing when the distances between the first and
+ * second access of pages that are frequently used is bigger than the
+ * current inactive clock list size, as the pages get reclaimed before
+ * the second access would have promoted them instead:
+ *
+ *    Access #: 1 2 3 4 5 6 7 8 9
+ *     Page ID: x y b c d e f x y
+ *                  | inactive  |
+ *
+ * To prevent this workload from thrashing, a bigger inactive list is
+ * required.  And the only way the inactive list can grow on a full
+ * zone is by taking away space from the corresponding active list.
+ *
+ *      +-inactive--+-active------+
+ *  x y | b c d e f | G H I J K L |
+ *      +-----------+-------------+
+ *
+ * Not every refault should lead to growing the inactive list at the
+ * cost of the active list, however: if the access distances are
+ * bigger than available memory overall, there is little point in
+ * challenging the protected pages on the active list, as those
+ * refaulting pages will not fit completely into memory.
+ *
+ * It is prohibitively expensive to track the access frequency of
+ * in-core pages, but it is possible to track their refault distance,
+ * which is the number of page slots shrunk from the inactive list
+ * between a page's eviction and subsequent refault.  This indicates
+ * how many page slots are missing on the inactive list in order to
+ * prevent future thrashing of that page.  Thus, instead of comparing
+ * access frequency to total available memory, one can compare the
+ * refault distance to the inactive list's potential for growth: the
+ * size of the active list.
+ *
+ *
+ *		Rebalancing the lists
+ *
+ * Shrinking the active list has to be done carefully because the
+ * pages on it may have vastly different access frequencies compared
+ * to the pages on the inactive list.  Thus, pages are not reclaimed
+ * directly from the tail of the active list, but instead moved to the
+ * head of the inactive list.  This way, they are competing directly
+ * with the pages that challenged their protected status.  If they are
+ * unused, they will eventually be reclaimed, but if they are indeed
+ * used more frequently than the challenging inactive pages, they will
+ * be reactivated.  This allows the existing protected set to be
+ * challenged without incurring major faults in case of a mistake.
+ */
+
+static void *pack_shadow(unsigned long time, struct zone *zone)
+{
+	time = (time << NODES_SHIFT) | zone_to_nid(zone);
+	time = (time << ZONES_SHIFT) | zone_idx(zone);
+	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
+
+	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+static void unpack_shadow(void *shadow,
+			  struct zone **zone,
+			  unsigned long *distance)
+{
+	unsigned long entry = (unsigned long)shadow;
+	unsigned long time_of_eviction;
+	unsigned long mask;
+	unsigned long now;
+	int zid, nid;
+
+	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	zid = entry & ((1UL << ZONES_SHIFT) - 1);
+	entry >>= ZONES_SHIFT;
+	nid = entry & ((1UL << NODES_SHIFT) - 1);
+	entry >>= NODES_SHIFT;
+	time_of_eviction = entry;
+
+	*zone = NODE_DATA(nid)->node_zones + zid;
+
+	now = atomic_long_read(&(*zone)->workingset_time);
+	mask = ~0UL >> (RADIX_TREE_EXCEPTIONAL_SHIFT +
+			ZONES_SHIFT + NODES_SHIFT);
+
+	*distance = (now - time_of_eviction) & mask;
+}
+
+/**
+ * workingset_eviction - note the eviction of a page from memory
+ * @mapping: address space the page was backing
+ * @page: the page being evicted
+ *
+ * Returns a shadow entry to be stored in @mapping->page_tree in place
+ * of the evicted @page so that a later refault can be detected.  Or
+ * %NULL when the eviction should not be remembered.
+ */
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long time;
+
+	time = atomic_long_inc_return(&zone->workingset_time);
+
+	/*
+	 * Don't store shadows in an inode that is being reclaimed.
+	 * This is not just an optizimation, inode reclaim needs to
+	 * empty out the radix tree or the nodes are lost, so don't
+	 * plant shadows behind its back.
+	 */
+	if (mapping_exiting(mapping))
+		return NULL;
+
+	return pack_shadow(time, zone);
+}
+
+/**
+ * workingset_refault - note the refault of a previously evicted page
+ * @shadow: shadow entry of the evicted page
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted page in the context of the zone it was allocated in.
+ *
+ * This primes page reclaim to rebalance the zone's file lists if
+ * necessary, so it must be called before a page frame for the
+ * refaulting page is allocated.
+ */
+void workingset_refault(void *shadow)
+{
+	unsigned long refault_distance;
+	struct zone *zone;
+
+	unpack_shadow(shadow, &zone, &refault_distance);
+
+	inc_zone_state(zone, WORKINGSET_REFAULT);
+
+	/*
+	 * Protected pages should be challenged when the refault
+	 * distance indicates that thrashing could be stopped by
+	 * increasing the inactive list at the cost of the active
+	 * list.
+	 */
+	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+		inc_zone_state(zone, WORKINGSET_STALE);
+		zone->shrink_active++;
+	}
+}
+EXPORT_SYMBOL(workingset_refault);
+
+/**
+ * workingset_activation - note a page activation
+ * @page: page that is being activated
+ */
+void workingset_activation(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	/*
+	 * The lists are rebalanced when the inactive list is observed
+	 * to be too small for activations.  An activation means that
+	 * the inactive list is now big enough again for at least one
+	 * page, so back off further deactivation.
+	 */
+	atomic_long_inc(&zone->workingset_time);
+	if (zone->shrink_active > 0)
+		zone->shrink_active--;
+}
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 9/9] mm: workingset: keep shadow entries in check
  2013-08-17 19:31 ` Johannes Weiner
@ 2013-08-17 19:31   ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, a list of inodes that contain shadow
entries is maintained.  If the global number of shadows exceeds a
certain threshold, a shrinker is activated that reclaims old entries
from the mappings.  This is heavy-handed but it should not be a hot
path and is mainly there to protect from accidentally/maliciously
induced OOM kills.  The global list is also not a problem because the
modifications are very rare: inodes are added once in their lifetime
when the first shadow entry is stored (i.e. the first page reclaimed)
and lazily removed when the inode exits.  Or if the shrinker removes
all shadow entries.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c             |   5 +-
 include/linux/fs.h     |   1 +
 include/linux/mmzone.h |   1 +
 include/linux/swap.h   |   4 +
 mm/filemap.c           |   4 +-
 mm/truncate.c          |   2 +-
 mm/vmstat.c            |   1 +
 mm/workingset.c        | 196 +++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 208 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8862b1b..b23b141 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -169,6 +169,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->private_data = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+	workingset_init_mapping(mapping);
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
@@ -546,9 +547,7 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
-	spin_lock_irq(&inode->i_data.tree_lock);
-	mapping_set_exiting(&inode->i_data);
-	spin_unlock_irq(&inode->i_data.tree_lock);
+	workingset_exit_mapping(&inode->i_data);
 
 	if (op->evict_inode) {
 		op->evict_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac5d84e..140ae2d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -417,6 +417,7 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	struct list_head	shadow_list;	/* list of mappings with shadows */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 56f540e..747adae 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -146,6 +146,7 @@ enum zone_stat_item {
 	WORKINGSET_STALE,
 	WORKINGSET_BALANCE,
 	WORKINGSET_BALANCE_FORCE,
+	WORKINGSET_SHADOWS_RECLAIMED,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 441845d..f18dd33eb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,9 +261,13 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
+void workingset_init_mapping(struct address_space *mapping);
+void workingset_exit_mapping(struct address_space *mapping);
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 void workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index ab4351e..d47d83e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,7 +132,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
 		radix_tree_replace_slot(slot, shadow);
-		mapping->nrshadows++;
+		workingset_shadows_inc(mapping);
 	} else
 		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
@@ -466,7 +466,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 5c85dd4..76064a4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -36,7 +36,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * need verification under the tree lock.
 	 */
 	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4fabaed..b0b8e3d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -743,6 +743,7 @@ const char * const vmstat_text[] = {
 	"workingset_stale",
 	"workingset_balance",
 	"workingset_balance_force",
+	"workingset_shadows_reclaimed",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 88ad7ea..00b2280e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -84,6 +84,46 @@
  * challenged without incurring major faults in case of a mistake.
  */
 
+static DEFINE_PER_CPU(unsigned long, nr_shadows);
+static DEFINE_SPINLOCK(shadow_lock);
+static LIST_HEAD(shadow_mappings);
+
+/**
+ * workingset_init_mapping - prepare address space for page reclaim
+ * @mapping: address space
+ *
+ * Must be called when the inode is instantiated, before any page
+ * cache is populated.
+ */
+void workingset_init_mapping(struct address_space *mapping)
+{
+	INIT_LIST_HEAD(&mapping->shadow_list);
+}
+
+/**
+ * workingset_exit_mapping - tell page reclaim address space is exiting
+ * @mapping: address space
+ *
+ * Must be called before the final truncate, to prevent page reclaim
+ * from installing shadow entries behind the back of the inode
+ * teardown process.
+ */
+void workingset_exit_mapping(struct address_space *mapping)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	mapping_set_exiting(mapping);
+	/*
+	 * Take the mapping off the shrinker list, the final truncate
+	 * is about to remove potentially remaining shadow entries.
+	 */
+	if (!list_empty(&mapping->shadow_list)) {
+		spin_lock(&shadow_lock);
+		list_del(&mapping->shadow_list);
+		spin_unlock(&shadow_lock);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+}
+
 static void *pack_shadow(unsigned long time, struct zone *zone)
 {
 	time = (time << NODES_SHIFT) | zone_to_nid(zone);
@@ -198,3 +238,159 @@ void workingset_activation(struct page *page)
 	if (zone->shrink_active > 0)
 		zone->shrink_active--;
 }
+
+void workingset_shadows_inc(struct address_space *mapping)
+{
+	might_lock(&shadow_lock);
+
+	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
+		spin_lock(&shadow_lock);
+		list_add(&mapping->shadow_list, &shadow_mappings);
+		spin_unlock(&shadow_lock);
+	}
+
+	mapping->nrshadows++;
+	this_cpu_inc(nr_shadows);
+}
+
+void workingset_shadows_dec(struct address_space *mapping)
+{
+	mapping->nrshadows--;
+	this_cpu_dec(nr_shadows);
+	/*
+	 * shadow_mappings operations are costly, so we keep the
+	 * mapping linked here even without any shadows left and
+	 * unlink it lazily in the shadow shrinker or when the inode
+	 * is destroyed.
+	 */
+}
+
+static unsigned long get_nr_old_shadows(void)
+{
+	unsigned long nr_max;
+	unsigned long nr;
+	long sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(nr_shadows, cpu);
+	nr = max(sum, 0L);
+
+	/*
+	 * Every shadow entry with a refault distance bigger than the
+	 * active list is ignored and so NR_ACTIVE_FILE would be a
+	 * reasonable ceiling.  But scanning and shrinking shadow
+	 * entries is quite expensive, so be generous.
+	 */
+	nr_max = global_dirtyable_memory() * 4;
+
+	if (nr <= nr_max)
+		return 0;
+	return nr - nr_max;
+}
+
+static unsigned long scan_mapping(struct address_space *mapping,
+				  unsigned long nr_to_scan)
+{
+	unsigned long nr_scanned = 0;
+	struct radix_tree_iter iter;
+	void **slot;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
+		unsigned long nrshadows;
+		unsigned long distance;
+		struct zone *zone;
+		struct page *page;
+
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (!radix_tree_exception(page))
+			continue;
+		if (radix_tree_deref_retry(page))
+			goto restart;
+
+		unpack_shadow(page, &zone, &distance);
+
+		if (distance <= zone_page_state(zone, NR_ACTIVE_FILE))
+			continue;
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (radix_tree_delete_item(&mapping->page_tree,
+					   iter.index, page)) {
+			inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
+			workingset_shadows_dec(mapping);
+			nr_scanned++;
+		}
+		nrshadows = mapping->nrshadows;
+		spin_unlock_irq(&mapping->tree_lock);
+
+		if (nrshadows == 0)
+			break;
+
+		if (--nr_to_scan == 0)
+			break;
+	}
+	rcu_read_unlock();
+
+	return nr_scanned;
+}
+
+static unsigned long count_shadows(struct shrinker *shrink,
+				   struct shrink_control *sc)
+{
+	return get_nr_old_shadows();
+}
+
+static unsigned long scan_shadows(struct shrinker *shrink,
+				  struct shrink_control *sc)
+{
+	unsigned long nr_to_scan = sc->nr_to_scan;
+
+	do {
+		struct address_space *mapping;
+
+		spin_lock_irq(&shadow_lock);
+		if (list_empty(&shadow_mappings)) {
+			spin_unlock_irq(&shadow_lock);
+			break;
+		}
+		mapping = list_entry(shadow_mappings.prev,
+				     struct address_space,
+				     shadow_list);
+		list_move(&mapping->shadow_list, &shadow_mappings);
+		__iget(mapping->host);
+		spin_unlock_irq(&shadow_lock);
+
+		if (mapping->nrshadows)
+			nr_to_scan -= scan_mapping(mapping, nr_to_scan);
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (mapping->nrshadows == 0) {
+			spin_lock(&shadow_lock);
+			list_del_init(&mapping->shadow_list);
+			spin_unlock(&shadow_lock);
+		}
+		spin_unlock_irq(&mapping->tree_lock);
+
+		iput(mapping->host);
+
+	} while (nr_to_scan && get_nr_old_shadows());
+
+	return sc->nr_to_scan - nr_to_scan;
+}
+
+static struct shrinker shadow_shrinker = {
+	.count_objects = count_shadows,
+	.scan_objects = scan_shadows,
+	.seeks = 1,
+};
+
+static int __init workingset_init(void)
+{
+	register_shrinker(&shadow_shrinker);
+	return 0;
+}
+core_initcall(workingset_init);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 9/9] mm: workingset: keep shadow entries in check
@ 2013-08-17 19:31   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-17 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, a list of inodes that contain shadow
entries is maintained.  If the global number of shadows exceeds a
certain threshold, a shrinker is activated that reclaims old entries
from the mappings.  This is heavy-handed but it should not be a hot
path and is mainly there to protect from accidentally/maliciously
induced OOM kills.  The global list is also not a problem because the
modifications are very rare: inodes are added once in their lifetime
when the first shadow entry is stored (i.e. the first page reclaimed)
and lazily removed when the inode exits.  Or if the shrinker removes
all shadow entries.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c             |   5 +-
 include/linux/fs.h     |   1 +
 include/linux/mmzone.h |   1 +
 include/linux/swap.h   |   4 +
 mm/filemap.c           |   4 +-
 mm/truncate.c          |   2 +-
 mm/vmstat.c            |   1 +
 mm/workingset.c        | 196 +++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 208 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8862b1b..b23b141 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -169,6 +169,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->private_data = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+	workingset_init_mapping(mapping);
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
@@ -546,9 +547,7 @@ static void evict(struct inode *inode)
 	 */
 	inode_wait_for_writeback(inode);
 
-	spin_lock_irq(&inode->i_data.tree_lock);
-	mapping_set_exiting(&inode->i_data);
-	spin_unlock_irq(&inode->i_data.tree_lock);
+	workingset_exit_mapping(&inode->i_data);
 
 	if (op->evict_inode) {
 		op->evict_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac5d84e..140ae2d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -417,6 +417,7 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	struct list_head	shadow_list;	/* list of mappings with shadows */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 56f540e..747adae 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -146,6 +146,7 @@ enum zone_stat_item {
 	WORKINGSET_STALE,
 	WORKINGSET_BALANCE,
 	WORKINGSET_BALANCE_FORCE,
+	WORKINGSET_SHADOWS_RECLAIMED,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 441845d..f18dd33eb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,9 +261,13 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
+void workingset_init_mapping(struct address_space *mapping);
+void workingset_exit_mapping(struct address_space *mapping);
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 void workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index ab4351e..d47d83e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,7 +132,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
 		radix_tree_replace_slot(slot, shadow);
-		mapping->nrshadows++;
+		workingset_shadows_inc(mapping);
 	} else
 		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
@@ -466,7 +466,7 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 		radix_tree_replace_slot(slot, page);
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 		return 0;
 	}
 	return radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 5c85dd4..76064a4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -36,7 +36,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * need verification under the tree lock.
 	 */
 	if (radix_tree_delete_item(&mapping->page_tree, index, page) == page)
-		mapping->nrshadows--;
+		workingset_shadows_dec(mapping);
 	spin_unlock_irq(&mapping->tree_lock);
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4fabaed..b0b8e3d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -743,6 +743,7 @@ const char * const vmstat_text[] = {
 	"workingset_stale",
 	"workingset_balance",
 	"workingset_balance_force",
+	"workingset_shadows_reclaimed",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 88ad7ea..00b2280e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -84,6 +84,46 @@
  * challenged without incurring major faults in case of a mistake.
  */
 
+static DEFINE_PER_CPU(unsigned long, nr_shadows);
+static DEFINE_SPINLOCK(shadow_lock);
+static LIST_HEAD(shadow_mappings);
+
+/**
+ * workingset_init_mapping - prepare address space for page reclaim
+ * @mapping: address space
+ *
+ * Must be called when the inode is instantiated, before any page
+ * cache is populated.
+ */
+void workingset_init_mapping(struct address_space *mapping)
+{
+	INIT_LIST_HEAD(&mapping->shadow_list);
+}
+
+/**
+ * workingset_exit_mapping - tell page reclaim address space is exiting
+ * @mapping: address space
+ *
+ * Must be called before the final truncate, to prevent page reclaim
+ * from installing shadow entries behind the back of the inode
+ * teardown process.
+ */
+void workingset_exit_mapping(struct address_space *mapping)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	mapping_set_exiting(mapping);
+	/*
+	 * Take the mapping off the shrinker list, the final truncate
+	 * is about to remove potentially remaining shadow entries.
+	 */
+	if (!list_empty(&mapping->shadow_list)) {
+		spin_lock(&shadow_lock);
+		list_del(&mapping->shadow_list);
+		spin_unlock(&shadow_lock);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+}
+
 static void *pack_shadow(unsigned long time, struct zone *zone)
 {
 	time = (time << NODES_SHIFT) | zone_to_nid(zone);
@@ -198,3 +238,159 @@ void workingset_activation(struct page *page)
 	if (zone->shrink_active > 0)
 		zone->shrink_active--;
 }
+
+void workingset_shadows_inc(struct address_space *mapping)
+{
+	might_lock(&shadow_lock);
+
+	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
+		spin_lock(&shadow_lock);
+		list_add(&mapping->shadow_list, &shadow_mappings);
+		spin_unlock(&shadow_lock);
+	}
+
+	mapping->nrshadows++;
+	this_cpu_inc(nr_shadows);
+}
+
+void workingset_shadows_dec(struct address_space *mapping)
+{
+	mapping->nrshadows--;
+	this_cpu_dec(nr_shadows);
+	/*
+	 * shadow_mappings operations are costly, so we keep the
+	 * mapping linked here even without any shadows left and
+	 * unlink it lazily in the shadow shrinker or when the inode
+	 * is destroyed.
+	 */
+}
+
+static unsigned long get_nr_old_shadows(void)
+{
+	unsigned long nr_max;
+	unsigned long nr;
+	long sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += per_cpu(nr_shadows, cpu);
+	nr = max(sum, 0L);
+
+	/*
+	 * Every shadow entry with a refault distance bigger than the
+	 * active list is ignored and so NR_ACTIVE_FILE would be a
+	 * reasonable ceiling.  But scanning and shrinking shadow
+	 * entries is quite expensive, so be generous.
+	 */
+	nr_max = global_dirtyable_memory() * 4;
+
+	if (nr <= nr_max)
+		return 0;
+	return nr - nr_max;
+}
+
+static unsigned long scan_mapping(struct address_space *mapping,
+				  unsigned long nr_to_scan)
+{
+	unsigned long nr_scanned = 0;
+	struct radix_tree_iter iter;
+	void **slot;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
+		unsigned long nrshadows;
+		unsigned long distance;
+		struct zone *zone;
+		struct page *page;
+
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (!radix_tree_exception(page))
+			continue;
+		if (radix_tree_deref_retry(page))
+			goto restart;
+
+		unpack_shadow(page, &zone, &distance);
+
+		if (distance <= zone_page_state(zone, NR_ACTIVE_FILE))
+			continue;
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (radix_tree_delete_item(&mapping->page_tree,
+					   iter.index, page)) {
+			inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
+			workingset_shadows_dec(mapping);
+			nr_scanned++;
+		}
+		nrshadows = mapping->nrshadows;
+		spin_unlock_irq(&mapping->tree_lock);
+
+		if (nrshadows == 0)
+			break;
+
+		if (--nr_to_scan == 0)
+			break;
+	}
+	rcu_read_unlock();
+
+	return nr_scanned;
+}
+
+static unsigned long count_shadows(struct shrinker *shrink,
+				   struct shrink_control *sc)
+{
+	return get_nr_old_shadows();
+}
+
+static unsigned long scan_shadows(struct shrinker *shrink,
+				  struct shrink_control *sc)
+{
+	unsigned long nr_to_scan = sc->nr_to_scan;
+
+	do {
+		struct address_space *mapping;
+
+		spin_lock_irq(&shadow_lock);
+		if (list_empty(&shadow_mappings)) {
+			spin_unlock_irq(&shadow_lock);
+			break;
+		}
+		mapping = list_entry(shadow_mappings.prev,
+				     struct address_space,
+				     shadow_list);
+		list_move(&mapping->shadow_list, &shadow_mappings);
+		__iget(mapping->host);
+		spin_unlock_irq(&shadow_lock);
+
+		if (mapping->nrshadows)
+			nr_to_scan -= scan_mapping(mapping, nr_to_scan);
+
+		spin_lock_irq(&mapping->tree_lock);
+		if (mapping->nrshadows == 0) {
+			spin_lock(&shadow_lock);
+			list_del_init(&mapping->shadow_list);
+			spin_unlock(&shadow_lock);
+		}
+		spin_unlock_irq(&mapping->tree_lock);
+
+		iput(mapping->host);
+
+	} while (nr_to_scan && get_nr_old_shadows());
+
+	return sc->nr_to_scan - nr_to_scan;
+}
+
+static struct shrinker shadow_shrinker = {
+	.count_objects = count_shadows,
+	.scan_objects = scan_shadows,
+	.seeks = 1,
+};
+
+static int __init workingset_init(void)
+{
+	register_shrinker(&shadow_shrinker);
+	return 0;
+}
+core_initcall(workingset_init);
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] lib: radix-tree: radix_tree_delete_item()
  2013-08-17 19:31   ` Johannes Weiner
@ 2013-08-20 20:59     ` Andrew Morton
  -1 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Provide a function that does not just delete an entry at a given
> index, but also allows passing in an expected item.  Delete only if
> that item is still located at the specified index.
> 
> This is handy when lockless tree traversals want to delete entries as
> well because they don't have to do an second, locked lookup to verify
> the slot has not changed under them before deleting the entry.
> 
> ...
>
> -void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
> +void *radix_tree_delete_item(struct radix_tree_root *root,
> +			     unsigned long index, void *item)

radix-tree is an exported-to-modules API, so I guess we should do this
so others don't need to..

--- a/lib/radix-tree.c~lib-radix-tree-radix_tree_delete_item-fix
+++ a/lib/radix-tree.c
@@ -1393,6 +1393,7 @@ void *radix_tree_delete_item(struct radi
 out:
 	return slot;
 }
+EXPORT_SYMBOL(radix_tree_delete_item);
 
 /**
  *	radix_tree_delete    -    delete an item from a radix tree
_


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 1/9] lib: radix-tree: radix_tree_delete_item()
@ 2013-08-20 20:59     ` Andrew Morton
  0 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Provide a function that does not just delete an entry at a given
> index, but also allows passing in an expected item.  Delete only if
> that item is still located at the specified index.
> 
> This is handy when lockless tree traversals want to delete entries as
> well because they don't have to do an second, locked lookup to verify
> the slot has not changed under them before deleting the entry.
> 
> ...
>
> -void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
> +void *radix_tree_delete_item(struct radix_tree_root *root,
> +			     unsigned long index, void *item)

radix-tree is an exported-to-modules API, so I guess we should do this
so others don't need to..

--- a/lib/radix-tree.c~lib-radix-tree-radix_tree_delete_item-fix
+++ a/lib/radix-tree.c
@@ -1393,6 +1393,7 @@ void *radix_tree_delete_item(struct radi
 out:
 	return slot;
 }
+EXPORT_SYMBOL(radix_tree_delete_item);
 
 /**
  *	radix_tree_delete    -    delete an item from a radix tree
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees
  2013-08-17 19:31   ` Johannes Weiner
@ 2013-08-20 20:59     ` Andrew Morton
  -1 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:18 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> shmem mappings already contain exceptional entries where swap slot
> information is remembered.
> 
> To be able to store eviction information for regular page cache,
> prepare every site dealing with the radix trees directly to handle
> entries other than pages.
> 
> The common lookup functions will filter out non-page entries and
> return NULL for page cache holes, just as before.  But provide a raw
> version of the API which returns non-page entries as well, and switch
> shmem over to use it.
> 
>
> ...
>
> -/**
> - * find_get_page - find and get a page reference
> - * @mapping: the address_space to search
> - * @offset: the page index
> - *
> - * Is there a pagecache struct page at the given (mapping, offset) tuple?
> - * If yes, increment its refcount and return it; if no, return NULL.
> - */
> -struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
> +struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
>  {
>  	void **pagep;
>  	struct page *page;
> @@ -812,24 +828,31 @@ out:
>  
>  	return page;
>  }
> -EXPORT_SYMBOL(find_get_page);
> +EXPORT_SYMBOL(__find_get_page);

Deleting the interface documentation for a global, exported-to-modules
function was a bit rude.

And it does need documentation, to tell people that it can return the
non-pages.

Does it have the same handling of non-pages as __find_get_pages()?  It
had better, given the naming!

>
> ...
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees
@ 2013-08-20 20:59     ` Andrew Morton
  0 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:18 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> shmem mappings already contain exceptional entries where swap slot
> information is remembered.
> 
> To be able to store eviction information for regular page cache,
> prepare every site dealing with the radix trees directly to handle
> entries other than pages.
> 
> The common lookup functions will filter out non-page entries and
> return NULL for page cache holes, just as before.  But provide a raw
> version of the API which returns non-page entries as well, and switch
> shmem over to use it.
> 
>
> ...
>
> -/**
> - * find_get_page - find and get a page reference
> - * @mapping: the address_space to search
> - * @offset: the page index
> - *
> - * Is there a pagecache struct page at the given (mapping, offset) tuple?
> - * If yes, increment its refcount and return it; if no, return NULL.
> - */
> -struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
> +struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
>  {
>  	void **pagep;
>  	struct page *page;
> @@ -812,24 +828,31 @@ out:
>  
>  	return page;
>  }
> -EXPORT_SYMBOL(find_get_page);
> +EXPORT_SYMBOL(__find_get_page);

Deleting the interface documentation for a global, exported-to-modules
function was a bit rude.

And it does need documentation, to tell people that it can return the
non-pages.

Does it have the same handling of non-pages as __find_get_pages()?  It
had better, given the naming!

>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 5/9] mm + fs: store shadow entries in page cache
  2013-08-17 19:31   ` Johannes Weiner
@ 2013-08-20 20:59     ` Andrew Morton
  -1 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:19 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Reclaim will be leaving shadow entries in the page cache radix tree
> upon evicting the real page.  As those pages are found from the LRU,
> an iput() can lead to the inode being freed concurrently.  At this
> point, reclaim must no longer install shadow pages because the inode
> freeing code needs to ensure the page tree is really empty.
> 
> Add an address_space flag, AS_EXITING, that the inode freeing code
> sets under the tree lock before doing the final truncate.  Reclaim
> will check for this flag before installing shadow pages.
> 
> ...
>
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -503,6 +503,7 @@ void clear_inode(struct inode *inode)
>  	 */
>  	spin_lock_irq(&inode->i_data.tree_lock);
>  	BUG_ON(inode->i_data.nrpages);
> +	BUG_ON(inode->i_data.nrshadows);
>  	spin_unlock_irq(&inode->i_data.tree_lock);
>  	BUG_ON(!list_empty(&inode->i_data.private_list));
>  	BUG_ON(!(inode->i_state & I_FREEING));
> @@ -545,10 +546,14 @@ static void evict(struct inode *inode)
>  	 */
>  	inode_wait_for_writeback(inode);
>  
> +	spin_lock_irq(&inode->i_data.tree_lock);
> +	mapping_set_exiting(&inode->i_data);
> +	spin_unlock_irq(&inode->i_data.tree_lock);

mapping_set_exiting() is atomic and the locking here probably doesn't do
anythng useful.


>  	if (op->evict_inode) {
>  		op->evict_inode(inode);
>  	} else {
> -		if (inode->i_data.nrpages)
> +		if (inode->i_data.nrpages || inode->i_data.nrshadows)
>  			truncate_inode_pages(&inode->i_data, 0);
>  		clear_inode(inode);
>  	}
> 
> ...
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -416,6 +416,7 @@ struct address_space {
>  	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
>  	/* Protected by tree_lock together with the radix tree */
>  	unsigned long		nrpages;	/* number of total pages */
> +	unsigned long		nrshadows;	/* number of shadow entries */
>  	pgoff_t			writeback_index;/* writeback starts here */
>  	const struct address_space_operations *a_ops;	/* methods */
>  	unsigned long		flags;		/* error bits/gfp mask */

This grows the size of the in-core inode.  There can be tremendous
numbers of those so this was a significantly costly change.

And this whole patchset contains no data which permits us to agree that
this cost was worthwhile.

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index b6854b7..db3a78b 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -25,6 +25,7 @@ enum mapping_flags {
>  	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
>  	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
>  	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
> +	AS_EXITING	= __GFP_BITS_SHIFT + 5, /* inode is being evicted */

This is far too little documentation.  I suggest adding the complete
rationale at the mapping_set_exiting() definition site.  Or perhaps at
the mapping_set_exiting callsite in evict().

>  };
>  
> 
> ...
>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 5/9] mm + fs: store shadow entries in page cache
@ 2013-08-20 20:59     ` Andrew Morton
  0 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:19 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Reclaim will be leaving shadow entries in the page cache radix tree
> upon evicting the real page.  As those pages are found from the LRU,
> an iput() can lead to the inode being freed concurrently.  At this
> point, reclaim must no longer install shadow pages because the inode
> freeing code needs to ensure the page tree is really empty.
> 
> Add an address_space flag, AS_EXITING, that the inode freeing code
> sets under the tree lock before doing the final truncate.  Reclaim
> will check for this flag before installing shadow pages.
> 
> ...
>
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -503,6 +503,7 @@ void clear_inode(struct inode *inode)
>  	 */
>  	spin_lock_irq(&inode->i_data.tree_lock);
>  	BUG_ON(inode->i_data.nrpages);
> +	BUG_ON(inode->i_data.nrshadows);
>  	spin_unlock_irq(&inode->i_data.tree_lock);
>  	BUG_ON(!list_empty(&inode->i_data.private_list));
>  	BUG_ON(!(inode->i_state & I_FREEING));
> @@ -545,10 +546,14 @@ static void evict(struct inode *inode)
>  	 */
>  	inode_wait_for_writeback(inode);
>  
> +	spin_lock_irq(&inode->i_data.tree_lock);
> +	mapping_set_exiting(&inode->i_data);
> +	spin_unlock_irq(&inode->i_data.tree_lock);

mapping_set_exiting() is atomic and the locking here probably doesn't do
anythng useful.


>  	if (op->evict_inode) {
>  		op->evict_inode(inode);
>  	} else {
> -		if (inode->i_data.nrpages)
> +		if (inode->i_data.nrpages || inode->i_data.nrshadows)
>  			truncate_inode_pages(&inode->i_data, 0);
>  		clear_inode(inode);
>  	}
> 
> ...
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -416,6 +416,7 @@ struct address_space {
>  	struct mutex		i_mmap_mutex;	/* protect tree, count, list */
>  	/* Protected by tree_lock together with the radix tree */
>  	unsigned long		nrpages;	/* number of total pages */
> +	unsigned long		nrshadows;	/* number of shadow entries */
>  	pgoff_t			writeback_index;/* writeback starts here */
>  	const struct address_space_operations *a_ops;	/* methods */
>  	unsigned long		flags;		/* error bits/gfp mask */

This grows the size of the in-core inode.  There can be tremendous
numbers of those so this was a significantly costly change.

And this whole patchset contains no data which permits us to agree that
this cost was worthwhile.

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index b6854b7..db3a78b 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -25,6 +25,7 @@ enum mapping_flags {
>  	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
>  	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
>  	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
> +	AS_EXITING	= __GFP_BITS_SHIFT + 5, /* inode is being evicted */

This is far too little documentation.  I suggest adding the complete
rationale at the mapping_set_exiting() definition site.  Or perhaps at
the mapping_set_exiting callsite in evict().

>  };
>  
> 
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-17 19:31   ` Johannes Weiner
@ 2013-08-20 20:59     ` Andrew Morton
  -1 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
> 
> The tricky part of this model is finding the right balance between
> them.  A big inactive list may not leave enough room for the active
> list to protect all the frequently used pages.  A big active list may
> not leave enough room for the inactive list for a new set of
> frequently used pages, "working set", to establish itself because the
> young pages get pushed out of memory before having a chance to get
> promoted.
> 
> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use once streams,
> but was not satisfactory when use once streaming persisted over longer
> periods of time and the established working set was temporarily
> suspended, like a nightly backup evicting all the interactive user
> program data.
> 
> Subsequently, the rules were changed to only age active pages when
> they exceeded the amount of inactive pages, i.e. leave the working set
> alone as long as the other half of memory is easy to reclaim use once
> pages.  This works well until working set transitions exceed the size
> of half of memory and the average access distance between the pages of
> the new working set is bigger than the inactive list.  The VM will
> mistake the thrashing new working set for use once streaming, while
> the unused old working set pages are stuck on the active list.
> 
> This patch solves this problem by maintaining a history of recently
> evicted file pages, which in turn allows the VM to tell used-once page
> streams from thrashing file cache.
> 
> To accomplish this, a per-zone counter is increased every time a page
> is evicted and a snapshot of that counter is stored as shadow entry in
> the page's now empty page cache radix tree slot.  Upon refault of that
> page, the difference between the current value of that counter and the
> shadow entry value is called the refault distance.  It tells how many
> pages have been evicted from the zone since that page's eviction,
> which is how many page slots at most are missing from the zone's
> inactive list for this page to get accessed twice while in memory.  If
> the number of missing slots is less than or equal to the number of
> active pages, increasing the inactive list at the cost of the active
> list would give this thrashing set a chance to establish itself:
> 
> eviction counter = 4
>                         evicted      inactive           active
>  Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
>   Shadow entries:         0 1 2 3
> Refault distance:         4 3 2 1
> 
> When c is faulted back into memory, it is noted that at most two more
> page slots on the inactive list could have prevented the refault (it
> could be less if c is used out of order).  Thus, the active list needs
> to be challenged as it is possible that c is used more frequently than
> l, m, n.  However, there is no access frequency information available
> on active pages so the pages have to be put in direct competition with
> each other before deciding which one to keep.  Thus, 1) pages can not
> be directly reclaimed from the tail of the active list and b)
> refaulting pages can not be directly activated.  Instead, active pages
> are moved from the tail of the active list to the head of the inactive
> list and placed directly next to the refaulting pages.  This way, they
> both have the same time on the inactive list to prove which page is
> actually used more frequently without incurring unnecessary major
> faults or diluting the active page set in case the previously active
> page is in fact the more frequently used one.
> 
> Also, since the refault of c could have been due to a spurious access,
> only one active page per qualifying refault is challenged.  This will
> keep the impact of outliers low but still detect if bigger groups of
> pages are refaulting.
> 
> ...
>
> +/*
> + *		Double CLOCK lists
> + *
> + * Per zone, two clock lists are maintained for file pages: the
> + * inactive and the active list.  Freshly faulted pages start out at
> + * the head of the inactive list and page reclaim scans pages from the
> + * tail.  Pages that are accessed multiple times on the inactive list
> + * are promoted to the active list, to protect them from reclaim,
> + * whereas active pages are demoted to the inactive list when the
> + * inactive list requires more space to detect repeatedly accessed
> + * pages in the current workload and prevent them from thrashing:
> + *
> + *   fault -----------------------+
> + *                                |
> + *              +-------------+   |            +-------------+
> + *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
> + *              +-------------+                +-------------+    |
> + *                       |                                        |
> + *                       +----------- promotion ------------------+
> + *
> + *
> + *		Access frequency and refault distance
> + *
> + * A workload is thrashing when the distances between the first and
> + * second access of pages that are frequently used is bigger than the
> + * current inactive clock list size, as the pages get reclaimed before
> + * the second access would have promoted them instead:
> + *
> + *    Access #: 1 2 3 4 5 6 7 8 9
> + *     Page ID: x y b c d e f x y
> + *                  | inactive  |
> + *
> + * To prevent this workload from thrashing, a bigger inactive list is
> + * required.  And the only way the inactive list can grow on a full
> + * zone is by taking away space from the corresponding active list.
> + *
> + *      +-inactive--+-active------+
> + *  x y | b c d e f | G H I J K L |
> + *      +-----------+-------------+
> + *
> + * Not every refault should lead to growing the inactive list at the
> + * cost of the active list, however: if the access distances are
> + * bigger than available memory overall, there is little point in
> + * challenging the protected pages on the active list, as those
> + * refaulting pages will not fit completely into memory.
> + *
> + * It is prohibitively expensive to track the access frequency of
> + * in-core pages, but it is possible to track their refault distance,
> + * which is the number of page slots shrunk from the inactive list
> + * between a page's eviction and subsequent refault.  This indicates
> + * how many page slots are missing on the inactive list in order to
> + * prevent future thrashing of that page.  Thus, instead of comparing
> + * access frequency to total available memory, one can compare the
> + * refault distance to the inactive list's potential for growth: the
> + * size of the active list.
> + *
> + *
> + *		Rebalancing the lists
> + *
> + * Shrinking the active list has to be done carefully because the
> + * pages on it may have vastly different access frequencies compared
> + * to the pages on the inactive list.  Thus, pages are not reclaimed
> + * directly from the tail of the active list, but instead moved to the
> + * head of the inactive list.  This way, they are competing directly
> + * with the pages that challenged their protected status.  If they are
> + * unused, they will eventually be reclaimed, but if they are indeed
> + * used more frequently than the challenging inactive pages, they will
> + * be reactivated.  This allows the existing protected set to be
> + * challenged without incurring major faults in case of a mistake.
> + */

The consequences of a 32-bit wraparound of the refault distance still
concern me.  It's a rare occurrence and it is difficult to determine
what will happen.  An explicit design-level description here would be
useful.

> +static void *pack_shadow(unsigned long time, struct zone *zone)
> +{
> +	time = (time << NODES_SHIFT) | zone_to_nid(zone);
> +	time = (time << ZONES_SHIFT) | zone_idx(zone);
> +	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
> +
> +	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
> +}

"time" is normally in jiffies ;)

Some description of the underlying units of workingset_time would be
helpful.

> 
> ...
>
> +/**
> + * workingset_eviction - note the eviction of a page from memory
> + * @mapping: address space the page was backing
> + * @page: the page being evicted
> + *
> + * Returns a shadow entry to be stored in @mapping->page_tree in place
> + * of the evicted @page so that a later refault can be detected.  Or
> + * %NULL when the eviction should not be remembered.
> + */

Describe the locking requirements here.  It's part of the interface. 
And it's the part the compiler can't check, so extra care is needed.

> +void *workingset_eviction(struct address_space *mapping, struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long time;
> +
> +	time = atomic_long_inc_return(&zone->workingset_time);
> +
> +	/*
> +	 * Don't store shadows in an inode that is being reclaimed.
> +	 * This is not just an optizimation, inode reclaim needs to
> +	 * empty out the radix tree or the nodes are lost, so don't
> +	 * plant shadows behind its back.
> +	 */
> +	if (mapping_exiting(mapping))
> +		return NULL;
> +
> +	return pack_shadow(time, zone);
> +}
> +
> +/**
> + * workingset_refault - note the refault of a previously evicted page
> + * @shadow: shadow entry of the evicted page
> + *
> + * Calculates and evaluates the refault distance of the previously
> + * evicted page in the context of the zone it was allocated in.
> + *
> + * This primes page reclaim to rebalance the zone's file lists if
> + * necessary, so it must be called before a page frame for the
> + * refaulting page is allocated.
> + */
> +void workingset_refault(void *shadow)
> +{
> +	unsigned long refault_distance;

Is the "refault distance" described somewhere?  What relationship (if
any) does it have with workingset_time?

> +	struct zone *zone;
> +
> +	unpack_shadow(shadow, &zone, &refault_distance);
> +
> +	inc_zone_state(zone, WORKINGSET_REFAULT);
> +
> +	/*
> +	 * Protected pages should be challenged when the refault
> +	 * distance indicates that thrashing could be stopped by
> +	 * increasing the inactive list at the cost of the active
> +	 * list.
> +	 */
> +	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> +		inc_zone_state(zone, WORKINGSET_STALE);
> +		zone->shrink_active++;
> +	}
> +}
> +EXPORT_SYMBOL(workingset_refault);
> +
> +/**
> + * workingset_activation - note a page activation
> + * @page: page that is being activated
> + */
> +void workingset_activation(struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	/*
> +	 * The lists are rebalanced when the inactive list is observed
> +	 * to be too small for activations.  An activation means that
> +	 * the inactive list is now big enough again for at least one
> +	 * page, so back off further deactivation.
> +	 */
> +	atomic_long_inc(&zone->workingset_time);
> +	if (zone->shrink_active > 0)
> +		zone->shrink_active--;
> +}

Strange mixture of exports and non-exports.  I assume you went with
"enough to make it build".


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
@ 2013-08-20 20:59     ` Andrew Morton
  0 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
> 
> The tricky part of this model is finding the right balance between
> them.  A big inactive list may not leave enough room for the active
> list to protect all the frequently used pages.  A big active list may
> not leave enough room for the inactive list for a new set of
> frequently used pages, "working set", to establish itself because the
> young pages get pushed out of memory before having a chance to get
> promoted.
> 
> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use once streams,
> but was not satisfactory when use once streaming persisted over longer
> periods of time and the established working set was temporarily
> suspended, like a nightly backup evicting all the interactive user
> program data.
> 
> Subsequently, the rules were changed to only age active pages when
> they exceeded the amount of inactive pages, i.e. leave the working set
> alone as long as the other half of memory is easy to reclaim use once
> pages.  This works well until working set transitions exceed the size
> of half of memory and the average access distance between the pages of
> the new working set is bigger than the inactive list.  The VM will
> mistake the thrashing new working set for use once streaming, while
> the unused old working set pages are stuck on the active list.
> 
> This patch solves this problem by maintaining a history of recently
> evicted file pages, which in turn allows the VM to tell used-once page
> streams from thrashing file cache.
> 
> To accomplish this, a per-zone counter is increased every time a page
> is evicted and a snapshot of that counter is stored as shadow entry in
> the page's now empty page cache radix tree slot.  Upon refault of that
> page, the difference between the current value of that counter and the
> shadow entry value is called the refault distance.  It tells how many
> pages have been evicted from the zone since that page's eviction,
> which is how many page slots at most are missing from the zone's
> inactive list for this page to get accessed twice while in memory.  If
> the number of missing slots is less than or equal to the number of
> active pages, increasing the inactive list at the cost of the active
> list would give this thrashing set a chance to establish itself:
> 
> eviction counter = 4
>                         evicted      inactive           active
>  Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
>   Shadow entries:         0 1 2 3
> Refault distance:         4 3 2 1
> 
> When c is faulted back into memory, it is noted that at most two more
> page slots on the inactive list could have prevented the refault (it
> could be less if c is used out of order).  Thus, the active list needs
> to be challenged as it is possible that c is used more frequently than
> l, m, n.  However, there is no access frequency information available
> on active pages so the pages have to be put in direct competition with
> each other before deciding which one to keep.  Thus, 1) pages can not
> be directly reclaimed from the tail of the active list and b)
> refaulting pages can not be directly activated.  Instead, active pages
> are moved from the tail of the active list to the head of the inactive
> list and placed directly next to the refaulting pages.  This way, they
> both have the same time on the inactive list to prove which page is
> actually used more frequently without incurring unnecessary major
> faults or diluting the active page set in case the previously active
> page is in fact the more frequently used one.
> 
> Also, since the refault of c could have been due to a spurious access,
> only one active page per qualifying refault is challenged.  This will
> keep the impact of outliers low but still detect if bigger groups of
> pages are refaulting.
> 
> ...
>
> +/*
> + *		Double CLOCK lists
> + *
> + * Per zone, two clock lists are maintained for file pages: the
> + * inactive and the active list.  Freshly faulted pages start out at
> + * the head of the inactive list and page reclaim scans pages from the
> + * tail.  Pages that are accessed multiple times on the inactive list
> + * are promoted to the active list, to protect them from reclaim,
> + * whereas active pages are demoted to the inactive list when the
> + * inactive list requires more space to detect repeatedly accessed
> + * pages in the current workload and prevent them from thrashing:
> + *
> + *   fault -----------------------+
> + *                                |
> + *              +-------------+   |            +-------------+
> + *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
> + *              +-------------+                +-------------+    |
> + *                       |                                        |
> + *                       +----------- promotion ------------------+
> + *
> + *
> + *		Access frequency and refault distance
> + *
> + * A workload is thrashing when the distances between the first and
> + * second access of pages that are frequently used is bigger than the
> + * current inactive clock list size, as the pages get reclaimed before
> + * the second access would have promoted them instead:
> + *
> + *    Access #: 1 2 3 4 5 6 7 8 9
> + *     Page ID: x y b c d e f x y
> + *                  | inactive  |
> + *
> + * To prevent this workload from thrashing, a bigger inactive list is
> + * required.  And the only way the inactive list can grow on a full
> + * zone is by taking away space from the corresponding active list.
> + *
> + *      +-inactive--+-active------+
> + *  x y | b c d e f | G H I J K L |
> + *      +-----------+-------------+
> + *
> + * Not every refault should lead to growing the inactive list at the
> + * cost of the active list, however: if the access distances are
> + * bigger than available memory overall, there is little point in
> + * challenging the protected pages on the active list, as those
> + * refaulting pages will not fit completely into memory.
> + *
> + * It is prohibitively expensive to track the access frequency of
> + * in-core pages, but it is possible to track their refault distance,
> + * which is the number of page slots shrunk from the inactive list
> + * between a page's eviction and subsequent refault.  This indicates
> + * how many page slots are missing on the inactive list in order to
> + * prevent future thrashing of that page.  Thus, instead of comparing
> + * access frequency to total available memory, one can compare the
> + * refault distance to the inactive list's potential for growth: the
> + * size of the active list.
> + *
> + *
> + *		Rebalancing the lists
> + *
> + * Shrinking the active list has to be done carefully because the
> + * pages on it may have vastly different access frequencies compared
> + * to the pages on the inactive list.  Thus, pages are not reclaimed
> + * directly from the tail of the active list, but instead moved to the
> + * head of the inactive list.  This way, they are competing directly
> + * with the pages that challenged their protected status.  If they are
> + * unused, they will eventually be reclaimed, but if they are indeed
> + * used more frequently than the challenging inactive pages, they will
> + * be reactivated.  This allows the existing protected set to be
> + * challenged without incurring major faults in case of a mistake.
> + */

The consequences of a 32-bit wraparound of the refault distance still
concern me.  It's a rare occurrence and it is difficult to determine
what will happen.  An explicit design-level description here would be
useful.

> +static void *pack_shadow(unsigned long time, struct zone *zone)
> +{
> +	time = (time << NODES_SHIFT) | zone_to_nid(zone);
> +	time = (time << ZONES_SHIFT) | zone_idx(zone);
> +	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
> +
> +	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
> +}

"time" is normally in jiffies ;)

Some description of the underlying units of workingset_time would be
helpful.

> 
> ...
>
> +/**
> + * workingset_eviction - note the eviction of a page from memory
> + * @mapping: address space the page was backing
> + * @page: the page being evicted
> + *
> + * Returns a shadow entry to be stored in @mapping->page_tree in place
> + * of the evicted @page so that a later refault can be detected.  Or
> + * %NULL when the eviction should not be remembered.
> + */

Describe the locking requirements here.  It's part of the interface. 
And it's the part the compiler can't check, so extra care is needed.

> +void *workingset_eviction(struct address_space *mapping, struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long time;
> +
> +	time = atomic_long_inc_return(&zone->workingset_time);
> +
> +	/*
> +	 * Don't store shadows in an inode that is being reclaimed.
> +	 * This is not just an optizimation, inode reclaim needs to
> +	 * empty out the radix tree or the nodes are lost, so don't
> +	 * plant shadows behind its back.
> +	 */
> +	if (mapping_exiting(mapping))
> +		return NULL;
> +
> +	return pack_shadow(time, zone);
> +}
> +
> +/**
> + * workingset_refault - note the refault of a previously evicted page
> + * @shadow: shadow entry of the evicted page
> + *
> + * Calculates and evaluates the refault distance of the previously
> + * evicted page in the context of the zone it was allocated in.
> + *
> + * This primes page reclaim to rebalance the zone's file lists if
> + * necessary, so it must be called before a page frame for the
> + * refaulting page is allocated.
> + */
> +void workingset_refault(void *shadow)
> +{
> +	unsigned long refault_distance;

Is the "refault distance" described somewhere?  What relationship (if
any) does it have with workingset_time?

> +	struct zone *zone;
> +
> +	unpack_shadow(shadow, &zone, &refault_distance);
> +
> +	inc_zone_state(zone, WORKINGSET_REFAULT);
> +
> +	/*
> +	 * Protected pages should be challenged when the refault
> +	 * distance indicates that thrashing could be stopped by
> +	 * increasing the inactive list at the cost of the active
> +	 * list.
> +	 */
> +	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> +		inc_zone_state(zone, WORKINGSET_STALE);
> +		zone->shrink_active++;
> +	}
> +}
> +EXPORT_SYMBOL(workingset_refault);
> +
> +/**
> + * workingset_activation - note a page activation
> + * @page: page that is being activated
> + */
> +void workingset_activation(struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	/*
> +	 * The lists are rebalanced when the inactive list is observed
> +	 * to be too small for activations.  An activation means that
> +	 * the inactive list is now big enough again for at least one
> +	 * page, so back off further deactivation.
> +	 */
> +	atomic_long_inc(&zone->workingset_time);
> +	if (zone->shrink_active > 0)
> +		zone->shrink_active--;
> +}

Strange mixture of exports and non-exports.  I assume you went with
"enough to make it build".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 9/9] mm: workingset: keep shadow entries in check
  2013-08-17 19:31   ` Johannes Weiner
@ 2013-08-20 20:59     ` Andrew Morton
  -1 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.

erk.  This whole patch is overhead :(

> To get this under control, a list of inodes that contain shadow
> entries is maintained.  If the global number of shadows exceeds a
> certain threshold, a shrinker is activated that reclaims old entries
> from the mappings.  This is heavy-handed but it should not be a hot
> path and is mainly there to protect from accidentally/maliciously
> induced OOM kills.  The global list is also not a problem because the
> modifications are very rare: inodes are added once in their lifetime
> when the first shadow entry is stored (i.e. the first page reclaimed)
> and lazily removed when the inode exits.  Or if the shrinker removes
> all shadow entries.
> 
> ...
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -417,6 +417,7 @@ struct address_space {
>  	/* Protected by tree_lock together with the radix tree */
>  	unsigned long		nrpages;	/* number of total pages */
>  	unsigned long		nrshadows;	/* number of shadow entries */
> +	struct list_head	shadow_list;	/* list of mappings with shadows */
>  	pgoff_t			writeback_index;/* writeback starts here */
>  	const struct address_space_operations *a_ops;	/* methods */
>  	unsigned long		flags;		/* error bits/gfp mask */

There's another 16 bytes into the inode.  Bad.

>
> ...
>
> +void workingset_shadows_inc(struct address_space *mapping)
> +{
> +	might_lock(&shadow_lock);
> +
> +	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
> +		spin_lock(&shadow_lock);

I can't work out whether or not shadow_lock is supposed to be irq-save.
Some places it is, others are unobvious.

> +		list_add(&mapping->shadow_list, &shadow_mappings);
> +		spin_unlock(&shadow_lock);
> +	}
> +
> +	mapping->nrshadows++;
> +	this_cpu_inc(nr_shadows);
> +}
> +
>
> ...
>
> +static unsigned long get_nr_old_shadows(void)
> +{
> +	unsigned long nr_max;
> +	unsigned long nr;
> +	long sum = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		sum += per_cpu(nr_shadows, cpu);

Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
calls it from a loop.  Can we use something non-deathly-slow here? 
Like percpu_counter_read_positive()?

> +	nr = max(sum, 0L);
> +
> +	/*
> +	 * Every shadow entry with a refault distance bigger than the
> +	 * active list is ignored and so NR_ACTIVE_FILE would be a
> +	 * reasonable ceiling.  But scanning and shrinking shadow
> +	 * entries is quite expensive, so be generous.
> +	 */
> +	nr_max = global_dirtyable_memory() * 4;
> +
> +	if (nr <= nr_max)
> +		return 0;
> +	return nr - nr_max;
> +}
> +
> +static unsigned long scan_mapping(struct address_space *mapping,
> +				  unsigned long nr_to_scan)

Some methodological description would be useful.

> +{
> +	unsigned long nr_scanned = 0;
> +	struct radix_tree_iter iter;
> +	void **slot;
> +
> +	rcu_read_lock();
> +restart:
> +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
> +		unsigned long nrshadows;
> +		unsigned long distance;
> +		struct zone *zone;
> +		struct page *page;
> +
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (!radix_tree_exception(page))
> +			continue;
> +		if (radix_tree_deref_retry(page))
> +			goto restart;
> +
> +		unpack_shadow(page, &zone, &distance);
> +
> +		if (distance <= zone_page_state(zone, NR_ACTIVE_FILE))
> +			continue;
> +
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (radix_tree_delete_item(&mapping->page_tree,
> +					   iter.index, page)) {
> +			inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
> +			workingset_shadows_dec(mapping);
> +			nr_scanned++;
> +		}
> +		nrshadows = mapping->nrshadows;
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		if (nrshadows == 0)
> +			break;
> +
> +		if (--nr_to_scan == 0)
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return nr_scanned;
> +}
> +
>
> ...
>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 9/9] mm: workingset: keep shadow entries in check
@ 2013-08-20 20:59     ` Andrew Morton
  0 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 20:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.

erk.  This whole patch is overhead :(

> To get this under control, a list of inodes that contain shadow
> entries is maintained.  If the global number of shadows exceeds a
> certain threshold, a shrinker is activated that reclaims old entries
> from the mappings.  This is heavy-handed but it should not be a hot
> path and is mainly there to protect from accidentally/maliciously
> induced OOM kills.  The global list is also not a problem because the
> modifications are very rare: inodes are added once in their lifetime
> when the first shadow entry is stored (i.e. the first page reclaimed)
> and lazily removed when the inode exits.  Or if the shrinker removes
> all shadow entries.
> 
> ...
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -417,6 +417,7 @@ struct address_space {
>  	/* Protected by tree_lock together with the radix tree */
>  	unsigned long		nrpages;	/* number of total pages */
>  	unsigned long		nrshadows;	/* number of shadow entries */
> +	struct list_head	shadow_list;	/* list of mappings with shadows */
>  	pgoff_t			writeback_index;/* writeback starts here */
>  	const struct address_space_operations *a_ops;	/* methods */
>  	unsigned long		flags;		/* error bits/gfp mask */

There's another 16 bytes into the inode.  Bad.

>
> ...
>
> +void workingset_shadows_inc(struct address_space *mapping)
> +{
> +	might_lock(&shadow_lock);
> +
> +	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
> +		spin_lock(&shadow_lock);

I can't work out whether or not shadow_lock is supposed to be irq-save.
Some places it is, others are unobvious.

> +		list_add(&mapping->shadow_list, &shadow_mappings);
> +		spin_unlock(&shadow_lock);
> +	}
> +
> +	mapping->nrshadows++;
> +	this_cpu_inc(nr_shadows);
> +}
> +
>
> ...
>
> +static unsigned long get_nr_old_shadows(void)
> +{
> +	unsigned long nr_max;
> +	unsigned long nr;
> +	long sum = 0;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		sum += per_cpu(nr_shadows, cpu);

Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
calls it from a loop.  Can we use something non-deathly-slow here? 
Like percpu_counter_read_positive()?

> +	nr = max(sum, 0L);
> +
> +	/*
> +	 * Every shadow entry with a refault distance bigger than the
> +	 * active list is ignored and so NR_ACTIVE_FILE would be a
> +	 * reasonable ceiling.  But scanning and shrinking shadow
> +	 * entries is quite expensive, so be generous.
> +	 */
> +	nr_max = global_dirtyable_memory() * 4;
> +
> +	if (nr <= nr_max)
> +		return 0;
> +	return nr - nr_max;
> +}
> +
> +static unsigned long scan_mapping(struct address_space *mapping,
> +				  unsigned long nr_to_scan)

Some methodological description would be useful.

> +{
> +	unsigned long nr_scanned = 0;
> +	struct radix_tree_iter iter;
> +	void **slot;
> +
> +	rcu_read_lock();
> +restart:
> +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0) {
> +		unsigned long nrshadows;
> +		unsigned long distance;
> +		struct zone *zone;
> +		struct page *page;
> +
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (!radix_tree_exception(page))
> +			continue;
> +		if (radix_tree_deref_retry(page))
> +			goto restart;
> +
> +		unpack_shadow(page, &zone, &distance);
> +
> +		if (distance <= zone_page_state(zone, NR_ACTIVE_FILE))
> +			continue;
> +
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (radix_tree_delete_item(&mapping->page_tree,
> +					   iter.index, page)) {
> +			inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
> +			workingset_shadows_dec(mapping);
> +			nr_scanned++;
> +		}
> +		nrshadows = mapping->nrshadows;
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		if (nrshadows == 0)
> +			break;
> +
> +		if (--nr_to_scan == 0)
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return nr_scanned;
> +}
> +
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 9/9] mm: thrash detection-based file cache sizing v4
  2013-08-17 19:31 ` Johannes Weiner
@ 2013-08-20 21:04   ` Andrew Morton
  -1 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 21:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:14 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> This series solves the problem by maintaining a history of pages
> evicted from the inactive list, enabling the VM to tell streaming IO
> from thrashing and rebalance the page cache lists when appropriate.

I can't say I'm loving the patchset.  It adds significant bloat to the
inode (of all things!), seems to add some runtime overhead and
certainly adds boatloads of complexity.

In return for which we get...  well, I don't know what we get - no data
was included.  It had better be good!

To aid in this decision, please go through the patchset and calculate
and itemize the overhead: increased inode size, increased radix-tree
consumption, lengthier code paths, anything else I missed  Others can
make their own judgements regarding complexity increase.

Then please carefully describe the benefits, then see if you can
convince us that one is worth the other!


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 9/9] mm: thrash detection-based file cache sizing v4
@ 2013-08-20 21:04   ` Andrew Morton
  0 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-20 21:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Sat, 17 Aug 2013 15:31:14 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> This series solves the problem by maintaining a history of pages
> evicted from the inactive list, enabling the VM to tell streaming IO
> from thrashing and rebalance the page cache lists when appropriate.

I can't say I'm loving the patchset.  It adds significant bloat to the
inode (of all things!), seems to add some runtime overhead and
certainly adds boatloads of complexity.

In return for which we get...  well, I don't know what we get - no data
was included.  It had better be good!

To aid in this decision, please go through the patchset and calculate
and itemize the overhead: increased inode size, increased radix-tree
consumption, lengthier code paths, anything else I missed  Others can
make their own judgements regarding complexity increase.

Then please carefully describe the benefits, then see if you can
convince us that one is worth the other!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees
  2013-08-20 20:59     ` Andrew Morton
@ 2013-08-22  7:20       ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-22  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Aug 20, 2013 at 01:59:10PM -0700, Andrew Morton wrote:
> On Sat, 17 Aug 2013 15:31:18 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > shmem mappings already contain exceptional entries where swap slot
> > information is remembered.
> > 
> > To be able to store eviction information for regular page cache,
> > prepare every site dealing with the radix trees directly to handle
> > entries other than pages.
> > 
> > The common lookup functions will filter out non-page entries and
> > return NULL for page cache holes, just as before.  But provide a raw
> > version of the API which returns non-page entries as well, and switch
> > shmem over to use it.
> > 
> >
> > ...
> >
> > -/**
> > - * find_get_page - find and get a page reference
> > - * @mapping: the address_space to search
> > - * @offset: the page index
> > - *
> > - * Is there a pagecache struct page at the given (mapping, offset) tuple?
> > - * If yes, increment its refcount and return it; if no, return NULL.
> > - */
> > -struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
> > +struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
> >  {
> >  	void **pagep;
> >  	struct page *page;
> > @@ -812,24 +828,31 @@ out:
> >  
> >  	return page;
> >  }
> > -EXPORT_SYMBOL(find_get_page);
> > +EXPORT_SYMBOL(__find_get_page);
> 
> Deleting the interface documentation for a global, exported-to-modules
> function was a bit rude.
> 
> And it does need documentation, to tell people that it can return the
> non-pages.

I didn't really delete documentation as much as moving it to the new
find_get_page() definition (the above is a rename).  But yeah, I
should probably add some documentation to the new function as well.

> Does it have the same handling of non-pages as __find_get_pages()?  It
> had better, given the naming!

Yes, the only difference is single vs. multi lookup.  The underscore
versions may return non-pages, the traditional interface filters them.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees
@ 2013-08-22  7:20       ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-22  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Aug 20, 2013 at 01:59:10PM -0700, Andrew Morton wrote:
> On Sat, 17 Aug 2013 15:31:18 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > shmem mappings already contain exceptional entries where swap slot
> > information is remembered.
> > 
> > To be able to store eviction information for regular page cache,
> > prepare every site dealing with the radix trees directly to handle
> > entries other than pages.
> > 
> > The common lookup functions will filter out non-page entries and
> > return NULL for page cache holes, just as before.  But provide a raw
> > version of the API which returns non-page entries as well, and switch
> > shmem over to use it.
> > 
> >
> > ...
> >
> > -/**
> > - * find_get_page - find and get a page reference
> > - * @mapping: the address_space to search
> > - * @offset: the page index
> > - *
> > - * Is there a pagecache struct page at the given (mapping, offset) tuple?
> > - * If yes, increment its refcount and return it; if no, return NULL.
> > - */
> > -struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
> > +struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
> >  {
> >  	void **pagep;
> >  	struct page *page;
> > @@ -812,24 +828,31 @@ out:
> >  
> >  	return page;
> >  }
> > -EXPORT_SYMBOL(find_get_page);
> > +EXPORT_SYMBOL(__find_get_page);
> 
> Deleting the interface documentation for a global, exported-to-modules
> function was a bit rude.
> 
> And it does need documentation, to tell people that it can return the
> non-pages.

I didn't really delete documentation as much as moving it to the new
find_get_page() definition (the above is a rename).  But yeah, I
should probably add some documentation to the new function as well.

> Does it have the same handling of non-pages as __find_get_pages()?  It
> had better, given the naming!

Yes, the only difference is single vs. multi lookup.  The underscore
versions may return non-pages, the traditional interface filters them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-20 20:59     ` Andrew Morton
@ 2013-08-22  8:45       ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-22  8:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Aug 20, 2013 at 01:59:20PM -0700, Andrew Morton wrote:
> On Sat, 17 Aug 2013 15:31:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > The VM maintains cached filesystem pages on two types of lists.  One
> > list holds the pages recently faulted into the cache, the other list
> > holds pages that have been referenced repeatedly on that first list.
> > The idea is to prefer reclaiming young pages over those that have
> > shown to benefit from caching in the past.  We call the recently used
> > list "inactive list" and the frequently used list "active list".
> > 
> > The tricky part of this model is finding the right balance between
> > them.  A big inactive list may not leave enough room for the active
> > list to protect all the frequently used pages.  A big active list may
> > not leave enough room for the inactive list for a new set of
> > frequently used pages, "working set", to establish itself because the
> > young pages get pushed out of memory before having a chance to get
> > promoted.
> > 
> > Historically, every reclaim scan of the inactive list also took a
> > smaller number of pages from the tail of the active list and moved
> > them to the head of the inactive list.  This model gave established
> > working sets more gracetime in the face of temporary use once streams,
> > but was not satisfactory when use once streaming persisted over longer
> > periods of time and the established working set was temporarily
> > suspended, like a nightly backup evicting all the interactive user
> > program data.
> > 
> > Subsequently, the rules were changed to only age active pages when
> > they exceeded the amount of inactive pages, i.e. leave the working set
> > alone as long as the other half of memory is easy to reclaim use once
> > pages.  This works well until working set transitions exceed the size
> > of half of memory and the average access distance between the pages of
> > the new working set is bigger than the inactive list.  The VM will
> > mistake the thrashing new working set for use once streaming, while
> > the unused old working set pages are stuck on the active list.
> > 
> > This patch solves this problem by maintaining a history of recently
> > evicted file pages, which in turn allows the VM to tell used-once page
> > streams from thrashing file cache.
> > 
> > To accomplish this, a per-zone counter is increased every time a page
> > is evicted and a snapshot of that counter is stored as shadow entry in
> > the page's now empty page cache radix tree slot.  Upon refault of that
> > page, the difference between the current value of that counter and the
> > shadow entry value is called the refault distance.  It tells how many
> > pages have been evicted from the zone since that page's eviction,
> > which is how many page slots at most are missing from the zone's
> > inactive list for this page to get accessed twice while in memory.  If
> > the number of missing slots is less than or equal to the number of
> > active pages, increasing the inactive list at the cost of the active
> > list would give this thrashing set a chance to establish itself:
> > 
> > eviction counter = 4
> >                         evicted      inactive           active
> >  Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
> >   Shadow entries:         0 1 2 3
> > Refault distance:         4 3 2 1
> > 
> > When c is faulted back into memory, it is noted that at most two more
> > page slots on the inactive list could have prevented the refault (it
> > could be less if c is used out of order).  Thus, the active list needs
> > to be challenged as it is possible that c is used more frequently than
> > l, m, n.  However, there is no access frequency information available
> > on active pages so the pages have to be put in direct competition with
> > each other before deciding which one to keep.  Thus, 1) pages can not
> > be directly reclaimed from the tail of the active list and b)
> > refaulting pages can not be directly activated.  Instead, active pages
> > are moved from the tail of the active list to the head of the inactive
> > list and placed directly next to the refaulting pages.  This way, they
> > both have the same time on the inactive list to prove which page is
> > actually used more frequently without incurring unnecessary major
> > faults or diluting the active page set in case the previously active
> > page is in fact the more frequently used one.
> > 
> > Also, since the refault of c could have been due to a spurious access,
> > only one active page per qualifying refault is challenged.  This will
> > keep the impact of outliers low but still detect if bigger groups of
> > pages are refaulting.
> > 
> > ...
> >
> > +/*
> > + *		Double CLOCK lists
> > + *
> > + * Per zone, two clock lists are maintained for file pages: the
> > + * inactive and the active list.  Freshly faulted pages start out at
> > + * the head of the inactive list and page reclaim scans pages from the
> > + * tail.  Pages that are accessed multiple times on the inactive list
> > + * are promoted to the active list, to protect them from reclaim,
> > + * whereas active pages are demoted to the inactive list when the
> > + * inactive list requires more space to detect repeatedly accessed
> > + * pages in the current workload and prevent them from thrashing:
> > + *
> > + *   fault -----------------------+
> > + *                                |
> > + *              +-------------+   |            +-------------+
> > + *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
> > + *              +-------------+                +-------------+    |
> > + *                       |                                        |
> > + *                       +----------- promotion ------------------+
> > + *
> > + *
> > + *		Access frequency and refault distance
> > + *
> > + * A workload is thrashing when the distances between the first and
> > + * second access of pages that are frequently used is bigger than the
> > + * current inactive clock list size, as the pages get reclaimed before
> > + * the second access would have promoted them instead:
> > + *
> > + *    Access #: 1 2 3 4 5 6 7 8 9
> > + *     Page ID: x y b c d e f x y
> > + *                  | inactive  |
> > + *
> > + * To prevent this workload from thrashing, a bigger inactive list is
> > + * required.  And the only way the inactive list can grow on a full
> > + * zone is by taking away space from the corresponding active list.
> > + *
> > + *      +-inactive--+-active------+
> > + *  x y | b c d e f | G H I J K L |
> > + *      +-----------+-------------+
> > + *
> > + * Not every refault should lead to growing the inactive list at the
> > + * cost of the active list, however: if the access distances are
> > + * bigger than available memory overall, there is little point in
> > + * challenging the protected pages on the active list, as those
> > + * refaulting pages will not fit completely into memory.
> > + *
> > + * It is prohibitively expensive to track the access frequency of
> > + * in-core pages, but it is possible to track their refault distance,
> > + * which is the number of page slots shrunk from the inactive list
> > + * between a page's eviction and subsequent refault.  This indicates
> > + * how many page slots are missing on the inactive list in order to
> > + * prevent future thrashing of that page.  Thus, instead of comparing
> > + * access frequency to total available memory, one can compare the
> > + * refault distance to the inactive list's potential for growth: the
> > + * size of the active list.
> > + *
> > + *
> > + *		Rebalancing the lists
> > + *
> > + * Shrinking the active list has to be done carefully because the
> > + * pages on it may have vastly different access frequencies compared
> > + * to the pages on the inactive list.  Thus, pages are not reclaimed
> > + * directly from the tail of the active list, but instead moved to the
> > + * head of the inactive list.  This way, they are competing directly
> > + * with the pages that challenged their protected status.  If they are
> > + * unused, they will eventually be reclaimed, but if they are indeed
> > + * used more frequently than the challenging inactive pages, they will
> > + * be reactivated.  This allows the existing protected set to be
> > + * challenged without incurring major faults in case of a mistake.
> > + */
> 
> The consequences of a 32-bit wraparound of the refault distance still
> concern me.  It's a rare occurrence and it is difficult to determine
> what will happen.  An explicit design-level description here would be
> useful.

I replied to a previous question about the wraparound issue, was that
useful at all?  Quote:

"The distance between two time stamps is an unsigned subtraction, so
 it's accurate even when the counter has wrapped between them.

 The per-zone counter lapping shadow entries is possible but not very
 likely because the shadow pages are reclaimed when more than
 2*global_dirtyable_memory() of them exist.  And usually they are
 refaulted or reclaimed along with the inode before that happens.

 There is an unlikely case where some shadow entries make it into an
 inode and then that same inode is evicting and refaulting pages in
 another area, which increases the counter while not producing an
 excess of shadow entries.  Should the counter lap these inactive
 shadow entries, the worst case is that a refault will incorrectly
 interpret them as recently evicted and deactivate a page for every
 such entry.  Which would at worst be a "regression" to how the code
 was for a long time, where every reclaim run also always deactivated
 some pages."

I'll expand the documentation on this either way, I just want to gauge
if this addresses your concerns.

> > +static void *pack_shadow(unsigned long time, struct zone *zone)
> > +{
> > +	time = (time << NODES_SHIFT) | zone_to_nid(zone);
> > +	time = (time << ZONES_SHIFT) | zone_idx(zone);
> > +	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
> > +
> > +	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
> > +}
> 
> "time" is normally in jiffies ;)
> 
> Some description of the underlying units of workingset_time would be
> helpful.

Humm, yes, "time" might be misleading.

The unit is pages and the clock is advanced every time the inactive
list shrinks (eviction or activation).

The refault distance (a delta of workingset time) describes the
inactive list space deficit, which increases as the inactive list
shrinks.

I'll try to expand on this better in the documentation.

> > +/**
> > + * workingset_eviction - note the eviction of a page from memory
> > + * @mapping: address space the page was backing
> > + * @page: the page being evicted
> > + *
> > + * Returns a shadow entry to be stored in @mapping->page_tree in place
> > + * of the evicted @page so that a later refault can be detected.  Or
> > + * %NULL when the eviction should not be remembered.
> > + */
> 
> Describe the locking requirements here.  It's part of the interface. 
> And it's the part the compiler can't check, so extra care is needed.
> 
> > +void *workingset_eviction(struct address_space *mapping, struct page *page)
> > +{
> > +	struct zone *zone = page_zone(page);
> > +	unsigned long time;
> > +
> > +	time = atomic_long_inc_return(&zone->workingset_time);
> > +
> > +	/*
> > +	 * Don't store shadows in an inode that is being reclaimed.
> > +	 * This is not just an optizimation, inode reclaim needs to
> > +	 * empty out the radix tree or the nodes are lost, so don't
> > +	 * plant shadows behind its back.
> > +	 */
> > +	if (mapping_exiting(mapping))
> > +		return NULL;

This needs to be ordered against inode eviction such that reclaim
never sneaks a shadow entry into the tree after eviction does the
final truncate.

  mapping_set_exiting(mapping)
  truncate_inode_pages(mapping)

  vs.

  mapping_exiting(mapping)
  page_cache_insert(mapping, shadow)

You already mentioned that the tree lock around mapping_set_exiting()
might be pointless.

I'll give this another pass when I'm less jetlagged and then either
document the locking requirements (tree lock) or remove it entirely.

> > +
> > +	return pack_shadow(time, zone);
> > +}
> > +
> > +/**
> > + * workingset_refault - note the refault of a previously evicted page
> > + * @shadow: shadow entry of the evicted page
> > + *
> > + * Calculates and evaluates the refault distance of the previously
> > + * evicted page in the context of the zone it was allocated in.
> > + *
> > + * This primes page reclaim to rebalance the zone's file lists if
> > + * necessary, so it must be called before a page frame for the
> > + * refaulting page is allocated.
> > + */
> > +void workingset_refault(void *shadow)
> > +{
> > +	unsigned long refault_distance;
> 
> Is the "refault distance" described somewhere?  What relationship (if
> any) does it have with workingset_time?

Ok, I definitely need to extend the documentation on this ;)

It says

"[...] it is possible to track their refault distance, which is the
 number of page slots shrunk from the inactive list between a page's
 eviction and subsequent refault."

So we take a snapshot of the workingset time on eviction.  The clock
advances as the inactive list shrinks.  On refault of a page, the
delta between its eviction timestamp and the current time is the
refault distance.  It describes the deficit in page slots available to
the inactive list that lead to the page being evicted between
accesses.

If the deficit is smaller than the active list, it can be closed by
stealing slots from there and giving them to the inactive list.

> > +	struct zone *zone;
> > +
> > +	unpack_shadow(shadow, &zone, &refault_distance);
> > +
> > +	inc_zone_state(zone, WORKINGSET_REFAULT);
> > +
> > +	/*
> > +	 * Protected pages should be challenged when the refault
> > +	 * distance indicates that thrashing could be stopped by
> > +	 * increasing the inactive list at the cost of the active
> > +	 * list.
> > +	 */
> > +	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> > +		inc_zone_state(zone, WORKINGSET_STALE);
> > +		zone->shrink_active++;
> > +	}
> > +}
> > +EXPORT_SYMBOL(workingset_refault);
> > +
> > +/**
> > + * workingset_activation - note a page activation
> > + * @page: page that is being activated
> > + */
> > +void workingset_activation(struct page *page)
> > +{
> > +	struct zone *zone = page_zone(page);
> > +
> > +	/*
> > +	 * The lists are rebalanced when the inactive list is observed
> > +	 * to be too small for activations.  An activation means that
> > +	 * the inactive list is now big enough again for at least one
> > +	 * page, so back off further deactivation.
> > +	 */
> > +	atomic_long_inc(&zone->workingset_time);
> > +	if (zone->shrink_active > 0)
> > +		zone->shrink_active--;
> > +}
> 
> Strange mixture of exports and non-exports.  I assume you went with
> "enough to make it build".

All these callbacks should be internal to the core VM but
page_cache_alloc(), which calls workingset_refault(), can be an inline
function used by modular filesystem code.

I should probably just make it a real function and remove the export.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
@ 2013-08-22  8:45       ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-22  8:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Aug 20, 2013 at 01:59:20PM -0700, Andrew Morton wrote:
> On Sat, 17 Aug 2013 15:31:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > The VM maintains cached filesystem pages on two types of lists.  One
> > list holds the pages recently faulted into the cache, the other list
> > holds pages that have been referenced repeatedly on that first list.
> > The idea is to prefer reclaiming young pages over those that have
> > shown to benefit from caching in the past.  We call the recently used
> > list "inactive list" and the frequently used list "active list".
> > 
> > The tricky part of this model is finding the right balance between
> > them.  A big inactive list may not leave enough room for the active
> > list to protect all the frequently used pages.  A big active list may
> > not leave enough room for the inactive list for a new set of
> > frequently used pages, "working set", to establish itself because the
> > young pages get pushed out of memory before having a chance to get
> > promoted.
> > 
> > Historically, every reclaim scan of the inactive list also took a
> > smaller number of pages from the tail of the active list and moved
> > them to the head of the inactive list.  This model gave established
> > working sets more gracetime in the face of temporary use once streams,
> > but was not satisfactory when use once streaming persisted over longer
> > periods of time and the established working set was temporarily
> > suspended, like a nightly backup evicting all the interactive user
> > program data.
> > 
> > Subsequently, the rules were changed to only age active pages when
> > they exceeded the amount of inactive pages, i.e. leave the working set
> > alone as long as the other half of memory is easy to reclaim use once
> > pages.  This works well until working set transitions exceed the size
> > of half of memory and the average access distance between the pages of
> > the new working set is bigger than the inactive list.  The VM will
> > mistake the thrashing new working set for use once streaming, while
> > the unused old working set pages are stuck on the active list.
> > 
> > This patch solves this problem by maintaining a history of recently
> > evicted file pages, which in turn allows the VM to tell used-once page
> > streams from thrashing file cache.
> > 
> > To accomplish this, a per-zone counter is increased every time a page
> > is evicted and a snapshot of that counter is stored as shadow entry in
> > the page's now empty page cache radix tree slot.  Upon refault of that
> > page, the difference between the current value of that counter and the
> > shadow entry value is called the refault distance.  It tells how many
> > pages have been evicted from the zone since that page's eviction,
> > which is how many page slots at most are missing from the zone's
> > inactive list for this page to get accessed twice while in memory.  If
> > the number of missing slots is less than or equal to the number of
> > active pages, increasing the inactive list at the cost of the active
> > list would give this thrashing set a chance to establish itself:
> > 
> > eviction counter = 4
> >                         evicted      inactive           active
> >  Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
> >   Shadow entries:         0 1 2 3
> > Refault distance:         4 3 2 1
> > 
> > When c is faulted back into memory, it is noted that at most two more
> > page slots on the inactive list could have prevented the refault (it
> > could be less if c is used out of order).  Thus, the active list needs
> > to be challenged as it is possible that c is used more frequently than
> > l, m, n.  However, there is no access frequency information available
> > on active pages so the pages have to be put in direct competition with
> > each other before deciding which one to keep.  Thus, 1) pages can not
> > be directly reclaimed from the tail of the active list and b)
> > refaulting pages can not be directly activated.  Instead, active pages
> > are moved from the tail of the active list to the head of the inactive
> > list and placed directly next to the refaulting pages.  This way, they
> > both have the same time on the inactive list to prove which page is
> > actually used more frequently without incurring unnecessary major
> > faults or diluting the active page set in case the previously active
> > page is in fact the more frequently used one.
> > 
> > Also, since the refault of c could have been due to a spurious access,
> > only one active page per qualifying refault is challenged.  This will
> > keep the impact of outliers low but still detect if bigger groups of
> > pages are refaulting.
> > 
> > ...
> >
> > +/*
> > + *		Double CLOCK lists
> > + *
> > + * Per zone, two clock lists are maintained for file pages: the
> > + * inactive and the active list.  Freshly faulted pages start out at
> > + * the head of the inactive list and page reclaim scans pages from the
> > + * tail.  Pages that are accessed multiple times on the inactive list
> > + * are promoted to the active list, to protect them from reclaim,
> > + * whereas active pages are demoted to the inactive list when the
> > + * inactive list requires more space to detect repeatedly accessed
> > + * pages in the current workload and prevent them from thrashing:
> > + *
> > + *   fault -----------------------+
> > + *                                |
> > + *              +-------------+   |            +-------------+
> > + *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
> > + *              +-------------+                +-------------+    |
> > + *                       |                                        |
> > + *                       +----------- promotion ------------------+
> > + *
> > + *
> > + *		Access frequency and refault distance
> > + *
> > + * A workload is thrashing when the distances between the first and
> > + * second access of pages that are frequently used is bigger than the
> > + * current inactive clock list size, as the pages get reclaimed before
> > + * the second access would have promoted them instead:
> > + *
> > + *    Access #: 1 2 3 4 5 6 7 8 9
> > + *     Page ID: x y b c d e f x y
> > + *                  | inactive  |
> > + *
> > + * To prevent this workload from thrashing, a bigger inactive list is
> > + * required.  And the only way the inactive list can grow on a full
> > + * zone is by taking away space from the corresponding active list.
> > + *
> > + *      +-inactive--+-active------+
> > + *  x y | b c d e f | G H I J K L |
> > + *      +-----------+-------------+
> > + *
> > + * Not every refault should lead to growing the inactive list at the
> > + * cost of the active list, however: if the access distances are
> > + * bigger than available memory overall, there is little point in
> > + * challenging the protected pages on the active list, as those
> > + * refaulting pages will not fit completely into memory.
> > + *
> > + * It is prohibitively expensive to track the access frequency of
> > + * in-core pages, but it is possible to track their refault distance,
> > + * which is the number of page slots shrunk from the inactive list
> > + * between a page's eviction and subsequent refault.  This indicates
> > + * how many page slots are missing on the inactive list in order to
> > + * prevent future thrashing of that page.  Thus, instead of comparing
> > + * access frequency to total available memory, one can compare the
> > + * refault distance to the inactive list's potential for growth: the
> > + * size of the active list.
> > + *
> > + *
> > + *		Rebalancing the lists
> > + *
> > + * Shrinking the active list has to be done carefully because the
> > + * pages on it may have vastly different access frequencies compared
> > + * to the pages on the inactive list.  Thus, pages are not reclaimed
> > + * directly from the tail of the active list, but instead moved to the
> > + * head of the inactive list.  This way, they are competing directly
> > + * with the pages that challenged their protected status.  If they are
> > + * unused, they will eventually be reclaimed, but if they are indeed
> > + * used more frequently than the challenging inactive pages, they will
> > + * be reactivated.  This allows the existing protected set to be
> > + * challenged without incurring major faults in case of a mistake.
> > + */
> 
> The consequences of a 32-bit wraparound of the refault distance still
> concern me.  It's a rare occurrence and it is difficult to determine
> what will happen.  An explicit design-level description here would be
> useful.

I replied to a previous question about the wraparound issue, was that
useful at all?  Quote:

"The distance between two time stamps is an unsigned subtraction, so
 it's accurate even when the counter has wrapped between them.

 The per-zone counter lapping shadow entries is possible but not very
 likely because the shadow pages are reclaimed when more than
 2*global_dirtyable_memory() of them exist.  And usually they are
 refaulted or reclaimed along with the inode before that happens.

 There is an unlikely case where some shadow entries make it into an
 inode and then that same inode is evicting and refaulting pages in
 another area, which increases the counter while not producing an
 excess of shadow entries.  Should the counter lap these inactive
 shadow entries, the worst case is that a refault will incorrectly
 interpret them as recently evicted and deactivate a page for every
 such entry.  Which would at worst be a "regression" to how the code
 was for a long time, where every reclaim run also always deactivated
 some pages."

I'll expand the documentation on this either way, I just want to gauge
if this addresses your concerns.

> > +static void *pack_shadow(unsigned long time, struct zone *zone)
> > +{
> > +	time = (time << NODES_SHIFT) | zone_to_nid(zone);
> > +	time = (time << ZONES_SHIFT) | zone_idx(zone);
> > +	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
> > +
> > +	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
> > +}
> 
> "time" is normally in jiffies ;)
> 
> Some description of the underlying units of workingset_time would be
> helpful.

Humm, yes, "time" might be misleading.

The unit is pages and the clock is advanced every time the inactive
list shrinks (eviction or activation).

The refault distance (a delta of workingset time) describes the
inactive list space deficit, which increases as the inactive list
shrinks.

I'll try to expand on this better in the documentation.

> > +/**
> > + * workingset_eviction - note the eviction of a page from memory
> > + * @mapping: address space the page was backing
> > + * @page: the page being evicted
> > + *
> > + * Returns a shadow entry to be stored in @mapping->page_tree in place
> > + * of the evicted @page so that a later refault can be detected.  Or
> > + * %NULL when the eviction should not be remembered.
> > + */
> 
> Describe the locking requirements here.  It's part of the interface. 
> And it's the part the compiler can't check, so extra care is needed.
> 
> > +void *workingset_eviction(struct address_space *mapping, struct page *page)
> > +{
> > +	struct zone *zone = page_zone(page);
> > +	unsigned long time;
> > +
> > +	time = atomic_long_inc_return(&zone->workingset_time);
> > +
> > +	/*
> > +	 * Don't store shadows in an inode that is being reclaimed.
> > +	 * This is not just an optizimation, inode reclaim needs to
> > +	 * empty out the radix tree or the nodes are lost, so don't
> > +	 * plant shadows behind its back.
> > +	 */
> > +	if (mapping_exiting(mapping))
> > +		return NULL;

This needs to be ordered against inode eviction such that reclaim
never sneaks a shadow entry into the tree after eviction does the
final truncate.

  mapping_set_exiting(mapping)
  truncate_inode_pages(mapping)

  vs.

  mapping_exiting(mapping)
  page_cache_insert(mapping, shadow)

You already mentioned that the tree lock around mapping_set_exiting()
might be pointless.

I'll give this another pass when I'm less jetlagged and then either
document the locking requirements (tree lock) or remove it entirely.

> > +
> > +	return pack_shadow(time, zone);
> > +}
> > +
> > +/**
> > + * workingset_refault - note the refault of a previously evicted page
> > + * @shadow: shadow entry of the evicted page
> > + *
> > + * Calculates and evaluates the refault distance of the previously
> > + * evicted page in the context of the zone it was allocated in.
> > + *
> > + * This primes page reclaim to rebalance the zone's file lists if
> > + * necessary, so it must be called before a page frame for the
> > + * refaulting page is allocated.
> > + */
> > +void workingset_refault(void *shadow)
> > +{
> > +	unsigned long refault_distance;
> 
> Is the "refault distance" described somewhere?  What relationship (if
> any) does it have with workingset_time?

Ok, I definitely need to extend the documentation on this ;)

It says

"[...] it is possible to track their refault distance, which is the
 number of page slots shrunk from the inactive list between a page's
 eviction and subsequent refault."

So we take a snapshot of the workingset time on eviction.  The clock
advances as the inactive list shrinks.  On refault of a page, the
delta between its eviction timestamp and the current time is the
refault distance.  It describes the deficit in page slots available to
the inactive list that lead to the page being evicted between
accesses.

If the deficit is smaller than the active list, it can be closed by
stealing slots from there and giving them to the inactive list.

> > +	struct zone *zone;
> > +
> > +	unpack_shadow(shadow, &zone, &refault_distance);
> > +
> > +	inc_zone_state(zone, WORKINGSET_REFAULT);
> > +
> > +	/*
> > +	 * Protected pages should be challenged when the refault
> > +	 * distance indicates that thrashing could be stopped by
> > +	 * increasing the inactive list at the cost of the active
> > +	 * list.
> > +	 */
> > +	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> > +		inc_zone_state(zone, WORKINGSET_STALE);
> > +		zone->shrink_active++;
> > +	}
> > +}
> > +EXPORT_SYMBOL(workingset_refault);
> > +
> > +/**
> > + * workingset_activation - note a page activation
> > + * @page: page that is being activated
> > + */
> > +void workingset_activation(struct page *page)
> > +{
> > +	struct zone *zone = page_zone(page);
> > +
> > +	/*
> > +	 * The lists are rebalanced when the inactive list is observed
> > +	 * to be too small for activations.  An activation means that
> > +	 * the inactive list is now big enough again for at least one
> > +	 * page, so back off further deactivation.
> > +	 */
> > +	atomic_long_inc(&zone->workingset_time);
> > +	if (zone->shrink_active > 0)
> > +		zone->shrink_active--;
> > +}
> 
> Strange mixture of exports and non-exports.  I assume you went with
> "enough to make it build".

All these callbacks should be internal to the core VM but
page_cache_alloc(), which calls workingset_refault(), can be an inline
function used by modular filesystem code.

I should probably just make it a real function and remove the export.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 9/9] mm: thrash detection-based file cache sizing v4
  2013-08-20 21:04   ` Andrew Morton
  (?)
@ 2013-08-22  9:08   ` Metin Doslu
  -1 siblings, 0 replies; 51+ messages in thread
From: Metin Doslu @ 2013-08-22  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Vlastimil Babka, linux-mm, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2439 bytes --]

Hey everbody,

I run following tests, and it shows in what cases this patch is beneficial
for us.

Test Environment:

* Ubuntu Server 12.04.2 LTS Linux 3.2.0-40-virtual #64-Ubuntu on EC2.
* 15 GB memory (DMA32 4GB + Normal 11GB).

Test Settings:

We have two PostgreSQL tables with same size of 9.75GB (65% of total
memory), where these tables contain clickstream events for March and April.
We call these two tables "events_march" and "events_april" respectively.

Problem (Before Patch is Applied):

I pass over events_march data twice with an example query, such as "select
count(*) from events_march". This activates all of events_march's pages.

I then pass over events_april dozens of times with a similar query. No
matter how many times I query events_april, I can't get completely get this
table's pages into memory. This happens even when events_march isn't
touched at all, events_april easily fits into memory, and events_april has
been referenced dozens of times.

After Patch is Applied:

This time, after three passes over events_april, all the pages are cached
in memory. (4th access is completely served from memory.)

I also repeated this test with a bigger dataset of size 12GB (80% of total
memory) for both events_march and events_april, and observed the same
results. (after 3rd pass, all of pages in events_april are cached.)

Thank you,
Metin


On Wed, Aug 21, 2013 at 12:04 AM, Andrew Morton
<akpm@linux-foundation.org>wrote:

> On Sat, 17 Aug 2013 15:31:14 -0400 Johannes Weiner <hannes@cmpxchg.org>
> wrote:
>
> > This series solves the problem by maintaining a history of pages
> > evicted from the inactive list, enabling the VM to tell streaming IO
> > from thrashing and rebalance the page cache lists when appropriate.
>
> I can't say I'm loving the patchset.  It adds significant bloat to the
> inode (of all things!), seems to add some runtime overhead and
> certainly adds boatloads of complexity.
>
> In return for which we get...  well, I don't know what we get - no data
> was included.  It had better be good!
>
> To aid in this decision, please go through the patchset and calculate
> and itemize the overhead: increased inode size, increased radix-tree
> consumption, lengthier code paths, anything else I missed  Others can
> make their own judgements regarding complexity increase.
>
> Then please carefully describe the benefits, then see if you can
> convince us that one is worth the other!
>
>

[-- Attachment #2: Type: text/html, Size: 3278 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 9/9] mm: workingset: keep shadow entries in check
  2013-08-20 20:59     ` Andrew Morton
@ 2013-08-22  9:48       ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-22  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Aug 20, 2013 at 01:59:24PM -0700, Andrew Morton wrote:
> On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> 
> erk.  This whole patch is overhead :(
> 
> > To get this under control, a list of inodes that contain shadow
> > entries is maintained.  If the global number of shadows exceeds a
> > certain threshold, a shrinker is activated that reclaims old entries
> > from the mappings.  This is heavy-handed but it should not be a hot
> > path and is mainly there to protect from accidentally/maliciously
> > induced OOM kills.  The global list is also not a problem because the
> > modifications are very rare: inodes are added once in their lifetime
> > when the first shadow entry is stored (i.e. the first page reclaimed)
> > and lazily removed when the inode exits.  Or if the shrinker removes
> > all shadow entries.
> > 
> > ...
> >
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -417,6 +417,7 @@ struct address_space {
> >  	/* Protected by tree_lock together with the radix tree */
> >  	unsigned long		nrpages;	/* number of total pages */
> >  	unsigned long		nrshadows;	/* number of shadow entries */
> > +	struct list_head	shadow_list;	/* list of mappings with shadows */
> >  	pgoff_t			writeback_index;/* writeback starts here */
> >  	const struct address_space_operations *a_ops;	/* methods */
> >  	unsigned long		flags;		/* error bits/gfp mask */
> 
> There's another 16 bytes into the inode.  Bad.

Yeah :(

An obvious alternative to storing page eviction information in the
page cache radix tree would be to have a separate data structure that
scales with the number of physical pages available.

It really depends on the workload which one's cheaper in terms of both
memory and cpu.

Workloads where the ratio between number of inodes and inode size is
high suffer from the increased inode size, but when they have decent
access locality within the inodes, the inodes should be evicted along
with their pages.  So in this case there is little to no memory
overhead from the eviction information compared to the fixed size
separate data structure.

And refault information lookup is cheaper of course when storing
eviction information inside the cache slots.

Workloads for which this model sucks are those with inode locality but
no data locality.  The inodes stay alive and the lack of data locality
produces many shadow entries that only the costly shrinker can get rid
of.  Numbers aside, it was a judgement call to improve workloads with
high data locality at the cost of those without.

But I'll include a more concrete cost analysis in the submission that
also includes more concrete details on the benefits of the series ;)

> > +void workingset_shadows_inc(struct address_space *mapping)
> > +{
> > +	might_lock(&shadow_lock);
> > +
> > +	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
> > +		spin_lock(&shadow_lock);
> 
> I can't work out whether or not shadow_lock is supposed to be irq-save.
> Some places it is, others are unobvious.

It is.  The caller holds the irq-safe tree_lock, though, so no need to
disable IRQs a second time.  I'll add documentation.

> > +static unsigned long get_nr_old_shadows(void)
> > +{
> > +	unsigned long nr_max;
> > +	unsigned long nr;
> > +	long sum = 0;
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu)
> > +		sum += per_cpu(nr_shadows, cpu);
> 
> Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
> calls it from a loop.  Can we use something non-deathly-slow here? 
> Like percpu_counter_read_positive()?

Finally, a usecase for percpu_counter!!  Sounds reasonable, I'll
convert this stuff over.

> > +	nr = max(sum, 0L);
> > +
> > +	/*
> > +	 * Every shadow entry with a refault distance bigger than the
> > +	 * active list is ignored and so NR_ACTIVE_FILE would be a
> > +	 * reasonable ceiling.  But scanning and shrinking shadow
> > +	 * entries is quite expensive, so be generous.
> > +	 */
> > +	nr_max = global_dirtyable_memory() * 4;
> > +
> > +	if (nr <= nr_max)
> > +		return 0;
> > +	return nr - nr_max;
> > +}
> > +
> > +static unsigned long scan_mapping(struct address_space *mapping,
> > +				  unsigned long nr_to_scan)
> 
> Some methodological description would be useful.

Fair enough, I'll write something.

Thanks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 9/9] mm: workingset: keep shadow entries in check
@ 2013-08-22  9:48       ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-22  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Andrea Arcangeli, Greg Thelen, Christoph Hellwig,
	Hugh Dickins, Jan Kara, KOSAKI Motohiro, Mel Gorman, Minchan Kim,
	Peter Zijlstra, Rik van Riel, Michel Lespinasse, Seth Jennings,
	Roman Gushchin, Ozgun Erdogan, Metin Doslu, Vlastimil Babka,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Aug 20, 2013 at 01:59:24PM -0700, Andrew Morton wrote:
> On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> 
> erk.  This whole patch is overhead :(
> 
> > To get this under control, a list of inodes that contain shadow
> > entries is maintained.  If the global number of shadows exceeds a
> > certain threshold, a shrinker is activated that reclaims old entries
> > from the mappings.  This is heavy-handed but it should not be a hot
> > path and is mainly there to protect from accidentally/maliciously
> > induced OOM kills.  The global list is also not a problem because the
> > modifications are very rare: inodes are added once in their lifetime
> > when the first shadow entry is stored (i.e. the first page reclaimed)
> > and lazily removed when the inode exits.  Or if the shrinker removes
> > all shadow entries.
> > 
> > ...
> >
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -417,6 +417,7 @@ struct address_space {
> >  	/* Protected by tree_lock together with the radix tree */
> >  	unsigned long		nrpages;	/* number of total pages */
> >  	unsigned long		nrshadows;	/* number of shadow entries */
> > +	struct list_head	shadow_list;	/* list of mappings with shadows */
> >  	pgoff_t			writeback_index;/* writeback starts here */
> >  	const struct address_space_operations *a_ops;	/* methods */
> >  	unsigned long		flags;		/* error bits/gfp mask */
> 
> There's another 16 bytes into the inode.  Bad.

Yeah :(

An obvious alternative to storing page eviction information in the
page cache radix tree would be to have a separate data structure that
scales with the number of physical pages available.

It really depends on the workload which one's cheaper in terms of both
memory and cpu.

Workloads where the ratio between number of inodes and inode size is
high suffer from the increased inode size, but when they have decent
access locality within the inodes, the inodes should be evicted along
with their pages.  So in this case there is little to no memory
overhead from the eviction information compared to the fixed size
separate data structure.

And refault information lookup is cheaper of course when storing
eviction information inside the cache slots.

Workloads for which this model sucks are those with inode locality but
no data locality.  The inodes stay alive and the lack of data locality
produces many shadow entries that only the costly shrinker can get rid
of.  Numbers aside, it was a judgement call to improve workloads with
high data locality at the cost of those without.

But I'll include a more concrete cost analysis in the submission that
also includes more concrete details on the benefits of the series ;)

> > +void workingset_shadows_inc(struct address_space *mapping)
> > +{
> > +	might_lock(&shadow_lock);
> > +
> > +	if (mapping->nrshadows == 0 && list_empty(&mapping->shadow_list)) {
> > +		spin_lock(&shadow_lock);
> 
> I can't work out whether or not shadow_lock is supposed to be irq-save.
> Some places it is, others are unobvious.

It is.  The caller holds the irq-safe tree_lock, though, so no need to
disable IRQs a second time.  I'll add documentation.

> > +static unsigned long get_nr_old_shadows(void)
> > +{
> > +	unsigned long nr_max;
> > +	unsigned long nr;
> > +	long sum = 0;
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu)
> > +		sum += per_cpu(nr_shadows, cpu);
> 
> Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
> calls it from a loop.  Can we use something non-deathly-slow here? 
> Like percpu_counter_read_positive()?

Finally, a usecase for percpu_counter!!  Sounds reasonable, I'll
convert this stuff over.

> > +	nr = max(sum, 0L);
> > +
> > +	/*
> > +	 * Every shadow entry with a refault distance bigger than the
> > +	 * active list is ignored and so NR_ACTIVE_FILE would be a
> > +	 * reasonable ceiling.  But scanning and shrinking shadow
> > +	 * entries is quite expensive, so be generous.
> > +	 */
> > +	nr_max = global_dirtyable_memory() * 4;
> > +
> > +	if (nr <= nr_max)
> > +		return 0;
> > +	return nr - nr_max;
> > +}
> > +
> > +static unsigned long scan_mapping(struct address_space *mapping,
> > +				  unsigned long nr_to_scan)
> 
> Some methodological description would be useful.

Fair enough, I'll write something.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 9/9] mm: thrash detection-based file cache sizing v4
  2013-08-20 21:04   ` Andrew Morton
  (?)
  (?)
@ 2013-08-22 11:15   ` Metin Doslu
  -1 siblings, 0 replies; 51+ messages in thread
From: Metin Doslu @ 2013-08-22 11:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

Hey everbody,

I run following tests, and it shows in what cases this patch is
beneficial for us.

Test Environment:

* Ubuntu Server 12.04.2 LTS Linux 3.2.0-40-virtual #64-Ubuntu on EC2.
* 15 GB memory (DMA32 4GB + Normal 11GB).

Test Settings:

We have two PostgreSQL tables with same size of 9.75GB (65% of total
memory), where these tables contain clickstream events for March and
April. We call these two tables "events_march" and "events_april"
respectively.

Problem (Before Patch is Applied):

I pass over events_march data twice with an example query, such as
"select count(*) from events_march". This activates all of
events_march's pages.

I then pass over events_april dozens of times with a similar query. No
matter how many times I query events_april, I can't get completely get
this table's pages into memory. This happens even when events_march
isn't touched at all, events_april easily fits into memory, and
events_april has been referenced dozens of times.

After Patch is Applied:

This time, after three passes over events_april, all the pages are
cached in memory. (4th access is completely served from memory.)

I also repeated this test with a bigger dataset of size 12GB (80% of
total memory) for both events_march and events_april, and observed the
same results. (after 3rd pass, all of pages in events_april are
cached.)

Thank you,
Metin

On Wed, Aug 21, 2013 at 12:04 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Sat, 17 Aug 2013 15:31:14 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
>
>> This series solves the problem by maintaining a history of pages
>> evicted from the inactive list, enabling the VM to tell streaming IO
>> from thrashing and rebalance the page cache lists when appropriate.
>
> I can't say I'm loving the patchset.  It adds significant bloat to the
> inode (of all things!), seems to add some runtime overhead and
> certainly adds boatloads of complexity.
>
> In return for which we get...  well, I don't know what we get - no data
> was included.  It had better be good!
>
> To aid in this decision, please go through the patchset and calculate
> and itemize the overhead: increased inode size, increased radix-tree
> consumption, lengthier code paths, anything else I missed  Others can
> make their own judgements regarding complexity increase.
>
> Then please carefully describe the benefits, then see if you can
> convince us that one is worth the other!
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-11 21:57     ` Vlastimil Babka
@ 2013-08-12 16:27       ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-12 16:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, Roman Gushchin,
	Ozgun Erdogan, Metin Doslu, linux-fsdevel, linux-kernel

Hello Vlastimil!

On Sun, Aug 11, 2013 at 11:57:37PM +0200, Vlastimil Babka wrote:
> On 08/07/2013 12:44 AM, Johannes Weiner wrote:
> >To accomplish this, a per-zone counter is increased every time a page
> >is evicted and a snapshot of that counter is stored as shadow entry in
> >the page's now empty page cache radix tree slot.  Upon refault of that
> >page, the difference between the current value of that counter and the
> >shadow entry value is called the refault distance.  It tells how many
> >pages have been evicted from the zone since that page's eviction,
> This explanation of refault distance seems correct...
> >which is how many page slots are missing from the zone's inactive list
> >for this page to get accessed twice while in memory.
> But this part seems slightly incorrect. IMHO the correct formulation
> would be "...how many page slots are AT MOST missing...". See below.

Yes, I think this would be better phrasing.

> >If the number of
> >missing slots is less than or equal to the number of active pages,
> >increasing the inactive list at the cost of the active list would give
> >this thrashing set a chance to establish itself:
> >
> >eviction counter = 4
> >                         evicted      inactive           active
> >  Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
> >   Shadow entries:         0 1 2 3
> >Refault distance:         4 3 2 1
> Consider here that if 'd' was now accessed before 'c', I think 'e'
> would be evicted and eviction counter would be incremented to 5. So
> for 'c' you would now say that three slots would prevent the
> refault, but in fact two would still be sufficient. This potential
> imprecision could make the algorithm challenge more active pages
> than it should, but I am not sure how bad the consequences could
> be... so just pointing it out.

Yes, pages are not accessed in strict order all the time and there is
very much the occasional outlier that gets accessed before its peers.

In fact, the code treats it as an AT MOST already.  Every page fault
is evaluated based on its refault distance, but a refault distance of
N does not mean that N pages are deactivated.  Only one page is.

An outlier is thus harmless but we catch if a whole group of pages is
thrashing.

The changelog and documentation should be updated, thanks for pointing
this out.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
@ 2013-08-12 16:27       ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-12 16:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, Roman Gushchin,
	Ozgun Erdogan, Metin Doslu, linux-fsdevel, linux-kernel

Hello Vlastimil!

On Sun, Aug 11, 2013 at 11:57:37PM +0200, Vlastimil Babka wrote:
> On 08/07/2013 12:44 AM, Johannes Weiner wrote:
> >To accomplish this, a per-zone counter is increased every time a page
> >is evicted and a snapshot of that counter is stored as shadow entry in
> >the page's now empty page cache radix tree slot.  Upon refault of that
> >page, the difference between the current value of that counter and the
> >shadow entry value is called the refault distance.  It tells how many
> >pages have been evicted from the zone since that page's eviction,
> This explanation of refault distance seems correct...
> >which is how many page slots are missing from the zone's inactive list
> >for this page to get accessed twice while in memory.
> But this part seems slightly incorrect. IMHO the correct formulation
> would be "...how many page slots are AT MOST missing...". See below.

Yes, I think this would be better phrasing.

> >If the number of
> >missing slots is less than or equal to the number of active pages,
> >increasing the inactive list at the cost of the active list would give
> >this thrashing set a chance to establish itself:
> >
> >eviction counter = 4
> >                         evicted      inactive           active
> >  Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
> >   Shadow entries:         0 1 2 3
> >Refault distance:         4 3 2 1
> Consider here that if 'd' was now accessed before 'c', I think 'e'
> would be evicted and eviction counter would be incremented to 5. So
> for 'c' you would now say that three slots would prevent the
> refault, but in fact two would still be sufficient. This potential
> imprecision could make the algorithm challenge more active pages
> than it should, but I am not sure how bad the consequences could
> be... so just pointing it out.

Yes, pages are not accessed in strict order all the time and there is
very much the occasional outlier that gets accessed before its peers.

In fact, the code treats it as an AT MOST already.  Every page fault
is evaluated based on its refault distance, but a refault distance of
N does not mean that N pages are deactivated.  Only one page is.

An outlier is thus harmless but we catch if a whole group of pages is
thrashing.

The changelog and documentation should be updated, thanks for pointing
this out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-09 22:49     ` Andrew Morton
@ 2013-08-12 16:00       ` Johannes Weiner
  -1 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-12 16:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

On Fri, Aug 09, 2013 at 03:49:43PM -0700, Andrew Morton wrote:
> On Tue,  6 Aug 2013 18:44:09 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > To accomplish this, a per-zone counter is increased every time a page
> > is evicted and a snapshot of that counter is stored as shadow entry in
> > the page's now empty page cache radix tree slot.
> 
> How do you handle wraparound of that counter on 32-bit machines?

The distance between two time stamps is an unsigned subtraction, so
it's accurate even when the counter has wrapped between them.

The per-zone counter lapping shadow entries is possible but not very
likely because the shadow pages are reclaimed when more than
2*global_dirtyable_memory() of them exist.  And usually they are
refaulted or reclaimed along with the inode before that happens.

There is an unlikely case where some shadow entries make it into an
inode and then that same inode is evicting and refaulting pages in
another area, which increases the counter while not producing an
excess of shadow entries.  Should the counter lap these inactive
shadow entries, the worst case is that a refault will incorrectly
interpret them as recently evicted and deactivate a page for every
such entry.  Which would at worst be a "regression" to how the code
was for a long time, where every reclaim run also always deactivated
some pages.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
@ 2013-08-12 16:00       ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-12 16:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

On Fri, Aug 09, 2013 at 03:49:43PM -0700, Andrew Morton wrote:
> On Tue,  6 Aug 2013 18:44:09 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > To accomplish this, a per-zone counter is increased every time a page
> > is evicted and a snapshot of that counter is stored as shadow entry in
> > the page's now empty page cache radix tree slot.
> 
> How do you handle wraparound of that counter on 32-bit machines?

The distance between two time stamps is an unsigned subtraction, so
it's accurate even when the counter has wrapped between them.

The per-zone counter lapping shadow entries is possible but not very
likely because the shadow pages are reclaimed when more than
2*global_dirtyable_memory() of them exist.  And usually they are
refaulted or reclaimed along with the inode before that happens.

There is an unlikely case where some shadow entries make it into an
inode and then that same inode is evicting and refaulting pages in
another area, which increases the counter while not producing an
excess of shadow entries.  Should the counter lap these inactive
shadow entries, the worst case is that a refault will incorrectly
interpret them as recently evicted and deactivate a page for every
such entry.  Which would at worst be a "regression" to how the code
was for a long time, where every reclaim run also always deactivated
some pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-06 22:44   ` Johannes Weiner
@ 2013-08-11 21:57     ` Vlastimil Babka
  -1 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2013-08-11 21:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, Roman Gushchin,
	Ozgun Erdogan, Metin Doslu, linux-fsdevel, linux-kernel

On 08/07/2013 12:44 AM, Johannes Weiner wrote:
> To accomplish this, a per-zone counter is increased every time a page
> is evicted and a snapshot of that counter is stored as shadow entry in
> the page's now empty page cache radix tree slot.  Upon refault of that
> page, the difference between the current value of that counter and the
> shadow entry value is called the refault distance.  It tells how many
> pages have been evicted from the zone since that page's eviction,
This explanation of refault distance seems correct...
> which is how many page slots are missing from the zone's inactive list
> for this page to get accessed twice while in memory.
But this part seems slightly incorrect. IMHO the correct formulation 
would be "...how many page slots are AT MOST missing...". See below.
> If the number of
> missing slots is less than or equal to the number of active pages,
> increasing the inactive list at the cost of the active list would give
> this thrashing set a chance to establish itself:
>
> eviction counter = 4
>                          evicted      inactive           active
>   Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
>    Shadow entries:         0 1 2 3
> Refault distance:         4 3 2 1
Consider here that if 'd' was now accessed before 'c', I think 'e' would 
be evicted and eviction counter would be incremented to 5. So for 'c' 
you would now say that three slots would prevent the refault, but in 
fact two would still be sufficient. This potential imprecision could 
make the algorithm challenge more active pages than it should, but I am 
not sure how bad the consequences could be... so just pointing it out.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
@ 2013-08-11 21:57     ` Vlastimil Babka
  0 siblings, 0 replies; 51+ messages in thread
From: Vlastimil Babka @ 2013-08-11 21:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Andrew Morton,
	Greg Thelen, Christoph Hellwig, Hugh Dickins, Jan Kara,
	KOSAKI Motohiro, Mel Gorman, Minchan Kim, Peter Zijlstra,
	Rik van Riel, Michel Lespinasse, Seth Jennings, Roman Gushchin,
	Ozgun Erdogan, Metin Doslu, linux-fsdevel, linux-kernel

On 08/07/2013 12:44 AM, Johannes Weiner wrote:
> To accomplish this, a per-zone counter is increased every time a page
> is evicted and a snapshot of that counter is stored as shadow entry in
> the page's now empty page cache radix tree slot.  Upon refault of that
> page, the difference between the current value of that counter and the
> shadow entry value is called the refault distance.  It tells how many
> pages have been evicted from the zone since that page's eviction,
This explanation of refault distance seems correct...
> which is how many page slots are missing from the zone's inactive list
> for this page to get accessed twice while in memory.
But this part seems slightly incorrect. IMHO the correct formulation 
would be "...how many page slots are AT MOST missing...". See below.
> If the number of
> missing slots is less than or equal to the number of active pages,
> increasing the inactive list at the cost of the active list would give
> this thrashing set a chance to establish itself:
>
> eviction counter = 4
>                          evicted      inactive           active
>   Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
>    Shadow entries:         0 1 2 3
> Refault distance:         4 3 2 1
Consider here that if 'd' was now accessed before 'c', I think 'e' would 
be evicted and eviction counter would be incremented to 5. So for 'c' 
you would now say that three slots would prevent the refault, but in 
fact two would still be sufficient. This potential imprecision could 
make the algorithm challenge more active pages than it should, but I am 
not sure how bad the consequences could be... so just pointing it out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-06 22:44   ` Johannes Weiner
@ 2013-08-09 22:49     ` Andrew Morton
  -1 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-09 22:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

On Tue,  6 Aug 2013 18:44:09 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> To accomplish this, a per-zone counter is increased every time a page
> is evicted and a snapshot of that counter is stored as shadow entry in
> the page's now empty page cache radix tree slot.

How do you handle wraparound of that counter on 32-bit machines?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [patch 8/9] mm: thrash detection-based file cache sizing
@ 2013-08-09 22:49     ` Andrew Morton
  0 siblings, 0 replies; 51+ messages in thread
From: Andrew Morton @ 2013-08-09 22:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andi Kleen, Andrea Arcangeli, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

On Tue,  6 Aug 2013 18:44:09 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> To accomplish this, a per-zone counter is increased every time a page
> is evicted and a snapshot of that counter is stored as shadow entry in
> the page's now empty page cache radix tree slot.

How do you handle wraparound of that counter on 32-bit machines?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
@ 2013-08-06 22:44   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This patch solves this problem by maintaining a history of recently
evicted file pages, which in turn allows the VM to tell used-once page
streams from thrashing file cache.

To accomplish this, a per-zone counter is increased every time a page
is evicted and a snapshot of that counter is stored as shadow entry in
the page's now empty page cache radix tree slot.  Upon refault of that
page, the difference between the current value of that counter and the
shadow entry value is called the refault distance.  It tells how many
pages have been evicted from the zone since that page's eviction,
which is how many page slots are missing from the zone's inactive list
for this page to get accessed twice while in memory.  If the number of
missing slots is less than or equal to the number of active pages,
increasing the inactive list at the cost of the active list would give
this thrashing set a chance to establish itself:

eviction counter = 4
                        evicted      inactive           active
 Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
  Shadow entries:         0 1 2 3
Refault distance:         4 3 2 1

When c is faulted back into memory, it is noted that two more page
slots on the inactive list could have prevented the refault.  Thus,
the active list needs to be challenged for those two page slots as it
is possible that c is used more frequently than l, m, n.  However, c
might also be used much less frequent than the active pages and so 1)
pages can not be directly reclaimed from the tail of the active list
and b) refaulting pages can not be directly activated.  Instead,
active pages are moved from the tail of the active list to the head of
the inactive list and placed directly next to the refaulting pages.
This way, they both have the same time on the inactive list to prove
which page is actually used more frequently without incurring
unnecessary major faults or diluting the active page set in case the
previously active page is in fact the more frequently used one.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h  |   6 ++
 include/linux/pagemap.h |   2 +
 include/linux/swap.h    |   5 ++
 mm/Makefile             |   2 +-
 mm/filemap.c            |   3 +
 mm/swap.c               |   2 +
 mm/vmscan.c             |  62 +++++++++++---
 mm/vmstat.c             |   3 +
 mm/workingset.c         | 213 ++++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 284 insertions(+), 14 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0c41d59..e75fc92 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -141,6 +141,9 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_STALE,
+	WORKINGSET_BALANCE,
+	WORKINGSET_BALANCE_FORCE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -393,6 +396,9 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
+	atomic_long_t		workingset_time;
+	long			shrink_active;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b24236..c6beed2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -232,6 +232,8 @@ extern struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
+	if (shadow)
+		workingset_refault(shadow);
 	return alloc_pages(gfp, 0);
 }
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 24db914..441845d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
 	int next;	/* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+void workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index cd0abd8..f740427 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o \
-			   interval_tree.o list_lru.o $(mmu-y)
+			   interval_tree.o list_lru.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/filemap.c b/mm/filemap.c
index d3e5578..ab4351e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -543,6 +543,9 @@ struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 	int n;
 	struct page *page;
 
+	if (shadow)
+		workingset_refault(shadow);
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
diff --git a/mm/swap.c b/mm/swap.c
index bf448cf..f90a331 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,6 +482,8 @@ void mark_page_accessed(struct page *page)
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dd5f67c..27a36f6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -496,7 +496,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -542,10 +543,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page, NULL);
+		if (reclaimed && page_is_file_cache(page))
+			shadow = workingset_eviction(mapping, page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -568,7 +572,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -1038,7 +1042,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
@@ -1746,32 +1750,64 @@ static inline int inactive_anon_is_low(struct lruvec *lruvec)
 /**
  * inactive_file_is_low - check if file pages need to be deactivated
  * @lruvec: LRU vector to check
+ * @nr_to_scan: number of active pages to scan
+ * @sc: scan parameters
  *
  * When the system is doing streaming IO, memory pressure here
  * ensures that active file pages get deactivated, until more
  * than half of the file pages are on the inactive list.
  *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
+ * Once we get to that situation, protect the system's working set
+ * from being evicted by disabling active file page aging, until
+ * thrashing on the inactive list suggests that a new working set is
+ * trying to form.
  *
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct lruvec *lruvec)
+static int inactive_file_is_low(struct lruvec *lruvec,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long inactive;
 	unsigned long active;
 
+	if (global_reclaim(sc)) {
+		if (zone->shrink_active > 0) {
+			if (nr_to_scan) {
+				inc_zone_state(zone, WORKINGSET_BALANCE);
+				zone->shrink_active -= nr_to_scan;
+			}
+			return true;
+		}
+	}
+	/*
+	 * Make sure there is always a reasonable amount of inactive
+	 * file pages around to keep the zone reclaimable.
+	 *
+	 * We could do better than requiring half of memory, but we
+	 * need a big safety buffer until we are smarter about
+	 * dirty/writeback pages and file readahead windows.
+	 * Otherwise, we can end up with all pages on the inactive
+	 * list being dirty, or trash readahead pages before use.
+	 */
 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
-
-	return active > inactive;
+	if (active > inactive) {
+		if (global_reclaim(sc) && nr_to_scan)
+			inc_zone_state(zone, WORKINGSET_BALANCE_FORCE);
+		return true;
+	}
+	return false;
 }
 
-static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
+static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
 	if (is_file_lru(lru))
-		return inactive_file_is_low(lruvec);
+		return inactive_file_is_low(lruvec, nr_to_scan, sc);
 	else
 		return inactive_anon_is_low(lruvec);
 }
@@ -1780,7 +1816,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, lru))
+		if (inactive_list_is_low(lruvec, lru, nr_to_scan, sc))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -1891,7 +1927,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * There is enough inactive page cache, do not reclaim
 	 * anything from the anonymous working set right now.
 	 */
-	if (!inactive_file_is_low(lruvec)) {
+	if (!inactive_file_is_low(lruvec, 0, sc)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 87228c5..2b14f7a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -738,6 +738,9 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_stale",
+	"workingset_balance",
+	"workingset_balance_force",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..65714d2
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,213 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ *		Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list.  Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail.  Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * inactive list requires more space to detect repeatedly accessed
+ * pages in the current workload and prevent them from thrashing:
+ *
+ *   fault -----------------------+
+ *                                |
+ *              +-------------+   |            +-------------+
+ *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
+ *              +-------------+                +-------------+    |
+ *                       |                                        |
+ *                       +----------- promotion ------------------+
+ *
+ *
+ *		Access frequency and refault distance
+ *
+ * A workload is thrashing when the distances between the first and
+ * second access of pages that are frequently used is bigger than the
+ * current inactive clock list size, as the pages get reclaimed before
+ * the second access would have promoted them instead:
+ *
+ *    Access #: 1 2 3 4 5 6 7 8 9
+ *     Page ID: x y b c d e f x y
+ *                  | inactive  |
+ *
+ * To prevent this workload from thrashing, a bigger inactive list is
+ * required.  And the only way the inactive list can grow on a full
+ * zone is by taking away space from the corresponding active list.
+ *
+ *      +-inactive--+-active------+
+ *  x y | b c d e f | G H I J K L |
+ *      +-----------+-------------+
+ *
+ * Not every refault should lead to growing the inactive list at the
+ * cost of the active list, however: if the access distances are
+ * bigger than available memory overall, there is little point in
+ * challenging the protected pages on the active list, as those
+ * refaulting pages will not fit completely into memory.
+ *
+ * It is prohibitively expensive to track the access frequency of
+ * in-core pages, but it is possible to track their refault distance,
+ * which is the number of page slots shrunk from the inactive list
+ * between a page's eviction and subsequent refault.  This indicates
+ * how many page slots are missing on the inactive list in order to
+ * prevent future thrashing of that page.  Thus, instead of comparing
+ * access frequency to total available memory, one can compare the
+ * refault distance to the inactive list's potential for growth: the
+ * size of the active list.
+ *
+ *
+ *		Rebalancing the lists
+ *
+ * Shrinking the active list has to be done carefully because the
+ * pages on it may have vastly different access frequencies compared
+ * to the pages on the inactive list.  Thus, pages are not reclaimed
+ * directly from the tail of the active list, but instead moved to the
+ * head of the inactive list.  This way, they are competing directly
+ * with the pages that challenged their protected status.  If they are
+ * unused, they will eventually be reclaimed, but if they are indeed
+ * used more frequently than the challenging inactive pages, they will
+ * be reactivated.  This allows the existing protected set to be
+ * challenged without incurring major faults in case of a mistake.
+ */
+
+static void *pack_shadow(unsigned long time, struct zone *zone)
+{
+	time = (time << NODES_SHIFT) | zone_to_nid(zone);
+	time = (time << ZONES_SHIFT) | zone_idx(zone);
+	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
+
+	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+static void unpack_shadow(void *shadow,
+			  struct zone **zone,
+			  unsigned long *distance)
+{
+	unsigned long entry = (unsigned long)shadow;
+	unsigned long time_of_eviction;
+	unsigned long mask;
+	unsigned long now;
+	int zid, nid;
+
+	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	zid = entry & ((1UL << ZONES_SHIFT) - 1);
+	entry >>= ZONES_SHIFT;
+	nid = entry & ((1UL << NODES_SHIFT) - 1);
+	entry >>= NODES_SHIFT;
+	time_of_eviction = entry;
+
+	*zone = NODE_DATA(nid)->node_zones + zid;
+
+	now = atomic_long_read(&(*zone)->workingset_time);
+	mask = ~0UL >> (RADIX_TREE_EXCEPTIONAL_SHIFT +
+			ZONES_SHIFT + NODES_SHIFT);
+
+	*distance = (now - time_of_eviction) & mask;
+}
+
+/**
+ * workingset_eviction - note the eviction of a page from memory
+ * @mapping: address space the page was backing
+ * @page: the page being evicted
+ *
+ * Returns a shadow entry to be stored in @mapping->page_tree in place
+ * of the evicted @page so that a later refault can be detected.  Or
+ * %NULL when the eviction should not be remembered.
+ */
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long time;
+
+	time = atomic_long_inc_return(&zone->workingset_time);
+
+	/*
+	 * Don't store shadows in an inode that is being reclaimed.
+	 * This is not just an optizimation, inode reclaim needs to
+	 * empty out the radix tree or the nodes are lost, so don't
+	 * plant shadows behind its back.
+	 */
+	if (mapping_exiting(mapping))
+		return NULL;
+
+	return pack_shadow(time, zone);
+}
+
+/**
+ * workingset_refault - note the refault of a previously evicted page
+ * @shadow: shadow entry of the evicted page
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted page in the context of the zone it was allocated in.
+ *
+ * This primes page reclaim to rebalance the zone's file lists if
+ * necessary, so it must be called before a page frame for the
+ * refaulting page is allocated.
+ */
+void workingset_refault(void *shadow)
+{
+	unsigned long refault_distance;
+	unsigned long zone_active;
+	unsigned long zone_free;
+	struct zone *zone;
+
+	unpack_shadow(shadow, &zone, &refault_distance);
+
+	/*
+	 * The dirty balance reserve is a generous estimation of the
+	 * zone's memory reserve that is not available to page cache.
+	 * If the zone has more free pages than that, it means that
+	 * there are pages ready to allocate without reclaiming from
+	 * the zone at all, let alone putting pressure on its active
+	 * pages.
+	 */
+	zone_free = zone_page_state(zone, NR_FREE_PAGES);
+	if (zone_free > zone->dirty_balance_reserve)
+		return;
+
+	/*
+	 * Protected pages should be challenged when the refault
+	 * distance indicates that thrashing could be stopped by
+	 * increasing the inactive list at the cost of the active
+	 * list.
+	 */
+	zone_active = zone_page_state(zone, NR_ACTIVE_FILE);
+	if (refault_distance > zone_active)
+		return;
+
+	inc_zone_state(zone, WORKINGSET_STALE);
+	zone->shrink_active++;
+}
+EXPORT_SYMBOL(workingset_refault);
+
+/**
+ * workingset_activation - note a page activation
+ * @page: page that is being activated
+ */
+void workingset_activation(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	/*
+	 * The lists are rebalanced when the inactive list is observed
+	 * to be too small for activations.  An activation means that
+	 * the inactive list is now big enough again for at least one
+	 * page, so back off further deactivation.
+	 */
+	atomic_long_inc(&zone->workingset_time);
+	if (zone->shrink_active > 0)
+		zone->shrink_active--;
+}
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 8/9] mm: thrash detection-based file cache sizing
@ 2013-08-06 22:44   ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman Gushchin, Ozgun Erdogan,
	Metin Doslu, linux-fsdevel, linux-kernel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This patch solves this problem by maintaining a history of recently
evicted file pages, which in turn allows the VM to tell used-once page
streams from thrashing file cache.

To accomplish this, a per-zone counter is increased every time a page
is evicted and a snapshot of that counter is stored as shadow entry in
the page's now empty page cache radix tree slot.  Upon refault of that
page, the difference between the current value of that counter and the
shadow entry value is called the refault distance.  It tells how many
pages have been evicted from the zone since that page's eviction,
which is how many page slots are missing from the zone's inactive list
for this page to get accessed twice while in memory.  If the number of
missing slots is less than or equal to the number of active pages,
increasing the inactive list at the cost of the active list would give
this thrashing set a chance to establish itself:

eviction counter = 4
                        evicted      inactive           active
 Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
  Shadow entries:         0 1 2 3
Refault distance:         4 3 2 1

When c is faulted back into memory, it is noted that two more page
slots on the inactive list could have prevented the refault.  Thus,
the active list needs to be challenged for those two page slots as it
is possible that c is used more frequently than l, m, n.  However, c
might also be used much less frequent than the active pages and so 1)
pages can not be directly reclaimed from the tail of the active list
and b) refaulting pages can not be directly activated.  Instead,
active pages are moved from the tail of the active list to the head of
the inactive list and placed directly next to the refaulting pages.
This way, they both have the same time on the inactive list to prove
which page is actually used more frequently without incurring
unnecessary major faults or diluting the active page set in case the
previously active page is in fact the more frequently used one.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h  |   6 ++
 include/linux/pagemap.h |   2 +
 include/linux/swap.h    |   5 ++
 mm/Makefile             |   2 +-
 mm/filemap.c            |   3 +
 mm/swap.c               |   2 +
 mm/vmscan.c             |  62 +++++++++++---
 mm/vmstat.c             |   3 +
 mm/workingset.c         | 213 ++++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 284 insertions(+), 14 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0c41d59..e75fc92 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -141,6 +141,9 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_STALE,
+	WORKINGSET_BALANCE,
+	WORKINGSET_BALANCE_FORCE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -393,6 +396,9 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
+	atomic_long_t		workingset_time;
+	long			shrink_active;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b24236..c6beed2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -232,6 +232,8 @@ extern struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
+	if (shadow)
+		workingset_refault(shadow);
 	return alloc_pages(gfp, 0);
 }
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 24db914..441845d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
 	int next;	/* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+void workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index cd0abd8..f740427 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o \
-			   interval_tree.o list_lru.o $(mmu-y)
+			   interval_tree.o list_lru.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/filemap.c b/mm/filemap.c
index d3e5578..ab4351e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -543,6 +543,9 @@ struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 	int n;
 	struct page *page;
 
+	if (shadow)
+		workingset_refault(shadow);
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
diff --git a/mm/swap.c b/mm/swap.c
index bf448cf..f90a331 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,6 +482,8 @@ void mark_page_accessed(struct page *page)
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dd5f67c..27a36f6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -496,7 +496,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -542,10 +543,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page, NULL);
+		if (reclaimed && page_is_file_cache(page))
+			shadow = workingset_eviction(mapping, page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -568,7 +572,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -1038,7 +1042,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
@@ -1746,32 +1750,64 @@ static inline int inactive_anon_is_low(struct lruvec *lruvec)
 /**
  * inactive_file_is_low - check if file pages need to be deactivated
  * @lruvec: LRU vector to check
+ * @nr_to_scan: number of active pages to scan
+ * @sc: scan parameters
  *
  * When the system is doing streaming IO, memory pressure here
  * ensures that active file pages get deactivated, until more
  * than half of the file pages are on the inactive list.
  *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
+ * Once we get to that situation, protect the system's working set
+ * from being evicted by disabling active file page aging, until
+ * thrashing on the inactive list suggests that a new working set is
+ * trying to form.
  *
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct lruvec *lruvec)
+static int inactive_file_is_low(struct lruvec *lruvec,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long inactive;
 	unsigned long active;
 
+	if (global_reclaim(sc)) {
+		if (zone->shrink_active > 0) {
+			if (nr_to_scan) {
+				inc_zone_state(zone, WORKINGSET_BALANCE);
+				zone->shrink_active -= nr_to_scan;
+			}
+			return true;
+		}
+	}
+	/*
+	 * Make sure there is always a reasonable amount of inactive
+	 * file pages around to keep the zone reclaimable.
+	 *
+	 * We could do better than requiring half of memory, but we
+	 * need a big safety buffer until we are smarter about
+	 * dirty/writeback pages and file readahead windows.
+	 * Otherwise, we can end up with all pages on the inactive
+	 * list being dirty, or trash readahead pages before use.
+	 */
 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
-
-	return active > inactive;
+	if (active > inactive) {
+		if (global_reclaim(sc) && nr_to_scan)
+			inc_zone_state(zone, WORKINGSET_BALANCE_FORCE);
+		return true;
+	}
+	return false;
 }
 
-static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
+static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
 	if (is_file_lru(lru))
-		return inactive_file_is_low(lruvec);
+		return inactive_file_is_low(lruvec, nr_to_scan, sc);
 	else
 		return inactive_anon_is_low(lruvec);
 }
@@ -1780,7 +1816,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, lru))
+		if (inactive_list_is_low(lruvec, lru, nr_to_scan, sc))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -1891,7 +1927,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * There is enough inactive page cache, do not reclaim
 	 * anything from the anonymous working set right now.
 	 */
-	if (!inactive_file_is_low(lruvec)) {
+	if (!inactive_file_is_low(lruvec, 0, sc)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 87228c5..2b14f7a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -738,6 +738,9 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_stale",
+	"workingset_balance",
+	"workingset_balance_force",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..65714d2
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,213 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ *		Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list.  Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail.  Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * inactive list requires more space to detect repeatedly accessed
+ * pages in the current workload and prevent them from thrashing:
+ *
+ *   fault -----------------------+
+ *                                |
+ *              +-------------+   |            +-------------+
+ *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
+ *              +-------------+                +-------------+    |
+ *                       |                                        |
+ *                       +----------- promotion ------------------+
+ *
+ *
+ *		Access frequency and refault distance
+ *
+ * A workload is thrashing when the distances between the first and
+ * second access of pages that are frequently used is bigger than the
+ * current inactive clock list size, as the pages get reclaimed before
+ * the second access would have promoted them instead:
+ *
+ *    Access #: 1 2 3 4 5 6 7 8 9
+ *     Page ID: x y b c d e f x y
+ *                  | inactive  |
+ *
+ * To prevent this workload from thrashing, a bigger inactive list is
+ * required.  And the only way the inactive list can grow on a full
+ * zone is by taking away space from the corresponding active list.
+ *
+ *      +-inactive--+-active------+
+ *  x y | b c d e f | G H I J K L |
+ *      +-----------+-------------+
+ *
+ * Not every refault should lead to growing the inactive list at the
+ * cost of the active list, however: if the access distances are
+ * bigger than available memory overall, there is little point in
+ * challenging the protected pages on the active list, as those
+ * refaulting pages will not fit completely into memory.
+ *
+ * It is prohibitively expensive to track the access frequency of
+ * in-core pages, but it is possible to track their refault distance,
+ * which is the number of page slots shrunk from the inactive list
+ * between a page's eviction and subsequent refault.  This indicates
+ * how many page slots are missing on the inactive list in order to
+ * prevent future thrashing of that page.  Thus, instead of comparing
+ * access frequency to total available memory, one can compare the
+ * refault distance to the inactive list's potential for growth: the
+ * size of the active list.
+ *
+ *
+ *		Rebalancing the lists
+ *
+ * Shrinking the active list has to be done carefully because the
+ * pages on it may have vastly different access frequencies compared
+ * to the pages on the inactive list.  Thus, pages are not reclaimed
+ * directly from the tail of the active list, but instead moved to the
+ * head of the inactive list.  This way, they are competing directly
+ * with the pages that challenged their protected status.  If they are
+ * unused, they will eventually be reclaimed, but if they are indeed
+ * used more frequently than the challenging inactive pages, they will
+ * be reactivated.  This allows the existing protected set to be
+ * challenged without incurring major faults in case of a mistake.
+ */
+
+static void *pack_shadow(unsigned long time, struct zone *zone)
+{
+	time = (time << NODES_SHIFT) | zone_to_nid(zone);
+	time = (time << ZONES_SHIFT) | zone_idx(zone);
+	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
+
+	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+static void unpack_shadow(void *shadow,
+			  struct zone **zone,
+			  unsigned long *distance)
+{
+	unsigned long entry = (unsigned long)shadow;
+	unsigned long time_of_eviction;
+	unsigned long mask;
+	unsigned long now;
+	int zid, nid;
+
+	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	zid = entry & ((1UL << ZONES_SHIFT) - 1);
+	entry >>= ZONES_SHIFT;
+	nid = entry & ((1UL << NODES_SHIFT) - 1);
+	entry >>= NODES_SHIFT;
+	time_of_eviction = entry;
+
+	*zone = NODE_DATA(nid)->node_zones + zid;
+
+	now = atomic_long_read(&(*zone)->workingset_time);
+	mask = ~0UL >> (RADIX_TREE_EXCEPTIONAL_SHIFT +
+			ZONES_SHIFT + NODES_SHIFT);
+
+	*distance = (now - time_of_eviction) & mask;
+}
+
+/**
+ * workingset_eviction - note the eviction of a page from memory
+ * @mapping: address space the page was backing
+ * @page: the page being evicted
+ *
+ * Returns a shadow entry to be stored in @mapping->page_tree in place
+ * of the evicted @page so that a later refault can be detected.  Or
+ * %NULL when the eviction should not be remembered.
+ */
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long time;
+
+	time = atomic_long_inc_return(&zone->workingset_time);
+
+	/*
+	 * Don't store shadows in an inode that is being reclaimed.
+	 * This is not just an optizimation, inode reclaim needs to
+	 * empty out the radix tree or the nodes are lost, so don't
+	 * plant shadows behind its back.
+	 */
+	if (mapping_exiting(mapping))
+		return NULL;
+
+	return pack_shadow(time, zone);
+}
+
+/**
+ * workingset_refault - note the refault of a previously evicted page
+ * @shadow: shadow entry of the evicted page
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted page in the context of the zone it was allocated in.
+ *
+ * This primes page reclaim to rebalance the zone's file lists if
+ * necessary, so it must be called before a page frame for the
+ * refaulting page is allocated.
+ */
+void workingset_refault(void *shadow)
+{
+	unsigned long refault_distance;
+	unsigned long zone_active;
+	unsigned long zone_free;
+	struct zone *zone;
+
+	unpack_shadow(shadow, &zone, &refault_distance);
+
+	/*
+	 * The dirty balance reserve is a generous estimation of the
+	 * zone's memory reserve that is not available to page cache.
+	 * If the zone has more free pages than that, it means that
+	 * there are pages ready to allocate without reclaiming from
+	 * the zone at all, let alone putting pressure on its active
+	 * pages.
+	 */
+	zone_free = zone_page_state(zone, NR_FREE_PAGES);
+	if (zone_free > zone->dirty_balance_reserve)
+		return;
+
+	/*
+	 * Protected pages should be challenged when the refault
+	 * distance indicates that thrashing could be stopped by
+	 * increasing the inactive list at the cost of the active
+	 * list.
+	 */
+	zone_active = zone_page_state(zone, NR_ACTIVE_FILE);
+	if (refault_distance > zone_active)
+		return;
+
+	inc_zone_state(zone, WORKINGSET_STALE);
+	zone->shrink_active++;
+}
+EXPORT_SYMBOL(workingset_refault);
+
+/**
+ * workingset_activation - note a page activation
+ * @page: page that is being activated
+ */
+void workingset_activation(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	/*
+	 * The lists are rebalanced when the inactive list is observed
+	 * to be too small for activations.  An activation means that
+	 * the inactive list is now big enough again for at least one
+	 * page, so back off further deactivation.
+	 */
+	atomic_long_inc(&zone->workingset_time);
+	if (zone->shrink_active > 0)
+		zone->shrink_active--;
+}
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [patch 8/9] mm: thrash detection-based file cache sizing
  2013-08-06 22:22 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
@ 2013-08-06 22:22 ` Johannes Weiner
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Weiner @ 2013-08-06 22:22 UTC (permalink / raw)
  To: linux-mm
  Cc: Andi Kleen, Andrea Arcangeli, Andrew Morton, Greg Thelen,
	Christoph Hellwig, Hugh Dickins, Jan Kara, KOSAKI Motohiro,
	Mel Gorman, Minchan Kim, Peter Zijlstra, Rik van Riel,
	Michel Lespinasse, Seth Jennings, Roman, Gushchin <klamm,
	Ozgun Erdogan, Metin Doslu, linux-fsdevel, linux-kernel

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This patch solves this problem by maintaining a history of recently
evicted file pages, which in turn allows the VM to tell used-once page
streams from thrashing file cache.

To accomplish this, a per-zone counter is increased every time a page
is evicted and a snapshot of that counter is stored as shadow entry in
the page's now empty page cache radix tree slot.  Upon refault of that
page, the difference between the current value of that counter and the
shadow entry value is called the refault distance.  It tells how many
pages have been evicted from the zone since that page's eviction,
which is how many page slots are missing from the zone's inactive list
for this page to get accessed twice while in memory.  If the number of
missing slots is less than or equal to the number of active pages,
increasing the inactive list at the cost of the active list would give
this thrashing set a chance to establish itself:

eviction counter = 4
                        evicted      inactive           active
 Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
  Shadow entries:         0 1 2 3
Refault distance:         4 3 2 1

When c is faulted back into memory, it is noted that two more page
slots on the inactive list could have prevented the refault.  Thus,
the active list needs to be challenged for those two page slots as it
is possible that c is used more frequently than l, m, n.  However, c
might also be used much less frequent than the active pages and so 1)
pages can not be directly reclaimed from the tail of the active list
and b) refaulting pages can not be directly activated.  Instead,
active pages are moved from the tail of the active list to the head of
the inactive list and placed directly next to the refaulting pages.
This way, they both have the same time on the inactive list to prove
which page is actually used more frequently without incurring
unnecessary major faults or diluting the active page set in case the
previously active page is in fact the more frequently used one.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h  |   6 ++
 include/linux/pagemap.h |   2 +
 include/linux/swap.h    |   5 ++
 mm/Makefile             |   2 +-
 mm/filemap.c            |   3 +
 mm/swap.c               |   2 +
 mm/vmscan.c             |  62 +++++++++++---
 mm/vmstat.c             |   3 +
 mm/workingset.c         | 213 ++++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 284 insertions(+), 14 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0c41d59..e75fc92 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -141,6 +141,9 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_STALE,
+	WORKINGSET_BALANCE,
+	WORKINGSET_BALANCE_FORCE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
@@ -393,6 +396,9 @@ struct zone {
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
 
+	atomic_long_t		workingset_time;
+	long			shrink_active;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b24236..c6beed2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -232,6 +232,8 @@ extern struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow);
 #else
 static inline struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 {
+	if (shadow)
+		workingset_refault(shadow);
 	return alloc_pages(gfp, 0);
 }
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 24db914..441845d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
 	int next;	/* swapfile to be used next */
 };
 
+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+void workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index cd0abd8..f740427 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o balloon_compaction.o \
-			   interval_tree.o list_lru.o $(mmu-y)
+			   interval_tree.o list_lru.o workingset.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/filemap.c b/mm/filemap.c
index d3e5578..ab4351e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -543,6 +543,9 @@ struct page *__page_cache_alloc(gfp_t gfp, struct page *shadow)
 	int n;
 	struct page *page;
 
+	if (shadow)
+		workingset_refault(shadow);
+
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
 		do {
diff --git a/mm/swap.c b/mm/swap.c
index bf448cf..f90a331 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -482,6 +482,8 @@ void mark_page_accessed(struct page *page)
 		else
 			__lru_cache_activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dd5f67c..27a36f6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -496,7 +496,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -542,10 +543,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page, NULL);
+		if (reclaimed && page_is_file_cache(page))
+			shadow = workingset_eviction(mapping, page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -568,7 +572,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -1038,7 +1042,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
@@ -1746,32 +1750,64 @@ static inline int inactive_anon_is_low(struct lruvec *lruvec)
 /**
  * inactive_file_is_low - check if file pages need to be deactivated
  * @lruvec: LRU vector to check
+ * @nr_to_scan: number of active pages to scan
+ * @sc: scan parameters
  *
  * When the system is doing streaming IO, memory pressure here
  * ensures that active file pages get deactivated, until more
  * than half of the file pages are on the inactive list.
  *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
+ * Once we get to that situation, protect the system's working set
+ * from being evicted by disabling active file page aging, until
+ * thrashing on the inactive list suggests that a new working set is
+ * trying to form.
  *
  * This uses a different ratio than the anonymous pages, because
  * the page cache uses a use-once replacement algorithm.
  */
-static int inactive_file_is_low(struct lruvec *lruvec)
+static int inactive_file_is_low(struct lruvec *lruvec,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
+	struct zone *zone = lruvec_zone(lruvec);
 	unsigned long inactive;
 	unsigned long active;
 
+	if (global_reclaim(sc)) {
+		if (zone->shrink_active > 0) {
+			if (nr_to_scan) {
+				inc_zone_state(zone, WORKINGSET_BALANCE);
+				zone->shrink_active -= nr_to_scan;
+			}
+			return true;
+		}
+	}
+	/*
+	 * Make sure there is always a reasonable amount of inactive
+	 * file pages around to keep the zone reclaimable.
+	 *
+	 * We could do better than requiring half of memory, but we
+	 * need a big safety buffer until we are smarter about
+	 * dirty/writeback pages and file readahead windows.
+	 * Otherwise, we can end up with all pages on the inactive
+	 * list being dirty, or trash readahead pages before use.
+	 */
 	inactive = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	active = get_lru_size(lruvec, LRU_ACTIVE_FILE);
-
-	return active > inactive;
+	if (active > inactive) {
+		if (global_reclaim(sc) && nr_to_scan)
+			inc_zone_state(zone, WORKINGSET_BALANCE_FORCE);
+		return true;
+	}
+	return false;
 }
 
-static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
+static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru,
+				unsigned long nr_to_scan,
+				struct scan_control *sc)
 {
 	if (is_file_lru(lru))
-		return inactive_file_is_low(lruvec);
+		return inactive_file_is_low(lruvec, nr_to_scan, sc);
 	else
 		return inactive_anon_is_low(lruvec);
 }
@@ -1780,7 +1816,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(lruvec, lru))
+		if (inactive_list_is_low(lruvec, lru, nr_to_scan, sc))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
 		return 0;
 	}
@@ -1891,7 +1927,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	 * There is enough inactive page cache, do not reclaim
 	 * anything from the anonymous working set right now.
 	 */
-	if (!inactive_file_is_low(lruvec)) {
+	if (!inactive_file_is_low(lruvec, 0, sc)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 87228c5..2b14f7a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -738,6 +738,9 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_stale",
+	"workingset_balance",
+	"workingset_balance_force",
 	"nr_anon_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..65714d2
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,213 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ *		Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list.  Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail.  Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * inactive list requires more space to detect repeatedly accessed
+ * pages in the current workload and prevent them from thrashing:
+ *
+ *   fault -----------------------+
+ *                                |
+ *              +-------------+   |            +-------------+
+ *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
+ *              +-------------+                +-------------+    |
+ *                       |                                        |
+ *                       +----------- promotion ------------------+
+ *
+ *
+ *		Access frequency and refault distance
+ *
+ * A workload is thrashing when the distances between the first and
+ * second access of pages that are frequently used is bigger than the
+ * current inactive clock list size, as the pages get reclaimed before
+ * the second access would have promoted them instead:
+ *
+ *    Access #: 1 2 3 4 5 6 7 8 9
+ *     Page ID: x y b c d e f x y
+ *                  | inactive  |
+ *
+ * To prevent this workload from thrashing, a bigger inactive list is
+ * required.  And the only way the inactive list can grow on a full
+ * zone is by taking away space from the corresponding active list.
+ *
+ *      +-inactive--+-active------+
+ *  x y | b c d e f | G H I J K L |
+ *      +-----------+-------------+
+ *
+ * Not every refault should lead to growing the inactive list at the
+ * cost of the active list, however: if the access distances are
+ * bigger than available memory overall, there is little point in
+ * challenging the protected pages on the active list, as those
+ * refaulting pages will not fit completely into memory.
+ *
+ * It is prohibitively expensive to track the access frequency of
+ * in-core pages, but it is possible to track their refault distance,
+ * which is the number of page slots shrunk from the inactive list
+ * between a page's eviction and subsequent refault.  This indicates
+ * how many page slots are missing on the inactive list in order to
+ * prevent future thrashing of that page.  Thus, instead of comparing
+ * access frequency to total available memory, one can compare the
+ * refault distance to the inactive list's potential for growth: the
+ * size of the active list.
+ *
+ *
+ *		Rebalancing the lists
+ *
+ * Shrinking the active list has to be done carefully because the
+ * pages on it may have vastly different access frequencies compared
+ * to the pages on the inactive list.  Thus, pages are not reclaimed
+ * directly from the tail of the active list, but instead moved to the
+ * head of the inactive list.  This way, they are competing directly
+ * with the pages that challenged their protected status.  If they are
+ * unused, they will eventually be reclaimed, but if they are indeed
+ * used more frequently than the challenging inactive pages, they will
+ * be reactivated.  This allows the existing protected set to be
+ * challenged without incurring major faults in case of a mistake.
+ */
+
+static void *pack_shadow(unsigned long time, struct zone *zone)
+{
+	time = (time << NODES_SHIFT) | zone_to_nid(zone);
+	time = (time << ZONES_SHIFT) | zone_idx(zone);
+	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
+
+	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+static void unpack_shadow(void *shadow,
+			  struct zone **zone,
+			  unsigned long *distance)
+{
+	unsigned long entry = (unsigned long)shadow;
+	unsigned long time_of_eviction;
+	unsigned long mask;
+	unsigned long now;
+	int zid, nid;
+
+	entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+	zid = entry & ((1UL << ZONES_SHIFT) - 1);
+	entry >>= ZONES_SHIFT;
+	nid = entry & ((1UL << NODES_SHIFT) - 1);
+	entry >>= NODES_SHIFT;
+	time_of_eviction = entry;
+
+	*zone = NODE_DATA(nid)->node_zones + zid;
+
+	now = atomic_long_read(&(*zone)->workingset_time);
+	mask = ~0UL >> (RADIX_TREE_EXCEPTIONAL_SHIFT +
+			ZONES_SHIFT + NODES_SHIFT);
+
+	*distance = (now - time_of_eviction) & mask;
+}
+
+/**
+ * workingset_eviction - note the eviction of a page from memory
+ * @mapping: address space the page was backing
+ * @page: the page being evicted
+ *
+ * Returns a shadow entry to be stored in @mapping->page_tree in place
+ * of the evicted @page so that a later refault can be detected.  Or
+ * %NULL when the eviction should not be remembered.
+ */
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long time;
+
+	time = atomic_long_inc_return(&zone->workingset_time);
+
+	/*
+	 * Don't store shadows in an inode that is being reclaimed.
+	 * This is not just an optizimation, inode reclaim needs to
+	 * empty out the radix tree or the nodes are lost, so don't
+	 * plant shadows behind its back.
+	 */
+	if (mapping_exiting(mapping))
+		return NULL;
+
+	return pack_shadow(time, zone);
+}
+
+/**
+ * workingset_refault - note the refault of a previously evicted page
+ * @shadow: shadow entry of the evicted page
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted page in the context of the zone it was allocated in.
+ *
+ * This primes page reclaim to rebalance the zone's file lists if
+ * necessary, so it must be called before a page frame for the
+ * refaulting page is allocated.
+ */
+void workingset_refault(void *shadow)
+{
+	unsigned long refault_distance;
+	unsigned long zone_active;
+	unsigned long zone_free;
+	struct zone *zone;
+
+	unpack_shadow(shadow, &zone, &refault_distance);
+
+	/*
+	 * The dirty balance reserve is a generous estimation of the
+	 * zone's memory reserve that is not available to page cache.
+	 * If the zone has more free pages than that, it means that
+	 * there are pages ready to allocate without reclaiming from
+	 * the zone at all, let alone putting pressure on its active
+	 * pages.
+	 */
+	zone_free = zone_page_state(zone, NR_FREE_PAGES);
+	if (zone_free > zone->dirty_balance_reserve)
+		return;
+
+	/*
+	 * Protected pages should be challenged when the refault
+	 * distance indicates that thrashing could be stopped by
+	 * increasing the inactive list at the cost of the active
+	 * list.
+	 */
+	zone_active = zone_page_state(zone, NR_ACTIVE_FILE);
+	if (refault_distance > zone_active)
+		return;
+
+	inc_zone_state(zone, WORKINGSET_STALE);
+	zone->shrink_active++;
+}
+EXPORT_SYMBOL(workingset_refault);
+
+/**
+ * workingset_activation - note a page activation
+ * @page: page that is being activated
+ */
+void workingset_activation(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	/*
+	 * The lists are rebalanced when the inactive list is observed
+	 * to be too small for activations.  An activation means that
+	 * the inactive list is now big enough again for at least one
+	 * page, so back off further deactivation.
+	 */
+	atomic_long_inc(&zone->workingset_time);
+	if (zone->shrink_active > 0)
+		zone->shrink_active--;
+}
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2013-08-22 11:15 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-17 19:31 [patch 9/9] mm: thrash detection-based file cache sizing v4 Johannes Weiner
2013-08-17 19:31 ` Johannes Weiner
2013-08-17 19:31 ` [patch 1/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2013-08-17 19:31   ` Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-20 20:59     ` Andrew Morton
2013-08-17 19:31 ` [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2013-08-17 19:31   ` Johannes Weiner
2013-08-17 19:31 ` [patch 3/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2013-08-17 19:31   ` Johannes Weiner
2013-08-17 19:31 ` [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2013-08-17 19:31   ` Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-20 20:59     ` Andrew Morton
2013-08-22  7:20     ` Johannes Weiner
2013-08-22  7:20       ` Johannes Weiner
2013-08-17 19:31 ` [patch 5/9] mm + fs: store shadow entries in page cache Johannes Weiner
2013-08-17 19:31   ` Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-20 20:59     ` Andrew Morton
2013-08-17 19:31 ` [patch 6/9] mm + fs: provide shadow pages to page cache allocations Johannes Weiner
2013-08-17 19:31   ` Johannes Weiner
2013-08-17 19:31 ` [patch 7/9] mm: make global_dirtyable_memory() available to other mm code Johannes Weiner
2013-08-17 19:31   ` Johannes Weiner
2013-08-17 19:31 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
2013-08-17 19:31   ` Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-20 20:59     ` Andrew Morton
2013-08-22  8:45     ` Johannes Weiner
2013-08-22  8:45       ` Johannes Weiner
2013-08-17 19:31 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
2013-08-17 19:31   ` Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-20 20:59     ` Andrew Morton
2013-08-22  9:48     ` Johannes Weiner
2013-08-22  9:48       ` Johannes Weiner
2013-08-20 21:04 ` [patch 9/9] mm: thrash detection-based file cache sizing v4 Andrew Morton
2013-08-20 21:04   ` Andrew Morton
2013-08-22  9:08   ` Metin Doslu
2013-08-22 11:15   ` Metin Doslu
  -- strict thread matches above, loose matches on Subject: below --
2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
2013-08-06 22:44 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
2013-08-06 22:44   ` Johannes Weiner
2013-08-09 22:49   ` Andrew Morton
2013-08-09 22:49     ` Andrew Morton
2013-08-12 16:00     ` Johannes Weiner
2013-08-12 16:00       ` Johannes Weiner
2013-08-11 21:57   ` Vlastimil Babka
2013-08-11 21:57     ` Vlastimil Babka
2013-08-12 16:27     ` Johannes Weiner
2013-08-12 16:27       ` Johannes Weiner
2013-08-06 22:22 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
2013-08-06 22:22 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.