linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/5] refault distance-based file cache sizing
@ 2012-05-01  8:41 Johannes Weiner
  2012-05-01  8:41 ` [patch 1/5] mm: readahead: move radix tree hole searching here Johannes Weiner
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Johannes Weiner @ 2012-05-01  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Andrea Arcangeli, Peter Zijlstra, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

Hi,

our file data caching implementation is done by having an inactive
list of pages that have yet to prove worth keeping around and an
active list of pages that already did.  The question is how to balance
those two lists against each other.

On one hand, the space for inactive pages needs to be big enough so
that they have the necessary time in memory to gather the references
required for an activation.  On the other hand, we want an active list
big enough to hold all data that is frequently used, if possible, to
protect it from streams of less frequently used/once used pages.

Our current balancing ("active can't grow larger than inactive") does
not really work too well.  We have people complaining that the working
set is not well protected from used-once file cache, and other people
complaining that we don't adapt to changes in the workingset and
protect stale pages in other cases.

This series stores file cache eviction information in the vacated page
cache radix tree slots and uses it on refault to see if the pages
currently on the active list need to have their status challenged.

A fully activated file set that occupies 85% of memory is successfully
detected as stale when another file set of equal size is accessed for
a few times (4-5).  The old kernel would never adapt to the second
one.  If the new set is bigger than memory, the old set is left
untouched, where the old kernel would shrink the old set to half of
memory and leave it at that.  Tested on a multi-zone single-node
machine.

More testing is obviously required, but I first wanted some opinions
at this point.  Is there fundamental disagreement with the concept?
With the implementation?

Memcg hard limit reclaim is not converted (anymore, ripped it out to
focus on the global case first) and it still does the 50/50 balancing
between lists, but this will be re-added in the next version.

Patches are based on 3.3.

 fs/btrfs/compression.c     |   10 +-
 fs/btrfs/extent_io.c       |    3 +-
 fs/cachefiles/rdwr.c       |   26 +++--
 fs/ceph/xattr.c            |    2 +-
 fs/inode.c                 |    7 +-
 fs/logfs/readwrite.c       |    9 +-
 fs/nilfs2/inode.c          |    6 +-
 fs/ntfs/file.c             |   11 ++-
 fs/splice.c                |   10 +-
 include/linux/mm.h         |    8 ++
 include/linux/mmzone.h     |    7 ++
 include/linux/pagemap.h    |   54 ++++++++---
 include/linux/pagevec.h    |    3 +
 include/linux/radix-tree.h |    4 -
 include/linux/sched.h      |    1 +
 include/linux/shmem_fs.h   |    1 +
 include/linux/swap.h       |    7 ++
 lib/radix-tree.c           |   75 ---------------
 mm/Makefile                |    1 +
 mm/filemap.c               |  222 ++++++++++++++++++++++++++++++++++----------
 mm/memcontrol.c            |    3 +
 mm/mincore.c               |   20 +++-
 mm/page_alloc.c            |    7 ++
 mm/readahead.c             |   51 +++++++++-
 mm/shmem.c                 |   89 +++---------------
 mm/swap.c                  |   23 +++++
 mm/truncate.c              |   73 +++++++++++---
 mm/vmscan.c                |   80 +++++++++-------
 mm/vmstat.c                |    4 +
 mm/workingset.c            |  174 ++++++++++++++++++++++++++++++++++
 net/ceph/messenger.c       |    2 +-
 net/ceph/pagelist.c        |    4 +-
 net/ceph/pagevec.c         |    2 +-
 33 files changed, 682 insertions(+), 317 deletions(-)

Thanks,
Johannes

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [patch 1/5] mm: readahead: move radix tree hole searching here
  2012-05-01  8:41 [patch 0/5] refault distance-based file cache sizing Johannes Weiner
@ 2012-05-01  8:41 ` Johannes Weiner
  2012-05-01 21:06   ` Rik van Riel
  2012-05-01  8:41 ` [patch 2/5] mm + fs: prepare for non-page entries in page cache Johannes Weiner
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2012-05-01  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Andrea Arcangeli, Peter Zijlstra, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

The readahead code searches the page cache for non-present pages, or
holes, to get a picture of the area surrounding a fault e.g.

For this it sufficed to rely on the radix tree definition of holes,
which is "empty tree slot".  This is about to change, though, because
shadow page descriptors will be stored in the page cache when the real
pages get evicted from memory.

Fortunately, nobody outside the readahead code uses these functions
and they have no internal knowledge of the radix tree structures, so
just move them over to mm/readahead.c where they can later be adapted
to handle the new definition of "page cache hole".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |    4 --
 lib/radix-tree.c           |   75 --------------------------------------------
 mm/readahead.c             |   36 ++++++++++++++++++++-
 3 files changed, 34 insertions(+), 81 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 07e360b..73e49c4 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -224,10 +224,6 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index dc63d08..89b5f6a 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -742,81 +742,6 @@ next:
 }
 EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
 
-
-/**
- *	radix_tree_next_hole    -    find the next hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the lowest
- *	indexed hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'return - index >= max_scan'
- *	will be true). In rare cases of index wrap-around, 0 will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 5, then subsequently a hole is created at index 10,
- *	radix_tree_next_hole covering both indexes may return 10 if called
- *	under rcu_read_lock.
- */
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index++;
-		if (index == 0)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_next_hole);
-
-/**
- *	radix_tree_prev_hole    -    find the prev hole (not-present entry)
- *	@root:		tree root
- *	@index:		index key
- *	@max_scan:	maximum range to search
- *
- *	Search backwards in the range [max(index-max_scan+1, 0), index]
- *	for the first hole.
- *
- *	Returns: the index of the hole if found, otherwise returns an index
- *	outside of the set specified (in which case 'index - return >= max_scan'
- *	will be true). In rare cases of wrap-around, ULONG_MAX will be returned.
- *
- *	radix_tree_next_hole may be called under rcu_read_lock. However, like
- *	radix_tree_gang_lookup, this will not atomically search a snapshot of
- *	the tree at a single point in time. For example, if a hole is created
- *	at index 10, then subsequently a hole is created at index 5,
- *	radix_tree_prev_hole covering both indexes may return 5 if called under
- *	rcu_read_lock.
- */
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
-				   unsigned long index, unsigned long max_scan)
-{
-	unsigned long i;
-
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
-			break;
-		index--;
-		if (index == ULONG_MAX)
-			break;
-	}
-
-	return index;
-}
-EXPORT_SYMBOL(radix_tree_prev_hole);
-
 static unsigned int
 __lookup(struct radix_tree_node *slot, void ***results, unsigned long *indices,
 	unsigned long index, unsigned int max_items, unsigned long *next_index)
diff --git a/mm/readahead.c b/mm/readahead.c
index cbcbb02..0d1f1aa 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -336,6 +336,38 @@ static unsigned long get_next_ra_size(struct file_ra_state *ra,
  * it approaches max_readhead.
  */
 
+static unsigned long page_cache_next_hole(struct address_space *mapping,
+					  pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index++;
+		if (index == 0)
+			break;
+	}
+
+	return index;
+}
+
+static unsigned long page_cache_prev_hole(struct address_space *mapping,
+					  pgoff_t index, unsigned long max_scan)
+{
+	unsigned long i;
+
+	for (i = 0; i < max_scan; i++) {
+		if (!radix_tree_lookup(&mapping->page_tree, index))
+			break;
+		index--;
+		if (index == ULONG_MAX)
+			break;
+	}
+
+	return index;
+}
+
 /*
  * Count contiguously cached pages from @offset-1 to @offset-@max,
  * this count is a conservative estimation of
@@ -349,7 +381,7 @@ static pgoff_t count_history_pages(struct address_space *mapping,
 	pgoff_t head;
 
 	rcu_read_lock();
-	head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+	head = page_cache_prev_hole(mapping, offset - 1, max);
 	rcu_read_unlock();
 
 	return offset - 1 - head;
@@ -428,7 +460,7 @@ ondemand_readahead(struct address_space *mapping,
 		pgoff_t start;
 
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = page_cache_next_hole(mapping, offset + 1, max);
 		rcu_read_unlock();
 
 		if (!start || start - offset > max)
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 2/5] mm + fs: prepare for non-page entries in page cache
  2012-05-01  8:41 [patch 0/5] refault distance-based file cache sizing Johannes Weiner
  2012-05-01  8:41 ` [patch 1/5] mm: readahead: move radix tree hole searching here Johannes Weiner
@ 2012-05-01  8:41 ` Johannes Weiner
  2012-05-01 19:02   ` Andrew Morton
  2012-05-01  8:41 ` [patch 3/5] mm + fs: store shadow pages " Johannes Weiner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2012-05-01  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Andrea Arcangeli, Peter Zijlstra, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache,
prepare every site dealing with the radix trees directly to handle
non-page entries.

The common lookup functions will filter out non-page entries and
return NULL for page cache holes, just as before.  But provide a raw
version of the API which returns non-page entries as well, and switch
shmem over to use it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c   |    2 +-
 fs/btrfs/extent_io.c     |    3 +-
 fs/inode.c               |    3 +-
 fs/nilfs2/inode.c        |    6 +--
 include/linux/mm.h       |    8 +++
 include/linux/pagemap.h  |   13 +++--
 include/linux/pagevec.h  |    3 +
 include/linux/shmem_fs.h |    1 +
 mm/filemap.c             |  124 +++++++++++++++++++++++++++++++++++++++-------
 mm/mincore.c             |   20 +++++--
 mm/readahead.c           |   12 +++-
 mm/shmem.c               |   89 +++++----------------------------
 mm/swap.c                |   21 ++++++++
 mm/truncate.c            |   71 +++++++++++++++++++++------
 14 files changed, 244 insertions(+), 132 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index d02c27c..a0594c9 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -472,7 +472,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, pg_index);
 		rcu_read_unlock();
-		if (page) {
+		if (page && !radix_tree_exceptional_entry(page)) {
 			misses++;
 			if (misses > 4)
 				break;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a55fbe6..f78352d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3566,7 +3566,8 @@ inline struct page *extent_buffer_page(struct extent_buffer *eb,
 	rcu_read_lock();
 	p = radix_tree_lookup(&mapping->page_tree, i);
 	rcu_read_unlock();
-
+	if (radix_tree_exceptional_entry(p))
+		p = NULL;
 	return p;
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 83ab215..645731f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -544,8 +544,7 @@ static void evict(struct inode *inode)
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
-		if (inode->i_data.nrpages)
-			truncate_inode_pages(&inode->i_data, 0);
+		truncate_inode_pages(&inode->i_data, 0);
 		end_writeback(inode);
 	}
 	if (S_ISBLK(inode->i_mode) && inode->i_bdev)
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 8f7b95a..35b7e18 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -732,16 +732,14 @@ void nilfs_evict_inode(struct inode *inode)
 	int ret;
 
 	if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
-		if (inode->i_data.nrpages)
-			truncate_inode_pages(&inode->i_data, 0);
+		truncate_inode_pages(&inode->i_data, 0);
 		end_writeback(inode);
 		nilfs_clear_inode(inode);
 		return;
 	}
 	nilfs_transaction_begin(sb, &ti, 0); /* never fails */
 
-	if (inode->i_data.nrpages)
-		truncate_inode_pages(&inode->i_data, 0);
+	truncate_inode_pages(&inode->i_data, 0);
 
 	/* TODO: some of the following operations may fail.  */
 	nilfs_truncate_bmap(ii, 0);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 17b27cd..0d5948b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -873,6 +873,14 @@ extern bool skip_free_areas_node(unsigned int flags, int nid);
 int shmem_lock(struct file *file, int lock, struct user_struct *user);
 struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags);
 int shmem_zero_setup(struct vm_area_struct *);
+#ifdef CONFIG_SHMEM
+bool shmem_mapping(struct address_space *);
+#else
+static inline bool shmem_mapping(struct address_space *mapping)
+{
+	return false;
+}
+#endif
 
 extern int can_do_mlock(void);
 extern int user_shm_lock(size_t, struct user_struct *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cfaaa69..aba5b91 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -227,12 +227,13 @@ static inline struct page *page_cache_alloc_readahead(struct address_space *x)
 
 typedef int filler_t(void *, struct page *);
 
-extern struct page * find_get_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_lock_page(struct address_space *mapping,
-				pgoff_t index);
-extern struct page * find_or_create_page(struct address_space *mapping,
-				pgoff_t index, gfp_t gfp_mask);
+struct page *__find_get_page(struct address_space *, pgoff_t);
+struct page *find_get_page(struct address_space *, pgoff_t);
+struct page *__find_lock_page(struct address_space *, pgoff_t);
+struct page *find_lock_page(struct address_space *, pgoff_t);
+struct page *find_or_create_page(struct address_space *, pgoff_t, gfp_t);
+unsigned __find_get_pages(struct address_space *, pgoff_t,  unsigned int,
+			  struct page **, pgoff_t *);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 2aa12b8..7520532 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,9 @@ struct pagevec {
 
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
+unsigned __pagevec_lookup(struct pagevec *, struct address_space *,
+			  pgoff_t, unsigned, pgoff_t *);
+void pagevec_remove_exceptionals(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 79ab255..69444f1 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -48,6 +48,7 @@ extern struct file *shmem_file_setup(const char *name,
 					loff_t size, unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern bool shmem_mapping(struct address_space *);
 extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index b662757..b8af34a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -431,6 +431,24 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
+static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
+			     struct page *page)
+{
+	void **slot;
+
+	slot = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (slot) {
+		void *p;
+
+		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+		if (!radix_tree_exceptional_entry(p))
+			return -EEXIST;
+		radix_tree_replace_slot(slot, page);
+		return 0;
+	}
+	return radix_tree_insert(&mapping->page_tree, offset, page);
+}
+
 /**
  * add_to_page_cache_locked - add a locked page to the pagecache
  * @page:	page to add
@@ -461,7 +479,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		page->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
+		error = page_cache_insert(mapping, offset, page);
 		if (likely(!error)) {
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
@@ -664,15 +682,7 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 	}
 }
 
-/**
- * find_get_page - find and get a page reference
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Is there a pagecache struct page at the given (mapping, offset) tuple?
- * If yes, increment its refcount and return it; if no, return NULL.
- */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
 {
 	void **pagep;
 	struct page *page;
@@ -713,22 +723,29 @@ out:
 
 	return page;
 }
-EXPORT_SYMBOL(find_get_page);
+EXPORT_SYMBOL(__find_get_page);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * find_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Locates the desired pagecache page, locks it, increments its reference
- * count and returns its address.
- *
- * Returns zero if the page was not present. find_lock_page() may sleep.
+ * Is there a pagecache struct page at the given (mapping, offset) tuple?
+ * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page;
+	struct page *page = __find_get_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
+EXPORT_SYMBOL(find_get_page);
 
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page;
 repeat:
 	page = find_get_page(mapping, offset);
 	if (page && !radix_tree_exception(page)) {
@@ -743,6 +760,25 @@ repeat:
 	}
 	return page;
 }
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Locates the desired pagecache page, locks it, increments its reference
+ * count and returns its address.
+ *
+ * Returns zero if the page was not present. find_lock_page() may sleep.
+ */
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+	struct page *page = __find_lock_page(mapping, offset);
+
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	return page;
+}
 EXPORT_SYMBOL(find_lock_page);
 
 /**
@@ -792,6 +828,54 @@ repeat:
 }
 EXPORT_SYMBOL(find_or_create_page);
 
+unsigned __find_get_pages(struct address_space *mapping,
+			  pgoff_t start, unsigned int nr_pages,
+			  struct page **pages, pgoff_t *indices)
+{
+	unsigned int i;
+	unsigned int ret;
+	unsigned int nr_found;
+
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, indices, start, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * Otherwise, we must be storing a swap entry
+			 * here as an exceptional entry: so return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = indices[i];
+		pages[ret] = page;
+		ret++;
+	}
+	if (unlikely(!ret && nr_found))
+		goto restart;
+	rcu_read_unlock();
+	return ret;
+}
+
 /**
  * find_get_pages - gang pagecache lookup
  * @mapping:	The address_space to search
@@ -864,8 +948,10 @@ repeat:
 	 * If all entries were removed before we could secure them,
 	 * try again, because callers stop trying once 0 is returned.
 	 */
-	if (unlikely(!ret && nr_found > nr_skip))
+	if (unlikely(!ret && nr_found)) {
+		start += nr_skip;
 		goto restart;
+	}
 	rcu_read_unlock();
 	return ret;
 }
diff --git a/mm/mincore.c b/mm/mincore.c
index 636a868..0c7f093 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 	 */
-	page = find_get_page(mapping, pgoff);
 #ifdef CONFIG_SWAP
-	/* shmem/tmpfs may return swap: account for swapcache page too. */
-	if (radix_tree_exceptional_entry(page)) {
-		swp_entry_t swap = radix_to_swp_entry(page);
-		page = find_get_page(&swapper_space, swap.val);
-	}
+	if (shmem_mapping(mapping)) {
+		page = __find_get_page(mapping, pgoff);
+		/*
+		 * shmem/tmpfs may return swap: account for swapcache
+		 * page too.
+		 */
+		if (radix_tree_exceptional_entry(page)) {
+			swp_entry_t swap = radix_to_swp_entry(page);
+			page = find_get_page(&swapper_space, swap.val);
+		}
+	} else
+		page = find_get_page(mapping, pgoff);
+#else
+	page = find_get_page(mapping, pgoff);
 #endif
 	if (page) {
 		present = PageUptodate(page);
diff --git a/mm/readahead.c b/mm/readahead.c
index 0d1f1aa..43f9dd2 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -177,7 +177,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
 		rcu_read_unlock();
-		if (page)
+		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
 		page = page_cache_alloc_readahead(mapping);
@@ -342,7 +342,10 @@ static unsigned long page_cache_next_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index++;
 		if (index == 0)
@@ -358,7 +361,10 @@ static unsigned long page_cache_prev_hole(struct address_space *mapping,
 	unsigned long i;
 
 	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(&mapping->page_tree, index))
+		struct page *page;
+
+		page = radix_tree_lookup(&mapping->page_tree, index);
+		if (!page || radix_tree_exceptional_entry(page))
 			break;
 		index--;
 		if (index == ULONG_MAX)
diff --git a/mm/shmem.c b/mm/shmem.c
index 269d049..6d063eb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -310,57 +310,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 }
 
 /*
- * Like find_get_pages, but collecting swap entries as well as pages.
- */
-static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
-					pgoff_t start, unsigned int nr_pages,
-					struct page **pages, pgoff_t *indices)
-{
-	unsigned int i;
-	unsigned int ret;
-	unsigned int nr_found;
-
-	rcu_read_lock();
-restart:
-	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
-				(void ***)pages, indices, start, nr_pages);
-	ret = 0;
-	for (i = 0; i < nr_found; i++) {
-		struct page *page;
-repeat:
-		page = radix_tree_deref_slot((void **)pages[i]);
-		if (unlikely(!page))
-			continue;
-		if (radix_tree_exception(page)) {
-			if (radix_tree_deref_retry(page))
-				goto restart;
-			/*
-			 * Otherwise, we must be storing a swap entry
-			 * here as an exceptional entry: so return it
-			 * without attempting to raise page count.
-			 */
-			goto export;
-		}
-		if (!page_cache_get_speculative(page))
-			goto repeat;
-
-		/* Has the page moved? */
-		if (unlikely(page != *((void **)pages[i]))) {
-			page_cache_release(page);
-			goto repeat;
-		}
-export:
-		indices[ret] = indices[i];
-		pages[ret] = page;
-		ret++;
-	}
-	if (unlikely(!ret && nr_found))
-		goto restart;
-	rcu_read_unlock();
-	return ret;
-}
-
-/*
  * Remove swap entry from radix tree, free the swap and its page cache.
  */
 static int shmem_free_swap(struct address_space *mapping,
@@ -377,21 +326,6 @@ static int shmem_free_swap(struct address_space *mapping,
 }
 
 /*
- * Pagevec may contain swap entries, so shuffle up pages before releasing.
- */
-static void shmem_deswap_pagevec(struct pagevec *pvec)
-{
-	int i, j;
-
-	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
-		if (!radix_tree_exceptional_entry(page))
-			pvec->pages[j++] = page;
-	}
-	pvec->nr = j;
-}
-
-/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
@@ -409,12 +343,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
 		 * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
 		 * has finished, if it hits a row of PAGEVEC_SIZE swap entries.
 		 */
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 					PAGEVEC_SIZE, pvec.pages, indices);
 		if (!pvec.nr)
 			break;
 		index = indices[pvec.nr - 1] + 1;
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		check_move_unevictable_pages(pvec.pages, pvec.nr);
 		pagevec_release(&pvec);
 		cond_resched();
@@ -442,7 +376,7 @@ void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index <= end) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 							pvec.pages, indices);
 		if (!pvec.nr)
@@ -469,7 +403,7 @@ void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -490,7 +424,7 @@ void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+		pvec.nr = __find_get_pages(mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
 							pvec.pages, indices);
 		if (!pvec.nr) {
@@ -500,7 +434,7 @@ void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 			continue;
 		}
 		if (index == start && indices[0] > end) {
-			shmem_deswap_pagevec(&pvec);
+			pagevec_remove_exceptionals(&pvec);
 			pagevec_release(&pvec);
 			break;
 		}
@@ -525,7 +459,7 @@ void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 			}
 			unlock_page(page);
 		}
-		shmem_deswap_pagevec(&pvec);
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -877,7 +811,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return -EFBIG;
 repeat:
 	swap.val = 0;
-	page = find_lock_page(mapping, index);
+	page = __find_lock_page(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
 		page = NULL;
@@ -1025,7 +959,7 @@ unacct:
 	shmem_unacct_blocks(info->flags, 1);
 failed:
 	if (swap.val && error != -EINVAL) {
-		struct page *test = find_get_page(mapping, index);
+		struct page *test = __find_get_page(mapping, index);
 		if (test && !radix_tree_exceptional_entry(test))
 			page_cache_release(test);
 		/* Have another try if the entry has changed */
@@ -1174,6 +1108,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	return inode;
 }
 
+bool shmem_mapping(struct address_space *mapping)
+{
+	return mapping->backing_dev_info == &shmem_backing_dev_info;
+}
+
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
 static const struct inode_operations shmem_short_symlink_operations;
diff --git a/mm/swap.c b/mm/swap.c
index 14380e9..cc5ce81 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -728,6 +728,27 @@ void __pagevec_lru_add(struct pagevec *pvec, enum lru_list lru)
 }
 EXPORT_SYMBOL(__pagevec_lru_add);
 
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+			  pgoff_t start, unsigned int nr_pages,
+			  pgoff_t *indices)
+{
+	pvec->nr = __find_get_pages(mapping, start, nr_pages,
+				    pvec->pages, indices);
+	return pagevec_count(pvec);
+}
+
+void pagevec_remove_exceptionals(struct pagevec *pvec)
+{
+	int i, j;
+
+	for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		if (!radix_tree_exceptional_entry(page))
+			pvec->pages[j++] = page;
+	}
+	pvec->nr = j;
+}
+
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
diff --git a/mm/truncate.c b/mm/truncate.c
index 632b15e..d8c8964 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -22,6 +22,18 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+static void clear_exceptional_entry(struct address_space *mapping,
+				    pgoff_t index, struct page *page)
+{
+	/* Handled by shmem itself */
+	if (shmem_mapping(mapping))
+		return;
+
+	spin_lock_irq(&mapping->tree_lock);
+	if (radix_tree_lookup(&mapping->page_tree, index) == page)
+		radix_tree_delete(&mapping->page_tree, index);
+	spin_unlock_irq(&mapping->tree_lock);
+}
 
 /**
  * do_invalidatepage - invalidate part or all of a page
@@ -208,31 +220,36 @@ void truncate_inode_pages_range(struct address_space *mapping,
 {
 	const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
 	const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	pgoff_t end;
 	int i;
 
 	cleancache_flush_inode(mapping);
-	if (mapping->nrpages == 0)
-		return;
 
 	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
 	end = (lend >> PAGE_CACHE_SHIFT);
 
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -243,6 +260,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -262,14 +280,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	index = start;
 	for ( ; ; ) {
 		cond_resched();
-		if (!pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		if (!__pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 			if (index == start)
 				break;
 			index = start;
 			continue;
 		}
-		if (index == start && pvec.pages[0]->index > end) {
+		if (index == start && indices[0] > end) {
 			pagevec_release(&pvec);
 			break;
 		}
@@ -278,16 +297,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
 			truncate_inode_page(mapping, page);
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		index++;
@@ -330,6 +355,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index = start;
 	unsigned long ret;
@@ -345,17 +371,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 	 */
 
 	pagevec_init(&pvec, 0);
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -369,6 +401,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 				deactivate_page(page);
 			count += ret;
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
@@ -437,6 +470,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
 int invalidate_inode_pages2_range(struct address_space *mapping,
 				  pgoff_t start, pgoff_t end)
 {
+	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
 	pgoff_t index;
 	int i;
@@ -447,17 +481,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	cleancache_flush_inode(mapping);
 	pagevec_init(&pvec, 0);
 	index = start;
-	while (index <= end && pagevec_lookup(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+	while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+			indices)) {
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
 			/* We rely upon deletion not changing page->index */
-			index = page->index;
+			index = indices[i];
 			if (index > end)
 				break;
 
+			if (radix_tree_exceptional_entry(page)) {
+				clear_exceptional_entry(mapping, index, page);
+				continue;
+			}
+
 			lock_page(page);
 			WARN_ON(page->index != index);
 			if (page->mapping != mapping) {
@@ -495,6 +535,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				ret = ret2;
 			unlock_page(page);
 		}
+		pagevec_remove_exceptionals(&pvec);
 		pagevec_release(&pvec);
 		mem_cgroup_uncharge_end();
 		cond_resched();
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 3/5] mm + fs: store shadow pages in page cache
  2012-05-01  8:41 [patch 0/5] refault distance-based file cache sizing Johannes Weiner
  2012-05-01  8:41 ` [patch 1/5] mm: readahead: move radix tree hole searching here Johannes Weiner
  2012-05-01  8:41 ` [patch 2/5] mm + fs: prepare for non-page entries in page cache Johannes Weiner
@ 2012-05-01  8:41 ` Johannes Weiner
  2012-05-01  8:41 ` [patch 4/5] mm + fs: provide refault distance to page cache instantiations Johannes Weiner
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2012-05-01  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Andrea Arcangeli, Peter Zijlstra, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

Reclaim will be leaving shadow entries in the page cache radix tree
upon evicting the real page.  As those pages are found from the LRU,
an iput() can lead to the inode being freed concurrently.  At this
point, reclaim must no longer install shadow pages because the inode
freeing code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code
sets under the tree lock before doing the final truncate.  Reclaim
will check for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/inode.c              |    4 ++++
 include/linux/pagemap.h |   13 ++++++++++++-
 mm/filemap.c            |   14 ++++++++++----
 mm/truncate.c           |    2 +-
 mm/vmscan.c             |    2 +-
 5 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 645731f..9be6bac 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -541,6 +541,10 @@ static void evict(struct inode *inode)
 
 	inode_sb_list_del(inode);
 
+	spin_lock_irq(&inode->i_data.tree_lock);
+	mapping_set_exiting(&inode->i_data);
+	spin_unlock_irq(&inode->i_data.tree_lock);
+
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index aba5b91..c1abb88 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -24,6 +24,7 @@ enum mapping_flags {
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
+	AS_EXITING	= __GFP_BITS_SHIFT + 4, /* inode is being evicted */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -53,6 +54,16 @@ static inline int mapping_unevictable(struct address_space *mapping)
 	return !!mapping;
 }
 
+static inline void mapping_set_exiting(struct address_space *mapping)
+{
+	set_bit(AS_EXITING, &mapping->flags);
+}
+
+static inline int mapping_exiting(struct address_space *mapping)
+{
+	return test_bit(AS_EXITING, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
@@ -458,7 +469,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page);
+extern void __delete_from_page_cache(struct page *page, void *shadow);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index b8af34a..4ca12a3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -111,7 +111,7 @@
  * sure the page is locked and that nobody else uses it - or that usage
  * is safe.  The caller must hold the mapping's tree_lock.
  */
-void __delete_from_page_cache(struct page *page)
+void __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
 
@@ -125,7 +125,13 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_flush_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (shadow && !mapping_exiting(mapping)) {
+		void **slot;
+
+		slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+		radix_tree_replace_slot(slot, shadow);
+	} else
+		radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
 	mapping->nrpages--;
@@ -164,7 +170,7 @@ void delete_from_page_cache(struct page *page)
 
 	freepage = mapping->a_ops->freepage;
 	spin_lock_irq(&mapping->tree_lock);
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
@@ -411,7 +417,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		new->index = offset;
 
 		spin_lock_irq(&mapping->tree_lock);
-		__delete_from_page_cache(old);
+		__delete_from_page_cache(old, NULL);
 		error = radix_tree_insert(&mapping->page_tree, offset, new);
 		BUG_ON(error);
 		mapping->nrpages++;
diff --git a/mm/truncate.c b/mm/truncate.c
index d8c8964..0f6f700 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -433,7 +433,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 
 	clear_page_mlock(page);
 	BUG_ON(page_has_private(page));
-	__delete_from_page_cache(page);
+	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
 	mem_cgroup_uncharge_cache_page(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c52b235..44d81f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -585,7 +585,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 
 		freepage = mapping->a_ops->freepage;
 
-		__delete_from_page_cache(page);
+		__delete_from_page_cache(page, NULL);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 4/5] mm + fs: provide refault distance to page cache instantiations
  2012-05-01  8:41 [patch 0/5] refault distance-based file cache sizing Johannes Weiner
                   ` (2 preceding siblings ...)
  2012-05-01  8:41 ` [patch 3/5] mm + fs: store shadow pages " Johannes Weiner
@ 2012-05-01  8:41 ` Johannes Weiner
  2012-05-01  9:30   ` Peter Zijlstra
  2012-05-01  8:41 ` [patch 5/5] mm: refault distance-based file cache sizing Johannes Weiner
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2012-05-01  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Andrea Arcangeli, Peter Zijlstra, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

The page allocator needs to be given the non-residency information
stored in the page cache at the time the page is faulted back in.

Every site that does a find_or_create()-style allocation is converted
to pass this refault information to the page_cache_alloc() family of
functions, which in turn passes it down to the page allocator via
current->refault_distance.

XXX: Pages are charged to memory cgroups only when being added to the
page cache, and, in case of multi-page reads, allocation and addition
happen in separate batches.  To communicate the individual refault
distance to the memory controller, the refault distance is stored in
page->private between allocation and add_to_page_cache().  memcg does
not do anything with it yet, though, but it will use it when charging
requires reclaiming (hard limit reclaim).

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/btrfs/compression.c  |    8 +++--
 fs/cachefiles/rdwr.c    |   26 +++++++++-----
 fs/ceph/xattr.c         |    2 +-
 fs/logfs/readwrite.c    |    9 +++--
 fs/ntfs/file.c          |   11 ++++--
 fs/splice.c             |   10 +++--
 include/linux/pagemap.h |   28 ++++++++++-----
 include/linux/sched.h   |    1 +
 include/linux/swap.h    |    6 +++
 mm/filemap.c            |   84 ++++++++++++++++++++++++++++++++---------------
 mm/readahead.c          |    7 +++-
 net/ceph/messenger.c    |    2 +-
 net/ceph/pagelist.c     |    4 +-
 net/ceph/pagevec.c      |    2 +-
 14 files changed, 134 insertions(+), 66 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index a0594c9..dc57e09 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -464,6 +464,8 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 	end_index = (i_size_read(inode) - 1) >> PAGE_CACHE_SHIFT;
 
 	while (last_offset < compressed_end) {
+		unsigned long distance;
+
 		pg_index = last_offset >> PAGE_CACHE_SHIFT;
 
 		if (pg_index > end_index)
@@ -478,12 +480,12 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 				break;
 			goto next;
 		}
-
+		distance = workingset_refault_distance(page);
 		page = __page_cache_alloc(mapping_gfp_mask(mapping) &
-								~__GFP_FS);
+					  ~__GFP_FS, distance);
 		if (!page)
 			break;
-
+		set_page_private(page, distance);
 		if (add_to_page_cache_lru(page, mapping, pg_index,
 								GFP_NOFS)) {
 			page_cache_release(page);
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 0e3c092..359be1f 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -12,6 +12,7 @@
 #include <linux/mount.h>
 #include <linux/slab.h>
 #include <linux/file.h>
+#include <linux/swap.h>
 #include "internal.h"
 
 /*
@@ -253,16 +254,18 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
 	newpage = NULL;
 
 	for (;;) {
-		backpage = find_get_page(bmapping, netpage->index);
-		if (backpage)
-			goto backing_page_already_present;
+		unsigned long distance;
 
+		backpage = __find_get_page(bmapping, netpage->index);
+		if (backpage && !radix_tree_exceptional_entry(backpage))
+			goto backing_page_already_present;
+		distance = workingset_refault_distance(backpage);
 		if (!newpage) {
-			newpage = page_cache_alloc_cold(bmapping);
+			newpage = page_cache_alloc_cold(bmapping, distance);
 			if (!newpage)
 				goto nomem_monitor;
 		}
-
+		set_page_private(newpage, distance);
 		ret = add_to_page_cache(newpage, bmapping,
 					netpage->index, GFP_KERNEL);
 		if (ret == 0)
@@ -495,16 +498,19 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 		}
 
 		for (;;) {
-			backpage = find_get_page(bmapping, netpage->index);
-			if (backpage)
-				goto backing_page_already_present;
+			unsigned long distance;
 
+			backpage = __find_get_page(bmapping, netpage->index);
+			if (backpage && !radix_tree_exceptional_entry(backpage))
+				goto backing_page_already_present;
+			distance = workingset_refault_distance(backpage);
 			if (!newpage) {
-				newpage = page_cache_alloc_cold(bmapping);
+				newpage = page_cache_alloc_cold(bmapping,
+								distance);
 				if (!newpage)
 					goto nomem;
 			}
-
+			set_page_private(newpage, distance);
 			ret = add_to_page_cache(newpage, bmapping,
 						netpage->index, GFP_KERNEL);
 			if (ret == 0)
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index a76f697..26e85d1 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -647,7 +647,7 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name,
 			return -ENOMEM;
 		err = -ENOMEM;
 		for (i = 0; i < nr_pages; i++) {
-			pages[i] = __page_cache_alloc(GFP_NOFS);
+			pages[i] = __page_cache_alloc(GFP_NOFS, 0);
 			if (!pages[i]) {
 				nr_pages = i;
 				goto out;
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 4153e65..b70fcfe 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -316,11 +316,14 @@ static struct page *logfs_get_write_page(struct inode *inode, u64 bix,
 	int err;
 
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(GFP_NOFS);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		unsigned long distance = workingset_refault_distance(page);
+
+		page = __page_cache_alloc(GFP_NOFS, distance);
 		if (!page)
 			return NULL;
+		set_page_private(page, distance);
 		err = add_to_page_cache_lru(page, mapping, index, GFP_NOFS);
 		if (unlikely(err)) {
 			page_cache_release(page);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index c587e2d..ccca902 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -412,15 +412,20 @@ static inline int __ntfs_grab_cache_pages(struct address_space *mapping,
 	BUG_ON(!nr_pages);
 	err = nr = 0;
 	do {
-		pages[nr] = find_lock_page(mapping, index);
-		if (!pages[nr]) {
+		pages[nr] = __find_lock_page(mapping, index);
+		if (!pages[nr] || radix_tree_exceptional_entry(pages[nr])) {
+			unsigned long distance;
+
+			distance = workingset_refault_distance(pages[nr]);
 			if (!*cached_page) {
-				*cached_page = page_cache_alloc(mapping);
+				*cached_page = page_cache_alloc(mapping,
+								distance);
 				if (unlikely(!*cached_page)) {
 					err = -ENOMEM;
 					goto err_out;
 				}
 			}
+			set_page_private(*cached_page, distance);
 			err = add_to_page_cache_lru(*cached_page, mapping, index,
 					GFP_KERNEL);
 			if (unlikely(err)) {
diff --git a/fs/splice.c b/fs/splice.c
index 1ec0493..96cbad0 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -347,15 +347,17 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		 * Page could be there, find_get_pages_contig() breaks on
 		 * the first hole.
 		 */
-		page = find_get_page(mapping, index);
-		if (!page) {
+		page = __find_get_page(mapping, index);
+		if (!page || radix_tree_exceptional_entry(page)) {
+			unsigned long distance;
 			/*
 			 * page didn't exist, allocate one.
 			 */
-			page = page_cache_alloc_cold(mapping);
+			distance = workingset_refault_distance(page);
+			page = page_cache_alloc_cold(mapping, distance);
 			if (!page)
 				break;
-
+			set_page_private(page, distance);
 			error = add_to_page_cache_lru(page, mapping, index,
 						GFP_KERNEL);
 			if (unlikely(error)) {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c1abb88..7ddfc69 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -212,28 +212,38 @@ static inline void page_unfreeze_refs(struct page *page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t, unsigned long);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp,
+					      unsigned long refault_distance)
 {
-	return alloc_pages(gfp, 0);
+	struct page *page;
+
+	current->refault_distance = refault_distance;
+	page = alloc_pages(gfp, 0);
+	current->refault_distance = 0;
 }
 #endif
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+static inline struct page *page_cache_alloc(struct address_space *x,
+					    unsigned long refault_distance)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping_gfp_mask(x), refault_distance);
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *x,
+						 unsigned long refault_distance)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
+				  refault_distance);
 }
 
-static inline struct page *page_cache_alloc_readahead(struct address_space *x)
+static inline struct page *page_cache_alloc_readahead(struct address_space *x,
+						      unsigned long refault_distance)
 {
 	return __page_cache_alloc(mapping_gfp_mask(x) |
-				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
+				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN,
+				  refault_distance);
 }
 
 typedef int filler_t(void *, struct page *);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0657368..f1a984b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1296,6 +1296,7 @@ struct task_struct {
 #endif
 
 	struct mm_struct *mm, *active_mm;
+	unsigned long refault_distance;
 #ifdef CONFIG_COMPAT_BRK
 	unsigned brk_randomized:1;
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3e60228..03d327f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -204,6 +204,12 @@ struct swap_list_t {
 /* Swap 50% full? Release swapcache more aggressively.. */
 #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
 
+/* linux/mm/workingset.c */
+static inline unsigned long workingset_refault_distance(struct page *page)
+{
+	return 0;
+}
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index 4ca12a3..288346a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -468,13 +468,19 @@ static int page_cache_insert(struct address_space *mapping, pgoff_t offset,
 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
+	unsigned long distance;
 	int error;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
+	distance = page_private(page);
+	set_page_private(page, 0);
+
+	current->refault_distance = distance;
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
+	current->refault_distance = 0;
 	if (error)
 		goto out;
 
@@ -518,19 +524,21 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, unsigned long refault_distance)
 {
 	int n;
 	struct page *page;
 
+	current->refault_distance = refault_distance;
 	if (cpuset_do_page_mem_spread()) {
 		get_mems_allowed();
 		n = cpuset_mem_spread_node();
 		page = alloc_pages_exact_node(n, gfp, 0);
 		put_mems_allowed();
-		return page;
-	}
-	return alloc_pages(gfp, 0);
+	} else
+		page = alloc_pages(gfp, 0);
+	current->refault_distance = 0;
+	return page;
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -810,11 +818,14 @@ struct page *find_or_create_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+	page = __find_lock_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		unsigned long distance = workingset_refault_distance(page);
+
+		page = __page_cache_alloc(gfp_mask, distance);
 		if (!page)
 			return NULL;
+		set_page_private(page, distance);
 		/*
 		 * We want a regular kernel memory (not highmem or DMA etc)
 		 * allocation for the radix tree nodes, but we need to honour
@@ -1128,16 +1139,22 @@ EXPORT_SYMBOL(find_get_pages_tag);
 struct page *
 grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
-	struct page *page = find_get_page(mapping, index);
+	struct page *page = __find_get_page(mapping, index);
+	unsigned long distance;
 
-	if (page) {
+	if (page && !radix_tree_exceptional_entry(page)) {
 		if (trylock_page(page))
 			return page;
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
-	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
+	distance = workingset_refault_distance(page);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+				  distance);
+	if (!page)
+		return NULL;
+	set_page_private(page, distance);
+	if (add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
 	}
@@ -1199,6 +1216,7 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 	offset = *ppos & ~PAGE_CACHE_MASK;
 
 	for (;;) {
+		unsigned long distance;
 		struct page *page;
 		pgoff_t end_index;
 		loff_t isize;
@@ -1211,8 +1229,9 @@ find_page:
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
-			page = find_get_page(mapping, index);
-			if (unlikely(page == NULL))
+			page = __find_get_page(mapping, index);
+			if (unlikely(!page ||
+				     radix_tree_exceptional_entry(page)))
 				goto no_cached_page;
 		}
 		if (PageReadahead(page)) {
@@ -1370,11 +1389,13 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		page = page_cache_alloc_cold(mapping);
+		distance = workingset_refault_distance(page);
+		page = page_cache_alloc_cold(mapping, distance);
 		if (!page) {
 			desc->error = -ENOMEM;
 			goto out;
 		}
+		set_page_private(page, distance);
 		error = add_to_page_cache_lru(page, mapping,
 						index, GFP_KERNEL);
 		if (error) {
@@ -1621,21 +1642,23 @@ SYSCALL_ALIAS(sys_readahead, SyS_readahead);
  * page_cache_read - adds requested page to the page cache if not already there
  * @file:	file to read
  * @offset:	page index
+ * @distance:	refault distance
  *
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset,
+			   unsigned long distance)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *page; 
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, distance);
 		if (!page)
 			return -ENOMEM;
-
+		set_page_private(page, distance);
 		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
 		if (ret == 0)
 			ret = mapping->a_ops->readpage(file, page);
@@ -1738,6 +1761,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
 	pgoff_t offset = vmf->pgoff;
+	unsigned long distance;
 	struct page *page;
 	pgoff_t size;
 	int ret = 0;
@@ -1763,8 +1787,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 retry_find:
-		page = find_get_page(mapping, offset);
-		if (!page)
+		page = __find_get_page(mapping, offset);
+		if (!page || radix_tree_exceptional_entry(page))
 			goto no_cached_page;
 	}
 
@@ -1807,7 +1831,8 @@ no_cached_page:
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, offset);
+	distance = workingset_refault_distance(page);
+	error = page_cache_read(file, offset, distance);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1901,11 +1926,14 @@ static struct page *__read_cache_page(struct address_space *mapping,
 	struct page *page;
 	int err;
 repeat:
-	page = find_get_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp | __GFP_COLD);
+	page = __find_get_page(mapping, index);
+	if (!page || radix_tree_exceptional_entry(page)) {
+		unsigned long distance = workingset_refault_distance(page);
+
+		page = __page_cache_alloc(gfp | __GFP_COLD, distance);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
+		set_page_private(page, distance);
 		err = add_to_page_cache_lru(page, mapping, index, gfp);
 		if (unlikely(err)) {
 			page_cache_release(page);
@@ -2432,18 +2460,20 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+	unsigned long distance;
 
 	gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE;
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
-	page = find_lock_page(mapping, index);
-	if (page)
+	page = __find_lock_page(mapping, index);
+	if (page && !radix_tree_exceptional_entry(page))
 		goto found;
-
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	distance = workingset_refault_distance(page);
+	page = __page_cache_alloc(gfp_mask & ~gfp_notmask, distance);
 	if (!page)
 		return NULL;
+	set_page_private(page, distance);
 	status = add_to_page_cache_lru(page, mapping, index,
 						GFP_KERNEL & ~gfp_notmask);
 	if (unlikely(status)) {
diff --git a/mm/readahead.c b/mm/readahead.c
index 43f9dd2..dc071cc 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -11,6 +11,7 @@
 #include <linux/fs.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
 #include <linux/export.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -170,6 +171,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	 */
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		pgoff_t page_offset = offset + page_idx;
+		unsigned long distance;
 
 		if (page_offset > end_index)
 			break;
@@ -179,10 +181,11 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		rcu_read_unlock();
 		if (page && !radix_tree_exceptional_entry(page))
 			continue;
-
-		page = page_cache_alloc_readahead(mapping);
+		distance = workingset_refault_distance(page);
+		page = page_cache_alloc_readahead(mapping, distance);
 		if (!page)
 			break;
+		set_page_private(page, distance);
 		page->index = page_offset;
 		list_add(&page->lru, &page_pool);
 		if (page_idx == nr_to_read - lookahead_size)
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ad5b708..b57933a 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2218,7 +2218,7 @@ struct ceph_messenger *ceph_messenger_create(struct ceph_entity_addr *myaddr,
 
 	/* the zero page is needed if a request is "canceled" while the message
 	 * is being written over the socket */
-	msgr->zero_page = __page_cache_alloc(GFP_KERNEL | __GFP_ZERO);
+	msgr->zero_page = __page_cache_alloc(GFP_KERNEL | __GFP_ZERO, 0);
 	if (!msgr->zero_page) {
 		kfree(msgr);
 		return ERR_PTR(-ENOMEM);
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 13cb409..ffc6289 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -33,7 +33,7 @@ static int ceph_pagelist_addpage(struct ceph_pagelist *pl)
 	struct page *page;
 
 	if (!pl->num_pages_free) {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = __page_cache_alloc(GFP_NOFS, 0);
 	} else {
 		page = list_first_entry(&pl->free_list, struct page, lru);
 		list_del(&page->lru);
@@ -85,7 +85,7 @@ int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space)
 	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
 
 	while (space > pl->num_pages_free) {
-		struct page *page = __page_cache_alloc(GFP_NOFS);
+		struct page *page = __page_cache_alloc(GFP_NOFS, 0);
 		if (!page)
 			return -ENOMEM;
 		list_add_tail(&page->lru, &pl->free_list);
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index cd9c21d..4bc4ffd 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -79,7 +79,7 @@ struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags)
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < num_pages; i++) {
-		pages[i] = __page_cache_alloc(flags);
+		pages[i] = __page_cache_alloc(flags, 0);
 		if (pages[i] == NULL) {
 			ceph_release_page_vector(pages, i);
 			return ERR_PTR(-ENOMEM);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 5/5] mm: refault distance-based file cache sizing
  2012-05-01  8:41 [patch 0/5] refault distance-based file cache sizing Johannes Weiner
                   ` (3 preceding siblings ...)
  2012-05-01  8:41 ` [patch 4/5] mm + fs: provide refault distance to page cache instantiations Johannes Weiner
@ 2012-05-01  8:41 ` Johannes Weiner
  2012-05-01 14:13   ` Minchan Kim
  2012-05-02  1:57   ` Andrea Arcangeli
  2012-05-01 19:08 ` [patch 0/5] " Andrew Morton
  2012-05-16  5:25 ` nai.xia
  6 siblings, 2 replies; 35+ messages in thread
From: Johannes Weiner @ 2012-05-01  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Andrea Arcangeli, Peter Zijlstra, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

To protect frequently used page cache (workingset) from bursts of less
frequently used or one-shot cache, page cache pages are managed on two
linked lists.  The inactive list is where all cache starts out on
fault and ends on reclaim.  Pages that get accessed another time while
on the inactive list get promoted to the active list to protect them
from reclaim.

Right now we have two main problems.

One stems from numa allocation decisions and how the page allocator
and kswapd interact.  The both of them can enter into a perfect loop
where kswapd reclaims from the preferred zone of a task, allowing the
task to continuously allocate from that zone.  Or, the node distance
can lead to the allocator to do direct zone reclaim to stay in the
preferred zone.  This may be good for locality, but the task has only
the inactive space of that one zone to get its memory activated.
Forcing the allocator to spread out to lower zones in the right
situation makes the difference between continuous IO to serve the
workingset, or taking the numa cost but serving fully from memory.

The other issue is that with the two lists alone, we can never detect
when a new set of data with equal access frequency should be cached if
the size of it is bigger than total/allowed memory minus the active
set.  Currently we have the perfect compromise given those
constraints: the active list is not allowed to grow bigger than the
inactive list.  This means that we can protect cache from reclaim only
up to half of memory, and don't recognize workingset changes that are
bigger than half of memory.

This patch tries to solve both problems by adding and making use of a
new metric, the refault distance.

Whenever a file page leaves the inactive list, be it through reclaim
or activation, a global counter is increased, called the "workingset
time".  When a page is evicted from memory, a snapshot of the current
workingset time is remembered, so that when the page is refaulted
later, it can be figured out for how long the page has been out of
memory.  This is called the refault distance.

The observation then is this: if a page is refaulted after N ticks of
working set time, the eviction could have been avoided if the active
list had been N pages smaller and this space available to the inactive
list instead.

We don't have recent usage information for pages on the active list,
so we can not explicitely compare the refaulting page to the least
frequently used active page.  Instead, for each refault with a
distance smaller than the size of the active list, we deactivate an
active page.  This way, both the refaulted page and the freshly
deactivated page get placed next to each other on the head of the
inactive list and both have equal chance to get activated.  Whichever
wins is probably the more frequently used page.

To ensure the spreading of pages across available/allowed zones when
necessary, a per-zone floating proportion of evictions in the system
is maintained, which allows translating the global refault distance of
a page into a distance proportional to the zone's own eviction speed.
When a refaulting page is allocated, for each zone considered in the
first zonelist walk of the allocator, the per-zone distance is
compared to the zone's number of active and free pages.  If the
distance is bigger, the allocator moves to the next zone, to see if
its less utilized (less evictions -> smaller distance, potentially
stale active pages, or even free pages) and thus, unlike the preferred
zone, has the potential to hold the page in memory.  This way,
non-refault allocations and those that would fit into the preferred
zone stay local, but if we see a chance to keep these pages in memory
long-term by spreading them out, we try to use all the space we can
get and sacrifice locality to save disk IO.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |    7 ++
 include/linux/swap.h   |    9 ++-
 mm/Makefile            |    1 +
 mm/memcontrol.c        |    3 +
 mm/page_alloc.c        |    7 ++
 mm/swap.c              |    2 +
 mm/vmscan.c            |   80 +++++++++++++---------
 mm/vmstat.c            |    4 +
 mm/workingset.c        |  174 ++++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 249 insertions(+), 38 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 650ba2f..a4da472 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -15,6 +15,7 @@
 #include <linux/seqlock.h>
 #include <linux/nodemask.h>
 #include <linux/pageblock-flags.h>
+#include <linux/proportions.h>
 #include <generated/bounds.h>
 #include <linux/atomic.h>
 #include <asm/page.h>
@@ -115,6 +116,10 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	WORKINGSET_SKIP,
+	WORKINGSET_ALLOC,
+	WORKINGSET_STALE,
+	WORKINGSET_STALE_FORCE,
 	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
@@ -161,6 +166,7 @@ static inline int is_unevictable_lru(enum lru_list lru)
 
 struct lruvec {
 	struct list_head lists[NR_LRU_LISTS];
+	long shrink_active;
 };
 
 /* Mask used at gathering information at once (see memcontrol.c) */
@@ -372,6 +378,7 @@ struct zone {
 	/* Fields commonly accessed by the page reclaim scanner */
 	spinlock_t		lru_lock;
 	struct lruvec		lruvec;
+	struct prop_local_percpu evictions;
 
 	struct zone_reclaim_stat reclaim_stat;
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 03d327f..cf304ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -205,10 +205,11 @@ struct swap_list_t {
 #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)
 
 /* linux/mm/workingset.c */
-static inline unsigned long workingset_refault_distance(struct page *page)
-{
-	return 0;
-}
+void *workingset_eviction(struct page *);
+void workingset_activation(struct page *);
+unsigned long workingset_refault_distance(struct page *);
+bool workingset_zone_alloc(struct zone *, unsigned long,
+			   unsigned long *, unsigned long *);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..bd09137 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -13,6 +13,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   page_isolation.o mm_init.o mmu_context.o percpu.o \
+			   workingset.o \
 			   $(mmu-y)
 obj-y += init-mm.o
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 58a08fc..10dc07c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1020,6 +1020,9 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
 	if (mem_cgroup_disabled())
 		return &zone->lruvec;
 
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
 	mz = mem_cgroup_zoneinfo(memcg, zone_to_nid(zone), zone_idx(zone));
 	return &mz->lruvec;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a13ded1..a6544c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1711,9 +1711,11 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	unsigned long distance, active;
 
 	classzone_idx = zone_idx(preferred_zone);
 zonelist_scan:
+	distance = active = 0;
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
@@ -1726,6 +1728,11 @@ zonelist_scan:
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    current->refault_distance &&
+		    !workingset_zone_alloc(zone, current->refault_distance,
+					   &distance, &active))
+			continue;
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
diff --git a/mm/swap.c b/mm/swap.c
index cc5ce81..3029b40 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -365,6 +365,8 @@ void mark_page_accessed(struct page *page)
 			PageReferenced(page) && PageLRU(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
+		if (page_is_file_cache(page))
+			workingset_activation(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44d81f5..a01d123 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -536,7 +536,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+			    bool reclaimed)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -582,10 +583,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
 		swapcache_free(swap, page);
 	} else {
 		void (*freepage)(struct page *);
+		void *shadow = NULL;
 
 		freepage = mapping->a_ops->freepage;
-
-		__delete_from_page_cache(page, NULL);
+		
+		if (reclaimed && page_is_file_cache(page))
+			shadow = workingset_eviction(page);
+		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
 
@@ -608,7 +612,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (__remove_mapping(mapping, page, false)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
@@ -968,7 +972,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
 
 		/*
@@ -1824,43 +1828,51 @@ static inline int inactive_anon_is_low(struct mem_cgroup_zone *mz)
 }
 #endif
 
-static int inactive_file_is_low_global(struct zone *zone)
+static int inactive_file_is_low(unsigned long nr_to_scan,
+				struct mem_cgroup_zone *mz,
+				struct scan_control *sc)
 {
-	unsigned long active, inactive;
-
-	active = zone_page_state(zone, NR_ACTIVE_FILE);
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
-
-	return (active > inactive);
-}
+	unsigned long inactive_ratio;
+	unsigned long inactive;
+	struct lruvec *lruvec;
+	unsigned long active;
+	unsigned long gb;
 
-/**
- * inactive_file_is_low - check if file pages need to be deactivated
- * @mz: memory cgroup and zone to check
- *
- * When the system is doing streaming IO, memory pressure here
- * ensures that active file pages get deactivated, until more
- * than half of the file pages are on the inactive list.
- *
- * Once we get to that situation, protect the system's working
- * set from being evicted by disabling active file page aging.
- *
- * This uses a different ratio than the anonymous pages, because
- * the page cache uses a use-once replacement algorithm.
- */
-static int inactive_file_is_low(struct mem_cgroup_zone *mz)
-{
-	if (!scanning_global_lru(mz))
+	if (!global_reclaim(sc)) /* XXX: integrate hard limit reclaim */
 		return mem_cgroup_inactive_file_is_low(mz->mem_cgroup,
 						       mz->zone);
 
-	return inactive_file_is_low_global(mz->zone);
+	lruvec = mem_cgroup_zone_lruvec(mz->zone, sc->target_mem_cgroup);
+	if (lruvec->shrink_active > 0) {
+		inc_zone_state(mz->zone, WORKINGSET_STALE);
+		lruvec->shrink_active -= nr_to_scan;
+		return true;
+	}
+	/*
+	 * Make sure there is always a reasonable amount of inactive
+	 * file pages around to keep the zone reclaimable.
+	 */
+	inactive = zone_nr_lru_pages(mz, LRU_INACTIVE_FILE);
+	active = zone_nr_lru_pages(mz, LRU_ACTIVE_FILE);
+	gb = (inactive + active) >> (30 - PAGE_SHIFT);
+	if (gb)
+		inactive_ratio = int_sqrt(10 * gb);
+	else
+		inactive_ratio = 1;
+	if (inactive * inactive_ratio < active) {
+		inc_zone_state(mz->zone, WORKINGSET_STALE_FORCE);
+		return true;
+	}
+	return false;
 }
 
-static int inactive_list_is_low(struct mem_cgroup_zone *mz, int file)
+static int inactive_list_is_low(unsigned long nr_to_scan,
+				struct mem_cgroup_zone *mz,
+				struct scan_control *sc,
+				int file)
 {
 	if (file)
-		return inactive_file_is_low(mz);
+		return inactive_file_is_low(nr_to_scan, mz, sc);
 	else
 		return inactive_anon_is_low(mz);
 }
@@ -1872,7 +1884,7 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	int file = is_file_lru(lru);
 
 	if (is_active_lru(lru)) {
-		if (inactive_list_is_low(mz, file))
+		if (inactive_list_is_low(nr_to_scan, mz, sc, file))
 			shrink_active_list(nr_to_scan, mz, sc, priority, file);
 		return 0;
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f600557..28f4b90 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -718,6 +718,10 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
+	"workingset_skip",
+	"workingset_alloc",
+	"workingset_stale",
+	"workingset_stale_force",
 	"nr_anon_transparent_hugepages",
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 0000000..2fc9ac6
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,174 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2012 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ * Monotonic workingset clock for non-resident pages.  Each page
+ * leaving the inactive list (eviction or activation) is one tick.
+ *
+ * The refault distance of a page that is the number of ticks that
+ * occurred between eviction and refault.
+ *
+ * If the inactive list had been bigger by the refault distance in
+ * pages, the refault would not have happened.  Or put differently, if
+ * the distance is smaller than the number of active file pages, an
+ * active page needs to be deactivated so that both pages get an equal
+ * chance for activation when there is not enough memory for both.
+ */
+static atomic_t workingset_time;
+
+/*
+ * Per-zone proportional eviction counter to keep track of recent zone
+ * eviction speed and be able to calculate per-zone refault distances.
+ */
+static struct prop_descriptor global_evictions;
+
+/*
+ * Workingset time snapshots are stored in the page cache radix tree
+ * as exceptional entries.
+ */
+#define EV_SHIFT	RADIX_TREE_EXCEPTIONAL_SHIFT
+#define EV_MASK		(~0UL >> EV_SHIFT)
+
+void *workingset_eviction(struct page *page)
+{
+	unsigned long time;
+
+	prop_inc_percpu(&global_evictions, &page_zone(page)->evictions);
+	time = (unsigned int)atomic_inc_return(&workingset_time);
+
+	return (void *)((time << EV_SHIFT) | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+void workingset_activation(struct page *page)
+{
+	struct lruvec *lruvec;
+	/*
+	 * Refault distance is compared to the number of active pages,
+	 * but pages activated after the eviction were hardly the
+	 * reason for memory shortness back then.  Advancing the clock
+	 * on activation compensates for them, so that we compare to
+	 * the number of active pages at time of eviction.
+	 */
+	atomic_inc(&workingset_time);
+	/*
+	 * Furthermore, activations mean that the inactive list is big
+	 * enough and that a new workingset is being adapted already.
+	 * Deactivation is no longer necessary; even harmful.
+	 */
+	lruvec = mem_cgroup_zone_lruvec(page_zone(page), NULL);
+	if (lruvec->shrink_active > 0)
+		lruvec->shrink_active--;
+}
+
+unsigned long workingset_refault_distance(struct page *page)
+{
+	unsigned long time_of_eviction;
+	unsigned long now;
+
+	if (!page)
+		return 0;
+
+	BUG_ON(!radix_tree_exceptional_entry(page));
+
+	time_of_eviction = (unsigned long)page >> EV_SHIFT;
+	now = (unsigned int)atomic_read(&workingset_time) & EV_MASK;
+
+	return (now - time_of_eviction) & EV_MASK;
+}
+EXPORT_SYMBOL(workingset_refault_distance);
+
+bool workingset_zone_alloc(struct zone *zone, unsigned long refault_distance,
+			   unsigned long *pdistance, unsigned long *pactive)
+{
+	unsigned long zone_active;
+	unsigned long zone_free;
+	unsigned long missing;
+	long denominator;
+	long numerator;
+
+	/*
+	 * Don't put refaulting pages into zones that are already
+	 * heavily reclaimed and don't have the potential to hold all
+	 * the workingset.  Instead go for zones where the zone-local
+	 * distance is smaller than the potential inactive list space.
+	 * This can be either because there has not been much reclaim
+	 * recently (small distance), because the zone is not actually
+	 * full (free pages), or because there are just genuinely a
+	 * lot of active pages that may be used less frequently than
+	 * the refaulting page.  Either way, use this potential to
+	 * hold the refaulting page long-term instead of beating on
+	 * already thrashing higher zones.
+	 */
+	prop_fraction_percpu(&global_evictions, &zone->evictions,
+			     &numerator, &denominator);
+	missing = refault_distance * numerator;
+	do_div(missing, denominator);
+	*pdistance += missing;
+
+	zone_active = zone_page_state(zone, NR_ACTIVE_FILE);
+	*pactive += zone_active;
+
+	/*
+	 * Lower zones may not even be full, and free pages are
+	 * potential inactive space, too.  But the dirty reserve is
+	 * not available to page cache due to lowmem reserves and the
+	 * kswapd watermark.  Don't include it.
+	 */
+	zone_free = zone_page_state(zone, NR_FREE_PAGES);
+	if (zone_free > zone->dirty_balance_reserve)
+		zone_free -= zone->dirty_balance_reserve;
+	else
+		zone_free = 0;
+
+	if (missing >= zone_active + zone_free) {
+		inc_zone_state(zone, WORKINGSET_SKIP);
+		return false;
+	}
+
+	inc_zone_state(zone, WORKINGSET_ALLOC);
+
+	/*
+	 * Okay, placement in this zone makes sense, but don't start
+	 * actually deactivating pages until all allowed zones are
+	 * under equalized pressure, or risk throwing out active pages
+	 * from a barely used zone even when the refaulting data set
+	 * is bigger than the available memory.  To prevent that, look
+	 * at the cumulative distance and active pages of all zones
+	 * already visited, which normalizes the distance for the case
+	 * when higher zones are thrashing and we just started putting
+	 * pages in the lower ones.
+	 */
+	if (*pdistance < *pactive) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_zone_lruvec(zone, NULL);
+		lruvec->shrink_active++;
+	}
+	return true;
+}
+
+static int __init workingset_init(void)
+{
+	extern unsigned long global_dirtyable_memory(void);
+	struct zone *zone;
+	int shift;
+
+	shift = ilog2(global_dirtyable_memory() - 1);
+	prop_descriptor_init(&global_evictions, shift);
+	for_each_zone(zone)
+		prop_local_init_percpu(&zone->evictions);
+	return 0;
+}
+
+module_init(workingset_init);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [patch 4/5] mm + fs: provide refault distance to page cache instantiations
  2012-05-01  8:41 ` [patch 4/5] mm + fs: provide refault distance to page cache instantiations Johannes Weiner
@ 2012-05-01  9:30   ` Peter Zijlstra
  2012-05-01  9:55     ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2012-05-01  9:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue, 2012-05-01 at 10:41 +0200, Johannes Weiner wrote:
> Every site that does a find_or_create()-style allocation is converted
> to pass this refault information to the page_cache_alloc() family of
> functions, which in turn passes it down to the page allocator via
> current->refault_distance. 

That is rather icky..

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 4/5] mm + fs: provide refault distance to page cache instantiations
  2012-05-01  9:30   ` Peter Zijlstra
@ 2012-05-01  9:55     ` Johannes Weiner
  2012-05-01  9:58       ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2012-05-01  9:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue, May 01, 2012 at 11:30:40AM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-01 at 10:41 +0200, Johannes Weiner wrote:
> > Every site that does a find_or_create()-style allocation is converted
> > to pass this refault information to the page_cache_alloc() family of
> > functions, which in turn passes it down to the page allocator via
> > current->refault_distance. 
> 
> That is rather icky..

Fair enough, I just went with the easier solution to get things off
the ground instead of dealing with adding an extra parameter to layers
of config-dependent gfp API.  I'll revisit this for v2.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 4/5] mm + fs: provide refault distance to page cache instantiations
  2012-05-01  9:55     ` Johannes Weiner
@ 2012-05-01  9:58       ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2012-05-01  9:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue, 2012-05-01 at 11:55 +0200, Johannes Weiner wrote:
> On Tue, May 01, 2012 at 11:30:40AM +0200, Peter Zijlstra wrote:
> > On Tue, 2012-05-01 at 10:41 +0200, Johannes Weiner wrote:
> > > Every site that does a find_or_create()-style allocation is converted
> > > to pass this refault information to the page_cache_alloc() family of
> > > functions, which in turn passes it down to the page allocator via
> > > current->refault_distance. 
> > 
> > That is rather icky..
> 
> Fair enough, I just went with the easier solution to get things off
> the ground instead of dealing with adding an extra parameter to layers
> of config-dependent gfp API.  I'll revisit this for v2.

OK, thanks!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 5/5] mm: refault distance-based file cache sizing
  2012-05-01  8:41 ` [patch 5/5] mm: refault distance-based file cache sizing Johannes Weiner
@ 2012-05-01 14:13   ` Minchan Kim
  2012-05-01 15:38     ` Johannes Weiner
  2012-05-02  1:57   ` Andrea Arcangeli
  1 sibling, 1 reply; 35+ messages in thread
From: Minchan Kim @ 2012-05-01 14:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel

Hi Hannes,

On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> To protect frequently used page cache (workingset) from bursts of less
> frequently used or one-shot cache, page cache pages are managed on two
> linked lists.  The inactive list is where all cache starts out on
> fault and ends on reclaim.  Pages that get accessed another time while
> on the inactive list get promoted to the active list to protect them
> from reclaim.
> 
> Right now we have two main problems.
> 
> One stems from numa allocation decisions and how the page allocator
> and kswapd interact.  The both of them can enter into a perfect loop
> where kswapd reclaims from the preferred zone of a task, allowing the
> task to continuously allocate from that zone.  Or, the node distance
> can lead to the allocator to do direct zone reclaim to stay in the
> preferred zone.  This may be good for locality, but the task has only

Understood.

> the inactive space of that one zone to get its memory activated.
> Forcing the allocator to spread out to lower zones in the right
> situation makes the difference between continuous IO to serve the
> workingset, or taking the numa cost but serving fully from memory.

It's hard to parse your word due to my dumb brain.
Could you elaborate on it?
It would be a good if you say with example.

> 
> The other issue is that with the two lists alone, we can never detect
> when a new set of data with equal access frequency should be cached if
> the size of it is bigger than total/allowed memory minus the active
> set.  Currently we have the perfect compromise given those
> constraints: the active list is not allowed to grow bigger than the
> inactive list.  This means that we can protect cache from reclaim only

Okay.

> up to half of memory, and don't recognize workingset changes that are
> bigger than half of memory.

Workingset change?
You mean if new workingset is bigger than half of memory and it's like
stream before retouch, we could cache only part of working set because 
head pages on working set would be discared by tail pages of working set
in inactive list?

I'm sure I totally coudln't parse your point.
Could you explain in detail? Before reading your approach and diving into code,
I would like to see the problem clearly.

Thanks.
 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 5/5] mm: refault distance-based file cache sizing
  2012-05-01 14:13   ` Minchan Kim
@ 2012-05-01 15:38     ` Johannes Weiner
  2012-05-02  5:21       ` Minchan Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2012-05-01 15:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel

On Tue, May 01, 2012 at 11:13:30PM +0900, Minchan Kim wrote:
> Hi Hannes,
> 
> On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> > To protect frequently used page cache (workingset) from bursts of less
> > frequently used or one-shot cache, page cache pages are managed on two
> > linked lists.  The inactive list is where all cache starts out on
> > fault and ends on reclaim.  Pages that get accessed another time while
> > on the inactive list get promoted to the active list to protect them
> > from reclaim.
> > 
> > Right now we have two main problems.
> > 
> > One stems from numa allocation decisions and how the page allocator
> > and kswapd interact.  The both of them can enter into a perfect loop
> > where kswapd reclaims from the preferred zone of a task, allowing the
> > task to continuously allocate from that zone.  Or, the node distance
> > can lead to the allocator to do direct zone reclaim to stay in the
> > preferred zone.  This may be good for locality, but the task has only
> 
> Understood.
> 
> > the inactive space of that one zone to get its memory activated.
> > Forcing the allocator to spread out to lower zones in the right
> > situation makes the difference between continuous IO to serve the
> > workingset, or taking the numa cost but serving fully from memory.
> 
> It's hard to parse your word due to my dumb brain.
> Could you elaborate on it?
> It would be a good if you say with example.

Say your Normal zone is 4G (DMA32 also 4G) and you have 2G of active
file pages in Normal and DMA32 is full of other stuff.  Now you access
a new 6G file repeatedly.  First it allocates from Normal (preferred),
then tries DMA32 (full), wakes up kswapd and retries all zones.  If
kswapd then frees pages at roughly the same pace as the allocator
allocates from Normal, kswapd never goes to sleep and evicts pages
from the 6G file before they can get accessed a second time.  Even
though the 6G file could fit in memory (4G Normal + 4G DMA32), the
allocator only uses the 4G Normal zone.

Same applies if you have a load that would fit in the memory of two
nodes but the node distance leads the allocator to do zone_reclaim()
and forcing the pages to stay in one node, again preventing the load
from being fully cached in memory, which is much more expensive than
the foreign node cost.

> > up to half of memory, and don't recognize workingset changes that are
> > bigger than half of memory.
> 
> Workingset change?
> You mean if new workingset is bigger than half of memory and it's like
> stream before retouch, we could cache only part of working set because 
> head pages on working set would be discared by tail pages of working set
> in inactive list?

Spot-on.  I called that 'tail-chasing' in my notes :-) When you are in
a perpetual loop of evicting pages you will need in a couple hundred
page faults.  Those couple hundred page faults are the refault
distance and my code is able to detect these loops and increases the
space available to the inactive list to end them, if possible.

This is the whole principle of the series.

If such a loop is recognized in a single zone, the allocator goes for
lower zones to increase the inactive space.  If such a loop is
recognized over all allowed zones in the zonelist, the active lists
are shrunk to increase the inactive space.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 2/5] mm + fs: prepare for non-page entries in page cache
  2012-05-01  8:41 ` [patch 2/5] mm + fs: prepare for non-page entries in page cache Johannes Weiner
@ 2012-05-01 19:02   ` Andrew Morton
  2012-05-01 20:15     ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2012-05-01 19:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue,  1 May 2012 10:41:50 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -544,8 +544,7 @@ static void evict(struct inode *inode)
>  	if (op->evict_inode) {
>  		op->evict_inode(inode);
>  	} else {
> -		if (inode->i_data.nrpages)
> -			truncate_inode_pages(&inode->i_data, 0);
> +		truncate_inode_pages(&inode->i_data, 0);

Why did we lose this optimisation?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-01  8:41 [patch 0/5] refault distance-based file cache sizing Johannes Weiner
                   ` (4 preceding siblings ...)
  2012-05-01  8:41 ` [patch 5/5] mm: refault distance-based file cache sizing Johannes Weiner
@ 2012-05-01 19:08 ` Andrew Morton
  2012-05-01 21:19   ` Rik van Riel
  2012-05-16  5:25 ` nai.xia
  6 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2012-05-01 19:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue,  1 May 2012 10:41:48 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> This series stores file cache eviction information in the vacated page
> cache radix tree slots and uses it on refault to see if the pages
> currently on the active list need to have their status challenged.

So we no longer free the radix-tree node when everything under it has
been reclaimed?  One could create workloads which would result in a
tremendous amount of memory used by radix_tree_node_cachep objects.

So I assume these things get thrown away at some point.  Some
discussion about the life-cycle here would be useful.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 2/5] mm + fs: prepare for non-page entries in page cache
  2012-05-01 19:02   ` Andrew Morton
@ 2012-05-01 20:15     ` Johannes Weiner
  2012-05-01 20:24       ` Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2012-05-01 20:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue, May 01, 2012 at 12:02:46PM -0700, Andrew Morton wrote:
> On Tue,  1 May 2012 10:41:50 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -544,8 +544,7 @@ static void evict(struct inode *inode)
> >  	if (op->evict_inode) {
> >  		op->evict_inode(inode);
> >  	} else {
> > -		if (inode->i_data.nrpages)
> > -			truncate_inode_pages(&inode->i_data, 0);
> > +		truncate_inode_pages(&inode->i_data, 0);
> 
> Why did we lose this optimisation?

For inodes with only shadow pages remaining in the tree, because there
is no separate counter for them.  Otherwise, we'd leak the tree nodes.

I had mapping->nrshadows at first to keep truncation conditional, but
thought that using an extra word per cached inode would be worse than
removing this optimization.  There is not too much being done when the
tree is empty.

Another solution would be to include the shadows count in ->nrpages,
but filesystems use this counter for various other purposes.

Do you think it's worth reconsidering?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 2/5] mm + fs: prepare for non-page entries in page cache
  2012-05-01 20:15     ` Johannes Weiner
@ 2012-05-01 20:24       ` Andrew Morton
  2012-05-01 21:14         ` Rik van Riel
  2012-05-01 21:29         ` Johannes Weiner
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2012-05-01 20:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue, 1 May 2012 22:15:04 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, May 01, 2012 at 12:02:46PM -0700, Andrew Morton wrote:
> > On Tue,  1 May 2012 10:41:50 +0200
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -544,8 +544,7 @@ static void evict(struct inode *inode)
> > >  	if (op->evict_inode) {
> > >  		op->evict_inode(inode);
> > >  	} else {
> > > -		if (inode->i_data.nrpages)
> > > -			truncate_inode_pages(&inode->i_data, 0);
> > > +		truncate_inode_pages(&inode->i_data, 0);
> > 
> > Why did we lose this optimisation?
> 
> For inodes with only shadow pages remaining in the tree, because there
> is no separate counter for them.  Otherwise, we'd leak the tree nodes.
> 
> I had mapping->nrshadows at first to keep truncation conditional, but
> thought that using an extra word per cached inode would be worse than
> removing this optimization.  There is not too much being done when the
> tree is empty.
> 
> Another solution would be to include the shadows count in ->nrpages,
> but filesystems use this counter for various other purposes.
> 
> Do you think it's worth reconsidering?

It doesn't sound like it's worth adding ->nrshadows for only that
reason.

That's a pretty significant alteration in the meaning of ->nrpages. 
Did this not have any other effects?

What does truncate do?  I assume it invalidates shadow page entries in
the radix tree?  And frees the radix-tree nodes?

The patchset will make lookups slower in some (probably obscure)
circumstances, due to the additional radix-tree nodes.

I assume that if a pagecache lookup encounters a radix-tree node which
contains no real pages, the search will terminate at that point?  We
don't pointlessly go all the way down to the leaf nodes?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 1/5] mm: readahead: move radix tree hole searching here
  2012-05-01  8:41 ` [patch 1/5] mm: readahead: move radix tree hole searching here Johannes Weiner
@ 2012-05-01 21:06   ` Rik van Riel
  0 siblings, 0 replies; 35+ messages in thread
From: Rik van Riel @ 2012-05-01 21:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrea Arcangeli, Peter Zijlstra, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On 05/01/2012 04:41 AM, Johannes Weiner wrote:
> The readahead code searches the page cache for non-present pages, or
> holes, to get a picture of the area surrounding a fault e.g.
>
> For this it sufficed to rely on the radix tree definition of holes,
> which is "empty tree slot".  This is about to change, though, because
> shadow page descriptors will be stored in the page cache when the real
> pages get evicted from memory.
>
> Fortunately, nobody outside the readahead code uses these functions
> and they have no internal knowledge of the radix tree structures, so
> just move them over to mm/readahead.c where they can later be adapted
> to handle the new definition of "page cache hole".
>
> Signed-off-by: Johannes Weiner<hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 2/5] mm + fs: prepare for non-page entries in page cache
  2012-05-01 20:24       ` Andrew Morton
@ 2012-05-01 21:14         ` Rik van Riel
  2012-05-01 21:29         ` Johannes Weiner
  1 sibling, 0 replies; 35+ messages in thread
From: Rik van Riel @ 2012-05-01 21:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, linux-mm, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On 05/01/2012 04:24 PM, Andrew Morton wrote:

> That's a pretty significant alteration in the meaning of ->nrpages.
> Did this not have any other effects?

 From what I see (though it's been a long day), ->nrpages
stays the same it is now.

The non-page entries are simply not counted in ->nrpages.

> What does truncate do?  I assume it invalidates shadow page entries in
> the radix tree?  And frees the radix-tree nodes?

Indeed, truncate will get rid of the non-page entries
in the radix tree.  That is why it needs to be called
even if ->nrpages==0.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-01 19:08 ` [patch 0/5] " Andrew Morton
@ 2012-05-01 21:19   ` Rik van Riel
  2012-05-01 21:26     ` Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2012-05-01 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, linux-mm, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On 05/01/2012 03:08 PM, Andrew Morton wrote:
> On Tue,  1 May 2012 10:41:48 +0200
> Johannes Weiner<hannes@cmpxchg.org>  wrote:
>
>> This series stores file cache eviction information in the vacated page
>> cache radix tree slots and uses it on refault to see if the pages
>> currently on the active list need to have their status challenged.
>
> So we no longer free the radix-tree node when everything under it has
> been reclaimed?  One could create workloads which would result in a
> tremendous amount of memory used by radix_tree_node_cachep objects.
>
> So I assume these things get thrown away at some point.  Some
> discussion about the life-cycle here would be useful.

I assume that in the current codebase Johannes has, we would
have to rely on the inode cache shrinker to reclaim the inode
and throw out the radix tree nodes.

Having a better way to deal with radix tree nodes that contain
stale entries (where the evicted pages would no longer receive
special treatment on re-fault, because it has been so long) get
reclaimed would be nice for a future version.

Probably not too urgent, though...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-01 21:19   ` Rik van Riel
@ 2012-05-01 21:26     ` Andrew Morton
  2012-05-02  1:10       ` Andrea Arcangeli
  2012-05-03 13:15       ` Johannes Weiner
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2012-05-01 21:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, linux-mm, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue, 01 May 2012 17:19:16 -0400
Rik van Riel <riel@redhat.com> wrote:

> On 05/01/2012 03:08 PM, Andrew Morton wrote:
> > On Tue,  1 May 2012 10:41:48 +0200
> > Johannes Weiner<hannes@cmpxchg.org>  wrote:
> >
> >> This series stores file cache eviction information in the vacated page
> >> cache radix tree slots and uses it on refault to see if the pages
> >> currently on the active list need to have their status challenged.
> >
> > So we no longer free the radix-tree node when everything under it has
> > been reclaimed?  One could create workloads which would result in a
> > tremendous amount of memory used by radix_tree_node_cachep objects.
> >
> > So I assume these things get thrown away at some point.  Some
> > discussion about the life-cycle here would be useful.
> 
> I assume that in the current codebase Johannes has, we would
> have to rely on the inode cache shrinker to reclaim the inode
> and throw out the radix tree nodes.
> 
> Having a better way to deal with radix tree nodes that contain
> stale entries (where the evicted pages would no longer receive
> special treatment on re-fault, because it has been so long) get
> reclaimed would be nice for a future version.
> 

Well, think of a stupid workload which creates a large number of very
large but sparse files (populated with one page in each 64, for
example).  Get them all in cache, then sit there touching the inodes to
keep then fresh.  What's the worst case here?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 2/5] mm + fs: prepare for non-page entries in page cache
  2012-05-01 20:24       ` Andrew Morton
  2012-05-01 21:14         ` Rik van Riel
@ 2012-05-01 21:29         ` Johannes Weiner
  1 sibling, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2012-05-01 21:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue, May 01, 2012 at 01:24:49PM -0700, Andrew Morton wrote:
> On Tue, 1 May 2012 22:15:04 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Tue, May 01, 2012 at 12:02:46PM -0700, Andrew Morton wrote:
> > > On Tue,  1 May 2012 10:41:50 +0200
> > > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > 
> > > > --- a/fs/inode.c
> > > > +++ b/fs/inode.c
> > > > @@ -544,8 +544,7 @@ static void evict(struct inode *inode)
> > > >  	if (op->evict_inode) {
> > > >  		op->evict_inode(inode);
> > > >  	} else {
> > > > -		if (inode->i_data.nrpages)
> > > > -			truncate_inode_pages(&inode->i_data, 0);
> > > > +		truncate_inode_pages(&inode->i_data, 0);
> > > 
> > > Why did we lose this optimisation?
> > 
> > For inodes with only shadow pages remaining in the tree, because there
> > is no separate counter for them.  Otherwise, we'd leak the tree nodes.
> > 
> > I had mapping->nrshadows at first to keep truncation conditional, but
> > thought that using an extra word per cached inode would be worse than
> > removing this optimization.  There is not too much being done when the
> > tree is empty.
> > 
> > Another solution would be to include the shadows count in ->nrpages,
> > but filesystems use this counter for various other purposes.
> > 
> > Do you think it's worth reconsidering?
> 
> It doesn't sound like it's worth adding ->nrshadows for only that
> reason.
> 
> That's a pretty significant alteration in the meaning of ->nrpages. 
> Did this not have any other effects?

It still means "number of page entries in radix tree", just that the
radix tree can be non-empty when this count drops to zero.  AFAICS,
it's used when writing/syncing the inode or when gathering statistics,
like nr_blockdev_pages().  They only care about actual pages.  It's
just the final truncate that has to make sure to remove the non-pages
as well.

> What does truncate do?  I assume it invalidates shadow page entries in
> the radix tree?  And frees the radix-tree nodes?

Yes, it just does a radix_tree_delete() on these shadow page entries,
see clear_exceptional_entry() in mm/truncate.c.  This garbage-collects
empty nodes.

> The patchset will make lookups slower in some (probably obscure)
> circumstances, due to the additional radix-tree nodes.
> 
> I assume that if a pagecache lookup encounters a radix-tree node which
> contains no real pages, the search will terminate at that point?  We
> don't pointlessly go all the way down to the leaf nodes?

When reading/instantiating it's not pointless.  Empty slots or shadow
slots are faults and we have to retrieve the shadow entries to
calculate the refault distance.

When writing: dirtied pages are tagged, and tags are propagated
upwards in the tree, so we don't check any more nodes than before.
For leaf nodes (64 slots) where shadow entries and dirty pages mix,
the cost of skipping shadow entries is a bit bigger than that of empty
slots (see the radix_tree_exception branch in find_get_pages(), which
Hugh added to handle shmem's swap entries).

Then there are filesystems that do page cache lookups for population
analysis/heuristics, they could indeed pointlessly descend to leaf
nodes that only contain non-page entries.  I haven't investigated yet
how hot these paths actuallly are.  If this turns out to be a problem,
we could add another tag and trade tree size for performance.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-01 21:26     ` Andrew Morton
@ 2012-05-02  1:10       ` Andrea Arcangeli
  2012-05-03 13:15       ` Johannes Weiner
  1 sibling, 0 replies; 35+ messages in thread
From: Andrea Arcangeli @ 2012-05-02  1:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, linux-mm, Peter Zijlstra,
	Mel Gorman, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

Hi,

On Tue, May 01, 2012 at 02:26:56PM -0700, Andrew Morton wrote:
> Well, think of a stupid workload which creates a large number of very
> large but sparse files (populated with one page in each 64, for
> example).  Get them all in cache, then sit there touching the inodes to
> keep then fresh.  What's the worst case here?

I suspect in that scenario we may drop more inodes than before and so
a ton of their cache with it and actually worsen the LRU effect
instead of improving them.

I don't think it's a reliablity issue, or we would probably be bitten
by it already, especially with a ton of inodes with just one page at a
very large file offset accessed in a loop. This only makes more sticky
a badness we already have. Testing it for sure, wouldn't be a bad idea
though.

At first glance it sounds like a good tradeoff, as normally the
"worsening" effect of when we have too many and large radix trees that
would lead to more inodes to be dropped than before, shouldn't
materialize and we'd just make better use of the memory we already
allocated to make more accurate decisions on the active/inactive
LRU balancing.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 5/5] mm: refault distance-based file cache sizing
  2012-05-01  8:41 ` [patch 5/5] mm: refault distance-based file cache sizing Johannes Weiner
  2012-05-01 14:13   ` Minchan Kim
@ 2012-05-02  1:57   ` Andrea Arcangeli
  2012-05-02  6:23     ` Johannes Weiner
  1 sibling, 1 reply; 35+ messages in thread
From: Andrea Arcangeli @ 2012-05-02  1:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> frequently used active page.  Instead, for each refault with a
> distance smaller than the size of the active list, we deactivate an

Shouldn't this be the size of active list + size of inactive list?

If the active list is 500M, inactive 500M and the new working set is
600M, the refault distance will be 600M, it won't be smaller than the
size of the active list, and it won't deactivate the active list as it
should and it won't be detected as working set.

Only the refault distance bigger than inactive+active should not
deactivate the active list if I understand how this works correctly.

> @@ -1726,6 +1728,11 @@ zonelist_scan:
>  		if ((alloc_flags & ALLOC_CPUSET) &&
>  			!cpuset_zone_allowed_softwall(zone, gfp_mask))
>  				continue;
> +		if ((alloc_flags & ALLOC_WMARK_LOW) &&
> +		    current->refault_distance &&
> +		    !workingset_zone_alloc(zone, current->refault_distance,
> +					   &distance, &active))
> +			continue;
>  		/*
>  		 * When allocating a page cache page for writing, we
>  		 * want to get it from a zone that is within its dirty

It's a bit hard to see how this may not run oom prematurely if the
distance is always bigger, this is just an implementation question and
maybe I'm missing a fallback somewhere where we actually allocate
memory from whatever place in case no place is ideal.

> +	/*
> +	 * Lower zones may not even be full, and free pages are
> +	 * potential inactive space, too.  But the dirty reserve is
> +	 * not available to page cache due to lowmem reserves and the
> +	 * kswapd watermark.  Don't include it.
> +	 */
> +	zone_free = zone_page_state(zone, NR_FREE_PAGES);
> +	if (zone_free > zone->dirty_balance_reserve)
> +		zone_free -= zone->dirty_balance_reserve;
> +	else
> +		zone_free = 0;

Maybe also remove the high wmark from the sum? It can be some hundred
meg so it's better to take it into account, to have a more accurate
math and locate the best zone that surely fits.

For the same reason it looks like the lowmem reserve should also be
taken into account, on the full sum.

> +	if (missing >= zone_active + zone_free) {

This seems a place where to add the zone_inactive too according to my
comment on top.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 5/5] mm: refault distance-based file cache sizing
  2012-05-01 15:38     ` Johannes Weiner
@ 2012-05-02  5:21       ` Minchan Kim
  0 siblings, 0 replies; 35+ messages in thread
From: Minchan Kim @ 2012-05-02  5:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel

On 05/02/2012 12:38 AM, Johannes Weiner wrote:

> On Tue, May 01, 2012 at 11:13:30PM +0900, Minchan Kim wrote:
>> Hi Hannes,
>>
>> On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
>>> To protect frequently used page cache (workingset) from bursts of less
>>> frequently used or one-shot cache, page cache pages are managed on two
>>> linked lists.  The inactive list is where all cache starts out on
>>> fault and ends on reclaim.  Pages that get accessed another time while
>>> on the inactive list get promoted to the active list to protect them
>>> from reclaim.
>>>
>>> Right now we have two main problems.
>>>
>>> One stems from numa allocation decisions and how the page allocator
>>> and kswapd interact.  The both of them can enter into a perfect loop
>>> where kswapd reclaims from the preferred zone of a task, allowing the
>>> task to continuously allocate from that zone.  Or, the node distance
>>> can lead to the allocator to do direct zone reclaim to stay in the
>>> preferred zone.  This may be good for locality, but the task has only
>>
>> Understood.
>>
>>> the inactive space of that one zone to get its memory activated.
>>> Forcing the allocator to spread out to lower zones in the right
>>> situation makes the difference between continuous IO to serve the
>>> workingset, or taking the numa cost but serving fully from memory.
>>
>> It's hard to parse your word due to my dumb brain.
>> Could you elaborate on it?
>> It would be a good if you say with example.
> 
> Say your Normal zone is 4G (DMA32 also 4G) and you have 2G of active
> file pages in Normal and DMA32 is full of other stuff.  Now you access
> a new 6G file repeatedly.  First it allocates from Normal (preferred),
> then tries DMA32 (full), wakes up kswapd and retries all zones.  If
> kswapd then frees pages at roughly the same pace as the allocator
> allocates from Normal, kswapd never goes to sleep and evicts pages
> from the 6G file before they can get accessed a second time.  Even
> though the 6G file could fit in memory (4G Normal + 4G DMA32), the
> allocator only uses the 4G Normal zone.
> 
> Same applies if you have a load that would fit in the memory of two
> nodes but the node distance leads the allocator to do zone_reclaim()
> and forcing the pages to stay in one node, again preventing the load
> from being fully cached in memory, which is much more expensive than
> the foreign node cost.
> 
>>> up to half of memory, and don't recognize workingset changes that are
>>> bigger than half of memory.
>>
>> Workingset change?
>> You mean if new workingset is bigger than half of memory and it's like
>> stream before retouch, we could cache only part of working set because 
>> head pages on working set would be discared by tail pages of working set
>> in inactive list?
> 
> Spot-on.  I called that 'tail-chasing' in my notes :-) When you are in
> a perpetual loop of evicting pages you will need in a couple hundred
> page faults.  Those couple hundred page faults are the refault
> distance and my code is able to detect these loops and increases the
> space available to the inactive list to end them, if possible.
> 


Thanks! It would be better to add above explanation in cover-letter.


> This is the whole principle of the series.
> 
> If such a loop is recognized in a single zone, the allocator goes for
> lower zones to increase the inactive space.  If such a loop is
> recognized over all allowed zones in the zonelist, the active lists
> are shrunk to increase the inactive space.

>

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 5/5] mm: refault distance-based file cache sizing
  2012-05-02  1:57   ` Andrea Arcangeli
@ 2012-05-02  6:23     ` Johannes Weiner
  2012-05-02 15:11       ` Andrea Arcangeli
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2012-05-02  6:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Wed, May 02, 2012 at 03:57:41AM +0200, Andrea Arcangeli wrote:
> On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> > frequently used active page.  Instead, for each refault with a
> > distance smaller than the size of the active list, we deactivate an
> 
> Shouldn't this be the size of active list + size of inactive list?
> 
> If the active list is 500M, inactive 500M and the new working set is
> 600M, the refault distance will be 600M, it won't be smaller than the
> size of the active list, and it won't deactivate the active list as it
> should and it won't be detected as working set.
> 
> Only the refault distance bigger than inactive+active should not
> deactivate the active list if I understand how this works correctly.

The refault distance is what's missing, not the full reuse frequency.
You ignore the 500M worth of inactive LRU time the page had in memory.
The distance in that scenario would be 100M, the time between eviction
and refault:

        +-----------------------------++-----------------------------+
        |                             ||                             |
        | inactive                    || active                      |
        +-----------------------------++-----------------------------+
+~~~~~~~------------------------------+
|                                     |
| new set                             |
+~~~~~~~------------------------------+
^       ^
|       |
|       eviction
refault

The ~~~'d part could fit into memory if the active list was 100M
smaller.

> > @@ -1726,6 +1728,11 @@ zonelist_scan:
> >  		if ((alloc_flags & ALLOC_CPUSET) &&
> >  			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> >  				continue;
> > +		if ((alloc_flags & ALLOC_WMARK_LOW) &&
> > +		    current->refault_distance &&
> > +		    !workingset_zone_alloc(zone, current->refault_distance,
> > +					   &distance, &active))
> > +			continue;
> >  		/*
> >  		 * When allocating a page cache page for writing, we
> >  		 * want to get it from a zone that is within its dirty
> 
> It's a bit hard to see how this may not run oom prematurely if the
> distance is always bigger, this is just an implementation question and
> maybe I'm missing a fallback somewhere where we actually allocate
> memory from whatever place in case no place is ideal.

Sorry, this should be documented better.

The ALLOC_WMARK_LOW check makes sure this only applies in the
fastpath.  It will prepare reclaim with lruvec->shrink_active, then
wake up kswapd and retry the zonelist without this constraint.

> > +	/*
> > +	 * Lower zones may not even be full, and free pages are
> > +	 * potential inactive space, too.  But the dirty reserve is
> > +	 * not available to page cache due to lowmem reserves and the
> > +	 * kswapd watermark.  Don't include it.
> > +	 */
> > +	zone_free = zone_page_state(zone, NR_FREE_PAGES);
> > +	if (zone_free > zone->dirty_balance_reserve)
> > +		zone_free -= zone->dirty_balance_reserve;
> > +	else
> > +		zone_free = 0;
> 
> Maybe also remove the high wmark from the sum? It can be some hundred
> meg so it's better to take it into account, to have a more accurate
> math and locate the best zone that surely fits.
> 
> For the same reason it looks like the lowmem reserve should also be
> taken into account, on the full sum.

dirty_balance_reserve IS the sum of the high watermark and the biggest
lowmem reserve for a particular zone, see how it's calculated in
mm/page_alloc.c::calculate_totalreserve_pages().

nr_free - dirty_balance_reserve is the number of pages available to
page cache allocations without keeping kswapd alive or having to dip
into lowmem reserves.

Or did I misunderstand you?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 5/5] mm: refault distance-based file cache sizing
  2012-05-02  6:23     ` Johannes Weiner
@ 2012-05-02 15:11       ` Andrea Arcangeli
  0 siblings, 0 replies; 35+ messages in thread
From: Andrea Arcangeli @ 2012-05-02 15:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Peter Zijlstra, Mel Gorman,
	Andrew Morton, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Wed, May 02, 2012 at 08:23:09AM +0200, Johannes Weiner wrote:
> On Wed, May 02, 2012 at 03:57:41AM +0200, Andrea Arcangeli wrote:
> > On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> > > frequently used active page.  Instead, for each refault with a
> > > distance smaller than the size of the active list, we deactivate an
> > 
> > Shouldn't this be the size of active list + size of inactive list?
> > 
> > If the active list is 500M, inactive 500M and the new working set is
> > 600M, the refault distance will be 600M, it won't be smaller than the
> > size of the active list, and it won't deactivate the active list as it
> > should and it won't be detected as working set.
> > 
> > Only the refault distance bigger than inactive+active should not
> > deactivate the active list if I understand how this works correctly.
> 
> The refault distance is what's missing, not the full reuse frequency.
> You ignore the 500M worth of inactive LRU time the page had in memory.
> The distance in that scenario would be 100M, the time between eviction
> and refault:
> 
>         +-----------------------------++-----------------------------+
>         |                             ||                             |
>         | inactive                    || active                      |
>         +-----------------------------++-----------------------------+
> +~~~~~~~------------------------------+
> |                                     |
> | new set                             |
> +~~~~~~~------------------------------+
> ^       ^
> |       |
> |       eviction
> refault
> 
> The ~~~'d part could fit into memory if the active list was 100M
> smaller.

Never mind, I see that the refault distance is only going to measure
the amount of the new working set that spilled over the inactive list
so it would only be set to 100M in the example.

> > > @@ -1726,6 +1728,11 @@ zonelist_scan:
> > >  		if ((alloc_flags & ALLOC_CPUSET) &&
> > >  			!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > >  				continue;
> > > +		if ((alloc_flags & ALLOC_WMARK_LOW) &&
> > > +		    current->refault_distance &&
> > > +		    !workingset_zone_alloc(zone, current->refault_distance,
> > > +					   &distance, &active))
> > > +			continue;
> > >  		/*
> > >  		 * When allocating a page cache page for writing, we
> > >  		 * want to get it from a zone that is within its dirty
> > 
> > It's a bit hard to see how this may not run oom prematurely if the
> > distance is always bigger, this is just an implementation question and
> > maybe I'm missing a fallback somewhere where we actually allocate
> > memory from whatever place in case no place is ideal.
> 
> Sorry, this should be documented better.
> 
> The ALLOC_WMARK_LOW check makes sure this only applies in the
> fastpath.  It will prepare reclaim with lruvec->shrink_active, then
> wake up kswapd and retry the zonelist without this constraint.

My point is this is going to change the semantics of ALLOC_WMARK_LOW
to "return OOM randomly even if there's plenty of free memory" instead
of "use only up to the low wmark". I see you want to wake kswapd and
retry with the min wmark after that, but maybe it would be cleaner to
have a new ALLOC_REFAULT_DISTANCE to avoid altering the meaning of
ALLOC_WMARK_LOW. Then add a "|ALLOC_REFAULT_DISTANCE" to the
parameter. It sounds simpler to keep controlling the wmark level
checked with ALLOC_WMARK_LOW|MIN|HIGH without introducing a new special
meanings to the LOW bitflag.

This is only a cleanup though, I believe it works good at runtime.

> > > +	/*
> > > +	 * Lower zones may not even be full, and free pages are
> > > +	 * potential inactive space, too.  But the dirty reserve is
> > > +	 * not available to page cache due to lowmem reserves and the
> > > +	 * kswapd watermark.  Don't include it.
> > > +	 */
> > > +	zone_free = zone_page_state(zone, NR_FREE_PAGES);
> > > +	if (zone_free > zone->dirty_balance_reserve)
> > > +		zone_free -= zone->dirty_balance_reserve;
> > > +	else
> > > +		zone_free = 0;
> > 
> > Maybe also remove the high wmark from the sum? It can be some hundred
> > meg so it's better to take it into account, to have a more accurate
> > math and locate the best zone that surely fits.
> > 
> > For the same reason it looks like the lowmem reserve should also be
> > taken into account, on the full sum.
> 
> dirty_balance_reserve IS the sum of the high watermark and the biggest
> lowmem reserve for a particular zone, see how it's calculated in
> mm/page_alloc.c::calculate_totalreserve_pages().
> 
> nr_free - dirty_balance_reserve is the number of pages available to
> page cache allocations without keeping kswapd alive or having to dip
> into lowmem reserves.
> 
> Or did I misunderstand you?

No, that's all right then! I didn't realize dirty_balance_reserve
accounts exactly for what I wrote above (high wmark and lowmem
reserve). I've seen it used by page-writeback and I naively assumed it
had to do with dirty pages levels, while it has absolutely nothing to
do with writeback or any dirty memory level! Despite its quite
misleading _dirty prefix :)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-01 21:26     ` Andrew Morton
  2012-05-02  1:10       ` Andrea Arcangeli
@ 2012-05-03 13:15       ` Johannes Weiner
  1 sibling, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2012-05-03 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, linux-mm, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Minchan Kim, Hugh Dickins, KOSAKI Motohiro,
	linux-fsdevel, linux-kernel

On Tue, May 01, 2012 at 02:26:56PM -0700, Andrew Morton wrote:
> On Tue, 01 May 2012 17:19:16 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > On 05/01/2012 03:08 PM, Andrew Morton wrote:
> > > On Tue,  1 May 2012 10:41:48 +0200
> > > Johannes Weiner<hannes@cmpxchg.org>  wrote:
> > >
> > >> This series stores file cache eviction information in the vacated page
> > >> cache radix tree slots and uses it on refault to see if the pages
> > >> currently on the active list need to have their status challenged.
> > >
> > > So we no longer free the radix-tree node when everything under it has
> > > been reclaimed?  One could create workloads which would result in a
> > > tremendous amount of memory used by radix_tree_node_cachep objects.
> > >
> > > So I assume these things get thrown away at some point.  Some
> > > discussion about the life-cycle here would be useful.
> > 
> > I assume that in the current codebase Johannes has, we would
> > have to rely on the inode cache shrinker to reclaim the inode
> > and throw out the radix tree nodes.
> > 
> > Having a better way to deal with radix tree nodes that contain
> > stale entries (where the evicted pages would no longer receive
> > special treatment on re-fault, because it has been so long) get
> > reclaimed would be nice for a future version.
> > 
> 
> Well, think of a stupid workload which creates a large number of very
> large but sparse files (populated with one page in each 64, for
> example).  Get them all in cache, then sit there touching the inodes to
> keep then fresh.  What's the worst case here?

With 8G of RAM, it takes a minimally populated file (one page per leaf
node) of 3.5TB to consume all memory for radix tree nodes.

The worst case is going OOM without someone to blame as the objects
are owned by the kernel.

Is this a use case we should worry about?  A realistic one, I mean, it
wouldn't be the first one to take down a machine maliciously and could
be prevented by rlimiting the maximum file size.

That aside, entries that are past the point where they would mean
anything, as Rik described above, are a waste of memory, the severity
of which depends on how much of its previously faulted data an inode
has evicted while still being in active use.

For me it's not a question of whether we want a mechanism to reclaim
old shadow pages of inodes that are still in use, but how critical
this is, and then how accurate it needs to be etc.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-01  8:41 [patch 0/5] refault distance-based file cache sizing Johannes Weiner
                   ` (5 preceding siblings ...)
  2012-05-01 19:08 ` [patch 0/5] " Andrew Morton
@ 2012-05-16  5:25 ` nai.xia
  2012-05-16  6:51   ` Johannes Weiner
  2012-05-17 13:11   ` Rik van Riel
  6 siblings, 2 replies; 35+ messages in thread
From: nai.xia @ 2012-05-16  5:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel

Hi Johannes,

Just out of curiosity(since I didn't study deep into the
reclaiming algorithms), I can recall from here that around 2005,
there was an(or some?) implementation of the "Clock-pro" algorithm
which also have the idea of "reuse distance", but it seems that algo
did not work well enough to get merged? Does this patch series finally
solve the problem(s) with "Clock-pro" or totally doesn't have to worry
about the similar problems?


Thanks,

Nai

On 2012/05/01 16:41, Johannes Weiner wrote:
> Hi,
>
> our file data caching implementation is done by having an inactive
> list of pages that have yet to prove worth keeping around and an
> active list of pages that already did.  The question is how to balance
> those two lists against each other.
>
> On one hand, the space for inactive pages needs to be big enough so
> that they have the necessary time in memory to gather the references
> required for an activation.  On the other hand, we want an active list
> big enough to hold all data that is frequently used, if possible, to
> protect it from streams of less frequently used/once used pages.
>
> Our current balancing ("active can't grow larger than inactive") does
> not really work too well.  We have people complaining that the working
> set is not well protected from used-once file cache, and other people
> complaining that we don't adapt to changes in the workingset and
> protect stale pages in other cases.
>
> This series stores file cache eviction information in the vacated page
> cache radix tree slots and uses it on refault to see if the pages
> currently on the active list need to have their status challenged.
>
> A fully activated file set that occupies 85% of memory is successfully
> detected as stale when another file set of equal size is accessed for
> a few times (4-5).  The old kernel would never adapt to the second
> one.  If the new set is bigger than memory, the old set is left
> untouched, where the old kernel would shrink the old set to half of
> memory and leave it at that.  Tested on a multi-zone single-node
> machine.
>
> More testing is obviously required, but I first wanted some opinions
> at this point.  Is there fundamental disagreement with the concept?
> With the implementation?
>
> Memcg hard limit reclaim is not converted (anymore, ripped it out to
> focus on the global case first) and it still does the 50/50 balancing
> between lists, but this will be re-added in the next version.
>
> Patches are based on 3.3.
>
>   fs/btrfs/compression.c     |   10 +-
>   fs/btrfs/extent_io.c       |    3 +-
>   fs/cachefiles/rdwr.c       |   26 +++--
>   fs/ceph/xattr.c            |    2 +-
>   fs/inode.c                 |    7 +-
>   fs/logfs/readwrite.c       |    9 +-
>   fs/nilfs2/inode.c          |    6 +-
>   fs/ntfs/file.c             |   11 ++-
>   fs/splice.c                |   10 +-
>   include/linux/mm.h         |    8 ++
>   include/linux/mmzone.h     |    7 ++
>   include/linux/pagemap.h    |   54 ++++++++---
>   include/linux/pagevec.h    |    3 +
>   include/linux/radix-tree.h |    4 -
>   include/linux/sched.h      |    1 +
>   include/linux/shmem_fs.h   |    1 +
>   include/linux/swap.h       |    7 ++
>   lib/radix-tree.c           |   75 ---------------
>   mm/Makefile                |    1 +
>   mm/filemap.c               |  222 ++++++++++++++++++++++++++++++++++----------
>   mm/memcontrol.c            |    3 +
>   mm/mincore.c               |   20 +++-
>   mm/page_alloc.c            |    7 ++
>   mm/readahead.c             |   51 +++++++++-
>   mm/shmem.c                 |   89 +++---------------
>   mm/swap.c                  |   23 +++++
>   mm/truncate.c              |   73 +++++++++++---
>   mm/vmscan.c                |   80 +++++++++-------
>   mm/vmstat.c                |    4 +
>   mm/workingset.c            |  174 ++++++++++++++++++++++++++++++++++
>   net/ceph/messenger.c       |    2 +-
>   net/ceph/pagelist.c        |    4 +-
>   net/ceph/pagevec.c         |    2 +-
>   33 files changed, 682 insertions(+), 317 deletions(-)
>
> Thanks,
> Johannes
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-16  5:25 ` nai.xia
@ 2012-05-16  6:51   ` Johannes Weiner
  2012-05-16 12:56     ` nai.xia
  2012-05-17 13:11   ` Rik van Riel
  1 sibling, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2012-05-16  6:51 UTC (permalink / raw)
  To: nai.xia
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel

Hi Nai,

On Wed, May 16, 2012 at 01:25:34PM +0800, nai.xia wrote:
> Hi Johannes,
> 
> Just out of curiosity(since I didn't study deep into the
> reclaiming algorithms), I can recall from here that around 2005,
> there was an(or some?) implementation of the "Clock-pro" algorithm
> which also have the idea of "reuse distance", but it seems that algo
> did not work well enough to get merged? Does this patch series finally
> solve the problem(s) with "Clock-pro" or totally doesn't have to worry
> about the similar problems?

As far as I understood, clock-pro set out to solve more problems than
my patch set and it failed to satisfy everybody.

The main error case was that it could not partially cache data of a
set that was bigger than memory.  Instead, looping over the file
repeatedly always has to read every single page because the most
recent page allocations would push out the pages needed in the nearest
future.  I never promised to solve this problem in the first place.
But giving more memory to the big looping load is not useful in our
current situation, and at least my code protects smaller sets of
active cache from these loops.  So it's not optimal, but it sucks only
half as much :)

There may have been improvements from clock-pro, but it's hard to get
code merged that does not behave as expected in theory with nobody
understanding what's going on.

My code is fairly simple, works for the tests I've done and the
behaviour observed so far is understood (at least by me).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-16  6:51   ` Johannes Weiner
@ 2012-05-16 12:56     ` nai.xia
  2012-05-17 21:08       ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: nai.xia @ 2012-05-16 12:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel

Hi,

On 2012/05/16 14:51, Johannes Weiner wrote:
> Hi Nai,
>
> On Wed, May 16, 2012 at 01:25:34PM +0800, nai.xia wrote:
>> Hi Johannes,
>>
>> Just out of curiosity(since I didn't study deep into the
>> reclaiming algorithms), I can recall from here that around 2005,
>> there was an(or some?) implementation of the "Clock-pro" algorithm
>> which also have the idea of "reuse distance", but it seems that algo
>> did not work well enough to get merged? Does this patch series finally
>> solve the problem(s) with "Clock-pro" or totally doesn't have to worry
>> about the similar problems?
>
> As far as I understood, clock-pro set out to solve more problems than
> my patch set and it failed to satisfy everybody.
>
> The main error case was that it could not partially cache data of a
> set that was bigger than memory.  Instead, looping over the file
> repeatedly always has to read every single page because the most
> recent page allocations would push out the pages needed in the nearest
> future.  I never promised to solve this problem in the first place.
> But giving more memory to the big looping load is not useful in our
> current situation, and at least my code protects smaller sets of
> active cache from these loops.  So it's not optimal, but it sucks only
> half as much :)

Yep, I see ;)

>
> There may have been improvements from clock-pro, but it's hard to get
> code merged that does not behave as expected in theory with nobody
> understanding what's going on.
>
> My code is fairly simple, works for the tests I've done and the
> behaviour observed so far is understood (at least by me).

OK, I assume that you do aware that the system you constructed with
this simple and understandable idea looks like a so called "feedback
system"? Or in other words, I think theoretically the refault-distance
of a page before and after your algorithm is applied is not the same.
And this changed refault-distance pattern is then feed as input into
your algorithm. A feedback system may be hard(and may be simple) to
analyze but may also work well magically.

Well, again I confess I've not done enough course in this area. Just hope
that my words can help you think more comprehensively. :)


Thanks,

Nai

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-16  5:25 ` nai.xia
  2012-05-16  6:51   ` Johannes Weiner
@ 2012-05-17 13:11   ` Rik van Riel
  2012-05-18  5:03     ` Nai Xia
  1 sibling, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2012-05-17 13:11 UTC (permalink / raw)
  To: nai.xia
  Cc: Johannes Weiner, linux-mm, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel

On 05/16/2012 01:25 AM, nai.xia wrote:
> Hi Johannes,
>
> Just out of curiosity(since I didn't study deep into the
> reclaiming algorithms), I can recall from here that around 2005,
> there was an(or some?) implementation of the "Clock-pro" algorithm
> which also have the idea of "reuse distance", but it seems that algo
> did not work well enough to get merged?

The main issue with clock-pro was scalability.

Johannes has managed to take the good parts of clock-pro,
and add it on top of our split lru VM, which lets us keep
the scalability, while still being able to deal with file
faults from beyond the inactive list.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-16 12:56     ` nai.xia
@ 2012-05-17 21:08       ` Johannes Weiner
  2012-05-18  3:44         ` Nai Xia
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2012-05-17 21:08 UTC (permalink / raw)
  To: nai.xia
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel

On Wed, May 16, 2012 at 08:56:54PM +0800, nai.xia wrote:
> On 2012/05/16 14:51, Johannes Weiner wrote:
> >There may have been improvements from clock-pro, but it's hard to get
> >code merged that does not behave as expected in theory with nobody
> >understanding what's going on.

Damn, that sounded way harsher and arrogant than I wanted it to sound.
And it's only based on what I gathered from the discussions on the
list archives.  Sorry :(

> OK, I assume that you do aware that the system you constructed with
> this simple and understandable idea looks like a so called "feedback
> system"? Or in other words, I think theoretically the refault-distance
> of a page before and after your algorithm is applied is not the same.
> And this changed refault-distance pattern is then feed as input into
> your algorithm. A feedback system may be hard(and may be simple) to
> analyze but may also work well magically.

I'm with you on that, but I can't see an alternative in this case.  We
can't predict future page accesses very well, so we have to take
speculative shots and be considerate about the consequences.

And BECAUSE we may get it wrong, the algorithm does not rely on the
decisions it makes to be correct.  For example, it does not activate
pages based on refault distance, but requires the refaulted page to
win the race against an actual active page.  Likewise, pages are not
evicted from the active list directly, instead they get a chance at
re-activation when challenged.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-17 21:08       ` Johannes Weiner
@ 2012-05-18  3:44         ` Nai Xia
  2012-05-18 15:07           ` Rik van Riel
  0 siblings, 1 reply; 35+ messages in thread
From: Nai Xia @ 2012-05-18  3:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Rik van Riel, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel



On 2012/05/18 05:08, Johannes Weiner wrote:
> On Wed, May 16, 2012 at 08:56:54PM +0800, nai.xia wrote:
>> On 2012/05/16 14:51, Johannes Weiner wrote:
>>> There may have been improvements from clock-pro, but it's hard to get
>>> code merged that does not behave as expected in theory with nobody
>>> understanding what's going on.
>
> Damn, that sounded way harsher and arrogant than I wanted it to sound.
> And it's only based on what I gathered from the discussions on the
> list archives.  Sorry :(

No harm done, man. I just understood your words in this way. :)

But I do think that Clock-pro deserves its credit, since after all
it's that research work firstly brought the idea of "refault/reuse
distance" to the kernel community. Further more, it's also good
to let the researchers and the community to together have some
brain-storm of this problem if it's really hard to deal with in
reality.

>
>> OK, I assume that you do aware that the system you constructed with
>> this simple and understandable idea looks like a so called "feedback
>> system"? Or in other words, I think theoretically the refault-distance
>> of a page before and after your algorithm is applied is not the same.
>> And this changed refault-distance pattern is then feed as input into
>> your algorithm. A feedback system may be hard(and may be simple) to
>> analyze but may also work well magically.
>
> I'm with you on that, but I can't see an alternative in this case.  We

I trend to agree, I once tried to deal with an anti-LRU pattern(e.g. the
big loop like you said) of a app from kernel space and failed. Seems
it's hard to gather a very accurate information of a program's real memory
footprint in mixed workloads with only the help of pte bits...(but also
may due to my lack of skills in tweaking the reclaiming code...)

> can't predict future page accesses very well, so we have to take
> speculative shots and be considerate about the consequences.
>
> And BECAUSE we may get it wrong, the algorithm does not rely on the
> decisions it makes to be correct.  For example, it does not activate
> pages based on refault distance, but requires the refaulted page to
> win the race against an actual active page.  Likewise, pages are not
> evicted from the active list directly, instead they get a chance at
> re-activation when challenged.

Yes. That sounds a smart handling.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-17 13:11   ` Rik van Riel
@ 2012-05-18  5:03     ` Nai Xia
  0 siblings, 0 replies; 35+ messages in thread
From: Nai Xia @ 2012-05-18  5:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, linux-mm, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel



On 2012年05月17日 21:11, Rik van Riel wrote:
> On 05/16/2012 01:25 AM, nai.xia wrote:
>> Hi Johannes,
>>
>> Just out of curiosity(since I didn't study deep into the
>> reclaiming algorithms), I can recall from here that around 2005,
>> there was an(or some?) implementation of the "Clock-pro" algorithm
>> which also have the idea of "reuse distance", but it seems that algo
>> did not work well enough to get merged?
>
> The main issue with clock-pro was scalability.
>
> Johannes has managed to take the good parts of clock-pro,
> and add it on top of our split lru VM, which lets us keep
> the scalability, while still being able to deal with file
> faults from beyond the inactive list.
>

Hmm, I see. Thanks for the reply.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-18  3:44         ` Nai Xia
@ 2012-05-18 15:07           ` Rik van Riel
  2012-05-18 15:30             ` Nai Xia
  0 siblings, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2012-05-18 15:07 UTC (permalink / raw)
  To: nai.xia
  Cc: Johannes Weiner, linux-mm, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel

On 05/17/2012 11:44 PM, Nai Xia wrote:

> But I do think that Clock-pro deserves its credit, since after all
> it's that research work firstly brought the idea of "refault/reuse
> distance" to the kernel community.

The ARC people did that, too.

> Further more, it's also good
> to let the researchers and the community to together have some
> brain-storm of this problem if it's really hard to deal with in
> reality.

How much are researchers interested in the real world
constraints that OS developers have to deal with?

Often scalability is as much of a goal as being good
at selecting the right page to replace...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 0/5] refault distance-based file cache sizing
  2012-05-18 15:07           ` Rik van Riel
@ 2012-05-18 15:30             ` Nai Xia
  0 siblings, 0 replies; 35+ messages in thread
From: Nai Xia @ 2012-05-18 15:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Johannes Weiner, linux-mm, Andrea Arcangeli, Peter Zijlstra,
	Mel Gorman, Andrew Morton, Minchan Kim, Hugh Dickins,
	KOSAKI Motohiro, linux-fsdevel, linux-kernel



On 2012年05月18日 23:07, Rik van Riel wrote:
> On 05/17/2012 11:44 PM, Nai Xia wrote:
>
>> But I do think that Clock-pro deserves its credit, since after all
>> it's that research work firstly brought the idea of "refault/reuse
>> distance" to the kernel community.
>
> The ARC people did that, too.

Well, I think you said "take the good parts of clock-pro"...
Anyway, then I think you should credit either of the previous
works... :D

>
>> Further more, it's also good
>> to let the researchers and the community to together have some
>> brain-storm of this problem if it's really hard to deal with in
>> reality.
>
> How much are researchers interested in the real world
> constraints that OS developers have to deal with?

I think there will be nobody, if we don't try to let them
know about the constraints. Honestly, LKML are hard for
researchers to follow. They really need abstract view of
a problem. Surely there is a gap...between researchers and
developers.

>
> Often scalability is as much of a goal as being good
> at selecting the right page to replace...
>
Then scalability might be a good research topic as long
as they have the chance to understand the details.

Ok, all I want to say is another way that may help
the kernel world better. I am actually quite positive
about the patch itself.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2012-05-18 15:31 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-01  8:41 [patch 0/5] refault distance-based file cache sizing Johannes Weiner
2012-05-01  8:41 ` [patch 1/5] mm: readahead: move radix tree hole searching here Johannes Weiner
2012-05-01 21:06   ` Rik van Riel
2012-05-01  8:41 ` [patch 2/5] mm + fs: prepare for non-page entries in page cache Johannes Weiner
2012-05-01 19:02   ` Andrew Morton
2012-05-01 20:15     ` Johannes Weiner
2012-05-01 20:24       ` Andrew Morton
2012-05-01 21:14         ` Rik van Riel
2012-05-01 21:29         ` Johannes Weiner
2012-05-01  8:41 ` [patch 3/5] mm + fs: store shadow pages " Johannes Weiner
2012-05-01  8:41 ` [patch 4/5] mm + fs: provide refault distance to page cache instantiations Johannes Weiner
2012-05-01  9:30   ` Peter Zijlstra
2012-05-01  9:55     ` Johannes Weiner
2012-05-01  9:58       ` Peter Zijlstra
2012-05-01  8:41 ` [patch 5/5] mm: refault distance-based file cache sizing Johannes Weiner
2012-05-01 14:13   ` Minchan Kim
2012-05-01 15:38     ` Johannes Weiner
2012-05-02  5:21       ` Minchan Kim
2012-05-02  1:57   ` Andrea Arcangeli
2012-05-02  6:23     ` Johannes Weiner
2012-05-02 15:11       ` Andrea Arcangeli
2012-05-01 19:08 ` [patch 0/5] " Andrew Morton
2012-05-01 21:19   ` Rik van Riel
2012-05-01 21:26     ` Andrew Morton
2012-05-02  1:10       ` Andrea Arcangeli
2012-05-03 13:15       ` Johannes Weiner
2012-05-16  5:25 ` nai.xia
2012-05-16  6:51   ` Johannes Weiner
2012-05-16 12:56     ` nai.xia
2012-05-17 21:08       ` Johannes Weiner
2012-05-18  3:44         ` Nai Xia
2012-05-18 15:07           ` Rik van Riel
2012-05-18 15:30             ` Nai Xia
2012-05-17 13:11   ` Rik van Riel
2012-05-18  5:03     ` Nai Xia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).