linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/7] mm: Batch page reclamation under shink_page_list
@ 2013-05-07 21:19 Dave Hansen
  2013-05-07 21:19 ` [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages Dave Hansen
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Dave Hansen @ 2013-05-07 21:19 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, Dave Hansen

These are an update of Tim Chen's earlier work:

	http://lkml.kernel.org/r/1347293960.9977.70.camel@schen9-DESK

I broke the patches up a bit more, and tried to incorporate some
changes based on some feedback from Mel and Andrew.

--

To do page reclamation in shrink_page_list function, there are
two locks taken on a page by page basis.  One is the tree lock
protecting the radix tree of the page mapping and the other is
the mapping->i_mmap_mutex protecting the mapped pages.  This set
deals only with mapping->tree_lock.

Tim managed to get 14% throughput improvement when with a workload
putting heavy pressure of page cache by reading many large mmaped
files simultaneously on a 8 socket Westmere server.

I've been testing these by running large parallel kernel compiles
on systems that are under memory pressure.  During development,
I caught quite a few races on smaller setups, and it's being
quite stable that large (160 logical CPU / 1TB) system.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages
  2013-05-07 21:19 [RFC][PATCH 0/7] mm: Batch page reclamation under shink_page_list Dave Hansen
@ 2013-05-07 21:19 ` Dave Hansen
  2013-05-09 22:07   ` Seth Jennings
  2013-05-14 14:55   ` Mel Gorman
  2013-05-07 21:19 ` [RFC][PATCH 2/7] make 'struct page' and swp_entry_t variants of swapcache_free() Dave Hansen
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 25+ messages in thread
From: Dave Hansen @ 2013-05-07 21:19 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

There are only two callers of swapcache_free() which actually
pass in a non-NULL 'struct page'.  Both of them
(__remove_mapping and delete_from_swap_cache())  create a
temporary on-stack 'swp_entry_t' and set entry.val to
page_private().

They need to do this since __delete_from_swap_cache() does
set_page_private(page, 0) and destroys the information.

However, I'd like to batch a few of these operations on several
pages together in a new version of __remove_mapping(), and I
would prefer not to have to allocate temporary storage for
each page.  The code is pretty ugly, and it's a bit silly
to create these on-stack 'swp_entry_t's when it is so easy to
just keep the information around in 'struct page'.

There should not be any danger in doing this since we are
absolutely on the path of freeing these page.  There is no
turning back, and no other rerferences can be obtained
after it comes out of the radix tree.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/mm/swap_state.c |    4 ++--
 linux.git-davehans/mm/vmscan.c     |    2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff -puN mm/swap_state.c~__delete_from_swap_cache-dont-clear-page-private mm/swap_state.c
--- linux.git/mm/swap_state.c~__delete_from_swap_cache-dont-clear-page-private	2013-05-07 13:48:13.698044473 -0700
+++ linux.git-davehans/mm/swap_state.c	2013-05-07 13:48:13.703044693 -0700
@@ -146,8 +146,6 @@ void __delete_from_swap_cache(struct pag
 	entry.val = page_private(page);
 	address_space = swap_address_space(entry);
 	radix_tree_delete(&address_space->page_tree, page_private(page));
-	set_page_private(page, 0);
-	ClearPageSwapCache(page);
 	address_space->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
@@ -224,6 +222,8 @@ void delete_from_swap_cache(struct page
 	spin_unlock_irq(&address_space->tree_lock);
 
 	swapcache_free(entry, page);
+	set_page_private(page, 0);
+	ClearPageSwapCache(page);
 	page_cache_release(page);
 }
 
diff -puN mm/vmscan.c~__delete_from_swap_cache-dont-clear-page-private mm/vmscan.c
--- linux.git/mm/vmscan.c~__delete_from_swap_cache-dont-clear-page-private	2013-05-07 13:48:13.700044561 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-05-07 13:48:13.705044783 -0700
@@ -494,6 +494,8 @@ static int __remove_mapping(struct addre
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
 		swapcache_free(swap, page);
+		set_page_private(page, 0);
+		ClearPageSwapCache(page);
 	} else {
 		void (*freepage)(struct page *);
 
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC][PATCH 2/7] make 'struct page' and swp_entry_t variants of swapcache_free().
  2013-05-07 21:19 [RFC][PATCH 0/7] mm: Batch page reclamation under shink_page_list Dave Hansen
  2013-05-07 21:19 ` [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages Dave Hansen
@ 2013-05-07 21:19 ` Dave Hansen
  2013-05-14 15:00   ` Mel Gorman
  2013-05-07 21:19 ` [RFC][PATCH 3/7] break up __remove_mapping() Dave Hansen
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2013-05-07 21:19 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

swapcache_free() takes two arguments:

	void swapcache_free(swp_entry_t entry, struct page *page)

Most of its callers (5/7) are from error handling paths haven't even
instantiated a page, so they pass page=NULL.  Both of the callers
that call in with a 'struct page' create and pass in a temporary
swp_entry_t.

Now that we are deferring clearing page_private() until after
swapcache_free() has been called, we can just create a variant
that takes a 'struct page' and does the temporary variable in
the helper.

That leaves all the other callers doing

	swapcache_free(entry, NULL)

so create another helper for them that makes it clear that they
need only pass in a swp_entry_t.

One downside here is that delete_from_swap_cache() now does
an extra swap_address_space() call.  But, those are pretty
cheap (just some array index arithmetic).

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/drivers/staging/zcache/zcache-main.c |    2 +-
 linux.git-davehans/include/linux/swap.h                 |    3 ++-
 linux.git-davehans/mm/shmem.c                           |    2 +-
 linux.git-davehans/mm/swap_state.c                      |   13 +++++--------
 linux.git-davehans/mm/swapfile.c                        |   13 ++++++++++++-
 linux.git-davehans/mm/vmscan.c                          |    3 +--
 6 files changed, 22 insertions(+), 14 deletions(-)

diff -puN drivers/staging/zcache/zcache-main.c~make-page-and-swp_entry_t-variants drivers/staging/zcache/zcache-main.c
--- linux.git/drivers/staging/zcache/zcache-main.c~make-page-and-swp_entry_t-variants	2013-05-07 13:48:13.963056205 -0700
+++ linux.git-davehans/drivers/staging/zcache/zcache-main.c	2013-05-07 13:48:13.975056737 -0700
@@ -961,7 +961,7 @@ static int zcache_get_swap_cache_page(in
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free_entry(entry);
 		/* FIXME: is it possible to get here without err==-ENOMEM?
 		 * If not, we can dispense with the do loop, use goto retry */
 	} while (err != -ENOMEM);
diff -puN include/linux/swap.h~make-page-and-swp_entry_t-variants include/linux/swap.h
--- linux.git/include/linux/swap.h~make-page-and-swp_entry_t-variants	2013-05-07 13:48:13.964056249 -0700
+++ linux.git-davehans/include/linux/swap.h	2013-05-07 13:48:13.975056737 -0700
@@ -382,7 +382,8 @@ extern void swap_shmem_alloc(swp_entry_t
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t, struct page *page);
+extern void swapcache_free_entry(swp_entry_t entry);
+extern void swapcache_free_page_entry(struct page *page);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
diff -puN mm/shmem.c~make-page-and-swp_entry_t-variants mm/shmem.c
--- linux.git/mm/shmem.c~make-page-and-swp_entry_t-variants	2013-05-07 13:48:13.966056339 -0700
+++ linux.git-davehans/mm/shmem.c	2013-05-07 13:48:13.976056781 -0700
@@ -871,7 +871,7 @@ static int shmem_writepage(struct page *
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
-	swapcache_free(swap, NULL);
+	swapcache_free_entry(swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
diff -puN mm/swapfile.c~make-page-and-swp_entry_t-variants mm/swapfile.c
--- linux.git/mm/swapfile.c~make-page-and-swp_entry_t-variants	2013-05-07 13:48:13.968056427 -0700
+++ linux.git-davehans/mm/swapfile.c	2013-05-07 13:48:13.977056825 -0700
@@ -637,7 +637,7 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry, struct page *page)
+static void __swapcache_free(swp_entry_t entry, struct page *page)
 {
 	struct swap_info_struct *p;
 	unsigned char count;
@@ -651,6 +651,17 @@ void swapcache_free(swp_entry_t entry, s
 	}
 }
 
+void swapcache_free_entry(swp_entry_t entry)
+{
+	__swapcache_free(entry, NULL);
+}
+
+void swapcache_free_page_entry(struct page *page)
+{
+	swp_entry_t entry = { .val = page_private(page) };
+	__swapcache_free(entry, page);
+}
+
 /*
  * How many references to page are currently swapped out?
  * This does not give an exact answer when swap count is continued,
diff -puN mm/swap_state.c~make-page-and-swp_entry_t-variants mm/swap_state.c
--- linux.git/mm/swap_state.c~make-page-and-swp_entry_t-variants	2013-05-07 13:48:13.969056471 -0700
+++ linux.git-davehans/mm/swap_state.c	2013-05-07 13:48:13.978056869 -0700
@@ -172,7 +172,7 @@ int add_to_swap(struct page *page, struc
 
 	if (unlikely(PageTransHuge(page)))
 		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry, NULL);
+			swapcache_free_entry(entry);
 			return 0;
 		}
 
@@ -198,7 +198,7 @@ int add_to_swap(struct page *page, struc
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free_entry(entry);
 		return 0;
 	}
 }
@@ -211,17 +211,14 @@ int add_to_swap(struct page *page, struc
  */
 void delete_from_swap_cache(struct page *page)
 {
-	swp_entry_t entry;
 	struct address_space *address_space;
 
-	entry.val = page_private(page);
-
-	address_space = swap_address_space(entry);
+	address_space = page_mapping(page);
 	spin_lock_irq(&address_space->tree_lock);
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry, page);
+	swapcache_free_page_entry(page);
 	set_page_private(page, 0);
 	ClearPageSwapCache(page);
 	page_cache_release(page);
@@ -365,7 +362,7 @@ struct page *read_swap_cache_async(swp_e
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free_entry(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff -puN mm/vmscan.c~make-page-and-swp_entry_t-variants mm/vmscan.c
--- linux.git/mm/vmscan.c~make-page-and-swp_entry_t-variants	2013-05-07 13:48:13.971056559 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-05-07 13:48:13.979056913 -0700
@@ -490,10 +490,9 @@ static int __remove_mapping(struct addre
 	}
 
 	if (PageSwapCache(page)) {
-		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		swapcache_free(swap, page);
+		swapcache_free_page_entry(page);
 		set_page_private(page, 0);
 		ClearPageSwapCache(page);
 	} else {
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC][PATCH 3/7] break up __remove_mapping()
  2013-05-07 21:19 [RFC][PATCH 0/7] mm: Batch page reclamation under shink_page_list Dave Hansen
  2013-05-07 21:19 ` [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages Dave Hansen
  2013-05-07 21:19 ` [RFC][PATCH 2/7] make 'struct page' and swp_entry_t variants of swapcache_free() Dave Hansen
@ 2013-05-07 21:19 ` Dave Hansen
  2013-05-14 15:22   ` Mel Gorman
  2013-05-07 21:20 ` [RFC][PATCH 4/7] break out mapping "freepage" code Dave Hansen
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2013-05-07 21:19 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Our goal here is to eventually reduce the number of repetitive
acquire/release operations on mapping->tree_lock.

To start out, we make a version of __remove_mapping() called
__remove_mapping_nolock().  This actually makes the locking
_much_ more straighforward.

One non-obvious part of this patch: the

	freepage = mapping->a_ops->freepage;

used to happen under the mapping->tree_lock, but this patch
moves it to outside of the lock.  All of the other
a_ops->freepage users do it outside the lock, and we only
assign it when we create inodes, so that makes it safe.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/mm/vmscan.c |   41 +++++++++++++++++++++++++----------------
 1 file changed, 25 insertions(+), 16 deletions(-)

diff -puN mm/vmscan.c~make-remove-mapping-without-locks mm/vmscan.c
--- linux.git/mm/vmscan.c~make-remove-mapping-without-locks	2013-05-07 13:48:14.271069843 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-05-07 13:48:14.275070019 -0700
@@ -450,12 +450,12 @@ static pageout_t pageout(struct page *pa
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping_nolock(struct address_space *mapping,
+				   struct page *page)
 {
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -482,37 +482,46 @@ static int __remove_mapping(struct addre
 	 * and thus under tree_lock, then this ordering is not required.
 	 */
 	if (!page_freeze_refs(page, 2))
-		goto cannot_free;
+		return 0;
 	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
 	if (unlikely(PageDirty(page))) {
 		page_unfreeze_refs(page, 2);
-		goto cannot_free;
+		return 0;
 	}
 
 	if (PageSwapCache(page)) {
 		__delete_from_swap_cache(page);
-		spin_unlock_irq(&mapping->tree_lock);
+	} else {
+		__delete_from_page_cache(page);
+	}
+	return 1;
+}
+
+static int __remove_mapping(struct address_space *mapping, struct page *page)
+{
+	int ret;
+	BUG_ON(!PageLocked(page));
+
+	spin_lock_irq(&mapping->tree_lock);
+	ret = __remove_mapping_nolock(mapping, page);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/* unable to free */
+	if (!ret)
+		return 0;
+
+	if (PageSwapCache(page)) {
 		swapcache_free_page_entry(page);
 		set_page_private(page, 0);
 		ClearPageSwapCache(page);
 	} else {
 		void (*freepage)(struct page *);
-
 		freepage = mapping->a_ops->freepage;
-
-		__delete_from_page_cache(page);
-		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
-
 		if (freepage != NULL)
 			freepage(page);
 	}
-
-	return 1;
-
-cannot_free:
-	spin_unlock_irq(&mapping->tree_lock);
-	return 0;
+	return ret;
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC][PATCH 4/7] break out mapping "freepage" code
  2013-05-07 21:19 [RFC][PATCH 0/7] mm: Batch page reclamation under shink_page_list Dave Hansen
                   ` (2 preceding siblings ...)
  2013-05-07 21:19 ` [RFC][PATCH 3/7] break up __remove_mapping() Dave Hansen
@ 2013-05-07 21:20 ` Dave Hansen
  2013-05-14 15:26   ` Mel Gorman
  2013-05-07 21:20 ` [RFC][PATCH 5/7] create __remove_mapping_batch() Dave Hansen
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2013-05-07 21:20 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

__remove_mapping() only deals with pages with mappings, meaning
page cache and swap cache.

At this point, the page has been removed from the mapping's radix
tree, and we need to ensure that any fs-specific (or swap-
specific) resources are freed up.

We will be using this function from a second location in a
following patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/mm/vmscan.c |   33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff -puN mm/vmscan.c~free_mapping_page mm/vmscan.c
--- linux.git/mm/vmscan.c~free_mapping_page	2013-05-07 13:48:14.520080867 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-05-07 13:48:14.524081045 -0700
@@ -497,6 +497,27 @@ static int __remove_mapping_nolock(struc
 	return 1;
 }
 
+/*
+ * maybe this isnt named the best... it just releases
+ * the mapping's reference to the page.  It frees the
+ * page from *the* *mapping* but not necessarily back
+ * in to the allocator
+ */
+static void free_mapping_page(struct address_space *mapping, struct page *page)
+{
+	if (PageSwapCache(page)) {
+		swapcache_free_page_entry(page);
+		set_page_private(page, 0);
+		ClearPageSwapCache(page);
+	} else {
+		void (*freepage)(struct page *);
+		freepage = mapping->a_ops->freepage;
+		mem_cgroup_uncharge_cache_page(page);
+		if (freepage != NULL)
+			freepage(page);
+	}
+}
+
 static int __remove_mapping(struct address_space *mapping, struct page *page)
 {
 	int ret;
@@ -510,17 +531,7 @@ static int __remove_mapping(struct addre
 	if (!ret)
 		return 0;
 
-	if (PageSwapCache(page)) {
-		swapcache_free_page_entry(page);
-		set_page_private(page, 0);
-		ClearPageSwapCache(page);
-	} else {
-		void (*freepage)(struct page *);
-		freepage = mapping->a_ops->freepage;
-		mem_cgroup_uncharge_cache_page(page);
-		if (freepage != NULL)
-			freepage(page);
-	}
+	free_mapping_page(mapping, page);
 	return ret;
 }
 
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC][PATCH 5/7] create __remove_mapping_batch()
  2013-05-07 21:19 [RFC][PATCH 0/7] mm: Batch page reclamation under shink_page_list Dave Hansen
                   ` (3 preceding siblings ...)
  2013-05-07 21:20 ` [RFC][PATCH 4/7] break out mapping "freepage" code Dave Hansen
@ 2013-05-07 21:20 ` Dave Hansen
  2013-05-09 22:13   ` Seth Jennings
  2013-05-14 15:51   ` Mel Gorman
  2013-05-07 21:20 ` [RFC][PATCH 6/7] use __remove_mapping_batch() in shrink_page_list() Dave Hansen
  2013-05-07 21:20 ` [RFC][PATCH 7/7] drain batch list during long operations Dave Hansen
  6 siblings, 2 replies; 25+ messages in thread
From: Dave Hansen @ 2013-05-07 21:20 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

__remove_mapping_batch() does logically the same thing as
__remove_mapping().

We batch like this so that several pages can be freed with a
single mapping->tree_lock acquisition/release pair.  This reduces
the number of atomic operations and ensures that we do not bounce
cachelines around.

It has shown some substantial performance benefits on
microbenchmarks.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/mm/vmscan.c |   50 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff -puN mm/vmscan.c~create-remove_mapping_batch mm/vmscan.c
--- linux.git/mm/vmscan.c~create-remove_mapping_batch	2013-05-07 14:00:01.432361260 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-05-07 14:19:32.341148892 -0700
@@ -555,6 +555,56 @@ int remove_mapping(struct address_space
 	return 0;
 }
 
+/*
+ * pages come in here (via remove_list) locked and leave unlocked
+ * (on either ret_pages or free_pages)
+ *
+ * We do this batching so that we free batches of pages with a
+ * single mapping->tree_lock acquisition/release.  This optimization
+ * only makes sense when the pages on remove_list all share a
+ * page->mapping.  If this is violated you will BUG_ON().
+ */
+static int __remove_mapping_batch(struct list_head *remove_list,
+				  struct list_head *ret_pages,
+				  struct list_head *free_pages)
+{
+	int nr_reclaimed = 0;
+	struct address_space *mapping;
+	struct page *page;
+	LIST_HEAD(need_free_mapping);
+
+	if (list_empty(remove_list))
+		return 0;
+
+	mapping = lru_to_page(remove_list)->mapping;
+	spin_lock_irq(&mapping->tree_lock);
+	while (!list_empty(remove_list)) {
+		int freed;
+		page = lru_to_page(remove_list);
+		BUG_ON(!PageLocked(page));
+		BUG_ON(page->mapping != mapping);
+		list_del(&page->lru);
+
+		freed = __remove_mapping_nolock(mapping, page);
+		if (freed) {
+			list_add(&page->lru, &need_free_mapping);
+		} else {
+			unlock_page(page);
+			list_add(&page->lru, ret_pages);
+		}
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	while (!list_empty(&need_free_mapping)) {
+		page = lru_to_page(&need_free_mapping);
+		list_move(&page->list, free_pages);
+		free_mapping_page(mapping, page);
+		unlock_page(page);
+		nr_reclaimed++;
+	}
+	return nr_reclaimed;
+}
+
 /**
  * putback_lru_page - put previously isolated page onto appropriate LRU list
  * @page: page to be put back to appropriate lru list
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC][PATCH 6/7] use __remove_mapping_batch() in shrink_page_list()
  2013-05-07 21:19 [RFC][PATCH 0/7] mm: Batch page reclamation under shink_page_list Dave Hansen
                   ` (4 preceding siblings ...)
  2013-05-07 21:20 ` [RFC][PATCH 5/7] create __remove_mapping_batch() Dave Hansen
@ 2013-05-07 21:20 ` Dave Hansen
  2013-05-14 16:05   ` Mel Gorman
  2013-05-07 21:20 ` [RFC][PATCH 7/7] drain batch list during long operations Dave Hansen
  6 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2013-05-07 21:20 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Tim Chen's earlier version of these patches just unconditionally
created large batches of pages, even if they did not share a
page->mapping.  This is a bit suboptimal for a few reasons:
1. if we can not consolidate lock acquisitions, it makes little
   sense to batch
2. The page locks are held for long periods of time, so we only
   want to do this when we are sure that we will gain a
   substantial throughput improvement because we pay a latency
   cost by holding the locks.

This patch makes sure to only batch when all the pages on
'batch_for_mapping_removal' continue to share a page->mapping.
This only happens in practice in cases where pages in the same
file are close to each other on the LRU.  That seems like a
reasonable assumption.

In a 128MB virtual machine doing kernel compiles, the average
batch size when calling __remove_mapping_batch() is around 5,
so this does seem to do some good in practice.

On a 160-cpu system doing kernel compiles, I still saw an
average batch length of about 2.8.  One promising feature:
as the memory pressure went up, the average batches seem to
have gotten larger.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/mm/vmscan.c |   52 +++++++++++++++++++++++++++++++++--------
 1 file changed, 42 insertions(+), 10 deletions(-)

diff -puN mm/vmscan.c~use-remove_mapping_batch mm/vmscan.c
--- linux.git/mm/vmscan.c~use-remove_mapping_batch	2013-05-07 13:48:15.016102828 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-05-07 13:48:15.020103005 -0700
@@ -599,7 +599,14 @@ static int __remove_mapping_batch(struct
 		page = lru_to_page(&need_free_mapping);
 		list_move(&page->list, free_pages);
 		free_mapping_page(mapping, page);
-		unlock_page(page);
+		/*
+		 * At this point, we have no other references and there is
+		 * no way to pick any more up (removed from LRU, removed
+		 * from pagecache). Can use non-atomic bitops now (and
+		 * we obviously don't have to worry about waking up a process
+		 * waiting on the page lock, because there are no references.
+		 */
+		__clear_page_locked(page);
 		nr_reclaimed++;
 	}
 	return nr_reclaimed;
@@ -740,6 +747,15 @@ static enum page_references page_check_r
 	return PAGEREF_RECLAIM;
 }
 
+static bool batch_has_same_mapping(struct page *page, struct list_head *batch)
+{
+	struct page *first_in_batch;
+	first_in_batch = lru_to_page(batch);
+	if (first_in_batch->mapping == page->mapping)
+		return true;
+	return false;
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -753,6 +769,7 @@ static unsigned long shrink_page_list(st
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(batch_for_mapping_removal);
 	int pgactivate = 0;
 	unsigned long nr_dirty = 0;
 	unsigned long nr_congested = 0;
@@ -771,6 +788,19 @@ static unsigned long shrink_page_list(st
 		cond_resched();
 
 		page = lru_to_page(page_list);
+		/*
+		 * batching only makes sense when we can save lock
+		 * acquisitions, so drain the batched pages when
+		 * we move over to a different mapping
+		 */
+		if (!list_empty(&batch_for_mapping_removal) &&
+		    !batch_has_same_mapping(page, &batch_for_mapping_removal)) {
+			nr_reclaimed +=
+				__remove_mapping_batch(&batch_for_mapping_removal,
+							&ret_pages,
+							&free_pages);
+		}
+
 		list_del(&page->lru);
 
 		if (!trylock_page(page))
@@ -975,17 +1005,17 @@ static unsigned long shrink_page_list(st
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping)
 			goto keep_locked;
-
 		/*
-		 * At this point, we have no other references and there is
-		 * no way to pick any more up (removed from LRU, removed
-		 * from pagecache). Can use non-atomic bitops now (and
-		 * we obviously don't have to worry about waking up a process
-		 * waiting on the page lock, because there are no references.
+		 * This list contains pages all in the same mapping, but
+		 * in effectively random order and we hold lock_page()
+		 * on *all* of them.  This can potentially cause lock
+		 * ordering issues, but the reclaim code only trylocks
+		 * them which saves us.
 		 */
-		__clear_page_locked(page);
+		list_add(&page->lru, &batch_for_mapping_removal);
+		continue;
 free_it:
 		nr_reclaimed++;
 
@@ -1016,7 +1046,9 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
-
+	nr_reclaimed += __remove_mapping_batch(&batch_for_mapping_removal,
+						&ret_pages,
+						&free_pages);
 	/*
 	 * Tag a zone as congested if all the dirty pages encountered were
 	 * backed by a congested BDI. In this case, reclaimers should just
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC][PATCH 7/7] drain batch list during long operations
  2013-05-07 21:19 [RFC][PATCH 0/7] mm: Batch page reclamation under shink_page_list Dave Hansen
                   ` (5 preceding siblings ...)
  2013-05-07 21:20 ` [RFC][PATCH 6/7] use __remove_mapping_batch() in shrink_page_list() Dave Hansen
@ 2013-05-07 21:20 ` Dave Hansen
  2013-05-07 23:56   ` Dave Hansen
                     ` (2 more replies)
  6 siblings, 3 replies; 25+ messages in thread
From: Dave Hansen @ 2013-05-07 21:20 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

This was a suggestion from Mel:

	http://lkml.kernel.org/r/20120914085634.GM11157@csn.ul.ie

Any pages we collect on 'batch_for_mapping_removal' will have
their lock_page() held during the duration of their stay on the
list.  If some other user is trying to get at them during this
time, they might end up having to wait for a while, especially if
we go off and do pageout() on some other page.

This ensures that we drain the batch if we are about to perform a
writeout.

I added some statistics to the __remove_mapping_batch() code to
track how large the lists are that we pass in to it.  With this
patch, the average list length drops about 10% (from about 4.1 to
3.8).  The workload here was a make -j4 kernel compile on a VM
with 200MB of RAM.

I've still got the statistics patch around if anyone is
interested.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/kernel/sched/fair.c |    2 ++
 linux.git-davehans/mm/vmscan.c         |   10 ++++++++++
 2 files changed, 12 insertions(+)

diff -puN kernel/sched/fair.c~drain-batch-list-during-long-operations kernel/sched/fair.c
--- linux.git/kernel/sched/fair.c~drain-batch-list-during-long-operations	2013-05-07 13:48:15.267113941 -0700
+++ linux.git-davehans/kernel/sched/fair.c	2013-05-07 13:48:15.275114295 -0700
@@ -5211,6 +5211,8 @@ more_balance:
 		if (sd->balance_interval < sd->max_interval)
 			sd->balance_interval *= 2;
 	}
+	//if (printk_ratelimit())
+	//	printk("sd->balance_interval: %d\n", sd->balance_interval);
 
 	goto out;
 
diff -puN mm/vmscan.c~drain-batch-list-during-long-operations mm/vmscan.c
--- linux.git/mm/vmscan.c~drain-batch-list-during-long-operations	2013-05-07 13:48:15.268113985 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-05-07 13:48:15.272114163 -0700
@@ -936,6 +936,16 @@ static unsigned long shrink_page_list(st
 			if (!sc->may_writepage)
 				goto keep_locked;
 
+			/*
+			 * We hold a bunch of page locks on the batch.
+			 * pageout() can take a while, so drain the
+			 * batch before we perform pageout.
+			 */
+			nr_reclaimed +=
+		               __remove_mapping_batch(&batch_for_mapping_removal,
+		                                      &ret_pages,
+		                                      &free_pages);
+
 			/* Page is dirty, try to write it out here */
 			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 7/7] drain batch list during long operations
  2013-05-07 21:20 ` [RFC][PATCH 7/7] drain batch list during long operations Dave Hansen
@ 2013-05-07 23:56   ` Dave Hansen
  2013-05-08  0:42   ` Tim Chen
  2013-05-14 16:08   ` Mel Gorman
  2 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-05-07 23:56 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On 05/07/2013 02:20 PM, Dave Hansen wrote:
> +++ linux.git-davehans/kernel/sched/fair.c	2013-05-07 13:48:15.275114295 -0700
> @@ -5211,6 +5211,8 @@ more_balance:
>  		if (sd->balance_interval < sd->max_interval)
>  			sd->balance_interval *= 2;
>  	}
> +	//if (printk_ratelimit())
> +	//	printk("sd->balance_interval: %d\n", sd->balance_interval);
>  
>  	goto out;

Urg, this is obviously a garbage hunk that snuck in here.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 7/7] drain batch list during long operations
  2013-05-07 21:20 ` [RFC][PATCH 7/7] drain batch list during long operations Dave Hansen
  2013-05-07 23:56   ` Dave Hansen
@ 2013-05-08  0:42   ` Tim Chen
  2013-05-14 16:08   ` Mel Gorman
  2 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2013-05-08  0:42 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman

On Tue, 2013-05-07 at 14:20 -0700, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This was a suggestion from Mel:
> 
> 	http://lkml.kernel.org/r/20120914085634.GM11157@csn.ul.ie
> 
> Any pages we collect on 'batch_for_mapping_removal' will have
> their lock_page() held during the duration of their stay on the
> list.  If some other user is trying to get at them during this
> time, they might end up having to wait for a while, especially if
> we go off and do pageout() on some other page.
> 
> This ensures that we drain the batch if we are about to perform a
> writeout.
> 
> I added some statistics to the __remove_mapping_batch() code to
> track how large the lists are that we pass in to it.  With this
> patch, the average list length drops about 10% (from about 4.1 to
> 3.8).  The workload here was a make -j4 kernel compile on a VM
> with 200MB of RAM.
> 
> I've still got the statistics patch around if anyone is
> interested.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>


I like this new patch series. Logic is cleaner than my previous attempt.

Acked.

Tim


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages
  2013-05-07 21:19 ` [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages Dave Hansen
@ 2013-05-09 22:07   ` Seth Jennings
  2013-05-09 22:19     ` Dave Hansen
  2013-05-10  9:26     ` Michal Hocko
  2013-05-14 14:55   ` Mel Gorman
  1 sibling, 2 replies; 25+ messages in thread
From: Seth Jennings @ 2013-05-09 22:07 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Tue, May 07, 2013 at 02:19:55PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> There are only two callers of swapcache_free() which actually
> pass in a non-NULL 'struct page'.  Both of them
> (__remove_mapping and delete_from_swap_cache())  create a
> temporary on-stack 'swp_entry_t' and set entry.val to
> page_private().
> 
> They need to do this since __delete_from_swap_cache() does
> set_page_private(page, 0) and destroys the information.
> 
> However, I'd like to batch a few of these operations on several
> pages together in a new version of __remove_mapping(), and I
> would prefer not to have to allocate temporary storage for
> each page.  The code is pretty ugly, and it's a bit silly
> to create these on-stack 'swp_entry_t's when it is so easy to
> just keep the information around in 'struct page'.
> 
> There should not be any danger in doing this since we are
> absolutely on the path of freeing these page.  There is no
> turning back, and no other rerferences can be obtained
> after it comes out of the radix tree.

I get a BUG on this one:

[   26.114818] ------------[ cut here ]------------
[   26.115282] kernel BUG at mm/memcontrol.c:4111!
[   26.115282] invalid opcode: 0000 [#1] PREEMPT SMP 
[   26.115282] Modules linked in:
[   26.115282] CPU: 3 PID: 5026 Comm: cc1 Not tainted 3.9.0+ #8
[   26.115282] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[   26.115282] task: ffff88007c1cdca0 ti: ffff88001b442000 task.ti: ffff88001b442000
[   26.115282] RIP: 0010:[<ffffffff810ed425>]  [<ffffffff810ed425>] __mem_cgroup_uncharge_common+0x255/0x2e0
[   26.115282] RSP: 0000:ffff88001b443708  EFLAGS: 00010206
[   26.115282] RAX: 4000000000090009 RBX: 0000000000000000 RCX: ffffc90000014001
[   26.115282] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffea00006e5b40
[   26.115282] RBP: ffff88001b443738 R08: 0000000000000000 R09: 0000000000000000
[   26.115282] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00006e5b40
[   26.115282] R13: 0000000000000000 R14: ffffea00006e5b40 R15: 0000000000000002
[   26.115282] FS:  00007fabd08ee700(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
[   26.115282] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.115282] CR2: 00007fabce27a000 CR3: 000000001985f000 CR4: 00000000000006a0
[   26.115282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   26.115282] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   26.115282] Stack:
[   26.115282]  ffffffff810dcbae ffff880064a0a500 0000000000000001 ffffea00006e5b40
[   26.115282]  ffffea00006e5b40 0000000000000001 ffff88001b443748 ffffffff810f0d05
[   26.115282]  ffff88001b443778 ffffffff810ddb3e ffff88001b443778 ffffea00006e5b40
[   26.115282] Call Trace:
[   26.115282]  [<ffffffff810dcbae>] ? swap_info_get+0x5e/0xe0
[   26.115282]  [<ffffffff810f0d05>] mem_cgroup_uncharge_swapcache+0x15/0x20
[   26.115282]  [<ffffffff810ddb3e>] swapcache_free+0x4e/0x70
[   26.115282]  [<ffffffff810b6e67>] __remove_mapping+0xd7/0x120
[   26.115282]  [<ffffffff810b8682>] shrink_page_list+0x5c2/0x920
[   26.115282]  [<ffffffff810b780e>] ? isolate_lru_pages.isra.37+0xae/0x120
[   26.115282]  [<ffffffff810b8ecf>] shrink_inactive_list+0x13f/0x380
[   26.115282]  [<ffffffff810b9350>] shrink_lruvec+0x240/0x4e0
[   26.115282]  [<ffffffff810b9656>] shrink_zone+0x66/0x1a0
[   26.115282]  [<ffffffff810ba1fb>] do_try_to_free_pages+0xeb/0x570
[   26.115282]  [<ffffffff810eb7d9>] ? lookup_page_cgroup_used+0x9/0x20
[   26.115282]  [<ffffffff810ba7af>] try_to_free_pages+0x9f/0xc0
[   26.115282]  [<ffffffff810b1357>] __alloc_pages_nodemask+0x5a7/0x970
[   26.115282]  [<ffffffff810cb2be>] handle_pte_fault+0x65e/0x880
[   26.115282]  [<ffffffff810cc7d9>] handle_mm_fault+0x139/0x1e0
[   26.115282]  [<ffffffff81027920>] __do_page_fault+0x160/0x460
[   26.115282]  [<ffffffff810d176c>] ? do_brk+0x1fc/0x360
[   26.115282]  [<ffffffff81212979>] ? lockdep_sys_exit_thunk+0x35/0x67
[   26.115282]  [<ffffffff81027c49>] do_page_fault+0x9/0x10
[   26.115282]  [<ffffffff813b4a72>] page_fault+0x22/0x30
[   26.115282] Code: a9 00 00 08 00 0f 85 43 fe ff ff e9 b8 fe ff ff 66 0f 1f 44 00 00 41 8b 44 24 18 85 c0 0f 89 2b fe ff ff 0f 1f 00 e9 9d fe ff ff <0f> 0b 66 0f 1f 84 00 00 00 00 00 49 89 9c 24 48 0f 00 00 e9 0a 
[   26.115282] RIP  [<ffffffff810ed425>] __mem_cgroup_uncharge_common+0x255/0x2e0
[   26.115282]  RSP <ffff88001b443708>
[   26.171597] ---[ end trace 5e49a21e51452c24 ]---


mm/memcontrol:4111
VM_BUG_ON(PageSwapCache(page));

Seems that mem_cgroup_uncharge_swapcache, somewhat ironically expects the
SwapCache flag to be unset already.

Fix might be a simple as removing that VM_BUG_ON() but there might be more to
it.  There usually is :)

Seth

> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  linux.git-davehans/mm/swap_state.c |    4 ++--
>  linux.git-davehans/mm/vmscan.c     |    2 ++
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff -puN mm/swap_state.c~__delete_from_swap_cache-dont-clear-page-private mm/swap_state.c
> --- linux.git/mm/swap_state.c~__delete_from_swap_cache-dont-clear-page-private	2013-05-07 13:48:13.698044473 -0700
> +++ linux.git-davehans/mm/swap_state.c	2013-05-07 13:48:13.703044693 -0700
> @@ -146,8 +146,6 @@ void __delete_from_swap_cache(struct pag
>  	entry.val = page_private(page);
>  	address_space = swap_address_space(entry);
>  	radix_tree_delete(&address_space->page_tree, page_private(page));
> -	set_page_private(page, 0);
> -	ClearPageSwapCache(page);
>  	address_space->nrpages--;
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
>  	INC_CACHE_INFO(del_total);
> @@ -224,6 +222,8 @@ void delete_from_swap_cache(struct page
>  	spin_unlock_irq(&address_space->tree_lock);
> 
>  	swapcache_free(entry, page);
> +	set_page_private(page, 0);
> +	ClearPageSwapCache(page);
>  	page_cache_release(page);
>  }
> 
> diff -puN mm/vmscan.c~__delete_from_swap_cache-dont-clear-page-private mm/vmscan.c
> --- linux.git/mm/vmscan.c~__delete_from_swap_cache-dont-clear-page-private	2013-05-07 13:48:13.700044561 -0700
> +++ linux.git-davehans/mm/vmscan.c	2013-05-07 13:48:13.705044783 -0700
> @@ -494,6 +494,8 @@ static int __remove_mapping(struct addre
>  		__delete_from_swap_cache(page);
>  		spin_unlock_irq(&mapping->tree_lock);
>  		swapcache_free(swap, page);
> +		set_page_private(page, 0);
> +		ClearPageSwapCache(page);
>  	} else {
>  		void (*freepage)(struct page *);
> 
> _
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 5/7] create __remove_mapping_batch()
  2013-05-07 21:20 ` [RFC][PATCH 5/7] create __remove_mapping_batch() Dave Hansen
@ 2013-05-09 22:13   ` Seth Jennings
  2013-05-09 22:18     ` Dave Hansen
  2013-05-14 15:51   ` Mel Gorman
  1 sibling, 1 reply; 25+ messages in thread
From: Seth Jennings @ 2013-05-09 22:13 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Tue, May 07, 2013 at 02:20:01PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> __remove_mapping_batch() does logically the same thing as
> __remove_mapping().
> 
> We batch like this so that several pages can be freed with a
> single mapping->tree_lock acquisition/release pair.  This reduces
> the number of atomic operations and ensures that we do not bounce
> cachelines around.
> 
> It has shown some substantial performance benefits on
> microbenchmarks.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  linux.git-davehans/mm/vmscan.c |   50 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff -puN mm/vmscan.c~create-remove_mapping_batch mm/vmscan.c
> --- linux.git/mm/vmscan.c~create-remove_mapping_batch	2013-05-07 14:00:01.432361260 -0700
> +++ linux.git-davehans/mm/vmscan.c	2013-05-07 14:19:32.341148892 -0700
> @@ -555,6 +555,56 @@ int remove_mapping(struct address_space
>  	return 0;
>  }
> 
> +/*
> + * pages come in here (via remove_list) locked and leave unlocked
> + * (on either ret_pages or free_pages)
> + *
> + * We do this batching so that we free batches of pages with a
> + * single mapping->tree_lock acquisition/release.  This optimization
> + * only makes sense when the pages on remove_list all share a
> + * page->mapping.  If this is violated you will BUG_ON().
> + */
> +static int __remove_mapping_batch(struct list_head *remove_list,
> +				  struct list_head *ret_pages,
> +				  struct list_head *free_pages)
> +{
> +	int nr_reclaimed = 0;
> +	struct address_space *mapping;
> +	struct page *page;
> +	LIST_HEAD(need_free_mapping);
> +
> +	if (list_empty(remove_list))
> +		return 0;
> +
> +	mapping = lru_to_page(remove_list)->mapping;

This doesn't work for pages in the swap cache as mapping is overloaded to
hold... something else that I can't remember of the top of my head.  Anyway,
this happens:

[   70.027984] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[   70.028010] IP: [<ffffffff81077de8>] __lock_acquire.isra.24+0x188/0xd10
[   70.028010] PGD 1ab99067 PUD 671e5067 PMD 0 
[   70.028010] Oops: 0000 [#1] PREEMPT SMP 
[   70.028010] Modules linked in:
[   70.028010] CPU: 1 PID: 11494 Comm: cc1 Not tainted 3.9.0+ #6
[   70.028010] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[   70.028010] task: ffff88007c708f70 ti: ffff88001aa28000 task.ti: ffff88001aa28000
[   70.028010] RIP: 0010:[<ffffffff81077de8>]  [<ffffffff81077de8>] __lock_acquire.isra.24+0x188/0xd10
[   70.028010] RSP: 0000:ffff88001aa29658  EFLAGS: 00010097
[   70.028010] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   70.028010] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   70.028010] RBP: ffff88001aa296c8 R08: 0000000000000001 R09: 0000000000000000
[   70.028010] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007c708f70
[   70.028010] R13: 0000000000000001 R14: 0000000000000030 R15: ffff88001aa29758
[   70.028010] FS:  00007fe676f32700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
[   70.028010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.028010] CR2: 0000000000000038 CR3: 000000001b7ba000 CR4: 00000000000006a0
[   70.028010] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   70.028010] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   70.028010] Stack:
[   70.028010]  0000000000000001 ffffffff8165d910 ffff88001aa296e8 ffffffff8107e925
[   70.028010]  ffff88001aa296a8 0000000000000000 0000000000000002 0000000000000046
[   70.028010]  dead000000200200 ffff88007c708f70 0000000000000046 0000000000000000
[   70.028010] Call Trace:
[   70.028010]  [<ffffffff8107e925>] ? smp_call_function_single+0xd5/0x180
[   70.028010]  [<ffffffff81078ea2>] lock_acquire+0x52/0x70
[   70.028010]  [<ffffffff810b7468>] ? __remove_mapping_batch+0x48/0x140
[   70.028010]  [<ffffffff813b3cf7>] _raw_spin_lock_irq+0x37/0x50
[   70.028010]  [<ffffffff810b7468>] ? __remove_mapping_batch+0x48/0x140
[   70.028010]  [<ffffffff810b7468>] __remove_mapping_batch+0x48/0x140
[   70.028010]  [<ffffffff810b88b0>] shrink_page_list+0x680/0x9f0
[   70.028010]  [<ffffffff810b910f>] shrink_inactive_list+0x13f/0x380
[   70.028010]  [<ffffffff810b9590>] shrink_lruvec+0x240/0x4e0
[   70.028010]  [<ffffffff810b9896>] shrink_zone+0x66/0x1a0
[   70.028010]  [<ffffffff810ba43b>] do_try_to_free_pages+0xeb/0x570
[   70.028010]  [<ffffffff810eba29>] ? lookup_page_cgroup_used+0x9/0x20
[   70.028010]  [<ffffffff810ba9ef>] try_to_free_pages+0x9f/0xc0
[   70.028010]  [<ffffffff810b1357>] __alloc_pages_nodemask+0x5a7/0x970
[   70.028010]  [<ffffffff810cb4fe>] handle_pte_fault+0x65e/0x880
[   70.028010]  [<ffffffff810cca19>] handle_mm_fault+0x139/0x1e0
[   70.028010]  [<ffffffff81027920>] __do_page_fault+0x160/0x460
[   70.028010]  [<ffffffff811121f1>] ? mntput+0x21/0x30
[   70.028010]  [<ffffffff810f56c1>] ? __fput+0x191/0x250
[   70.028010]  [<ffffffff81212bc9>] ? lockdep_sys_exit_thunk+0x35/0x67
[   70.028010]  [<ffffffff81027c49>] do_page_fault+0x9/0x10
[   70.028010]  [<ffffffff813b4cb2>] page_fault+0x22/0x30
[   70.028010] Code: 48 c7 c1 65 1b 50 81 48 c7 c2 3d 07 50 81 31 c0 be fb 0b 00 00 48 c7 c7 a8 58 50 81 e8 62 89 fb ff e9 d7 01 00 00 0f 1f 44 00 00 <4d> 8b 6c c6 08 4d 85 ed 0f 84 cc fe ff ff f0 41 ff 85 98 01 00 
[   70.028010] RIP  [<ffffffff81077de8>] __lock_acquire.isra.24+0x188/0xd10
[   70.028010]  RSP <ffff88001aa29658>
[   70.028010] CR2: 0000000000000038
[   70.028010] ---[ end trace 94be6276375f6199 ]---

The solution is to use page_mapping() which has logic to handle swap cache page
mapping.

I got by it with:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 43b4da8..897eb5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -576,13 +576,15 @@ static int __remove_mapping_batch(struct list_head *remove_list,
        if (list_empty(remove_list))
                return 0;
 
-       mapping = lru_to_page(remove_list)->mapping;
+       page = lru_to_page(remove_list);
+       mapping = page_mapping(page);
+       BUG_ON(mapping == NULL);
        spin_lock_irq(&mapping->tree_lock);
        while (!list_empty(remove_list)) {
                int freed;
                page = lru_to_page(remove_list);
                BUG_ON(!PageLocked(page));
-               BUG_ON(page->mapping != mapping);
+               BUG_ON(page_mapping(page) != mapping);
                list_del(&page->lru);
 
                freed = __remove_mapping_nolock(mapping, page);

Seth

> +	spin_lock_irq(&mapping->tree_lock);
> +	while (!list_empty(remove_list)) {
> +		int freed;
> +		page = lru_to_page(remove_list);
> +		BUG_ON(!PageLocked(page));
> +		BUG_ON(page->mapping != mapping);
> +		list_del(&page->lru);
> +
> +		freed = __remove_mapping_nolock(mapping, page);
> +		if (freed) {
> +			list_add(&page->lru, &need_free_mapping);
> +		} else {
> +			unlock_page(page);
> +			list_add(&page->lru, ret_pages);
> +		}
> +	}
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	while (!list_empty(&need_free_mapping)) {
> +		page = lru_to_page(&need_free_mapping);
> +		list_move(&page->list, free_pages);
> +		free_mapping_page(mapping, page);
> +		unlock_page(page);
> +		nr_reclaimed++;
> +	}
> +	return nr_reclaimed;
> +}
> +
>  /**
>   * putback_lru_page - put previously isolated page onto appropriate LRU list
>   * @page: page to be put back to appropriate lru list
> _
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 5/7] create __remove_mapping_batch()
  2013-05-09 22:13   ` Seth Jennings
@ 2013-05-09 22:18     ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-05-09 22:18 UTC (permalink / raw)
  To: Seth Jennings; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On 05/09/2013 03:13 PM, Seth Jennings wrote:
>> > +	mapping = lru_to_page(remove_list)->mapping;
> This doesn't work for pages in the swap cache as mapping is overloaded to
> hold... something else that I can't remember of the top of my head.  Anyway,
> this happens:

Yup, that's true.  I'm supposed to be using page_mapping() here.  I know
I ran in to that along the way and fixed a few of the sites in my patch,
but missed that one.  I'll fix it up.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages
  2013-05-09 22:07   ` Seth Jennings
@ 2013-05-09 22:19     ` Dave Hansen
  2013-05-10  9:26     ` Michal Hocko
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-05-09 22:19 UTC (permalink / raw)
  To: Seth Jennings; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On 05/09/2013 03:07 PM, Seth Jennings wrote:
> mm/memcontrol:4111
> VM_BUG_ON(PageSwapCache(page));
> 
> Seems that mem_cgroup_uncharge_swapcache, somewhat ironically expects the
> SwapCache flag to be unset already.
> 
> Fix might be a simple as removing that VM_BUG_ON() but there might be more to
> it.  There usually is :)

Yeah, I wasn't testing with cgroups around.  I'll go do that and see
what I come up with.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages
  2013-05-09 22:07   ` Seth Jennings
  2013-05-09 22:19     ` Dave Hansen
@ 2013-05-10  9:26     ` Michal Hocko
  2013-05-10 14:01       ` Seth Jennings
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2013-05-10  9:26 UTC (permalink / raw)
  To: Seth Jennings
  Cc: Dave Hansen, linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Thu 09-05-13 17:07:39, Seth Jennings wrote:
> On Tue, May 07, 2013 at 02:19:55PM -0700, Dave Hansen wrote:
> > 
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > 
> > There are only two callers of swapcache_free() which actually
> > pass in a non-NULL 'struct page'.  Both of them
> > (__remove_mapping and delete_from_swap_cache())  create a
> > temporary on-stack 'swp_entry_t' and set entry.val to
> > page_private().
> > 
> > They need to do this since __delete_from_swap_cache() does
> > set_page_private(page, 0) and destroys the information.
> > 
> > However, I'd like to batch a few of these operations on several
> > pages together in a new version of __remove_mapping(), and I
> > would prefer not to have to allocate temporary storage for
> > each page.  The code is pretty ugly, and it's a bit silly
> > to create these on-stack 'swp_entry_t's when it is so easy to
> > just keep the information around in 'struct page'.
> > 
> > There should not be any danger in doing this since we are
> > absolutely on the path of freeing these page.  There is no
> > turning back, and no other rerferences can be obtained
> > after it comes out of the radix tree.
> 
> I get a BUG on this one:
> 
> [   26.114818] ------------[ cut here ]------------
> [   26.115282] kernel BUG at mm/memcontrol.c:4111!
> [   26.115282] invalid opcode: 0000 [#1] PREEMPT SMP 
> [   26.115282] Modules linked in:
> [   26.115282] CPU: 3 PID: 5026 Comm: cc1 Not tainted 3.9.0+ #8
> [   26.115282] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> [   26.115282] task: ffff88007c1cdca0 ti: ffff88001b442000 task.ti: ffff88001b442000
> [   26.115282] RIP: 0010:[<ffffffff810ed425>]  [<ffffffff810ed425>] __mem_cgroup_uncharge_common+0x255/0x2e0
> [   26.115282] RSP: 0000:ffff88001b443708  EFLAGS: 00010206
> [   26.115282] RAX: 4000000000090009 RBX: 0000000000000000 RCX: ffffc90000014001
> [   26.115282] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffea00006e5b40
> [   26.115282] RBP: ffff88001b443738 R08: 0000000000000000 R09: 0000000000000000
> [   26.115282] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00006e5b40
> [   26.115282] R13: 0000000000000000 R14: ffffea00006e5b40 R15: 0000000000000002
> [   26.115282] FS:  00007fabd08ee700(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
> [   26.115282] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   26.115282] CR2: 00007fabce27a000 CR3: 000000001985f000 CR4: 00000000000006a0
> [   26.115282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   26.115282] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   26.115282] Stack:
> [   26.115282]  ffffffff810dcbae ffff880064a0a500 0000000000000001 ffffea00006e5b40
> [   26.115282]  ffffea00006e5b40 0000000000000001 ffff88001b443748 ffffffff810f0d05
> [   26.115282]  ffff88001b443778 ffffffff810ddb3e ffff88001b443778 ffffea00006e5b40
> [   26.115282] Call Trace:
> [   26.115282]  [<ffffffff810dcbae>] ? swap_info_get+0x5e/0xe0
> [   26.115282]  [<ffffffff810f0d05>] mem_cgroup_uncharge_swapcache+0x15/0x20
> [   26.115282]  [<ffffffff810ddb3e>] swapcache_free+0x4e/0x70
> [   26.115282]  [<ffffffff810b6e67>] __remove_mapping+0xd7/0x120
> [   26.115282]  [<ffffffff810b8682>] shrink_page_list+0x5c2/0x920
> [   26.115282]  [<ffffffff810b780e>] ? isolate_lru_pages.isra.37+0xae/0x120
> [   26.115282]  [<ffffffff810b8ecf>] shrink_inactive_list+0x13f/0x380
> [   26.115282]  [<ffffffff810b9350>] shrink_lruvec+0x240/0x4e0
> [   26.115282]  [<ffffffff810b9656>] shrink_zone+0x66/0x1a0
> [   26.115282]  [<ffffffff810ba1fb>] do_try_to_free_pages+0xeb/0x570
> [   26.115282]  [<ffffffff810eb7d9>] ? lookup_page_cgroup_used+0x9/0x20
> [   26.115282]  [<ffffffff810ba7af>] try_to_free_pages+0x9f/0xc0
> [   26.115282]  [<ffffffff810b1357>] __alloc_pages_nodemask+0x5a7/0x970
> [   26.115282]  [<ffffffff810cb2be>] handle_pte_fault+0x65e/0x880
> [   26.115282]  [<ffffffff810cc7d9>] handle_mm_fault+0x139/0x1e0
> [   26.115282]  [<ffffffff81027920>] __do_page_fault+0x160/0x460
> [   26.115282]  [<ffffffff810d176c>] ? do_brk+0x1fc/0x360
> [   26.115282]  [<ffffffff81212979>] ? lockdep_sys_exit_thunk+0x35/0x67
> [   26.115282]  [<ffffffff81027c49>] do_page_fault+0x9/0x10
> [   26.115282]  [<ffffffff813b4a72>] page_fault+0x22/0x30
> [   26.115282] Code: a9 00 00 08 00 0f 85 43 fe ff ff e9 b8 fe ff ff 66 0f 1f 44 00 00 41 8b 44 24 18 85 c0 0f 89 2b fe ff ff 0f 1f 00 e9 9d fe ff ff <0f> 0b 66 0f 1f 84 00 00 00 00 00 49 89 9c 24 48 0f 00 00 e9 0a 
> [   26.115282] RIP  [<ffffffff810ed425>] __mem_cgroup_uncharge_common+0x255/0x2e0
> [   26.115282]  RSP <ffff88001b443708>
> [   26.171597] ---[ end trace 5e49a21e51452c24 ]---
> 
> 
> mm/memcontrol:4111
> VM_BUG_ON(PageSwapCache(page));
> 
> Seems that mem_cgroup_uncharge_swapcache, somewhat ironically expects the
> SwapCache flag to be unset already.
> 
> Fix might be a simple as removing that VM_BUG_ON() but there might be more to
> it.  There usually is :)

This has been already fixed in the -mm tree
(http://git.kernel.org/cgit/linux/kernel/git/mhocko/mm.git/commit/?h=since-3.9&id=b341f7ffa5fe6ae11afa87e2fecc32c6093541f8)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages
  2013-05-10  9:26     ` Michal Hocko
@ 2013-05-10 14:01       ` Seth Jennings
  0 siblings, 0 replies; 25+ messages in thread
From: Seth Jennings @ 2013-05-10 14:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Fri, May 10, 2013 at 11:26:49AM +0200, Michal Hocko wrote:
> On Thu 09-05-13 17:07:39, Seth Jennings wrote:
> > On Tue, May 07, 2013 at 02:19:55PM -0700, Dave Hansen wrote:
> > > 
> > > From: Dave Hansen <dave.hansen@linux.intel.com>
> > > 
> > > There are only two callers of swapcache_free() which actually
> > > pass in a non-NULL 'struct page'.  Both of them
> > > (__remove_mapping and delete_from_swap_cache())  create a
> > > temporary on-stack 'swp_entry_t' and set entry.val to
> > > page_private().
> > > 
> > > They need to do this since __delete_from_swap_cache() does
> > > set_page_private(page, 0) and destroys the information.
> > > 
> > > However, I'd like to batch a few of these operations on several
> > > pages together in a new version of __remove_mapping(), and I
> > > would prefer not to have to allocate temporary storage for
> > > each page.  The code is pretty ugly, and it's a bit silly
> > > to create these on-stack 'swp_entry_t's when it is so easy to
> > > just keep the information around in 'struct page'.
> > > 
> > > There should not be any danger in doing this since we are
> > > absolutely on the path of freeing these page.  There is no
> > > turning back, and no other rerferences can be obtained
> > > after it comes out of the radix tree.
> > 
> > I get a BUG on this one:
> > 
> > [   26.114818] ------------[ cut here ]------------
> > [   26.115282] kernel BUG at mm/memcontrol.c:4111!
> > [   26.115282] invalid opcode: 0000 [#1] PREEMPT SMP 
> > [   26.115282] Modules linked in:
> > [   26.115282] CPU: 3 PID: 5026 Comm: cc1 Not tainted 3.9.0+ #8
> > [   26.115282] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> > [   26.115282] task: ffff88007c1cdca0 ti: ffff88001b442000 task.ti: ffff88001b442000
> > [   26.115282] RIP: 0010:[<ffffffff810ed425>]  [<ffffffff810ed425>] __mem_cgroup_uncharge_common+0x255/0x2e0
> > [   26.115282] RSP: 0000:ffff88001b443708  EFLAGS: 00010206
> > [   26.115282] RAX: 4000000000090009 RBX: 0000000000000000 RCX: ffffc90000014001
> > [   26.115282] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffffea00006e5b40
> > [   26.115282] RBP: ffff88001b443738 R08: 0000000000000000 R09: 0000000000000000
> > [   26.115282] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00006e5b40
> > [   26.115282] R13: 0000000000000000 R14: ffffea00006e5b40 R15: 0000000000000002
> > [   26.115282] FS:  00007fabd08ee700(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
> > [   26.115282] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   26.115282] CR2: 00007fabce27a000 CR3: 000000001985f000 CR4: 00000000000006a0
> > [   26.115282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [   26.115282] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [   26.115282] Stack:
> > [   26.115282]  ffffffff810dcbae ffff880064a0a500 0000000000000001 ffffea00006e5b40
> > [   26.115282]  ffffea00006e5b40 0000000000000001 ffff88001b443748 ffffffff810f0d05
> > [   26.115282]  ffff88001b443778 ffffffff810ddb3e ffff88001b443778 ffffea00006e5b40
> > [   26.115282] Call Trace:
> > [   26.115282]  [<ffffffff810dcbae>] ? swap_info_get+0x5e/0xe0
> > [   26.115282]  [<ffffffff810f0d05>] mem_cgroup_uncharge_swapcache+0x15/0x20
> > [   26.115282]  [<ffffffff810ddb3e>] swapcache_free+0x4e/0x70
> > [   26.115282]  [<ffffffff810b6e67>] __remove_mapping+0xd7/0x120
> > [   26.115282]  [<ffffffff810b8682>] shrink_page_list+0x5c2/0x920
> > [   26.115282]  [<ffffffff810b780e>] ? isolate_lru_pages.isra.37+0xae/0x120
> > [   26.115282]  [<ffffffff810b8ecf>] shrink_inactive_list+0x13f/0x380
> > [   26.115282]  [<ffffffff810b9350>] shrink_lruvec+0x240/0x4e0
> > [   26.115282]  [<ffffffff810b9656>] shrink_zone+0x66/0x1a0
> > [   26.115282]  [<ffffffff810ba1fb>] do_try_to_free_pages+0xeb/0x570
> > [   26.115282]  [<ffffffff810eb7d9>] ? lookup_page_cgroup_used+0x9/0x20
> > [   26.115282]  [<ffffffff810ba7af>] try_to_free_pages+0x9f/0xc0
> > [   26.115282]  [<ffffffff810b1357>] __alloc_pages_nodemask+0x5a7/0x970
> > [   26.115282]  [<ffffffff810cb2be>] handle_pte_fault+0x65e/0x880
> > [   26.115282]  [<ffffffff810cc7d9>] handle_mm_fault+0x139/0x1e0
> > [   26.115282]  [<ffffffff81027920>] __do_page_fault+0x160/0x460
> > [   26.115282]  [<ffffffff810d176c>] ? do_brk+0x1fc/0x360
> > [   26.115282]  [<ffffffff81212979>] ? lockdep_sys_exit_thunk+0x35/0x67
> > [   26.115282]  [<ffffffff81027c49>] do_page_fault+0x9/0x10
> > [   26.115282]  [<ffffffff813b4a72>] page_fault+0x22/0x30
> > [   26.115282] Code: a9 00 00 08 00 0f 85 43 fe ff ff e9 b8 fe ff ff 66 0f 1f 44 00 00 41 8b 44 24 18 85 c0 0f 89 2b fe ff ff 0f 1f 00 e9 9d fe ff ff <0f> 0b 66 0f 1f 84 00 00 00 00 00 49 89 9c 24 48 0f 00 00 e9 0a 
> > [   26.115282] RIP  [<ffffffff810ed425>] __mem_cgroup_uncharge_common+0x255/0x2e0
> > [   26.115282]  RSP <ffff88001b443708>
> > [   26.171597] ---[ end trace 5e49a21e51452c24 ]---
> > 
> > 
> > mm/memcontrol:4111
> > VM_BUG_ON(PageSwapCache(page));
> > 
> > Seems that mem_cgroup_uncharge_swapcache, somewhat ironically expects the
> > SwapCache flag to be unset already.
> > 
> > Fix might be a simple as removing that VM_BUG_ON() but there might be more to
> > it.  There usually is :)
> 
> This has been already fixed in the -mm tree
> (http://git.kernel.org/cgit/linux/kernel/git/mhocko/mm.git/commit/?h=since-3.9&id=b341f7ffa5fe6ae11afa87e2fecc32c6093541f8)

Ah yes, I even saw this patch come in now I see it again.  I should have known!

Seth


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages
  2013-05-07 21:19 ` [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages Dave Hansen
  2013-05-09 22:07   ` Seth Jennings
@ 2013-05-14 14:55   ` Mel Gorman
  1 sibling, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2013-05-14 14:55 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, tim.c.chen

On Tue, May 07, 2013 at 02:19:55PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> There are only two callers of swapcache_free() which actually
> pass in a non-NULL 'struct page'.  Both of them
> (__remove_mapping and delete_from_swap_cache())  create a
> temporary on-stack 'swp_entry_t' and set entry.val to
> page_private().
> 
> They need to do this since __delete_from_swap_cache() does
> set_page_private(page, 0) and destroys the information.
> 
> However, I'd like to batch a few of these operations on several
> pages together in a new version of __remove_mapping(), and I
> would prefer not to have to allocate temporary storage for
> each page.  The code is pretty ugly, and it's a bit silly
> to create these on-stack 'swp_entry_t's when it is so easy to
> just keep the information around in 'struct page'.
> 
> There should not be any danger in doing this since we are
> absolutely on the path of freeing these page.  There is no
> turning back, and no other rerferences can be obtained
> after it comes out of the radix tree.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

On it's own, this patch looks like it has a lot missing but when
combined with patch 2, it starts making sense so

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 2/7] make 'struct page' and swp_entry_t variants of swapcache_free().
  2013-05-07 21:19 ` [RFC][PATCH 2/7] make 'struct page' and swp_entry_t variants of swapcache_free() Dave Hansen
@ 2013-05-14 15:00   ` Mel Gorman
  0 siblings, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2013-05-14 15:00 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, tim.c.chen

On Tue, May 07, 2013 at 02:19:57PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> swapcache_free() takes two arguments:
> 
> 	void swapcache_free(swp_entry_t entry, struct page *page)
> 
> Most of its callers (5/7) are from error handling paths haven't even
> instantiated a page, so they pass page=NULL.  Both of the callers
> that call in with a 'struct page' create and pass in a temporary
> swp_entry_t.
> 
> Now that we are deferring clearing page_private() until after
> swapcache_free() has been called, we can just create a variant
> that takes a 'struct page' and does the temporary variable in
> the helper.
> 
> That leaves all the other callers doing
> 
> 	swapcache_free(entry, NULL)
> 
> so create another helper for them that makes it clear that they
> need only pass in a swp_entry_t.
> 
> One downside here is that delete_from_swap_cache() now does
> an extra swap_address_space() call.  But, those are pretty
> cheap (just some array index arithmetic).
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  linux.git-davehans/drivers/staging/zcache/zcache-main.c |    2 +-
>  linux.git-davehans/include/linux/swap.h                 |    3 ++-
>  linux.git-davehans/mm/shmem.c                           |    2 +-
>  linux.git-davehans/mm/swap_state.c                      |   13 +++++--------
>  linux.git-davehans/mm/swapfile.c                        |   13 ++++++++++++-
>  linux.git-davehans/mm/vmscan.c                          |    3 +--
>  6 files changed, 22 insertions(+), 14 deletions(-)
> 
> diff -puN drivers/staging/zcache/zcache-main.c~make-page-and-swp_entry_t-variants drivers/staging/zcache/zcache-main.c
> --- linux.git/drivers/staging/zcache/zcache-main.c~make-page-and-swp_entry_t-variants	2013-05-07 13:48:13.963056205 -0700
> +++ linux.git-davehans/drivers/staging/zcache/zcache-main.c	2013-05-07 13:48:13.975056737 -0700
> @@ -961,7 +961,7 @@ static int zcache_get_swap_cache_page(in
>  		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
>  		 * clear SWAP_HAS_CACHE flag.
>  		 */
> -		swapcache_free(entry, NULL);
> +		swapcache_free_entry(entry);
>  		/* FIXME: is it possible to get here without err==-ENOMEM?
>  		 * If not, we can dispense with the do loop, use goto retry */
>  	} while (err != -ENOMEM);
> diff -puN include/linux/swap.h~make-page-and-swp_entry_t-variants include/linux/swap.h
> --- linux.git/include/linux/swap.h~make-page-and-swp_entry_t-variants	2013-05-07 13:48:13.964056249 -0700
> +++ linux.git-davehans/include/linux/swap.h	2013-05-07 13:48:13.975056737 -0700
> @@ -382,7 +382,8 @@ extern void swap_shmem_alloc(swp_entry_t
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> -extern void swapcache_free(swp_entry_t, struct page *page);
> +extern void swapcache_free_entry(swp_entry_t entry);
> +extern void swapcache_free_page_entry(struct page *page);
>  extern int free_swap_and_cache(swp_entry_t);
>  extern int swap_type_of(dev_t, sector_t, struct block_device **);
>  extern unsigned int count_swap_pages(int, int);
> diff -puN mm/shmem.c~make-page-and-swp_entry_t-variants mm/shmem.c
> --- linux.git/mm/shmem.c~make-page-and-swp_entry_t-variants	2013-05-07 13:48:13.966056339 -0700
> +++ linux.git-davehans/mm/shmem.c	2013-05-07 13:48:13.976056781 -0700
> @@ -871,7 +871,7 @@ static int shmem_writepage(struct page *
>  	}
>  
>  	mutex_unlock(&shmem_swaplist_mutex);
> -	swapcache_free(swap, NULL);
> +	swapcache_free_entry(swap);
>  redirty:
>  	set_page_dirty(page);
>  	if (wbc->for_reclaim)
> diff -puN mm/swapfile.c~make-page-and-swp_entry_t-variants mm/swapfile.c
> --- linux.git/mm/swapfile.c~make-page-and-swp_entry_t-variants	2013-05-07 13:48:13.968056427 -0700
> +++ linux.git-davehans/mm/swapfile.c	2013-05-07 13:48:13.977056825 -0700
> @@ -637,7 +637,7 @@ void swap_free(swp_entry_t entry)
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */
> -void swapcache_free(swp_entry_t entry, struct page *page)
> +static void __swapcache_free(swp_entry_t entry, struct page *page)
>  {
>  	struct swap_info_struct *p;
>  	unsigned char count;
> @@ -651,6 +651,17 @@ void swapcache_free(swp_entry_t entry, s
>  	}
>  }
>  
> +void swapcache_free_entry(swp_entry_t entry)
> +{
> +	__swapcache_free(entry, NULL);
> +}
> +
> +void swapcache_free_page_entry(struct page *page)
> +{
> +	swp_entry_t entry = { .val = page_private(page) };
> +	__swapcache_free(entry, page);
> +}

Patch one moved the clearing of private_private and ClearPageSwapCache
from __delete_from_swap_cache to two callers. Now that you have split
the function, it would be a lot tidier if this helper looked like

void swapcache_free_page_entry(struct page *page)
{
	swp_entry_t entry = { .val = page_private(page) };
	__swapcache_free(entry, page);
	set_page_private(page, 0);
	ClearPageSwapCache(page);
}

and the callers were no longer responsible again. I suspect this would
have been more obvious if patch 1 & 2 were collapsed together. Otherwise,
independent of the rest of the series, this looks like a reasonable cleanup.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 3/7] break up __remove_mapping()
  2013-05-07 21:19 ` [RFC][PATCH 3/7] break up __remove_mapping() Dave Hansen
@ 2013-05-14 15:22   ` Mel Gorman
  0 siblings, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2013-05-14 15:22 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, tim.c.chen

On Tue, May 07, 2013 at 02:19:58PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Our goal here is to eventually reduce the number of repetitive
> acquire/release operations on mapping->tree_lock.
> 
> To start out, we make a version of __remove_mapping() called
> __remove_mapping_nolock().  This actually makes the locking
> _much_ more straighforward.
> 
> One non-obvious part of this patch: the
> 
> 	freepage = mapping->a_ops->freepage;
> 
> used to happen under the mapping->tree_lock, but this patch
> moves it to outside of the lock.  All of the other
> a_ops->freepage users do it outside the lock, and we only
> assign it when we create inodes, so that makes it safe.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

It's a stupid nit, but more often than not, foo and __foo refer to the
locked and unlocked versions of a function. Other times it refers to
functions with internal helpers. In this patch, it looks like there are
two helpers "locked" and "really, I mean it, it's locked this time".
The term "nolock" is ambiguous because it could mean either "no lock is
acquired" or "no lock needs to be acquired". It's all in one file so
it's hardly a major problem but I would suggest different names. Maybe

remove_mapping
lock_remove_mapping
__remove_mapping

instead of

remove_mapping
__remove_mapping
__remove_mapping_nolock

Whether you do that or not

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 4/7] break out mapping "freepage" code
  2013-05-07 21:20 ` [RFC][PATCH 4/7] break out mapping "freepage" code Dave Hansen
@ 2013-05-14 15:26   ` Mel Gorman
  0 siblings, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2013-05-14 15:26 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, tim.c.chen

On Tue, May 07, 2013 at 02:20:00PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> __remove_mapping() only deals with pages with mappings, meaning
> page cache and swap cache.
> 
> At this point, the page has been removed from the mapping's radix
> tree, and we need to ensure that any fs-specific (or swap-
> specific) resources are freed up.
> 
> We will be using this function from a second location in a
> following patch.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

I used up all my complaining about naming beans in the last patch and do
not have a better suggestion for the new helper so ...

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 5/7] create __remove_mapping_batch()
  2013-05-07 21:20 ` [RFC][PATCH 5/7] create __remove_mapping_batch() Dave Hansen
  2013-05-09 22:13   ` Seth Jennings
@ 2013-05-14 15:51   ` Mel Gorman
  2013-05-16 17:14     ` Dave Hansen
  1 sibling, 1 reply; 25+ messages in thread
From: Mel Gorman @ 2013-05-14 15:51 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, tim.c.chen

On Tue, May 07, 2013 at 02:20:01PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> __remove_mapping_batch() does logically the same thing as
> __remove_mapping().
> 
> We batch like this so that several pages can be freed with a
> single mapping->tree_lock acquisition/release pair.  This reduces
> the number of atomic operations and ensures that we do not bounce
> cachelines around.
> 
> It has shown some substantial performance benefits on
> microbenchmarks.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  linux.git-davehans/mm/vmscan.c |   50 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff -puN mm/vmscan.c~create-remove_mapping_batch mm/vmscan.c
> --- linux.git/mm/vmscan.c~create-remove_mapping_batch	2013-05-07 14:00:01.432361260 -0700
> +++ linux.git-davehans/mm/vmscan.c	2013-05-07 14:19:32.341148892 -0700
> @@ -555,6 +555,56 @@ int remove_mapping(struct address_space
>  	return 0;
>  }
>  
> +/*
> + * pages come in here (via remove_list) locked and leave unlocked
> + * (on either ret_pages or free_pages)
> + *
> + * We do this batching so that we free batches of pages with a
> + * single mapping->tree_lock acquisition/release.  This optimization
> + * only makes sense when the pages on remove_list all share a
> + * page->mapping.  If this is violated you will BUG_ON().
> + */
> +static int __remove_mapping_batch(struct list_head *remove_list,
> +				  struct list_head *ret_pages,
> +				  struct list_head *free_pages)
> +{
> +	int nr_reclaimed = 0;
> +	struct address_space *mapping;
> +	struct page *page;
> +	LIST_HEAD(need_free_mapping);
> +
> +	if (list_empty(remove_list))
> +		return 0;
> +
> +	mapping = lru_to_page(remove_list)->mapping;
> +	spin_lock_irq(&mapping->tree_lock);
> +	while (!list_empty(remove_list)) {
> +		int freed;
> +		page = lru_to_page(remove_list);
> +		BUG_ON(!PageLocked(page));
> +		BUG_ON(page->mapping != mapping);
> +		list_del(&page->lru);
> +
> +		freed = __remove_mapping_nolock(mapping, page);

Nit, it's not freed, it's detached but rather than complaining the
ambiguity can be removed with

if (!__remove_mapping_nolock(mapping, page)) {
	unlock_page(page);
	list_add(&page->lru, ret_pages);
	continue;
}

list_add(&page->lru, &need_free_mapping);

The same comments I had before about potentially long page lock hold
times still apply at this point. Andrew's concerns about the worst-case
scenario where no adjacent page on the LRU has the same mapping also
still applies. Is there any noticable overhead with his suggested
workload of a single threaded process that opens files touching one page
in each file until reclaim starts?

This would be easier to review it it was merged with the next patch that
actually uses this function.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 6/7] use __remove_mapping_batch() in shrink_page_list()
  2013-05-07 21:20 ` [RFC][PATCH 6/7] use __remove_mapping_batch() in shrink_page_list() Dave Hansen
@ 2013-05-14 16:05   ` Mel Gorman
  2013-05-14 16:50     ` Dave Hansen
  0 siblings, 1 reply; 25+ messages in thread
From: Mel Gorman @ 2013-05-14 16:05 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, tim.c.chen

On Tue, May 07, 2013 at 02:20:02PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Tim Chen's earlier version of these patches just unconditionally
> created large batches of pages, even if they did not share a
> page->mapping.  This is a bit suboptimal for a few reasons:
> 1. if we can not consolidate lock acquisitions, it makes little
>    sense to batch
> 2. The page locks are held for long periods of time, so we only
>    want to do this when we are sure that we will gain a
>    substantial throughput improvement because we pay a latency
>    cost by holding the locks.
> 
> This patch makes sure to only batch when all the pages on
> 'batch_for_mapping_removal' continue to share a page->mapping.
> This only happens in practice in cases where pages in the same
> file are close to each other on the LRU.  That seems like a
> reasonable assumption.
> 
> In a 128MB virtual machine doing kernel compiles, the average
> batch size when calling __remove_mapping_batch() is around 5,
> so this does seem to do some good in practice.
> 
> On a 160-cpu system doing kernel compiles, I still saw an
> average batch length of about 2.8.  One promising feature:
> as the memory pressure went up, the average batches seem to
> have gotten larger.
> 

That's curious to me. I would expect with 160 CPUs reading files that it
would become less likely that they would insert pages backed by the same
mapping adjacent to each other in the LRU list. Maybe readahead is adding
the pages in batch so they are still adjacent.  I expect you would see
the best batching for kernel compiles with make -j1

> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  linux.git-davehans/mm/vmscan.c |   52 +++++++++++++++++++++++++++++++++--------
>  1 file changed, 42 insertions(+), 10 deletions(-)
> 
> diff -puN mm/vmscan.c~use-remove_mapping_batch mm/vmscan.c
> --- linux.git/mm/vmscan.c~use-remove_mapping_batch	2013-05-07 13:48:15.016102828 -0700
> +++ linux.git-davehans/mm/vmscan.c	2013-05-07 13:48:15.020103005 -0700
> @@ -599,7 +599,14 @@ static int __remove_mapping_batch(struct
>  		page = lru_to_page(&need_free_mapping);
>  		list_move(&page->list, free_pages);
>  		free_mapping_page(mapping, page);
> -		unlock_page(page);
> +		/*
> +		 * At this point, we have no other references and there is
> +		 * no way to pick any more up (removed from LRU, removed
> +		 * from pagecache). Can use non-atomic bitops now (and
> +		 * we obviously don't have to worry about waking up a process
> +		 * waiting on the page lock, because there are no references.
> +		 */
> +		__clear_page_locked(page);
>  		nr_reclaimed++;
>  	}
>  	return nr_reclaimed;
> @@ -740,6 +747,15 @@ static enum page_references page_check_r
>  	return PAGEREF_RECLAIM;
>  }
>  
> +static bool batch_has_same_mapping(struct page *page, struct list_head *batch)
> +{
> +	struct page *first_in_batch;
> +	first_in_batch = lru_to_page(batch);
> +	if (first_in_batch->mapping == page->mapping)
> +		return true;

If you are batching the removal of PageSwapCache pages, will this check
still work as you used page->mapping instead of page_mapping?

> +	return false;
> +}
> +

This helper seems overkill. Why not just have batch_mapping in
shrink_page_list() that is set when the first page is added to the
batch_for_mapping_removal and defer the decision to drain until after the
page mapping has been looked up?

struct address_space *batch_mapping = NULL;

.....

mapping = page_mapping(page);
if (!batch_mapping)
	batch_mapping = mapping;

if (!list_empty(&batch_for_mapping_removal) && mapping != batch_mapping) {
	nr_reclaimed += __remove_mapping_batch(....);
	batch_mapping = mapping;
}

Locks will still be held across waiting on page writeback or pageout()
which could be for long periods of time and blocking flushers.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 7/7] drain batch list during long operations
  2013-05-07 21:20 ` [RFC][PATCH 7/7] drain batch list during long operations Dave Hansen
  2013-05-07 23:56   ` Dave Hansen
  2013-05-08  0:42   ` Tim Chen
@ 2013-05-14 16:08   ` Mel Gorman
  2 siblings, 0 replies; 25+ messages in thread
From: Mel Gorman @ 2013-05-14 16:08 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, tim.c.chen

On Tue, May 07, 2013 at 02:20:03PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This was a suggestion from Mel:
> 
> 	http://lkml.kernel.org/r/20120914085634.GM11157@csn.ul.ie
> 
> Any pages we collect on 'batch_for_mapping_removal' will have
> their lock_page() held during the duration of their stay on the
> list.  If some other user is trying to get at them during this
> time, they might end up having to wait for a while, especially if
> we go off and do pageout() on some other page.
> 
> This ensures that we drain the batch if we are about to perform a
> writeout.
> 
> I added some statistics to the __remove_mapping_batch() code to
> track how large the lists are that we pass in to it.  With this
> patch, the average list length drops about 10% (from about 4.1 to
> 3.8).  The workload here was a make -j4 kernel compile on a VM
> with 200MB of RAM.
> 
> I've still got the statistics patch around if anyone is
> interested.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  linux.git-davehans/kernel/sched/fair.c |    2 ++
>  linux.git-davehans/mm/vmscan.c         |   10 ++++++++++
>  2 files changed, 12 insertions(+)
> 
> diff -puN kernel/sched/fair.c~drain-batch-list-during-long-operations kernel/sched/fair.c
> --- linux.git/kernel/sched/fair.c~drain-batch-list-during-long-operations	2013-05-07 13:48:15.267113941 -0700
> +++ linux.git-davehans/kernel/sched/fair.c	2013-05-07 13:48:15.275114295 -0700
> @@ -5211,6 +5211,8 @@ more_balance:
>  		if (sd->balance_interval < sd->max_interval)
>  			sd->balance_interval *= 2;
>  	}
> +	//if (printk_ratelimit())
> +	//	printk("sd->balance_interval: %d\n", sd->balance_interval);
>  
>  	goto out;
>  

heh

> diff -puN mm/vmscan.c~drain-batch-list-during-long-operations mm/vmscan.c
> --- linux.git/mm/vmscan.c~drain-batch-list-during-long-operations	2013-05-07 13:48:15.268113985 -0700
> +++ linux.git-davehans/mm/vmscan.c	2013-05-07 13:48:15.272114163 -0700
> @@ -936,6 +936,16 @@ static unsigned long shrink_page_list(st
>  			if (!sc->may_writepage)
>  				goto keep_locked;
>  
> +			/*
> +			 * We hold a bunch of page locks on the batch.
> +			 * pageout() can take a while, so drain the
> +			 * batch before we perform pageout.
> +			 */
> +			nr_reclaimed +=
> +		               __remove_mapping_batch(&batch_for_mapping_removal,
> +		                                      &ret_pages,
> +		                                      &free_pages);
> +

There is also a wait_on_page_writeback() above that would affect
memcg's but this alone alleviates a lot of my concerns about lock hold
times.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 6/7] use __remove_mapping_batch() in shrink_page_list()
  2013-05-14 16:05   ` Mel Gorman
@ 2013-05-14 16:50     ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-05-14 16:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, akpm, tim.c.chen

On 05/14/2013 09:05 AM, Mel Gorman wrote:
> This helper seems overkill. Why not just have batch_mapping in
> shrink_page_list() that is set when the first page is added to the
> batch_for_mapping_removal and defer the decision to drain until after the
> page mapping has been looked up?
> 
> struct address_space *batch_mapping = NULL;
> 
> .....
> 
> mapping = page_mapping(page);
> if (!batch_mapping)
> 	batch_mapping = mapping;
> 
> if (!list_empty(&batch_for_mapping_removal) && mapping != batch_mapping) {
> 	nr_reclaimed += __remove_mapping_batch(....);
> 	batch_mapping = mapping;
> }

I was trying to avoid doing the batch drain while holding lock_page() on
an unrelated page.  But, now that I think about it, that was probably
unsafe anyway.  The page could have been truncated out of the mapping
since it *was* before lock_page().

I think I was also trying to save adding another local variable, but
you're right that it's overkill.  I'll fix it up.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 5/7] create __remove_mapping_batch()
  2013-05-14 15:51   ` Mel Gorman
@ 2013-05-16 17:14     ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-05-16 17:14 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, akpm, tim.c.chen

On 05/14/2013 08:51 AM, Mel Gorman wrote:
> The same comments I had before about potentially long page lock hold
> times still apply at this point. Andrew's concerns about the worst-case
> scenario where no adjacent page on the LRU has the same mapping also
> still applies. Is there any noticable overhead with his suggested
> workload of a single threaded process that opens files touching one page
> in each file until reclaim starts?

This is an attempt to address some of Andrew's concerns from here:

 	http://lkml.kernel.org/r/20120912122758.ad15e10f.akpm@linux-foundation.org

The executive summary: This does cause a small amount of increased CPU
time in __remove_mapping_batch().  But, it *is* small and it comes with
a throughput increase.

Test #1:

1. My goal here was to create an LRU with as few adjacent pages in the
   same file as possible.
2. Using lots of small files turned out to be a pain in the butt just
   because I need to create tens of thousands of them.
3. I ended up writing a program that does:
	for (offset = 0; offset < somenumber; offset += PAGE_SIZE)
		for_each_file(f)
			read(f, offset)...
4. This was sitting in a loop where the working set of my file reads was
   slightly larger than the total amount of memory, so we were
   effectively evicting page cache with streaming reads.

Even doing that above loop across ~2k files at once, __remove_mapping()
itself isn't CPU intensive in the single-threaded case.  In my testing,
it only shows up at 0.021% of CPU usage.  That went up to 0.036% (and
shifted to __remove_mapping_batch()) with these patches applied.

In any case, there are no showstoppers here.  We're way down looking at
the 0.01% of CPU time scale.

    sample    %
     delta   change
    ------   ------
       462     2.7% ata_scsi_queuecmd
       194     0.1% default_idle
        59   999.9% __remove_mapping_batch
        54   490.9% prepare_to_wait
        41   585.7% rcu_process_callbacks
       -32   -49.2% blk_queue_bio
       -35  -100.0% __remove_mapping
       -38   -33.6% generic_file_aio_read
       -41   -68.3% mix_pool_bytes.constprop.0
       -48   -11.9% __wake_up
       -53   -66.2% copy_user_generic_string
       -75    -8.4% finish_task_switch
       -79   -53.4% cpu_startup_entry
       -87   -15.9% blk_end_bidi_request
      -109   -14.3% scsi_request_fn
      -172    -3.6% __do_softirq

Test #2:

The second test I did was a single-threaded dd.  I did a 4GB dd over and
over with just barely less than 4GB of memory available.  This was the
test that we would expect to hurt us in the single-threaded case since
we spread out accesses to 'struct page' over time and have less cache
warmth.  The total disk throughput (as reported by vmstat) actually went
_up_ 6% in this case with these patches.

Here are the relevant bits grepped out of 'perf report' during the dd:

> -------- perf.vanilla.data ----------
>      3.75%         swapper  [kernel.kallsyms]     [k] intel_idle                                
>      2.83%              dd  [kernel.kallsyms]     [k] put_page                                  
>      1.30%         kswapd0  [kernel.kallsyms]     [k] __ticket_spin_lock                        
>      1.05%              dd  [kernel.kallsyms]     [k] __ticket_spin_lock                        
>      1.04%         kswapd0  [kernel.kallsyms]     [k] shrink_page_list                          
>      0.38%         kswapd0  [kernel.kallsyms]     [k] __remove_mapping                          
>      0.34%         kswapd0  [kernel.kallsyms]     [k] put_page                                  
> -------- perf.patched.data ----------
>      4.47%          swapper  [kernel.kallsyms]           [k] intel_idle                                               
>      2.02%               dd  [kernel.kallsyms]           [k] put_page                                                 
>      1.55%               dd  [kernel.kallsyms]           [k] __ticket_spin_lock                                       
>      1.21%          kswapd0  [kernel.kallsyms]           [k] shrink_page_list                                         
>      0.97%          kswapd0  [kernel.kallsyms]           [k] __ticket_spin_lock                                       
>      0.43%          kswapd0  [kernel.kallsyms]           [k] put_page                                                 
>      0.36%          kswapd0  [kernel.kallsyms]           [k] __remove_mapping                                         
>      0.28%          kswapd0  [kernel.kallsyms]           [k] __remove_mapping_batch                 

And the same functions from 'perf diff':

>              +4.47%  [kernel.kallsyms]           [k] intel_idle                                               
>      3.22%   -0.77%  [kernel.kallsyms]           [k] put_page                                                 
>              +1.21%  [kernel.kallsyms]           [k] shrink_page_list                                         
>              +0.36%  [kernel.kallsyms]           [k] __remove_mapping                                         
>              +0.28%  [kernel.kallsyms]           [k] __remove_mapping_batch                                   
>      0.39%   -0.39%  [kernel.kallsyms]           [k] __remove_mapping                                         
>      1.04%   -1.04%  [kernel.kallsyms]           [k] shrink_page_list                                         
>      3.68%   -3.68%  [kernel.kallsyms]           [k] intel_idle                                    

1. Idle time goes up by quite a bit, probably since we hold the page
   locks longer amounts of time, and cause more sleeping on them
2. put_page() got substantially cheaper, probably since we are now doing
   all the put_page()s closer to each other.
3. __remove_mapping_batch() is definitely costing us CPU, and not
   directly saving it anywhere else (like shrink_page_list() which also
   gets a bit worse)

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-05-16 17:14 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-07 21:19 [RFC][PATCH 0/7] mm: Batch page reclamation under shink_page_list Dave Hansen
2013-05-07 21:19 ` [RFC][PATCH 1/7] defer clearing of page_private() for swap cache pages Dave Hansen
2013-05-09 22:07   ` Seth Jennings
2013-05-09 22:19     ` Dave Hansen
2013-05-10  9:26     ` Michal Hocko
2013-05-10 14:01       ` Seth Jennings
2013-05-14 14:55   ` Mel Gorman
2013-05-07 21:19 ` [RFC][PATCH 2/7] make 'struct page' and swp_entry_t variants of swapcache_free() Dave Hansen
2013-05-14 15:00   ` Mel Gorman
2013-05-07 21:19 ` [RFC][PATCH 3/7] break up __remove_mapping() Dave Hansen
2013-05-14 15:22   ` Mel Gorman
2013-05-07 21:20 ` [RFC][PATCH 4/7] break out mapping "freepage" code Dave Hansen
2013-05-14 15:26   ` Mel Gorman
2013-05-07 21:20 ` [RFC][PATCH 5/7] create __remove_mapping_batch() Dave Hansen
2013-05-09 22:13   ` Seth Jennings
2013-05-09 22:18     ` Dave Hansen
2013-05-14 15:51   ` Mel Gorman
2013-05-16 17:14     ` Dave Hansen
2013-05-07 21:20 ` [RFC][PATCH 6/7] use __remove_mapping_batch() in shrink_page_list() Dave Hansen
2013-05-14 16:05   ` Mel Gorman
2013-05-14 16:50     ` Dave Hansen
2013-05-07 21:20 ` [RFC][PATCH 7/7] drain batch list during long operations Dave Hansen
2013-05-07 23:56   ` Dave Hansen
2013-05-08  0:42   ` Tim Chen
2013-05-14 16:08   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).