[v5][PATCH 0/6] mm: vmscan: Batch page reclamation under shink_page

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [v5][PATCH 0/6] mm: vmscan: Batch page reclamation under shink_page_list
@ 2013-06-03 20:02 Dave Hansen
  2013-06-03 20:02 ` [v5][PATCH 1/6] mm: swap: defer clearing of page_private() for swap cache pages Dave Hansen
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Dave Hansen @ 2013-06-03 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, minchan, Dave Hansen

These are an update of Tim Chen's earlier work:

	http://lkml.kernel.org/r/1347293960.9977.70.camel@schen9-DESK

I broke the patches up a bit more, and tried to incorporate some
changes based on some feedback from Mel and Andrew.

Changes for v5:
 * fix description in about the costs of moving around the code
   under delete_from_swap_cache() in patch 2.
 * Minor formatting (remove unnecessary newlines), thanks
   Minchan!

Changes for v4:
 * generated on top of linux-next-20130530, plus Mel's vmscan
   fixes:
	http://lkml.kernel.org/r/1369659778-6772-2-git-send-email-mgorman@suse.de
 * added some proper vmscan/swap: prefixes to the subjects

Changes for v3:
 * Add batch draining before congestion_wait()
 * minor merge conflicts with Mel's vmscan work

Changes for v2:
 * use page_mapping() accessor instead of direct access
   to page->mapping (could cause crashes when running in
   to swap cache pages.
 * group the batch function's introduction patch with
   its first use
 * rename a few functions as suggested by Mel
 * Ran some single-threaded tests to look for regressions
   caused by the batching.  If there is overhead, it is only
   in the worst-case scenarios, and then only in hundreths of
   a percent of CPU time.

If you're curious how effective the batching is, I have a quick
and dirty patch to keep some stats:

	https://www.sr71.net/~dave/intel/rmb-stats-only.patch

--

To do page reclamation in shrink_page_list function, there are
two locks taken on a page by page basis.  One is the tree lock
protecting the radix tree of the page mapping and the other is
the mapping->i_mmap_mutex protecting the mapped pages.  This set
deals only with mapping->tree_lock.

Tim managed to get 14% throughput improvement when with a workload
putting heavy pressure of page cache by reading many large mmaped
files simultaneously on a 8 socket Westmere server.

I've been testing these by running large parallel kernel compiles
on systems that are under memory pressure.  During development,
I caught quite a few races on smaller setups, and it's being
quite stable that large (160 logical CPU / 1TB) system.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [v5][PATCH 1/6] mm: swap: defer clearing of page_private() for swap cache pages
  2013-06-03 20:02 [v5][PATCH 0/6] mm: vmscan: Batch page reclamation under shink_page_list Dave Hansen
@ 2013-06-03 20:02 ` Dave Hansen
  2013-06-03 20:02 ` [v5][PATCH 2/6] mm: swap: make 'struct page' and swp_entry_t variants of swapcache_free() Dave Hansen
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-06-03 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, minchan, Dave Hansen

From: Dave Hansen <dave.hansen@linux.intel.com>

This patch defers the destruction of swapcache-specific data in
'struct page'.  This simplifies the code because we do not have
to keep extra copies of the data during the removal of a page
from the swap cache.

There are only two callers of swapcache_free() which actually
pass in a non-NULL 'struct page'.  Both of them (__remove_mapping
and delete_from_swap_cache())  create a temporary on-stack
'swp_entry_t' and set entry.val to page_private().

They need to do this since __delete_from_swap_cache() does
set_page_private(page, 0) and destroys the information.

However, I'd like to batch a few of these operations on several
pages together in a new version of __remove_mapping(), and I
would prefer not to have to allocate temporary storage for each
page.  The code is pretty ugly, and it's a bit silly to create
these on-stack 'swp_entry_t's when it is so easy to just keep the
information around in 'struct page'.

There should not be any danger in doing this since we are
absolutely on the path of freeing these page.  There is no
turning back, and no other rerferences can be obtained after it
comes out of the radix tree.

Note: This patch is separate from the next one since it
introduces the behavior change.  I've seen issues with this patch
by itself in various forms and I think having it separate like
this aids bisection.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kin <minchan@kernel.org>
---

 linux.git-davehans/mm/swap_state.c |    4 ++--
 linux.git-davehans/mm/vmscan.c     |    2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff -puN mm/swap_state.c~__delete_from_swap_cache-dont-clear-page-private mm/swap_state.c
--- linux.git/mm/swap_state.c~__delete_from_swap_cache-dont-clear-page-private	2013-06-03 12:41:30.321703206 -0700
+++ linux.git-davehans/mm/swap_state.c	2013-06-03 12:41:30.326703428 -0700
@@ -148,8 +148,6 @@ void __delete_from_swap_cache(struct pag
 	entry.val = page_private(page);
 	address_space = swap_address_space(entry);
 	radix_tree_delete(&address_space->page_tree, page_private(page));
-	set_page_private(page, 0);
-	ClearPageSwapCache(page);
 	address_space->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
@@ -226,6 +224,8 @@ void delete_from_swap_cache(struct page
 	spin_unlock_irq(&address_space->tree_lock);

 	swapcache_free(entry, page);
+	set_page_private(page, 0);
+	ClearPageSwapCache(page);
 	page_cache_release(page);
 }

diff -puN mm/vmscan.c~__delete_from_swap_cache-dont-clear-page-private mm/vmscan.c
--- linux.git/mm/vmscan.c~__delete_from_swap_cache-dont-clear-page-private	2013-06-03 12:41:30.323703296 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-06-03 12:41:30.328703516 -0700
@@ -494,6 +494,8 @@ static int __remove_mapping(struct addre
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
 		swapcache_free(swap, page);
+		set_page_private(page, 0);
+		ClearPageSwapCache(page);
 	} else {
 		void (*freepage)(struct page *);

_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [v5][PATCH 2/6] mm: swap: make 'struct page' and swp_entry_t variants of swapcache_free().
  2013-06-03 20:02 [v5][PATCH 0/6] mm: vmscan: Batch page reclamation under shink_page_list Dave Hansen
  2013-06-03 20:02 ` [v5][PATCH 1/6] mm: swap: defer clearing of page_private() for swap cache pages Dave Hansen
@ 2013-06-03 20:02 ` Dave Hansen
  2013-06-03 20:02 ` [v5][PATCH 3/6] mm: vmscan: break up __remove_mapping() Dave Hansen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-06-03 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, minchan, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

swapcache_free() takes two arguments:

	void swapcache_free(swp_entry_t entry, struct page *page)

Most of its callers (5/7) are from error handling paths haven't
even instantiated a page, so they pass page=NULL.  Both of the
callers that call in with a 'struct page' create and pass in a
temporary swp_entry_t.

Now that we are deferring clearing page_private() until after
swapcache_free() has been called, we can just create a variant
that takes a 'struct page' and does the temporary variable in the
helper.

That leaves all the other callers doing

	swapcache_free(entry, NULL)

so create another helper for them that makes it clear that they
need only pass in a swp_entry_t.

One downside here is that delete_from_swap_cache() now calls
swap_address_space() via page_mapping() instead of calling
swap_address_space() directly.  In doing so, it removes one more
case of the swap cache code being special-cased, which is a good
thing in my book.  But it does cost us a function call.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kin <minchan@kernel.org>
---

 linux.git-davehans/drivers/staging/zcache/zcache-main.c |    2 +-
 linux.git-davehans/include/linux/swap.h                 |    3 ++-
 linux.git-davehans/mm/shmem.c                           |    2 +-
 linux.git-davehans/mm/swap_state.c                      |   15 +++++----------
 linux.git-davehans/mm/swapfile.c                        |   15 ++++++++++++++-
 linux.git-davehans/mm/vmscan.c                          |    5 +----
 6 files changed, 24 insertions(+), 18 deletions(-)

diff -puN drivers/staging/zcache/zcache-main.c~make-page-and-swp_entry_t-variants drivers/staging/zcache/zcache-main.c
--- linux.git/drivers/staging/zcache/zcache-main.c~make-page-and-swp_entry_t-variants	2013-06-03 12:41:30.590715114 -0700
+++ linux.git-davehans/drivers/staging/zcache/zcache-main.c	2013-06-03 12:41:30.602715646 -0700
@@ -961,7 +961,7 @@ static int zcache_get_swap_cache_page(in
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free_entry(entry);
 		/* FIXME: is it possible to get here without err==-ENOMEM?
 		 * If not, we can dispense with the do loop, use goto retry */
 	} while (err != -ENOMEM);
diff -puN include/linux/swap.h~make-page-and-swp_entry_t-variants include/linux/swap.h
--- linux.git/include/linux/swap.h~make-page-and-swp_entry_t-variants	2013-06-03 12:41:30.591715158 -0700
+++ linux.git-davehans/include/linux/swap.h	2013-06-03 12:41:30.602715646 -0700
@@ -385,7 +385,8 @@ extern void swap_shmem_alloc(swp_entry_t
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t, struct page *page);
+extern void swapcache_free_entry(swp_entry_t entry);
+extern void swapcache_free_page_entry(struct page *page);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
diff -puN mm/shmem.c~make-page-and-swp_entry_t-variants mm/shmem.c
--- linux.git/mm/shmem.c~make-page-and-swp_entry_t-variants	2013-06-03 12:41:30.593715247 -0700
+++ linux.git-davehans/mm/shmem.c	2013-06-03 12:41:30.603715690 -0700
@@ -872,7 +872,7 @@ static int shmem_writepage(struct page *
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
-	swapcache_free(swap, NULL);
+	swapcache_free_entry(swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
diff -puN mm/swapfile.c~make-page-and-swp_entry_t-variants mm/swapfile.c
--- linux.git/mm/swapfile.c~make-page-and-swp_entry_t-variants	2013-06-03 12:41:30.595715336 -0700
+++ linux.git-davehans/mm/swapfile.c	2013-06-03 12:41:30.604715734 -0700
@@ -637,7 +637,7 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry, struct page *page)
+static void __swapcache_free(swp_entry_t entry, struct page *page)
 {
 	struct swap_info_struct *p;
 	unsigned char count;
@@ -651,6 +651,19 @@ void swapcache_free(swp_entry_t entry, s
 	}
 }
 
+void swapcache_free_entry(swp_entry_t entry)
+{
+	__swapcache_free(entry, NULL);
+}
+
+void swapcache_free_page_entry(struct page *page)
+{
+	swp_entry_t entry = { .val = page_private(page) };
+	__swapcache_free(entry, page);
+	set_page_private(page, 0);
+	ClearPageSwapCache(page);
+}
+
 /*
  * How many references to page are currently swapped out?
  * This does not give an exact answer when swap count is continued,
diff -puN mm/swap_state.c~make-page-and-swp_entry_t-variants mm/swap_state.c
--- linux.git/mm/swap_state.c~make-page-and-swp_entry_t-variants	2013-06-03 12:41:30.596715380 -0700
+++ linux.git-davehans/mm/swap_state.c	2013-06-03 12:41:30.605715778 -0700
@@ -174,7 +174,7 @@ int add_to_swap(struct page *page, struc
 
 	if (unlikely(PageTransHuge(page)))
 		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry, NULL);
+			swapcache_free_entry(entry);
 			return 0;
 		}
 
@@ -200,7 +200,7 @@ int add_to_swap(struct page *page, struc
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free_entry(entry);
 		return 0;
 	}
 }
@@ -213,19 +213,14 @@ int add_to_swap(struct page *page, struc
  */
 void delete_from_swap_cache(struct page *page)
 {
-	swp_entry_t entry;
 	struct address_space *address_space;
 
-	entry.val = page_private(page);
-
-	address_space = swap_address_space(entry);
+	address_space = page_mapping(page);
 	spin_lock_irq(&address_space->tree_lock);
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry, page);
-	set_page_private(page, 0);
-	ClearPageSwapCache(page);
+	swapcache_free_page_entry(page);
 	page_cache_release(page);
 }
 
@@ -370,7 +365,7 @@ struct page *read_swap_cache_async(swp_e
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free_entry(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff -puN mm/vmscan.c~make-page-and-swp_entry_t-variants mm/vmscan.c
--- linux.git/mm/vmscan.c~make-page-and-swp_entry_t-variants	2013-06-03 12:41:30.598715468 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-06-03 12:41:30.606715822 -0700
@@ -490,12 +490,9 @@ static int __remove_mapping(struct addre
 	}
 
 	if (PageSwapCache(page)) {
-		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		swapcache_free(swap, page);
-		set_page_private(page, 0);
-		ClearPageSwapCache(page);
+		swapcache_free_page_entry(page);
 	} else {
 		void (*freepage)(struct page *);
 
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [v5][PATCH 3/6] mm: vmscan: break up __remove_mapping()
  2013-06-03 20:02 [v5][PATCH 0/6] mm: vmscan: Batch page reclamation under shink_page_list Dave Hansen
  2013-06-03 20:02 ` [v5][PATCH 1/6] mm: swap: defer clearing of page_private() for swap cache pages Dave Hansen
  2013-06-03 20:02 ` [v5][PATCH 2/6] mm: swap: make 'struct page' and swp_entry_t variants of swapcache_free() Dave Hansen
@ 2013-06-03 20:02 ` Dave Hansen
  2013-06-03 20:02 ` [v5][PATCH 4/6] mm: vmscan: break out mapping "freepage" code Dave Hansen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-06-03 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, minchan, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

Our goal here is to eventually reduce the number of repetitive
acquire/release operations on mapping->tree_lock.

Logically, this patch has two steps:
1. rename __remove_mapping() to lock_remove_mapping() since
   "__" usually means "this us the unlocked version.
2. Recreate __remove_mapping() to _be_ the lock_remove_mapping()
   but without the locks.

I think this actually makes the code flow around the locking
_much_ more straighforward since the locking just becomes:

	spin_lock_irq(&mapping->tree_lock);
	ret = __remove_mapping(mapping, page);
	spin_unlock_irq(&mapping->tree_lock);

One non-obvious part of this patch: the

	freepage = mapping->a_ops->freepage;

used to happen under the mapping->tree_lock, but this patch
moves it to outside of the lock.  All of the other
a_ops->freepage users do it outside the lock, and we only
assign it when we create inodes, so that makes it safe.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kin <minchan@kernel.org>

---

 linux.git-davehans/mm/vmscan.c |   40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff -puN mm/vmscan.c~make-remove-mapping-without-locks mm/vmscan.c
--- linux.git/mm/vmscan.c~make-remove-mapping-without-locks	2013-06-03 12:41:30.903728970 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-06-03 12:41:30.907729146 -0700
@@ -455,7 +455,6 @@ static int __remove_mapping(struct addre
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -482,35 +481,44 @@ static int __remove_mapping(struct addre
 	 * and thus under tree_lock, then this ordering is not required.
 	 */
 	if (!page_freeze_refs(page, 2))
-		goto cannot_free;
+		return 0;
 	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
 	if (unlikely(PageDirty(page))) {
 		page_unfreeze_refs(page, 2);
-		goto cannot_free;
+		return 0;
 	}
 
 	if (PageSwapCache(page)) {
 		__delete_from_swap_cache(page);
-		spin_unlock_irq(&mapping->tree_lock);
+	} else {
+		__delete_from_page_cache(page);
+	}
+	return 1;
+}
+
+static int lock_remove_mapping(struct address_space *mapping, struct page *page)
+{
+	int ret;
+	BUG_ON(!PageLocked(page));
+
+	spin_lock_irq(&mapping->tree_lock);
+	ret = __remove_mapping(mapping, page);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/* unable to free */
+	if (!ret)
+		return 0;
+
+	if (PageSwapCache(page)) {
 		swapcache_free_page_entry(page);
 	} else {
 		void (*freepage)(struct page *);
-
 		freepage = mapping->a_ops->freepage;
-
-		__delete_from_page_cache(page);
-		spin_unlock_irq(&mapping->tree_lock);
 		mem_cgroup_uncharge_cache_page(page);
-
 		if (freepage != NULL)
 			freepage(page);
 	}
-
-	return 1;
-
-cannot_free:
-	spin_unlock_irq(&mapping->tree_lock);
-	return 0;
+	return ret;
 }
 
 /*
@@ -521,7 +529,7 @@ cannot_free:
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page)) {
+	if (lock_remove_mapping(mapping, page)) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [v5][PATCH 4/6] mm: vmscan: break out mapping "freepage" code
  2013-06-03 20:02 [v5][PATCH 0/6] mm: vmscan: Batch page reclamation under shink_page_list Dave Hansen
                   ` (2 preceding siblings ...)
  2013-06-03 20:02 ` [v5][PATCH 3/6] mm: vmscan: break up __remove_mapping() Dave Hansen
@ 2013-06-03 20:02 ` Dave Hansen
  2013-06-03 20:02 ` [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations Dave Hansen
  2013-06-03 20:02 ` [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations Dave Hansen
  5 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-06-03 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, minchan, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>

__remove_mapping() only deals with pages with mappings, meaning
page cache and swap cache.

At this point, the page has been removed from the mapping's radix
tree, and we need to ensure that any fs-specific (or swap-
specific) resources are freed up.

We will be using this function from a second location in a
following patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan@kernel.org>
---

 linux.git-davehans/mm/vmscan.c |   27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff -puN mm/vmscan.c~free_mapping_page mm/vmscan.c
--- linux.git/mm/vmscan.c~free_mapping_page	2013-06-03 12:41:31.155740124 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-06-03 12:41:31.159740301 -0700
@@ -496,6 +496,23 @@ static int __remove_mapping(struct addre
 	return 1;
 }
 
+/*
+ * Release any resources the mapping had tied up in the page.
+ */
+static void mapping_release_page(struct address_space *mapping,
+				 struct page *page)
+{
+	if (PageSwapCache(page)) {
+		swapcache_free_page_entry(page);
+	} else {
+		void (*freepage)(struct page *);
+		freepage = mapping->a_ops->freepage;
+		mem_cgroup_uncharge_cache_page(page);
+		if (freepage != NULL)
+			freepage(page);
+	}
+}
+
 static int lock_remove_mapping(struct address_space *mapping, struct page *page)
 {
 	int ret;
@@ -509,15 +526,7 @@ static int lock_remove_mapping(struct ad
 	if (!ret)
 		return 0;
 
-	if (PageSwapCache(page)) {
-		swapcache_free_page_entry(page);
-	} else {
-		void (*freepage)(struct page *);
-		freepage = mapping->a_ops->freepage;
-		mem_cgroup_uncharge_cache_page(page);
-		if (freepage != NULL)
-			freepage(page);
-	}
+	mapping_release_page(mapping, page);
 	return ret;
 }
 
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-03 20:02 [v5][PATCH 0/6] mm: vmscan: Batch page reclamation under shink_page_list Dave Hansen
                   ` (3 preceding siblings ...)
  2013-06-03 20:02 ` [v5][PATCH 4/6] mm: vmscan: break out mapping "freepage" code Dave Hansen
@ 2013-06-03 20:02 ` Dave Hansen
  2013-06-04  1:17   ` Hillf Danton
  2013-06-04  5:01   ` Minchan Kim
  2013-06-03 20:02 ` [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations Dave Hansen
  5 siblings, 2 replies; 25+ messages in thread
From: Dave Hansen @ 2013-06-03 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, minchan, Dave Hansen


From: Dave Hansen <dave.hansen@linux.intel.com>
changes for v2:
 * remove batch_has_same_mapping() helper.  A local varible makes
   the check cheaper and cleaner
 * Move batch draining later to where we already know
   page_mapping().  This probably fixes a truncation race anyway
 * rename batch_for_mapping_removal -> batch_for_mapping_rm.  It
   caused a line over 80 chars and needed shortening anyway.
 * Note: we only set 'batch_mapping' when there are pages in the
   batch_for_mapping_rm list

--

We batch like this so that several pages can be freed with a
single mapping->tree_lock acquisition/release pair.  This reduces
the number of atomic operations and ensures that we do not bounce
cachelines around.

Tim Chen's earlier version of these patches just unconditionally
created large batches of pages, even if they did not share a
page_mapping().  This is a bit suboptimal for a few reasons:
1. if we can not consolidate lock acquisitions, it makes little
   sense to batch
2. The page locks are held for long periods of time, so we only
   want to do this when we are sure that we will gain a
   substantial throughput improvement because we pay a latency
   cost by holding the locks.

This patch makes sure to only batch when all the pages on
'batch_for_mapping_rm' continue to share a page_mapping().
This only happens in practice in cases where pages in the same
file are close to each other on the LRU.  That seems like a
reasonable assumption.

In a 128MB virtual machine doing kernel compiles, the average
batch size when calling __remove_mapping_batch() is around 5,
so this does seem to do some good in practice.

On a 160-cpu system doing kernel compiles, I still saw an
average batch length of about 2.8.  One promising feature:
as the memory pressure went up, the average batches seem to
have gotten larger.

It has shown some substantial performance benefits on
microbenchmarks.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---

 linux.git-davehans/mm/vmscan.c |   95 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 86 insertions(+), 9 deletions(-)

diff -puN mm/vmscan.c~create-remove_mapping_batch mm/vmscan.c
--- linux.git/mm/vmscan.c~create-remove_mapping_batch	2013-06-03 12:41:31.408751324 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-06-03 12:41:31.412751500 -0700
@@ -550,6 +550,61 @@ int remove_mapping(struct address_space
 	return 0;
 }
 
+/*
+ * pages come in here (via remove_list) locked and leave unlocked
+ * (on either ret_pages or free_pages)
+ *
+ * We do this batching so that we free batches of pages with a
+ * single mapping->tree_lock acquisition/release.  This optimization
+ * only makes sense when the pages on remove_list all share a
+ * page_mapping().  If this is violated you will BUG_ON().
+ */
+static int __remove_mapping_batch(struct list_head *remove_list,
+				  struct list_head *ret_pages,
+				  struct list_head *free_pages)
+{
+	int nr_reclaimed = 0;
+	struct address_space *mapping;
+	struct page *page;
+	LIST_HEAD(need_free_mapping);
+
+	if (list_empty(remove_list))
+		return 0;
+
+	mapping = page_mapping(lru_to_page(remove_list));
+	spin_lock_irq(&mapping->tree_lock);
+	while (!list_empty(remove_list)) {
+		page = lru_to_page(remove_list);
+		BUG_ON(!PageLocked(page));
+		BUG_ON(page_mapping(page) != mapping);
+		list_del(&page->lru);
+
+		if (!__remove_mapping(mapping, page)) {
+			unlock_page(page);
+			list_add(&page->lru, ret_pages);
+			continue;
+		}
+		list_add(&page->lru, &need_free_mapping);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+
+	while (!list_empty(&need_free_mapping)) {
+		page = lru_to_page(&need_free_mapping);
+		list_move(&page->list, free_pages);
+		mapping_release_page(mapping, page);
+		/*
+		 * At this point, we have no other references and there is
+		 * no way to pick any more up (removed from LRU, removed
+		 * from pagecache). Can use non-atomic bitops now (and
+		 * we obviously don't have to worry about waking up a process
+		 * waiting on the page lock, because there are no references.
+		 */
+		__clear_page_locked(page);
+		nr_reclaimed++;
+	}
+	return nr_reclaimed;
+}
+
 /**
  * putback_lru_page - put previously isolated page onto appropriate LRU list
  * @page: page to be put back to appropriate lru list
@@ -728,6 +783,8 @@ static unsigned long shrink_page_list(st
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(batch_for_mapping_rm);
+	struct address_space *batch_mapping = NULL;
 	int pgactivate = 0;
 	unsigned long nr_unqueued_dirty = 0;
 	unsigned long nr_dirty = 0;
@@ -749,6 +806,7 @@ static unsigned long shrink_page_list(st
 		cond_resched();
 
 		page = lru_to_page(page_list);
+
 		list_del(&page->lru);
 
 		if (!trylock_page(page))
@@ -851,6 +909,10 @@ static unsigned long shrink_page_list(st
 
 			/* Case 3 above */
 			} else {
+				/*
+				 * batch_for_mapping_rm could be drained here
+				 * if its lock_page()s hurt latency elsewhere.
+				 */
 				wait_on_page_writeback(page);
 			}
 		}
@@ -881,6 +943,18 @@ static unsigned long shrink_page_list(st
 		}
 
 		mapping = page_mapping(page);
+		/*
+		 * batching only makes sense when we can save lock
+		 * acquisitions, so drain the previously-batched
+		 * pages when we move over to a different mapping
+		 */
+		if (batch_mapping && (batch_mapping != mapping)) {
+			nr_reclaimed +=
+				__remove_mapping_batch(&batch_for_mapping_rm,
+							&ret_pages,
+							&free_pages);
+			batch_mapping = NULL;
+		}
 
 		/*
 		 * The page is mapped into the page tables of one or more
@@ -996,17 +1070,18 @@ static unsigned long shrink_page_list(st
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping)
 			goto keep_locked;
-
 		/*
-		 * At this point, we have no other references and there is
-		 * no way to pick any more up (removed from LRU, removed
-		 * from pagecache). Can use non-atomic bitops now (and
-		 * we obviously don't have to worry about waking up a process
-		 * waiting on the page lock, because there are no references.
+		 * This list contains pages all in the same mapping, but
+		 * in effectively random order and we hold lock_page()
+		 * on *all* of them.  This can potentially cause lock
+		 * ordering issues, but the reclaim code only trylocks
+		 * them which saves us.
 		 */
-		__clear_page_locked(page);
+		list_add(&page->lru, &batch_for_mapping_rm);
+		batch_mapping = mapping;
+		continue;
 free_it:
 		nr_reclaimed++;
 
@@ -1037,7 +1112,9 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
-
+	nr_reclaimed += __remove_mapping_batch(&batch_for_mapping_rm,
+						&ret_pages,
+						&free_pages);
 	/*
 	 * Tag a zone as congested if all the dirty pages encountered were
 	 * backed by a congested BDI. In this case, reclaimers should just
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations
  2013-06-03 20:02 [v5][PATCH 0/6] mm: vmscan: Batch page reclamation under shink_page_list Dave Hansen
                   ` (4 preceding siblings ...)
  2013-06-03 20:02 ` [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations Dave Hansen
@ 2013-06-03 20:02 ` Dave Hansen
  2013-06-04  6:05   ` Minchan Kim
  2013-06-04 23:37   ` Minchan Kim
  5 siblings, 2 replies; 25+ messages in thread
From: Dave Hansen @ 2013-06-03 20:02 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mgorman, tim.c.chen, minchan, Dave Hansen

From: Dave Hansen <dave.hansen@linux.intel.com>

This was a suggestion from Mel:

	http://lkml.kernel.org/r/20120914085634.GM11157@csn.ul.ie

Any pages we collect on 'batch_for_mapping_removal' will have
their lock_page() held during the duration of their stay on the
list.  If some other user is trying to get at them during this
time, they might end up having to wait.

This ensures that we drain the batch if we are about to perform a
pageout() or congestion_wait(), either of which will take some
time.  We expect this to help mitigate the worst of the latency
increase that the batching could cause.

I added some statistics to the __remove_mapping_batch() code to
track how large the lists are that we pass in to it.  With this
patch, the average list length drops about 10% (from about 4.1 to
3.8).  The workload here was a make -j4 kernel compile on a VM
with 200MB of RAM.

I've still got the statistics patch around if anyone is
interested.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 linux.git-davehans/mm/vmscan.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN mm/vmscan.c~drain-batch-list-during-long-operations mm/vmscan.c
--- linux.git/mm/vmscan.c~drain-batch-list-during-long-operations	2013-06-03 12:41:31.661762522 -0700
+++ linux.git-davehans/mm/vmscan.c	2013-06-03 12:41:31.665762700 -0700
@@ -1001,6 +1001,16 @@ static unsigned long shrink_page_list(st
 			if (!sc->may_writepage)
 				goto keep_locked;

+			/*
+			 * We hold a bunch of page locks on the batch.
+			 * pageout() can take a while, so drain the
+			 * batch before we perform pageout.
+			 */
+			nr_reclaimed +=
+		               __remove_mapping_batch(&batch_for_mapping_rm,
+		                                      &ret_pages,
+		                                      &free_pages);
+
 			/* Page is dirty, try to write it out here */
 			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-03 20:02 ` [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations Dave Hansen
@ 2013-06-04  1:17   ` Hillf Danton
  2013-06-04  5:07     ` Minchan Kim
  2013-06-04  5:01   ` Minchan Kim
  1 sibling, 1 reply; 25+ messages in thread
From: Hillf Danton @ 2013-06-04  1:17 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen, minchan

On Tue, Jun 4, 2013 at 4:02 AM, Dave Hansen <dave@sr71.net> wrote:
> +/*
> + * pages come in here (via remove_list) locked and leave unlocked
> + * (on either ret_pages or free_pages)
> + *
> + * We do this batching so that we free batches of pages with a
> + * single mapping->tree_lock acquisition/release.  This optimization
> + * only makes sense when the pages on remove_list all share a
> + * page_mapping().  If this is violated you will BUG_ON().
> + */
> +static int __remove_mapping_batch(struct list_head *remove_list,
> +                                 struct list_head *ret_pages,
> +                                 struct list_head *free_pages)
> +{
> +       int nr_reclaimed = 0;
> +       struct address_space *mapping;
> +       struct page *page;
> +       LIST_HEAD(need_free_mapping);
> +
> +       if (list_empty(remove_list))
> +               return 0;
> +
> +       mapping = page_mapping(lru_to_page(remove_list));
> +       spin_lock_irq(&mapping->tree_lock);
> +       while (!list_empty(remove_list)) {
> +               page = lru_to_page(remove_list);
> +               BUG_ON(!PageLocked(page));
> +               BUG_ON(page_mapping(page) != mapping);
> +               list_del(&page->lru);
> +
> +               if (!__remove_mapping(mapping, page)) {
> +                       unlock_page(page);
> +                       list_add(&page->lru, ret_pages);
> +                       continue;
> +               }
> +               list_add(&page->lru, &need_free_mapping);
> +       }
> +       spin_unlock_irq(&mapping->tree_lock);
> +
While reclaiming pages, can we open ears upon IRQ controller,
if the page list length is over 10, or even 20?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-03 20:02 ` [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations Dave Hansen
  2013-06-04  1:17   ` Hillf Danton
@ 2013-06-04  5:01   ` Minchan Kim
  2013-06-04  6:02     ` Minchan Kim
  2013-06-04  6:10     ` Dave Hansen
  1 sibling, 2 replies; 25+ messages in thread
From: Minchan Kim @ 2013-06-04  5:01 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Mon, Jun 03, 2013 at 01:02:08PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> changes for v2:
>  * remove batch_has_same_mapping() helper.  A local varible makes
>    the check cheaper and cleaner
>  * Move batch draining later to where we already know
>    page_mapping().  This probably fixes a truncation race anyway
>  * rename batch_for_mapping_removal -> batch_for_mapping_rm.  It
>    caused a line over 80 chars and needed shortening anyway.
>  * Note: we only set 'batch_mapping' when there are pages in the
>    batch_for_mapping_rm list
> 
> --
> 
> We batch like this so that several pages can be freed with a
> single mapping->tree_lock acquisition/release pair.  This reduces
> the number of atomic operations and ensures that we do not bounce
> cachelines around.
> 
> Tim Chen's earlier version of these patches just unconditionally
> created large batches of pages, even if they did not share a
> page_mapping().  This is a bit suboptimal for a few reasons:
> 1. if we can not consolidate lock acquisitions, it makes little
>    sense to batch
> 2. The page locks are held for long periods of time, so we only
>    want to do this when we are sure that we will gain a
>    substantial throughput improvement because we pay a latency
>    cost by holding the locks.
> 
> This patch makes sure to only batch when all the pages on
> 'batch_for_mapping_rm' continue to share a page_mapping().
> This only happens in practice in cases where pages in the same
> file are close to each other on the LRU.  That seems like a
> reasonable assumption.
> 
> In a 128MB virtual machine doing kernel compiles, the average
> batch size when calling __remove_mapping_batch() is around 5,
> so this does seem to do some good in practice.
> 
> On a 160-cpu system doing kernel compiles, I still saw an
> average batch length of about 2.8.  One promising feature:
> as the memory pressure went up, the average batches seem to
> have gotten larger.
> 
> It has shown some substantial performance benefits on
> microbenchmarks.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Acked-by: Mel Gorman <mgorman@suse.de>

Look at below comment, otherwise, looks good to me.

Reviewed-by: Minchan Kim <minchan@kernel.org>

> ---
> 
>  linux.git-davehans/mm/vmscan.c |   95 +++++++++++++++++++++++++++++++++++++----
>  1 file changed, 86 insertions(+), 9 deletions(-)
> 
> diff -puN mm/vmscan.c~create-remove_mapping_batch mm/vmscan.c
> --- linux.git/mm/vmscan.c~create-remove_mapping_batch	2013-06-03 12:41:31.408751324 -0700
> +++ linux.git-davehans/mm/vmscan.c	2013-06-03 12:41:31.412751500 -0700
> @@ -550,6 +550,61 @@ int remove_mapping(struct address_space
>  	return 0;
>  }
>  
> +/*
> + * pages come in here (via remove_list) locked and leave unlocked
> + * (on either ret_pages or free_pages)
> + *
> + * We do this batching so that we free batches of pages with a
> + * single mapping->tree_lock acquisition/release.  This optimization
> + * only makes sense when the pages on remove_list all share a
> + * page_mapping().  If this is violated you will BUG_ON().
> + */
> +static int __remove_mapping_batch(struct list_head *remove_list,
> +				  struct list_head *ret_pages,
> +				  struct list_head *free_pages)
> +{
> +	int nr_reclaimed = 0;
> +	struct address_space *mapping;
> +	struct page *page;
> +	LIST_HEAD(need_free_mapping);
> +
> +	if (list_empty(remove_list))
> +		return 0;
> +
> +	mapping = page_mapping(lru_to_page(remove_list));
> +	spin_lock_irq(&mapping->tree_lock);
> +	while (!list_empty(remove_list)) {
> +		page = lru_to_page(remove_list);
> +		BUG_ON(!PageLocked(page));
> +		BUG_ON(page_mapping(page) != mapping);
> +		list_del(&page->lru);
> +
> +		if (!__remove_mapping(mapping, page)) {
> +			unlock_page(page);
> +			list_add(&page->lru, ret_pages);
> +			continue;
> +		}
> +		list_add(&page->lru, &need_free_mapping);

Why do we need new lru list instead of using @free_pages?

> +	}
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	while (!list_empty(&need_free_mapping)) {
> +		page = lru_to_page(&need_free_mapping);
> +		list_move(&page->list, free_pages);
> +		mapping_release_page(mapping, page);
> +		/*
> +		 * At this point, we have no other references and there is
> +		 * no way to pick any more up (removed from LRU, removed
> +		 * from pagecache). Can use non-atomic bitops now (and
> +		 * we obviously don't have to worry about waking up a process
> +		 * waiting on the page lock, because there are no references.
> +		 */
> +		__clear_page_locked(page);
> +		nr_reclaimed++;
> +	}
> +	return nr_reclaimed;
> +}
> +

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-04  1:17   ` Hillf Danton
@ 2013-06-04  5:07     ` Minchan Kim
  2013-06-04 15:22       ` Dave Hansen
  2013-06-05  7:28       ` Hillf Danton
  0 siblings, 2 replies; 25+ messages in thread
From: Minchan Kim @ 2013-06-04  5:07 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Dave Hansen, linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Tue, Jun 04, 2013 at 09:17:26AM +0800, Hillf Danton wrote:
> On Tue, Jun 4, 2013 at 4:02 AM, Dave Hansen <dave@sr71.net> wrote:
> > +/*
> > + * pages come in here (via remove_list) locked and leave unlocked
> > + * (on either ret_pages or free_pages)
> > + *
> > + * We do this batching so that we free batches of pages with a
> > + * single mapping->tree_lock acquisition/release.  This optimization
> > + * only makes sense when the pages on remove_list all share a
> > + * page_mapping().  If this is violated you will BUG_ON().
> > + */
> > +static int __remove_mapping_batch(struct list_head *remove_list,
> > +                                 struct list_head *ret_pages,
> > +                                 struct list_head *free_pages)
> > +{
> > +       int nr_reclaimed = 0;
> > +       struct address_space *mapping;
> > +       struct page *page;
> > +       LIST_HEAD(need_free_mapping);
> > +
> > +       if (list_empty(remove_list))
> > +               return 0;
> > +
> > +       mapping = page_mapping(lru_to_page(remove_list));
> > +       spin_lock_irq(&mapping->tree_lock);
> > +       while (!list_empty(remove_list)) {
> > +               page = lru_to_page(remove_list);
> > +               BUG_ON(!PageLocked(page));
> > +               BUG_ON(page_mapping(page) != mapping);
> > +               list_del(&page->lru);
> > +
> > +               if (!__remove_mapping(mapping, page)) {
> > +                       unlock_page(page);
> > +                       list_add(&page->lru, ret_pages);
> > +                       continue;
> > +               }
> > +               list_add(&page->lru, &need_free_mapping);
> > +       }
> > +       spin_unlock_irq(&mapping->tree_lock);
> > +
> While reclaiming pages, can we open ears upon IRQ controller,
> if the page list length is over 10, or even 20?

At the moment, it implicitly could be bounded by SWAP_CLUSTER_MAX and
it's the value used by isolate_migratepages_ranges to enable IRQ.
I have no idea it's proper value to give a chace to IRQ but at least,
Dave's code doesn't break the rule.
If we need a tune for that, it could be a another patch to investigate
all of places on vmscan.c in near future.


> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-04  5:01   ` Minchan Kim
@ 2013-06-04  6:02     ` Minchan Kim
  2013-06-04 15:29       ` Dave Hansen
  2013-06-04  6:10     ` Dave Hansen
  1 sibling, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2013-06-04  6:02 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Tue, Jun 04, 2013 at 02:01:03PM +0900, Minchan Kim wrote:
> On Mon, Jun 03, 2013 at 01:02:08PM -0700, Dave Hansen wrote:
> > 
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > changes for v2:
> >  * remove batch_has_same_mapping() helper.  A local varible makes
> >    the check cheaper and cleaner
> >  * Move batch draining later to where we already know
> >    page_mapping().  This probably fixes a truncation race anyway
> >  * rename batch_for_mapping_removal -> batch_for_mapping_rm.  It
> >    caused a line over 80 chars and needed shortening anyway.
> >  * Note: we only set 'batch_mapping' when there are pages in the
> >    batch_for_mapping_rm list
> > 
> > --
> > 
> > We batch like this so that several pages can be freed with a
> > single mapping->tree_lock acquisition/release pair.  This reduces
> > the number of atomic operations and ensures that we do not bounce
> > cachelines around.
> > 
> > Tim Chen's earlier version of these patches just unconditionally
> > created large batches of pages, even if they did not share a
> > page_mapping().  This is a bit suboptimal for a few reasons:
> > 1. if we can not consolidate lock acquisitions, it makes little
> >    sense to batch
> > 2. The page locks are held for long periods of time, so we only
> >    want to do this when we are sure that we will gain a
> >    substantial throughput improvement because we pay a latency
> >    cost by holding the locks.
> > 
> > This patch makes sure to only batch when all the pages on
> > 'batch_for_mapping_rm' continue to share a page_mapping().
> > This only happens in practice in cases where pages in the same
> > file are close to each other on the LRU.  That seems like a
> > reasonable assumption.
> > 
> > In a 128MB virtual machine doing kernel compiles, the average
> > batch size when calling __remove_mapping_batch() is around 5,
> > so this does seem to do some good in practice.
> > 
> > On a 160-cpu system doing kernel compiles, I still saw an
> > average batch length of about 2.8.  One promising feature:
> > as the memory pressure went up, the average batches seem to
> > have gotten larger.
> > 
> > It has shown some substantial performance benefits on
> > microbenchmarks.
> > 
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Acked-by: Mel Gorman <mgorman@suse.de>
> 
> Look at below comment, otherwise, looks good to me.
> 
> Reviewed-by: Minchan Kim <minchan@kernel.org>
> 
> > ---
> > 
> >  linux.git-davehans/mm/vmscan.c |   95 +++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 86 insertions(+), 9 deletions(-)
> > 
> > diff -puN mm/vmscan.c~create-remove_mapping_batch mm/vmscan.c
> > --- linux.git/mm/vmscan.c~create-remove_mapping_batch	2013-06-03 12:41:31.408751324 -0700
> > +++ linux.git-davehans/mm/vmscan.c	2013-06-03 12:41:31.412751500 -0700
> > @@ -550,6 +550,61 @@ int remove_mapping(struct address_space
> >  	return 0;
> >  }
> >  
> > +/*
> > + * pages come in here (via remove_list) locked and leave unlocked
> > + * (on either ret_pages or free_pages)
> > + *
> > + * We do this batching so that we free batches of pages with a
> > + * single mapping->tree_lock acquisition/release.  This optimization
> > + * only makes sense when the pages on remove_list all share a
> > + * page_mapping().  If this is violated you will BUG_ON().
> > + */
> > +static int __remove_mapping_batch(struct list_head *remove_list,
> > +				  struct list_head *ret_pages,
> > +				  struct list_head *free_pages)
> > +{
> > +	int nr_reclaimed = 0;
> > +	struct address_space *mapping;
> > +	struct page *page;
> > +	LIST_HEAD(need_free_mapping);
> > +
> > +	if (list_empty(remove_list))
> > +		return 0;
> > +
> > +	mapping = page_mapping(lru_to_page(remove_list));
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	while (!list_empty(remove_list)) {
> > +		page = lru_to_page(remove_list);
> > +		BUG_ON(!PageLocked(page));
> > +		BUG_ON(page_mapping(page) != mapping);
> > +		list_del(&page->lru);
> > +
> > +		if (!__remove_mapping(mapping, page)) {
> > +			unlock_page(page);
> > +			list_add(&page->lru, ret_pages);
> > +			continue;
> > +		}
> > +		list_add(&page->lru, &need_free_mapping);
> 
> Why do we need new lru list instead of using @free_pages?

I got your point that @free_pages could have freed page by
put_page_testzero of shrink_page_list and they don't have
valid mapping so __remove_mapping_batch's mapping_release_page
would access NULL pointer.

I think it would be better to mention it in comment. :(
Otherwise, I suggest we can declare another new LIST_HEAD to
accumulate pages freed by put_page_testzero in shrink_page_list
so __remove_mapping_batch don't have to declare temporal LRU list
and can remove unnecessary list_move operation.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations
  2013-06-03 20:02 ` [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations Dave Hansen
@ 2013-06-04  6:05   ` Minchan Kim
  2013-06-04 15:24     ` Dave Hansen
  2013-06-04 23:37   ` Minchan Kim
  1 sibling, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2013-06-04  6:05 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Mon, Jun 03, 2013 at 01:02:10PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This was a suggestion from Mel:
> 
> 	http://lkml.kernel.org/r/20120914085634.GM11157@csn.ul.ie
> 
> Any pages we collect on 'batch_for_mapping_removal' will have
> their lock_page() held during the duration of their stay on the
> list.  If some other user is trying to get at them during this
> time, they might end up having to wait.
> 
> This ensures that we drain the batch if we are about to perform a
> pageout() or congestion_wait(), either of which will take some
> time.  We expect this to help mitigate the worst of the latency
> increase that the batching could cause.

Nice idea but I could see drain before pageout but congestion_wait?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-04  5:01   ` Minchan Kim
  2013-06-04  6:02     ` Minchan Kim
@ 2013-06-04  6:10     ` Dave Hansen
  2013-06-04  6:59       ` Minchan Kim
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2013-06-04  6:10 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On 06/03/2013 10:01 PM, Minchan Kim wrote:
>> > +static int __remove_mapping_batch(struct list_head *remove_list,
>> > +				  struct list_head *ret_pages,
>> > +				  struct list_head *free_pages)
>> > +{
>> > +	int nr_reclaimed = 0;
>> > +	struct address_space *mapping;
>> > +	struct page *page;
>> > +	LIST_HEAD(need_free_mapping);
>> > +
>> > +	while (!list_empty(remove_list)) {
...
>> > +		if (!__remove_mapping(mapping, page)) {
>> > +			unlock_page(page);
>> > +			list_add(&page->lru, ret_pages);
>> > +			continue;
>> > +		}
>> > +		list_add(&page->lru, &need_free_mapping);
...
> +	spin_unlock_irq(&mapping->tree_lock);
> +	while (!list_empty(&need_free_mapping)) {...
> +		list_move(&page->list, free_pages);
> +		mapping_release_page(mapping, page);
> +	}
> Why do we need new lru list instead of using @free_pages?

I actually tried using @free_pages at first.  The problem is that we
need to call mapping_release_page() without the radix tree lock held so
we can not do it in the first while() loop.

'free_pages' is a list created up in shrink_page_list().  There can be
several calls to __remove_mapping_batch() for each call to
shrink_page_list().

'need_free_mapping' lets us temporarily differentiate the pages that we
need to call mapping_release_page()/unlock_page() on versus the ones on
'free_pages' which have already had that done.

We could theoretically delay _all_ of the
release_mapping_page()/unlock_page() operations until the _entire_
shrink_page_list() operation is done, but doing this really helps with
lock_page() latency.

Does that make sense?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-04  6:10     ` Dave Hansen
@ 2013-06-04  6:59       ` Minchan Kim
  0 siblings, 0 replies; 25+ messages in thread
From: Minchan Kim @ 2013-06-04  6:59 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Mon, Jun 03, 2013 at 11:10:02PM -0700, Dave Hansen wrote:
> On 06/03/2013 10:01 PM, Minchan Kim wrote:
> >> > +static int __remove_mapping_batch(struct list_head *remove_list,
> >> > +				  struct list_head *ret_pages,
> >> > +				  struct list_head *free_pages)
> >> > +{
> >> > +	int nr_reclaimed = 0;
> >> > +	struct address_space *mapping;
> >> > +	struct page *page;
> >> > +	LIST_HEAD(need_free_mapping);
> >> > +
> >> > +	while (!list_empty(remove_list)) {
> ...
> >> > +		if (!__remove_mapping(mapping, page)) {
> >> > +			unlock_page(page);
> >> > +			list_add(&page->lru, ret_pages);
> >> > +			continue;
> >> > +		}
> >> > +		list_add(&page->lru, &need_free_mapping);
> ...
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	while (!list_empty(&need_free_mapping)) {...
> > +		list_move(&page->list, free_pages);
> > +		mapping_release_page(mapping, page);
> > +	}
> > Why do we need new lru list instead of using @free_pages?
> 
> I actually tried using @free_pages at first.  The problem is that we
> need to call mapping_release_page() without the radix tree lock held so
> we can not do it in the first while() loop.
> 
> 'free_pages' is a list created up in shrink_page_list().  There can be
> several calls to __remove_mapping_batch() for each call to
> shrink_page_list().

I missed that point.

> 
> 'need_free_mapping' lets us temporarily differentiate the pages that we
> need to call mapping_release_page()/unlock_page() on versus the ones on
> 'free_pages' which have already had that done.
> 

Right.

> We could theoretically delay _all_ of the
> release_mapping_page()/unlock_page() operations until the _entire_
> shrink_page_list() operation is done, but doing this really helps with

                                        maybe you mean
                                        but doing this doesn't really helps
> lock_page() latency.
> 
> Does that make sense?

If so, It does make sense.
Thanks for pointing me out.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-04  5:07     ` Minchan Kim
@ 2013-06-04 15:22       ` Dave Hansen
  2013-06-05  7:28       ` Hillf Danton
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-06-04 15:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hillf Danton, linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On 06/03/2013 10:07 PM, Minchan Kim wrote:
>>> > > +       while (!list_empty(remove_list)) {
>>> > > +               page = lru_to_page(remove_list);
>>> > > +               BUG_ON(!PageLocked(page));
>>> > > +               BUG_ON(page_mapping(page) != mapping);
>>> > > +               list_del(&page->lru);
>>> > > +
>>> > > +               if (!__remove_mapping(mapping, page)) {
>>> > > +                       unlock_page(page);
>>> > > +                       list_add(&page->lru, ret_pages);
>>> > > +                       continue;
>>> > > +               }
>>> > > +               list_add(&page->lru, &need_free_mapping);
>>> > > +       }
>>> > > +       spin_unlock_irq(&mapping->tree_lock);
>>> > > +
>> > While reclaiming pages, can we open ears upon IRQ controller,
>> > if the page list length is over 10, or even 20?
> At the moment, it implicitly could be bounded by SWAP_CLUSTER_MAX and
> it's the value used by isolate_migratepages_ranges to enable IRQ.
> I have no idea it's proper value to give a chace to IRQ but at least,
> Dave's code doesn't break the rule.
> If we need a tune for that, it could be a another patch to investigate

I also wouldn't exactly call this "reclaiming pages".   As Minchan
mentions, this is already bounded and it's a relatively cheap set of
operations.  *WAY* cheaper than actually reclaiming a page.

Honestly, this whole patch series is about trading latency for increased
bandwidth reclaiming pages.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations
  2013-06-04  6:05   ` Minchan Kim
@ 2013-06-04 15:24     ` Dave Hansen
  2013-06-04 23:23       ` Minchan Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2013-06-04 15:24 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On 06/03/2013 11:05 PM, Minchan Kim wrote:
>> > This ensures that we drain the batch if we are about to perform a
>> > pageout() or congestion_wait(), either of which will take some
>> > time.  We expect this to help mitigate the worst of the latency
>> > increase that the batching could cause.
> Nice idea but I could see drain before pageout but congestion_wait?

That comment managed to bitrot a bit :(

The first version of these had the drain before pageout() only.  Then,
Mel added a congestion_wait() call, and I modified the series to also
drain there.  But, some other patches took the congestion_wait() back
out, so I took that drain back out.

I _believe_ the only congestion_wait() left in there is a cgroup-related
one that we didn't think would cause very much harm.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-04  6:02     ` Minchan Kim
@ 2013-06-04 15:29       ` Dave Hansen
  2013-06-04 23:32         ` Minchan Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2013-06-04 15:29 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On 06/03/2013 11:02 PM, Minchan Kim wrote:
>> > Why do we need new lru list instead of using @free_pages?
> I got your point that @free_pages could have freed page by
> put_page_testzero of shrink_page_list and they don't have
> valid mapping so __remove_mapping_batch's mapping_release_page
> would access NULL pointer.
> 
> I think it would be better to mention it in comment. :(
> Otherwise, I suggest we can declare another new LIST_HEAD to
> accumulate pages freed by put_page_testzero in shrink_page_list
> so __remove_mapping_batch don't have to declare temporal LRU list
> and can remove unnecessary list_move operation.

If I respin them again, I'll add a comment.

I guess we could splice the whole list over at once instead of moving
the pages individually.  But, what are we trying to optimize here?
Saving a list_head worth of space on the stack?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations
  2013-06-04 15:24     ` Dave Hansen
@ 2013-06-04 23:23       ` Minchan Kim
  2013-06-04 23:31         ` Dave Hansen
  0 siblings, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2013-06-04 23:23 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

Hello Dave,

On Tue, Jun 04, 2013 at 08:24:38AM -0700, Dave Hansen wrote:
> On 06/03/2013 11:05 PM, Minchan Kim wrote:
> >> > This ensures that we drain the batch if we are about to perform a
> >> > pageout() or congestion_wait(), either of which will take some
> >> > time.  We expect this to help mitigate the worst of the latency
> >> > increase that the batching could cause.
> > Nice idea but I could see drain before pageout but congestion_wait?
> 
> That comment managed to bitrot a bit :(
> 
> The first version of these had the drain before pageout() only.  Then,
> Mel added a congestion_wait() call, and I modified the series to also
> drain there.  But, some other patches took the congestion_wait() back
> out, so I took that drain back out.

I am looking next-20130530 and it has still a congestion_wait.
I'm confusing. :(


                if (PageWriteback(page)) {
			/* Case 1 above */
			if (current_is_kswapd() &&
			    PageReclaim(page) &&
			    zone_is_reclaim_writeback(zone)) {
				congestion_wait(BLK_RW_ASYNC, HZ/10);
				zone_clear_flag(zone, ZONE_WRITEBACK);
> 
> I _believe_ the only congestion_wait() left in there is a cgroup-related
> one that we didn't think would cause very much harm.

The congestion_wait I am seeing is not cgroup-related one.

I'd like to clear this confusing.
Thanks.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations
  2013-06-04 23:23       ` Minchan Kim
@ 2013-06-04 23:31         ` Dave Hansen
  2013-06-04 23:36           ` Minchan Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2013-06-04 23:31 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On 06/04/2013 04:23 PM, Minchan Kim wrote:
> On Tue, Jun 04, 2013 at 08:24:38AM -0700, Dave Hansen wrote:
>> On 06/03/2013 11:05 PM, Minchan Kim wrote:
>>>>> This ensures that we drain the batch if we are about to perform a
>>>>> pageout() or congestion_wait(), either of which will take some
>>>>> time.  We expect this to help mitigate the worst of the latency
>>>>> increase that the batching could cause.
>>> Nice idea but I could see drain before pageout but congestion_wait?
>>
>> That comment managed to bitrot a bit :(
>>
>> The first version of these had the drain before pageout() only.  Then,
>> Mel added a congestion_wait() call, and I modified the series to also
>> drain there.  But, some other patches took the congestion_wait() back
>> out, so I took that drain back out.
> 
> I am looking next-20130530 and it has still a congestion_wait.
> I'm confusing. :(
> 
> 
>                 if (PageWriteback(page)) {
> 			/* Case 1 above */
> 			if (current_is_kswapd() &&
> 			    PageReclaim(page) &&
> 			    zone_is_reclaim_writeback(zone)) {
> 				congestion_wait(BLK_RW_ASYNC, HZ/10);
> 				zone_clear_flag(zone, ZONE_WRITEBACK);
>>
>> I _believe_ the only congestion_wait() left in there is a cgroup-related
>> one that we didn't think would cause very much harm.
> 
> The congestion_wait I am seeing is not cgroup-related one.

Yeah, sorry for the confusion.  There's been a whole lot of activity in
there.  My set is also done on top of a couple of fixes that Mel posted
later on, including this one:

	https://patchwork.kernel.org/patch/2619901/

*That* one removes the congestion_wait() you noticed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-04 15:29       ` Dave Hansen
@ 2013-06-04 23:32         ` Minchan Kim
  0 siblings, 0 replies; 25+ messages in thread
From: Minchan Kim @ 2013-06-04 23:32 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Tue, Jun 04, 2013 at 08:29:18AM -0700, Dave Hansen wrote:
> On 06/03/2013 11:02 PM, Minchan Kim wrote:
> >> > Why do we need new lru list instead of using @free_pages?
> > I got your point that @free_pages could have freed page by
> > put_page_testzero of shrink_page_list and they don't have
> > valid mapping so __remove_mapping_batch's mapping_release_page
> > would access NULL pointer.
> > 
> > I think it would be better to mention it in comment. :(
> > Otherwise, I suggest we can declare another new LIST_HEAD to
> > accumulate pages freed by put_page_testzero in shrink_page_list
> > so __remove_mapping_batch don't have to declare temporal LRU list
> > and can remove unnecessary list_move operation.
> 
> If I respin them again, I'll add a comment.

Thanks. it's enough for me.

> 
> I guess we could splice the whole list over at once instead of moving
> the pages individually.  But, what are we trying to optimize here?
> Saving a list_head worth of space on the stack?

Never mind. At first, I thought we can simply use @free_pages instead of
redundant new LRU list and it's *minor*. That's why I already gave
my Reviewed-by. But you woke up my brain so I realized it so I don't have
a concern any more about your patch.

Thanks.
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations
  2013-06-04 23:31         ` Dave Hansen
@ 2013-06-04 23:36           ` Minchan Kim
  0 siblings, 0 replies; 25+ messages in thread
From: Minchan Kim @ 2013-06-04 23:36 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Tue, Jun 04, 2013 at 04:31:15PM -0700, Dave Hansen wrote:
> On 06/04/2013 04:23 PM, Minchan Kim wrote:
> > On Tue, Jun 04, 2013 at 08:24:38AM -0700, Dave Hansen wrote:
> >> On 06/03/2013 11:05 PM, Minchan Kim wrote:
> >>>>> This ensures that we drain the batch if we are about to perform a
> >>>>> pageout() or congestion_wait(), either of which will take some
> >>>>> time.  We expect this to help mitigate the worst of the latency
> >>>>> increase that the batching could cause.
> >>> Nice idea but I could see drain before pageout but congestion_wait?
> >>
> >> That comment managed to bitrot a bit :(
> >>
> >> The first version of these had the drain before pageout() only.  Then,
> >> Mel added a congestion_wait() call, and I modified the series to also
> >> drain there.  But, some other patches took the congestion_wait() back
> >> out, so I took that drain back out.
> > 
> > I am looking next-20130530 and it has still a congestion_wait.
> > I'm confusing. :(
> > 
> > 
> >                 if (PageWriteback(page)) {
> > 			/* Case 1 above */
> > 			if (current_is_kswapd() &&
> > 			    PageReclaim(page) &&
> > 			    zone_is_reclaim_writeback(zone)) {
> > 				congestion_wait(BLK_RW_ASYNC, HZ/10);
> > 				zone_clear_flag(zone, ZONE_WRITEBACK);
> >>
> >> I _believe_ the only congestion_wait() left in there is a cgroup-related
> >> one that we didn't think would cause very much harm.
> > 
> > The congestion_wait I am seeing is not cgroup-related one.
> 
> Yeah, sorry for the confusion.  There's been a whole lot of activity in
> there.  My set is also done on top of a couple of fixes that Mel posted
> later on, including this one:
> 
> 	https://patchwork.kernel.org/patch/2619901/
> 
> *That* one removes the congestion_wait() you noticed.

Thanks for the information.
My nitpick just gets disappeared.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations
  2013-06-03 20:02 ` [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations Dave Hansen
  2013-06-04  6:05   ` Minchan Kim
@ 2013-06-04 23:37   ` Minchan Kim
  1 sibling, 0 replies; 25+ messages in thread
From: Minchan Kim @ 2013-06-04 23:37 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Mon, Jun 03, 2013 at 01:02:10PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This was a suggestion from Mel:
> 
> 	http://lkml.kernel.org/r/20120914085634.GM11157@csn.ul.ie
> 
> Any pages we collect on 'batch_for_mapping_removal' will have
> their lock_page() held during the duration of their stay on the
> list.  If some other user is trying to get at them during this
> time, they might end up having to wait.
> 
> This ensures that we drain the batch if we are about to perform a
> pageout() or congestion_wait(), either of which will take some
> time.  We expect this to help mitigate the worst of the latency
> increase that the batching could cause.
> 
> I added some statistics to the __remove_mapping_batch() code to
> track how large the lists are that we pass in to it.  With this
> patch, the average list length drops about 10% (from about 4.1 to
> 3.8).  The workload here was a make -j4 kernel compile on a VM
> with 200MB of RAM.
> 
> I've still got the statistics patch around if anyone is
> interested.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-04  5:07     ` Minchan Kim
  2013-06-04 15:22       ` Dave Hansen
@ 2013-06-05  7:28       ` Hillf Danton
  2013-06-05  7:57         ` Minchan Kim
  2013-06-05 14:24         ` Dave Hansen
  1 sibling, 2 replies; 25+ messages in thread
From: Hillf Danton @ 2013-06-05  7:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Dave Hansen, linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Tue, Jun 4, 2013 at 1:07 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Jun 04, 2013 at 09:17:26AM +0800, Hillf Danton wrote:
>> On Tue, Jun 4, 2013 at 4:02 AM, Dave Hansen <dave@sr71.net> wrote:
>> > +/*
>> > + * pages come in here (via remove_list) locked and leave unlocked
>> > + * (on either ret_pages or free_pages)
>> > + *
>> > + * We do this batching so that we free batches of pages with a
>> > + * single mapping->tree_lock acquisition/release.  This optimization
>> > + * only makes sense when the pages on remove_list all share a
>> > + * page_mapping().  If this is violated you will BUG_ON().
>> > + */
>> > +static int __remove_mapping_batch(struct list_head *remove_list,
>> > +                                 struct list_head *ret_pages,
>> > +                                 struct list_head *free_pages)
>> > +{
>> > +       int nr_reclaimed = 0;
>> > +       struct address_space *mapping;
>> > +       struct page *page;
>> > +       LIST_HEAD(need_free_mapping);
>> > +
>> > +       if (list_empty(remove_list))
>> > +               return 0;
>> > +
>> > +       mapping = page_mapping(lru_to_page(remove_list));
>> > +       spin_lock_irq(&mapping->tree_lock);
>> > +       while (!list_empty(remove_list)) {
>> > +               page = lru_to_page(remove_list);
>> > +               BUG_ON(!PageLocked(page));
>> > +               BUG_ON(page_mapping(page) != mapping);
>> > +               list_del(&page->lru);
>> > +
>> > +               if (!__remove_mapping(mapping, page)) {
>> > +                       unlock_page(page);
>> > +                       list_add(&page->lru, ret_pages);
>> > +                       continue;
>> > +               }
>> > +               list_add(&page->lru, &need_free_mapping);
>> > +       }
>> > +       spin_unlock_irq(&mapping->tree_lock);
>> > +
>> While reclaiming pages, can we open ears upon IRQ controller,
>> if the page list length is over 10, or even 20?
>
> At the moment, it implicitly could be bounded by SWAP_CLUSTER_MAX and

Could we reclaim a THP currently?

> it's the value used by isolate_migratepages_ranges to enable IRQ.
> I have no idea it's proper value to give a chace to IRQ but at least,
> Dave's code doesn't break the rule.
> If we need a tune for that, it could be a another patch to investigate
> all of places on vmscan.c in near future.
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-05  7:28       ` Hillf Danton
@ 2013-06-05  7:57         ` Minchan Kim
  2013-06-05 14:24         ` Dave Hansen
  1 sibling, 0 replies; 25+ messages in thread
From: Minchan Kim @ 2013-06-05  7:57 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Dave Hansen, linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On Wed, Jun 05, 2013 at 03:28:27PM +0800, Hillf Danton wrote:
> On Tue, Jun 4, 2013 at 1:07 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Tue, Jun 04, 2013 at 09:17:26AM +0800, Hillf Danton wrote:
> >> On Tue, Jun 4, 2013 at 4:02 AM, Dave Hansen <dave@sr71.net> wrote:
> >> > +/*
> >> > + * pages come in here (via remove_list) locked and leave unlocked
> >> > + * (on either ret_pages or free_pages)
> >> > + *
> >> > + * We do this batching so that we free batches of pages with a
> >> > + * single mapping->tree_lock acquisition/release.  This optimization
> >> > + * only makes sense when the pages on remove_list all share a
> >> > + * page_mapping().  If this is violated you will BUG_ON().
> >> > + */
> >> > +static int __remove_mapping_batch(struct list_head *remove_list,
> >> > +                                 struct list_head *ret_pages,
> >> > +                                 struct list_head *free_pages)
> >> > +{
> >> > +       int nr_reclaimed = 0;
> >> > +       struct address_space *mapping;
> >> > +       struct page *page;
> >> > +       LIST_HEAD(need_free_mapping);
> >> > +
> >> > +       if (list_empty(remove_list))
> >> > +               return 0;
> >> > +
> >> > +       mapping = page_mapping(lru_to_page(remove_list));
> >> > +       spin_lock_irq(&mapping->tree_lock);
> >> > +       while (!list_empty(remove_list)) {
> >> > +               page = lru_to_page(remove_list);
> >> > +               BUG_ON(!PageLocked(page));
> >> > +               BUG_ON(page_mapping(page) != mapping);
> >> > +               list_del(&page->lru);
> >> > +
> >> > +               if (!__remove_mapping(mapping, page)) {
> >> > +                       unlock_page(page);
> >> > +                       list_add(&page->lru, ret_pages);
> >> > +                       continue;
> >> > +               }
> >> > +               list_add(&page->lru, &need_free_mapping);
> >> > +       }
> >> > +       spin_unlock_irq(&mapping->tree_lock);
> >> > +
> >> While reclaiming pages, can we open ears upon IRQ controller,
> >> if the page list length is over 10, or even 20?
> >
> > At the moment, it implicitly could be bounded by SWAP_CLUSTER_MAX and
> 
> Could we reclaim a THP currently?

You mean that we could have (512 * SWAP_CLUSTER_MAX) pages in
page_list as worst case?
Yes but in that case, we drain batch_for_mapping_rm by [6/6] so
THP page couldn't be a problem, IMO.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations
  2013-06-05  7:28       ` Hillf Danton
  2013-06-05  7:57         ` Minchan Kim
@ 2013-06-05 14:24         ` Dave Hansen
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2013-06-05 14:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Minchan Kim, linux-mm, linux-kernel, akpm, mgorman, tim.c.chen

On 06/05/2013 12:28 AM, Hillf Danton wrote:
>>>> >> > +       mapping = page_mapping(lru_to_page(remove_list));
>>>> >> > +       spin_lock_irq(&mapping->tree_lock);
>>>> >> > +       while (!list_empty(remove_list)) {
>>>> >> > +               page = lru_to_page(remove_list);
>>>> >> > +               BUG_ON(!PageLocked(page));
>>>> >> > +               BUG_ON(page_mapping(page) != mapping);
>>>> >> > +               list_del(&page->lru);
>>>> >> > +
>>>> >> > +               if (!__remove_mapping(mapping, page)) {
>>>> >> > +                       unlock_page(page);
>>>> >> > +                       list_add(&page->lru, ret_pages);
>>>> >> > +                       continue;
>>>> >> > +               }
>>>> >> > +               list_add(&page->lru, &need_free_mapping);
>>>> >> > +       }
>>>> >> > +       spin_unlock_irq(&mapping->tree_lock);
>>>> >> > +
>>> >> While reclaiming pages, can we open ears upon IRQ controller,
>>> >> if the page list length is over 10, or even 20?
>> >
>> > At the moment, it implicitly could be bounded by SWAP_CLUSTER_MAX and
> Could we reclaim a THP currently?

No, it would be split first.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-06-05 14:24 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-03 20:02 [v5][PATCH 0/6] mm: vmscan: Batch page reclamation under shink_page_list Dave Hansen
2013-06-03 20:02 ` [v5][PATCH 1/6] mm: swap: defer clearing of page_private() for swap cache pages Dave Hansen
2013-06-03 20:02 ` [v5][PATCH 2/6] mm: swap: make 'struct page' and swp_entry_t variants of swapcache_free() Dave Hansen
2013-06-03 20:02 ` [v5][PATCH 3/6] mm: vmscan: break up __remove_mapping() Dave Hansen
2013-06-03 20:02 ` [v5][PATCH 4/6] mm: vmscan: break out mapping "freepage" code Dave Hansen
2013-06-03 20:02 ` [v5][PATCH 5/6] mm: vmscan: batch shrink_page_list() locking operations Dave Hansen
2013-06-04  1:17   ` Hillf Danton
2013-06-04  5:07     ` Minchan Kim
2013-06-04 15:22       ` Dave Hansen
2013-06-05  7:28       ` Hillf Danton
2013-06-05  7:57         ` Minchan Kim
2013-06-05 14:24         ` Dave Hansen
2013-06-04  5:01   ` Minchan Kim
2013-06-04  6:02     ` Minchan Kim
2013-06-04 15:29       ` Dave Hansen
2013-06-04 23:32         ` Minchan Kim
2013-06-04  6:10     ` Dave Hansen
2013-06-04  6:59       ` Minchan Kim
2013-06-03 20:02 ` [v5][PATCH 6/6] mm: vmscan: drain batch list during long operations Dave Hansen
2013-06-04  6:05   ` Minchan Kim
2013-06-04 15:24     ` Dave Hansen
2013-06-04 23:23       ` Minchan Kim
2013-06-04 23:31         ` Dave Hansen
2013-06-04 23:36           ` Minchan Kim
2013-06-04 23:37   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).