[PATCH 0/7] mm: Improve swap path scalability with batched operations

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] mm: Improve swap path scalability with batched operations
       [not found] <cover.1462306228.git.tim.c.chen@linux.intel.com>
@ 2016-05-03 21:00 ` Tim Chen
  2016-05-04 12:45   ` Michal Hocko
  2016-05-03 21:01 ` [PATCH 1/7] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions Tim Chen
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Tim Chen @ 2016-05-03 21:00 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Minchan Kim, Hugh Dickins
  Cc: Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel

The page swap out path is not scalable due to the numerous locks
acquired and released along the way, which are all executed on a page
by page basis, e.g.:

1. The acquisition of the mapping tree lock in swap cache when adding
a page to swap cache, and then again when deleting a page from swap cache after
it has been swapped out. 
2. The acquisition of the lock on swap device to allocate a swap slot for
a page to be swapped out. 

With the advent of high speed block devices that's several orders  
of magnitude faster than the old spinning disks, these bottlenecks
become fairly significant, especially on server class machines
with many theads running.  To reduce these locking costs, this patch
series attempt to batch the pages on the following oprations needed
on for swap:
1. Allocate swap slots in large batches, so locks on the swap device
don't need to be acquired as often. 
2. Add anonymous pages to the swap cache for the same swap device in             
batches, so the mapping tree lock can be acquired less.
3. Delete pages from swap cache also in batches.

We experimented the effect of this patches. We set up N threads to access
memory in excess of memory capcity, causing swap.  In experiments using
a single pmem based fast block device on a 2 socket machine, we saw
that for 1 thread, there is a ~25% increase in swap throughput and for
16 threads, the swap throughput increase by ~85%, when compared with the
vanilla kernel. Batching helps even for 1 thread because of contention
with kswapd when doing direct memory reclaim.

Feedbacks and reviews to this patch series are much appreciated.

Thanks.

Tim

Tim Chen (7):
  mm: Cleanup - Reorganize the shrink_page_list code into smaller
    functions
  mm: Group the processing of anonymous pages to be swapped in
    shrink_page_list
  mm: Add new functions to allocate swap slots in batches
  mm: Shrink page list batch allocates swap slots for page swapping
  mm: Batch addtion of pages to swap cache
  mm: Cleanup - Reorganize code to group handling of page
  mm: Batch unmapping of pages that are in swap cache

 include/linux/swap.h |  29 ++-
 mm/swap_state.c      | 253 +++++++++++++-----
 mm/swapfile.c        | 215 +++++++++++++--
 mm/vmscan.c          | 725 ++++++++++++++++++++++++++++++++++++++-------------
 4 files changed, 945 insertions(+), 277 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/7] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
       [not found] <cover.1462306228.git.tim.c.chen@linux.intel.com>
  2016-05-03 21:00 ` [PATCH 0/7] mm: Improve swap path scalability with batched operations Tim Chen
@ 2016-05-03 21:01 ` Tim Chen
  2016-05-27 16:40   ` Tim Chen
  2016-05-03 21:01 ` [PATCH 2/7] mm: Group the processing of anonymous pages to be swapped in shrink_page_list Tim Chen
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Tim Chen @ 2016-05-03 21:01 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Minchan Kim, Hugh Dickins
  Cc: Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel

This patch prepares the code for being able to batch the anonymous pages
to be swapped out.  It reorganizes shrink_page_list function with
2 new functions: handle_pgout and pg_finish.

The paging operation in shrink_page_list is consolidated into
handle_pgout function.

After we have scanned a page shrink_page_list and completed any paging,
the final disposition and clean up of the page is conslidated into
pg_finish.  The designated disposition of the page from page scanning
in shrink_page_list is marked with one of the designation in pg_result.

This is a clean up patch and there is no change in functionality or
logic of the code.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/vmscan.c | 429 ++++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 246 insertions(+), 183 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b934223e..5542005 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -873,6 +873,216 @@ static void page_check_dirty_writeback(struct page *page,
 		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
+enum pg_result {
+	PG_SPECULATIVE_REF,
+	PG_FREE,
+	PG_MLOCKED,
+	PG_ACTIVATE_LOCKED,
+	PG_KEEP_LOCKED,
+	PG_KEEP,
+	PG_NEXT,
+	PG_UNKNOWN,
+};
+
+static enum pg_result handle_pgout(struct list_head *page_list,
+	struct zone *zone,
+	struct scan_control *sc,
+	enum ttu_flags ttu_flags,
+	enum page_references references,
+	bool may_enter_fs,
+	bool lazyfree,
+	int  *swap_ret,
+	struct page *page)
+{
+	struct address_space *mapping;
+
+	mapping =  page_mapping(page);
+
+	/*
+	 * The page is mapped into the page tables of one or more
+	 * processes. Try to unmap it here.
+	 */
+	if (page_mapped(page) && mapping) {
+		switch (*swap_ret = try_to_unmap(page, lazyfree ?
+			(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
+			(ttu_flags | TTU_BATCH_FLUSH))) {
+		case SWAP_FAIL:
+			return PG_ACTIVATE_LOCKED;
+		case SWAP_AGAIN:
+			return PG_KEEP_LOCKED;
+		case SWAP_MLOCK:
+			return PG_MLOCKED;
+		case SWAP_LZFREE:
+			goto lazyfree;
+		case SWAP_SUCCESS:
+			; /* try to free the page below */
+		}
+	}
+
+	if (PageDirty(page)) {
+		/*
+		 * Only kswapd can writeback filesystem pages to
+		 * avoid risk of stack overflow but only writeback
+		 * if many dirty pages have been encountered.
+		 */
+		if (page_is_file_cache(page) &&
+				(!current_is_kswapd() ||
+				 !test_bit(ZONE_DIRTY, &zone->flags))) {
+			/*
+			 * Immediately reclaim when written back.
+			 * Similar in principal to deactivate_page()
+			 * except we already have the page isolated
+			 * and know it's dirty
+			 */
+			inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
+			SetPageReclaim(page);
+
+			return PG_KEEP_LOCKED;
+		}
+
+		if (references == PAGEREF_RECLAIM_CLEAN)
+			return PG_KEEP_LOCKED;
+		if (!may_enter_fs)
+			return PG_KEEP_LOCKED;
+		if (!sc->may_writepage)
+			return PG_KEEP_LOCKED;
+
+		/*
+		 * Page is dirty. Flush the TLB if a writable entry
+		 * potentially exists to avoid CPU writes after IO
+		 * starts and then write it out here.
+		 */
+		try_to_unmap_flush_dirty();
+		switch (pageout(page, mapping, sc)) {
+		case PAGE_KEEP:
+			return PG_KEEP_LOCKED;
+		case PAGE_ACTIVATE:
+			return PG_ACTIVATE_LOCKED;
+		case PAGE_SUCCESS:
+			if (PageWriteback(page))
+				return PG_KEEP;
+			if (PageDirty(page))
+				return PG_KEEP;
+
+			/*
+			 * A synchronous write - probably a ramdisk.  Go
+			 * ahead and try to reclaim the page.
+			 */
+			if (!trylock_page(page))
+				return PG_KEEP;
+			if (PageDirty(page) || PageWriteback(page))
+				return PG_KEEP_LOCKED;
+			mapping = page_mapping(page);
+		case PAGE_CLEAN:
+			; /* try to free the page below */
+		}
+	}
+
+	/*
+	 * If the page has buffers, try to free the buffer mappings
+	 * associated with this page. If we succeed we try to free
+	 * the page as well.
+	 *
+	 * We do this even if the page is PageDirty().
+	 * try_to_release_page() does not perform I/O, but it is
+	 * possible for a page to have PageDirty set, but it is actually
+	 * clean (all its buffers are clean).  This happens if the
+	 * buffers were written out directly, with submit_bh(). ext3
+	 * will do this, as well as the blockdev mapping.
+	 * try_to_release_page() will discover that cleanness and will
+	 * drop the buffers and mark the page clean - it can be freed.
+	 *
+	 * Rarely, pages can have buffers and no ->mapping.  These are
+	 * the pages which were not successfully invalidated in
+	 * truncate_complete_page().  We try to drop those buffers here
+	 * and if that worked, and the page is no longer mapped into
+	 * process address space (page_count == 1) it can be freed.
+	 * Otherwise, leave the page on the LRU so it is swappable.
+	 */
+	if (page_has_private(page)) {
+		if (!try_to_release_page(page, sc->gfp_mask))
+			return PG_ACTIVATE_LOCKED;
+		if (!mapping && page_count(page) == 1) {
+			unlock_page(page);
+			if (put_page_testzero(page))
+				return PG_FREE;
+			else {
+				/*
+				 * rare race with speculative reference.
+				 * the speculative reference will free
+				 * this page shortly, so we may
+				 * increment nr_reclaimed (and
+				 * leave it off the LRU).
+				 */
+				return PG_SPECULATIVE_REF;
+			}
+		}
+	}
+
+lazyfree:
+	if (!mapping || !__remove_mapping(mapping, page, true))
+		return PG_KEEP_LOCKED;
+
+	/*
+	 * At this point, we have no other references and there is
+	 * no way to pick any more up (removed from LRU, removed
+	 * from pagecache). Can use non-atomic bitops now (and
+	 * we obviously don't have to worry about waking up a process
+	 * waiting on the page lock, because there are no references.
+	 */
+	__ClearPageLocked(page);
+	return PG_FREE;
+}
+
+static void pg_finish(struct page *page,
+	enum pg_result pg_dispose,
+	int swap_ret,
+	unsigned long *nr_reclaimed,
+	int *pgactivate,
+	struct list_head *ret_pages,
+	struct list_head *free_pages)
+{
+	switch (pg_dispose) {
+	case PG_SPECULATIVE_REF:
+		++*nr_reclaimed;
+		return;
+	case PG_FREE:
+		if (swap_ret == SWAP_LZFREE)
+			count_vm_event(PGLAZYFREED);
+
+		++*nr_reclaimed;
+		/*
+		 * Is there need to periodically free_page_list? It would
+		 * appear not as the counts should be low
+		 */
+		list_add(&page->lru, free_pages);
+		return;
+	case PG_MLOCKED:
+		if (PageSwapCache(page))
+			try_to_free_swap(page);
+		unlock_page(page);
+		list_add(&page->lru, ret_pages);
+		return;
+	case PG_ACTIVATE_LOCKED:
+		/* Not a candidate for swapping, so reclaim swap space. */
+		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
+			try_to_free_swap(page);
+		VM_BUG_ON_PAGE(PageActive(page), page);
+		SetPageActive(page);
+		++*pgactivate;
+	case PG_KEEP_LOCKED:
+		unlock_page(page);
+	case PG_KEEP:
+		list_add(&page->lru, ret_pages);
+	case PG_NEXT:
+		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
+		break;
+	case PG_UNKNOWN:
+		VM_BUG_ON_PAGE((pg_dispose == PG_UNKNOWN), page);
+		break;
+	}
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -904,28 +1114,35 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		struct page *page;
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
+		enum pg_result pg_dispose = PG_UNKNOWN;
 		bool dirty, writeback;
 		bool lazyfree = false;
-		int ret = SWAP_SUCCESS;
+		int swap_ret = SWAP_SUCCESS;
 
 		cond_resched();
 
 		page = lru_to_page(page_list);
 		list_del(&page->lru);
 
-		if (!trylock_page(page))
-			goto keep;
+		if (!trylock_page(page)) {
+			pg_dispose = PG_KEEP;
+			goto finish;
+		}
 
 		VM_BUG_ON_PAGE(PageActive(page), page);
 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		sc->nr_scanned++;
 
-		if (unlikely(!page_evictable(page)))
-			goto cull_mlocked;
+		if (unlikely(!page_evictable(page))) {
+			pg_dispose = PG_MLOCKED;
+			goto finish;
+		}
 
-		if (!sc->may_unmap && page_mapped(page))
-			goto keep_locked;
+		if (!sc->may_unmap && page_mapped(page)) {
+			pg_dispose = PG_KEEP_LOCKED;
+			goto finish;
+		}
 
 		/* Double the slab pressure for mapped and swapcache pages */
 		if (page_mapped(page) || PageSwapCache(page))
@@ -998,7 +1215,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			    PageReclaim(page) &&
 			    test_bit(ZONE_WRITEBACK, &zone->flags)) {
 				nr_immediate++;
-				goto keep_locked;
+				pg_dispose = PG_KEEP_LOCKED;
+				goto finish;
 
 			/* Case 2 above */
 			} else if (sane_reclaim(sc) ||
@@ -1016,7 +1234,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				 */
 				SetPageReclaim(page);
 				nr_writeback++;
-				goto keep_locked;
+				pg_dispose = PG_KEEP_LOCKED;
+				goto finish;
 
 			/* Case 3 above */
 			} else {
@@ -1033,9 +1252,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 		switch (references) {
 		case PAGEREF_ACTIVATE:
-			goto activate_locked;
+			pg_dispose = PG_ACTIVATE_LOCKED;
+			goto finish;
 		case PAGEREF_KEEP:
-			goto keep_locked;
+			pg_dispose = PG_KEEP_LOCKED;
+			goto finish;
 		case PAGEREF_RECLAIM:
 		case PAGEREF_RECLAIM_CLEAN:
 			; /* try to reclaim the page below */
@@ -1046,183 +1267,25 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * Try to allocate it some swap space here.
 		 */
 		if (PageAnon(page) && !PageSwapCache(page)) {
-			if (!(sc->gfp_mask & __GFP_IO))
-				goto keep_locked;
-			if (!add_to_swap(page, page_list))
-				goto activate_locked;
-			lazyfree = true;
-			may_enter_fs = 1;
-
-			/* Adding to swap updated mapping */
-			mapping = page_mapping(page);
-		}
-
-		/*
-		 * The page is mapped into the page tables of one or more
-		 * processes. Try to unmap it here.
-		 */
-		if (page_mapped(page) && mapping) {
-			switch (ret = try_to_unmap(page, lazyfree ?
-				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
-				(ttu_flags | TTU_BATCH_FLUSH))) {
-			case SWAP_FAIL:
-				goto activate_locked;
-			case SWAP_AGAIN:
-				goto keep_locked;
-			case SWAP_MLOCK:
-				goto cull_mlocked;
-			case SWAP_LZFREE:
-				goto lazyfree;
-			case SWAP_SUCCESS:
-				; /* try to free the page below */
+			if (!(sc->gfp_mask & __GFP_IO)) {
+				pg_dispose = PG_KEEP_LOCKED;
+				goto finish;
 			}
-		}
-
-		if (PageDirty(page)) {
-			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but only writeback
-			 * if many dirty pages have been encountered.
-			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() ||
-					 !test_bit(ZONE_DIRTY, &zone->flags))) {
-				/*
-				 * Immediately reclaim when written back.
-				 * Similar in principal to deactivate_page()
-				 * except we already have the page isolated
-				 * and know it's dirty
-				 */
-				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
-				SetPageReclaim(page);
-
-				goto keep_locked;
-			}
-
-			if (references == PAGEREF_RECLAIM_CLEAN)
-				goto keep_locked;
-			if (!may_enter_fs)
-				goto keep_locked;
-			if (!sc->may_writepage)
-				goto keep_locked;
-
-			/*
-			 * Page is dirty. Flush the TLB if a writable entry
-			 * potentially exists to avoid CPU writes after IO
-			 * starts and then write it out here.
-			 */
-			try_to_unmap_flush_dirty();
-			switch (pageout(page, mapping, sc)) {
-			case PAGE_KEEP:
-				goto keep_locked;
-			case PAGE_ACTIVATE:
-				goto activate_locked;
-			case PAGE_SUCCESS:
-				if (PageWriteback(page))
-					goto keep;
-				if (PageDirty(page))
-					goto keep;
-
-				/*
-				 * A synchronous write - probably a ramdisk.  Go
-				 * ahead and try to reclaim the page.
-				 */
-				if (!trylock_page(page))
-					goto keep;
-				if (PageDirty(page) || PageWriteback(page))
-					goto keep_locked;
-				mapping = page_mapping(page);
-			case PAGE_CLEAN:
-				; /* try to free the page below */
-			}
-		}
-
-		/*
-		 * If the page has buffers, try to free the buffer mappings
-		 * associated with this page. If we succeed we try to free
-		 * the page as well.
-		 *
-		 * We do this even if the page is PageDirty().
-		 * try_to_release_page() does not perform I/O, but it is
-		 * possible for a page to have PageDirty set, but it is actually
-		 * clean (all its buffers are clean).  This happens if the
-		 * buffers were written out directly, with submit_bh(). ext3
-		 * will do this, as well as the blockdev mapping.
-		 * try_to_release_page() will discover that cleanness and will
-		 * drop the buffers and mark the page clean - it can be freed.
-		 *
-		 * Rarely, pages can have buffers and no ->mapping.  These are
-		 * the pages which were not successfully invalidated in
-		 * truncate_complete_page().  We try to drop those buffers here
-		 * and if that worked, and the page is no longer mapped into
-		 * process address space (page_count == 1) it can be freed.
-		 * Otherwise, leave the page on the LRU so it is swappable.
-		 */
-		if (page_has_private(page)) {
-			if (!try_to_release_page(page, sc->gfp_mask))
-				goto activate_locked;
-			if (!mapping && page_count(page) == 1) {
-				unlock_page(page);
-				if (put_page_testzero(page))
-					goto free_it;
-				else {
-					/*
-					 * rare race with speculative reference.
-					 * the speculative reference will free
-					 * this page shortly, so we may
-					 * increment nr_reclaimed here (and
-					 * leave it off the LRU).
-					 */
-					nr_reclaimed++;
-					continue;
-				}
+			if (!add_to_swap(page, page_list)) {
+				pg_dispose = PG_ACTIVATE_LOCKED;
+				goto finish;
 			}
+			lazyfree = true;
+			may_enter_fs = 1;
 		}
 
-lazyfree:
-		if (!mapping || !__remove_mapping(mapping, page, true))
-			goto keep_locked;
-
-		/*
-		 * At this point, we have no other references and there is
-		 * no way to pick any more up (removed from LRU, removed
-		 * from pagecache). Can use non-atomic bitops now (and
-		 * we obviously don't have to worry about waking up a process
-		 * waiting on the page lock, because there are no references.
-		 */
-		__ClearPageLocked(page);
-free_it:
-		if (ret == SWAP_LZFREE)
-			count_vm_event(PGLAZYFREED);
-
-		nr_reclaimed++;
+		pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags,
+				references, may_enter_fs, lazyfree,
+				&swap_ret, page);
+finish:
+		pg_finish(page, pg_dispose, swap_ret, &nr_reclaimed,
+				&pgactivate, &ret_pages, &free_pages);
 
-		/*
-		 * Is there need to periodically free_page_list? It would
-		 * appear not as the counts should be low
-		 */
-		list_add(&page->lru, &free_pages);
-		continue;
-
-cull_mlocked:
-		if (PageSwapCache(page))
-			try_to_free_swap(page);
-		unlock_page(page);
-		list_add(&page->lru, &ret_pages);
-		continue;
-
-activate_locked:
-		/* Not a candidate for swapping, so reclaim swap space. */
-		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
-			try_to_free_swap(page);
-		VM_BUG_ON_PAGE(PageActive(page), page);
-		SetPageActive(page);
-		pgactivate++;
-keep_locked:
-		unlock_page(page);
-keep:
-		list_add(&page->lru, &ret_pages);
-		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
 
 	mem_cgroup_uncharge_list(&free_pages);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/7] mm: Group the processing of anonymous pages to be swapped in shrink_page_list
       [not found] <cover.1462306228.git.tim.c.chen@linux.intel.com>
  2016-05-03 21:00 ` [PATCH 0/7] mm: Improve swap path scalability with batched operations Tim Chen
  2016-05-03 21:01 ` [PATCH 1/7] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions Tim Chen
@ 2016-05-03 21:01 ` Tim Chen
  2016-05-03 21:02 ` [PATCH 3/7] mm: Add new functions to allocate swap slots in batches Tim Chen
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2016-05-03 21:01 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Minchan Kim, Hugh Dickins
  Cc: Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel

This is a clean up patch to reorganize the processing of anonymous
pages in shrink_page_list.

We delay the processing of swapping anonymous pages in shrink_page_list
and put them together on a separate list.  This prepares for batching
of pages to be swapped.  The processing of the list of anonymous pages
to be swapped is consolidated in the function shrink_anon_page_list.

Functionally, there is no change in the logic of how pages are processed,
just the order of processing of the anonymous pages and file mapped
pages in shrink_page_list.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/vmscan.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 77 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5542005..132ba02 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1083,6 +1083,58 @@ static void pg_finish(struct page *page,
 	}
 }
 
+static unsigned long shrink_anon_page_list(struct list_head *page_list,
+	struct zone *zone,
+	struct scan_control *sc,
+	struct list_head *swap_pages,
+	struct list_head *ret_pages,
+	struct list_head *free_pages,
+	enum ttu_flags ttu_flags,
+	int *pgactivate,
+	int n,
+	bool clean)
+{
+	unsigned long nr_reclaimed = 0;
+	enum pg_result pg_dispose;
+
+	while (n > 0) {
+		struct page *page;
+		int swap_ret = SWAP_SUCCESS;
+
+		--n;
+		if (list_empty(swap_pages))
+		       return nr_reclaimed;
+
+		page = lru_to_page(swap_pages);
+
+		list_del(&page->lru);
+
+		/*
+		* Anonymous process memory has backing store?
+		* Try to allocate it some swap space here.
+		*/
+
+		if (!add_to_swap(page, page_list)) {
+			pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret, &nr_reclaimed,
+					pgactivate, ret_pages, free_pages);
+			continue;
+		}
+
+		if (clean)
+			pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags,
+				PAGEREF_RECLAIM_CLEAN, true, true, &swap_ret, page);
+		else
+			pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags,
+				PAGEREF_RECLAIM, true, true, &swap_ret, page);
+
+		pg_finish(page, pg_dispose, swap_ret, &nr_reclaimed,
+				pgactivate, ret_pages, free_pages);
+	}
+	return nr_reclaimed;
+}
+
+
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1099,6 +1151,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(swap_pages);
+	LIST_HEAD(swap_pages_clean);
 	int pgactivate = 0;
 	unsigned long nr_unqueued_dirty = 0;
 	unsigned long nr_dirty = 0;
@@ -1106,6 +1160,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_writeback = 0;
 	unsigned long nr_immediate = 0;
+	unsigned long nr_swap = 0;
+	unsigned long nr_swap_clean = 0;
 
 	cond_resched();
 
@@ -1271,12 +1327,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				pg_dispose = PG_KEEP_LOCKED;
 				goto finish;
 			}
-			if (!add_to_swap(page, page_list)) {
-				pg_dispose = PG_ACTIVATE_LOCKED;
-				goto finish;
+			if (references == PAGEREF_RECLAIM_CLEAN) {
+				list_add(&page->lru, &swap_pages_clean);
+				++nr_swap_clean;
+			} else {
+				list_add(&page->lru, &swap_pages);
+				++nr_swap;
 			}
-			lazyfree = true;
-			may_enter_fs = 1;
+
+			pg_dispose = PG_NEXT;
+			goto finish;
+
 		}
 
 		pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags,
@@ -1288,6 +1349,17 @@ finish:
 
 	}
 
+	nr_reclaimed += shrink_anon_page_list(page_list, zone, sc,
+						&swap_pages_clean, &ret_pages,
+						&free_pages, ttu_flags,
+						&pgactivate, nr_swap_clean,
+						true);
+	nr_reclaimed += shrink_anon_page_list(page_list, zone, sc,
+						&swap_pages, &ret_pages,
+						&free_pages, ttu_flags,
+						&pgactivate, nr_swap,
+						false);
+
 	mem_cgroup_uncharge_list(&free_pages);
 	try_to_unmap_flush();
 	free_hot_cold_page_list(&free_pages, true);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/7] mm: Add new functions to allocate swap slots in batches
       [not found] <cover.1462306228.git.tim.c.chen@linux.intel.com>
                   ` (2 preceding siblings ...)
  2016-05-03 21:01 ` [PATCH 2/7] mm: Group the processing of anonymous pages to be swapped in shrink_page_list Tim Chen
@ 2016-05-03 21:02 ` Tim Chen
  2016-05-03 21:02 ` [PATCH 4/7] mm: Shrink page list batch allocates swap slots for page swapping Tim Chen
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2016-05-03 21:02 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Minchan Kim, Hugh Dickins
  Cc: Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel

Currently, the swap slots have to be allocated one page at a time,
causing contention to the swap_info lock protecting the swap partition
on every page being swapped.

This patch adds new functions get_swap_pages and scan_swap_map_slots to
request multiple swap slots at once. This will reduce the lock contention
on the swap_info lock as we only need to acquire the lock once to get
multiple slots.  Also scan_swap_map_slots can operate more efficiently
as swap slots often occurs in clusters close to each other on a swap
device and it is quicker to allocate them together.

Multiple swap slots can also be freed in one shot with new function
swapcache_free_entries, that further reduce contention on the swap_info
lock.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap.h |  27 +++++--
 mm/swap_state.c      |  23 +++---
 mm/swapfile.c        | 215 +++++++++++++++++++++++++++++++++++++++++++++------
 mm/vmscan.c          |   2 +-
 4 files changed, 228 insertions(+), 39 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2b83359..da6d994 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -23,6 +23,7 @@ struct bio;
 #define SWAP_FLAG_DISCARD	0x10000 /* enable discard for swap */
 #define SWAP_FLAG_DISCARD_ONCE	0x20000 /* discard swap area at swapon-time */
 #define SWAP_FLAG_DISCARD_PAGES 0x40000 /* discard page-clusters after use */
+#define SWAP_BATCH 64
 
 #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
 				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
@@ -370,7 +371,8 @@ extern struct address_space swapper_spaces[];
 #define swap_address_space(entry) (&swapper_spaces[swp_type(entry)])
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *, struct list_head *list);
+extern int add_to_swap(struct page *, struct list_head *list,
+			swp_entry_t *entry);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
 extern void __delete_from_swap_cache(struct page *);
@@ -403,6 +405,7 @@ static inline long get_nr_swap_pages(void)
 
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
+extern int get_swap_pages(int n, swp_entry_t swp_entries[]);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
@@ -410,6 +413,7 @@ extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free(swp_entry_t);
+extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -429,7 +433,6 @@ struct backing_dev_info;
 #define total_swap_pages			0L
 #define total_swapcache_pages()			0UL
 #define vm_swap_full()				0
-
 #define si_swapinfo(val) \
 	do { (val)->freeswap = (val)->totalswap = 0; } while (0)
 /* only sparc can not include linux/pagemap.h in this file
@@ -451,6 +454,21 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 	return 0;
 }
 
+static inline int add_to_swap(struct page *page, struct list_head *list,
+				swp_entry_t *entry)
+{
+	return 0;
+}
+
+static inline int get_swap_pages(int n, swp_entry_t swp_entries[])
+{
+	return 0;
+}
+
+static inline void swapcache_free_entries(swp_entry_t *entries, int n)
+{
+}
+
 static inline void swap_shmem_alloc(swp_entry_t swp)
 {
 }
@@ -484,11 +502,6 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp)
 	return NULL;
 }
 
-static inline int add_to_swap(struct page *page, struct list_head *list)
-{
-	return 0;
-}
-
 static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
 							gfp_t gfp_mask)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 366ce35..bad02c1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -154,30 +154,35 @@ void __delete_from_swap_cache(struct page *page)
 /**
  * add_to_swap - allocate swap space for a page
  * @page: page we want to move to swap
+ * @entry: swap entry that we have pre-allocated
  *
  * Allocate swap space for the page and add the page to the
  * swap cache.  Caller needs to hold the page lock. 
  */
-int add_to_swap(struct page *page, struct list_head *list)
+int add_to_swap(struct page *page, struct list_head *list, swp_entry_t *entry)
 {
-	swp_entry_t entry;
 	int err;
+	swp_entry_t ent;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
-	entry = get_swap_page();
-	if (!entry.val)
+	if (!entry) {
+		ent = get_swap_page();
+		entry = &ent;
+	}
+
+	if (entry && !entry->val)
 		return 0;
 
-	if (mem_cgroup_try_charge_swap(page, entry)) {
-		swapcache_free(entry);
+	if (mem_cgroup_try_charge_swap(page, *entry)) {
+		swapcache_free(*entry);
 		return 0;
 	}
 
 	if (unlikely(PageTransHuge(page)))
 		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry);
+			swapcache_free(*entry);
 			return 0;
 		}
 
@@ -192,7 +197,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 	/*
 	 * Add it to the swap cache.
 	 */
-	err = add_to_swap_cache(page, entry,
+	err = add_to_swap_cache(page, *entry,
 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
 
 	if (!err) {
@@ -202,7 +207,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry);
+		swapcache_free(*entry);
 		return 0;
 	}
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 83874ec..2c294a6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -437,7 +437,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
  * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
  * might involve allocating a new cluster for current CPU too.
  */
-static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
+static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned long *offset, unsigned long *scan_base)
 {
 	struct percpu_cluster *cluster;
@@ -460,7 +460,7 @@ new_cluster:
 			*scan_base = *offset = si->cluster_next;
 			goto new_cluster;
 		} else
-			return;
+			return false;
 	}
 
 	found_free = false;
@@ -485,15 +485,21 @@ new_cluster:
 	cluster->next = tmp + 1;
 	*offset = tmp;
 	*scan_base = tmp;
+	return found_free;
 }
 
-static unsigned long scan_swap_map(struct swap_info_struct *si,
-				   unsigned char usage)
+static int scan_swap_map_slots(struct swap_info_struct *si,
+				   unsigned char usage, int nr,
+				   unsigned long slots[])
 {
 	unsigned long offset;
 	unsigned long scan_base;
 	unsigned long last_in_cluster = 0;
 	int latency_ration = LATENCY_LIMIT;
+	int n_ret = 0;
+
+	if (nr > SWAP_BATCH)
+		nr = SWAP_BATCH;
 
 	/*
 	 * We try to cluster swap pages by allocating them sequentially
@@ -511,8 +517,10 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 	/* SSD algorithm */
 	if (si->cluster_info) {
-		scan_swap_map_try_ssd_cluster(si, &offset, &scan_base);
-		goto checks;
+		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+			goto checks;
+		else
+			goto done;
 	}
 
 	if (unlikely(!si->cluster_nr--)) {
@@ -556,8 +564,14 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 checks:
 	if (si->cluster_info) {
-		while (scan_swap_map_ssd_cluster_conflict(si, offset))
-			scan_swap_map_try_ssd_cluster(si, &offset, &scan_base);
+		while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
+		/* take a break if we already got some slots */
+			if (n_ret)
+				goto done;
+			if (!scan_swap_map_try_ssd_cluster(si, &offset,
+							&scan_base))
+				goto done;
+		}
 	}
 	if (!(si->flags & SWP_WRITEOK))
 		goto no_page;
@@ -578,8 +592,12 @@ checks:
 		goto scan; /* check next one */
 	}
 
-	if (si->swap_map[offset])
-		goto scan;
+	if (si->swap_map[offset]) {
+		if (!n_ret)
+			goto scan;
+		else
+			goto done;
+	}
 
 	if (offset == si->lowest_bit)
 		si->lowest_bit++;
@@ -596,9 +614,42 @@ checks:
 	si->swap_map[offset] = usage;
 	inc_cluster_info_page(si, si->cluster_info, offset);
 	si->cluster_next = offset + 1;
-	si->flags -= SWP_SCANNING;
+	slots[n_ret] = offset;
+	++n_ret;
 
-	return offset;
+	/* got enough slots or reach max slots? */
+	if ((n_ret == nr) || (offset >= si->highest_bit))
+		goto done;
+
+	/* search for next available slot */
+
+	/* time to take a break? */
+	if (unlikely(--latency_ration < 0)) {
+		spin_unlock(&si->lock);
+		cond_resched();
+		spin_lock(&si->lock);
+		latency_ration = LATENCY_LIMIT;
+	}
+
+	/* try to get more slots in cluster */
+	if (si->cluster_info) {
+		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+			goto checks;
+		else
+			goto done;
+	}
+	/* non-ssd case */
+	++offset;
+
+	/* non-ssd case, still more slots in cluster? */
+	if (si->cluster_nr && !si->swap_map[offset]) {
+		--si->cluster_nr;
+		goto checks;
+	}
+
+done:
+	si->flags -= SWP_SCANNING;
+	return n_ret;
 
 scan:
 	spin_unlock(&si->lock);
@@ -636,17 +687,44 @@ scan:
 
 no_page:
 	si->flags -= SWP_SCANNING;
-	return 0;
+	return n_ret;
 }
 
-swp_entry_t get_swap_page(void)
+static unsigned long scan_swap_map(struct swap_info_struct *si,
+				   unsigned char usage)
+{
+	unsigned long slots[1];
+	int n_ret;
+
+	n_ret = scan_swap_map_slots(si, usage, 1, slots);
+
+	if (n_ret)
+		return slots[0];
+	else
+		return 0;
+
+}
+
+int get_swap_pages(int n, swp_entry_t swp_entries[])
 {
 	struct swap_info_struct *si, *next;
-	pgoff_t offset;
+	long avail_pgs, n_ret, n_goal;
 
-	if (atomic_long_read(&nr_swap_pages) <= 0)
+	n_ret = 0;
+	avail_pgs = atomic_long_read(&nr_swap_pages);
+	if (avail_pgs <= 0)
 		goto noswap;
-	atomic_long_dec(&nr_swap_pages);
+
+	n_goal = n;
+	swp_entries[0] = (swp_entry_t) {0};
+
+	if (n_goal > SWAP_BATCH)
+		n_goal = SWAP_BATCH;
+
+	if (n_goal > avail_pgs)
+		n_goal = avail_pgs;
+
+	atomic_long_sub(n_goal, &nr_swap_pages);
 
 	spin_lock(&swap_avail_lock);
 
@@ -674,10 +752,26 @@ start_over:
 		}
 
 		/* This is called for allocating swap entry for cache */
-		offset = scan_swap_map(si, SWAP_HAS_CACHE);
+		while (n_ret < n_goal) {
+			unsigned long slots[SWAP_BATCH];
+			int ret, i;
+
+			ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+							n_goal-n_ret, slots);
+			if (!ret)
+				break;
+
+			for (i = 0; i < ret; ++i)
+				swp_entries[n_ret+i] = swp_entry(si->type,
+								slots[i]);
+
+			n_ret += ret;
+		}
+
 		spin_unlock(&si->lock);
-		if (offset)
-			return swp_entry(si->type, offset);
+		if (n_ret == n_goal)
+			return n_ret;
+
 		pr_debug("scan_swap_map of si %d failed to find offset\n",
 		       si->type);
 		spin_lock(&swap_avail_lock);
@@ -698,9 +792,23 @@ nextsi:
 
 	spin_unlock(&swap_avail_lock);
 
-	atomic_long_inc(&nr_swap_pages);
+	if (n_ret < n_goal)
+		atomic_long_add((long) (n_goal-n_ret), &nr_swap_pages);
 noswap:
-	return (swp_entry_t) {0};
+	return n_ret;
+}
+
+swp_entry_t get_swap_page(void)
+{
+	swp_entry_t swp_entries[1];
+	long n_ret;
+
+	n_ret = get_swap_pages(1, swp_entries);
+
+	if (n_ret)
+		return swp_entries[0];
+	else
+		return (swp_entry_t) {0};
 }
 
 /* The only caller of this function is now suspend routine */
@@ -761,6 +869,47 @@ out:
 	return NULL;
 }
 
+static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
+					struct swap_info_struct *q)
+{
+	struct swap_info_struct *p;
+	unsigned long offset, type;
+
+	if (!entry.val)
+		goto out;
+	type = swp_type(entry);
+	if (type >= nr_swapfiles)
+		goto bad_nofile;
+	p = swap_info[type];
+	if (!(p->flags & SWP_USED))
+		goto bad_device;
+	offset = swp_offset(entry);
+	if (offset >= p->max)
+		goto bad_offset;
+	if (!p->swap_map[offset])
+		goto bad_free;
+	if (p != q) {
+		if (q != NULL)
+			spin_unlock(&q->lock);
+		spin_lock(&p->lock);
+	}
+	return p;
+
+bad_free:
+	pr_err("swap_free: %s%08lx\n", Unused_offset, entry.val);
+	goto out;
+bad_offset:
+	pr_err("swap_free: %s%08lx\n", Bad_offset, entry.val);
+	goto out;
+bad_device:
+	pr_err("swap_free: %s%08lx\n", Unused_file, entry.val);
+	goto out;
+bad_nofile:
+	pr_err("swap_free: %s%08lx\n", Bad_file, entry.val);
+out:
+	return NULL;
+}
+
 static unsigned char swap_entry_free(struct swap_info_struct *p,
 				     swp_entry_t entry, unsigned char usage)
 {
@@ -855,6 +1004,28 @@ void swapcache_free(swp_entry_t entry)
 	}
 }
 
+void swapcache_free_entries(swp_entry_t *entries, int n)
+{
+	struct swap_info_struct *p, *prev;
+	int i;
+
+	if (n <= 0)
+		return;
+
+	prev = NULL;
+	p = NULL;
+	for (i = 0; i < n; ++i) {
+		p = swap_info_get_cont(entries[i], prev);
+		if (p)
+			swap_entry_free(p, entries[i], SWAP_HAS_CACHE);
+		else
+			break;
+		prev = p;
+	}
+	if (p)
+		spin_unlock(&p->lock);
+}
+
 /*
  * How many references to page are currently swapped out?
  * This does not give an exact answer when swap count is continued,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 132ba02..e36d8a7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1114,7 +1114,7 @@ static unsigned long shrink_anon_page_list(struct list_head *page_list,
 		* Try to allocate it some swap space here.
 		*/
 
-		if (!add_to_swap(page, page_list)) {
+		if (!add_to_swap(page, page_list, NULL)) {
 			pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret, &nr_reclaimed,
 					pgactivate, ret_pages, free_pages);
 			continue;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/7] mm: Shrink page list batch allocates swap slots for page swapping
       [not found] <cover.1462306228.git.tim.c.chen@linux.intel.com>
                   ` (3 preceding siblings ...)
  2016-05-03 21:02 ` [PATCH 3/7] mm: Add new functions to allocate swap slots in batches Tim Chen
@ 2016-05-03 21:02 ` Tim Chen
  2016-05-03 21:02 ` [PATCH 5/7] mm: Batch addtion of pages to swap cache Tim Chen
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2016-05-03 21:02 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Minchan Kim, Hugh Dickins
  Cc: Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel

In shrink page list, we take advantage bulk allocation of swap entries
with the new get_swap_pages function. This reduces contention on a
swap device's swap_info lock. When the memory is low and the system is
actively trying to reclaim memory, both direct reclaim path and kswapd
contends on this lock when they access the same swap partition.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/vmscan.c | 63 ++++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e36d8a7..310e2b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1096,38 +1096,59 @@ static unsigned long shrink_anon_page_list(struct list_head *page_list,
 {
 	unsigned long nr_reclaimed = 0;
 	enum pg_result pg_dispose;
+	swp_entry_t swp_entries[SWAP_BATCH];
+	struct page *page;
+	int m, i, k;
 
 	while (n > 0) {
-		struct page *page;
 		int swap_ret = SWAP_SUCCESS;
 
-		--n;
-		if (list_empty(swap_pages))
-		       return nr_reclaimed;
+		m = get_swap_pages(n, swp_entries);
+		if (!m)
+			goto no_swap_slots;
+		n -= m;
+		for (i = 0; i < m; ++i) {
+			if (list_empty(swap_pages)) {
+				/* free any leftover swap slots */
+				for (k = i; k < m; ++k)
+					swapcache_free(swp_entries[k]);
+				return nr_reclaimed;
+			}
+			page = lru_to_page(swap_pages);
 
-		page = lru_to_page(swap_pages);
+			list_del(&page->lru);
 
-		list_del(&page->lru);
+			/*
+			* Anonymous process memory has backing store?
+			* Try to allocate it some swap space here.
+			*/
+
+			if (!add_to_swap(page, page_list, NULL)) {
+				pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret,
+						&nr_reclaimed, pgactivate,
+						ret_pages, free_pages);
+				continue;
+			}
 
-		/*
-		* Anonymous process memory has backing store?
-		* Try to allocate it some swap space here.
-		*/
+			if (clean)
+				pg_dispose = handle_pgout(page_list, zone, sc,
+						ttu_flags, PAGEREF_RECLAIM_CLEAN,
+						true, true, &swap_ret, page);
+			else
+				pg_dispose = handle_pgout(page_list, zone, sc,
+						ttu_flags, PAGEREF_RECLAIM,
+						true, true, &swap_ret, page);
 
-		if (!add_to_swap(page, page_list, NULL)) {
-			pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret, &nr_reclaimed,
+			pg_finish(page, pg_dispose, swap_ret, &nr_reclaimed,
 					pgactivate, ret_pages, free_pages);
-			continue;
 		}
+	}
+	return nr_reclaimed;
 
-		if (clean)
-			pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags,
-				PAGEREF_RECLAIM_CLEAN, true, true, &swap_ret, page);
-		else
-			pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags,
-				PAGEREF_RECLAIM, true, true, &swap_ret, page);
-
-		pg_finish(page, pg_dispose, swap_ret, &nr_reclaimed,
+no_swap_slots:
+	while (!list_empty(swap_pages)) {
+		page = lru_to_page(swap_pages);
+		pg_finish(page, PG_ACTIVATE_LOCKED, 0, &nr_reclaimed,
 				pgactivate, ret_pages, free_pages);
 	}
 	return nr_reclaimed;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/7] mm: Batch addtion of pages to swap cache
       [not found] <cover.1462306228.git.tim.c.chen@linux.intel.com>
                   ` (4 preceding siblings ...)
  2016-05-03 21:02 ` [PATCH 4/7] mm: Shrink page list batch allocates swap slots for page swapping Tim Chen
@ 2016-05-03 21:02 ` Tim Chen
  2016-05-03 21:03 ` [PATCH 6/7] mm: Cleanup - Reorganize code to group handling of page Tim Chen
  2016-05-03 21:03 ` [PATCH 7/7] mm: Batch unmapping of pages that are in swap cache Tim Chen
  7 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2016-05-03 21:02 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Minchan Kim, Hugh Dickins
  Cc: Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel

When a page is to be swapped, it needed to be added to the swap cache
and then removed after the paging has been completed.  A swap partition's
mapping tree lock is acquired for each anonymous page's addition to the
swap cache.

This patch created new functions add_to_swap_batch and
__add_to_swap_cache_batch that allows multiple pages destinied for the
same swap partition to be added to that swap partition's swap cache in
one acquisition of the mapping tree lock.  These functions extend the
original add_to_swap and __add_to_swap_cache. This reduces the contention
of the swap partition's mapping tree lock when we are actively reclaiming
memory and swapping pages

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap.h |   2 +
 mm/swap_state.c      | 248 +++++++++++++++++++++++++++++++++++++--------------
 mm/vmscan.c          |  19 ++--
 3 files changed, 196 insertions(+), 73 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index da6d994..cd06f2a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -373,6 +373,8 @@ extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *, struct list_head *list,
 			swp_entry_t *entry);
+extern void add_to_swap_batch(struct page *pages[], struct list_head *list,
+			swp_entry_t entries[], int ret_codes[], int nr);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
 extern void __delete_from_swap_cache(struct page *);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index bad02c1..ce02024 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -72,49 +72,94 @@ void show_swap_cache_info(void)
 	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
 }
 
-/*
- * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
- * but sets SwapCache flag and private instead of mapping and index.
- */
-int __add_to_swap_cache(struct page *page, swp_entry_t entry)
+void __add_to_swap_cache_batch(struct page *pages[], swp_entry_t entries[],
+				int ret[], int nr)
 {
-	int error;
+	int error, i;
 	struct address_space *address_space;
+	struct address_space *prev;
+	struct page *page;
+	swp_entry_t entry;
 
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	VM_BUG_ON_PAGE(PageSwapCache(page), page);
-	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+	prev = NULL;
+	address_space = NULL;
+	for (i = 0; i < nr; ++i) {
+		/* error at pre-processing stage, swap entry already released */
+		if (ret[i] == -ENOENT)
+			continue;
 
-	get_page(page);
-	SetPageSwapCache(page);
-	set_page_private(page, entry.val);
+		page = pages[i];
+		entry = entries[i];
 
-	address_space = swap_address_space(entry);
-	spin_lock_irq(&address_space->tree_lock);
-	error = radix_tree_insert(&address_space->page_tree,
-					entry.val, page);
-	if (likely(!error)) {
-		address_space->nrpages++;
-		__inc_zone_page_state(page, NR_FILE_PAGES);
-		INC_CACHE_INFO(add_total);
-	}
-	spin_unlock_irq(&address_space->tree_lock);
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+		VM_BUG_ON_PAGE(PageSwapCache(page), page);
+		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
-	if (unlikely(error)) {
-		/*
-		 * Only the context which have set SWAP_HAS_CACHE flag
-		 * would call add_to_swap_cache().
-		 * So add_to_swap_cache() doesn't returns -EEXIST.
-		 */
-		VM_BUG_ON(error == -EEXIST);
-		set_page_private(page, 0UL);
-		ClearPageSwapCache(page);
-		put_page(page);
+		get_page(page);
+		SetPageSwapCache(page);
+		set_page_private(page, entry.val);
+
+		address_space = swap_address_space(entry);
+		if (prev != address_space) {
+			if (prev)
+				spin_unlock_irq(&prev->tree_lock);
+			spin_lock_irq(&address_space->tree_lock);
+		}
+		error = radix_tree_insert(&address_space->page_tree,
+				entry.val, page);
+		if (likely(!error)) {
+			address_space->nrpages++;
+			__inc_zone_page_state(page, NR_FILE_PAGES);
+			INC_CACHE_INFO(add_total);
+		}
+
+		if (unlikely(error)) {
+			spin_unlock_irq(&address_space->tree_lock);
+			address_space = NULL;
+			/*
+			 * Only the context which have set SWAP_HAS_CACHE flag
+			 * would call add_to_swap_cache().
+			 * So add_to_swap_cache() doesn't returns -EEXIST.
+			 */
+			VM_BUG_ON(error == -EEXIST);
+			set_page_private(page, 0UL);
+			ClearPageSwapCache(page);
+			put_page(page);
+		}
+		prev = address_space;
+		ret[i] = error;
 	}
+	if (address_space)
+		spin_unlock_irq(&address_space->tree_lock);
+}
 
-	return error;
+/*
+ * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
+ * but sets SwapCache flag and private instead of mapping and index.
+ */
+int __add_to_swap_cache(struct page *page, swp_entry_t entry)
+{
+	swp_entry_t	entries[1];
+	struct page	*pages[1];
+	int	ret[1];
+
+	pages[0] = page;
+	entries[0] = entry;
+	__add_to_swap_cache_batch(pages, entries, ret, 1);
+	return ret[0];
 }
 
+void add_to_swap_cache_batch(struct page *pages[], swp_entry_t entries[],
+				gfp_t gfp_mask, int ret[], int nr)
+{
+	int error;
+
+	error = radix_tree_maybe_preload(gfp_mask);
+	if (!error) {
+		__add_to_swap_cache_batch(pages, entries, ret, nr);
+		radix_tree_preload_end();
+	}
+}
 
 int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
 {
@@ -151,6 +196,73 @@ void __delete_from_swap_cache(struct page *page)
 	INC_CACHE_INFO(del_total);
 }
 
+void add_to_swap_batch(struct page *pages[], struct list_head *list,
+			swp_entry_t entries[], int ret_codes[], int nr)
+{
+	swp_entry_t *entry;
+	struct page *page;
+	int i;
+
+	for (i = 0; i < nr; ++i) {
+		entry = &entries[i];
+		page = pages[i];
+
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+		VM_BUG_ON_PAGE(!PageUptodate(page), page);
+
+		ret_codes[i] = 1;
+
+		if (!entry->val)
+			ret_codes[i] = -ENOENT;
+
+		if (mem_cgroup_try_charge_swap(page, *entry)) {
+			swapcache_free(*entry);
+			ret_codes[i] = 0;
+		}
+
+		if (unlikely(PageTransHuge(page)))
+			if (unlikely(split_huge_page_to_list(page, list))) {
+				swapcache_free(*entry);
+				ret_codes[i] = -ENOENT;
+				continue;
+			}
+	}
+
+	/*
+	 * Radix-tree node allocations from PF_MEMALLOC contexts could
+	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
+	 * stops emergency reserves from being allocated.
+	 *
+	 * TODO: this could cause a theoretical memory reclaim
+	 * deadlock in the swap out path.
+	 */
+	/*
+	 * Add it to the swap cache
+	 */
+	add_to_swap_cache_batch(pages, entries,
+			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN,
+				ret_codes, nr);
+
+	for (i = 0; i < nr; ++i) {
+		entry = &entries[i];
+		page = pages[i];
+
+		if (!ret_codes[i]) {    /* Success */
+			ret_codes[i] = 1;
+			continue;
+		} else {        /* -ENOMEM radix-tree allocation failure */
+			/*
+			 * add_to_swap_cache() doesn't return -EEXIST,
+			 * so we can safely clear SWAP_HAS_CACHE flag.
+			 */
+			if (ret_codes[i] != -ENOENT)
+				swapcache_free(*entry);
+			ret_codes[i] = 0;
+			continue;
+		}
+	}
+}
+
 /**
  * add_to_swap - allocate swap space for a page
  * @page: page we want to move to swap
@@ -161,54 +273,56 @@ void __delete_from_swap_cache(struct page *page)
  */
 int add_to_swap(struct page *page, struct list_head *list, swp_entry_t *entry)
 {
-	int err;
-	swp_entry_t ent;
+	int ret[1];
+	swp_entry_t ent[1];
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
 	if (!entry) {
-		ent = get_swap_page();
-		entry = &ent;
+		ent[0] = get_swap_page();
+		entry = &ent[0];
 	}
 
 	if (entry && !entry->val)
 		return 0;
 
-	if (mem_cgroup_try_charge_swap(page, *entry)) {
-		swapcache_free(*entry);
-		return 0;
-	}
+	add_to_swap_batch(&page, list, entry, ret, 1);
+	return ret[0];
+}
 
-	if (unlikely(PageTransHuge(page)))
-		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(*entry);
-			return 0;
+void delete_from_swap_cache_batch(struct page pages[], int nr)
+{
+	struct page *page;
+	swp_entry_t entry;
+	struct address_space *address_space, *prev;
+	int i;
+
+	prev = NULL;
+	address_space = NULL;
+	for (i = 0; i < nr; ++i) {
+		page = &pages[i];
+		entry.val = page_private(page);
+
+		address_space = swap_address_space(entry);
+		if (address_space != prev) {
+			if (prev)
+				spin_unlock_irq(&prev->tree_lock);
+			spin_lock_irq(&address_space->tree_lock);
 		}
+		__delete_from_swap_cache(page);
+		prev = address_space;
+	}
+	if (address_space)
+		spin_unlock_irq(&address_space->tree_lock);
 
-	/*
-	 * Radix-tree node allocations from PF_MEMALLOC contexts could
-	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
-	 * stops emergency reserves from being allocated.
-	 *
-	 * TODO: this could cause a theoretical memory reclaim
-	 * deadlock in the swap out path.
-	 */
-	/*
-	 * Add it to the swap cache.
-	 */
-	err = add_to_swap_cache(page, *entry,
-			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
+	for (i = 0; i < nr; ++i) {
+		page = &pages[i];
+		entry.val = page_private(page);
 
-	if (!err) {
-		return 1;
-	} else {	/* -ENOMEM radix-tree allocation failure */
-		/*
-		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
-		 * clear SWAP_HAS_CACHE flag.
-		 */
-		swapcache_free(*entry);
-		return 0;
+		/* can batch this */
+		swapcache_free(entry);
+		put_page(page);
 	}
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 310e2b2..fab61f1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1097,8 +1097,9 @@ static unsigned long shrink_anon_page_list(struct list_head *page_list,
 	unsigned long nr_reclaimed = 0;
 	enum pg_result pg_dispose;
 	swp_entry_t swp_entries[SWAP_BATCH];
+	struct page *pages[SWAP_BATCH];
+	int m, i, k, ret[SWAP_BATCH];
 	struct page *page;
-	int m, i, k;
 
 	while (n > 0) {
 		int swap_ret = SWAP_SUCCESS;
@@ -1117,13 +1118,19 @@ static unsigned long shrink_anon_page_list(struct list_head *page_list,
 			page = lru_to_page(swap_pages);
 
 			list_del(&page->lru);
+			pages[i] = page;
+		}
 
-			/*
-			* Anonymous process memory has backing store?
-			* Try to allocate it some swap space here.
-			*/
+		/*
+		* Anonymous process memory has backing store?
+		* Try to allocate it some swap space here.
+		*/
+		add_to_swap_batch(pages, page_list, swp_entries, ret, m);
+
+		for (i = 0; i < m; ++i) {
+			page = pages[i];
 
-			if (!add_to_swap(page, page_list, NULL)) {
+			if (!ret[i]) {
 				pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret,
 						&nr_reclaimed, pgactivate,
 						ret_pages, free_pages);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/7] mm: Cleanup - Reorganize code to group handling of page
       [not found] <cover.1462306228.git.tim.c.chen@linux.intel.com>
                   ` (5 preceding siblings ...)
  2016-05-03 21:02 ` [PATCH 5/7] mm: Batch addtion of pages to swap cache Tim Chen
@ 2016-05-03 21:03 ` Tim Chen
  2016-05-03 21:03 ` [PATCH 7/7] mm: Batch unmapping of pages that are in swap cache Tim Chen
  7 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2016-05-03 21:03 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Minchan Kim, Hugh Dickins
  Cc: Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel

In this patch, we reorganize the paging operations so the paging
operations of pages to the same swap device can be grouped together.
This prepares for the next patch that remove multiple pages from
the same swap cache together once they have been paged out.

The patch creates a new function handle_pgout_batch that takes
the code of handle_pgout and put a loop around handle_pgout code for
multiple pages.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/vmscan.c | 338 +++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 196 insertions(+), 142 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fab61f1..9fc04e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -884,154 +884,218 @@ enum pg_result {
 	PG_UNKNOWN,
 };
 
-static enum pg_result handle_pgout(struct list_head *page_list,
+static void handle_pgout_batch(struct list_head *page_list,
 	struct zone *zone,
 	struct scan_control *sc,
 	enum ttu_flags ttu_flags,
 	enum page_references references,
 	bool may_enter_fs,
 	bool lazyfree,
-	int  *swap_ret,
-	struct page *page)
+	struct page *pages[],
+	int  swap_ret[],
+	int ret[],
+	int nr)
 {
 	struct address_space *mapping;
+	struct page *page;
+	int i;
 
-	mapping =  page_mapping(page);
+	for (i = 0; i < nr; ++i) {
+		page = pages[i];
+		mapping =  page_mapping(page);
 
-	/*
-	 * The page is mapped into the page tables of one or more
-	 * processes. Try to unmap it here.
-	 */
-	if (page_mapped(page) && mapping) {
-		switch (*swap_ret = try_to_unmap(page, lazyfree ?
-			(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
-			(ttu_flags | TTU_BATCH_FLUSH))) {
-		case SWAP_FAIL:
-			return PG_ACTIVATE_LOCKED;
-		case SWAP_AGAIN:
-			return PG_KEEP_LOCKED;
-		case SWAP_MLOCK:
-			return PG_MLOCKED;
-		case SWAP_LZFREE:
-			goto lazyfree;
-		case SWAP_SUCCESS:
-			; /* try to free the page below */
+		/* check outcome of cache addition */
+		if (!ret[i]) {
+			ret[i] = PG_ACTIVATE_LOCKED;
+			continue;
 		}
-	}
-
-	if (PageDirty(page)) {
 		/*
-		 * Only kswapd can writeback filesystem pages to
-		 * avoid risk of stack overflow but only writeback
-		 * if many dirty pages have been encountered.
+		 * The page is mapped into the page tables of one or more
+		 * processes. Try to unmap it here.
 		 */
-		if (page_is_file_cache(page) &&
-				(!current_is_kswapd() ||
-				 !test_bit(ZONE_DIRTY, &zone->flags))) {
+		if (page_mapped(page) && mapping) {
+			switch (swap_ret[i] = try_to_unmap(page, lazyfree ?
+				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
+				(ttu_flags | TTU_BATCH_FLUSH))) {
+			case SWAP_FAIL:
+				ret[i] = PG_ACTIVATE_LOCKED;
+				continue;
+			case SWAP_AGAIN:
+				ret[i] = PG_KEEP_LOCKED;
+				continue;
+			case SWAP_MLOCK:
+				ret[i] = PG_MLOCKED;
+				continue;
+			case SWAP_LZFREE:
+				goto lazyfree;
+			case SWAP_SUCCESS:
+				; /* try to free the page below */
+			}
+		}
+
+		if (PageDirty(page)) {
 			/*
-			 * Immediately reclaim when written back.
-			 * Similar in principal to deactivate_page()
-			 * except we already have the page isolated
-			 * and know it's dirty
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow but only writeback
+			 * if many dirty pages have been encountered.
 			 */
-			inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
-			SetPageReclaim(page);
-
-			return PG_KEEP_LOCKED;
-		}
+			if (page_is_file_cache(page) &&
+					(!current_is_kswapd() ||
+					 !test_bit(ZONE_DIRTY, &zone->flags))) {
+				/*
+				 * Immediately reclaim when written back.
+				 * Similar in principal to deactivate_page()
+				 * except we already have the page isolated
+				 * and know it's dirty
+				 */
+				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
+				SetPageReclaim(page);
 
-		if (references == PAGEREF_RECLAIM_CLEAN)
-			return PG_KEEP_LOCKED;
-		if (!may_enter_fs)
-			return PG_KEEP_LOCKED;
-		if (!sc->may_writepage)
-			return PG_KEEP_LOCKED;
+				ret[i] = PG_KEEP_LOCKED;
+				continue;
+			}
 
-		/*
-		 * Page is dirty. Flush the TLB if a writable entry
-		 * potentially exists to avoid CPU writes after IO
-		 * starts and then write it out here.
-		 */
-		try_to_unmap_flush_dirty();
-		switch (pageout(page, mapping, sc)) {
-		case PAGE_KEEP:
-			return PG_KEEP_LOCKED;
-		case PAGE_ACTIVATE:
-			return PG_ACTIVATE_LOCKED;
-		case PAGE_SUCCESS:
-			if (PageWriteback(page))
-				return PG_KEEP;
-			if (PageDirty(page))
-				return PG_KEEP;
+			if (references == PAGEREF_RECLAIM_CLEAN) {
+				ret[i] = PG_KEEP_LOCKED;
+				continue;
+			}
+			if (!may_enter_fs) {
+				ret[i] = PG_KEEP_LOCKED;
+				continue;
+			}
+			if (!sc->may_writepage) {
+				ret[i] = PG_KEEP_LOCKED;
+				continue;
+			}
 
 			/*
-			 * A synchronous write - probably a ramdisk.  Go
-			 * ahead and try to reclaim the page.
+			 * Page is dirty. Flush the TLB if a writable entry
+			 * potentially exists to avoid CPU writes after IO
+			 * starts and then write it out here.
 			 */
-			if (!trylock_page(page))
-				return PG_KEEP;
-			if (PageDirty(page) || PageWriteback(page))
-				return PG_KEEP_LOCKED;
-			mapping = page_mapping(page);
-		case PAGE_CLEAN:
-			; /* try to free the page below */
-		}
-	}
+			try_to_unmap_flush_dirty();
+			switch (pageout(page, mapping, sc)) {
+			case PAGE_KEEP:
+				ret[i] = PG_KEEP_LOCKED;
+				continue;
+			case PAGE_ACTIVATE:
+				ret[i] = PG_ACTIVATE_LOCKED;
+				continue;
+			case PAGE_SUCCESS:
+				if (PageWriteback(page)) {
+					ret[i] = PG_KEEP;
+					continue;
+				}
+				if (PageDirty(page)) {
+					ret[i] = PG_KEEP;
+					continue;
+				}
 
-	/*
-	 * If the page has buffers, try to free the buffer mappings
-	 * associated with this page. If we succeed we try to free
-	 * the page as well.
-	 *
-	 * We do this even if the page is PageDirty().
-	 * try_to_release_page() does not perform I/O, but it is
-	 * possible for a page to have PageDirty set, but it is actually
-	 * clean (all its buffers are clean).  This happens if the
-	 * buffers were written out directly, with submit_bh(). ext3
-	 * will do this, as well as the blockdev mapping.
-	 * try_to_release_page() will discover that cleanness and will
-	 * drop the buffers and mark the page clean - it can be freed.
-	 *
-	 * Rarely, pages can have buffers and no ->mapping.  These are
-	 * the pages which were not successfully invalidated in
-	 * truncate_complete_page().  We try to drop those buffers here
-	 * and if that worked, and the page is no longer mapped into
-	 * process address space (page_count == 1) it can be freed.
-	 * Otherwise, leave the page on the LRU so it is swappable.
-	 */
-	if (page_has_private(page)) {
-		if (!try_to_release_page(page, sc->gfp_mask))
-			return PG_ACTIVATE_LOCKED;
-		if (!mapping && page_count(page) == 1) {
-			unlock_page(page);
-			if (put_page_testzero(page))
-				return PG_FREE;
-			else {
 				/*
-				 * rare race with speculative reference.
-				 * the speculative reference will free
-				 * this page shortly, so we may
-				 * increment nr_reclaimed (and
-				 * leave it off the LRU).
+				 * A synchronous write - probably a ramdisk.  Go
+				 * ahead and try to reclaim the page.
 				 */
-				return PG_SPECULATIVE_REF;
+				if (!trylock_page(page)) {
+					ret[i] = PG_KEEP;
+					continue;
+				}
+				if (PageDirty(page) || PageWriteback(page)) {
+					ret[i] = PG_KEEP_LOCKED;
+					continue;
+				}
+				mapping = page_mapping(page);
+			case PAGE_CLEAN:
+				; /* try to free the page below */
 			}
 		}
-	}
 
+		/*
+		 * If the page has buffers, try to free the buffer mappings
+		 * associated with this page. If we succeed we try to free
+		 * the page as well.
+		 *
+		 * We do this even if the page is PageDirty().
+		 * try_to_release_page() does not perform I/O, but it is
+		 * possible for a page to have PageDirty set, but it is actually
+		 * clean (all its buffers are clean).  This happens if the
+		 * buffers were written out directly, with submit_bh(). ext3
+		 * will do this, as well as the blockdev mapping.
+		 * try_to_release_page() will discover that cleanness and will
+		 * drop the buffers and mark the page clean - it can be freed.
+		 *
+		 * Rarely, pages can have buffers and no ->mapping.  These are
+		 * the pages which were not successfully invalidated in
+		 * truncate_complete_page().  We try to drop those buffers here
+		 * and if that worked, and the page is no longer mapped into
+		 * process address space (page_count == 1) it can be freed.
+		 * Otherwise, leave the page on the LRU so it is swappable.
+		 */
+		if (page_has_private(page)) {
+			if (!try_to_release_page(page, sc->gfp_mask)) {
+				ret[i] = PG_ACTIVATE_LOCKED;
+				continue;
+			}
+			if (!mapping && page_count(page) == 1) {
+				unlock_page(page);
+				if (put_page_testzero(page)) {
+					ret[i] = PG_FREE;
+					continue;
+				} else {
+					/*
+					 * rare race with speculative reference.
+					 * the speculative reference will free
+					 * this page shortly, so we may
+					 * increment nr_reclaimed (and
+					 * leave it off the LRU).
+					 */
+					ret[i] = PG_SPECULATIVE_REF;
+					continue;
+				}
+			}
+		}
 lazyfree:
-	if (!mapping || !__remove_mapping(mapping, page, true))
-		return PG_KEEP_LOCKED;
+		if (!mapping || !__remove_mapping(mapping, page, true)) {
+			ret[i] = PG_KEEP_LOCKED;
+			continue;
+		}
+
+		/*
+		 * At this point, we have no other references and there is
+		 * no way to pick any more up (removed from LRU, removed
+		 * from pagecache). Can use non-atomic bitops now (and
+		 * we obviously don't have to worry about waking up a process
+		 * waiting on the page lock, because there are no references.
+		 */
+		__ClearPageLocked(page);
+		ret[i] = PG_FREE;
+	}
+}
+
+static enum pg_result handle_pgout(struct list_head *page_list,
+	struct zone *zone,
+	struct scan_control *sc,
+	enum ttu_flags ttu_flags,
+	enum page_references references,
+	bool may_enter_fs,
+	bool lazyfree,
+	int  *swap_ret,
+	struct page *page)
+{
+	struct page *pages[1];
+	int ret[1];
+	int sret[1];
+
+	pages[0] = page;
 
 	/*
-	 * At this point, we have no other references and there is
-	 * no way to pick any more up (removed from LRU, removed
-	 * from pagecache). Can use non-atomic bitops now (and
-	 * we obviously don't have to worry about waking up a process
-	 * waiting on the page lock, because there are no references.
+	 * page is in swap cache or page cache, indicate that
+	 * by setting ret[0] to 1
 	 */
-	__ClearPageLocked(page);
-	return PG_FREE;
+	ret[0] = 1;
+	handle_pgout_batch(page_list, zone, sc, ttu_flags, references,
+		may_enter_fs, lazyfree, pages, sret, ret, 1);
+	*swap_ret = sret[0];
+	return ret[0];
 }
 
 static void pg_finish(struct page *page,
@@ -1095,14 +1159,13 @@ static unsigned long shrink_anon_page_list(struct list_head *page_list,
 	bool clean)
 {
 	unsigned long nr_reclaimed = 0;
-	enum pg_result pg_dispose;
 	swp_entry_t swp_entries[SWAP_BATCH];
 	struct page *pages[SWAP_BATCH];
 	int m, i, k, ret[SWAP_BATCH];
 	struct page *page;
 
 	while (n > 0) {
-		int swap_ret = SWAP_SUCCESS;
+		int swap_ret[SWAP_BATCH];
 
 		m = get_swap_pages(n, swp_entries);
 		if (!m)
@@ -1127,28 +1190,19 @@ static unsigned long shrink_anon_page_list(struct list_head *page_list,
 		*/
 		add_to_swap_batch(pages, page_list, swp_entries, ret, m);
 
-		for (i = 0; i < m; ++i) {
-			page = pages[i];
-
-			if (!ret[i]) {
-				pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret,
-						&nr_reclaimed, pgactivate,
-						ret_pages, free_pages);
-				continue;
-			}
-
-			if (clean)
-				pg_dispose = handle_pgout(page_list, zone, sc,
-						ttu_flags, PAGEREF_RECLAIM_CLEAN,
-						true, true, &swap_ret, page);
-			else
-				pg_dispose = handle_pgout(page_list, zone, sc,
-						ttu_flags, PAGEREF_RECLAIM,
-						true, true, &swap_ret, page);
-
-			pg_finish(page, pg_dispose, swap_ret, &nr_reclaimed,
-					pgactivate, ret_pages, free_pages);
-		}
+		if (clean)
+			handle_pgout_batch(page_list, zone, sc, ttu_flags,
+					PAGEREF_RECLAIM_CLEAN, true, true,
+					pages, swap_ret, ret, m);
+		else
+			handle_pgout_batch(page_list, zone, sc, ttu_flags,
+					PAGEREF_RECLAIM, true, true,
+					pages, swap_ret, ret, m);
+
+		for (i = 0; i < m; ++i)
+			pg_finish(pages[i], ret[i], swap_ret[i],
+					&nr_reclaimed, pgactivate,
+					ret_pages, free_pages);
 	}
 	return nr_reclaimed;
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 7/7] mm: Batch unmapping of pages that are in swap cache
       [not found] <cover.1462306228.git.tim.c.chen@linux.intel.com>
                   ` (6 preceding siblings ...)
  2016-05-03 21:03 ` [PATCH 6/7] mm: Cleanup - Reorganize code to group handling of page Tim Chen
@ 2016-05-03 21:03 ` Tim Chen
  7 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2016-05-03 21:03 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Minchan Kim, Hugh Dickins
  Cc: Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel

We created a new function __remove_swap_mapping_batch that
allows all pages under the same swap partition to be removed
from the swap cache's mapping in a single acquisition
of the mapping's tree lock.  This reduces the contention
on the lock when multiple threads are reclaiming
memory by swapping to the same swap partition.

The handle_pgout_batch function is updated so all the
pages under the same swap partition are unmapped together
when the have been paged out.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/vmscan.c | 426 ++++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 286 insertions(+), 140 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9fc04e1..5e4b8ce 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -690,6 +690,103 @@ cannot_free:
 	return 0;
 }
 
+/* use this only for swap mapped pages */
+static void __remove_swap_mapping_batch(struct page *pages[],
+			    bool reclaimed, short ret[], int nr)
+{
+	unsigned long flags;
+	struct page *page;
+	swp_entry_t swap[SWAP_BATCH];
+	struct address_space *mapping;
+
+	int i, batch_size;
+
+	if (nr <= 0)
+		return;
+
+	while (nr) {
+		mapping = page_mapping(pages[0]);
+		BUG_ON(!mapping);
+
+		batch_size = min(nr, SWAP_BATCH);
+
+		spin_lock_irqsave(&mapping->tree_lock, flags);
+		for (i = 0; i < batch_size; ++i) {
+			page = pages[i];
+
+			BUG_ON(!PageLocked(page));
+			BUG_ON(!PageSwapCache(page));
+			BUG_ON(mapping != page_mapping(page));
+
+			/* stop batching if mapping changes */
+			if (mapping != page_mapping(page)) {
+				batch_size = i;
+				break;
+			}
+			/*
+			 * The non racy check for a busy page.
+			 *
+			 * Must be careful with the order of the tests. When someone has
+			 * a ref to the page, it may be possible that they dirty it then
+			 * drop the reference. So if PageDirty is tested before page_count
+			 * here, then the following race may occur:
+			 *
+			 * get_user_pages(&page);
+			 * [user mapping goes away]
+			 * write_to(page);
+			 *				!PageDirty(page)    [good]
+			 * SetPageDirty(page);
+			 * put_page(page);
+			 *				!page_count(page)   [good, discard it]
+			 *
+			 * [oops, our write_to data is lost]
+			 *
+			 * Reversing the order of the tests ensures such a situation cannot
+			 * escape unnoticed. The smp_rmb is needed to ensure the page->flags
+			 * load is not satisfied before that of page->_count.
+			 *
+			 * Note that if SetPageDirty is always performed via set_page_dirty,
+			 * and thus under tree_lock, then this ordering is not required.
+			 */
+			if (!page_ref_freeze(page, 2))
+				goto cannot_free;
+			/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
+			if (unlikely(PageDirty(page))) {
+				page_ref_unfreeze(page, 2);
+				goto cannot_free;
+			}
+
+			swap[i].val = page_private(page);
+			__delete_from_swap_cache(page);
+
+			ret[i] = 1;
+			continue;
+
+cannot_free:
+			ret[i] = 0;
+		}
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+
+		/* need to keep irq off for mem_cgroup accounting, don't restore flags yet  */
+		local_irq_disable();
+		for (i = 0; i < batch_size; ++i) {
+			if (ret[i]) {
+				page = pages[i];
+				mem_cgroup_swapout(page, swap[i]);
+			}
+		}
+		local_irq_enable();
+
+		for (i = 0; i < batch_size; ++i) {
+			if (ret[i])
+				swapcache_free(swap[i]);
+		}
+		/* advance to next batch */
+		pages += batch_size;
+		ret += batch_size;
+		nr -= batch_size;
+	}
+}
 /*
  * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
  * someone else has a ref on the page, abort and return 0.  If it was
@@ -897,177 +994,226 @@ static void handle_pgout_batch(struct list_head *page_list,
 	int nr)
 {
 	struct address_space *mapping;
+	struct page *umap_pages[SWAP_BATCH];
 	struct page *page;
-	int i;
-
-	for (i = 0; i < nr; ++i) {
-		page = pages[i];
-		mapping =  page_mapping(page);
+	int i, j, batch_size;
+	short umap_ret[SWAP_BATCH], idx[SWAP_BATCH];
+
+	while (nr) {
+		j = 0;
+		batch_size = min(nr, SWAP_BATCH);
+		mapping = NULL;
+
+		for (i = 0; i < batch_size; ++i) {
+			page = pages[i];
+
+			if (mapping) {
+				if (mapping != page_mapping(page)) {
+					/* mapping change, stop batch here */
+					batch_size = i;
+					break;
+				}
+			} else
+				mapping =  page_mapping(page);
 
-		/* check outcome of cache addition */
-		if (!ret[i]) {
-			ret[i] = PG_ACTIVATE_LOCKED;
-			continue;
-		}
-		/*
-		 * The page is mapped into the page tables of one or more
-		 * processes. Try to unmap it here.
-		 */
-		if (page_mapped(page) && mapping) {
-			switch (swap_ret[i] = try_to_unmap(page, lazyfree ?
-				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
-				(ttu_flags | TTU_BATCH_FLUSH))) {
-			case SWAP_FAIL:
+			/* check outcome of cache addition */
+			if (!ret[i]) {
 				ret[i] = PG_ACTIVATE_LOCKED;
 				continue;
-			case SWAP_AGAIN:
-				ret[i] = PG_KEEP_LOCKED;
-				continue;
-			case SWAP_MLOCK:
-				ret[i] = PG_MLOCKED;
-				continue;
-			case SWAP_LZFREE:
-				goto lazyfree;
-			case SWAP_SUCCESS:
-				; /* try to free the page below */
 			}
-		}
-
-		if (PageDirty(page)) {
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but only writeback
-			 * if many dirty pages have been encountered.
+			 * The page is mapped into the page tables of one or more
+			 * processes. Try to unmap it here.
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() ||
-					 !test_bit(ZONE_DIRTY, &zone->flags))) {
+			if (page_mapped(page) && mapping) {
+				switch (swap_ret[i] = try_to_unmap(page, lazyfree ?
+					(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
+					(ttu_flags | TTU_BATCH_FLUSH))) {
+				case SWAP_FAIL:
+					ret[i] = PG_ACTIVATE_LOCKED;
+					continue;
+				case SWAP_AGAIN:
+					ret[i] = PG_KEEP_LOCKED;
+					continue;
+				case SWAP_MLOCK:
+					ret[i] = PG_MLOCKED;
+					continue;
+				case SWAP_LZFREE:
+					goto lazyfree;
+				case SWAP_SUCCESS:
+					; /* try to free the page below */
+				}
+			}
+
+			if (PageDirty(page)) {
 				/*
-				 * Immediately reclaim when written back.
-				 * Similar in principal to deactivate_page()
-				 * except we already have the page isolated
-				 * and know it's dirty
+				 * Only kswapd can writeback filesystem pages to
+				 * avoid risk of stack overflow but only writeback
+				 * if many dirty pages have been encountered.
 				 */
-				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
-				SetPageReclaim(page);
-
-				ret[i] = PG_KEEP_LOCKED;
-				continue;
-			}
+				if (page_is_file_cache(page) &&
+						(!current_is_kswapd() ||
+						 !test_bit(ZONE_DIRTY, &zone->flags))) {
+					/*
+					 * Immediately reclaim when written back.
+					 * Similar in principal to deactivate_page()
+					 * except we already have the page isolated
+					 * and know it's dirty
+					 */
+					inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
+					SetPageReclaim(page);
 
-			if (references == PAGEREF_RECLAIM_CLEAN) {
-				ret[i] = PG_KEEP_LOCKED;
-				continue;
-			}
-			if (!may_enter_fs) {
-				ret[i] = PG_KEEP_LOCKED;
-				continue;
-			}
-			if (!sc->may_writepage) {
-				ret[i] = PG_KEEP_LOCKED;
-				continue;
-			}
+					ret[i] = PG_KEEP_LOCKED;
+					continue;
+				}
 
-			/*
-			 * Page is dirty. Flush the TLB if a writable entry
-			 * potentially exists to avoid CPU writes after IO
-			 * starts and then write it out here.
-			 */
-			try_to_unmap_flush_dirty();
-			switch (pageout(page, mapping, sc)) {
-			case PAGE_KEEP:
-				ret[i] = PG_KEEP_LOCKED;
-				continue;
-			case PAGE_ACTIVATE:
-				ret[i] = PG_ACTIVATE_LOCKED;
-				continue;
-			case PAGE_SUCCESS:
-				if (PageWriteback(page)) {
-					ret[i] = PG_KEEP;
+				if (references == PAGEREF_RECLAIM_CLEAN) {
+					ret[i] = PG_KEEP_LOCKED;
+					continue;
+				}
+				if (!may_enter_fs) {
+					ret[i] = PG_KEEP_LOCKED;
 					continue;
 				}
-				if (PageDirty(page)) {
-					ret[i] = PG_KEEP;
+				if (!sc->may_writepage) {
+					ret[i] = PG_KEEP_LOCKED;
 					continue;
 				}
 
 				/*
-				 * A synchronous write - probably a ramdisk.  Go
-				 * ahead and try to reclaim the page.
+				 * Page is dirty. Flush the TLB if a writable entry
+				 * potentially exists to avoid CPU writes after IO
+				 * starts and then write it out here.
 				 */
-				if (!trylock_page(page)) {
-					ret[i] = PG_KEEP;
-					continue;
-				}
-				if (PageDirty(page) || PageWriteback(page)) {
+				try_to_unmap_flush_dirty();
+				switch (pageout(page, mapping, sc)) {
+				case PAGE_KEEP:
 					ret[i] = PG_KEEP_LOCKED;
 					continue;
+				case PAGE_ACTIVATE:
+					ret[i] = PG_ACTIVATE_LOCKED;
+					continue;
+				case PAGE_SUCCESS:
+					if (PageWriteback(page)) {
+						ret[i] = PG_KEEP;
+						continue;
+					}
+					if (PageDirty(page)) {
+						ret[i] = PG_KEEP;
+						continue;
+					}
+
+					/*
+					 * A synchronous write - probably a ramdisk.  Go
+					 * ahead and try to reclaim the page.
+					 */
+					if (!trylock_page(page)) {
+						ret[i] = PG_KEEP;
+						continue;
+					}
+					if (PageDirty(page) || PageWriteback(page)) {
+						ret[i] = PG_KEEP_LOCKED;
+						continue;
+					}
+					mapping = page_mapping(page);
+				case PAGE_CLEAN:
+					; /* try to free the page below */
 				}
-				mapping = page_mapping(page);
-			case PAGE_CLEAN:
-				; /* try to free the page below */
 			}
-		}
 
-		/*
-		 * If the page has buffers, try to free the buffer mappings
-		 * associated with this page. If we succeed we try to free
-		 * the page as well.
-		 *
-		 * We do this even if the page is PageDirty().
-		 * try_to_release_page() does not perform I/O, but it is
-		 * possible for a page to have PageDirty set, but it is actually
-		 * clean (all its buffers are clean).  This happens if the
-		 * buffers were written out directly, with submit_bh(). ext3
-		 * will do this, as well as the blockdev mapping.
-		 * try_to_release_page() will discover that cleanness and will
-		 * drop the buffers and mark the page clean - it can be freed.
-		 *
-		 * Rarely, pages can have buffers and no ->mapping.  These are
-		 * the pages which were not successfully invalidated in
-		 * truncate_complete_page().  We try to drop those buffers here
-		 * and if that worked, and the page is no longer mapped into
-		 * process address space (page_count == 1) it can be freed.
-		 * Otherwise, leave the page on the LRU so it is swappable.
-		 */
-		if (page_has_private(page)) {
-			if (!try_to_release_page(page, sc->gfp_mask)) {
-				ret[i] = PG_ACTIVATE_LOCKED;
+			/*
+			 * If the page has buffers, try to free the buffer mappings
+			 * associated with this page. If we succeed we try to free
+			 * the page as well.
+			 *
+			 * We do this even if the page is PageDirty().
+			 * try_to_release_page() does not perform I/O, but it is
+			 * possible for a page to have PageDirty set, but it is actually
+			 * clean (all its buffers are clean).  This happens if the
+			 * buffers were written out directly, with submit_bh(). ext3
+			 * will do this, as well as the blockdev mapping.
+			 * try_to_release_page() will discover that cleanness and will
+			 * drop the buffers and mark the page clean - it can be freed.
+			 *
+			 * Rarely, pages can have buffers and no ->mapping.  These are
+			 * the pages which were not successfully invalidated in
+			 * truncate_complete_page().  We try to drop those buffers here
+			 * and if that worked, and the page is no longer mapped into
+			 * process address space (page_count == 1) it can be freed.
+			 * Otherwise, leave the page on the LRU so it is swappable.
+			 */
+			if (page_has_private(page)) {
+				if (!try_to_release_page(page, sc->gfp_mask)) {
+					ret[i] = PG_ACTIVATE_LOCKED;
+					continue;
+				}
+				if (!mapping && page_count(page) == 1) {
+					unlock_page(page);
+					if (put_page_testzero(page)) {
+						ret[i] = PG_FREE;
+						continue;
+					} else {
+						/*
+						 * rare race with speculative reference.
+						 * the speculative reference will free
+						 * this page shortly, so we may
+						 * increment nr_reclaimed (and
+						 * leave it off the LRU).
+						 */
+						ret[i] = PG_SPECULATIVE_REF;
+						continue;
+					}
+				}
+			}
+lazyfree:
+			if (!mapping) {
+				ret[i] = PG_KEEP_LOCKED;
 				continue;
 			}
-			if (!mapping && page_count(page) == 1) {
-				unlock_page(page);
-				if (put_page_testzero(page)) {
-					ret[i] = PG_FREE;
-					continue;
-				} else {
-					/*
-					 * rare race with speculative reference.
-					 * the speculative reference will free
-					 * this page shortly, so we may
-					 * increment nr_reclaimed (and
-					 * leave it off the LRU).
-					 */
-					ret[i] = PG_SPECULATIVE_REF;
+			if (!PageSwapCache(page)) {
+				if (!__remove_mapping(mapping, page, true)) {
+					ret[i] = PG_KEEP_LOCKED;
 					continue;
 				}
+				__ClearPageLocked(page);
+				ret[i] = PG_FREE;
+				continue;
 			}
+
+			/* note pages to be unmapped */
+			ret[i] = PG_UNKNOWN;
+			idx[j] = i;
+			umap_pages[j] = page;
+			++j;
 		}
-lazyfree:
-		if (!mapping || !__remove_mapping(mapping, page, true)) {
-			ret[i] = PG_KEEP_LOCKED;
-			continue;
+
+		/* handle remaining pages that need to be unmapped */
+		__remove_swap_mapping_batch(umap_pages, true, umap_ret, j);
+
+		for (i = 0; i < j; ++i) {
+			if (!umap_ret[i]) {
+				/* unmap failed */
+				ret[idx[i]] = PG_KEEP_LOCKED;
+				continue;
+			}
+
+			page = umap_pages[i];
+			/*
+			 * At this point, we have no other references and there is
+			 * no way to pick any more up (removed from LRU, removed
+			 * from pagecache). Can use non-atomic bitops now (and
+			 * we obviously don't have to worry about waking up a process
+			 * waiting on the page lock, because there are no references.
+			 */
+			__ClearPageLocked(page);
+			ret[idx[i]] = PG_FREE;
 		}
 
-		/*
-		 * At this point, we have no other references and there is
-		 * no way to pick any more up (removed from LRU, removed
-		 * from pagecache). Can use non-atomic bitops now (and
-		 * we obviously don't have to worry about waking up a process
-		 * waiting on the page lock, because there are no references.
-		 */
-		__ClearPageLocked(page);
-		ret[i] = PG_FREE;
+		/* advance pointers to next batch and remaining page count */
+		nr = nr - batch_size;
+		pages += batch_size;
+		ret += batch_size;
+		swap_ret += batch_size;
 	}
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] mm: Improve swap path scalability with batched operations
  2016-05-03 21:00 ` [PATCH 0/7] mm: Improve swap path scalability with batched operations Tim Chen
@ 2016-05-04 12:45   ` Michal Hocko
  2016-05-04 17:13     ` Tim Chen
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2016-05-04 12:45 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Vladimir Davydov, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Kirill A.Shutemov, Andi Kleen, Aaron Lu,
	Huang Ying, linux-mm, linux-kernel

On Tue 03-05-16 14:00:39, Tim Chen wrote:
[...]
>  include/linux/swap.h |  29 ++-
>  mm/swap_state.c      | 253 +++++++++++++-----
>  mm/swapfile.c        | 215 +++++++++++++--
>  mm/vmscan.c          | 725 ++++++++++++++++++++++++++++++++++++++-------------
>  4 files changed, 945 insertions(+), 277 deletions(-)

This is rather large change for a normally rare path. We have been
trying to preserve the anonymous memory as much as possible and rather
push the page cache out. In fact swappiness is ignored most of the
time for the vast majority of workloads.

So this would help anonymous mostly workloads and I am really wondering
whether this is something worth bothering without further and deeper
rethinking of our current reclaim strategy. I fully realize that the
swap out sucks and that the new storage technologies might change the
way how we think about anonymous memory being so "special" wrt. disk
based caches but I would like to see a stronger use case than "we have
been playing with some artificial use case and it scales better"
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] mm: Improve swap path scalability with batched operations
  2016-05-04 12:45   ` Michal Hocko
@ 2016-05-04 17:13     ` Tim Chen
  2016-05-04 19:49       ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tim Chen @ 2016-05-04 17:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vladimir Davydov, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Kirill A.Shutemov, Andi Kleen, Aaron Lu,
	Huang Ying, linux-mm, linux-kernel

On Wed, 2016-05-04 at 14:45 +0200, Michal Hocko wrote:
> On Tue 03-05-16 14:00:39, Tim Chen wrote:
> [...]
> > 
> >  include/linux/swap.h |  29 ++-
> >  mm/swap_state.c      | 253 +++++++++++++-----
> >  mm/swapfile.c        | 215 +++++++++++++--
> >  mm/vmscan.c          | 725 ++++++++++++++++++++++++++++++++++++++-
> > ------------
> >  4 files changed, 945 insertions(+), 277 deletions(-)
> This is rather large change for a normally rare path. We have been
> trying to preserve the anonymous memory as much as possible and
> rather
> push the page cache out. In fact swappiness is ignored most of the
> time for the vast majority of workloads.
> 
> So this would help anonymous mostly workloads and I am really
> wondering
> whether this is something worth bothering without further and deeper
> rethinking of our current reclaim strategy. I fully realize that the
> swap out sucks and that the new storage technologies might change the
> way how we think about anonymous memory being so "special" wrt. disk
> based caches but I would like to see a stronger use case than "we
> have
> been playing with some artificial use case and it scales better"

With non-volatile ram based block devices, swap device could be very
fast, approaching RAM speed and can potentially be used as a secondary
memory. Just configuring these NVRAM as swap will be
an easy way for apps to make use of them without doing any heavy
lifting to change the apps.  But the swap path is so 
un-scalable today that such use case
is unfeasible, even more so for multi-threaded server machines.

I understand that the patch set is a little large. Any better
ideas for achieving similar ends will be appreciated.  I put
out these patches in the hope that it will spur solutions
to improve swap.

Perhaps the first two patches to make shrink_page_list into
smaller components can be considered first, as a first step 
to make any changes to the reclaim code easier.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] mm: Improve swap path scalability with batched operations
  2016-05-04 17:13     ` Tim Chen
@ 2016-05-04 19:49       ` Michal Hocko
  2016-05-04 21:05         ` Andi Kleen
  2016-05-04 21:25         ` Johannes Weiner
  0 siblings, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2016-05-04 19:49 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Vladimir Davydov, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Kirill A.Shutemov, Andi Kleen, Aaron Lu,
	Huang Ying, linux-mm, linux-kernel

On Wed 04-05-16 10:13:06, Tim Chen wrote:
> On Wed, 2016-05-04 at 14:45 +0200, Michal Hocko wrote:
> > On Tue 03-05-16 14:00:39, Tim Chen wrote:
> > [...]
> > > 
> > >  include/linux/swap.h |  29 ++-
> > >  mm/swap_state.c      | 253 +++++++++++++-----
> > >  mm/swapfile.c        | 215 +++++++++++++--
> > >  mm/vmscan.c          | 725 ++++++++++++++++++++++++++++++++++++++-
> > > ------------
> > >  4 files changed, 945 insertions(+), 277 deletions(-)
> > This is rather large change for a normally rare path. We have been
> > trying to preserve the anonymous memory as much as possible and
> > rather
> > push the page cache out. In fact swappiness is ignored most of the
> > time for the vast majority of workloads.
> > 
> > So this would help anonymous mostly workloads and I am really
> > wondering
> > whether this is something worth bothering without further and deeper
> > rethinking of our current reclaim strategy. I fully realize that the
> > swap out sucks and that the new storage technologies might change the
> > way how we think about anonymous memory being so "special" wrt. disk
> > based caches but I would like to see a stronger use case than "we
> > have
> > been playing with some artificial use case and it scales better"
> 
> With non-volatile ram based block devices, swap device could be very
> fast, approaching RAM speed and can potentially be used as a secondary
> memory. Just configuring these NVRAM as swap will be
> an easy way for apps to make use of them without doing any heavy
> lifting to change the apps.  But the swap path is so 
> un-scalable today that such use case
> is unfeasible, even more so for multi-threaded server machines.

In order this to work other quite intrusive changes to the current
reclaim decisions would have to be made though. This is what I tried to
say. Look at get_scan_count() on how we are making many steps to ignore
swappiness or prefer the page cache. Even when we make swapout scale it
won't help much if we do not swap out that often. That's why I claim
that we really should think more long term and maybe reconsider these
decisions which were based on the rotating rust for the swap devices.

> I understand that the patch set is a little large. Any better
> ideas for achieving similar ends will be appreciated.  I put
> out these patches in the hope that it will spur solutions
> to improve swap.
> 
> Perhaps the first two patches to make shrink_page_list into
> smaller components can be considered first, as a first step 
> to make any changes to the reclaim code easier.

I didn't get to review those yet and probably will not get to them
shortly (sorry about that). shrink_page_list is surely one giant
function that is calling for a better layout/split out. I wouldn't be
opposed but there are some subtle details lurking there which make
clean ups non-trivial. I will not discourage you from trying to get it
into shape of course.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] mm: Improve swap path scalability with batched operations
  2016-05-04 19:49       ` Michal Hocko
@ 2016-05-04 21:05         ` Andi Kleen
  2016-05-04 21:25         ` Johannes Weiner
  1 sibling, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2016-05-04 21:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Andrew Morton, Vladimir Davydov, Johannes Weiner,
	Minchan Kim, Hugh Dickins, Kirill A.Shutemov, Andi Kleen,
	Aaron Lu, Huang Ying, linux-mm, linux-kernel

> In order this to work other quite intrusive changes to the current
> reclaim decisions would have to be made though. This is what I tried to
> say. Look at get_scan_count() on how we are making many steps to ignore
> swappiness or prefer the page cache. Even when we make swapout scale it
> won't help much if we do not swap out that often. That's why I claim

But if you made swapout to scale you would need some equivalent
of Tim's patches for the swap path... So you need them in case.

> that we really should think more long term and maybe reconsider these
> decisions which were based on the rotating rust for the swap devices.

Sure that makes sense, but why not start with low hanging fruit
in basic performance, like Tim did? Usually that is how Linux
changes work, steady evolution, not revolution.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] mm: Improve swap path scalability with batched operations
  2016-05-04 19:49       ` Michal Hocko
  2016-05-04 21:05         ` Andi Kleen
@ 2016-05-04 21:25         ` Johannes Weiner
  2016-05-05  0:08           ` Minchan Kim
  2016-05-05  7:49           ` Michal Hocko
  1 sibling, 2 replies; 18+ messages in thread
From: Johannes Weiner @ 2016-05-04 21:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Andrew Morton, Vladimir Davydov, Minchan Kim,
	Hugh Dickins, Kirill A.Shutemov, Andi Kleen, Aaron Lu,
	Huang Ying, linux-mm, linux-kernel

On Wed, May 04, 2016 at 09:49:02PM +0200, Michal Hocko wrote:
> On Wed 04-05-16 10:13:06, Tim Chen wrote:
> In order this to work other quite intrusive changes to the current
> reclaim decisions would have to be made though. This is what I tried to
> say. Look at get_scan_count() on how we are making many steps to ignore
> swappiness or prefer the page cache. Even when we make swapout scale it
> won't help much if we do not swap out that often. That's why I claim
> that we really should think more long term and maybe reconsider these
> decisions which were based on the rotating rust for the swap devices.

While I agree that such balancing rework is necessary to make swap
perform optimally, I don't see why this would be a dependency for
making the mechanical swapout paths a lot leaner.

I'm actually working on improving the LRU balancing decisions for fast
random IO swap devices, and hope to have something to submit soon.

> > I understand that the patch set is a little large. Any better
> > ideas for achieving similar ends will be appreciated.  I put
> > out these patches in the hope that it will spur solutions
> > to improve swap.
> > 
> > Perhaps the first two patches to make shrink_page_list into
> > smaller components can be considered first, as a first step 
> > to make any changes to the reclaim code easier.

It makes sense that we need to batch swap allocation and swap cache
operations. Unfortunately, the patches as they stand turn
shrink_page_list() into an unreadable mess. This would need better
refactoring before considering them for upstream merging. The swap
allocation batching should not obfuscate the main sequence of events
that is happening for both file-backed and anonymous pages.

It'd also be great if the remove_mapping() batching could be done
universally for all pages, given that in many cases file pages from
the same inode also cluster together on the LRU.

I realize this is fairly vague feedback; I'll try to take a closer
look at the patches. But I do think this work is going in the right
direction and there is plenty of justification for making these paths
more efficient.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] mm: Improve swap path scalability with batched operations
  2016-05-04 21:25         ` Johannes Weiner
@ 2016-05-05  0:08           ` Minchan Kim
  2016-05-05  7:49           ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Minchan Kim @ 2016-05-05  0:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Tim Chen, Andrew Morton, Vladimir Davydov,
	Minchan Kim, Hugh Dickins, Kirill A.Shutemov, Andi Kleen,
	Aaron Lu, Huang Ying, linux-mm, linux-kernel

On Wed, May 04, 2016 at 05:25:06PM -0400, Johannes Weiner wrote:
> On Wed, May 04, 2016 at 09:49:02PM +0200, Michal Hocko wrote:
> > On Wed 04-05-16 10:13:06, Tim Chen wrote:
> > In order this to work other quite intrusive changes to the current
> > reclaim decisions would have to be made though. This is what I tried to
> > say. Look at get_scan_count() on how we are making many steps to ignore
> > swappiness or prefer the page cache. Even when we make swapout scale it
> > won't help much if we do not swap out that often. That's why I claim
> > that we really should think more long term and maybe reconsider these
> > decisions which were based on the rotating rust for the swap devices.
> 
> While I agree that such balancing rework is necessary to make swap
> perform optimally, I don't see why this would be a dependency for
> making the mechanical swapout paths a lot leaner.

I agree.

> 
> I'm actually working on improving the LRU balancing decisions for fast
> random IO swap devices, and hope to have something to submit soon.

Good to hear! I really have an interest about that because we already
have used such fast random IO swap device. zRAM although I'm not
sure it's really fast as much as such NVRAM. Anyway, it would be very
benefit for zram.

> 
> > > I understand that the patch set is a little large. Any better
> > > ideas for achieving similar ends will be appreciated.  I put
> > > out these patches in the hope that it will spur solutions
> > > to improve swap.
> > > 
> > > Perhaps the first two patches to make shrink_page_list into
> > > smaller components can be considered first, as a first step 
> > > to make any changes to the reclaim code easier.
> 
> It makes sense that we need to batch swap allocation and swap cache
> operations. Unfortunately, the patches as they stand turn
> shrink_page_list() into an unreadable mess. This would need better
> refactoring before considering them for upstream merging. The swap
> allocation batching should not obfuscate the main sequence of events
> that is happening for both file-backed and anonymous pages.
> 
> It'd also be great if the remove_mapping() batching could be done
> universally for all pages, given that in many cases file pages from
> the same inode also cluster together on the LRU.
> 
> I realize this is fairly vague feedback; I'll try to take a closer
> look at the patches. But I do think this work is going in the right
> direction and there is plenty of justification for making these paths
> more efficient.

+1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] mm: Improve swap path scalability with batched operations
  2016-05-04 21:25         ` Johannes Weiner
  2016-05-05  0:08           ` Minchan Kim
@ 2016-05-05  7:49           ` Michal Hocko
  2016-05-05 15:56             ` Tim Chen
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2016-05-05  7:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tim Chen, Andrew Morton, Vladimir Davydov, Minchan Kim,
	Hugh Dickins, Kirill A.Shutemov, Andi Kleen, Aaron Lu,
	Huang Ying, linux-mm, linux-kernel

On Wed 04-05-16 17:25:06, Johannes Weiner wrote:
> On Wed, May 04, 2016 at 09:49:02PM +0200, Michal Hocko wrote:
> > On Wed 04-05-16 10:13:06, Tim Chen wrote:
> > In order this to work other quite intrusive changes to the current
> > reclaim decisions would have to be made though. This is what I tried to
> > say. Look at get_scan_count() on how we are making many steps to ignore
> > swappiness or prefer the page cache. Even when we make swapout scale it
> > won't help much if we do not swap out that often. That's why I claim
> > that we really should think more long term and maybe reconsider these
> > decisions which were based on the rotating rust for the swap devices.
> 
> While I agree that such balancing rework is necessary to make swap
> perform optimally, I don't see why this would be a dependency for
> making the mechanical swapout paths a lot leaner.

Ohh, I didn't say this would be a dependency. I am all for preparing
the code for a better scaling I just felt that the patch is quite large
with a small benefit at this moment and the initial description was not
very clear about the motivation and changes seemed to be shaped by an
artificial test case.

> I'm actually working on improving the LRU balancing decisions for fast
> random IO swap devices, and hope to have something to submit soon.

That is really good to hear!

> > > I understand that the patch set is a little large. Any better
> > > ideas for achieving similar ends will be appreciated.  I put
> > > out these patches in the hope that it will spur solutions
> > > to improve swap.
> > > 
> > > Perhaps the first two patches to make shrink_page_list into
> > > smaller components can be considered first, as a first step 
> > > to make any changes to the reclaim code easier.
> 
> It makes sense that we need to batch swap allocation and swap cache
> operations. Unfortunately, the patches as they stand turn
> shrink_page_list() into an unreadable mess. This would need better
> refactoring before considering them for upstream merging. The swap
> allocation batching should not obfuscate the main sequence of events
> that is happening for both file-backed and anonymous pages.

That was my first impression as well but to be fair I only skimmed
through the patch so I might be just biased by the size.

> It'd also be great if the remove_mapping() batching could be done
> universally for all pages, given that in many cases file pages from
> the same inode also cluster together on the LRU.

Agreed!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] mm: Improve swap path scalability with batched operations
  2016-05-05  7:49           ` Michal Hocko
@ 2016-05-05 15:56             ` Tim Chen
  0 siblings, 0 replies; 18+ messages in thread
From: Tim Chen @ 2016-05-05 15:56 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Minchan Kim, Hugh Dickins,
	Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel

On Thu, 2016-05-05 at 09:49 +0200, Michal Hocko wrote:
> On Wed 04-05-16 17:25:06, Johannes Weiner wrote:
> > 
> > 
> 
> > 
> > > 
> > > > 
> > > > I understand that the patch set is a little large. Any better
> > > > ideas for achieving similar ends will be appreciated.  I put
> > > > out these patches in the hope that it will spur solutions
> > > > to improve swap.
> > > > 
> > > > Perhaps the first two patches to make shrink_page_list into
> > > > smaller components can be considered first, as a first step 
> > > > to make any changes to the reclaim code easier.
> > It makes sense that we need to batch swap allocation and swap cache
> > operations. Unfortunately, the patches as they stand turn
> > shrink_page_list() into an unreadable mess. This would need better
> > refactoring before considering them for upstream merging. The swap
> > allocation batching should not obfuscate the main sequence of
> > events
> > that is happening for both file-backed and anonymous pages.
> That was my first impression as well but to be fair I only skimmed
> through the patch so I might be just biased by the size.
> 
> > 
> > It'd also be great if the remove_mapping() batching could be done
> > universally for all pages, given that in many cases file pages from
> > the same inode also cluster together on the LRU.
> 

Agree.  I didn't try to do something on file mapped pages yet as
the changes in this patch set is already quite substantial.
But once we have some agreement on the batching on the anonymous
pages, the file backed pages could be grouped similarly.

Tim

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/7] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
  2016-05-03 21:01 ` [PATCH 1/7] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions Tim Chen
@ 2016-05-27 16:40   ` Tim Chen
  2016-05-30  8:48     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tim Chen @ 2016-05-27 16:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel, Andrew Morton, Vladimir Davydov, Johannes Weiner,
	Minchan Kim, Hugh Dickins

On Tue, 2016-05-03 at 14:01 -0700, Tim Chen wrote:
> This patch prepares the code for being able to batch the anonymous
> pages
> to be swapped out.  It reorganizes shrink_page_list function with
> 2 new functions: handle_pgout and pg_finish.
> 
> The paging operation in shrink_page_list is consolidated into
> handle_pgout function.
> 
> After we have scanned a page shrink_page_list and completed any
> paging,
> the final disposition and clean up of the page is conslidated into
> pg_finish.  The designated disposition of the page from page scanning
> in shrink_page_list is marked with one of the designation in
> pg_result.
> 
> This is a clean up patch and there is no change in functionality or
> logic of the code.

Hi Michal,

We've talked about doing the clean up of shrink_page_list code
before attempting to do batching on the swap out path as those
set of patches I've previously posted are quit intrusive.  Wonder
if you have a chance to look at this patch and has any comments?

Thanks.

Tim

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/7] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
  2016-05-27 16:40   ` Tim Chen
@ 2016-05-30  8:48     ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2016-05-30  8:48 UTC (permalink / raw)
  To: Tim Chen
  Cc: Kirill A.Shutemov, Andi Kleen, Aaron Lu, Huang Ying, linux-mm,
	linux-kernel, Andrew Morton, Vladimir Davydov, Johannes Weiner,
	Minchan Kim, Hugh Dickins

On Fri 27-05-16 09:40:27, Tim Chen wrote:
> On Tue, 2016-05-03 at 14:01 -0700, Tim Chen wrote:
> > This patch prepares the code for being able to batch the anonymous
> > pages
> > to be swapped out.  It reorganizes shrink_page_list function with
> > 2 new functions: handle_pgout and pg_finish.
> > 
> > The paging operation in shrink_page_list is consolidated into
> > handle_pgout function.
> > 
> > After we have scanned a page shrink_page_list and completed any
> > paging,
> > the final disposition and clean up of the page is conslidated into
> > pg_finish.  The designated disposition of the page from page scanning
> > in shrink_page_list is marked with one of the designation in
> > pg_result.
> > 
> > This is a clean up patch and there is no change in functionality or
> > logic of the code.
> 
> Hi Michal,
> 
> We've talked about doing the clean up of shrink_page_list code
> before attempting to do batching on the swap out path as those
> set of patches I've previously posted are quit intrusive.  Wonder
> if you have a chance to look at this patch and has any comments?

I have noticed your
http://lkml.kernel.org/r/1463779979.22178.142.camel@linux.intel.com but
still haven't found time to look at it. Sorry about that. There is
rather a lot on my pile...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-05-30  8:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1462306228.git.tim.c.chen@linux.intel.com>
2016-05-03 21:00 ` [PATCH 0/7] mm: Improve swap path scalability with batched operations Tim Chen
2016-05-04 12:45   ` Michal Hocko
2016-05-04 17:13     ` Tim Chen
2016-05-04 19:49       ` Michal Hocko
2016-05-04 21:05         ` Andi Kleen
2016-05-04 21:25         ` Johannes Weiner
2016-05-05  0:08           ` Minchan Kim
2016-05-05  7:49           ` Michal Hocko
2016-05-05 15:56             ` Tim Chen
2016-05-03 21:01 ` [PATCH 1/7] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions Tim Chen
2016-05-27 16:40   ` Tim Chen
2016-05-30  8:48     ` Michal Hocko
2016-05-03 21:01 ` [PATCH 2/7] mm: Group the processing of anonymous pages to be swapped in shrink_page_list Tim Chen
2016-05-03 21:02 ` [PATCH 3/7] mm: Add new functions to allocate swap slots in batches Tim Chen
2016-05-03 21:02 ` [PATCH 4/7] mm: Shrink page list batch allocates swap slots for page swapping Tim Chen
2016-05-03 21:02 ` [PATCH 5/7] mm: Batch addtion of pages to swap cache Tim Chen
2016-05-03 21:03 ` [PATCH 6/7] mm: Cleanup - Reorganize code to group handling of page Tim Chen
2016-05-03 21:03 ` [PATCH 7/7] mm: Batch unmapping of pages that are in swap cache Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).