linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out
@ 2017-05-25  6:46 Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 01/13] mm, THP, swap: Support to clear swap cache flag for THP " Huang, Ying
                   ` (12 more replies)
  0 siblings, 13 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel, Andrea Arcangeli,
	Kirill A . Shutemov, Jens Axboe, Michal Hocko, Huang Ying

From: Huang Ying <ying.huang@intel.com>

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Johannes and Minchan, Thanks a lot for your review to the first
step of the THP swap optimization!  Could you help me to review the
second step in this patchset?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [01/13], [02/13], [03/13],
[04/13], [07/13], [12/13], and [13/13].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [01/13], [03/13], [08/13], [09/13], [10/13],
[12/13].

Hi, Jens and Shaohua, could you help me to review the block part of
the patchset?  Especially [05/13], [06/13], and [07/13].

Hi, Johannes, Michal, could you help me to review the cgroup part of
the patchset?  Especially [09/13], [10/13], and [11/13].

And for all, Any comment is welcome!

This is the second step of THP (Transparent Huge Page) swap
optimization.  In the first step, the splitting huge page is delayed
from almost the first step of swapping out to after allocating the
swap space for the THP and adding the THP into the swap cache.  In the
second step, the splitting is delayed further to after the swapping
out finished.  The plan is to delay splitting THP step by step,
finally avoid splitting THP for the THP swapping out and swap out/in
the THP as a whole.

In the patchset, more operations for the anonymous THP reclaiming,
such as TLB flushing, writing the THP to the swap device, removing the
THP from the swap cache are batched.  So that the performance of
anonymous THP swapping out are improved.

This patchset is based on the 5/18 head of mmotm/master.

During the development, the following scenarios/code paths have been
checked,

- swap out/in
- swap off
- write protect page fault
- madvise_free
- process exit
- split huge page

Please let me know if I missed something.

With the patchset, the swap out throughput improves 42% (from about
5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes.  At the same time, the IPI (reflect TLB flushing)
reduced about 78.9%.  The test is done on a Xeon E5 v3 system.  The
swap device used is a RAM simulated PMEM (persistent memory) device.
To test the sequential swapping out, the test case creates 8
processes, which sequentially allocate and write to the anonymous
pages until the RAM and part of the swap device is used up.

Below is the part of the cover letter for the first step patchset of
THP swap optimization which applies to all steps.

----------------------------------------------------------------->

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce TLB flushing and
  lock acquiring/releasing, including allocating/freeing the swap
  space, adding/deleting to/from the swap cache, and writing/reading
  the swap space, etc.  This will help improve the performance of the
  THP swap.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which are usually 4k random
  IO.  This will improve the performance of the THP swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH -mm 01/13] mm, THP, swap: Support to clear swap cache flag for THP swapped out
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 02/13] mm, THP, swap: Support to reclaim swap space " Huang, Ying
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

Previously, swapcache_free_cluster() is used only in the error path of
shrink_page_list() to free the swap cluster just allocated if the
THP (Transparent Huge Page) is failed to be split.  In this patch, it
is enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the swap
cluster that holds the contents of THP swapped out.

This will be used in delaying splitting THP after swapping out
support.  Because there is no THP swapping in as a whole support yet,
after clearing the swap cache flag, the swap cluster backing the THP
swapped out will be split.  So that the swap slots in the swap cluster
can be swapped in as normal pages later.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/swapfile.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8a6cdf9e55f9..4cd02dec6894 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1167,22 +1167,40 @@ static void swapcache_free_cluster(swp_entry_t entry)
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	unsigned char *map;
-	unsigned int i;
+	unsigned int i, free_entries = 0;
+	unsigned char val;
 
-	si = swap_info_get(entry);
+	si = _swap_info_get(entry);
 	if (!si)
 		return;
 
 	ci = lock_cluster(si, offset);
 	map = si->swap_map + offset;
 	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
-		VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
-		map[i] = 0;
+		val = map[i];
+		VM_BUG_ON(!(val & SWAP_HAS_CACHE));
+		if (val == SWAP_HAS_CACHE)
+			free_entries++;
+	}
+	if (!free_entries) {
+		for (i = 0; i < SWAPFILE_CLUSTER; i++)
+			map[i] &= ~SWAP_HAS_CACHE;
 	}
 	unlock_cluster(ci);
-	mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
-	swap_free_cluster(si, idx);
-	spin_unlock(&si->lock);
+	if (free_entries == SWAPFILE_CLUSTER) {
+		spin_lock(&si->lock);
+		ci = lock_cluster(si, offset);
+		memset(map, 0, SWAPFILE_CLUSTER);
+		unlock_cluster(ci);
+		mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
+		swap_free_cluster(si, idx);
+		spin_unlock(&si->lock);
+	} else if (free_entries) {
+		for (i = 0; i < SWAPFILE_CLUSTER; i++, entry.val++) {
+			if (!__swap_entry_free(si, entry, SWAP_HAS_CACHE))
+				free_swap_slot(entry);
+		}
+	}
 }
 #else
 static inline void swapcache_free_cluster(swp_entry_t entry)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 02/13] mm, THP, swap: Support to reclaim swap space for THP swapped out
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 01/13] mm, THP, swap: Support to clear swap cache flag for THP " Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 03/13] mm, THP, swap: Make reuse_swap_page() works " Huang, Ying
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

The normal swap slot reclaiming can be done when the swap count
reaches SWAP_HAS_CACHE.  But for the swap slot which is backing a THP,
all swap slots backing one THP must be reclaimed together, because the
swap slot may be used again when the THP is swapped out again later.
So the swap slots backing one THP can be reclaimed together when the
swap count for all swap slots for the THP reached SWAP_HAS_CACHE.  In
the patch, the functions to check whether the swap count for all swap
slots backing one THP reached SWAP_HAS_CACHE are implemented and used
when checking whether a swap slot can be reclaimed.

To make it easier to determine whether a swap slot is backing a THP, a
new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
cluster which is backing a THP (Transparent Huge Page).  Because THP
swap in as a whole isn't supported now.  After deleting the THP from
the swap cache (for example, swapping out finished), the
CLUSTER_FLAG_HUGE flag will be cleared.  So that, the normal pages
inside THP can be swapped in individually.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
---
 include/linux/swap.h |  1 +
 mm/swapfile.c        | 78 +++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5ab1c98c7d27..c563c45b30b4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -188,6 +188,7 @@ struct swap_cluster_info {
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4cd02dec6894..675afc235de1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -264,6 +264,16 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }
 
+static inline bool cluster_is_huge(struct swap_cluster_info *info)
+{
+	return info->flags & CLUSTER_FLAG_HUGE;
+}
+
+static inline void cluster_clear_huge(struct swap_cluster_info *info)
+{
+	info->flags &= ~CLUSTER_FLAG_HUGE;
+}
+
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -845,7 +855,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 	offset = idx * SWAPFILE_CLUSTER;
 	ci = lock_cluster(si, offset);
 	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
 
 	map = si->swap_map + offset;
 	for (i = 0; i < SWAPFILE_CLUSTER; i++)
@@ -1175,6 +1185,7 @@ static void swapcache_free_cluster(swp_entry_t entry)
 		return;
 
 	ci = lock_cluster(si, offset);
+	VM_BUG_ON(!cluster_is_huge(ci));
 	map = si->swap_map + offset;
 	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
 		val = map[i];
@@ -1186,6 +1197,7 @@ static void swapcache_free_cluster(swp_entry_t entry)
 		for (i = 0; i < SWAPFILE_CLUSTER; i++)
 			map[i] &= ~SWAP_HAS_CACHE;
 	}
+	cluster_clear_huge(ci);
 	unlock_cluster(ci);
 	if (free_entries == SWAPFILE_CLUSTER) {
 		spin_lock(&si->lock);
@@ -1334,6 +1346,54 @@ int swp_swapcount(swp_entry_t entry)
 	return count;
 }
 
+#ifdef CONFIG_THP_SWAP
+static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
+					 swp_entry_t entry)
+{
+	struct swap_cluster_info *ci;
+	unsigned char *map = si->swap_map;
+	unsigned long roffset = swp_offset(entry);
+	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+	int i;
+	bool ret = false;
+
+	ci = lock_cluster_or_swap_info(si, offset);
+	if (!cluster_is_huge(ci)) {
+		if (map[roffset] != SWAP_HAS_CACHE)
+			ret = true;
+		goto unlock_out;
+	}
+	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+		if (map[offset + i] != SWAP_HAS_CACHE) {
+			ret = true;
+			break;
+		}
+	}
+unlock_out:
+	unlock_cluster_or_swap_info(si, ci);
+	return ret;
+}
+
+static bool page_swapped(struct page *page)
+{
+	swp_entry_t entry;
+	struct swap_info_struct *si;
+
+	if (likely(!PageTransCompound(page)))
+		return page_swapcount(page) != 0;
+
+	page = compound_head(page);
+	entry.val = page_private(page);
+	si = _swap_info_get(entry);
+	if (si)
+		return swap_page_trans_huge_swapped(si, entry);
+	return false;
+}
+#else
+#define swap_page_trans_huge_swapped(si, entry)	swap_swapcount(si, entry)
+#define page_swapped(page)			(page_swapcount(page) != 0)
+#endif
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content
@@ -1388,7 +1448,7 @@ int try_to_free_swap(struct page *page)
 		return 0;
 	if (PageWriteback(page))
 		return 0;
-	if (page_swapcount(page))
+	if (page_swapped(page))
 		return 0;
 
 	/*
@@ -1409,6 +1469,7 @@ int try_to_free_swap(struct page *page)
 	if (pm_suspended_storage())
 		return 0;
 
+	page = compound_head(page);
 	delete_from_swap_cache(page);
 	SetPageDirty(page);
 	return 1;
@@ -1430,7 +1491,8 @@ int free_swap_and_cache(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		count = __swap_entry_free(p, entry, 1);
-		if (count == SWAP_HAS_CACHE) {
+		if (count == SWAP_HAS_CACHE &&
+		    !swap_page_trans_huge_swapped(p, entry)) {
 			page = find_get_page(swap_address_space(entry),
 					     swp_offset(entry));
 			if (page && !trylock_page(page)) {
@@ -1447,7 +1509,8 @@ int free_swap_and_cache(swp_entry_t entry)
 		 */
 		if (PageSwapCache(page) && !PageWriteback(page) &&
 		    (!page_mapped(page) || mem_cgroup_swap_full(page)) &&
-		    !swap_swapcount(p, entry)) {
+		    !swap_page_trans_huge_swapped(p, entry)) {
+			page = compound_head(page);
 			delete_from_swap_cache(page);
 			SetPageDirty(page);
 		}
@@ -2001,7 +2064,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
 				.sync_mode = WB_SYNC_NONE,
 			};
 
-			swap_writepage(page, &wbc);
+			swap_writepage(compound_head(page), &wbc);
 			lock_page(page);
 			wait_on_page_writeback(page);
 		}
@@ -2014,8 +2077,9 @@ int try_to_unuse(unsigned int type, bool frontswap,
 		 * delete, since it may not have been written out to swap yet.
 		 */
 		if (PageSwapCache(page) &&
-		    likely(page_private(page) == entry.val))
-			delete_from_swap_cache(page);
+		    likely(page_private(page) == entry.val) &&
+		    !page_swapped(page))
+			delete_from_swap_cache(compound_head(page));
 
 		/*
 		 * So we could skip searching mms once swap count went
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 03/13] mm, THP, swap: Make reuse_swap_page() works for THP swapped out
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 01/13] mm, THP, swap: Support to clear swap cache flag for THP " Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 02/13] mm, THP, swap: Support to reclaim swap space " Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 04/13] mm, THP, swap: Don't allocate huge cluster for file backed swap device Huang, Ying
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel, Andrea Arcangeli,
	Kirill A . Shutemov

From: Huang Ying <ying.huang@intel.com>

After supporting to delay THP (Transparent Huge Page) splitting after
swapped out, it is possible that some page table mappings of the THP
are turned into swap entries.  So reuse_swap_page() need to check the
swap count in addition to the map count as before.  This patch done
that.

In the huge PMD write protect fault handler, in addition to the page
map count, the swap count need to be checked too, so the page lock
need to be acquired too when calling reuse_swap_page() in addition to
the page table lock.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
---
 include/linux/swap.h |   4 +-
 mm/huge_memory.c     |  16 +++++++-
 mm/memory.c          |   6 +--
 mm/swapfile.c        | 102 ++++++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 113 insertions(+), 15 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c563c45b30b4..ed51d5e699e0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -508,8 +508,8 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-#define reuse_swap_page(page, total_mapcount) \
-	(page_trans_huge_mapcount(page, total_mapcount) == 1)
+#define reuse_swap_page(page, total_map_swapcount) \
+	(page_trans_huge_mapcount(page, total_map_swapcount) == 1)
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3a14c77fcce7..0eb1251f924a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1226,15 +1226,29 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	 * We can only reuse the page if nobody else maps the huge page or it's
 	 * part.
 	 */
-	if (page_trans_huge_mapcount(page, NULL) == 1) {
+	if (!trylock_page(page)) {
+		get_page(page);
+		spin_unlock(vmf->ptl);
+		lock_page(page);
+		spin_lock(vmf->ptl);
+		if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) {
+			unlock_page(page);
+			put_page(page);
+			goto out_unlock;
+		}
+		put_page(page);
+	}
+	if (reuse_swap_page(page, NULL)) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry,  1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		ret |= VM_FAULT_WRITE;
+		unlock_page(page);
 		goto out_unlock;
 	}
+	unlock_page(page);
 	get_page(page);
 	spin_unlock(vmf->ptl);
 alloc:
diff --git a/mm/memory.c b/mm/memory.c
index d320b4e16826..ac780fc619cd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2541,7 +2541,7 @@ static int do_wp_page(struct vm_fault *vmf)
 	 * not dirty accountable.
 	 */
 	if (PageAnon(vmf->page) && !PageKsm(vmf->page)) {
-		int total_mapcount;
+		int total_map_swapcount;
 		if (!trylock_page(vmf->page)) {
 			get_page(vmf->page);
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2556,8 +2556,8 @@ static int do_wp_page(struct vm_fault *vmf)
 			}
 			put_page(vmf->page);
 		}
-		if (reuse_swap_page(vmf->page, &total_mapcount)) {
-			if (total_mapcount == 1) {
+		if (reuse_swap_page(vmf->page, &total_map_swapcount)) {
+			if (total_map_swapcount == 1) {
 				/*
 				 * The page is all ours. Move it to
 				 * our anon_vma so the rmap code will
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 675afc235de1..bd0f38f31d3d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1389,9 +1389,89 @@ static bool page_swapped(struct page *page)
 		return swap_page_trans_huge_swapped(si, entry);
 	return false;
 }
+
+static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
+					 int *total_swapcount)
+{
+	int i, map_swapcount, _total_mapcount, _total_swapcount;
+	unsigned long offset;
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci = NULL;
+	unsigned char *map = NULL;
+	int mapcount, swapcount = 0;
+
+	/* hugetlbfs shouldn't call it */
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+
+	if (likely(!PageTransCompound(page))) {
+		mapcount = atomic_read(&page->_mapcount) + 1;
+		if (total_mapcount)
+			*total_mapcount = mapcount;
+		if (PageSwapCache(page))
+			swapcount = page_swapcount(page);
+		if (total_swapcount)
+			*total_swapcount = swapcount;
+		return mapcount + swapcount;
+	}
+
+	page = compound_head(page);
+
+	_total_mapcount = _total_swapcount = map_swapcount = 0;
+	if (PageSwapCache(page)) {
+		swp_entry_t entry;
+
+		entry.val = page_private(page);
+		si = _swap_info_get(entry);
+		if (si) {
+			map = si->swap_map;
+			offset = swp_offset(entry);
+		}
+	}
+	if (map)
+		ci = lock_cluster(si, offset);
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		mapcount = atomic_read(&page[i]._mapcount) + 1;
+		_total_mapcount += mapcount;
+		if (map) {
+			swapcount = swap_count(map[offset + i]);
+			_total_swapcount += swapcount;
+		}
+		map_swapcount = max(map_swapcount, mapcount + swapcount);
+	}
+	unlock_cluster(ci);
+	if (PageDoubleMap(page)) {
+		map_swapcount -= 1;
+		_total_mapcount -= HPAGE_PMD_NR;
+	}
+	mapcount = compound_mapcount(page);
+	map_swapcount += mapcount;
+	_total_mapcount += mapcount;
+	if (total_mapcount)
+		*total_mapcount = _total_mapcount;
+	if (total_swapcount)
+		*total_swapcount = _total_swapcount;
+
+	return map_swapcount;
+}
 #else
 #define swap_page_trans_huge_swapped(si, entry)	swap_swapcount(si, entry)
 #define page_swapped(page)			(page_swapcount(page) != 0)
+
+static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
+					 int *total_swapcount)
+{
+	int mapcount, swapcount = 0;
+
+	/* hugetlbfs shouldn't call it */
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+
+	mapcount = page_trans_huge_mapcount(page, total_mapcount)
+	if (PageSwapCache(page))
+		swapcount = page_swapcount(page);
+	if (total_swapcount)
+		*total_swapcount = swapcount;
+	return mapcount + swapcount;
+}
 #endif
 
 /*
@@ -1400,23 +1480,27 @@ static bool page_swapped(struct page *page)
  * on disk will never be read, and seeking back there to write new content
  * later would only waste time away from clustering.
  *
- * NOTE: total_mapcount should not be relied upon by the caller if
+ * NOTE: total_map_swapcount should not be relied upon by the caller if
  * reuse_swap_page() returns false, but it may be always overwritten
  * (see the other implementation for CONFIG_SWAP=n).
  */
-bool reuse_swap_page(struct page *page, int *total_mapcount)
+bool reuse_swap_page(struct page *page, int *total_map_swapcount)
 {
-	int count;
+	int count, total_mapcount, total_swapcount;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
 		return false;
-	count = page_trans_huge_mapcount(page, total_mapcount);
-	if (count <= 1 && PageSwapCache(page)) {
-		count += page_swapcount(page);
-		if (count != 1)
-			goto out;
+	count = page_trans_huge_map_swapcount(page, &total_mapcount,
+					      &total_swapcount);
+	if (total_map_swapcount)
+		*total_map_swapcount = total_mapcount + total_swapcount;
+	if (count == 1 && PageSwapCache(page) &&
+	    (likely(!PageTransCompound(page)) ||
+	     /* The remaining swap count will be freed soon */
+	     total_swapcount == page_swapcount(page))) {
 		if (!PageWriteback(page)) {
+			page = compound_head(page);
 			delete_from_swap_cache(page);
 			SetPageDirty(page);
 		} else {
@@ -1432,7 +1516,7 @@ bool reuse_swap_page(struct page *page, int *total_mapcount)
 			spin_unlock(&p->lock);
 		}
 	}
-out:
+
 	return count <= 1;
 }
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 04/13] mm, THP, swap: Don't allocate huge cluster for file backed swap device
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
                   ` (2 preceding siblings ...)
  2017-05-25  6:46 ` [PATCH -mm 03/13] mm, THP, swap: Make reuse_swap_page() works " Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 05/13] block, THP: Make block_device_operations.rw_page support THP Huang, Ying
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

It's hard to write a whole transparent huge page (THP) to a file
backed swap device during swapping out and the file backed swap device
isn't very popular.  So the huge cluster allocation for the file
backed swap device is disabled.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/swapfile.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index bd0f38f31d3d..2a2f5d08f0a9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -947,9 +947,10 @@ int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		if (cluster)
-			n_ret = swap_alloc_cluster(si, swp_entries);
-		else
+		if (cluster) {
+			if (!(si->flags & SWP_FILE))
+				n_ret = swap_alloc_cluster(si, swp_entries);
+		} else
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 						    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 05/13] block, THP: Make block_device_operations.rw_page support THP
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
                   ` (3 preceding siblings ...)
  2017-05-25  6:46 ` [PATCH -mm 04/13] mm, THP, swap: Don't allocate huge cluster for file backed swap device Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-06-02  5:57   ` Ross Zwisler
  2017-05-25  6:46 ` [PATCH -mm 06/13] block: Increase BIO_MAX_PAGES to PMD size if THP_SWAP enabled Huang, Ying
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Dan Williams, Ross Zwisler, Vishal L Verma, Jens Axboe,
	linux-nvdimm

From: Huang Ying <ying.huang@intel.com>

The .rw_page in struct block_device_operations is used by the swap
subsystem to read/write the page contents from/into the corresponding
swap slot in the swap device.  To support the THP (Transparent Huge
Page) swap optimization, the .rw_page is enhanced to support to
read/write THP if possible.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@intel.com>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-nvdimm@lists.01.org
---
 drivers/block/brd.c           |  6 +++++-
 drivers/block/zram/zram_drv.c |  2 ++
 drivers/nvdimm/btt.c          |  4 +++-
 drivers/nvdimm/pmem.c         | 42 +++++++++++++++++++++++++++++++-----------
 4 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 57b574f2f66a..4240d2a9dcf9 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -324,7 +324,11 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 		       struct page *page, bool is_write)
 {
 	struct brd_device *brd = bdev->bd_disk->private_data;
-	int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector);
+	int err;
+
+	if (PageTransHuge(page))
+		return -ENOTSUPP;
+	err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector);
 	page_endio(page, is_write, err);
 	return err;
 }
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 5f2a862d8e31..09b11286c927 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1049,6 +1049,8 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
 	struct zram *zram;
 	struct bio_vec bv;
 
+	if (PageTransHuge(page))
+		return -ENOTSUPP;
 	zram = bdev->bd_disk->private_data;
 
 	if (!valid_io_request(zram, sector, PAGE_SIZE)) {
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 983718b8fd9b..46d4a0bd2ae6 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1248,8 +1248,10 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
 		struct page *page, bool is_write)
 {
 	struct btt *btt = bdev->bd_disk->private_data;
+	unsigned int len;
 
-	btt_do_bvec(btt, NULL, page, PAGE_SIZE, 0, is_write, sector);
+	len = hpage_nr_pages(page) * PAGE_SIZE;
+	btt_do_bvec(btt, NULL, page, len, 0, is_write, sector);
 	page_endio(page, is_write, 0);
 	return 0;
 }
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c544d466ea51..e644115d56a7 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -78,22 +78,40 @@ static int pmem_clear_poison(struct pmem_device *pmem, phys_addr_t offset,
 static void write_pmem(void *pmem_addr, struct page *page,
 		unsigned int off, unsigned int len)
 {
-	void *mem = kmap_atomic(page);
-
-	memcpy_to_pmem(pmem_addr, mem + off, len);
-	kunmap_atomic(mem);
+	unsigned int chunk;
+	void *mem;
+
+	while (len) {
+		mem = kmap_atomic(page);
+		chunk = min_t(unsigned int, len, PAGE_SIZE);
+		memcpy_to_pmem(pmem_addr, mem + off, chunk);
+		kunmap_atomic(mem);
+		len -= chunk;
+		off = 0;
+		page++;
+		pmem_addr += PAGE_SIZE;
+	}
 }
 
 static int read_pmem(struct page *page, unsigned int off,
 		void *pmem_addr, unsigned int len)
 {
+	unsigned int chunk;
 	int rc;
-	void *mem = kmap_atomic(page);
-
-	rc = memcpy_mcsafe(mem + off, pmem_addr, len);
-	kunmap_atomic(mem);
-	if (rc)
-		return -EIO;
+	void *mem;
+
+	while (len) {
+		mem = kmap_atomic(page);
+		chunk = min_t(unsigned int, len, PAGE_SIZE);
+		rc = memcpy_mcsafe(mem + off, pmem_addr, chunk);
+		kunmap_atomic(mem);
+		if (rc)
+			return -EIO;
+		len -= chunk;
+		off = 0;
+		page++;
+		pmem_addr += PAGE_SIZE;
+	}
 	return 0;
 }
 
@@ -184,9 +202,11 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 		       struct page *page, bool is_write)
 {
 	struct pmem_device *pmem = bdev->bd_queue->queuedata;
+	unsigned int len;
 	int rc;
 
-	rc = pmem_do_bvec(pmem, page, PAGE_SIZE, 0, is_write, sector);
+	len = hpage_nr_pages(page) * PAGE_SIZE;
+	rc = pmem_do_bvec(pmem, page, len, 0, is_write, sector);
 
 	/*
 	 * The ->rw_page interface is subtle and tricky.  The core
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 06/13] block: Increase BIO_MAX_PAGES to PMD size if THP_SWAP enabled
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
                   ` (4 preceding siblings ...)
  2017-05-25  6:46 ` [PATCH -mm 05/13] block, THP: Make block_device_operations.rw_page support THP Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  8:42   ` Ming Lei
  2017-05-25  6:46 ` [PATCH -mm 07/13] mm, THP, swap: Support to write THP to swap device as a whole Huang, Ying
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Jens Axboe, Ming Lei, Shaohua Li, linux-block

From: Huang Ying <ying.huang@intel.com>

In this patch, BIO_MAX_PAGES is changed from 256 to HPAGE_PMD_NR if
CONFIG_THP_SWAP is enabled and HPAGE_PMD_NR > 256.  This is to support
THP (Transparent Huge Page) swap optimization.  Where the THP will be
write to disk as a whole instead of HPAGE_PMD_NR normal pages to batch
the various operations during swap.  And the page is likely to be
written to disk to free memory when system memory goes really low, the
memory pool need to be used to avoid deadlock.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Shaohua Li <shli@fb.com>
Cc: linux-block@vger.kernel.org
---
 include/linux/bio.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index d1b04b0e99cf..314796486507 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -38,7 +38,15 @@
 #define BIO_BUG_ON
 #endif
 
+#ifdef CONFIG_THP_SWAP
+#if HPAGE_PMD_NR > 256
+#define BIO_MAX_PAGES		HPAGE_PMD_NR
+#else
 #define BIO_MAX_PAGES		256
+#endif
+#else
+#define BIO_MAX_PAGES		256
+#endif
 
 #define bio_prio(bio)			(bio)->bi_ioprio
 #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 07/13] mm, THP, swap: Support to write THP to swap device as a whole
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
                   ` (5 preceding siblings ...)
  2017-05-25  6:46 ` [PATCH -mm 06/13] block: Increase BIO_MAX_PAGES to PMD size if THP_SWAP enabled Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 08/13] mm, THP, swap: Support to split THP for THP swapped out Huang, Ying
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel, Jens Axboe

From: Huang Ying <ying.huang@intel.com>

In the patch, the swap writing is enhanced to support to write a
THP (Transparent Huge Page) as a whole.  This is a part of the THP
swap optimization and will improve swap write IO performance for the
more large continuous IOs.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jens Axboe <axboe@fb.com>
---
 include/linux/page-flags.h    |  4 ++--
 include/linux/vm_event_item.h |  1 +
 mm/page_io.c                  | 21 ++++++++++++++++-----
 mm/vmstat.c                   |  1 +
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d33e3280c8ad..ba2d470d2d0a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -303,8 +303,8 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
  * risky: they bypass page accounting.
  */
-TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
-	TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
+TESTPAGEFLAG(Writeback, writeback, PF_NO_TAIL)
+	TESTSCFLAG(Writeback, writeback, PF_NO_TAIL)
 PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
 
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d84ae90ccd5c..5b5b0f094060 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -84,6 +84,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
+		THP_SWPOUT,
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
 		BALLOON_INFLATE,
diff --git a/mm/page_io.c b/mm/page_io.c
index 23f6d0d3470f..ec5229fb3607 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -27,16 +27,18 @@
 static struct bio *get_swap_bio(gfp_t gfp_flags,
 				struct page *page, bio_end_io_t end_io)
 {
+	int i, nr = hpage_nr_pages(page);
 	struct bio *bio;
 
-	bio = bio_alloc(gfp_flags, 1);
+	bio = bio_alloc(gfp_flags, nr);
 	if (bio) {
 		bio->bi_iter.bi_sector = map_swap_page(page, &bio->bi_bdev);
 		bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
 		bio->bi_end_io = end_io;
 
-		bio_add_page(bio, page, PAGE_SIZE, 0);
-		BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE);
+		for (i = 0; i < nr; i++)
+			bio_add_page(bio, page + i, PAGE_SIZE, 0);
+		VM_BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE * nr);
 	}
 	return bio;
 }
@@ -257,6 +259,15 @@ static sector_t swap_page_sector(struct page *page)
 	return (sector_t)__page_file_index(page) << (PAGE_SHIFT - 9);
 }
 
+static inline void count_swpout_vm_event(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (unlikely(PageTransHuge(page)))
+		count_vm_event(THP_SWPOUT);
+#endif
+	count_vm_events(PSWPOUT, hpage_nr_pages(page));
+}
+
 int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		bio_end_io_t end_write_func)
 {
@@ -308,7 +319,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 
 	ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
 	if (!ret) {
-		count_vm_event(PSWPOUT);
+		count_swpout_vm_event(page);
 		return 0;
 	}
 
@@ -321,7 +332,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		goto out;
 	}
 	bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
-	count_vm_event(PSWPOUT);
+	count_swpout_vm_event(page);
 	set_page_writeback(page);
 	unlock_page(page);
 	submit_bio(bio);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c432e581f9a9..ebfd79df1008 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1070,6 +1070,7 @@ const char * const vmstat_text[] = {
 #endif
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
+	"thp_swpout",
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
 	"balloon_inflate",
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 08/13] mm, THP, swap: Support to split THP for THP swapped out
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
                   ` (6 preceding siblings ...)
  2017-05-25  6:46 ` [PATCH -mm 07/13] mm, THP, swap: Support to write THP to swap device as a whole Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 09/13] memcg, THP, swap: Support move mem cgroup charge " Huang, Ying
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel, Andrea Arcangeli,
	Kirill A . Shutemov

From: Huang Ying <ying.huang@intel.com>

After adding swapping out support for THP (Transparent Huge Page), it
is possible that a THP in swap cache (partly swapped out) need to be
split.  To split such a THP, the swap cluster backing the THP need to
be split too, that is, the CLUSTER_FLAG_HUGE flag need to be cleared
for the swap cluster.  The patch implemented this.

And because the THP swap writing needs the THP keeps as huge page
during writing.  The PageWriteback flag is checked before splitting.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
---
 include/linux/swap.h |  9 +++++++++
 mm/huge_memory.c     | 10 +++++++++-
 mm/swapfile.c        | 15 +++++++++++++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ed51d5e699e0..fbe75245971e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -525,6 +525,15 @@ static inline swp_entry_t get_swap_page(struct page *page)
 
 #endif /* CONFIG_SWAP */
 
+#ifdef CONFIG_THP_SWAP
+extern int split_swap_cluster(swp_entry_t entry);
+#else
+static inline int split_swap_cluster(swp_entry_t entry)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0eb1251f924a..0aefc90c6573 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2446,6 +2446,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
+	if (PageWriteback(page))
+		return -EBUSY;
+
 	if (PageAnon(head)) {
 		/*
 		 * The caller does not necessarily hold an mmap_sem that would
@@ -2523,7 +2526,12 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			__dec_node_page_state(page, NR_SHMEM_THPS);
 		spin_unlock(&pgdata->split_queue_lock);
 		__split_huge_page(page, list, flags);
-		ret = 0;
+		if (PageSwapCache(head)) {
+			swp_entry_t entry = { .val = page_private(head) };
+
+			ret = split_swap_cluster(entry);
+		} else
+			ret = 0;
 	} else {
 		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
 			pr_alert("total_mapcount: %u, page_count(): %u\n",
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2a2f5d08f0a9..d4fd80be2e2d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1215,6 +1215,21 @@ static void swapcache_free_cluster(swp_entry_t entry)
 		}
 	}
 }
+
+int split_swap_cluster(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+
+	si = _swap_info_get(entry);
+	if (!si)
+		return -EBUSY;
+	ci = lock_cluster(si, offset);
+	cluster_clear_huge(ci);
+	unlock_cluster(ci);
+	return 0;
+}
 #else
 static inline void swapcache_free_cluster(swp_entry_t entry)
 {
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 09/13] memcg, THP, swap: Support move mem cgroup charge for THP swapped out
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
                   ` (7 preceding siblings ...)
  2017-05-25  6:46 ` [PATCH -mm 08/13] mm, THP, swap: Support to split THP for THP swapped out Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 10/13] memcg, THP, swap: Avoid to duplicated charge THP in swap cache Huang, Ying
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Michal Hocko, Andrea Arcangeli, Kirill A . Shutemov

From: Huang Ying <ying.huang@intel.com>

PTE mapped THP (Transparent Huge Page) will be ignored when moving
memory cgroup charge.  But for THP which is in the swap cache, the
memory cgroup charge for the swap of a tail-page may be moved in
current implementation.  That isn't correct, because the swap charge
for all sub-pages of a THP should be moved together.  Following the
processing of the PTE mapped THP, the mem cgroup charge moving for the
swap entry for a tail-page of a THP is ignored too.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
---
 mm/memcontrol.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c131f7e5ecd1..1f36bb61a6de 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4606,8 +4606,11 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		if (!ret || !target)
 			put_page(page);
 	}
-	/* There is a swap entry and a page doesn't exist or isn't charged */
-	if (ent.val && !ret &&
+	/*
+	 * There is a swap entry and a page doesn't exist or isn't charged.
+	 * But we cannot move a tail-page in a THP.
+	 */
+	if (ent.val && !ret && (!page || !PageTransCompound(page)) &&
 	    mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) {
 		ret = MC_TARGET_SWAP;
 		if (target)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 10/13] memcg, THP, swap: Avoid to duplicated charge THP in swap cache
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
                   ` (8 preceding siblings ...)
  2017-05-25  6:46 ` [PATCH -mm 09/13] memcg, THP, swap: Support move mem cgroup charge " Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 11/13] memcg, THP, swap: Make mem_cgroup_swapout() support THP Huang, Ying
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Michal Hocko, Andrea Arcangeli, Kirill A . Shutemov

From: Huang Ying <ying.huang@intel.com>

For a THP (Transparent Huge Page), tail_page->mem_cgroup is NULL.  So
to check whether the page is charged already, we need to check the
head page.  This is not an issue before because it is impossible for a
THP to be in the swap cache before.  But after we add delaying
splitting THP after swapped out support, it is possible now.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f36bb61a6de..7de1fa07f77d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5372,7 +5372,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		 * in turn serializes uncharging.
 		 */
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
-		if (page->mem_cgroup)
+		if (compound_head(page)->mem_cgroup)
 			goto out;
 
 		if (do_swap_account) {
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 11/13] memcg, THP, swap: Make mem_cgroup_swapout() support THP
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
                   ` (9 preceding siblings ...)
  2017-05-25  6:46 ` [PATCH -mm 10/13] memcg, THP, swap: Avoid to duplicated charge THP in swap cache Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 12/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 13/13] mm, THP, swap: Add THP swapping out fallback counting Huang, Ying
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Michal Hocko, Andrea Arcangeli, Kirill A . Shutemov

From: Huang Ying <ying.huang@intel.com>

This patch makes mem_cgroup_swapout() works for the transparent huge
page (THP).  Which will move the memory cgroup charge from memory to
swap for a THP.

This will be used for the THP swap support.  Where a THP may be
swapped out as a whole to a set of (HPAGE_PMD_NR) continuous swap
slots on the swap device.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
---
 mm/memcontrol.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7de1fa07f77d..f520dcadabb5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4621,8 +4621,8 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
- * We don't consider swapping or file mapped pages because THP does not
- * support them for now.
+ * We don't consider PMD mapped swapping or file mapped pages because THP does
+ * not support them for now.
  * Caller should make sure that pmd_trans_huge(pmd) is true.
  */
 static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
@@ -5855,6 +5855,7 @@ static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
 void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 {
 	struct mem_cgroup *memcg, *swap_memcg;
+	unsigned int nr_entries;
 	unsigned short oldid;
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -5875,19 +5876,24 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * ancestor for the swap instead and transfer the memory+swap charge.
 	 */
 	swap_memcg = mem_cgroup_id_get_online(memcg);
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 1);
+	nr_entries = hpage_nr_pages(page);
+	/* Get references for the tail pages, too */
+	if (nr_entries > 1)
+		mem_cgroup_id_get_many(swap_memcg, nr_entries - 1);
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg),
+				   nr_entries);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(swap_memcg, 1);
+	mem_cgroup_swap_statistics(swap_memcg, nr_entries);
 
 	page->mem_cgroup = NULL;
 
 	if (!mem_cgroup_is_root(memcg))
-		page_counter_uncharge(&memcg->memory, 1);
+		page_counter_uncharge(&memcg->memory, nr_entries);
 
 	if (memcg != swap_memcg) {
 		if (!mem_cgroup_is_root(swap_memcg))
-			page_counter_charge(&swap_memcg->memsw, 1);
-		page_counter_uncharge(&memcg->memsw, 1);
+			page_counter_charge(&swap_memcg->memsw, nr_entries);
+		page_counter_uncharge(&memcg->memsw, nr_entries);
 	}
 
 	/*
@@ -5897,7 +5903,8 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * only synchronisation we have for udpating the per-CPU variables.
 	 */
 	VM_BUG_ON(!irqs_disabled());
-	mem_cgroup_charge_statistics(memcg, page, false, -1);
+	mem_cgroup_charge_statistics(memcg, page, PageTransHuge(page),
+				     -nr_entries);
 	memcg_check_events(memcg, page);
 
 	if (!mem_cgroup_is_root(memcg))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 12/13] mm, THP, swap: Delay splitting THP after swapped out
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
                   ` (10 preceding siblings ...)
  2017-05-25  6:46 ` [PATCH -mm 11/13] memcg, THP, swap: Make mem_cgroup_swapout() support THP Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  2017-05-25  6:46 ` [PATCH -mm 13/13] mm, THP, swap: Add THP swapping out fallback counting Huang, Ying
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko

From: Huang Ying <ying.huang@intel.com>

In this patch, splitting transparent huge page (THP) during swapping
out is delayed from after adding the THP into the swap cache to after
swapping out finishes.  After the patch, more operations for the
anonymous THP reclaiming, such as writing the THP to the swap device,
removing the THP from the swap cache could be batched.  So that the
performance of anonymous THP swapping out could be improved.

This is the second step for the THP swap support.  The plan is to
delay splitting the THP step by step and avoid splitting the THP
finally.

With the patchset, the swap out throughput improves 42% (from about
5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes.  At the same time, the IPI (reflect TLB flushing)
reduced about 78.9%.  The test is done on a Xeon E5 v3 system.  The
swap device used is a RAM simulated PMEM (persistent memory) device.
To test the sequential swapping out, the test case creates 8
processes, which sequentially allocate and write to the anonymous
pages until the RAM and part of the swap device is used up.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
 mm/vmscan.c | 95 +++++++++++++++++++++++++++++++++----------------------------
 1 file changed, 52 insertions(+), 43 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7e949ac9756..510e709aecd4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -535,7 +535,9 @@ static inline int is_page_cache_freeable(struct page *page)
 	 * that isolated the page, the page cache radix tree and
 	 * optional buffer heads at page->private.
 	 */
-	return page_count(page) - page_has_private(page) == 2;
+	int radix_pins = PageTransHuge(page) && PageSwapCache(page) ?
+		HPAGE_PMD_NR : 1;
+	return page_count(page) - page_has_private(page) == 1 + radix_pins;
 }
 
 static int may_write_to_inode(struct inode *inode, struct scan_control *sc)
@@ -665,6 +667,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 			    bool reclaimed)
 {
 	unsigned long flags;
+	int refcount;
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
@@ -695,11 +698,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	 * Note that if SetPageDirty is always performed via set_page_dirty,
 	 * and thus under tree_lock, then this ordering is not required.
 	 */
-	if (!page_ref_freeze(page, 2))
+	if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
+		refcount = 1 + HPAGE_PMD_NR;
+	else
+		refcount = 2;
+	if (!page_ref_freeze(page, refcount))
 		goto cannot_free;
 	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
 	if (unlikely(PageDirty(page))) {
-		page_ref_unfreeze(page, 2);
+		page_ref_unfreeze(page, refcount);
 		goto cannot_free;
 	}
 
@@ -1121,58 +1128,56 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * Try to allocate it some swap space here.
 		 * Lazyfree page could be freed directly
 		 */
-		if (PageAnon(page) && PageSwapBacked(page) &&
-		    !PageSwapCache(page)) {
-			if (!(sc->gfp_mask & __GFP_IO))
-				goto keep_locked;
-			if (PageTransHuge(page)) {
-				/* cannot split THP, skip it */
-				if (!can_split_huge_page(page, NULL))
-					goto activate_locked;
-				/*
-				 * Split pages without a PMD map right
-				 * away. Chances are some or all of the
-				 * tail pages can be freed without IO.
-				 */
-				if (!compound_mapcount(page) &&
-				    split_huge_page_to_list(page, page_list))
-					goto activate_locked;
-			}
-			if (!add_to_swap(page)) {
-				if (!PageTransHuge(page))
-					goto activate_locked;
-				/* Split THP and swap individual base pages */
-				if (split_huge_page_to_list(page, page_list))
-					goto activate_locked;
-				if (!add_to_swap(page))
-					goto activate_locked;
-			}
-
-			/* XXX: We don't support THP writes */
-			if (PageTransHuge(page) &&
-				  split_huge_page_to_list(page, page_list)) {
-				delete_from_swap_cache(page);
-				goto activate_locked;
-			}
+		if (PageAnon(page) && PageSwapBacked(page)) {
+			if (!PageSwapCache(page)) {
+				if (!(sc->gfp_mask & __GFP_IO))
+					goto keep_locked;
+				if (PageTransHuge(page)) {
+					/* cannot split THP, skip it */
+					if (!can_split_huge_page(page, NULL))
+						goto activate_locked;
+					/*
+					 * Split pages without a PMD map right
+					 * away. Chances are some or all of the
+					 * tail pages can be freed without IO.
+					 */
+					if (!compound_mapcount(page) &&
+					    split_huge_page_to_list(page,
+								    page_list))
+						goto activate_locked;
+				}
+				if (!add_to_swap(page)) {
+					if (!PageTransHuge(page))
+						goto activate_locked;
+					/* Fallback to swap normal pages */
+					if (split_huge_page_to_list(page,
+								    page_list))
+						goto activate_locked;
+					if (!add_to_swap(page))
+						goto activate_locked;
+				}
 
-			may_enter_fs = 1;
+				may_enter_fs = 1;
 
-			/* Adding to swap updated mapping */
-			mapping = page_mapping(page);
+				/* Adding to swap updated mapping */
+				mapping = page_mapping(page);
+			}
 		} else if (unlikely(PageTransHuge(page))) {
 			/* Split file THP */
 			if (split_huge_page_to_list(page, page_list))
 				goto keep_locked;
 		}
 
-		VM_BUG_ON_PAGE(PageTransHuge(page), page);
-
 		/*
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page)) {
-			if (!try_to_unmap(page, ttu_flags | TTU_BATCH_FLUSH)) {
+			enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
+
+			if (unlikely(PageTransHuge(page)))
+				flags |= TTU_SPLIT_HUGE_PMD;
+			if (!try_to_unmap(page, flags)) {
 				nr_unmap_fail++;
 				goto activate_locked;
 			}
@@ -1311,7 +1316,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * Is there need to periodically free_page_list? It would
 		 * appear not as the counts should be low
 		 */
-		list_add(&page->lru, &free_pages);
+		if (unlikely(PageTransHuge(page))) {
+			mem_cgroup_uncharge(page);
+			(*get_compound_page_dtor(page))(page);
+		} else
+			list_add(&page->lru, &free_pages);
 		continue;
 
 activate_locked:
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 13/13] mm, THP, swap: Add THP swapping out fallback counting
  2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
                   ` (11 preceding siblings ...)
  2017-05-25  6:46 ` [PATCH -mm 12/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
@ 2017-05-25  6:46 ` Huang, Ying
  12 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel, Andrea Arcangeli,
	Kirill A . Shutemov, Michal Hocko

From: Huang Ying <ying.huang@intel.com>

When swapping out THP (Transparent Huge Page), instead of swapping out
the THP as a whole, sometimes we have to fallback to split the THP
into normal pages before swapping, because no free swap clusters are
available, or cgroup limit is exceeded, etc.  To count the number of
the fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted
when we fallback to split the THP.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
---
 include/linux/vm_event_item.h | 1 +
 mm/vmscan.c                   | 3 +++
 mm/vmstat.c                   | 1 +
 3 files changed, 5 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 5b5b0f094060..66effbadc9b8 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -85,6 +85,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 		THP_SWPOUT,
+		THP_SWPOUT_FALLBACK,
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
 		BALLOON_INFLATE,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 510e709aecd4..0f5a6bfc5e65 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1153,6 +1153,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					if (split_huge_page_to_list(page,
 								    page_list))
 						goto activate_locked;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+					count_vm_event(THP_SWPOUT_FALLBACK);
+#endif
 					if (!add_to_swap(page))
 						goto activate_locked;
 				}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ebfd79df1008..9400c915e9a2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1071,6 +1071,7 @@ const char * const vmstat_text[] = {
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 	"thp_swpout",
+	"thp_swpout_fallback",
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
 	"balloon_inflate",
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH -mm 06/13] block: Increase BIO_MAX_PAGES to PMD size if THP_SWAP enabled
  2017-05-25  6:46 ` [PATCH -mm 06/13] block: Increase BIO_MAX_PAGES to PMD size if THP_SWAP enabled Huang, Ying
@ 2017-05-25  8:42   ` Ming Lei
  2017-05-26  0:56     ` Huang, Ying
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2017-05-25  8:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner,
	Minchan Kim, Jens Axboe, Ming Lei, Shaohua Li, linux-block

On Thu, May 25, 2017 at 02:46:28PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> In this patch, BIO_MAX_PAGES is changed from 256 to HPAGE_PMD_NR if
> CONFIG_THP_SWAP is enabled and HPAGE_PMD_NR > 256.  This is to support
> THP (Transparent Huge Page) swap optimization.  Where the THP will be
> write to disk as a whole instead of HPAGE_PMD_NR normal pages to batch
> the various operations during swap.  And the page is likely to be
> written to disk to free memory when system memory goes really low, the
> memory pool need to be used to avoid deadlock.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ming Lei <tom.leiming@gmail.com>
> Cc: Shaohua Li <shli@fb.com>
> Cc: linux-block@vger.kernel.org
> ---
>  include/linux/bio.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index d1b04b0e99cf..314796486507 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -38,7 +38,15 @@
>  #define BIO_BUG_ON
>  #endif
>  
> +#ifdef CONFIG_THP_SWAP
> +#if HPAGE_PMD_NR > 256
> +#define BIO_MAX_PAGES		HPAGE_PMD_NR
> +#else
>  #define BIO_MAX_PAGES		256
> +#endif
> +#else
> +#define BIO_MAX_PAGES		256
> +#endif
>  
>  #define bio_prio(bio)			(bio)->bi_ioprio
>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)

Last time we discussed we should use multipage bvec for this usage.

I will rebase the last post on v4.12-rc and kick if off again since
the raid cleanup is just done on v4.11.

	http://marc.info/?t=148453679000002&r=1&w=2

Thanks,
Ming

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH -mm 06/13] block: Increase BIO_MAX_PAGES to PMD size if THP_SWAP enabled
  2017-05-25  8:42   ` Ming Lei
@ 2017-05-26  0:56     ` Huang, Ying
  0 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-05-26  0:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Johannes Weiner, Minchan Kim, Jens Axboe, Ming Lei, Shaohua Li,
	linux-block

Ming Lei <ming.lei@redhat.com> writes:

> On Thu, May 25, 2017 at 02:46:28PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> In this patch, BIO_MAX_PAGES is changed from 256 to HPAGE_PMD_NR if
>> CONFIG_THP_SWAP is enabled and HPAGE_PMD_NR > 256.  This is to support
>> THP (Transparent Huge Page) swap optimization.  Where the THP will be
>> write to disk as a whole instead of HPAGE_PMD_NR normal pages to batch
>> the various operations during swap.  And the page is likely to be
>> written to disk to free memory when system memory goes really low, the
>> memory pool need to be used to avoid deadlock.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Ming Lei <tom.leiming@gmail.com>
>> Cc: Shaohua Li <shli@fb.com>
>> Cc: linux-block@vger.kernel.org
>> ---
>>  include/linux/bio.h | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>> 
>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>> index d1b04b0e99cf..314796486507 100644
>> --- a/include/linux/bio.h
>> +++ b/include/linux/bio.h
>> @@ -38,7 +38,15 @@
>>  #define BIO_BUG_ON
>>  #endif
>>  
>> +#ifdef CONFIG_THP_SWAP
>> +#if HPAGE_PMD_NR > 256
>> +#define BIO_MAX_PAGES		HPAGE_PMD_NR
>> +#else
>>  #define BIO_MAX_PAGES		256
>> +#endif
>> +#else
>> +#define BIO_MAX_PAGES		256
>> +#endif
>>  
>>  #define bio_prio(bio)			(bio)->bi_ioprio
>>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
>
> Last time we discussed we should use multipage bvec for this usage.
>
> I will rebase the last post on v4.12-rc and kick if off again since
> the raid cleanup is just done on v4.11.
>
> 	http://marc.info/?t=148453679000002&r=1&w=2

Thanks for your information!  I will rebase my patchset on that after
they are merged.  From now on, this patch and the next one [07/13] is
only a temporary workaround for testing.

Best Regards,
Huang, Ying

> Thanks,
> Ming

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH -mm 05/13] block, THP: Make block_device_operations.rw_page support THP
  2017-05-25  6:46 ` [PATCH -mm 05/13] block, THP: Make block_device_operations.rw_page support THP Huang, Ying
@ 2017-06-02  5:57   ` Ross Zwisler
  2017-06-05  1:00     ` Huang, Ying
  0 siblings, 1 reply; 18+ messages in thread
From: Ross Zwisler @ 2017-06-02  5:57 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Jens Axboe, Minchan Kim, Ross Zwisler,
	linux-kernel, linux-mm, Johannes Weiner, linux-nvdimm

On Thu, May 25, 2017 at 02:46:27PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> The .rw_page in struct block_device_operations is used by the swap
> subsystem to read/write the page contents from/into the corresponding
> swap slot in the swap device.  To support the THP (Transparent Huge
> Page) swap optimization, the .rw_page is enhanced to support to
> read/write THP if possible.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@intel.com>
> Cc: Vishal L Verma <vishal.l.verma@intel.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: linux-nvdimm@lists.01.org
> ---
>  drivers/block/brd.c           |  6 +++++-
>  drivers/block/zram/zram_drv.c |  2 ++
>  drivers/nvdimm/btt.c          |  4 +++-
>  drivers/nvdimm/pmem.c         | 42 +++++++++++++++++++++++++++++++-----------
>  4 files changed, 41 insertions(+), 13 deletions(-)

The changes in brd.c, zram_drv.c and pmem.c look good to me.  For those bits
you can add: 

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

I think we still want Vishal to make sure that the BTT changes are okay.  I
don't know that code well enough to know whether it's safe to throw 512 pages
at btt_[read|write]_pg().

Also, Ying, next time can you please CC me (and probably the linux-nvdimm
list) on the whole series?  It would give us more context on what the larger
change is, allow us to see the cover letter, allow us to test with all the
patches in the series, etc.  It's pretty easy for reviewers to skip over the
patches we don't care about or aren't in our area.

Thanks,
- Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH -mm 05/13] block, THP: Make block_device_operations.rw_page support THP
  2017-06-02  5:57   ` Ross Zwisler
@ 2017-06-05  1:00     ` Huang, Ying
  0 siblings, 0 replies; 18+ messages in thread
From: Huang, Ying @ 2017-06-05  1:00 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Huang, Ying, Andrew Morton, Jens Axboe, Minchan Kim,
	Ross Zwisler, linux-kernel, linux-mm, Johannes Weiner,
	linux-nvdimm

Ross Zwisler <ross.zwisler@linux.intel.com> writes:

> On Thu, May 25, 2017 at 02:46:27PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> The .rw_page in struct block_device_operations is used by the swap
>> subsystem to read/write the page contents from/into the corresponding
>> swap slot in the swap device.  To support the THP (Transparent Huge
>> Page) swap optimization, the .rw_page is enhanced to support to
>> read/write THP if possible.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Ross Zwisler <ross.zwisler@intel.com>
>> Cc: Vishal L Verma <vishal.l.verma@intel.com>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: linux-nvdimm@lists.01.org
>> ---
>>  drivers/block/brd.c           |  6 +++++-
>>  drivers/block/zram/zram_drv.c |  2 ++
>>  drivers/nvdimm/btt.c          |  4 +++-
>>  drivers/nvdimm/pmem.c         | 42 +++++++++++++++++++++++++++++++-----------
>>  4 files changed, 41 insertions(+), 13 deletions(-)
>
> The changes in brd.c, zram_drv.c and pmem.c look good to me.  For those bits
> you can add: 
>
> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Thanks!

> I think we still want Vishal to make sure that the BTT changes are okay.  I
> don't know that code well enough to know whether it's safe to throw 512 pages
> at btt_[read|write]_pg().
>
> Also, Ying, next time can you please CC me (and probably the linux-nvdimm
> list) on the whole series?  It would give us more context on what the larger
> change is, allow us to see the cover letter, allow us to test with all the
> patches in the series, etc.  It's pretty easy for reviewers to skip over the
> patches we don't care about or aren't in our area.

Sure.

Best Regards,
Huang, Ying

> Thanks,
> - Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-06-05  1:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-25  6:46 [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 01/13] mm, THP, swap: Support to clear swap cache flag for THP " Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 02/13] mm, THP, swap: Support to reclaim swap space " Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 03/13] mm, THP, swap: Make reuse_swap_page() works " Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 04/13] mm, THP, swap: Don't allocate huge cluster for file backed swap device Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 05/13] block, THP: Make block_device_operations.rw_page support THP Huang, Ying
2017-06-02  5:57   ` Ross Zwisler
2017-06-05  1:00     ` Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 06/13] block: Increase BIO_MAX_PAGES to PMD size if THP_SWAP enabled Huang, Ying
2017-05-25  8:42   ` Ming Lei
2017-05-26  0:56     ` Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 07/13] mm, THP, swap: Support to write THP to swap device as a whole Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 08/13] mm, THP, swap: Support to split THP for THP swapped out Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 09/13] memcg, THP, swap: Support move mem cgroup charge " Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 10/13] memcg, THP, swap: Avoid to duplicated charge THP in swap cache Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 11/13] memcg, THP, swap: Make mem_cgroup_swapout() support THP Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 12/13] mm, THP, swap: Delay splitting THP after swapped out Huang, Ying
2017-05-25  6:46 ` [PATCH -mm 13/13] mm, THP, swap: Add THP swapping out fallback counting Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).