linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -mm -v10 0/3] THP swap: Delay splitting THP during swapping out
@ 2017-04-25 12:56 Huang, Ying
  2017-04-25 12:56 ` [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Huang, Ying @ 2017-04-25 12:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Johannes Weiner, Michal Hocko

From: Huang Ying <ying.huang@intel.com>

This patchset is to optimize the performance of Transparent Huge Page
(THP) swap.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help improve the performance of the THP swap.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which are usually 4k random
  IO.  This will improve the performance of the THP swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

This patchset is based on 04/13 head of mmotm/master.

This patchset is the first step for the THP swap support.  The plan is
to delay splitting THP step by step, finally avoid splitting THP
during the THP swapping out and swap out/in the THP as a whole.

As the first step, in this patchset, the splitting huge page is
delayed from almost the first step of swapping out to after allocating
the swap space for the THP and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.

With the patchset, the swap out throughput improves 15.5% (from about
3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To
test the sequential swapping out, the test case creates 8 processes,
which sequentially allocate and write to the anonymous pages until the
RAM and part of the swap device is used up.

Changelog:

v10:

- Remove unlikely() in add_to_swap() per Johannes' comments
- Improve code comments in add_to_swap() and mem_cgroup_try_charge_swap()

v9:

- Rebased on latest -mm tree
- Johannes done extensive cleanups and simplifications
- Johannes resolved the code size increases
- Update the test results

v8:

- Rebased on latest -mm tree
- Reorganize the patchset per Johannes' comments
- Merge add_to_swap_trans_huge() and add_to_swap() per Johannes' comments

v7:

- Rebased on latest -mm tree
- Revise get_swap_pages() THP support per Tim's comments

v6:

- Rebased on latest -mm tree (cluster lock, etc).
- Fix a potential uninitialized variable bug in __swap_entry_free()
- Revise the swap read-ahead changes to avoid a potential race
  condition between swap off and swap out in theory.

v5:

- Per Hillf's comments, fix a locking bug in error path of
  __add_to_swap_cache().  And merge the code to calculate extra_pins
  into can_split_huge_page().

v4:

- Per Johannes' comments, simplified swap cgroup array accessing code.
- Per Kirill and Dave Hansen's comments, used HPAGE_PMD_NR instead of
  HPAGE_SIZE/PAGE_SIZE.
- Per Anshuman's comments, used HPAGE_PMD_NR instead of 512 in patch
  description.

v3:

- Per Andrew's suggestion, used a more systematical way to determine
  whether to enable THP swap optimization
- Per Andrew's comments, moved as much as possible code into
  #ifdef CONFIG_TRANSPARENT_HUGE_PAGE/#endif or "if (PageTransHuge())"
- Fixed some coding style warning.

v2:

- Original [1/11] sent separately and merged
- Use switch in 10/10 per Hiff's suggestion

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-04-25 12:56 [PATCH -mm -v10 0/3] THP swap: Delay splitting THP during swapping out Huang, Ying
@ 2017-04-25 12:56 ` Huang, Ying
  2017-04-27  5:31   ` Minchan Kim
  2017-04-25 12:56 ` [PATCH -mm -v10 2/3] mm, THP, swap: Check whether THP can be split firstly Huang, Ying
  2017-04-25 12:56 ` [PATCH -mm -v10 3/3] mm, THP, swap: Enable THP swap optimization only if has compound map Huang, Ying
  2 siblings, 1 reply; 26+ messages in thread
From: Huang, Ying @ 2017-04-25 12:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Ebru Akagunduz, Johannes Weiner, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel, cgroups

From: Huang Ying <ying.huang@intel.com>

In this patch, splitting huge page is delayed from almost the first
step of swapping out to after allocating the swap space for the
THP (Transparent Huge Page) and adding the THP into the swap cache.
This will batch the corresponding operation, thus improve THP swap out
throughput.

This is the first step for the THP swap optimization.  The plan is to
delay splitting the THP step by step and avoid splitting the THP
finally.

The advantages of the THP swap support include:

- Batch the swap operations for the THP and reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help to improve the THP swap performance.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which usually are 4k random
  IO.  This will help to improve the THP swap performance.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after the THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

In this patch, one swap cluster is used to hold the contents of each
THP swapped out.  So, the size of the swap cluster is changed to that
of the THP (Transparent Huge Page) on x86_64 architecture (512).  For
other architectures which want such THP swap optimization,
ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file
for the architecture.  In effect, this will enlarge swap cluster size
by 2 times on x86_64.  Which may make it harder to find a free cluster
when the swap space becomes fragmented.  So that, this may reduce the
continuous swap space allocation and sequential write in theory.  The
performance test in 0day shows no regressions caused by this.

In the future of THP swap optimization, some information of the
swapped out THP (such as compound map count) will be recorded in the
swap_cluster_info data structure.

The mem cgroup swap accounting functions are enhanced to support
charge or uncharge a swap cluster backing a THP as a whole.

The swap cluster allocate/free functions are added to allocate/free a
swap cluster for a THP.  A fair simple algorithm is used for swap
cluster allocation, that is, only the first swap device in priority
list will be tried to allocate the swap cluster.  The function will
fail if the trying is not successful, and the caller will fallback to
allocate a single swap slot instead.  This works good enough for
normal cases.  If the difference of the number of the free swap
clusters among multiple swap devices is significant, it is possible
that some THPs are split earlier than necessary.  For example, this
could be caused by big size difference among multiple swap devices.

The swap cache functions is enhanced to support add/delete THP to/from
the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
enhanced in the future with multi-order radix tree.  But because we
will split the THP soon during swapping out, that optimization doesn't
make much sense for this first step.

The THP splitting functions are enhanced to support to split THP in
swap cache during swapping out.  The page lock will be held during
allocating the swap cluster, adding the THP into the swap cache and
splitting the THP.  So in the code path other than swapping out, if
the THP need to be split, the PageSwapCache(THP) will be always false.

The swap cluster is only available for SSD, so the THP swap
optimization in this patchset has no effect for HDD.

With the patch, the swap out throughput improves 11.5% (from about
3.73GB/s to about 4.16GB/s) in the vm-scalability swap-w-seq test case
with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To
test the sequential swapping out, the test case creates 8 processes,
which sequentially allocate and write to the anonymous pages until the
RAM and part of the swap device is used up.

[hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: cgroups@vger.kernel.org
Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/x86/Kconfig            |   1 +
 include/linux/page-flags.h  |   7 +-
 include/linux/swap.h        |  25 ++++-
 include/linux/swap_cgroup.h |   6 +-
 mm/Kconfig                  |  12 +++
 mm/huge_memory.c            |  11 +-
 mm/memcontrol.c             |  50 ++++-----
 mm/shmem.c                  |   2 +-
 mm/swap_cgroup.c            |  40 +++++--
 mm/swap_slots.c             |  16 ++-
 mm/swap_state.c             | 114 ++++++++++++--------
 mm/swapfile.c               | 256 ++++++++++++++++++++++++++++++++------------
 12 files changed, 375 insertions(+), 165 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c43f47622440..aaa7cdd601fe 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -72,6 +72,7 @@ config X86
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP
 	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
+	select ARCH_WANTS_THP_SWAP		if X86_64
 	select BUILDTIME_EXTABLE_SORT
 	select CLKEVT_I8253
 	select CLOCKSOURCE_VALIDATE_LAST_CYCLE
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d6de32..d33e3280c8ad 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -326,11 +326,14 @@ PAGEFLAG_FALSE(HighMem)
 #ifdef CONFIG_SWAP
 static __always_inline int PageSwapCache(struct page *page)
 {
+#ifdef CONFIG_THP_SWAP
+	page = compound_head(page);
+#endif
 	return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags);
 
 }
-SETPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
-CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
+CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ba5882419a7d..b60fea3748f8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -29,6 +29,12 @@ struct bio;
 				 SWAP_FLAG_DISCARD_PAGES)
 #define SWAP_BATCH 64
 
+#ifdef CONFIG_THP_SWAP
+#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
+#else
+#define SWAPFILE_CLUSTER	256
+#endif
+
 static inline int current_is_kswapd(void)
 {
 	return current->flags & PF_KSWAPD;
@@ -386,9 +392,9 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t get_swap_page(struct page *page);
 extern swp_entry_t get_swap_page_of_type(int);
-extern int get_swap_pages(int n, swp_entry_t swp_entries[]);
+extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
@@ -515,7 +521,7 @@ static inline int try_to_free_swap(struct page *page)
 	return 0;
 }
 
-static inline swp_entry_t get_swap_page(void)
+static inline swp_entry_t get_swap_page(struct page *page)
 {
 	swp_entry_t entry;
 	entry.val = 0;
@@ -548,7 +554,7 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 #ifdef CONFIG_MEMCG_SWAP
 extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
 extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
-extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages);
 extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg);
 extern bool mem_cgroup_swap_full(struct page *page);
 #else
@@ -562,7 +568,8 @@ static inline int mem_cgroup_try_charge_swap(struct page *page,
 	return 0;
 }
 
-static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
+					    unsigned int nr_pages)
 {
 }
 
@@ -577,5 +584,13 @@ static inline bool mem_cgroup_swap_full(struct page *page)
 }
 #endif
 
+#ifdef CONFIG_THP_SWAP
+extern void swapcache_free_cluster(swp_entry_t entry);
+#else
+static inline void swapcache_free_cluster(swp_entry_t entry)
+{
+}
+#endif
+
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index 145306bdc92f..b2b8ec7bda3f 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -7,7 +7,8 @@
 
 extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new);
-extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
+extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+					 unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
@@ -15,7 +16,8 @@ extern void swap_cgroup_swapoff(int type);
 #else
 
 static inline
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+				  unsigned int nr_ents)
 {
 	return 0;
 }
diff --git a/mm/Kconfig b/mm/Kconfig
index c89f472b658c..660fb765bf7d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -447,6 +447,18 @@ choice
 	  benefit.
 endchoice
 
+config ARCH_WANTS_THP_SWAP
+       def_bool n
+
+config THP_SWAP
+	def_bool y
+	depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP
+	help
+	  Swap transparent huge pages in one piece, without splitting.
+	  XXX: For now this only does clustered swap space allocation.
+
+	  For selection by architectures with reasonable THP sizes.
+
 config	TRANSPARENT_HUGE_PAGECACHE
 	def_bool y
 	depends on TRANSPARENT_HUGEPAGE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a84909cf20d3..b7c06476590e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2197,7 +2197,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 * atomic_set() here would be safe on all archs (and not only on x86),
 	 * it's safer to use atomic_inc()/atomic_add().
 	 */
-	if (PageAnon(head)) {
+	if (PageAnon(head) && !PageSwapCache(head)) {
 		page_ref_inc(page_tail);
 	} else {
 		/* Additional pin to radix tree */
@@ -2208,6 +2208,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	page_tail->flags |= (head->flags &
 			((1L << PG_referenced) |
 			 (1L << PG_swapbacked) |
+			 (1L << PG_swapcache) |
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
@@ -2270,7 +2271,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	ClearPageCompound(head);
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
-		page_ref_inc(head);
+		/* Additional pin to radix tree of swap cache */
+		if (PageSwapCache(head))
+			page_ref_add(head, 2);
+		else
+			page_ref_inc(head);
 	} else {
 		/* Additional pin to radix tree */
 		page_ref_add(head, 2);
@@ -2426,7 +2431,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			ret = -EBUSY;
 			goto out;
 		}
-		extra_pins = 0;
+		extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ff73899af61a..13441099e0dd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2376,10 +2376,9 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
+				       int nr_entries)
 {
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEMCG_SWAP], val);
+	this_cpu_add(memcg->stat->count[MEMCG_SWAP], nr_entries);
 }
 
 /**
@@ -2405,8 +2404,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 	new_id = mem_cgroup_id(to);
 
 	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_swap_statistics(to, true);
+		mem_cgroup_swap_statistics(from, -1);
+		mem_cgroup_swap_statistics(to, 1);
 		return 0;
 	}
 	return -EINVAL;
@@ -5445,7 +5444,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		 * let's not wait for it.  The page already received a
 		 * memory+swap charge, drop the swap entry duplicate.
 		 */
-		mem_cgroup_uncharge_swap(entry);
+		mem_cgroup_uncharge_swap(entry, nr_pages);
 	}
 }
 
@@ -5873,9 +5872,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * ancestor for the swap instead and transfer the memory+swap charge.
 	 */
 	swap_memcg = mem_cgroup_id_get_online(memcg);
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg));
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 1);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(swap_memcg, true);
+	mem_cgroup_swap_statistics(swap_memcg, 1);
 
 	page->mem_cgroup = NULL;
 
@@ -5902,19 +5901,20 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 		css_put(&memcg->css);
 }
 
-/*
- * mem_cgroup_try_charge_swap - try charging a swap entry
+/**
+ * mem_cgroup_try_charge_swap - try charging swap space for a page
  * @page: page being added to swap
  * @entry: swap entry to charge
  *
- * Try to charge @entry to the memcg that @page belongs to.
+ * Try to charge @page's memcg for the swap space at @entry.
  *
  * Returns 0 on success, -ENOMEM on failure.
  */
 int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 {
-	struct mem_cgroup *memcg;
+	unsigned int nr_pages = hpage_nr_pages(page);
 	struct page_counter *counter;
+	struct mem_cgroup *memcg;
 	unsigned short oldid;
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) || !do_swap_account)
@@ -5929,25 +5929,27 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	memcg = mem_cgroup_id_get_online(memcg);
 
 	if (!mem_cgroup_is_root(memcg) &&
-	    !page_counter_try_charge(&memcg->swap, 1, &counter)) {
+	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
 		mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
 
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
+	/* Get references for the tail pages, too */
+	if (nr_pages > 1)
+		mem_cgroup_id_get_many(memcg, nr_pages - 1);
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_pages);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(memcg, true);
+	mem_cgroup_swap_statistics(memcg, nr_pages);
 
 	return 0;
 }
 
 /**
- * mem_cgroup_uncharge_swap - uncharge a swap entry
+ * mem_cgroup_uncharge_swap - uncharge swap space
  * @entry: swap entry to uncharge
- *
- * Drop the swap charge associated with @entry.
+ * @nr_pages: the amount of swap space to uncharge
  */
-void mem_cgroup_uncharge_swap(swp_entry_t entry)
+void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 {
 	struct mem_cgroup *memcg;
 	unsigned short id;
@@ -5955,18 +5957,18 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
 	if (!do_swap_account)
 		return;
 
-	id = swap_cgroup_record(entry, 0);
+	id = swap_cgroup_record(entry, 0, nr_pages);
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
 		if (!mem_cgroup_is_root(memcg)) {
 			if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
-				page_counter_uncharge(&memcg->swap, 1);
+				page_counter_uncharge(&memcg->swap, nr_pages);
 			else
-				page_counter_uncharge(&memcg->memsw, 1);
+				page_counter_uncharge(&memcg->memsw, nr_pages);
 		}
-		mem_cgroup_swap_statistics(memcg, false);
-		mem_cgroup_id_put(memcg);
+		mem_cgroup_swap_statistics(memcg, -nr_pages);
+		mem_cgroup_id_put_many(memcg, nr_pages);
 	}
 	rcu_read_unlock();
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index e67d6ba4e98e..29948d7da172 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1290,7 +1290,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		SetPageUptodate(page);
 	}
 
-	swap = get_swap_page();
+	swap = get_swap_page(page);
 	if (!swap.val)
 		goto redirty;
 
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index ac6318a064d3..aee4b36d5207 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -58,21 +58,27 @@ static int swap_cgroup_prepare(int type)
 	return -ENOMEM;
 }
 
+static struct swap_cgroup *__lookup_swap_cgroup(struct swap_cgroup_ctrl *ctrl,
+						pgoff_t offset)
+{
+	struct page *mappage;
+	struct swap_cgroup *sc;
+
+	mappage = ctrl->map[offset / SC_PER_PAGE];
+	sc = page_address(mappage);
+	return sc + offset % SC_PER_PAGE;
+}
+
 static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
 					struct swap_cgroup_ctrl **ctrlp)
 {
 	pgoff_t offset = swp_offset(ent);
 	struct swap_cgroup_ctrl *ctrl;
-	struct page *mappage;
-	struct swap_cgroup *sc;
 
 	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
 	if (ctrlp)
 		*ctrlp = ctrl;
-
-	mappage = ctrl->map[offset / SC_PER_PAGE];
-	sc = page_address(mappage);
-	return sc + offset % SC_PER_PAGE;
+	return __lookup_swap_cgroup(ctrl, offset);
 }
 
 /**
@@ -105,25 +111,39 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 }
 
 /**
- * swap_cgroup_record - record mem_cgroup for this swp_entry.
- * @ent: swap entry to be recorded into
+ * swap_cgroup_record - record mem_cgroup for a set of swap entries
+ * @ent: the first swap entry to be recorded into
  * @id: mem_cgroup to be recorded
+ * @nr_ents: number of swap entries to be recorded
  *
  * Returns old value at success, 0 at failure.
  * (Of course, old value can be 0.)
  */
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+				  unsigned int nr_ents)
 {
 	struct swap_cgroup_ctrl *ctrl;
 	struct swap_cgroup *sc;
 	unsigned short old;
 	unsigned long flags;
+	pgoff_t offset = swp_offset(ent);
+	pgoff_t end = offset + nr_ents;
 
 	sc = lookup_swap_cgroup(ent, &ctrl);
 
 	spin_lock_irqsave(&ctrl->lock, flags);
 	old = sc->id;
-	sc->id = id;
+	for (;;) {
+		VM_BUG_ON(sc->id != old);
+		sc->id = id;
+		offset++;
+		if (offset == end)
+			break;
+		if (offset % SC_PER_PAGE)
+			sc++;
+		else
+			sc = __lookup_swap_cgroup(ctrl, offset);
+	}
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	return old;
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 58f6c78f1dad..eb7524f8296d 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -263,7 +263,8 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 
 	cache->cur = 0;
 	if (swap_slot_cache_active)
-		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots);
+		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, false,
+					   cache->slots);
 
 	return cache->nr;
 }
@@ -301,11 +302,19 @@ int free_swap_slot(swp_entry_t entry)
 	return 0;
 }
 
-swp_entry_t get_swap_page(void)
+swp_entry_t get_swap_page(struct page *page)
 {
 	swp_entry_t entry, *pentry;
 	struct swap_slots_cache *cache;
 
+	entry.val = 0;
+
+	if (PageTransHuge(page)) {
+		if (hpage_nr_pages(page) == SWAPFILE_CLUSTER)
+			get_swap_pages(1, true, &entry);
+		return entry;
+	}
+
 	/*
 	 * Preemption is allowed here, because we may sleep
 	 * in refill_swap_slots_cache().  But it is safe, because
@@ -317,7 +326,6 @@ swp_entry_t get_swap_page(void)
 	 */
 	cache = raw_cpu_ptr(&swp_slots);
 
-	entry.val = 0;
 	if (check_cache_active()) {
 		mutex_lock(&cache->alloc_lock);
 		if (cache->slots) {
@@ -337,7 +345,7 @@ swp_entry_t get_swap_page(void)
 			return entry;
 	}
 
-	get_swap_pages(1, &entry);
+	get_swap_pages(1, false, &entry);
 
 	return entry;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 539b8885e3d1..16ff89d058f4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -19,6 +19,7 @@
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
 #include <linux/swap_slots.h>
+#include <linux/huge_mm.h>
 
 #include <asm/pgtable.h>
 
@@ -38,6 +39,7 @@ struct address_space *swapper_spaces[MAX_SWAPFILES];
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES];
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
+#define ADD_CACHE_INFO(x, nr)	do { swap_cache_info.x += (nr); } while (0)
 
 static struct {
 	unsigned long add_total;
@@ -90,39 +92,46 @@ void show_swap_cache_info(void)
  */
 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
-	int error;
+	int error, i, nr = hpage_nr_pages(page);
 	struct address_space *address_space;
+	pgoff_t idx = swp_offset(entry);
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapCache(page), page);
 	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
-	get_page(page);
+	page_ref_add(page, nr);
 	SetPageSwapCache(page);
-	set_page_private(page, entry.val);
 
 	address_space = swap_address_space(entry);
 	spin_lock_irq(&address_space->tree_lock);
-	error = radix_tree_insert(&address_space->page_tree,
-				  swp_offset(entry), page);
-	if (likely(!error)) {
-		address_space->nrpages++;
-		__inc_node_page_state(page, NR_FILE_PAGES);
-		INC_CACHE_INFO(add_total);
+	for (i = 0; i < nr; i++) {
+		set_page_private(page + i, entry.val + i);
+		error = radix_tree_insert(&address_space->page_tree,
+					  idx + i, page + i);
+		if (unlikely(error))
+			break;
 	}
-	spin_unlock_irq(&address_space->tree_lock);
-
-	if (unlikely(error)) {
+	if (likely(!error)) {
+		address_space->nrpages += nr;
+		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
+		ADD_CACHE_INFO(add_total, nr);
+	} else {
 		/*
 		 * Only the context which have set SWAP_HAS_CACHE flag
 		 * would call add_to_swap_cache().
 		 * So add_to_swap_cache() doesn't returns -EEXIST.
 		 */
 		VM_BUG_ON(error == -EEXIST);
-		set_page_private(page, 0UL);
+		set_page_private(page + i, 0UL);
+		while (i--) {
+			radix_tree_delete(&address_space->page_tree, idx + i);
+			set_page_private(page + i, 0UL);
+		}
 		ClearPageSwapCache(page);
-		put_page(page);
+		page_ref_sub(page, nr);
 	}
+	spin_unlock_irq(&address_space->tree_lock);
 
 	return error;
 }
@@ -132,7 +141,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
 {
 	int error;
 
-	error = radix_tree_maybe_preload(gfp_mask);
+	error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page));
 	if (!error) {
 		error = __add_to_swap_cache(page, entry);
 		radix_tree_preload_end();
@@ -146,8 +155,10 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
  */
 void __delete_from_swap_cache(struct page *page)
 {
-	swp_entry_t entry;
 	struct address_space *address_space;
+	int i, nr = hpage_nr_pages(page);
+	swp_entry_t entry;
+	pgoff_t idx;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
@@ -155,12 +166,15 @@ void __delete_from_swap_cache(struct page *page)
 
 	entry.val = page_private(page);
 	address_space = swap_address_space(entry);
-	radix_tree_delete(&address_space->page_tree, swp_offset(entry));
-	set_page_private(page, 0);
+	idx = swp_offset(entry);
+	for (i = 0; i < nr; i++) {
+		radix_tree_delete(&address_space->page_tree, idx + i);
+		set_page_private(page + i, 0);
+	}
 	ClearPageSwapCache(page);
-	address_space->nrpages--;
-	__dec_node_page_state(page, NR_FILE_PAGES);
-	INC_CACHE_INFO(del_total);
+	address_space->nrpages -= nr;
+	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
+	ADD_CACHE_INFO(del_total, nr);
 }
 
 /**
@@ -178,20 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
-	entry = get_swap_page();
+retry:
+	entry = get_swap_page(page);
 	if (!entry.val)
-		return 0;
-
-	if (mem_cgroup_try_charge_swap(page, entry)) {
-		swapcache_free(entry);
-		return 0;
-	}
-
-	if (unlikely(PageTransHuge(page)))
-		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry);
-			return 0;
-		}
+		goto fail;
+	if (mem_cgroup_try_charge_swap(page, entry))
+		goto fail_free;
 
 	/*
 	 * Radix-tree node allocations from PF_MEMALLOC contexts could
@@ -206,17 +212,33 @@ int add_to_swap(struct page *page, struct list_head *list)
 	 */
 	err = add_to_swap_cache(page, entry,
 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
-
-	if (!err) {
-		return 1;
-	} else {	/* -ENOMEM radix-tree allocation failure */
+	/* -ENOMEM radix-tree allocation failure */
+	if (err)
 		/*
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry);
-		return 0;
+		goto fail_free;
+
+	if (PageTransHuge(page)) {
+		err = split_huge_page_to_list(page, list);
+		if (err) {
+			delete_from_swap_cache(page);
+			return 0;
+		}
 	}
+
+	return 1;
+
+fail_free:
+	if (PageTransHuge(page))
+		swapcache_free_cluster(entry);
+	else
+		swapcache_free(entry);
+fail:
+	if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
+		goto retry;
+	return 0;
 }
 
 /*
@@ -237,8 +259,12 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry);
-	put_page(page);
+	if (PageTransHuge(page))
+		swapcache_free_cluster(entry);
+	else
+		swapcache_free(entry);
+
+	page_ref_sub(page, hpage_nr_pages(page));
 }
 
 /* 
@@ -295,7 +321,7 @@ struct page * lookup_swap_cache(swp_entry_t entry)
 
 	page = find_get_page(swap_address_space(entry), swp_offset(entry));
 
-	if (page) {
+	if (page && likely(!PageTransCompound(page))) {
 		INC_CACHE_INFO(find_success);
 		if (TestClearPageReadahead(page))
 			atomic_inc(&swapin_readahead_hits);
@@ -506,7 +532,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 						gfp_mask, vma, addr);
 		if (!page)
 			continue;
-		if (offset != entry_offset)
+		if (offset != entry_offset && likely(!PageTransCompound(page)))
 			SetPageReadahead(page);
 		put_page(page);
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f23c56e9be39..596306272059 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -200,7 +200,6 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 	}
 }
 
-#define SWAPFILE_CLUSTER	256
 #define LATENCY_LIMIT		256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
@@ -375,6 +374,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	schedule_work(&si->discard_work);
 }
 
+static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info;
+
+	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
+	cluster_list_add_tail(&si->free_clusters, ci, idx);
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
  * will be added to free cluster list. caller should hold si->lock.
@@ -395,10 +402,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 
 		spin_lock(&si->lock);
 		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
-		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
-		unlock_cluster(ci);
-		cluster_list_add_tail(&si->free_clusters, info, idx);
-		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
+		__free_cluster(si, idx);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
 		unlock_cluster(ci);
@@ -416,6 +420,34 @@ static void swap_discard_work(struct work_struct *work)
 	spin_unlock(&si->lock);
 }
 
+static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info;
+
+	VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
+	cluster_list_del_first(&si->free_clusters, ci);
+	cluster_set_count_flag(ci + idx, 0, 0);
+}
+
+static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info + idx;
+
+	VM_BUG_ON(cluster_count(ci) != 0);
+	/*
+	 * If the swap is discardable, prepare discard the cluster
+	 * instead of free it immediately. The cluster will be freed
+	 * after discard.
+	 */
+	if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
+	    (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
+		swap_cluster_schedule_discard(si, idx);
+		return;
+	}
+
+	__free_cluster(si, idx);
+}
+
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
  * removed from free cluster list and its usage counter will be increased.
@@ -427,11 +459,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 
 	if (!cluster_info)
 		return;
-	if (cluster_is_free(&cluster_info[idx])) {
-		VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx);
-		cluster_list_del_first(&p->free_clusters, cluster_info);
-		cluster_set_count_flag(&cluster_info[idx], 0, 0);
-	}
+	if (cluster_is_free(&cluster_info[idx]))
+		alloc_cluster(p, idx);
 
 	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
@@ -455,21 +484,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 	cluster_set_count(&cluster_info[idx],
 		cluster_count(&cluster_info[idx]) - 1);
 
-	if (cluster_count(&cluster_info[idx]) == 0) {
-		/*
-		 * If the swap is discardable, prepare discard the cluster
-		 * instead of free it immediately. The cluster will be freed
-		 * after discard.
-		 */
-		if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
-				 (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
-			swap_cluster_schedule_discard(p, idx);
-			return;
-		}
-
-		cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
-		cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
-	}
+	if (cluster_count(&cluster_info[idx]) == 0)
+		free_cluster(p, idx);
 }
 
 /*
@@ -559,6 +575,60 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	return found_free;
 }
 
+static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
+			     unsigned int nr_entries)
+{
+	unsigned int end = offset + nr_entries - 1;
+
+	if (offset == si->lowest_bit)
+		si->lowest_bit += nr_entries;
+	if (end == si->highest_bit)
+		si->highest_bit -= nr_entries;
+	si->inuse_pages += nr_entries;
+	if (si->inuse_pages == si->pages) {
+		si->lowest_bit = si->max;
+		si->highest_bit = 0;
+		spin_lock(&swap_avail_lock);
+		plist_del(&si->avail_list, &swap_avail_head);
+		spin_unlock(&swap_avail_lock);
+	}
+}
+
+static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
+			    unsigned int nr_entries)
+{
+	unsigned long end = offset + nr_entries - 1;
+	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
+
+	if (offset < si->lowest_bit)
+		si->lowest_bit = offset;
+	if (end > si->highest_bit) {
+		bool was_full = !si->highest_bit;
+
+		si->highest_bit = end;
+		if (was_full && (si->flags & SWP_WRITEOK)) {
+			spin_lock(&swap_avail_lock);
+			WARN_ON(!plist_node_empty(&si->avail_list));
+			if (plist_node_empty(&si->avail_list))
+				plist_add(&si->avail_list, &swap_avail_head);
+			spin_unlock(&swap_avail_lock);
+		}
+	}
+	atomic_long_add(nr_entries, &nr_swap_pages);
+	si->inuse_pages -= nr_entries;
+	if (si->flags & SWP_BLKDEV)
+		swap_slot_free_notify =
+			si->bdev->bd_disk->fops->swap_slot_free_notify;
+	else
+		swap_slot_free_notify = NULL;
+	while (offset <= end) {
+		frontswap_invalidate_page(si->type, offset);
+		if (swap_slot_free_notify)
+			swap_slot_free_notify(si->bdev, offset);
+		offset++;
+	}
+}
+
 static int scan_swap_map_slots(struct swap_info_struct *si,
 			       unsigned char usage, int nr,
 			       swp_entry_t slots[])
@@ -677,18 +747,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	inc_cluster_info_page(si, si->cluster_info, offset);
 	unlock_cluster(ci);
 
-	if (offset == si->lowest_bit)
-		si->lowest_bit++;
-	if (offset == si->highest_bit)
-		si->highest_bit--;
-	si->inuse_pages++;
-	if (si->inuse_pages == si->pages) {
-		si->lowest_bit = si->max;
-		si->highest_bit = 0;
-		spin_lock(&swap_avail_lock);
-		plist_del(&si->avail_list, &swap_avail_head);
-		spin_unlock(&swap_avail_lock);
-	}
+	swap_range_alloc(si, offset, 1);
 	si->cluster_next = offset + 1;
 	slots[n_ret++] = swp_entry(si->type, offset);
 
@@ -767,6 +826,52 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }
 
+#ifdef CONFIG_THP_SWAP
+static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+{
+	unsigned long idx;
+	struct swap_cluster_info *ci;
+	unsigned long offset, i;
+	unsigned char *map;
+
+	if (cluster_list_empty(&si->free_clusters))
+		return 0;
+
+	idx = cluster_list_first(&si->free_clusters);
+	offset = idx * SWAPFILE_CLUSTER;
+	ci = lock_cluster(si, offset);
+	alloc_cluster(si, idx);
+	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+
+	map = si->swap_map + offset;
+	for (i = 0; i < SWAPFILE_CLUSTER; i++)
+		map[i] = SWAP_HAS_CACHE;
+	unlock_cluster(ci);
+	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
+	*slot = swp_entry(si->type, offset);
+
+	return 1;
+}
+
+static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	unsigned long offset = idx * SWAPFILE_CLUSTER;
+	struct swap_cluster_info *ci;
+
+	ci = lock_cluster(si, offset);
+	cluster_set_count_flag(ci, 0, 0);
+	free_cluster(si, idx);
+	unlock_cluster(ci);
+	swap_range_free(si, offset, SWAPFILE_CLUSTER);
+}
+#else
+static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+{
+	VM_WARN_ON_ONCE(1);
+	return 0;
+}
+#endif /* CONFIG_THP_SWAP */
+
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
@@ -782,13 +887,17 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 }
 
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
+int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[])
 {
+	unsigned long nr_pages = cluster ? SWAPFILE_CLUSTER : 1;
 	struct swap_info_struct *si, *next;
 	long avail_pgs;
 	int n_ret = 0;
 
-	avail_pgs = atomic_long_read(&nr_swap_pages);
+	/* Only single cluster request supported */
+	WARN_ON_ONCE(n_goal > 1 && cluster);
+
+	avail_pgs = atomic_long_read(&nr_swap_pages) / nr_pages;
 	if (avail_pgs <= 0)
 		goto noswap;
 
@@ -798,7 +907,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
 	if (n_goal > avail_pgs)
 		n_goal = avail_pgs;
 
-	atomic_long_sub(n_goal, &nr_swap_pages);
+	atomic_long_sub(n_goal * nr_pages, &nr_swap_pages);
 
 	spin_lock(&swap_avail_lock);
 
@@ -824,10 +933,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					    n_goal, swp_entries);
+		if (likely(cluster))
+			n_ret = swap_alloc_cluster(si, swp_entries);
+		else
+			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+						    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-		if (n_ret)
+		if (n_ret || unlikely(cluster))
 			goto check_out;
 		pr_debug("scan_swap_map of si %d failed to find offset\n",
 			si->type);
@@ -853,7 +965,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
 
 check_out:
 	if (n_ret < n_goal)
-		atomic_long_add((long) (n_goal-n_ret), &nr_swap_pages);
+		atomic_long_add((long)(n_goal - n_ret) * nr_pages,
+				&nr_swap_pages);
 noswap:
 	return n_ret;
 }
@@ -1009,32 +1122,8 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 	dec_cluster_info_page(p, p->cluster_info, offset);
 	unlock_cluster(ci);
 
-	mem_cgroup_uncharge_swap(entry);
-	if (offset < p->lowest_bit)
-		p->lowest_bit = offset;
-	if (offset > p->highest_bit) {
-		bool was_full = !p->highest_bit;
-
-		p->highest_bit = offset;
-		if (was_full && (p->flags & SWP_WRITEOK)) {
-			spin_lock(&swap_avail_lock);
-			WARN_ON(!plist_node_empty(&p->avail_list));
-			if (plist_node_empty(&p->avail_list))
-				plist_add(&p->avail_list,
-					  &swap_avail_head);
-			spin_unlock(&swap_avail_lock);
-		}
-	}
-	atomic_long_inc(&nr_swap_pages);
-	p->inuse_pages--;
-	frontswap_invalidate_page(p->type, offset);
-	if (p->flags & SWP_BLKDEV) {
-		struct gendisk *disk = p->bdev->bd_disk;
-
-		if (disk->fops->swap_slot_free_notify)
-			disk->fops->swap_slot_free_notify(p->bdev,
-							  offset);
-	}
+	mem_cgroup_uncharge_swap(entry, 1);
+	swap_range_free(p, offset, 1);
 }
 
 /*
@@ -1066,6 +1155,33 @@ void swapcache_free(swp_entry_t entry)
 	}
 }
 
+#ifdef CONFIG_THP_SWAP
+void swapcache_free_cluster(swp_entry_t entry)
+{
+	unsigned long offset = swp_offset(entry);
+	unsigned long idx = offset / SWAPFILE_CLUSTER;
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *si;
+	unsigned char *map;
+	unsigned int i;
+
+	si = swap_info_get(entry);
+	if (!si)
+		return;
+
+	ci = lock_cluster(si, offset);
+	map = si->swap_map + offset;
+	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+		VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
+		map[i] = 0;
+	}
+	unlock_cluster(ci);
+	mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
+	swap_free_cluster(si, idx);
+	spin_unlock(&si->lock);
+}
+#endif /* CONFIG_THP_SWAP */
+
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH -mm -v10 2/3] mm, THP, swap: Check whether THP can be split firstly
  2017-04-25 12:56 [PATCH -mm -v10 0/3] THP swap: Delay splitting THP during swapping out Huang, Ying
  2017-04-25 12:56 ` [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
@ 2017-04-25 12:56 ` Huang, Ying
  2017-04-25 21:43   ` Johannes Weiner
  2017-04-25 12:56 ` [PATCH -mm -v10 3/3] mm, THP, swap: Enable THP swap optimization only if has compound map Huang, Ying
  2 siblings, 1 reply; 26+ messages in thread
From: Huang, Ying @ 2017-04-25 12:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner

From: Huang Ying <ying.huang@intel.com>

To swap out THP (Transparent Huage Page), before splitting the THP,
the swap cluster will be allocated and the THP will be added into the
swap cache.  But it is possible that the THP cannot be split, so that
we must delete the THP from the swap cache and free the swap cluster.
To avoid that, in this patch, whether the THP can be split is checked
firstly.  The check can only be done racy, but it is good enough for
most cases.

With the patch, the swap out throughput improves 3.6% (from about
4.16GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To
test the sequential swapping out, the test case creates 8 processes,
which sequentially allocate and write to the anonymous pages until the
RAM and part of the swap device is used up.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for can_split_huge_page()]
---
 include/linux/huge_mm.h |  7 +++++++
 mm/huge_memory.c        | 20 ++++++++++++++++----
 mm/swap_state.c         |  4 ++++
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a3762d49ba39..d3b3e8fcc717 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -113,6 +113,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 extern void prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 
+bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
@@ -231,6 +232,12 @@ static inline void prep_transhuge_page(struct page *page) {}
 
 #define thp_get_unmapped_area	NULL
 
+static inline bool
+can_split_huge_page(struct page *page, int *pextra_pins)
+{
+	BUILD_BUG();
+	return false;
+}
 static inline int
 split_huge_page_to_list(struct page *page, struct list_head *list)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7c06476590e..3a14c77fcce7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2384,6 +2384,21 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 	return ret;
 }
 
+/* Racy check whether the huge page can be split */
+bool can_split_huge_page(struct page *page, int *pextra_pins)
+{
+	int extra_pins;
+
+	/* Additional pins from radix tree */
+	if (PageAnon(page))
+		extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
+	else
+		extra_pins = HPAGE_PMD_NR;
+	if (pextra_pins)
+		*pextra_pins = extra_pins;
+	return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
 /*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
@@ -2431,7 +2446,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			ret = -EBUSY;
 			goto out;
 		}
-		extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
@@ -2443,8 +2457,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			goto out;
 		}
 
-		/* Addidional pins from radix tree */
-		extra_pins = HPAGE_PMD_NR;
 		anon_vma = NULL;
 		i_mmap_lock_read(mapping);
 	}
@@ -2453,7 +2465,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * Racy check if we can split the page, before freeze_page() will
 	 * split PMDs
 	 */
-	if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
+	if (!can_split_huge_page(head, &extra_pins)) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 16ff89d058f4..006d91d8fc53 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -192,6 +192,10 @@ int add_to_swap(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
+	/* cannot split, skip it */
+	if (PageTransHuge(page) && !can_split_huge_page(page, NULL))
+		return 0;
+
 retry:
 	entry = get_swap_page(page);
 	if (!entry.val)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH -mm -v10 3/3] mm, THP, swap: Enable THP swap optimization only if has compound map
  2017-04-25 12:56 [PATCH -mm -v10 0/3] THP swap: Delay splitting THP during swapping out Huang, Ying
  2017-04-25 12:56 ` [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
  2017-04-25 12:56 ` [PATCH -mm -v10 2/3] mm, THP, swap: Check whether THP can be split firstly Huang, Ying
@ 2017-04-25 12:56 ` Huang, Ying
  2017-04-25 21:46   ` Johannes Weiner
  2 siblings, 1 reply; 26+ messages in thread
From: Huang, Ying @ 2017-04-25 12:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner

From: Huang Ying <ying.huang@intel.com>

If there is no compound map for a THP (Transparent Huge Page), it is
possible that the map count of some sub-pages of the THP is 0.  So it
is better to split the THP before swapping out. In this way, the
sub-pages not mapped will be freed, and we can avoid the unnecessary
swap out operations for these sub-pages.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swap_state.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 006d91d8fc53..13f83c6bb1b4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -192,9 +192,19 @@ int add_to_swap(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
-	/* cannot split, skip it */
-	if (PageTransHuge(page) && !can_split_huge_page(page, NULL))
-		return 0;
+	if (PageTransHuge(page)) {
+		/* cannot split, skip it */
+		if (!can_split_huge_page(page, NULL))
+			return 0;
+		/*
+		 * Split pages without a PMD map right away. Chances
+		 * are some or all of the tail pages can be freed
+		 * without IO.
+		 */
+		if (!compound_mapcount(page) &&
+		    split_huge_page_to_list(page, list))
+			return 0;
+	}
 
 retry:
 	entry = get_swap_page(page);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 2/3] mm, THP, swap: Check whether THP can be split firstly
  2017-04-25 12:56 ` [PATCH -mm -v10 2/3] mm, THP, swap: Check whether THP can be split firstly Huang, Ying
@ 2017-04-25 21:43   ` Johannes Weiner
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2017-04-25 21:43 UTC (permalink / raw)
  To: Huang, Ying; +Cc: Andrew Morton, linux-mm, linux-kernel

On Tue, Apr 25, 2017 at 08:56:57PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> To swap out THP (Transparent Huage Page), before splitting the THP,
> the swap cluster will be allocated and the THP will be added into the
> swap cache.  But it is possible that the THP cannot be split, so that
> we must delete the THP from the swap cache and free the swap cluster.
> To avoid that, in this patch, whether the THP can be split is checked
> firstly.  The check can only be done racy, but it is good enough for
> most cases.
> 
> With the patch, the swap out throughput improves 3.6% (from about
> 4.16GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
> with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
> device used is a RAM simulated PMEM (persistent memory) device.  To
> test the sequential swapping out, the test case creates 8 processes,
> which sequentially allocate and write to the anonymous pages until the
> RAM and part of the swap device is used up.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for can_split_huge_page()]

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 3/3] mm, THP, swap: Enable THP swap optimization only if has compound map
  2017-04-25 12:56 ` [PATCH -mm -v10 3/3] mm, THP, swap: Enable THP swap optimization only if has compound map Huang, Ying
@ 2017-04-25 21:46   ` Johannes Weiner
  2017-04-28 13:16     ` Kirill A. Shutemov
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Weiner @ 2017-04-25 21:46 UTC (permalink / raw)
  To: Huang, Ying; +Cc: Andrew Morton, Kirill A. Shutemov, linux-mm, linux-kernel

On Tue, Apr 25, 2017 at 08:56:58PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> If there is no compound map for a THP (Transparent Huge Page), it is
> possible that the map count of some sub-pages of the THP is 0.  So it
> is better to split the THP before swapping out. In this way, the
> sub-pages not mapped will be freed, and we can avoid the unnecessary
> swap out operations for these sub-pages.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

CC Kirill to double check the reasoning here

> ---
>  mm/swap_state.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 006d91d8fc53..13f83c6bb1b4 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -192,9 +192,19 @@ int add_to_swap(struct page *page, struct list_head *list)
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(!PageUptodate(page), page);
>  
> -	/* cannot split, skip it */
> -	if (PageTransHuge(page) && !can_split_huge_page(page, NULL))
> -		return 0;
> +	if (PageTransHuge(page)) {
> +		/* cannot split, skip it */
> +		if (!can_split_huge_page(page, NULL))
> +			return 0;
> +		/*
> +		 * Split pages without a PMD map right away. Chances
> +		 * are some or all of the tail pages can be freed
> +		 * without IO.
> +		 */
> +		if (!compound_mapcount(page) &&
> +		    split_huge_page_to_list(page, list))
> +			return 0;
> +	}
>  
>  retry:
>  	entry = get_swap_page(page);
> -- 
> 2.11.0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-04-25 12:56 ` [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
@ 2017-04-27  5:31   ` Minchan Kim
  2017-04-27  7:12     ` Huang, Ying
  0 siblings, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2017-04-27  5:31 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Ebru Akagunduz, Johannes Weiner, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

On Tue, Apr 25, 2017 at 08:56:56PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> In this patch, splitting huge page is delayed from almost the first
> step of swapping out to after allocating the swap space for the
> THP (Transparent Huge Page) and adding the THP into the swap cache.
> This will batch the corresponding operation, thus improve THP swap out
> throughput.
> 
> This is the first step for the THP swap optimization.  The plan is to
> delay splitting the THP step by step and avoid splitting the THP
> finally.
> 
> The advantages of the THP swap support include:
> 
> - Batch the swap operations for the THP and reduce lock
>   acquiring/releasing, including allocating/freeing the swap space,
>   adding/deleting to/from the swap cache, and writing/reading the swap
>   space, etc.  This will help to improve the THP swap performance.
> 
> - The THP swap space read/write will be 2M sequential IO.  It is
>   particularly helpful for the swap read, which usually are 4k random
>   IO.  This will help to improve the THP swap performance.
> 
> - It will help the memory fragmentation, especially when the THP is
>   heavily used by the applications.  The 2M continuous pages will be
>   free up after the THP swapping out.
> 
> - It will improve the THP utilization on the system with the swap
>   turned on.  Because the speed for khugepaged to collapse the normal
>   pages into the THP is quite slow.  After the THP is split during the
>   swapping out, it will take quite long time for the normal pages to
>   collapse back into the THP after being swapped in.  The high THP
>   utilization helps the efficiency of the page based memory management
>   too.
> 
> There are some concerns regarding THP swap in, mainly because possible
> enlarged read/write IO size (for swap in/out) may put more overhead on
> the storage device.  To deal with that, the THP swap in should be
> turned on only when necessary.  For example, it can be selected via
> "always/never/madvise" logic, to be turned on globally, turned off
> globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
> 
> In this patch, one swap cluster is used to hold the contents of each
> THP swapped out.  So, the size of the swap cluster is changed to that
> of the THP (Transparent Huge Page) on x86_64 architecture (512).  For
> other architectures which want such THP swap optimization,
> ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file
> for the architecture.  In effect, this will enlarge swap cluster size
> by 2 times on x86_64.  Which may make it harder to find a free cluster
> when the swap space becomes fragmented.  So that, this may reduce the
> continuous swap space allocation and sequential write in theory.  The
> performance test in 0day shows no regressions caused by this.

What about other architecures?

I mean THP page size on every architectures would be various.
If THP page size is much bigger than 2M, the architecture should
have big swap cluster size for supporting THP swap-out feature.
It means fast empty-swap cluster consumption so that it can suffer
from fragmentation easily which causes THP swap void and swap slot
allocations slow due to not being able to use per-cpu.

What I suggested was contiguous multiple swap cluster allocations
to meet THP page size. If some of architecure's THP size is 64M
and SWAP_CLUSTER_SIZE is 2M, it should allocate 32 contiguos
swap clusters. For that, swap layer need to manage clusters sort
in order which would be more overhead in CONFIG_THP_SWAP case
but I think it's tradeoff. With that, every architectures can
support THP swap easily without arch-specific something.

If (PAGE_SIZE * 512) swap cluster size were okay for most of
architecture, just increase it. It's orthogonal work regardless of
THP swapout. Then, we don't need to manage swap clusters sort
in order in x86_64 which SWAP_CLUSTER_SIZE is equal to
THP_PAGE_SIZE. It's just a bonus by side-effect.

AFAIR, I suggested it but cannot remember why we cannot go with
this way.

> 
> In the future of THP swap optimization, some information of the
> swapped out THP (such as compound map count) will be recorded in the
> swap_cluster_info data structure.
> 
> The mem cgroup swap accounting functions are enhanced to support
> charge or uncharge a swap cluster backing a THP as a whole.
> 
> The swap cluster allocate/free functions are added to allocate/free a
> swap cluster for a THP.  A fair simple algorithm is used for swap
> cluster allocation, that is, only the first swap device in priority
> list will be tried to allocate the swap cluster.  The function will
> fail if the trying is not successful, and the caller will fallback to
> allocate a single swap slot instead.  This works good enough for
> normal cases.  If the difference of the number of the free swap
> clusters among multiple swap devices is significant, it is possible
> that some THPs are split earlier than necessary.  For example, this
> could be caused by big size difference among multiple swap devices.
> 
> The swap cache functions is enhanced to support add/delete THP to/from
> the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
> enhanced in the future with multi-order radix tree.  But because we
> will split the THP soon during swapping out, that optimization doesn't
> make much sense for this first step.
> 
> The THP splitting functions are enhanced to support to split THP in
> swap cache during swapping out.  The page lock will be held during
> allocating the swap cluster, adding the THP into the swap cache and
> splitting the THP.  So in the code path other than swapping out, if
> the THP need to be split, the PageSwapCache(THP) will be always false.
> 
> The swap cluster is only available for SSD, so the THP swap
> optimization in this patchset has no effect for HDD.
> 
> With the patch, the swap out throughput improves 11.5% (from about
> 3.73GB/s to about 4.16GB/s) in the vm-scalability swap-w-seq test case
> with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
> device used is a RAM simulated PMEM (persistent memory) device.  To
> test the sequential swapping out, the test case creates 8 processes,
> which sequentially allocate and write to the anonymous pages until the
> RAM and part of the swap device is used up.
> 
> [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: cgroups@vger.kernel.org
> Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  arch/x86/Kconfig            |   1 +
>  include/linux/page-flags.h  |   7 +-
>  include/linux/swap.h        |  25 ++++-
>  include/linux/swap_cgroup.h |   6 +-
>  mm/Kconfig                  |  12 +++
>  mm/huge_memory.c            |  11 +-
>  mm/memcontrol.c             |  50 ++++-----
>  mm/shmem.c                  |   2 +-
>  mm/swap_cgroup.c            |  40 +++++--
>  mm/swap_slots.c             |  16 ++-
>  mm/swap_state.c             | 114 ++++++++++++--------
>  mm/swapfile.c               | 256 ++++++++++++++++++++++++++++++++------------
>  12 files changed, 375 insertions(+), 165 deletions(-)

< snip >

> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1290,7 +1290,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  		SetPageUptodate(page);
>  	}
>  
> -	swap = get_swap_page();
> +	swap = get_swap_page(page);
>  	if (!swap.val)
>  		goto redirty;
>  

If swap is non-ssd, swap.val could be zero. Right?
If so, could we retry like anonymous page swapout?

>  
> -swp_entry_t get_swap_page(void)
> +swp_entry_t get_swap_page(struct page *page)
>  {
>  	swp_entry_t entry, *pentry;
>  	struct swap_slots_cache *cache;
>  
> +	entry.val = 0;
> +
> +	if (PageTransHuge(page)) {
> +		if (hpage_nr_pages(page) == SWAPFILE_CLUSTER)
> +			get_swap_pages(1, true, &entry);
> +		return entry;
> +	}
> +


< snip >

>  /**
> @@ -178,20 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(!PageUptodate(page), page);
>  
> -	entry = get_swap_page();
> +retry:
> +	entry = get_swap_page(page);
>  	if (!entry.val)
> -		return 0;
> -
> -	if (mem_cgroup_try_charge_swap(page, entry)) {
> -		swapcache_free(entry);
> -		return 0;
> -	}
> -
> -	if (unlikely(PageTransHuge(page)))
> -		if (unlikely(split_huge_page_to_list(page, list))) {
> -			swapcache_free(entry);
> -			return 0;
> -		}
> +		goto fail;

So, with non-SSD swap, THP page *always* get the fail to get swp_entry_t
and retry after split the page. However, it makes unncessary get_swap_pages
call which is not trivial. If there is no SSD swap, thp-swap out should
be void without adding any performance overhead.
Hmm, but I have no good idea to do it simple. :(

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-04-27  5:31   ` Minchan Kim
@ 2017-04-27  7:12     ` Huang, Ying
  2017-04-27 13:37       ` Johannes Weiner
  2017-04-28  8:40       ` Minchan Kim
  0 siblings, 2 replies; 26+ messages in thread
From: Huang, Ying @ 2017-04-27  7:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Johannes Weiner, Michal Hocko,
	Tejun Heo, Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

Minchan Kim <minchan@kernel.org> writes:

> On Tue, Apr 25, 2017 at 08:56:56PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> In this patch, splitting huge page is delayed from almost the first
>> step of swapping out to after allocating the swap space for the
>> THP (Transparent Huge Page) and adding the THP into the swap cache.
>> This will batch the corresponding operation, thus improve THP swap out
>> throughput.
>> 
>> This is the first step for the THP swap optimization.  The plan is to
>> delay splitting the THP step by step and avoid splitting the THP
>> finally.
>> 
>> The advantages of the THP swap support include:
>> 
>> - Batch the swap operations for the THP and reduce lock
>>   acquiring/releasing, including allocating/freeing the swap space,
>>   adding/deleting to/from the swap cache, and writing/reading the swap
>>   space, etc.  This will help to improve the THP swap performance.
>> 
>> - The THP swap space read/write will be 2M sequential IO.  It is
>>   particularly helpful for the swap read, which usually are 4k random
>>   IO.  This will help to improve the THP swap performance.
>> 
>> - It will help the memory fragmentation, especially when the THP is
>>   heavily used by the applications.  The 2M continuous pages will be
>>   free up after the THP swapping out.
>> 
>> - It will improve the THP utilization on the system with the swap
>>   turned on.  Because the speed for khugepaged to collapse the normal
>>   pages into the THP is quite slow.  After the THP is split during the
>>   swapping out, it will take quite long time for the normal pages to
>>   collapse back into the THP after being swapped in.  The high THP
>>   utilization helps the efficiency of the page based memory management
>>   too.
>> 
>> There are some concerns regarding THP swap in, mainly because possible
>> enlarged read/write IO size (for swap in/out) may put more overhead on
>> the storage device.  To deal with that, the THP swap in should be
>> turned on only when necessary.  For example, it can be selected via
>> "always/never/madvise" logic, to be turned on globally, turned off
>> globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
>> 
>> In this patch, one swap cluster is used to hold the contents of each
>> THP swapped out.  So, the size of the swap cluster is changed to that
>> of the THP (Transparent Huge Page) on x86_64 architecture (512).  For
>> other architectures which want such THP swap optimization,
>> ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file
>> for the architecture.  In effect, this will enlarge swap cluster size
>> by 2 times on x86_64.  Which may make it harder to find a free cluster
>> when the swap space becomes fragmented.  So that, this may reduce the
>> continuous swap space allocation and sequential write in theory.  The
>> performance test in 0day shows no regressions caused by this.
>
> What about other architecures?
>
> I mean THP page size on every architectures would be various.
> If THP page size is much bigger than 2M, the architecture should
> have big swap cluster size for supporting THP swap-out feature.
> It means fast empty-swap cluster consumption so that it can suffer
> from fragmentation easily which causes THP swap void and swap slot
> allocations slow due to not being able to use per-cpu.
>
> What I suggested was contiguous multiple swap cluster allocations
> to meet THP page size. If some of architecure's THP size is 64M
> and SWAP_CLUSTER_SIZE is 2M, it should allocate 32 contiguos
> swap clusters. For that, swap layer need to manage clusters sort
> in order which would be more overhead in CONFIG_THP_SWAP case
> but I think it's tradeoff. With that, every architectures can
> support THP swap easily without arch-specific something.

That may be a good solution for other architectures.  But I am afraid I
am not the right person to work on that.  Because I don't know the
requirement of other architectures, and I have no other architectures
machines to work on and measure the performance.

And the swap clusters aren't sorted in order now intentionally to avoid
cache line false sharing between the spinlock of struct
swap_cluster_info.  If we want to sort clusters in order, we need a
solution for that.

> If (PAGE_SIZE * 512) swap cluster size were okay for most of
> architecture, just increase it. It's orthogonal work regardless of
> THP swapout. Then, we don't need to manage swap clusters sort
> in order in x86_64 which SWAP_CLUSTER_SIZE is equal to
> THP_PAGE_SIZE. It's just a bonus by side-effect.

Andrew suggested to make swap cluster size = huge page size (or turn on
THP swap optimization) only if we enabled CONFIG_THP_SWAP.  So that, THP
swap optimization will not be turned on unintentionally.

We may adjust default swap cluster size, but I don't think it need to be
in this patchset.

> AFAIR, I suggested it but cannot remember why we cannot go with
> this way.
>
>> 
>> In the future of THP swap optimization, some information of the
>> swapped out THP (such as compound map count) will be recorded in the
>> swap_cluster_info data structure.
>> 
>> The mem cgroup swap accounting functions are enhanced to support
>> charge or uncharge a swap cluster backing a THP as a whole.
>> 
>> The swap cluster allocate/free functions are added to allocate/free a
>> swap cluster for a THP.  A fair simple algorithm is used for swap
>> cluster allocation, that is, only the first swap device in priority
>> list will be tried to allocate the swap cluster.  The function will
>> fail if the trying is not successful, and the caller will fallback to
>> allocate a single swap slot instead.  This works good enough for
>> normal cases.  If the difference of the number of the free swap
>> clusters among multiple swap devices is significant, it is possible
>> that some THPs are split earlier than necessary.  For example, this
>> could be caused by big size difference among multiple swap devices.
>> 
>> The swap cache functions is enhanced to support add/delete THP to/from
>> the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
>> enhanced in the future with multi-order radix tree.  But because we
>> will split the THP soon during swapping out, that optimization doesn't
>> make much sense for this first step.
>> 
>> The THP splitting functions are enhanced to support to split THP in
>> swap cache during swapping out.  The page lock will be held during
>> allocating the swap cluster, adding the THP into the swap cache and
>> splitting the THP.  So in the code path other than swapping out, if
>> the THP need to be split, the PageSwapCache(THP) will be always false.
>> 
>> The swap cluster is only available for SSD, so the THP swap
>> optimization in this patchset has no effect for HDD.
>> 
>> With the patch, the swap out throughput improves 11.5% (from about
>> 3.73GB/s to about 4.16GB/s) in the vm-scalability swap-w-seq test case
>> with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
>> device used is a RAM simulated PMEM (persistent memory) device.  To
>> test the sequential swapping out, the test case creates 8 processes,
>> which sequentially allocate and write to the anonymous pages until the
>> RAM and part of the swap device is used up.
>> 
>> [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: cgroups@vger.kernel.org
>> Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>> ---
>>  arch/x86/Kconfig            |   1 +
>>  include/linux/page-flags.h  |   7 +-
>>  include/linux/swap.h        |  25 ++++-
>>  include/linux/swap_cgroup.h |   6 +-
>>  mm/Kconfig                  |  12 +++
>>  mm/huge_memory.c            |  11 +-
>>  mm/memcontrol.c             |  50 ++++-----
>>  mm/shmem.c                  |   2 +-
>>  mm/swap_cgroup.c            |  40 +++++--
>>  mm/swap_slots.c             |  16 ++-
>>  mm/swap_state.c             | 114 ++++++++++++--------
>>  mm/swapfile.c               | 256 ++++++++++++++++++++++++++++++++------------
>>  12 files changed, 375 insertions(+), 165 deletions(-)
>
> < snip >
>
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1290,7 +1290,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>>  		SetPageUptodate(page);
>>  	}
>>  
>> -	swap = get_swap_page();
>> +	swap = get_swap_page(page);
>>  	if (!swap.val)
>>  		goto redirty;
>>  
>
> If swap is non-ssd, swap.val could be zero. Right?
> If so, could we retry like anonymous page swapout?

This is for shmem, where the THP will be split before goes here.  That
is, "page" here is always normal page.

>>  
>> -swp_entry_t get_swap_page(void)
>> +swp_entry_t get_swap_page(struct page *page)
>>  {
>>  	swp_entry_t entry, *pentry;
>>  	struct swap_slots_cache *cache;
>>  
>> +	entry.val = 0;
>> +
>> +	if (PageTransHuge(page)) {
>> +		if (hpage_nr_pages(page) == SWAPFILE_CLUSTER)
>> +			get_swap_pages(1, true, &entry);
>> +		return entry;
>> +	}
>> +
>
>
> < snip >
>
>>  /**
>> @@ -178,20 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
>>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>>  	VM_BUG_ON_PAGE(!PageUptodate(page), page);
>>  
>> -	entry = get_swap_page();
>> +retry:
>> +	entry = get_swap_page(page);
>>  	if (!entry.val)
>> -		return 0;
>> -
>> -	if (mem_cgroup_try_charge_swap(page, entry)) {
>> -		swapcache_free(entry);
>> -		return 0;
>> -	}
>> -
>> -	if (unlikely(PageTransHuge(page)))
>> -		if (unlikely(split_huge_page_to_list(page, list))) {
>> -			swapcache_free(entry);
>> -			return 0;
>> -		}
>> +		goto fail;
>
> So, with non-SSD swap, THP page *always* get the fail to get swp_entry_t
> and retry after split the page. However, it makes unncessary get_swap_pages
> call which is not trivial. If there is no SSD swap, thp-swap out should
> be void without adding any performance overhead.
> Hmm, but I have no good idea to do it simple. :(

For HDD swap, the device raw throughput is so low (< 100M Bps
typically), that the added overhead here will not be a big issue.  Do
you agree?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-04-27  7:12     ` Huang, Ying
@ 2017-04-27 13:37       ` Johannes Weiner
  2017-04-28  8:40       ` Minchan Kim
  1 sibling, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2017-04-27 13:37 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

On Thu, Apr 27, 2017 at 03:12:34PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> > On Tue, Apr 25, 2017 at 08:56:56PM +0800, Huang, Ying wrote:
> >> @@ -178,20 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
> >>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >>  	VM_BUG_ON_PAGE(!PageUptodate(page), page);
> >>  
> >> -	entry = get_swap_page();
> >> +retry:
> >> +	entry = get_swap_page(page);
> >>  	if (!entry.val)
> >> -		return 0;
> >> -
> >> -	if (mem_cgroup_try_charge_swap(page, entry)) {
> >> -		swapcache_free(entry);
> >> -		return 0;
> >> -	}
> >> -
> >> -	if (unlikely(PageTransHuge(page)))
> >> -		if (unlikely(split_huge_page_to_list(page, list))) {
> >> -			swapcache_free(entry);
> >> -			return 0;
> >> -		}
> >> +		goto fail;
> >
> > So, with non-SSD swap, THP page *always* get the fail to get swp_entry_t
> > and retry after split the page. However, it makes unncessary get_swap_pages
> > call which is not trivial. If there is no SSD swap, thp-swap out should
> > be void without adding any performance overhead.
> > Hmm, but I have no good idea to do it simple. :(
> 
> For HDD swap, the device raw throughput is so low (< 100M Bps
> typically), that the added overhead here will not be a big issue.  Do
> you agree?

I fully agree. If you swap to spinning rust, an extra function call
here is the least of your concern.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-04-27  7:12     ` Huang, Ying
  2017-04-27 13:37       ` Johannes Weiner
@ 2017-04-28  8:40       ` Minchan Kim
  2017-04-28 12:21         ` Huang, Ying
  2017-05-01 10:44         ` Johannes Weiner
  1 sibling, 2 replies; 26+ messages in thread
From: Minchan Kim @ 2017-04-28  8:40 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Ebru Akagunduz, Johannes Weiner, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

On Thu, Apr 27, 2017 at 03:12:34PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Tue, Apr 25, 2017 at 08:56:56PM +0800, Huang, Ying wrote:
> >> From: Huang Ying <ying.huang@intel.com>
> >> 
> >> In this patch, splitting huge page is delayed from almost the first
> >> step of swapping out to after allocating the swap space for the
> >> THP (Transparent Huge Page) and adding the THP into the swap cache.
> >> This will batch the corresponding operation, thus improve THP swap out
> >> throughput.
> >> 
> >> This is the first step for the THP swap optimization.  The plan is to
> >> delay splitting the THP step by step and avoid splitting the THP
> >> finally.
> >> 
> >> The advantages of the THP swap support include:
> >> 
> >> - Batch the swap operations for the THP and reduce lock
> >>   acquiring/releasing, including allocating/freeing the swap space,
> >>   adding/deleting to/from the swap cache, and writing/reading the swap
> >>   space, etc.  This will help to improve the THP swap performance.
> >> 
> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >>   particularly helpful for the swap read, which usually are 4k random
> >>   IO.  This will help to improve the THP swap performance.
> >> 
> >> - It will help the memory fragmentation, especially when the THP is
> >>   heavily used by the applications.  The 2M continuous pages will be
> >>   free up after the THP swapping out.
> >> 
> >> - It will improve the THP utilization on the system with the swap
> >>   turned on.  Because the speed for khugepaged to collapse the normal
> >>   pages into the THP is quite slow.  After the THP is split during the
> >>   swapping out, it will take quite long time for the normal pages to
> >>   collapse back into the THP after being swapped in.  The high THP
> >>   utilization helps the efficiency of the page based memory management
> >>   too.
> >> 
> >> There are some concerns regarding THP swap in, mainly because possible
> >> enlarged read/write IO size (for swap in/out) may put more overhead on
> >> the storage device.  To deal with that, the THP swap in should be
> >> turned on only when necessary.  For example, it can be selected via
> >> "always/never/madvise" logic, to be turned on globally, turned off
> >> globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
> >> 
> >> In this patch, one swap cluster is used to hold the contents of each
> >> THP swapped out.  So, the size of the swap cluster is changed to that
> >> of the THP (Transparent Huge Page) on x86_64 architecture (512).  For
> >> other architectures which want such THP swap optimization,
> >> ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file
> >> for the architecture.  In effect, this will enlarge swap cluster size
> >> by 2 times on x86_64.  Which may make it harder to find a free cluster
> >> when the swap space becomes fragmented.  So that, this may reduce the
> >> continuous swap space allocation and sequential write in theory.  The
> >> performance test in 0day shows no regressions caused by this.
> >
> > What about other architecures?
> >
> > I mean THP page size on every architectures would be various.
> > If THP page size is much bigger than 2M, the architecture should
> > have big swap cluster size for supporting THP swap-out feature.
> > It means fast empty-swap cluster consumption so that it can suffer
> > from fragmentation easily which causes THP swap void and swap slot
> > allocations slow due to not being able to use per-cpu.
> >
> > What I suggested was contiguous multiple swap cluster allocations
> > to meet THP page size. If some of architecure's THP size is 64M
> > and SWAP_CLUSTER_SIZE is 2M, it should allocate 32 contiguos
> > swap clusters. For that, swap layer need to manage clusters sort
> > in order which would be more overhead in CONFIG_THP_SWAP case
> > but I think it's tradeoff. With that, every architectures can
> > support THP swap easily without arch-specific something.
> 
> That may be a good solution for other architectures.  But I am afraid I
> am not the right person to work on that.  Because I don't know the
> requirement of other architectures, and I have no other architectures
> machines to work on and measure the performance.

IMO, THP swapout is good thing for every architecture so I dobut
you need to know other architecture's requirement.

> 
> And the swap clusters aren't sorted in order now intentionally to avoid
> cache line false sharing between the spinlock of struct
> swap_cluster_info.  If we want to sort clusters in order, we need a
> solution for that.

Does it really matter for this work? IOW, if we couldn't solve it,
cannot we support THP swapout? I don't think so. That's the least
of your worries.
Also, if we have sorted cluster data structure, we need to change
current single linked list of swap cluster to other one so we would
need to revisit to see whether it's really problem.

> 
> > If (PAGE_SIZE * 512) swap cluster size were okay for most of
> > architecture, just increase it. It's orthogonal work regardless of
> > THP swapout. Then, we don't need to manage swap clusters sort
> > in order in x86_64 which SWAP_CLUSTER_SIZE is equal to
> > THP_PAGE_SIZE. It's just a bonus by side-effect.
> 
> Andrew suggested to make swap cluster size = huge page size (or turn on
> THP swap optimization) only if we enabled CONFIG_THP_SWAP.  So that, THP
> swap optimization will not be turned on unintentionally.
> 
> We may adjust default swap cluster size, but I don't think it need to be
> in this patchset.

That's it. This feature shouldn't be aware of swap cluster size. IOW,
it would be better to work with every swap cluster size if the align
between THP and swap cluster size is matched at least.

> >> --- a/mm/shmem.c
> >> +++ b/mm/shmem.c
> >> @@ -1290,7 +1290,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> >>  		SetPageUptodate(page);
> >>  	}
> >>  
> >> -	swap = get_swap_page();
> >> +	swap = get_swap_page(page);
> >>  	if (!swap.val)
> >>  		goto redirty;
> >>  
> >
> > If swap is non-ssd, swap.val could be zero. Right?
> > If so, could we retry like anonymous page swapout?
> 
> This is for shmem, where the THP will be split before goes here.  That
> is, "page" here is always normal page.

Thanks. I missed it.

However, get_swap_page is ugly now. The caller should take care of
failure and should retry after split. I hope get_swap_page includes
split and retry logic in itself without reling on the caller.

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b60fea3748f8..96d41fade8d9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -392,7 +392,7 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(struct page *page);
+extern swp_entry_t get_swap_page(struct page *page, struct list_head *list);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
@@ -400,7 +400,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
+extern void swapcache_free(struct page *page, swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
@@ -459,7 +459,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void swapcache_free(struct page *page, swp_entry_t swp)
 {
 }
 
@@ -521,7 +521,8 @@ static inline int try_to_free_swap(struct page *page)
 	return 0;
 }
 
-static inline swp_entry_t get_swap_page(struct page *page)
+static inline swp_entry_t get_swap_page(struct page *page,
+					struct list_head *list)
 {
 	swp_entry_t entry;
 	entry.val = 0;
diff --git a/mm/shmem.c b/mm/shmem.c
index 29948d7da172..59afa7fc4313 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1290,7 +1290,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		SetPageUptodate(page);
 	}
 
-	swap = get_swap_page(page);
+	swap = get_swap_page(page, NULL);
 	if (!swap.val)
 		goto redirty;
 
@@ -1326,7 +1326,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 
 	mutex_unlock(&shmem_swaplist_mutex);
 free_swap:
-	swapcache_free(swap);
+	swapcache_free(page, swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index eb7524f8296d..ed5170f0bb7e 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -302,7 +302,7 @@ int free_swap_slot(swp_entry_t entry)
 	return 0;
 }
 
-swp_entry_t get_swap_page(struct page *page)
+swp_entry_t get_swap_page(struct page *page, struct list_head *list)
 {
 	swp_entry_t entry, *pentry;
 	struct swap_slots_cache *cache;
@@ -312,7 +312,15 @@ swp_entry_t get_swap_page(struct page *page)
 	if (PageTransHuge(page)) {
 		if (hpage_nr_pages(page) == SWAPFILE_CLUSTER)
 			get_swap_pages(1, true, &entry);
-		return entry;
+		if (entry.val != 0)
+			return entry;
+		/*
+		 * If swap device is not a SSD or cannot find
+		 * a empty cluster, split the page and fall back
+		 * to swap slot allocation.
+		 */
+		if (split_huge_page_to_list(page, list))
+			return entry;
 	}
 
 	/*
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 16ff89d058f4..d218c8513ff1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -192,12 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
-retry:
-	entry = get_swap_page(page);
+	entry = get_swap_page(page, list);
 	if (!entry.val)
-		goto fail;
+		return 0;
+
 	if (mem_cgroup_try_charge_swap(page, entry))
-		goto fail_free;
+		goto fail;
 
 	/*
 	 * Radix-tree node allocations from PF_MEMALLOC contexts could
@@ -218,7 +218,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		goto fail_free;
+		goto fail;
 
 	if (PageTransHuge(page)) {
 		err = split_huge_page_to_list(page, list);
@@ -230,14 +230,8 @@ int add_to_swap(struct page *page, struct list_head *list)
 
 	return 1;
 
-fail_free:
-	if (PageTransHuge(page))
-		swapcache_free_cluster(entry);
-	else
-		swapcache_free(entry);
 fail:
-	if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
-		goto retry;
+	swapcache_free(page, entry);
 	return 0;
 }
 
@@ -259,11 +253,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	if (PageTransHuge(page))
-		swapcache_free_cluster(entry);
-	else
-		swapcache_free(entry);
-
+	swapcache_free(page, entry);
 	page_ref_sub(page, hpage_nr_pages(page));
 }
 
@@ -415,7 +405,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry);
+		swapcache_free(new_page, entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 596306272059..9496cc3e955a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry)
+void __swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 
@@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
 }
 
 #ifdef CONFIG_THP_SWAP
-void swapcache_free_cluster(swp_entry_t entry)
+void __swapcache_free_cluster(swp_entry_t entry)
 {
 	unsigned long offset = swp_offset(entry);
 	unsigned long idx = offset / SWAPFILE_CLUSTER;
@@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
 }
 #endif /* CONFIG_THP_SWAP */
 
+void swapcache_free(struct page *page, swp_entry_t entry)
+{
+	if (!PageTransHuge(page))
+		__swapcache_free(entry);
+	else
+		__swapcache_free_cluster(entry);
+}
+
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ebf468c5429..0f8ca3d1761d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -708,7 +708,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		swapcache_free(swap);
+		swapcache_free(page, swap);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
> 
> >>  
> >> -swp_entry_t get_swap_page(void)
> >> +swp_entry_t get_swap_page(struct page *page)
> >>  {
> >>  	swp_entry_t entry, *pentry;
> >>  	struct swap_slots_cache *cache;
> >>  
> >> +	entry.val = 0;
> >> +
> >> +	if (PageTransHuge(page)) {
> >> +		if (hpage_nr_pages(page) == SWAPFILE_CLUSTER)
> >> +			get_swap_pages(1, true, &entry);
> >> +		return entry;
> >> +	}
> >> +
> >
> >
> > < snip >
> >
> >>  /**
> >> @@ -178,20 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
> >>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >>  	VM_BUG_ON_PAGE(!PageUptodate(page), page);
> >>  
> >> -	entry = get_swap_page();
> >> +retry:
> >> +	entry = get_swap_page(page);
> >>  	if (!entry.val)
> >> -		return 0;
> >> -
> >> -	if (mem_cgroup_try_charge_swap(page, entry)) {
> >> -		swapcache_free(entry);
> >> -		return 0;
> >> -	}
> >> -
> >> -	if (unlikely(PageTransHuge(page)))
> >> -		if (unlikely(split_huge_page_to_list(page, list))) {
> >> -			swapcache_free(entry);
> >> -			return 0;
> >> -		}
> >> +		goto fail;
> >
> > So, with non-SSD swap, THP page *always* get the fail to get swp_entry_t
> > and retry after split the page. However, it makes unncessary get_swap_pages
> > call which is not trivial. If there is no SSD swap, thp-swap out should
> > be void without adding any performance overhead.
> > Hmm, but I have no good idea to do it simple. :(
> 
> For HDD swap, the device raw throughput is so low (< 100M Bps
> typically), that the added overhead here will not be a big issue.  Do
> you agree?

I agree. Actually, I wanted to remove the pointless overhead
if we have a enough *simple* solution. However, as I said,
I have no idea so just raised an issue wit hope that someone might
have an idea.

Frankly speaking, I think we should support THP swap with hdd
as well as ssd but it's limited to just *implementation* direction
which is really unfortunate. :(

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-04-28  8:40       ` Minchan Kim
@ 2017-04-28 12:21         ` Huang, Ying
  2017-05-10  1:03           ` Minchan Kim
  2017-05-01 10:44         ` Johannes Weiner
  1 sibling, 1 reply; 26+ messages in thread
From: Huang, Ying @ 2017-04-28 12:21 UTC (permalink / raw)
  To: Minchan Kim, Johannes Weiner
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

Minchan Kim <minchan@kernel.org> writes:

> On Thu, Apr 27, 2017 at 03:12:34PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Tue, Apr 25, 2017 at 08:56:56PM +0800, Huang, Ying wrote:
>> >> From: Huang Ying <ying.huang@intel.com>
>> >> 
>> >> In this patch, splitting huge page is delayed from almost the first
>> >> step of swapping out to after allocating the swap space for the
>> >> THP (Transparent Huge Page) and adding the THP into the swap cache.
>> >> This will batch the corresponding operation, thus improve THP swap out
>> >> throughput.
>> >> 
>> >> This is the first step for the THP swap optimization.  The plan is to
>> >> delay splitting the THP step by step and avoid splitting the THP
>> >> finally.
>> >> 
>> >> The advantages of the THP swap support include:
>> >> 
>> >> - Batch the swap operations for the THP and reduce lock
>> >>   acquiring/releasing, including allocating/freeing the swap space,
>> >>   adding/deleting to/from the swap cache, and writing/reading the swap
>> >>   space, etc.  This will help to improve the THP swap performance.
>> >> 
>> >> - The THP swap space read/write will be 2M sequential IO.  It is
>> >>   particularly helpful for the swap read, which usually are 4k random
>> >>   IO.  This will help to improve the THP swap performance.
>> >> 
>> >> - It will help the memory fragmentation, especially when the THP is
>> >>   heavily used by the applications.  The 2M continuous pages will be
>> >>   free up after the THP swapping out.
>> >> 
>> >> - It will improve the THP utilization on the system with the swap
>> >>   turned on.  Because the speed for khugepaged to collapse the normal
>> >>   pages into the THP is quite slow.  After the THP is split during the
>> >>   swapping out, it will take quite long time for the normal pages to
>> >>   collapse back into the THP after being swapped in.  The high THP
>> >>   utilization helps the efficiency of the page based memory management
>> >>   too.
>> >> 
>> >> There are some concerns regarding THP swap in, mainly because possible
>> >> enlarged read/write IO size (for swap in/out) may put more overhead on
>> >> the storage device.  To deal with that, the THP swap in should be
>> >> turned on only when necessary.  For example, it can be selected via
>> >> "always/never/madvise" logic, to be turned on globally, turned off
>> >> globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
>> >> 
>> >> In this patch, one swap cluster is used to hold the contents of each
>> >> THP swapped out.  So, the size of the swap cluster is changed to that
>> >> of the THP (Transparent Huge Page) on x86_64 architecture (512).  For
>> >> other architectures which want such THP swap optimization,
>> >> ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file
>> >> for the architecture.  In effect, this will enlarge swap cluster size
>> >> by 2 times on x86_64.  Which may make it harder to find a free cluster
>> >> when the swap space becomes fragmented.  So that, this may reduce the
>> >> continuous swap space allocation and sequential write in theory.  The
>> >> performance test in 0day shows no regressions caused by this.
>> >
>> > What about other architecures?
>> >
>> > I mean THP page size on every architectures would be various.
>> > If THP page size is much bigger than 2M, the architecture should
>> > have big swap cluster size for supporting THP swap-out feature.
>> > It means fast empty-swap cluster consumption so that it can suffer
>> > from fragmentation easily which causes THP swap void and swap slot
>> > allocations slow due to not being able to use per-cpu.
>> >
>> > What I suggested was contiguous multiple swap cluster allocations
>> > to meet THP page size. If some of architecure's THP size is 64M
>> > and SWAP_CLUSTER_SIZE is 2M, it should allocate 32 contiguos
>> > swap clusters. For that, swap layer need to manage clusters sort
>> > in order which would be more overhead in CONFIG_THP_SWAP case
>> > but I think it's tradeoff. With that, every architectures can
>> > support THP swap easily without arch-specific something.
>> 
>> That may be a good solution for other architectures.  But I am afraid I
>> am not the right person to work on that.  Because I don't know the
>> requirement of other architectures, and I have no other architectures
>> machines to work on and measure the performance.
>
> IMO, THP swapout is good thing for every architecture so I dobut
> you need to know other architecture's requirement.
>
>> 
>> And the swap clusters aren't sorted in order now intentionally to avoid
>> cache line false sharing between the spinlock of struct
>> swap_cluster_info.  If we want to sort clusters in order, we need a
>> solution for that.
>
> Does it really matter for this work? IOW, if we couldn't solve it,
> cannot we support THP swapout? I don't think so. That's the least
> of your worries.
> Also, if we have sorted cluster data structure, we need to change
> current single linked list of swap cluster to other one so we would
> need to revisit to see whether it's really problem.
>
>> 
>> > If (PAGE_SIZE * 512) swap cluster size were okay for most of
>> > architecture, just increase it. It's orthogonal work regardless of
>> > THP swapout. Then, we don't need to manage swap clusters sort
>> > in order in x86_64 which SWAP_CLUSTER_SIZE is equal to
>> > THP_PAGE_SIZE. It's just a bonus by side-effect.
>> 
>> Andrew suggested to make swap cluster size = huge page size (or turn on
>> THP swap optimization) only if we enabled CONFIG_THP_SWAP.  So that, THP
>> swap optimization will not be turned on unintentionally.
>> 
>> We may adjust default swap cluster size, but I don't think it need to be
>> in this patchset.
>
> That's it. This feature shouldn't be aware of swap cluster size. IOW,
> it would be better to work with every swap cluster size if the align
> between THP and swap cluster size is matched at least.

Using one swap cluster for each THP is simpler, so why not start from
the simple design?  Complex design may be necessary in the future, but
we can work on that at that time.

>> >> --- a/mm/shmem.c
>> >> +++ b/mm/shmem.c
>> >> @@ -1290,7 +1290,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>> >>  		SetPageUptodate(page);
>> >>  	}
>> >>  
>> >> -	swap = get_swap_page();
>> >> +	swap = get_swap_page(page);
>> >>  	if (!swap.val)
>> >>  		goto redirty;
>> >>  
>> >
>> > If swap is non-ssd, swap.val could be zero. Right?
>> > If so, could we retry like anonymous page swapout?
>> 
>> This is for shmem, where the THP will be split before goes here.  That
>> is, "page" here is always normal page.
>
> Thanks. I missed it.
>
> However, get_swap_page is ugly now. The caller should take care of
> failure and should retry after split. I hope get_swap_page includes
> split and retry logic in itself without reling on the caller.

The current interface of get_swap_page() and swapcache_free() is
proposed by Johannes.

Hi, Johannes, what do you think about Minchan's interface proposal?

Best Regards,
Huang, Ying

> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b60fea3748f8..96d41fade8d9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -392,7 +392,7 @@ static inline long get_nr_swap_pages(void)
>  }
>  
>  extern void si_swapinfo(struct sysinfo *);
> -extern swp_entry_t get_swap_page(struct page *page);
> +extern swp_entry_t get_swap_page(struct page *page, struct list_head *list);
>  extern swp_entry_t get_swap_page_of_type(int);
>  extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]);
>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> @@ -400,7 +400,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> -extern void swapcache_free(swp_entry_t);
> +extern void swapcache_free(struct page *page, swp_entry_t);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern int free_swap_and_cache(swp_entry_t);
>  extern int swap_type_of(dev_t, sector_t, struct block_device **);
> @@ -459,7 +459,7 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>  
> -static inline void swapcache_free(swp_entry_t swp)
> +static inline void swapcache_free(struct page *page, swp_entry_t swp)
>  {
>  }
>  
> @@ -521,7 +521,8 @@ static inline int try_to_free_swap(struct page *page)
>  	return 0;
>  }
>  
> -static inline swp_entry_t get_swap_page(struct page *page)
> +static inline swp_entry_t get_swap_page(struct page *page,
> +					struct list_head *list)
>  {
>  	swp_entry_t entry;
>  	entry.val = 0;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 29948d7da172..59afa7fc4313 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1290,7 +1290,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  		SetPageUptodate(page);
>  	}
>  
> -	swap = get_swap_page(page);
> +	swap = get_swap_page(page, NULL);
>  	if (!swap.val)
>  		goto redirty;
>  
> @@ -1326,7 +1326,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
>  
>  	mutex_unlock(&shmem_swaplist_mutex);
>  free_swap:
> -	swapcache_free(swap);
> +	swapcache_free(page, swap);
>  redirty:
>  	set_page_dirty(page);
>  	if (wbc->for_reclaim)
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index eb7524f8296d..ed5170f0bb7e 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -302,7 +302,7 @@ int free_swap_slot(swp_entry_t entry)
>  	return 0;
>  }
>  
> -swp_entry_t get_swap_page(struct page *page)
> +swp_entry_t get_swap_page(struct page *page, struct list_head *list)
>  {
>  	swp_entry_t entry, *pentry;
>  	struct swap_slots_cache *cache;
> @@ -312,7 +312,15 @@ swp_entry_t get_swap_page(struct page *page)
>  	if (PageTransHuge(page)) {
>  		if (hpage_nr_pages(page) == SWAPFILE_CLUSTER)
>  			get_swap_pages(1, true, &entry);
> -		return entry;
> +		if (entry.val != 0)
> +			return entry;
> +		/*
> +		 * If swap device is not a SSD or cannot find
> +		 * a empty cluster, split the page and fall back
> +		 * to swap slot allocation.
> +		 */
> +		if (split_huge_page_to_list(page, list))
> +			return entry;
>  	}
>  
>  	/*
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 16ff89d058f4..d218c8513ff1 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -192,12 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(!PageUptodate(page), page);
>  
> -retry:
> -	entry = get_swap_page(page);
> +	entry = get_swap_page(page, list);
>  	if (!entry.val)
> -		goto fail;
> +		return 0;
> +
>  	if (mem_cgroup_try_charge_swap(page, entry))
> -		goto fail_free;
> +		goto fail;
>  
>  	/*
>  	 * Radix-tree node allocations from PF_MEMALLOC contexts could
> @@ -218,7 +218,7 @@ int add_to_swap(struct page *page, struct list_head *list)
>  		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
>  		 * clear SWAP_HAS_CACHE flag.
>  		 */
> -		goto fail_free;
> +		goto fail;
>  
>  	if (PageTransHuge(page)) {
>  		err = split_huge_page_to_list(page, list);
> @@ -230,14 +230,8 @@ int add_to_swap(struct page *page, struct list_head *list)
>  
>  	return 1;
>  
> -fail_free:
> -	if (PageTransHuge(page))
> -		swapcache_free_cluster(entry);
> -	else
> -		swapcache_free(entry);
>  fail:
> -	if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
> -		goto retry;
> +	swapcache_free(page, entry);
>  	return 0;
>  }
>  
> @@ -259,11 +253,7 @@ void delete_from_swap_cache(struct page *page)
>  	__delete_from_swap_cache(page);
>  	spin_unlock_irq(&address_space->tree_lock);
>  
> -	if (PageTransHuge(page))
> -		swapcache_free_cluster(entry);
> -	else
> -		swapcache_free(entry);
> -
> +	swapcache_free(page, entry);
>  	page_ref_sub(page, hpage_nr_pages(page));
>  }
>  
> @@ -415,7 +405,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
>  		 * clear SWAP_HAS_CACHE flag.
>  		 */
> -		swapcache_free(entry);
> +		swapcache_free(new_page, entry);
>  	} while (err != -ENOMEM);
>  
>  	if (new_page)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 596306272059..9496cc3e955a 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */
> -void swapcache_free(swp_entry_t entry)
> +void __swapcache_free(swp_entry_t entry)
>  {
>  	struct swap_info_struct *p;
>  
> @@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
>  }
>  
>  #ifdef CONFIG_THP_SWAP
> -void swapcache_free_cluster(swp_entry_t entry)
> +void __swapcache_free_cluster(swp_entry_t entry)
>  {
>  	unsigned long offset = swp_offset(entry);
>  	unsigned long idx = offset / SWAPFILE_CLUSTER;
> @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
>  }
>  #endif /* CONFIG_THP_SWAP */
>  
> +void swapcache_free(struct page *page, swp_entry_t entry)
> +{
> +	if (!PageTransHuge(page))
> +		__swapcache_free(entry);
> +	else
> +		__swapcache_free_cluster(entry);
> +}
> +
>  static int swp_entry_cmp(const void *ent1, const void *ent2)
>  {
>  	const swp_entry_t *e1 = ent1, *e2 = ent2;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5ebf468c5429..0f8ca3d1761d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -708,7 +708,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
>  		mem_cgroup_swapout(page, swap);
>  		__delete_from_swap_cache(page);
>  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
> -		swapcache_free(swap);
> +		swapcache_free(page, swap);
>  	} else {
>  		void (*freepage)(struct page *);
>  		void *shadow = NULL;
>> 
>> >>  
>> >> -swp_entry_t get_swap_page(void)
>> >> +swp_entry_t get_swap_page(struct page *page)
>> >>  {
>> >>  	swp_entry_t entry, *pentry;
>> >>  	struct swap_slots_cache *cache;
>> >>  
>> >> +	entry.val = 0;
>> >> +
>> >> +	if (PageTransHuge(page)) {
>> >> +		if (hpage_nr_pages(page) == SWAPFILE_CLUSTER)
>> >> +			get_swap_pages(1, true, &entry);
>> >> +		return entry;
>> >> +	}
>> >> +
>> >
>> >
>> > < snip >
>> >
>> >>  /**
>> >> @@ -178,20 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
>> >>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> >>  	VM_BUG_ON_PAGE(!PageUptodate(page), page);
>> >>  
>> >> -	entry = get_swap_page();
>> >> +retry:
>> >> +	entry = get_swap_page(page);
>> >>  	if (!entry.val)
>> >> -		return 0;
>> >> -
>> >> -	if (mem_cgroup_try_charge_swap(page, entry)) {
>> >> -		swapcache_free(entry);
>> >> -		return 0;
>> >> -	}
>> >> -
>> >> -	if (unlikely(PageTransHuge(page)))
>> >> -		if (unlikely(split_huge_page_to_list(page, list))) {
>> >> -			swapcache_free(entry);
>> >> -			return 0;
>> >> -		}
>> >> +		goto fail;
>> >
>> > So, with non-SSD swap, THP page *always* get the fail to get swp_entry_t
>> > and retry after split the page. However, it makes unncessary get_swap_pages
>> > call which is not trivial. If there is no SSD swap, thp-swap out should
>> > be void without adding any performance overhead.
>> > Hmm, but I have no good idea to do it simple. :(
>> 
>> For HDD swap, the device raw throughput is so low (< 100M Bps
>> typically), that the added overhead here will not be a big issue.  Do
>> you agree?
>
> I agree. Actually, I wanted to remove the pointless overhead
> if we have a enough *simple* solution. However, as I said,
> I have no idea so just raised an issue wit hope that someone might
> have an idea.
>
> Frankly speaking, I think we should support THP swap with hdd
> as well as ssd but it's limited to just *implementation* direction
> which is really unfortunate. :(

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 3/3] mm, THP, swap: Enable THP swap optimization only if has compound map
  2017-04-25 21:46   ` Johannes Weiner
@ 2017-04-28 13:16     ` Kirill A. Shutemov
  0 siblings, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2017-04-28 13:16 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel

On Tue, Apr 25, 2017 at 05:46:18PM -0400, Johannes Weiner wrote:
> On Tue, Apr 25, 2017 at 08:56:58PM +0800, Huang, Ying wrote:
> > From: Huang Ying <ying.huang@intel.com>
> > 
> > If there is no compound map for a THP (Transparent Huge Page), it is
> > possible that the map count of some sub-pages of the THP is 0.  So it
> > is better to split the THP before swapping out. In this way, the
> > sub-pages not mapped will be freed, and we can avoid the unnecessary
> > swap out operations for these sub-pages.
> > 
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> CC Kirill to double check the reasoning here

Looks good to me:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-04-28  8:40       ` Minchan Kim
  2017-04-28 12:21         ` Huang, Ying
@ 2017-05-01 10:44         ` Johannes Weiner
  2017-05-01 23:53           ` Minchan Kim
  1 sibling, 1 reply; 26+ messages in thread
From: Johannes Weiner @ 2017-05-01 10:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

On Fri, Apr 28, 2017 at 05:40:44PM +0900, Minchan Kim wrote:
> However, get_swap_page is ugly now. The caller should take care of
> failure and should retry after split. I hope get_swap_page includes
> split and retry logic in itself without reling on the caller.

I think this makes the interface terrible. It's an allocation function
to which you pass a reference object for size - and if the allocation
doesn't succeed it'll split your reference object to make it fit?

That's a nasty side effect for this function to have.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-05-01 10:44         ` Johannes Weiner
@ 2017-05-01 23:53           ` Minchan Kim
  2017-05-10 13:56             ` Johannes Weiner
  0 siblings, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2017-05-01 23:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

Hi Johannes,

The patch I sent has two clean-up.

First part was as follows:

>From 5400dceb3a7739d4e7ff340fc0831e0e1830ec0b Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 28 Apr 2017 15:04:14 +0900
Subject: [PATCH 1/2] swap: make swapcache_free aware of page size

Now, get_swap_page takes struct page and allocates swap space according
to page size(ie, normal or THP) so it would be more clear to take
struct page in swapcache_free which is a counter function of
get_swap_page without needing if-else statement of caller side.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  4 ++--
 mm/shmem.c           |  2 +-
 mm/swap_state.c      | 13 +++----------
 mm/swapfile.c        | 12 ++++++++++--
 mm/vmscan.c          |  2 +-
 5 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b60fea3748f8..16c8d2392ddd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -400,7 +400,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
+extern void swapcache_free(struct page *page, swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
@@ -459,7 +459,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void swapcache_free(struct page *page, swp_entry_t swp)
 {
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 29948d7da172..ab1802664e97 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1326,7 +1326,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 
 	mutex_unlock(&shmem_swaplist_mutex);
 free_swap:
-	swapcache_free(swap);
+	swapcache_free(page, swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 16ff89d058f4..4af44fd4142e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -231,10 +231,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 	return 1;
 
 fail_free:
-	if (PageTransHuge(page))
-		swapcache_free_cluster(entry);
-	else
-		swapcache_free(entry);
+	swapcache_free(page, entry);
 fail:
 	if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
 		goto retry;
@@ -259,11 +256,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	if (PageTransHuge(page))
-		swapcache_free_cluster(entry);
-	else
-		swapcache_free(entry);
-
+	swapcache_free(page, entry);
 	page_ref_sub(page, hpage_nr_pages(page));
 }
 
@@ -415,7 +408,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry);
+		swapcache_free(new_page, entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 596306272059..9496cc3e955a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry)
+void __swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 
@@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
 }
 
 #ifdef CONFIG_THP_SWAP
-void swapcache_free_cluster(swp_entry_t entry)
+void __swapcache_free_cluster(swp_entry_t entry)
 {
 	unsigned long offset = swp_offset(entry);
 	unsigned long idx = offset / SWAPFILE_CLUSTER;
@@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
 }
 #endif /* CONFIG_THP_SWAP */
 
+void swapcache_free(struct page *page, swp_entry_t entry)
+{
+	if (!PageTransHuge(page))
+		__swapcache_free(entry);
+	else
+		__swapcache_free_cluster(entry);
+}
+
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ebf468c5429..0f8ca3d1761d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -708,7 +708,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		swapcache_free(swap);
+		swapcache_free(page, swap);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
-- 
2.7.4


On Mon, May 01, 2017 at 12:44:30PM +0200, Johannes Weiner wrote:
> On Fri, Apr 28, 2017 at 05:40:44PM +0900, Minchan Kim wrote:
> > However, get_swap_page is ugly now. The caller should take care of
> > failure and should retry after split. I hope get_swap_page includes
> > split and retry logic in itself without reling on the caller.
> 
> I think this makes the interface terrible. It's an allocation function
> to which you pass a reference object for size - and if the allocation
> doesn't succeed it'll split your reference object to make it fit?
> 
> That's a nasty side effect for this function to have.

It does make sense to me.

With that point of view, retry logic in add_to_swap is awkward to me, too.
The add_to_swap is a kind of allocate function(swap slot and swap cache)
so I think it would be better to move retry logic to vmscan.c.

What do you think about it?

>From e228d67e9677f8eeea1e424cab61b67884d534b4 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Tue, 2 May 2017 08:30:41 +0900
Subject: [PATCH 2/2] swap: move anonymous THP split logic to vmscan

The add_to_swap aims to allocate swap_space(ie, swap slot and
swapcache) so if it fails due to lack of space in case of THP,
the caller should split the THP page and retry it with a page.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  4 ++--
 mm/swap_state.c      | 23 ++++++-----------------
 mm/vmscan.c          | 22 +++++++++++++++++++++-
 3 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 16c8d2392ddd..a305d27b12f8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -359,7 +359,7 @@ extern struct address_space *swapper_spaces[];
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *, struct list_head *list);
+extern int add_to_swap(struct page *);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
 extern void __delete_from_swap_cache(struct page *);
@@ -479,7 +479,7 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp)
 	return NULL;
 }
 
-static inline int add_to_swap(struct page *page, struct list_head *list)
+static inline int add_to_swap(struct page *page)
 {
 	return 0;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4af44fd4142e..1b6ef1660b7e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -184,7 +184,7 @@ void __delete_from_swap_cache(struct page *page)
  * Allocate swap space for the page and add the page to the
  * swap cache.  Caller needs to hold the page lock. 
  */
-int add_to_swap(struct page *page, struct list_head *list)
+int add_to_swap(struct page *page)
 {
 	swp_entry_t entry;
 	int err;
@@ -192,12 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
-retry:
 	entry = get_swap_page(page);
 	if (!entry.val)
-		goto fail;
+		return 0;
+
 	if (mem_cgroup_try_charge_swap(page, entry))
-		goto fail_free;
+		goto fail;
 
 	/*
 	 * Radix-tree node allocations from PF_MEMALLOC contexts could
@@ -218,23 +218,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		goto fail_free;
-
-	if (PageTransHuge(page)) {
-		err = split_huge_page_to_list(page, list);
-		if (err) {
-			delete_from_swap_cache(page);
-			return 0;
-		}
-	}
+		goto fail;
 
 	return 1;
 
-fail_free:
-	swapcache_free(page, entry);
 fail:
-	if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
-		goto retry;
+	swapcache_free(page, entry);
 	return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f8ca3d1761d..2314aca47d12 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1125,8 +1125,28 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		    !PageSwapCache(page)) {
 			if (!(sc->gfp_mask & __GFP_IO))
 				goto keep_locked;
-			if (!add_to_swap(page, page_list))
+swap_retry:
+			/*
+			 * Retry after split if we fail to allocate
+			 * swap space of a THP.
+			 */
+			if (!add_to_swap(page)) {
+				if (!PageTransHuge(page) ||
+				    split_huge_page_to_list(page, page_list))
+					goto activate_locked;
+				goto swap_retry;
+			}
+
+			/*
+			 * Got swap space successfully. But unfortunately,
+			 * we don't support a THP page writeout so split it.
+			 */
+			if (PageTransHuge(page) &&
+				  split_huge_page_to_list(page, page_list)) {
+				delete_from_swap_cache(page);
 				goto activate_locked;
+			}
+
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-04-28 12:21         ` Huang, Ying
@ 2017-05-10  1:03           ` Minchan Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2017-05-10  1:03 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

Hi Huang,

On Fri, Apr 28, 2017 at 08:21:37PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Thu, Apr 27, 2017 at 03:12:34PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > On Tue, Apr 25, 2017 at 08:56:56PM +0800, Huang, Ying wrote:
> >> >> From: Huang Ying <ying.huang@intel.com>
> >> >> 
> >> >> In this patch, splitting huge page is delayed from almost the first
> >> >> step of swapping out to after allocating the swap space for the
> >> >> THP (Transparent Huge Page) and adding the THP into the swap cache.
> >> >> This will batch the corresponding operation, thus improve THP swap out
> >> >> throughput.
> >> >> 
> >> >> This is the first step for the THP swap optimization.  The plan is to
> >> >> delay splitting the THP step by step and avoid splitting the THP
> >> >> finally.
> >> >> 
> >> >> The advantages of the THP swap support include:
> >> >> 
> >> >> - Batch the swap operations for the THP and reduce lock
> >> >>   acquiring/releasing, including allocating/freeing the swap space,
> >> >>   adding/deleting to/from the swap cache, and writing/reading the swap
> >> >>   space, etc.  This will help to improve the THP swap performance.
> >> >> 
> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >> >>   particularly helpful for the swap read, which usually are 4k random
> >> >>   IO.  This will help to improve the THP swap performance.
> >> >> 
> >> >> - It will help the memory fragmentation, especially when the THP is
> >> >>   heavily used by the applications.  The 2M continuous pages will be
> >> >>   free up after the THP swapping out.
> >> >> 
> >> >> - It will improve the THP utilization on the system with the swap
> >> >>   turned on.  Because the speed for khugepaged to collapse the normal
> >> >>   pages into the THP is quite slow.  After the THP is split during the
> >> >>   swapping out, it will take quite long time for the normal pages to
> >> >>   collapse back into the THP after being swapped in.  The high THP
> >> >>   utilization helps the efficiency of the page based memory management
> >> >>   too.
> >> >> 
> >> >> There are some concerns regarding THP swap in, mainly because possible
> >> >> enlarged read/write IO size (for swap in/out) may put more overhead on
> >> >> the storage device.  To deal with that, the THP swap in should be
> >> >> turned on only when necessary.  For example, it can be selected via
> >> >> "always/never/madvise" logic, to be turned on globally, turned off
> >> >> globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
> >> >> 
> >> >> In this patch, one swap cluster is used to hold the contents of each
> >> >> THP swapped out.  So, the size of the swap cluster is changed to that
> >> >> of the THP (Transparent Huge Page) on x86_64 architecture (512).  For
> >> >> other architectures which want such THP swap optimization,
> >> >> ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file
> >> >> for the architecture.  In effect, this will enlarge swap cluster size
> >> >> by 2 times on x86_64.  Which may make it harder to find a free cluster
> >> >> when the swap space becomes fragmented.  So that, this may reduce the
> >> >> continuous swap space allocation and sequential write in theory.  The
> >> >> performance test in 0day shows no regressions caused by this.
> >> >
> >> > What about other architecures?
> >> >
> >> > I mean THP page size on every architectures would be various.
> >> > If THP page size is much bigger than 2M, the architecture should
> >> > have big swap cluster size for supporting THP swap-out feature.
> >> > It means fast empty-swap cluster consumption so that it can suffer
> >> > from fragmentation easily which causes THP swap void and swap slot
> >> > allocations slow due to not being able to use per-cpu.
> >> >
> >> > What I suggested was contiguous multiple swap cluster allocations
> >> > to meet THP page size. If some of architecure's THP size is 64M
> >> > and SWAP_CLUSTER_SIZE is 2M, it should allocate 32 contiguos
> >> > swap clusters. For that, swap layer need to manage clusters sort
> >> > in order which would be more overhead in CONFIG_THP_SWAP case
> >> > but I think it's tradeoff. With that, every architectures can
> >> > support THP swap easily without arch-specific something.
> >> 
> >> That may be a good solution for other architectures.  But I am afraid I
> >> am not the right person to work on that.  Because I don't know the
> >> requirement of other architectures, and I have no other architectures
> >> machines to work on and measure the performance.
> >
> > IMO, THP swapout is good thing for every architecture so I dobut
> > you need to know other architecture's requirement.
> >
> >> 
> >> And the swap clusters aren't sorted in order now intentionally to avoid
> >> cache line false sharing between the spinlock of struct
> >> swap_cluster_info.  If we want to sort clusters in order, we need a
> >> solution for that.
> >
> > Does it really matter for this work? IOW, if we couldn't solve it,
> > cannot we support THP swapout? I don't think so. That's the least
> > of your worries.
> > Also, if we have sorted cluster data structure, we need to change
> > current single linked list of swap cluster to other one so we would
> > need to revisit to see whether it's really problem.
> >
> >> 
> >> > If (PAGE_SIZE * 512) swap cluster size were okay for most of
> >> > architecture, just increase it. It's orthogonal work regardless of
> >> > THP swapout. Then, we don't need to manage swap clusters sort
> >> > in order in x86_64 which SWAP_CLUSTER_SIZE is equal to
> >> > THP_PAGE_SIZE. It's just a bonus by side-effect.
> >> 
> >> Andrew suggested to make swap cluster size = huge page size (or turn on
> >> THP swap optimization) only if we enabled CONFIG_THP_SWAP.  So that, THP
> >> swap optimization will not be turned on unintentionally.
> >> 
> >> We may adjust default swap cluster size, but I don't think it need to be
> >> in this patchset.
> >
> > That's it. This feature shouldn't be aware of swap cluster size. IOW,
> > it would be better to work with every swap cluster size if the align
> > between THP and swap cluster size is matched at least.
> 
> Using one swap cluster for each THP is simpler, so why not start from
> the simple design?  Complex design may be necessary in the future, but
> we can work on that at that time.

If it were really architecture specific issue, I'm okay with such simple
start with major architecture first. However, I don't think it's the case.

THP swap: I don't think it's a architecure issue. It's generally good thing
once system supports THP. It is also good thing for HDD swap as well as SSD.
(In fact, I don't understand why we should have CONFIG_THP_SWAP. I think
 it should work with CONFIG_TRANSPARENT_HUGEPAGE automatically).

Current design is selected for just *simple implementation" and put the
heavy drift to make it generalize to others in the future.
I don't think it's good thing. But others might have different opinions
so I'm not insisting.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-05-01 23:53           ` Minchan Kim
@ 2017-05-10 13:56             ` Johannes Weiner
  2017-05-10 23:25               ` Minchan Kim
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Weiner @ 2017-05-10 13:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

Hi Michan,

On Tue, May 02, 2017 at 08:53:32AM +0900, Minchan Kim wrote:
> @@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */
> -void swapcache_free(swp_entry_t entry)
> +void __swapcache_free(swp_entry_t entry)
>  {
>  	struct swap_info_struct *p;
>  
> @@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
>  }
>  
>  #ifdef CONFIG_THP_SWAP
> -void swapcache_free_cluster(swp_entry_t entry)
> +void __swapcache_free_cluster(swp_entry_t entry)
>  {
>  	unsigned long offset = swp_offset(entry);
>  	unsigned long idx = offset / SWAPFILE_CLUSTER;
> @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
>  }
>  #endif /* CONFIG_THP_SWAP */
>  
> +void swapcache_free(struct page *page, swp_entry_t entry)
> +{
> +	if (!PageTransHuge(page))
> +		__swapcache_free(entry);
> +	else
> +		__swapcache_free_cluster(entry);
> +}

I don't think this is cleaner :/

On your second patch:

> @@ -1125,8 +1125,28 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		    !PageSwapCache(page)) {
>  			if (!(sc->gfp_mask & __GFP_IO))
>  				goto keep_locked;
> -			if (!add_to_swap(page, page_list))
> +swap_retry:
> +			/*
> +			 * Retry after split if we fail to allocate
> +			 * swap space of a THP.
> +			 */
> +			if (!add_to_swap(page)) {
> +				if (!PageTransHuge(page) ||
> +				    split_huge_page_to_list(page, page_list))
> +					goto activate_locked;
> +				goto swap_retry;
> +			}

This is definitely better.

However, I think it'd be cleaner without the label here:

			if (!add_to_swap(page)) {
				if (!PageTransHuge(page))
					goto activate_locked;
				/* Split THP and swap individual base pages */
				if (split_huge_page_to_list(page, page_list))
					goto activate_locked;
				if (!add_to_swap(page))
					goto activate_locked;
			}

> +			/*
> +			 * Got swap space successfully. But unfortunately,
> +			 * we don't support a THP page writeout so split it.
> +			 */
> +			if (PageTransHuge(page) &&
> +				  split_huge_page_to_list(page, page_list)) {
> +				delete_from_swap_cache(page);
>  				goto activate_locked;
> +			}

Pulling this out of add_to_swap() is an improvement for sure. Add an
XXX: before that "we don't support THP writes" comment for good
measure :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-05-10 13:56             ` Johannes Weiner
@ 2017-05-10 23:25               ` Minchan Kim
  2017-05-11  0:50                 ` Huang, Ying
  2017-05-11  1:22                 ` [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out Minchan Kim
  0 siblings, 2 replies; 26+ messages in thread
From: Minchan Kim @ 2017-05-10 23:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

On Wed, May 10, 2017 at 09:56:54AM -0400, Johannes Weiner wrote:
> Hi Michan,
> 
> On Tue, May 02, 2017 at 08:53:32AM +0900, Minchan Kim wrote:
> > @@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
> >  /*
> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >   */
> > -void swapcache_free(swp_entry_t entry)
> > +void __swapcache_free(swp_entry_t entry)
> >  {
> >  	struct swap_info_struct *p;
> >  
> > @@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
> >  }
> >  
> >  #ifdef CONFIG_THP_SWAP
> > -void swapcache_free_cluster(swp_entry_t entry)
> > +void __swapcache_free_cluster(swp_entry_t entry)
> >  {
> >  	unsigned long offset = swp_offset(entry);
> >  	unsigned long idx = offset / SWAPFILE_CLUSTER;
> > @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
> >  }
> >  #endif /* CONFIG_THP_SWAP */
> >  
> > +void swapcache_free(struct page *page, swp_entry_t entry)
> > +{
> > +	if (!PageTransHuge(page))
> > +		__swapcache_free(entry);
> > +	else
> > +		__swapcache_free_cluster(entry);
> > +}
> 
> I don't think this is cleaner :/
> 
> On your second patch:
> 
> > @@ -1125,8 +1125,28 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		    !PageSwapCache(page)) {
> >  			if (!(sc->gfp_mask & __GFP_IO))
> >  				goto keep_locked;
> > -			if (!add_to_swap(page, page_list))
> > +swap_retry:
> > +			/*
> > +			 * Retry after split if we fail to allocate
> > +			 * swap space of a THP.
> > +			 */
> > +			if (!add_to_swap(page)) {
> > +				if (!PageTransHuge(page) ||
> > +				    split_huge_page_to_list(page, page_list))
> > +					goto activate_locked;
> > +				goto swap_retry;
> > +			}
> 
> This is definitely better.

Thanks.

> 
> However, I think it'd be cleaner without the label here:
> 
> 			if (!add_to_swap(page)) {
> 				if (!PageTransHuge(page))
> 					goto activate_locked;
> 				/* Split THP and swap individual base pages */
> 				if (split_huge_page_to_list(page, page_list))
> 					goto activate_locked;
> 				if (!add_to_swap(page))
> 					goto activate_locked;

Yes.

> 			}
> 
> > +			/*
> > +			 * Got swap space successfully. But unfortunately,
> > +			 * we don't support a THP page writeout so split it.
> > +			 */
> > +			if (PageTransHuge(page) &&
> > +				  split_huge_page_to_list(page, page_list)) {
> > +				delete_from_swap_cache(page);
> >  				goto activate_locked;
> > +			}
> 
> Pulling this out of add_to_swap() is an improvement for sure. Add an
> XXX: before that "we don't support THP writes" comment for good
> measure :)

Sure.

It could be a separate patch which makes add_to_swap clean via
removing page_list argument but I hope Huang take/fold it when he
resend it because it would be more important with THP swap.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-05-10 23:25               ` Minchan Kim
@ 2017-05-11  0:50                 ` Huang, Ying
  2017-05-11  4:31                   ` Minchan Kim
  2017-05-12  2:21                   ` [PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page Minchan Kim
  2017-05-11  1:22                 ` [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out Minchan Kim
  1 sibling, 2 replies; 26+ messages in thread
From: Huang, Ying @ 2017-05-11  0:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Huang, Ying, Andrew Morton, linux-mm,
	linux-kernel, Andrea Arcangeli, Ebru Akagunduz, Michal Hocko,
	Tejun Heo, Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

Minchan Kim <minchan@kernel.org> writes:

> On Wed, May 10, 2017 at 09:56:54AM -0400, Johannes Weiner wrote:
>> Hi Michan,
>> 
>> On Tue, May 02, 2017 at 08:53:32AM +0900, Minchan Kim wrote:
>> > @@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
>> >  /*
>> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >   */
>> > -void swapcache_free(swp_entry_t entry)
>> > +void __swapcache_free(swp_entry_t entry)
>> >  {
>> >  	struct swap_info_struct *p;
>> >  
>> > @@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
>> >  }
>> >  
>> >  #ifdef CONFIG_THP_SWAP
>> > -void swapcache_free_cluster(swp_entry_t entry)
>> > +void __swapcache_free_cluster(swp_entry_t entry)
>> >  {
>> >  	unsigned long offset = swp_offset(entry);
>> >  	unsigned long idx = offset / SWAPFILE_CLUSTER;
>> > @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
>> >  }
>> >  #endif /* CONFIG_THP_SWAP */
>> >  
>> > +void swapcache_free(struct page *page, swp_entry_t entry)
>> > +{
>> > +	if (!PageTransHuge(page))
>> > +		__swapcache_free(entry);
>> > +	else
>> > +		__swapcache_free_cluster(entry);
>> > +}
>> 
>> I don't think this is cleaner :/
>> 
>> On your second patch:
>> 
>> > @@ -1125,8 +1125,28 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> >  		    !PageSwapCache(page)) {
>> >  			if (!(sc->gfp_mask & __GFP_IO))
>> >  				goto keep_locked;
>> > -			if (!add_to_swap(page, page_list))
>> > +swap_retry:
>> > +			/*
>> > +			 * Retry after split if we fail to allocate
>> > +			 * swap space of a THP.
>> > +			 */
>> > +			if (!add_to_swap(page)) {
>> > +				if (!PageTransHuge(page) ||
>> > +				    split_huge_page_to_list(page, page_list))
>> > +					goto activate_locked;
>> > +				goto swap_retry;
>> > +			}
>> 
>> This is definitely better.
>
> Thanks.
>
>> 
>> However, I think it'd be cleaner without the label here:
>> 
>> 			if (!add_to_swap(page)) {
>> 				if (!PageTransHuge(page))
>> 					goto activate_locked;
>> 				/* Split THP and swap individual base pages */
>> 				if (split_huge_page_to_list(page, page_list))
>> 					goto activate_locked;
>> 				if (!add_to_swap(page))
>> 					goto activate_locked;
>
> Yes.
>
>> 			}
>> 
>> > +			/*
>> > +			 * Got swap space successfully. But unfortunately,
>> > +			 * we don't support a THP page writeout so split it.
>> > +			 */
>> > +			if (PageTransHuge(page) &&
>> > +				  split_huge_page_to_list(page, page_list)) {
>> > +				delete_from_swap_cache(page);
>> >  				goto activate_locked;
>> > +			}
>> 
>> Pulling this out of add_to_swap() is an improvement for sure. Add an
>> XXX: before that "we don't support THP writes" comment for good
>> measure :)
>
> Sure.
>
> It could be a separate patch which makes add_to_swap clean via
> removing page_list argument but I hope Huang take/fold it when he
> resend it because it would be more important with THP swap.

Sure.  I will take this patch as one patch of the THP swap series.
Because the first patch of the THP swap series is a little big, I don't
think it is a good idea to fold this patch into it.  Could you update
the patch according to Johannes' comments and resend it?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-05-10 23:25               ` Minchan Kim
  2017-05-11  0:50                 ` Huang, Ying
@ 2017-05-11  1:22                 ` Minchan Kim
  2017-05-11 10:40                   ` Johannes Weiner
  1 sibling, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2017-05-11  1:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

On Thu, May 11, 2017 at 08:25:56AM +0900, Minchan Kim wrote:
> On Wed, May 10, 2017 at 09:56:54AM -0400, Johannes Weiner wrote:
> > Hi Michan,
> > 
> > On Tue, May 02, 2017 at 08:53:32AM +0900, Minchan Kim wrote:
> > > @@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
> > >  /*
> > >   * Called after dropping swapcache to decrease refcnt to swap entries.
> > >   */
> > > -void swapcache_free(swp_entry_t entry)
> > > +void __swapcache_free(swp_entry_t entry)
> > >  {
> > >  	struct swap_info_struct *p;
> > >  
> > > @@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
> > >  }
> > >  
> > >  #ifdef CONFIG_THP_SWAP
> > > -void swapcache_free_cluster(swp_entry_t entry)
> > > +void __swapcache_free_cluster(swp_entry_t entry)
> > >  {
> > >  	unsigned long offset = swp_offset(entry);
> > >  	unsigned long idx = offset / SWAPFILE_CLUSTER;
> > > @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
> > >  }
> > >  #endif /* CONFIG_THP_SWAP */
> > >  
> > > +void swapcache_free(struct page *page, swp_entry_t entry)
> > > +{
> > > +	if (!PageTransHuge(page))
> > > +		__swapcache_free(entry);
> > > +	else
> > > +		__swapcache_free_cluster(entry);
> > > +}
> > 
> > I don't think this is cleaner :/

Let's see a example add_to_swap. Without it, it looks like that.

int add_to_swap(struct page *page)
{
        entry = get_swap_page(page);
        ..
        ..
fail:
        if (PageTransHuge(page))
                swapcache_free_cluster(entry);
        else
                swapcache_free(entry);
}

It doesn't looks good to me because get_swap_page hides
where entry allocation is from cluster or slot but when
we free the entry allocated, we should be aware of the
internal and call right function. :(

Do you think it's better still?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-05-11  0:50                 ` Huang, Ying
@ 2017-05-11  4:31                   ` Minchan Kim
  2017-05-12  2:21                   ` [PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page Minchan Kim
  1 sibling, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2017-05-11  4:31 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

On Thu, May 11, 2017 at 08:50:01AM +0800, Huang, Ying wrote:
< snip >

> >> > @@ -1125,8 +1125,28 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >> >  		    !PageSwapCache(page)) {
> >> >  			if (!(sc->gfp_mask & __GFP_IO))
> >> >  				goto keep_locked;
> >> > -			if (!add_to_swap(page, page_list))
> >> > +swap_retry:
> >> > +			/*
> >> > +			 * Retry after split if we fail to allocate
> >> > +			 * swap space of a THP.
> >> > +			 */
> >> > +			if (!add_to_swap(page)) {
> >> > +				if (!PageTransHuge(page) ||
> >> > +				    split_huge_page_to_list(page, page_list))
> >> > +					goto activate_locked;
> >> > +				goto swap_retry;
> >> > +			}
> >> 
> >> This is definitely better.
> >
> > Thanks.
> >
> >> 
> >> However, I think it'd be cleaner without the label here:
> >> 
> >> 			if (!add_to_swap(page)) {
> >> 				if (!PageTransHuge(page))
> >> 					goto activate_locked;
> >> 				/* Split THP and swap individual base pages */
> >> 				if (split_huge_page_to_list(page, page_list))
> >> 					goto activate_locked;
> >> 				if (!add_to_swap(page))
> >> 					goto activate_locked;
> >
> > Yes.
> >
> >> 			}
> >> 
> >> > +			/*
> >> > +			 * Got swap space successfully. But unfortunately,
> >> > +			 * we don't support a THP page writeout so split it.
> >> > +			 */
> >> > +			if (PageTransHuge(page) &&
> >> > +				  split_huge_page_to_list(page, page_list)) {
> >> > +				delete_from_swap_cache(page);
> >> >  				goto activate_locked;
> >> > +			}
> >> 
> >> Pulling this out of add_to_swap() is an improvement for sure. Add an
> >> XXX: before that "we don't support THP writes" comment for good
> >> measure :)
> >
> > Sure.
> >
> > It could be a separate patch which makes add_to_swap clean via
> > removing page_list argument but I hope Huang take/fold it when he
> > resend it because it would be more important with THP swap.
> 
> Sure.  I will take this patch as one patch of the THP swap series.
> Because the first patch of the THP swap series is a little big, I don't
> think it is a good idea to fold this patch into it.  Could you update
> the patch according to Johannes' comments and resend it?

Okay, I will resend this clean-up patch against on yours patch
after finishing this discussion.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-05-11  1:22                 ` [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out Minchan Kim
@ 2017-05-11 10:40                   ` Johannes Weiner
  2017-05-12  1:34                     ` Minchan Kim
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Weiner @ 2017-05-11 10:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

On Thu, May 11, 2017 at 10:22:13AM +0900, Minchan Kim wrote:
> On Thu, May 11, 2017 at 08:25:56AM +0900, Minchan Kim wrote:
> > On Wed, May 10, 2017 at 09:56:54AM -0400, Johannes Weiner wrote:
> > > Hi Michan,
> > > 
> > > On Tue, May 02, 2017 at 08:53:32AM +0900, Minchan Kim wrote:
> > > > @@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
> > > >  /*
> > > >   * Called after dropping swapcache to decrease refcnt to swap entries.
> > > >   */
> > > > -void swapcache_free(swp_entry_t entry)
> > > > +void __swapcache_free(swp_entry_t entry)
> > > >  {
> > > >  	struct swap_info_struct *p;
> > > >  
> > > > @@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
> > > >  }
> > > >  
> > > >  #ifdef CONFIG_THP_SWAP
> > > > -void swapcache_free_cluster(swp_entry_t entry)
> > > > +void __swapcache_free_cluster(swp_entry_t entry)
> > > >  {
> > > >  	unsigned long offset = swp_offset(entry);
> > > >  	unsigned long idx = offset / SWAPFILE_CLUSTER;
> > > > @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
> > > >  }
> > > >  #endif /* CONFIG_THP_SWAP */
> > > >  
> > > > +void swapcache_free(struct page *page, swp_entry_t entry)
> > > > +{
> > > > +	if (!PageTransHuge(page))
> > > > +		__swapcache_free(entry);
> > > > +	else
> > > > +		__swapcache_free_cluster(entry);
> > > > +}
> > > 
> > > I don't think this is cleaner :/
> 
> Let's see a example add_to_swap. Without it, it looks like that.
> 
> int add_to_swap(struct page *page)
> {
>         entry = get_swap_page(page);
>         ..
>         ..
> fail:
>         if (PageTransHuge(page))
>                 swapcache_free_cluster(entry);
>         else
>                 swapcache_free(entry);
> }
> 
> It doesn't looks good to me because get_swap_page hides
> where entry allocation is from cluster or slot but when
> we free the entry allocated, we should be aware of the
> internal and call right function. :(

This could be nicer indeed. I just don't like the underscore versions
much, but symmetry with get_swap_page() would be nice.

How about put_swap_page()? :)

That can call the appropriate swapcache_free function then.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out
  2017-05-11 10:40                   ` Johannes Weiner
@ 2017-05-12  1:34                     ` Minchan Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Minchan Kim @ 2017-05-12  1:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Ebru Akagunduz, Michal Hocko, Tejun Heo,
	Hugh Dickins, Shaohua Li, Rik van Riel, cgroups

Hi Hannes,

On Thu, May 11, 2017 at 06:40:58AM -0400, Johannes Weiner wrote:
> On Thu, May 11, 2017 at 10:22:13AM +0900, Minchan Kim wrote:
> > On Thu, May 11, 2017 at 08:25:56AM +0900, Minchan Kim wrote:
> > > On Wed, May 10, 2017 at 09:56:54AM -0400, Johannes Weiner wrote:
> > > > Hi Michan,
> > > > 
> > > > On Tue, May 02, 2017 at 08:53:32AM +0900, Minchan Kim wrote:
> > > > > @@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
> > > > >  /*
> > > > >   * Called after dropping swapcache to decrease refcnt to swap entries.
> > > > >   */
> > > > > -void swapcache_free(swp_entry_t entry)
> > > > > +void __swapcache_free(swp_entry_t entry)
> > > > >  {
> > > > >  	struct swap_info_struct *p;
> > > > >  
> > > > > @@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
> > > > >  }
> > > > >  
> > > > >  #ifdef CONFIG_THP_SWAP
> > > > > -void swapcache_free_cluster(swp_entry_t entry)
> > > > > +void __swapcache_free_cluster(swp_entry_t entry)
> > > > >  {
> > > > >  	unsigned long offset = swp_offset(entry);
> > > > >  	unsigned long idx = offset / SWAPFILE_CLUSTER;
> > > > > @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
> > > > >  }
> > > > >  #endif /* CONFIG_THP_SWAP */
> > > > >  
> > > > > +void swapcache_free(struct page *page, swp_entry_t entry)
> > > > > +{
> > > > > +	if (!PageTransHuge(page))
> > > > > +		__swapcache_free(entry);
> > > > > +	else
> > > > > +		__swapcache_free_cluster(entry);
> > > > > +}
> > > > 
> > > > I don't think this is cleaner :/
> > 
> > Let's see a example add_to_swap. Without it, it looks like that.
> > 
> > int add_to_swap(struct page *page)
> > {
> >         entry = get_swap_page(page);
> >         ..
> >         ..
> > fail:
> >         if (PageTransHuge(page))
> >                 swapcache_free_cluster(entry);
> >         else
> >                 swapcache_free(entry);
> > }
> > 
> > It doesn't looks good to me because get_swap_page hides
> > where entry allocation is from cluster or slot but when
> > we free the entry allocated, we should be aware of the
> > internal and call right function. :(
> 
> This could be nicer indeed. I just don't like the underscore versions
> much, but symmetry with get_swap_page() would be nice.
> 
> How about put_swap_page()? :)

Good idea. It's the best one I can do now.
Actually, get_swap_page is awkward to me. Maybe it would be nicer to
rename it with get_swap_[slot|entry] but, I will postpone it if someone
would be on same page with me in future.

> 
> That can call the appropriate swapcache_free function then.

Yub.
Thanks for the review!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page
  2017-05-11  0:50                 ` Huang, Ying
  2017-05-11  4:31                   ` Minchan Kim
@ 2017-05-12  2:21                   ` Minchan Kim
  2017-05-12  2:21                     ` [PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan Minchan Kim
  2017-05-12 16:47                     ` [PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page Johannes Weiner
  1 sibling, 2 replies; 26+ messages in thread
From: Minchan Kim @ 2017-05-12  2:21 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, Johannes Weiner, linux-kernel, linux-mm, Minchan Kim

Now, get_swap_page takes struct page and allocates swap space
according to page size(ie, normal or THP) so it would be more
cleaner to introduce put_swap_page which is a counter function
of get_swap_page. Then, it calls right swap slot free function
depending on page's size.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  4 ++--
 mm/shmem.c           |  2 +-
 mm/swap_state.c      | 13 +++----------
 mm/swapfile.c        |  8 ++++++++
 mm/vmscan.c          |  2 +-
 5 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b60fea3748f8..8f12f67e869f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -393,6 +393,7 @@ static inline long get_nr_swap_pages(void)
 
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(struct page *page);
+extern void put_swap_page(struct page *page, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
@@ -400,7 +401,6 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
@@ -459,7 +459,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void put_swap_page(struct page *page, swp_entry_t swp)
 {
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 29948d7da172..82158edaefdb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1326,7 +1326,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 
 	mutex_unlock(&shmem_swaplist_mutex);
 free_swap:
-	swapcache_free(swap);
+	put_swap_page(page, swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 16ff89d058f4..0ad214d7a7ad 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -231,10 +231,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 	return 1;
 
 fail_free:
-	if (PageTransHuge(page))
-		swapcache_free_cluster(entry);
-	else
-		swapcache_free(entry);
+	put_swap_page(page, entry);
 fail:
 	if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
 		goto retry;
@@ -259,11 +256,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	if (PageTransHuge(page))
-		swapcache_free_cluster(entry);
-	else
-		swapcache_free(entry);
-
+	put_swap_page(page, entry);
 	page_ref_sub(page, hpage_nr_pages(page));
 }
 
@@ -415,7 +408,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry);
+		put_swap_page(new_page, entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 596306272059..b65e49428090 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
 }
 #endif /* CONFIG_THP_SWAP */
 
+void put_swap_page(struct page *page, swp_entry_t entry)
+{
+	if (!PageTransHuge(page))
+		swapcache_free(entry);
+	else
+		swapcache_free_cluster(entry);
+}
+
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ebf468c5429..57268c0c8fcf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -708,7 +708,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		swapcache_free(swap);
+		put_swap_page(page, swap);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan
  2017-05-12  2:21                   ` [PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page Minchan Kim
@ 2017-05-12  2:21                     ` Minchan Kim
  2017-05-12 16:48                       ` Johannes Weiner
  2017-05-12 16:47                     ` [PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page Johannes Weiner
  1 sibling, 1 reply; 26+ messages in thread
From: Minchan Kim @ 2017-05-12  2:21 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, Johannes Weiner, linux-kernel, linux-mm, Minchan Kim

The add_to_swap aims to allocate swap_space(ie, swap slot and
swapcache) so if it fails due to lack of space in case of THP
or something(hdd swap but tries THP swapout) *caller* rather
than add_to_swap itself should split the THP page and retry it
with base page which is more natural.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  4 ++--
 mm/swap_state.c      | 23 ++++++-----------------
 mm/vmscan.c          | 17 ++++++++++++++++-
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f12f67e869f..87cca2169d44 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -359,7 +359,7 @@ extern struct address_space *swapper_spaces[];
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *, struct list_head *list);
+extern int add_to_swap(struct page *);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
 extern void __delete_from_swap_cache(struct page *);
@@ -479,7 +479,7 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp)
 	return NULL;
 }
 
-static inline int add_to_swap(struct page *page, struct list_head *list)
+static inline int add_to_swap(struct page *page)
 {
 	return 0;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0ad214d7a7ad..9c71b6b2562f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -184,7 +184,7 @@ void __delete_from_swap_cache(struct page *page)
  * Allocate swap space for the page and add the page to the
  * swap cache.  Caller needs to hold the page lock. 
  */
-int add_to_swap(struct page *page, struct list_head *list)
+int add_to_swap(struct page *page)
 {
 	swp_entry_t entry;
 	int err;
@@ -192,12 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
-retry:
 	entry = get_swap_page(page);
 	if (!entry.val)
-		goto fail;
+		return 0;
+
 	if (mem_cgroup_try_charge_swap(page, entry))
-		goto fail_free;
+		goto fail;
 
 	/*
 	 * Radix-tree node allocations from PF_MEMALLOC contexts could
@@ -218,23 +218,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		goto fail_free;
-
-	if (PageTransHuge(page)) {
-		err = split_huge_page_to_list(page, list);
-		if (err) {
-			delete_from_swap_cache(page);
-			return 0;
-		}
-	}
+		goto fail;
 
 	return 1;
 
-fail_free:
-	put_swap_page(page, entry);
 fail:
-	if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
-		goto retry;
+	put_swap_page(page, entry);
 	return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 57268c0c8fcf..767e20856080 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1125,8 +1125,23 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		    !PageSwapCache(page)) {
 			if (!(sc->gfp_mask & __GFP_IO))
 				goto keep_locked;
-			if (!add_to_swap(page, page_list))
+			if (!add_to_swap(page)) {
+				if (!PageTransHuge(page))
+					goto activate_locked;
+				/* Split THP and swap individual base pages */
+				if (split_huge_page_to_list(page, page_list))
+					goto activate_locked;
+				if (!add_to_swap(page))
+					goto activate_locked;
+			}
+
+			/* XXX: We don't support THP writes */
+			if (PageTransHuge(page) &&
+				  split_huge_page_to_list(page, page_list)) {
+				delete_from_swap_cache(page);
 				goto activate_locked;
+			}
+
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page
  2017-05-12  2:21                   ` [PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page Minchan Kim
  2017-05-12  2:21                     ` [PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan Minchan Kim
@ 2017-05-12 16:47                     ` Johannes Weiner
  1 sibling, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2017-05-12 16:47 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Huang Ying, Andrew Morton, linux-kernel, linux-mm

On Fri, May 12, 2017 at 11:21:23AM +0900, Minchan Kim wrote:
> Now, get_swap_page takes struct page and allocates swap space
> according to page size(ie, normal or THP) so it would be more
> cleaner to introduce put_swap_page which is a counter function
> of get_swap_page. Then, it calls right swap slot free function
> depending on page's size.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan
  2017-05-12  2:21                     ` [PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan Minchan Kim
@ 2017-05-12 16:48                       ` Johannes Weiner
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2017-05-12 16:48 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Huang Ying, Andrew Morton, linux-kernel, linux-mm

On Fri, May 12, 2017 at 11:21:24AM +0900, Minchan Kim wrote:
> The add_to_swap aims to allocate swap_space(ie, swap slot and
> swapcache) so if it fails due to lack of space in case of THP
> or something(hdd swap but tries THP swapout) *caller* rather
> than add_to_swap itself should split the THP page and retry it
> with base page which is more natural.
> 
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-05-12 16:48 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-25 12:56 [PATCH -mm -v10 0/3] THP swap: Delay splitting THP during swapping out Huang, Ying
2017-04-25 12:56 ` [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
2017-04-27  5:31   ` Minchan Kim
2017-04-27  7:12     ` Huang, Ying
2017-04-27 13:37       ` Johannes Weiner
2017-04-28  8:40       ` Minchan Kim
2017-04-28 12:21         ` Huang, Ying
2017-05-10  1:03           ` Minchan Kim
2017-05-01 10:44         ` Johannes Weiner
2017-05-01 23:53           ` Minchan Kim
2017-05-10 13:56             ` Johannes Weiner
2017-05-10 23:25               ` Minchan Kim
2017-05-11  0:50                 ` Huang, Ying
2017-05-11  4:31                   ` Minchan Kim
2017-05-12  2:21                   ` [PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page Minchan Kim
2017-05-12  2:21                     ` [PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan Minchan Kim
2017-05-12 16:48                       ` Johannes Weiner
2017-05-12 16:47                     ` [PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page Johannes Weiner
2017-05-11  1:22                 ` [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out Minchan Kim
2017-05-11 10:40                   ` Johannes Weiner
2017-05-12  1:34                     ` Minchan Kim
2017-04-25 12:56 ` [PATCH -mm -v10 2/3] mm, THP, swap: Check whether THP can be split firstly Huang, Ying
2017-04-25 21:43   ` Johannes Weiner
2017-04-25 12:56 ` [PATCH -mm -v10 3/3] mm, THP, swap: Enable THP swap optimization only if has compound map Huang, Ying
2017-04-25 21:46   ` Johannes Weiner
2017-04-28 13:16     ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).