linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out
@ 2017-03-08  7:26 Huang, Ying
  2017-03-08  7:26 ` [PATCH -mm -v6 1/9] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Huang, Ying @ 2017-03-08  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko

From: Huang Ying <ying.huang@intel.com>

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [1/9], [3/9], [4/9], [5/9],
[6/9], [9/9].

Hi, Andrea could you help me to review the THP part of the patchset?
Especially [2/9], [7/9] and [8/9].

Hi, Johannes, Michal and Vladimir, I am not very confident about the
memory cgroup part, especially [2/9].  Could you help me to review it?

And for all, Any comment is welcome!


Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help improve the performance of the THP swap.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which are usually 4k random
  IO.  This will improve the performance of the THP swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

This patchset is based on 03/06 head of mmotm/master.

This patchset is the first step for the THP swap support.  The plan is
to delay splitting THP step by step, finally avoid splitting THP
during the THP swapping out and swap out/in the THP as a whole.

As the first step, in this patchset, the splitting huge page is
delayed from almost the first step of swapping out to after allocating
the swap space for the THP and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.

With the patchset, the swap out throughput improves 14.9% (from about
3.77GB/s to about 4.34GB/s) in the vm-scalability swap-w-seq test case
with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To
test the sequential swapping out, the test case creates 8 processes,
which sequentially allocate and write to the anonymous pages until the
RAM and part of the swap device is used up.

The detailed comparison result is as follow,

base             base+patchset
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   7043990 ±  0%     +21.2%    8536807 ±  0%  vm-scalability.throughput
    109.94 ±  1%     -16.2%      92.09 ±  0%  vm-scalability.time.elapsed_time
   3957091 ±  0%     +14.9%    4547173 ±  0%  vmstat.swap.so
     31.46 ±  1%     -38.3%      19.42 ±  0%  perf-stat.cache-miss-rate%
      1.04 ±  1%     +22.2%       1.27 ±  0%  perf-stat.ipc
      9.33 ±  2%     -60.7%       3.67 ±  1%  perf-profile.calltrace.cycles-pp.add_to_swap.shrink_page_list.shrink_inactive_list.shrink_node_memcg.shrink_node

Changelog:

v6:

- Rebased on latest -mm tree (cluster lock, etc).
- Fix a potential uninitialized variable bug in __swap_entry_free()
- Revise the swap read-ahead changes to avoid a potential race
  condition between swap off and swap out in theory.

v5:

- Per Hillf's comments, fix a locking bug in error path of
  __add_to_swap_cache().  And merge the code to calculate extra_pins
  into can_split_huge_page().

v4:

- Per Johannes' comments, simplified swap cgroup array accessing code.
- Per Kirill and Dave Hansen's comments, used HPAGE_PMD_NR instead of
  HPAGE_SIZE/PAGE_SIZE.
- Per Anshuman's comments, used HPAGE_PMD_NR instead of 512 in patch
  description.

v3:

- Per Andrew's suggestion, used a more systematical way to determine
  whether to enable THP swap optimization
- Per Andrew's comments, moved as much as possible code into
  #ifdef CONFIG_TRANSPARENT_HUGE_PAGE/#endif or "if (PageTransHuge())"
- Fixed some coding style warning.

v2:

- Original [1/11] sent separately and merged
- Use switch in 10/10 per Hiff's suggestion

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH -mm -v6 1/9] mm, swap: Make swap cluster size same of THP size on x86_64
  2017-03-08  7:26 [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out Huang, Ying
@ 2017-03-08  7:26 ` Huang, Ying
  2017-03-08 12:56   ` Matthew Wilcox
  2017-03-08  7:26 ` [PATCH -mm -v6 2/9] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2017-03-08  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

In this patch, the size of the swap cluster is changed to that of the
THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
the THP swap support on x86_64.  Where one swap cluster will be used to
hold the contents of each THP swapped out.  And some information of the
swapped out THP (such as compound map count) will be recorded in the
swap_cluster_info data structure.

For other architectures which want THP swap support,
ARCH_USES_THP_SWAP_CLUSTER need to be selected in the Kconfig file for
the architecture.

In effect, this will enlarge swap cluster size by 2 times on x86_64.
Which may make it harder to find a free cluster when the swap space
becomes fragmented.  So that, this may reduce the continuous swap space
allocation and sequential write in theory.  The performance test in 0day
shows no regressions caused by this.

Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 arch/x86/Kconfig |  1 +
 mm/Kconfig       | 13 +++++++++++++
 mm/swapfile.c    |  4 ++++
 3 files changed, 18 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 005df7c825f5..94908127c83e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -176,6 +176,7 @@ config X86
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_USES_THP_SWAP_CLUSTER	if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb969dc..7b708e200c29 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -499,6 +499,19 @@ config FRONTSWAP
 
 	  If unsure, say Y to enable frontswap.
 
+config ARCH_USES_THP_SWAP_CLUSTER
+	bool
+	default n
+
+config THP_SWAP_CLUSTER
+	bool
+	depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
+	default y
+	help
+	  Use one swap cluster to hold the contents of the THP
+	  (Transparent Huge Page) swapped out.  The size of the swap
+	  cluster will be same as that of THP.
+
 config CMA
 	bool "Contiguous Memory Allocator"
 	depends on HAVE_MEMBLOCK && MMU
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5e1e9bffa3e2..2396e1a350c8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -199,7 +199,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 	}
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
+#else
 #define SWAPFILE_CLUSTER	256
+#endif
 #define LATENCY_LIMIT		256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -mm -v6 2/9] mm, memcg: Support to charge/uncharge multiple swap entries
  2017-03-08  7:26 [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out Huang, Ying
  2017-03-08  7:26 ` [PATCH -mm -v6 1/9] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
@ 2017-03-08  7:26 ` Huang, Ying
  2017-03-08 11:37   ` Balbir Singh
  2017-03-08  7:26 ` [PATCH -mm -v6 3/9] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2017-03-08  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Kirill A . Shutemov, Vladimir Davydov, Johannes Weiner,
	Michal Hocko, Tejun Heo, cgroups

From: Huang Ying <ying.huang@intel.com>

This patch make it possible to charge or uncharge a set of continuous
swap entries in the swap cgroup.  The number of swap entries is
specified via an added parameter.

This will be used for the THP (Transparent Huge Page) swap support.
Where a swap cluster backing a THP may be allocated and freed as a
whole.  So a set of (HPAGE_PMD_NR) continuous swap entries backing one
THP need to be charged or uncharged together.  This will batch the
cgroup operations for the THP swap too.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h        | 12 ++++++----
 include/linux/swap_cgroup.h |  6 +++--
 mm/memcontrol.c             | 57 +++++++++++++++++++++++++--------------------
 mm/shmem.c                  |  2 +-
 mm/swap_cgroup.c            | 40 +++++++++++++++++++++++--------
 mm/swap_state.c             |  2 +-
 mm/swapfile.c               |  2 +-
 7 files changed, 77 insertions(+), 44 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 486494e6b2fc..278e1349a424 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -550,8 +550,10 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 
 #ifdef CONFIG_MEMCG_SWAP
 extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
-extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
-extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
+extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry,
+				      unsigned int nr_entries);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry,
+				     unsigned int nr_entries);
 extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg);
 extern bool mem_cgroup_swap_full(struct page *page);
 #else
@@ -560,12 +562,14 @@ static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 }
 
 static inline int mem_cgroup_try_charge_swap(struct page *page,
-					     swp_entry_t entry)
+					     swp_entry_t entry,
+					     unsigned int nr_entries)
 {
 	return 0;
 }
 
-static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
+					    unsigned int nr_entries)
 {
 }
 
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index 145306bdc92f..b2b8ec7bda3f 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -7,7 +7,8 @@
 
 extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new);
-extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
+extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+					 unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
@@ -15,7 +16,8 @@ extern void swap_cgroup_swapoff(int type);
 #else
 
 static inline
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+				  unsigned int nr_ents)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 537362e4108e..2b0b6012ad22 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2393,10 +2393,9 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
+				       int nr_entries)
 {
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_entries);
 }
 
 /**
@@ -2422,8 +2421,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 	new_id = mem_cgroup_id(to);
 
 	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_swap_statistics(to, true);
+		mem_cgroup_swap_statistics(from, -1);
+		mem_cgroup_swap_statistics(to, 1);
 		return 0;
 	}
 	return -EINVAL;
@@ -5446,7 +5445,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		 * let's not wait for it.  The page already received a
 		 * memory+swap charge, drop the swap entry duplicate.
 		 */
-		mem_cgroup_uncharge_swap(entry);
+		mem_cgroup_uncharge_swap(entry, nr_pages);
 	}
 }
 
@@ -5874,9 +5873,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * ancestor for the swap instead and transfer the memory+swap charge.
 	 */
 	swap_memcg = mem_cgroup_id_get_online(memcg);
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg));
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 1);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(swap_memcg, true);
+	mem_cgroup_swap_statistics(swap_memcg, 1);
 
 	page->mem_cgroup = NULL;
 
@@ -5903,16 +5902,19 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 		css_put(&memcg->css);
 }
 
-/*
- * mem_cgroup_try_charge_swap - try charging a swap entry
+/**
+ * mem_cgroup_try_charge_swap - try charging a set of swap entries
  * @page: page being added to swap
- * @entry: swap entry to charge
+ * @entry: the first swap entry to charge
+ * @nr_entries: the number of swap entries to charge
  *
- * Try to charge @entry to the memcg that @page belongs to.
+ * Try to charge @nr_entries swap entries starting from @entry to the
+ * memcg that @page belongs to.
  *
  * Returns 0 on success, -ENOMEM on failure.
  */
-int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
+int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry,
+			       unsigned int nr_entries)
 {
 	struct mem_cgroup *memcg;
 	struct page_counter *counter;
@@ -5930,25 +5932,29 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	memcg = mem_cgroup_id_get_online(memcg);
 
 	if (!mem_cgroup_is_root(memcg) &&
-	    !page_counter_try_charge(&memcg->swap, 1, &counter)) {
+	    !page_counter_try_charge(&memcg->swap, nr_entries, &counter)) {
 		mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
 
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
+	if (nr_entries > 1)
+		mem_cgroup_id_get_many(memcg, nr_entries - 1);
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_entries);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(memcg, true);
+	mem_cgroup_swap_statistics(memcg, nr_entries);
 
 	return 0;
 }
 
 /**
- * mem_cgroup_uncharge_swap - uncharge a swap entry
- * @entry: swap entry to uncharge
+ * mem_cgroup_uncharge_swap - uncharge a set of swap entries
+ * @entry: the first swap entry to uncharge
+ * @nr_entries: the number of swap entries to uncharge
  *
- * Drop the swap charge associated with @entry.
+ * Drop the swap charge associated with @nr_entries swap entries
+ * starting from @entry.
  */
-void mem_cgroup_uncharge_swap(swp_entry_t entry)
+void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_entries)
 {
 	struct mem_cgroup *memcg;
 	unsigned short id;
@@ -5956,18 +5962,19 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
 	if (!do_swap_account)
 		return;
 
-	id = swap_cgroup_record(entry, 0);
+	id = swap_cgroup_record(entry, 0, nr_entries);
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
 		if (!mem_cgroup_is_root(memcg)) {
 			if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
-				page_counter_uncharge(&memcg->swap, 1);
+				page_counter_uncharge(&memcg->swap, nr_entries);
 			else
-				page_counter_uncharge(&memcg->memsw, 1);
+				page_counter_uncharge(&memcg->memsw,
+						      nr_entries);
 		}
-		mem_cgroup_swap_statistics(memcg, false);
-		mem_cgroup_id_put(memcg);
+		mem_cgroup_swap_statistics(memcg, -nr_entries);
+		mem_cgroup_id_put_many(memcg, nr_entries);
 	}
 	rcu_read_unlock();
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index e67d6ba4e98e..effe07ef5c26 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1294,7 +1294,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (!swap.val)
 		goto redirty;
 
-	if (mem_cgroup_try_charge_swap(page, swap))
+	if (mem_cgroup_try_charge_swap(page, swap, 1))
 		goto free_swap;
 
 	/*
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 310ac0b8f974..8cee2d125815 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -58,21 +58,27 @@ static int swap_cgroup_prepare(int type)
 	return -ENOMEM;
 }
 
+static struct swap_cgroup *__lookup_swap_cgroup(struct swap_cgroup_ctrl *ctrl,
+						pgoff_t offset)
+{
+	struct page *mappage;
+	struct swap_cgroup *sc;
+
+	mappage = ctrl->map[offset / SC_PER_PAGE];
+	sc = page_address(mappage);
+	return sc + offset % SC_PER_PAGE;
+}
+
 static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
 					struct swap_cgroup_ctrl **ctrlp)
 {
 	pgoff_t offset = swp_offset(ent);
 	struct swap_cgroup_ctrl *ctrl;
-	struct page *mappage;
-	struct swap_cgroup *sc;
 
 	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
 	if (ctrlp)
 		*ctrlp = ctrl;
-
-	mappage = ctrl->map[offset / SC_PER_PAGE];
-	sc = page_address(mappage);
-	return sc + offset % SC_PER_PAGE;
+	return __lookup_swap_cgroup(ctrl, offset);
 }
 
 /**
@@ -105,25 +111,39 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 }
 
 /**
- * swap_cgroup_record - record mem_cgroup for this swp_entry.
- * @ent: swap entry to be recorded into
+ * swap_cgroup_record - record mem_cgroup for a set of swap entries
+ * @ent: the first swap entry to be recorded into
  * @id: mem_cgroup to be recorded
+ * @nr_ents: number of swap entries to be recorded
  *
  * Returns old value at success, 0 at failure.
  * (Of course, old value can be 0.)
  */
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+				  unsigned int nr_ents)
 {
 	struct swap_cgroup_ctrl *ctrl;
 	struct swap_cgroup *sc;
 	unsigned short old;
 	unsigned long flags;
+	pgoff_t offset = swp_offset(ent);
+	pgoff_t end = offset + nr_ents;
 
 	sc = lookup_swap_cgroup(ent, &ctrl);
 
 	spin_lock_irqsave(&ctrl->lock, flags);
 	old = sc->id;
-	sc->id = id;
+	for (;;) {
+		VM_BUG_ON(sc->id != old);
+		sc->id = id;
+		offset++;
+		if (offset == end)
+			break;
+		if (offset % SC_PER_PAGE)
+			sc++;
+		else
+			sc = __lookup_swap_cgroup(ctrl, offset);
+	}
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	return old;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 473b71e052a8..3c248f0a0abc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -182,7 +182,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 	if (!entry.val)
 		return 0;
 
-	if (mem_cgroup_try_charge_swap(page, entry)) {
+	if (mem_cgroup_try_charge_swap(page, entry, 1)) {
 		swapcache_free(entry);
 		return 0;
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2396e1a350c8..a744604384ff 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1012,7 +1012,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 	dec_cluster_info_page(p, p->cluster_info, offset);
 	unlock_cluster(ci);
 
-	mem_cgroup_uncharge_swap(entry);
+	mem_cgroup_uncharge_swap(entry, 1);
 	if (offset < p->lowest_bit)
 		p->lowest_bit = offset;
 	if (offset > p->highest_bit) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -mm -v6 3/9] mm, THP, swap: Add swap cluster allocate/free functions
  2017-03-08  7:26 [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out Huang, Ying
  2017-03-08  7:26 ` [PATCH -mm -v6 1/9] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
  2017-03-08  7:26 ` [PATCH -mm -v6 2/9] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
@ 2017-03-08  7:26 ` Huang, Ying
  2017-03-14 23:13   ` Tim Chen
  2017-03-08  7:26 ` [PATCH -mm -v6 4/9] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2017-03-08  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Kirill A . Shutemov, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel

From: Huang Ying <ying.huang@intel.com>

The swap cluster allocation/free functions are added based on the
existing swap cluster management mechanism for SSD.  These functions
don't work for the rotating hard disks because the existing swap cluster
management mechanism doesn't work for them.  The hard disks support may
be added if someone really need it.  But that needn't be included in
this patchset.

This will be used for the THP (Transparent Huge Page) swap support.
Where one swap cluster will hold the contents of each THP swapped out.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swapfile.c | 217 +++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 156 insertions(+), 61 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index a744604384ff..91876c33114b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -378,6 +378,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	schedule_work(&si->discard_work);
 }
 
+static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info;
+
+	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
+	cluster_list_add_tail(&si->free_clusters, ci, idx);
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
  * will be added to free cluster list. caller should hold si->lock.
@@ -398,10 +406,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 
 		spin_lock(&si->lock);
 		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
-		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
-		unlock_cluster(ci);
-		cluster_list_add_tail(&si->free_clusters, info, idx);
-		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
+		__free_cluster(si, idx);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
 		unlock_cluster(ci);
@@ -419,6 +424,34 @@ static void swap_discard_work(struct work_struct *work)
 	spin_unlock(&si->lock);
 }
 
+static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info;
+
+	VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
+	cluster_list_del_first(&si->free_clusters, ci);
+	cluster_set_count_flag(ci + idx, 0, 0);
+}
+
+static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info + idx;
+
+	VM_BUG_ON(cluster_count(ci) != 0);
+	/*
+	 * If the swap is discardable, prepare discard the cluster
+	 * instead of free it immediately. The cluster will be freed
+	 * after discard.
+	 */
+	if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
+	    (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
+		swap_cluster_schedule_discard(si, idx);
+		return;
+	}
+
+	__free_cluster(si, idx);
+}
+
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
  * removed from free cluster list and its usage counter will be increased.
@@ -430,11 +463,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 
 	if (!cluster_info)
 		return;
-	if (cluster_is_free(&cluster_info[idx])) {
-		VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx);
-		cluster_list_del_first(&p->free_clusters, cluster_info);
-		cluster_set_count_flag(&cluster_info[idx], 0, 0);
-	}
+	if (cluster_is_free(&cluster_info[idx]))
+		alloc_cluster(p, idx);
 
 	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
@@ -458,21 +488,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 	cluster_set_count(&cluster_info[idx],
 		cluster_count(&cluster_info[idx]) - 1);
 
-	if (cluster_count(&cluster_info[idx]) == 0) {
-		/*
-		 * If the swap is discardable, prepare discard the cluster
-		 * instead of free it immediately. The cluster will be freed
-		 * after discard.
-		 */
-		if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
-				 (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
-			swap_cluster_schedule_discard(p, idx);
-			return;
-		}
-
-		cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
-		cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
-	}
+	if (cluster_count(&cluster_info[idx]) == 0)
+		free_cluster(p, idx);
 }
 
 /*
@@ -562,6 +579,71 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	return found_free;
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static inline unsigned int huge_cluster_nr_entries(bool huge)
+{
+	return huge ? SWAPFILE_CLUSTER : 1;
+}
+#else
+#define huge_cluster_nr_entries(huge)	1
+#endif
+
+static void _swap_entry_alloc(struct swap_info_struct *si,
+			      unsigned long offset, bool huge)
+{
+	unsigned int nr_entries = huge_cluster_nr_entries(huge);
+	unsigned int end = offset + nr_entries - 1;
+
+	if (offset == si->lowest_bit)
+		si->lowest_bit += nr_entries;
+	if (end == si->highest_bit)
+		si->highest_bit -= nr_entries;
+	si->inuse_pages += nr_entries;
+	if (si->inuse_pages == si->pages) {
+		si->lowest_bit = si->max;
+		si->highest_bit = 0;
+		spin_lock(&swap_avail_lock);
+		plist_del(&si->avail_list, &swap_avail_head);
+		spin_unlock(&swap_avail_lock);
+	}
+}
+
+static void _swap_entry_free(struct swap_info_struct *si, unsigned long offset,
+			     bool huge)
+{
+	unsigned int nr_entries = huge_cluster_nr_entries(huge);
+	unsigned long end = offset + nr_entries - 1;
+	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
+
+	if (offset < si->lowest_bit)
+		si->lowest_bit = offset;
+	if (end > si->highest_bit) {
+		bool was_full = !si->highest_bit;
+
+		si->highest_bit = end;
+		if (was_full && (si->flags & SWP_WRITEOK)) {
+			spin_lock(&swap_avail_lock);
+			WARN_ON(!plist_node_empty(&si->avail_list));
+			if (plist_node_empty(&si->avail_list))
+				plist_add(&si->avail_list, &swap_avail_head);
+			spin_unlock(&swap_avail_lock);
+		}
+	}
+	atomic_long_add(nr_entries, &nr_swap_pages);
+	si->inuse_pages -= nr_entries;
+	if (si->flags & SWP_BLKDEV)
+		swap_slot_free_notify =
+			si->bdev->bd_disk->fops->swap_slot_free_notify;
+	else
+		swap_slot_free_notify = NULL;
+	while (offset <= end) {
+		frontswap_invalidate_page(si->type, offset);
+		if (swap_slot_free_notify)
+			swap_slot_free_notify(si->bdev, offset);
+		offset++;
+	}
+}
+
 static int scan_swap_map_slots(struct swap_info_struct *si,
 			       unsigned char usage, int nr,
 			       swp_entry_t slots[])
@@ -677,18 +759,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 			goto done;
 	}
 
-	if (offset == si->lowest_bit)
-		si->lowest_bit++;
-	if (offset == si->highest_bit)
-		si->highest_bit--;
-	si->inuse_pages++;
-	if (si->inuse_pages == si->pages) {
-		si->lowest_bit = si->max;
-		si->highest_bit = 0;
-		spin_lock(&swap_avail_lock);
-		plist_del(&si->avail_list, &swap_avail_head);
-		spin_unlock(&swap_avail_lock);
-	}
+	_swap_entry_alloc(si, offset, false);
 	si->swap_map[offset] = usage;
 	inc_cluster_info_page(si, si->cluster_info, offset);
 	unlock_cluster(ci);
@@ -770,6 +841,54 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static void swap_free_huge_cluster(struct swap_info_struct *si,
+				   unsigned long idx)
+{
+	struct swap_cluster_info *ci;
+	unsigned long offset = idx * SWAPFILE_CLUSTER;
+
+	ci = lock_cluster(si, offset);
+	cluster_set_count_flag(ci, 0, 0);
+	free_cluster(si, idx);
+	unlock_cluster(ci);
+	_swap_entry_free(si, offset, true);
+}
+
+static int swap_alloc_huge_cluster(struct swap_info_struct *si,
+				   swp_entry_t *slot)
+{
+	unsigned long idx;
+	struct swap_cluster_info *ci;
+	unsigned long offset, i;
+	unsigned char *map;
+
+	if (cluster_list_empty(&si->free_clusters))
+		return 0;
+
+	idx = cluster_list_first(&si->free_clusters);
+	offset = idx * SWAPFILE_CLUSTER;
+	ci = lock_cluster(si, offset);
+	alloc_cluster(si, idx);
+	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+	_swap_entry_alloc(si, offset, true);
+
+	map = si->swap_map + offset;
+	for (i = 0; i < SWAPFILE_CLUSTER; i++)
+		map[i] = SWAP_HAS_CACHE;
+	unlock_cluster(ci);
+	*slot = swp_entry(si->type, offset);
+
+	return SWAPFILE_CLUSTER;
+}
+#else
+static inline int swap_alloc_huge_cluster(struct swap_info_struct *si,
+					  swp_entry_t *slot)
+{
+	return 0;
+}
+#endif
+
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
@@ -1013,31 +1132,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 	unlock_cluster(ci);
 
 	mem_cgroup_uncharge_swap(entry, 1);
-	if (offset < p->lowest_bit)
-		p->lowest_bit = offset;
-	if (offset > p->highest_bit) {
-		bool was_full = !p->highest_bit;
-
-		p->highest_bit = offset;
-		if (was_full && (p->flags & SWP_WRITEOK)) {
-			spin_lock(&swap_avail_lock);
-			WARN_ON(!plist_node_empty(&p->avail_list));
-			if (plist_node_empty(&p->avail_list))
-				plist_add(&p->avail_list,
-					  &swap_avail_head);
-			spin_unlock(&swap_avail_lock);
-		}
-	}
-	atomic_long_inc(&nr_swap_pages);
-	p->inuse_pages--;
-	frontswap_invalidate_page(p->type, offset);
-	if (p->flags & SWP_BLKDEV) {
-		struct gendisk *disk = p->bdev->bd_disk;
-
-		if (disk->fops->swap_slot_free_notify)
-			disk->fops->swap_slot_free_notify(p->bdev,
-							  offset);
-	}
+	_swap_entry_free(p, offset, false);
 }
 
 /*
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -mm -v6 4/9] mm, THP, swap: Add get_huge_swap_page()
  2017-03-08  7:26 [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (2 preceding siblings ...)
  2017-03-08  7:26 ` [PATCH -mm -v6 3/9] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
@ 2017-03-08  7:26 ` Huang, Ying
  2017-03-14 23:40   ` Tim Chen
  2017-03-08  7:26 ` [PATCH -mm -v6 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2017-03-08  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Kirill A . Shutemov, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel

From: Huang Ying <ying.huang@intel.com>

A variation of get_swap_page(), get_huge_swap_page(), is added to
allocate a swap cluster (HPAGE_PMD_NR swap slots) based on the swap
cluster allocation function.  A fair simple algorithm is used, that is,
only the first swap device in priority list will be tried to allocate
the swap cluster.  The function will fail if the trying is not
successful, and the caller will fallback to allocate a single swap slot
instead.  This works good enough for normal cases.

This will be used for the THP (Transparent Huge Page) swap support.
Where get_huge_swap_page() will be used to allocate one swap cluster for
each THP swapped out.

Because of the algorithm adopted, if the difference of the number of the
free swap clusters among multiple swap devices is significant, it is
possible that some THPs are split earlier than necessary.  For example,
this could be caused by big size difference among multiple swap devices.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h | 19 ++++++++++++++++++-
 mm/swap_slots.c      |  5 +++--
 mm/swapfile.c        | 16 ++++++++++++----
 3 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 278e1349a424..e3a7609a8989 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -388,7 +388,7 @@ static inline long get_nr_swap_pages(void)
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
 extern swp_entry_t get_swap_page_of_type(int);
-extern int get_swap_pages(int n, swp_entry_t swp_entries[]);
+extern int get_swap_pages(int n, swp_entry_t swp_entries[], bool huge);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
@@ -527,6 +527,23 @@ static inline swp_entry_t get_swap_page(void)
 
 #endif /* CONFIG_SWAP */
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static inline swp_entry_t get_huge_swap_page(void)
+{
+	swp_entry_t entry;
+
+	if (get_swap_pages(1, &entry, true))
+		return entry;
+	else
+		return (swp_entry_t) {0};
+}
+#else
+static inline swp_entry_t get_huge_swap_page(void)
+{
+	return (swp_entry_t) {0};
+}
+#endif
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 9b5bc86f96ad..075bb39e03c5 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -258,7 +258,8 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 
 	cache->cur = 0;
 	if (swap_slot_cache_active)
-		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots);
+		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots,
+					   false);
 
 	return cache->nr;
 }
@@ -334,7 +335,7 @@ swp_entry_t get_swap_page(void)
 			return entry;
 	}
 
-	get_swap_pages(1, &entry);
+	get_swap_pages(1, &entry, false);
 
 	return entry;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 91876c33114b..7241c937e52b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -904,11 +904,12 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 }
 
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
+int get_swap_pages(int n_goal, swp_entry_t swp_entries[], bool huge)
 {
 	struct swap_info_struct *si, *next;
 	long avail_pgs;
 	int n_ret = 0;
+	int nr_pages = huge_cluster_nr_entries(huge);
 
 	avail_pgs = atomic_long_read(&nr_swap_pages);
 	if (avail_pgs <= 0)
@@ -920,6 +921,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
 	if (n_goal > avail_pgs)
 		n_goal = avail_pgs;
 
+	n_goal *= nr_pages;
+	if (avail_pgs < n_goal)
+		goto noswap;
+
 	atomic_long_sub(n_goal, &nr_swap_pages);
 
 	spin_lock(&swap_avail_lock);
@@ -946,10 +951,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					    n_goal, swp_entries);
+		if (likely(nr_pages == 1))
+			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+						    n_goal, swp_entries);
+		else
+			n_ret = swap_alloc_huge_cluster(si, swp_entries);
 		spin_unlock(&si->lock);
-		if (n_ret)
+		if (n_ret || unlikely(nr_pages != 1))
 			goto check_out;
 		pr_debug("scan_swap_map of si %d failed to find offset\n",
 			si->type);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -mm -v6 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page
  2017-03-08  7:26 [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (3 preceding siblings ...)
  2017-03-08  7:26 ` [PATCH -mm -v6 4/9] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
@ 2017-03-08  7:26 ` Huang, Ying
  2017-03-15  0:14   ` Tim Chen
  2017-03-08  7:26 ` [PATCH -mm -v6 6/9] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2017-03-08  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Kirill A . Shutemov, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel

From: Huang Ying <ying.huang@intel.com>

__swapcache_free() is added to support to clear the SWAP_HAS_CACHE flag
for the huge page.  This will free the specified swap cluster now.
Because now this function will be called only in the error path to free
the swap cluster just allocated.  So the corresponding swap_map[i] ==
SWAP_HAS_CACHE, that is, the swap count is 0.  This makes the
implementation simpler than that of the ordinary swap entry.

This will be used for delaying splitting THP (Transparent Huge Page)
during swapping out.  Where for one THP to swap out, we will allocate a
swap cluster, add the THP into the swap cache, then split the THP.  If
anything fails after allocating the swap cluster and before splitting
the THP successfully, the swapcache_free_trans_huge() will be used to
free the swap space allocated.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |  9 +++++++--
 mm/swapfile.c        | 34 ++++++++++++++++++++++++++++++++--
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index e3a7609a8989..2f2a6c0363aa 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -394,7 +394,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
+extern void __swapcache_free(swp_entry_t entry, bool huge);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
@@ -456,7 +456,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void __swapcache_free(swp_entry_t swp, bool huge)
 {
 }
 
@@ -544,6 +544,11 @@ static inline swp_entry_t get_huge_swap_page(void)
 }
 #endif
 
+static inline void swapcache_free(swp_entry_t entry)
+{
+	__swapcache_free(entry, false);
+}
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7241c937e52b..6019f94afbaf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -855,6 +855,29 @@ static void swap_free_huge_cluster(struct swap_info_struct *si,
 	_swap_entry_free(si, offset, true);
 }
 
+static void swapcache_free_trans_huge(struct swap_info_struct *si,
+				      swp_entry_t entry)
+{
+	unsigned long offset = swp_offset(entry);
+	unsigned long idx = offset / SWAPFILE_CLUSTER;
+	struct swap_cluster_info *ci;
+	unsigned char *map;
+	unsigned int i;
+
+	spin_lock(&si->lock);
+	ci = lock_cluster(si, offset);
+	map = si->swap_map + offset;
+	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+		VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
+		map[i] &= ~SWAP_HAS_CACHE;
+	}
+	unlock_cluster(ci);
+	/* Cluster size is same as huge pmd size */
+	mem_cgroup_uncharge_swap(entry, HPAGE_PMD_NR);
+	swap_free_huge_cluster(si, idx);
+	spin_unlock(&si->lock);
+}
+
 static int swap_alloc_huge_cluster(struct swap_info_struct *si,
 				   swp_entry_t *slot)
 {
@@ -887,6 +910,11 @@ static inline int swap_alloc_huge_cluster(struct swap_info_struct *si,
 {
 	return 0;
 }
+
+static inline void swapcache_free_trans_huge(struct swap_info_struct *si,
+					     swp_entry_t entry)
+{
+}
 #endif
 
 static unsigned long scan_swap_map(struct swap_info_struct *si,
@@ -1161,13 +1189,15 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry)
+void __swapcache_free(swp_entry_t entry, bool huge)
 {
 	struct swap_info_struct *p;
 
 	p = _swap_info_get(entry);
 	if (p) {
-		if (!__swap_entry_free(p, entry, SWAP_HAS_CACHE))
+		if (unlikely(huge))
+			swapcache_free_trans_huge(p, entry);
+		else if (!__swap_entry_free(p, entry, SWAP_HAS_CACHE))
 			free_swap_slot(entry);
 	}
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -mm -v6 6/9] mm, THP, swap: Support to add/delete THP to/from swap cache
  2017-03-08  7:26 [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (4 preceding siblings ...)
  2017-03-08  7:26 ` [PATCH -mm -v6 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
@ 2017-03-08  7:26 ` Huang, Ying
  2017-03-08  7:26 ` [PATCH -mm -v6 7/9] mm, THP: Add can_split_huge_page() Huang, Ying
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-03-08  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov

From: Huang Ying <ying.huang@intel.com>

With this patch, a THP (Transparent Huge Page) can be added/deleted
to/from the swap cache as a set of (HPAGE_PMD_NR) sub-pages.

This will be used for the THP (Transparent Huge Page) swap support.
Where one THP may be added/delted to/from the swap cache.  This will
batch the swap cache operations to reduce the lock acquire/release times
for the THP swap too.

Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/page-flags.h |  5 ++--
 mm/swap_state.c            | 64 ++++++++++++++++++++++++++++++----------------
 2 files changed, 45 insertions(+), 24 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d6de32..f4acd6c4f808 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -326,11 +326,12 @@ PAGEFLAG_FALSE(HighMem)
 #ifdef CONFIG_SWAP
 static __always_inline int PageSwapCache(struct page *page)
 {
+	page = compound_head(page);
 	return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags);
 
 }
-SETPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
-CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
+CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3c248f0a0abc..387466fd114b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -38,6 +38,7 @@ struct address_space *swapper_spaces[MAX_SWAPFILES];
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES];
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
+#define ADD_CACHE_INFO(x, nr)	do { swap_cache_info.x += (nr); } while (0)
 
 static struct {
 	unsigned long add_total;
@@ -90,39 +91,52 @@ void show_swap_cache_info(void)
  */
 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
-	int error;
+	int error, i, nr = hpage_nr_pages(page);
 	struct address_space *address_space;
+	struct page *cur_page;
+	swp_entry_t cur_entry;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapCache(page), page);
 	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
-	get_page(page);
+	page_ref_add(page, nr);
 	SetPageSwapCache(page);
-	set_page_private(page, entry.val);
 
 	address_space = swap_address_space(entry);
+	cur_page = page;
+	cur_entry.val = entry.val;
 	spin_lock_irq(&address_space->tree_lock);
-	error = radix_tree_insert(&address_space->page_tree,
-				  swp_offset(entry), page);
-	if (likely(!error)) {
-		address_space->nrpages++;
-		__inc_node_page_state(page, NR_FILE_PAGES);
-		INC_CACHE_INFO(add_total);
+	for (i = 0; i < nr; i++, cur_page++, cur_entry.val++) {
+		set_page_private(cur_page, cur_entry.val);
+		error = radix_tree_insert(&address_space->page_tree,
+					  swp_offset(cur_entry), cur_page);
+		if (unlikely(error))
+			break;
 	}
-	spin_unlock_irq(&address_space->tree_lock);
-
-	if (unlikely(error)) {
+	if (likely(!error)) {
+		address_space->nrpages += nr;
+		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
+		ADD_CACHE_INFO(add_total, nr);
+	} else {
 		/*
 		 * Only the context which have set SWAP_HAS_CACHE flag
 		 * would call add_to_swap_cache().
 		 * So add_to_swap_cache() doesn't returns -EEXIST.
 		 */
 		VM_BUG_ON(error == -EEXIST);
-		set_page_private(page, 0UL);
+		set_page_private(cur_page, 0UL);
+		while (i--) {
+			cur_page--;
+			cur_entry.val--;
+			radix_tree_delete(&address_space->page_tree,
+					  swp_offset(cur_entry));
+			set_page_private(cur_page, 0UL);
+		}
 		ClearPageSwapCache(page);
-		put_page(page);
+		page_ref_sub(page, nr);
 	}
+	spin_unlock_irq(&address_space->tree_lock);
 
 	return error;
 }
@@ -132,7 +146,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
 {
 	int error;
 
-	error = radix_tree_maybe_preload(gfp_mask);
+	error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page));
 	if (!error) {
 		error = __add_to_swap_cache(page, entry);
 		radix_tree_preload_end();
@@ -148,6 +162,7 @@ void __delete_from_swap_cache(struct page *page)
 {
 	swp_entry_t entry;
 	struct address_space *address_space;
+	int i, nr = hpage_nr_pages(page);
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
@@ -155,12 +170,17 @@ void __delete_from_swap_cache(struct page *page)
 
 	entry.val = page_private(page);
 	address_space = swap_address_space(entry);
-	radix_tree_delete(&address_space->page_tree, swp_offset(entry));
-	set_page_private(page, 0);
+	for (i = 0; i < nr; i++, entry.val++) {
+		struct page *cur_page = page + i;
+
+		radix_tree_delete(&address_space->page_tree,
+				  swp_offset(entry));
+		set_page_private(cur_page, 0);
+	}
 	ClearPageSwapCache(page);
-	address_space->nrpages--;
-	__dec_node_page_state(page, NR_FILE_PAGES);
-	INC_CACHE_INFO(del_total);
+	address_space->nrpages -= nr;
+	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
+	ADD_CACHE_INFO(del_total, nr);
 }
 
 /**
@@ -237,8 +257,8 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry);
-	put_page(page);
+	__swapcache_free(entry, PageTransHuge(page));
+	page_ref_sub(page, hpage_nr_pages(page));
 }
 
 /* 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -mm -v6 7/9] mm, THP: Add can_split_huge_page()
  2017-03-08  7:26 [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (5 preceding siblings ...)
  2017-03-08  7:26 ` [PATCH -mm -v6 6/9] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
@ 2017-03-08  7:26 ` Huang, Ying
  2017-03-08  7:26 ` [PATCH -mm -v6 8/9] mm, THP, swap: Support to split THP in swap cache Huang, Ying
  2017-03-08  7:26 ` [PATCH -mm -v6 9/9] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
  8 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-03-08  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli, Ebru Akagunduz

From: Huang Ying <ying.huang@intel.com>

Separates checking whether we can split the huge page from
split_huge_page_to_list() into a function.  This will help to check that
before splitting the THP (Transparent Huge Page) really.

This will be used for delaying splitting THP during swapping out.  Where
for a THP, we will allocate a swap cluster, add the THP into the swap
cache, then split the THP.  To avoid the unnecessary operations for the
un-splittable THP, we will check that firstly.

There is no functionality change in this patch.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |  7 +++++++
 mm/huge_memory.c        | 17 ++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a3762d49ba39..d3b3e8fcc717 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -113,6 +113,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 extern void prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 
+bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
@@ -231,6 +232,12 @@ static inline void prep_transhuge_page(struct page *page) {}
 
 #define thp_get_unmapped_area	NULL
 
+static inline bool
+can_split_huge_page(struct page *page, int *pextra_pins)
+{
+	BUILD_BUG();
+	return false;
+}
 static inline int
 split_huge_page_to_list(struct page *page, struct list_head *list)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2b4120f6930c..45f944db43b0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2362,6 +2362,19 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 	return ret;
 }
 
+/* Racy check whether the huge page can be split */
+bool can_split_huge_page(struct page *page, int *pextra_pins)
+{
+	int extra_pins = 0;
+
+	/* Additional pins from radix tree */
+	if (!PageAnon(page))
+		extra_pins = HPAGE_PMD_NR;
+	if (pextra_pins)
+		*pextra_pins = extra_pins;
+	return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
 /*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
@@ -2421,8 +2434,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			goto out;
 		}
 
-		/* Addidional pins from radix tree */
-		extra_pins = HPAGE_PMD_NR;
 		anon_vma = NULL;
 		i_mmap_lock_read(mapping);
 	}
@@ -2431,7 +2442,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * Racy check if we can split the page, before freeze_page() will
 	 * split PMDs
 	 */
-	if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
+	if (!can_split_huge_page(head, &extra_pins)) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -mm -v6 8/9] mm, THP, swap: Support to split THP in swap cache
  2017-03-08  7:26 [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (6 preceding siblings ...)
  2017-03-08  7:26 ` [PATCH -mm -v6 7/9] mm, THP: Add can_split_huge_page() Huang, Ying
@ 2017-03-08  7:26 ` Huang, Ying
  2017-03-08  7:26 ` [PATCH -mm -v6 9/9] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
  8 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-03-08  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli, Ebru Akagunduz

From: Huang Ying <ying.huang@intel.com>

This patch enhanced the split_huge_page_to_list() to work properly for
the THP (Transparent Huge Page) in the swap cache during swapping out.

This is used for delaying splitting the THP during swapping out.  Where
for a THP to be swapped out, we will allocate a swap cluster, add the
THP into the swap cache, then split the THP.  The page lock will be held
during this process.  So in the code path other than swapping out, if
the THP need to be split, the PageSwapCache(THP) will be always false.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 45f944db43b0..ffb7da440fb8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2180,7 +2180,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 * atomic_set() here would be safe on all archs (and not only on x86),
 	 * it's safer to use atomic_inc()/atomic_add().
 	 */
-	if (PageAnon(head)) {
+	if (PageAnon(head) && !PageSwapCache(head)) {
 		page_ref_inc(page_tail);
 	} else {
 		/* Additional pin to radix tree */
@@ -2191,6 +2191,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	page_tail->flags |= (head->flags &
 			((1L << PG_referenced) |
 			 (1L << PG_swapbacked) |
+			 (1L << PG_swapcache) |
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
@@ -2253,7 +2254,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	ClearPageCompound(head);
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
-		page_ref_inc(head);
+		/* Additional pin to radix tree of swap cache */
+		if (PageSwapCache(head))
+			page_ref_add(head, 2);
+		else
+			page_ref_inc(head);
 	} else {
 		/* Additional pin to radix tree */
 		page_ref_add(head, 2);
@@ -2365,10 +2370,12 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 /* Racy check whether the huge page can be split */
 bool can_split_huge_page(struct page *page, int *pextra_pins)
 {
-	int extra_pins = 0;
+	int extra_pins;
 
 	/* Additional pins from radix tree */
-	if (!PageAnon(page))
+	if (PageAnon(page))
+		extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
+	else
 		extra_pins = HPAGE_PMD_NR;
 	if (pextra_pins)
 		*pextra_pins = extra_pins;
@@ -2422,7 +2429,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			ret = -EBUSY;
 			goto out;
 		}
-		extra_pins = 0;
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH -mm -v6 9/9] mm, THP, swap: Delay splitting THP during swap out
  2017-03-08  7:26 [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (7 preceding siblings ...)
  2017-03-08  7:26 ` [PATCH -mm -v6 8/9] mm, THP, swap: Support to split THP in swap cache Huang, Ying
@ 2017-03-08  7:26 ` Huang, Ying
  8 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-03-08  7:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Huang Ying

From: Huang Ying <ying.huang@intel.com>

In this patch, splitting huge page is delayed from almost the first step
of swapping out to after allocating the swap space for the
THP (Transparent Huge Page) and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.

This is the first step for the THP swap support.  The plan is to delay
splitting the THP step by step and avoid splitting the THP finally.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help to improve the THP swap performance.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which usually are 4k random
  IO.  This will help to improve the THP swap performance too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after the THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

With the patchset, the swap out throughput improves 14.9% (from about
3.77GB/s to about 4.34GB/s) in the vm-scalability swap-w-seq test case
with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To
test the sequential swapping out, the test case creates 8 processes,
which sequentially allocate and write to the anonymous pages until the
RAM and part of the swap device is used up.

The detailed comparison result is as follow,

base             base+patchset
---------------- --------------------------
         %stddev     %change         %stddev
             \          |                \
   7043990 ±  0%     +21.2%    8536807 ±  0%  vm-scalability.throughput
    109.94 ±  1%     -16.2%      92.09 ±  0%  vm-scalability.time.elapsed_time
   3957091 ±  0%     +14.9%    4547173 ±  0%  vmstat.swap.so
     31.46 ±  1%     -38.3%      19.42 ±  0%  perf-stat.cache-miss-rate%
      1.04 ±  1%     +22.2%       1.27 ±  0%  perf-stat.ipc
      9.33 ±  2%     -60.7%       3.67 ±  1%  perf-profile.calltrace.cycles-pp.add_to_swap.shrink_page_list.shrink_inactive_list.shrink_node_memcg.shrink_node

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swap_state.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 57 insertions(+), 3 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 387466fd114b..12e7a461cf4c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -19,6 +19,7 @@
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
 #include <linux/swap_slots.h>
+#include <linux/huge_mm.h>
 
 #include <asm/pgtable.h>
 
@@ -183,12 +184,53 @@ void __delete_from_swap_cache(struct page *page)
 	ADD_CACHE_INFO(del_total, nr);
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+int add_to_swap_trans_huge(struct page *page, struct list_head *list)
+{
+	swp_entry_t entry;
+	int ret = 0;
+
+	/* cannot split, which may be needed during swap in, skip it */
+	if (!can_split_huge_page(page, NULL))
+		return -EBUSY;
+	/* fallback to split huge page firstly if no PMD map */
+	if (!compound_mapcount(page))
+		return 0;
+	entry = get_huge_swap_page();
+	if (!entry.val)
+		return 0;
+	if (mem_cgroup_try_charge_swap(page, entry, HPAGE_PMD_NR)) {
+		__swapcache_free(entry, true);
+		return -EOVERFLOW;
+	}
+	ret = add_to_swap_cache(page, entry,
+				__GFP_HIGH | __GFP_NOMEMALLOC|__GFP_NOWARN);
+	/* -ENOMEM radix-tree allocation failure */
+	if (ret) {
+		__swapcache_free(entry, true);
+		return 0;
+	}
+	ret = split_huge_page_to_list(page, list);
+	if (ret) {
+		delete_from_swap_cache(page);
+		return -EBUSY;
+	}
+	return 1;
+}
+#else
+static inline int add_to_swap_trans_huge(struct page *page,
+					 struct list_head *list)
+{
+	return 0;
+}
+#endif
+
 /**
  * add_to_swap - allocate swap space for a page
  * @page: page we want to move to swap
  *
  * Allocate swap space for the page and add the page to the
- * swap cache.  Caller needs to hold the page lock. 
+ * swap cache.  Caller needs to hold the page lock.
  */
 int add_to_swap(struct page *page, struct list_head *list)
 {
@@ -198,6 +240,18 @@ int add_to_swap(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
+	if (unlikely(PageTransHuge(page))) {
+		err = add_to_swap_trans_huge(page, list);
+		switch (err) {
+		case 1:
+			return 1;
+		case 0:
+			/* fallback to split firstly if return 0 */
+			break;
+		default:
+			return 0;
+		}
+	}
 	entry = get_swap_page();
 	if (!entry.val)
 		return 0;
@@ -315,7 +369,7 @@ struct page * lookup_swap_cache(swp_entry_t entry)
 
 	page = find_get_page(swap_address_space(entry), swp_offset(entry));
 
-	if (page) {
+	if (page && likely(!PageTransCompound(page))) {
 		INC_CACHE_INFO(find_success);
 		if (TestClearPageReadahead(page))
 			atomic_inc(&swapin_readahead_hits);
@@ -536,7 +590,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 						gfp_mask, vma, addr);
 		if (!page)
 			continue;
-		if (offset != entry_offset)
+		if (offset != entry_offset && likely(!PageTransCompound(page)))
 			SetPageReadahead(page);
 		put_page(page);
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 2/9] mm, memcg: Support to charge/uncharge multiple swap entries
  2017-03-08  7:26 ` [PATCH -mm -v6 2/9] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
@ 2017-03-08 11:37   ` Balbir Singh
  2017-03-09  1:28     ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2017-03-08 11:37 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Tejun Heo,
	cgroups

On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> This patch make it possible to charge or uncharge a set of continuous
> swap entries in the swap cgroup.  The number of swap entries is
> specified via an added parameter.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where a swap cluster backing a THP may be allocated and freed as a
> whole.  So a set of (HPAGE_PMD_NR) continuous swap entries backing one
> THP need to be charged or uncharged together.  This will batch the
> cgroup operations for the THP swap too.

A quick look at the patches makes it look sane. I wonder if we would
make sense to track THP swapout separately as well
(from a memory.stat perspective)

Balbir Singh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 1/9] mm, swap: Make swap cluster size same of THP size on x86_64
  2017-03-08  7:26 ` [PATCH -mm -v6 1/9] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
@ 2017-03-08 12:56   ` Matthew Wilcox
  2017-03-09  1:45     ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2017-03-08 12:56 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel

On Wed, Mar 08, 2017 at 03:26:05PM +0800, Huang, Ying wrote:
> In this patch, the size of the swap cluster is changed to that of the
> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
> the THP swap support on x86_64.  Where one swap cluster will be used to
> hold the contents of each THP swapped out.  And some information of the
> swapped out THP (such as compound map count) will be recorded in the
> swap_cluster_info data structure.
> 
> For other architectures which want THP swap support,
> ARCH_USES_THP_SWAP_CLUSTER need to be selected in the Kconfig file for
> the architecture.
> 
> In effect, this will enlarge swap cluster size by 2 times on x86_64.
> Which may make it harder to find a free cluster when the swap space
> becomes fragmented.  So that, this may reduce the continuous swap space
> allocation and sequential write in theory.  The performance test in 0day
> shows no regressions caused by this.

Well ... if there are no regressions found, why not change it
unconditionally?  The value '256' seems relatively arbitrary (I bet it
was tuned by some doofus with a 486, 8MB RAM and ST506 hard drive ...
it certainly hasn't changed since git started in 2005)

Might be worth checking with the PowerPC people to see if their larger
pages causes this smaller patch to perform badly:

diff --git a/mm/swapfile.c b/mm/swapfile.c
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -199,7 +199,7 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 	}
 }
 
-#define SWAPFILE_CLUSTER	256
+#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
 #define LATENCY_LIMIT		256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 2/9] mm, memcg: Support to charge/uncharge multiple swap entries
  2017-03-08 11:37   ` Balbir Singh
@ 2017-03-09  1:28     ` Huang, Ying
  2017-03-09 21:22       ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2017-03-09  1:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Tejun Heo, cgroups

Balbir Singh <bsingharora@gmail.com> writes:

> On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> This patch make it possible to charge or uncharge a set of continuous
>> swap entries in the swap cgroup.  The number of swap entries is
>> specified via an added parameter.
>> 
>> This will be used for the THP (Transparent Huge Page) swap support.
>> Where a swap cluster backing a THP may be allocated and freed as a
>> whole.  So a set of (HPAGE_PMD_NR) continuous swap entries backing one
>> THP need to be charged or uncharged together.  This will batch the
>> cgroup operations for the THP swap too.
>
> A quick look at the patches makes it look sane. I wonder if we would
> make sense to track THP swapout separately as well
> (from a memory.stat perspective)

The patchset is just the first step of THP swap optimization.  So the
THP will still be split after putting the THP into the swap cache.  This
makes it unnecessary to change mem_cgroup_swapout().  I am working on a
following up patchset to further delaying THP splitting after swapping
out the THP to the disk.  In that patchset, I will change
mem_cgroup_swapout() too.

Best Regards,
Huang, Ying

> Balbir Singh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 1/9] mm, swap: Make swap cluster size same of THP size on x86_64
  2017-03-08 12:56   ` Matthew Wilcox
@ 2017-03-09  1:45     ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-03-09  1:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Shaohua Li, Minchan Kim, Rik van Riel

Hi, Matthew,

Matthew Wilcox <willy@infradead.org> writes:

> On Wed, Mar 08, 2017 at 03:26:05PM +0800, Huang, Ying wrote:
>> In this patch, the size of the swap cluster is changed to that of the
>> THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
>> the THP swap support on x86_64.  Where one swap cluster will be used to
>> hold the contents of each THP swapped out.  And some information of the
>> swapped out THP (such as compound map count) will be recorded in the
>> swap_cluster_info data structure.
>> 
>> For other architectures which want THP swap support,
>> ARCH_USES_THP_SWAP_CLUSTER need to be selected in the Kconfig file for
>> the architecture.
>> 
>> In effect, this will enlarge swap cluster size by 2 times on x86_64.
>> Which may make it harder to find a free cluster when the swap space
>> becomes fragmented.  So that, this may reduce the continuous swap space
>> allocation and sequential write in theory.  The performance test in 0day
>> shows no regressions caused by this.
>
> Well ... if there are no regressions found, why not change it
> unconditionally?  The value '256' seems relatively arbitrary (I bet it
> was tuned by some doofus with a 486, 8MB RAM and ST506 hard drive ...
> it certainly hasn't changed since git started in 2005)
>
> Might be worth checking with the PowerPC people to see if their larger
> pages causes this smaller patch to perform badly:

I found the huge page size is large not only on PowerPC, for example, on
MIPS, the PMD_SHIFT could be from 21 to 29, depends on configuration.  I
don't know the situation for the other architectures.  So I thought it
may be better to let the architecture developers to determine whether to
make the change and under which configuration.

> diff --git a/mm/swapfile.c b/mm/swapfile.c
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -199,7 +199,7 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>  	}
>  }
>  
> -#define SWAPFILE_CLUSTER	256
> +#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
>  #define LATENCY_LIMIT		256
>  
>  static inline void cluster_set_flag(struct swap_cluster_info *info,

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 2/9] mm, memcg: Support to charge/uncharge multiple swap entries
  2017-03-09  1:28     ` Huang, Ying
@ 2017-03-09 21:22       ` Balbir Singh
  0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2017-03-09 21:22 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Vladimir Davydov, Johannes Weiner,
	Michal Hocko, Tejun Heo, cgroups

On Thu, Mar 9, 2017 at 12:28 PM, Huang, Ying <ying.huang@intel.com> wrote:
> Balbir Singh <bsingharora@gmail.com> writes:
>
>> On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
>>> From: Huang Ying <ying.huang@intel.com>
>>>
>>> This patch make it possible to charge or uncharge a set of continuous
>>> swap entries in the swap cgroup.  The number of swap entries is
>>> specified via an added parameter.
>>>
>>> This will be used for the THP (Transparent Huge Page) swap support.
>>> Where a swap cluster backing a THP may be allocated and freed as a
>>> whole.  So a set of (HPAGE_PMD_NR) continuous swap entries backing one
>>> THP need to be charged or uncharged together.  This will batch the
>>> cgroup operations for the THP swap too.
>>
>> A quick look at the patches makes it look sane. I wonder if we would
>> make sense to track THP swapout separately as well
>> (from a memory.stat perspective)
>
> The patchset is just the first step of THP swap optimization.  So the
> THP will still be split after putting the THP into the swap cache.  This
> makes it unnecessary to change mem_cgroup_swapout().  I am working on a
> following up patchset to further delaying THP splitting after swapping
> out the THP to the disk.  In that patchset, I will change
> mem_cgroup_swapout() too.


Fair enough

Thanks,
Balbir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 3/9] mm, THP, swap: Add swap cluster allocate/free functions
  2017-03-08  7:26 ` [PATCH -mm -v6 3/9] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
@ 2017-03-14 23:13   ` Tim Chen
  2017-03-15  1:19     ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Tim Chen @ 2017-03-14 23:13 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel

On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> The swap cluster allocation/free functions are added based on the
> existing swap cluster management mechanism for SSD.  These functions
> don't work for the rotating hard disks because the existing swap cluster
> management mechanism doesn't work for them.  The hard disks support may
> be added if someone really need it.  But that needn't be included in
> this patchset.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where one swap cluster will hold the contents of each THP swapped out.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  mm/swapfile.c | 217 +++++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 156 insertions(+), 61 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a744604384ff..91876c33114b 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -378,6 +378,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>  	schedule_work(&si->discard_work);
>  }
>  
> +static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> +{
> +	struct swap_cluster_info *ci = si->cluster_info;
> +
> +	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> +	cluster_list_add_tail(&si->free_clusters, ci, idx);
> +}
> +
>  /*
>   * Doing discard actually. After a cluster discard is finished, the cluster
>   * will be added to free cluster list. caller should hold si->lock.
> @@ -398,10 +406,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>  
>  		spin_lock(&si->lock);
>  		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> -		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
> -		unlock_cluster(ci);
> -		cluster_list_add_tail(&si->free_clusters, info, idx);
> -		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> +		__free_cluster(si, idx);
>  		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>  				0, SWAPFILE_CLUSTER);
>  		unlock_cluster(ci);

The __free_cluster definition and the above change to eliminate
the extra unlock_cluster and lock_cluster can perhaps be broken up
as a separate patch.  It can be independent of THP changes.


Tim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 4/9] mm, THP, swap: Add get_huge_swap_page()
  2017-03-08  7:26 ` [PATCH -mm -v6 4/9] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
@ 2017-03-14 23:40   ` Tim Chen
  2017-03-15  1:08     ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Tim Chen @ 2017-03-14 23:40 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel

On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> A variation of get_swap_page(), get_huge_swap_page(), is added to
> allocate a swap cluster (HPAGE_PMD_NR swap slots) based on the swap
> cluster allocation function.  A fair simple algorithm is used, that is,
> only the first swap device in priority list will be tried to allocate
> the swap cluster.  The function will fail if the trying is not
> successful, and the caller will fallback to allocate a single swap slot
> instead.  This works good enough for normal cases.
> 
> This will be used for the THP (Transparent Huge Page) swap support.
> Where get_huge_swap_page() will be used to allocate one swap cluster for
> each THP swapped out.
> 
> Because of the algorithm adopted, if the difference of the number of the
> free swap clusters among multiple swap devices is significant, it is
> possible that some THPs are split earlier than necessary.  For example,
> this could be caused by big size difference among multiple swap devices.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  include/linux/swap.h | 19 ++++++++++++++++++-
>  mm/swap_slots.c      |  5 +++--
>  mm/swapfile.c        | 16 ++++++++++++----
>  3 files changed, 33 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 278e1349a424..e3a7609a8989 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -388,7 +388,7 @@ static inline long get_nr_swap_pages(void)
>  extern void si_swapinfo(struct sysinfo *);
>  extern swp_entry_t get_swap_page(void);
>  extern swp_entry_t get_swap_page_of_type(int);
> -extern int get_swap_pages(int n, swp_entry_t swp_entries[]);
> +extern int get_swap_pages(int n, swp_entry_t swp_entries[], bool huge);
>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>  extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
> @@ -527,6 +527,23 @@ static inline swp_entry_t get_swap_page(void)
>  
>  #endif /* CONFIG_SWAP */
>  
> +#ifdef CONFIG_THP_SWAP_CLUSTER
> +static inline swp_entry_t get_huge_swap_page(void)
> +{
> +	swp_entry_t entry;
> +
> +	if (get_swap_pages(1, &entry, true))
> +		return entry;
> +	else
> +		return (swp_entry_t) {0};
> +}
> +#else
> +static inline swp_entry_t get_huge_swap_page(void)
> +{
> +	return (swp_entry_t) {0};
> +}
> +#endif
> +
>  #ifdef CONFIG_MEMCG
>  static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 9b5bc86f96ad..075bb39e03c5 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -258,7 +258,8 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
>  
>  	cache->cur = 0;
>  	if (swap_slot_cache_active)
> -		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots);
> +		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots,
> +					   false);
>  
>  	return cache->nr;
>  }
> @@ -334,7 +335,7 @@ swp_entry_t get_swap_page(void)
>  			return entry;
>  	}
>  
> -	get_swap_pages(1, &entry);
> +	get_swap_pages(1, &entry, false);
>  
>  	return entry;
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 91876c33114b..7241c937e52b 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -904,11 +904,12 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
>  
>  }
>  
> -int get_swap_pages(int n_goal, swp_entry_t swp_entries[])


> +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], bool huge)
>  {
>  	struct swap_info_struct *si, *next;
>  	long avail_pgs;
>  	int n_ret = 0;
> +	int nr_pages = huge_cluster_nr_entries(huge);
>  
>  	avail_pgs = atomic_long_read(&nr_swap_pages);
>  	if (avail_pgs <= 0)
> @@ -920,6 +921,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
>  	if (n_goal > avail_pgs)
>  		n_goal = avail_pgs;
>  
> +	n_goal *= nr_pages;

I think if (n_goal > 1) when huge is true, 
n_goal should be set to huge_cluster_nr_entries(huge) here
or we could have an invalid check below. We probably
should add a comment to get_swap_pages on how we treat
n_goal when huge is true.  Maybe say we will always treat
n_goal as SWAPFILE_CLUSTER when huge is true. 

> +	if (avail_pgs < n_goal)
> +		goto noswap;
> +
>  	atomic_long_sub(n_goal, &nr_swap_pages);
>  
>  	spin_lock(&swap_avail_lock);
> @@ -946,10 +951,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
>  			spin_unlock(&si->lock);
>  			goto nextsi;
>  		}
> -		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> -					    n_goal, swp_entries);
> +		if (likely(nr_pages == 1))

if (likely(!huge)) is probably more readable

> +			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> +						    n_goal, swp_entries);
> +		else
> +			n_ret = swap_alloc_huge_cluster(si, swp_entries);
>  		spin_unlock(&si->lock);

Thanks.

Tim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page
  2017-03-08  7:26 ` [PATCH -mm -v6 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
@ 2017-03-15  0:14   ` Tim Chen
  2017-03-15  0:54     ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Tim Chen @ 2017-03-15  0:14 UTC (permalink / raw)
  To: Huang, Ying, Andrew Morton
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel

On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> __swapcache_free() is added to support to clear the SWAP_HAS_CACHE flag
> for the huge page.  This will free the specified swap cluster now.
> Because now this function will be called only in the error path to free
> the swap cluster just allocated.  So the corresponding swap_map[i] ==
> SWAP_HAS_CACHE, that is, the swap count is 0.  This makes the
> implementation simpler than that of the ordinary swap entry.
> 
> This will be used for delaying splitting THP (Transparent Huge Page)
> during swapping out.  Where for one THP to swap out, we will allocate a
> swap cluster, add the THP into the swap cache, then split the THP.  If
> anything fails after allocating the swap cluster and before splitting
> the THP successfully, the swapcache_free_trans_huge() will be used to
> free the swap space allocated.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  include/linux/swap.h |  9 +++++++--
>  mm/swapfile.c        | 34 ++++++++++++++++++++++++++++++++--
>  2 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e3a7609a8989..2f2a6c0363aa 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -394,7 +394,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> -extern void swapcache_free(swp_entry_t);
> +extern void __swapcache_free(swp_entry_t entry, bool huge);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern int free_swap_and_cache(swp_entry_t);
>  extern int swap_type_of(dev_t, sector_t, struct block_device **);
> @@ -456,7 +456,7 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>  
> -static inline void swapcache_free(swp_entry_t swp)
> +static inline void __swapcache_free(swp_entry_t swp, bool huge)
>  {
>  }
>  
> @@ -544,6 +544,11 @@ static inline swp_entry_t get_huge_swap_page(void)
>  }
>  #endif
>  
> +static inline void swapcache_free(swp_entry_t entry)
> +{
> +	__swapcache_free(entry, false);
> +}
> +
>  #ifdef CONFIG_MEMCG
>  static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 7241c937e52b..6019f94afbaf 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -855,6 +855,29 @@ static void swap_free_huge_cluster(struct swap_info_struct *si,
>  	_swap_entry_free(si, offset, true);
>  }
>  
> +static void swapcache_free_trans_huge(struct swap_info_struct *si,
> +				      swp_entry_t entry)
> +{
> +	unsigned long offset = swp_offset(entry);
> +	unsigned long idx = offset / SWAPFILE_CLUSTER;
> +	struct swap_cluster_info *ci;
> +	unsigned char *map;
> +	unsigned int i;
> +
> +	spin_lock(&si->lock);
> +	ci = lock_cluster(si, offset);
> +	map = si->swap_map + offset;
> +	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> +		VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
> +		map[i] &= ~SWAP_HAS_CACHE;

Nitpicking a bit:
map[i] = 0  is more readable if map[i] == SWAP_HAS_CACHE here.


Thanks.

Tim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page
  2017-03-15  0:14   ` Tim Chen
@ 2017-03-15  0:54     ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-03-15  0:54 UTC (permalink / raw)
  To: Tim Chen
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel

Tim Chen <tim.c.chen@linux.intel.com> writes:

> On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> __swapcache_free() is added to support to clear the SWAP_HAS_CACHE flag
>> for the huge page.  This will free the specified swap cluster now.
>> Because now this function will be called only in the error path to free
>> the swap cluster just allocated.  So the corresponding swap_map[i] ==
>> SWAP_HAS_CACHE, that is, the swap count is 0.  This makes the
>> implementation simpler than that of the ordinary swap entry.
>> 
>> This will be used for delaying splitting THP (Transparent Huge Page)
>> during swapping out.  Where for one THP to swap out, we will allocate a
>> swap cluster, add the THP into the swap cache, then split the THP.  If
>> anything fails after allocating the swap cluster and before splitting
>> the THP successfully, the swapcache_free_trans_huge() will be used to
>> free the swap space allocated.
>> 
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Rik van Riel <riel@redhat.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  include/linux/swap.h |  9 +++++++--
>>  mm/swapfile.c        | 34 ++++++++++++++++++++++++++++++++--
>>  2 files changed, 39 insertions(+), 4 deletions(-)
>> 
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index e3a7609a8989..2f2a6c0363aa 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -394,7 +394,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>>  extern int swap_duplicate(swp_entry_t);
>>  extern int swapcache_prepare(swp_entry_t);
>>  extern void swap_free(swp_entry_t);
>> -extern void swapcache_free(swp_entry_t);
>> +extern void __swapcache_free(swp_entry_t entry, bool huge);
>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>  extern int free_swap_and_cache(swp_entry_t);
>>  extern int swap_type_of(dev_t, sector_t, struct block_device **);
>> @@ -456,7 +456,7 @@ static inline void swap_free(swp_entry_t swp)
>>  {
>>  }
>>  
>> -static inline void swapcache_free(swp_entry_t swp)
>> +static inline void __swapcache_free(swp_entry_t swp, bool huge)
>>  {
>>  }
>>  
>> @@ -544,6 +544,11 @@ static inline swp_entry_t get_huge_swap_page(void)
>>  }
>>  #endif
>>  
>> +static inline void swapcache_free(swp_entry_t entry)
>> +{
>> +	__swapcache_free(entry, false);
>> +}
>> +
>>  #ifdef CONFIG_MEMCG
>>  static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>>  {
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 7241c937e52b..6019f94afbaf 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -855,6 +855,29 @@ static void swap_free_huge_cluster(struct swap_info_struct *si,
>>  	_swap_entry_free(si, offset, true);
>>  }
>>  
>> +static void swapcache_free_trans_huge(struct swap_info_struct *si,
>> +				      swp_entry_t entry)
>> +{
>> +	unsigned long offset = swp_offset(entry);
>> +	unsigned long idx = offset / SWAPFILE_CLUSTER;
>> +	struct swap_cluster_info *ci;
>> +	unsigned char *map;
>> +	unsigned int i;
>> +
>> +	spin_lock(&si->lock);
>> +	ci = lock_cluster(si, offset);
>> +	map = si->swap_map + offset;
>> +	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
>> +		VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
>> +		map[i] &= ~SWAP_HAS_CACHE;
>
> Nitpicking a bit:
> map[i] = 0  is more readable if map[i] == SWAP_HAS_CACHE here.

OK.  I will change this.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 4/9] mm, THP, swap: Add get_huge_swap_page()
  2017-03-14 23:40   ` Tim Chen
@ 2017-03-15  1:08     ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-03-15  1:08 UTC (permalink / raw)
  To: Tim Chen
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel

Tim Chen <tim.c.chen@linux.intel.com> writes:

> On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> A variation of get_swap_page(), get_huge_swap_page(), is added to
>> allocate a swap cluster (HPAGE_PMD_NR swap slots) based on the swap
>> cluster allocation function.  A fair simple algorithm is used, that is,
>> only the first swap device in priority list will be tried to allocate
>> the swap cluster.  The function will fail if the trying is not
>> successful, and the caller will fallback to allocate a single swap slot
>> instead.  This works good enough for normal cases.
>> 
>> This will be used for the THP (Transparent Huge Page) swap support.
>> Where get_huge_swap_page() will be used to allocate one swap cluster for
>> each THP swapped out.
>> 
>> Because of the algorithm adopted, if the difference of the number of the
>> free swap clusters among multiple swap devices is significant, it is
>> possible that some THPs are split earlier than necessary.  For example,
>> this could be caused by big size difference among multiple swap devices.
>> 
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Rik van Riel <riel@redhat.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  include/linux/swap.h | 19 ++++++++++++++++++-
>>  mm/swap_slots.c      |  5 +++--
>>  mm/swapfile.c        | 16 ++++++++++++----
>>  3 files changed, 33 insertions(+), 7 deletions(-)
>> 
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 278e1349a424..e3a7609a8989 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -388,7 +388,7 @@ static inline long get_nr_swap_pages(void)
>>  extern void si_swapinfo(struct sysinfo *);
>>  extern swp_entry_t get_swap_page(void);
>>  extern swp_entry_t get_swap_page_of_type(int);
>> -extern int get_swap_pages(int n, swp_entry_t swp_entries[]);
>> +extern int get_swap_pages(int n, swp_entry_t swp_entries[], bool huge);
>>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>>  extern void swap_shmem_alloc(swp_entry_t);
>>  extern int swap_duplicate(swp_entry_t);
>> @@ -527,6 +527,23 @@ static inline swp_entry_t get_swap_page(void)
>>  
>>  #endif /* CONFIG_SWAP */
>>  
>> +#ifdef CONFIG_THP_SWAP_CLUSTER
>> +static inline swp_entry_t get_huge_swap_page(void)
>> +{
>> +	swp_entry_t entry;
>> +
>> +	if (get_swap_pages(1, &entry, true))
>> +		return entry;
>> +	else
>> +		return (swp_entry_t) {0};
>> +}
>> +#else
>> +static inline swp_entry_t get_huge_swap_page(void)
>> +{
>> +	return (swp_entry_t) {0};
>> +}
>> +#endif
>> +
>>  #ifdef CONFIG_MEMCG
>>  static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>>  {
>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>> index 9b5bc86f96ad..075bb39e03c5 100644
>> --- a/mm/swap_slots.c
>> +++ b/mm/swap_slots.c
>> @@ -258,7 +258,8 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
>>  
>>  	cache->cur = 0;
>>  	if (swap_slot_cache_active)
>> -		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots);
>> +		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots,
>> +					   false);
>>  
>>  	return cache->nr;
>>  }
>> @@ -334,7 +335,7 @@ swp_entry_t get_swap_page(void)
>>  			return entry;
>>  	}
>>  
>> -	get_swap_pages(1, &entry);
>> +	get_swap_pages(1, &entry, false);
>>  
>>  	return entry;
>>  }
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 91876c33114b..7241c937e52b 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -904,11 +904,12 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
>>  
>>  }
>>  
>> -int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
>
>
>> +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], bool huge)
>>  {
>>  	struct swap_info_struct *si, *next;
>>  	long avail_pgs;
>>  	int n_ret = 0;
>> +	int nr_pages = huge_cluster_nr_entries(huge);
>>  
>>  	avail_pgs = atomic_long_read(&nr_swap_pages);
>>  	if (avail_pgs <= 0)
>> @@ -920,6 +921,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
>>  	if (n_goal > avail_pgs)
>>  		n_goal = avail_pgs;
>>  
>> +	n_goal *= nr_pages;
>
> I think if (n_goal > 1) when huge is true, 
> n_goal should be set to huge_cluster_nr_entries(huge) here
> or we could have an invalid check below. We probably
> should add a comment to get_swap_pages on how we treat
> n_goal when huge is true.  Maybe say we will always treat
> n_goal as SWAPFILE_CLUSTER when huge is true. 

Yes.  The meaning of n_goal and n_ret isn't consistent between huge and
normal swap entry allocation.  I will revise the logic in the function
to make them consistent.

>> +	if (avail_pgs < n_goal)
>> +		goto noswap;
>> +
>>  	atomic_long_sub(n_goal, &nr_swap_pages);
>>  
>>  	spin_lock(&swap_avail_lock);
>> @@ -946,10 +951,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
>>  			spin_unlock(&si->lock);
>>  			goto nextsi;
>>  		}
>> -		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>> -					    n_goal, swp_entries);
>> +		if (likely(nr_pages == 1))
>
> if (likely(!huge)) is probably more readable

Sure.

Best Regards,
Huang, Ying

>> +			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>> +						    n_goal, swp_entries);
>> +		else
>> +			n_ret = swap_alloc_huge_cluster(si, swp_entries);
>>  		spin_unlock(&si->lock);
>
> Thanks.
>
> Tim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 3/9] mm, THP, swap: Add swap cluster allocate/free functions
  2017-03-14 23:13   ` Tim Chen
@ 2017-03-15  1:19     ` Huang, Ying
  2017-03-15 17:15       ` Tim Chen
  0 siblings, 1 reply; 23+ messages in thread
From: Huang, Ying @ 2017-03-15  1:19 UTC (permalink / raw)
  To: Tim Chen
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel

Tim Chen <tim.c.chen@linux.intel.com> writes:

> On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> The swap cluster allocation/free functions are added based on the
>> existing swap cluster management mechanism for SSD.  These functions
>> don't work for the rotating hard disks because the existing swap cluster
>> management mechanism doesn't work for them.  The hard disks support may
>> be added if someone really need it.  But that needn't be included in
>> this patchset.
>> 
>> This will be used for the THP (Transparent Huge Page) swap support.
>> Where one swap cluster will hold the contents of each THP swapped out.
>> 
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Rik van Riel <riel@redhat.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  mm/swapfile.c | 217 +++++++++++++++++++++++++++++++++++++++++-----------------
>>  1 file changed, 156 insertions(+), 61 deletions(-)
>> 
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index a744604384ff..91876c33114b 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -378,6 +378,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>  	schedule_work(&si->discard_work);
>>  }
>>  
>> +static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>> +{
>> +	struct swap_cluster_info *ci = si->cluster_info;
>> +
>> +	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
>> +	cluster_list_add_tail(&si->free_clusters, ci, idx);
>> +}
>> +
>>  /*
>>   * Doing discard actually. After a cluster discard is finished, the cluster
>>   * will be added to free cluster list. caller should hold si->lock.
>> @@ -398,10 +406,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>>  
>>  		spin_lock(&si->lock);
>>  		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
>> -		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
>> -		unlock_cluster(ci);
>> -		cluster_list_add_tail(&si->free_clusters, info, idx);
>> -		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
>> +		__free_cluster(si, idx);
>>  		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>>  				0, SWAPFILE_CLUSTER);
>>  		unlock_cluster(ci);
>
> The __free_cluster definition and the above change to eliminate
> the extra unlock_cluster and lock_cluster can perhaps be broken up
> as a separate patch.  It can be independent of THP changes.

I think the change may have no value by itself without THP changes.
There will be only 1 user of __free_cluster() and the lock change is
trivial too.  So I think it may be better just to keep it as that?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 3/9] mm, THP, swap: Add swap cluster allocate/free functions
  2017-03-15  1:19     ` Huang, Ying
@ 2017-03-15 17:15       ` Tim Chen
  2017-03-16  6:31         ` Huang, Ying
  0 siblings, 1 reply; 23+ messages in thread
From: Tim Chen @ 2017-03-15 17:15 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel

On Wed, 2017-03-15 at 09:19 +0800, Huang, Ying wrote:
> Tim Chen <tim.c.chen@linux.intel.com> writes:
> 
> > 
> > On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
> > > 
> > > From: Huang Ying <ying.huang@intel.com>
> > > 
> > > The swap cluster allocation/free functions are added based on the
> > > existing swap cluster management mechanism for SSD.  These functions
> > > don't work for the rotating hard disks because the existing swap cluster
> > > management mechanism doesn't work for them.  The hard disks support may
> > > be added if someone really need it.  But that needn't be included in
> > > this patchset.
> > > 
> > > This will be used for the THP (Transparent Huge Page) swap support.
> > > Where one swap cluster will hold the contents of each THP swapped out.
> > > 
> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Shaohua Li <shli@kernel.org>
> > > Cc: Minchan Kim <minchan@kernel.org>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> > > ---
> > >  mm/swapfile.c | 217 +++++++++++++++++++++++++++++++++++++++++-----------------
> > >  1 file changed, 156 insertions(+), 61 deletions(-)
> > > 
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index a744604384ff..91876c33114b 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -378,6 +378,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > >  	schedule_work(&si->discard_work);
> > >  }
> > >  
> > > +static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> > > +{
> > > +	struct swap_cluster_info *ci = si->cluster_info;
> > > +
> > > +	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> > > +	cluster_list_add_tail(&si->free_clusters, ci, idx);
> > > +}
> > > +
> > >  /*
> > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > >   * will be added to free cluster list. caller should hold si->lock.
> > > @@ -398,10 +406,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > >  
> > >  		spin_lock(&si->lock);
> > >  		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> > > -		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
> > > -		unlock_cluster(ci);
> > > -		cluster_list_add_tail(&si->free_clusters, info, idx);
> > > -		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> > > +		__free_cluster(si, idx);
> > >  		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > >  				0, SWAPFILE_CLUSTER);
> > >  		unlock_cluster(ci);
> > The __free_cluster definition and the above change to eliminate
> > the extra unlock_cluster and lock_cluster can perhaps be broken up
> > as a separate patch.  It can be independent of THP changes.
> I think the change may have no value by itself without THP changes.
> There will be only 1 user of __free_cluster() and the lock change is
> trivial too.  So I think it may be better just to keep it as that?
> 

Seems like the extra unlock and lock of cluster in existing code should be taken out
irrespective of the THP changes:
 
		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
-		unlock_cluster(ci);
		cluster_list_add_tail(&si->free_clusters, info, idx);
-		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
  				0, SWAPFILE_CLUSTER);

Tim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH -mm -v6 3/9] mm, THP, swap: Add swap cluster allocate/free functions
  2017-03-15 17:15       ` Tim Chen
@ 2017-03-16  6:31         ` Huang, Ying
  0 siblings, 0 replies; 23+ messages in thread
From: Huang, Ying @ 2017-03-16  6:31 UTC (permalink / raw)
  To: Tim Chen
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Andrea Arcangeli, Kirill A . Shutemov, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel

Tim Chen <tim.c.chen@linux.intel.com> writes:

> On Wed, 2017-03-15 at 09:19 +0800, Huang, Ying wrote:
>> Tim Chen <tim.c.chen@linux.intel.com> writes:
>> 
>> > 
>> > On Wed, 2017-03-08 at 15:26 +0800, Huang, Ying wrote:
>> > > 
>> > > From: Huang Ying <ying.huang@intel.com>
>> > > 
>> > > The swap cluster allocation/free functions are added based on the
>> > > existing swap cluster management mechanism for SSD.  These functions
>> > > don't work for the rotating hard disks because the existing swap cluster
>> > > management mechanism doesn't work for them.  The hard disks support may
>> > > be added if someone really need it.  But that needn't be included in
>> > > this patchset.
>> > > 
>> > > This will be used for the THP (Transparent Huge Page) swap support.
>> > > Where one swap cluster will hold the contents of each THP swapped out.
>> > > 
>> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
>> > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > > Cc: Hugh Dickins <hughd@google.com>
>> > > Cc: Shaohua Li <shli@kernel.org>
>> > > Cc: Minchan Kim <minchan@kernel.org>
>> > > Cc: Rik van Riel <riel@redhat.com>
>> > > Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> > > ---
>> > >  mm/swapfile.c | 217 +++++++++++++++++++++++++++++++++++++++++-----------------
>> > >  1 file changed, 156 insertions(+), 61 deletions(-)
>> > > 
>> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > > index a744604384ff..91876c33114b 100644
>> > > --- a/mm/swapfile.c
>> > > +++ b/mm/swapfile.c
>> > > @@ -378,6 +378,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>> > >  	schedule_work(&si->discard_work);
>> > >  }
>> > >  
>> > > +static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
>> > > +{
>> > > +	struct swap_cluster_info *ci = si->cluster_info;
>> > > +
>> > > +	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
>> > > +	cluster_list_add_tail(&si->free_clusters, ci, idx);
>> > > +}
>> > > +
>> > >  /*
>> > >   * Doing discard actually. After a cluster discard is finished, the cluster
>> > >   * will be added to free cluster list. caller should hold si->lock.
>> > > @@ -398,10 +406,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>> > >  
>> > >  		spin_lock(&si->lock);
>> > >  		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
>> > > -		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
>> > > -		unlock_cluster(ci);
>> > > -		cluster_list_add_tail(&si->free_clusters, info, idx);
>> > > -		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
>> > > +		__free_cluster(si, idx);
>> > >  		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> > >  				0, SWAPFILE_CLUSTER);
>> > >  		unlock_cluster(ci);
>> > The __free_cluster definition and the above change to eliminate
>> > the extra unlock_cluster and lock_cluster can perhaps be broken up
>> > as a separate patch.  It can be independent of THP changes.
>> I think the change may have no value by itself without THP changes.
>> There will be only 1 user of __free_cluster() and the lock change is
>> trivial too.  So I think it may be better just to keep it as that?
>> 
>
> Seems like the extra unlock and lock of cluster in existing code should be taken out
> irrespective of the THP changes:
>  
> 		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
> -		unlock_cluster(ci);
> 		cluster_list_add_tail(&si->free_clusters, info, idx);
> -		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>   				0, SWAPFILE_CLUSTER);
>

This is not a functionality fix or performance optimization.  Because
the lock on the swap_info_struct is held during the operation and there
are no operations on cluster with index "idx" in
cluster_list_add_tail().  The change here is just to make the resulting
code a little simpler.  Is this deserved a separate patch?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-03-16  6:31 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-08  7:26 [PATCH -v6 0/9] THP swap: Delay splitting THP during swapping out Huang, Ying
2017-03-08  7:26 ` [PATCH -mm -v6 1/9] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
2017-03-08 12:56   ` Matthew Wilcox
2017-03-09  1:45     ` Huang, Ying
2017-03-08  7:26 ` [PATCH -mm -v6 2/9] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
2017-03-08 11:37   ` Balbir Singh
2017-03-09  1:28     ` Huang, Ying
2017-03-09 21:22       ` Balbir Singh
2017-03-08  7:26 ` [PATCH -mm -v6 3/9] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
2017-03-14 23:13   ` Tim Chen
2017-03-15  1:19     ` Huang, Ying
2017-03-15 17:15       ` Tim Chen
2017-03-16  6:31         ` Huang, Ying
2017-03-08  7:26 ` [PATCH -mm -v6 4/9] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
2017-03-14 23:40   ` Tim Chen
2017-03-15  1:08     ` Huang, Ying
2017-03-08  7:26 ` [PATCH -mm -v6 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
2017-03-15  0:14   ` Tim Chen
2017-03-15  0:54     ` Huang, Ying
2017-03-08  7:26 ` [PATCH -mm -v6 6/9] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
2017-03-08  7:26 ` [PATCH -mm -v6 7/9] mm, THP: Add can_split_huge_page() Huang, Ying
2017-03-08  7:26 ` [PATCH -mm -v6 8/9] mm, THP, swap: Support to split THP in swap cache Huang, Ying
2017-03-08  7:26 ` [PATCH -mm -v6 9/9] mm, THP, swap: Delay splitting THP during swap out Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).