[RFC 00/11] THP swap: Delay splitting THP during swapping out

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 00/11] THP swap: Delay splitting THP during swapping out
@ 2016-08-09 16:37 Huang, Ying
  2016-08-09 16:37 ` [RFC 01/11] swap: Add swap_cluster_list Huang, Ying
                   ` (12 more replies)
  0 siblings, 13 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying

From: Huang Ying <ying.huang@intel.com>

This patchset is based on 8/4 head of mmotm/master.

This is the first step for Transparent Huge Page (THP) swap support.
The plan is to delaying splitting THP step by step and avoid splitting
THP finally during THP swapping out and swapping in.

The advantages of THP swap support are:

- Batch swap operations for THP to reduce lock acquiring/releasing,
  including allocating/freeing swap space, adding/deleting to/from swap
  cache, and writing/reading swap space, etc.

- THP swap space read/write will be 2M sequence IO.  It is particularly
  helpful for swap read, which usually are 4k random IO.

- It will help memory fragmentation, especially when THP is heavily used
  by the applications.  2M continuous pages will be free up after THP
  swapping out.

As the first step, in this patchset, the splitting huge page is
delayed from almost the first step of swapping out to after allocating
the swap space for THP and adding the THP into swap cache.  This will
reduce lock acquiring/releasing for locks used for swap space and swap
cache management.

With the patchset, the swap out bandwidth improved 12.1% in
vm-scalability swap-w-seq test case with 16 processes on a Xeon E5 v3
system.  To test sequence swap out, the test case uses 16 processes
sequentially allocate and write to anonymous pages until RAM and part of
the swap device is used up.

The detailed compare result is as follow,

base             base+patchset
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   1118821 A+-  0%     +12.1%    1254241 A+-  1%  vmstat.swap.so
   2460636 A+-  1%     +10.6%    2720983 A+-  1%  vm-scalability.throughput
    308.79 A+-  1%      -7.9%     284.53 A+-  1%  vm-scalability.time.elapsed_time
      1639 A+-  4%    +232.3%       5446 A+-  1%  meminfo.SwapCached
      0.70 A+-  3%      +8.7%       0.77 A+-  5%  perf-stat.ipc
      9.82 A+-  8%     -31.6%       6.72 A+-  2%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC 01/11] swap: Add swap_cluster_list
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 16:37 ` [RFC 02/11] swap: Change SWAPFILE_CLUSTER to 512 Huang, Ying
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

This is a code clean up patch without functionality changes.  The
swap_cluster_list data structure and its operations is introduced to
provide some better encapsulation for free cluster and discard cluster
list operations.  This avoid some code duplication, improved the code
readability, and reduced the total line number.

Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |  11 +++--
 mm/swapfile.c        | 132 ++++++++++++++++++++++++---------------------------
 2 files changed, 69 insertions(+), 74 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc48..ed41bec 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -191,6 +191,11 @@ struct percpu_cluster {
 	unsigned int next; /* Likely next allocation offset */
 };
 
+struct swap_cluster_list {
+	struct swap_cluster_info head;
+	struct swap_cluster_info tail;
+};
+
 /*
  * The in-memory structure used to track swap areas.
  */
@@ -203,8 +208,7 @@ struct swap_info_struct {
 	unsigned int	max;		/* extent of the swap_map */
 	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
-	struct swap_cluster_info free_cluster_head; /* free cluster list head */
-	struct swap_cluster_info free_cluster_tail; /* free cluster list tail */
+	struct swap_cluster_list free_clusters; /* free clusters list */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
@@ -235,8 +239,7 @@ struct swap_info_struct {
 					 * first.
 					 */
 	struct work_struct discard_work; /* discard worker */
-	struct swap_cluster_info discard_cluster_head; /* list head of discard clusters */
-	struct swap_cluster_info discard_cluster_tail; /* list tail of discard clusters */
+	struct swap_cluster_list discard_clusters; /* discard clusters list */
 };
 
 /* linux/mm/workingset.c */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 78cfa29..09e3877 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -257,6 +257,53 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }
 
+static inline bool cluster_list_empty(struct swap_cluster_list *list)
+{
+	return cluster_is_null(&list->head);
+}
+
+static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
+{
+	return cluster_next(&list->head);
+}
+
+static void cluster_list_init(struct swap_cluster_list *list)
+{
+	cluster_set_null(&list->head);
+	cluster_set_null(&list->tail);
+}
+
+static void cluster_list_add_tail(struct swap_cluster_list *list,
+				  struct swap_cluster_info *ci,
+				  unsigned int idx)
+{
+	if (cluster_list_empty(list)) {
+		cluster_set_next_flag(&list->head, idx, 0);
+		cluster_set_next_flag(&list->tail, idx, 0);
+	} else {
+		unsigned int tail = cluster_next(&list->tail);
+
+		cluster_set_next(&ci[tail], idx);
+		cluster_set_next_flag(&list->tail, idx, 0);
+	}
+}
+
+static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
+					   struct swap_cluster_info *ci)
+{
+	unsigned int idx;
+
+	idx = cluster_next(&list->head);
+	if (cluster_next(&list->tail) == idx) {
+		cluster_set_null(&list->head);
+		cluster_set_null(&list->tail);
+	} else
+		cluster_set_next_flag(&list->head,
+				      cluster_next(&ci[idx]), 0);
+
+	return idx;
+}
+
 /* Add a cluster to discard list and schedule it to do discard */
 static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 		unsigned int idx)
@@ -270,17 +317,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
 
-	if (cluster_is_null(&si->discard_cluster_head)) {
-		cluster_set_next_flag(&si->discard_cluster_head,
-						idx, 0);
-		cluster_set_next_flag(&si->discard_cluster_tail,
-						idx, 0);
-	} else {
-		unsigned int tail = cluster_next(&si->discard_cluster_tail);
-		cluster_set_next(&si->cluster_info[tail], idx);
-		cluster_set_next_flag(&si->discard_cluster_tail,
-						idx, 0);
-	}
+	cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
 
 	schedule_work(&si->discard_work);
 }
@@ -296,15 +333,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 
 	info = si->cluster_info;
 
-	while (!cluster_is_null(&si->discard_cluster_head)) {
-		idx = cluster_next(&si->discard_cluster_head);
-
-		cluster_set_next_flag(&si->discard_cluster_head,
-						cluster_next(&info[idx]), 0);
-		if (cluster_next(&si->discard_cluster_tail) == idx) {
-			cluster_set_null(&si->discard_cluster_head);
-			cluster_set_null(&si->discard_cluster_tail);
-		}
+	while (!cluster_list_empty(&si->discard_clusters)) {
+		idx = cluster_list_del_first(&si->discard_clusters, info);
 		spin_unlock(&si->lock);
 
 		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
@@ -312,19 +342,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 
 		spin_lock(&si->lock);
 		cluster_set_flag(&info[idx], CLUSTER_FLAG_FREE);
-		if (cluster_is_null(&si->free_cluster_head)) {
-			cluster_set_next_flag(&si->free_cluster_head,
-						idx, 0);
-			cluster_set_next_flag(&si->free_cluster_tail,
-						idx, 0);
-		} else {
-			unsigned int tail;
-
-			tail = cluster_next(&si->free_cluster_tail);
-			cluster_set_next(&info[tail], idx);
-			cluster_set_next_flag(&si->free_cluster_tail,
-						idx, 0);
-		}
+		cluster_list_add_tail(&si->free_clusters, info, idx);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
 	}
@@ -353,13 +371,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 	if (!cluster_info)
 		return;
 	if (cluster_is_free(&cluster_info[idx])) {
-		VM_BUG_ON(cluster_next(&p->free_cluster_head) != idx);
-		cluster_set_next_flag(&p->free_cluster_head,
-			cluster_next(&cluster_info[idx]), 0);
-		if (cluster_next(&p->free_cluster_tail) == idx) {
-			cluster_set_null(&p->free_cluster_tail);
-			cluster_set_null(&p->free_cluster_head);
-		}
+		VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx);
+		cluster_list_del_first(&p->free_clusters, cluster_info);
 		cluster_set_count_flag(&cluster_info[idx], 0, 0);
 	}
 
@@ -398,14 +411,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 		}
 
 		cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
-		if (cluster_is_null(&p->free_cluster_head)) {
-			cluster_set_next_flag(&p->free_cluster_head, idx, 0);
-			cluster_set_next_flag(&p->free_cluster_tail, idx, 0);
-		} else {
-			unsigned int tail = cluster_next(&p->free_cluster_tail);
-			cluster_set_next(&cluster_info[tail], idx);
-			cluster_set_next_flag(&p->free_cluster_tail, idx, 0);
-		}
+		cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
 	}
 }
 
@@ -421,8 +427,8 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 	bool conflict;
 
 	offset /= SWAPFILE_CLUSTER;
-	conflict = !cluster_is_null(&si->free_cluster_head) &&
-		offset != cluster_next(&si->free_cluster_head) &&
+	conflict = !cluster_list_empty(&si->free_clusters) &&
+		offset != cluster_list_first(&si->free_clusters) &&
 		cluster_is_free(&si->cluster_info[offset]);
 
 	if (!conflict)
@@ -447,11 +453,11 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 new_cluster:
 	cluster = this_cpu_ptr(si->percpu_cluster);
 	if (cluster_is_null(&cluster->index)) {
-		if (!cluster_is_null(&si->free_cluster_head)) {
-			cluster->index = si->free_cluster_head;
+		if (!cluster_list_empty(&si->free_clusters)) {
+			cluster->index = si->free_clusters.head;
 			cluster->next = cluster_next(&cluster->index) *
 					SWAPFILE_CLUSTER;
-		} else if (!cluster_is_null(&si->discard_cluster_head)) {
+		} else if (!cluster_list_empty(&si->discard_clusters)) {
 			/*
 			 * we don't have free cluster but have some clusters in
 			 * discarding, do discard now and reclaim them
@@ -2292,10 +2298,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 
 	nr_good_pages = maxpages - 1;	/* omit header page */
 
-	cluster_set_null(&p->free_cluster_head);
-	cluster_set_null(&p->free_cluster_tail);
-	cluster_set_null(&p->discard_cluster_head);
-	cluster_set_null(&p->discard_cluster_tail);
+	cluster_list_init(&p->free_clusters);
+	cluster_list_init(&p->discard_clusters);
 
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
@@ -2341,19 +2345,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	for (i = 0; i < nr_clusters; i++) {
 		if (!cluster_count(&cluster_info[idx])) {
 			cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
-			if (cluster_is_null(&p->free_cluster_head)) {
-				cluster_set_next_flag(&p->free_cluster_head,
-								idx, 0);
-				cluster_set_next_flag(&p->free_cluster_tail,
-								idx, 0);
-			} else {
-				unsigned int tail;
-
-				tail = cluster_next(&p->free_cluster_tail);
-				cluster_set_next(&cluster_info[tail], idx);
-				cluster_set_next_flag(&p->free_cluster_tail,
-								idx, 0);
-			}
+			cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
 		}
 		idx++;
 		if (idx == nr_clusters)
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 02/11] swap: Change SWAPFILE_CLUSTER to 512
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
  2016-08-09 16:37 ` [RFC 01/11] swap: Add swap_cluster_list Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 16:37 ` [RFC 03/11] mm, memcg: Add swap_cgroup_iter iterator Huang, Ying
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

In this patch, the size of swap cluster is changed to that of THP on
x86_64 (512).  This is for THP (Transparent Huge Page) swap support on
x86_64.  Where one swap cluster will be used to hold the contents of
each THP swapped out.  And some information of the swapped out THP (such
as compound map count) will be recorded in the swap_cluster_info data
structure.

In effect, this will enlarge swap cluster size by 2 times.  Which may
make it harder to find a free cluster when swap space becomes
fragmented.  So that, this may reduce the continuous swap space
allocation and sequence write if that happens in theory.  The
performance test in 0day show no regressions caused by this.

Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swapfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 09e3877..18f9292 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -196,7 +196,7 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 	}
 }
 
-#define SWAPFILE_CLUSTER	256
+#define SWAPFILE_CLUSTER	512
 #define LATENCY_LIMIT		256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 03/11] mm, memcg: Add swap_cgroup_iter iterator
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
  2016-08-09 16:37 ` [RFC 01/11] swap: Add swap_cluster_list Huang, Ying
  2016-08-09 16:37 ` [RFC 02/11] swap: Change SWAPFILE_CLUSTER to 512 Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 16:37 ` [RFC 04/11] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Tejun Heo

From: Huang Ying <ying.huang@intel.com>

Swap cgroup uses a discontinuous array to store the information for the
swap entries.  lookup_swap_cgroup() provides the good encapsulation to
access one element of the discontinuous array.  To make it easier to
access multiple elements of the discontinuous array, an iterator for
swap cgroup named swap_cgroup_iter is added in this patch.

This will be used for transparent huge page (THP) swap support.  Where
the swap_cgroup for multiple swap entries will be changed together.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swap_cgroup.c | 62 +++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 46 insertions(+), 16 deletions(-)

diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 310ac0b..3563b8b 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -18,6 +18,13 @@ struct swap_cgroup {
 };
 #define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
 
+struct swap_cgroup_iter {
+	struct swap_cgroup_ctrl *ctrl;
+	struct swap_cgroup *sc;
+	swp_entry_t entry;
+	unsigned long flags;
+};
+
 /*
  * SwapCgroup implements "lookup" and "exchange" operations.
  * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
@@ -75,6 +82,34 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
 	return sc + offset % SC_PER_PAGE;
 }
 
+static void swap_cgroup_iter_init(struct swap_cgroup_iter *iter, swp_entry_t ent)
+{
+	iter->entry = ent;
+	iter->sc = lookup_swap_cgroup(ent, &iter->ctrl);
+	spin_lock_irqsave(&iter->ctrl->lock, iter->flags);
+}
+
+static void swap_cgroup_iter_exit(struct swap_cgroup_iter *iter)
+{
+	spin_unlock_irqrestore(&iter->ctrl->lock, iter->flags);
+}
+
+/*
+ * swap_cgroup is stored in a kind of discontinuous array.  That is,
+ * they are continuous in one page, but not across page boundary.  And
+ * there is one lock for each page.
+ */
+static void swap_cgroup_iter_advance(struct swap_cgroup_iter *iter)
+{
+	iter->sc++;
+	iter->entry.val++;
+	if (!(((unsigned long)iter->sc) & PAGE_MASK)) {
+		spin_unlock_irqrestore(&iter->ctrl->lock, iter->flags);
+		iter->sc = lookup_swap_cgroup(iter->entry, &iter->ctrl);
+		spin_lock_irqsave(&iter->ctrl->lock, iter->flags);
+	}
+}
+
 /**
  * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
  * @ent: swap entry to be cmpxchged
@@ -87,20 +122,18 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
 unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new)
 {
-	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
-	unsigned long flags;
+	struct swap_cgroup_iter iter;
 	unsigned short retval;
 
-	sc = lookup_swap_cgroup(ent, &ctrl);
+	swap_cgroup_iter_init(&iter, ent);
 
-	spin_lock_irqsave(&ctrl->lock, flags);
-	retval = sc->id;
+	retval = iter.sc->id;
 	if (retval == old)
-		sc->id = new;
+		iter.sc->id = new;
 	else
 		retval = 0;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+
+	swap_cgroup_iter_exit(&iter);
 	return retval;
 }
 
@@ -114,18 +147,15 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
  */
 unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
 {
-	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
+	struct swap_cgroup_iter iter;
 	unsigned short old;
-	unsigned long flags;
 
-	sc = lookup_swap_cgroup(ent, &ctrl);
+	swap_cgroup_iter_init(&iter, ent);
 
-	spin_lock_irqsave(&ctrl->lock, flags);
-	old = sc->id;
-	sc->id = id;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+	old = iter.sc->id;
+	iter.sc->id = id;
 
+	swap_cgroup_iter_exit(&iter);
 	return old;
 }
 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 04/11] mm, memcg: Support to charge/uncharge multiple swap entries
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (2 preceding siblings ...)
  2016-08-09 16:37 ` [RFC 03/11] mm, memcg: Add swap_cgroup_iter iterator Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 16:37 ` [RFC 05/11] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Vladimir Davydov, Johannes Weiner, Michal Hocko,
	Tejun Heo

From: Huang Ying <ying.huang@intel.com>

This patch make it possible to charge or uncharge a set of continuous
swap entries in swap cgroup.  The number of swap entries is specified
via an added parameter.

This will be used for THP (Transparent Huge Page) swap support.  Where a
whole swap cluster backing a THP may be allocated and freed as a whole.
So a set of continuous swap entries (512 on x86_64) backing one THP need
to be charged or uncharged together.  This will batch the cgroup
operations for THP swap too.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h        | 11 +++++----
 include/linux/swap_cgroup.h |  6 +++--
 mm/memcontrol.c             | 54 +++++++++++++++++++++++++--------------------
 mm/shmem.c                  |  2 +-
 mm/swap_cgroup.c            | 17 ++++++++++----
 mm/swap_state.c             |  2 +-
 mm/swapfile.c               |  2 +-
 7 files changed, 57 insertions(+), 37 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ed41bec..6988bce 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -550,8 +550,9 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 
 #ifdef CONFIG_MEMCG_SWAP
 extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
-extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry);
-extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
+extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry,
+				      unsigned int nr_entries);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_entries);
 extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg);
 extern bool mem_cgroup_swap_full(struct page *page);
 #else
@@ -560,12 +561,14 @@ static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 }
 
 static inline int mem_cgroup_try_charge_swap(struct page *page,
-					     swp_entry_t entry)
+					     swp_entry_t entry,
+					     unsigned int nr_entries)
 {
 	return 0;
 }
 
-static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry,
+					    unsigned int nr_entries)
 {
 }
 
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index 145306b..b2b8ec7 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -7,7 +7,8 @@
 
 extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 					unsigned short old, unsigned short new);
-extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
+extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+					 unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
@@ -15,7 +16,8 @@ extern void swap_cgroup_swapoff(int type);
 #else
 
 static inline
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+				  unsigned int nr_ents)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f507f0..d29b368 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2362,10 +2362,9 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
+				       int nr_entries)
 {
-	int val = (charge) ? 1 : -1;
-	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
+	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_entries);
 }
 
 /**
@@ -2391,8 +2390,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry,
 	new_id = mem_cgroup_id(to);
 
 	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_swap_statistics(to, true);
+		mem_cgroup_swap_statistics(from, -1);
+		mem_cgroup_swap_statistics(to, 1);
 		return 0;
 	}
 	return -EINVAL;
@@ -5416,7 +5415,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		 * let's not wait for it.  The page already received a
 		 * memory+swap charge, drop the swap entry duplicate.
 		 */
-		mem_cgroup_uncharge_swap(entry);
+		mem_cgroup_uncharge_swap(entry, nr_pages);
 	}
 }
 
@@ -5799,9 +5798,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 		return;
 
 	swap_memcg = mem_cgroup_id_get_active(memcg);
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg));
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 1);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(swap_memcg, true);
+	mem_cgroup_swap_statistics(swap_memcg, 1);
 
 	page->mem_cgroup = NULL;
 
@@ -5827,16 +5826,19 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 		css_put(&memcg->css);
 }
 
-/*
- * mem_cgroup_try_charge_swap - try charging a swap entry
+/**
+ * mem_cgroup_try_charge_swap - try charging a set of swap entries
  * @page: page being added to swap
- * @entry: swap entry to charge
+ * @entry: the first swap entry to charge
+ * @nr_entries: the number of swap entries to charge
  *
- * Try to charge @entry to the memcg that @page belongs to.
+ * Try to charge @nr_entries swap entries starting from @entry to the
+ * memcg that @page belongs to.
  *
  * Returns 0 on success, -ENOMEM on failure.
  */
-int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
+int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry,
+			       unsigned int nr_entries)
 {
 	struct mem_cgroup *memcg;
 	struct page_counter *counter;
@@ -5854,25 +5856,29 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	memcg = mem_cgroup_id_get_active(memcg);
 
 	if (!mem_cgroup_is_root(memcg) &&
-	    !page_counter_try_charge(&memcg->swap, 1, &counter)) {
+	    !page_counter_try_charge(&memcg->swap, nr_entries, &counter)) {
 		mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
 
-	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
+	if (nr_entries > 1)
+		mem_cgroup_id_get_many(memcg, nr_entries - 1);
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_entries);
 	VM_BUG_ON_PAGE(oldid, page);
-	mem_cgroup_swap_statistics(memcg, true);
+	mem_cgroup_swap_statistics(memcg, nr_entries);
 
 	return 0;
 }
 
 /**
- * mem_cgroup_uncharge_swap - uncharge a swap entry
- * @entry: swap entry to uncharge
+ * mem_cgroup_uncharge_swap - uncharge a set of swap entries
+ * @entry: the first swap entry to uncharge
+ * @nr_entries: the number of swap entries to uncharge
  *
- * Drop the swap charge associated with @entry.
+ * Drop the swap charge associated with @nr_entries swap entries
+ * starting from @entry.
  */
-void mem_cgroup_uncharge_swap(swp_entry_t entry)
+void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_entries)
 {
 	struct mem_cgroup *memcg;
 	unsigned short id;
@@ -5880,17 +5886,17 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
 	if (!do_swap_account)
 		return;
 
-	id = swap_cgroup_record(entry, 0);
+	id = swap_cgroup_record(entry, 0, nr_entries);
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
 		if (!mem_cgroup_is_root(memcg)) {
 			if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
-				page_counter_uncharge(&memcg->swap, 1);
+				page_counter_uncharge(&memcg->swap, nr_entries);
 			else
-				page_counter_uncharge(&memcg->memsw, 1);
+				page_counter_uncharge(&memcg->memsw, nr_entries);
 		}
-		mem_cgroup_swap_statistics(memcg, false);
+		mem_cgroup_swap_statistics(memcg, -nr_entries);
 		mem_cgroup_id_put(memcg);
 	}
 	rcu_read_unlock();
diff --git a/mm/shmem.c b/mm/shmem.c
index 7f7748a..fa4067e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1248,7 +1248,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (!swap.val)
 		goto redirty;
 
-	if (mem_cgroup_try_charge_swap(page, swap))
+	if (mem_cgroup_try_charge_swap(page, swap, 1))
 		goto free_swap;
 
 	/*
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 3563b8b..a2cafbd 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -138,14 +138,16 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 }
 
 /**
- * swap_cgroup_record - record mem_cgroup for this swp_entry.
- * @ent: swap entry to be recorded into
+ * swap_cgroup_record - record mem_cgroup for a set of swap entries
+ * @ent: the first swap entry to be recorded into
  * @id: mem_cgroup to be recorded
+ * @nr_ents: number of swap entries to be recorded
  *
  * Returns old value at success, 0 at failure.
  * (Of course, old value can be 0.)
  */
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
+unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
+				  unsigned int nr_ents)
 {
 	struct swap_cgroup_iter iter;
 	unsigned short old;
@@ -153,7 +155,14 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
 	swap_cgroup_iter_init(&iter, ent);
 
 	old = iter.sc->id;
-	iter.sc->id = id;
+	for (;;) {
+		VM_BUG_ON(iter.sc->id != old);
+		iter.sc->id = id;
+		nr_ents--;
+		if (!nr_ents)
+			break;
+		swap_cgroup_iter_advance(&iter);
+	}
 
 	swap_cgroup_iter_exit(&iter);
 	return old;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c8310a3..2013793 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -170,7 +170,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 	if (!entry.val)
 		return 0;
 
-	if (mem_cgroup_try_charge_swap(page, entry)) {
+	if (mem_cgroup_try_charge_swap(page, entry, 1)) {
 		swapcache_free(entry);
 		return 0;
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 18f9292..25363c2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -802,7 +802,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 
 	/* free if no reference */
 	if (!usage) {
-		mem_cgroup_uncharge_swap(entry);
+		mem_cgroup_uncharge_swap(entry, 1);
 		dec_cluster_info_page(p, p->cluster_info, offset);
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 05/11] mm, THP, swap: Add swap cluster allocate/free functions
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (3 preceding siblings ...)
  2016-08-09 16:37 ` [RFC 04/11] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 16:37 ` [RFC 06/11] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

The swap cluster allocation/free functions are added based on the
existing swap cluster management mechanism for SSD.  These functions
don't work for traditional hard disk because the existing swap cluster
management mechanism doesn't work for it.  The hard disk support may be
added if someone really need that.  But that needn't be included in this
patchset.

This will be used for THP (Transparent Huge Page) swap support.  Where
one swap cluster will hold the contents of each THP swapped out.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swapfile.c | 194 +++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 137 insertions(+), 57 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 25363c2..d710e0e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -322,6 +322,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	schedule_work(&si->discard_work);
 }
 
+static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info;
+
+	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
+	cluster_list_add_tail(&si->free_clusters, ci, idx);
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
  * will be added to free cluster list. caller should hold si->lock.
@@ -341,8 +349,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 				SWAPFILE_CLUSTER);
 
 		spin_lock(&si->lock);
-		cluster_set_flag(&info[idx], CLUSTER_FLAG_FREE);
-		cluster_list_add_tail(&si->free_clusters, info, idx);
+		__free_cluster(si, idx);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
 	}
@@ -359,6 +366,34 @@ static void swap_discard_work(struct work_struct *work)
 	spin_unlock(&si->lock);
 }
 
+static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info;
+
+	VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
+	cluster_list_del_first(&si->free_clusters, ci);
+	cluster_set_count_flag(ci + idx, 0, 0);
+}
+
+static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info + idx;
+
+	VM_BUG_ON(cluster_count(ci) != 0);
+	/*
+	 * If the swap is discardable, prepare discard the cluster
+	 * instead of free it immediately. The cluster will be freed
+	 * after discard.
+	 */
+	if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
+	    (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
+		swap_cluster_schedule_discard(si, idx);
+		return;
+	}
+
+	__free_cluster(si, idx);
+}
+
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
  * removed from free cluster list and its usage counter will be increased.
@@ -370,11 +405,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 
 	if (!cluster_info)
 		return;
-	if (cluster_is_free(&cluster_info[idx])) {
-		VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx);
-		cluster_list_del_first(&p->free_clusters, cluster_info);
-		cluster_set_count_flag(&cluster_info[idx], 0, 0);
-	}
+	if (cluster_is_free(&cluster_info[idx]))
+		alloc_cluster(p, idx);
 
 	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
@@ -398,21 +430,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 	cluster_set_count(&cluster_info[idx],
 		cluster_count(&cluster_info[idx]) - 1);
 
-	if (cluster_count(&cluster_info[idx]) == 0) {
-		/*
-		 * If the swap is discardable, prepare discard the cluster
-		 * instead of free it immediately. The cluster will be freed
-		 * after discard.
-		 */
-		if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
-				 (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
-			swap_cluster_schedule_discard(p, idx);
-			return;
-		}
-
-		cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
-		cluster_list_add_tail(&p->free_clusters, cluster_info, idx);
-	}
+	if (cluster_count(&cluster_info[idx]) == 0)
+		free_cluster(p, idx);
 }
 
 /*
@@ -493,6 +512,68 @@ new_cluster:
 	*scan_base = tmp;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline unsigned int huge_cluster_nr_entries(bool huge)
+{
+	return huge ? SWAPFILE_CLUSTER : 1;
+}
+#else
+#define huge_cluster_nr_entries(huge)	1
+#endif
+
+static void __swap_entry_alloc(struct swap_info_struct *si, unsigned long offset,
+			       bool huge)
+{
+	unsigned int nr_entries = huge_cluster_nr_entries(huge);
+	unsigned int end = offset + nr_entries - 1;
+
+	if (offset == si->lowest_bit)
+		si->lowest_bit += nr_entries;
+	if (end == si->highest_bit)
+		si->highest_bit -= nr_entries;
+	si->inuse_pages += nr_entries;
+	if (si->inuse_pages == si->pages) {
+		si->lowest_bit = si->max;
+		si->highest_bit = 0;
+		spin_lock(&swap_avail_lock);
+		plist_del(&si->avail_list, &swap_avail_head);
+		spin_unlock(&swap_avail_lock);
+	}
+}
+
+static void __swap_entry_free(struct swap_info_struct *si, unsigned long offset,
+			      bool huge)
+{
+	unsigned int nr_entries = huge_cluster_nr_entries(huge);
+	unsigned long end = offset + nr_entries - 1;
+	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
+
+	if (offset < si->lowest_bit)
+		si->lowest_bit = offset;
+	if (end > si->highest_bit) {
+		bool was_full = !si->highest_bit;
+
+		si->highest_bit = end;
+		if (was_full && (si->flags & SWP_WRITEOK)) {
+			spin_lock(&swap_avail_lock);
+			WARN_ON(!plist_node_empty(&si->avail_list));
+			if (plist_node_empty(&si->avail_list))
+				plist_add(&si->avail_list, &swap_avail_head);
+			spin_unlock(&swap_avail_lock);
+		}
+	}
+	atomic_long_add(nr_entries, &nr_swap_pages);
+	si->inuse_pages -= nr_entries;
+	if (si->flags & SWP_BLKDEV)
+		swap_slot_free_notify = si->bdev->bd_disk->fops->swap_slot_free_notify;
+	while (offset <= end) {
+		frontswap_invalidate_page(si->type, offset);
+		if (swap_slot_free_notify)
+			swap_slot_free_notify(si->bdev, offset);
+		offset++;
+	}
+}
+
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
@@ -587,18 +668,7 @@ checks:
 	if (si->swap_map[offset])
 		goto scan;
 
-	if (offset == si->lowest_bit)
-		si->lowest_bit++;
-	if (offset == si->highest_bit)
-		si->highest_bit--;
-	si->inuse_pages++;
-	if (si->inuse_pages == si->pages) {
-		si->lowest_bit = si->max;
-		si->highest_bit = 0;
-		spin_lock(&swap_avail_lock);
-		plist_del(&si->avail_list, &swap_avail_head);
-		spin_unlock(&swap_avail_lock);
-	}
+	__swap_entry_alloc(si, offset, false);
 	si->swap_map[offset] = usage;
 	inc_cluster_info_page(si, si->cluster_info, offset);
 	si->cluster_next = offset + 1;
@@ -645,6 +715,38 @@ no_page:
 	return 0;
 }
 
+static void swap_free_huge_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+	struct swap_cluster_info *ci = si->cluster_info + idx;
+	unsigned long offset = idx * SWAPFILE_CLUSTER;
+
+	cluster_set_count_flag(ci, 0, 0);
+	free_cluster(si, idx);
+	__swap_entry_free(si, offset, true);
+}
+
+static unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
+{
+	unsigned long idx;
+	struct swap_cluster_info *ci;
+	unsigned long offset, i;
+	unsigned char *map;
+
+	if (cluster_list_empty(&si->free_clusters))
+		return 0;
+	idx = cluster_list_first(&si->free_clusters);
+	alloc_cluster(si, idx);
+	ci = si->cluster_info + idx;
+	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+
+	offset = idx * SWAPFILE_CLUSTER;
+	__swap_entry_alloc(si, offset, true);
+	map = si->swap_map + offset;
+	for (i = 0; i < SWAPFILE_CLUSTER; i++)
+		map[i] = SWAP_HAS_CACHE;
+	return offset;
+}
+
 swp_entry_t get_swap_page(void)
 {
 	struct swap_info_struct *si, *next;
@@ -804,29 +906,7 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 	if (!usage) {
 		mem_cgroup_uncharge_swap(entry, 1);
 		dec_cluster_info_page(p, p->cluster_info, offset);
-		if (offset < p->lowest_bit)
-			p->lowest_bit = offset;
-		if (offset > p->highest_bit) {
-			bool was_full = !p->highest_bit;
-			p->highest_bit = offset;
-			if (was_full && (p->flags & SWP_WRITEOK)) {
-				spin_lock(&swap_avail_lock);
-				WARN_ON(!plist_node_empty(&p->avail_list));
-				if (plist_node_empty(&p->avail_list))
-					plist_add(&p->avail_list,
-						  &swap_avail_head);
-				spin_unlock(&swap_avail_lock);
-			}
-		}
-		atomic_long_inc(&nr_swap_pages);
-		p->inuse_pages--;
-		frontswap_invalidate_page(p->type, offset);
-		if (p->flags & SWP_BLKDEV) {
-			struct gendisk *disk = p->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify)
-				disk->fops->swap_slot_free_notify(p->bdev,
-								  offset);
-		}
+		__swap_entry_free(p, offset, false);
 	}
 
 	return usage;
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 06/11] mm, THP, swap: Add get_huge_swap_page()
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (4 preceding siblings ...)
  2016-08-09 16:37 ` [RFC 05/11] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 16:37 ` [RFC 07/11] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

A variation of get_swap_page(), get_huge_swap_page(), is added to
allocate a swap cluster (512 swap slots) based on the swap cluster
allocation function.  A fair simple algorithm is used, that is, only the
first swap device in priority list will be tried to allocate the swap
cluster.  The function will fail if that trying is not successful, and
the caller will fall back to allocate single swap slot instead.  This
works good enough for normal cases.

This will be used for THP (Transparent Huge Page) swap support.  Where
get_huge_swap_page() will be used to allocate one swap cluster for each
THP swapped out.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h | 21 ++++++++++++++++++++-
 mm/swapfile.c        | 29 +++++++++++++++++++++++------
 2 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6988bce..95a526e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -399,7 +399,7 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t __get_swap_page(bool huge);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
@@ -419,6 +419,20 @@ extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
+static inline swp_entry_t get_swap_page(void)
+{
+	return __get_swap_page(false);
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern swp_entry_t get_huge_swap_page(void);
+#else
+static inline swp_entry_t get_huge_swap_page(void)
+{
+	return (swp_entry_t) {0};
+}
+#endif
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)		(NULL)
@@ -525,6 +539,11 @@ static inline swp_entry_t get_swap_page(void)
 	return entry;
 }
 
+static inline swp_entry_t get_huge_swap_page(void)
+{
+	return (swp_entry_t) {0};
+}
+
 #endif /* CONFIG_SWAP */
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d710e0e..5cd78c7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -747,14 +747,15 @@ static unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
 	return offset;
 }
 
-swp_entry_t get_swap_page(void)
+swp_entry_t __get_swap_page(bool huge)
 {
 	struct swap_info_struct *si, *next;
 	pgoff_t offset;
+	int nr_pages = huge_cluster_nr_entries(huge);
 
-	if (atomic_long_read(&nr_swap_pages) <= 0)
+	if (atomic_long_read(&nr_swap_pages) < nr_pages)
 		goto noswap;
-	atomic_long_dec(&nr_swap_pages);
+	atomic_long_sub(nr_pages, &nr_swap_pages);
 
 	spin_lock(&swap_avail_lock);
 
@@ -782,10 +783,15 @@ start_over:
 		}
 
 		/* This is called for allocating swap entry for cache */
-		offset = scan_swap_map(si, SWAP_HAS_CACHE);
+		if (likely(nr_pages == 1))
+			offset = scan_swap_map(si, SWAP_HAS_CACHE);
+		else
+			offset = swap_alloc_huge_cluster(si);
 		spin_unlock(&si->lock);
 		if (offset)
 			return swp_entry(si->type, offset);
+		else if (unlikely(nr_pages != 1))
+			goto fail_alloc;
 		pr_debug("scan_swap_map of si %d failed to find offset\n",
 		       si->type);
 		spin_lock(&swap_avail_lock);
@@ -805,12 +811,23 @@ nextsi:
 	}
 
 	spin_unlock(&swap_avail_lock);
-
-	atomic_long_inc(&nr_swap_pages);
+fail_alloc:
+	atomic_long_add(nr_pages, &nr_swap_pages);
 noswap:
 	return (swp_entry_t) {0};
 }
 
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+swp_entry_t get_huge_swap_page(void)
+{
+	if (SWAPFILE_CLUSTER != HPAGE_PMD_NR)
+		return (swp_entry_t) {0};
+
+	return __get_swap_page(true);
+}
+#endif
+
 /* The only caller of this function is now suspend routine */
 swp_entry_t get_swap_page_of_type(int type)
 {
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 07/11] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (5 preceding siblings ...)
  2016-08-09 16:37 ` [RFC 06/11] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 16:37 ` [RFC 08/11] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

__swapcache_free() is added to support to clear SWAP_HAS_CACHE for huge
page.  This will free the specified swap cluster now.  Because now this
function will be called only in the error path to free the swap cluster
just allocated.  So the corresponding swap_map[i] == SWAP_HAS_CACHE,
that is, the swap count is 0.  This makes the implementation simpler
than that of the ordinary swap entry.

This will be used for delaying splitting THP (Transparent Huge Page)
during swapping out.  Where for one THP to swap out, we will allocate a
swap cluster, add the THP into swap cache, then split the THP.  If
anything fails after allocating the swap cluster and before splitting
the THP successfully, the swapcache_free_trans_huge() will be used to
free the swap space allocated.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |  9 +++++++--
 mm/swapfile.c        | 27 +++++++++++++++++++++++++--
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 95a526e..04d963f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -406,7 +406,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
+extern void __swapcache_free(swp_entry_t, bool);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -475,7 +475,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void __swapcache_free(swp_entry_t swp, bool huge)
 {
 }
 
@@ -546,6 +546,11 @@ static inline swp_entry_t get_huge_swap_page(void)
 
 #endif /* CONFIG_SWAP */
 
+static inline void swapcache_free(swp_entry_t entry)
+{
+	__swapcache_free(entry, false);
+}
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5cd78c7..be89a2f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -945,15 +945,38 @@ void swap_free(swp_entry_t entry)
 }
 
 /*
+ * Caller should hold si->lock.
+ */
+static void swapcache_free_trans_huge(struct swap_info_struct *si,
+				      swp_entry_t entry)
+{
+	unsigned long offset = swp_offset(entry);
+	unsigned long idx = offset / SWAPFILE_CLUSTER;
+	unsigned char *map;
+	unsigned int i;
+
+	map = si->swap_map + offset;
+	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+		VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
+		map[i] &= ~SWAP_HAS_CACHE;
+	}
+	mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
+	swap_free_huge_cluster(si, idx);
+}
+
+/*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry)
+void __swapcache_free(swp_entry_t entry, bool huge)
 {
 	struct swap_info_struct *p;
 
 	p = swap_info_get(entry);
 	if (p) {
-		swap_entry_free(p, entry, SWAP_HAS_CACHE);
+		if (unlikely(huge))
+			swapcache_free_trans_huge(p, entry);
+		else
+			swap_entry_free(p, entry, SWAP_HAS_CACHE);
 		spin_unlock(&p->lock);
 	}
 }
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 08/11] mm, THP, swap: Support to add/delete THP to/from swap cache
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (6 preceding siblings ...)
  2016-08-09 16:37 ` [RFC 07/11] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 16:37 ` [RFC 09/11] mm, THP: Add can_split_huge_page() Huang, Ying
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

With this patch, a THP (Transparent Huge Page) can be added/deleted
to/from swap cache as a set of sub-pages (512 on x86_64).

This will be used for Transparent Huge Page (THP) swap support.  Where
one THP may be added/delted to/from the swap cache.  This will batch
swap cache operations to reduce the lock acquire/release times for THP
swap too.

Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/page-flags.h |  2 +-
 mm/swap_state.c            | 57 +++++++++++++++++++++++++++++++---------------
 2 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda..f5bcbea 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem)
 #endif
 
 #ifdef CONFIG_SWAP
-PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2013793..a41fd10 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -41,6 +41,7 @@ struct address_space swapper_spaces[MAX_SWAPFILES] = {
 };
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
+#define ADD_CACHE_INFO(x, nr)	do { swap_cache_info.x += (nr); } while (0)
 
 static struct {
 	unsigned long add_total;
@@ -78,25 +79,32 @@ void show_swap_cache_info(void)
  */
 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
-	int error;
+	int error, i, nr = hpage_nr_pages(page);
 	struct address_space *address_space;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapCache(page), page);
 	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
-	get_page(page);
+	page_ref_add(page, nr);
 	SetPageSwapCache(page);
-	set_page_private(page, entry.val);
 
 	address_space = swap_address_space(entry);
 	spin_lock_irq(&address_space->tree_lock);
-	error = radix_tree_insert(&address_space->page_tree,
-					entry.val, page);
+	for (i = 0; i < nr; i++) {
+		struct page *cur_page = page + i;
+		unsigned long index = entry.val + i;
+
+		set_page_private(cur_page, index);
+		error = radix_tree_insert(&address_space->page_tree,
+					  index, cur_page);
+		if (unlikely(error))
+			break;
+	}
 	if (likely(!error)) {
-		address_space->nrpages++;
-		__inc_node_page_state(page, NR_FILE_PAGES);
-		INC_CACHE_INFO(add_total);
+		address_space->nrpages += nr;
+		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
+		ADD_CACHE_INFO(add_total, nr);
 	}
 	spin_unlock_irq(&address_space->tree_lock);
 
@@ -107,9 +115,16 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 		 * So add_to_swap_cache() doesn't returns -EEXIST.
 		 */
 		VM_BUG_ON(error == -EEXIST);
-		set_page_private(page, 0UL);
 		ClearPageSwapCache(page);
-		put_page(page);
+		set_page_private(page + i, 0UL);
+		while (i--) {
+			struct page *cur_page = page + i;
+			unsigned long index = entry.val + i;
+
+			set_page_private(cur_page, 0UL);
+			radix_tree_delete(&address_space->page_tree, index);
+		}
+		page_ref_sub(page, nr);
 	}
 
 	return error;
@@ -120,7 +135,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
 {
 	int error;
 
-	error = radix_tree_maybe_preload(gfp_mask);
+	error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page));
 	if (!error) {
 		error = __add_to_swap_cache(page, entry);
 		radix_tree_preload_end();
@@ -136,6 +151,7 @@ void __delete_from_swap_cache(struct page *page)
 {
 	swp_entry_t entry;
 	struct address_space *address_space;
+	int i, nr = hpage_nr_pages(page);
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
@@ -143,12 +159,17 @@ void __delete_from_swap_cache(struct page *page)
 
 	entry.val = page_private(page);
 	address_space = swap_address_space(entry);
-	radix_tree_delete(&address_space->page_tree, page_private(page));
-	set_page_private(page, 0);
 	ClearPageSwapCache(page);
-	address_space->nrpages--;
-	__dec_node_page_state(page, NR_FILE_PAGES);
-	INC_CACHE_INFO(del_total);
+	for (i = 0; i < nr; i++) {
+		struct page *cur_page = page + i;
+
+		radix_tree_delete(&address_space->page_tree,
+				  page_private(cur_page));
+		set_page_private(cur_page, 0);
+	}
+	address_space->nrpages -= nr;
+	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
+	ADD_CACHE_INFO(del_total, nr);
 }
 
 /**
@@ -225,8 +246,8 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry);
-	put_page(page);
+	__swapcache_free(entry, PageTransHuge(page));
+	page_ref_sub(page, hpage_nr_pages(page));
 }
 
 /* 
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 09/11] mm, THP: Add can_split_huge_page()
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (7 preceding siblings ...)
  2016-08-09 16:37 ` [RFC 08/11] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 16:37 ` [RFC 10/11] mm, THP, swap: Support to split THP in swap cache Huang, Ying
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Ebru Akagunduz

From: Huang Ying <ying.huang@intel.com>

Separates checking whether we can split the huge page from
split_huge_page_to_list() into a function.  This will help to check that
before splitting the THP (Transparent Huge Page) really.

This will be used for delaying splitting THP during swapping out.  Where
for a THP, we will allocate a swap cluster, add the THP into swap cache,
then split the THP.  To avoid unnecessary operations for un-splittable
THP, we will check that firstly.

There is no functionality change in this patch.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/huge_mm.h |  6 ++++++
 mm/huge_memory.c        | 13 ++++++++++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6f14de4..95ccbb4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -90,6 +90,7 @@ extern unsigned long transparent_hugepage_flags;
 extern void prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 
+bool can_split_huge_page(struct page *page);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
@@ -169,6 +170,11 @@ void put_huge_zero_page(void);
 static inline void prep_transhuge_page(struct page *page) {}
 
 #define transparent_hugepage_flags 0UL
+static inline bool
+can_split_huge_page(struct page *page)
+{
+	return false;
+}
 static inline int
 split_huge_page_to_list(struct page *page, struct list_head *list)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2373f0a..af65413 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1954,6 +1954,17 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 	return ret;
 }
 
+/* Racy check whether the huge page can be split */
+bool can_split_huge_page(struct page *page)
+{
+	int extra_pins = 0;
+
+	/* Additional pins from radix tree */
+	if (!PageAnon(page))
+		extra_pins = HPAGE_PMD_NR;
+	return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
 /*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
@@ -2024,7 +2035,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * Racy check if we can split the page, before freeze_page() will
 	 * split PMDs
 	 */
-	if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
+	if (!can_split_huge_page(head)) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 10/11] mm, THP, swap: Support to split THP in swap cache
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (8 preceding siblings ...)
  2016-08-09 16:37 ` [RFC 09/11] mm, THP: Add can_split_huge_page() Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 16:37 ` [RFC 11/11] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Ebru Akagunduz

From: Huang Ying <ying.huang@intel.com>

This patch enhanced the split_huge_page_to_list() to work properly for
THP (Transparent Huge Page) in swap cache during swapping out.

This is used for delaying splitting THP during swapping out.  Where for
a THP to be swapped out, we will allocate a swap cluster, add the THP
into the swap cache, then split the THP.  The page lock will be held
during this process.  So in the code path other than swapping out, if
the THP need to be split, the PageSwapCache(THP) will be always false.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/huge_memory.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index af65413..f738a7e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1772,7 +1772,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 * atomic_set() here would be safe on all archs (and not only on x86),
 	 * it's safer to use atomic_inc()/atomic_add().
 	 */
-	if (PageAnon(head)) {
+	if (PageAnon(head) && !PageSwapCache(head)) {
 		page_ref_inc(page_tail);
 	} else {
 		/* Additional pin to radix tree */
@@ -1783,6 +1783,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	page_tail->flags |= (head->flags &
 			((1L << PG_referenced) |
 			 (1L << PG_swapbacked) |
+			 (1L << PG_swapcache) |
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
@@ -1845,7 +1846,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	ClearPageCompound(head);
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
-		page_ref_inc(head);
+		/* Additional pin to radix tree of swap cache */
+		if (PageSwapCache(head))
+			page_ref_add(head, 2);
+		else
+			page_ref_inc(head);
 	} else {
 		/* Additional pin to radix tree */
 		page_ref_add(head, 2);
@@ -1957,10 +1962,12 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 /* Racy check whether the huge page can be split */
 bool can_split_huge_page(struct page *page)
 {
-	int extra_pins = 0;
+	int extra_pins;
 
 	/* Additional pins from radix tree */
-	if (!PageAnon(page))
+	if (PageAnon(page))
+		extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
+	else
 		extra_pins = HPAGE_PMD_NR;
 	return total_mapcount(page) == page_count(page) - extra_pins - 1;
 }
@@ -2013,7 +2020,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			ret = -EBUSY;
 			goto out;
 		}
-		extra_pins = 0;
+		extra_pins = PageSwapCache(head) ? HPAGE_PMD_NR : 0;
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 11/11] mm, THP, swap: Delay splitting THP during swap out
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (9 preceding siblings ...)
  2016-08-09 16:37 ` [RFC 10/11] mm, THP, swap: Support to split THP in swap cache Huang, Ying
@ 2016-08-09 16:37 ` Huang, Ying
  2016-08-09 17:25 ` [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
  2016-08-17  0:59 ` Minchan Kim
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel,
	Huang Ying, Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

In this patch, the splitting huge page is delayed from almost the first
step of swapping out to after allocating the swap space for THP and
adding the THP into swap cache.  This will reduce lock
acquiring/releasing for locks used for swap space and swap cache
management.

This is the also first step for THP (Transparent Huge Page) swap
support.  The plan is to delaying splitting THP step by step and avoid
splitting THP finally.

The advantages of THP swap support are:

- Batch swap operations for THP to reduce lock acquiring/releasing,
  including allocating/freeing swap space, adding/deleting to/from swap
  cache, and writing/reading swap space, etc.

- THP swap space read/write will be 2M sequence IO.  It is particularly
  helpful for swap read, which usually are 4k random IO.

- It will help memory fragmentation, especially when THP is heavily used
  by the applications.  2M continuous pages will be free up after THP
  swapping out.

With the patchset, the swap out bandwidth improved 12.1% in
vm-scalability swap-w-seq test case with 16 processes on a Xeon E5 v3
system.  To test sequence swap out, the test case uses 16 processes
sequentially allocate and write to anonymous pages until RAM and part of
the swap device is used up.

The detailed compare result is as follow,

base             base+patchset
---------------- --------------------------
         %stddev     %change         %stddev
             \          |                \
   1118821 A+-  0%     +12.1%    1254241 A+-  1%  vmstat.swap.so
   2460636 A+-  1%     +10.6%    2720983 A+-  1%  vm-scalability.throughput
    308.79 A+-  1%      -7.9%     284.53 A+-  1%  vm-scalability.time.elapsed_time
      1639 A+-  4%    +232.3%       5446 A+-  1%  meminfo.SwapCached
      0.70 A+-  3%      +8.7%       0.77 A+-  5%  perf-stat.ipc
      9.82 A+-  8%     -31.6%       6.72 A+-  2%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list

Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swap_state.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 50 insertions(+), 3 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index a41fd10..5316fbc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/huge_mm.h>
 
 #include <asm/pgtable.h>
 
@@ -172,12 +173,45 @@ void __delete_from_swap_cache(struct page *page)
 	ADD_CACHE_INFO(del_total, nr);
 }
 
+int add_to_swap_trans_huge(struct page *page, struct list_head *list)
+{
+	swp_entry_t entry;
+	int ret = 0;
+
+	/* cannot split, which may be needed during swap in, skip it */
+	if (!can_split_huge_page(page))
+		return -EBUSY;
+	/* fallback to split huge page firstly if no PMD map */
+	if (!compound_mapcount(page))
+		return 0;
+	entry = get_huge_swap_page();
+	if (!entry.val)
+		return 0;
+	if (mem_cgroup_try_charge_swap(page, entry, HPAGE_PMD_NR)) {
+		__swapcache_free(entry, true);
+		return -EOVERFLOW;
+	}
+	ret = add_to_swap_cache(page, entry,
+				__GFP_HIGH | __GFP_NOMEMALLOC|__GFP_NOWARN);
+	/* -ENOMEM radix-tree allocation failure */
+	if (ret) {
+		__swapcache_free(entry, true);
+		return 0;
+	}
+	ret = split_huge_page_to_list(page, list);
+	if (ret) {
+		delete_from_swap_cache(page);
+		return -EBUSY;
+	}
+	return 1;
+}
+
 /**
  * add_to_swap - allocate swap space for a page
  * @page: page we want to move to swap
  *
  * Allocate swap space for the page and add the page to the
- * swap cache.  Caller needs to hold the page lock. 
+ * swap cache.  Caller needs to hold the page lock.
  */
 int add_to_swap(struct page *page, struct list_head *list)
 {
@@ -187,6 +221,14 @@ int add_to_swap(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
+	if (unlikely(PageTransHuge(page))) {
+		err = add_to_swap_trans_huge(page, list);
+		if (err < 0)
+			return 0;
+		else if (err > 0)
+			return err;
+		/* fallback to split firstly if return 0 */
+	}
 	entry = get_swap_page();
 	if (!entry.val)
 		return 0;
@@ -306,7 +348,7 @@ struct page * lookup_swap_cache(swp_entry_t entry)
 
 	page = find_get_page(swap_address_space(entry), entry.val);
 
-	if (page) {
+	if (page && likely(!PageCompound(page))) {
 		INC_CACHE_INFO(find_success);
 		if (TestClearPageReadahead(page))
 			atomic_inc(&swapin_readahead_hits);
@@ -332,8 +374,13 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * that would confuse statistics.
 		 */
 		found_page = find_get_page(swapper_space, entry.val);
-		if (found_page)
+		if (found_page) {
+			if (unlikely(PageCompound(found_page))) {
+				put_page(found_page);
+				found_page = NULL;
+			}
 			break;
+		}
 
 		/*
 		 * Get a new page to read into from swap.
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (10 preceding siblings ...)
  2016-08-09 16:37 ` [RFC 11/11] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
@ 2016-08-09 17:25 ` Huang, Ying
  2016-08-17  0:59 ` Minchan Kim
  12 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-09 17:25 UTC (permalink / raw)
  To: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli
  Cc: Huang, Ying, linux-mm, linux-kernel

Hi, All,

"Huang, Ying" <ying.huang@intel.com> writes:

> From: Huang Ying <ying.huang@intel.com>
>
> This patchset is based on 8/4 head of mmotm/master.
>
> This is the first step for Transparent Huge Page (THP) swap support.
> The plan is to delaying splitting THP step by step and avoid splitting
> THP finally during THP swapping out and swapping in.
>
> The advantages of THP swap support are:
>
> - Batch swap operations for THP to reduce lock acquiring/releasing,
>   including allocating/freeing swap space, adding/deleting to/from swap
>   cache, and writing/reading swap space, etc.
>
> - THP swap space read/write will be 2M sequence IO.  It is particularly
>   helpful for swap read, which usually are 4k random IO.
>
> - It will help memory fragmentation, especially when THP is heavily used
>   by the applications.  2M continuous pages will be free up after THP
>   swapping out.
>
> As the first step, in this patchset, the splitting huge page is
> delayed from almost the first step of swapping out to after allocating
> the swap space for THP and adding the THP into swap cache.  This will
> reduce lock acquiring/releasing for locks used for swap space and swap
> cache management.

For this patchset posting,

In general, I want to check the basic design with memory management
subsystem maintainers and developers.

For [RFC 01/11] swap: Add swap_cluster_list, it is a cleanup patch.  And
I think it should be useful independently.

I am not very confident about the memcg part, that is

[RFC 03/11] mm, memcg: Add swap_cgroup_iter iterator
[RFC 04/11] mm, memcg: Support to charge/uncharge multiple swap entries

Please help me to check it.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
                   ` (11 preceding siblings ...)
  2016-08-09 17:25 ` [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
@ 2016-08-17  0:59 ` Minchan Kim
  2016-08-17  2:06   ` Huang, Ying
  2016-08-22 21:33   ` Huang, Ying
  12 siblings, 2 replies; 27+ messages in thread
From: Minchan Kim @ 2016-08-17  0:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel

Hello Huang,

On Tue, Aug 09, 2016 at 09:37:42AM -0700, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> This patchset is based on 8/4 head of mmotm/master.
> 
> This is the first step for Transparent Huge Page (THP) swap support.
> The plan is to delaying splitting THP step by step and avoid splitting
> THP finally during THP swapping out and swapping in.

What does it mean "delay splitting THP on swapping-in"?

> 
> The advantages of THP swap support are:
> 
> - Batch swap operations for THP to reduce lock acquiring/releasing,
>   including allocating/freeing swap space, adding/deleting to/from swap
>   cache, and writing/reading swap space, etc.
> 
> - THP swap space read/write will be 2M sequence IO.  It is particularly
>   helpful for swap read, which usually are 4k random IO.
> 
> - It will help memory fragmentation, especially when THP is heavily used
>   by the applications.  2M continuous pages will be free up after THP
>   swapping out.

Could we take the benefit for normal pages as well as THP page?
I think Tim and me discussed about that a few weeks ago.

Please search below topics.

[1] mm: Batch page reclamation under shink_page_list
[2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions

It's different with yours which focused on THP swapping while the suggestion
would be more general if we can do so it's worth to try it, I think.

Anyway, I hope [1/11] should be merged regardless of the patchset because
I believe anyone doesn't feel comfortable with cluser_info functions. ;-)

Thanks.

> 
> As the first step, in this patchset, the splitting huge page is
> delayed from almost the first step of swapping out to after allocating
> the swap space for THP and adding the THP into swap cache.  This will
> reduce lock acquiring/releasing for locks used for swap space and swap
> cache management.
> 
> With the patchset, the swap out bandwidth improved 12.1% in
> vm-scalability swap-w-seq test case with 16 processes on a Xeon E5 v3
> system.  To test sequence swap out, the test case uses 16 processes
> sequentially allocate and write to anonymous pages until RAM and part of
> the swap device is used up.
> 
> The detailed compare result is as follow,
> 
> base             base+patchset
> ---------------- -------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>    1118821 +-  0%     +12.1%    1254241 +-  1%  vmstat.swap.so
>    2460636 +-  1%     +10.6%    2720983 +-  1%  vm-scalability.throughput
>     308.79 +-  1%      -7.9%     284.53 +-  1%  vm-scalability.time.elapsed_time
>       1639 +-  4%    +232.3%       5446 +-  1%  meminfo.SwapCached
>       0.70 +-  3%      +8.7%       0.77 +-  5%  perf-stat.ipc
>       9.82 +-  8%     -31.6%       6.72 +-  2%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-17  0:59 ` Minchan Kim
@ 2016-08-17  2:06   ` Huang, Ying
  2016-08-17  5:07     ` Minchan Kim
  2016-08-22 21:33   ` Huang, Ying
  1 sibling, 1 reply; 27+ messages in thread
From: Huang, Ying @ 2016-08-17  2:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, Kirill A . Shutemov, Andrea Arcangeli, linux-mm,
	linux-kernel

Hi, Kim,

Minchan Kim <minchan@kernel.org> writes:

> Hello Huang,
>
> On Tue, Aug 09, 2016 at 09:37:42AM -0700, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> This patchset is based on 8/4 head of mmotm/master.
>> 
>> This is the first step for Transparent Huge Page (THP) swap support.
>> The plan is to delaying splitting THP step by step and avoid splitting
>> THP finally during THP swapping out and swapping in.
>
> What does it mean "delay splitting THP on swapping-in"?

Sorry for my poor English.  We will only delay splitting the THP during
swapping out.  The final target is to avoid splitting the THP during
swapping out, and swap out/in the THP directly.  Thanks for pointing out
that.  I will revise the patch description in the next version.

>> 
>> The advantages of THP swap support are:
>> 
>> - Batch swap operations for THP to reduce lock acquiring/releasing,
>>   including allocating/freeing swap space, adding/deleting to/from swap
>>   cache, and writing/reading swap space, etc.
>> 
>> - THP swap space read/write will be 2M sequence IO.  It is particularly
>>   helpful for swap read, which usually are 4k random IO.
>> 
>> - It will help memory fragmentation, especially when THP is heavily used
>>   by the applications.  2M continuous pages will be free up after THP
>>   swapping out.
>
> Could we take the benefit for normal pages as well as THP page?

This patchset benefits the THP swap only.  It has no effect for normal pages.

> I think Tim and me discussed about that a few weeks ago.

I work closely with Tim on swap optimization.  This patchset is the part
of our swap optimization plan.

> Please search below topics.
>
> [1] mm: Batch page reclamation under shink_page_list
> [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
>
> It's different with yours which focused on THP swapping while the suggestion
> would be more general if we can do so it's worth to try it, I think.

I think the general optimization above will benefit both normal pages
and THP at least for now.  And I think there are no hard conflict
between those two patchsets.

The THP swap has more opportunity to be optimized, because we can batch
512 operations together more easily.  For full THP swap support, unmap a
THP could be more efficient with only one swap count operation instead
of 512, so do many other operations, such as add/remove from swap cache
with multi-order radix tree etc.  And it will help memory fragmentation.
THP can be kept after swapping out/in, need not to rebuild THP via
khugepaged.

But not all pages are huge, so normal pages swap optimization is
necessary and good anyway.

> Anyway, I hope [1/11] should be merged regardless of the patchset because
> I believe anyone doesn't feel comfortable with cluser_info functions. ;-)

Thanks,

Best Regards,
Huang, Ying

[snip]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-17  2:06   ` Huang, Ying
@ 2016-08-17  5:07     ` Minchan Kim
  2016-08-17 17:24       ` Tim Chen
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2016-08-17  5:07 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel

On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote:
> Hi, Kim,
> 
> Minchan Kim <minchan@kernel.org> writes:
> 
> > Hello Huang,
> >
> > On Tue, Aug 09, 2016 at 09:37:42AM -0700, Huang, Ying wrote:
> >> From: Huang Ying <ying.huang@intel.com>
> >> 
> >> This patchset is based on 8/4 head of mmotm/master.
> >> 
> >> This is the first step for Transparent Huge Page (THP) swap support.
> >> The plan is to delaying splitting THP step by step and avoid splitting
> >> THP finally during THP swapping out and swapping in.
> >
> > What does it mean "delay splitting THP on swapping-in"?
> 
> Sorry for my poor English.  We will only delay splitting the THP during
> swapping out.  The final target is to avoid splitting the THP during
> swapping out, and swap out/in the THP directly.  Thanks for pointing out
> that.  I will revise the patch description in the next version.

Thanks.

> 
> >> 
> >> The advantages of THP swap support are:
> >> 
> >> - Batch swap operations for THP to reduce lock acquiring/releasing,
> >>   including allocating/freeing swap space, adding/deleting to/from swap
> >>   cache, and writing/reading swap space, etc.
> >> 
> >> - THP swap space read/write will be 2M sequence IO.  It is particularly
> >>   helpful for swap read, which usually are 4k random IO.
> >> 
> >> - It will help memory fragmentation, especially when THP is heavily used
> >>   by the applications.  2M continuous pages will be free up after THP
> >>   swapping out.
> >
> > Could we take the benefit for normal pages as well as THP page?
> 
> This patchset benefits the THP swap only.  It has no effect for normal pages.
> 
> > I think Tim and me discussed about that a few weeks ago.
> 
> I work closely with Tim on swap optimization.  This patchset is the part
> of our swap optimization plan.
> 
> > Please search below topics.
> >
> > [1] mm: Batch page reclamation under shink_page_list
> > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
> >
> > It's different with yours which focused on THP swapping while the suggestion
> > would be more general if we can do so it's worth to try it, I think.
> 
> I think the general optimization above will benefit both normal pages
> and THP at least for now.  And I think there are no hard conflict
> between those two patchsets.

If we could do general optimzation, I guess THP swap without splitting
would be more straight forward.

If we can reclaim batch a certain of pages all at once, it helps we can
do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be
greater or less than 512 pages. With that, scan_swap_map effectively
search empty swap slots from scan_map or free cluser list.
Then, needed part from your patchset is to just delay splitting of THP.

> 
> The THP swap has more opportunity to be optimized, because we can batch
> 512 operations together more easily.  For full THP swap support, unmap a
> THP could be more efficient with only one swap count operation instead
> of 512, so do many other operations, such as add/remove from swap cache
> with multi-order radix tree etc.  And it will help memory fragmentation.
> THP can be kept after swapping out/in, need not to rebuild THP via
> khugepaged.

It seems you increased cluster size to 512 and search a empty cluster
for a THP swap. With that approach, I have a concern that once clusters
will be fragmented, THP swap support doesn't take benefit at all.

Why do we need a empty cluster for swapping out 512 pages?
IOW, below case could work for the goal.

A : Allocated slot
F : Free slot

cluster A   cluster B
AAAAFFFF  -  FFFFAAAA

That's one of the reason I suggested batch reclaim work first and
support THP swap based on it. With that, scan_swap_map can be aware of nr_pages
and selects right clusters.

With the approach, justfication of THP swap support would be easier, too.
IOW, I'm not sure how only THP swap support is valuable in real workload.

Anyways, that's just my two cents.

> 
> But not all pages are huge, so normal pages swap optimization is
> necessary and good anyway.
> 
> > Anyway, I hope [1/11] should be merged regardless of the patchset because
> > I believe anyone doesn't feel comfortable with cluser_info functions. ;-)
> 
> Thanks,
> 
> Best Regards,
> Huang, Ying
> 
> [snip]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-17  5:07     ` Minchan Kim
@ 2016-08-17 17:24       ` Tim Chen
  2016-08-18  8:39         ` Minchan Kim
  0 siblings, 1 reply; 27+ messages in thread
From: Tim Chen @ 2016-08-17 17:24 UTC (permalink / raw)
  To: Minchan Kim, Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel

On Wed, 2016-08-17 at 14:07 +0900, Minchan Kim wrote:
> On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote:
> > 
> >A 
> > > 
> > > I think Tim and me discussed about that a few weeks ago.
> > I work closely with Tim on swap optimization.A A This patchset is the part
> > of our swap optimization plan.
> > 
> > > 
> > > Please search below topics.
> > > 
> > > [1] mm: Batch page reclamation under shink_page_list
> > > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
> > > 
> > > It's different with yours which focused on THP swapping while the suggestion
> > > would be more general if we can do so it's worth to try it, I think.
> > I think the general optimization above will benefit both normal pages
> > and THP at least for now.A A And I think there are no hard conflict
> > between those two patchsets.
> If we could do general optimzation, I guess THP swap without splitting
> would be more straight forward.
> 
> If we can reclaim batch a certain of pages all at once, it helps we can
> do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be
> greater or less than 512 pages. With that, scan_swap_map effectively
> search empty swap slots from scan_map or free cluser list.
> Then, needed part from your patchset is to just delay splitting of THP.
> 
> > 
> > 
> > The THP swap has more opportunity to be optimized, because we can batch
> > 512 operations together more easily.A A For full THP swap support, unmap a
> > THP could be more efficient with only one swap count operation instead
> > of 512, so do many other operations, such as add/remove from swap cache
> > with multi-order radix tree etc.A A And it will help memory fragmentation.
> > THP can be kept after swapping out/in, need not to rebuild THP via
> > khugepaged.
> It seems you increased cluster size to 512 and search a empty cluster
> for a THP swap. With that approach, I have a concern that once clusters
> will be fragmented, THP swap support doesn't take benefit at all.
> 
> Why do we need a empty cluster for swapping out 512 pages?
> IOW, below case could work for the goal.
> 
> A : Allocated slot
> F : Free slot
> 
> cluster AA A A cluster B
> AAAAFFFFA A -A A FFFFAAAA
> 
> That's one of the reason I suggested batch reclaim work first and
> support THP swap based on it. With that, scan_swap_map can be aware of nr_pages
> and selects right clusters.
> 
> With the approach, justfication of THP swap support would be easier, too.
> IOW, I'm not sure how only THP swap support is valuable in real workload.
> 
> Anyways, that's just my two cents.

Minchan,

Scanning for contiguous slots that span clusters may take quite a
long time under fragmentation, and may eventually fail. A In that case the addition scan
time overhead may go to waste and defeat the purpose of fast swapping of large page.

The empty cluster lookup on the other hand is very fast.
We treat the empty cluster available case as an opportunity for fast path
swap out of large page. A Otherwise, we'll revert to the current
slow path behavior of breaking into normal pages so there's no
regression, and we may get speed up. A We can be considerably faster when a lot of large
pages are used. A 


> 
> > 
> > 
> > But not all pages are huge, so normal pages swap optimization is
> > necessary and good anyway.
> > 

Yes, optimizing the normal swap pages is still an important goal
for us. A THP swap optimization is complementary component. A 

We have seen system with THP spend significant cpu cycles breaking up the
pages on swap out and then compacting the pages for THP again after
swap in. A So if we can avoid this, that will be helpful.

Thanks for your valuable comments.

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-17 17:24       ` Tim Chen
@ 2016-08-18  8:39         ` Minchan Kim
  2016-08-18 17:19           ` Huang, Ying
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2016-08-18  8:39 UTC (permalink / raw)
  To: Tim Chen
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, Kirill A . Shutemov, Andrea Arcangeli, linux-mm,
	linux-kernel

Hi Tim,

On Wed, Aug 17, 2016 at 10:24:56AM -0700, Tim Chen wrote:
> On Wed, 2016-08-17 at 14:07 +0900, Minchan Kim wrote:
> > On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote:
> > > 
> > > 
> > > > 
> > > > I think Tim and me discussed about that a few weeks ago.
> > > I work closely with Tim on swap optimization.  This patchset is the part
> > > of our swap optimization plan.
> > > 
> > > > 
> > > > Please search below topics.
> > > > 
> > > > [1] mm: Batch page reclamation under shink_page_list
> > > > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
> > > > 
> > > > It's different with yours which focused on THP swapping while the suggestion
> > > > would be more general if we can do so it's worth to try it, I think.
> > > I think the general optimization above will benefit both normal pages
> > > and THP at least for now.  And I think there are no hard conflict
> > > between those two patchsets.
> > If we could do general optimzation, I guess THP swap without splitting
> > would be more straight forward.
> > 
> > If we can reclaim batch a certain of pages all at once, it helps we can
> > do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be
> > greater or less than 512 pages. With that, scan_swap_map effectively
> > search empty swap slots from scan_map or free cluser list.
> > Then, needed part from your patchset is to just delay splitting of THP.
> > 
> > > 
> > > 
> > > The THP swap has more opportunity to be optimized, because we can batch
> > > 512 operations together more easily.  For full THP swap support, unmap a
> > > THP could be more efficient with only one swap count operation instead
> > > of 512, so do many other operations, such as add/remove from swap cache
> > > with multi-order radix tree etc.  And it will help memory fragmentation.
> > > THP can be kept after swapping out/in, need not to rebuild THP via
> > > khugepaged.
> > It seems you increased cluster size to 512 and search a empty cluster
> > for a THP swap. With that approach, I have a concern that once clusters
> > will be fragmented, THP swap support doesn't take benefit at all.
> > 
> > Why do we need a empty cluster for swapping out 512 pages?
> > IOW, below case could work for the goal.
> > 
> > A : Allocated slot
> > F : Free slot
> > 
> > cluster A   cluster B
> > AAAAFFFF  -  FFFFAAAA
> > 
> > That's one of the reason I suggested batch reclaim work first and
> > support THP swap based on it. With that, scan_swap_map can be aware of nr_pages
> > and selects right clusters.
> > 
> > With the approach, justfication of THP swap support would be easier, too.
> > IOW, I'm not sure how only THP swap support is valuable in real workload.
> > 
> > Anyways, that's just my two cents.
> 
> Minchan,
> 
> Scanning for contiguous slots that span clusters may take quite a
> long time under fragmentation, and may eventually fail.  In that case the addition scan
> time overhead may go to waste and defeat the purpose of fast swapping of large page.
> 
> The empty cluster lookup on the other hand is very fast.
> We treat the empty cluster available case as an opportunity for fast path
> swap out of large page.  Otherwise, we'll revert to the current
> slow path behavior of breaking into normal pages so there's no
> regression, and we may get speed up.  We can be considerably faster when a lot of large
> pages are used.  

I didn't mean we should search scan_swap_map firstly without peeking
free cluster but what I wanted was we might abstract it into
scan_swap_map.

For example, if nr_pages is greather than the size of cluster, we can
get empty cluster first and nr_pages - sizeof(cluster) for other free
cluster or scanning of current CPU per-cpu cluster. If we cannot find
used slot during scanning, we can bail out simply. Then, although we
fail to get all* contiguous slots, we get a certain of contiguous slots
so it would be benefit for seq write and lock batching point of view
at the cost of a little scanning. And it's not specific to THP algorighm.

My point is that once we optimize normal page batch for swap, THP
swap support would be more straight forward. But I should admit I didn't
look into code in detail so it might have clear hurdle to implement it
so I will rely on you guys's decision whether which one is more urgent/
benefit/making good code quality for the goal.

Thanks.

> 
> 
> > 
> > > 
> > > 
> > > But not all pages are huge, so normal pages swap optimization is
> > > necessary and good anyway.
> > > 
> 
> Yes, optimizing the normal swap pages is still an important goal
> for us.  THP swap optimization is complementary component.  
> 
> We have seen system with THP spend significant cpu cycles breaking up the
> pages on swap out and then compacting the pages for THP again after
> swap in.  So if we can avoid this, that will be helpful.
> 
> Thanks for your valuable comments.

Thanks for good works.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-18  8:39         ` Minchan Kim
@ 2016-08-18 17:19           ` Huang, Ying
  2016-08-19  0:49             ` Minchan Kim
  0 siblings, 1 reply; 27+ messages in thread
From: Huang, Ying @ 2016-08-18 17:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tim Chen, Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen,
	andi.kleen, aaron.lu, Kirill A . Shutemov, Andrea Arcangeli,
	linux-mm, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ascii, Size: 6091 bytes --]

Minchan Kim <minchan@kernel.org> writes:

> Hi Tim,
>
> On Wed, Aug 17, 2016 at 10:24:56AM -0700, Tim Chen wrote:
>> On Wed, 2016-08-17 at 14:07 +0900, Minchan Kim wrote:
>> > On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote:
>> > > 
>> > >
>> > > > 
>> > > > I think Tim and me discussed about that a few weeks ago.
>> > > I work closely with Tim on swap optimization. This patchset is the part
>> > > of our swap optimization plan.
>> > > 
>> > > > 
>> > > > Please search below topics.
>> > > > 
>> > > > [1] mm: Batch page reclamation under shink_page_list
>> > > > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
>> > > > 
>> > > > It's different with yours which focused on THP swapping while the suggestion
>> > > > would be more general if we can do so it's worth to try it, I think.
>> > > I think the general optimization above will benefit both normal pages
>> > > and THP at least for now. And I think there are no hard conflict
>> > > between those two patchsets.
>> > If we could do general optimzation, I guess THP swap without splitting
>> > would be more straight forward.
>> > 
>> > If we can reclaim batch a certain of pages all at once, it helps we can
>> > do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be
>> > greater or less than 512 pages. With that, scan_swap_map effectively
>> > search empty swap slots from scan_map or free cluser list.
>> > Then, needed part from your patchset is to just delay splitting of THP.
>> > 
>> > > 
>> > > 
>> > > The THP swap has more opportunity to be optimized, because we can batch
>> > > 512 operations together more easily. For full THP swap support, unmap a
>> > > THP could be more efficient with only one swap count operation instead
>> > > of 512, so do many other operations, such as add/remove from swap cache
>> > > with multi-order radix tree etc. And it will help memory fragmentation.
>> > > THP can be kept after swapping out/in, need not to rebuild THP via
>> > > khugepaged.
>> > It seems you increased cluster size to 512 and search a empty cluster
>> > for a THP swap. With that approach, I have a concern that once clusters
>> > will be fragmented, THP swap support doesn't take benefit at all.
>> > 
>> > Why do we need a empty cluster for swapping out 512 pages?
>> > IOW, below case could work for the goal.
>> > 
>> > A : Allocated slot
>> > F : Free slot
>> > 
>> > cluster A cluster B
>> > AAAAFFFF - FFFFAAAA
>> > 
>> > That's one of the reason I suggested batch reclaim work first and
>> > support THP swap based on it. With that, scan_swap_map can be aware of nr_pages
>> > and selects right clusters.
>> > 
>> > With the approach, justfication of THP swap support would be easier, too.
>> > IOW, I'm not sure how only THP swap support is valuable in real workload.
>> > 
>> > Anyways, that's just my two cents.
>> 
>> Minchan,
>> 
>> Scanning for contiguous slots that span clusters may take quite a
>> long time under fragmentation, and may eventually fail. In that case the addition scan
>> time overhead may go to waste and defeat the purpose of fast swapping of large page.
>> 
>> The empty cluster lookup on the other hand is very fast.
>> We treat the empty cluster available case as an opportunity for fast path
>> swap out of large page. Otherwise, we'll revert to the current
>> slow path behavior of breaking into normal pages so there's no
>> regression, and we may get speed up. We can be considerably faster when a lot of large
>> pages are used. 
>
> I didn't mean we should search scan_swap_map firstly without peeking
> free cluster but what I wanted was we might abstract it into
> scan_swap_map.
>
> For example, if nr_pages is greather than the size of cluster, we can
> get empty cluster first and nr_pages - sizeof(cluster) for other free
> cluster or scanning of current CPU per-cpu cluster. If we cannot find
> used slot during scanning, we can bail out simply. Then, although we
> fail to get all* contiguous slots, we get a certain of contiguous slots
> so it would be benefit for seq write and lock batching point of view
> at the cost of a little scanning. And it's not specific to THP algorighm.

Firstly, if my understanding were correct, to batch the normal pages
swapping out, the swap slots need not to be continuous.  But for the THP
swap support, we need the continuous swap slots.  So I think the
requirements are quite different between them.

And with the current design of the swap space management, it is quite
hard to implement allocating nr_pages continuous free swap slots.  To
reduce the contention of sis->lock, even to scan one free swap slot, the
sis->lock is unlocked during scanning.  When we scan nr_pages free swap
slots, and there are no nr_pages continuous free swap slots, we need to
scan from sis->lowest_bit to sis->highest_bit, and record the largest
continuous free swap slots.  But when we lock sis->lock again to check,
some swap slot inside the largest continuous free swap slots we found
may be allocated by other processes.  So we may end up with a much
smaller number of swap slots or we need to startover again.  So I think
the simpler solution is to

- When a whole cluster is requested (for the THP), try to allocate a
  free cluster.  Give up if there are no free clusters.

- When a small number of swap slots are requested (for normal swap
  batching), check only sis->percpu_cluster and return next N free swap
  slots in it.  Because we only scan very small number of swap slots, we
  can do that with sis->lock held.

BTW: The sis->lock is under heavy contention after the lock contention of
swap cache radix tree lock is reduced via batching in 8 processes
sequential swapping out test.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-18 17:19           ` Huang, Ying
@ 2016-08-19  0:49             ` Minchan Kim
  2016-08-19  3:44               ` Huang, Ying
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2016-08-19  0:49 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Tim Chen, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, Kirill A . Shutemov, Andrea Arcangeli, linux-mm,
	linux-kernel

Hi Huang,

On Thu, Aug 18, 2016 at 10:19:32AM -0700, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > Hi Tim,
> >
> > On Wed, Aug 17, 2016 at 10:24:56AM -0700, Tim Chen wrote:
> >> On Wed, 2016-08-17 at 14:07 +0900, Minchan Kim wrote:
> >> > On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote:
> >> > > 
> >> > >
> >> > > > 
> >> > > > I think Tim and me discussed about that a few weeks ago.
> >> > > I work closely with Tim on swap optimization.?This patchset is the part
> >> > > of our swap optimization plan.
> >> > > 
> >> > > > 
> >> > > > Please search below topics.
> >> > > > 
> >> > > > [1] mm: Batch page reclamation under shink_page_list
> >> > > > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
> >> > > > 
> >> > > > It's different with yours which focused on THP swapping while the suggestion
> >> > > > would be more general if we can do so it's worth to try it, I think.
> >> > > I think the general optimization above will benefit both normal pages
> >> > > and THP at least for now.?And I think there are no hard conflict
> >> > > between those two patchsets.
> >> > If we could do general optimzation, I guess THP swap without splitting
> >> > would be more straight forward.
> >> > 
> >> > If we can reclaim batch a certain of pages all at once, it helps we can
> >> > do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be
> >> > greater or less than 512 pages. With that, scan_swap_map effectively
> >> > search empty swap slots from scan_map or free cluser list.
> >> > Then, needed part from your patchset is to just delay splitting of THP.
> >> > 
> >> > > 
> >> > > 
> >> > > The THP swap has more opportunity to be optimized, because we can batch
> >> > > 512 operations together more easily.?For full THP swap support, unmap a
> >> > > THP could be more efficient with only one swap count operation instead
> >> > > of 512, so do many other operations, such as add/remove from swap cache
> >> > > with multi-order radix tree etc.?And it will help memory fragmentation.
> >> > > THP can be kept after swapping out/in, need not to rebuild THP via
> >> > > khugepaged.
> >> > It seems you increased cluster size to 512 and search a empty cluster
> >> > for a THP swap. With that approach, I have a concern that once clusters
> >> > will be fragmented, THP swap support doesn't take benefit at all.
> >> > 
> >> > Why do we need a empty cluster for swapping out 512 pages?
> >> > IOW, below case could work for the goal.
> >> > 
> >> > A : Allocated slot
> >> > F : Free slot
> >> > 
> >> > cluster A?cluster B
> >> > AAAAFFFF?-?FFFFAAAA
> >> > 
> >> > That's one of the reason I suggested batch reclaim work first and
> >> > support THP swap based on it. With that, scan_swap_map can be aware of nr_pages
> >> > and selects right clusters.
> >> > 
> >> > With the approach, justfication of THP swap support would be easier, too.
> >> > IOW, I'm not sure how only THP swap support is valuable in real workload.
> >> > 
> >> > Anyways, that's just my two cents.
> >> 
> >> Minchan,
> >> 
> >> Scanning for contiguous slots that span clusters may take quite a
> >> long time under fragmentation, and may eventually fail. In that case the addition scan
> >> time overhead may go to waste and defeat the purpose of fast swapping of large page.
> >> 
> >> The empty cluster lookup on the other hand is very fast.
> >> We treat the empty cluster available case as an opportunity for fast path
> >> swap out of large page. Otherwise, we'll revert to the current
> >> slow path behavior of breaking into normal pages so there's no
> >> regression, and we may get speed up. We can be considerably faster when a lot of large
> >> pages are used. 
> >
> > I didn't mean we should search scan_swap_map firstly without peeking
> > free cluster but what I wanted was we might abstract it into
> > scan_swap_map.
> >
> > For example, if nr_pages is greather than the size of cluster, we can
> > get empty cluster first and nr_pages - sizeof(cluster) for other free
> > cluster or scanning of current CPU per-cpu cluster. If we cannot find
> > used slot during scanning, we can bail out simply. Then, although we
> > fail to get all* contiguous slots, we get a certain of contiguous slots
> > so it would be benefit for seq write and lock batching point of view
> > at the cost of a little scanning. And it's not specific to THP algorighm.
> 
> Firstly, if my understanding were correct, to batch the normal pages
> swapping out, the swap slots need not to be continuous.  But for the THP
> swap support, we need the continuous swap slots.  So I think the
> requirements are quite different between them.

Hmm, I don't understand.

Let's think about swap slot management layer point of view.
It doesn't need to take care of that a amount of batch request is caused
by a thp page or multiple normal pages.

A matter is just that VM now asks multiple swap slots for seveal LRU-order
pages so swap slot management tries to allocate several slots in a lock.
Sure, it would be great if slots are consecutive fully because it means
it's fast big sequential write as well as readahead together ideally.
However, it would be better even if we didn't get consecutive slots because
we get muliple slots all at once by batch.

It's not a THP specific requirement, I think.
Currenlty, SWAP_CLUSTER_MAX might be too small to get a benefit by
normal page batch but it could be changed later once we implement batching
logic nicely.

> 
> And with the current design of the swap space management, it is quite
> hard to implement allocating nr_pages continuous free swap slots.  To
> reduce the contention of sis->lock, even to scan one free swap slot, the
> sis->lock is unlocked during scanning.  When we scan nr_pages free swap
> slots, and there are no nr_pages continuous free swap slots, we need to
> scan from sis->lowest_bit to sis->highest_bit, and record the largest
> continuous free swap slots.  But when we lock sis->lock again to check,
> some swap slot inside the largest continuous free swap slots we found
> may be allocated by other processes.  So we may end up with a much
> smaller number of swap slots or we need to startover again.  So I think
> the simpler solution is to
> 
> - When a whole cluster is requested (for the THP), try to allocate a
>   free cluster.  Give up if there are no free clusters.

One thing I'm afraid that it would consume free clusters very fast
if adjacent pages around a faulted one doesn't have same hottness/
lifetime. Once it happens, we can't get benefit any more.
IMO, it's too conservative and might be worse for the fragment point
of view.

We can do further through scanning of per_cpu and/or global swap_map.
If it makes lock contention problem, we can release the lock peridically
(e.g., every SWAP_CLUSTER_MAX). I don't mean we must search 512 consecutive
slots in swap_map but just bail out once we meet a hole during scanning.

> 
> - When a small number of swap slots are requested (for normal swap
>   batching), check only sis->percpu_cluster and return next N free swap
>   slots in it.  Because we only scan very small number of swap slots, we
>   can do that with sis->lock held.

Agreed.

If the batch request size is greater than cluster size, we can use
a free cluster for (batch size - free cluster slots).
If the batch request size is less than cluster, we can use pcp
cluster.
If we fails both, we can try to consecutive slots via scanning
of swap_map and let's bail out easily if we encounter the hole.
About the lock contention, we might need release periodically.

Thanks.

> 
> BTW: The sis->lock is under heavy contention after the lock contention of
> swap cache radix tree lock is reduced via batching in 8 processes
> sequential swapping out test.
> 
> Best Regards,
> Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-19  0:49             ` Minchan Kim
@ 2016-08-19  3:44               ` Huang, Ying
  2016-08-19  6:44                 ` Minchan Kim
  0 siblings, 1 reply; 27+ messages in thread
From: Huang, Ying @ 2016-08-19  3:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Tim Chen, Andrew Morton, tim.c.chen, dave.hansen,
	andi.kleen, aaron.lu, Kirill A . Shutemov, Andrea Arcangeli,
	linux-mm, linux-kernel

Minchan Kim <minchan@kernel.org> writes:

> Hi Huang,
>
> On Thu, Aug 18, 2016 at 10:19:32AM -0700, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > Hi Tim,
>> >
>> > On Wed, Aug 17, 2016 at 10:24:56AM -0700, Tim Chen wrote:
>> >> On Wed, 2016-08-17 at 14:07 +0900, Minchan Kim wrote:
>> >> > On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote:
>> >> > > 
>> >> > >
>> >> > > > 
>> >> > > > I think Tim and me discussed about that a few weeks ago.
>> >> > > I work closely with Tim on swap optimization.?This patchset is the part
>> >> > > of our swap optimization plan.
>> >> > > 
>> >> > > > 
>> >> > > > Please search below topics.
>> >> > > > 
>> >> > > > [1] mm: Batch page reclamation under shink_page_list
>> >> > > > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
>> >> > > > 
>> >> > > > It's different with yours which focused on THP swapping while the suggestion
>> >> > > > would be more general if we can do so it's worth to try it, I think.
>> >> > > I think the general optimization above will benefit both normal pages
>> >> > > and THP at least for now.?And I think there are no hard conflict
>> >> > > between those two patchsets.
>> >> > If we could do general optimzation, I guess THP swap without splitting
>> >> > would be more straight forward.
>> >> > 
>> >> > If we can reclaim batch a certain of pages all at once, it helps we can
>> >> > do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be
>> >> > greater or less than 512 pages. With that, scan_swap_map effectively
>> >> > search empty swap slots from scan_map or free cluser list.
>> >> > Then, needed part from your patchset is to just delay splitting of THP.
>> >> > 
>> >> > > 
>> >> > > 
>> >> > > The THP swap has more opportunity to be optimized, because we can batch
>> >> > > 512 operations together more easily.?For full THP swap support, unmap a
>> >> > > THP could be more efficient with only one swap count operation instead
>> >> > > of 512, so do many other operations, such as add/remove from swap cache
>> >> > > with multi-order radix tree etc.?And it will help memory fragmentation.
>> >> > > THP can be kept after swapping out/in, need not to rebuild THP via
>> >> > > khugepaged.
>> >> > It seems you increased cluster size to 512 and search a empty cluster
>> >> > for a THP swap. With that approach, I have a concern that once clusters
>> >> > will be fragmented, THP swap support doesn't take benefit at all.
>> >> > 
>> >> > Why do we need a empty cluster for swapping out 512 pages?
>> >> > IOW, below case could work for the goal.
>> >> > 
>> >> > A : Allocated slot
>> >> > F : Free slot
>> >> > 
>> >> > cluster A?cluster B
>> >> > AAAAFFFF?-?FFFFAAAA
>> >> > 
>> >> > That's one of the reason I suggested batch reclaim work first and
>> >> > support THP swap based on it. With that, scan_swap_map can be aware of nr_pages
>> >> > and selects right clusters.
>> >> > 
>> >> > With the approach, justfication of THP swap support would be easier, too.
>> >> > IOW, I'm not sure how only THP swap support is valuable in real workload.
>> >> > 
>> >> > Anyways, that's just my two cents.
>> >> 
>> >> Minchan,
>> >> 
>> >> Scanning for contiguous slots that span clusters may take quite a
>> >> long time under fragmentation, and may eventually fail. In that case the addition scan
>> >> time overhead may go to waste and defeat the purpose of fast swapping of large page.
>> >> 
>> >> The empty cluster lookup on the other hand is very fast.
>> >> We treat the empty cluster available case as an opportunity for fast path
>> >> swap out of large page. Otherwise, we'll revert to the current
>> >> slow path behavior of breaking into normal pages so there's no
>> >> regression, and we may get speed up. We can be considerably faster when a lot of large
>> >> pages are used. 
>> >
>> > I didn't mean we should search scan_swap_map firstly without peeking
>> > free cluster but what I wanted was we might abstract it into
>> > scan_swap_map.
>> >
>> > For example, if nr_pages is greather than the size of cluster, we can
>> > get empty cluster first and nr_pages - sizeof(cluster) for other free
>> > cluster or scanning of current CPU per-cpu cluster. If we cannot find
>> > used slot during scanning, we can bail out simply. Then, although we
>> > fail to get all* contiguous slots, we get a certain of contiguous slots
>> > so it would be benefit for seq write and lock batching point of view
>> > at the cost of a little scanning. And it's not specific to THP algorighm.
>> 
>> Firstly, if my understanding were correct, to batch the normal pages
>> swapping out, the swap slots need not to be continuous.  But for the THP
>> swap support, we need the continuous swap slots.  So I think the
>> requirements are quite different between them.
>
> Hmm, I don't understand.
>
> Let's think about swap slot management layer point of view.
> It doesn't need to take care of that a amount of batch request is caused
> by a thp page or multiple normal pages.
>
> A matter is just that VM now asks multiple swap slots for seveal LRU-order
> pages so swap slot management tries to allocate several slots in a lock.
> Sure, it would be great if slots are consecutive fully because it means
> it's fast big sequential write as well as readahead together ideally.
> However, it would be better even if we didn't get consecutive slots because
> we get muliple slots all at once by batch.
>
> It's not a THP specific requirement, I think.
> Currenlty, SWAP_CLUSTER_MAX might be too small to get a benefit by
> normal page batch but it could be changed later once we implement batching
> logic nicely.

Consecutive or not may influence the performance of the swap slots
allocation function greatly.  For example, there is some non-consecutive
swap slots at the begin of the swap space, and some consecutive swap
slots at the end of the swap space.  If the consecutive swap slots are
needed, the function may need to scan from the begin to the end.  If
non-consecutive swap slots are required, just return the swap slots at
the begin of the swap space.

>> And with the current design of the swap space management, it is quite
>> hard to implement allocating nr_pages continuous free swap slots.  To
>> reduce the contention of sis->lock, even to scan one free swap slot, the
>> sis->lock is unlocked during scanning.  When we scan nr_pages free swap
>> slots, and there are no nr_pages continuous free swap slots, we need to
>> scan from sis->lowest_bit to sis->highest_bit, and record the largest
>> continuous free swap slots.  But when we lock sis->lock again to check,
>> some swap slot inside the largest continuous free swap slots we found
>> may be allocated by other processes.  So we may end up with a much
>> smaller number of swap slots or we need to startover again.  So I think
>> the simpler solution is to
>> 
>> - When a whole cluster is requested (for the THP), try to allocate a
>>   free cluster.  Give up if there are no free clusters.
>
> One thing I'm afraid that it would consume free clusters very fast
> if adjacent pages around a faulted one doesn't have same hottness/
> lifetime. Once it happens, we can't get benefit any more.
> IMO, it's too conservative and might be worse for the fragment point
> of view.

It is possible.  But I think we should start from the simple solution
firstly.  Instead of jumping to the perfect solution directly.
Especially when the simple solution is a subset of the perfect solution.
Do you agree?

There are some other difficulties not to use the swap cluster to hold
the THP swapped out for the full THP swap support (without splitting).

The THP could be mapped in both PMD and PTE.  After the THP is swapped
out.  There may be swap entry in PMD and PTE too.  If a non-head PTE is
accessed, how do we know where is the first swap slot for the THP, so
that we can swap in the whole THP?

We can have a flag in cluster_info->flag to mark whether the swap
cluster backing a THP.  So swap in readahead can avoid to read ahead the
THP, or it can read ahead the whole THP instead of just several
sub-pages of the THP.

And if we use one swap cluster for each THP, we can use cluster_info->data
to hold compound map number.  That is very convenient.

Best Regards,
Huang, Ying

> We can do further through scanning of per_cpu and/or global swap_map.
> If it makes lock contention problem, we can release the lock peridically
> (e.g., every SWAP_CLUSTER_MAX). I don't mean we must search 512 consecutive
> slots in swap_map but just bail out once we meet a hole during scanning.
>
>> 
>> - When a small number of swap slots are requested (for normal swap
>>   batching), check only sis->percpu_cluster and return next N free swap
>>   slots in it.  Because we only scan very small number of swap slots, we
>>   can do that with sis->lock held.
>
> Agreed.
>
> If the batch request size is greater than cluster size, we can use
> a free cluster for (batch size - free cluster slots).
> If the batch request size is less than cluster, we can use pcp
> cluster.
> If we fails both, we can try to consecutive slots via scanning
> of swap_map and let's bail out easily if we encounter the hole.
> About the lock contention, we might need release periodically.
>
> Thanks.
>
>> 
>> BTW: The sis->lock is under heavy contention after the lock contention of
>> swap cache radix tree lock is reduced via batching in 8 processes
>> sequential swapping out test.
>> 
>> Best Regards,
>> Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-19  3:44               ` Huang, Ying
@ 2016-08-19  6:44                 ` Minchan Kim
  2016-08-19  6:47                   ` Minchan Kim
  2016-08-19 23:43                   ` Huang, Ying
  0 siblings, 2 replies; 27+ messages in thread
From: Minchan Kim @ 2016-08-19  6:44 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Tim Chen, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, Kirill A . Shutemov, Andrea Arcangeli, linux-mm,
	linux-kernel

On Thu, Aug 18, 2016 at 08:44:13PM -0700, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > Hi Huang,
> >
> > On Thu, Aug 18, 2016 at 10:19:32AM -0700, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > Hi Tim,
> >> >
> >> > On Wed, Aug 17, 2016 at 10:24:56AM -0700, Tim Chen wrote:
> >> >> On Wed, 2016-08-17 at 14:07 +0900, Minchan Kim wrote:
> >> >> > On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote:
> >> >> > > 
> >> >> > >
> >> >> > > > 
> >> >> > > > I think Tim and me discussed about that a few weeks ago.
> >> >> > > I work closely with Tim on swap optimization.?This patchset is the part
> >> >> > > of our swap optimization plan.
> >> >> > > 
> >> >> > > > 
> >> >> > > > Please search below topics.
> >> >> > > > 
> >> >> > > > [1] mm: Batch page reclamation under shink_page_list
> >> >> > > > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
> >> >> > > > 
> >> >> > > > It's different with yours which focused on THP swapping while the suggestion
> >> >> > > > would be more general if we can do so it's worth to try it, I think.
> >> >> > > I think the general optimization above will benefit both normal pages
> >> >> > > and THP at least for now.?And I think there are no hard conflict
> >> >> > > between those two patchsets.
> >> >> > If we could do general optimzation, I guess THP swap without splitting
> >> >> > would be more straight forward.
> >> >> > 
> >> >> > If we can reclaim batch a certain of pages all at once, it helps we can
> >> >> > do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be
> >> >> > greater or less than 512 pages. With that, scan_swap_map effectively
> >> >> > search empty swap slots from scan_map or free cluser list.
> >> >> > Then, needed part from your patchset is to just delay splitting of THP.
> >> >> > 
> >> >> > > 
> >> >> > > 
> >> >> > > The THP swap has more opportunity to be optimized, because we can batch
> >> >> > > 512 operations together more easily.?For full THP swap support, unmap a
> >> >> > > THP could be more efficient with only one swap count operation instead
> >> >> > > of 512, so do many other operations, such as add/remove from swap cache
> >> >> > > with multi-order radix tree etc.?And it will help memory fragmentation.
> >> >> > > THP can be kept after swapping out/in, need not to rebuild THP via
> >> >> > > khugepaged.
> >> >> > It seems you increased cluster size to 512 and search a empty cluster
> >> >> > for a THP swap. With that approach, I have a concern that once clusters
> >> >> > will be fragmented, THP swap support doesn't take benefit at all.
> >> >> > 
> >> >> > Why do we need a empty cluster for swapping out 512 pages?
> >> >> > IOW, below case could work for the goal.
> >> >> > 
> >> >> > A : Allocated slot
> >> >> > F : Free slot
> >> >> > 
> >> >> > cluster A?cluster B
> >> >> > AAAAFFFF?-?FFFFAAAA
> >> >> > 
> >> >> > That's one of the reason I suggested batch reclaim work first and
> >> >> > support THP swap based on it. With that, scan_swap_map can be aware of nr_pages
> >> >> > and selects right clusters.
> >> >> > 
> >> >> > With the approach, justfication of THP swap support would be easier, too.
> >> >> > IOW, I'm not sure how only THP swap support is valuable in real workload.
> >> >> > 
> >> >> > Anyways, that's just my two cents.
> >> >> 
> >> >> Minchan,
> >> >> 
> >> >> Scanning for contiguous slots that span clusters may take quite a
> >> >> long time under fragmentation, and may eventually fail. In that case the addition scan
> >> >> time overhead may go to waste and defeat the purpose of fast swapping of large page.
> >> >> 
> >> >> The empty cluster lookup on the other hand is very fast.
> >> >> We treat the empty cluster available case as an opportunity for fast path
> >> >> swap out of large page. Otherwise, we'll revert to the current
> >> >> slow path behavior of breaking into normal pages so there's no
> >> >> regression, and we may get speed up. We can be considerably faster when a lot of large
> >> >> pages are used. 
> >> >
> >> > I didn't mean we should search scan_swap_map firstly without peeking
> >> > free cluster but what I wanted was we might abstract it into
> >> > scan_swap_map.
> >> >
> >> > For example, if nr_pages is greather than the size of cluster, we can
> >> > get empty cluster first and nr_pages - sizeof(cluster) for other free
> >> > cluster or scanning of current CPU per-cpu cluster. If we cannot find
> >> > used slot during scanning, we can bail out simply. Then, although we
> >> > fail to get all* contiguous slots, we get a certain of contiguous slots
> >> > so it would be benefit for seq write and lock batching point of view
> >> > at the cost of a little scanning. And it's not specific to THP algorighm.
> >> 
> >> Firstly, if my understanding were correct, to batch the normal pages
> >> swapping out, the swap slots need not to be continuous.  But for the THP
> >> swap support, we need the continuous swap slots.  So I think the
> >> requirements are quite different between them.
> >
> > Hmm, I don't understand.
> >
> > Let's think about swap slot management layer point of view.
> > It doesn't need to take care of that a amount of batch request is caused
> > by a thp page or multiple normal pages.
> >
> > A matter is just that VM now asks multiple swap slots for seveal LRU-order
> > pages so swap slot management tries to allocate several slots in a lock.
> > Sure, it would be great if slots are consecutive fully because it means
> > it's fast big sequential write as well as readahead together ideally.
> > However, it would be better even if we didn't get consecutive slots because
> > we get muliple slots all at once by batch.
> >
> > It's not a THP specific requirement, I think.
> > Currenlty, SWAP_CLUSTER_MAX might be too small to get a benefit by
> > normal page batch but it could be changed later once we implement batching
> > logic nicely.
> 
> Consecutive or not may influence the performance of the swap slots
> allocation function greatly.  For example, there is some non-consecutive
> swap slots at the begin of the swap space, and some consecutive swap
> slots at the end of the swap space.  If the consecutive swap slots are
> needed, the function may need to scan from the begin to the end.  If
> non-consecutive swap slots are required, just return the swap slots at
> the begin of the swap space.

Don't get me wrong. I never said consecutive swap slot allocation is
not important and should scan swap_map fully for searching consecutive
swap slot.

Both multiple normal page swap and a THP swap, consecutive swap slot
allocation is important so that it's a same requirement so I want to
abstract it regardless of THP swap.

> 
> >> And with the current design of the swap space management, it is quite
> >> hard to implement allocating nr_pages continuous free swap slots.  To
> >> reduce the contention of sis->lock, even to scan one free swap slot, the
> >> sis->lock is unlocked during scanning.  When we scan nr_pages free swap
> >> slots, and there are no nr_pages continuous free swap slots, we need to
> >> scan from sis->lowest_bit to sis->highest_bit, and record the largest
> >> continuous free swap slots.  But when we lock sis->lock again to check,
> >> some swap slot inside the largest continuous free swap slots we found
> >> may be allocated by other processes.  So we may end up with a much
> >> smaller number of swap slots or we need to startover again.  So I think
> >> the simpler solution is to
> >> 
> >> - When a whole cluster is requested (for the THP), try to allocate a
> >>   free cluster.  Give up if there are no free clusters.
> >
> > One thing I'm afraid that it would consume free clusters very fast
> > if adjacent pages around a faulted one doesn't have same hottness/
> > lifetime. Once it happens, we can't get benefit any more.
> > IMO, it's too conservative and might be worse for the fragment point
> > of view.
> 
> It is possible.  But I think we should start from the simple solution
> firstly.  Instead of jumping to the perfect solution directly.
> Especially when the simple solution is a subset of the perfect solution.
> Do you agree?

If simple solution works well and is hard to prove it's not bad than as-is,
I agree. But my concern is about that it would consume free clusters so fast
that it can affect badly for other workload.

> 
> There are some other difficulties not to use the swap cluster to hold
> the THP swapped out for the full THP swap support (without splitting).
> 
> The THP could be mapped in both PMD and PTE.  After the THP is swapped
> out.  There may be swap entry in PMD and PTE too.  If a non-head PTE is
> accessed, how do we know where is the first swap slot for the THP, so
> that we can swap in the whole THP?

You mean you want to swapin 2M pages all at once? Hmm, I'm not sure
it's a good idea. We don't have any evidence 512 pages have same time
locality. They were just LRU in-order due to split implementation,
not time locality. A thing we can bet is any processes sharing the THP
doesn't touch a subpage in 512 pages so it's really *cold*.
For such cold 512 page swap-in, I am really not sure.

> 
> We can have a flag in cluster_info->flag to mark whether the swap
> cluster backing a THP.  So swap in readahead can avoid to read ahead the
> THP, or it can read ahead the whole THP instead of just several
> sub-pages of the THP.
> 
> And if we use one swap cluster for each THP, we can use cluster_info->data
> to hold compound map number.  That is very convenient.

Huang,

If you think my points are enough valid, just continue your work
regardless of my comment. I don't want to waste your time if it helps
your workload really. And I will defer the decision to other MM people.

What I just wanted is to make swap batch for normal pages first and
then support THP swap based upon it because normal page batching would
more general optimization for us and I thought it will make your work
more simple.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-19  6:44                 ` Minchan Kim
@ 2016-08-19  6:47                   ` Minchan Kim
  2016-08-19 23:43                   ` Huang, Ying
  1 sibling, 0 replies; 27+ messages in thread
From: Minchan Kim @ 2016-08-19  6:47 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Tim Chen, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, Kirill A . Shutemov, Andrea Arcangeli, linux-mm,
	linux-kernel

On Fri, Aug 19, 2016 at 03:44:26PM +0900, Minchan Kim wrote:
> On Thu, Aug 18, 2016 at 08:44:13PM -0700, Huang, Ying wrote:
> > Minchan Kim <minchan@kernel.org> writes:
> > 
> > > Hi Huang,
> > >
> > > On Thu, Aug 18, 2016 at 10:19:32AM -0700, Huang, Ying wrote:
> > >> Minchan Kim <minchan@kernel.org> writes:
> > >> 
> > >> > Hi Tim,
> > >> >
> > >> > On Wed, Aug 17, 2016 at 10:24:56AM -0700, Tim Chen wrote:
> > >> >> On Wed, 2016-08-17 at 14:07 +0900, Minchan Kim wrote:
> > >> >> > On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote:
> > >> >> > > 
> > >> >> > >
> > >> >> > > > 
> > >> >> > > > I think Tim and me discussed about that a few weeks ago.
> > >> >> > > I work closely with Tim on swap optimization.?This patchset is the part
> > >> >> > > of our swap optimization plan.
> > >> >> > > 
> > >> >> > > > 
> > >> >> > > > Please search below topics.
> > >> >> > > > 
> > >> >> > > > [1] mm: Batch page reclamation under shink_page_list
> > >> >> > > > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
> > >> >> > > > 
> > >> >> > > > It's different with yours which focused on THP swapping while the suggestion
> > >> >> > > > would be more general if we can do so it's worth to try it, I think.
> > >> >> > > I think the general optimization above will benefit both normal pages
> > >> >> > > and THP at least for now.?And I think there are no hard conflict
> > >> >> > > between those two patchsets.
> > >> >> > If we could do general optimzation, I guess THP swap without splitting
> > >> >> > would be more straight forward.
> > >> >> > 
> > >> >> > If we can reclaim batch a certain of pages all at once, it helps we can
> > >> >> > do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be
> > >> >> > greater or less than 512 pages. With that, scan_swap_map effectively
> > >> >> > search empty swap slots from scan_map or free cluser list.
> > >> >> > Then, needed part from your patchset is to just delay splitting of THP.
> > >> >> > 
> > >> >> > > 
> > >> >> > > 
> > >> >> > > The THP swap has more opportunity to be optimized, because we can batch
> > >> >> > > 512 operations together more easily.?For full THP swap support, unmap a
> > >> >> > > THP could be more efficient with only one swap count operation instead
> > >> >> > > of 512, so do many other operations, such as add/remove from swap cache
> > >> >> > > with multi-order radix tree etc.?And it will help memory fragmentation.
> > >> >> > > THP can be kept after swapping out/in, need not to rebuild THP via
> > >> >> > > khugepaged.
> > >> >> > It seems you increased cluster size to 512 and search a empty cluster
> > >> >> > for a THP swap. With that approach, I have a concern that once clusters
> > >> >> > will be fragmented, THP swap support doesn't take benefit at all.
> > >> >> > 
> > >> >> > Why do we need a empty cluster for swapping out 512 pages?
> > >> >> > IOW, below case could work for the goal.
> > >> >> > 
> > >> >> > A : Allocated slot
> > >> >> > F : Free slot
> > >> >> > 
> > >> >> > cluster A?cluster B
> > >> >> > AAAAFFFF?-?FFFFAAAA
> > >> >> > 
> > >> >> > That's one of the reason I suggested batch reclaim work first and
> > >> >> > support THP swap based on it. With that, scan_swap_map can be aware of nr_pages
> > >> >> > and selects right clusters.
> > >> >> > 
> > >> >> > With the approach, justfication of THP swap support would be easier, too.
> > >> >> > IOW, I'm not sure how only THP swap support is valuable in real workload.
> > >> >> > 
> > >> >> > Anyways, that's just my two cents.
> > >> >> 
> > >> >> Minchan,
> > >> >> 
> > >> >> Scanning for contiguous slots that span clusters may take quite a
> > >> >> long time under fragmentation, and may eventually fail. In that case the addition scan
> > >> >> time overhead may go to waste and defeat the purpose of fast swapping of large page.
> > >> >> 
> > >> >> The empty cluster lookup on the other hand is very fast.
> > >> >> We treat the empty cluster available case as an opportunity for fast path
> > >> >> swap out of large page. Otherwise, we'll revert to the current
> > >> >> slow path behavior of breaking into normal pages so there's no
> > >> >> regression, and we may get speed up. We can be considerably faster when a lot of large
> > >> >> pages are used. 
> > >> >
> > >> > I didn't mean we should search scan_swap_map firstly without peeking
> > >> > free cluster but what I wanted was we might abstract it into
> > >> > scan_swap_map.
> > >> >
> > >> > For example, if nr_pages is greather than the size of cluster, we can
> > >> > get empty cluster first and nr_pages - sizeof(cluster) for other free
> > >> > cluster or scanning of current CPU per-cpu cluster. If we cannot find
> > >> > used slot during scanning, we can bail out simply. Then, although we
> > >> > fail to get all* contiguous slots, we get a certain of contiguous slots
> > >> > so it would be benefit for seq write and lock batching point of view
> > >> > at the cost of a little scanning. And it's not specific to THP algorighm.
> > >> 
> > >> Firstly, if my understanding were correct, to batch the normal pages
> > >> swapping out, the swap slots need not to be continuous.  But for the THP
> > >> swap support, we need the continuous swap slots.  So I think the
> > >> requirements are quite different between them.
> > >
> > > Hmm, I don't understand.
> > >
> > > Let's think about swap slot management layer point of view.
> > > It doesn't need to take care of that a amount of batch request is caused
> > > by a thp page or multiple normal pages.
> > >
> > > A matter is just that VM now asks multiple swap slots for seveal LRU-order
> > > pages so swap slot management tries to allocate several slots in a lock.
> > > Sure, it would be great if slots are consecutive fully because it means
> > > it's fast big sequential write as well as readahead together ideally.
> > > However, it would be better even if we didn't get consecutive slots because
> > > we get muliple slots all at once by batch.
> > >
> > > It's not a THP specific requirement, I think.
> > > Currenlty, SWAP_CLUSTER_MAX might be too small to get a benefit by
> > > normal page batch but it could be changed later once we implement batching
> > > logic nicely.
> > 
> > Consecutive or not may influence the performance of the swap slots
> > allocation function greatly.  For example, there is some non-consecutive
> > swap slots at the begin of the swap space, and some consecutive swap
> > slots at the end of the swap space.  If the consecutive swap slots are
> > needed, the function may need to scan from the begin to the end.  If
> > non-consecutive swap slots are required, just return the swap slots at
> > the begin of the swap space.
> 
> Don't get me wrong. I never said consecutive swap slot allocation is
> not important and should scan swap_map fully for searching consecutive
> swap slot.
> 
> Both multiple normal page swap and a THP swap, consecutive swap slot
> allocation is important so that it's a same requirement so I want to
> abstract it regardless of THP swap.
> 
> > 
> > >> And with the current design of the swap space management, it is quite
> > >> hard to implement allocating nr_pages continuous free swap slots.  To
> > >> reduce the contention of sis->lock, even to scan one free swap slot, the
> > >> sis->lock is unlocked during scanning.  When we scan nr_pages free swap
> > >> slots, and there are no nr_pages continuous free swap slots, we need to
> > >> scan from sis->lowest_bit to sis->highest_bit, and record the largest
> > >> continuous free swap slots.  But when we lock sis->lock again to check,
> > >> some swap slot inside the largest continuous free swap slots we found
> > >> may be allocated by other processes.  So we may end up with a much
> > >> smaller number of swap slots or we need to startover again.  So I think
> > >> the simpler solution is to
> > >> 
> > >> - When a whole cluster is requested (for the THP), try to allocate a
> > >>   free cluster.  Give up if there are no free clusters.
> > >
> > > One thing I'm afraid that it would consume free clusters very fast
> > > if adjacent pages around a faulted one doesn't have same hottness/
> > > lifetime. Once it happens, we can't get benefit any more.
> > > IMO, it's too conservative and might be worse for the fragment point
> > > of view.
> > 
> > It is possible.  But I think we should start from the simple solution
> > firstly.  Instead of jumping to the perfect solution directly.
> > Especially when the simple solution is a subset of the perfect solution.
> > Do you agree?
> 
> If simple solution works well and is hard to prove it's not bad than as-is,
> I agree. But my concern is about that it would consume free clusters so fast
> that it can affect badly for other workload.
> 
> > 
> > There are some other difficulties not to use the swap cluster to hold
> > the THP swapped out for the full THP swap support (without splitting).
> > 
> > The THP could be mapped in both PMD and PTE.  After the THP is swapped
> > out.  There may be swap entry in PMD and PTE too.  If a non-head PTE is
> > accessed, how do we know where is the first swap slot for the THP, so
> > that we can swap in the whole THP?
> 
> You mean you want to swapin 2M pages all at once? Hmm, I'm not sure
> it's a good idea. We don't have any evidence 512 pages have same time
> locality. They were just LRU in-order due to split implementation,
> not time locality. A thing we can bet is any processes sharing the THP
> doesn't touch a subpage in 512 pages so it's really *cold*.
> For such cold 512 page swap-in, I am really not sure.
> 
> > 
> > We can have a flag in cluster_info->flag to mark whether the swap
> > cluster backing a THP.  So swap in readahead can avoid to read ahead the
> > THP, or it can read ahead the whole THP instead of just several
> > sub-pages of the THP.
> > 
> > And if we use one swap cluster for each THP, we can use cluster_info->data
> > to hold compound map number.  That is very convenient.
> 
> Huang,
> 
> If you think my points are enough valid, just continue your work

  If you don't think

Oops, typo.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-19  6:44                 ` Minchan Kim
  2016-08-19  6:47                   ` Minchan Kim
@ 2016-08-19 23:43                   ` Huang, Ying
  1 sibling, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-19 23:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Tim Chen, Andrew Morton, tim.c.chen, dave.hansen,
	andi.kleen, aaron.lu, Kirill A . Shutemov, Andrea Arcangeli,
	linux-mm, linux-kernel

Minchan Kim <minchan@kernel.org> writes:

> On Thu, Aug 18, 2016 at 08:44:13PM -0700, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > Hi Huang,
>> >
>> > On Thu, Aug 18, 2016 at 10:19:32AM -0700, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >> 
>> >> > Hi Tim,
>> >> >
>> >> > On Wed, Aug 17, 2016 at 10:24:56AM -0700, Tim Chen wrote:
>> >> >> On Wed, 2016-08-17 at 14:07 +0900, Minchan Kim wrote:
>> >> >> > On Tue, Aug 16, 2016 at 07:06:00PM -0700, Huang, Ying wrote:
>> >> >> > > 
>> >> >> > >
>> >> >> > > > 
>> >> >> > > > I think Tim and me discussed about that a few weeks ago.
>> >> >> > > I work closely with Tim on swap optimization.?This patchset is the part
>> >> >> > > of our swap optimization plan.
>> >> >> > > 
>> >> >> > > > 
>> >> >> > > > Please search below topics.
>> >> >> > > > 
>> >> >> > > > [1] mm: Batch page reclamation under shink_page_list
>> >> >> > > > [2] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
>> >> >> > > > 
>> >> >> > > > It's different with yours which focused on THP swapping while the suggestion
>> >> >> > > > would be more general if we can do so it's worth to try it, I think.
>> >> >> > > I think the general optimization above will benefit both normal pages
>> >> >> > > and THP at least for now.?And I think there are no hard conflict
>> >> >> > > between those two patchsets.
>> >> >> > If we could do general optimzation, I guess THP swap without splitting
>> >> >> > would be more straight forward.
>> >> >> > 
>> >> >> > If we can reclaim batch a certain of pages all at once, it helps we can
>> >> >> > do scan_swap_map(si, SWAP_HAS_CACHE, nr_pages). The nr_pages could be
>> >> >> > greater or less than 512 pages. With that, scan_swap_map effectively
>> >> >> > search empty swap slots from scan_map or free cluser list.
>> >> >> > Then, needed part from your patchset is to just delay splitting of THP.
>> >> >> > 
>> >> >> > > 
>> >> >> > > 
>> >> >> > > The THP swap has more opportunity to be optimized, because we can batch
>> >> >> > > 512 operations together more easily.?For full THP swap support, unmap a
>> >> >> > > THP could be more efficient with only one swap count operation instead
>> >> >> > > of 512, so do many other operations, such as add/remove from swap cache
>> >> >> > > with multi-order radix tree etc.?And it will help memory fragmentation.
>> >> >> > > THP can be kept after swapping out/in, need not to rebuild THP via
>> >> >> > > khugepaged.
>> >> >> > It seems you increased cluster size to 512 and search a empty cluster
>> >> >> > for a THP swap. With that approach, I have a concern that once clusters
>> >> >> > will be fragmented, THP swap support doesn't take benefit at all.
>> >> >> > 
>> >> >> > Why do we need a empty cluster for swapping out 512 pages?
>> >> >> > IOW, below case could work for the goal.
>> >> >> > 
>> >> >> > A : Allocated slot
>> >> >> > F : Free slot
>> >> >> > 
>> >> >> > cluster A?cluster B
>> >> >> > AAAAFFFF?-?FFFFAAAA
>> >> >> > 
>> >> >> > That's one of the reason I suggested batch reclaim work first and
>> >> >> > support THP swap based on it. With that, scan_swap_map can be aware of nr_pages
>> >> >> > and selects right clusters.
>> >> >> > 
>> >> >> > With the approach, justfication of THP swap support would be easier, too.
>> >> >> > IOW, I'm not sure how only THP swap support is valuable in real workload.
>> >> >> > 
>> >> >> > Anyways, that's just my two cents.
>> >> >> 
>> >> >> Minchan,
>> >> >> 
>> >> >> Scanning for contiguous slots that span clusters may take quite a
>> >> >> long time under fragmentation, and may eventually fail. In that case the addition scan
>> >> >> time overhead may go to waste and defeat the purpose of fast swapping of large page.
>> >> >> 
>> >> >> The empty cluster lookup on the other hand is very fast.
>> >> >> We treat the empty cluster available case as an opportunity for fast path
>> >> >> swap out of large page. Otherwise, we'll revert to the current
>> >> >> slow path behavior of breaking into normal pages so there's no
>> >> >> regression, and we may get speed up. We can be considerably faster when a lot of large
>> >> >> pages are used. 
>> >> >
>> >> > I didn't mean we should search scan_swap_map firstly without peeking
>> >> > free cluster but what I wanted was we might abstract it into
>> >> > scan_swap_map.
>> >> >
>> >> > For example, if nr_pages is greather than the size of cluster, we can
>> >> > get empty cluster first and nr_pages - sizeof(cluster) for other free
>> >> > cluster or scanning of current CPU per-cpu cluster. If we cannot find
>> >> > used slot during scanning, we can bail out simply. Then, although we
>> >> > fail to get all* contiguous slots, we get a certain of contiguous slots
>> >> > so it would be benefit for seq write and lock batching point of view
>> >> > at the cost of a little scanning. And it's not specific to THP algorighm.
>> >> 
>> >> Firstly, if my understanding were correct, to batch the normal pages
>> >> swapping out, the swap slots need not to be continuous.  But for the THP
>> >> swap support, we need the continuous swap slots.  So I think the
>> >> requirements are quite different between them.
>> >
>> > Hmm, I don't understand.
>> >
>> > Let's think about swap slot management layer point of view.
>> > It doesn't need to take care of that a amount of batch request is caused
>> > by a thp page or multiple normal pages.
>> >
>> > A matter is just that VM now asks multiple swap slots for seveal LRU-order
>> > pages so swap slot management tries to allocate several slots in a lock.
>> > Sure, it would be great if slots are consecutive fully because it means
>> > it's fast big sequential write as well as readahead together ideally.
>> > However, it would be better even if we didn't get consecutive slots because
>> > we get muliple slots all at once by batch.
>> >
>> > It's not a THP specific requirement, I think.
>> > Currenlty, SWAP_CLUSTER_MAX might be too small to get a benefit by
>> > normal page batch but it could be changed later once we implement batching
>> > logic nicely.
>> 
>> Consecutive or not may influence the performance of the swap slots
>> allocation function greatly.  For example, there is some non-consecutive
>> swap slots at the begin of the swap space, and some consecutive swap
>> slots at the end of the swap space.  If the consecutive swap slots are
>> needed, the function may need to scan from the begin to the end.  If
>> non-consecutive swap slots are required, just return the swap slots at
>> the begin of the swap space.
>
> Don't get me wrong. I never said consecutive swap slot allocation is
> not important and should scan swap_map fully for searching consecutive
> swap slot.

Sorry, I am confused.  For multiple normal page swapping,
Non-consecutive allocation is important or not?  If both consecutive and
non-consecutive allocation are important, how to balance between them?
Restrict the scanning number?

For the THP swap, consecutive is mandatory.  We need to add a parameter
to specify that at least so that the allocator can try harder and use
free cluster directly?

For multiple normal pages swapping, exactly consecutive swap slots
allocation isn't so important.  Just nearby enough should be OK for
them.  For example in the same swap cluster, so that it can be processed
in lower level disk hardware with high efficiency (same disk segment
etc.).

> Both multiple normal page swap and a THP swap, consecutive swap slot
> allocation is important so that it's a same requirement so I want to
> abstract it regardless of THP swap.
>
>> 
>> >> And with the current design of the swap space management, it is quite
>> >> hard to implement allocating nr_pages continuous free swap slots.  To
>> >> reduce the contention of sis->lock, even to scan one free swap slot, the
>> >> sis->lock is unlocked during scanning.  When we scan nr_pages free swap
>> >> slots, and there are no nr_pages continuous free swap slots, we need to
>> >> scan from sis->lowest_bit to sis->highest_bit, and record the largest
>> >> continuous free swap slots.  But when we lock sis->lock again to check,
>> >> some swap slot inside the largest continuous free swap slots we found
>> >> may be allocated by other processes.  So we may end up with a much
>> >> smaller number of swap slots or we need to startover again.  So I think
>> >> the simpler solution is to
>> >> 
>> >> - When a whole cluster is requested (for the THP), try to allocate a
>> >>   free cluster.  Give up if there are no free clusters.
>> >
>> > One thing I'm afraid that it would consume free clusters very fast
>> > if adjacent pages around a faulted one doesn't have same hottness/
>> > lifetime. Once it happens, we can't get benefit any more.
>> > IMO, it's too conservative and might be worse for the fragment point
>> > of view.
>> 
>> It is possible.  But I think we should start from the simple solution
>> firstly.  Instead of jumping to the perfect solution directly.
>> Especially when the simple solution is a subset of the perfect solution.
>> Do you agree?
>
> If simple solution works well and is hard to prove it's not bad than as-is,
> I agree. But my concern is about that it would consume free clusters so fast
> that it can affect badly for other workload.

If we allocate consecutive swap slots other than free clusters for the
THP, the free clusters will be consumed quickly too.  Because not all
sub-pages may be freed together.  The swap space will become fragmented
anyway.  The situation may be better a little, but I don't think there
will be huge difference here.  There may be situations that 2
consecutive non-free clusters to have 512 consecutive swap slots, but I
don't think that will be many.

I think we need some other ways to deal with fragmented swap problem.
For example, one way is to have a list for not fully used cluster list
and use that for per_cpu cluster too.  Another way is starting to
reclaim swap space during swapping in if the swap space becomes
fragmented.  And swapping in 2M page together and free the cluster
backing it could be a good way to help fragmentation.

And I think, my change here will not trigger regression.  For swapping
out the THP, current code has almost the same behavior as for free
clusters with my code.  The percpu cluster will be used up, then next
free cluster is used.  So for each THP, one free cluster will be
consumed.

>> There are some other difficulties not to use the swap cluster to hold
>> the THP swapped out for the full THP swap support (without splitting).
>> 
>> The THP could be mapped in both PMD and PTE.  After the THP is swapped
>> out.  There may be swap entry in PMD and PTE too.  If a non-head PTE is
>> accessed, how do we know where is the first swap slot for the THP, so
>> that we can swap in the whole THP?
>
> You mean you want to swapin 2M pages all at once? Hmm, I'm not sure
> it's a good idea. We don't have any evidence 512 pages have same time
> locality. They were just LRU in-order due to split implementation,
> not time locality. A thing we can bet is any processes sharing the THP
> doesn't touch a subpage in 512 pages so it's really *cold*.
> For such cold 512 page swap-in, I am really not sure.

On a system with /sys/kernel/mm/transparent_hugepage/enabled set to
always, most anonymous pages could be THP.  It could be helpful to swap
out/in THP together.  Do you think so?  And because it is 2M sequential
read, the performance is good.  Only more memory may be needed, but you
use more memory if you use THP anyway.

My point is, this depends on the workload.  Swapping out/in THP could
benefit quite some workloads.  We may provide a choice for the users to
turn off it when necessary.  But I don't think we should make it
impossible at all.  Do you agree?
 
>> We can have a flag in cluster_info->flag to mark whether the swap
>> cluster backing a THP.  So swap in readahead can avoid to read ahead the
>> THP, or it can read ahead the whole THP instead of just several
>> sub-pages of the THP.
>> 
>> And if we use one swap cluster for each THP, we can use cluster_info->data
>> to hold compound map number.  That is very convenient.
>
> Huang,
>
> If you think my points are enough valid, just continue your work
> regardless of my comment. I don't want to waste your time if it helps
> your workload really. And I will defer the decision to other MM people.

I think your comments are good for me.  Thanks a lot for your comments.
We may have some different idea about requirement.  But I think the
discussion is good.  If I give up swapping in 2M THP as a whole.  You
solution looks good for me now.  We can use cluster_info->data to
accelerate scanning, and we can give up scanning after some trying to
avoid too much lock contention.

> What I just wanted is to make swap batch for normal pages first and
> then support THP swap based upon it because normal page batching would
> more general optimization for us and I thought it will make your work
> more simple.

The problem is that swapping in 2M THP requirement makes it hard for the
THP swap support to use the swap allocation mechanism for normal pages
swapping batching.  But there may be some other places that the THP swap
can take advantage of.  So I think it may be more reasonable for the
normal pages swapping optimization go firstly.  But we can discuss the
basic design of the THP swapping.  Do you agree?  For example, whether
supporting swapping in 2M THP as a whole?  If so, how to do it?

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-17  0:59 ` Minchan Kim
  2016-08-17  2:06   ` Huang, Ying
@ 2016-08-22 21:33   ` Huang, Ying
  2016-08-24  1:00     ` Minchan Kim
  1 sibling, 1 reply; 27+ messages in thread
From: Huang, Ying @ 2016-08-22 21:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, Kirill A . Shutemov, Andrea Arcangeli, linux-mm,
	linux-kernel

Hi, Minchan,

Minchan Kim <minchan@kernel.org> writes:
> Anyway, I hope [1/11] should be merged regardless of the patchset because
> I believe anyone doesn't feel comfortable with cluser_info functions. ;-)

I want to send out 1/11 separately.  Can I add your "Acked-by:" for it?

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-22 21:33   ` Huang, Ying
@ 2016-08-24  1:00     ` Minchan Kim
  2016-08-24  2:18       ` Huang, Ying
  0 siblings, 1 reply; 27+ messages in thread
From: Minchan Kim @ 2016-08-24  1:00 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, tim.c.chen, dave.hansen, andi.kleen, aaron.lu,
	Kirill A . Shutemov, Andrea Arcangeli, linux-mm, linux-kernel

Hi Huang,

On my side, there are more urgent works now so I didn't have a time to
see our ongoing discussion. I will continue after settle down works,
maybe next week. Sorry.

On Mon, Aug 22, 2016 at 02:33:08PM -0700, Huang, Ying wrote:
> Hi, Minchan,
> 
> Minchan Kim <minchan@kernel.org> writes:
> > Anyway, I hope [1/11] should be merged regardless of the patchset because
> > I believe anyone doesn't feel comfortable with cluser_info functions. ;-)
> 
> I want to send out 1/11 separately.  Can I add your "Acked-by:" for it?

Sure.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] THP swap: Delay splitting THP during swapping out
  2016-08-24  1:00     ` Minchan Kim
@ 2016-08-24  2:18       ` Huang, Ying
  0 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2016-08-24  2:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, tim.c.chen, dave.hansen, andi.kleen,
	aaron.lu, Kirill A . Shutemov, Andrea Arcangeli, linux-mm,
	linux-kernel

Minchan Kim <minchan@kernel.org> writes:

> Hi Huang,
>
> On my side, there are more urgent works now so I didn't have a time to
> see our ongoing discussion. I will continue after settle down works,
> maybe next week. Sorry.

No problem.  Thanks for your review so far!

> On Mon, Aug 22, 2016 at 02:33:08PM -0700, Huang, Ying wrote:
>> Hi, Minchan,
>> 
>> Minchan Kim <minchan@kernel.org> writes:
>> > Anyway, I hope [1/11] should be merged regardless of the patchset because
>> > I believe anyone doesn't feel comfortable with cluser_info functions. ;-)
>> 
>> I want to send out 1/11 separately.  Can I add your "Acked-by:" for it?
>
> Sure.

Thanks!

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-08-24  2:18 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-09 16:37 [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
2016-08-09 16:37 ` [RFC 01/11] swap: Add swap_cluster_list Huang, Ying
2016-08-09 16:37 ` [RFC 02/11] swap: Change SWAPFILE_CLUSTER to 512 Huang, Ying
2016-08-09 16:37 ` [RFC 03/11] mm, memcg: Add swap_cgroup_iter iterator Huang, Ying
2016-08-09 16:37 ` [RFC 04/11] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
2016-08-09 16:37 ` [RFC 05/11] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
2016-08-09 16:37 ` [RFC 06/11] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
2016-08-09 16:37 ` [RFC 07/11] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
2016-08-09 16:37 ` [RFC 08/11] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
2016-08-09 16:37 ` [RFC 09/11] mm, THP: Add can_split_huge_page() Huang, Ying
2016-08-09 16:37 ` [RFC 10/11] mm, THP, swap: Support to split THP in swap cache Huang, Ying
2016-08-09 16:37 ` [RFC 11/11] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
2016-08-09 17:25 ` [RFC 00/11] THP swap: Delay splitting THP during swapping out Huang, Ying
2016-08-17  0:59 ` Minchan Kim
2016-08-17  2:06   ` Huang, Ying
2016-08-17  5:07     ` Minchan Kim
2016-08-17 17:24       ` Tim Chen
2016-08-18  8:39         ` Minchan Kim
2016-08-18 17:19           ` Huang, Ying
2016-08-19  0:49             ` Minchan Kim
2016-08-19  3:44               ` Huang, Ying
2016-08-19  6:44                 ` Minchan Kim
2016-08-19  6:47                   ` Minchan Kim
2016-08-19 23:43                   ` Huang, Ying
2016-08-22 21:33   ` Huang, Ying
2016-08-24  1:00     ` Minchan Kim
2016-08-24  2:18       ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).