linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/9] mm/swap: Regular page swap optimizations
@ 2016-12-09 21:09 Tim Chen
  2016-12-09 21:09 ` [PATCH v4 1/9] mm/swap: Fix kernel message in swap_info_get() Tim Chen
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Tim Chen @ 2016-12-09 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

Change Log:
v4:
1. Fix a bug in unlock cluster in add_swap_count_continuation(). We
should use unlock_cluster() instead of unlock_cluser_or_swap_info().
2. During swap off, handle race when swap slot is marked unused but allocated,
and not yet placed in swap cache.  Wait for swap slot to be placed in swap cache
and not abort swap off.
3. Initialize n_ret in get_swap_pages().

v3:
1. Fix bug that didn't check for page already in swap cache before skipping
read ahead and return null page.
2. Fix bug that didn't try to allocate from global pool if allocation
from swap slot cache did not succeed.
3. Fix memory allocation bug for spaces to store split up 64MB radix tree
4. Fix problems caused by races between get_swap_page, cpu online/offline and
swap_on/off

v2: 
1. Fix bug in the index limit used in scan_swap_map_try_ssd_cluster
when searching for empty slots in cluster.
2. Fix bug in swap off that incorrectly determines if we still have
swap devices left.
3. Port patches to mmotm-2016-10-11-15-46 branch

Andrew,

We're updating this patch series with some minor fixes.
Please consider this patch series for inclusion to 4.10.
 
Times have changed.  Coming generation of Solid state Block device
latencies are getting down to sub 100 usec, which is within an order of
magnitude of DRAM, and their performance is orders of magnitude higher
than the single- spindle rotational media we've swapped to historically.

This could benefit many usage scenearios.  For example cloud providers who
overcommit their memory (as VM don't use all the memory provisioned).
Having a fast swap will allow them to be more aggressive in memory
overcommit and fit more VMs to a platform.

In our testing [see footnote], the median latency that the
kernel adds to a page fault is 15 usec, which comes quite close
to the amount that will be contributed by the underlying I/O
devices.

The software latency comes mostly from contentions on the locks
protecting the radix tree of the swap cache and also the locks protecting
the individual swap devices.  The lock contentions already consumed
35% of cpu cycles in our test.  In the very near future,
software latency will become the bottleneck to swap performnace as
block device I/O latency gets within the shouting distance of DRAM speed.

This patch set, plus a previous patch Ying already posted
(commit: f6498b3f) reduced the median page fault latency
from 15 usec to 4 usec (375% reduction) for DRAM based pmem
block device.

Patch 1 is a clean up patch.
Patch 2 creates a lock per cluster, this gives us a more fine graind lock
        that can be used for accessing swap_map, and not lock the whole
        swap device
Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
        the rate that we have to contende for the radix tree.
Patch 4 eliminates unnecessary page allocation for read ahead.
Patch 5-9 create a per cpu cache of the swap slots, so we don't have
        to contend on the swap device to get a swap slot or to release
        a swap slot.  And we allocate and release the swap slots
        in batches for better efficiency.

Ying Huang & Tim Chen

Footnote:
We tested mmotm-2016-10-11-15-46 kernel with/without optimizations from
this patche series plus one additional patch Ying posted earlier on
removing radix tree write back tag in swap cache.  Eight threads performed
random memory access on a 2 socket Haswell using swap mounted on RAM
based PMEM block device.  This emulated a moderate load and a SWAP
device unbounded by I/O speed. The aggregate working set is twice the
RAM size. We instrumented the kernel to measure the page fault latency.


Huang Ying (1):
  mm/swap: Skip readahead only when swap slot cache is enabled

Huang, Ying (3):
  mm/swap: Fix kernel message in swap_info_get()
  mm/swap: Add cluster lock
  mm/swap: Split swap cache into 64MB trunks

Tim Chen (5):
  mm/swap: skip read ahead for unreferenced swap slots
  mm/swap: Allocate swap slots in batches
  mm/swap: Free swap slots in batch
  mm/swap: Add cache for swap slots allocation
  mm/swap: Enable swap slots cache usage

 include/linux/swap.h       |  36 ++-
 include/linux/swap_slots.h |  30 +++
 mm/Makefile                |   2 +-
 mm/swap.c                  |   6 -
 mm/swap_slots.c            | 364 ++++++++++++++++++++++++++++++
 mm/swap_state.c            |  80 ++++++-
 mm/swapfile.c              | 542 +++++++++++++++++++++++++++++++++++----------
 7 files changed, 919 insertions(+), 141 deletions(-)
 create mode 100644 include/linux/swap_slots.h
 create mode 100644 mm/swap_slots.c

-- 
2.5.5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/9] mm/swap: Fix kernel message in swap_info_get()
  2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
@ 2016-12-09 21:09 ` Tim Chen
  2016-12-09 21:09 ` [PATCH v4 2/9] mm/swap: Add cluster lock Tim Chen
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2016-12-09 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim Chen

From: "Huang, Ying" <ying.huang@intel.com>

swap_info_get() is used not only in swap free code path but also in
page_swapcount(), etc.  So the original kernel message in
swap_info_get() is not correct now.  Fix it via replacing "swap_free" to
"swap_info_get" in the message.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/swapfile.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index f304389..4510c3d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -753,16 +753,16 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 	return p;
 
 bad_free:
-	pr_err("swap_free: %s%08lx\n", Unused_offset, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val);
 	goto out;
 bad_offset:
-	pr_err("swap_free: %s%08lx\n", Bad_offset, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Bad_offset, entry.val);
 	goto out;
 bad_device:
-	pr_err("swap_free: %s%08lx\n", Unused_file, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Unused_file, entry.val);
 	goto out;
 bad_nofile:
-	pr_err("swap_free: %s%08lx\n", Bad_file, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Bad_file, entry.val);
 out:
 	return NULL;
 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 2/9] mm/swap: Add cluster lock
  2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
  2016-12-09 21:09 ` [PATCH v4 1/9] mm/swap: Fix kernel message in swap_info_get() Tim Chen
@ 2016-12-09 21:09 ` Tim Chen
  2016-12-09 21:09 ` [PATCH v4 3/9] mm/swap: Split swap cache into 64MB trunks Tim Chen
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2016-12-09 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim Chen

From: "Huang, Ying" <ying.huang@intel.com>

This patch is to reduce the lock contention of swap_info_struct->lock
via using a more fine grained lock in swap_cluster_info for some swap
operations.  swap_info_struct->lock is heavily contended if multiple
processes reclaim pages simultaneously.  Because there is only one lock
for each swap device.  While in common configuration, there is only one
or several swap devices in the system.  The lock protects almost all
swap related operations.

In fact, many swap operations only access one element of
swap_info_struct->swap_map array.  And there is no dependency between
different elements of swap_info_struct->swap_map.  So a fine grained
lock can be used to allow parallel access to the different elements of
swap_info_struct->swap_map.

In this patch, one bit of swap_cluster_info is used as the bin spinlock
to protect the elements of swap_info_struct->swap_map in the swap
cluster and the fields of swap_cluster_info.  This reduced locking
contention for swap_info_struct->swap_map access greatly.

To use the bin spinlock, the size of swap_cluster_info needs to increase
from 4 bytes to 8 bytes on the 64bit system.  This will use 4k more
memory for every 1G swap space.

Because the size of swap_cluster_info is much smaller than the size of
the cache line (8 vs 64 on x86_64 architecture), there may be false
cache line sharing between swap_cluster_info bit spinlocks.  To avoid
the false sharing in the first round of the swap cluster allocation, the
order of the swap clusters in the free clusters list is changed.  So
that, the swap_cluster_info sharing the same cache line will be placed
as far as possible.  After the first round of allocation, the order of
the clusters in free clusters list is expected to be random.  So the
false sharing should be not noticeable.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap.h |  13 ++-
 mm/swapfile.c        | 239 +++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 194 insertions(+), 58 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a56523c..c2d6c25 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -175,11 +175,16 @@ enum {
  * protected by swap_info_struct.lock.
  */
 struct swap_cluster_info {
-	unsigned int data:24;
-	unsigned int flags:8;
+	unsigned long data;
 };
-#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
-#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+#define CLUSTER_COUNT_SHIFT		8
+#define CLUSTER_FLAG_MASK		((1UL << CLUSTER_COUNT_SHIFT) - 1)
+#define CLUSTER_COUNT_MASK		(~CLUSTER_FLAG_MASK)
+#define CLUSTER_FLAG_FREE		1 /* This cluster is free */
+#define CLUSTER_FLAG_NEXT_NULL		2 /* This cluster has no next cluster */
+/* cluster lock, protect cluster_info contents and sis->swap_map */
+#define CLUSTER_FLAG_LOCK_BIT		2
+#define CLUSTER_FLAG_LOCK		(1 << CLUSTER_FLAG_LOCK_BIT)
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4510c3d..94828c86 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -200,61 +200,107 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 #define LATENCY_LIMIT		256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
-	unsigned int flag)
+				    unsigned int flag)
 {
-	info->flags = flag;
+	info->data = (info->data & (CLUSTER_COUNT_MASK | CLUSTER_FLAG_LOCK)) |
+		(flag & ~CLUSTER_FLAG_LOCK);
 }
 
 static inline unsigned int cluster_count(struct swap_cluster_info *info)
 {
-	return info->data;
+	return info->data >> CLUSTER_COUNT_SHIFT;
 }
 
 static inline void cluster_set_count(struct swap_cluster_info *info,
 				     unsigned int c)
 {
-	info->data = c;
+	info->data = (c << CLUSTER_COUNT_SHIFT) | (info->data & CLUSTER_FLAG_MASK);
 }
 
 static inline void cluster_set_count_flag(struct swap_cluster_info *info,
 					 unsigned int c, unsigned int f)
 {
-	info->flags = f;
-	info->data = c;
+	info->data = (info->data & CLUSTER_FLAG_LOCK) |
+		(c << CLUSTER_COUNT_SHIFT) | (f & ~CLUSTER_FLAG_LOCK);
 }
 
 static inline unsigned int cluster_next(struct swap_cluster_info *info)
 {
-	return info->data;
+	return cluster_count(info);
 }
 
 static inline void cluster_set_next(struct swap_cluster_info *info,
 				    unsigned int n)
 {
-	info->data = n;
+	cluster_set_count(info, n);
 }
 
 static inline void cluster_set_next_flag(struct swap_cluster_info *info,
 					 unsigned int n, unsigned int f)
 {
-	info->flags = f;
-	info->data = n;
+	cluster_set_count_flag(info, n, f);
 }
 
 static inline bool cluster_is_free(struct swap_cluster_info *info)
 {
-	return info->flags & CLUSTER_FLAG_FREE;
+	return info->data & CLUSTER_FLAG_FREE;
 }
 
 static inline bool cluster_is_null(struct swap_cluster_info *info)
 {
-	return info->flags & CLUSTER_FLAG_NEXT_NULL;
+	return info->data & CLUSTER_FLAG_NEXT_NULL;
 }
 
 static inline void cluster_set_null(struct swap_cluster_info *info)
 {
-	info->flags = CLUSTER_FLAG_NEXT_NULL;
-	info->data = 0;
+	cluster_set_next_flag(info, 0, CLUSTER_FLAG_NEXT_NULL);
+}
+
+/* Protect swap_cluster_info fields and si->swap_map */
+static inline void __lock_cluster(struct swap_cluster_info *ci)
+{
+	bit_spin_lock(CLUSTER_FLAG_LOCK_BIT, &ci->data);
+}
+
+static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
+						     unsigned long offset)
+{
+	struct swap_cluster_info *ci;
+
+	ci = si->cluster_info;
+	if (ci) {
+		ci += offset / SWAPFILE_CLUSTER;
+		__lock_cluster(ci);
+	}
+	return ci;
+}
+
+static inline void unlock_cluster(struct swap_cluster_info *ci)
+{
+	if (ci)
+		bit_spin_unlock(CLUSTER_FLAG_LOCK_BIT, &ci->data);
+}
+
+static inline struct swap_cluster_info *lock_cluster_or_swap_info(
+	struct swap_info_struct *si,
+	unsigned long offset)
+{
+	struct swap_cluster_info *ci;
+
+	ci = lock_cluster(si, offset);
+	if (!ci)
+		spin_lock(&si->lock);
+
+	return ci;
+}
+
+static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
+					       struct swap_cluster_info *ci)
+{
+	if (ci)
+		unlock_cluster(ci);
+	else
+		spin_unlock(&si->lock);
 }
 
 static inline bool cluster_list_empty(struct swap_cluster_list *list)
@@ -281,9 +327,17 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
 		cluster_set_next_flag(&list->head, idx, 0);
 		cluster_set_next_flag(&list->tail, idx, 0);
 	} else {
+		struct swap_cluster_info *ci_tail;
 		unsigned int tail = cluster_next(&list->tail);
 
-		cluster_set_next(&ci[tail], idx);
+		/*
+		 * Nested cluster lock, but both cluster locks are
+		 * only acquired when we held swap_info_struct->lock
+		 */
+		ci_tail = ci + tail;
+		__lock_cluster(ci_tail);
+		cluster_set_next(ci_tail, idx);
+		unlock_cluster(ci_tail);
 		cluster_set_next_flag(&list->tail, idx, 0);
 	}
 }
@@ -328,7 +382,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 */
 static void swap_do_scheduled_discard(struct swap_info_struct *si)
 {
-	struct swap_cluster_info *info;
+	struct swap_cluster_info *info, *ci;
 	unsigned int idx;
 
 	info = si->cluster_info;
@@ -341,10 +395,14 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 				SWAPFILE_CLUSTER);
 
 		spin_lock(&si->lock);
-		cluster_set_flag(&info[idx], CLUSTER_FLAG_FREE);
+		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
+		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
+		unlock_cluster(ci);
 		cluster_list_add_tail(&si->free_clusters, info, idx);
+		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
+		unlock_cluster(ci);
 	}
 }
 
@@ -447,8 +505,9 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned long *offset, unsigned long *scan_base)
 {
 	struct percpu_cluster *cluster;
+	struct swap_cluster_info *ci;
 	bool found_free;
-	unsigned long tmp;
+	unsigned long tmp, max;
 
 new_cluster:
 	cluster = this_cpu_ptr(si->percpu_cluster);
@@ -476,14 +535,21 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	 * check if there is still free entry in the cluster
 	 */
 	tmp = cluster->next;
-	while (tmp < si->max && tmp < (cluster_next(&cluster->index) + 1) *
-	       SWAPFILE_CLUSTER) {
+	max = min_t(unsigned long, si->max,
+		    (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER);
+	if (tmp >= max) {
+		cluster_set_null(&cluster->index);
+		goto new_cluster;
+	}
+	ci = lock_cluster(si, tmp);
+	while (tmp < max) {
 		if (!si->swap_map[tmp]) {
 			found_free = true;
 			break;
 		}
 		tmp++;
 	}
+	unlock_cluster(ci);
 	if (!found_free) {
 		cluster_set_null(&cluster->index);
 		goto new_cluster;
@@ -496,6 +562,7 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
+	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned long scan_base;
 	unsigned long last_in_cluster = 0;
@@ -572,9 +639,11 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
 
+	ci = lock_cluster(si, offset);
 	/* reuse swap entry of cache-only swap if not busy. */
 	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
 		int swap_was_freed;
+		unlock_cluster(ci);
 		spin_unlock(&si->lock);
 		swap_was_freed = __try_to_reclaim_swap(si, offset);
 		spin_lock(&si->lock);
@@ -584,8 +653,10 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 		goto scan; /* check next one */
 	}
 
-	if (si->swap_map[offset])
+	if (si->swap_map[offset]) {
+		unlock_cluster(ci);
 		goto scan;
+	}
 
 	if (offset == si->lowest_bit)
 		si->lowest_bit++;
@@ -601,6 +672,7 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 	}
 	si->swap_map[offset] = usage;
 	inc_cluster_info_page(si, si->cluster_info, offset);
+	unlock_cluster(ci);
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
 
@@ -731,7 +803,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	return (swp_entry_t) {0};
 }
 
-static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	unsigned long offset, type;
@@ -749,7 +821,6 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 		goto bad_offset;
 	if (!p->swap_map[offset])
 		goto bad_free;
-	spin_lock(&p->lock);
 	return p;
 
 bad_free:
@@ -767,14 +838,45 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
+static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+{
+	struct swap_info_struct *p;
+
+	p = _swap_info_get(entry);
+	if (p)
+		spin_lock(&p->lock);
+	return p;
+}
+
 static unsigned char swap_entry_free(struct swap_info_struct *p,
-				     swp_entry_t entry, unsigned char usage)
+				     swp_entry_t entry, unsigned char usage,
+				     bool swap_info_locked)
 {
+	struct swap_cluster_info *ci;
 	unsigned long offset = swp_offset(entry);
 	unsigned char count;
 	unsigned char has_cache;
+	bool lock_swap_info = false;
+
+	if (!swap_info_locked) {
+		count = p->swap_map[offset];
+		if (!p->cluster_info || count == usage || count == SWAP_MAP_SHMEM) {
+lock_swap_info:
+			swap_info_locked = true;
+			lock_swap_info = true;
+			spin_lock(&p->lock);
+		}
+	}
+
+	ci = lock_cluster(p, offset);
 
 	count = p->swap_map[offset];
+
+	if (!swap_info_locked && (count == usage || count == SWAP_MAP_SHMEM)) {
+		unlock_cluster(ci);
+		goto lock_swap_info;
+	}
+
 	has_cache = count & SWAP_HAS_CACHE;
 	count &= ~SWAP_HAS_CACHE;
 
@@ -800,10 +902,15 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 	usage = count | has_cache;
 	p->swap_map[offset] = usage;
 
+	unlock_cluster(ci);
+
 	/* free if no reference */
 	if (!usage) {
+		VM_BUG_ON(!swap_info_locked);
 		mem_cgroup_uncharge_swap(entry);
+		ci = lock_cluster(p, offset);
 		dec_cluster_info_page(p, p->cluster_info, offset);
+		unlock_cluster(ci);
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
 		if (offset > p->highest_bit) {
@@ -829,6 +936,9 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 		}
 	}
 
+	if (lock_swap_info)
+		spin_unlock(&p->lock);
+
 	return usage;
 }
 
@@ -840,11 +950,9 @@ void swap_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 
-	p = swap_info_get(entry);
-	if (p) {
-		swap_entry_free(p, entry, 1);
-		spin_unlock(&p->lock);
-	}
+	p = _swap_info_get(entry);
+	if (p)
+		swap_entry_free(p, entry, 1, false);
 }
 
 /*
@@ -854,11 +962,9 @@ void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 
-	p = swap_info_get(entry);
-	if (p) {
-		swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		spin_unlock(&p->lock);
-	}
+	p = _swap_info_get(entry);
+	if (p)
+		swap_entry_free(p, entry, SWAP_HAS_CACHE, false);
 }
 
 /*
@@ -870,13 +976,17 @@ int page_swapcount(struct page *page)
 {
 	int count = 0;
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	swp_entry_t entry;
+	unsigned long offset;
 
 	entry.val = page_private(page);
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (p) {
-		count = swap_count(p->swap_map[swp_offset(entry)]);
-		spin_unlock(&p->lock);
+		offset = swp_offset(entry);
+		ci = lock_cluster_or_swap_info(p, offset);
+		count = swap_count(p->swap_map[offset]);
+		unlock_cluster_or_swap_info(p, ci);
 	}
 	return count;
 }
@@ -889,22 +999,26 @@ int swp_swapcount(swp_entry_t entry)
 {
 	int count, tmp_count, n;
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	struct page *page;
 	pgoff_t offset;
 	unsigned char *map;
 
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (!p)
 		return 0;
 
-	count = swap_count(p->swap_map[swp_offset(entry)]);
+	offset = swp_offset(entry);
+
+	ci = lock_cluster_or_swap_info(p, offset);
+
+	count = swap_count(p->swap_map[offset]);
 	if (!(count & COUNT_CONTINUED))
 		goto out;
 
 	count &= ~COUNT_CONTINUED;
 	n = SWAP_MAP_MAX + 1;
 
-	offset = swp_offset(entry);
 	page = vmalloc_to_page(p->swap_map + offset);
 	offset &= ~PAGE_MASK;
 	VM_BUG_ON(page_private(page) != SWP_CONTINUED);
@@ -919,7 +1033,7 @@ int swp_swapcount(swp_entry_t entry)
 		n *= (SWAP_CONT_MAX + 1);
 	} while (tmp_count & COUNT_CONTINUED);
 out:
-	spin_unlock(&p->lock);
+	unlock_cluster_or_swap_info(p, ci);
 	return count;
 }
 
@@ -1003,7 +1117,7 @@ int free_swap_and_cache(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, entry, 1) == SWAP_HAS_CACHE) {
+		if (swap_entry_free(p, entry, 1, true) == SWAP_HAS_CACHE) {
 			page = find_get_page(swap_address_space(entry),
 					     swp_offset(entry));
 			if (page && !trylock_page(page)) {
@@ -2285,6 +2399,9 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 	return maxpages;
 }
 
+#define SWAP_CLUSTER_COLS						\
+	DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
+
 static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					union swap_header *swap_header,
 					unsigned char *swap_map,
@@ -2292,11 +2409,12 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					unsigned long maxpages,
 					sector_t *span)
 {
-	int i;
+	unsigned int j, k;
 	unsigned int nr_good_pages;
 	int nr_extents;
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
-	unsigned long idx = p->cluster_next / SWAPFILE_CLUSTER;
+	unsigned long col = p->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS;
+	unsigned long i, idx;
 
 	nr_good_pages = maxpages - 1;	/* omit header page */
 
@@ -2344,15 +2462,20 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	if (!cluster_info)
 		return nr_extents;
 
-	for (i = 0; i < nr_clusters; i++) {
-		if (!cluster_count(&cluster_info[idx])) {
+
+	/* Reduce false cache line sharing between cluster_info */
+	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
+		j = (k + col) % SWAP_CLUSTER_COLS;
+		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
+			idx = i * SWAP_CLUSTER_COLS + j;
+			if (idx >= nr_clusters)
+				continue;
+			if (cluster_count(&cluster_info[idx]))
+				continue;
 			cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
 			cluster_list_add_tail(&p->free_clusters, cluster_info,
 					      idx);
 		}
-		idx++;
-		if (idx == nr_clusters)
-			idx = 0;
 	}
 	return nr_extents;
 }
@@ -2610,6 +2733,7 @@ void si_swapinfo(struct sysinfo *val)
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 {
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	unsigned long offset, type;
 	unsigned char count;
 	unsigned char has_cache;
@@ -2623,10 +2747,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 		goto bad_file;
 	p = swap_info[type];
 	offset = swp_offset(entry);
-
-	spin_lock(&p->lock);
 	if (unlikely(offset >= p->max))
-		goto unlock_out;
+		goto out;
+
+	ci = lock_cluster_or_swap_info(p, offset);
 
 	count = p->swap_map[offset];
 
@@ -2669,7 +2793,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 	p->swap_map[offset] = count | has_cache;
 
 unlock_out:
-	spin_unlock(&p->lock);
+	unlock_cluster_or_swap_info(p, ci);
 out:
 	return err;
 
@@ -2758,6 +2882,7 @@ EXPORT_SYMBOL_GPL(__page_file_index);
 int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 {
 	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
 	struct page *head;
 	struct page *page;
 	struct page *list_page;
@@ -2781,6 +2906,9 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	offset = swp_offset(entry);
+
+	ci = lock_cluster(si, offset);
+
 	count = si->swap_map[offset] & ~SWAP_HAS_CACHE;
 
 	if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
@@ -2793,6 +2921,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	if (!page) {
+		unlock_cluster(ci);
 		spin_unlock(&si->lock);
 		return -ENOMEM;
 	}
@@ -2841,6 +2970,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	list_add_tail(&page->lru, &head->lru);
 	page = NULL;			/* now it's attached, don't free it */
 out:
+	unlock_cluster(ci);
 	spin_unlock(&si->lock);
 outer:
 	if (page)
@@ -2854,7 +2984,8 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
  * into, carry if so, or else fail until a new continuation page is allocated;
  * when the original swap_map count is decremented from 0 with continuation,
  * borrow from the continuation and report whether it still holds more.
- * Called while __swap_duplicate() or swap_entry_free() holds swap_lock.
+ * Called while __swap_duplicate() or swap_entry_free() holds swap or cluster
+ * lock.
  */
 static bool swap_count_continued(struct swap_info_struct *si,
 				 pgoff_t offset, unsigned char count)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 3/9] mm/swap: Split swap cache into 64MB trunks
  2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
  2016-12-09 21:09 ` [PATCH v4 1/9] mm/swap: Fix kernel message in swap_info_get() Tim Chen
  2016-12-09 21:09 ` [PATCH v4 2/9] mm/swap: Add cluster lock Tim Chen
@ 2016-12-09 21:09 ` Tim Chen
  2016-12-09 21:09 ` [PATCH v4 4/9] mm/swap: skip read ahead for unreferenced swap slots Tim Chen
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2016-12-09 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim Chen

From: "Huang, Ying" <ying.huang@intel.com>

The patch is to improve the scalability of the swap out/in via using
fine grained locks for the swap cache.  In current kernel, one address
space will be used for each swap device.  And in the common
configuration, the number of the swap device is very small (one is
typical).  This causes the heavy lock contention on the radix tree of
the address space if multiple tasks swap out/in concurrently.  But in
fact, there is no dependency between pages in the swap cache.  So that,
we can split the one shared address space for each swap device into
several address spaces to reduce the lock contention.  In the patch, the
shared address space is split into 64MB trunks.  64MB is chosen to
balance the memory space usage and effect of lock contention reduction.

The size of struct address_space on x86_64 architecture is 408B, so with
the patch, 6528B more memory will be used for every 1GB swap space on
x86_64 architecture.

One address space is still shared for the swap entries in the same 64M
trunks.  To avoid lock contention for the first round of swap space
allocation, the order of the swap clusters in the initial free clusters
list is changed.  The swap space distance between the consecutive swap
clusters in the free cluster list is at least 64M.  After the first
round of allocation, the swap clusters are expected to be freed
randomly, so the lock contention should be reduced effectively.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap.h | 10 ++++++--
 mm/swap.c            |  6 -----
 mm/swap_state.c      | 68 ++++++++++++++++++++++++++++++++++++++++++----------
 mm/swapfile.c        | 16 +++++++++++--
 4 files changed, 78 insertions(+), 22 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c2d6c25..5f66c84 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -370,8 +370,12 @@ int generic_swapfile_activate(struct swap_info_struct *, struct file *,
 		sector_t *);
 
 /* linux/mm/swap_state.c */
-extern struct address_space swapper_spaces[];
-#define swap_address_space(entry) (&swapper_spaces[swp_type(entry)])
+/* One swap address space for each 64M swap space */
+#define SWAP_ADDRESS_SPACE_SHIFT	14
+#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
+extern struct address_space* swapper_spaces[];
+#define swap_address_space(entry)					\
+	(&swapper_spaces[swp_type(entry)][swp_offset(entry) >> SWAP_ADDRESS_SPACE_SHIFT])
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *, struct list_head *list);
@@ -425,6 +429,8 @@ extern struct swap_info_struct *page_swap_info(struct page *);
 extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
+extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
+extern void exit_swap_address_space(unsigned int type);
 
 #else /* CONFIG_SWAP */
 
diff --git a/mm/swap.c b/mm/swap.c
index 4dcf852..5c2ea71 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -969,12 +969,6 @@ EXPORT_SYMBOL(pagevec_lookup_tag);
 void __init swap_setup(void)
 {
 	unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT);
-#ifdef CONFIG_SWAP
-	int i;
-
-	for (i = 0; i < MAX_SWAPFILES; i++)
-		spin_lock_init(&swapper_spaces[i].tree_lock);
-#endif
 
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 35d7e0e..3863acd 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/vmalloc.h>
 
 #include <asm/pgtable.h>
 
@@ -32,15 +33,8 @@ static const struct address_space_operations swap_aops = {
 #endif
 };
 
-struct address_space swapper_spaces[MAX_SWAPFILES] = {
-	[0 ... MAX_SWAPFILES - 1] = {
-		.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-		.i_mmap_writable = ATOMIC_INIT(0),
-		.a_ops		= &swap_aops,
-		/* swap cache doesn't use writeback related tags */
-		.flags		= 1 << AS_NO_WRITEBACK_TAGS,
-	}
-};
+struct address_space *swapper_spaces[MAX_SWAPFILES];
+static unsigned int nr_swapper_spaces[MAX_SWAPFILES];
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
 
@@ -53,11 +47,26 @@ static struct {
 
 unsigned long total_swapcache_pages(void)
 {
-	int i;
+	unsigned int i, j, nr;
 	unsigned long ret = 0;
+	struct address_space *spaces;
 
-	for (i = 0; i < MAX_SWAPFILES; i++)
-		ret += swapper_spaces[i].nrpages;
+	rcu_read_lock();
+	for (i = 0; i < MAX_SWAPFILES; i++) {
+		/*
+		 * The corresponding entries in nr_swapper_spaces and
+		 * swapper_spaces will be reused only after at least
+		 * one grace period.  So it is impossible for them
+		 * belongs to different usage.
+		 */
+		nr = nr_swapper_spaces[i];
+		spaces = rcu_dereference(swapper_spaces[i]);
+		if (!nr || !spaces)
+			continue;
+		for (j = 0; j < nr; j++)
+			ret += spaces[j].nrpages;
+	}
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -505,3 +514,38 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 skip:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
+
+int init_swap_address_space(unsigned int type, unsigned long nr_pages)
+{
+	struct address_space *spaces, *space;
+	unsigned int i, nr;
+
+	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
+	spaces = vzalloc(sizeof(struct address_space) * nr);
+	if (!spaces)
+		return -ENOMEM;
+	for (i = 0; i < nr; i++) {
+		space = spaces + i;
+		INIT_RADIX_TREE(&space->page_tree, GFP_ATOMIC|__GFP_NOWARN);
+		atomic_set(&space->i_mmap_writable, 0);
+		space->a_ops = &swap_aops;
+		/* swap cache doesn't use writeback related tags */
+		mapping_set_no_writeback_tags(space);
+		spin_lock_init(&space->tree_lock);
+	}
+	nr_swapper_spaces[type] = nr;
+	rcu_assign_pointer(swapper_spaces[type], spaces);
+
+	return 0;
+}
+
+void exit_swap_address_space(unsigned int type)
+{
+	struct address_space *spaces;
+
+	spaces = swapper_spaces[type];
+	nr_swapper_spaces[type] = 0;
+	rcu_assign_pointer(swapper_spaces[type], NULL);
+	synchronize_rcu();
+	kvfree(spaces);
+}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 94828c86..eb6cba7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2076,6 +2076,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	vfree(frontswap_map);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
+	exit_swap_address_space(p->type);
 
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
@@ -2399,8 +2400,12 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 	return maxpages;
 }
 
-#define SWAP_CLUSTER_COLS						\
+#define SWAP_CLUSTER_INFO_COLS						\
 	DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
+#define SWAP_CLUSTER_SPACE_COLS						\
+	DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
+#define SWAP_CLUSTER_COLS						\
+	max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
 
 static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					union swap_header *swap_header,
@@ -2463,7 +2468,10 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 		return nr_extents;
 
 
-	/* Reduce false cache line sharing between cluster_info */
+	/*
+	 * Reduce false cache line sharing between cluster_info and
+	 * sharing same address space.
+	 */
 	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
 		j = (k + col) % SWAP_CLUSTER_COLS;
 		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
@@ -2644,6 +2652,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		}
 	}
 
+	error = init_swap_address_space(p->type, maxpages);
+	if (error)
+		goto bad_swap;
+
 	mutex_lock(&swapon_mutex);
 	prio = -1;
 	if (swap_flags & SWAP_FLAG_PREFER)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 4/9] mm/swap: skip read ahead for unreferenced swap slots
  2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (2 preceding siblings ...)
  2016-12-09 21:09 ` [PATCH v4 3/9] mm/swap: Split swap cache into 64MB trunks Tim Chen
@ 2016-12-09 21:09 ` Tim Chen
  2016-12-09 21:09 ` [PATCH v4 5/9] mm/swap: Allocate swap slots in batches Tim Chen
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2016-12-09 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

We can avoid needlessly allocating page for swap slots that
are not used by anyone. No pages have to be read in for
these slots.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |  6 ++++++
 mm/swap_state.c      |  4 ++++
 mm/swapfile.c        | 47 +++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5f66c84..6b7a48b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -424,6 +424,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern bool reuse_swap_page(struct page *, int *);
@@ -518,6 +519,11 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
+static inline int __swp_swapcount(swp_entry_t entry)
+{
+	return 0;
+}
+
 static inline int swp_swapcount(swp_entry_t entry)
 {
 	return 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3863acd..3d76d80 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -323,6 +323,10 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		if (found_page)
 			break;
 
+		/* Just skip read ahead for unused swap slot */
+		if (!__swp_swapcount(entry))
+			return NULL;
+
 		/*
 		 * Get a new page to read into from swap.
 		 */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index eb6cba7..136c9dd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -803,7 +803,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	return (swp_entry_t) {0};
 }
 
-static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *__swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	unsigned long offset, type;
@@ -819,13 +819,8 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	offset = swp_offset(entry);
 	if (offset >= p->max)
 		goto bad_offset;
-	if (!p->swap_map[offset])
-		goto bad_free;
 	return p;
 
-bad_free:
-	pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val);
-	goto out;
 bad_offset:
 	pr_err("swap_info_get: %s%08lx\n", Bad_offset, entry.val);
 	goto out;
@@ -838,6 +833,24 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
+static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
+{
+	struct swap_info_struct *p;
+
+	p = __swap_info_get(entry);
+	if (!p)
+		goto out;
+	if (!p->swap_map[swp_offset(entry)])
+		goto bad_free;
+	return p;
+
+bad_free:
+	pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val);
+	goto out;
+out:
+	return NULL;
+}
+
 static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
@@ -993,6 +1006,28 @@ int page_swapcount(struct page *page)
 
 /*
  * How many references to @entry are currently swapped out?
+ * This does not give an exact answer when swap count is continued,
+ * but does include the high COUNT_CONTINUED flag to allow for that.
+ */
+int __swp_swapcount(swp_entry_t entry)
+{
+	int count = 0;
+	pgoff_t offset;
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+
+	si = __swap_info_get(entry);
+	if (si) {
+		offset = swp_offset(entry);
+		ci = lock_cluster_or_swap_info(si, offset);
+		count = swap_count(si->swap_map[offset]);
+		unlock_cluster_or_swap_info(si, ci);
+	}
+	return count;
+}
+
+/*
+ * How many references to @entry are currently swapped out?
  * This considers COUNT_CONTINUED so it returns exact answer.
  */
 int swp_swapcount(swp_entry_t entry)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 5/9] mm/swap: Allocate swap slots in batches
  2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (3 preceding siblings ...)
  2016-12-09 21:09 ` [PATCH v4 4/9] mm/swap: skip read ahead for unreferenced swap slots Tim Chen
@ 2016-12-09 21:09 ` Tim Chen
  2016-12-09 21:09 ` [PATCH v4 6/9] mm/swap: Free swap slots in batch Tim Chen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2016-12-09 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

Currently, the swap slots are allocated one page at a time,
causing contention to the swap_info lock protecting the swap partition
on every page being swapped.

This patch adds new functions get_swap_pages and scan_swap_map_slots
to request multiple swap slots at once. This will reduces the lock
contention on the swap_info lock. Also scan_swap_map_slots can operate
more efficiently as swap slots often occurs in clusters close to each
other on a swap device and it is quicker to allocate them together.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |   2 +
 mm/swapfile.c        | 138 +++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 115 insertions(+), 25 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6b7a48b..68d2ea5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -27,6 +27,7 @@ struct bio;
 #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
 				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
 				 SWAP_FLAG_DISCARD_PAGES)
+#define SWAP_BATCH 64
 
 static inline int current_is_kswapd(void)
 {
@@ -412,6 +413,7 @@ static inline long get_nr_swap_pages(void)
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
 extern swp_entry_t get_swap_page_of_type(int);
+extern int get_swap_pages(int n, swp_entry_t swp_entries[]);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 136c9dd..8b2a97e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -501,7 +501,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
  * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
  * might involve allocating a new cluster for current CPU too.
  */
-static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
+static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned long *offset, unsigned long *scan_base)
 {
 	struct percpu_cluster *cluster;
@@ -525,7 +525,7 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 			*scan_base = *offset = si->cluster_next;
 			goto new_cluster;
 		} else
-			return;
+			return false;
 	}
 
 	found_free = false;
@@ -557,16 +557,22 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	cluster->next = tmp + 1;
 	*offset = tmp;
 	*scan_base = tmp;
+	return found_free;
 }
 
-static unsigned long scan_swap_map(struct swap_info_struct *si,
-				   unsigned char usage)
+static int scan_swap_map_slots(struct swap_info_struct *si,
+			       unsigned char usage, int nr,
+			       swp_entry_t slots[])
 {
 	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned long scan_base;
 	unsigned long last_in_cluster = 0;
 	int latency_ration = LATENCY_LIMIT;
+	int n_ret = 0;
+
+	if (nr > SWAP_BATCH)
+		nr = SWAP_BATCH;
 
 	/*
 	 * We try to cluster swap pages by allocating them sequentially
@@ -584,8 +590,10 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 	/* SSD algorithm */
 	if (si->cluster_info) {
-		scan_swap_map_try_ssd_cluster(si, &offset, &scan_base);
-		goto checks;
+		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+			goto checks;
+		else
+			goto scan;
 	}
 
 	if (unlikely(!si->cluster_nr--)) {
@@ -629,8 +637,14 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 checks:
 	if (si->cluster_info) {
-		while (scan_swap_map_ssd_cluster_conflict(si, offset))
-			scan_swap_map_try_ssd_cluster(si, &offset, &scan_base);
+		while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
+		/* take a break if we already got some slots */
+			if (n_ret)
+				goto done;
+			if (!scan_swap_map_try_ssd_cluster(si, &offset,
+							&scan_base))
+				goto scan;
+		}
 	}
 	if (!(si->flags & SWP_WRITEOK))
 		goto no_page;
@@ -655,7 +669,10 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 	if (si->swap_map[offset]) {
 		unlock_cluster(ci);
-		goto scan;
+		if (!n_ret)
+			goto scan;
+		else
+			goto done;
 	}
 
 	if (offset == si->lowest_bit)
@@ -674,9 +691,43 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 	inc_cluster_info_page(si, si->cluster_info, offset);
 	unlock_cluster(ci);
 	si->cluster_next = offset + 1;
-	si->flags -= SWP_SCANNING;
+	slots[n_ret++] = swp_entry(si->type, offset);
+
+	/* got enough slots or reach max slots? */
+	if ((n_ret == nr) || (offset >= si->highest_bit))
+		goto done;
 
-	return offset;
+	/* search for next available slot */
+
+	/* time to take a break? */
+	if (unlikely(--latency_ration < 0)) {
+		if (n_ret)
+			goto done;
+		spin_unlock(&si->lock);
+		cond_resched();
+		spin_lock(&si->lock);
+		latency_ration = LATENCY_LIMIT;
+	}
+
+	/* try to get more slots in cluster */
+	if (si->cluster_info) {
+		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+			goto checks;
+		else
+			goto done;
+	}
+	/* non-ssd case */
+	++offset;
+
+	/* non-ssd case, still more slots in cluster? */
+	if (si->cluster_nr && !si->swap_map[offset]) {
+		--si->cluster_nr;
+		goto checks;
+	}
+
+done:
+	si->flags -= SWP_SCANNING;
+	return n_ret;
 
 scan:
 	spin_unlock(&si->lock);
@@ -714,17 +765,43 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 no_page:
 	si->flags -= SWP_SCANNING;
-	return 0;
+	return n_ret;
 }
 
-swp_entry_t get_swap_page(void)
+static unsigned long scan_swap_map(struct swap_info_struct *si,
+				   unsigned char usage)
+{
+	swp_entry_t entry;
+	int n_ret;
+
+	n_ret = scan_swap_map_slots(si, usage, 1, &entry);
+
+	if (n_ret)
+		return swp_offset(entry);
+	else
+		return 0;
+
+}
+
+int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
 {
 	struct swap_info_struct *si, *next;
-	pgoff_t offset;
+	long avail_pgs;
+	int n_ret = 0;
 
-	if (atomic_long_read(&nr_swap_pages) <= 0)
+	avail_pgs = atomic_long_read(&nr_swap_pages);
+	if (avail_pgs <= 0)
 		goto noswap;
-	atomic_long_dec(&nr_swap_pages);
+
+	swp_entries[0] = (swp_entry_t) {0};
+
+	if (n_goal > SWAP_BATCH)
+		n_goal = SWAP_BATCH;
+
+	if (n_goal > avail_pgs)
+		n_goal = avail_pgs;
+
+	atomic_long_sub(n_goal, &nr_swap_pages);
 
 	spin_lock(&swap_avail_lock);
 
@@ -750,14 +827,14 @@ swp_entry_t get_swap_page(void)
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-
-		/* This is called for allocating swap entry for cache */
-		offset = scan_swap_map(si, SWAP_HAS_CACHE);
+		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+					    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-		if (offset)
-			return swp_entry(si->type, offset);
+		if (n_ret)
+			goto check_out;
 		pr_debug("scan_swap_map of si %d failed to find offset\n",
-		       si->type);
+			si->type);
+
 		spin_lock(&swap_avail_lock);
 nextsi:
 		/*
@@ -768,7 +845,8 @@ swp_entry_t get_swap_page(void)
 		 * up between us dropping swap_avail_lock and taking si->lock.
 		 * Since we dropped the swap_avail_lock, the swap_avail_head
 		 * list may have been modified; so if next is still in the
-		 * swap_avail_head list then try it, otherwise start over.
+		 * swap_avail_head list then try it, otherwise start over
+		 * if we have not gotten any slots.
 		 */
 		if (plist_node_empty(&next->avail_list))
 			goto start_over;
@@ -776,9 +854,19 @@ swp_entry_t get_swap_page(void)
 
 	spin_unlock(&swap_avail_lock);
 
-	atomic_long_inc(&nr_swap_pages);
+check_out:
+	if (n_ret < n_goal)
+		atomic_long_add((long) (n_goal-n_ret), &nr_swap_pages);
 noswap:
-	return (swp_entry_t) {0};
+	return n_ret;
+}
+
+swp_entry_t get_swap_page(void)
+{
+	swp_entry_t entry;
+
+	get_swap_pages(1, &entry);
+	return entry;
 }
 
 /* The only caller of this function is now suspend routine */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 6/9] mm/swap: Free swap slots in batch
  2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (4 preceding siblings ...)
  2016-12-09 21:09 ` [PATCH v4 5/9] mm/swap: Allocate swap slots in batches Tim Chen
@ 2016-12-09 21:09 ` Tim Chen
  2016-12-09 21:09 ` [PATCH v4 7/9] mm/swap: Add cache for swap slots allocation Tim Chen
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2016-12-09 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

Add new functions that free unused swap slots in batches without
the need to reacquire swap info lock.  This improves scalability
and reduce lock contention.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |   1 +
 mm/swapfile.c        | 155 +++++++++++++++++++++++++++++++--------------------
 2 files changed, 95 insertions(+), 61 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 68d2ea5..f2bb6ac 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -420,6 +420,7 @@ extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free(swp_entry_t);
+extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8b2a97e..172fd36 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -949,35 +949,34 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 	return p;
 }
 
-static unsigned char swap_entry_free(struct swap_info_struct *p,
-				     swp_entry_t entry, unsigned char usage,
-				     bool swap_info_locked)
+static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
+					struct swap_info_struct *q)
+{
+	struct swap_info_struct *p;
+
+	p = _swap_info_get(entry);
+
+	if (p != q) {
+		if (q != NULL)
+			spin_unlock(&q->lock);
+		if (p != NULL)
+			spin_lock(&p->lock);
+	}
+	return p;
+}
+
+static unsigned char __swap_entry_free(struct swap_info_struct *p,
+				       swp_entry_t entry, unsigned char usage)
 {
 	struct swap_cluster_info *ci;
 	unsigned long offset = swp_offset(entry);
 	unsigned char count;
 	unsigned char has_cache;
-	bool lock_swap_info = false;
-
-	if (!swap_info_locked) {
-		count = p->swap_map[offset];
-		if (!p->cluster_info || count == usage || count == SWAP_MAP_SHMEM) {
-lock_swap_info:
-			swap_info_locked = true;
-			lock_swap_info = true;
-			spin_lock(&p->lock);
-		}
-	}
 
-	ci = lock_cluster(p, offset);
+	ci = lock_cluster_or_swap_info(p, offset);
 
 	count = p->swap_map[offset];
 
-	if (!swap_info_locked && (count == usage || count == SWAP_MAP_SHMEM)) {
-		unlock_cluster(ci);
-		goto lock_swap_info;
-	}
-
 	has_cache = count & SWAP_HAS_CACHE;
 	count &= ~SWAP_HAS_CACHE;
 
@@ -1001,46 +1000,52 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 	}
 
 	usage = count | has_cache;
-	p->swap_map[offset] = usage;
+	p->swap_map[offset] = usage ? : SWAP_HAS_CACHE;
+
+	unlock_cluster_or_swap_info(p, ci);
+
+	return usage;
+}
 
+static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
+{
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+	unsigned char count;
+
+	ci = lock_cluster(p, offset);
+	count = p->swap_map[offset];
+	VM_BUG_ON(count != SWAP_HAS_CACHE);
+	p->swap_map[offset] = 0;
+	dec_cluster_info_page(p, p->cluster_info, offset);
 	unlock_cluster(ci);
 
-	/* free if no reference */
-	if (!usage) {
-		VM_BUG_ON(!swap_info_locked);
-		mem_cgroup_uncharge_swap(entry);
-		ci = lock_cluster(p, offset);
-		dec_cluster_info_page(p, p->cluster_info, offset);
-		unlock_cluster(ci);
-		if (offset < p->lowest_bit)
-			p->lowest_bit = offset;
-		if (offset > p->highest_bit) {
-			bool was_full = !p->highest_bit;
-			p->highest_bit = offset;
-			if (was_full && (p->flags & SWP_WRITEOK)) {
-				spin_lock(&swap_avail_lock);
-				WARN_ON(!plist_node_empty(&p->avail_list));
-				if (plist_node_empty(&p->avail_list))
-					plist_add(&p->avail_list,
-						  &swap_avail_head);
-				spin_unlock(&swap_avail_lock);
-			}
-		}
-		atomic_long_inc(&nr_swap_pages);
-		p->inuse_pages--;
-		frontswap_invalidate_page(p->type, offset);
-		if (p->flags & SWP_BLKDEV) {
-			struct gendisk *disk = p->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify)
-				disk->fops->swap_slot_free_notify(p->bdev,
-								  offset);
+	mem_cgroup_uncharge_swap(entry);
+	if (offset < p->lowest_bit)
+		p->lowest_bit = offset;
+	if (offset > p->highest_bit) {
+		bool was_full = !p->highest_bit;
+
+		p->highest_bit = offset;
+		if (was_full && (p->flags & SWP_WRITEOK)) {
+			spin_lock(&swap_avail_lock);
+			WARN_ON(!plist_node_empty(&p->avail_list));
+			if (plist_node_empty(&p->avail_list))
+				plist_add(&p->avail_list,
+					  &swap_avail_head);
+			spin_unlock(&swap_avail_lock);
 		}
 	}
+	atomic_long_inc(&nr_swap_pages);
+	p->inuse_pages--;
+	frontswap_invalidate_page(p->type, offset);
+	if (p->flags & SWP_BLKDEV) {
+		struct gendisk *disk = p->bdev->bd_disk;
 
-	if (lock_swap_info)
-		spin_unlock(&p->lock);
-
-	return usage;
+		if (disk->fops->swap_slot_free_notify)
+			disk->fops->swap_slot_free_notify(p->bdev,
+							  offset);
+	}
 }
 
 /*
@@ -1052,8 +1057,10 @@ void swap_free(swp_entry_t entry)
 	struct swap_info_struct *p;
 
 	p = _swap_info_get(entry);
-	if (p)
-		swap_entry_free(p, entry, 1, false);
+	if (p) {
+		if (!__swap_entry_free(p, entry, 1))
+			swapcache_free_entries(&entry, 1);
+	}
 }
 
 /*
@@ -1064,8 +1071,32 @@ void swapcache_free(swp_entry_t entry)
 	struct swap_info_struct *p;
 
 	p = _swap_info_get(entry);
+	if (p) {
+		if (!__swap_entry_free(p, entry, SWAP_HAS_CACHE))
+			swapcache_free_entries(&entry, 1);
+	}
+}
+
+void swapcache_free_entries(swp_entry_t *entries, int n)
+{
+	struct swap_info_struct *p, *prev;
+	int i;
+
+	if (n <= 0)
+		return;
+
+	prev = NULL;
+	p = NULL;
+	for (i = 0; i < n; ++i) {
+		p = swap_info_get_cont(entries[i], prev);
+		if (p)
+			swap_entry_free(p, entries[i]);
+		else
+			break;
+		prev = p;
+	}
 	if (p)
-		swap_entry_free(p, entry, SWAP_HAS_CACHE, false);
+		spin_unlock(&p->lock);
 }
 
 /*
@@ -1234,21 +1265,23 @@ int free_swap_and_cache(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	struct page *page = NULL;
+	unsigned char count;
 
 	if (non_swap_entry(entry))
 		return 1;
 
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, entry, 1, true) == SWAP_HAS_CACHE) {
+		count = __swap_entry_free(p, entry, 1);
+		if (count == SWAP_HAS_CACHE) {
 			page = find_get_page(swap_address_space(entry),
 					     swp_offset(entry));
 			if (page && !trylock_page(page)) {
 				put_page(page);
 				page = NULL;
 			}
-		}
-		spin_unlock(&p->lock);
+		} else if (!count)
+			swapcache_free_entries(&entry, 1);
 	}
 	if (page) {
 		/*
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 7/9] mm/swap: Add cache for swap slots allocation
  2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (5 preceding siblings ...)
  2016-12-09 21:09 ` [PATCH v4 6/9] mm/swap: Free swap slots in batch Tim Chen
@ 2016-12-09 21:09 ` Tim Chen
  2016-12-09 21:09 ` [PATCH v4 8/9] mm/swap: Enable swap slots cache usage Tim Chen
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2016-12-09 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

We add per cpu caches for swap slots that can be allocated and freed
quickly without the need to touch the swap info lock.

Two separate caches are maintained for swap slots allocated and
swap slots returned.  This is to allow the swap slots to be returned
to the global pool in a batch so they will have a chance to be
coaelesced with other slots in a cluster.  We do not reuse the slots
that are returned right away, as it may increase fragmentation
of the slots.

The swap allocation cache is protected by a mutex as we may sleep
when searching for empty slots in cache.  The swap free cache
is protected by a spin lock as we cannot sleep in the free path.

We refill the swap slots cache when we run out of slots, and we
disable the swap slots cache and drain the slots if the global
number of slots fall below a low watermark threshold.  We re-enable the cache
agian when the slots available are above a high watermark.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h       |   4 +
 include/linux/swap_slots.h |  28 ++++
 mm/Makefile                |   2 +-
 mm/swap_slots.c            | 364 +++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c            |   1 +
 mm/swapfile.c              |  26 ++--
 6 files changed, 413 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/swap_slots.h
 create mode 100644 mm/swap_slots.c

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f2bb6ac..e82593a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -398,6 +398,7 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
+extern bool has_usable_swap(void);
 
 /* Swap 50% full? Release swapcache more aggressively.. */
 static inline bool vm_swap_full(void)
@@ -436,6 +437,9 @@ struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
 
+extern int get_swap_slots(int n, swp_entry_t *slots);
+extern void swapcache_free_batch(swp_entry_t *entries, int n);
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)		(NULL)
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
new file mode 100644
index 0000000..a59e6e2
--- /dev/null
+++ b/include/linux/swap_slots.h
@@ -0,0 +1,28 @@
+#ifndef _LINUX_SWAP_SLOTS_H
+#define _LINUX_SWAP_SLOTS_H
+
+#include <linux/swap.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+
+#define SWAP_SLOTS_CACHE_SIZE			SWAP_BATCH
+#define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE	(5*SWAP_SLOTS_CACHE_SIZE)
+#define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE	(2*SWAP_SLOTS_CACHE_SIZE)
+
+struct swap_slots_cache {
+	bool		lock_initialized;
+	struct mutex	alloc_lock;
+	swp_entry_t	*slots;
+	int		nr;
+	int		cur;
+	spinlock_t	free_lock;
+	swp_entry_t	*slots_ret;
+	int		n_ret;
+};
+
+void disable_swap_slots_cache_lock(void);
+void reenable_swap_slots_cache_unlock(void);
+int enable_swap_slots_cache(void);
+int free_swap_slot(swp_entry_t entry);
+
+#endif /* _LINUX_SWAP_SLOTS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a..433eaf9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -35,7 +35,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o vmacache.o \
+			   compaction.o vmacache.o swap_slots.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
new file mode 100644
index 0000000..8da25df
--- /dev/null
+++ b/mm/swap_slots.c
@@ -0,0 +1,364 @@
+/*
+ * Manage cache of swap slots to be used for and returned from
+ * swap.
+ *
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * Author: Tim Chen <tim.c.chen@linux.intel.com>
+ *
+ * We allocate the swap slots from the global pool and put
+ * it into local per cpu caches.  This has the advantage
+ * of no needing to acquire the swap_info lock every time
+ * we need a new slot.
+ *
+ * There is also opportunity to simply return the slot
+ * to local caches without needing to acquire swap_info
+ * lock.  We do not reuse the returned slots directly but
+ * move them back to the global pool in a batch.  This
+ * allows the slots to coaellesce and reduce fragmentation.
+ *
+ * The swap entry allocated is marked with SWAP_HAS_CACHE
+ * flag in map_count that prevents it from being allocated
+ * again from the global pool.
+ *
+ * The swap slots cache is protected by a mutex instead of
+ * a spin lock as when we search for slots with scan_swap_map,
+ * we can possibly sleep.
+ */
+
+#include <linux/swap_slots.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/vmalloc.h>
+#include <linux/mutex.h>
+
+#ifdef CONFIG_SWAP
+
+static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots);
+static bool	swap_slot_cache_active;
+static bool	swap_slot_cache_enabled;
+static bool	swap_slot_cache_initialized;
+DEFINE_MUTEX(swap_slots_cache_mutex);
+/* Serialize swap slots cache enable/disable operations */
+DEFINE_MUTEX(swap_slots_cache_enable_mutex);
+
+static void __drain_swap_slots_cache(unsigned int type);
+static void deactivate_swap_slots_cache(void);
+static void reactivate_swap_slots_cache(void);
+
+#define use_swap_slot_cache (swap_slot_cache_active && \
+		swap_slot_cache_enabled && swap_slot_cache_initialized)
+#define SLOTS_CACHE 0x1
+#define SLOTS_CACHE_RET 0x2
+
+static void deactivate_swap_slots_cache(void)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	swap_slot_cache_active = false;
+	__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+static void reactivate_swap_slots_cache(void)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	swap_slot_cache_active = true;
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+/* Must not be called with cpu hot plug lock */
+void disable_swap_slots_cache_lock(void)
+{
+	mutex_lock(&swap_slots_cache_enable_mutex);
+	swap_slot_cache_enabled = false;
+	if (swap_slot_cache_initialized) {
+		/* serialize with cpu hotplug operations */
+		get_online_cpus();
+		__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+		put_online_cpus();
+	}
+}
+
+static void __reenable_swap_slots_cache(void)
+{
+	swap_slot_cache_enabled = has_usable_swap();
+}
+
+void reenable_swap_slots_cache_unlock(void)
+{
+	__reenable_swap_slots_cache();
+	mutex_unlock(&swap_slots_cache_enable_mutex);
+}
+
+static bool check_cache_active(void)
+{
+	long pages;
+
+	if (!swap_slot_cache_enabled || !swap_slot_cache_initialized)
+		return false;
+
+	pages = get_nr_swap_pages();
+	if (!swap_slot_cache_active) {
+		if (pages > num_online_cpus() *
+		    THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE)
+			reactivate_swap_slots_cache();
+		goto out;
+	}
+
+	/* if global pool of slot caches too low, deactivate cache */
+	if (pages < num_online_cpus() * THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE)
+		deactivate_swap_slots_cache();
+out:
+	return swap_slot_cache_active;
+}
+
+static int alloc_swap_slot_cache(int cpu)
+{
+	struct swap_slots_cache *cache;
+	swp_entry_t *slots, *slots_ret;
+
+	/*
+	 * Do allocation outside swap_slots_cache_mutex
+	 * as vzalloc could trigger reclaim and get_swap_page,
+	 * which can lock swap_slots_cache_mutex.
+	 */
+	slots = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	if (!slots) {
+		return -ENOMEM;
+	}
+	slots_ret = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	if (!slots_ret) {
+		vfree(slots);
+		return -ENOMEM;
+	}
+
+	mutex_lock(&swap_slots_cache_mutex);
+	cache = &per_cpu(swp_slots, cpu);
+	if (cache->slots || cache->slots_ret)
+		/* cache already allocated */
+		goto out;
+	if (!cache->lock_initialized) {
+		mutex_init(&cache->alloc_lock);
+		spin_lock_init(&cache->free_lock);
+		cache->lock_initialized = true;
+	}
+	cache->nr = 0;
+	cache->cur = 0;
+	cache->n_ret = 0;
+	cache->slots = slots;
+	slots = NULL;
+	cache->slots_ret = slots_ret;
+	slots_ret = NULL;
+out:
+	mutex_unlock(&swap_slots_cache_mutex);
+	if (slots)
+		vfree(slots);
+	if (slots_ret)
+		vfree(slots_ret);
+	return 0;
+}
+
+static void drain_slots_cache_cpu(int cpu, unsigned int type, bool free_slots)
+{
+	struct swap_slots_cache *cache;
+	swp_entry_t *slots = NULL;
+
+	cache = &per_cpu(swp_slots, cpu);
+	if ((type & SLOTS_CACHE) && cache->slots) {
+		mutex_lock(&cache->alloc_lock);
+		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
+		cache->cur = 0;
+		cache->nr = 0;
+		if (free_slots && cache->slots) {
+			vfree(cache->slots);
+			cache->slots = NULL;
+		}
+		mutex_unlock(&cache->alloc_lock);
+	}
+	if ((type & SLOTS_CACHE_RET) && cache->slots_ret) {
+		spin_lock_irq(&cache->free_lock);
+		swapcache_free_entries(cache->slots_ret, cache->n_ret);
+		cache->n_ret = 0;
+		if (free_slots && cache->slots_ret) {
+			slots = cache->slots_ret;
+			cache->slots_ret = NULL;
+		}
+		spin_unlock_irq(&cache->free_lock);
+		if (slots)
+			vfree(slots);
+	}
+}
+
+static void __drain_swap_slots_cache(unsigned int type)
+{
+	int cpu;
+
+	/*
+	 * This function is called during
+	 * 	1) swapoff, when we have to make sure no
+	 * 	   left over slots are in cache when we remove
+	 * 	   a swap device;
+	 *      2) disabling of swap slot cache, when we run low
+	 *         on swap slots when allocating memory and need
+	 *         to return swap slots to global pool.
+	 *
+	 * We cannot acquire cpu hot plug lock here as
+	 * this function can be invoked in the cpu
+	 * hot plug path:
+	 * cpu_up -> lock cpu_hotplug -> smpboot_create_threads
+	 *   -> kmem_cache_alloc -> direct reclaim -> get_swap_page
+	 *   -> drain_swap_slots_cache
+	 *
+	 * Hence the loop over current online cpu below could miss cpu that
+	 * is being brought online but not yet marked as online.
+	 * That is okay as we do not schedule and run anything on a
+	 * cpu before it has been marked online. Hence, we will not
+	 * fill any swap slots in slots cache of such cpu.
+	 * There are no slots on such cpu that need to be drained.
+	 */
+	for_each_online_cpu(cpu)
+		drain_slots_cache_cpu(cpu, type, false);
+}
+
+static void free_slot_cache(int cpu)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	drain_slots_cache_cpu(cpu, SLOTS_CACHE | SLOTS_CACHE_RET, true);
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+static int swap_cache_callback(struct notifier_block *nfb,
+			unsigned long action, void *hcpu)
+{
+	int cpu = (long)hcpu;
+
+	switch (action) {
+	case CPU_DOWN_PREPARE:
+			free_slot_cache(cpu);
+			break;
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE:
+			alloc_swap_slot_cache(cpu);
+			break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block swap_cache_notifier = {
+	.notifier_call = swap_cache_callback,
+};
+
+int enable_swap_slots_cache(void)
+{
+	int	i, j;
+
+	mutex_lock(&swap_slots_cache_enable_mutex);
+	if (swap_slot_cache_initialized) {
+		__reenable_swap_slots_cache();
+		goto out_unlock;
+	}
+
+	cpu_notifier_register_begin();
+	for_each_online_cpu(i) {
+		if (alloc_swap_slot_cache(i))
+			goto fail;
+	}
+	swap_slot_cache_initialized = true;
+	__reenable_swap_slots_cache();
+	__register_hotcpu_notifier(&swap_cache_notifier);
+	cpu_notifier_register_done();
+out_unlock:
+	mutex_unlock(&swap_slots_cache_enable_mutex);
+	return 0;
+fail:
+	for_each_online_cpu(j) {
+		if (j == i)
+			break;
+		free_slot_cache(j);
+	}
+	cpu_notifier_register_done();
+	swap_slot_cache_initialized = false;
+	mutex_unlock(&swap_slots_cache_enable_mutex);
+	return -ENOMEM;
+}
+
+/* called with swap slot cache's alloc lock held */
+static int refill_swap_slots_cache(struct swap_slots_cache *cache)
+{
+	if (!use_swap_slot_cache || cache->nr)
+		return 0;
+
+	cache->cur = 0;
+	if (swap_slot_cache_active)
+		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots);
+
+	return cache->nr;
+}
+
+int free_swap_slot(swp_entry_t entry)
+{
+	struct swap_slots_cache *cache;
+
+	BUG_ON(!swap_slot_cache_initialized);
+
+	cache = &get_cpu_var(swp_slots);
+	if (use_swap_slot_cache && cache->slots_ret) {
+		spin_lock_irq(&cache->free_lock);
+		/* Swap slots cache may be deactivated before acquiring lock */
+		if (!use_swap_slot_cache) {
+			spin_unlock_irq(&cache->free_lock);
+			goto direct_free;
+		}
+		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
+			/*
+			 * Return slots to global pool.
+			 * The current swap_map value is SWAP_HAS_CACHE.
+			 * Set it to 0 to indicate it is available for
+			 * allocation in global pool
+			 */
+			swapcache_free_entries(cache->slots_ret, cache->n_ret);
+			cache->n_ret = 0;
+		}
+		cache->slots_ret[cache->n_ret++] = entry;
+		spin_unlock_irq(&cache->free_lock);
+	} else
+direct_free:
+		swapcache_free_entries(&entry, 1);
+	put_cpu_var(swp_slots);
+
+	return 0;
+}
+
+swp_entry_t get_swap_page(void)
+{
+	swp_entry_t entry, *pentry;
+	struct swap_slots_cache *cache;
+
+	cache = this_cpu_ptr(&swp_slots);
+
+	if (check_cache_active()) {
+		entry.val = 0;
+		mutex_lock(&cache->alloc_lock);
+		if (cache->slots) {
+repeat:
+			if (cache->nr) {
+				pentry = &cache->slots[cache->cur++];
+				entry = *pentry;
+				pentry->val = 0;
+				cache->nr--;
+			} else {
+				if (refill_swap_slots_cache(cache))
+					goto repeat;
+			}
+		}
+		mutex_unlock(&cache->alloc_lock);
+		if (entry.val)
+			return entry;
+	}
+
+	get_swap_pages(1, &entry);
+
+	return entry;
+}
+
+#endif /* CONFIG_SWAP */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3d76d80..e1f07ca 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
+#include <linux/swap_slots.h>
 
 #include <asm/pgtable.h>
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 172fd36..3a6cad1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -34,6 +34,7 @@
 #include <linux/frontswap.h>
 #include <linux/swapfile.h>
 #include <linux/export.h>
+#include <linux/swap_slots.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -861,14 +862,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
 	return n_ret;
 }
 
-swp_entry_t get_swap_page(void)
-{
-	swp_entry_t entry;
-
-	get_swap_pages(1, &entry);
-	return entry;
-}
-
 /* The only caller of this function is now suspend routine */
 swp_entry_t get_swap_page_of_type(int type)
 {
@@ -1059,7 +1052,7 @@ void swap_free(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		if (!__swap_entry_free(p, entry, 1))
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot(entry);
 	}
 }
 
@@ -1073,7 +1066,7 @@ void swapcache_free(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		if (!__swap_entry_free(p, entry, SWAP_HAS_CACHE))
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot(entry);
 	}
 }
 
@@ -1281,7 +1274,7 @@ int free_swap_and_cache(swp_entry_t entry)
 				page = NULL;
 			}
 		} else if (!count)
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot(entry);
 	}
 	if (page) {
 		/*
@@ -2110,6 +2103,17 @@ static void reinsert_swap_info(struct swap_info_struct *p)
 	spin_unlock(&swap_lock);
 }
 
+bool has_usable_swap(void)
+{
+	bool ret = true;
+
+	spin_lock(&swap_lock);
+	if (plist_head_empty(&swap_active_head))
+		ret = false;
+	spin_unlock(&swap_lock);
+	return ret;
+}
+
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 8/9] mm/swap: Enable swap slots cache usage
  2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (6 preceding siblings ...)
  2016-12-09 21:09 ` [PATCH v4 7/9] mm/swap: Add cache for swap slots allocation Tim Chen
@ 2016-12-09 21:09 ` Tim Chen
  2016-12-09 21:09 ` [PATCH v4 9/9] mm/swap: Skip readahead only when swap slot cache is enabled Tim Chen
  2016-12-27  7:45 ` [PATCH v4 0/9] mm/swap: Regular page swap optimizations Minchan Kim
  9 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2016-12-09 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

Initialize swap slots cache and enable it on swap on.
Drain all swap slots on swap off.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/swapfile.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3a6cad1..2440cec 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2184,6 +2184,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 
+	disable_swap_slots_cache_lock();
+
 	set_current_oom_origin();
 	err = try_to_unuse(p->type, false, 0); /* force unuse all pages */
 	clear_current_oom_origin();
@@ -2191,9 +2193,12 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	if (err) {
 		/* re-insert swap space back into swap_list */
 		reinsert_swap_info(p);
+		reenable_swap_slots_cache_unlock();
 		goto out_dput;
 	}
 
+	reenable_swap_slots_cache_unlock();
+
 	flush_work(&p->discard_work);
 
 	destroy_swap_extents(p);
@@ -2871,6 +2876,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		putname(name);
 	if (inode && S_ISREG(inode->i_mode))
 		inode_unlock(inode);
+	if (!error)
+		enable_swap_slots_cache();
 	return error;
 }
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 9/9] mm/swap: Skip readahead only when swap slot cache is enabled
  2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (7 preceding siblings ...)
  2016-12-09 21:09 ` [PATCH v4 8/9] mm/swap: Enable swap slots cache usage Tim Chen
@ 2016-12-09 21:09 ` Tim Chen
  2016-12-27  7:45 ` [PATCH v4 0/9] mm/swap: Regular page swap optimizations Minchan Kim
  9 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2016-12-09 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim Chen

From: Huang Ying <ying.huang@intel.com>

Because during swap off, a swap entry may have swap_map[] ==
SWAP_HAS_CACHE (for example, just allocated).  If we return NULL in
__read_swap_cache_async(), the swap off will abort.  So when swap slot
cache is disabled, (for swap off), we will wait for page to be put
into swap cache in such race condition.  This should not be a problem
for swap slot cache, because swap slot cache should be drained after
clearing swap_slot_cache_enabled.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap_slots.h |  2 ++
 mm/swap_slots.c            |  2 +-
 mm/swap_state.c            | 11 +++++++++--
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index a59e6e2..fb90734 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -25,4 +25,6 @@ void reenable_swap_slots_cache_unlock(void);
 int enable_swap_slots_cache(void);
 int free_swap_slot(swp_entry_t entry);
 
+extern bool swap_slot_cache_enabled;
+
 #endif /* _LINUX_SWAP_SLOTS_H */
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 8da25df..b53c149 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -36,7 +36,7 @@
 
 static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots);
 static bool	swap_slot_cache_active;
-static bool	swap_slot_cache_enabled;
+bool	swap_slot_cache_enabled;
 static bool	swap_slot_cache_initialized;
 DEFINE_MUTEX(swap_slots_cache_mutex);
 /* Serialize swap slots cache enable/disable operations */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e1f07ca..ef14f42 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -324,8 +324,15 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		if (found_page)
 			break;
 
-		/* Just skip read ahead for unused swap slot */
-		if (!__swp_swapcount(entry))
+		/*
+		 * Just skip read ahead for unused swap slot.
+		 * During swap_off when swap_slot_cache is disabled,
+		 * we have to handle the race between putting
+		 * swap entry in swap cache and marking swap slot
+		 * as SWAP_HAS_CACHE.  That's done in later part of code or
+		 * else swap_off will be aborted if we return NULL.
+		*/
+		if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
 			return NULL;
 
 		/*
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (8 preceding siblings ...)
  2016-12-09 21:09 ` [PATCH v4 9/9] mm/swap: Skip readahead only when swap slot cache is enabled Tim Chen
@ 2016-12-27  7:45 ` Minchan Kim
  2016-12-28  1:54   ` Huang, Ying
                     ` (2 more replies)
  9 siblings, 3 replies; 25+ messages in thread
From: Minchan Kim @ 2016-12-27  7:45 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

Hi,

On Fri, Dec 09, 2016 at 01:09:13PM -0800, Tim Chen wrote:
> Change Log:
> v4:
> 1. Fix a bug in unlock cluster in add_swap_count_continuation(). We
> should use unlock_cluster() instead of unlock_cluser_or_swap_info().
> 2. During swap off, handle race when swap slot is marked unused but allocated,
> and not yet placed in swap cache.  Wait for swap slot to be placed in swap cache
> and not abort swap off.
> 3. Initialize n_ret in get_swap_pages().
> 
> v3:
> 1. Fix bug that didn't check for page already in swap cache before skipping
> read ahead and return null page.
> 2. Fix bug that didn't try to allocate from global pool if allocation
> from swap slot cache did not succeed.
> 3. Fix memory allocation bug for spaces to store split up 64MB radix tree
> 4. Fix problems caused by races between get_swap_page, cpu online/offline and
> swap_on/off
> 
> v2: 
> 1. Fix bug in the index limit used in scan_swap_map_try_ssd_cluster
> when searching for empty slots in cluster.
> 2. Fix bug in swap off that incorrectly determines if we still have
> swap devices left.
> 3. Port patches to mmotm-2016-10-11-15-46 branch
> 
> Andrew,
> 
> We're updating this patch series with some minor fixes.
> Please consider this patch series for inclusion to 4.10.
>  
> Times have changed.  Coming generation of Solid state Block device
> latencies are getting down to sub 100 usec, which is within an order of
> magnitude of DRAM, and their performance is orders of magnitude higher
> than the single- spindle rotational media we've swapped to historically.
> 
> This could benefit many usage scenearios.  For example cloud providers who
> overcommit their memory (as VM don't use all the memory provisioned).
> Having a fast swap will allow them to be more aggressive in memory
> overcommit and fit more VMs to a platform.
> 
> In our testing [see footnote], the median latency that the
> kernel adds to a page fault is 15 usec, which comes quite close
> to the amount that will be contributed by the underlying I/O
> devices.
> 
> The software latency comes mostly from contentions on the locks
> protecting the radix tree of the swap cache and also the locks protecting
> the individual swap devices.  The lock contentions already consumed
> 35% of cpu cycles in our test.  In the very near future,
> software latency will become the bottleneck to swap performnace as
> block device I/O latency gets within the shouting distance of DRAM speed.
> 
> This patch set, plus a previous patch Ying already posted
> (commit: f6498b3f) reduced the median page fault latency
> from 15 usec to 4 usec (375% reduction) for DRAM based pmem
> block device.

The patchset has used several techniqueus to reduce lock contention, for example,
batching alloc/free, fine-grained lock and cluster distribution to avoid cache
false-sharing. Each items has different complexity and benefits so could you
show the number for each step of pathchset? It would be better to include the
nubmer in each description. It helps how the patch is important when we consider
complexitiy of the patch.

> 
> Patch 1 is a clean up patch.

Could it be separated patch?

> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
>         that can be used for accessing swap_map, and not lock the whole
>         swap device

I hope you make three steps to review easier. You can create some functions like
swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock.
It doesn't change anything performance pov but it clearly shows what kinds of lock
we should use in specific context.

Then, you can introduce more fine-graind lock in next patch and apply it into
those wrapper functions.

And last patch, you can adjust cluster distribution to avoid false-sharing.
And the description should include how it's bad in testing so it's worth.

Frankly speaking, although I'm huge user of bit_spin_lock(zram/zsmalloc
have used it heavily), I don't like swap subsystem uses it.
During zram development, it really hurts debugging due to losing lockdep.
The reason zram have used it is by size concern of embedded world but server
would be not critical so please consider trade-off of spinlock vs. bit_spin_lock.

> Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
>         the rate that we have to contende for the radix tree.

To me, it's rather hacky. I think it might be common problem for page cache
so can we think another generalized way like range_lock? Ccing Jan.

> Patch 4 eliminates unnecessary page allocation for read ahead.

Could it be separated patch?

> Patch 5-9 create a per cpu cache of the swap slots, so we don't have
>         to contend on the swap device to get a swap slot or to release
>         a swap slot.  And we allocate and release the swap slots
>         in batches for better efficiency.


To me, idea is good although I feel the amount of code is rather huge and
messy so it should include the number about the benefit, at least.

And it might make some of patches in this patchset if we put this batching
ahead before other patches redundant.

Sorry for vague commenting. In this phase, it's really hard to review.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2016-12-27  7:45 ` [PATCH v4 0/9] mm/swap: Regular page swap optimizations Minchan Kim
@ 2016-12-28  1:54   ` Huang, Ying
  2016-12-28  2:37     ` Minchan Kim
  2017-01-02 15:48   ` Jan Kara
  2017-01-05  1:33   ` Huang, Ying
  2 siblings, 1 reply; 25+ messages in thread
From: Huang, Ying @ 2016-12-28  1:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tim Chen, Andrew Morton, Ying Huang, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

Hi, Minchan,

Minchan Kim <minchan@kernel.org> writes:

> Hi,
>
> On Fri, Dec 09, 2016 at 01:09:13PM -0800, Tim Chen wrote:
>> Change Log:
>> v4:
>> 1. Fix a bug in unlock cluster in add_swap_count_continuation(). We
>> should use unlock_cluster() instead of unlock_cluser_or_swap_info().
>> 2. During swap off, handle race when swap slot is marked unused but allocated,
>> and not yet placed in swap cache.  Wait for swap slot to be placed in swap cache
>> and not abort swap off.
>> 3. Initialize n_ret in get_swap_pages().
>> 
>> v3:
>> 1. Fix bug that didn't check for page already in swap cache before skipping
>> read ahead and return null page.
>> 2. Fix bug that didn't try to allocate from global pool if allocation
>> from swap slot cache did not succeed.
>> 3. Fix memory allocation bug for spaces to store split up 64MB radix tree
>> 4. Fix problems caused by races between get_swap_page, cpu online/offline and
>> swap_on/off
>> 
>> v2: 
>> 1. Fix bug in the index limit used in scan_swap_map_try_ssd_cluster
>> when searching for empty slots in cluster.
>> 2. Fix bug in swap off that incorrectly determines if we still have
>> swap devices left.
>> 3. Port patches to mmotm-2016-10-11-15-46 branch
>> 
>> Andrew,
>> 
>> We're updating this patch series with some minor fixes.
>> Please consider this patch series for inclusion to 4.10.
>>  
>> Times have changed.  Coming generation of Solid state Block device
>> latencies are getting down to sub 100 usec, which is within an order of
>> magnitude of DRAM, and their performance is orders of magnitude higher
>> than the single- spindle rotational media we've swapped to historically.
>> 
>> This could benefit many usage scenearios.  For example cloud providers who
>> overcommit their memory (as VM don't use all the memory provisioned).
>> Having a fast swap will allow them to be more aggressive in memory
>> overcommit and fit more VMs to a platform.
>> 
>> In our testing [see footnote], the median latency that the
>> kernel adds to a page fault is 15 usec, which comes quite close
>> to the amount that will be contributed by the underlying I/O
>> devices.
>> 
>> The software latency comes mostly from contentions on the locks
>> protecting the radix tree of the swap cache and also the locks protecting
>> the individual swap devices.  The lock contentions already consumed
>> 35% of cpu cycles in our test.  In the very near future,
>> software latency will become the bottleneck to swap performnace as
>> block device I/O latency gets within the shouting distance of DRAM speed.
>> 
>> This patch set, plus a previous patch Ying already posted
>> (commit: f6498b3f) reduced the median page fault latency
>> from 15 usec to 4 usec (375% reduction) for DRAM based pmem
>> block device.
>
> The patchset has used several techniqueus to reduce lock contention, for example,
> batching alloc/free, fine-grained lock and cluster distribution to avoid cache
> false-sharing. Each items has different complexity and benefits so could you
> show the number for each step of pathchset? It would be better to include the
> nubmer in each description. It helps how the patch is important when we consider
> complexitiy of the patch.

One common problem of scalability optimization is that, after you have
optimized one lock, the end result may be not very good, because another
lock becomes heavily contended.  Similar problem occurs here, there are
mainly two locks during swap out/in, one protects swap cache, the other
protects swap device.  We can achieve good scalability only after having
optimized the two locks.

You cannot say that one patch is not important just because the test
result for that single patch is not very good.  Because without that,
the end result of the whole series will be not very good.

>> 
>> Patch 1 is a clean up patch.
>
> Could it be separated patch?
>
>> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
>>         that can be used for accessing swap_map, and not lock the whole
>>         swap device
>
> I hope you make three steps to review easier. You can create some functions like
> swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock.
> It doesn't change anything performance pov but it clearly shows what kinds of lock
> we should use in specific context.
>
> Then, you can introduce more fine-graind lock in next patch and apply it into
> those wrapper functions.
>
> And last patch, you can adjust cluster distribution to avoid false-sharing.
> And the description should include how it's bad in testing so it's worth.
>
> Frankly speaking, although I'm huge user of bit_spin_lock(zram/zsmalloc
> have used it heavily), I don't like swap subsystem uses it.
> During zram development, it really hurts debugging due to losing lockdep.
> The reason zram have used it is by size concern of embedded world but server
> would be not critical so please consider trade-off of spinlock vs. bit_spin_lock.

There will be one struct swap_cluster_info for every 1MB swap space.
So, for example, for 1TB swap space, the number of struct
swap_cluster_info will be one million.  To reduce the RAM usage, we
choose to use bit_spin_lock, otherwise, spinlock is better.  The code
will be used by embedded, PC and server, so the RAM usage is important.

Best Regards,
Huang, Ying

>> Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
>>         the rate that we have to contende for the radix tree.
>
> To me, it's rather hacky. I think it might be common problem for page cache
> so can we think another generalized way like range_lock? Ccing Jan.
>
>> Patch 4 eliminates unnecessary page allocation for read ahead.
>
> Could it be separated patch?
>
>> Patch 5-9 create a per cpu cache of the swap slots, so we don't have
>>         to contend on the swap device to get a swap slot or to release
>>         a swap slot.  And we allocate and release the swap slots
>>         in batches for better efficiency.
>
>
> To me, idea is good although I feel the amount of code is rather huge and
> messy so it should include the number about the benefit, at least.
>
> And it might make some of patches in this patchset if we put this batching
> ahead before other patches redundant.
>
> Sorry for vague commenting. In this phase, it's really hard to review.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2016-12-28  1:54   ` Huang, Ying
@ 2016-12-28  2:37     ` Minchan Kim
  2016-12-28  3:15       ` Huang, Ying
  0 siblings, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2016-12-28  2:37 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Tim Chen, Andrew Morton, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

Hi Huang,

On Wed, Dec 28, 2016 at 09:54:27AM +0800, Huang, Ying wrote:

< snip >

> > The patchset has used several techniqueus to reduce lock contention, for example,
> > batching alloc/free, fine-grained lock and cluster distribution to avoid cache
> > false-sharing. Each items has different complexity and benefits so could you
> > show the number for each step of pathchset? It would be better to include the
> > nubmer in each description. It helps how the patch is important when we consider
> > complexitiy of the patch.
> 
> One common problem of scalability optimization is that, after you have
> optimized one lock, the end result may be not very good, because another
> lock becomes heavily contended.  Similar problem occurs here, there are
> mainly two locks during swap out/in, one protects swap cache, the other
> protects swap device.  We can achieve good scalability only after having
> optimized the two locks.

Yes. You can describe that situation into the description. For example,
"with this patch, we can watch less swap_lock contention with perf but
overall performance is not good because swap cache lock still is still
contended heavily like below data so next patch will solve the problem".

It will make patch's justficiation clear.

> 
> You cannot say that one patch is not important just because the test
> result for that single patch is not very good.  Because without that,
> the end result of the whole series will be not very good.

I know that but this patchset are lack of number too much to justify
each works. You can show just raw number itself of a techniqueue
although it is not huge benefit or even worse. You can explain the reason
why it was not good, which would be enough motivation for next patch.

Number itself wouldn't be important but justfication is really crucial
to review/merge patchset and number will help it a lot in especially
MM community.

> 
> >> 
> >> Patch 1 is a clean up patch.
> >
> > Could it be separated patch?
> >
> >> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
> >>         that can be used for accessing swap_map, and not lock the whole
> >>         swap device
> >
> > I hope you make three steps to review easier. You can create some functions like
> > swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock.
> > It doesn't change anything performance pov but it clearly shows what kinds of lock
> > we should use in specific context.
> >
> > Then, you can introduce more fine-graind lock in next patch and apply it into
> > those wrapper functions.
> >
> > And last patch, you can adjust cluster distribution to avoid false-sharing.
> > And the description should include how it's bad in testing so it's worth.
> >
> > Frankly speaking, although I'm huge user of bit_spin_lock(zram/zsmalloc
> > have used it heavily), I don't like swap subsystem uses it.
> > During zram development, it really hurts debugging due to losing lockdep.
> > The reason zram have used it is by size concern of embedded world but server
> > would be not critical so please consider trade-off of spinlock vs. bit_spin_lock.
> 
> There will be one struct swap_cluster_info for every 1MB swap space.
> So, for example, for 1TB swap space, the number of struct
> swap_cluster_info will be one million.  To reduce the RAM usage, we
> choose to use bit_spin_lock, otherwise, spinlock is better.  The code
> will be used by embedded, PC and server, so the RAM usage is important.

It seems you already increase swap_cluster_info 4 byte to support
bit_spin_lock.
Compared to that, how much memory does spin_lock increase?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2016-12-28  2:37     ` Minchan Kim
@ 2016-12-28  3:15       ` Huang, Ying
  2016-12-28  3:31         ` Huang, Ying
  0 siblings, 1 reply; 25+ messages in thread
From: Huang, Ying @ 2016-12-28  3:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Tim Chen, Andrew Morton, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

Minchan Kim <minchan@kernel.org> writes:

> Hi Huang,
>
> On Wed, Dec 28, 2016 at 09:54:27AM +0800, Huang, Ying wrote:
>
> < snip >
>
>> > The patchset has used several techniqueus to reduce lock contention, for example,
>> > batching alloc/free, fine-grained lock and cluster distribution to avoid cache
>> > false-sharing. Each items has different complexity and benefits so could you
>> > show the number for each step of pathchset? It would be better to include the
>> > nubmer in each description. It helps how the patch is important when we consider
>> > complexitiy of the patch.
>> 
>> One common problem of scalability optimization is that, after you have
>> optimized one lock, the end result may be not very good, because another
>> lock becomes heavily contended.  Similar problem occurs here, there are
>> mainly two locks during swap out/in, one protects swap cache, the other
>> protects swap device.  We can achieve good scalability only after having
>> optimized the two locks.
>
> Yes. You can describe that situation into the description. For example,
> "with this patch, we can watch less swap_lock contention with perf but
> overall performance is not good because swap cache lock still is still
> contended heavily like below data so next patch will solve the problem".
>
> It will make patch's justficiation clear.
>
>> 
>> You cannot say that one patch is not important just because the test
>> result for that single patch is not very good.  Because without that,
>> the end result of the whole series will be not very good.
>
> I know that but this patchset are lack of number too much to justify
> each works. You can show just raw number itself of a techniqueue
> although it is not huge benefit or even worse. You can explain the reason
> why it was not good, which would be enough motivation for next patch.
>
> Number itself wouldn't be important but justfication is really crucial
> to review/merge patchset and number will help it a lot in especially
> MM community.
>
>> 
>> >> 
>> >> Patch 1 is a clean up patch.
>> >
>> > Could it be separated patch?
>> >
>> >> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
>> >>         that can be used for accessing swap_map, and not lock the whole
>> >>         swap device
>> >
>> > I hope you make three steps to review easier. You can create some functions like
>> > swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock.
>> > It doesn't change anything performance pov but it clearly shows what kinds of lock
>> > we should use in specific context.
>> >
>> > Then, you can introduce more fine-graind lock in next patch and apply it into
>> > those wrapper functions.
>> >
>> > And last patch, you can adjust cluster distribution to avoid false-sharing.
>> > And the description should include how it's bad in testing so it's worth.
>> >
>> > Frankly speaking, although I'm huge user of bit_spin_lock(zram/zsmalloc
>> > have used it heavily), I don't like swap subsystem uses it.
>> > During zram development, it really hurts debugging due to losing lockdep.
>> > The reason zram have used it is by size concern of embedded world but server
>> > would be not critical so please consider trade-off of spinlock vs. bit_spin_lock.
>> 
>> There will be one struct swap_cluster_info for every 1MB swap space.
>> So, for example, for 1TB swap space, the number of struct
>> swap_cluster_info will be one million.  To reduce the RAM usage, we
>> choose to use bit_spin_lock, otherwise, spinlock is better.  The code
>> will be used by embedded, PC and server, so the RAM usage is important.
>
> It seems you already increase swap_cluster_info 4 byte to support
> bit_spin_lock.

The increment only occurs on 64bit platform.  On 32bit platform, the
size is the same as before.

> Compared to that, how much memory does spin_lock increase?

The size of struct swap_cluster_info will increase from 4 bytes to 16
bytes on 64bit platform.  I guess it will increase from 4 bytes to 8
bytes on 32bit platform at least, but I did not test that.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2016-12-28  3:15       ` Huang, Ying
@ 2016-12-28  3:31         ` Huang, Ying
  2016-12-28  3:53           ` Minchan Kim
  0 siblings, 1 reply; 25+ messages in thread
From: Huang, Ying @ 2016-12-28  3:31 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Minchan Kim, Tim Chen, Andrew Morton, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

"Huang, Ying" <ying.huang@intel.com> writes:

> Minchan Kim <minchan@kernel.org> writes:
>
>> Hi Huang,
>>
>> On Wed, Dec 28, 2016 at 09:54:27AM +0800, Huang, Ying wrote:
>>
>> < snip >
>>
>>> > The patchset has used several techniqueus to reduce lock contention, for example,
>>> > batching alloc/free, fine-grained lock and cluster distribution to avoid cache
>>> > false-sharing. Each items has different complexity and benefits so could you
>>> > show the number for each step of pathchset? It would be better to include the
>>> > nubmer in each description. It helps how the patch is important when we consider
>>> > complexitiy of the patch.
>>> 
>>> One common problem of scalability optimization is that, after you have
>>> optimized one lock, the end result may be not very good, because another
>>> lock becomes heavily contended.  Similar problem occurs here, there are
>>> mainly two locks during swap out/in, one protects swap cache, the other
>>> protects swap device.  We can achieve good scalability only after having
>>> optimized the two locks.
>>
>> Yes. You can describe that situation into the description. For example,
>> "with this patch, we can watch less swap_lock contention with perf but
>> overall performance is not good because swap cache lock still is still
>> contended heavily like below data so next patch will solve the problem".
>>
>> It will make patch's justficiation clear.
>>
>>> 
>>> You cannot say that one patch is not important just because the test
>>> result for that single patch is not very good.  Because without that,
>>> the end result of the whole series will be not very good.
>>
>> I know that but this patchset are lack of number too much to justify
>> each works. You can show just raw number itself of a techniqueue
>> although it is not huge benefit or even worse. You can explain the reason
>> why it was not good, which would be enough motivation for next patch.
>>
>> Number itself wouldn't be important but justfication is really crucial
>> to review/merge patchset and number will help it a lot in especially
>> MM community.
>>
>>> 
>>> >> 
>>> >> Patch 1 is a clean up patch.
>>> >
>>> > Could it be separated patch?
>>> >
>>> >> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
>>> >>         that can be used for accessing swap_map, and not lock the whole
>>> >>         swap device
>>> >
>>> > I hope you make three steps to review easier. You can create some functions like
>>> > swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock.
>>> > It doesn't change anything performance pov but it clearly shows what kinds of lock
>>> > we should use in specific context.
>>> >
>>> > Then, you can introduce more fine-graind lock in next patch and apply it into
>>> > those wrapper functions.
>>> >
>>> > And last patch, you can adjust cluster distribution to avoid false-sharing.
>>> > And the description should include how it's bad in testing so it's worth.
>>> >
>>> > Frankly speaking, although I'm huge user of bit_spin_lock(zram/zsmalloc
>>> > have used it heavily), I don't like swap subsystem uses it.
>>> > During zram development, it really hurts debugging due to losing lockdep.
>>> > The reason zram have used it is by size concern of embedded world but server
>>> > would be not critical so please consider trade-off of spinlock vs. bit_spin_lock.
>>> 
>>> There will be one struct swap_cluster_info for every 1MB swap space.
>>> So, for example, for 1TB swap space, the number of struct
>>> swap_cluster_info will be one million.  To reduce the RAM usage, we
>>> choose to use bit_spin_lock, otherwise, spinlock is better.  The code
>>> will be used by embedded, PC and server, so the RAM usage is important.
>>
>> It seems you already increase swap_cluster_info 4 byte to support
>> bit_spin_lock.
>
> The increment only occurs on 64bit platform.  On 32bit platform, the
> size is the same as before.
>
>> Compared to that, how much memory does spin_lock increase?
>
> The size of struct swap_cluster_info will increase from 4 bytes to 16
> bytes on 64bit platform.  I guess it will increase from 4 bytes to 8
> bytes on 32bit platform at least, but I did not test that.

Sorry, I make a mistake during test.  The size of struct
swap_cluster_info will increase from 4 bytes to 8 bytes on 64 bit
platform.  I think it will increase from 4 bytes to 8 bytes on 32 bit
platform too (not tested).

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2016-12-28  3:31         ` Huang, Ying
@ 2016-12-28  3:53           ` Minchan Kim
  2016-12-28  4:56             ` Huang, Ying
  0 siblings, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2016-12-28  3:53 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Tim Chen, Andrew Morton, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

On Wed, Dec 28, 2016 at 11:31:06AM +0800, Huang, Ying wrote:

< snip >

> >>> > Frankly speaking, although I'm huge user of bit_spin_lock(zram/zsmalloc
> >>> > have used it heavily), I don't like swap subsystem uses it.
> >>> > During zram development, it really hurts debugging due to losing lockdep.
> >>> > The reason zram have used it is by size concern of embedded world but server
> >>> > would be not critical so please consider trade-off of spinlock vs. bit_spin_lock.
> >>> 
> >>> There will be one struct swap_cluster_info for every 1MB swap space.
> >>> So, for example, for 1TB swap space, the number of struct
> >>> swap_cluster_info will be one million.  To reduce the RAM usage, we
> >>> choose to use bit_spin_lock, otherwise, spinlock is better.  The code
> >>> will be used by embedded, PC and server, so the RAM usage is important.
> >>
> >> It seems you already increase swap_cluster_info 4 byte to support
> >> bit_spin_lock.
> >
> > The increment only occurs on 64bit platform.  On 32bit platform, the
> > size is the same as before.
> >
> >> Compared to that, how much memory does spin_lock increase?
> >
> > The size of struct swap_cluster_info will increase from 4 bytes to 16
> > bytes on 64bit platform.  I guess it will increase from 4 bytes to 8
> > bytes on 32bit platform at least, but I did not test that.
> 
> Sorry, I make a mistake during test.  The size of struct
> swap_cluster_info will increase from 4 bytes to 8 bytes on 64 bit
> platform.  I think it will increase from 4 bytes to 8 bytes on 32 bit
> platform too (not tested).

Thanks for the information.
To me, it's not big when we consider spinlock's usefullness which helps
cache-line bouncing, lockdep and happy with RT people.
So, I vote spin_lock but I'm not in charge of deciding on that and your
opinion might be different still. If so, let's pass the decision to
maintainer.
Instead, please write down above content in description for maintainer to
judge it fairly.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2016-12-28  3:53           ` Minchan Kim
@ 2016-12-28  4:56             ` Huang, Ying
  0 siblings, 0 replies; 25+ messages in thread
From: Huang, Ying @ 2016-12-28  4:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Tim Chen, Andrew Morton, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

Minchan Kim <minchan@kernel.org> writes:

> On Wed, Dec 28, 2016 at 11:31:06AM +0800, Huang, Ying wrote:
>
> < snip >
>
>> >>> > Frankly speaking, although I'm huge user of bit_spin_lock(zram/zsmalloc
>> >>> > have used it heavily), I don't like swap subsystem uses it.
>> >>> > During zram development, it really hurts debugging due to losing lockdep.
>> >>> > The reason zram have used it is by size concern of embedded world but server
>> >>> > would be not critical so please consider trade-off of spinlock vs. bit_spin_lock.
>> >>> 
>> >>> There will be one struct swap_cluster_info for every 1MB swap space.
>> >>> So, for example, for 1TB swap space, the number of struct
>> >>> swap_cluster_info will be one million.  To reduce the RAM usage, we
>> >>> choose to use bit_spin_lock, otherwise, spinlock is better.  The code
>> >>> will be used by embedded, PC and server, so the RAM usage is important.
>> >>
>> >> It seems you already increase swap_cluster_info 4 byte to support
>> >> bit_spin_lock.
>> >
>> > The increment only occurs on 64bit platform.  On 32bit platform, the
>> > size is the same as before.
>> >
>> >> Compared to that, how much memory does spin_lock increase?
>> >
>> > The size of struct swap_cluster_info will increase from 4 bytes to 16
>> > bytes on 64bit platform.  I guess it will increase from 4 bytes to 8
>> > bytes on 32bit platform at least, but I did not test that.
>> 
>> Sorry, I make a mistake during test.  The size of struct
>> swap_cluster_info will increase from 4 bytes to 8 bytes on 64 bit
>> platform.  I think it will increase from 4 bytes to 8 bytes on 32 bit
>> platform too (not tested).
>
> Thanks for the information.
> To me, it's not big when we consider spinlock's usefullness which helps
> cache-line bouncing, lockdep and happy with RT people.

Yes.  spinlock helps on lockdep and RT, but I don't think it helps
cache-line bouncing.

> So, I vote spin_lock but I'm not in charge of deciding on that and your
> opinion might be different still. If so, let's pass the decision to
> maintainer.

I have no strong opinion for size change on 32bit platform.  But I want
to know other people's opinion, especially maintainer's too.

> Instead, please write down above content in description for maintainer to
> judge it fairly.

Sure.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2016-12-27  7:45 ` [PATCH v4 0/9] mm/swap: Regular page swap optimizations Minchan Kim
  2016-12-28  1:54   ` Huang, Ying
@ 2017-01-02 15:48   ` Jan Kara
  2017-01-03  4:34     ` Minchan Kim
  2017-01-05  1:33   ` Huang, Ying
  2 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2017-01-02 15:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tim Chen, Andrew Morton, Ying Huang, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

Hi,

On Tue 27-12-16 16:45:03, Minchan Kim wrote:
> > Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
> >         the rate that we have to contende for the radix tree.
> 
> To me, it's rather hacky. I think it might be common problem for page cache
> so can we think another generalized way like range_lock? Ccing Jan.

I agree on the hackyness of the patch and that page cache would suffer with
the same contention (although the files are usually smaller than swap so it
would not be that visible I guess). But I don't see how range lock would
help here - we need to serialize modifications of the tree structure itself
and that is difficult to achieve with the range lock. So what you would
need is either a different data structure for tracking swap cache entries
or a finer grained locking of the radix tree.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2017-01-02 15:48   ` Jan Kara
@ 2017-01-03  4:34     ` Minchan Kim
  2017-01-03  5:43       ` Huang, Ying
  2017-01-03 17:47       ` Tim Chen
  0 siblings, 2 replies; 25+ messages in thread
From: Minchan Kim @ 2017-01-03  4:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tim Chen, Andrew Morton, Ying Huang, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Peter Zijlstra,
	Nicholas Piggin

Hi Jan,

On Mon, Jan 02, 2017 at 04:48:41PM +0100, Jan Kara wrote:
> Hi,
> 
> On Tue 27-12-16 16:45:03, Minchan Kim wrote:
> > > Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
> > >         the rate that we have to contende for the radix tree.
> > 
> > To me, it's rather hacky. I think it might be common problem for page cache
> > so can we think another generalized way like range_lock? Ccing Jan.
> 
> I agree on the hackyness of the patch and that page cache would suffer with
> the same contention (although the files are usually smaller than swap so it
> would not be that visible I guess). But I don't see how range lock would
> help here - we need to serialize modifications of the tree structure itself
> and that is difficult to achieve with the range lock. So what you would
> need is either a different data structure for tracking swap cache entries
> or a finer grained locking of the radix tree.

Thanks for the comment, Jan.

I think there are more general options. One is to shrink batching pages like
Mel and Tim had approached.

https://patchwork.kernel.org/patch/9008421/
https://patchwork.kernel.org/patch/9322793/

Or concurrent page cache by peter.

https://www.kernel.org/doc/ols/2007/ols2007v2-pages-311-318.pdf

Ccing Nick who might have an interest on lockless page cache.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2017-01-03  4:34     ` Minchan Kim
@ 2017-01-03  5:43       ` Huang, Ying
  2017-01-05  6:15         ` Minchan Kim
  2017-01-03 17:47       ` Tim Chen
  1 sibling, 1 reply; 25+ messages in thread
From: Huang, Ying @ 2017-01-03  5:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jan Kara, Tim Chen, Andrew Morton, Ying Huang, dave.hansen, ak,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Peter Zijlstra,
	Nicholas Piggin

Hi, Minchan,

Minchan Kim <minchan@kernel.org> writes:

> Hi Jan,
>
> On Mon, Jan 02, 2017 at 04:48:41PM +0100, Jan Kara wrote:
>> Hi,
>> 
>> On Tue 27-12-16 16:45:03, Minchan Kim wrote:
>> > > Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
>> > >         the rate that we have to contende for the radix tree.
>> > 
>> > To me, it's rather hacky. I think it might be common problem for page cache
>> > so can we think another generalized way like range_lock? Ccing Jan.
>> 
>> I agree on the hackyness of the patch and that page cache would suffer with
>> the same contention (although the files are usually smaller than swap so it
>> would not be that visible I guess). But I don't see how range lock would
>> help here - we need to serialize modifications of the tree structure itself
>> and that is difficult to achieve with the range lock. So what you would
>> need is either a different data structure for tracking swap cache entries
>> or a finer grained locking of the radix tree.
>
> Thanks for the comment, Jan.
>
> I think there are more general options. One is to shrink batching pages like
> Mel and Tim had approached.
>
> https://patchwork.kernel.org/patch/9008421/
> https://patchwork.kernel.org/patch/9322793/

This helps to reduce the lock contention on radix tree of swap cache.
But splitting swap cache has much better performance.  So we switched
from that solution to current solution.

> Or concurrent page cache by peter.
>
> https://www.kernel.org/doc/ols/2007/ols2007v2-pages-311-318.pdf

I think this is good, it helps swap and file cache.  But I don't know
whether other people want to go this way and how much effort will be
needed.

In contrast, splitting swap cache is quite simple, for implementation
and review.  And the effect is good.

Best Regards,
Huang, Ying

> Ccing Nick who might have an interest on lockless page cache.
>
> Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2017-01-03  4:34     ` Minchan Kim
  2017-01-03  5:43       ` Huang, Ying
@ 2017-01-03 17:47       ` Tim Chen
  1 sibling, 0 replies; 25+ messages in thread
From: Tim Chen @ 2017-01-03 17:47 UTC (permalink / raw)
  To: Minchan Kim, Jan Kara
  Cc: Andrew Morton, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Peter Zijlstra,
	Nicholas Piggin

On Tue, 2017-01-03 at 13:34 +0900, Minchan Kim wrote:
> Hi Jan,
> 
> On Mon, Jan 02, 2017 at 04:48:41PM +0100, Jan Kara wrote:
> > 
> > Hi,
> > 
> > On Tue 27-12-16 16:45:03, Minchan Kim wrote:
> > > 
> > > > 
> > > > Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
> > > >         the rate that we have to contende for the radix tree.
> > > To me, it's rather hacky. I think it might be common problem for page cache
> > > so can we think another generalized way like range_lock? Ccing Jan.
> > I agree on the hackyness of the patch and that page cache would suffer with
> > the same contention (although the files are usually smaller than swap so it
> > would not be that visible I guess). But I don't see how range lock would
> > help here - we need to serialize modifications of the tree structure itself
> > and that is difficult to achieve with the range lock. So what you would
> > need is either a different data structure for tracking swap cache entries
> > or a finer grained locking of the radix tree.
> Thanks for the comment, Jan.
> 
> I think there are more general options. One is to shrink batching pages like
> Mel and Tim had approached.
> 
> https://patchwork.kernel.org/patch/9008421/
> https://patchwork.kernel.org/patch/9322793/

The batching of pages is done in this patch series with a page allocation cache
and page release cache.  It is done a bit differently than my original patch proposal.

This reduces the contention on the swap_info lock. We uses
the splitting of the radix tree to reduce the radix tree lock contention.

In our tests, these two approaches combined are quite effective in reducing the
latency on actual fast solid state drives.  So we hope that the patch series
can be merged to facilitate the use case of using these drives as secondary
memory.

Tim

> 
> Or concurrent page cache by peter.
> 
> https://www.kernel.org/doc/ols/2007/ols2007v2-pages-311-318.pdf
> 
> Ccing Nick who might have an interest on lockless page cache.
> 
> Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2016-12-27  7:45 ` [PATCH v4 0/9] mm/swap: Regular page swap optimizations Minchan Kim
  2016-12-28  1:54   ` Huang, Ying
  2017-01-02 15:48   ` Jan Kara
@ 2017-01-05  1:33   ` Huang, Ying
  2017-01-05  6:32     ` Minchan Kim
  2 siblings, 1 reply; 25+ messages in thread
From: Huang, Ying @ 2017-01-05  1:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tim Chen, Andrew Morton, Ying Huang, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

Hi, Minchan,

Minchan Kim <minchan@kernel.org> writes:
[snip]
>
> The patchset has used several techniqueus to reduce lock contention, for example,
> batching alloc/free, fine-grained lock and cluster distribution to avoid cache
> false-sharing. Each items has different complexity and benefits so could you
> show the number for each step of pathchset? It would be better to include the
> nubmer in each description. It helps how the patch is important when we consider
> complexitiy of the patch.

Here is the test data.

We test the vm-scalability swap-w-seq test case with 32 processes on a
Xeon E5 v3 system.  The swap device used is a RAM simulated PMEM
(persistent memory) device.  To test the sequential swapping out, the
test case created 32 processes, which sequentially allocate and write to
the anonymous pages until the RAM and part of the swap device is used
up.

The patchset is rebased on v4.9-rc8.  So the baseline performance is as
follow,

  "vmstat.swap.so": 1428002,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": 13.94,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg": 13.75,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.swapcache_free.__remove_mapping.shrink_page_list": 7.05,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.page_swapcount.try_to_free_swap.swap_writepage": 7.03,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.__swap_duplicate.swap_duplicate.try_to_unmap_one.rmap_walk_anon": 7.02,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": 6.83,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.page_check_address_transhuge.page_referenced_one.rmap_walk_anon.rmap_walk": 0.81,

>> Patch 1 is a clean up patch.
>
> Could it be separated patch?
>
>> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
>>         that can be used for accessing swap_map, and not lock the whole
>>         swap device

After patch 2, the result is as follow,

  "vmstat.swap.so": 1481704,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": 27.53,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg": 27.01,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages.drain_local_pages": 1.03,

The swap out throughput is at the same level, but the lock contention on
swap_info_struct->lock is eliminated.

>> Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
>>         the rate that we have to contende for the radix tree.
>

After patch 3,

  "vmstat.swap.so": 2050097,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": 43.27,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": 4.84,

The swap out throughput is improved about ~43% compared with baseline.
The lock contention on swap cache radix tree lock is eliminated.
swap_info_struct->lock in get_swap_page() becomes the most heavy
contended lock.

>
>> Patch 4 eliminates unnecessary page allocation for read ahead.
>
> Could it be separated patch?
>
>> Patch 5-9 create a per cpu cache of the swap slots, so we don't have
>>         to contend on the swap device to get a swap slot or to release
>>         a swap slot.  And we allocate and release the swap slots
>>         in batches for better efficiency.

After patch 9,

  "vmstat.swap.so": 4170746,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.swapcache_free_entries.free_swap_slot.free_swap_and_cache.unmap_page_range": 13.91,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": 8.56,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask.alloc_pages_vma": 2.56,
  "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap.shrink_page_list": 2.47,

The swap out throughput is improved about 192% compared with the
baseline.  There are still some lock contention for
swap_info_struct->lock, but the pressure begins to shift to buddy system
now.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2017-01-03  5:43       ` Huang, Ying
@ 2017-01-05  6:15         ` Minchan Kim
  0 siblings, 0 replies; 25+ messages in thread
From: Minchan Kim @ 2017-01-05  6:15 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Jan Kara, Tim Chen, Andrew Morton, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Peter Zijlstra,
	Nicholas Piggin

Hi Huang,

On Tue, Jan 03, 2017 at 01:43:43PM +0800, Huang, Ying wrote:
> Hi, Minchan,
> 
> Minchan Kim <minchan@kernel.org> writes:
> 
> > Hi Jan,
> >
> > On Mon, Jan 02, 2017 at 04:48:41PM +0100, Jan Kara wrote:
> >> Hi,
> >> 
> >> On Tue 27-12-16 16:45:03, Minchan Kim wrote:
> >> > > Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
> >> > >         the rate that we have to contende for the radix tree.
> >> > 
> >> > To me, it's rather hacky. I think it might be common problem for page cache
> >> > so can we think another generalized way like range_lock? Ccing Jan.
> >> 
> >> I agree on the hackyness of the patch and that page cache would suffer with
> >> the same contention (although the files are usually smaller than swap so it
> >> would not be that visible I guess). But I don't see how range lock would
> >> help here - we need to serialize modifications of the tree structure itself
> >> and that is difficult to achieve with the range lock. So what you would
> >> need is either a different data structure for tracking swap cache entries
> >> or a finer grained locking of the radix tree.
> >
> > Thanks for the comment, Jan.
> >
> > I think there are more general options. One is to shrink batching pages like
> > Mel and Tim had approached.
> >
> > https://patchwork.kernel.org/patch/9008421/
> > https://patchwork.kernel.org/patch/9322793/
> 
> This helps to reduce the lock contention on radix tree of swap cache.
> But splitting swap cache has much better performance.  So we switched
> from that solution to current solution.
> 
> > Or concurrent page cache by peter.
> >
> > https://www.kernel.org/doc/ols/2007/ols2007v2-pages-311-318.pdf
> 
> I think this is good, it helps swap and file cache.  But I don't know
> whether other people want to go this way and how much effort will be
> needed.
> 
> In contrast, splitting swap cache is quite simple, for implementation
> and review.  And the effect is good.

I think general approach is better but I don't want to be a a party pooper
if every people are okay with this. I just wanted to point out we need to
consider more general approach and I did my best.

Decision depends on you guys.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2017-01-05  1:33   ` Huang, Ying
@ 2017-01-05  6:32     ` Minchan Kim
  2017-01-05  6:44       ` Huang, Ying
  0 siblings, 1 reply; 25+ messages in thread
From: Minchan Kim @ 2017-01-05  6:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Tim Chen, Andrew Morton, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

Hi,

On Thu, Jan 05, 2017 at 09:33:55AM +0800, Huang, Ying wrote:
> Hi, Minchan,
> 
> Minchan Kim <minchan@kernel.org> writes:
> [snip]
> >
> > The patchset has used several techniqueus to reduce lock contention, for example,
> > batching alloc/free, fine-grained lock and cluster distribution to avoid cache
> > false-sharing. Each items has different complexity and benefits so could you
> > show the number for each step of pathchset? It would be better to include the
> > nubmer in each description. It helps how the patch is important when we consider
> > complexitiy of the patch.
> 
> Here is the test data.

Thanks!

> 
> We test the vm-scalability swap-w-seq test case with 32 processes on a
> Xeon E5 v3 system.  The swap device used is a RAM simulated PMEM
> (persistent memory) device.  To test the sequential swapping out, the
> test case created 32 processes, which sequentially allocate and write to
> the anonymous pages until the RAM and part of the swap device is used
> up.
> 
> The patchset is rebased on v4.9-rc8.  So the baseline performance is as
> follow,
> 
>   "vmstat.swap.so": 1428002,

What does it mean? vmstat.pswpout?

>   "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": 13.94,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg": 13.75,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.swapcache_free.__remove_mapping.shrink_page_list": 7.05,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.page_swapcount.try_to_free_swap.swap_writepage": 7.03,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.__swap_duplicate.swap_duplicate.try_to_unmap_one.rmap_walk_anon": 7.02,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": 6.83,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.page_check_address_transhuge.page_referenced_one.rmap_walk_anon.rmap_walk": 0.81,

Numbers mean overhead percentage reported by perf?

> 
> >> Patch 1 is a clean up patch.
> >
> > Could it be separated patch?
> >
> >> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
> >>         that can be used for accessing swap_map, and not lock the whole
> >>         swap device
> 
> After patch 2, the result is as follow,
> 
>   "vmstat.swap.so": 1481704,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": 27.53,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg": 27.01,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages.drain_local_pages": 1.03,
> 
> The swap out throughput is at the same level, but the lock contention on
> swap_info_struct->lock is eliminated.
> 
> >> Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
> >>         the rate that we have to contende for the radix tree.
> >
> 
> After patch 3,
> 
>   "vmstat.swap.so": 2050097,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": 43.27,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": 4.84,
> 
> The swap out throughput is improved about ~43% compared with baseline.
> The lock contention on swap cache radix tree lock is eliminated.
> swap_info_struct->lock in get_swap_page() becomes the most heavy
> contended lock.

The numbers are great! Please include those into each patchset.
And I ask one more thing I said earlier about patch 2.

""
I hope you make three steps to review easier. You can create some functions like
swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock.
It doesn't change anything performance pov but it clearly shows what kinds of lock
we should use in specific context.

Then, you can introduce more fine-graind lock in next patch and apply it into
those wrapper functions.
 
And last patch, you can adjust cluster distribution to avoid false-sharing.
And the description should include how it's bad in testing so it's worth.
""

It makes review more easier, I believe.

> 
> >
> >> Patch 4 eliminates unnecessary page allocation for read ahead.
> >
> > Could it be separated patch?
> >
> >> Patch 5-9 create a per cpu cache of the swap slots, so we don't have
> >>         to contend on the swap device to get a swap slot or to release
> >>         a swap slot.  And we allocate and release the swap slots
> >>         in batches for better efficiency.
> 
> After patch 9,
> 
>   "vmstat.swap.so": 4170746,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.swapcache_free_entries.free_swap_slot.free_swap_and_cache.unmap_page_range": 13.91,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": 8.56,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask.alloc_pages_vma": 2.56,
>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap.shrink_page_list": 2.47,
> 
> The swap out throughput is improved about 192% compared with the
> baseline.  There are still some lock contention for
> swap_info_struct->lock, but the pressure begins to shift to buddy system
> now.
> 
> Best Regards,
> Huang, Ying
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/9] mm/swap: Regular page swap optimizations
  2017-01-05  6:32     ` Minchan Kim
@ 2017-01-05  6:44       ` Huang, Ying
  0 siblings, 0 replies; 25+ messages in thread
From: Huang, Ying @ 2017-01-05  6:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Tim Chen, Andrew Morton, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, jack

Minchan Kim <minchan@kernel.org> writes:

> Hi,
>
> On Thu, Jan 05, 2017 at 09:33:55AM +0800, Huang, Ying wrote:
>> Hi, Minchan,
>> 
>> Minchan Kim <minchan@kernel.org> writes:
>> [snip]
>> >
>> > The patchset has used several techniqueus to reduce lock contention, for example,
>> > batching alloc/free, fine-grained lock and cluster distribution to avoid cache
>> > false-sharing. Each items has different complexity and benefits so could you
>> > show the number for each step of pathchset? It would be better to include the
>> > nubmer in each description. It helps how the patch is important when we consider
>> > complexitiy of the patch.
>> 
>> Here is the test data.
>
> Thanks!
>
>> 
>> We test the vm-scalability swap-w-seq test case with 32 processes on a
>> Xeon E5 v3 system.  The swap device used is a RAM simulated PMEM
>> (persistent memory) device.  To test the sequential swapping out, the
>> test case created 32 processes, which sequentially allocate and write to
>> the anonymous pages until the RAM and part of the swap device is used
>> up.
>> 
>> The patchset is rebased on v4.9-rc8.  So the baseline performance is as
>> follow,
>> 
>>   "vmstat.swap.so": 1428002,
>
> What does it mean? vmstat.pswpout?

This is the average of swap.so column of /usr/bin/vmstat output.  We run
/usr/bin/vmstat with,

/usr/bin/vmstat -n 1

>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": 13.94,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg":
>> 13.75,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.swapcache_free.__remove_mapping.shrink_page_list": 7.05,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.page_swapcount.try_to_free_swap.swap_writepage": 7.03,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.__swap_duplicate.swap_duplicate.try_to_unmap_one.rmap_walk_anon": 7.02,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": 6.83,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.page_check_address_transhuge.page_referenced_one.rmap_walk_anon.rmap_walk": 0.81,
>
> Numbers mean overhead percentage reported by perf?

Yes.
 
>> >> Patch 1 is a clean up patch.
>> >
>> > Could it be separated patch?
>> >
>> >> Patch 2 creates a lock per cluster, this gives us a more fine graind lock
>> >>         that can be used for accessing swap_map, and not lock the whole
>> >>         swap device
>> 
>> After patch 2, the result is as follow,
>> 
>>   "vmstat.swap.so": 1481704,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list": 27.53,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg":
>> 27.01,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages.drain_local_pages": 1.03,
>> 
>> The swap out throughput is at the same level, but the lock contention on
>> swap_info_struct->lock is eliminated.
>> 
>> >> Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
>> >>         the rate that we have to contende for the radix tree.
>> >
>> 
>> After patch 3,
>> 
>>   "vmstat.swap.so": 2050097,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list": 43.27,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": 4.84,
>> 
>> The swap out throughput is improved about ~43% compared with baseline.
>> The lock contention on swap cache radix tree lock is eliminated.
>> swap_info_struct->lock in get_swap_page() becomes the most heavy
>> contended lock.
>
> The numbers are great! Please include those into each patchset.
> And I ask one more thing I said earlier about patch 2.
>
> ""
> I hope you make three steps to review easier. You can create some functions like
> swap_map_lock and cluster_lock which are wrapper functions just hold swap_lock.
> It doesn't change anything performance pov but it clearly shows what kinds of lock
> we should use in specific context.
>
> Then, you can introduce more fine-graind lock in next patch and apply it into
> those wrapper functions.
>  
> And last patch, you can adjust cluster distribution to avoid false-sharing.
> And the description should include how it's bad in testing so it's worth.
> ""
>
> It makes review more easier, I believe.

Sorry, personally, I don't like this way to organize the patchset.  So
unless more people have this requirement, I still want to keep the
current way.

Best Regards,
Huang, Ying

>> 
>> >
>> >> Patch 4 eliminates unnecessary page allocation for read ahead.
>> >
>> > Could it be separated patch?
>> >
>> >> Patch 5-9 create a per cpu cache of the swap slots, so we don't have
>> >>         to contend on the swap device to get a swap slot or to release
>> >>         a swap slot.  And we allocate and release the swap slots
>> >>         in batches for better efficiency.
>> 
>> After patch 9,
>> 
>>   "vmstat.swap.so": 4170746,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.swapcache_free_entries.free_swap_slot.free_swap_and_cache.unmap_page_range": 13.91,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault": 8.56,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask.alloc_pages_vma":
>> 2.56,
>>   "perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap.shrink_page_list": 2.47,
>> 
>> The swap out throughput is improved about 192% compared with the
>> baseline.  There are still some lock contention for
>> swap_info_struct->lock, but the pressure begins to shift to buddy system
>> now.
>> 
>> Best Regards,
>> Huang, Ying
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-01-05  6:48 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-09 21:09 [PATCH v4 0/9] mm/swap: Regular page swap optimizations Tim Chen
2016-12-09 21:09 ` [PATCH v4 1/9] mm/swap: Fix kernel message in swap_info_get() Tim Chen
2016-12-09 21:09 ` [PATCH v4 2/9] mm/swap: Add cluster lock Tim Chen
2016-12-09 21:09 ` [PATCH v4 3/9] mm/swap: Split swap cache into 64MB trunks Tim Chen
2016-12-09 21:09 ` [PATCH v4 4/9] mm/swap: skip read ahead for unreferenced swap slots Tim Chen
2016-12-09 21:09 ` [PATCH v4 5/9] mm/swap: Allocate swap slots in batches Tim Chen
2016-12-09 21:09 ` [PATCH v4 6/9] mm/swap: Free swap slots in batch Tim Chen
2016-12-09 21:09 ` [PATCH v4 7/9] mm/swap: Add cache for swap slots allocation Tim Chen
2016-12-09 21:09 ` [PATCH v4 8/9] mm/swap: Enable swap slots cache usage Tim Chen
2016-12-09 21:09 ` [PATCH v4 9/9] mm/swap: Skip readahead only when swap slot cache is enabled Tim Chen
2016-12-27  7:45 ` [PATCH v4 0/9] mm/swap: Regular page swap optimizations Minchan Kim
2016-12-28  1:54   ` Huang, Ying
2016-12-28  2:37     ` Minchan Kim
2016-12-28  3:15       ` Huang, Ying
2016-12-28  3:31         ` Huang, Ying
2016-12-28  3:53           ` Minchan Kim
2016-12-28  4:56             ` Huang, Ying
2017-01-02 15:48   ` Jan Kara
2017-01-03  4:34     ` Minchan Kim
2017-01-03  5:43       ` Huang, Ying
2017-01-05  6:15         ` Minchan Kim
2017-01-03 17:47       ` Tim Chen
2017-01-05  1:33   ` Huang, Ying
2017-01-05  6:32     ` Minchan Kim
2017-01-05  6:44       ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).