linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] mm/swap: Regular page swap optimizations
@ 2016-10-20 23:31 Tim Chen
  2016-10-20 23:31 ` [PATCH v2 1/8] mm/swap: Fix kernel message in swap_info_get() Tim Chen
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Tim Chen @ 2016-10-20 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton

We appreciate feedback about this patch series from the
community.  Historically, neither the performance nor latency of the swap
path mattered.  The underlying I/O was slow enough to hide any latency
coming from software and the low IOPS kept the overall CPU impact low.

Times have changed.  Coming generation of Solid state Block device
latencies are getting down to sub 100 usec, which is within an order of
magnitude of DRAM, and their performance is orders of magnitude higher
than the single- spindle rotational media we've swapped to historically.

This could benefit many usage scenearios.  For example cloud providers who
overcommit their memory (as VM don't use all the memory provisioned).
Having a fast swap will allow them to be more aggressive in memory
overcommit and fit more VMs to a platform.

In our testing [see footnote], the median latency that the
kernel adds to a page fault is 15 usec, which comes quite close
to the amount that will be contributed by the underlying I/O
devices.

The software latency comes mostly from contentions on the locks
protecting the radix tree of the swap cache and also the locks protecting
the individual swap devices.  The lock contentions already consumed
35% of cpu cycles in our test.  In the very near future,
software latency will become the bottleneck to swap performnace as
block device I/O latency gets within the shouting distance of DRAM speed.

This patch set, plus a previous patch Ying already posted
(commit: f6498b3f) reduced the median page fault latency
from 15 usec to 4 usec (375% reduction) for DRAM based pmem
block device.

Patch 1 is a clean up patch.
Patch 2 creates a lock per cluster, this gives us a more fine graind lock
        that can be used for accessing swap_map, and not lock the whole
        swap device
Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
        the rate that we have to contende for the radix tree.
Patch 4 eliminates unnecessary page allocation for read ahead.
Patch 5-8 create a per cpu cache of the swap slots, so we don't have
        to contend on the swap device to get a swap slot or to release
        a swap slot.  And we allocate and release the swap slots
        in batches for better efficiency.

Ying Huang & Tim Chen

Footnote:
Testing was done on 4.8-rc3-mm1 kernel with/without optimizations from
this patche series plus one additional patch Ying posted earlier on
removing radix tree write back tag in swap cache.  Eight threads performed
random memory access on a 2 socket Haswell using swap mounted on RAM
based PMEM block device.  This emulated a moderate load and a SWAP
device unbounded by I/O speed. The aggregate working set is twice the
RAM size. We instrumented the kernel to measure the page fault latency.

Change Log:
v2: 
1. Fix bug in the index limit used in scan_swap_map_try_ssd_cluster
when searching for empty slots in cluster.
2. Fix bug in swap off that incorrectly determines if we still have
swap devices left.
3. Port patches to mmotm-2016-10-11-15-46 branch


Huang, Ying (3):
  mm/swap: Fix kernel message in swap_info_get()
  mm/swap: Add cluster lock
  mm/swap: Split swap cache into 64MB trunks

Tim Chen (5):
  mm/swap: skip read ahead for unreferenced swap slots
  mm/swap: Allocate swap slots in batches
  mm/swap: Free swap slots in batch
  mm/swap: Add cache for swap slots allocation
  mm/swap: Enable swap slots cache usage

 include/linux/swap.h       |  35 ++-
 include/linux/swap_slots.h |  37 +++
 mm/Makefile                |   2 +-
 mm/swap.c                  |   6 -
 mm/swap_slots.c            | 306 +++++++++++++++++++++++++
 mm/swap_state.c            |  76 +++++-
 mm/swapfile.c              | 558 +++++++++++++++++++++++++++++++++++----------
 7 files changed, 875 insertions(+), 145 deletions(-)
 create mode 100644 include/linux/swap_slots.h
 create mode 100644 mm/swap_slots.c

-- 
2.5.5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/8] mm/swap: Fix kernel message in swap_info_get()
  2016-10-20 23:31 [PATCH v2 0/8] mm/swap: Regular page swap optimizations Tim Chen
@ 2016-10-20 23:31 ` Tim Chen
  2016-10-20 23:31 ` [PATCH v2 2/8] mm/swap: Add cluster lock Tim Chen
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2016-10-20 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton, Tim Chen

From: "Huang, Ying" <ying.huang@intel.com>

swap_info_get() is used not only in swap free code path but also in
page_swapcount(), etc.  So the original kernel message in
swap_info_get() is not correct now.  Fix it via replacing "swap_free" to
"swap_info_get" in the message.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/swapfile.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2210de2..b745d3d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -753,16 +753,16 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 	return p;
 
 bad_free:
-	pr_err("swap_free: %s%08lx\n", Unused_offset, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val);
 	goto out;
 bad_offset:
-	pr_err("swap_free: %s%08lx\n", Bad_offset, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Bad_offset, entry.val);
 	goto out;
 bad_device:
-	pr_err("swap_free: %s%08lx\n", Unused_file, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Unused_file, entry.val);
 	goto out;
 bad_nofile:
-	pr_err("swap_free: %s%08lx\n", Bad_file, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Bad_file, entry.val);
 out:
 	return NULL;
 }
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/8] mm/swap: Add cluster lock
  2016-10-20 23:31 [PATCH v2 0/8] mm/swap: Regular page swap optimizations Tim Chen
  2016-10-20 23:31 ` [PATCH v2 1/8] mm/swap: Fix kernel message in swap_info_get() Tim Chen
@ 2016-10-20 23:31 ` Tim Chen
  2016-10-24 16:31   ` Jonathan Corbet
  2016-10-20 23:31 ` [PATCH v2 3/8] mm/swap: Split swap cache into 64MB trunks Tim Chen
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Tim Chen @ 2016-10-20 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton, Tim Chen

From: "Huang, Ying" <ying.huang@intel.com>

This patch is to reduce the lock contention of swap_info_struct->lock
via using a more fine grained lock in swap_cluster_info for some swap
operations.  swap_info_struct->lock is heavily contended if multiple
processes reclaim pages simultaneously.  Because there is only one lock
for each swap device.  While in common configuration, there is only one
or several swap devices in the system.  The lock protects almost all
swap related operations.

In fact, many swap operations only access one element of
swap_info_struct->swap_map array.  And there is no dependency between
different elements of swap_info_struct->swap_map.  So a fine grained
lock can be used to allow parallel access to the different elements of
swap_info_struct->swap_map.

In this patch, one bit of swap_cluster_info is used as the bin spinlock
to protect the elements of swap_info_struct->swap_map in the swap
cluster and the fields of swap_cluster_info.  This reduced locking
contention for swap_info_struct->swap_map access greatly.

To use the bin spinlock, the size of swap_cluster_info needs to increase
from 4 bytes to 8 bytes on the 64bit system.  This will use 4k more
memory for every 1G swap space.

Because the size of swap_cluster_info is much smaller than the size of
the cache line (8 vs 64 on x86_64 architecture), there may be false
cache line sharing between swap_cluster_info bit spinlocks.  To avoid
the false sharing in the first round of the swap cluster allocation, the
order of the swap clusters in the free clusters list is changed.  So
that, the swap_cluster_info sharing the same cache line will be placed
as far as possible.  After the first round of allocation, the order of
the clusters in free clusters list is expected to be random.  So the
false sharing should be not noticeable.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap.h |  13 ++-
 mm/swapfile.c        | 239 +++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 194 insertions(+), 58 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7e553e1..2e0c9ae 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -175,11 +175,16 @@ enum {
  * protected by swap_info_struct.lock.
  */
 struct swap_cluster_info {
-	unsigned int data:24;
-	unsigned int flags:8;
+	unsigned long data;
 };
-#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
-#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+#define CLUSTER_COUNT_SHIFT		8
+#define CLUSTER_FLAG_MASK		((1UL << CLUSTER_COUNT_SHIFT) - 1)
+#define CLUSTER_COUNT_MASK		(~CLUSTER_FLAG_MASK)
+#define CLUSTER_FLAG_FREE		1 /* This cluster is free */
+#define CLUSTER_FLAG_NEXT_NULL		2 /* This cluster has no next cluster */
+/* cluster lock, protect cluster_info contents and sis->swap_map */
+#define CLUSTER_FLAG_LOCK_BIT		2
+#define CLUSTER_FLAG_LOCK		(1 << CLUSTER_FLAG_LOCK_BIT)
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b745d3d..3806bce 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -200,61 +200,107 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 #define LATENCY_LIMIT		256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
-	unsigned int flag)
+				    unsigned int flag)
 {
-	info->flags = flag;
+	info->data = (info->data & (CLUSTER_COUNT_MASK | CLUSTER_FLAG_LOCK)) |
+		(flag & ~CLUSTER_FLAG_LOCK);
 }
 
 static inline unsigned int cluster_count(struct swap_cluster_info *info)
 {
-	return info->data;
+	return info->data >> CLUSTER_COUNT_SHIFT;
 }
 
 static inline void cluster_set_count(struct swap_cluster_info *info,
 				     unsigned int c)
 {
-	info->data = c;
+	info->data = (c << CLUSTER_COUNT_SHIFT) | (info->data & CLUSTER_FLAG_MASK);
 }
 
 static inline void cluster_set_count_flag(struct swap_cluster_info *info,
 					 unsigned int c, unsigned int f)
 {
-	info->flags = f;
-	info->data = c;
+	info->data = (info->data & CLUSTER_FLAG_LOCK) |
+		(c << CLUSTER_COUNT_SHIFT) | (f & ~CLUSTER_FLAG_LOCK);
 }
 
 static inline unsigned int cluster_next(struct swap_cluster_info *info)
 {
-	return info->data;
+	return cluster_count(info);
 }
 
 static inline void cluster_set_next(struct swap_cluster_info *info,
 				    unsigned int n)
 {
-	info->data = n;
+	cluster_set_count(info, n);
 }
 
 static inline void cluster_set_next_flag(struct swap_cluster_info *info,
 					 unsigned int n, unsigned int f)
 {
-	info->flags = f;
-	info->data = n;
+	cluster_set_count_flag(info, n, f);
 }
 
 static inline bool cluster_is_free(struct swap_cluster_info *info)
 {
-	return info->flags & CLUSTER_FLAG_FREE;
+	return info->data & CLUSTER_FLAG_FREE;
 }
 
 static inline bool cluster_is_null(struct swap_cluster_info *info)
 {
-	return info->flags & CLUSTER_FLAG_NEXT_NULL;
+	return info->data & CLUSTER_FLAG_NEXT_NULL;
 }
 
 static inline void cluster_set_null(struct swap_cluster_info *info)
 {
-	info->flags = CLUSTER_FLAG_NEXT_NULL;
-	info->data = 0;
+	cluster_set_next_flag(info, 0, CLUSTER_FLAG_NEXT_NULL);
+}
+
+/* Protect swap_cluster_info fields and si->swap_map */
+static inline void __lock_cluster(struct swap_cluster_info *ci)
+{
+	bit_spin_lock(CLUSTER_FLAG_LOCK_BIT, &ci->data);
+}
+
+static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
+						     unsigned long offset)
+{
+	struct swap_cluster_info *ci;
+
+	ci = si->cluster_info;
+	if (ci) {
+		ci += offset / SWAPFILE_CLUSTER;
+		__lock_cluster(ci);
+	}
+	return ci;
+}
+
+static inline void unlock_cluster(struct swap_cluster_info *ci)
+{
+	if (ci)
+		bit_spin_unlock(CLUSTER_FLAG_LOCK_BIT, &ci->data);
+}
+
+static inline struct swap_cluster_info *lock_cluster_or_swap_info(
+	struct swap_info_struct *si,
+	unsigned long offset)
+{
+	struct swap_cluster_info *ci;
+
+	ci = lock_cluster(si, offset);
+	if (!ci)
+		spin_lock(&si->lock);
+
+	return ci;
+}
+
+static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
+					       struct swap_cluster_info *ci)
+{
+	if (ci)
+		unlock_cluster(ci);
+	else
+		spin_unlock(&si->lock);
 }
 
 static inline bool cluster_list_empty(struct swap_cluster_list *list)
@@ -281,9 +327,17 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
 		cluster_set_next_flag(&list->head, idx, 0);
 		cluster_set_next_flag(&list->tail, idx, 0);
 	} else {
+		struct swap_cluster_info *ci_tail;
 		unsigned int tail = cluster_next(&list->tail);
 
-		cluster_set_next(&ci[tail], idx);
+		/*
+		 * Nested cluster lock, but both cluster locks are
+		 * only acquired when we held swap_info_struct->lock
+		 */
+		ci_tail = ci + tail;
+		__lock_cluster(ci_tail);
+		cluster_set_next(ci_tail, idx);
+		unlock_cluster(ci_tail);
 		cluster_set_next_flag(&list->tail, idx, 0);
 	}
 }
@@ -328,7 +382,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 */
 static void swap_do_scheduled_discard(struct swap_info_struct *si)
 {
-	struct swap_cluster_info *info;
+	struct swap_cluster_info *info, *ci;
 	unsigned int idx;
 
 	info = si->cluster_info;
@@ -341,10 +395,14 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 				SWAPFILE_CLUSTER);
 
 		spin_lock(&si->lock);
-		cluster_set_flag(&info[idx], CLUSTER_FLAG_FREE);
+		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
+		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
+		unlock_cluster(ci);
 		cluster_list_add_tail(&si->free_clusters, info, idx);
+		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
+		unlock_cluster(ci);
 	}
 }
 
@@ -447,8 +505,9 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned long *offset, unsigned long *scan_base)
 {
 	struct percpu_cluster *cluster;
+	struct swap_cluster_info *ci;
 	bool found_free;
-	unsigned long tmp;
+	unsigned long tmp, max;
 
 new_cluster:
 	cluster = this_cpu_ptr(si->percpu_cluster);
@@ -476,14 +535,21 @@ new_cluster:
 	 * check if there is still free entry in the cluster
 	 */
 	tmp = cluster->next;
-	while (tmp < si->max && tmp < (cluster_next(&cluster->index) + 1) *
-	       SWAPFILE_CLUSTER) {
+	max = min_t(unsigned long, si->max,
+		    (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER);
+	if (tmp >= max) {
+		cluster_set_null(&cluster->index);
+		goto new_cluster;
+	}
+	ci = lock_cluster(si, tmp);
+	while (tmp < max) {
 		if (!si->swap_map[tmp]) {
 			found_free = true;
 			break;
 		}
 		tmp++;
 	}
+	unlock_cluster(ci);
 	if (!found_free) {
 		cluster_set_null(&cluster->index);
 		goto new_cluster;
@@ -496,6 +562,7 @@ new_cluster:
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
+	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned long scan_base;
 	unsigned long last_in_cluster = 0;
@@ -572,9 +639,11 @@ checks:
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
 
+	ci = lock_cluster(si, offset);
 	/* reuse swap entry of cache-only swap if not busy. */
 	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
 		int swap_was_freed;
+		unlock_cluster(ci);
 		spin_unlock(&si->lock);
 		swap_was_freed = __try_to_reclaim_swap(si, offset);
 		spin_lock(&si->lock);
@@ -584,8 +653,10 @@ checks:
 		goto scan; /* check next one */
 	}
 
-	if (si->swap_map[offset])
+	if (si->swap_map[offset]) {
+		unlock_cluster(ci);
 		goto scan;
+	}
 
 	if (offset == si->lowest_bit)
 		si->lowest_bit++;
@@ -601,6 +672,7 @@ checks:
 	}
 	si->swap_map[offset] = usage;
 	inc_cluster_info_page(si, si->cluster_info, offset);
+	unlock_cluster(ci);
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
 
@@ -731,7 +803,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	return (swp_entry_t) {0};
 }
 
-static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	unsigned long offset, type;
@@ -749,7 +821,6 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 		goto bad_offset;
 	if (!p->swap_map[offset])
 		goto bad_free;
-	spin_lock(&p->lock);
 	return p;
 
 bad_free:
@@ -767,14 +838,45 @@ out:
 	return NULL;
 }
 
+static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+{
+	struct swap_info_struct *p;
+
+	p = _swap_info_get(entry);
+	if (p)
+		spin_lock(&p->lock);
+	return p;
+}
+
 static unsigned char swap_entry_free(struct swap_info_struct *p,
-				     swp_entry_t entry, unsigned char usage)
+				     swp_entry_t entry, unsigned char usage,
+				     bool swap_info_locked)
 {
+	struct swap_cluster_info *ci;
 	unsigned long offset = swp_offset(entry);
 	unsigned char count;
 	unsigned char has_cache;
+	bool lock_swap_info = false;
+
+	if (!swap_info_locked) {
+		count = p->swap_map[offset];
+		if (!p->cluster_info || count == usage || count == SWAP_MAP_SHMEM) {
+lock_swap_info:
+			swap_info_locked = true;
+			lock_swap_info = true;
+			spin_lock(&p->lock);
+		}
+	}
+
+	ci = lock_cluster(p, offset);
 
 	count = p->swap_map[offset];
+
+	if (!swap_info_locked && (count == usage || count == SWAP_MAP_SHMEM)) {
+		unlock_cluster(ci);
+		goto lock_swap_info;
+	}
+
 	has_cache = count & SWAP_HAS_CACHE;
 	count &= ~SWAP_HAS_CACHE;
 
@@ -800,10 +902,15 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 	usage = count | has_cache;
 	p->swap_map[offset] = usage;
 
+	unlock_cluster(ci);
+
 	/* free if no reference */
 	if (!usage) {
+		VM_BUG_ON(!swap_info_locked);
 		mem_cgroup_uncharge_swap(entry);
+		ci = lock_cluster(p, offset);
 		dec_cluster_info_page(p, p->cluster_info, offset);
+		unlock_cluster(ci);
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
 		if (offset > p->highest_bit) {
@@ -829,6 +936,9 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 		}
 	}
 
+	if (lock_swap_info)
+		spin_unlock(&p->lock);
+
 	return usage;
 }
 
@@ -840,11 +950,9 @@ void swap_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 
-	p = swap_info_get(entry);
-	if (p) {
-		swap_entry_free(p, entry, 1);
-		spin_unlock(&p->lock);
-	}
+	p = _swap_info_get(entry);
+	if (p)
+		swap_entry_free(p, entry, 1, false);
 }
 
 /*
@@ -854,11 +962,9 @@ void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 
-	p = swap_info_get(entry);
-	if (p) {
-		swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		spin_unlock(&p->lock);
-	}
+	p = _swap_info_get(entry);
+	if (p)
+		swap_entry_free(p, entry, SWAP_HAS_CACHE, false);
 }
 
 /*
@@ -870,13 +976,17 @@ int page_swapcount(struct page *page)
 {
 	int count = 0;
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	swp_entry_t entry;
+	unsigned long offset;
 
 	entry.val = page_private(page);
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (p) {
-		count = swap_count(p->swap_map[swp_offset(entry)]);
-		spin_unlock(&p->lock);
+		offset = swp_offset(entry);
+		ci = lock_cluster_or_swap_info(p, offset);
+		count = swap_count(p->swap_map[offset]);
+		unlock_cluster_or_swap_info(p, ci);
 	}
 	return count;
 }
@@ -889,22 +999,26 @@ int swp_swapcount(swp_entry_t entry)
 {
 	int count, tmp_count, n;
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	struct page *page;
 	pgoff_t offset;
 	unsigned char *map;
 
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (!p)
 		return 0;
 
-	count = swap_count(p->swap_map[swp_offset(entry)]);
+	offset = swp_offset(entry);
+
+	ci = lock_cluster_or_swap_info(p, offset);
+
+	count = swap_count(p->swap_map[offset]);
 	if (!(count & COUNT_CONTINUED))
 		goto out;
 
 	count &= ~COUNT_CONTINUED;
 	n = SWAP_MAP_MAX + 1;
 
-	offset = swp_offset(entry);
 	page = vmalloc_to_page(p->swap_map + offset);
 	offset &= ~PAGE_MASK;
 	VM_BUG_ON(page_private(page) != SWP_CONTINUED);
@@ -919,7 +1033,7 @@ int swp_swapcount(swp_entry_t entry)
 		n *= (SWAP_CONT_MAX + 1);
 	} while (tmp_count & COUNT_CONTINUED);
 out:
-	spin_unlock(&p->lock);
+	unlock_cluster_or_swap_info(p, ci);
 	return count;
 }
 
@@ -1003,7 +1117,7 @@ int free_swap_and_cache(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, entry, 1) == SWAP_HAS_CACHE) {
+		if (swap_entry_free(p, entry, 1, true) == SWAP_HAS_CACHE) {
 			page = find_get_page(swap_address_space(entry),
 					     swp_offset(entry));
 			if (page && !trylock_page(page)) {
@@ -2283,6 +2397,9 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 	return maxpages;
 }
 
+#define SWAP_CLUSTER_COLS						\
+	DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
+
 static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					union swap_header *swap_header,
 					unsigned char *swap_map,
@@ -2290,11 +2407,12 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					unsigned long maxpages,
 					sector_t *span)
 {
-	int i;
+	unsigned int j, k;
 	unsigned int nr_good_pages;
 	int nr_extents;
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
-	unsigned long idx = p->cluster_next / SWAPFILE_CLUSTER;
+	unsigned long col = p->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS;
+	unsigned long i, idx;
 
 	nr_good_pages = maxpages - 1;	/* omit header page */
 
@@ -2342,15 +2460,20 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	if (!cluster_info)
 		return nr_extents;
 
-	for (i = 0; i < nr_clusters; i++) {
-		if (!cluster_count(&cluster_info[idx])) {
+
+	/* Reduce false cache line sharing between cluster_info */
+	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
+		j = (k + col) % SWAP_CLUSTER_COLS;
+		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
+			idx = i * SWAP_CLUSTER_COLS + j;
+			if (idx >= nr_clusters)
+				continue;
+			if (cluster_count(&cluster_info[idx]))
+				continue;
 			cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
 			cluster_list_add_tail(&p->free_clusters, cluster_info,
 					      idx);
 		}
-		idx++;
-		if (idx == nr_clusters)
-			idx = 0;
 	}
 	return nr_extents;
 }
@@ -2608,6 +2731,7 @@ void si_swapinfo(struct sysinfo *val)
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 {
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	unsigned long offset, type;
 	unsigned char count;
 	unsigned char has_cache;
@@ -2621,10 +2745,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 		goto bad_file;
 	p = swap_info[type];
 	offset = swp_offset(entry);
-
-	spin_lock(&p->lock);
 	if (unlikely(offset >= p->max))
-		goto unlock_out;
+		goto out;
+
+	ci = lock_cluster_or_swap_info(p, offset);
 
 	count = p->swap_map[offset];
 
@@ -2667,7 +2791,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 	p->swap_map[offset] = count | has_cache;
 
 unlock_out:
-	spin_unlock(&p->lock);
+	unlock_cluster_or_swap_info(p, ci);
 out:
 	return err;
 
@@ -2756,6 +2880,7 @@ EXPORT_SYMBOL_GPL(__page_file_index);
 int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 {
 	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
 	struct page *head;
 	struct page *page;
 	struct page *list_page;
@@ -2779,6 +2904,9 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	offset = swp_offset(entry);
+
+	ci = lock_cluster(si, offset);
+
 	count = si->swap_map[offset] & ~SWAP_HAS_CACHE;
 
 	if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
@@ -2791,6 +2919,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	if (!page) {
+		unlock_cluster_or_swap_info(si, ci);
 		spin_unlock(&si->lock);
 		return -ENOMEM;
 	}
@@ -2839,6 +2968,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	list_add_tail(&page->lru, &head->lru);
 	page = NULL;			/* now it's attached, don't free it */
 out:
+	unlock_cluster(ci);
 	spin_unlock(&si->lock);
 outer:
 	if (page)
@@ -2852,7 +2982,8 @@ outer:
  * into, carry if so, or else fail until a new continuation page is allocated;
  * when the original swap_map count is decremented from 0 with continuation,
  * borrow from the continuation and report whether it still holds more.
- * Called while __swap_duplicate() or swap_entry_free() holds swap_lock.
+ * Called while __swap_duplicate() or swap_entry_free() holds swap or cluster
+ * lock.
  */
 static bool swap_count_continued(struct swap_info_struct *si,
 				 pgoff_t offset, unsigned char count)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/8] mm/swap: Split swap cache into 64MB trunks
  2016-10-20 23:31 [PATCH v2 0/8] mm/swap: Regular page swap optimizations Tim Chen
  2016-10-20 23:31 ` [PATCH v2 1/8] mm/swap: Fix kernel message in swap_info_get() Tim Chen
  2016-10-20 23:31 ` [PATCH v2 2/8] mm/swap: Add cluster lock Tim Chen
@ 2016-10-20 23:31 ` Tim Chen
  2016-10-20 23:31 ` [PATCH v2 4/8] mm/swap: skip read ahead for unreferenced swap slots Tim Chen
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2016-10-20 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton, Tim Chen

From: "Huang, Ying" <ying.huang@intel.com>

The patch is to improve the scalability of the swap out/in via using
fine grained locks for the swap cache.  In current kernel, one address
space will be used for each swap device.  And in the common
configuration, the number of the swap device is very small (one is
typical).  This causes the heavy lock contention on the radix tree of
the address space if multiple tasks swap out/in concurrently.  But in
fact, there is no dependency between pages in the swap cache.  So that,
we can split the one shared address space for each swap device into
several address spaces to reduce the lock contention.  In the patch, the
shared address space is split into 64MB trunks.  64MB is chosen to
balance the memory space usage and effect of lock contention reduction.

The size of struct address_space on x86_64 architecture is 408B, so with
the patch, 6528B more memory will be used for every 1GB swap space on
x86_64 architecture.

One address space is still shared for the swap entries in the same 64M
trunks.  To avoid lock contention for the first round of swap space
allocation, the order of the swap clusters in the initial free clusters
list is changed.  The swap space distance between the consecutive swap
clusters in the free cluster list is at least 64M.  After the first
round of allocation, the swap clusters are expected to be freed
randomly, so the lock contention should be reduced effectively.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap.h | 10 ++++++--
 mm/swap.c            |  6 -----
 mm/swap_state.c      | 71 +++++++++++++++++++++++++++++++++++++++++++---------
 mm/swapfile.c        | 16 ++++++++++--
 4 files changed, 81 insertions(+), 22 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2e0c9ae..6bda950 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -370,8 +370,12 @@ int generic_swapfile_activate(struct swap_info_struct *, struct file *,
 		sector_t *);
 
 /* linux/mm/swap_state.c */
-extern struct address_space swapper_spaces[];
-#define swap_address_space(entry) (&swapper_spaces[swp_type(entry)])
+/* One swap address space for each 64M swap space */
+#define SWAP_ADDRESS_SPACE_SHIFT	14
+#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
+extern struct address_space* swapper_spaces[];
+#define swap_address_space(entry)					\
+	(&swapper_spaces[swp_type(entry)][swp_offset(entry) >> SWAP_ADDRESS_SPACE_SHIFT])
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *, struct list_head *list);
@@ -425,6 +429,8 @@ extern struct swap_info_struct *page_swap_info(struct page *);
 extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
+extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
+extern void exit_swap_address_space(unsigned int type);
 
 #else /* CONFIG_SWAP */
 
diff --git a/mm/swap.c b/mm/swap.c
index 4dcf852..5c2ea71 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -969,12 +969,6 @@ EXPORT_SYMBOL(pagevec_lookup_tag);
 void __init swap_setup(void)
 {
 	unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT);
-#ifdef CONFIG_SWAP
-	int i;
-
-	for (i = 0; i < MAX_SWAPFILES; i++)
-		spin_lock_init(&swapper_spaces[i].tree_lock);
-#endif
 
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 35d7e0e..0e377f5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/vmalloc.h>
 
 #include <asm/pgtable.h>
 
@@ -32,15 +33,8 @@ static const struct address_space_operations swap_aops = {
 #endif
 };
 
-struct address_space swapper_spaces[MAX_SWAPFILES] = {
-	[0 ... MAX_SWAPFILES - 1] = {
-		.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-		.i_mmap_writable = ATOMIC_INIT(0),
-		.a_ops		= &swap_aops,
-		/* swap cache doesn't use writeback related tags */
-		.flags		= 1 << AS_NO_WRITEBACK_TAGS,
-	}
-};
+struct address_space *swapper_spaces[MAX_SWAPFILES];
+static unsigned int nr_swapper_spaces[MAX_SWAPFILES];
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
 
@@ -53,11 +47,26 @@ static struct {
 
 unsigned long total_swapcache_pages(void)
 {
-	int i;
+	unsigned int i, j, nr;
 	unsigned long ret = 0;
+	struct address_space *spaces;
 
-	for (i = 0; i < MAX_SWAPFILES; i++)
-		ret += swapper_spaces[i].nrpages;
+	rcu_read_lock();
+	for (i = 0; i < MAX_SWAPFILES; i++) {
+		/*
+		 * The corresponding entries in nr_swapper_spaces and
+		 * swapper_spaces will be reused only after at least
+		 * one grace period.  So it is impossible for them
+		 * belongs to different usage.
+		 */
+		nr = nr_swapper_spaces[i];
+		spaces = rcu_dereference(swapper_spaces[i]);
+		if (!nr || !spaces)
+			continue;
+		for (j = 0; j < nr; j++)
+			ret += spaces[j].nrpages;
+	}
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -505,3 +514,41 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 skip:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
+
+int init_swap_address_space(unsigned int type, unsigned long nr_pages)
+{
+	struct address_space *spaces, *space;
+	unsigned int i, nr;
+
+	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
+	spaces = kzalloc(sizeof(struct address_space) * nr, GFP_KERNEL);
+	if (!spaces) {
+		spaces = vzalloc(sizeof(struct address_space) * nr);
+		if (!spaces)
+			return -ENOMEM;
+	}
+	for (i = 0; i < nr; i++) {
+		space = spaces + i;
+		INIT_RADIX_TREE(&space->page_tree, GFP_ATOMIC|__GFP_NOWARN);
+		atomic_set(&space->i_mmap_writable, 0);
+		space->a_ops = &swap_aops;
+		/* swap cache doesn't use writeback related tags */
+		mapping_set_no_writeback_tags(space);
+		spin_lock_init(&space->tree_lock);
+	}
+	nr_swapper_spaces[type] = nr;
+	rcu_assign_pointer(swapper_spaces[type], spaces);
+
+	return 0;
+}
+
+void exit_swap_address_space(unsigned int type)
+{
+	struct address_space *spaces;
+
+	spaces = swapper_spaces[type];
+	nr_swapper_spaces[type] = 0;
+	rcu_assign_pointer(swapper_spaces[type], NULL);
+	synchronize_rcu();
+	kvfree(spaces);
+}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3806bce..4f1b721 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2076,6 +2076,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	vfree(frontswap_map);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
+	exit_swap_address_space(p->type);
 
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
@@ -2397,8 +2398,12 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 	return maxpages;
 }
 
-#define SWAP_CLUSTER_COLS						\
+#define SWAP_CLUSTER_INFO_COLS						\
 	DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
+#define SWAP_CLUSTER_SPACE_COLS						\
+	DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
+#define SWAP_CLUSTER_COLS						\
+	max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
 
 static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					union swap_header *swap_header,
@@ -2461,7 +2466,10 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 		return nr_extents;
 
 
-	/* Reduce false cache line sharing between cluster_info */
+	/*
+	 * Reduce false cache line sharing between cluster_info and
+	 * sharing same address space.
+	 */
 	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
 		j = (k + col) % SWAP_CLUSTER_COLS;
 		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
@@ -2642,6 +2650,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		}
 	}
 
+	error = init_swap_address_space(p->type, maxpages);
+	if (error)
+		goto bad_swap;
+
 	mutex_lock(&swapon_mutex);
 	prio = -1;
 	if (swap_flags & SWAP_FLAG_PREFER)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/8] mm/swap: skip read ahead for unreferenced swap slots
  2016-10-20 23:31 [PATCH v2 0/8] mm/swap: Regular page swap optimizations Tim Chen
                   ` (2 preceding siblings ...)
  2016-10-20 23:31 ` [PATCH v2 3/8] mm/swap: Split swap cache into 64MB trunks Tim Chen
@ 2016-10-20 23:31 ` Tim Chen
  2016-10-20 23:31 ` [PATCH v2 5/8] mm/swap: Allocate swap slots in batches Tim Chen
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2016-10-20 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton

We can avoid needlessly allocating page for swap slots that
are not used by anyone. No pages have to be read in for
these slots.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |  6 ++++++
 mm/swap_state.c      |  4 ++++
 mm/swapfile.c        | 47 +++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6bda950..13bbc5f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -424,6 +424,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern bool reuse_swap_page(struct page *, int *);
@@ -518,6 +519,11 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
+static inline int __swp_swapcount(swp_entry_t entry)
+{
+	return 0;
+}
+
 static inline int swp_swapcount(swp_entry_t entry)
 {
 	return 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0e377f5..1f52ff6 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -314,6 +314,10 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	*new_page_allocated = false;
 
 	do {
+		/* Just skip read ahead for unused swap slot */
+		if (!__swp_swapcount(entry))
+			return NULL;
+
 		/*
 		 * First check the swap cache.  Since this is normally
 		 * called after lookup_swap_cache() failed, re-calling
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4f1b721..19e3ea9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -803,7 +803,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	return (swp_entry_t) {0};
 }
 
-static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *__swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	unsigned long offset, type;
@@ -819,13 +819,8 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	offset = swp_offset(entry);
 	if (offset >= p->max)
 		goto bad_offset;
-	if (!p->swap_map[offset])
-		goto bad_free;
 	return p;
 
-bad_free:
-	pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val);
-	goto out;
 bad_offset:
 	pr_err("swap_info_get: %s%08lx\n", Bad_offset, entry.val);
 	goto out;
@@ -838,6 +833,24 @@ out:
 	return NULL;
 }
 
+static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
+{
+	struct swap_info_struct *p;
+
+	p = __swap_info_get(entry);
+	if (!p)
+		goto out;
+	if (!p->swap_map[swp_offset(entry)])
+		goto bad_free;
+	return p;
+
+bad_free:
+	pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val);
+	goto out;
+out:
+	return NULL;
+}
+
 static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
@@ -993,6 +1006,28 @@ int page_swapcount(struct page *page)
 
 /*
  * How many references to @entry are currently swapped out?
+ * This does not give an exact answer when swap count is continued,
+ * but does include the high COUNT_CONTINUED flag to allow for that.
+ */
+int __swp_swapcount(swp_entry_t entry)
+{
+	int count = 0;
+	pgoff_t offset;
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+
+	si = __swap_info_get(entry);
+	if (si) {
+		offset = swp_offset(entry);
+		ci = lock_cluster_or_swap_info(si, offset);
+		count = swap_count(si->swap_map[offset]);
+		unlock_cluster_or_swap_info(si, ci);
+	}
+	return count;
+}
+
+/*
+ * How many references to @entry are currently swapped out?
  * This considers COUNT_CONTINUED so it returns exact answer.
  */
 int swp_swapcount(swp_entry_t entry)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/8] mm/swap: Allocate swap slots in batches
  2016-10-20 23:31 [PATCH v2 0/8] mm/swap: Regular page swap optimizations Tim Chen
                   ` (3 preceding siblings ...)
  2016-10-20 23:31 ` [PATCH v2 4/8] mm/swap: skip read ahead for unreferenced swap slots Tim Chen
@ 2016-10-20 23:31 ` Tim Chen
  2016-10-20 23:31 ` [PATCH v2 6/8] mm/swap: Free swap slots in batch Tim Chen
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2016-10-20 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton

Currently, the swap slots are allocated one page at a time,
causing contention to the swap_info lock protecting the swap partition
on every page being swapped.

This patch adds new functions get_swap_pages and scan_swap_map_slots
to request multiple swap slots at once. This will reduces the lock
contention on the swap_info lock. Also scan_swap_map_slots can operate
more efficiently as swap slots often occurs in clusters close to each
other on a swap device and it is quicker to allocate them together.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |   2 +
 mm/swapfile.c        | 161 ++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 136 insertions(+), 27 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 13bbc5f..d2b69bd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -27,6 +27,7 @@ struct bio;
 #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
 				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
 				 SWAP_FLAG_DISCARD_PAGES)
+#define SWAP_BATCH 64
 
 static inline int current_is_kswapd(void)
 {
@@ -412,6 +413,7 @@ static inline long get_nr_swap_pages(void)
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
 extern swp_entry_t get_swap_page_of_type(int);
+extern int get_swap_pages(int n, swp_entry_t swp_entries[]);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 19e3ea9..00c56c9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -501,7 +501,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
  * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
  * might involve allocating a new cluster for current CPU too.
  */
-static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
+static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned long *offset, unsigned long *scan_base)
 {
 	struct percpu_cluster *cluster;
@@ -525,7 +525,7 @@ new_cluster:
 			*scan_base = *offset = si->cluster_next;
 			goto new_cluster;
 		} else
-			return;
+			return false;
 	}
 
 	found_free = false;
@@ -557,16 +557,22 @@ new_cluster:
 	cluster->next = tmp + 1;
 	*offset = tmp;
 	*scan_base = tmp;
+	return found_free;
 }
 
-static unsigned long scan_swap_map(struct swap_info_struct *si,
-				   unsigned char usage)
+static int scan_swap_map_slots(struct swap_info_struct *si,
+				   unsigned char usage, int nr,
+				   unsigned long slots[])
 {
 	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned long scan_base;
 	unsigned long last_in_cluster = 0;
 	int latency_ration = LATENCY_LIMIT;
+	int n_ret = 0;
+
+	if (nr > SWAP_BATCH)
+		nr = SWAP_BATCH;
 
 	/*
 	 * We try to cluster swap pages by allocating them sequentially
@@ -584,8 +590,10 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 	/* SSD algorithm */
 	if (si->cluster_info) {
-		scan_swap_map_try_ssd_cluster(si, &offset, &scan_base);
-		goto checks;
+		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+			goto checks;
+		else
+			goto scan;
 	}
 
 	if (unlikely(!si->cluster_nr--)) {
@@ -629,8 +637,14 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 checks:
 	if (si->cluster_info) {
-		while (scan_swap_map_ssd_cluster_conflict(si, offset))
-			scan_swap_map_try_ssd_cluster(si, &offset, &scan_base);
+		while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
+		/* take a break if we already got some slots */
+			if (n_ret)
+				goto done;
+			if (!scan_swap_map_try_ssd_cluster(si, &offset,
+							&scan_base))
+				goto scan;
+		}
 	}
 	if (!(si->flags & SWP_WRITEOK))
 		goto no_page;
@@ -654,8 +668,13 @@ checks:
 	}
 
 	if (si->swap_map[offset]) {
-		unlock_cluster(ci);
-		goto scan;
+		if (!n_ret) {
+			unlock_cluster(ci);
+			goto scan;
+		} else {
+			unlock_cluster(ci);
+			goto done;
+		}
 	}
 
 	if (offset == si->lowest_bit)
@@ -674,9 +693,44 @@ checks:
 	inc_cluster_info_page(si, si->cluster_info, offset);
 	unlock_cluster(ci);
 	si->cluster_next = offset + 1;
-	si->flags -= SWP_SCANNING;
+	slots[n_ret] = offset;
+	++n_ret;
+
+	/* got enough slots or reach max slots? */
+	if ((n_ret == nr) || (offset >= si->highest_bit))
+		goto done;
+
+	/* search for next available slot */
+
+	/* time to take a break? */
+	if (unlikely(--latency_ration < 0)) {
+		if (n_ret)
+			goto done;
+		spin_unlock(&si->lock);
+		cond_resched();
+		spin_lock(&si->lock);
+		latency_ration = LATENCY_LIMIT;
+	}
+
+	/* try to get more slots in cluster */
+	if (si->cluster_info) {
+		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+			goto checks;
+		else
+			goto done;
+	}
+	/* non-ssd case */
+	++offset;
 
-	return offset;
+	/* non-ssd case, still more slots in cluster? */
+	if (si->cluster_nr && !si->swap_map[offset]) {
+		--si->cluster_nr;
+		goto checks;
+	}
+
+done:
+	si->flags -= SWP_SCANNING;
+	return n_ret;
 
 scan:
 	spin_unlock(&si->lock);
@@ -714,17 +768,46 @@ scan:
 
 no_page:
 	si->flags -= SWP_SCANNING;
-	return 0;
+	return n_ret;
 }
 
-swp_entry_t get_swap_page(void)
+static unsigned long scan_swap_map(struct swap_info_struct *si,
+				   unsigned char usage)
+{
+	unsigned long slots[1];
+	int n_ret;
+
+	n_ret = scan_swap_map_slots(si, usage, 1, slots);
+
+	if (n_ret)
+		return slots[0];
+	else
+		return 0;
+
+}
+
+int get_swap_pages(int n, swp_entry_t swp_entries[])
 {
 	struct swap_info_struct *si, *next;
-	pgoff_t offset;
+	long avail_pgs;
+	unsigned long *slots;
+	int ret, i, n_ret, n_goal;
 
-	if (atomic_long_read(&nr_swap_pages) <= 0)
+	n_ret = 0;
+	avail_pgs = atomic_long_read(&nr_swap_pages);
+	if (avail_pgs <= 0)
 		goto noswap;
-	atomic_long_dec(&nr_swap_pages);
+
+	n_goal = n;
+	swp_entries[0] = (swp_entry_t) {0};
+
+	if (n_goal > SWAP_BATCH)
+		n_goal = SWAP_BATCH;
+
+	if (n_goal > avail_pgs)
+		n_goal = avail_pgs;
+
+	atomic_long_sub(n_goal, &nr_swap_pages);
 
 	spin_lock(&swap_avail_lock);
 
@@ -751,13 +834,27 @@ start_over:
 			goto nextsi;
 		}
 
-		/* This is called for allocating swap entry for cache */
-		offset = scan_swap_map(si, SWAP_HAS_CACHE);
+		/* use swp_entries array to store slots returned, both same size */
+		BUILD_BUG_ON(sizeof(swp_entry_t) != sizeof(unsigned long));
+		slots = (unsigned long *) &swp_entries[n_ret];
+		ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+						n_goal-n_ret, slots);
+		if (!ret) {
+			pr_debug("scan_swap_map of si %d failed to find offset\n",
+			       si->type);
+			goto next;
+		}
+
+		for (i = 0; i < ret; ++i)
+			swp_entries[n_ret+i] = swp_entry(si->type,
+							slots[i]);
+
+		n_ret += ret;
+next:
 		spin_unlock(&si->lock);
-		if (offset)
-			return swp_entry(si->type, offset);
-		pr_debug("scan_swap_map of si %d failed to find offset\n",
-		       si->type);
+		if (n_ret == n_goal)
+			return n_ret;
+
 		spin_lock(&swap_avail_lock);
 nextsi:
 		/*
@@ -768,17 +865,27 @@ nextsi:
 		 * up between us dropping swap_avail_lock and taking si->lock.
 		 * Since we dropped the swap_avail_lock, the swap_avail_head
 		 * list may have been modified; so if next is still in the
-		 * swap_avail_head list then try it, otherwise start over.
+		 * swap_avail_head list then try it, otherwise start over
+		 * if we have not gotten any slots.
 		 */
-		if (plist_node_empty(&next->avail_list))
+		if (plist_node_empty(&next->avail_list) && !n_ret)
 			goto start_over;
 	}
 
 	spin_unlock(&swap_avail_lock);
 
-	atomic_long_inc(&nr_swap_pages);
+	if (n_ret < n_goal)
+		atomic_long_add((long) (n_goal-n_ret), &nr_swap_pages);
 noswap:
-	return (swp_entry_t) {0};
+	return n_ret;
+}
+
+swp_entry_t get_swap_page(void)
+{
+	swp_entry_t entry;
+
+	get_swap_pages(1, &entry);
+	return entry;
 }
 
 /* The only caller of this function is now suspend routine */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 6/8] mm/swap: Free swap slots in batch
  2016-10-20 23:31 [PATCH v2 0/8] mm/swap: Regular page swap optimizations Tim Chen
                   ` (4 preceding siblings ...)
  2016-10-20 23:31 ` [PATCH v2 5/8] mm/swap: Allocate swap slots in batches Tim Chen
@ 2016-10-20 23:31 ` Tim Chen
  2016-10-20 23:31 ` [PATCH v2 7/8] mm/swap: Add cache for swap slots allocation Tim Chen
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2016-10-20 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton

Add new functions that free unused swap slots in batches without
the need to reacquire swap info lock.  This improves scalability
and reduce lock contention.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |   1 +
 mm/swapfile.c        | 155 +++++++++++++++++++++++++++++++--------------------
 2 files changed, 95 insertions(+), 61 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d2b69bd..7492fcc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -420,6 +420,7 @@ extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free(swp_entry_t);
+extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 00c56c9..4125a1d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -968,35 +968,34 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 	return p;
 }
 
-static unsigned char swap_entry_free(struct swap_info_struct *p,
-				     swp_entry_t entry, unsigned char usage,
-				     bool swap_info_locked)
+static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
+					struct swap_info_struct *q)
+{
+	struct swap_info_struct *p;
+
+	p = _swap_info_get(entry);
+
+	if (p != q) {
+		if (q != NULL)
+			spin_unlock(&q->lock);
+		if (p != NULL)
+			spin_lock(&p->lock);
+	}
+	return p;
+}
+
+static unsigned char __swap_entry_free(struct swap_info_struct *p,
+				       swp_entry_t entry, unsigned char usage)
 {
 	struct swap_cluster_info *ci;
 	unsigned long offset = swp_offset(entry);
 	unsigned char count;
 	unsigned char has_cache;
-	bool lock_swap_info = false;
-
-	if (!swap_info_locked) {
-		count = p->swap_map[offset];
-		if (!p->cluster_info || count == usage || count == SWAP_MAP_SHMEM) {
-lock_swap_info:
-			swap_info_locked = true;
-			lock_swap_info = true;
-			spin_lock(&p->lock);
-		}
-	}
 
-	ci = lock_cluster(p, offset);
+	ci = lock_cluster_or_swap_info(p, offset);
 
 	count = p->swap_map[offset];
 
-	if (!swap_info_locked && (count == usage || count == SWAP_MAP_SHMEM)) {
-		unlock_cluster(ci);
-		goto lock_swap_info;
-	}
-
 	has_cache = count & SWAP_HAS_CACHE;
 	count &= ~SWAP_HAS_CACHE;
 
@@ -1020,46 +1019,52 @@ lock_swap_info:
 	}
 
 	usage = count | has_cache;
-	p->swap_map[offset] = usage;
+	p->swap_map[offset] = usage ? : SWAP_HAS_CACHE;
+
+	unlock_cluster_or_swap_info(p, ci);
+
+	return usage;
+}
 
+static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
+{
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+	unsigned char count;
+
+	ci = lock_cluster(p, offset);
+	count = p->swap_map[offset];
+	VM_BUG_ON(count != SWAP_HAS_CACHE);
+	p->swap_map[offset] = 0;
+	dec_cluster_info_page(p, p->cluster_info, offset);
 	unlock_cluster(ci);
 
-	/* free if no reference */
-	if (!usage) {
-		VM_BUG_ON(!swap_info_locked);
-		mem_cgroup_uncharge_swap(entry);
-		ci = lock_cluster(p, offset);
-		dec_cluster_info_page(p, p->cluster_info, offset);
-		unlock_cluster(ci);
-		if (offset < p->lowest_bit)
-			p->lowest_bit = offset;
-		if (offset > p->highest_bit) {
-			bool was_full = !p->highest_bit;
-			p->highest_bit = offset;
-			if (was_full && (p->flags & SWP_WRITEOK)) {
-				spin_lock(&swap_avail_lock);
-				WARN_ON(!plist_node_empty(&p->avail_list));
-				if (plist_node_empty(&p->avail_list))
-					plist_add(&p->avail_list,
-						  &swap_avail_head);
-				spin_unlock(&swap_avail_lock);
-			}
-		}
-		atomic_long_inc(&nr_swap_pages);
-		p->inuse_pages--;
-		frontswap_invalidate_page(p->type, offset);
-		if (p->flags & SWP_BLKDEV) {
-			struct gendisk *disk = p->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify)
-				disk->fops->swap_slot_free_notify(p->bdev,
-								  offset);
+	mem_cgroup_uncharge_swap(entry);
+	if (offset < p->lowest_bit)
+		p->lowest_bit = offset;
+	if (offset > p->highest_bit) {
+		bool was_full = !p->highest_bit;
+
+		p->highest_bit = offset;
+		if (was_full && (p->flags & SWP_WRITEOK)) {
+			spin_lock(&swap_avail_lock);
+			WARN_ON(!plist_node_empty(&p->avail_list));
+			if (plist_node_empty(&p->avail_list))
+				plist_add(&p->avail_list,
+					  &swap_avail_head);
+			spin_unlock(&swap_avail_lock);
 		}
 	}
+	atomic_long_inc(&nr_swap_pages);
+	p->inuse_pages--;
+	frontswap_invalidate_page(p->type, offset);
+	if (p->flags & SWP_BLKDEV) {
+		struct gendisk *disk = p->bdev->bd_disk;
 
-	if (lock_swap_info)
-		spin_unlock(&p->lock);
-
-	return usage;
+		if (disk->fops->swap_slot_free_notify)
+			disk->fops->swap_slot_free_notify(p->bdev,
+							  offset);
+	}
 }
 
 /*
@@ -1071,8 +1076,10 @@ void swap_free(swp_entry_t entry)
 	struct swap_info_struct *p;
 
 	p = _swap_info_get(entry);
-	if (p)
-		swap_entry_free(p, entry, 1, false);
+	if (p) {
+		if (!__swap_entry_free(p, entry, 1))
+			swapcache_free_entries(&entry, 1);
+	}
 }
 
 /*
@@ -1083,8 +1090,32 @@ void swapcache_free(swp_entry_t entry)
 	struct swap_info_struct *p;
 
 	p = _swap_info_get(entry);
+	if (p) {
+		if (!__swap_entry_free(p, entry, SWAP_HAS_CACHE))
+			swapcache_free_entries(&entry, 1);
+	}
+}
+
+void swapcache_free_entries(swp_entry_t *entries, int n)
+{
+	struct swap_info_struct *p, *prev;
+	int i;
+
+	if (n <= 0)
+		return;
+
+	prev = NULL;
+	p = NULL;
+	for (i = 0; i < n; ++i) {
+		p = swap_info_get_cont(entries[i], prev);
+		if (p)
+			swap_entry_free(p, entries[i]);
+		else
+			break;
+		prev = p;
+	}
 	if (p)
-		swap_entry_free(p, entry, SWAP_HAS_CACHE, false);
+		spin_unlock(&p->lock);
 }
 
 /*
@@ -1253,21 +1284,23 @@ int free_swap_and_cache(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	struct page *page = NULL;
+	unsigned char count;
 
 	if (non_swap_entry(entry))
 		return 1;
 
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, entry, 1, true) == SWAP_HAS_CACHE) {
+		count = __swap_entry_free(p, entry, 1);
+		if (count == SWAP_HAS_CACHE) {
 			page = find_get_page(swap_address_space(entry),
 					     swp_offset(entry));
 			if (page && !trylock_page(page)) {
 				put_page(page);
 				page = NULL;
 			}
-		}
-		spin_unlock(&p->lock);
+		} else if (!count)
+			swapcache_free_entries(&entry, 1);
 	}
 	if (page) {
 		/*
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 7/8] mm/swap: Add cache for swap slots allocation
  2016-10-20 23:31 [PATCH v2 0/8] mm/swap: Regular page swap optimizations Tim Chen
                   ` (5 preceding siblings ...)
  2016-10-20 23:31 ` [PATCH v2 6/8] mm/swap: Free swap slots in batch Tim Chen
@ 2016-10-20 23:31 ` Tim Chen
  2016-10-20 23:31 ` [PATCH v2 8/8] mm/swap: Enable swap slots cache usage Tim Chen
  2016-10-21  8:16 ` [PATCH v2 0/8] mm/swap: Regular page swap optimizations Christian Borntraeger
  8 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2016-10-20 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton

We add per cpu caches for swap slots that can be allocated and freed
quickly without the need to touch the swap info lock.

Two separate caches are maintained for swap slots allocated and
swap slots returned.  This is to allow the swap slots to be returned
to the global pool in a batch so they will have a chance to be
coaelesced with other slots in a cluster.  We do not reuse the slots
that are returned right away, as it may increase fragmentation
of the slots.

The swap allocation cache is protected by a mutex as we may sleep
when searching for empty slots in cache.  The swap free cache
is protected by a spin lock as we cannot sleep in the free path.

We refill the swap slots cache when we run out of slots, and we
disable the swap slots cache and drain the slots if the global
number of slots fall below a low watermark threshold.  We re-enable the cache
agian when the slots available are above a high watermark.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h       |   3 +
 include/linux/swap_slots.h |  37 ++++++
 mm/Makefile                |   2 +-
 mm/swap_slots.c            | 306 +++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c            |   1 +
 mm/swapfile.c              |  15 +--
 6 files changed, 352 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/swap_slots.h
 create mode 100644 mm/swap_slots.c

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7492fcc..5f88ac20 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -436,6 +436,9 @@ struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
 
+extern int get_swap_slots(int n, swp_entry_t *slots);
+extern void swapcache_free_batch(swp_entry_t *entries, int n);
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)		(NULL)
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
new file mode 100644
index 0000000..cd7bc2d
--- /dev/null
+++ b/include/linux/swap_slots.h
@@ -0,0 +1,37 @@
+#ifndef _LINUX_SWAP_SLOTS_H
+#define _LINUX_SWAP_SLOTS_H
+
+#include <linux/swap.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+
+#define SWAP_SLOTS_CACHE_SIZE			SWAP_BATCH
+#define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE	(5*SWAP_SLOTS_CACHE_SIZE)
+#define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE	(2*SWAP_SLOTS_CACHE_SIZE)
+
+struct swap_slots_cache {
+	struct mutex	alloc_lock;
+	swp_entry_t	*slots;
+	int		nr;
+	int		cur;
+	spinlock_t	free_lock;
+	swp_entry_t	*slots_ret;
+	int		n_ret;
+};
+
+DECLARE_PER_CPU(struct swap_slots_cache, swp_slots);
+extern bool    swap_slot_cache_enabled;
+
+void drain_swap_slots_cache(unsigned int type);
+void deactivate_swap_slots_cache(void);
+void disable_swap_slots_cache(void);
+void reenable_swap_slots_cache(void);
+void reactivate_swap_slots_cache(void);
+
+int free_swap_slot_to_cache(swp_entry_t entry);
+swp_entry_t get_swap_slot(void);
+int get_swap_slots(int n, swp_entry_t *entries);
+void free_swap_slot_cache(int cpu);
+int enable_swap_slot_caches(void);
+
+#endif /* _LINUX_SWAP_SLOTS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 2ca1faf..9d51267 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -38,7 +38,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o vmacache.o \
+			   compaction.o vmacache.o swap_slots.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
new file mode 100644
index 0000000..2821de1
--- /dev/null
+++ b/mm/swap_slots.c
@@ -0,0 +1,306 @@
+/*
+ * Manage cache of swap slots to be used for and returned from
+ * swap.
+ *
+ * Copyright(c) 2014 Intel Corporation.
+ *
+ * Author: Tim Chen <tim.c.chen@linux.intel.com>
+ *
+ * We allocate the swap slots from the global pool and put
+ * it into local per cpu caches.  This has the advantage
+ * of no needing to acquire the swap_info lock every time
+ * we need a new slot.
+ *
+ * There is also opportunity to simply return the slot
+ * to local caches without needing to acquire swap_info
+ * lock.  We do not reuse the returned slots directly but
+ * move them back to the global pool in a batch.  This
+ * allows the slots to coaellesce and reduce fragmentation.
+ *
+ * The swap entry allocated is marked with SWAP_HAS_CACHE
+ * flag in map_count that prevents it from being allocated
+ * again from the global pool.
+ *
+ * The swap slots cache is protected by a mutex instead of
+ * a spin lock as when we search for slots with scan_swap_map,
+ * we can possibly sleep.
+ */
+
+#include <linux/swap_slots.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/vmalloc.h>
+#include <linux/mutex.h>
+
+#ifdef CONFIG_SWAP
+
+DEFINE_PER_CPU(struct swap_slots_cache, swp_slots);
+static bool	swap_slot_cache_active;
+bool	swap_slot_cache_enabled;
+static bool	swap_slot_cache_initialized;
+DEFINE_MUTEX(swap_slots_cache_mutex);
+
+#define use_swap_slot_cache (swap_slot_cache_active && \
+		swap_slot_cache_enabled && swap_slot_cache_initialized)
+#define SLOTS_CACHE 0x1
+#define SLOTS_CACHE_RET 0x2
+
+void deactivate_swap_slots_cache(void)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	swap_slot_cache_active = false;
+	drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+void disable_swap_slots_cache(void)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	swap_slot_cache_active = false;
+	swap_slot_cache_enabled = false;
+	drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+void reenable_swap_slots_cache(void)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	swap_slot_cache_enabled = true;
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+void reactivate_swap_slots_cache(void)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	swap_slot_cache_active = true;
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+static bool check_cache_active(void)
+{
+	long pages;
+
+	if (!swap_slot_cache_enabled || !swap_slot_cache_initialized)
+		return false;
+
+	pages = get_nr_swap_pages();
+	if (!swap_slot_cache_active) {
+		if (pages > num_online_cpus() *
+				THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE) {
+			drain_swap_slots_cache(SLOTS_CACHE_RET);
+			reactivate_swap_slots_cache();
+		}
+		goto out;
+	}
+
+	/* if global pool of slot caches too low, deactivate cache */
+	if (pages < num_online_cpus() * THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE)
+		deactivate_swap_slots_cache();
+out:
+	return swap_slot_cache_active;
+}
+
+static int alloc_swap_slot_cache(int cpu)
+{
+	struct swap_slots_cache *cache;
+
+	cache = &per_cpu(swp_slots, cpu);
+	mutex_init(&cache->alloc_lock);
+	spin_lock_init(&cache->free_lock);
+	cache->nr = 0;
+	cache->cur = 0;
+	cache->n_ret = 0;
+	cache->slots = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	if (!cache->slots) {
+		swap_slot_cache_enabled = false;
+		return -ENOMEM;
+	}
+	cache->slots_ret = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	if (!cache->slots_ret) {
+		vfree(cache->slots);
+		swap_slot_cache_enabled = false;
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+static void drain_slots_cache_cpu(int cpu, unsigned int type)
+{
+	int i, n;
+	swp_entry_t entry;
+	struct swap_slots_cache *cache;
+
+	cache = &per_cpu(swp_slots, cpu);
+	if (type & SLOTS_CACHE) {
+		mutex_lock(&cache->alloc_lock);
+		for (n = 0; n < cache->nr; ++n) {
+			i = (cache->cur + n) % SWAP_SLOTS_CACHE_SIZE;
+			/*
+			 * locking swap info is unnecessary,
+			 * nobody else will claim this map slot
+			 * and use it if its value is SWAP_HAS_CACHE
+			 */
+			entry = cache->slots[i];
+			swapcache_free_entries(&entry, 1);
+		}
+		cache->cur = 0;
+		cache->nr = 0;
+		mutex_unlock(&cache->alloc_lock);
+	}
+	if (type & SLOTS_CACHE_RET) {
+		spin_lock_irq(&cache->free_lock);
+		swapcache_free_entries(cache->slots_ret, cache->n_ret);
+		cache->n_ret = 0;
+		spin_unlock_irq(&cache->free_lock);
+	}
+}
+
+void drain_swap_slots_cache(unsigned int type)
+{
+	int cpu;
+
+	get_online_cpus();
+	for_each_online_cpu(cpu)
+		drain_slots_cache_cpu(cpu, type);
+	put_online_cpus();
+}
+
+static void free_slot_cache(int cpu)
+{
+	struct swap_slots_cache *cache;
+
+	mutex_lock(&swap_slots_cache_mutex);
+	drain_slots_cache_cpu(cpu, SLOTS_CACHE | SLOTS_CACHE_RET);
+	cache = &per_cpu(swp_slots, cpu);
+	cache->nr = 0;
+	cache->cur = 0;
+	cache->n_ret = 0;
+	vfree(cache->slots);
+	vfree(cache->slots_ret);
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+static int swap_cache_callback(struct notifier_block *nfb,
+			unsigned long action, void *hcpu)
+{
+	int cpu = (long)hcpu;
+
+	switch (action) {
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+			free_slot_cache(cpu);
+			break;
+	case CPU_ONLINE:
+			alloc_swap_slot_cache(cpu);
+			break;
+	}
+	return NOTIFY_OK;
+}
+
+int enable_swap_slot_caches(void)
+{
+	int	i, j;
+
+	if (swap_slot_cache_initialized) {
+		if (!swap_slot_cache_enabled)
+			reenable_swap_slots_cache();
+		return 0;
+	}
+
+	get_online_cpus();
+	for_each_online_cpu(i) {
+		if (alloc_swap_slot_cache(i))
+			goto fail;
+	}
+	swap_slot_cache_initialized = true;
+	swap_slot_cache_enabled = true;
+	put_online_cpus();
+	hotcpu_notifier(swap_cache_callback, 0);
+	mutex_init(&swap_slots_cache_mutex);
+	return 0;
+fail:
+	for_each_online_cpu(j) {
+		if (j == i)
+			break;
+		free_slot_cache(j);
+	}
+	put_online_cpus();
+	swap_slot_cache_initialized = false;
+	return -ENOMEM;
+}
+
+/* called with swap slot cache's alloc lock held */
+static int refill_swap_slots_cache(struct swap_slots_cache *cache)
+{
+	if (!use_swap_slot_cache || cache->nr)
+		return 0;
+
+	cache->cur = 0;
+	if (swap_slot_cache_active)
+		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
+					&cache->slots[cache->cur]);
+
+	return cache->nr;
+}
+
+int free_swap_slot_to_cache(swp_entry_t entry)
+{
+	struct swap_slots_cache *cache;
+	int idx;
+
+	BUG_ON(!swap_slot_cache_initialized);
+
+	cache = this_cpu_ptr(&swp_slots);
+
+	if (use_swap_slot_cache) {
+		spin_lock_irq(&cache->free_lock);
+		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
+			/*
+			 * return slots to global pool
+			 * note swap_map value should be SWAP_HAS_CACHE
+			 * set it to 0 to indicate it is available for
+			 * allocation in global pool
+			 */
+			swapcache_free_entries(cache->slots_ret, cache->n_ret);
+			cache->n_ret = 0;
+		}
+		/* return to local cache at tail */
+		idx = cache->n_ret % SWAP_SLOTS_CACHE_SIZE;
+		cache->slots_ret[idx] = entry;
+		cache->n_ret++;
+		spin_unlock_irq(&cache->free_lock);
+	} else
+		swapcache_free_entries(&entry, 1);
+
+	return 0;
+}
+
+swp_entry_t get_swap_page(void)
+{
+	swp_entry_t	entry;
+	struct swap_slots_cache *cache;
+
+	cache = this_cpu_ptr(&swp_slots);
+
+	if (check_cache_active()) {
+		mutex_lock(&cache->alloc_lock);
+repeat:
+		if (cache->nr) {
+			entry = cache->slots[cache->cur];
+			cache->slots[cache->cur].val = 0;
+			cache->cur = (cache->cur + 1) % SWAP_SLOTS_CACHE_SIZE;
+			--cache->nr;
+		} else {
+			if (refill_swap_slots_cache(cache))
+				goto repeat;
+		}
+		mutex_unlock(&cache->alloc_lock);
+		return entry;
+	}
+
+	get_swap_pages(1, &entry);
+
+	return entry;
+}
+
+#endif /* CONFIG_SWAP */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1f52ff6..af4ed5f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
+#include <linux/swap_slots.h>
 
 #include <asm/pgtable.h>
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4125a1d..caf9fe6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -34,6 +34,7 @@
 #include <linux/frontswap.h>
 #include <linux/swapfile.h>
 #include <linux/export.h>
+#include <linux/swap_slots.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -880,14 +881,6 @@ noswap:
 	return n_ret;
 }
 
-swp_entry_t get_swap_page(void)
-{
-	swp_entry_t entry;
-
-	get_swap_pages(1, &entry);
-	return entry;
-}
-
 /* The only caller of this function is now suspend routine */
 swp_entry_t get_swap_page_of_type(int type)
 {
@@ -1078,7 +1071,7 @@ void swap_free(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		if (!__swap_entry_free(p, entry, 1))
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot_to_cache(entry);
 	}
 }
 
@@ -1092,7 +1085,7 @@ void swapcache_free(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		if (!__swap_entry_free(p, entry, SWAP_HAS_CACHE))
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot_to_cache(entry);
 	}
 }
 
@@ -1300,7 +1293,7 @@ int free_swap_and_cache(swp_entry_t entry)
 				page = NULL;
 			}
 		} else if (!count)
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot_to_cache(entry);
 	}
 	if (page) {
 		/*
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 8/8] mm/swap: Enable swap slots cache usage
  2016-10-20 23:31 [PATCH v2 0/8] mm/swap: Regular page swap optimizations Tim Chen
                   ` (6 preceding siblings ...)
  2016-10-20 23:31 ` [PATCH v2 7/8] mm/swap: Add cache for swap slots allocation Tim Chen
@ 2016-10-20 23:31 ` Tim Chen
  2016-10-21  8:16 ` [PATCH v2 0/8] mm/swap: Regular page swap optimizations Christian Borntraeger
  8 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2016-10-20 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton

Initialize swap slots cache and enable it on swap on.
Drain swap slots on swap off.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/swapfile.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index caf9fe6..e98c725 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2132,7 +2132,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	struct address_space *mapping;
 	struct inode *inode;
 	struct filename *pathname;
-	int err, found = 0;
+	int err, found = 0, has_swap = 0;
 	unsigned int old_block_size;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -2144,6 +2144,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	if (IS_ERR(pathname))
 		return PTR_ERR(pathname);
 
+	disable_swap_slots_cache();
 	victim = file_open_name(pathname, O_RDWR|O_LARGEFILE, 0);
 	err = PTR_ERR(victim);
 	if (IS_ERR(victim))
@@ -2153,10 +2154,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	spin_lock(&swap_lock);
 	plist_for_each_entry(p, &swap_active_head, list) {
 		if (p->flags & SWP_WRITEOK) {
-			if (p->swap_file->f_mapping == mapping) {
+			if (p->swap_file->f_mapping == mapping)
 				found = 1;
+			else
+				++has_swap;
+			/* there is another swap device left? */
+			if (found && has_swap)
 				break;
-			}
 		}
 	}
 	if (!found) {
@@ -2275,6 +2279,8 @@ out_dput:
 	filp_close(victim, NULL);
 out:
 	putname(pathname);
+	if (has_swap)
+		reenable_swap_slots_cache();
 	return err;
 }
 
@@ -2692,6 +2698,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	enable_swap_slot_caches();
 	p = alloc_swap_info();
 	if (IS_ERR(p))
 		return PTR_ERR(p);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] mm/swap: Regular page swap optimizations
  2016-10-20 23:31 [PATCH v2 0/8] mm/swap: Regular page swap optimizations Tim Chen
                   ` (7 preceding siblings ...)
  2016-10-20 23:31 ` [PATCH v2 8/8] mm/swap: Enable swap slots cache usage Tim Chen
@ 2016-10-21  8:16 ` Christian Borntraeger
  2016-10-21 10:05   ` Christian Borntraeger
  8 siblings, 1 reply; 15+ messages in thread
From: Christian Borntraeger @ 2016-10-21  8:16 UTC (permalink / raw)
  To: Tim Chen, Andrew Morton
  Cc: Ying Huang, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton

On s390 4.9-rc1 + this patch set
I get the following on swapon

[  308.206195] ------------[ cut here ]------------
[  308.206203] WARNING: CPU: 5 PID: 20745 at mm/page_alloc.c:3511 __alloc_pages_nodemask+0x884/0xdf8
[  308.206205] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter btrfs raid6_pq xor dm_service_time ghash_s390 prng aes_s390 des_s390 des_generic qeth_l2 sha512_s390 sha256_s390 sha1_s390 sha_common nfsd eadm_sch auth_rpcgss qeth ccwgroup oid_registry nfs_acl lockd grace vhost_net vhost sunrpc macvtap macvlan kvm sch_fq_codel dm_multipath ip_tables
[  308.206240] CPU: 5 PID: 20745 Comm: swapon Tainted: G        W       4.9.0-rc1+ #23
[  308.206241] Hardware name: IBM              2964 NC9              704              (LPAR)
[  308.206243] task: 00000000e3bb8000 task.stack: 00000000d4270000
[  308.206244] Krnl PSW : 0704c00180000000 000000000025db3c
[  308.206246]  (__alloc_pages_nodemask+0x884/0xdf8)

[  308.206248]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0
[  308.206250]  RI:0 EA:3
[  308.206251] 
               Krnl GPRS: 000000000000000a 0000000000b2b0ec 0000000100000312 0000000000000001
[  308.206267]            000000000025d3ec 0000000000000008 0000000000000075 000000000240c0c0
[  308.206269]            0000000000000001 0000000000000000 000000000000000a 0000000000000000
[  308.206270]            000000000000000a 000000000079d8c8 000000000025d3ec 00000000d4273ba0
[  308.206280] Krnl Code: 000000000025db30: a774fd27		brc	7,25d57e
[  308.206282] 
                          000000000025db34: 92011000		mvi	0(%r1),1
[  308.206285] 
                         #000000000025db38: a7f40001		brc	15,25db3a
[  308.206286] 
                         >000000000025db3c: a7f4fd21		brc	15,25d57e
[  308.206289] 
                          000000000025db40: 4130f150		la	%r3,336(%r15)
[  308.206291] 
                          000000000025db44: b904002c		lgr	%r2,%r12
[  308.206293] 
                          000000000025db48: c0e5ffffe11c	brasl	%r14,259d80
[  308.206294] 
                          000000000025db4e: a7f4fda3		brc	15,25d694
[  308.206297] Call Trace:
[  308.206299] ([<000000000025d3ec>] __alloc_pages_nodemask+0x134/0xdf8)
[  308.206303] ([<0000000000280d6a>] kmalloc_order+0x42/0x70)
[  308.206305] ([<0000000000280dd8>] kmalloc_order_trace+0x40/0xf0)
[  308.206310] ([<00000000002a7090>] init_swap_address_space+0x68/0x138)
[  308.206312] ([<00000000002ac858>] SyS_swapon+0xbd0/0xf80)
[  308.206317] ([<0000000000785476>] system_call+0xd6/0x264)
[  308.206318] Last Breaking-Event-Address:
[  308.206319]  [<000000000025db38>] __alloc_pages_nodemask+0x880/0xdf8
[  308.206320] ---[ end trace aaeca736f47ac05b ]---

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] mm/swap: Regular page swap optimizations
  2016-10-21  8:16 ` [PATCH v2 0/8] mm/swap: Regular page swap optimizations Christian Borntraeger
@ 2016-10-21 10:05   ` Christian Borntraeger
  2016-10-21 17:40     ` Tim Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Borntraeger @ 2016-10-21 10:05 UTC (permalink / raw)
  To: Tim Chen, Andrew Morton
  Cc: Ying Huang, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton

On 10/21/2016 10:16 AM, Christian Borntraeger wrote:
> On s390 4.9-rc1 + this patch set
> I get the following on swapon
> 
> [  308.206195] ------------[ cut here ]------------
> [  308.206203] WARNING: CPU: 5 PID: 20745 at mm/page_alloc.c:3511 __alloc_pages_nodemask+0x884/0xdf8
> [  308.206205] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter btrfs raid6_pq xor dm_service_time ghash_s390 prng aes_s390 des_s390 des_generic qeth_l2 sha512_s390 sha256_s390 sha1_s390 sha_common nfsd eadm_sch auth_rpcgss qeth ccwgroup oid_registry nfs_acl lockd grace vhost_net vhost sunrpc macvtap macvlan kvm sch_fq_codel dm_multipath ip_tables
> [  308.206240] CPU: 5 PID: 20745 Comm: swapon Tainted: G        W       4.9.0-rc1+ #23
> [  308.206241] Hardware name: IBM              2964 NC9              704              (LPAR)
> [  308.206243] task: 00000000e3bb8000 task.stack: 00000000d4270000
> [  308.206244] Krnl PSW : 0704c00180000000 000000000025db3c
> [  308.206246]  (__alloc_pages_nodemask+0x884/0xdf8)
> 
> [  308.206248]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0
> [  308.206250]  RI:0 EA:3
> [  308.206251] 
>                Krnl GPRS: 000000000000000a 0000000000b2b0ec 0000000100000312 0000000000000001
> [  308.206267]            000000000025d3ec 0000000000000008 0000000000000075 000000000240c0c0
> [  308.206269]            0000000000000001 0000000000000000 000000000000000a 0000000000000000
> [  308.206270]            000000000000000a 000000000079d8c8 000000000025d3ec 00000000d4273ba0
> [  308.206280] Krnl Code: 000000000025db30: a774fd27		brc	7,25d57e
> [  308.206282] 
>                           000000000025db34: 92011000		mvi	0(%r1),1
> [  308.206285] 
>                          #000000000025db38: a7f40001		brc	15,25db3a
> [  308.206286] 
>                          >000000000025db3c: a7f4fd21		brc	15,25d57e
> [  308.206289] 
>                           000000000025db40: 4130f150		la	%r3,336(%r15)
> [  308.206291] 
>                           000000000025db44: b904002c		lgr	%r2,%r12
> [  308.206293] 
>                           000000000025db48: c0e5ffffe11c	brasl	%r14,259d80
> [  308.206294] 
>                           000000000025db4e: a7f4fda3		brc	15,25d694
> [  308.206297] Call Trace:
> [  308.206299] ([<000000000025d3ec>] __alloc_pages_nodemask+0x134/0xdf8)
> [  308.206303] ([<0000000000280d6a>] kmalloc_order+0x42/0x70)
> [  308.206305] ([<0000000000280dd8>] kmalloc_order_trace+0x40/0xf0)
> [  308.206310] ([<00000000002a7090>] init_swap_address_space+0x68/0x138)
> [  308.206312] ([<00000000002ac858>] SyS_swapon+0xbd0/0xf80)
> [  308.206317] ([<0000000000785476>] system_call+0xd6/0x264)
> [  308.206318] Last Breaking-Event-Address:
> [  308.206319]  [<000000000025db38>] __alloc_pages_nodemask+0x880/0xdf8
> [  308.206320] ---[ end trace aaeca736f47ac05b ]---
> 

Looks like that 1TB of swap is just too big for your logic (you try kmalloc without checking the size).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] mm/swap: Regular page swap optimizations
  2016-10-21 10:05   ` Christian Borntraeger
@ 2016-10-21 17:40     ` Tim Chen
  0 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2016-10-21 17:40 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Andrew Morton, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton

On Fri, Oct 21, 2016 at 12:05:19PM +0200, Christian Borntraeger wrote:
> On 10/21/2016 10:16 AM, Christian Borntraeger wrote:
> > [  308.206297] Call Trace:
> > [  308.206299] ([<000000000025d3ec>] __alloc_pages_nodemask+0x134/0xdf8)
> > [  308.206303] ([<0000000000280d6a>] kmalloc_order+0x42/0x70)
> > [  308.206305] ([<0000000000280dd8>] kmalloc_order_trace+0x40/0xf0)
> > [  308.206310] ([<00000000002a7090>] init_swap_address_space+0x68/0x138)
> > [  308.206312] ([<00000000002ac858>] SyS_swapon+0xbd0/0xf80)
> > [  308.206317] ([<0000000000785476>] system_call+0xd6/0x264)
> > [  308.206318] Last Breaking-Event-Address:
> > [  308.206319]  [<000000000025db38>] __alloc_pages_nodemask+0x880/0xdf8
> > [  308.206320] ---[ end trace aaeca736f47ac05b ]---
> > 
> 
> Looks like that 1TB of swap is just too big for your logic (you try kmalloc without checking the size).
> 

Thanks for giving this patch series a spin.
Let's use vzalloc instead.  Can you try the following change.

Thanks.

Tim

--->8---
diff --git a/mm/swap_state.c b/mm/swap_state.c
index af4ed5f..0f84526 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -526,11 +526,9 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
 	unsigned int i, nr;
 
 	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
-	spaces = kzalloc(sizeof(struct address_space) * nr, GFP_KERNEL);
+	spaces = vzalloc(sizeof(struct address_space) * nr);
 	if (!spaces) {
-		spaces = vzalloc(sizeof(struct address_space) * nr);
-		if (!spaces)
-			return -ENOMEM;
+		return -ENOMEM;
 	}
 	for (i = 0; i < nr; i++) {
 		space = spaces + i;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/8] mm/swap: Add cluster lock
  2016-10-20 23:31 ` [PATCH v2 2/8] mm/swap: Add cluster lock Tim Chen
@ 2016-10-24 16:31   ` Jonathan Corbet
  2016-10-25  2:05     ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Corbet @ 2016-10-24 16:31 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton

On Thu, 20 Oct 2016 16:31:41 -0700
Tim Chen <tim.c.chen@linux.intel.com> wrote:

> From: "Huang, Ying" <ying.huang@intel.com>
> 
> This patch is to reduce the lock contention of swap_info_struct->lock
> via using a more fine grained lock in swap_cluster_info for some swap
> operations.  swap_info_struct->lock is heavily contended if multiple
> processes reclaim pages simultaneously.  Because there is only one lock
> for each swap device.  While in common configuration, there is only one
> or several swap devices in the system.  The lock protects almost all
> swap related operations.

So I'm looking at this a bit.  Overall it seems like a good thing to do
(from my limited understanding of this area) but I have a probably silly
question... 

>  struct swap_cluster_info {
> -	unsigned int data:24;
> -	unsigned int flags:8;
> +	unsigned long data;
>  };
> -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> +#define CLUSTER_COUNT_SHIFT		8
> +#define CLUSTER_FLAG_MASK		((1UL << CLUSTER_COUNT_SHIFT) - 1)
> +#define CLUSTER_COUNT_MASK		(~CLUSTER_FLAG_MASK)
> +#define CLUSTER_FLAG_FREE		1 /* This cluster is free */
> +#define CLUSTER_FLAG_NEXT_NULL		2 /* This cluster has no next cluster */
> +/* cluster lock, protect cluster_info contents and sis->swap_map */
> +#define CLUSTER_FLAG_LOCK_BIT		2
> +#define CLUSTER_FLAG_LOCK		(1 << CLUSTER_FLAG_LOCK_BIT)

Why the roll-your-own locking and data structures here?  To my naive
understanding, it seems like you could do something like:

  struct swap_cluster_info {
  	spinlock_t lock;
	atomic_t count;
	unsigned int flags;
  };

Then you could use proper spinlock operations which, among other things,
would make the realtime folks happier.  That might well help with the
cache-line sharing issues as well.  Some of the count manipulations could
perhaps be done without the lock entirely; similarly, atomic bitops might
save you the locking for some of the flag tweaks - though I'd have to look
more closely to be really sure of that.

The cost, of course, is the growth of this structure, but you've already
noted that the overhead isn't all that high; seems like it could be worth
it.

I assume that I'm missing something obvious here?

Thanks,

jon

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/8] mm/swap: Add cluster lock
  2016-10-24 16:31   ` Jonathan Corbet
@ 2016-10-25  2:05     ` Huang, Ying
  2016-12-28  3:34       ` huang ying
  0 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2016-10-25  2:05 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Tim Chen, Andrew Morton, Huang, Ying, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton

Hi, Jonathan,

Thanks for review.

Jonathan Corbet <corbet@lwn.net> writes:

> On Thu, 20 Oct 2016 16:31:41 -0700
> Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>> From: "Huang, Ying" <ying.huang@intel.com>
>> 
>> This patch is to reduce the lock contention of swap_info_struct->lock
>> via using a more fine grained lock in swap_cluster_info for some swap
>> operations.  swap_info_struct->lock is heavily contended if multiple
>> processes reclaim pages simultaneously.  Because there is only one lock
>> for each swap device.  While in common configuration, there is only one
>> or several swap devices in the system.  The lock protects almost all
>> swap related operations.
>
> So I'm looking at this a bit.  Overall it seems like a good thing to do
> (from my limited understanding of this area) but I have a probably silly
> question... 
>
>>  struct swap_cluster_info {
>> -	unsigned int data:24;
>> -	unsigned int flags:8;
>> +	unsigned long data;
>>  };
>> -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>> -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
>> +#define CLUSTER_COUNT_SHIFT		8
>> +#define CLUSTER_FLAG_MASK		((1UL << CLUSTER_COUNT_SHIFT) - 1)
>> +#define CLUSTER_COUNT_MASK		(~CLUSTER_FLAG_MASK)
>> +#define CLUSTER_FLAG_FREE		1 /* This cluster is free */
>> +#define CLUSTER_FLAG_NEXT_NULL		2 /* This cluster has no next cluster */
>> +/* cluster lock, protect cluster_info contents and sis->swap_map */
>> +#define CLUSTER_FLAG_LOCK_BIT		2
>> +#define CLUSTER_FLAG_LOCK		(1 << CLUSTER_FLAG_LOCK_BIT)
>
> Why the roll-your-own locking and data structures here?  To my naive
> understanding, it seems like you could do something like:
>
>   struct swap_cluster_info {
>   	spinlock_t lock;
> 	atomic_t count;
> 	unsigned int flags;
>   };
>
> Then you could use proper spinlock operations which, among other things,
> would make the realtime folks happier.  That might well help with the
> cache-line sharing issues as well.  Some of the count manipulations could
> perhaps be done without the lock entirely; similarly, atomic bitops might
> save you the locking for some of the flag tweaks - though I'd have to look
> more closely to be really sure of that.
>
> The cost, of course, is the growth of this structure, but you've already
> noted that the overhead isn't all that high; seems like it could be worth
> it.

Yes.  The data structure you proposed is much easier to be used than the
current one.  The main concern is the RAM usage.  The size of the data
structure you proposed is about 80 bytes, while that of the current one
is about 8 bytes.  There will be one struct swap_cluster_info for every
1MB swap space, so for 1TB swap space, the total size will be 80M
compared with 8M of current implementation.

In the other hand, the return of the increased size is not overwhelming.
The bit spinlock on cluster will not be heavy contended because it is a
quite fine-grained lock.  So the benefit will be little to use lockless
operations.  I guess the realtime issue isn't serious given the lock is
not heavy contended and the operations protected by the lock is
light-weight too.

Best Regards,
Huang, Ying

> I assume that I'm missing something obvious here?
>
> Thanks,
>
> jon

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/8] mm/swap: Add cluster lock
  2016-10-25  2:05     ` Huang, Ying
@ 2016-12-28  3:34       ` huang ying
  0 siblings, 0 replies; 15+ messages in thread
From: huang ying @ 2016-12-28  3:34 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Jonathan Corbet, Tim Chen, Andrew Morton, dave.hansen,
	Andi Kleen, Aaron Lu, linux-mm, LKML, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton

Hi, Jonathan,

On Tue, Oct 25, 2016 at 10:05 AM, Huang, Ying <ying.huang@intel.com> wrote:
> Hi, Jonathan,
>
> Thanks for review.
>
> Jonathan Corbet <corbet@lwn.net> writes:
>
>> On Thu, 20 Oct 2016 16:31:41 -0700
>> Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>
>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>
>>> This patch is to reduce the lock contention of swap_info_struct->lock
>>> via using a more fine grained lock in swap_cluster_info for some swap
>>> operations.  swap_info_struct->lock is heavily contended if multiple
>>> processes reclaim pages simultaneously.  Because there is only one lock
>>> for each swap device.  While in common configuration, there is only one
>>> or several swap devices in the system.  The lock protects almost all
>>> swap related operations.
>>
>> So I'm looking at this a bit.  Overall it seems like a good thing to do
>> (from my limited understanding of this area) but I have a probably silly
>> question...
>>
>>>  struct swap_cluster_info {
>>> -    unsigned int data:24;
>>> -    unsigned int flags:8;
>>> +    unsigned long data;
>>>  };
>>> -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>>> -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
>>> +#define CLUSTER_COUNT_SHIFT         8
>>> +#define CLUSTER_FLAG_MASK           ((1UL << CLUSTER_COUNT_SHIFT) - 1)
>>> +#define CLUSTER_COUNT_MASK          (~CLUSTER_FLAG_MASK)
>>> +#define CLUSTER_FLAG_FREE           1 /* This cluster is free */
>>> +#define CLUSTER_FLAG_NEXT_NULL              2 /* This cluster has no next cluster */
>>> +/* cluster lock, protect cluster_info contents and sis->swap_map */
>>> +#define CLUSTER_FLAG_LOCK_BIT               2
>>> +#define CLUSTER_FLAG_LOCK           (1 << CLUSTER_FLAG_LOCK_BIT)
>>
>> Why the roll-your-own locking and data structures here?  To my naive
>> understanding, it seems like you could do something like:
>>
>>   struct swap_cluster_info {
>>       spinlock_t lock;
>>       atomic_t count;
>>       unsigned int flags;
>>   };
>>
>> Then you could use proper spinlock operations which, among other things,
>> would make the realtime folks happier.  That might well help with the
>> cache-line sharing issues as well.  Some of the count manipulations could
>> perhaps be done without the lock entirely; similarly, atomic bitops might
>> save you the locking for some of the flag tweaks - though I'd have to look
>> more closely to be really sure of that.
>>
>> The cost, of course, is the growth of this structure, but you've already
>> noted that the overhead isn't all that high; seems like it could be worth
>> it.
>
> Yes.  The data structure you proposed is much easier to be used than the
> current one.  The main concern is the RAM usage.  The size of the data
> structure you proposed is about 80 bytes, while that of the current one
> is about 8 bytes.  There will be one struct swap_cluster_info for every
> 1MB swap space, so for 1TB swap space, the total size will be 80M
> compared with 8M of current implementation.

Sorry, I turned on the lockdep when measure the size change, so the
previous size change data is wrong.  The size of the data structure
you proposed is 12 bytes.  While that of the current one is 8 bytes on
64 bit platform and 4 bytes on 32 bit platform.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-12-28  3:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-20 23:31 [PATCH v2 0/8] mm/swap: Regular page swap optimizations Tim Chen
2016-10-20 23:31 ` [PATCH v2 1/8] mm/swap: Fix kernel message in swap_info_get() Tim Chen
2016-10-20 23:31 ` [PATCH v2 2/8] mm/swap: Add cluster lock Tim Chen
2016-10-24 16:31   ` Jonathan Corbet
2016-10-25  2:05     ` Huang, Ying
2016-12-28  3:34       ` huang ying
2016-10-20 23:31 ` [PATCH v2 3/8] mm/swap: Split swap cache into 64MB trunks Tim Chen
2016-10-20 23:31 ` [PATCH v2 4/8] mm/swap: skip read ahead for unreferenced swap slots Tim Chen
2016-10-20 23:31 ` [PATCH v2 5/8] mm/swap: Allocate swap slots in batches Tim Chen
2016-10-20 23:31 ` [PATCH v2 6/8] mm/swap: Free swap slots in batch Tim Chen
2016-10-20 23:31 ` [PATCH v2 7/8] mm/swap: Add cache for swap slots allocation Tim Chen
2016-10-20 23:31 ` [PATCH v2 8/8] mm/swap: Enable swap slots cache usage Tim Chen
2016-10-21  8:16 ` [PATCH v2 0/8] mm/swap: Regular page swap optimizations Christian Borntraeger
2016-10-21 10:05   ` Christian Borntraeger
2016-10-21 17:40     ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).