linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/9] mm/swap: Regular page swap optimizations
@ 2017-01-11 17:55 Tim Chen
  2017-01-11 17:55 ` [PATCH v5 1/9] mm/swap: Fix kernel message in swap_info_get() Tim Chen
                   ` (9 more replies)
  0 siblings, 10 replies; 33+ messages in thread
From: Tim Chen @ 2017-01-11 17:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

Change Log:
v5:
1. Rebase patch series on 4.10-rc3 kernel. Update patch series to remove
usage of obsoleted hot plug functions: cpu_notifier_register_begin(),
cpu_notifier_register_done(), and __register_hotcpu_notifier() 
2. Fix a bug returning uninitialized swap slot when we run
out of swap slots on all swap devices.
3. Minor code style clean ups.

v4:
1. Fix a bug in unlock cluster in add_swap_count_continuation(). We
should use unlock_cluster() instead of unlock_cluser_or_swap_info().
2. During swap off, handle race when swap slot is marked unused but allocated,
and not yet placed in swap cache.  Wait for swap slot to be placed in swap cache
and not abort swap off.
3. Initialize n_ret in get_swap_pages().

v3:
1. Fix bug that didn't check for page already in swap cache before skipping
read ahead and return null page.
2. Fix bug that didn't try to allocate from global pool if allocation
from swap slot cache did not succeed.
3. Fix memory allocation bug for spaces to store split up 64MB radix tree
4. Fix problems caused by races between get_swap_page, cpu online/offline and
swap_on/off

v2: 
1. Fix bug in the index limit used in scan_swap_map_try_ssd_cluster
when searching for empty slots in cluster.
2. Fix bug in swap off that incorrectly determines if we still have
swap devices left.
3. Port patches to mmotm-2016-10-11-15-46 branch

Andrew,

We're updating this patch series with some minor fixes and rebased to 4.10-rc3.
Please consider this patch series for inclusion to the mm kernel. 
 
Times have changed.  Coming generation of Solid state Block device
latencies are getting down to sub 100 usec, which is within an order of
magnitude of DRAM, and their performance is orders of magnitude higher
than the single- spindle rotational media we've swapped to historically.

This could benefit many usage scenearios.  For example cloud providers who
overcommit their memory (as VM don't use all the memory provisioned).
Having a fast swap will allow them to be more aggressive in memory
overcommit and fit more VMs to a platform.

In our testing [see footnote], the median latency that the
kernel adds to a page fault is 15 usec, which comes quite close
to the amount that will be contributed by the underlying I/O
devices.

The software latency comes mostly from contentions on the locks
protecting the radix tree of the swap cache and also the locks protecting
the individual swap devices.  The lock contentions already consumed
35% of cpu cycles in our test.  In the very near future,
software latency will become the bottleneck to swap performnace as
block device I/O latency gets within the shouting distance of DRAM speed.

This patch set, reduced the median page fault latency
from 15 usec to 4 usec (375% reduction) for DRAM based pmem
block device.

Patch 1 is a clean up patch.
Patch 2 creates a lock per cluster, this gives us a more fine graind lock
        that can be used for accessing swap_map, and not lock the whole
        swap device
Patch 3 splits the swap cache radix tree into 64MB chunks, reducing
        the rate that we have to contende for the radix tree.
Patch 4 eliminates unnecessary page allocation for read ahead.
Patch 5-9 create a per cpu cache of the swap slots, so we don't have
        to contend on the swap device to get a swap slot or to release
        a swap slot.  And we allocate and release the swap slots
        in batches for better efficiency.

We describe below the changes in swap throughput and
lock contentions. Test was done with PMEM block swap device
for 32 processes on Xeon E5 v3 system. The swap device used is a RAM
simulated PMEM (persistent memory) device.  To test the sequential
swapping out, the test case created 32 processes, which sequentially
allocate and write to the anonymous pages until the RAM and part of the
swap device is used.  This gives an indication of the effect of each
successive patch.  Test was done on patch version 4 which is functionally
identical to version 5.

Vanilla kernel 4.9-rc8:
 Throughput:
  vmstat.swap.so: 1428002 kB/sec,
 Top lock contentions in %cpu.
  perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 13.94%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 13.75%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.swapcache_free.__remove_mapping.shrink_page_list: 7.05%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock.swap_info_get.page_swapcount.try_to_free_swap.swap_writepage: 7.03%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock.__swap_duplicate.swap_duplicate.try_to_unmap_one.rmap_walk_anon: 7.02%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list: 6.83%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock.page_check_address_transhuge.page_referenced_one.rmap_walk_anon.rmap_walk: 0.81%,

Patch 1-2: 
Swap throughput slightly improved 4%, swap_info_get and __swap_duplicate contention on swap_info lock eliminated.
  Throughput:
  vmstat.swap.so: 1481704 kB/sec,  (4% increase over vanilla)
  Top lock contentions in %cpu:
  perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 27.53%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 27.01%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages.drain_local_pages: 1.03%,

Patch 1-3
Swap throughput improved 44%, add_to_swap_cache contention on radix tree lock is eliminated.
  Throughput:
  vmstat.swap.so: 2050097 kB/sec,  (44% increase over vanilla)
  Top lock contentions in %cpu:
  perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_page.add_to_swap.shrink_page_list.shrink_inactive_list: 43.27,
  perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault: 4.84,

Patch 1-9
Swap throughput improved 192%, get_swap_page contention on swap_info lock eliminated. 
  Throughput:
  vmstat.swap.so: 4170746 kB/sec, (192% increase over vanilla)
  Top lock contentions in %cpu:
  perf-profile.calltrace.cycles-pp._raw_spin_lock.swapcache_free_entries.free_swap_slot.free_swap_and_cache.unmap_page_range: 13.91%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_mm_fault: 8.56%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock.get_page_from_freelist.__alloc_pages_slowpath.__alloc_pages_nodemask.alloc_pages_vma: 2.56%,
  perf-profile.calltrace.cycles-pp._raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap.shrink_page_list: 2.47%,

Ying Huang & Tim Chen

Footnote:
We tested the patch series for page latency with/without optimizations from
this patche series plus one additional patch Ying posted earlier on
removing radix tree write back tag in swap cache.  Eight threads performed
random memory access on a 2 socket Haswell using swap mounted on RAM
based PMEM block device.  This emulated a moderate load and a SWAP
device unbounded by I/O speed. The aggregate working set is twice the
RAM size. We instrumented the kernel to measure the page fault latency.


Huang Ying (1):
  mm/swap: Skip readahead only when swap slot cache is enabled

Huang, Ying (3):
  mm/swap: Fix kernel message in swap_info_get()
  mm/swap: Add cluster lock
  mm/swap: Split swap cache into 64MB trunks

Tim Chen (5):
  mm/swap: skip read ahead for unreferenced swap slots
  mm/swap: Allocate swap slots in batches
  mm/swap: Free swap slots in batch
  mm/swap: Add cache for swap slots allocation
  mm/swap: Enable swap slots cache usage

 include/linux/swap.h       |  37 +++-
 include/linux/swap_slots.h |  30 +++
 mm/Makefile                |   2 +-
 mm/swap.c                  |   6 -
 mm/swap_slots.c            | 333 ++++++++++++++++++++++++++++
 mm/swap_state.c            |  80 ++++++-
 mm/swapfile.c              | 540 +++++++++++++++++++++++++++++++++++----------
 7 files changed, 887 insertions(+), 141 deletions(-)
 create mode 100644 include/linux/swap_slots.h
 create mode 100644 mm/swap_slots.c

-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v5 1/9] mm/swap: Fix kernel message in swap_info_get()
  2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
@ 2017-01-11 17:55 ` Tim Chen
  2017-01-11 17:55 ` [PATCH v5 2/9] mm/swap: Add cluster lock Tim Chen
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Tim Chen @ 2017-01-11 17:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim Chen

From: "Huang, Ying" <ying.huang@intel.com>

swap_info_get() is used not only in swap free code path but also in
page_swapcount(), etc.  So the original kernel message in
swap_info_get() is not correct now.  Fix it via replacing "swap_free" to
"swap_info_get" in the message.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/swapfile.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1c6e032..19a7c1d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -753,16 +753,16 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 	return p;
 
 bad_free:
-	pr_err("swap_free: %s%08lx\n", Unused_offset, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val);
 	goto out;
 bad_offset:
-	pr_err("swap_free: %s%08lx\n", Bad_offset, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Bad_offset, entry.val);
 	goto out;
 bad_device:
-	pr_err("swap_free: %s%08lx\n", Unused_file, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Unused_file, entry.val);
 	goto out;
 bad_nofile:
-	pr_err("swap_free: %s%08lx\n", Bad_file, entry.val);
+	pr_err("swap_info_get: %s%08lx\n", Bad_file, entry.val);
 out:
 	return NULL;
 }
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 2/9] mm/swap: Add cluster lock
  2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
  2017-01-11 17:55 ` [PATCH v5 1/9] mm/swap: Fix kernel message in swap_info_get() Tim Chen
@ 2017-01-11 17:55 ` Tim Chen
  2017-01-11 23:00   ` Andrew Morton
  2017-01-11 17:55 ` [PATCH v5 3/9] mm/swap: Split swap cache into 64MB trunks Tim Chen
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Tim Chen @ 2017-01-11 17:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim Chen

From: "Huang, Ying" <ying.huang@intel.com>

This patch is to reduce the lock contention of swap_info_struct->lock
via using a more fine grained lock in swap_cluster_info for some swap
operations.  swap_info_struct->lock is heavily contended if multiple
processes reclaim pages simultaneously.  Because there is only one lock
for each swap device.  While in common configuration, there is only one
or several swap devices in the system.  The lock protects almost all
swap related operations.

In fact, many swap operations only access one element of
swap_info_struct->swap_map array.  And there is no dependency between
different elements of swap_info_struct->swap_map.  So a fine grained
lock can be used to allow parallel access to the different elements of
swap_info_struct->swap_map.

In this patch, one bit of swap_cluster_info is used as the bin spinlock
to protect the elements of swap_info_struct->swap_map in the swap
cluster and the fields of swap_cluster_info.  This reduced locking
contention for swap_info_struct->swap_map access greatly.

To use the bin spinlock, the size of swap_cluster_info needs to increase
from 4 bytes to 8 bytes on the 64bit system.  This will use 4k more
memory for every 1G swap space.

Because the size of swap_cluster_info is much smaller than the size of
the cache line (8 vs 64 on x86_64 architecture), there may be false
cache line sharing between swap_cluster_info bit spinlocks.  To avoid
the false sharing in the first round of the swap cluster allocation, the
order of the swap clusters in the free clusters list is changed.  So
that, the swap_cluster_info sharing the same cache line will be placed
as far as possible.  After the first round of allocation, the order of
the clusters in free clusters list is expected to be random.  So the
false sharing should be not noticeable.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap.h |  13 ++-
 mm/swapfile.c        | 239 +++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 194 insertions(+), 58 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 09f4be1..40716b9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -175,11 +175,16 @@ enum {
  * protected by swap_info_struct.lock.
  */
 struct swap_cluster_info {
-	unsigned int data:24;
-	unsigned int flags:8;
+	unsigned long data;
 };
-#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
-#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+#define CLUSTER_COUNT_SHIFT		8
+#define CLUSTER_FLAG_MASK		((1UL << CLUSTER_COUNT_SHIFT) - 1)
+#define CLUSTER_COUNT_MASK		(~CLUSTER_FLAG_MASK)
+#define CLUSTER_FLAG_FREE		1 /* This cluster is free */
+#define CLUSTER_FLAG_NEXT_NULL		2 /* This cluster has no next cluster */
+/* cluster lock, protect cluster_info contents and sis->swap_map */
+#define CLUSTER_FLAG_LOCK_BIT		2
+#define CLUSTER_FLAG_LOCK		(1 << CLUSTER_FLAG_LOCK_BIT)
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 19a7c1d..c3ed37e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -200,61 +200,107 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 #define LATENCY_LIMIT		256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
-	unsigned int flag)
+				    unsigned int flag)
 {
-	info->flags = flag;
+	info->data = (info->data & (CLUSTER_COUNT_MASK | CLUSTER_FLAG_LOCK)) |
+		(flag & ~CLUSTER_FLAG_LOCK);
 }
 
 static inline unsigned int cluster_count(struct swap_cluster_info *info)
 {
-	return info->data;
+	return info->data >> CLUSTER_COUNT_SHIFT;
 }
 
 static inline void cluster_set_count(struct swap_cluster_info *info,
 				     unsigned int c)
 {
-	info->data = c;
+	info->data = (c << CLUSTER_COUNT_SHIFT) | (info->data & CLUSTER_FLAG_MASK);
 }
 
 static inline void cluster_set_count_flag(struct swap_cluster_info *info,
 					 unsigned int c, unsigned int f)
 {
-	info->flags = f;
-	info->data = c;
+	info->data = (info->data & CLUSTER_FLAG_LOCK) |
+		(c << CLUSTER_COUNT_SHIFT) | (f & ~CLUSTER_FLAG_LOCK);
 }
 
 static inline unsigned int cluster_next(struct swap_cluster_info *info)
 {
-	return info->data;
+	return cluster_count(info);
 }
 
 static inline void cluster_set_next(struct swap_cluster_info *info,
 				    unsigned int n)
 {
-	info->data = n;
+	cluster_set_count(info, n);
 }
 
 static inline void cluster_set_next_flag(struct swap_cluster_info *info,
 					 unsigned int n, unsigned int f)
 {
-	info->flags = f;
-	info->data = n;
+	cluster_set_count_flag(info, n, f);
 }
 
 static inline bool cluster_is_free(struct swap_cluster_info *info)
 {
-	return info->flags & CLUSTER_FLAG_FREE;
+	return info->data & CLUSTER_FLAG_FREE;
 }
 
 static inline bool cluster_is_null(struct swap_cluster_info *info)
 {
-	return info->flags & CLUSTER_FLAG_NEXT_NULL;
+	return info->data & CLUSTER_FLAG_NEXT_NULL;
 }
 
 static inline void cluster_set_null(struct swap_cluster_info *info)
 {
-	info->flags = CLUSTER_FLAG_NEXT_NULL;
-	info->data = 0;
+	cluster_set_next_flag(info, 0, CLUSTER_FLAG_NEXT_NULL);
+}
+
+/* Protect swap_cluster_info fields and si->swap_map */
+static inline void __lock_cluster(struct swap_cluster_info *ci)
+{
+	bit_spin_lock(CLUSTER_FLAG_LOCK_BIT, &ci->data);
+}
+
+static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
+						     unsigned long offset)
+{
+	struct swap_cluster_info *ci;
+
+	ci = si->cluster_info;
+	if (ci) {
+		ci += offset / SWAPFILE_CLUSTER;
+		__lock_cluster(ci);
+	}
+	return ci;
+}
+
+static inline void unlock_cluster(struct swap_cluster_info *ci)
+{
+	if (ci)
+		bit_spin_unlock(CLUSTER_FLAG_LOCK_BIT, &ci->data);
+}
+
+static inline struct swap_cluster_info *lock_cluster_or_swap_info(
+	struct swap_info_struct *si,
+	unsigned long offset)
+{
+	struct swap_cluster_info *ci;
+
+	ci = lock_cluster(si, offset);
+	if (!ci)
+		spin_lock(&si->lock);
+
+	return ci;
+}
+
+static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
+					       struct swap_cluster_info *ci)
+{
+	if (ci)
+		unlock_cluster(ci);
+	else
+		spin_unlock(&si->lock);
 }
 
 static inline bool cluster_list_empty(struct swap_cluster_list *list)
@@ -281,9 +327,17 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
 		cluster_set_next_flag(&list->head, idx, 0);
 		cluster_set_next_flag(&list->tail, idx, 0);
 	} else {
+		struct swap_cluster_info *ci_tail;
 		unsigned int tail = cluster_next(&list->tail);
 
-		cluster_set_next(&ci[tail], idx);
+		/*
+		 * Nested cluster lock, but both cluster locks are
+		 * only acquired when we held swap_info_struct->lock
+		 */
+		ci_tail = ci + tail;
+		__lock_cluster(ci_tail);
+		cluster_set_next(ci_tail, idx);
+		unlock_cluster(ci_tail);
 		cluster_set_next_flag(&list->tail, idx, 0);
 	}
 }
@@ -328,7 +382,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 */
 static void swap_do_scheduled_discard(struct swap_info_struct *si)
 {
-	struct swap_cluster_info *info;
+	struct swap_cluster_info *info, *ci;
 	unsigned int idx;
 
 	info = si->cluster_info;
@@ -341,10 +395,14 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 				SWAPFILE_CLUSTER);
 
 		spin_lock(&si->lock);
-		cluster_set_flag(&info[idx], CLUSTER_FLAG_FREE);
+		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
+		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
+		unlock_cluster(ci);
 		cluster_list_add_tail(&si->free_clusters, info, idx);
+		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
+		unlock_cluster(ci);
 	}
 }
 
@@ -447,8 +505,9 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned long *offset, unsigned long *scan_base)
 {
 	struct percpu_cluster *cluster;
+	struct swap_cluster_info *ci;
 	bool found_free;
-	unsigned long tmp;
+	unsigned long tmp, max;
 
 new_cluster:
 	cluster = this_cpu_ptr(si->percpu_cluster);
@@ -476,14 +535,21 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	 * check if there is still free entry in the cluster
 	 */
 	tmp = cluster->next;
-	while (tmp < si->max && tmp < (cluster_next(&cluster->index) + 1) *
-	       SWAPFILE_CLUSTER) {
+	max = min_t(unsigned long, si->max,
+		    (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER);
+	if (tmp >= max) {
+		cluster_set_null(&cluster->index);
+		goto new_cluster;
+	}
+	ci = lock_cluster(si, tmp);
+	while (tmp < max) {
 		if (!si->swap_map[tmp]) {
 			found_free = true;
 			break;
 		}
 		tmp++;
 	}
+	unlock_cluster(ci);
 	if (!found_free) {
 		cluster_set_null(&cluster->index);
 		goto new_cluster;
@@ -496,6 +562,7 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
+	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned long scan_base;
 	unsigned long last_in_cluster = 0;
@@ -572,9 +639,11 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
 
+	ci = lock_cluster(si, offset);
 	/* reuse swap entry of cache-only swap if not busy. */
 	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
 		int swap_was_freed;
+		unlock_cluster(ci);
 		spin_unlock(&si->lock);
 		swap_was_freed = __try_to_reclaim_swap(si, offset);
 		spin_lock(&si->lock);
@@ -584,8 +653,10 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 		goto scan; /* check next one */
 	}
 
-	if (si->swap_map[offset])
+	if (si->swap_map[offset]) {
+		unlock_cluster(ci);
 		goto scan;
+	}
 
 	if (offset == si->lowest_bit)
 		si->lowest_bit++;
@@ -601,6 +672,7 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 	}
 	si->swap_map[offset] = usage;
 	inc_cluster_info_page(si, si->cluster_info, offset);
+	unlock_cluster(ci);
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
 
@@ -731,7 +803,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	return (swp_entry_t) {0};
 }
 
-static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	unsigned long offset, type;
@@ -749,7 +821,6 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 		goto bad_offset;
 	if (!p->swap_map[offset])
 		goto bad_free;
-	spin_lock(&p->lock);
 	return p;
 
 bad_free:
@@ -767,14 +838,45 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
+static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+{
+	struct swap_info_struct *p;
+
+	p = _swap_info_get(entry);
+	if (p)
+		spin_lock(&p->lock);
+	return p;
+}
+
 static unsigned char swap_entry_free(struct swap_info_struct *p,
-				     swp_entry_t entry, unsigned char usage)
+				     swp_entry_t entry, unsigned char usage,
+				     bool swap_info_locked)
 {
+	struct swap_cluster_info *ci;
 	unsigned long offset = swp_offset(entry);
 	unsigned char count;
 	unsigned char has_cache;
+	bool lock_swap_info = false;
+
+	if (!swap_info_locked) {
+		count = p->swap_map[offset];
+		if (!p->cluster_info || count == usage || count == SWAP_MAP_SHMEM) {
+lock_swap_info:
+			swap_info_locked = true;
+			lock_swap_info = true;
+			spin_lock(&p->lock);
+		}
+	}
+
+	ci = lock_cluster(p, offset);
 
 	count = p->swap_map[offset];
+
+	if (!swap_info_locked && (count == usage || count == SWAP_MAP_SHMEM)) {
+		unlock_cluster(ci);
+		goto lock_swap_info;
+	}
+
 	has_cache = count & SWAP_HAS_CACHE;
 	count &= ~SWAP_HAS_CACHE;
 
@@ -800,10 +902,15 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 	usage = count | has_cache;
 	p->swap_map[offset] = usage;
 
+	unlock_cluster(ci);
+
 	/* free if no reference */
 	if (!usage) {
+		VM_BUG_ON(!swap_info_locked);
 		mem_cgroup_uncharge_swap(entry);
+		ci = lock_cluster(p, offset);
 		dec_cluster_info_page(p, p->cluster_info, offset);
+		unlock_cluster(ci);
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
 		if (offset > p->highest_bit) {
@@ -829,6 +936,9 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 		}
 	}
 
+	if (lock_swap_info)
+		spin_unlock(&p->lock);
+
 	return usage;
 }
 
@@ -840,11 +950,9 @@ void swap_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 
-	p = swap_info_get(entry);
-	if (p) {
-		swap_entry_free(p, entry, 1);
-		spin_unlock(&p->lock);
-	}
+	p = _swap_info_get(entry);
+	if (p)
+		swap_entry_free(p, entry, 1, false);
 }
 
 /*
@@ -854,11 +962,9 @@ void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 
-	p = swap_info_get(entry);
-	if (p) {
-		swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		spin_unlock(&p->lock);
-	}
+	p = _swap_info_get(entry);
+	if (p)
+		swap_entry_free(p, entry, SWAP_HAS_CACHE, false);
 }
 
 /*
@@ -870,13 +976,17 @@ int page_swapcount(struct page *page)
 {
 	int count = 0;
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	swp_entry_t entry;
+	unsigned long offset;
 
 	entry.val = page_private(page);
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (p) {
-		count = swap_count(p->swap_map[swp_offset(entry)]);
-		spin_unlock(&p->lock);
+		offset = swp_offset(entry);
+		ci = lock_cluster_or_swap_info(p, offset);
+		count = swap_count(p->swap_map[offset]);
+		unlock_cluster_or_swap_info(p, ci);
 	}
 	return count;
 }
@@ -889,22 +999,26 @@ int swp_swapcount(swp_entry_t entry)
 {
 	int count, tmp_count, n;
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	struct page *page;
 	pgoff_t offset;
 	unsigned char *map;
 
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (!p)
 		return 0;
 
-	count = swap_count(p->swap_map[swp_offset(entry)]);
+	offset = swp_offset(entry);
+
+	ci = lock_cluster_or_swap_info(p, offset);
+
+	count = swap_count(p->swap_map[offset]);
 	if (!(count & COUNT_CONTINUED))
 		goto out;
 
 	count &= ~COUNT_CONTINUED;
 	n = SWAP_MAP_MAX + 1;
 
-	offset = swp_offset(entry);
 	page = vmalloc_to_page(p->swap_map + offset);
 	offset &= ~PAGE_MASK;
 	VM_BUG_ON(page_private(page) != SWP_CONTINUED);
@@ -919,7 +1033,7 @@ int swp_swapcount(swp_entry_t entry)
 		n *= (SWAP_CONT_MAX + 1);
 	} while (tmp_count & COUNT_CONTINUED);
 out:
-	spin_unlock(&p->lock);
+	unlock_cluster_or_swap_info(p, ci);
 	return count;
 }
 
@@ -1003,7 +1117,7 @@ int free_swap_and_cache(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, entry, 1) == SWAP_HAS_CACHE) {
+		if (swap_entry_free(p, entry, 1, true) == SWAP_HAS_CACHE) {
 			page = find_get_page(swap_address_space(entry),
 					     swp_offset(entry));
 			if (page && !trylock_page(page)) {
@@ -2284,6 +2398,9 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 	return maxpages;
 }
 
+#define SWAP_CLUSTER_COLS						\
+	DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
+
 static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					union swap_header *swap_header,
 					unsigned char *swap_map,
@@ -2291,11 +2408,12 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					unsigned long maxpages,
 					sector_t *span)
 {
-	int i;
+	unsigned int j, k;
 	unsigned int nr_good_pages;
 	int nr_extents;
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
-	unsigned long idx = p->cluster_next / SWAPFILE_CLUSTER;
+	unsigned long col = p->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS;
+	unsigned long i, idx;
 
 	nr_good_pages = maxpages - 1;	/* omit header page */
 
@@ -2343,15 +2461,20 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	if (!cluster_info)
 		return nr_extents;
 
-	for (i = 0; i < nr_clusters; i++) {
-		if (!cluster_count(&cluster_info[idx])) {
+
+	/* Reduce false cache line sharing between cluster_info */
+	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
+		j = (k + col) % SWAP_CLUSTER_COLS;
+		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
+			idx = i * SWAP_CLUSTER_COLS + j;
+			if (idx >= nr_clusters)
+				continue;
+			if (cluster_count(&cluster_info[idx]))
+				continue;
 			cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
 			cluster_list_add_tail(&p->free_clusters, cluster_info,
 					      idx);
 		}
-		idx++;
-		if (idx == nr_clusters)
-			idx = 0;
 	}
 	return nr_extents;
 }
@@ -2609,6 +2732,7 @@ void si_swapinfo(struct sysinfo *val)
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 {
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	unsigned long offset, type;
 	unsigned char count;
 	unsigned char has_cache;
@@ -2622,10 +2746,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 		goto bad_file;
 	p = swap_info[type];
 	offset = swp_offset(entry);
-
-	spin_lock(&p->lock);
 	if (unlikely(offset >= p->max))
-		goto unlock_out;
+		goto out;
+
+	ci = lock_cluster_or_swap_info(p, offset);
 
 	count = p->swap_map[offset];
 
@@ -2668,7 +2792,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 	p->swap_map[offset] = count | has_cache;
 
 unlock_out:
-	spin_unlock(&p->lock);
+	unlock_cluster_or_swap_info(p, ci);
 out:
 	return err;
 
@@ -2757,6 +2881,7 @@ EXPORT_SYMBOL_GPL(__page_file_index);
 int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 {
 	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
 	struct page *head;
 	struct page *page;
 	struct page *list_page;
@@ -2780,6 +2905,9 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	offset = swp_offset(entry);
+
+	ci = lock_cluster(si, offset);
+
 	count = si->swap_map[offset] & ~SWAP_HAS_CACHE;
 
 	if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
@@ -2792,6 +2920,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	if (!page) {
+		unlock_cluster(ci);
 		spin_unlock(&si->lock);
 		return -ENOMEM;
 	}
@@ -2840,6 +2969,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	list_add_tail(&page->lru, &head->lru);
 	page = NULL;			/* now it's attached, don't free it */
 out:
+	unlock_cluster(ci);
 	spin_unlock(&si->lock);
 outer:
 	if (page)
@@ -2853,7 +2983,8 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
  * into, carry if so, or else fail until a new continuation page is allocated;
  * when the original swap_map count is decremented from 0 with continuation,
  * borrow from the continuation and report whether it still holds more.
- * Called while __swap_duplicate() or swap_entry_free() holds swap_lock.
+ * Called while __swap_duplicate() or swap_entry_free() holds swap or cluster
+ * lock.
  */
 static bool swap_count_continued(struct swap_info_struct *si,
 				 pgoff_t offset, unsigned char count)
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 3/9] mm/swap: Split swap cache into 64MB trunks
  2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
  2017-01-11 17:55 ` [PATCH v5 1/9] mm/swap: Fix kernel message in swap_info_get() Tim Chen
  2017-01-11 17:55 ` [PATCH v5 2/9] mm/swap: Add cluster lock Tim Chen
@ 2017-01-11 17:55 ` Tim Chen
  2017-01-11 23:09   ` Andrew Morton
  2017-01-11 17:55 ` [PATCH v5 4/9] mm/swap: skip read ahead for unreferenced swap slots Tim Chen
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Tim Chen @ 2017-01-11 17:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim Chen

From: "Huang, Ying" <ying.huang@intel.com>

The patch is to improve the scalability of the swap out/in via using
fine grained locks for the swap cache.  In current kernel, one address
space will be used for each swap device.  And in the common
configuration, the number of the swap device is very small (one is
typical).  This causes the heavy lock contention on the radix tree of
the address space if multiple tasks swap out/in concurrently.  But in
fact, there is no dependency between pages in the swap cache.  So that,
we can split the one shared address space for each swap device into
several address spaces to reduce the lock contention.  In the patch, the
shared address space is split into 64MB trunks.  64MB is chosen to
balance the memory space usage and effect of lock contention reduction.

The size of struct address_space on x86_64 architecture is 408B, so with
the patch, 6528B more memory will be used for every 1GB swap space on
x86_64 architecture.

One address space is still shared for the swap entries in the same 64M
trunks.  To avoid lock contention for the first round of swap space
allocation, the order of the swap clusters in the initial free clusters
list is changed.  The swap space distance between the consecutive swap
clusters in the free cluster list is at least 64M.  After the first
round of allocation, the swap clusters are expected to be freed
randomly, so the lock contention should be reduced effectively.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap.h | 11 +++++++--
 mm/swap.c            |  6 -----
 mm/swap_state.c      | 68 ++++++++++++++++++++++++++++++++++++++++++----------
 mm/swapfile.c        | 16 +++++++++++--
 4 files changed, 79 insertions(+), 22 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 40716b9..31d89c4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -341,8 +341,13 @@ int generic_swapfile_activate(struct swap_info_struct *, struct file *,
 		sector_t *);
 
 /* linux/mm/swap_state.c */
-extern struct address_space swapper_spaces[];
-#define swap_address_space(entry) (&swapper_spaces[swp_type(entry)])
+/* One swap address space for each 64M swap space */
+#define SWAP_ADDRESS_SPACE_SHIFT	14
+#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
+extern struct address_space *swapper_spaces[];
+#define swap_address_space(entry)			    \
+	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
+		>> SWAP_ADDRESS_SPACE_SHIFT])
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *, struct list_head *list);
@@ -396,6 +401,8 @@ extern struct swap_info_struct *page_swap_info(struct page *);
 extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
+extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
+extern void exit_swap_address_space(unsigned int type);
 
 #else /* CONFIG_SWAP */
 
diff --git a/mm/swap.c b/mm/swap.c
index 844baed..aabf2e9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -971,12 +971,6 @@ EXPORT_SYMBOL(pagevec_lookup_tag);
 void __init swap_setup(void)
 {
 	unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT);
-#ifdef CONFIG_SWAP
-	int i;
-
-	for (i = 0; i < MAX_SWAPFILES; i++)
-		spin_lock_init(&swapper_spaces[i].tree_lock);
-#endif
 
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 35d7e0e..3863acd 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/blkdev.h>
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
+#include <linux/vmalloc.h>
 
 #include <asm/pgtable.h>
 
@@ -32,15 +33,8 @@ static const struct address_space_operations swap_aops = {
 #endif
 };
 
-struct address_space swapper_spaces[MAX_SWAPFILES] = {
-	[0 ... MAX_SWAPFILES - 1] = {
-		.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-		.i_mmap_writable = ATOMIC_INIT(0),
-		.a_ops		= &swap_aops,
-		/* swap cache doesn't use writeback related tags */
-		.flags		= 1 << AS_NO_WRITEBACK_TAGS,
-	}
-};
+struct address_space *swapper_spaces[MAX_SWAPFILES];
+static unsigned int nr_swapper_spaces[MAX_SWAPFILES];
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
 
@@ -53,11 +47,26 @@ static struct {
 
 unsigned long total_swapcache_pages(void)
 {
-	int i;
+	unsigned int i, j, nr;
 	unsigned long ret = 0;
+	struct address_space *spaces;
 
-	for (i = 0; i < MAX_SWAPFILES; i++)
-		ret += swapper_spaces[i].nrpages;
+	rcu_read_lock();
+	for (i = 0; i < MAX_SWAPFILES; i++) {
+		/*
+		 * The corresponding entries in nr_swapper_spaces and
+		 * swapper_spaces will be reused only after at least
+		 * one grace period.  So it is impossible for them
+		 * belongs to different usage.
+		 */
+		nr = nr_swapper_spaces[i];
+		spaces = rcu_dereference(swapper_spaces[i]);
+		if (!nr || !spaces)
+			continue;
+		for (j = 0; j < nr; j++)
+			ret += spaces[j].nrpages;
+	}
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -505,3 +514,38 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 skip:
 	return read_swap_cache_async(entry, gfp_mask, vma, addr);
 }
+
+int init_swap_address_space(unsigned int type, unsigned long nr_pages)
+{
+	struct address_space *spaces, *space;
+	unsigned int i, nr;
+
+	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
+	spaces = vzalloc(sizeof(struct address_space) * nr);
+	if (!spaces)
+		return -ENOMEM;
+	for (i = 0; i < nr; i++) {
+		space = spaces + i;
+		INIT_RADIX_TREE(&space->page_tree, GFP_ATOMIC|__GFP_NOWARN);
+		atomic_set(&space->i_mmap_writable, 0);
+		space->a_ops = &swap_aops;
+		/* swap cache doesn't use writeback related tags */
+		mapping_set_no_writeback_tags(space);
+		spin_lock_init(&space->tree_lock);
+	}
+	nr_swapper_spaces[type] = nr;
+	rcu_assign_pointer(swapper_spaces[type], spaces);
+
+	return 0;
+}
+
+void exit_swap_address_space(unsigned int type)
+{
+	struct address_space *spaces;
+
+	spaces = swapper_spaces[type];
+	nr_swapper_spaces[type] = 0;
+	rcu_assign_pointer(swapper_spaces[type], NULL);
+	synchronize_rcu();
+	kvfree(spaces);
+}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c3ed37e..6f61fb3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2075,6 +2075,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	vfree(frontswap_map);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
+	exit_swap_address_space(p->type);
 
 	inode = mapping->host;
 	if (S_ISBLK(inode->i_mode)) {
@@ -2398,8 +2399,12 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 	return maxpages;
 }
 
-#define SWAP_CLUSTER_COLS						\
+#define SWAP_CLUSTER_INFO_COLS						\
 	DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
+#define SWAP_CLUSTER_SPACE_COLS						\
+	DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
+#define SWAP_CLUSTER_COLS						\
+	max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
 
 static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					union swap_header *swap_header,
@@ -2462,7 +2467,10 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 		return nr_extents;
 
 
-	/* Reduce false cache line sharing between cluster_info */
+	/*
+	 * Reduce false cache line sharing between cluster_info and
+	 * sharing same address space.
+	 */
 	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
 		j = (k + col) % SWAP_CLUSTER_COLS;
 		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
@@ -2643,6 +2651,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		}
 	}
 
+	error = init_swap_address_space(p->type, maxpages);
+	if (error)
+		goto bad_swap;
+
 	mutex_lock(&swapon_mutex);
 	prio = -1;
 	if (swap_flags & SWAP_FLAG_PREFER)
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 4/9] mm/swap: skip read ahead for unreferenced swap slots
  2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (2 preceding siblings ...)
  2017-01-11 17:55 ` [PATCH v5 3/9] mm/swap: Split swap cache into 64MB trunks Tim Chen
@ 2017-01-11 17:55 ` Tim Chen
  2017-01-11 17:55 ` [PATCH v5 5/9] mm/swap: Allocate swap slots in batches Tim Chen
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Tim Chen @ 2017-01-11 17:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

We can avoid needlessly allocating page for swap slots that
are not used by anyone. No pages have to be read in for
these slots.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |  6 ++++++
 mm/swap_state.c      |  4 ++++
 mm/swapfile.c        | 47 +++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 31d89c4..ca66efd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -396,6 +396,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern bool reuse_swap_page(struct page *, int *);
@@ -490,6 +491,11 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
+static inline int __swp_swapcount(swp_entry_t entry)
+{
+	return 0;
+}
+
 static inline int swp_swapcount(swp_entry_t entry)
 {
 	return 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3863acd..3d76d80 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -323,6 +323,10 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		if (found_page)
 			break;
 
+		/* Just skip read ahead for unused swap slot */
+		if (!__swp_swapcount(entry))
+			return NULL;
+
 		/*
 		 * Get a new page to read into from swap.
 		 */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6f61fb3..21c6cae 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -803,7 +803,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	return (swp_entry_t) {0};
 }
 
-static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *__swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	unsigned long offset, type;
@@ -819,13 +819,8 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	offset = swp_offset(entry);
 	if (offset >= p->max)
 		goto bad_offset;
-	if (!p->swap_map[offset])
-		goto bad_free;
 	return p;
 
-bad_free:
-	pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val);
-	goto out;
 bad_offset:
 	pr_err("swap_info_get: %s%08lx\n", Bad_offset, entry.val);
 	goto out;
@@ -838,6 +833,24 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
+static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
+{
+	struct swap_info_struct *p;
+
+	p = __swap_info_get(entry);
+	if (!p)
+		goto out;
+	if (!p->swap_map[swp_offset(entry)])
+		goto bad_free;
+	return p;
+
+bad_free:
+	pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val);
+	goto out;
+out:
+	return NULL;
+}
+
 static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
@@ -993,6 +1006,28 @@ int page_swapcount(struct page *page)
 
 /*
  * How many references to @entry are currently swapped out?
+ * This does not give an exact answer when swap count is continued,
+ * but does include the high COUNT_CONTINUED flag to allow for that.
+ */
+int __swp_swapcount(swp_entry_t entry)
+{
+	int count = 0;
+	pgoff_t offset;
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+
+	si = __swap_info_get(entry);
+	if (si) {
+		offset = swp_offset(entry);
+		ci = lock_cluster_or_swap_info(si, offset);
+		count = swap_count(si->swap_map[offset]);
+		unlock_cluster_or_swap_info(si, ci);
+	}
+	return count;
+}
+
+/*
+ * How many references to @entry are currently swapped out?
  * This considers COUNT_CONTINUED so it returns exact answer.
  */
 int swp_swapcount(swp_entry_t entry)
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 5/9] mm/swap: Allocate swap slots in batches
  2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (3 preceding siblings ...)
  2017-01-11 17:55 ` [PATCH v5 4/9] mm/swap: skip read ahead for unreferenced swap slots Tim Chen
@ 2017-01-11 17:55 ` Tim Chen
  2017-01-11 17:55 ` [PATCH v5 6/9] mm/swap: Free swap slots in batch Tim Chen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Tim Chen @ 2017-01-11 17:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

Currently, the swap slots are allocated one page at a time,
causing contention to the swap_info lock protecting the swap partition
on every page being swapped.

This patch adds new functions get_swap_pages and scan_swap_map_slots
to request multiple swap slots at once. This will reduces the lock
contention on the swap_info lock. Also scan_swap_map_slots can operate
more efficiently as swap slots often occurs in clusters close to each
other on a swap device and it is quicker to allocate them together.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |   2 +
 mm/swapfile.c        | 136 +++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 113 insertions(+), 25 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ca66efd..980f159 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -27,6 +27,7 @@ struct bio;
 #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
 				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
 				 SWAP_FLAG_DISCARD_PAGES)
+#define SWAP_BATCH 64
 
 static inline int current_is_kswapd(void)
 {
@@ -384,6 +385,7 @@ static inline long get_nr_swap_pages(void)
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
 extern swp_entry_t get_swap_page_of_type(int);
+extern int get_swap_pages(int n, swp_entry_t swp_entries[]);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 21c6cae..54fe8dd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -501,7 +501,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
  * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
  * might involve allocating a new cluster for current CPU too.
  */
-static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
+static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned long *offset, unsigned long *scan_base)
 {
 	struct percpu_cluster *cluster;
@@ -525,7 +525,7 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 			*scan_base = *offset = si->cluster_next;
 			goto new_cluster;
 		} else
-			return;
+			return false;
 	}
 
 	found_free = false;
@@ -557,16 +557,22 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	cluster->next = tmp + 1;
 	*offset = tmp;
 	*scan_base = tmp;
+	return found_free;
 }
 
-static unsigned long scan_swap_map(struct swap_info_struct *si,
-				   unsigned char usage)
+static int scan_swap_map_slots(struct swap_info_struct *si,
+			       unsigned char usage, int nr,
+			       swp_entry_t slots[])
 {
 	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned long scan_base;
 	unsigned long last_in_cluster = 0;
 	int latency_ration = LATENCY_LIMIT;
+	int n_ret = 0;
+
+	if (nr > SWAP_BATCH)
+		nr = SWAP_BATCH;
 
 	/*
 	 * We try to cluster swap pages by allocating them sequentially
@@ -584,8 +590,10 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 	/* SSD algorithm */
 	if (si->cluster_info) {
-		scan_swap_map_try_ssd_cluster(si, &offset, &scan_base);
-		goto checks;
+		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+			goto checks;
+		else
+			goto scan;
 	}
 
 	if (unlikely(!si->cluster_nr--)) {
@@ -629,8 +637,14 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 checks:
 	if (si->cluster_info) {
-		while (scan_swap_map_ssd_cluster_conflict(si, offset))
-			scan_swap_map_try_ssd_cluster(si, &offset, &scan_base);
+		while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
+		/* take a break if we already got some slots */
+			if (n_ret)
+				goto done;
+			if (!scan_swap_map_try_ssd_cluster(si, &offset,
+							&scan_base))
+				goto scan;
+		}
 	}
 	if (!(si->flags & SWP_WRITEOK))
 		goto no_page;
@@ -655,7 +669,10 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 	if (si->swap_map[offset]) {
 		unlock_cluster(ci);
-		goto scan;
+		if (!n_ret)
+			goto scan;
+		else
+			goto done;
 	}
 
 	if (offset == si->lowest_bit)
@@ -674,9 +691,43 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 	inc_cluster_info_page(si, si->cluster_info, offset);
 	unlock_cluster(ci);
 	si->cluster_next = offset + 1;
-	si->flags -= SWP_SCANNING;
+	slots[n_ret++] = swp_entry(si->type, offset);
+
+	/* got enough slots or reach max slots? */
+	if ((n_ret == nr) || (offset >= si->highest_bit))
+		goto done;
+
+	/* search for next available slot */
+
+	/* time to take a break? */
+	if (unlikely(--latency_ration < 0)) {
+		if (n_ret)
+			goto done;
+		spin_unlock(&si->lock);
+		cond_resched();
+		spin_lock(&si->lock);
+		latency_ration = LATENCY_LIMIT;
+	}
 
-	return offset;
+	/* try to get more slots in cluster */
+	if (si->cluster_info) {
+		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+			goto checks;
+		else
+			goto done;
+	}
+	/* non-ssd case */
+	++offset;
+
+	/* non-ssd case, still more slots in cluster? */
+	if (si->cluster_nr && !si->swap_map[offset]) {
+		--si->cluster_nr;
+		goto checks;
+	}
+
+done:
+	si->flags -= SWP_SCANNING;
+	return n_ret;
 
 scan:
 	spin_unlock(&si->lock);
@@ -714,17 +765,41 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 
 no_page:
 	si->flags -= SWP_SCANNING;
-	return 0;
+	return n_ret;
 }
 
-swp_entry_t get_swap_page(void)
+static unsigned long scan_swap_map(struct swap_info_struct *si,
+				   unsigned char usage)
+{
+	swp_entry_t entry;
+	int n_ret;
+
+	n_ret = scan_swap_map_slots(si, usage, 1, &entry);
+
+	if (n_ret)
+		return swp_offset(entry);
+	else
+		return 0;
+
+}
+
+int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
 {
 	struct swap_info_struct *si, *next;
-	pgoff_t offset;
+	long avail_pgs;
+	int n_ret = 0;
 
-	if (atomic_long_read(&nr_swap_pages) <= 0)
+	avail_pgs = atomic_long_read(&nr_swap_pages);
+	if (avail_pgs <= 0)
 		goto noswap;
-	atomic_long_dec(&nr_swap_pages);
+
+	if (n_goal > SWAP_BATCH)
+		n_goal = SWAP_BATCH;
+
+	if (n_goal > avail_pgs)
+		n_goal = avail_pgs;
+
+	atomic_long_sub(n_goal, &nr_swap_pages);
 
 	spin_lock(&swap_avail_lock);
 
@@ -750,14 +825,14 @@ swp_entry_t get_swap_page(void)
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-
-		/* This is called for allocating swap entry for cache */
-		offset = scan_swap_map(si, SWAP_HAS_CACHE);
+		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+					    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-		if (offset)
-			return swp_entry(si->type, offset);
+		if (n_ret)
+			goto check_out;
 		pr_debug("scan_swap_map of si %d failed to find offset\n",
-		       si->type);
+			si->type);
+
 		spin_lock(&swap_avail_lock);
 nextsi:
 		/*
@@ -768,7 +843,8 @@ swp_entry_t get_swap_page(void)
 		 * up between us dropping swap_avail_lock and taking si->lock.
 		 * Since we dropped the swap_avail_lock, the swap_avail_head
 		 * list may have been modified; so if next is still in the
-		 * swap_avail_head list then try it, otherwise start over.
+		 * swap_avail_head list then try it, otherwise start over
+		 * if we have not gotten any slots.
 		 */
 		if (plist_node_empty(&next->avail_list))
 			goto start_over;
@@ -776,9 +852,19 @@ swp_entry_t get_swap_page(void)
 
 	spin_unlock(&swap_avail_lock);
 
-	atomic_long_inc(&nr_swap_pages);
+check_out:
+	if (n_ret < n_goal)
+		atomic_long_add((long) (n_goal-n_ret), &nr_swap_pages);
 noswap:
-	return (swp_entry_t) {0};
+	return n_ret;
+}
+
+swp_entry_t get_swap_page(void)
+{
+	swp_entry_t entry;
+
+	get_swap_pages(1, &entry);
+	return entry;
 }
 
 /* The only caller of this function is now suspend routine */
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 6/9] mm/swap: Free swap slots in batch
  2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (4 preceding siblings ...)
  2017-01-11 17:55 ` [PATCH v5 5/9] mm/swap: Allocate swap slots in batches Tim Chen
@ 2017-01-11 17:55 ` Tim Chen
  2017-01-11 17:55 ` [PATCH v5 7/9] mm/swap: Add cache for swap slots allocation Tim Chen
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Tim Chen @ 2017-01-11 17:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

Add new functions that free unused swap slots in batches without
the need to reacquire swap info lock.  This improves scalability
and reduce lock contention.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |   1 +
 mm/swapfile.c        | 155 +++++++++++++++++++++++++++++++--------------------
 2 files changed, 95 insertions(+), 61 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 980f159..f0480c3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -392,6 +392,7 @@ extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free(swp_entry_t);
+extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 54fe8dd..50f2688 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -947,35 +947,34 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 	return p;
 }
 
-static unsigned char swap_entry_free(struct swap_info_struct *p,
-				     swp_entry_t entry, unsigned char usage,
-				     bool swap_info_locked)
+static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
+					struct swap_info_struct *q)
+{
+	struct swap_info_struct *p;
+
+	p = _swap_info_get(entry);
+
+	if (p != q) {
+		if (q != NULL)
+			spin_unlock(&q->lock);
+		if (p != NULL)
+			spin_lock(&p->lock);
+	}
+	return p;
+}
+
+static unsigned char __swap_entry_free(struct swap_info_struct *p,
+				       swp_entry_t entry, unsigned char usage)
 {
 	struct swap_cluster_info *ci;
 	unsigned long offset = swp_offset(entry);
 	unsigned char count;
 	unsigned char has_cache;
-	bool lock_swap_info = false;
-
-	if (!swap_info_locked) {
-		count = p->swap_map[offset];
-		if (!p->cluster_info || count == usage || count == SWAP_MAP_SHMEM) {
-lock_swap_info:
-			swap_info_locked = true;
-			lock_swap_info = true;
-			spin_lock(&p->lock);
-		}
-	}
 
-	ci = lock_cluster(p, offset);
+	ci = lock_cluster_or_swap_info(p, offset);
 
 	count = p->swap_map[offset];
 
-	if (!swap_info_locked && (count == usage || count == SWAP_MAP_SHMEM)) {
-		unlock_cluster(ci);
-		goto lock_swap_info;
-	}
-
 	has_cache = count & SWAP_HAS_CACHE;
 	count &= ~SWAP_HAS_CACHE;
 
@@ -999,46 +998,52 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 	}
 
 	usage = count | has_cache;
-	p->swap_map[offset] = usage;
+	p->swap_map[offset] = usage ? : SWAP_HAS_CACHE;
+
+	unlock_cluster_or_swap_info(p, ci);
+
+	return usage;
+}
 
+static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
+{
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+	unsigned char count;
+
+	ci = lock_cluster(p, offset);
+	count = p->swap_map[offset];
+	VM_BUG_ON(count != SWAP_HAS_CACHE);
+	p->swap_map[offset] = 0;
+	dec_cluster_info_page(p, p->cluster_info, offset);
 	unlock_cluster(ci);
 
-	/* free if no reference */
-	if (!usage) {
-		VM_BUG_ON(!swap_info_locked);
-		mem_cgroup_uncharge_swap(entry);
-		ci = lock_cluster(p, offset);
-		dec_cluster_info_page(p, p->cluster_info, offset);
-		unlock_cluster(ci);
-		if (offset < p->lowest_bit)
-			p->lowest_bit = offset;
-		if (offset > p->highest_bit) {
-			bool was_full = !p->highest_bit;
-			p->highest_bit = offset;
-			if (was_full && (p->flags & SWP_WRITEOK)) {
-				spin_lock(&swap_avail_lock);
-				WARN_ON(!plist_node_empty(&p->avail_list));
-				if (plist_node_empty(&p->avail_list))
-					plist_add(&p->avail_list,
-						  &swap_avail_head);
-				spin_unlock(&swap_avail_lock);
-			}
-		}
-		atomic_long_inc(&nr_swap_pages);
-		p->inuse_pages--;
-		frontswap_invalidate_page(p->type, offset);
-		if (p->flags & SWP_BLKDEV) {
-			struct gendisk *disk = p->bdev->bd_disk;
-			if (disk->fops->swap_slot_free_notify)
-				disk->fops->swap_slot_free_notify(p->bdev,
-								  offset);
+	mem_cgroup_uncharge_swap(entry);
+	if (offset < p->lowest_bit)
+		p->lowest_bit = offset;
+	if (offset > p->highest_bit) {
+		bool was_full = !p->highest_bit;
+
+		p->highest_bit = offset;
+		if (was_full && (p->flags & SWP_WRITEOK)) {
+			spin_lock(&swap_avail_lock);
+			WARN_ON(!plist_node_empty(&p->avail_list));
+			if (plist_node_empty(&p->avail_list))
+				plist_add(&p->avail_list,
+					  &swap_avail_head);
+			spin_unlock(&swap_avail_lock);
 		}
 	}
+	atomic_long_inc(&nr_swap_pages);
+	p->inuse_pages--;
+	frontswap_invalidate_page(p->type, offset);
+	if (p->flags & SWP_BLKDEV) {
+		struct gendisk *disk = p->bdev->bd_disk;
 
-	if (lock_swap_info)
-		spin_unlock(&p->lock);
-
-	return usage;
+		if (disk->fops->swap_slot_free_notify)
+			disk->fops->swap_slot_free_notify(p->bdev,
+							  offset);
+	}
 }
 
 /*
@@ -1050,8 +1055,10 @@ void swap_free(swp_entry_t entry)
 	struct swap_info_struct *p;
 
 	p = _swap_info_get(entry);
-	if (p)
-		swap_entry_free(p, entry, 1, false);
+	if (p) {
+		if (!__swap_entry_free(p, entry, 1))
+			swapcache_free_entries(&entry, 1);
+	}
 }
 
 /*
@@ -1062,8 +1069,32 @@ void swapcache_free(swp_entry_t entry)
 	struct swap_info_struct *p;
 
 	p = _swap_info_get(entry);
+	if (p) {
+		if (!__swap_entry_free(p, entry, SWAP_HAS_CACHE))
+			swapcache_free_entries(&entry, 1);
+	}
+}
+
+void swapcache_free_entries(swp_entry_t *entries, int n)
+{
+	struct swap_info_struct *p, *prev;
+	int i;
+
+	if (n <= 0)
+		return;
+
+	prev = NULL;
+	p = NULL;
+	for (i = 0; i < n; ++i) {
+		p = swap_info_get_cont(entries[i], prev);
+		if (p)
+			swap_entry_free(p, entries[i]);
+		else
+			break;
+		prev = p;
+	}
 	if (p)
-		swap_entry_free(p, entry, SWAP_HAS_CACHE, false);
+		spin_unlock(&p->lock);
 }
 
 /*
@@ -1232,21 +1263,23 @@ int free_swap_and_cache(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	struct page *page = NULL;
+	unsigned char count;
 
 	if (non_swap_entry(entry))
 		return 1;
 
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, entry, 1, true) == SWAP_HAS_CACHE) {
+		count = __swap_entry_free(p, entry, 1);
+		if (count == SWAP_HAS_CACHE) {
 			page = find_get_page(swap_address_space(entry),
 					     swp_offset(entry));
 			if (page && !trylock_page(page)) {
 				put_page(page);
 				page = NULL;
 			}
-		}
-		spin_unlock(&p->lock);
+		} else if (!count)
+			swapcache_free_entries(&entry, 1);
 	}
 	if (page) {
 		/*
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 7/9] mm/swap: Add cache for swap slots allocation
  2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (5 preceding siblings ...)
  2017-01-11 17:55 ` [PATCH v5 6/9] mm/swap: Free swap slots in batch Tim Chen
@ 2017-01-11 17:55 ` Tim Chen
  2017-01-17  2:55   ` [Update][PATCH " Huang, Ying
  2017-01-11 17:55 ` [PATCH v5 8/9] mm/swap: Enable swap slots cache usage Tim Chen
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Tim Chen @ 2017-01-11 17:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

We add per cpu caches for swap slots that can be allocated and freed
quickly without the need to touch the swap info lock.

Two separate caches are maintained for swap slots allocated and swap
slots returned.  This is to allow the swap slots to be returned to the
global pool in a batch so they will have a chance to be coaelesced with
other slots in a cluster.  We do not reuse the slots that are returned
right away, as it may increase fragmentation of the slots.

The swap allocation cache is protected by a mutex as we may sleep when
searching for empty slots in cache.  The swap free cache is protected
by a spin lock as we cannot sleep in the free path.

We refill the swap slots cache when we run out of slots, and we disable
the swap slots cache and drain the slots if the global number of slots
fall below a low watermark threshold.  We re-enable the cache agian when
the slots available are above a high watermark.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h       |   4 +
 include/linux/swap_slots.h |  28 ++++
 mm/Makefile                |   2 +-
 mm/swap_slots.c            | 333 +++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c            |   1 +
 mm/swapfile.c              |  26 ++--
 6 files changed, 382 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/swap_slots.h
 create mode 100644 mm/swap_slots.c

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f0480c3..b90b17c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -370,6 +370,7 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
+extern bool has_usable_swap(void);
 
 /* Swap 50% full? Release swapcache more aggressively.. */
 static inline bool vm_swap_full(void)
@@ -408,6 +409,9 @@ struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
 
+extern int get_swap_slots(int n, swp_entry_t *slots);
+extern void swapcache_free_batch(swp_entry_t *entries, int n);
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)		(NULL)
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
new file mode 100644
index 0000000..a59e6e2
--- /dev/null
+++ b/include/linux/swap_slots.h
@@ -0,0 +1,28 @@
+#ifndef _LINUX_SWAP_SLOTS_H
+#define _LINUX_SWAP_SLOTS_H
+
+#include <linux/swap.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+
+#define SWAP_SLOTS_CACHE_SIZE			SWAP_BATCH
+#define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE	(5*SWAP_SLOTS_CACHE_SIZE)
+#define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE	(2*SWAP_SLOTS_CACHE_SIZE)
+
+struct swap_slots_cache {
+	bool		lock_initialized;
+	struct mutex	alloc_lock;
+	swp_entry_t	*slots;
+	int		nr;
+	int		cur;
+	spinlock_t	free_lock;
+	swp_entry_t	*slots_ret;
+	int		n_ret;
+};
+
+void disable_swap_slots_cache_lock(void);
+void reenable_swap_slots_cache_unlock(void);
+int enable_swap_slots_cache(void);
+int free_swap_slot(swp_entry_t entry);
+
+#endif /* _LINUX_SWAP_SLOTS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a..433eaf9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -35,7 +35,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o vmacache.o \
+			   compaction.o vmacache.o swap_slots.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
new file mode 100644
index 0000000..b816839
--- /dev/null
+++ b/mm/swap_slots.c
@@ -0,0 +1,333 @@
+/*
+ * Manage cache of swap slots to be used for and returned from
+ * swap.
+ *
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * Author: Tim Chen <tim.c.chen@linux.intel.com>
+ *
+ * We allocate the swap slots from the global pool and put
+ * it into local per cpu caches.  This has the advantage
+ * of no needing to acquire the swap_info lock every time
+ * we need a new slot.
+ *
+ * There is also opportunity to simply return the slot
+ * to local caches without needing to acquire swap_info
+ * lock.  We do not reuse the returned slots directly but
+ * move them back to the global pool in a batch.  This
+ * allows the slots to coaellesce and reduce fragmentation.
+ *
+ * The swap entry allocated is marked with SWAP_HAS_CACHE
+ * flag in map_count that prevents it from being allocated
+ * again from the global pool.
+ *
+ * The swap slots cache is protected by a mutex instead of
+ * a spin lock as when we search for slots with scan_swap_map,
+ * we can possibly sleep.
+ */
+
+#include <linux/swap_slots.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/vmalloc.h>
+#include <linux/mutex.h>
+
+#ifdef CONFIG_SWAP
+
+static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots);
+static bool	swap_slot_cache_active;
+static bool	swap_slot_cache_enabled;
+static bool	swap_slot_cache_initialized;
+DEFINE_MUTEX(swap_slots_cache_mutex);
+/* Serialize swap slots cache enable/disable operations */
+DEFINE_MUTEX(swap_slots_cache_enable_mutex);
+
+static void __drain_swap_slots_cache(unsigned int type);
+static void deactivate_swap_slots_cache(void);
+static void reactivate_swap_slots_cache(void);
+
+#define use_swap_slot_cache (swap_slot_cache_active && \
+		swap_slot_cache_enabled && swap_slot_cache_initialized)
+#define SLOTS_CACHE 0x1
+#define SLOTS_CACHE_RET 0x2
+
+static void deactivate_swap_slots_cache(void)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	swap_slot_cache_active = false;
+	__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+static void reactivate_swap_slots_cache(void)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	swap_slot_cache_active = true;
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+/* Must not be called with cpu hot plug lock */
+void disable_swap_slots_cache_lock(void)
+{
+	mutex_lock(&swap_slots_cache_enable_mutex);
+	swap_slot_cache_enabled = false;
+	if (swap_slot_cache_initialized) {
+		/* serialize with cpu hotplug operations */
+		get_online_cpus();
+		__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+		put_online_cpus();
+	}
+}
+
+static void __reenable_swap_slots_cache(void)
+{
+	swap_slot_cache_enabled = has_usable_swap();
+}
+
+void reenable_swap_slots_cache_unlock(void)
+{
+	__reenable_swap_slots_cache();
+	mutex_unlock(&swap_slots_cache_enable_mutex);
+}
+
+static bool check_cache_active(void)
+{
+	long pages;
+
+	if (!swap_slot_cache_enabled || !swap_slot_cache_initialized)
+		return false;
+
+	pages = get_nr_swap_pages();
+	if (!swap_slot_cache_active) {
+		if (pages > num_online_cpus() *
+		    THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE)
+			reactivate_swap_slots_cache();
+		goto out;
+	}
+
+	/* if global pool of slot caches too low, deactivate cache */
+	if (pages < num_online_cpus() * THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE)
+		deactivate_swap_slots_cache();
+out:
+	return swap_slot_cache_active;
+}
+
+static int alloc_swap_slot_cache(unsigned int cpu)
+{
+	struct swap_slots_cache *cache;
+	swp_entry_t *slots, *slots_ret;
+
+	/*
+	 * Do allocation outside swap_slots_cache_mutex
+	 * as vzalloc could trigger reclaim and get_swap_page,
+	 * which can lock swap_slots_cache_mutex.
+	 */
+	slots = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	if (!slots)
+		return -ENOMEM;
+
+	slots_ret = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	if (!slots_ret) {
+		vfree(slots);
+		return -ENOMEM;
+	}
+
+	mutex_lock(&swap_slots_cache_mutex);
+	cache = &per_cpu(swp_slots, cpu);
+	if (cache->slots || cache->slots_ret)
+		/* cache already allocated */
+		goto out;
+	if (!cache->lock_initialized) {
+		mutex_init(&cache->alloc_lock);
+		spin_lock_init(&cache->free_lock);
+		cache->lock_initialized = true;
+	}
+	cache->nr = 0;
+	cache->cur = 0;
+	cache->n_ret = 0;
+	cache->slots = slots;
+	slots = NULL;
+	cache->slots_ret = slots_ret;
+	slots_ret = NULL;
+out:
+	mutex_unlock(&swap_slots_cache_mutex);
+	if (slots)
+		vfree(slots);
+	if (slots_ret)
+		vfree(slots_ret);
+	return 0;
+}
+
+static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
+				  bool free_slots)
+{
+	struct swap_slots_cache *cache;
+	swp_entry_t *slots = NULL;
+
+	cache = &per_cpu(swp_slots, cpu);
+	if ((type & SLOTS_CACHE) && cache->slots) {
+		mutex_lock(&cache->alloc_lock);
+		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
+		cache->cur = 0;
+		cache->nr = 0;
+		if (free_slots && cache->slots) {
+			vfree(cache->slots);
+			cache->slots = NULL;
+		}
+		mutex_unlock(&cache->alloc_lock);
+	}
+	if ((type & SLOTS_CACHE_RET) && cache->slots_ret) {
+		spin_lock_irq(&cache->free_lock);
+		swapcache_free_entries(cache->slots_ret, cache->n_ret);
+		cache->n_ret = 0;
+		if (free_slots && cache->slots_ret) {
+			slots = cache->slots_ret;
+			cache->slots_ret = NULL;
+		}
+		spin_unlock_irq(&cache->free_lock);
+		if (slots)
+			vfree(slots);
+	}
+}
+
+static void __drain_swap_slots_cache(unsigned int type)
+{
+	unsigned int cpu;
+
+	/*
+	 * This function is called during
+	 *	1) swapoff, when we have to make sure no
+	 *	   left over slots are in cache when we remove
+	 *	   a swap device;
+	 *      2) disabling of swap slot cache, when we run low
+	 *	   on swap slots when allocating memory and need
+	 *	   to return swap slots to global pool.
+	 *
+	 * We cannot acquire cpu hot plug lock here as
+	 * this function can be invoked in the cpu
+	 * hot plug path:
+	 * cpu_up -> lock cpu_hotplug -> cpu hotplug state callback
+	 *   -> memory allocation -> direct reclaim -> get_swap_page
+	 *   -> drain_swap_slots_cache
+	 *
+	 * Hence the loop over current online cpu below could miss cpu that
+	 * is being brought online but not yet marked as online.
+	 * That is okay as we do not schedule and run anything on a
+	 * cpu before it has been marked online. Hence, we will not
+	 * fill any swap slots in slots cache of such cpu.
+	 * There are no slots on such cpu that need to be drained.
+	 */
+	for_each_online_cpu(cpu)
+		drain_slots_cache_cpu(cpu, type, false);
+}
+
+static int free_slot_cache(unsigned int cpu)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	drain_slots_cache_cpu(cpu, SLOTS_CACHE | SLOTS_CACHE_RET, true);
+	mutex_unlock(&swap_slots_cache_mutex);
+	return 0;
+}
+
+int enable_swap_slots_cache(void)
+{
+	int ret = 0;
+
+	mutex_lock(&swap_slots_cache_enable_mutex);
+	if (swap_slot_cache_initialized) {
+		__reenable_swap_slots_cache();
+		goto out_unlock;
+	}
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "swap_slots_cache",
+				alloc_swap_slot_cache, free_slot_cache);
+	if (ret < 0)
+		goto out_unlock;
+	swap_slot_cache_initialized = true;
+	__reenable_swap_slots_cache();
+out_unlock:
+	mutex_unlock(&swap_slots_cache_enable_mutex);
+	return 0;
+}
+
+/* called with swap slot cache's alloc lock held */
+static int refill_swap_slots_cache(struct swap_slots_cache *cache)
+{
+	if (!use_swap_slot_cache || cache->nr)
+		return 0;
+
+	cache->cur = 0;
+	if (swap_slot_cache_active)
+		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots);
+
+	return cache->nr;
+}
+
+int free_swap_slot(swp_entry_t entry)
+{
+	struct swap_slots_cache *cache;
+
+	BUG_ON(!swap_slot_cache_initialized);
+
+	cache = &get_cpu_var(swp_slots);
+	if (use_swap_slot_cache && cache->slots_ret) {
+		spin_lock_irq(&cache->free_lock);
+		/* Swap slots cache may be deactivated before acquiring lock */
+		if (!use_swap_slot_cache) {
+			spin_unlock_irq(&cache->free_lock);
+			goto direct_free;
+		}
+		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
+			/*
+			 * Return slots to global pool.
+			 * The current swap_map value is SWAP_HAS_CACHE.
+			 * Set it to 0 to indicate it is available for
+			 * allocation in global pool
+			 */
+			swapcache_free_entries(cache->slots_ret, cache->n_ret);
+			cache->n_ret = 0;
+		}
+		cache->slots_ret[cache->n_ret++] = entry;
+		spin_unlock_irq(&cache->free_lock);
+	} else {
+direct_free:
+		swapcache_free_entries(&entry, 1);
+	}
+	put_cpu_var(swp_slots);
+
+	return 0;
+}
+
+swp_entry_t get_swap_page(void)
+{
+	swp_entry_t entry, *pentry;
+	struct swap_slots_cache *cache;
+
+	cache = this_cpu_ptr(&swp_slots);
+
+	entry.val = 0;
+	if (check_cache_active()) {
+		mutex_lock(&cache->alloc_lock);
+		if (cache->slots) {
+repeat:
+			if (cache->nr) {
+				pentry = &cache->slots[cache->cur++];
+				entry = *pentry;
+				pentry->val = 0;
+				cache->nr--;
+			} else {
+				if (refill_swap_slots_cache(cache))
+					goto repeat;
+			}
+		}
+		mutex_unlock(&cache->alloc_lock);
+		if (entry.val)
+			return entry;
+	}
+
+	get_swap_pages(1, &entry);
+
+	return entry;
+}
+
+#endif /* CONFIG_SWAP */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3d76d80..e1f07ca 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
+#include <linux/swap_slots.h>
 
 #include <asm/pgtable.h>
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 50f2688..e6c30ed 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -34,6 +34,7 @@
 #include <linux/frontswap.h>
 #include <linux/swapfile.h>
 #include <linux/export.h>
+#include <linux/swap_slots.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -859,14 +860,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
 	return n_ret;
 }
 
-swp_entry_t get_swap_page(void)
-{
-	swp_entry_t entry;
-
-	get_swap_pages(1, &entry);
-	return entry;
-}
-
 /* The only caller of this function is now suspend routine */
 swp_entry_t get_swap_page_of_type(int type)
 {
@@ -1057,7 +1050,7 @@ void swap_free(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		if (!__swap_entry_free(p, entry, 1))
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot(entry);
 	}
 }
 
@@ -1071,7 +1064,7 @@ void swapcache_free(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		if (!__swap_entry_free(p, entry, SWAP_HAS_CACHE))
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot(entry);
 	}
 }
 
@@ -1279,7 +1272,7 @@ int free_swap_and_cache(swp_entry_t entry)
 				page = NULL;
 			}
 		} else if (!count)
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot(entry);
 	}
 	if (page) {
 		/*
@@ -2107,6 +2100,17 @@ static void reinsert_swap_info(struct swap_info_struct *p)
 	spin_unlock(&swap_lock);
 }
 
+bool has_usable_swap(void)
+{
+	bool ret = true;
+
+	spin_lock(&swap_lock);
+	if (plist_head_empty(&swap_active_head))
+		ret = false;
+	spin_unlock(&swap_lock);
+	return ret;
+}
+
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 8/9] mm/swap: Enable swap slots cache usage
  2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (6 preceding siblings ...)
  2017-01-11 17:55 ` [PATCH v5 7/9] mm/swap: Add cache for swap slots allocation Tim Chen
@ 2017-01-11 17:55 ` Tim Chen
  2017-01-11 17:55 ` [PATCH v5 9/9] mm/swap: Skip readahead only when swap slot cache is enabled Tim Chen
  2017-01-16 12:02 ` [PATCH v5 0/9] mm/swap: Regular page swap optimizations Michal Hocko
  9 siblings, 0 replies; 33+ messages in thread
From: Tim Chen @ 2017-01-11 17:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

Initialize swap slots cache and enable it on swap on.
Drain all swap slots on swap off.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/swapfile.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index e6c30ed..14d9ea2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2181,6 +2181,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 
+	disable_swap_slots_cache_lock();
+
 	set_current_oom_origin();
 	err = try_to_unuse(p->type, false, 0); /* force unuse all pages */
 	clear_current_oom_origin();
@@ -2188,9 +2190,12 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	if (err) {
 		/* re-insert swap space back into swap_list */
 		reinsert_swap_info(p);
+		reenable_swap_slots_cache_unlock();
 		goto out_dput;
 	}
 
+	reenable_swap_slots_cache_unlock();
+
 	flush_work(&p->discard_work);
 
 	destroy_swap_extents(p);
@@ -2868,6 +2873,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		putname(name);
 	if (inode && S_ISREG(inode->i_mode))
 		inode_unlock(inode);
+	if (!error)
+		enable_swap_slots_cache();
 	return error;
 }
 
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 9/9] mm/swap: Skip readahead only when swap slot cache is enabled
  2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (7 preceding siblings ...)
  2017-01-11 17:55 ` [PATCH v5 8/9] mm/swap: Enable swap slots cache usage Tim Chen
@ 2017-01-11 17:55 ` Tim Chen
  2017-01-16 12:02 ` [PATCH v5 0/9] mm/swap: Regular page swap optimizations Michal Hocko
  9 siblings, 0 replies; 33+ messages in thread
From: Tim Chen @ 2017-01-11 17:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim Chen

From: Huang Ying <ying.huang@intel.com>

Because during swap off, a swap entry may have swap_map[] ==
SWAP_HAS_CACHE (for example, just allocated).  If we return NULL in
__read_swap_cache_async(), the swap off will abort.  So when swap slot
cache is disabled, (for swap off), we will wait for page to be put
into swap cache in such race condition.  This should not be a problem
for swap slot cache, because swap slot cache should be drained after
clearing swap_slot_cache_enabled.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap_slots.h |  2 ++
 mm/swap_slots.c            |  2 +-
 mm/swap_state.c            | 11 +++++++++--
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index a59e6e2..fb90734 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -25,4 +25,6 @@ void reenable_swap_slots_cache_unlock(void);
 int enable_swap_slots_cache(void);
 int free_swap_slot(swp_entry_t entry);
 
+extern bool swap_slot_cache_enabled;
+
 #endif /* _LINUX_SWAP_SLOTS_H */
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index b816839..8cf941e 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -36,7 +36,7 @@
 
 static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots);
 static bool	swap_slot_cache_active;
-static bool	swap_slot_cache_enabled;
+bool	swap_slot_cache_enabled;
 static bool	swap_slot_cache_initialized;
 DEFINE_MUTEX(swap_slots_cache_mutex);
 /* Serialize swap slots cache enable/disable operations */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e1f07ca..2126e9b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -324,8 +324,15 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		if (found_page)
 			break;
 
-		/* Just skip read ahead for unused swap slot */
-		if (!__swp_swapcount(entry))
+		/*
+		 * Just skip read ahead for unused swap slot.
+		 * During swap_off when swap_slot_cache is disabled,
+		 * we have to handle the race between putting
+		 * swap entry in swap cache and marking swap slot
+		 * as SWAP_HAS_CACHE.  That's done in later part of code or
+		 * else swap_off will be aborted if we return NULL.
+		 */
+		if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
 			return NULL;
 
 		/*
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/9] mm/swap: Add cluster lock
  2017-01-11 17:55 ` [PATCH v5 2/9] mm/swap: Add cluster lock Tim Chen
@ 2017-01-11 23:00   ` Andrew Morton
  2017-01-11 23:07     ` Jonathan Corbet
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2017-01-11 23:00 UTC (permalink / raw)
  To: Tim Chen
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

On Wed, 11 Jan 2017 09:55:12 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:

> From: "Huang, Ying" <ying.huang@intel.com>
> 
> This patch is to reduce the lock contention of swap_info_struct->lock
> via using a more fine grained lock in swap_cluster_info for some swap
> operations.  swap_info_struct->lock is heavily contended if multiple
> processes reclaim pages simultaneously.  Because there is only one lock
> for each swap device.  While in common configuration, there is only one
> or several swap devices in the system.  The lock protects almost all
> swap related operations.
> 
> In fact, many swap operations only access one element of
> swap_info_struct->swap_map array.  And there is no dependency between
> different elements of swap_info_struct->swap_map.  So a fine grained
> lock can be used to allow parallel access to the different elements of
> swap_info_struct->swap_map.
> 
> In this patch, one bit of swap_cluster_info is used as the bin spinlock
> to protect the elements of swap_info_struct->swap_map in the swap
> cluster and the fields of swap_cluster_info.  This reduced locking
> contention for swap_info_struct->swap_map access greatly.
> 
> To use the bin spinlock, the size of swap_cluster_info needs to increase
> from 4 bytes to 8 bytes on the 64bit system.  This will use 4k more
> memory for every 1G swap space.
> 
> Because the size of swap_cluster_info is much smaller than the size of
> the cache line (8 vs 64 on x86_64 architecture), there may be false
> cache line sharing between swap_cluster_info bit spinlocks.  To avoid
> the false sharing in the first round of the swap cluster allocation, the
> order of the swap clusters in the free clusters list is changed.  So
> that, the swap_cluster_info sharing the same cache line will be placed
> as far as possible.  After the first round of allocation, the order of
> the clusters in free clusters list is expected to be random.  So the
> false sharing should be not noticeable.
> 
> ...
>
> @@ -175,11 +175,16 @@ enum {
>   * protected by swap_info_struct.lock.
>   */
>  struct swap_cluster_info {
> -	unsigned int data:24;
> -	unsigned int flags:8;
> +	unsigned long data;
>  };
>
> ...
>
> +static inline void __lock_cluster(struct swap_cluster_info *ci)
> +{
> +	bit_spin_lock(CLUSTER_FLAG_LOCK_BIT, &ci->data);
> +}

hm, bit_spin_lock() is a nasty thing.  It is slow and it doesn't have
all the lockdep support.

Would the world end if we added a spinlock to swap_cluster_info?  Check
my math: for each 1G of wapspace we have 256k pages, hence 1k of
swap_cluster_infos, hence 4k of memory.  ie, one page of memory for
each 256,000 pages of swap.  Is increasing that 1/256000 to 2/256000 a
big deal?



Also, I note that struct swap_cluster_info is only used in swapfile.c
and as a cleanup we could move its definition into that .c file. 
Perhaps other things could be moved as well..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/9] mm/swap: Add cluster lock
  2017-01-11 23:00   ` Andrew Morton
@ 2017-01-11 23:07     ` Jonathan Corbet
  2017-01-11 23:15       ` Andrew Morton
  2017-01-12  1:23       ` [PATCH " Huang, Ying
  0 siblings, 2 replies; 33+ messages in thread
From: Jonathan Corbet @ 2017-01-11 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger

On Wed, 11 Jan 2017 15:00:29 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> hm, bit_spin_lock() is a nasty thing.  It is slow and it doesn't have
> all the lockdep support.
> 
> Would the world end if we added a spinlock to swap_cluster_info?

FWIW, I asked the same question in December, this is what I got:

jon

> From: "Huang\, Ying" <ying.huang@intel.com>
> To: Jonathan Corbet <corbet@lwn.net>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>,  Andrew Morton <akpm@linux-foundation.org>,  "Huang\, Ying" <ying.huang@intel.com>,  <dave.hansen@intel.com>,  <ak@linux.intel.com>,  <aaron.lu@intel.com>,  <linux-mm@kvack.org>,  <linux-kernel@vger.kernel.org>,  Hugh Dickins <hughd@google.com>,  Shaohua Li <shli@kernel.org>,  Minchan Kim <minchan@kernel.org>,  Rik van Riel <riel@redhat.com>,  Andrea Arcangeli <aarcange@redhat.com>,  "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,  Vladimir Davydov <vdavydov.dev@gmail.com>,  Johannes Weiner <hannes@cmpxchg.org>,  Michal Hocko <mhocko@kernel.org>,  "Hillf Danton" <hillf.zj@alibaba-inc.com>
> Subject: Re: [PATCH v2 2/8] mm/swap: Add cluster lock
> Date: Tue, 25 Oct 2016 10:05:39 +0800
> 
> Hi, Jonathan,
> 
> Thanks for review.
> 
> Jonathan Corbet <corbet@lwn.net> writes:
> 
> > On Thu, 20 Oct 2016 16:31:41 -0700
> > Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >  
> >> From: "Huang, Ying" <ying.huang@intel.com>
> >> 
> >> This patch is to reduce the lock contention of swap_info_struct->lock
> >> via using a more fine grained lock in swap_cluster_info for some swap
> >> operations.  swap_info_struct->lock is heavily contended if multiple
> >> processes reclaim pages simultaneously.  Because there is only one lock
> >> for each swap device.  While in common configuration, there is only one
> >> or several swap devices in the system.  The lock protects almost all
> >> swap related operations.  
> >
> > So I'm looking at this a bit.  Overall it seems like a good thing to do
> > (from my limited understanding of this area) but I have a probably silly
> > question... 
> >  
> >>  struct swap_cluster_info {
> >> -	unsigned int data:24;
> >> -	unsigned int flags:8;
> >> +	unsigned long data;
> >>  };
> >> -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> >> -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> >> +#define CLUSTER_COUNT_SHIFT		8
> >> +#define CLUSTER_FLAG_MASK		((1UL << CLUSTER_COUNT_SHIFT) - 1)
> >> +#define CLUSTER_COUNT_MASK		(~CLUSTER_FLAG_MASK)
> >> +#define CLUSTER_FLAG_FREE		1 /* This cluster is free */
> >> +#define CLUSTER_FLAG_NEXT_NULL		2 /* This cluster has no next cluster */
> >> +/* cluster lock, protect cluster_info contents and sis->swap_map */
> >> +#define CLUSTER_FLAG_LOCK_BIT		2
> >> +#define CLUSTER_FLAG_LOCK		(1 << CLUSTER_FLAG_LOCK_BIT)  
> >
> > Why the roll-your-own locking and data structures here?  To my naive
> > understanding, it seems like you could do something like:
> >
> >   struct swap_cluster_info {
> >   	spinlock_t lock;
> > 	atomic_t count;
> > 	unsigned int flags;
> >   };
> >
> > Then you could use proper spinlock operations which, among other things,
> > would make the realtime folks happier.  That might well help with the
> > cache-line sharing issues as well.  Some of the count manipulations could
> > perhaps be done without the lock entirely; similarly, atomic bitops might
> > save you the locking for some of the flag tweaks - though I'd have to look
> > more closely to be really sure of that.
> >
> > The cost, of course, is the growth of this structure, but you've already
> > noted that the overhead isn't all that high; seems like it could be worth
> > it.  
> 
> Yes.  The data structure you proposed is much easier to be used than the
> current one.  The main concern is the RAM usage.  The size of the data
> structure you proposed is about 80 bytes, while that of the current one
> is about 8 bytes.  There will be one struct swap_cluster_info for every
> 1MB swap space, so for 1TB swap space, the total size will be 80M
> compared with 8M of current implementation.
> 
> In the other hand, the return of the increased size is not overwhelming.
> The bit spinlock on cluster will not be heavy contended because it is a
> quite fine-grained lock.  So the benefit will be little to use lockless
> operations.  I guess the realtime issue isn't serious given the lock is
> not heavy contended and the operations protected by the lock is
> light-weight too.
> 
> Best Regards,
> Huang, Ying
> 
> > I assume that I'm missing something obvious here?
> >
> > Thanks,
> >
> > jon  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 3/9] mm/swap: Split swap cache into 64MB trunks
  2017-01-11 17:55 ` [PATCH v5 3/9] mm/swap: Split swap cache into 64MB trunks Tim Chen
@ 2017-01-11 23:09   ` Andrew Morton
  2017-01-11 23:19     ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2017-01-11 23:09 UTC (permalink / raw)
  To: Tim Chen
  Cc: Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

On Wed, 11 Jan 2017 09:55:13 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:

> The patch is to improve the scalability of the swap out/in via using
> fine grained locks for the swap cache.  In current kernel, one address
> space will be used for each swap device.  And in the common
> configuration, the number of the swap device is very small (one is
> typical).  This causes the heavy lock contention on the radix tree of
> the address space if multiple tasks swap out/in concurrently.  But in
> fact, there is no dependency between pages in the swap cache.  So that,
> we can split the one shared address space for each swap device into
> several address spaces to reduce the lock contention.  In the patch, the
> shared address space is split into 64MB trunks.  64MB is chosen to
> balance the memory space usage and effect of lock contention reduction.
> 
> The size of struct address_space on x86_64 architecture is 408B, so with
> the patch, 6528B more memory will be used for every 1GB swap space on
> x86_64 architecture.
> 
> One address space is still shared for the swap entries in the same 64M
> trunks.  To avoid lock contention for the first round of swap space
> allocation, the order of the swap clusters in the initial free clusters
> list is changed.  The swap space distance between the consecutive swap
> clusters in the free cluster list is at least 64M.  After the first
> round of allocation, the swap clusters are expected to be freed
> randomly, so the lock contention should be reduced effectively.

Switching from a single radix-tree to an array of radix-trees to reduce
contention seems a bit hacky.  That we can do this and have everything
continue to work tells me that we're simply using an inappropriate data
structure to hold this info.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/9] mm/swap: Add cluster lock
  2017-01-11 23:07     ` Jonathan Corbet
@ 2017-01-11 23:15       ` Andrew Morton
  2017-01-12  1:47         ` Huang, Ying
  2017-01-12  1:23       ` [PATCH " Huang, Ying
  1 sibling, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2017-01-11 23:15 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Tim Chen, Huang, Ying, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger

On Wed, 11 Jan 2017 16:07:29 -0700 Jonathan Corbet <corbet@lwn.net> wrote:

> On Wed, 11 Jan 2017 15:00:29 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > hm, bit_spin_lock() is a nasty thing.  It is slow and it doesn't have
> > all the lockdep support.
> > 
> > Would the world end if we added a spinlock to swap_cluster_info?
> 
> FWIW, I asked the same question in December, this is what I got:
> 
> ...
>
> > > Why the roll-your-own locking and data structures here?  To my naive
> > > understanding, it seems like you could do something like:
> > >
> > >   struct swap_cluster_info {
> > >   	spinlock_t lock;
> > > 	atomic_t count;
> > > 	unsigned int flags;
> > >   };
> > >
> > > Then you could use proper spinlock operations which, among other things,
> > > would make the realtime folks happier.  That might well help with the
> > > cache-line sharing issues as well.  Some of the count manipulations could
> > > perhaps be done without the lock entirely; similarly, atomic bitops might
> > > save you the locking for some of the flag tweaks - though I'd have to look
> > > more closely to be really sure of that.
> > >
> > > The cost, of course, is the growth of this structure, but you've already
> > > noted that the overhead isn't all that high; seems like it could be worth
> > > it.  
> > 
> > Yes.  The data structure you proposed is much easier to be used than the
> > current one.  The main concern is the RAM usage.  The size of the data
> > structure you proposed is about 80 bytes, while that of the current one
> > is about 8 bytes.  There will be one struct swap_cluster_info for every
> > 1MB swap space, so for 1TB swap space, the total size will be 80M
> > compared with 8M of current implementation.

Where did this 80 bytes come from?  That swap_cluster_info is 12 bytes
and could perhaps be squeezed into 8 bytes if we can get away with a
24-bit "count".


> > In the other hand, the return of the increased size is not overwhelming.
> > The bit spinlock on cluster will not be heavy contended because it is a
> > quite fine-grained lock.  So the benefit will be little to use lockless
> > operations.  I guess the realtime issue isn't serious given the lock is
> > not heavy contended and the operations protected by the lock is
> > light-weight too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 3/9] mm/swap: Split swap cache into 64MB trunks
  2017-01-11 23:09   ` Andrew Morton
@ 2017-01-11 23:19     ` Andi Kleen
  2017-01-12 16:47       ` Tim Chen
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2017-01-11 23:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tim Chen, Huang, Ying, dave.hansen, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

> Switching from a single radix-tree to an array of radix-trees to reduce
> contention seems a bit hacky.  That we can do this and have everything
> continue to work tells me that we're simply using an inappropriate data
> structure to hold this info.

What would you use instead?

A tree with fine grained locking?

FWIW too fine grained locking (e.g. on every node) is usually a bad idea: 

it slows down the single thread performance and it causes much more overhead
when there is actual contention because too much time is spent bouncing cache
lines around.

So I actually like the "a little bit more fine grained, but not too much"
approach.

Or a hash table? 

Not sure if this would work here.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/9] mm/swap: Add cluster lock
  2017-01-11 23:07     ` Jonathan Corbet
  2017-01-11 23:15       ` Andrew Morton
@ 2017-01-12  1:23       ` Huang, Ying
  1 sibling, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2017-01-12  1:23 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Morton, Tim Chen, Huang, Ying, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger

Hi, Jonathan,

Jonathan Corbet <corbet@lwn.net> writes:

> On Wed, 11 Jan 2017 15:00:29 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> hm, bit_spin_lock() is a nasty thing.  It is slow and it doesn't have
>> all the lockdep support.
>> 
>> Would the world end if we added a spinlock to swap_cluster_info?
>
> FWIW, I asked the same question in December, this is what I got:

Sorry I made a mistake in the following email.  I have sent another
email to correct this before from my another email address,
huang.ying.caritas@gmail.com, have you received it, copied below,

From: huang ying <huang.ying.caritas@gmail.com>
Subject: Re: [PATCH v2 2/8] mm/swap: Add cluster lock
To: "Huang, Ying" <ying.huang@intel.com>
CC: Jonathan Corbet <corbet@lwn.net>, Tim Chen <tim.c.chen@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>, <dave.hansen@intel.com>, "Andi
 Kleen" <ak@linux.intel.com>, Aaron Lu <aaron.lu@intel.com>,
	<linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, Hugh Dickins
	<hughd@google.com>, Shaohua Li <shli@kernel.org>, Minchan Kim
	<minchan@kernel.org>, Rik van Riel <riel@redhat.com>, Andrea Arcangeli
	<aarcange@redhat.com>, "Kirill A . Shutemov"
	<kirill.shutemov@linux.intel.com>, Vladimir Davydov <vdavydov.dev@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Hillf
 Danton <hillf.zj@alibaba-inc.com>
Date: Wed, 28 Dec 2016 11:34:01 +0800 (2 weeks, 21 hours, 45 minutes ago)

Hi, Jonathan,

On Tue, Oct 25, 2016 at 10:05 AM, Huang, Ying <ying.huang@intel.com> wrote:
> Hi, Jonathan,
>
> Thanks for review.
>
> Jonathan Corbet <corbet@lwn.net> writes:
>
>> On Thu, 20 Oct 2016 16:31:41 -0700
>> Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>
>>> From: "Huang, Ying" <ying.huang@intel.com>
>>>
>>> This patch is to reduce the lock contention of swap_info_struct->lock
>>> via using a more fine grained lock in swap_cluster_info for some swap
>>> operations.  swap_info_struct->lock is heavily contended if multiple

[...]

>> The cost, of course, is the growth of this structure, but you've already
>> noted that the overhead isn't all that high; seems like it could be worth
>> it.
>
> Yes.  The data structure you proposed is much easier to be used than the
> current one.  The main concern is the RAM usage.  The size of the data
> structure you proposed is about 80 bytes, while that of the current one
> is about 8 bytes.  There will be one struct swap_cluster_info for every
> 1MB swap space, so for 1TB swap space, the total size will be 80M
> compared with 8M of current implementation.

Sorry, I turned on the lockdep when measure the size change, so the
previous size change data is wrong.  The size of the data structure
you proposed is 12 bytes.  While that of the current one is 8 bytes on
64 bit platform and 4 bytes on 32 bit platform.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/9] mm/swap: Add cluster lock
  2017-01-11 23:15       ` Andrew Morton
@ 2017-01-12  1:47         ` Huang, Ying
  2017-01-12  1:58           ` Andrew Morton
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2017-01-12  1:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Tim Chen, Huang, Ying, dave.hansen, ak,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger

Hi, Andrew,

Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 11 Jan 2017 16:07:29 -0700 Jonathan Corbet <corbet@lwn.net> wrote:
>
>> On Wed, 11 Jan 2017 15:00:29 -0800
>> Andrew Morton <akpm@linux-foundation.org> wrote:
>> 
>> > hm, bit_spin_lock() is a nasty thing.  It is slow and it doesn't have
>> > all the lockdep support.
>> > 
>> > Would the world end if we added a spinlock to swap_cluster_info?
>> 
>> FWIW, I asked the same question in December, this is what I got:
>> 
>> ...
>>
>> > > Why the roll-your-own locking and data structures here?  To my naive
>> > > understanding, it seems like you could do something like:
>> > >
>> > >   struct swap_cluster_info {
>> > >   	spinlock_t lock;
>> > > 	atomic_t count;
>> > > 	unsigned int flags;
>> > >   };
>> > >
>> > > Then you could use proper spinlock operations which, among other things,
>> > > would make the realtime folks happier.  That might well help with the
>> > > cache-line sharing issues as well.  Some of the count manipulations could
>> > > perhaps be done without the lock entirely; similarly, atomic bitops might
>> > > save you the locking for some of the flag tweaks - though I'd have to look
>> > > more closely to be really sure of that.
>> > >
>> > > The cost, of course, is the growth of this structure, but you've already
>> > > noted that the overhead isn't all that high; seems like it could be worth
>> > > it.  
>> > 
>> > Yes.  The data structure you proposed is much easier to be used than the
>> > current one.  The main concern is the RAM usage.  The size of the data
>> > structure you proposed is about 80 bytes, while that of the current one
>> > is about 8 bytes.  There will be one struct swap_cluster_info for every
>> > 1MB swap space, so for 1TB swap space, the total size will be 80M
>> > compared with 8M of current implementation.
>
> Where did this 80 bytes come from?  That swap_cluster_info is 12 bytes
> and could perhaps be squeezed into 8 bytes if we can get away with a
> 24-bit "count".

Sorry, I made a mistake when measuring the size of swap_cluster_info
when I sent that email, because I turned on the lockdep when measuring.
I have sent out a correction email to Jonathan when I realized that
later.

So the latest size measuring result is:

If we use bit_spin_lock, the size of cluster_swap_info will,

- increased from 4 bytes to 8 bytes on 64 bit platform
- keep as 4 bytes on 32 bit platform

If we use normal spinlock (queue spinlock), the size of cluster_swap_info will,

- increased from 4 bytes to 8 bytes on 64 bit platform
- increased from 4 bytes to 8 bytes on 32 bit platform

So the difference occurs on 32 bit platform.  If the size increment on
32 bit platform is OK, then I think it should be good to use normal
spinlock instead of bit_spin_lock.  Personally, I am OK for that.  But I
don't know whether there will be some embedded world people don't like
it.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/9] mm/swap: Add cluster lock
  2017-01-12  1:47         ` Huang, Ying
@ 2017-01-12  1:58           ` Andrew Morton
  2017-01-12  2:51             ` Huang, Ying
  2017-01-14  4:37             ` [Update][PATCH " Huang, Ying
  0 siblings, 2 replies; 33+ messages in thread
From: Andrew Morton @ 2017-01-12  1:58 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Jonathan Corbet, Tim Chen, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger

On Thu, 12 Jan 2017 09:47:51 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:

> >> > 1MB swap space, so for 1TB swap space, the total size will be 80M
> >> > compared with 8M of current implementation.
> >
> > Where did this 80 bytes come from?  That swap_cluster_info is 12 bytes
> > and could perhaps be squeezed into 8 bytes if we can get away with a
> > 24-bit "count".
> 
> Sorry, I made a mistake when measuring the size of swap_cluster_info
> when I sent that email, because I turned on the lockdep when measuring.
> I have sent out a correction email to Jonathan when I realized that
> later.
> 
> So the latest size measuring result is:
> 
> If we use bit_spin_lock, the size of cluster_swap_info will,
> 
> - increased from 4 bytes to 8 bytes on 64 bit platform
> - keep as 4 bytes on 32 bit platform
> 
> If we use normal spinlock (queue spinlock), the size of cluster_swap_info will,
> 
> - increased from 4 bytes to 8 bytes on 64 bit platform
> - increased from 4 bytes to 8 bytes on 32 bit platform
> 
> So the difference occurs on 32 bit platform.  If the size increment on
> 32 bit platform is OK, then I think it should be good to use normal
> spinlock instead of bit_spin_lock.  Personally, I am OK for that.  But I
> don't know whether there will be some embedded world people don't like
> it.

I think that'll be OK - the difference is small and many small systems
disable swap anyway.  So can we please try that?  Please do describe
the additional overhead (with numbers) in the changelog: "additional
bytes of RAM per GB of swap", for example.  And please also rerun the
performance tests, see if we can notice the alleged speed improvements
from switching to a spinlock.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/9] mm/swap: Add cluster lock
  2017-01-12  1:58           ` Andrew Morton
@ 2017-01-12  2:51             ` Huang, Ying
  2017-01-14  4:37             ` [Update][PATCH " Huang, Ying
  1 sibling, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2017-01-12  2:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, Jonathan Corbet, Tim Chen, dave.hansen, ak,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 12 Jan 2017 09:47:51 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
>
>> >> > 1MB swap space, so for 1TB swap space, the total size will be 80M
>> >> > compared with 8M of current implementation.
>> >
>> > Where did this 80 bytes come from?  That swap_cluster_info is 12 bytes
>> > and could perhaps be squeezed into 8 bytes if we can get away with a
>> > 24-bit "count".
>> 
>> Sorry, I made a mistake when measuring the size of swap_cluster_info
>> when I sent that email, because I turned on the lockdep when measuring.
>> I have sent out a correction email to Jonathan when I realized that
>> later.
>> 
>> So the latest size measuring result is:
>> 
>> If we use bit_spin_lock, the size of cluster_swap_info will,
>> 
>> - increased from 4 bytes to 8 bytes on 64 bit platform
>> - keep as 4 bytes on 32 bit platform
>> 
>> If we use normal spinlock (queue spinlock), the size of cluster_swap_info will,
>> 
>> - increased from 4 bytes to 8 bytes on 64 bit platform
>> - increased from 4 bytes to 8 bytes on 32 bit platform
>> 
>> So the difference occurs on 32 bit platform.  If the size increment on
>> 32 bit platform is OK, then I think it should be good to use normal
>> spinlock instead of bit_spin_lock.  Personally, I am OK for that.  But I
>> don't know whether there will be some embedded world people don't like
>> it.
>
> I think that'll be OK - the difference is small and many small systems
> disable swap anyway.  So can we please try that?  Please do describe
> the additional overhead (with numbers) in the changelog: "additional
> bytes of RAM per GB of swap", for example.  And please also rerun the
> performance tests, see if we can notice the alleged speed improvements
> from switching to a spinlock.

Sure.  I will change it and redo the test.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 3/9] mm/swap: Split swap cache into 64MB trunks
  2017-01-11 23:19     ` Andi Kleen
@ 2017-01-12 16:47       ` Tim Chen
  0 siblings, 0 replies; 33+ messages in thread
From: Tim Chen @ 2017-01-12 16:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Huang, Ying, dave.hansen, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

[-- Attachment #1: Type: text/plain, Size: 851 bytes --]

On Wed, Jan 11, 2017 at 03:19:37PM -0800, Andi Kleen wrote:
> > Switching from a single radix-tree to an array of radix-trees to reduce
> > contention seems a bit hacky.  That we can do this and have everything
> > continue to work tells me that we're simply using an inappropriate data
> > structure to hold this info.
> 
> What would you use instead?

I agree that this approach is a bit hacky.  However, it is pretty
effective and simple.  If later on we come up with a better solution to
scale modfication of the radix tree, we can collapse the radix trees.

I think developing a scalable radix tree with write modifications will
take quite a while and is a non-trivial effort. With almost memory speed
SSDs coming on the market soon, I think having a workable solution now
and optimizing it for long term is reasonabale.

Tim


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Update][PATCH v5 2/9] mm/swap: Add cluster lock
  2017-01-12  1:58           ` Andrew Morton
  2017-01-12  2:51             ` Huang, Ying
@ 2017-01-14  4:37             ` Huang, Ying
  1 sibling, 0 replies; 33+ messages in thread
From: Huang, Ying @ 2017-01-14  4:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, Jonathan Corbet, Tim Chen, dave.hansen, ak,
	aaron.lu, linux-mm, linux-kernel, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger

This patch is to reduce the lock contention of swap_info_struct->lock
via using a more fine grained lock in swap_cluster_info for some swap
operations.  swap_info_struct->lock is heavily contended if multiple
processes reclaim pages simultaneously.  Because there is only one lock
for each swap device.  While in common configuration, there is only one
or several swap devices in the system.  The lock protects almost all
swap related operations.

In fact, many swap operations only access one element of
swap_info_struct->swap_map array.  And there is no dependency between
different elements of swap_info_struct->swap_map.  So a fine grained
lock can be used to allow parallel access to the different elements of
swap_info_struct->swap_map.

In this patch, a spinlock is added to swap_cluster_info to protect the
elements of swap_info_struct->swap_map in the swap cluster and the
fields of swap_cluster_info.  This reduced locking contention for
swap_info_struct->swap_map access greatly.

Because of the added spinlock, the size of swap_cluster_info increases
from 4 bytes to 8 bytes on the 64 bit and 32 bit system.  This will
use additional 4k RAM for every 1G swap space.

Because the size of swap_cluster_info is much smaller than the size of
the cache line (8 vs 64 on x86_64 architecture), there may be false
cache line sharing between spinlocks in swap_cluster_info.  To avoid
the false sharing in the first round of the swap cluster allocation,
the order of the swap clusters in the free clusters list is changed.
So that, the swap_cluster_info sharing the same cache line will be
placed as far as possible.  After the first round of allocation, the
order of the clusters in free clusters list is expected to be random.
So the false sharing should be not serious.

Compared with a previous implementation using bit_spin_lock, the
sequential swap out throughput improved about 3.2%.  Test was done on
a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM
(persistent memory) device.  To test the sequential swapping out, the
test case created 32 processes, which sequentially allocate and write
to the anonymous pages until the RAM and part of the swap device is
used.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h |   6 ++
 mm/swapfile.c        | 211 +++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 177 insertions(+), 40 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 09f4be179ff3..48e0ed4dc3c8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -175,6 +175,12 @@ enum {
  * protected by swap_info_struct.lock.
  */
 struct swap_cluster_info {
+	spinlock_t lock;	/*
+				 * Protect swap_cluster_info fields
+				 * and swap_info_struct->swap_map
+				 * elements correspond to the swap
+				 * cluster
+				 */
 	unsigned int data:24;
 	unsigned int flags:8;
 };
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 19a7c1ddd6c2..f5515c1552fe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -257,6 +257,52 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }
 
+static inline void __lock_cluster(struct swap_cluster_info *ci)
+{
+	spin_lock(&ci->lock);
+}
+
+static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
+						     unsigned long offset)
+{
+	struct swap_cluster_info *ci;
+
+	ci = si->cluster_info;
+	if (ci) {
+		ci += offset / SWAPFILE_CLUSTER;
+		__lock_cluster(ci);
+	}
+	return ci;
+}
+
+static inline void unlock_cluster(struct swap_cluster_info *ci)
+{
+	if (ci)
+		spin_unlock(&ci->lock);
+}
+
+static inline struct swap_cluster_info *lock_cluster_or_swap_info(
+	struct swap_info_struct *si,
+	unsigned long offset)
+{
+	struct swap_cluster_info *ci;
+
+	ci = lock_cluster(si, offset);
+	if (!ci)
+		spin_lock(&si->lock);
+
+	return ci;
+}
+
+static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
+					       struct swap_cluster_info *ci)
+{
+	if (ci)
+		unlock_cluster(ci);
+	else
+		spin_unlock(&si->lock);
+}
+
 static inline bool cluster_list_empty(struct swap_cluster_list *list)
 {
 	return cluster_is_null(&list->head);
@@ -281,9 +327,17 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
 		cluster_set_next_flag(&list->head, idx, 0);
 		cluster_set_next_flag(&list->tail, idx, 0);
 	} else {
+		struct swap_cluster_info *ci_tail;
 		unsigned int tail = cluster_next(&list->tail);
 
-		cluster_set_next(&ci[tail], idx);
+		/*
+		 * Nested cluster lock, but both cluster locks are
+		 * only acquired when we held swap_info_struct->lock
+		 */
+		ci_tail = ci + tail;
+		__lock_cluster(ci_tail);
+		cluster_set_next(ci_tail, idx);
+		unlock_cluster(ci_tail);
 		cluster_set_next_flag(&list->tail, idx, 0);
 	}
 }
@@ -328,7 +382,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 */
 static void swap_do_scheduled_discard(struct swap_info_struct *si)
 {
-	struct swap_cluster_info *info;
+	struct swap_cluster_info *info, *ci;
 	unsigned int idx;
 
 	info = si->cluster_info;
@@ -341,10 +395,14 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 				SWAPFILE_CLUSTER);
 
 		spin_lock(&si->lock);
-		cluster_set_flag(&info[idx], CLUSTER_FLAG_FREE);
+		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
+		cluster_set_flag(ci, CLUSTER_FLAG_FREE);
+		unlock_cluster(ci);
 		cluster_list_add_tail(&si->free_clusters, info, idx);
+		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
+		unlock_cluster(ci);
 	}
 }
 
@@ -447,8 +505,9 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned long *offset, unsigned long *scan_base)
 {
 	struct percpu_cluster *cluster;
+	struct swap_cluster_info *ci;
 	bool found_free;
-	unsigned long tmp;
+	unsigned long tmp, max;
 
 new_cluster:
 	cluster = this_cpu_ptr(si->percpu_cluster);
@@ -476,14 +535,21 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	 * check if there is still free entry in the cluster
 	 */
 	tmp = cluster->next;
-	while (tmp < si->max && tmp < (cluster_next(&cluster->index) + 1) *
-	       SWAPFILE_CLUSTER) {
+	max = min_t(unsigned long, si->max,
+		    (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER);
+	if (tmp >= max) {
+		cluster_set_null(&cluster->index);
+		goto new_cluster;
+	}
+	ci = lock_cluster(si, tmp);
+	while (tmp < max) {
 		if (!si->swap_map[tmp]) {
 			found_free = true;
 			break;
 		}
 		tmp++;
 	}
+	unlock_cluster(ci);
 	if (!found_free) {
 		cluster_set_null(&cluster->index);
 		goto new_cluster;
@@ -496,6 +562,7 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 static unsigned long scan_swap_map(struct swap_info_struct *si,
 				   unsigned char usage)
 {
+	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned long scan_base;
 	unsigned long last_in_cluster = 0;
@@ -572,9 +639,11 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
 
+	ci = lock_cluster(si, offset);
 	/* reuse swap entry of cache-only swap if not busy. */
 	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
 		int swap_was_freed;
+		unlock_cluster(ci);
 		spin_unlock(&si->lock);
 		swap_was_freed = __try_to_reclaim_swap(si, offset);
 		spin_lock(&si->lock);
@@ -584,8 +653,10 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 		goto scan; /* check next one */
 	}
 
-	if (si->swap_map[offset])
+	if (si->swap_map[offset]) {
+		unlock_cluster(ci);
 		goto scan;
+	}
 
 	if (offset == si->lowest_bit)
 		si->lowest_bit++;
@@ -601,6 +672,7 @@ static unsigned long scan_swap_map(struct swap_info_struct *si,
 	}
 	si->swap_map[offset] = usage;
 	inc_cluster_info_page(si, si->cluster_info, offset);
+	unlock_cluster(ci);
 	si->cluster_next = offset + 1;
 	si->flags -= SWP_SCANNING;
 
@@ -731,7 +803,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	return (swp_entry_t) {0};
 }
 
-static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 	unsigned long offset, type;
@@ -749,7 +821,6 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 		goto bad_offset;
 	if (!p->swap_map[offset])
 		goto bad_free;
-	spin_lock(&p->lock);
 	return p;
 
 bad_free:
@@ -767,14 +838,45 @@ static struct swap_info_struct *swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
+static struct swap_info_struct *swap_info_get(swp_entry_t entry)
+{
+	struct swap_info_struct *p;
+
+	p = _swap_info_get(entry);
+	if (p)
+		spin_lock(&p->lock);
+	return p;
+}
+
 static unsigned char swap_entry_free(struct swap_info_struct *p,
-				     swp_entry_t entry, unsigned char usage)
+				     swp_entry_t entry, unsigned char usage,
+				     bool swap_info_locked)
 {
+	struct swap_cluster_info *ci;
 	unsigned long offset = swp_offset(entry);
 	unsigned char count;
 	unsigned char has_cache;
+	bool lock_swap_info = false;
+
+	if (!swap_info_locked) {
+		count = p->swap_map[offset];
+		if (!p->cluster_info || count == usage || count == SWAP_MAP_SHMEM) {
+lock_swap_info:
+			swap_info_locked = true;
+			lock_swap_info = true;
+			spin_lock(&p->lock);
+		}
+	}
+
+	ci = lock_cluster(p, offset);
 
 	count = p->swap_map[offset];
+
+	if (!swap_info_locked && (count == usage || count == SWAP_MAP_SHMEM)) {
+		unlock_cluster(ci);
+		goto lock_swap_info;
+	}
+
 	has_cache = count & SWAP_HAS_CACHE;
 	count &= ~SWAP_HAS_CACHE;
 
@@ -800,10 +902,15 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 	usage = count | has_cache;
 	p->swap_map[offset] = usage;
 
+	unlock_cluster(ci);
+
 	/* free if no reference */
 	if (!usage) {
+		VM_BUG_ON(!swap_info_locked);
 		mem_cgroup_uncharge_swap(entry);
+		ci = lock_cluster(p, offset);
 		dec_cluster_info_page(p, p->cluster_info, offset);
+		unlock_cluster(ci);
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
 		if (offset > p->highest_bit) {
@@ -829,6 +936,9 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 		}
 	}
 
+	if (lock_swap_info)
+		spin_unlock(&p->lock);
+
 	return usage;
 }
 
@@ -840,11 +950,9 @@ void swap_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 
-	p = swap_info_get(entry);
-	if (p) {
-		swap_entry_free(p, entry, 1);
-		spin_unlock(&p->lock);
-	}
+	p = _swap_info_get(entry);
+	if (p)
+		swap_entry_free(p, entry, 1, false);
 }
 
 /*
@@ -854,11 +962,9 @@ void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
 
-	p = swap_info_get(entry);
-	if (p) {
-		swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		spin_unlock(&p->lock);
-	}
+	p = _swap_info_get(entry);
+	if (p)
+		swap_entry_free(p, entry, SWAP_HAS_CACHE, false);
 }
 
 /*
@@ -870,13 +976,17 @@ int page_swapcount(struct page *page)
 {
 	int count = 0;
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	swp_entry_t entry;
+	unsigned long offset;
 
 	entry.val = page_private(page);
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (p) {
-		count = swap_count(p->swap_map[swp_offset(entry)]);
-		spin_unlock(&p->lock);
+		offset = swp_offset(entry);
+		ci = lock_cluster_or_swap_info(p, offset);
+		count = swap_count(p->swap_map[offset]);
+		unlock_cluster_or_swap_info(p, ci);
 	}
 	return count;
 }
@@ -889,22 +999,26 @@ int swp_swapcount(swp_entry_t entry)
 {
 	int count, tmp_count, n;
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	struct page *page;
 	pgoff_t offset;
 	unsigned char *map;
 
-	p = swap_info_get(entry);
+	p = _swap_info_get(entry);
 	if (!p)
 		return 0;
 
-	count = swap_count(p->swap_map[swp_offset(entry)]);
+	offset = swp_offset(entry);
+
+	ci = lock_cluster_or_swap_info(p, offset);
+
+	count = swap_count(p->swap_map[offset]);
 	if (!(count & COUNT_CONTINUED))
 		goto out;
 
 	count &= ~COUNT_CONTINUED;
 	n = SWAP_MAP_MAX + 1;
 
-	offset = swp_offset(entry);
 	page = vmalloc_to_page(p->swap_map + offset);
 	offset &= ~PAGE_MASK;
 	VM_BUG_ON(page_private(page) != SWP_CONTINUED);
@@ -919,7 +1033,7 @@ int swp_swapcount(swp_entry_t entry)
 		n *= (SWAP_CONT_MAX + 1);
 	} while (tmp_count & COUNT_CONTINUED);
 out:
-	spin_unlock(&p->lock);
+	unlock_cluster_or_swap_info(p, ci);
 	return count;
 }
 
@@ -1003,7 +1117,7 @@ int free_swap_and_cache(swp_entry_t entry)
 
 	p = swap_info_get(entry);
 	if (p) {
-		if (swap_entry_free(p, entry, 1) == SWAP_HAS_CACHE) {
+		if (swap_entry_free(p, entry, 1, true) == SWAP_HAS_CACHE) {
 			page = find_get_page(swap_address_space(entry),
 					     swp_offset(entry));
 			if (page && !trylock_page(page)) {
@@ -2284,6 +2398,9 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 	return maxpages;
 }
 
+#define SWAP_CLUSTER_COLS						\
+	DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
+
 static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					union swap_header *swap_header,
 					unsigned char *swap_map,
@@ -2291,11 +2408,12 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 					unsigned long maxpages,
 					sector_t *span)
 {
-	int i;
+	unsigned int j, k;
 	unsigned int nr_good_pages;
 	int nr_extents;
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
-	unsigned long idx = p->cluster_next / SWAPFILE_CLUSTER;
+	unsigned long col = p->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS;
+	unsigned long i, idx;
 
 	nr_good_pages = maxpages - 1;	/* omit header page */
 
@@ -2343,15 +2461,20 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	if (!cluster_info)
 		return nr_extents;
 
-	for (i = 0; i < nr_clusters; i++) {
-		if (!cluster_count(&cluster_info[idx])) {
+
+	/* Reduce false cache line sharing between cluster_info */
+	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
+		j = (k + col) % SWAP_CLUSTER_COLS;
+		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
+			idx = i * SWAP_CLUSTER_COLS + j;
+			if (idx >= nr_clusters)
+				continue;
+			if (cluster_count(&cluster_info[idx]))
+				continue;
 			cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
 			cluster_list_add_tail(&p->free_clusters, cluster_info,
 					      idx);
 		}
-		idx++;
-		if (idx == nr_clusters)
-			idx = 0;
 	}
 	return nr_extents;
 }
@@ -2609,6 +2732,7 @@ void si_swapinfo(struct sysinfo *val)
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 {
 	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
 	unsigned long offset, type;
 	unsigned char count;
 	unsigned char has_cache;
@@ -2622,10 +2746,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 		goto bad_file;
 	p = swap_info[type];
 	offset = swp_offset(entry);
-
-	spin_lock(&p->lock);
 	if (unlikely(offset >= p->max))
-		goto unlock_out;
+		goto out;
+
+	ci = lock_cluster_or_swap_info(p, offset);
 
 	count = p->swap_map[offset];
 
@@ -2668,7 +2792,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 	p->swap_map[offset] = count | has_cache;
 
 unlock_out:
-	spin_unlock(&p->lock);
+	unlock_cluster_or_swap_info(p, ci);
 out:
 	return err;
 
@@ -2757,6 +2881,7 @@ EXPORT_SYMBOL_GPL(__page_file_index);
 int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 {
 	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
 	struct page *head;
 	struct page *page;
 	struct page *list_page;
@@ -2780,6 +2905,9 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	offset = swp_offset(entry);
+
+	ci = lock_cluster(si, offset);
+
 	count = si->swap_map[offset] & ~SWAP_HAS_CACHE;
 
 	if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
@@ -2792,6 +2920,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	if (!page) {
+		unlock_cluster(ci);
 		spin_unlock(&si->lock);
 		return -ENOMEM;
 	}
@@ -2840,6 +2969,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	list_add_tail(&page->lru, &head->lru);
 	page = NULL;			/* now it's attached, don't free it */
 out:
+	unlock_cluster(ci);
 	spin_unlock(&si->lock);
 outer:
 	if (page)
@@ -2853,7 +2983,8 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
  * into, carry if so, or else fail until a new continuation page is allocated;
  * when the original swap_map count is decremented from 0 with continuation,
  * borrow from the continuation and report whether it still holds more.
- * Called while __swap_duplicate() or swap_entry_free() holds swap_lock.
+ * Called while __swap_duplicate() or swap_entry_free() holds swap or cluster
+ * lock.
  */
 static bool swap_count_continued(struct swap_info_struct *si,
 				 pgoff_t offset, unsigned char count)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 0/9] mm/swap: Regular page swap optimizations
  2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
                   ` (8 preceding siblings ...)
  2017-01-11 17:55 ` [PATCH v5 9/9] mm/swap: Skip readahead only when swap slot cache is enabled Tim Chen
@ 2017-01-16 12:02 ` Michal Hocko
  2017-01-17  1:06   ` Huang, Ying
  9 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2017-01-16 12:02 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Ying Huang, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

Hi,
I am seeing a lot of preempt unsafe warnings with the current mmotm and
I assume that this patchset has introduced the issue. I haven't checked
more closely but get_swap_page didn't use this_cpu_ptr before "mm/swap:
add cache for swap slots allocation"

[   57.812314] BUG: using smp_processor_id() in preemptible [00000000] code: kswapd0/527
[   57.814360] caller is debug_smp_processor_id+0x17/0x19
[   57.815237] CPU: 1 PID: 527 Comm: kswapd0 Tainted: G        W 4.9.0-mmotm-00135-g4e9a9895ebef #1042
[   57.816019] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[   57.816019]  ffffc900001939c0 ffffffff81329c60 0000000000000001 ffffffff81a0ce06
[   57.816019]  ffffc900001939f0 ffffffff81343c2a 00000000000137a0 ffffea0000dfd2a0
[   57.816019]  ffff88003c49a700 ffffc90000193b10 ffffc90000193a00 ffffffff81343c53
[   57.816019] Call Trace:
[   57.816019]  [<ffffffff81329c60>] dump_stack+0x68/0x92
[   57.816019]  [<ffffffff81343c2a>] check_preemption_disabled+0xce/0xe0
[   57.816019]  [<ffffffff81343c53>] debug_smp_processor_id+0x17/0x19
[   57.816019]  [<ffffffff8115f06f>] get_swap_page+0x19/0x183
[   57.816019]  [<ffffffff8114e01d>] shmem_writepage+0xce/0x38c
[   57.816019]  [<ffffffff81148916>] shrink_page_list+0x81f/0xdbf
[   57.816019]  [<ffffffff81149652>] shrink_inactive_list+0x2ab/0x594
[   57.816019]  [<ffffffff8114a22f>] shrink_node_memcg+0x4c7/0x673
[   57.816019]  [<ffffffff8114a49f>] shrink_node+0xc4/0x282
[   57.816019]  [<ffffffff8114a49f>] ? shrink_node+0xc4/0x282
[   57.816019]  [<ffffffff8114b8cb>] kswapd+0x656/0x834
[   57.816019]  [<ffffffff8114b275>] ? mem_cgroup_shrink_node+0x2e1/0x2e1
[   57.816019]  [<ffffffff81069fb4>] ? call_usermodehelper_exec_async+0x124/0x12d
[   57.816019]  [<ffffffff81073621>] kthread+0xf9/0x101
[   57.816019]  [<ffffffff81660198>] ? _raw_spin_unlock_irq+0x2c/0x4a
[   57.816019]  [<ffffffff81073528>] ? kthread_park+0x5a/0x5a
[   57.816019]  [<ffffffff81069e90>] ? umh_complete+0x25/0x25
[   57.816019]  [<ffffffff81660b07>] ret_from_fork+0x27/0x40

I thought a simple 
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 8cf941e09941..732194de58a4 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -303,7 +303,7 @@ swp_entry_t get_swap_page(void)
 	swp_entry_t entry, *pentry;
 	struct swap_slots_cache *cache;
 
-	cache = this_cpu_ptr(&swp_slots);
+	cache = &get_cpu_var(swp_slots);
 
 	entry.val = 0;
 	if (check_cache_active()) {
@@ -322,11 +322,13 @@ swp_entry_t get_swap_page(void)
 		}
 		mutex_unlock(&cache->alloc_lock);
 		if (entry.val)
-			return entry;
+			goto out;
 	}
 
 	get_swap_pages(1, &entry);
 
+out:
+	put_cpu_var(swp_slots);
 	return entry;
 }
 

would be a way to go but the function takes a sleeping lock so disabling
the preemption is not a way forward. So this is either preempt safe
for some reason - which should be IMHO documented in a comment - and
raw_cpu_ptr can be used or this needs a deeper thought.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 0/9] mm/swap: Regular page swap optimizations
  2017-01-16 12:02 ` [PATCH v5 0/9] mm/swap: Regular page swap optimizations Michal Hocko
@ 2017-01-17  1:06   ` Huang, Ying
  2017-01-17  7:49     ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Huang, Ying @ 2017-01-17  1:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, Andrew Morton, Ying Huang, dave.hansen, ak, aaron.lu,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

Michal Hocko <mhocko@kernel.org> writes:

> Hi,
> I am seeing a lot of preempt unsafe warnings with the current mmotm and
> I assume that this patchset has introduced the issue. I haven't checked
> more closely but get_swap_page didn't use this_cpu_ptr before "mm/swap:
> add cache for swap slots allocation"
>
> [   57.812314] BUG: using smp_processor_id() in preemptible [00000000] code: kswapd0/527
> [   57.814360] caller is debug_smp_processor_id+0x17/0x19
> [   57.815237] CPU: 1 PID: 527 Comm: kswapd0 Tainted: G        W 4.9.0-mmotm-00135-g4e9a9895ebef #1042
> [   57.816019] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
> [   57.816019]  ffffc900001939c0 ffffffff81329c60 0000000000000001 ffffffff81a0ce06
> [   57.816019]  ffffc900001939f0 ffffffff81343c2a 00000000000137a0 ffffea0000dfd2a0
> [   57.816019]  ffff88003c49a700 ffffc90000193b10 ffffc90000193a00 ffffffff81343c53
> [   57.816019] Call Trace:
> [   57.816019]  [<ffffffff81329c60>] dump_stack+0x68/0x92
> [   57.816019]  [<ffffffff81343c2a>] check_preemption_disabled+0xce/0xe0
> [   57.816019]  [<ffffffff81343c53>] debug_smp_processor_id+0x17/0x19
> [   57.816019]  [<ffffffff8115f06f>] get_swap_page+0x19/0x183
> [   57.816019]  [<ffffffff8114e01d>] shmem_writepage+0xce/0x38c
> [   57.816019]  [<ffffffff81148916>] shrink_page_list+0x81f/0xdbf
> [   57.816019]  [<ffffffff81149652>] shrink_inactive_list+0x2ab/0x594
> [   57.816019]  [<ffffffff8114a22f>] shrink_node_memcg+0x4c7/0x673
> [   57.816019]  [<ffffffff8114a49f>] shrink_node+0xc4/0x282
> [   57.816019]  [<ffffffff8114a49f>] ? shrink_node+0xc4/0x282
> [   57.816019]  [<ffffffff8114b8cb>] kswapd+0x656/0x834
> [   57.816019]  [<ffffffff8114b275>] ? mem_cgroup_shrink_node+0x2e1/0x2e1
> [   57.816019]  [<ffffffff81069fb4>] ? call_usermodehelper_exec_async+0x124/0x12d
> [   57.816019]  [<ffffffff81073621>] kthread+0xf9/0x101
> [   57.816019]  [<ffffffff81660198>] ? _raw_spin_unlock_irq+0x2c/0x4a
> [   57.816019]  [<ffffffff81073528>] ? kthread_park+0x5a/0x5a
> [   57.816019]  [<ffffffff81069e90>] ? umh_complete+0x25/0x25
> [   57.816019]  [<ffffffff81660b07>] ret_from_fork+0x27/0x40

Sorry for bothering, we should have tested this before.

> I thought a simple 
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 8cf941e09941..732194de58a4 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -303,7 +303,7 @@ swp_entry_t get_swap_page(void)
>  	swp_entry_t entry, *pentry;
>  	struct swap_slots_cache *cache;
>  
> -	cache = this_cpu_ptr(&swp_slots);
> +	cache = &get_cpu_var(swp_slots);
>  
>  	entry.val = 0;
>  	if (check_cache_active()) {
> @@ -322,11 +322,13 @@ swp_entry_t get_swap_page(void)
>  		}
>  		mutex_unlock(&cache->alloc_lock);
>  		if (entry.val)
> -			return entry;
> +			goto out;
>  	}
>  
>  	get_swap_pages(1, &entry);
>  
> +out:
> +	put_cpu_var(swp_slots);
>  	return entry;
>  }
>  
>
> would be a way to go but the function takes a sleeping lock so disabling
> the preemption is not a way forward. So this is either preempt safe
> for some reason - which should be IMHO documented in a comment - and
> raw_cpu_ptr can be used or this needs a deeper thought.

Thanks for pointing out this.

We think this is preempt safe.  During the development, we have
considered the possible preemption between getting the per-CPU pointer
and its usage, and implemented the code to make it work at that
situation.  We will change the code to use raw_cpu_ptr() and add a
comment for it.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Update][PATCH v5 7/9] mm/swap: Add cache for swap slots allocation
  2017-01-11 17:55 ` [PATCH v5 7/9] mm/swap: Add cache for swap slots allocation Tim Chen
@ 2017-01-17  2:55   ` Huang, Ying
  2017-01-17 10:16     ` Michal Hocko
  2017-01-17 21:42     ` Tim Chen
  0 siblings, 2 replies; 33+ messages in thread
From: Huang, Ying @ 2017-01-17  2:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ying Huang, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim C Chen

Hi, Andrew,

This update patch is to fix the preemption warning raised by Michal
Hocko.  raw_cpu_ptr() is used to replace this_cpu_ptr() and comments are
added for why it is used.

Best Regards,
Huang, Ying

---------------------------------------------->
From: Tim Chen <tim.c.chen@linux.intel.com>

We add per cpu caches for swap slots that can be allocated and freed
quickly without the need to touch the swap info lock.

Two separate caches are maintained for swap slots allocated and swap
slots returned.  This is to allow the swap slots to be returned to the
global pool in a batch so they will have a chance to be coaelesced with
other slots in a cluster.  We do not reuse the slots that are returned
right away, as it may increase fragmentation of the slots.

The swap allocation cache is protected by a mutex as we may sleep when
searching for empty slots in cache.  The swap free cache is protected
by a spin lock as we cannot sleep in the free path.

We refill the swap slots cache when we run out of slots, and we disable
the swap slots cache and drain the slots if the global number of slots
fall below a low watermark threshold.  We re-enable the cache agian when
the slots available are above a high watermark.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h       |   4 +
 include/linux/swap_slots.h |  28 ++++
 mm/Makefile                |   2 +-
 mm/swap_slots.c            | 339 +++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c            |   1 +
 mm/swapfile.c              |  26 ++--
 6 files changed, 388 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/swap_slots.h
 create mode 100644 mm/swap_slots.c

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b021478a772d..d101d0e780da 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -371,6 +371,7 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
+extern bool has_usable_swap(void);
 
 /* Swap 50% full? Release swapcache more aggressively.. */
 static inline bool vm_swap_full(void)
@@ -409,6 +410,9 @@ struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
 
+extern int get_swap_slots(int n, swp_entry_t *slots);
+extern void swapcache_free_batch(swp_entry_t *entries, int n);
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)		(NULL)
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
new file mode 100644
index 000000000000..a59e6e2f2c47
--- /dev/null
+++ b/include/linux/swap_slots.h
@@ -0,0 +1,28 @@
+#ifndef _LINUX_SWAP_SLOTS_H
+#define _LINUX_SWAP_SLOTS_H
+
+#include <linux/swap.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+
+#define SWAP_SLOTS_CACHE_SIZE			SWAP_BATCH
+#define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE	(5*SWAP_SLOTS_CACHE_SIZE)
+#define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE	(2*SWAP_SLOTS_CACHE_SIZE)
+
+struct swap_slots_cache {
+	bool		lock_initialized;
+	struct mutex	alloc_lock;
+	swp_entry_t	*slots;
+	int		nr;
+	int		cur;
+	spinlock_t	free_lock;
+	swp_entry_t	*slots_ret;
+	int		n_ret;
+};
+
+void disable_swap_slots_cache_lock(void);
+void reenable_swap_slots_cache_unlock(void);
+int enable_swap_slots_cache(void);
+int free_swap_slot(swp_entry_t entry);
+
+#endif /* _LINUX_SWAP_SLOTS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a9f76b..433eaf9a876e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -35,7 +35,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o vmacache.o \
+			   compaction.o vmacache.o swap_slots.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
new file mode 100644
index 000000000000..1b7a52e87aac
--- /dev/null
+++ b/mm/swap_slots.c
@@ -0,0 +1,339 @@
+/*
+ * Manage cache of swap slots to be used for and returned from
+ * swap.
+ *
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * Author: Tim Chen <tim.c.chen@linux.intel.com>
+ *
+ * We allocate the swap slots from the global pool and put
+ * it into local per cpu caches.  This has the advantage
+ * of no needing to acquire the swap_info lock every time
+ * we need a new slot.
+ *
+ * There is also opportunity to simply return the slot
+ * to local caches without needing to acquire swap_info
+ * lock.  We do not reuse the returned slots directly but
+ * move them back to the global pool in a batch.  This
+ * allows the slots to coaellesce and reduce fragmentation.
+ *
+ * The swap entry allocated is marked with SWAP_HAS_CACHE
+ * flag in map_count that prevents it from being allocated
+ * again from the global pool.
+ *
+ * The swap slots cache is protected by a mutex instead of
+ * a spin lock as when we search for slots with scan_swap_map,
+ * we can possibly sleep.
+ */
+
+#include <linux/swap_slots.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/vmalloc.h>
+#include <linux/mutex.h>
+
+#ifdef CONFIG_SWAP
+
+static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots);
+static bool	swap_slot_cache_active;
+static bool	swap_slot_cache_enabled;
+static bool	swap_slot_cache_initialized;
+DEFINE_MUTEX(swap_slots_cache_mutex);
+/* Serialize swap slots cache enable/disable operations */
+DEFINE_MUTEX(swap_slots_cache_enable_mutex);
+
+static void __drain_swap_slots_cache(unsigned int type);
+static void deactivate_swap_slots_cache(void);
+static void reactivate_swap_slots_cache(void);
+
+#define use_swap_slot_cache (swap_slot_cache_active && \
+		swap_slot_cache_enabled && swap_slot_cache_initialized)
+#define SLOTS_CACHE 0x1
+#define SLOTS_CACHE_RET 0x2
+
+static void deactivate_swap_slots_cache(void)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	swap_slot_cache_active = false;
+	__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+static void reactivate_swap_slots_cache(void)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	swap_slot_cache_active = true;
+	mutex_unlock(&swap_slots_cache_mutex);
+}
+
+/* Must not be called with cpu hot plug lock */
+void disable_swap_slots_cache_lock(void)
+{
+	mutex_lock(&swap_slots_cache_enable_mutex);
+	swap_slot_cache_enabled = false;
+	if (swap_slot_cache_initialized) {
+		/* serialize with cpu hotplug operations */
+		get_online_cpus();
+		__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+		put_online_cpus();
+	}
+}
+
+static void __reenable_swap_slots_cache(void)
+{
+	swap_slot_cache_enabled = has_usable_swap();
+}
+
+void reenable_swap_slots_cache_unlock(void)
+{
+	__reenable_swap_slots_cache();
+	mutex_unlock(&swap_slots_cache_enable_mutex);
+}
+
+static bool check_cache_active(void)
+{
+	long pages;
+
+	if (!swap_slot_cache_enabled || !swap_slot_cache_initialized)
+		return false;
+
+	pages = get_nr_swap_pages();
+	if (!swap_slot_cache_active) {
+		if (pages > num_online_cpus() *
+		    THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE)
+			reactivate_swap_slots_cache();
+		goto out;
+	}
+
+	/* if global pool of slot caches too low, deactivate cache */
+	if (pages < num_online_cpus() * THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE)
+		deactivate_swap_slots_cache();
+out:
+	return swap_slot_cache_active;
+}
+
+static int alloc_swap_slot_cache(unsigned int cpu)
+{
+	struct swap_slots_cache *cache;
+	swp_entry_t *slots, *slots_ret;
+
+	/*
+	 * Do allocation outside swap_slots_cache_mutex
+	 * as vzalloc could trigger reclaim and get_swap_page,
+	 * which can lock swap_slots_cache_mutex.
+	 */
+	slots = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	if (!slots)
+		return -ENOMEM;
+
+	slots_ret = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	if (!slots_ret) {
+		vfree(slots);
+		return -ENOMEM;
+	}
+
+	mutex_lock(&swap_slots_cache_mutex);
+	cache = &per_cpu(swp_slots, cpu);
+	if (cache->slots || cache->slots_ret)
+		/* cache already allocated */
+		goto out;
+	if (!cache->lock_initialized) {
+		mutex_init(&cache->alloc_lock);
+		spin_lock_init(&cache->free_lock);
+		cache->lock_initialized = true;
+	}
+	cache->nr = 0;
+	cache->cur = 0;
+	cache->n_ret = 0;
+	cache->slots = slots;
+	slots = NULL;
+	cache->slots_ret = slots_ret;
+	slots_ret = NULL;
+out:
+	mutex_unlock(&swap_slots_cache_mutex);
+	if (slots)
+		vfree(slots);
+	if (slots_ret)
+		vfree(slots_ret);
+	return 0;
+}
+
+static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
+				  bool free_slots)
+{
+	struct swap_slots_cache *cache;
+	swp_entry_t *slots = NULL;
+
+	cache = &per_cpu(swp_slots, cpu);
+	if ((type & SLOTS_CACHE) && cache->slots) {
+		mutex_lock(&cache->alloc_lock);
+		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
+		cache->cur = 0;
+		cache->nr = 0;
+		if (free_slots && cache->slots) {
+			vfree(cache->slots);
+			cache->slots = NULL;
+		}
+		mutex_unlock(&cache->alloc_lock);
+	}
+	if ((type & SLOTS_CACHE_RET) && cache->slots_ret) {
+		spin_lock_irq(&cache->free_lock);
+		swapcache_free_entries(cache->slots_ret, cache->n_ret);
+		cache->n_ret = 0;
+		if (free_slots && cache->slots_ret) {
+			slots = cache->slots_ret;
+			cache->slots_ret = NULL;
+		}
+		spin_unlock_irq(&cache->free_lock);
+		if (slots)
+			vfree(slots);
+	}
+}
+
+static void __drain_swap_slots_cache(unsigned int type)
+{
+	unsigned int cpu;
+
+	/*
+	 * This function is called during
+	 *	1) swapoff, when we have to make sure no
+	 *	   left over slots are in cache when we remove
+	 *	   a swap device;
+	 *      2) disabling of swap slot cache, when we run low
+	 *	   on swap slots when allocating memory and need
+	 *	   to return swap slots to global pool.
+	 *
+	 * We cannot acquire cpu hot plug lock here as
+	 * this function can be invoked in the cpu
+	 * hot plug path:
+	 * cpu_up -> lock cpu_hotplug -> cpu hotplug state callback
+	 *   -> memory allocation -> direct reclaim -> get_swap_page
+	 *   -> drain_swap_slots_cache
+	 *
+	 * Hence the loop over current online cpu below could miss cpu that
+	 * is being brought online but not yet marked as online.
+	 * That is okay as we do not schedule and run anything on a
+	 * cpu before it has been marked online. Hence, we will not
+	 * fill any swap slots in slots cache of such cpu.
+	 * There are no slots on such cpu that need to be drained.
+	 */
+	for_each_online_cpu(cpu)
+		drain_slots_cache_cpu(cpu, type, false);
+}
+
+static int free_slot_cache(unsigned int cpu)
+{
+	mutex_lock(&swap_slots_cache_mutex);
+	drain_slots_cache_cpu(cpu, SLOTS_CACHE | SLOTS_CACHE_RET, true);
+	mutex_unlock(&swap_slots_cache_mutex);
+	return 0;
+}
+
+int enable_swap_slots_cache(void)
+{
+	int ret = 0;
+
+	mutex_lock(&swap_slots_cache_enable_mutex);
+	if (swap_slot_cache_initialized) {
+		__reenable_swap_slots_cache();
+		goto out_unlock;
+	}
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "swap_slots_cache",
+				alloc_swap_slot_cache, free_slot_cache);
+	if (ret < 0)
+		goto out_unlock;
+	swap_slot_cache_initialized = true;
+	__reenable_swap_slots_cache();
+out_unlock:
+	mutex_unlock(&swap_slots_cache_enable_mutex);
+	return 0;
+}
+
+/* called with swap slot cache's alloc lock held */
+static int refill_swap_slots_cache(struct swap_slots_cache *cache)
+{
+	if (!use_swap_slot_cache || cache->nr)
+		return 0;
+
+	cache->cur = 0;
+	if (swap_slot_cache_active)
+		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots);
+
+	return cache->nr;
+}
+
+int free_swap_slot(swp_entry_t entry)
+{
+	struct swap_slots_cache *cache;
+
+	BUG_ON(!swap_slot_cache_initialized);
+
+	cache = &get_cpu_var(swp_slots);
+	if (use_swap_slot_cache && cache->slots_ret) {
+		spin_lock_irq(&cache->free_lock);
+		/* Swap slots cache may be deactivated before acquiring lock */
+		if (!use_swap_slot_cache) {
+			spin_unlock_irq(&cache->free_lock);
+			goto direct_free;
+		}
+		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
+			/*
+			 * Return slots to global pool.
+			 * The current swap_map value is SWAP_HAS_CACHE.
+			 * Set it to 0 to indicate it is available for
+			 * allocation in global pool
+			 */
+			swapcache_free_entries(cache->slots_ret, cache->n_ret);
+			cache->n_ret = 0;
+		}
+		cache->slots_ret[cache->n_ret++] = entry;
+		spin_unlock_irq(&cache->free_lock);
+	} else {
+direct_free:
+		swapcache_free_entries(&entry, 1);
+	}
+	put_cpu_var(swp_slots);
+
+	return 0;
+}
+
+swp_entry_t get_swap_page(void)
+{
+	swp_entry_t entry, *pentry;
+	struct swap_slots_cache *cache;
+
+	/*
+	 * Preemption need to be turned on here, because we may sleep
+	 * in refill_swap_slots_cache().  But it is safe, because
+	 * accesses to the per-CPU data structure are protected by a
+	 * mutex.
+	 */
+	cache = raw_cpu_ptr(&swp_slots);
+
+	entry.val = 0;
+	if (check_cache_active()) {
+		mutex_lock(&cache->alloc_lock);
+		if (cache->slots) {
+repeat:
+			if (cache->nr) {
+				pentry = &cache->slots[cache->cur++];
+				entry = *pentry;
+				pentry->val = 0;
+				cache->nr--;
+			} else {
+				if (refill_swap_slots_cache(cache))
+					goto repeat;
+			}
+		}
+		mutex_unlock(&cache->alloc_lock);
+		if (entry.val)
+			return entry;
+	}
+
+	get_swap_pages(1, &entry);
+
+	return entry;
+}
+
+#endif /* CONFIG_SWAP */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3d76d80c07d6..e1f07cafecaa 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/vmalloc.h>
+#include <linux/swap_slots.h>
 
 #include <asm/pgtable.h>
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 72dfec7b8035..d4d7baf3c21a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -34,6 +34,7 @@
 #include <linux/frontswap.h>
 #include <linux/swapfile.h>
 #include <linux/export.h>
+#include <linux/swap_slots.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -859,14 +860,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[])
 	return n_ret;
 }
 
-swp_entry_t get_swap_page(void)
-{
-	swp_entry_t entry;
-
-	get_swap_pages(1, &entry);
-	return entry;
-}
-
 /* The only caller of this function is now suspend routine */
 swp_entry_t get_swap_page_of_type(int type)
 {
@@ -1057,7 +1050,7 @@ void swap_free(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		if (!__swap_entry_free(p, entry, 1))
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot(entry);
 	}
 }
 
@@ -1071,7 +1064,7 @@ void swapcache_free(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		if (!__swap_entry_free(p, entry, SWAP_HAS_CACHE))
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot(entry);
 	}
 }
 
@@ -1279,7 +1272,7 @@ int free_swap_and_cache(swp_entry_t entry)
 				page = NULL;
 			}
 		} else if (!count)
-			swapcache_free_entries(&entry, 1);
+			free_swap_slot(entry);
 	}
 	if (page) {
 		/*
@@ -2107,6 +2100,17 @@ static void reinsert_swap_info(struct swap_info_struct *p)
 	spin_unlock(&swap_lock);
 }
 
+bool has_usable_swap(void)
+{
+	bool ret = true;
+
+	spin_lock(&swap_lock);
+	if (plist_head_empty(&swap_active_head))
+		ret = false;
+	spin_unlock(&swap_lock);
+	return ret;
+}
+
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 0/9] mm/swap: Regular page swap optimizations
  2017-01-17  1:06   ` Huang, Ying
@ 2017-01-17  7:49     ` Michal Hocko
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2017-01-17  7:49 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Tim Chen, Andrew Morton, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

On Tue 17-01-17 09:06:04, Huang, Ying wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > Hi,
> > I am seeing a lot of preempt unsafe warnings with the current mmotm and
> > I assume that this patchset has introduced the issue. I haven't checked
> > more closely but get_swap_page didn't use this_cpu_ptr before "mm/swap:
> > add cache for swap slots allocation"
> >
> > [   57.812314] BUG: using smp_processor_id() in preemptible [00000000] code: kswapd0/527
> > [   57.814360] caller is debug_smp_processor_id+0x17/0x19
> > [   57.815237] CPU: 1 PID: 527 Comm: kswapd0 Tainted: G        W 4.9.0-mmotm-00135-g4e9a9895ebef #1042
> > [   57.816019] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
> > [   57.816019]  ffffc900001939c0 ffffffff81329c60 0000000000000001 ffffffff81a0ce06
> > [   57.816019]  ffffc900001939f0 ffffffff81343c2a 00000000000137a0 ffffea0000dfd2a0
> > [   57.816019]  ffff88003c49a700 ffffc90000193b10 ffffc90000193a00 ffffffff81343c53
> > [   57.816019] Call Trace:
> > [   57.816019]  [<ffffffff81329c60>] dump_stack+0x68/0x92
> > [   57.816019]  [<ffffffff81343c2a>] check_preemption_disabled+0xce/0xe0
> > [   57.816019]  [<ffffffff81343c53>] debug_smp_processor_id+0x17/0x19
> > [   57.816019]  [<ffffffff8115f06f>] get_swap_page+0x19/0x183
> > [   57.816019]  [<ffffffff8114e01d>] shmem_writepage+0xce/0x38c
> > [   57.816019]  [<ffffffff81148916>] shrink_page_list+0x81f/0xdbf
> > [   57.816019]  [<ffffffff81149652>] shrink_inactive_list+0x2ab/0x594
> > [   57.816019]  [<ffffffff8114a22f>] shrink_node_memcg+0x4c7/0x673
> > [   57.816019]  [<ffffffff8114a49f>] shrink_node+0xc4/0x282
> > [   57.816019]  [<ffffffff8114a49f>] ? shrink_node+0xc4/0x282
> > [   57.816019]  [<ffffffff8114b8cb>] kswapd+0x656/0x834
> > [   57.816019]  [<ffffffff8114b275>] ? mem_cgroup_shrink_node+0x2e1/0x2e1
> > [   57.816019]  [<ffffffff81069fb4>] ? call_usermodehelper_exec_async+0x124/0x12d
> > [   57.816019]  [<ffffffff81073621>] kthread+0xf9/0x101
> > [   57.816019]  [<ffffffff81660198>] ? _raw_spin_unlock_irq+0x2c/0x4a
> > [   57.816019]  [<ffffffff81073528>] ? kthread_park+0x5a/0x5a
> > [   57.816019]  [<ffffffff81069e90>] ? umh_complete+0x25/0x25
> > [   57.816019]  [<ffffffff81660b07>] ret_from_fork+0x27/0x40
> 
> Sorry for bothering, we should have tested this before.

I am always running my tests with CONFIG_DEBUG_PREEMPT=y which is what
has caught this one.

[...]
> > would be a way to go but the function takes a sleeping lock so disabling
> > the preemption is not a way forward. So this is either preempt safe
> > for some reason - which should be IMHO documented in a comment - and
> > raw_cpu_ptr can be used or this needs a deeper thought.
> 
> Thanks for pointing out this.
> 
> We think this is preempt safe.  During the development, we have
> considered the possible preemption between getting the per-CPU pointer
> and its usage, and implemented the code to make it work at that
> situation.  We will change the code to use raw_cpu_ptr() and add a
> comment for it.

FWIW s@this_cpu_ptr@raw_cpu_ptr@ which I am using as a workaround now
hasn't seemed to cause any issue. At least nothing observable like a
crash.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Update][PATCH v5 7/9] mm/swap: Add cache for swap slots allocation
  2017-01-17  2:55   ` [Update][PATCH " Huang, Ying
@ 2017-01-17 10:16     ` Michal Hocko
  2017-01-17 17:24       ` Chen, Tim C
  2017-01-17 21:42     ` Tim Chen
  1 sibling, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2017-01-17 10:16 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Hillf Danton, Christian Borntraeger,
	Jonathan Corbet, Tim C Chen

On Tue 17-01-17 10:55:47, Huang, Ying wrote:
[...]
> +int free_swap_slot(swp_entry_t entry)
> +{
> +	struct swap_slots_cache *cache;
> +
> +	BUG_ON(!swap_slot_cache_initialized);
> +
> +	cache = &get_cpu_var(swp_slots);
> +	if (use_swap_slot_cache && cache->slots_ret) {
> +		spin_lock_irq(&cache->free_lock);
> +		/* Swap slots cache may be deactivated before acquiring lock */
> +		if (!use_swap_slot_cache) {
> +			spin_unlock_irq(&cache->free_lock);
> +			goto direct_free;
> +		}
> +		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
> +			/*
> +			 * Return slots to global pool.
> +			 * The current swap_map value is SWAP_HAS_CACHE.
> +			 * Set it to 0 to indicate it is available for
> +			 * allocation in global pool
> +			 */
> +			swapcache_free_entries(cache->slots_ret, cache->n_ret);
> +			cache->n_ret = 0;
> +		}
> +		cache->slots_ret[cache->n_ret++] = entry;
> +		spin_unlock_irq(&cache->free_lock);
> +	} else {
> +direct_free:
> +		swapcache_free_entries(&entry, 1);
> +	}
> +	put_cpu_var(swp_slots);
> +
> +	return 0;
> +}
> +
> +swp_entry_t get_swap_page(void)
> +{
> +	swp_entry_t entry, *pentry;
> +	struct swap_slots_cache *cache;
> +
> +	/*
> +	 * Preemption need to be turned on here, because we may sleep
> +	 * in refill_swap_slots_cache().  But it is safe, because
> +	 * accesses to the per-CPU data structure are protected by a
> +	 * mutex.
> +	 */

the comment doesn't really explain why it is safe. THere are other users
which are not using the lock. E.g. just look at free_swap_slot above. 
How can
	cache->slots_ret[cache->n_ret++] = entry;
be safe wrt.
	pentry = &cache->slots[cache->cur++];
	entry = *pentry;

Both of them might touch the same slot, no? Btw. I would rather prefer
this would be a follow up fix with the trace and the detailed
explanation.

> +	cache = raw_cpu_ptr(&swp_slots);
> +
> +	entry.val = 0;
> +	if (check_cache_active()) {
> +		mutex_lock(&cache->alloc_lock);
> +		if (cache->slots) {
> +repeat:
> +			if (cache->nr) {
> +				pentry = &cache->slots[cache->cur++];
> +				entry = *pentry;
> +				pentry->val = 0;
> +				cache->nr--;
> +			} else {
> +				if (refill_swap_slots_cache(cache))
> +					goto repeat;
> +			}
> +		}
> +		mutex_unlock(&cache->alloc_lock);
> +		if (entry.val)
> +			return entry;
> +	}
> +
> +	get_swap_pages(1, &entry);
> +
> +	return entry;
> +}
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [Update][PATCH v5 7/9] mm/swap: Add cache for swap slots allocation
  2017-01-17 10:16     ` Michal Hocko
@ 2017-01-17 17:24       ` Chen, Tim C
  2017-01-17 20:03         ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Chen, Tim C @ 2017-01-17 17:24 UTC (permalink / raw)
  To: Michal Hocko, Huang, Ying
  Cc: Andrew Morton, Hansen, Dave, ak, Lu, Aaron, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

> > +	/*
> > +	 * Preemption need to be turned on here, because we may sleep
> > +	 * in refill_swap_slots_cache().  But it is safe, because
> > +	 * accesses to the per-CPU data structure are protected by a
> > +	 * mutex.
> > +	 */
> 
> the comment doesn't really explain why it is safe. THere are other users
> which are not using the lock. E.g. just look at free_swap_slot above.
> How can
> 	cache->slots_ret[cache->n_ret++] = entry; be safe wrt.
> 	pentry = &cache->slots[cache->cur++];
> 	entry = *pentry;
> 
> Both of them might touch the same slot, no? Btw. I would rather prefer this
> would be a follow up fix with the trace and the detailed explanation.
> 

The cache->slots_ret  is protected by cache->free_lock and cache->slots is
protected by cache->free_lock.  They are two separate structures, one for
caching the slots returned and one for caching the slots allocated.  So
they do no touch the same slots.  We'll update the comments so it is clearer.

Sure. We can issue a follow up fix on top of the current patchset.

Thanks.

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Update][PATCH v5 7/9] mm/swap: Add cache for swap slots allocation
  2017-01-17 17:24       ` Chen, Tim C
@ 2017-01-17 20:03         ` Michal Hocko
  2017-01-17 20:31           ` Chen, Tim C
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2017-01-17 20:03 UTC (permalink / raw)
  To: Chen, Tim C
  Cc: Huang, Ying, Andrew Morton, Hansen, Dave, ak, Lu, Aaron,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

On Tue 17-01-17 17:24:15, Chen, Tim C wrote:
> > > +	/*
> > > +	 * Preemption need to be turned on here, because we may sleep
> > > +	 * in refill_swap_slots_cache().  But it is safe, because
> > > +	 * accesses to the per-CPU data structure are protected by a
> > > +	 * mutex.
> > > +	 */
> > 
> > the comment doesn't really explain why it is safe. THere are other users
> > which are not using the lock. E.g. just look at free_swap_slot above.
> > How can
> > 	cache->slots_ret[cache->n_ret++] = entry; be safe wrt.
> > 	pentry = &cache->slots[cache->cur++];
> > 	entry = *pentry;
> > 
> > Both of them might touch the same slot, no? Btw. I would rather prefer this
> > would be a follow up fix with the trace and the detailed explanation.
> > 
> 
> The cache->slots_ret  is protected by cache->free_lock and cache->slots is
> protected by cache->free_lock.

Ohh, I have misread those names and considered them the same thing.
Sorry about the confusion. I will look at code more deeply tomorrow.

> They are two separate structures, one for
> caching the slots returned and one for caching the slots allocated.  So
> they do no touch the same slots.  We'll update the comments so it is clearer.

That would be really appreciated.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [Update][PATCH v5 7/9] mm/swap: Add cache for swap slots allocation
  2017-01-17 20:03         ` Michal Hocko
@ 2017-01-17 20:31           ` Chen, Tim C
  0 siblings, 0 replies; 33+ messages in thread
From: Chen, Tim C @ 2017-01-17 20:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Huang, Ying, Andrew Morton, Hansen, Dave, ak, Lu, Aaron,
	linux-mm, linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet

> > >
> >
> > The cache->slots_ret  is protected by cache->free_lock and
> > cache->slots is protected by cache->free_lock.

Typo.  cache->slots is protected by cache->alloc_lock.

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Update][PATCH v5 7/9] mm/swap: Add cache for swap slots allocation
  2017-01-17  2:55   ` [Update][PATCH " Huang, Ying
  2017-01-17 10:16     ` Michal Hocko
@ 2017-01-17 21:42     ` Tim Chen
  2017-01-18 12:45       ` Michal Hocko
  1 sibling, 1 reply; 33+ messages in thread
From: Tim Chen @ 2017-01-17 21:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, dave.hansen, ak, aaron.lu, linux-mm, linux-kernel,
	Hugh Dickins, Shaohua Li, Minchan Kim, Rik van Riel,
	Andrea Arcangeli, Kirill A . Shutemov, Vladimir Davydov,
	Johannes Weiner, Michal Hocko, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim C Chen

[-- Attachment #1: Type: text/plain, Size: 3932 bytes --]

On Tue, Jan 17, 2017 at 10:55:47AM +0800, Huang, Ying wrote:
> Hi, Andrew,
> 
> This update patch is to fix the preemption warning raised by Michal
> Hocko.  raw_cpu_ptr() is used to replace this_cpu_ptr() and comments are
> added for why it is used.
> 

Andrew & Michal,

Here's a fix that's a follow on patch instead of an updated patch
as Michal has suggested.  I've updated the comments a bit to make it
clearer.

Thanks.

Tim

--->8---
Date: Tue, 17 Jan 2017 12:57:00 -0800
Subject: [PATCH] mm/swap: Use raw_cpu_ptr over this_cpu_ptr for swap slots
 access
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Ying Huang <ying.huang@intel.com>, dave.hansen@intel.com, ak@linux.intel.com, aaron.lu@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins <hughd@google.com>, Shaohua Li <shli@kernel.org>, Minchan Kim <minchan@kernel.org>, Rik van Riel <riel@redhat.com>, Andrea Arcangeli <aarcange@redhat.com>, Kirill A . Shutemov <kirill.shutemov@linux.intel.com>, Vladimir Davydov <vdavydov.dev@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Hillf Danton <hillf.zj@alibaba-inc.com>, Christian Borntraeger <borntraeger@de.ibm.com>, Jonathan Corbet <corbet@lwn.net>

From: "Huang, Ying" <ying.huang@intel.com>

The usage of this_cpu_ptr in get_swap_page causes a bug warning
as it is used in pre-emptible code.

[   57.812314] BUG: using smp_processor_id() in preemptible [00000000] code: kswapd0/527
[   57.814360] caller is debug_smp_processor_id+0x17/0x19
[   57.815237] CPU: 1 PID: 527 Comm: kswapd0 Tainted: G        W 4.9.0-mmotm-00135-g4e9a9895ebef #1042
[   57.816019] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[   57.816019]  ffffc900001939c0 ffffffff81329c60 0000000000000001 ffffffff81a0ce06
[   57.816019]  ffffc900001939f0 ffffffff81343c2a 00000000000137a0 ffffea0000dfd2a0
[   57.816019]  ffff88003c49a700 ffffc90000193b10 ffffc90000193a00 ffffffff81343c53
[   57.816019] Call Trace:
[   57.816019]  [<ffffffff81329c60>] dump_stack+0x68/0x92
[   57.816019]  [<ffffffff81343c2a>] check_preemption_disabled+0xce/0xe0
[   57.816019]  [<ffffffff81343c53>] debug_smp_processor_id+0x17/0x19
[   57.816019]  [<ffffffff8115f06f>] get_swap_page+0x19/0x183
[   57.816019]  [<ffffffff8114e01d>] shmem_writepage+0xce/0x38c
[   57.816019]  [<ffffffff81148916>] shrink_page_list+0x81f/0xdbf
[   57.816019]  [<ffffffff81149652>] shrink_inactive_list+0x2ab/0x594
[   57.816019]  [<ffffffff8114a22f>] shrink_node_memcg+0x4c7/0x673
[   57.816019]  [<ffffffff8114a49f>] shrink_node+0xc4/0x282
[   57.816019]  [<ffffffff8114a49f>] ? shrink_node+0xc4/0x282
[   57.816019]  [<ffffffff8114b8cb>] kswapd+0x656/0x834

Logic wise, We do allow pre-emption as per cpu ptr cache->slots is
protected by the mutex cache->alloc_lock.  We switch the
inappropriately used this_cpu_ptr to raw_cpu_ptr for per cpu ptr
access of cache->slots.

Reported-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/swap_slots.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 8cf941e..9b5bc86 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -303,7 +303,16 @@ swp_entry_t get_swap_page(void)
 	swp_entry_t entry, *pentry;
 	struct swap_slots_cache *cache;
 
-	cache = this_cpu_ptr(&swp_slots);
+	/*
+	 * Preemption is allowed here, because we may sleep
+	 * in refill_swap_slots_cache().  But it is safe, because
+	 * accesses to the per-CPU data structure are protected by the
+	 * mutex cache->alloc_lock.
+	 *
+	 * The alloc path here does not touch cache->slots_ret
+	 * so cache->free_lock is not taken.
+	 */
+	cache = raw_cpu_ptr(&swp_slots);
 
 	entry.val = 0;
 	if (check_cache_active()) {
-- 
2.5.5


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Update][PATCH v5 7/9] mm/swap: Add cache for swap slots allocation
  2017-01-17 21:42     ` Tim Chen
@ 2017-01-18 12:45       ` Michal Hocko
  2017-01-18 18:03         ` Tim Chen
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2017-01-18 12:45 UTC (permalink / raw)
  To: Tim Chen
  Cc: Huang, Ying, Andrew Morton, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim C Chen

On Tue 17-01-17 13:42:35, Tim Chen wrote:
[...]
> Date: Tue, 17 Jan 2017 12:57:00 -0800
> Subject: [PATCH] mm/swap: Use raw_cpu_ptr over this_cpu_ptr for swap slots
>  access
> To: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ying Huang <ying.huang@intel.com>, dave.hansen@intel.com, ak@linux.intel.com, aaron.lu@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins <hughd@google.com>, Shaohua Li <shli@kernel.org>, Minchan Kim <minchan@kernel.org>, Rik van Riel <riel@redhat.com>, Andrea Arcangeli <aarcange@redhat.com>, Kirill A . Shutemov <kirill.shutemov@linux.intel.com>, Vladimir Davydov <vdavydov.dev@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Hillf Danton <hillf.zj@alibaba-inc.com>, Christian Borntraeger <borntraeger@de.ibm.com>, Jonathan Corbet <corbet@lwn.net>
> 
> From: "Huang, Ying" <ying.huang@intel.com>
> 
> The usage of this_cpu_ptr in get_swap_page causes a bug warning
> as it is used in pre-emptible code.
> 
> [   57.812314] BUG: using smp_processor_id() in preemptible [00000000] code: kswapd0/527
> [   57.814360] caller is debug_smp_processor_id+0x17/0x19
> [   57.815237] CPU: 1 PID: 527 Comm: kswapd0 Tainted: G        W 4.9.0-mmotm-00135-g4e9a9895ebef #1042
> [   57.816019] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
> [   57.816019]  ffffc900001939c0 ffffffff81329c60 0000000000000001 ffffffff81a0ce06
> [   57.816019]  ffffc900001939f0 ffffffff81343c2a 00000000000137a0 ffffea0000dfd2a0
> [   57.816019]  ffff88003c49a700 ffffc90000193b10 ffffc90000193a00 ffffffff81343c53
> [   57.816019] Call Trace:
> [   57.816019]  [<ffffffff81329c60>] dump_stack+0x68/0x92
> [   57.816019]  [<ffffffff81343c2a>] check_preemption_disabled+0xce/0xe0
> [   57.816019]  [<ffffffff81343c53>] debug_smp_processor_id+0x17/0x19
> [   57.816019]  [<ffffffff8115f06f>] get_swap_page+0x19/0x183
> [   57.816019]  [<ffffffff8114e01d>] shmem_writepage+0xce/0x38c
> [   57.816019]  [<ffffffff81148916>] shrink_page_list+0x81f/0xdbf
> [   57.816019]  [<ffffffff81149652>] shrink_inactive_list+0x2ab/0x594
> [   57.816019]  [<ffffffff8114a22f>] shrink_node_memcg+0x4c7/0x673
> [   57.816019]  [<ffffffff8114a49f>] shrink_node+0xc4/0x282
> [   57.816019]  [<ffffffff8114a49f>] ? shrink_node+0xc4/0x282
> [   57.816019]  [<ffffffff8114b8cb>] kswapd+0x656/0x834
> 
> Logic wise, We do allow pre-emption as per cpu ptr cache->slots is
> protected by the mutex cache->alloc_lock.  We switch the
> inappropriately used this_cpu_ptr to raw_cpu_ptr for per cpu ptr
> access of cache->slots.

OK, that looks better. I would still appreciate something like the
following folded in
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index fb907346c5c6..0afe748453a7 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -11,6 +11,7 @@
 
 struct swap_slots_cache {
 	bool		lock_initialized;
+	/* protects slots, nr, cur */
 	struct mutex	alloc_lock;
 	swp_entry_t	*slots;
 	int		nr;

> 
> Reported-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

Reviewed-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/swap_slots.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 8cf941e..9b5bc86 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -303,7 +303,16 @@ swp_entry_t get_swap_page(void)
>  	swp_entry_t entry, *pentry;
>  	struct swap_slots_cache *cache;
>  
> -	cache = this_cpu_ptr(&swp_slots);
> +	/*
> +	 * Preemption is allowed here, because we may sleep
> +	 * in refill_swap_slots_cache().  But it is safe, because
> +	 * accesses to the per-CPU data structure are protected by the
> +	 * mutex cache->alloc_lock.
> +	 *
> +	 * The alloc path here does not touch cache->slots_ret
> +	 * so cache->free_lock is not taken.
> +	 */
> +	cache = raw_cpu_ptr(&swp_slots);
>  
>  	entry.val = 0;
>  	if (check_cache_active()) {
> -- 
> 2.5.5
> 



-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Update][PATCH v5 7/9] mm/swap: Add cache for swap slots allocation
  2017-01-18 12:45       ` Michal Hocko
@ 2017-01-18 18:03         ` Tim Chen
  2017-01-18 18:15           ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Tim Chen @ 2017-01-18 18:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Huang, Ying, Andrew Morton, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim C Chen

[-- Attachment #1: Type: text/plain, Size: 2459 bytes --]

On Wed, Jan 18, 2017 at 01:45:55PM +0100, Michal Hocko wrote:
> On Tue 17-01-17 13:42:35, Tim Chen wrote:
> [...]
> > Logic wise, We do allow pre-emption as per cpu ptr cache->slots is
> > protected by the mutex cache->alloc_lock.  We switch the
> > inappropriately used this_cpu_ptr to raw_cpu_ptr for per cpu ptr
> > access of cache->slots.
> 
> OK, that looks better. I would still appreciate something like the
> following folded in
> diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
> index fb907346c5c6..0afe748453a7 100644
> --- a/include/linux/swap_slots.h
> +++ b/include/linux/swap_slots.h
> @@ -11,6 +11,7 @@
>  
>  struct swap_slots_cache {
>  	bool		lock_initialized;
> +	/* protects slots, nr, cur */
>  	struct mutex	alloc_lock;
>  	swp_entry_t	*slots;
>  	int		nr;
> 

I've included here a patch for the comments.

Thanks.

Tim

--->8---
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Wed, 18 Jan 2017 09:52:28 -0800
Subject: [PATCH] mm/swap: Add comments on locks in swap_slots.h
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Ying Huang <ying.huang@intel.com>, dave.hansen@intel.com, ak@linux.intel.com, aaron.lu@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins <hughd@google.com>, Shaohua Li <shli@kernel.org>, Minchan Kim <minchan@kernel.org>, Rik van Riel <riel@redhat.com>, Andrea Arcangeli <aarcange@redhat.com>, Kirill A . Shutemov <kirill.shutemov@linux.intel.com>, Vladimir Davydov <vdavydov.dev@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Hillf Danton <hillf.zj@alibaba-inc.com>, Christian Borntraeger <borntraeger@de.ibm.com>, Jonathan Corbet <corbet@lwn.net>

Explains what each lock protects in swap_slots_cache structure.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/swap_slots.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index fb90734..6ef92d1 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -11,11 +11,11 @@
 
 struct swap_slots_cache {
 	bool		lock_initialized;
-	struct mutex	alloc_lock;
+	struct mutex	alloc_lock; /* protects slots, nr, cur */
 	swp_entry_t	*slots;
 	int		nr;
 	int		cur;
-	spinlock_t	free_lock;
+	spinlock_t	free_lock;  /* protects slots_ret, n_ret */
 	swp_entry_t	*slots_ret;
 	int		n_ret;
 };
-- 
2.5.5


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Update][PATCH v5 7/9] mm/swap: Add cache for swap slots allocation
  2017-01-18 18:03         ` Tim Chen
@ 2017-01-18 18:15           ` Michal Hocko
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2017-01-18 18:15 UTC (permalink / raw)
  To: Tim Chen
  Cc: Huang, Ying, Andrew Morton, dave.hansen, ak, aaron.lu, linux-mm,
	linux-kernel, Hugh Dickins, Shaohua Li, Minchan Kim,
	Rik van Riel, Andrea Arcangeli, Kirill A . Shutemov,
	Vladimir Davydov, Johannes Weiner, Hillf Danton,
	Christian Borntraeger, Jonathan Corbet, Tim C Chen

On Wed 18-01-17 10:03:27, Tim Chen wrote:
> On Wed, Jan 18, 2017 at 01:45:55PM +0100, Michal Hocko wrote:
> > On Tue 17-01-17 13:42:35, Tim Chen wrote:
> > [...]
> > > Logic wise, We do allow pre-emption as per cpu ptr cache->slots is
> > > protected by the mutex cache->alloc_lock.  We switch the
> > > inappropriately used this_cpu_ptr to raw_cpu_ptr for per cpu ptr
> > > access of cache->slots.
> > 
> > OK, that looks better. I would still appreciate something like the
> > following folded in
> > diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
> > index fb907346c5c6..0afe748453a7 100644
> > --- a/include/linux/swap_slots.h
> > +++ b/include/linux/swap_slots.h
> > @@ -11,6 +11,7 @@
> >  
> >  struct swap_slots_cache {
> >  	bool		lock_initialized;
> > +	/* protects slots, nr, cur */
> >  	struct mutex	alloc_lock;
> >  	swp_entry_t	*slots;
> >  	int		nr;
> > 
> 
> I've included here a patch for the comments.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-01-18 18:15 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-11 17:55 [PATCH v5 0/9] mm/swap: Regular page swap optimizations Tim Chen
2017-01-11 17:55 ` [PATCH v5 1/9] mm/swap: Fix kernel message in swap_info_get() Tim Chen
2017-01-11 17:55 ` [PATCH v5 2/9] mm/swap: Add cluster lock Tim Chen
2017-01-11 23:00   ` Andrew Morton
2017-01-11 23:07     ` Jonathan Corbet
2017-01-11 23:15       ` Andrew Morton
2017-01-12  1:47         ` Huang, Ying
2017-01-12  1:58           ` Andrew Morton
2017-01-12  2:51             ` Huang, Ying
2017-01-14  4:37             ` [Update][PATCH " Huang, Ying
2017-01-12  1:23       ` [PATCH " Huang, Ying
2017-01-11 17:55 ` [PATCH v5 3/9] mm/swap: Split swap cache into 64MB trunks Tim Chen
2017-01-11 23:09   ` Andrew Morton
2017-01-11 23:19     ` Andi Kleen
2017-01-12 16:47       ` Tim Chen
2017-01-11 17:55 ` [PATCH v5 4/9] mm/swap: skip read ahead for unreferenced swap slots Tim Chen
2017-01-11 17:55 ` [PATCH v5 5/9] mm/swap: Allocate swap slots in batches Tim Chen
2017-01-11 17:55 ` [PATCH v5 6/9] mm/swap: Free swap slots in batch Tim Chen
2017-01-11 17:55 ` [PATCH v5 7/9] mm/swap: Add cache for swap slots allocation Tim Chen
2017-01-17  2:55   ` [Update][PATCH " Huang, Ying
2017-01-17 10:16     ` Michal Hocko
2017-01-17 17:24       ` Chen, Tim C
2017-01-17 20:03         ` Michal Hocko
2017-01-17 20:31           ` Chen, Tim C
2017-01-17 21:42     ` Tim Chen
2017-01-18 12:45       ` Michal Hocko
2017-01-18 18:03         ` Tim Chen
2017-01-18 18:15           ` Michal Hocko
2017-01-11 17:55 ` [PATCH v5 8/9] mm/swap: Enable swap slots cache usage Tim Chen
2017-01-11 17:55 ` [PATCH v5 9/9] mm/swap: Skip readahead only when swap slot cache is enabled Tim Chen
2017-01-16 12:02 ` [PATCH v5 0/9] mm/swap: Regular page swap optimizations Michal Hocko
2017-01-17  1:06   ` Huang, Ying
2017-01-17  7:49     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).