All of lore.kernel.org
 help / color / mirror / Atom feed
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Huang Ying <ying.huang@intel.com>, Michal Hocko <mhocko@suse.com>,
	Minchan Kim <minchan@kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Hugh Dickins <hughd@google.com>
Subject: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache
Date: Thu, 14 May 2020 15:04:24 +0800	[thread overview]
Message-ID: <20200514070424.16017-1-ying.huang@intel.com> (raw)

In some swap scalability test, it is found that there are heavy lock
contention on swap cache even if we have split one swap cache radix
tree per swap device to one swap cache radix tree every 64 MB trunk in
commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

The reason is as follow.  After the swap device becomes fragmented so
that there's no free swap cluster, the swap device will be scanned
linearly to find the free swap slots.  swap_info_struct->cluster_next
is the next scanning base that is shared by all CPUs.  So nearby free
swap slots will be allocated for different CPUs.  The probability for
multiple CPUs to operate on the same 64 MB trunk is high.  This causes
the lock contention on the swap cache.

To solve the issue, in this patch, for SSD swap device, a percpu
version next scanning base (cluster_next_cpu) is added.  Every CPU
will use its own next scanning base.  So the probability for multiple
CPUs to operate on the same 64 MB trunk is reduced greatly.  Thus the
lock contention is reduced too.  For HDD, because sequential access is
more important for IO performance, the original shared next scanning
base is used.

To test the patch, we have run 16-process pmbench memory benchmark on
a 2-socket server machine with 48 cores.  One ram disk is configured
as the swap device per socket.  The pmbench working-set size is much
larger than the available memory so that swapping is triggered.  The
memory read/write ratio is 80/20 and the accessing pattern is random.
In the original implementation, the lock contention on the swap cache
is heavy.  The perf profiling data of the lock contention code path is
as following,

_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      7.93
_raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 		7.03
_raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page: 		3.7
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.9
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:	1.32
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 	1.01
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 	0.87

After applying this patch, it becomes,

_raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page:		3.99
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.0
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      1.47
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 	1.31
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 	0.88
_raw_spin_lock.scan_swap_map_slots.get_swap_pages.get_swap_page: 	0.76
_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      0.53

The lock contention on the swap cache is almost eliminated.

And the pmbench score increases 15.9%.  The swapin throughput
increases 16.2% from 2.84 GB/s to 3.3 GB/s.  While the swapout
throughput increases 16.1% from 2.87 GB/s to 3.33 GB/s.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
---
 include/linux/swap.h |  1 +
 mm/swapfile.c        | 27 +++++++++++++++++++++++++--
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b42fb47d8cbe..e96820fb7472 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,6 +252,7 @@ struct swap_info_struct {
 	unsigned int inuse_pages;	/* number of those currently in use */
 	unsigned int cluster_next;	/* likely index for next allocation */
 	unsigned int cluster_nr;	/* countdown to next cluster search */
+	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 35be7a7271f4..9f1343b066c1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	 */
 
 	si->flags += SWP_SCANNING;
-	scan_base = offset = si->cluster_next;
+	/*
+	 * Use percpu scan base for SSD to reduce lock contention on
+	 * cluster and swap cache.  For HDD, sequential access is more
+	 * important.
+	 */
+	if (si->flags & SWP_SOLIDSTATE)
+		scan_base = this_cpu_read(*si->cluster_next_cpu);
+	else
+		scan_base = si->cluster_next;
+	offset = scan_base;
 
 	/* SSD algorithm */
 	if (si->cluster_info) {
@@ -835,7 +844,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	unlock_cluster(ci);
 
 	swap_range_alloc(si, offset, 1);
-	si->cluster_next = offset + 1;
+	if (si->flags & SWP_SOLIDSTATE)
+		this_cpu_write(*si->cluster_next_cpu, offset + 1);
+	else
+		si->cluster_next = offset + 1;
 	slots[n_ret++] = swp_entry(si->type, offset);
 
 	/* got enough slots or reach max slots? */
@@ -2828,6 +2840,11 @@ static struct swap_info_struct *alloc_swap_info(void)
 	p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
 	if (!p)
 		return ERR_PTR(-ENOMEM);
+	p->cluster_next_cpu = alloc_percpu(unsigned int);
+	if (!p->cluster_next_cpu) {
+		kvfree(p);
+		return ERR_PTR(-ENOMEM);
+	}
 
 	spin_lock(&swap_lock);
 	for (type = 0; type < nr_swapfiles; type++) {
@@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 
 	p->lowest_bit  = 1;
 	p->cluster_next = 1;
+	for_each_possible_cpu(i)
+		per_cpu(*p->cluster_next_cpu, i) = 1;
 	p->cluster_nr = 0;
 
 	maxpages = max_swapfile_size();
@@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		 * SSD
 		 */
 		p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
+		for_each_possible_cpu(cpu) {
+			per_cpu(*p->cluster_next_cpu, cpu) =
+				1 + prandom_u32_max(p->highest_bit);
+		}
 		nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 
 		cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
-- 
2.26.2


             reply	other threads:[~2020-05-14  7:05 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-14  7:04 Huang Ying [this message]
2020-05-15 22:19 ` [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache Andrew Morton
2020-05-18  5:52   ` Huang, Ying
2020-05-18  5:52     ` Huang, Ying
2020-05-15 23:51 ` Daniel Jordan
2020-05-18  6:37   ` Huang, Ying
2020-05-18  6:37     ` Huang, Ying
2020-05-19  1:39     ` Daniel Jordan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200514070424.16017-1-ying.huang@intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=tim.c.chen@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.