From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=y6BJ=7P=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AB3FAC433DF
	for <linux-mm@archiver.kernel.org>; Tue,  2 Jun 2020 04:49:20 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 6E1B32074B
	for <linux-mm@archiver.kernel.org>; Tue,  2 Jun 2020 04:49:20 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="AzyM16G3"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6E1B32074B
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 0F52428002F; Tue,  2 Jun 2020 00:49:13 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0A6F6280012; Tue,  2 Jun 2020 00:49:13 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F247428002F; Tue,  2 Jun 2020 00:49:12 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0211.hostedemail.com [216.40.44.211])
	by kanga.kvack.org (Postfix) with ESMTP id D9B3A280012
	for <linux-mm@kvack.org>; Tue,  2 Jun 2020 00:49:12 -0400 (EDT)
Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id A04248248047
	for <linux-mm@kvack.org>; Tue,  2 Jun 2020 04:49:12 +0000 (UTC)
X-FDA: 76883042544.16.stick39_333f8c9556b22
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin16.hostedemail.com (Postfix) with ESMTP id 81D2F100E690B
	for <linux-mm@kvack.org>; Tue,  2 Jun 2020 04:49:12 +0000 (UTC)
X-HE-Tag: stick39_333f8c9556b22
X-Filterd-Recvd-Size: 6912
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by imf22.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue,  2 Jun 2020 04:49:12 +0000 (UTC)
Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPSA id 06CFE20772;
	Tue,  2 Jun 2020 04:49:10 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=default; t=1591073351;
	bh=7VxrnuTDmHb98SWus9+ObfusHhl18CzOnL4e5ao7RQ0=;
	h=Date:From:To:Subject:In-Reply-To:From;
	b=AzyM16G3hOH1biRO5jkgHsJWoQOYBnGTm8uCrNqDXtObYbFAVGY6Q0YTCRnmEGvY6
	 2IvDg1ppa9AdJAl6Nb1tMoCm0b0n5aD7j09/BmFugzXDKlqL1otF/avFRq1FHVyIZI
	 giHgHssdBX2gNiT5Mmw56WZBys7/OeVKOHoadtnk=
Date: Mon, 01 Jun 2020 21:49:10 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, dave.hansen@intel.com, hughd@google.com,
 linux-mm@kvack.org, mhocko@suse.com, minchan@kernel.org,
 mm-commits@vger.kernel.org, tim.c.chen@linux.intel.com,
 torvalds@linux-foundation.org, ying.huang@intel.com
Subject:  [patch 069/128] swap: try to scan more free slots even
 when fragmented
Message-ID: <20200602044910.c5GIKRoKk%akpm@linux-foundation.org>
In-Reply-To: <20200601214457.919c35648e96a2b46b573fe1@linux-foundation.org>
User-Agent: s-nail v14.8.16
X-Rspamd-Queue-Id: 81D2F100E690B
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam03
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Huang Ying <ying.huang@intel.com>
Subject: swap: try to scan more free slots even when fragmented

Now, the scalability of swap code will drop much when the swap device
becomes fragmented, because the swap slots allocation batching stops
working.  To solve the problem, in this patch, we will try to scan a
little more swap slots with restricted effort to batch the swap slots
allocation even if the swap device is fragmented.  Test shows that the
benchmark score can increase up to 37.1% with the patch.  Details are as
follows.

The swap code has a per-cpu cache of swap slots.  These batch swap space
allocations to improve swap subsystem scaling.  In the following code
path,

  add_to_swap()
    get_swap_page()
      refill_swap_slots_cache()
        get_swap_pages()
	  scan_swap_map_slots()

scan_swap_map_slots() and get_swap_pages() can return multiple swap slots
for each call.  These slots will be cached in the per-CPU swap slots
cache, so that several following swap slot requests will be fulfilled
there to avoid the lock contention in the lower level swap space
allocation/freeing code path.

But this only works when there are free swap clusters.  If a swap device
becomes so fragmented that there's no free swap clusters,
scan_swap_map_slots() and get_swap_pages() will return only one swap slot
for each call in the above code path.  Effectively, this falls back to the
situation before the swap slots cache was introduced, the heavy lock
contention on the swap related locks kills the scalability.

Why does it work in this way?  Because the swap device could be large, and
the free swap slot scanning could be quite time consuming, to avoid taking
too much time to scanning free swap slots, the conservative method was
used.

In fact, this can be improved via scanning a little more free slots with
strictly restricted effort.  Which is implemented in this patch.  In
scan_swap_map_slots(), after the first free swap slot is gotten, we will
try to scan a little more, but only if we haven't scanned too many slots
(< LATENCY_LIMIT).  That is, the added scanning latency is strictly
restricted.

To test the patch, we have run 16-process pmbench memory benchmark on a
2-socket server machine with 48 cores.  Multiple ram disks are configured
as the swap devices.  The pmbench working-set size is much larger than the
available memory so that swapping is triggered.  The memory read/write
ratio is 80/20 and the accessing pattern is random, so the swap space
becomes highly fragmented during the test.  In the original
implementation, the lock contention on swap related locks is very heavy. 
The perf profiling data of the lock contention code path is as following,

_raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap:             21.03
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    1.92
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      1.72
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       0.69

While after applying this patch, it becomes,

_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:    4.89
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:      3.85
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages:       1.1
_raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88

That is, the lock contention on the swap locks is eliminated.

And the pmbench score increases 37.1%.  The swapin throughput increases
45.7% from 2.02 GB/s to 2.94 GB/s.  While the swapout throughput increases
45.3% from 2.04 GB/s to 2.97 GB/s.

Link: http://lkml.kernel.org/r/20200427030023.264780-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

--- a/mm/swapfile.c~swap-try-to-scan-more-free-slots-even-when-fragmented
+++ a/mm/swapfile.c
@@ -732,6 +732,7 @@ static int scan_swap_map_slots(struct sw
 	unsigned long last_in_cluster = 0;
 	int latency_ration = LATENCY_LIMIT;
 	int n_ret = 0;
+	bool scanned_many = false;
 
 	/*
 	 * We try to cluster swap pages by allocating them sequentially
@@ -863,6 +864,25 @@ checks:
 		goto checks;
 	}
 
+	/*
+	 * Even if there's no free clusters available (fragmented),
+	 * try to scan a little more quickly with lock held unless we
+	 * have scanned too many slots already.
+	 */
+	if (!scanned_many) {
+		unsigned long scan_limit;
+
+		if (offset < scan_base)
+			scan_limit = scan_base;
+		else
+			scan_limit = si->highest_bit;
+		for (; offset <= scan_limit && --latency_ration > 0;
+		     offset++) {
+			if (!si->swap_map[offset])
+				goto checks;
+		}
+	}
+
 done:
 	si->flags -= SWP_SCANNING;
 	return n_ret;
@@ -881,6 +901,7 @@ scan:
 		if (unlikely(--latency_ration < 0)) {
 			cond_resched();
 			latency_ration = LATENCY_LIMIT;
+			scanned_many = true;
 		}
 	}
 	offset = si->lowest_bit;
@@ -896,6 +917,7 @@ scan:
 		if (unlikely(--latency_ration < 0)) {
 			cond_resched();
 			latency_ration = LATENCY_LIMIT;
+			scanned_many = true;
 		}
 		offset++;
 	}
_