All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, "Huang, Ying" <ying.huang@intel.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Shaohua Li <shli@kernel.org>, Tim Chen <tim.c.chen@intel.com>,
	Michal Hocko <mhocko@suse.com>, Aaron Lu <aaron.lu@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Andi Kleen <ak@linux.intel.com>, Minchan Kim <minchan@kernel.org>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.13 22/36] mm, swap: fix race between swap count continuation operations
Date: Mon,  6 Nov 2017 10:12:35 +0100	[thread overview]
Message-ID: <20171106085048.051121879@linuxfoundation.org> (raw)
In-Reply-To: <20171106085047.005824077@linuxfoundation.org>

4.13-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Huang Ying <ying.huang@intel.com>

commit 2628bd6fc052bd85e9864dae4de494d8a6313391 upstream.

One page may store a set of entries of the sis->swap_map
(swap_info_struct->swap_map) in multiple swap clusters.

If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX,
multiple pages will be used to store the set of entries of the
sis->swap_map.  And the pages are linked with page->lru.  This is called
swap count continuation.  To access the pages which store the set of
entries of the sis->swap_map simultaneously, previously, sis->lock is
used.  But to improve the scalability of __swap_duplicate(), swap
cluster lock may be used in swap_count_continued() now.  This may race
with add_swap_count_continuation() which operates on a nearby swap
cluster, in which the sis->swap_map entries are stored in the same page.

The race can cause wrong swap count in practice, thus cause unfreeable
swap entries or software lockup, etc.

To fix the race, a new spin lock called cont_lock is added to struct
swap_info_struct to protect the swap count continuation page list.  This
is a lock at the swap device level, so the scalability isn't very well.
But it is still much better than the original sis->lock, because it is
only acquired/released when swap count continuation is used.  Which is
considered rare in practice.  If it turns out that the scalability
becomes an issue for some workloads, we can split the lock into some
more fine grained locks.

Link: http://lkml.kernel.org/r/20171017081320.28133-1-ying.huang@intel.com
Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 include/linux/swap.h |    4 ++++
 mm/swapfile.c        |   23 +++++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -246,6 +246,10 @@ struct swap_info_struct {
 					 * both locks need hold, hold swap_lock
 					 * first.
 					 */
+	spinlock_t cont_lock;		/*
+					 * protect swap count continuation page
+					 * list.
+					 */
 	struct work_struct discard_work; /* discard worker */
 	struct swap_cluster_list discard_clusters; /* discard clusters list */
 };
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2635,6 +2635,7 @@ static struct swap_info_struct *alloc_sw
 	p->flags = SWP_USED;
 	spin_unlock(&swap_lock);
 	spin_lock_init(&p->lock);
+	spin_lock_init(&p->cont_lock);
 
 	return p;
 }
@@ -3307,6 +3308,7 @@ int add_swap_count_continuation(swp_entr
 	head = vmalloc_to_page(si->swap_map + offset);
 	offset &= ~PAGE_MASK;
 
+	spin_lock(&si->cont_lock);
 	/*
 	 * Page allocation does not initialize the page's lru field,
 	 * but it does always reset its private field.
@@ -3326,7 +3328,7 @@ int add_swap_count_continuation(swp_entr
 		 * a continuation page, free our allocation and use this one.
 		 */
 		if (!(count & COUNT_CONTINUED))
-			goto out;
+			goto out_unlock_cont;
 
 		map = kmap_atomic(list_page) + offset;
 		count = *map;
@@ -3337,11 +3339,13 @@ int add_swap_count_continuation(swp_entr
 		 * free our allocation and use this one.
 		 */
 		if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
-			goto out;
+			goto out_unlock_cont;
 	}
 
 	list_add_tail(&page->lru, &head->lru);
 	page = NULL;			/* now it's attached, don't free it */
+out_unlock_cont:
+	spin_unlock(&si->cont_lock);
 out:
 	unlock_cluster(ci);
 	spin_unlock(&si->lock);
@@ -3366,6 +3370,7 @@ static bool swap_count_continued(struct
 	struct page *head;
 	struct page *page;
 	unsigned char *map;
+	bool ret;
 
 	head = vmalloc_to_page(si->swap_map + offset);
 	if (page_private(head) != SWP_CONTINUED) {
@@ -3373,6 +3378,7 @@ static bool swap_count_continued(struct
 		return false;		/* need to add count continuation */
 	}
 
+	spin_lock(&si->cont_lock);
 	offset &= ~PAGE_MASK;
 	page = list_entry(head->lru.next, struct page, lru);
 	map = kmap_atomic(page) + offset;
@@ -3393,8 +3399,10 @@ static bool swap_count_continued(struct
 		if (*map == SWAP_CONT_MAX) {
 			kunmap_atomic(map);
 			page = list_entry(page->lru.next, struct page, lru);
-			if (page == head)
-				return false;	/* add count continuation */
+			if (page == head) {
+				ret = false;	/* add count continuation */
+				goto out;
+			}
 			map = kmap_atomic(page) + offset;
 init_map:		*map = 0;		/* we didn't zero the page */
 		}
@@ -3407,7 +3415,7 @@ init_map:		*map = 0;		/* we didn't zero
 			kunmap_atomic(map);
 			page = list_entry(page->lru.prev, struct page, lru);
 		}
-		return true;			/* incremented */
+		ret = true;			/* incremented */
 
 	} else {				/* decrementing */
 		/*
@@ -3433,8 +3441,11 @@ init_map:		*map = 0;		/* we didn't zero
 			kunmap_atomic(map);
 			page = list_entry(page->lru.prev, struct page, lru);
 		}
-		return count == COUNT_CONTINUED;
+		ret = count == COUNT_CONTINUED;
 	}
+out:
+	spin_unlock(&si->cont_lock);
+	return ret;
 }
 
 /*

  parent reply	other threads:[~2017-11-06  9:14 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-06  9:12 [PATCH 4.13 00/36] 4.13.12-stable review Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 01/36] ALSA: timer: Add missing mutex lock for compat ioctls Greg Kroah-Hartman
2017-11-06  9:12   ` Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 02/36] ALSA: seq: Fix nested rwsem annotation for lockdep splat Greg Kroah-Hartman
2017-11-06  9:12   ` Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 03/36] cifs: check MaxPathNameComponentLength != 0 before using it Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 04/36] KEYS: return full count in keyring_read() if buffer is too small Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 05/36] KEYS: trusted: fix writing past end of buffer in trusted_read() Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 06/36] KEYS: fix out-of-bounds read during ASN.1 parsing Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 07/36] ASoC: adau17x1: Workaround for noise bug in ADC Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 08/36] virtio_blk: Fix an SG_IO regression Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 09/36] PM / QoS: Fix device resume latency PM QoS Greg Kroah-Hartman
2017-11-07  0:51   ` Rafael J. Wysocki
2017-11-07 10:32     ` Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 10/36] PM / QoS: Fix default runtime_pm device resume latency Greg Kroah-Hartman
2017-11-07  0:51   ` Rafael J. Wysocki
2017-11-06  9:12 ` [PATCH 4.13 11/36] arm64: ensure __dump_instr() checks addr_limit Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 12/36] KVM: arm64: its: Fix missing dynamic allocation check in scan_its_table Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 13/36] arm/arm64: KVM: set right LR register value for 32 bit guest when inject abort Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 14/36] arm/arm64: kvm: Disable branch profiling in HYP code Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 15/36] ARM: dts: mvebu: pl310-cache disable double-linefill Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 16/36] ARM: 8715/1: add a private asm/unaligned.h Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 17/36] drm/amdgpu: return -ENOENT from uvd 6.0 early init for harvesting Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 18/36] drm/amdgpu: allow harvesting check for Polaris VCE Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 19/36] userfaultfd: hugetlbfs: prevent UFFDIO_COPY to fill beyond the end of i_size Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 20/36] ocfs2: fstrim: Fix start offset of first cluster group during fstrim Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 21/36] fs/hugetlbfs/inode.c: fix hwpoison reserve accounting Greg Kroah-Hartman
2017-11-06  9:12 ` Greg Kroah-Hartman [this message]
2017-11-06  9:12 ` [PATCH 4.13 25/36] Revert "powerpc64/elfv1: Only dereference function descriptor for non-text symbols" Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 26/36] MIPS: bpf: Fix a typo in build_one_insn() Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 28/36] MIPS: microMIPS: Fix incorrect mask in insn_table_MM Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 29/36] MIPS: SMP: Fix deadlock & online race Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 30/36] Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"" Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 31/36] x86: CPU: Fix up "cpu MHz" in /proc/cpuinfo Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 32/36] powerpc/kprobes: Dereference function pointers only if the address does not belong to kernel text Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 33/36] futex: Fix more put_pi_state() vs. exit_pi_state_list() races Greg Kroah-Hartman
2017-11-06  9:12   ` Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 34/36] perf/cgroup: Fix perf cgroup hierarchy support Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 35/36] x86/mcelog: Get rid of RCU remnants Greg Kroah-Hartman
2017-11-06  9:12   ` [4.13,35/36] " Greg Kroah-Hartman
2017-11-06  9:12 ` [PATCH 4.13 36/36] irqchip/irq-mvebu-gicp: Add missing spin_lock init Greg Kroah-Hartman
2017-11-06  9:12   ` Greg Kroah-Hartman
2017-11-06 21:18 ` [PATCH 4.13 00/36] 4.13.12-stable review Guenter Roeck
2017-11-06 23:27 ` Shuah Khan
2017-11-07 10:33   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171106085048.051121879@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=aaron.lu@intel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=shli@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tim.c.chen@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.