linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -mm -v5 RESEND] mm, swap: Fix race between swapoff and some swap operations
@ 2018-02-13  1:42 Huang, Ying
  2018-02-13 23:41 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Huang, Ying @ 2018-02-13  1:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins,
	Paul E . McKenney, Minchan Kim, Johannes Weiner, Tim Chen,
	Shaohua Li, Mel Gorman, Jérôme Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara,
	Dave Jiang, Aaron Lu

From: Huang Ying <ying.huang@intel.com>

When the swapin is performed, after getting the swap entry information
from the page table, system will swap in the swap entry, without any
lock held to prevent the swap device from being swapoff.  This may
cause the race like below,

CPU 1				CPU 2
-----				-----
				do_swap_page
				  swapin_readahead
				    __read_swap_cache_async
swapoff				      swapcache_prepare
  p->swap_map = NULL		        __swap_duplicate
					  p->swap_map[?] /* !!! NULL pointer access */

Because swapoff is usually done when system shutdown only, the race
may not hit many people in practice.  But it is still a race need to
be fixed.

To fix the race, get_swap_device() is added to check whether the
specified swap entry is valid in its swap device.  If so, it will keep
the swap entry valid via preventing the swap device from being
swapoff, until put_swap_device() is called.

Because swapoff() is very rare code path, to make the normal path runs
as fast as possible, disabling preemption + stop_machine() instead of
reference count is used to implement get/put_swap_device().  From
get_swap_device() to put_swap_device(), the preemption is disabled, so
stop_machine() in swapoff() will wait until put_swap_device() is
called.

In addition to swap_map, cluster_info, etc. data structure in the
struct swap_info_struct, the swap cache radix tree will be freed after
swapoff, so this patch fixes the race between swap cache looking up
and swapoff too.

Races between some other swap cache usages protected via disabling
preemption and swapoff are fixed too via calling stop_machine()
between clearing PageSwapCache() and freeing swap cache data
structure.

Cc: Hugh Dickins <hughd@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

Changelog:

v5:

- Replace RCU with stop_machine()

v4:

- Use synchronize_rcu() in enable_swap_info() to reduce overhead of
  normal paths further.

v3:

- Re-implemented with RCU to reduce the overhead of normal paths

v2:

- Re-implemented with SRCU to reduce the overhead of normal paths.

- Avoid to check whether the swap device has been swapoff in
  get_swap_device().  Because we can check the origin of the swap
  entry to make sure the swap device hasn't bee swapoff.
---
 include/linux/swap.h |  13 ++++--
 mm/memory.c          |   2 +-
 mm/swap_state.c      |  16 +++++--
 mm/swapfile.c        | 129 +++++++++++++++++++++++++++++++++++++++------------
 4 files changed, 123 insertions(+), 37 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a1a3f4ed94ce..b4462789b970 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -172,8 +172,9 @@ enum {
 	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 11),	/* synchronous IO is efficient */
+	SWP_VALID	= (1 << 12),	/* swap is valid to be operated on? */
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
+	SWP_SCANNING	= (1 << 13),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
@@ -470,7 +471,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
-extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry);
+extern int __swap_count(swp_entry_t entry);
 extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
@@ -480,6 +481,12 @@ extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
+extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+
+static inline void put_swap_device(struct swap_info_struct *si)
+{
+	preempt_enable();
+}
 
 #else /* CONFIG_SWAP */
 
@@ -597,7 +604,7 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
-static inline int __swap_count(struct swap_info_struct *si, swp_entry_t entry)
+static inline int __swap_count(swp_entry_t entry)
 {
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 5ec6433d6a5c..9d84b4da6a35 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2938,7 +2938,7 @@ int do_swap_page(struct vm_fault *vmf)
 		struct swap_info_struct *si = swp_swap_info(entry);
 
 		if (si->flags & SWP_SYNCHRONOUS_IO &&
-				__swap_count(si, entry) == 1) {
+		    __swap_count(entry) == 1) {
 			/* skip swapcache */
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
 			if (page) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 39ae7cfad90f..9fd14ef5a922 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -334,8 +334,13 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
 	struct page *page;
 	unsigned long ra_info;
 	int win, hits, readahead;
+	struct swap_info_struct *si;
 
+	si = get_swap_device(entry);
+	if (!si)
+		return NULL;
 	page = find_get_page(swap_address_space(entry), swp_offset(entry));
+	put_swap_device(si);
 
 	INC_CACHE_INFO(find_total);
 	if (page) {
@@ -365,8 +370,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr,
 			bool *new_page_allocated)
 {
-	struct page *found_page, *new_page = NULL;
-	struct address_space *swapper_space = swap_address_space(entry);
+	struct page *found_page = NULL, *new_page = NULL;
+	struct swap_info_struct *si;
 	int err;
 	*new_page_allocated = false;
 
@@ -376,7 +381,12 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * called after lookup_swap_cache() failed, re-calling
 		 * that would confuse statistics.
 		 */
-		found_page = find_get_page(swapper_space, swp_offset(entry));
+		si = get_swap_device(entry);
+		if (!si)
+			break;
+		found_page = find_get_page(swap_address_space(entry),
+					   swp_offset(entry));
+		put_swap_device(si);
 		if (found_page)
 			break;
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0d00471af98b..01c28d244447 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -38,6 +38,7 @@
 #include <linux/export.h>
 #include <linux/swap_slots.h>
 #include <linux/sort.h>
+#include <linux/stop_machine.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -1107,6 +1108,41 @@ static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
 	return p;
 }
 
+/*
+ * Check whether swap entry is valid in the swap device.  If so,
+ * return pointer to swap_info_struct, and keep the swap entry valid
+ * via preventing the swap device from being swapoff, until
+ * put_swap_device() is called.  Otherwise return NULL.
+ */
+struct swap_info_struct *get_swap_device(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	unsigned long type, offset;
+
+	if (!entry.val)
+		goto out;
+	type = swp_type(entry);
+	if (type >= nr_swapfiles)
+		goto bad_nofile;
+	si = swap_info[type];
+
+	preempt_disable();
+	if (!(si->flags & SWP_VALID))
+		goto unlock_out;
+	offset = swp_offset(entry);
+	if (offset >= si->max)
+		goto unlock_out;
+
+	return si;
+bad_nofile:
+	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+out:
+	return NULL;
+unlock_out:
+	preempt_enable();
+	return NULL;
+}
+
 static unsigned char __swap_entry_free(struct swap_info_struct *p,
 				       swp_entry_t entry, unsigned char usage)
 {
@@ -1328,11 +1364,18 @@ int page_swapcount(struct page *page)
 	return count;
 }
 
-int __swap_count(struct swap_info_struct *si, swp_entry_t entry)
+int __swap_count(swp_entry_t entry)
 {
+	struct swap_info_struct *si;
 	pgoff_t offset = swp_offset(entry);
+	int count = 0;
 
-	return swap_count(si->swap_map[offset]);
+	si = get_swap_device(entry);
+	if (si) {
+		count = swap_count(si->swap_map[offset]);
+		put_swap_device(si);
+	}
+	return count;
 }
 
 static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
@@ -1357,9 +1400,11 @@ int __swp_swapcount(swp_entry_t entry)
 	int count = 0;
 	struct swap_info_struct *si;
 
-	si = __swap_info_get(entry);
-	if (si)
+	si = get_swap_device(entry);
+	if (si) {
 		count = swap_swapcount(si, entry);
+		put_swap_device(si);
+	}
 	return count;
 }
 
@@ -2451,9 +2496,9 @@ static int swap_node(struct swap_info_struct *p)
 	return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
 }
 
-static void _enable_swap_info(struct swap_info_struct *p, int prio,
-				unsigned char *swap_map,
-				struct swap_cluster_info *cluster_info)
+static void setup_swap_info(struct swap_info_struct *p, int prio,
+			    unsigned char *swap_map,
+			    struct swap_cluster_info *cluster_info)
 {
 	int i;
 
@@ -2478,7 +2523,11 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
 	}
 	p->swap_map = swap_map;
 	p->cluster_info = cluster_info;
-	p->flags |= SWP_WRITEOK;
+}
+
+static void _enable_swap_info(struct swap_info_struct *p)
+{
+	p->flags |= SWP_WRITEOK | SWP_VALID;
 	atomic_long_add(p->pages, &nr_swap_pages);
 	total_swap_pages += p->pages;
 
@@ -2497,6 +2546,11 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
 	add_to_avail_list(p);
 }
 
+static int swap_onoff_stop(void *arg)
+{
+	return 0;
+}
+
 static void enable_swap_info(struct swap_info_struct *p, int prio,
 				unsigned char *swap_map,
 				struct swap_cluster_info *cluster_info,
@@ -2505,7 +2559,17 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
 	frontswap_init(p->type, frontswap_map);
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
-	 _enable_swap_info(p, prio, swap_map, cluster_info);
+	setup_swap_info(p, prio, swap_map, cluster_info);
+	spin_unlock(&p->lock);
+	spin_unlock(&swap_lock);
+	/*
+	 * Guarantee swap_map, cluster_info, etc. fields are used
+	 * between get/put_swap_device() only if SWP_VALID bit is set
+	 */
+	stop_machine(swap_onoff_stop, NULL, cpu_online_mask);
+	spin_lock(&swap_lock);
+	spin_lock(&p->lock);
+	_enable_swap_info(p);
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 }
@@ -2514,7 +2578,8 @@ static void reinsert_swap_info(struct swap_info_struct *p)
 {
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
-	_enable_swap_info(p, p->prio, p->swap_map, p->cluster_info);
+	setup_swap_info(p, p->prio, p->swap_map, p->cluster_info);
+	_enable_swap_info(p);
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 }
@@ -2617,6 +2682,17 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	reenable_swap_slots_cache_unlock();
 
+	spin_lock(&swap_lock);
+	spin_lock(&p->lock);
+	p->flags &= ~SWP_VALID;		/* mark swap device as invalid */
+	spin_unlock(&p->lock);
+	spin_unlock(&swap_lock);
+	/*
+	 * wait for swap operations protected by get/put_swap_device()
+	 * to complete
+	 */
+	stop_machine(swap_onoff_stop, NULL, cpu_online_mask);
+
 	flush_work(&p->discard_work);
 
 	destroy_swap_extents(p);
@@ -3356,22 +3432,16 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 {
 	struct swap_info_struct *p;
 	struct swap_cluster_info *ci;
-	unsigned long offset, type;
+	unsigned long offset;
 	unsigned char count;
 	unsigned char has_cache;
 	int err = -EINVAL;
 
-	if (non_swap_entry(entry))
+	p = get_swap_device(entry);
+	if (!p)
 		goto out;
 
-	type = swp_type(entry);
-	if (type >= nr_swapfiles)
-		goto bad_file;
-	p = swap_info[type];
 	offset = swp_offset(entry);
-	if (unlikely(offset >= p->max))
-		goto out;
-
 	ci = lock_cluster_or_swap_info(p, offset);
 
 	count = p->swap_map[offset];
@@ -3417,11 +3487,9 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 unlock_out:
 	unlock_cluster_or_swap_info(p, ci);
 out:
+	if (p)
+		put_swap_device(p);
 	return err;
-
-bad_file:
-	pr_err("swap_dup: %s%08lx\n", Bad_file, entry.val);
-	goto out;
 }
 
 /*
@@ -3513,6 +3581,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	struct page *list_page;
 	pgoff_t offset;
 	unsigned char count;
+	int ret = 0;
 
 	/*
 	 * When debugging, it's easier to use __GFP_ZERO here; but it's better
@@ -3520,15 +3589,15 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	 */
 	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
 
-	si = swap_info_get(entry);
+	si = get_swap_device(entry);
 	if (!si) {
 		/*
 		 * An acceptable race has occurred since the failing
-		 * __swap_duplicate(): the swap entry has been freed,
-		 * perhaps even the whole swap_map cleared for swapoff.
+		 * __swap_duplicate(): the swap device may be swapoff
 		 */
 		goto outer;
 	}
+	spin_lock(&si->lock);
 
 	offset = swp_offset(entry);
 
@@ -3546,9 +3615,8 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	if (!page) {
-		unlock_cluster(ci);
-		spin_unlock(&si->lock);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	/*
@@ -3600,10 +3668,11 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 out:
 	unlock_cluster(ci);
 	spin_unlock(&si->lock);
+	put_swap_device(si);
 outer:
 	if (page)
 		__free_page(page);
-	return 0;
+	return ret;
 }
 
 /*
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH -mm -v5 RESEND] mm, swap: Fix race between swapoff and some swap operations
  2018-02-13  1:42 [PATCH -mm -v5 RESEND] mm, swap: Fix race between swapoff and some swap operations Huang, Ying
@ 2018-02-13 23:41 ` Andrew Morton
  2018-02-14  0:38   ` Huang, Ying
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2018-02-13 23:41 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Hugh Dickins, Paul E . McKenney,
	Minchan Kim, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	Jérôme Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Tue, 13 Feb 2018 09:42:20 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:

> From: Huang Ying <ying.huang@intel.com>
> 
> When the swapin is performed, after getting the swap entry information
> from the page table, system will swap in the swap entry, without any
> lock held to prevent the swap device from being swapoff.  This may
> cause the race like below,

Sigh.  In terms of putting all the work into the swapoff path and
avoiding overheads in the hot paths, I guess this is about as good as
it will get.

It's a very low-priority fix so I'd prefer to keep the patch in -mm
until Hugh has had an opportunity to think about it.

> ...
>  
> +/*
> + * Check whether swap entry is valid in the swap device.  If so,
> + * return pointer to swap_info_struct, and keep the swap entry valid
> + * via preventing the swap device from being swapoff, until
> + * put_swap_device() is called.  Otherwise return NULL.
> + */
> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> +	struct swap_info_struct *si;
> +	unsigned long type, offset;
> +
> +	if (!entry.val)
> +		goto out;
> +	type = swp_type(entry);
> +	if (type >= nr_swapfiles)
> +		goto bad_nofile;
> +	si = swap_info[type];
> +
> +	preempt_disable();

This preempt_disable() is later than I'd expect.  If a well-timed race
occurs, `si' could now be pointing at a defunct entry.  If that
well-timed race include a swapoff AND a swapon, `si' could be pointing
at the info for a new device?

> +	if (!(si->flags & SWP_VALID))
> +		goto unlock_out;
> +	offset = swp_offset(entry);
> +	if (offset >= si->max)
> +		goto unlock_out;
> +
> +	return si;
> +bad_nofile:
> +	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
> +out:
> +	return NULL;
> +unlock_out:
> +	preempt_enable();
> +	return NULL;
> +}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH -mm -v5 RESEND] mm, swap: Fix race between swapoff and some swap operations
  2018-02-13 23:41 ` Andrew Morton
@ 2018-02-14  0:38   ` Huang, Ying
  2018-02-16 23:38     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Huang, Ying @ 2018-02-14  0:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Hugh Dickins, Paul E . McKenney,
	Minchan Kim, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	jglisse, Michal Hocko, Andrea Arcangeli, David Rientjes,
	Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

Andrew Morton <akpm@linux-foundation.org> writes:

> On Tue, 13 Feb 2018 09:42:20 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> When the swapin is performed, after getting the swap entry information
>> from the page table, system will swap in the swap entry, without any
>> lock held to prevent the swap device from being swapoff.  This may
>> cause the race like below,
>
> Sigh.  In terms of putting all the work into the swapoff path and
> avoiding overheads in the hot paths, I guess this is about as good as
> it will get.
>
> It's a very low-priority fix so I'd prefer to keep the patch in -mm
> until Hugh has had an opportunity to think about it.
>
>> ...
>>  
>> +/*
>> + * Check whether swap entry is valid in the swap device.  If so,
>> + * return pointer to swap_info_struct, and keep the swap entry valid
>> + * via preventing the swap device from being swapoff, until
>> + * put_swap_device() is called.  Otherwise return NULL.
>> + */
>> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> +{
>> +	struct swap_info_struct *si;
>> +	unsigned long type, offset;
>> +
>> +	if (!entry.val)
>> +		goto out;
>> +	type = swp_type(entry);
>> +	if (type >= nr_swapfiles)
>> +		goto bad_nofile;
>> +	si = swap_info[type];
>> +
>> +	preempt_disable();
>
> This preempt_disable() is later than I'd expect.  If a well-timed race
> occurs, `si' could now be pointing at a defunct entry.  If that
> well-timed race include a swapoff AND a swapon, `si' could be pointing
> at the info for a new device?

struct swap_info_struct pointed to by swap_info[] will never be freed.
During swapoff, we only free the memory pointed to by the fields of
struct swap_info_struct.  And when swapon, we will always reuse
swap_info[type] if it's not NULL.  So it should be safe to dereference
swap_info[type] with preemption enabled.

Best Regards,
Huang, Ying

>> +	if (!(si->flags & SWP_VALID))
>> +		goto unlock_out;
>> +	offset = swp_offset(entry);
>> +	if (offset >= si->max)
>> +		goto unlock_out;
>> +
>> +	return si;
>> +bad_nofile:
>> +	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
>> +out:
>> +	return NULL;
>> +unlock_out:
>> +	preempt_enable();
>> +	return NULL;
>> +}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH -mm -v5 RESEND] mm, swap: Fix race between swapoff and some swap operations
  2018-02-14  0:38   ` Huang, Ying
@ 2018-02-16 23:38     ` Andrew Morton
  2018-02-18  1:06       ` huang ying
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2018-02-16 23:38 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Hugh Dickins, Paul E . McKenney,
	Minchan Kim, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	jglisse, Michal Hocko, Andrea Arcangeli, David Rientjes,
	Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Wed, 14 Feb 2018 08:38:00 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:

> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > On Tue, 13 Feb 2018 09:42:20 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
> >
> >> From: Huang Ying <ying.huang@intel.com>
> >> 
> >> When the swapin is performed, after getting the swap entry information
> >> from the page table, system will swap in the swap entry, without any
> >> lock held to prevent the swap device from being swapoff.  This may
> >> cause the race like below,
> >
> > Sigh.  In terms of putting all the work into the swapoff path and
> > avoiding overheads in the hot paths, I guess this is about as good as
> > it will get.
> >
> > It's a very low-priority fix so I'd prefer to keep the patch in -mm
> > until Hugh has had an opportunity to think about it.
> >
> >> ...
> >>  
> >> +/*
> >> + * Check whether swap entry is valid in the swap device.  If so,
> >> + * return pointer to swap_info_struct, and keep the swap entry valid
> >> + * via preventing the swap device from being swapoff, until
> >> + * put_swap_device() is called.  Otherwise return NULL.
> >> + */
> >> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
> >> +{
> >> +	struct swap_info_struct *si;
> >> +	unsigned long type, offset;
> >> +
> >> +	if (!entry.val)
> >> +		goto out;
> >> +	type = swp_type(entry);
> >> +	if (type >= nr_swapfiles)
> >> +		goto bad_nofile;
> >> +	si = swap_info[type];
> >> +
> >> +	preempt_disable();
> >
> > This preempt_disable() is later than I'd expect.  If a well-timed race
> > occurs, `si' could now be pointing at a defunct entry.  If that
> > well-timed race include a swapoff AND a swapon, `si' could be pointing
> > at the info for a new device?
> 
> struct swap_info_struct pointed to by swap_info[] will never be freed.
> During swapoff, we only free the memory pointed to by the fields of
> struct swap_info_struct.  And when swapon, we will always reuse
> swap_info[type] if it's not NULL.  So it should be safe to dereference
> swap_info[type] with preemption enabled.

That's my point.  If there's a race window during which there is a
parallel swapoff+swapon, this swap_info_struct may now be in use for a
different device?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH -mm -v5 RESEND] mm, swap: Fix race between swapoff and some swap operations
  2018-02-16 23:38     ` Andrew Morton
@ 2018-02-18  1:06       ` huang ying
  2018-02-20 23:38         ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: huang ying @ 2018-02-18  1:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, linux-mm, LKML, Hugh Dickins, Paul E . McKenney,
	Minchan Kim, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	jglisse, Michal Hocko, Andrea Arcangeli, David Rientjes,
	Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Sat, Feb 17, 2018 at 7:38 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 14 Feb 2018 08:38:00 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
>
>> Andrew Morton <akpm@linux-foundation.org> writes:
>>
>> > On Tue, 13 Feb 2018 09:42:20 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>> >
>> >> From: Huang Ying <ying.huang@intel.com>
>> >>
>> >> When the swapin is performed, after getting the swap entry information
>> >> from the page table, system will swap in the swap entry, without any
>> >> lock held to prevent the swap device from being swapoff.  This may
>> >> cause the race like below,
>> >
>> > Sigh.  In terms of putting all the work into the swapoff path and
>> > avoiding overheads in the hot paths, I guess this is about as good as
>> > it will get.
>> >
>> > It's a very low-priority fix so I'd prefer to keep the patch in -mm
>> > until Hugh has had an opportunity to think about it.
>> >
>> >> ...
>> >>
>> >> +/*
>> >> + * Check whether swap entry is valid in the swap device.  If so,
>> >> + * return pointer to swap_info_struct, and keep the swap entry valid
>> >> + * via preventing the swap device from being swapoff, until
>> >> + * put_swap_device() is called.  Otherwise return NULL.
>> >> + */
>> >> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> >> +{
>> >> +  struct swap_info_struct *si;
>> >> +  unsigned long type, offset;
>> >> +
>> >> +  if (!entry.val)
>> >> +          goto out;
>> >> +  type = swp_type(entry);
>> >> +  if (type >= nr_swapfiles)
>> >> +          goto bad_nofile;
>> >> +  si = swap_info[type];
>> >> +
>> >> +  preempt_disable();
>> >
>> > This preempt_disable() is later than I'd expect.  If a well-timed race
>> > occurs, `si' could now be pointing at a defunct entry.  If that
>> > well-timed race include a swapoff AND a swapon, `si' could be pointing
>> > at the info for a new device?
>>
>> struct swap_info_struct pointed to by swap_info[] will never be freed.
>> During swapoff, we only free the memory pointed to by the fields of
>> struct swap_info_struct.  And when swapon, we will always reuse
>> swap_info[type] if it's not NULL.  So it should be safe to dereference
>> swap_info[type] with preemption enabled.
>
> That's my point.  If there's a race window during which there is a
> parallel swapoff+swapon, this swap_info_struct may now be in use for a
> different device?

Yes.  It's possible.  And the caller of get_swap_device() can live
with it if the swap_info_struct has been fully initialized.  For
example, for the race in the patch description,

do_swap_page
  swapin_readahead
    __read_swap_cache_async
      swapcache_prepare
        __swap_duplicate

in __swap_duplicate(), it's possible that the swap device returned by
get_swap_device() is different from the swap device when
__swap_duplicate() call get_swap_device().  But the struct_info_struct
has been fully initialized, so __swap_duplicate() can reference
si->swap_map[] safely.  And we will check si->swap_map[] before any
further operation.  Even if the swap entry is swapped out again for
the new swap device, we will check the page table again in
do_swap_page().  So there is no functionality problem.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH -mm -v5 RESEND] mm, swap: Fix race between swapoff and some swap operations
  2018-02-18  1:06       ` huang ying
@ 2018-02-20 23:38         ` Andrew Morton
  2018-02-21  6:28           ` huang ying
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2018-02-20 23:38 UTC (permalink / raw)
  To: huang ying
  Cc: Huang, Ying, linux-mm, LKML, Hugh Dickins, Paul E . McKenney,
	Minchan Kim, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	jglisse, Michal Hocko, Andrea Arcangeli, David Rientjes,
	Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Sun, 18 Feb 2018 09:06:47 +0800 huang ying <huang.ying.caritas@gmail.com> wrote:

> >> >> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
> >> >> +{
> >> >> +  struct swap_info_struct *si;
> >> >> +  unsigned long type, offset;
> >> >> +
> >> >> +  if (!entry.val)
> >> >> +          goto out;
> >> >> +  type = swp_type(entry);
> >> >> +  if (type >= nr_swapfiles)
> >> >> +          goto bad_nofile;
> >> >> +  si = swap_info[type];
> >> >> +
> >> >> +  preempt_disable();
> >> >
> >> > This preempt_disable() is later than I'd expect.  If a well-timed race
> >> > occurs, `si' could now be pointing at a defunct entry.  If that
> >> > well-timed race include a swapoff AND a swapon, `si' could be pointing
> >> > at the info for a new device?
> >>
> >> struct swap_info_struct pointed to by swap_info[] will never be freed.
> >> During swapoff, we only free the memory pointed to by the fields of
> >> struct swap_info_struct.  And when swapon, we will always reuse
> >> swap_info[type] if it's not NULL.  So it should be safe to dereference
> >> swap_info[type] with preemption enabled.
> >
> > That's my point.  If there's a race window during which there is a
> > parallel swapoff+swapon, this swap_info_struct may now be in use for a
> > different device?
> 
> Yes.  It's possible.  And the caller of get_swap_device() can live
> with it if the swap_info_struct has been fully initialized.  For
> example, for the race in the patch description,
> 
> do_swap_page
>   swapin_readahead
>     __read_swap_cache_async
>       swapcache_prepare
>         __swap_duplicate
> 
> in __swap_duplicate(), it's possible that the swap device returned by
> get_swap_device() is different from the swap device when
> __swap_duplicate() call get_swap_device().  But the struct_info_struct
> has been fully initialized, so __swap_duplicate() can reference
> si->swap_map[] safely.  And we will check si->swap_map[] before any
> further operation.  Even if the swap entry is swapped out again for
> the new swap device, we will check the page table again in
> do_swap_page().  So there is no functionality problem.

That's rather revolting.  Can we tighten this up?  Or at least very
loudly document it?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH -mm -v5 RESEND] mm, swap: Fix race between swapoff and some swap operations
  2018-02-20 23:38         ` Andrew Morton
@ 2018-02-21  6:28           ` huang ying
  0 siblings, 0 replies; 7+ messages in thread
From: huang ying @ 2018-02-21  6:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, linux-mm, LKML, Hugh Dickins, Paul E . McKenney,
	Minchan Kim, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	jglisse, Michal Hocko, Andrea Arcangeli, David Rientjes,
	Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Wed, Feb 21, 2018 at 7:38 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Sun, 18 Feb 2018 09:06:47 +0800 huang ying <huang.ying.caritas@gmail.com> wrote:
>
>> >> >> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> >> >> +{
>> >> >> +  struct swap_info_struct *si;
>> >> >> +  unsigned long type, offset;
>> >> >> +
>> >> >> +  if (!entry.val)
>> >> >> +          goto out;
>> >> >> +  type = swp_type(entry);
>> >> >> +  if (type >= nr_swapfiles)
>> >> >> +          goto bad_nofile;
>> >> >> +  si = swap_info[type];
>> >> >> +
>> >> >> +  preempt_disable();
>> >> >
>> >> > This preempt_disable() is later than I'd expect.  If a well-timed race
>> >> > occurs, `si' could now be pointing at a defunct entry.  If that
>> >> > well-timed race include a swapoff AND a swapon, `si' could be pointing
>> >> > at the info for a new device?
>> >>
>> >> struct swap_info_struct pointed to by swap_info[] will never be freed.
>> >> During swapoff, we only free the memory pointed to by the fields of
>> >> struct swap_info_struct.  And when swapon, we will always reuse
>> >> swap_info[type] if it's not NULL.  So it should be safe to dereference
>> >> swap_info[type] with preemption enabled.
>> >
>> > That's my point.  If there's a race window during which there is a
>> > parallel swapoff+swapon, this swap_info_struct may now be in use for a
>> > different device?
>>
>> Yes.  It's possible.  And the caller of get_swap_device() can live
>> with it if the swap_info_struct has been fully initialized.  For
>> example, for the race in the patch description,
>>
>> do_swap_page
>>   swapin_readahead
>>     __read_swap_cache_async
>>       swapcache_prepare
>>         __swap_duplicate
>>
>> in __swap_duplicate(), it's possible that the swap device returned by
>> get_swap_device() is different from the swap device when
>> __swap_duplicate() call get_swap_device().  But the struct_info_struct
>> has been fully initialized, so __swap_duplicate() can reference
>> si->swap_map[] safely.  And we will check si->swap_map[] before any
>> further operation.  Even if the swap entry is swapped out again for
>> the new swap device, we will check the page table again in
>> do_swap_page().  So there is no functionality problem.
>
> That's rather revolting.  Can we tighten this up?  Or at least very
> loudly document it?

TBH, I think my original fix patch which uses a reference count in
swap_info_struct is easier to be understood.  But I understand it has
its own drawbacks too.  Anyway, unless there are some better ideas to
resolve this, I will send out a new version with more document.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-02-21  6:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-13  1:42 [PATCH -mm -v5 RESEND] mm, swap: Fix race between swapoff and some swap operations Huang, Ying
2018-02-13 23:41 ` Andrew Morton
2018-02-14  0:38   ` Huang, Ying
2018-02-16 23:38     ` Andrew Morton
2018-02-18  1:06       ` huang ying
2018-02-20 23:38         ` Andrew Morton
2018-02-21  6:28           ` huang ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).