linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
@ 2017-12-20  1:26 Huang, Ying
  2017-12-21  2:16 ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2017-12-20  1:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins,
	Paul E . McKenney, Minchan Kim, Johannes Weiner, Tim Chen,
	Shaohua Li, Mel Gorman, Jérôme Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara,
	Dave Jiang, Aaron Lu

From: Huang Ying <ying.huang@intel.com>

When the swapin is performed, after getting the swap entry information
from the page table, system will swap in the swap entry, without any
lock held to prevent the swap device from being swapoff.  This may
cause the race like below,

CPU 1				CPU 2
-----				-----
				do_swap_page
				  swapin_readahead
				    __read_swap_cache_async
swapoff				      swapcache_prepare
  p->swap_map = NULL		        __swap_duplicate
					  p->swap_map[?] /* !!! NULL pointer access */

Because swapoff is usually done when system shutdown only, the race
may not hit many people in practice.  But it is still a race need to
be fixed.

To fix the race, get_swap_device() is added to check whether the
specified swap entry is valid in its swap device.  If so, it will keep
the swap entry valid via preventing the swap device from being
swapoff, until put_swap_device() is called.

Because swapoff() is very race code path, to make the normal path runs
as fast as possible, RCU instead of reference count is used to
implement get/put_swap_device().  From get_swap_device() to
put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
swapoff() will wait until put_swap_device() is called.

In addition to swap_map, cluster_info, etc. data structure in the
struct swap_info_struct, the swap cache radix tree will be freed after
swapoff, so this patch fixes the race between swap cache looking up
and swapoff too.

Cc: Hugh Dickins <hughd@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

Changelog:

v4:

- Use synchronize_rcu() in enable_swap_info() to reduce overhead of
  normal paths further.

v3:

- Re-implemented with RCU to reduce the overhead of normal paths

v2:

- Re-implemented with SRCU to reduce the overhead of normal paths.

- Avoid to check whether the swap device has been swapoff in
  get_swap_device().  Because we can check the origin of the swap
  entry to make sure the swap device hasn't bee swapoff.
---
 include/linux/swap.h |  11 ++++-
 mm/memory.c          |   2 +-
 mm/swap_state.c      |  16 +++++--
 mm/swapfile.c        | 123 ++++++++++++++++++++++++++++++++++++++-------------
 4 files changed, 116 insertions(+), 36 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..f7e8f26cf07f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -172,8 +172,9 @@ enum {
 	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 11),	/* synchronous IO is efficient */
+	SWP_VALID	= (1 << 12),	/* swap is valid to be operated on? */
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
+	SWP_SCANNING	= (1 << 13),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
@@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
-extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry);
+extern int __swap_count(swp_entry_t entry);
 extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
@@ -470,6 +471,12 @@ extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
+extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+
+static inline void put_swap_device(struct swap_info_struct *si)
+{
+	rcu_read_unlock();
+}
 
 #else /* CONFIG_SWAP */
 
diff --git a/mm/memory.c b/mm/memory.c
index 1a969992f76b..77a7d6191218 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2909,7 +2909,7 @@ int do_swap_page(struct vm_fault *vmf)
 		struct swap_info_struct *si = swp_swap_info(entry);
 
 		if (si->flags & SWP_SYNCHRONOUS_IO &&
-				__swap_count(si, entry) == 1) {
+		    __swap_count(entry) == 1) {
 			/* skip swapcache */
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
 							vmf->address);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0b8ae361981f..8dde719e973c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -337,8 +337,13 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
 			       unsigned long addr)
 {
 	struct page *page;
+	struct swap_info_struct *si;
 
+	si = get_swap_device(entry);
+	if (!si)
+		return NULL;
 	page = find_get_page(swap_address_space(entry), swp_offset(entry));
+	put_swap_device(si);
 
 	INC_CACHE_INFO(find_total);
 	if (page) {
@@ -376,8 +381,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr,
 			bool *new_page_allocated)
 {
-	struct page *found_page, *new_page = NULL;
-	struct address_space *swapper_space = swap_address_space(entry);
+	struct page *found_page = NULL, *new_page = NULL;
+	struct swap_info_struct *si;
 	int err;
 	*new_page_allocated = false;
 
@@ -387,7 +392,12 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * called after lookup_swap_cache() failed, re-calling
 		 * that would confuse statistics.
 		 */
-		found_page = find_get_page(swapper_space, swp_offset(entry));
+		si = get_swap_device(entry);
+		if (!si)
+			break;
+		found_page = find_get_page(swap_address_space(entry),
+					   swp_offset(entry));
+		put_swap_device(si);
 		if (found_page)
 			break;
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 42fe5653814a..881515a59f95 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1107,6 +1107,41 @@ static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
 	return p;
 }
 
+/*
+ * Check whether swap entry is valid in the swap device.  If so,
+ * return pointer to swap_info_struct, and keep the swap entry valid
+ * via preventing the swap device from being swapoff, until
+ * put_swap_device() is called.  Otherwise return NULL.
+ */
+struct swap_info_struct *get_swap_device(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	unsigned long type, offset;
+
+	if (!entry.val)
+		goto out;
+	type = swp_type(entry);
+	if (type >= nr_swapfiles)
+		goto bad_nofile;
+	si = swap_info[type];
+
+	rcu_read_lock();
+	if (!(si->flags & SWP_VALID))
+		goto unlock_out;
+	offset = swp_offset(entry);
+	if (offset >= si->max)
+		goto unlock_out;
+
+	return si;
+bad_nofile:
+	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+out:
+	return NULL;
+unlock_out:
+	rcu_read_unlock();
+	return NULL;
+}
+
 static unsigned char __swap_entry_free(struct swap_info_struct *p,
 				       swp_entry_t entry, unsigned char usage)
 {
@@ -1328,11 +1363,18 @@ int page_swapcount(struct page *page)
 	return count;
 }
 
-int __swap_count(struct swap_info_struct *si, swp_entry_t entry)
+int __swap_count(swp_entry_t entry)
 {
+	struct swap_info_struct *si;
 	pgoff_t offset = swp_offset(entry);
+	int count = 0;
 
-	return swap_count(si->swap_map[offset]);
+	si = get_swap_device(entry);
+	if (si) {
+		count = swap_count(si->swap_map[offset]);
+		put_swap_device(si);
+	}
+	return count;
 }
 
 static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
@@ -1357,9 +1399,11 @@ int __swp_swapcount(swp_entry_t entry)
 	int count = 0;
 	struct swap_info_struct *si;
 
-	si = __swap_info_get(entry);
-	if (si)
+	si = get_swap_device(entry);
+	if (si) {
 		count = swap_swapcount(si, entry);
+		put_swap_device(si);
+	}
 	return count;
 }
 
@@ -2451,9 +2495,9 @@ static int swap_node(struct swap_info_struct *p)
 	return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
 }
 
-static void _enable_swap_info(struct swap_info_struct *p, int prio,
-				unsigned char *swap_map,
-				struct swap_cluster_info *cluster_info)
+static void setup_swap_info(struct swap_info_struct *p, int prio,
+			    unsigned char *swap_map,
+			    struct swap_cluster_info *cluster_info)
 {
 	int i;
 
@@ -2478,7 +2522,11 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
 	}
 	p->swap_map = swap_map;
 	p->cluster_info = cluster_info;
-	p->flags |= SWP_WRITEOK;
+}
+
+static void _enable_swap_info(struct swap_info_struct *p)
+{
+	p->flags |= SWP_WRITEOK | SWP_VALID;
 	atomic_long_add(p->pages, &nr_swap_pages);
 	total_swap_pages += p->pages;
 
@@ -2505,7 +2553,17 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
 	frontswap_init(p->type, frontswap_map);
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
-	 _enable_swap_info(p, prio, swap_map, cluster_info);
+	setup_swap_info(p, prio, swap_map, cluster_info);
+	spin_unlock(&p->lock);
+	spin_unlock(&swap_lock);
+	/*
+	 * Guarantee swap_map, cluster_info, etc. fields are used
+	 * between get/put_swap_device() only if SWP_VALID bit is set
+	 */
+	synchronize_rcu();
+	spin_lock(&swap_lock);
+	spin_lock(&p->lock);
+	_enable_swap_info(p);
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 }
@@ -2514,7 +2572,8 @@ static void reinsert_swap_info(struct swap_info_struct *p)
 {
 	spin_lock(&swap_lock);
 	spin_lock(&p->lock);
-	_enable_swap_info(p, p->prio, p->swap_map, p->cluster_info);
+	setup_swap_info(p, p->prio, p->swap_map, p->cluster_info);
+	_enable_swap_info(p);
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 }
@@ -2617,6 +2676,17 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	reenable_swap_slots_cache_unlock();
 
+	spin_lock(&swap_lock);
+	spin_lock(&p->lock);
+	p->flags &= ~SWP_VALID;		/* mark swap device as invalid */
+	spin_unlock(&p->lock);
+	spin_unlock(&swap_lock);
+	/*
+	 * wait for swap operations protected by get/put_swap_device()
+	 * to complete
+	 */
+	synchronize_rcu();
+
 	flush_work(&p->discard_work);
 
 	destroy_swap_extents(p);
@@ -3356,22 +3426,16 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 {
 	struct swap_info_struct *p;
 	struct swap_cluster_info *ci;
-	unsigned long offset, type;
+	unsigned long offset;
 	unsigned char count;
 	unsigned char has_cache;
 	int err = -EINVAL;
 
-	if (non_swap_entry(entry))
+	p = get_swap_device(entry);
+	if (!p)
 		goto out;
 
-	type = swp_type(entry);
-	if (type >= nr_swapfiles)
-		goto bad_file;
-	p = swap_info[type];
 	offset = swp_offset(entry);
-	if (unlikely(offset >= p->max))
-		goto out;
-
 	ci = lock_cluster_or_swap_info(p, offset);
 
 	count = p->swap_map[offset];
@@ -3417,11 +3481,9 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 unlock_out:
 	unlock_cluster_or_swap_info(p, ci);
 out:
+	if (p)
+		put_swap_device(p);
 	return err;
-
-bad_file:
-	pr_err("swap_dup: %s%08lx\n", Bad_file, entry.val);
-	goto out;
 }
 
 /*
@@ -3513,6 +3575,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	struct page *list_page;
 	pgoff_t offset;
 	unsigned char count;
+	int ret = 0;
 
 	/*
 	 * When debugging, it's easier to use __GFP_ZERO here; but it's better
@@ -3520,15 +3583,15 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	 */
 	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
 
-	si = swap_info_get(entry);
+	si = get_swap_device(entry);
 	if (!si) {
 		/*
 		 * An acceptable race has occurred since the failing
-		 * __swap_duplicate(): the swap entry has been freed,
-		 * perhaps even the whole swap_map cleared for swapoff.
+		 * __swap_duplicate(): the swap device may be swapoff
 		 */
 		goto outer;
 	}
+	spin_lock(&si->lock);
 
 	offset = swp_offset(entry);
 
@@ -3546,9 +3609,8 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	}
 
 	if (!page) {
-		unlock_cluster(ci);
-		spin_unlock(&si->lock);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	/*
@@ -3600,10 +3662,11 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 out:
 	unlock_cluster(ci);
 	spin_unlock(&si->lock);
+	put_swap_device(si);
 outer:
 	if (page)
 		__free_page(page);
-	return 0;
+	return ret;
 }
 
 /*
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-20  1:26 [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations Huang, Ying
@ 2017-12-21  2:16 ` Minchan Kim
       [not found]   ` <871sjopllj.fsf@yhuang-dev.intel.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2017-12-21  2:16 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Paul E . McKenney, Johannes Weiner, Tim Chen, Shaohua Li,
	Mel Gorman, Jérôme Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara,
	Dave Jiang, Aaron Lu

On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> When the swapin is performed, after getting the swap entry information
> from the page table, system will swap in the swap entry, without any
> lock held to prevent the swap device from being swapoff.  This may
> cause the race like below,
> 
> CPU 1				CPU 2
> -----				-----
> 				do_swap_page
> 				  swapin_readahead
> 				    __read_swap_cache_async
> swapoff				      swapcache_prepare
>   p->swap_map = NULL		        __swap_duplicate
> 					  p->swap_map[?] /* !!! NULL pointer access */
> 
> Because swapoff is usually done when system shutdown only, the race
> may not hit many people in practice.  But it is still a race need to
> be fixed.
> 
> To fix the race, get_swap_device() is added to check whether the
> specified swap entry is valid in its swap device.  If so, it will keep
> the swap entry valid via preventing the swap device from being
> swapoff, until put_swap_device() is called.
> 
> Because swapoff() is very race code path, to make the normal path runs
> as fast as possible, RCU instead of reference count is used to
> implement get/put_swap_device().  From get_swap_device() to
> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> swapoff() will wait until put_swap_device() is called.
> 
> In addition to swap_map, cluster_info, etc. data structure in the
> struct swap_info_struct, the swap cache radix tree will be freed after
> swapoff, so this patch fixes the race between swap cache looking up
> and swapoff too.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Aaron Lu <aaron.lu@intel.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> 
> Changelog:
> 
> v4:
> 
> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>   normal paths further.

Hi Huang,

This version is much better than old. To me, it's due to not rcu,
srcu, refcount thing but it adds swap device dependency(i.e., get/put)
into every swap related functions so users who don't interested on swap
don't need to care of it. Good.

The problem is caused by freeing by swap related-data structure
*dynamically* while old swap logic was based on static data
structure(i.e., never freed and the verify it's stale).
So, I reviewed some places where use PageSwapCache and swp_entry_t
which could make access of swap related data structures.

A example is __isolate_lru_page

It calls page_mapping to get a address_space.
What happens if the page is on SwapCache and raced with swapoff?
The mapping got could be disappeared by the race. Right?

Thanks.

> 
> v3:
> 
> - Re-implemented with RCU to reduce the overhead of normal paths
> 
> v2:
> 
> - Re-implemented with SRCU to reduce the overhead of normal paths.
> 
> - Avoid to check whether the swap device has been swapoff in
>   get_swap_device().  Because we can check the origin of the swap
>   entry to make sure the swap device hasn't bee swapoff.
> ---
>  include/linux/swap.h |  11 ++++-
>  mm/memory.c          |   2 +-
>  mm/swap_state.c      |  16 +++++--
>  mm/swapfile.c        | 123 ++++++++++++++++++++++++++++++++++++++-------------
>  4 files changed, 116 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2417d288e016..f7e8f26cf07f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -172,8 +172,9 @@ enum {
>  	SWP_PAGE_DISCARD = (1 << 9),	/* freed swap page-cluster discards */
>  	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
>  	SWP_SYNCHRONOUS_IO = (1 << 11),	/* synchronous IO is efficient */
> +	SWP_VALID	= (1 << 12),	/* swap is valid to be operated on? */
>  					/* add others here before... */
> -	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
> +	SWP_SCANNING	= (1 << 13),	/* refcount in scan_swap_map */
>  };
>  
>  #define SWAP_CLUSTER_MAX 32UL
> @@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int);
>  extern sector_t map_swap_page(struct page *, struct block_device **);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern int page_swapcount(struct page *);
> -extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry);
> +extern int __swap_count(swp_entry_t entry);
>  extern int __swp_swapcount(swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  extern struct swap_info_struct *page_swap_info(struct page *);
> @@ -470,6 +471,12 @@ extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
>  extern void exit_swap_address_space(unsigned int type);
> +extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
> +
> +static inline void put_swap_device(struct swap_info_struct *si)
> +{
> +	rcu_read_unlock();
> +}
>  
>  #else /* CONFIG_SWAP */
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 1a969992f76b..77a7d6191218 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2909,7 +2909,7 @@ int do_swap_page(struct vm_fault *vmf)
>  		struct swap_info_struct *si = swp_swap_info(entry);
>  
>  		if (si->flags & SWP_SYNCHRONOUS_IO &&
> -				__swap_count(si, entry) == 1) {
> +		    __swap_count(entry) == 1) {
>  			/* skip swapcache */
>  			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>  							vmf->address);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 0b8ae361981f..8dde719e973c 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -337,8 +337,13 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
>  			       unsigned long addr)
>  {
>  	struct page *page;
> +	struct swap_info_struct *si;
>  
> +	si = get_swap_device(entry);
> +	if (!si)
> +		return NULL;
>  	page = find_get_page(swap_address_space(entry), swp_offset(entry));
> +	put_swap_device(si);
>  
>  	INC_CACHE_INFO(find_total);
>  	if (page) {
> @@ -376,8 +381,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  			struct vm_area_struct *vma, unsigned long addr,
>  			bool *new_page_allocated)
>  {
> -	struct page *found_page, *new_page = NULL;
> -	struct address_space *swapper_space = swap_address_space(entry);
> +	struct page *found_page = NULL, *new_page = NULL;
> +	struct swap_info_struct *si;
>  	int err;
>  	*new_page_allocated = false;
>  
> @@ -387,7 +392,12 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 * called after lookup_swap_cache() failed, re-calling
>  		 * that would confuse statistics.
>  		 */
> -		found_page = find_get_page(swapper_space, swp_offset(entry));
> +		si = get_swap_device(entry);
> +		if (!si)
> +			break;
> +		found_page = find_get_page(swap_address_space(entry),
> +					   swp_offset(entry));
> +		put_swap_device(si);
>  		if (found_page)
>  			break;
>  
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 42fe5653814a..881515a59f95 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1107,6 +1107,41 @@ static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
>  	return p;
>  }
>  
> +/*
> + * Check whether swap entry is valid in the swap device.  If so,
> + * return pointer to swap_info_struct, and keep the swap entry valid
> + * via preventing the swap device from being swapoff, until
> + * put_swap_device() is called.  Otherwise return NULL.
> + */
> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> +	struct swap_info_struct *si;
> +	unsigned long type, offset;
> +
> +	if (!entry.val)
> +		goto out;
> +	type = swp_type(entry);
> +	if (type >= nr_swapfiles)
> +		goto bad_nofile;
> +	si = swap_info[type];
> +
> +	rcu_read_lock();
> +	if (!(si->flags & SWP_VALID))
> +		goto unlock_out;
> +	offset = swp_offset(entry);
> +	if (offset >= si->max)
> +		goto unlock_out;
> +
> +	return si;
> +bad_nofile:
> +	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
> +out:
> +	return NULL;
> +unlock_out:
> +	rcu_read_unlock();
> +	return NULL;
> +}
> +
>  static unsigned char __swap_entry_free(struct swap_info_struct *p,
>  				       swp_entry_t entry, unsigned char usage)
>  {
> @@ -1328,11 +1363,18 @@ int page_swapcount(struct page *page)
>  	return count;
>  }
>  
> -int __swap_count(struct swap_info_struct *si, swp_entry_t entry)
> +int __swap_count(swp_entry_t entry)
>  {
> +	struct swap_info_struct *si;
>  	pgoff_t offset = swp_offset(entry);
> +	int count = 0;
>  
> -	return swap_count(si->swap_map[offset]);
> +	si = get_swap_device(entry);
> +	if (si) {
> +		count = swap_count(si->swap_map[offset]);
> +		put_swap_device(si);
> +	}
> +	return count;
>  }
>  
>  static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
> @@ -1357,9 +1399,11 @@ int __swp_swapcount(swp_entry_t entry)
>  	int count = 0;
>  	struct swap_info_struct *si;
>  
> -	si = __swap_info_get(entry);
> -	if (si)
> +	si = get_swap_device(entry);
> +	if (si) {
>  		count = swap_swapcount(si, entry);
> +		put_swap_device(si);
> +	}
>  	return count;
>  }
>  
> @@ -2451,9 +2495,9 @@ static int swap_node(struct swap_info_struct *p)
>  	return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
>  }
>  
> -static void _enable_swap_info(struct swap_info_struct *p, int prio,
> -				unsigned char *swap_map,
> -				struct swap_cluster_info *cluster_info)
> +static void setup_swap_info(struct swap_info_struct *p, int prio,
> +			    unsigned char *swap_map,
> +			    struct swap_cluster_info *cluster_info)
>  {
>  	int i;
>  
> @@ -2478,7 +2522,11 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
>  	}
>  	p->swap_map = swap_map;
>  	p->cluster_info = cluster_info;
> -	p->flags |= SWP_WRITEOK;
> +}
> +
> +static void _enable_swap_info(struct swap_info_struct *p)
> +{
> +	p->flags |= SWP_WRITEOK | SWP_VALID;
>  	atomic_long_add(p->pages, &nr_swap_pages);
>  	total_swap_pages += p->pages;
>  
> @@ -2505,7 +2553,17 @@ static void enable_swap_info(struct swap_info_struct *p, int prio,
>  	frontswap_init(p->type, frontswap_map);
>  	spin_lock(&swap_lock);
>  	spin_lock(&p->lock);
> -	 _enable_swap_info(p, prio, swap_map, cluster_info);
> +	setup_swap_info(p, prio, swap_map, cluster_info);
> +	spin_unlock(&p->lock);
> +	spin_unlock(&swap_lock);
> +	/*
> +	 * Guarantee swap_map, cluster_info, etc. fields are used
> +	 * between get/put_swap_device() only if SWP_VALID bit is set
> +	 */
> +	synchronize_rcu();
> +	spin_lock(&swap_lock);
> +	spin_lock(&p->lock);
> +	_enable_swap_info(p);
>  	spin_unlock(&p->lock);
>  	spin_unlock(&swap_lock);
>  }
> @@ -2514,7 +2572,8 @@ static void reinsert_swap_info(struct swap_info_struct *p)
>  {
>  	spin_lock(&swap_lock);
>  	spin_lock(&p->lock);
> -	_enable_swap_info(p, p->prio, p->swap_map, p->cluster_info);
> +	setup_swap_info(p, p->prio, p->swap_map, p->cluster_info);
> +	_enable_swap_info(p);
>  	spin_unlock(&p->lock);
>  	spin_unlock(&swap_lock);
>  }
> @@ -2617,6 +2676,17 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  
>  	reenable_swap_slots_cache_unlock();
>  
> +	spin_lock(&swap_lock);
> +	spin_lock(&p->lock);
> +	p->flags &= ~SWP_VALID;		/* mark swap device as invalid */
> +	spin_unlock(&p->lock);
> +	spin_unlock(&swap_lock);
> +	/*
> +	 * wait for swap operations protected by get/put_swap_device()
> +	 * to complete
> +	 */
> +	synchronize_rcu();
> +
>  	flush_work(&p->discard_work);
>  
>  	destroy_swap_extents(p);
> @@ -3356,22 +3426,16 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
>  {
>  	struct swap_info_struct *p;
>  	struct swap_cluster_info *ci;
> -	unsigned long offset, type;
> +	unsigned long offset;
>  	unsigned char count;
>  	unsigned char has_cache;
>  	int err = -EINVAL;
>  
> -	if (non_swap_entry(entry))
> +	p = get_swap_device(entry);
> +	if (!p)
>  		goto out;
>  
> -	type = swp_type(entry);
> -	if (type >= nr_swapfiles)
> -		goto bad_file;
> -	p = swap_info[type];
>  	offset = swp_offset(entry);
> -	if (unlikely(offset >= p->max))
> -		goto out;
> -
>  	ci = lock_cluster_or_swap_info(p, offset);
>  
>  	count = p->swap_map[offset];
> @@ -3417,11 +3481,9 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
>  unlock_out:
>  	unlock_cluster_or_swap_info(p, ci);
>  out:
> +	if (p)
> +		put_swap_device(p);
>  	return err;
> -
> -bad_file:
> -	pr_err("swap_dup: %s%08lx\n", Bad_file, entry.val);
> -	goto out;
>  }
>  
>  /*
> @@ -3513,6 +3575,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>  	struct page *list_page;
>  	pgoff_t offset;
>  	unsigned char count;
> +	int ret = 0;
>  
>  	/*
>  	 * When debugging, it's easier to use __GFP_ZERO here; but it's better
> @@ -3520,15 +3583,15 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>  	 */
>  	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
>  
> -	si = swap_info_get(entry);
> +	si = get_swap_device(entry);
>  	if (!si) {
>  		/*
>  		 * An acceptable race has occurred since the failing
> -		 * __swap_duplicate(): the swap entry has been freed,
> -		 * perhaps even the whole swap_map cleared for swapoff.
> +		 * __swap_duplicate(): the swap device may be swapoff
>  		 */
>  		goto outer;
>  	}
> +	spin_lock(&si->lock);
>  
>  	offset = swp_offset(entry);
>  
> @@ -3546,9 +3609,8 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>  	}
>  
>  	if (!page) {
> -		unlock_cluster(ci);
> -		spin_unlock(&si->lock);
> -		return -ENOMEM;
> +		ret = -ENOMEM;
> +		goto out;
>  	}
>  
>  	/*
> @@ -3600,10 +3662,11 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>  out:
>  	unlock_cluster(ci);
>  	spin_unlock(&si->lock);
> +	put_swap_device(si);
>  outer:
>  	if (page)
>  		__free_page(page);
> -	return 0;
> +	return ret;
>  }
>  
>  /*
> -- 
> 2.15.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
       [not found]   ` <871sjopllj.fsf@yhuang-dev.intel.com>
@ 2017-12-21 23:58     ` Minchan Kim
  2017-12-22 14:14       ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2017-12-21 23:58 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Paul E . McKenney, Johannes Weiner, Tim Chen, Shaohua Li,
	Mel Gorman, J�r�me Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara,
	Dave Jiang, Aaron Lu

On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> From: Huang Ying <ying.huang@intel.com>
> >> 
> >> When the swapin is performed, after getting the swap entry information
> >> from the page table, system will swap in the swap entry, without any
> >> lock held to prevent the swap device from being swapoff.  This may
> >> cause the race like below,
> >> 
> >> CPU 1				CPU 2
> >> -----				-----
> >> 				do_swap_page
> >> 				  swapin_readahead
> >> 				    __read_swap_cache_async
> >> swapoff				      swapcache_prepare
> >>   p->swap_map = NULL		        __swap_duplicate
> >> 					  p->swap_map[?] /* !!! NULL pointer access */
> >> 
> >> Because swapoff is usually done when system shutdown only, the race
> >> may not hit many people in practice.  But it is still a race need to
> >> be fixed.
> >> 
> >> To fix the race, get_swap_device() is added to check whether the
> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> the swap entry valid via preventing the swap device from being
> >> swapoff, until put_swap_device() is called.
> >> 
> >> Because swapoff() is very race code path, to make the normal path runs
> >> as fast as possible, RCU instead of reference count is used to
> >> implement get/put_swap_device().  From get_swap_device() to
> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> swapoff() will wait until put_swap_device() is called.
> >> 
> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> swapoff, so this patch fixes the race between swap cache looking up
> >> and swapoff too.
> >> 
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> Cc: Minchan Kim <minchan@kernel.org>
> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >> Cc: Shaohua Li <shli@fb.com>
> >> Cc: Mel Gorman <mgorman@techsingularity.net>
> >> Cc: "Jrme Glisse" <jglisse@redhat.com>
> >> Cc: Michal Hocko <mhocko@suse.com>
> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >> Cc: David Rientjes <rientjes@google.com>
> >> Cc: Rik van Riel <riel@redhat.com>
> >> Cc: Jan Kara <jack@suse.cz>
> >> Cc: Dave Jiang <dave.jiang@intel.com>
> >> Cc: Aaron Lu <aaron.lu@intel.com>
> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> >> 
> >> Changelog:
> >> 
> >> v4:
> >> 
> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >>   normal paths further.
> >
> > Hi Huang,
> 
> Hi, Minchan,
> 
> > This version is much better than old. To me, it's due to not rcu,
> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> > into every swap related functions so users who don't interested on swap
> > don't need to care of it. Good.
> >
> > The problem is caused by freeing by swap related-data structure
> > *dynamically* while old swap logic was based on static data
> > structure(i.e., never freed and the verify it's stale).
> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> > which could make access of swap related data structures.
> >
> > A example is __isolate_lru_page
> >
> > It calls page_mapping to get a address_space.
> > What happens if the page is on SwapCache and raced with swapoff?
> > The mapping got could be disappeared by the race. Right?
> 
> Yes.  We should think about that.  Considering the file cache pages, the
> address_space backing the file cache pages may be freed dynamically too.
> So to use page_mapping() return value for the file cache pages, some
> kind of locking is needed to guarantee the address_space isn't freed
> under us.  Page may be locked, or under writeback, or some other locks

I didn't look at the code in detail but I guess every file page should
be freed before the address space destruction and page_lock/lru_lock makes
the work safe, I guess. So, it wouldn't be a problem.

However, in case of swapoff, it doesn't remove pages from LRU list
so there is no lock to prevent the race at this moment. :(

> need to be held, for example, page table lock, or lru_lock, etc.  For
> __isolate_lru_page(), lru_lock will be held when it is called.  And we
> will call synchronize_rcu() between clear PageSwapCache and free swap
> cache, so the usage of swap cache in __isolate_lru_page() should be
> safe.  Do you think my analysis makes sense?

I don't understand how synchronize_rcu closes the race with spin_lock.
Paul might help it.

Even if we solve it, there is a other problem I spot.
When I see migrate_vma_pages, it pass mapping to migrate_page which
accesses mapping->tree_lock unconditionally even though the address_space
is already gone.

Hmm, I didn't check all sites where uses PageSwapCache, swp_entry_t
but gut feeling is it would be not simple.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-21 23:58     ` Minchan Kim
@ 2017-12-22 14:14       ` Huang, Ying
  2017-12-22 16:14         ` Paul E. McKenney
  2017-12-23  1:36         ` Minchan Kim
  0 siblings, 2 replies; 15+ messages in thread
From: Huang, Ying @ 2017-12-22 14:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Paul E . McKenney, Johannes Weiner, Tim Chen, Shaohua Li,
	Mel Gorman, J�r�me Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara,
	Dave Jiang, Aaron Lu

Minchan Kim <minchan@kernel.org> writes:

> On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
>> >> From: Huang Ying <ying.huang@intel.com>
>> >> 
>> >> When the swapin is performed, after getting the swap entry information
>> >> from the page table, system will swap in the swap entry, without any
>> >> lock held to prevent the swap device from being swapoff.  This may
>> >> cause the race like below,
>> >> 
>> >> CPU 1				CPU 2
>> >> -----				-----
>> >> 				do_swap_page
>> >> 				  swapin_readahead
>> >> 				    __read_swap_cache_async
>> >> swapoff				      swapcache_prepare
>> >>   p->swap_map = NULL		        __swap_duplicate
>> >> 					  p->swap_map[?] /* !!! NULL pointer access */
>> >> 
>> >> Because swapoff is usually done when system shutdown only, the race
>> >> may not hit many people in practice.  But it is still a race need to
>> >> be fixed.
>> >> 
>> >> To fix the race, get_swap_device() is added to check whether the
>> >> specified swap entry is valid in its swap device.  If so, it will keep
>> >> the swap entry valid via preventing the swap device from being
>> >> swapoff, until put_swap_device() is called.
>> >> 
>> >> Because swapoff() is very race code path, to make the normal path runs
>> >> as fast as possible, RCU instead of reference count is used to
>> >> implement get/put_swap_device().  From get_swap_device() to
>> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
>> >> swapoff() will wait until put_swap_device() is called.
>> >> 
>> >> In addition to swap_map, cluster_info, etc. data structure in the
>> >> struct swap_info_struct, the swap cache radix tree will be freed after
>> >> swapoff, so this patch fixes the race between swap cache looking up
>> >> and swapoff too.
>> >> 
>> >> Cc: Hugh Dickins <hughd@google.com>
>> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >> Cc: Minchan Kim <minchan@kernel.org>
>> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>> >> Cc: Shaohua Li <shli@fb.com>
>> >> Cc: Mel Gorman <mgorman@techsingularity.net>
>> >> Cc: "Jrme Glisse" <jglisse@redhat.com>
>> >> Cc: Michal Hocko <mhocko@suse.com>
>> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> >> Cc: David Rientjes <rientjes@google.com>
>> >> Cc: Rik van Riel <riel@redhat.com>
>> >> Cc: Jan Kara <jack@suse.cz>
>> >> Cc: Dave Jiang <dave.jiang@intel.com>
>> >> Cc: Aaron Lu <aaron.lu@intel.com>
>> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> >> 
>> >> Changelog:
>> >> 
>> >> v4:
>> >> 
>> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>> >>   normal paths further.
>> >
>> > Hi Huang,
>> 
>> Hi, Minchan,
>> 
>> > This version is much better than old. To me, it's due to not rcu,
>> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
>> > into every swap related functions so users who don't interested on swap
>> > don't need to care of it. Good.
>> >
>> > The problem is caused by freeing by swap related-data structure
>> > *dynamically* while old swap logic was based on static data
>> > structure(i.e., never freed and the verify it's stale).
>> > So, I reviewed some places where use PageSwapCache and swp_entry_t
>> > which could make access of swap related data structures.
>> >
>> > A example is __isolate_lru_page
>> >
>> > It calls page_mapping to get a address_space.
>> > What happens if the page is on SwapCache and raced with swapoff?
>> > The mapping got could be disappeared by the race. Right?
>> 
>> Yes.  We should think about that.  Considering the file cache pages, the
>> address_space backing the file cache pages may be freed dynamically too.
>> So to use page_mapping() return value for the file cache pages, some
>> kind of locking is needed to guarantee the address_space isn't freed
>> under us.  Page may be locked, or under writeback, or some other locks
>
> I didn't look at the code in detail but I guess every file page should
> be freed before the address space destruction and page_lock/lru_lock makes
> the work safe, I guess. So, it wouldn't be a problem.
>
> However, in case of swapoff, it doesn't remove pages from LRU list
> so there is no lock to prevent the race at this moment. :(

Take a look at file cache pages and file cache address_space freeing
code path.  It appears that similar situation is possible for them too.

The file cache pages will be delete from file cache address_space before
address_space (embedded in inode) is freed.  But they will be deleted
from LRU list only when its refcount dropped to zero, please take a look
at put_page() and release_pages().  While address_space will be freed
after putting reference to all file cache pages.  If someone holds a
reference to a file cache page for quite long time, it is possible for a
file cache page to be in LRU list after the inode/address_space is
freed.

And I found inode/address_space is freed witch call_rcu().  I don't know
whether this is related to page_mapping().

This is just my understanding.

>> need to be held, for example, page table lock, or lru_lock, etc.  For
>> __isolate_lru_page(), lru_lock will be held when it is called.  And we
>> will call synchronize_rcu() between clear PageSwapCache and free swap
>> cache, so the usage of swap cache in __isolate_lru_page() should be
>> safe.  Do you think my analysis makes sense?
>
> I don't understand how synchronize_rcu closes the race with spin_lock.
> Paul might help it.

Per my understanding, spin_lock() will preempt_disable(), so
synchronize_rcu() will wait until spin_unlock() is called.

> Even if we solve it, there is a other problem I spot.
> When I see migrate_vma_pages, it pass mapping to migrate_page which
> accesses mapping->tree_lock unconditionally even though the address_space
> is already gone.

Before migrate_vma_pages() is called, migrate_vma_prepare() is called,
where pages are locked.  So it is safe.

> Hmm, I didn't check all sites where uses PageSwapCache, swp_entry_t
> but gut feeling is it would be not simple.

Yes.  We should check all sites.  Thanks for your help!

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-22 14:14       ` Huang, Ying
@ 2017-12-22 16:14         ` Paul E. McKenney
  2017-12-25  1:28           ` Huang, Ying
  2017-12-23  1:36         ` Minchan Kim
  1 sibling, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2017-12-22 16:14 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> >> From: Huang Ying <ying.huang@intel.com>
> >> >> 
> >> >> When the swapin is performed, after getting the swap entry information
> >> >> from the page table, system will swap in the swap entry, without any
> >> >> lock held to prevent the swap device from being swapoff.  This may
> >> >> cause the race like below,
> >> >> 
> >> >> CPU 1				CPU 2
> >> >> -----				-----
> >> >> 				do_swap_page
> >> >> 				  swapin_readahead
> >> >> 				    __read_swap_cache_async
> >> >> swapoff				      swapcache_prepare
> >> >>   p->swap_map = NULL		        __swap_duplicate
> >> >> 					  p->swap_map[?] /* !!! NULL pointer access */
> >> >> 
> >> >> Because swapoff is usually done when system shutdown only, the race
> >> >> may not hit many people in practice.  But it is still a race need to
> >> >> be fixed.
> >> >> 
> >> >> To fix the race, get_swap_device() is added to check whether the
> >> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> >> the swap entry valid via preventing the swap device from being
> >> >> swapoff, until put_swap_device() is called.
> >> >> 
> >> >> Because swapoff() is very race code path, to make the normal path runs
> >> >> as fast as possible, RCU instead of reference count is used to
> >> >> implement get/put_swap_device().  From get_swap_device() to
> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> >> swapoff() will wait until put_swap_device() is called.
> >> >> 
> >> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> >> swapoff, so this patch fixes the race between swap cache looking up
> >> >> and swapoff too.
> >> >> 
> >> >> Cc: Hugh Dickins <hughd@google.com>
> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> >> Cc: Minchan Kim <minchan@kernel.org>
> >> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >> >> Cc: Shaohua Li <shli@fb.com>
> >> >> Cc: Mel Gorman <mgorman@techsingularity.net>
> >> >> Cc: "Jrme Glisse" <jglisse@redhat.com>
> >> >> Cc: Michal Hocko <mhocko@suse.com>
> >> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >> >> Cc: David Rientjes <rientjes@google.com>
> >> >> Cc: Rik van Riel <riel@redhat.com>
> >> >> Cc: Jan Kara <jack@suse.cz>
> >> >> Cc: Dave Jiang <dave.jiang@intel.com>
> >> >> Cc: Aaron Lu <aaron.lu@intel.com>
> >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> >> >> 
> >> >> Changelog:
> >> >> 
> >> >> v4:
> >> >> 
> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >> >>   normal paths further.
> >> >
> >> > Hi Huang,
> >> 
> >> Hi, Minchan,
> >> 
> >> > This version is much better than old. To me, it's due to not rcu,
> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> >> > into every swap related functions so users who don't interested on swap
> >> > don't need to care of it. Good.
> >> >
> >> > The problem is caused by freeing by swap related-data structure
> >> > *dynamically* while old swap logic was based on static data
> >> > structure(i.e., never freed and the verify it's stale).
> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> >> > which could make access of swap related data structures.
> >> >
> >> > A example is __isolate_lru_page
> >> >
> >> > It calls page_mapping to get a address_space.
> >> > What happens if the page is on SwapCache and raced with swapoff?
> >> > The mapping got could be disappeared by the race. Right?
> >> 
> >> Yes.  We should think about that.  Considering the file cache pages, the
> >> address_space backing the file cache pages may be freed dynamically too.
> >> So to use page_mapping() return value for the file cache pages, some
> >> kind of locking is needed to guarantee the address_space isn't freed
> >> under us.  Page may be locked, or under writeback, or some other locks
> >
> > I didn't look at the code in detail but I guess every file page should
> > be freed before the address space destruction and page_lock/lru_lock makes
> > the work safe, I guess. So, it wouldn't be a problem.
> >
> > However, in case of swapoff, it doesn't remove pages from LRU list
> > so there is no lock to prevent the race at this moment. :(
> 
> Take a look at file cache pages and file cache address_space freeing
> code path.  It appears that similar situation is possible for them too.
> 
> The file cache pages will be delete from file cache address_space before
> address_space (embedded in inode) is freed.  But they will be deleted
> from LRU list only when its refcount dropped to zero, please take a look
> at put_page() and release_pages().  While address_space will be freed
> after putting reference to all file cache pages.  If someone holds a
> reference to a file cache page for quite long time, it is possible for a
> file cache page to be in LRU list after the inode/address_space is
> freed.
> 
> And I found inode/address_space is freed witch call_rcu().  I don't know
> whether this is related to page_mapping().
> 
> This is just my understanding.
> 
> >> need to be held, for example, page table lock, or lru_lock, etc.  For
> >> __isolate_lru_page(), lru_lock will be held when it is called.  And we
> >> will call synchronize_rcu() between clear PageSwapCache and free swap
> >> cache, so the usage of swap cache in __isolate_lru_page() should be
> >> safe.  Do you think my analysis makes sense?
> >
> > I don't understand how synchronize_rcu closes the race with spin_lock.
> > Paul might help it.
> 
> Per my understanding, spin_lock() will preempt_disable(), so
> synchronize_rcu() will wait until spin_unlock() is called.

Only when CONFIG_PREEMPT=n!

In CONFIG_PREEMPT=y kernels, preempt_disable() won't necessarily prevent
synchronize_rcu() from completing.

Now, preempt_disable() does prevent synchronize_sched() from
completing, but that would require changing the rcu_read_lock() and
rcu_read_unlock() to rcu_read_lock_sched()/rcu_read_unlock_sched()
or preempt_enable()/preempt_disable().

Another fix would be to invoke rcu_read_lock() just after acquiring
the spinlock and rcu_read_unlock() just before releasing it.

							Thanx, Paul

> > Even if we solve it, there is a other problem I spot.
> > When I see migrate_vma_pages, it pass mapping to migrate_page which
> > accesses mapping->tree_lock unconditionally even though the address_space
> > is already gone.
> 
> Before migrate_vma_pages() is called, migrate_vma_prepare() is called,
> where pages are locked.  So it is safe.
> 
> > Hmm, I didn't check all sites where uses PageSwapCache, swp_entry_t
> > but gut feeling is it would be not simple.
> 
> Yes.  We should check all sites.  Thanks for your help!
> 
> Best Regards,
> Huang, Ying
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-22 14:14       ` Huang, Ying
  2017-12-22 16:14         ` Paul E. McKenney
@ 2017-12-23  1:36         ` Minchan Kim
  2017-12-26  5:33           ` Huang, Ying
  2018-01-02 10:21           ` Mel Gorman
  1 sibling, 2 replies; 15+ messages in thread
From: Minchan Kim @ 2017-12-23  1:36 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Paul E . McKenney, Johannes Weiner, Tim Chen, Shaohua Li,
	Mel Gorman, J�r�me Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara,
	Dave Jiang, Aaron Lu, Mel Gorman

On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> >> From: Huang Ying <ying.huang@intel.com>
> >> >> 
> >> >> When the swapin is performed, after getting the swap entry information
> >> >> from the page table, system will swap in the swap entry, without any
> >> >> lock held to prevent the swap device from being swapoff.  This may
> >> >> cause the race like below,
> >> >> 
> >> >> CPU 1				CPU 2
> >> >> -----				-----
> >> >> 				do_swap_page
> >> >> 				  swapin_readahead
> >> >> 				    __read_swap_cache_async
> >> >> swapoff				      swapcache_prepare
> >> >>   p->swap_map = NULL		        __swap_duplicate
> >> >> 					  p->swap_map[?] /* !!! NULL pointer access */
> >> >> 
> >> >> Because swapoff is usually done when system shutdown only, the race
> >> >> may not hit many people in practice.  But it is still a race need to
> >> >> be fixed.
> >> >> 
> >> >> To fix the race, get_swap_device() is added to check whether the
> >> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> >> the swap entry valid via preventing the swap device from being
> >> >> swapoff, until put_swap_device() is called.
> >> >> 
> >> >> Because swapoff() is very race code path, to make the normal path runs
> >> >> as fast as possible, RCU instead of reference count is used to
> >> >> implement get/put_swap_device().  From get_swap_device() to
> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> >> swapoff() will wait until put_swap_device() is called.
> >> >> 
> >> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> >> swapoff, so this patch fixes the race between swap cache looking up
> >> >> and swapoff too.
> >> >> 
> >> >> Cc: Hugh Dickins <hughd@google.com>
> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> >> Cc: Minchan Kim <minchan@kernel.org>
> >> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> >> >> Cc: Shaohua Li <shli@fb.com>
> >> >> Cc: Mel Gorman <mgorman@techsingularity.net>
> >> >> Cc: "Jrme Glisse" <jglisse@redhat.com>
> >> >> Cc: Michal Hocko <mhocko@suse.com>
> >> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
> >> >> Cc: David Rientjes <rientjes@google.com>
> >> >> Cc: Rik van Riel <riel@redhat.com>
> >> >> Cc: Jan Kara <jack@suse.cz>
> >> >> Cc: Dave Jiang <dave.jiang@intel.com>
> >> >> Cc: Aaron Lu <aaron.lu@intel.com>
> >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> >> >> 
> >> >> Changelog:
> >> >> 
> >> >> v4:
> >> >> 
> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >> >>   normal paths further.
> >> >
> >> > Hi Huang,
> >> 
> >> Hi, Minchan,
> >> 
> >> > This version is much better than old. To me, it's due to not rcu,
> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> >> > into every swap related functions so users who don't interested on swap
> >> > don't need to care of it. Good.
> >> >
> >> > The problem is caused by freeing by swap related-data structure
> >> > *dynamically* while old swap logic was based on static data
> >> > structure(i.e., never freed and the verify it's stale).
> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> >> > which could make access of swap related data structures.
> >> >
> >> > A example is __isolate_lru_page
> >> >
> >> > It calls page_mapping to get a address_space.
> >> > What happens if the page is on SwapCache and raced with swapoff?
> >> > The mapping got could be disappeared by the race. Right?
> >> 
> >> Yes.  We should think about that.  Considering the file cache pages, the
> >> address_space backing the file cache pages may be freed dynamically too.
> >> So to use page_mapping() return value for the file cache pages, some
> >> kind of locking is needed to guarantee the address_space isn't freed
> >> under us.  Page may be locked, or under writeback, or some other locks
> >
> > I didn't look at the code in detail but I guess every file page should
> > be freed before the address space destruction and page_lock/lru_lock makes
> > the work safe, I guess. So, it wouldn't be a problem.
> >
> > However, in case of swapoff, it doesn't remove pages from LRU list
> > so there is no lock to prevent the race at this moment. :(
> 
> Take a look at file cache pages and file cache address_space freeing
> code path.  It appears that similar situation is possible for them too.
> 
> The file cache pages will be delete from file cache address_space before
> address_space (embedded in inode) is freed.  But they will be deleted
> from LRU list only when its refcount dropped to zero, please take a look
> at put_page() and release_pages().  While address_space will be freed
> after putting reference to all file cache pages.  If someone holds a
> reference to a file cache page for quite long time, it is possible for a
> file cache page to be in LRU list after the inode/address_space is
> freed.
> 
> And I found inode/address_space is freed witch call_rcu().  I don't know
> whether this is related to page_mapping().
> 
> This is just my understanding.

Hmm, it smells like a bug of __isolate_lru_page.

Ccing Mel:

What locks protects address_space destroying when race happens between
inode trauncation and __isolate_lru_page?

> 
> >> need to be held, for example, page table lock, or lru_lock, etc.  For
> >> __isolate_lru_page(), lru_lock will be held when it is called.  And we
> >> will call synchronize_rcu() between clear PageSwapCache and free swap
> >> cache, so the usage of swap cache in __isolate_lru_page() should be
> >> safe.  Do you think my analysis makes sense?
> >
> > I don't understand how synchronize_rcu closes the race with spin_lock.
> > Paul might help it.
> 
> Per my understanding, spin_lock() will preempt_disable(), so
> synchronize_rcu() will wait until spin_unlock() is called.
> 
> > Even if we solve it, there is a other problem I spot.
> > When I see migrate_vma_pages, it pass mapping to migrate_page which
> > accesses mapping->tree_lock unconditionally even though the address_space
> > is already gone.
> 
> Before migrate_vma_pages() is called, migrate_vma_prepare() is called,
> where pages are locked.  So it is safe.

I missed that. You're right. It's no problem. Thanks.

> 
> > Hmm, I didn't check all sites where uses PageSwapCache, swp_entry_t
> > but gut feeling is it would be not simple.
> 
> Yes.  We should check all sites.  Thanks for your help!

You might start checking already and found it.
Many architectures use page_mapping in cache flush code so we should
check there, too.

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-22 16:14         ` Paul E. McKenney
@ 2017-12-25  1:28           ` Huang, Ying
  0 siblings, 0 replies; 15+ messages in thread
From: Huang, Ying @ 2017-12-25  1:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Minchan Kim, Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >> 
>> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
>> >> >> From: Huang Ying <ying.huang@intel.com>
>> >> >> 
>> >> >> When the swapin is performed, after getting the swap entry information
>> >> >> from the page table, system will swap in the swap entry, without any
>> >> >> lock held to prevent the swap device from being swapoff.  This may
>> >> >> cause the race like below,
>> >> >> 
>> >> >> CPU 1				CPU 2
>> >> >> -----				-----
>> >> >> 				do_swap_page
>> >> >> 				  swapin_readahead
>> >> >> 				    __read_swap_cache_async
>> >> >> swapoff				      swapcache_prepare
>> >> >>   p->swap_map = NULL		        __swap_duplicate
>> >> >> 					  p->swap_map[?] /* !!! NULL pointer access */
>> >> >> 
>> >> >> Because swapoff is usually done when system shutdown only, the race
>> >> >> may not hit many people in practice.  But it is still a race need to
>> >> >> be fixed.
>> >> >> 
>> >> >> To fix the race, get_swap_device() is added to check whether the
>> >> >> specified swap entry is valid in its swap device.  If so, it will keep
>> >> >> the swap entry valid via preventing the swap device from being
>> >> >> swapoff, until put_swap_device() is called.
>> >> >> 
>> >> >> Because swapoff() is very race code path, to make the normal path runs
>> >> >> as fast as possible, RCU instead of reference count is used to
>> >> >> implement get/put_swap_device().  From get_swap_device() to
>> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
>> >> >> swapoff() will wait until put_swap_device() is called.
>> >> >> 
>> >> >> In addition to swap_map, cluster_info, etc. data structure in the
>> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
>> >> >> swapoff, so this patch fixes the race between swap cache looking up
>> >> >> and swapoff too.
>> >> >> 
>> >> >> Cc: Hugh Dickins <hughd@google.com>
>> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >> >> Cc: Minchan Kim <minchan@kernel.org>
>> >> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> >> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>> >> >> Cc: Shaohua Li <shli@fb.com>
>> >> >> Cc: Mel Gorman <mgorman@techsingularity.net>
>> >> >> Cc: "Jrme Glisse" <jglisse@redhat.com>
>> >> >> Cc: Michal Hocko <mhocko@suse.com>
>> >> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> >> >> Cc: David Rientjes <rientjes@google.com>
>> >> >> Cc: Rik van Riel <riel@redhat.com>
>> >> >> Cc: Jan Kara <jack@suse.cz>
>> >> >> Cc: Dave Jiang <dave.jiang@intel.com>
>> >> >> Cc: Aaron Lu <aaron.lu@intel.com>
>> >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> >> >> 
>> >> >> Changelog:
>> >> >> 
>> >> >> v4:
>> >> >> 
>> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>> >> >>   normal paths further.
>> >> >
>> >> > Hi Huang,
>> >> 
>> >> Hi, Minchan,
>> >> 
>> >> > This version is much better than old. To me, it's due to not rcu,
>> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
>> >> > into every swap related functions so users who don't interested on swap
>> >> > don't need to care of it. Good.
>> >> >
>> >> > The problem is caused by freeing by swap related-data structure
>> >> > *dynamically* while old swap logic was based on static data
>> >> > structure(i.e., never freed and the verify it's stale).
>> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
>> >> > which could make access of swap related data structures.
>> >> >
>> >> > A example is __isolate_lru_page
>> >> >
>> >> > It calls page_mapping to get a address_space.
>> >> > What happens if the page is on SwapCache and raced with swapoff?
>> >> > The mapping got could be disappeared by the race. Right?
>> >> 
>> >> Yes.  We should think about that.  Considering the file cache pages, the
>> >> address_space backing the file cache pages may be freed dynamically too.
>> >> So to use page_mapping() return value for the file cache pages, some
>> >> kind of locking is needed to guarantee the address_space isn't freed
>> >> under us.  Page may be locked, or under writeback, or some other locks
>> >
>> > I didn't look at the code in detail but I guess every file page should
>> > be freed before the address space destruction and page_lock/lru_lock makes
>> > the work safe, I guess. So, it wouldn't be a problem.
>> >
>> > However, in case of swapoff, it doesn't remove pages from LRU list
>> > so there is no lock to prevent the race at this moment. :(
>> 
>> Take a look at file cache pages and file cache address_space freeing
>> code path.  It appears that similar situation is possible for them too.
>> 
>> The file cache pages will be delete from file cache address_space before
>> address_space (embedded in inode) is freed.  But they will be deleted
>> from LRU list only when its refcount dropped to zero, please take a look
>> at put_page() and release_pages().  While address_space will be freed
>> after putting reference to all file cache pages.  If someone holds a
>> reference to a file cache page for quite long time, it is possible for a
>> file cache page to be in LRU list after the inode/address_space is
>> freed.
>> 
>> And I found inode/address_space is freed witch call_rcu().  I don't know
>> whether this is related to page_mapping().
>> 
>> This is just my understanding.
>> 
>> >> need to be held, for example, page table lock, or lru_lock, etc.  For
>> >> __isolate_lru_page(), lru_lock will be held when it is called.  And we
>> >> will call synchronize_rcu() between clear PageSwapCache and free swap
>> >> cache, so the usage of swap cache in __isolate_lru_page() should be
>> >> safe.  Do you think my analysis makes sense?
>> >
>> > I don't understand how synchronize_rcu closes the race with spin_lock.
>> > Paul might help it.
>> 
>> Per my understanding, spin_lock() will preempt_disable(), so
>> synchronize_rcu() will wait until spin_unlock() is called.
>
> Only when CONFIG_PREEMPT=n!
>
> In CONFIG_PREEMPT=y kernels, preempt_disable() won't necessarily prevent
> synchronize_rcu() from completing.
>
> Now, preempt_disable() does prevent synchronize_sched() from
> completing, but that would require changing the rcu_read_lock() and
> rcu_read_unlock() to rcu_read_lock_sched()/rcu_read_unlock_sched()
> or preempt_enable()/preempt_disable().
>
> Another fix would be to invoke rcu_read_lock() just after acquiring
> the spinlock and rcu_read_unlock() just before releasing it.

Hi, Paul,

Thanks for your correction.  This makes me clearer about the important
RCU semantics.

Best Regards,
Huang, Ying

> 							Thanx, Paul
>
>> > Even if we solve it, there is a other problem I spot.
>> > When I see migrate_vma_pages, it pass mapping to migrate_page which
>> > accesses mapping->tree_lock unconditionally even though the address_space
>> > is already gone.
>> 
>> Before migrate_vma_pages() is called, migrate_vma_prepare() is called,
>> where pages are locked.  So it is safe.
>> 
>> > Hmm, I didn't check all sites where uses PageSwapCache, swp_entry_t
>> > but gut feeling is it would be not simple.
>> 
>> Yes.  We should check all sites.  Thanks for your help!
>> 
>> Best Regards,
>> Huang, Ying
>> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-23  1:36         ` Minchan Kim
@ 2017-12-26  5:33           ` Huang, Ying
  2018-01-02 10:21           ` Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: Huang, Ying @ 2017-12-26  5:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Paul E . McKenney, Johannes Weiner, Tim Chen, Shaohua Li,
	Mel Gorman, J�r�me Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara,
	Dave Jiang, Aaron Lu, Mel Gorman

Minchan Kim <minchan@kernel.org> writes:

> On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >> 
>> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
>> >> >> From: Huang Ying <ying.huang@intel.com>
>> >> >> 
>> >> >> When the swapin is performed, after getting the swap entry information
>> >> >> from the page table, system will swap in the swap entry, without any
>> >> >> lock held to prevent the swap device from being swapoff.  This may
>> >> >> cause the race like below,
>> >> >> 
>> >> >> CPU 1				CPU 2
>> >> >> -----				-----
>> >> >> 				do_swap_page
>> >> >> 				  swapin_readahead
>> >> >> 				    __read_swap_cache_async
>> >> >> swapoff				      swapcache_prepare
>> >> >>   p->swap_map = NULL		        __swap_duplicate
>> >> >> 					  p->swap_map[?] /* !!! NULL pointer access */
>> >> >> 
>> >> >> Because swapoff is usually done when system shutdown only, the race
>> >> >> may not hit many people in practice.  But it is still a race need to
>> >> >> be fixed.
>> >> >> 
>> >> >> To fix the race, get_swap_device() is added to check whether the
>> >> >> specified swap entry is valid in its swap device.  If so, it will keep
>> >> >> the swap entry valid via preventing the swap device from being
>> >> >> swapoff, until put_swap_device() is called.
>> >> >> 
>> >> >> Because swapoff() is very race code path, to make the normal path runs
>> >> >> as fast as possible, RCU instead of reference count is used to
>> >> >> implement get/put_swap_device().  From get_swap_device() to
>> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
>> >> >> swapoff() will wait until put_swap_device() is called.
>> >> >> 
>> >> >> In addition to swap_map, cluster_info, etc. data structure in the
>> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
>> >> >> swapoff, so this patch fixes the race between swap cache looking up
>> >> >> and swapoff too.
>> >> >> 
>> >> >> Cc: Hugh Dickins <hughd@google.com>
>> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >> >> Cc: Minchan Kim <minchan@kernel.org>
>> >> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> >> >> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>> >> >> Cc: Shaohua Li <shli@fb.com>
>> >> >> Cc: Mel Gorman <mgorman@techsingularity.net>
>> >> >> Cc: "Jrme Glisse" <jglisse@redhat.com>
>> >> >> Cc: Michal Hocko <mhocko@suse.com>
>> >> >> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> >> >> Cc: David Rientjes <rientjes@google.com>
>> >> >> Cc: Rik van Riel <riel@redhat.com>
>> >> >> Cc: Jan Kara <jack@suse.cz>
>> >> >> Cc: Dave Jiang <dave.jiang@intel.com>
>> >> >> Cc: Aaron Lu <aaron.lu@intel.com>
>> >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> >> >> 
>> >> >> Changelog:
>> >> >> 
>> >> >> v4:
>> >> >> 
>> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>> >> >>   normal paths further.
>> >> >
>> >> > Hi Huang,
>> >> 
>> >> Hi, Minchan,
>> >> 
>> >> > This version is much better than old. To me, it's due to not rcu,
>> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
>> >> > into every swap related functions so users who don't interested on swap
>> >> > don't need to care of it. Good.
>> >> >
>> >> > The problem is caused by freeing by swap related-data structure
>> >> > *dynamically* while old swap logic was based on static data
>> >> > structure(i.e., never freed and the verify it's stale).
>> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
>> >> > which could make access of swap related data structures.
>> >> >
>> >> > A example is __isolate_lru_page
>> >> >
>> >> > It calls page_mapping to get a address_space.
>> >> > What happens if the page is on SwapCache and raced with swapoff?
>> >> > The mapping got could be disappeared by the race. Right?
>> >> 
>> >> Yes.  We should think about that.  Considering the file cache pages, the
>> >> address_space backing the file cache pages may be freed dynamically too.
>> >> So to use page_mapping() return value for the file cache pages, some
>> >> kind of locking is needed to guarantee the address_space isn't freed
>> >> under us.  Page may be locked, or under writeback, or some other locks
>> >
>> > I didn't look at the code in detail but I guess every file page should
>> > be freed before the address space destruction and page_lock/lru_lock makes
>> > the work safe, I guess. So, it wouldn't be a problem.
>> >
>> > However, in case of swapoff, it doesn't remove pages from LRU list
>> > so there is no lock to prevent the race at this moment. :(
>> 
>> Take a look at file cache pages and file cache address_space freeing
>> code path.  It appears that similar situation is possible for them too.
>> 
>> The file cache pages will be delete from file cache address_space before
>> address_space (embedded in inode) is freed.  But they will be deleted
>> from LRU list only when its refcount dropped to zero, please take a look
>> at put_page() and release_pages().  While address_space will be freed
>> after putting reference to all file cache pages.  If someone holds a
>> reference to a file cache page for quite long time, it is possible for a
>> file cache page to be in LRU list after the inode/address_space is
>> freed.
>> 
>> And I found inode/address_space is freed witch call_rcu().  I don't know
>> whether this is related to page_mapping().
>> 
>> This is just my understanding.
>
> Hmm, it smells like a bug of __isolate_lru_page.
>
> Ccing Mel:
>
> What locks protects address_space destroying when race happens between
> inode trauncation and __isolate_lru_page?
>
>> 
>> >> need to be held, for example, page table lock, or lru_lock, etc.  For
>> >> __isolate_lru_page(), lru_lock will be held when it is called.  And we
>> >> will call synchronize_rcu() between clear PageSwapCache and free swap
>> >> cache, so the usage of swap cache in __isolate_lru_page() should be
>> >> safe.  Do you think my analysis makes sense?
>> >
>> > I don't understand how synchronize_rcu closes the race with spin_lock.
>> > Paul might help it.
>> 
>> Per my understanding, spin_lock() will preempt_disable(), so
>> synchronize_rcu() will wait until spin_unlock() is called.
>> 
>> > Even if we solve it, there is a other problem I spot.
>> > When I see migrate_vma_pages, it pass mapping to migrate_page which
>> > accesses mapping->tree_lock unconditionally even though the address_space
>> > is already gone.
>> 
>> Before migrate_vma_pages() is called, migrate_vma_prepare() is called,
>> where pages are locked.  So it is safe.
>
> I missed that. You're right. It's no problem. Thanks.
>
>> 
>> > Hmm, I didn't check all sites where uses PageSwapCache, swp_entry_t
>> > but gut feeling is it would be not simple.
>> 
>> Yes.  We should check all sites.  Thanks for your help!
>
> You might start checking already and found it.
> Many architectures use page_mapping in cache flush code so we should
> check there, too.

Thanks for your reminding!  I will check them.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-23  1:36         ` Minchan Kim
  2017-12-26  5:33           ` Huang, Ying
@ 2018-01-02 10:21           ` Mel Gorman
  2018-01-02 11:29             ` Jan Kara
  1 sibling, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2018-01-02 10:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, Hugh Dickins,
	Paul E . McKenney, Johannes Weiner, Tim Chen, Shaohua Li,
	Mel Gorman, J???r???me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> > code path.  It appears that similar situation is possible for them too.
> > 
> > The file cache pages will be delete from file cache address_space before
> > address_space (embedded in inode) is freed.  But they will be deleted
> > from LRU list only when its refcount dropped to zero, please take a look
> > at put_page() and release_pages().  While address_space will be freed
> > after putting reference to all file cache pages.  If someone holds a
> > reference to a file cache page for quite long time, it is possible for a
> > file cache page to be in LRU list after the inode/address_space is
> > freed.
> > 
> > And I found inode/address_space is freed witch call_rcu().  I don't know
> > whether this is related to page_mapping().
> > 
> > This is just my understanding.
> 
> Hmm, it smells like a bug of __isolate_lru_page.
> 
> Ccing Mel:
> 
> What locks protects address_space destroying when race happens between
> inode trauncation and __isolate_lru_page?
> 

I'm just back online and have a lot of catching up to do so this is a rushed
answer and I didn't read the background of this. However the question is
somewhat ambiguous and the scope is broad as I'm not sure which race you
refer to. For file cache pages, I wouldnt' expect the address_space to be
destroyed specifically as long as the inode exists which is the structure
containing the address_space in this case. A page on the LRU being isolated
in __isolate_lru_page will have an elevated reference count which will
pin the inode until remove_mapping is called which holds the page lock
while inode truncation looking at a page for truncation also only checks
page_mapping under the page lock. Very broadly speaking, pages avoid being
added back to an inode being freed by checking the I_FREEING state.

Hopefully that helps while I go back to the TODO mountain.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2018-01-02 10:21           ` Mel Gorman
@ 2018-01-02 11:29             ` Jan Kara
  2018-01-02 13:29               ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2018-01-02 11:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, Huang, Ying, Andrew Morton, linux-mm, linux-kernel,
	Hugh Dickins, Paul E . McKenney, Johannes Weiner, Tim Chen,
	Shaohua Li, Mel Gorman, J???r???me Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Jan Kara,
	Dave Jiang, Aaron Lu

On Tue 02-01-18 10:21:03, Mel Gorman wrote:
> On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> > > code path.  It appears that similar situation is possible for them too.
> > > 
> > > The file cache pages will be delete from file cache address_space before
> > > address_space (embedded in inode) is freed.  But they will be deleted
> > > from LRU list only when its refcount dropped to zero, please take a look
> > > at put_page() and release_pages().  While address_space will be freed
> > > after putting reference to all file cache pages.  If someone holds a
> > > reference to a file cache page for quite long time, it is possible for a
> > > file cache page to be in LRU list after the inode/address_space is
> > > freed.
> > > 
> > > And I found inode/address_space is freed witch call_rcu().  I don't know
> > > whether this is related to page_mapping().
> > > 
> > > This is just my understanding.
> > 
> > Hmm, it smells like a bug of __isolate_lru_page.
> > 
> > Ccing Mel:
> > 
> > What locks protects address_space destroying when race happens between
> > inode trauncation and __isolate_lru_page?
> > 
> 
> I'm just back online and have a lot of catching up to do so this is a rushed
> answer and I didn't read the background of this. However the question is
> somewhat ambiguous and the scope is broad as I'm not sure which race you
> refer to. For file cache pages, I wouldnt' expect the address_space to be
> destroyed specifically as long as the inode exists which is the structure
> containing the address_space in this case. A page on the LRU being isolated
> in __isolate_lru_page will have an elevated reference count which will
> pin the inode until remove_mapping is called which holds the page lock
> while inode truncation looking at a page for truncation also only checks
> page_mapping under the page lock. Very broadly speaking, pages avoid being
> added back to an inode being freed by checking the I_FREEING state.

So I'm wondering what prevents the following:

CPU1						CPU2

truncate(inode)					__isolate_lru_page()
  ...
  truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
      spin_lock_irqsave(&mapping->tree_lock, flags);
        __delete_from_page_cache(page, NULL)
          page_cache_tree_delete(..)
            ...					  mapping = page_mapping(page);
            page->mapping = NULL;
            ...
      spin_unlock_irqrestore(&mapping->tree_lock, flags);
      page_cache_free_page(mapping, page)
        put_page(page)
          if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space

						  if (mapping && !mapping->a_ops->migratepage)
- we've dereferenced mapping which is potentially already free.

This all seems very theoretical but in principle possible...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2018-01-02 11:29             ` Jan Kara
@ 2018-01-02 13:29               ` Mel Gorman
  2018-01-03  0:42                 ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2018-01-02 13:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mel Gorman, Minchan Kim, Huang, Ying, Andrew Morton, linux-mm,
	linux-kernel, Hugh Dickins, Paul E . McKenney, Johannes Weiner,
	Tim Chen, Shaohua Li, J???r???me Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Dave Jiang,
	Aaron Lu

On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> > > > code path.  It appears that similar situation is possible for them too.
> > > > 
> > > > The file cache pages will be delete from file cache address_space before
> > > > address_space (embedded in inode) is freed.  But they will be deleted
> > > > from LRU list only when its refcount dropped to zero, please take a look
> > > > at put_page() and release_pages().  While address_space will be freed
> > > > after putting reference to all file cache pages.  If someone holds a
> > > > reference to a file cache page for quite long time, it is possible for a
> > > > file cache page to be in LRU list after the inode/address_space is
> > > > freed.
> > > > 
> > > > And I found inode/address_space is freed witch call_rcu().  I don't know
> > > > whether this is related to page_mapping().
> > > > 
> > > > This is just my understanding.
> > > 
> > > Hmm, it smells like a bug of __isolate_lru_page.
> > > 
> > > Ccing Mel:
> > > 
> > > What locks protects address_space destroying when race happens between
> > > inode trauncation and __isolate_lru_page?
> > > 
> > 
> > I'm just back online and have a lot of catching up to do so this is a rushed
> > answer and I didn't read the background of this. However the question is
> > somewhat ambiguous and the scope is broad as I'm not sure which race you
> > refer to. For file cache pages, I wouldnt' expect the address_space to be
> > destroyed specifically as long as the inode exists which is the structure
> > containing the address_space in this case. A page on the LRU being isolated
> > in __isolate_lru_page will have an elevated reference count which will
> > pin the inode until remove_mapping is called which holds the page lock
> > while inode truncation looking at a page for truncation also only checks
> > page_mapping under the page lock. Very broadly speaking, pages avoid being
> > added back to an inode being freed by checking the I_FREEING state.
> 
> So I'm wondering what prevents the following:
> 
> CPU1						CPU2
> 
> truncate(inode)					__isolate_lru_page()
>   ...
>   truncate_inode_page(mapping, page);
>     delete_from_page_cache(page)
>       spin_lock_irqsave(&mapping->tree_lock, flags);
>         __delete_from_page_cache(page, NULL)
>           page_cache_tree_delete(..)
>             ...					  mapping = page_mapping(page);
>             page->mapping = NULL;
>             ...
>       spin_unlock_irqrestore(&mapping->tree_lock, flags);
>       page_cache_free_page(mapping, page)
>         put_page(page)
>           if (put_page_testzero(page)) -> false
> - inode now has no pages and can be freed including embedded address_space
> 
> 						  if (mapping && !mapping->a_ops->migratepage)
> - we've dereferenced mapping which is potentially already free.
> 

Hmm, possible if unlikely.

Before delete_from_page_cache, we called truncate_cleanup_page so the
page is likely to be !PageDirty or PageWriteback which gets skipped by
the only caller that checks the mappping in __isolate_lru_page. The race
is tiny but it does exist. One way of closing it is to check the mapping
under the page lock which will prevent races with truncation. The
overhead is minimal as the calling context (compaction) is quite a heavy
operation anyway.

Build tested only for review


diff --git a/mm/vmscan.c b/mm/vmscan.c
index c02c850ea349..61bf0bc60d96 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1433,14 +1433,24 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 
 		if (PageDirty(page)) {
 			struct address_space *mapping;
+			bool migrate_dirty;
 
 			/*
 			 * Only pages without mappings or that have a
 			 * ->migratepage callback are possible to migrate
-			 * without blocking
+			 * without blocking. However, we can be racing with
+			 * truncation so it's necessary to lock the page
+			 * to stabilise the mapping as truncation holds
+			 * the page lock until after the page is removed
+			 * from the page cache.
 			 */
+			if (!trylock_page(page))
+				return ret;
+
 			mapping = page_mapping(page);
-			if (mapping && !mapping->a_ops->migratepage)
+			migrate_dirty = mapping && mapping->a_ops->migratepage;
+			unlock_page(page);
+			if (!migrate_dirty)
 				return ret;
 		}
 	}

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2018-01-02 13:29               ` Mel Gorman
@ 2018-01-03  0:42                 ` Huang, Ying
  2018-01-03  9:54                   ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2018-01-03  0:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jan Kara, Mel Gorman, Minchan Kim, Andrew Morton, linux-mm,
	linux-kernel, Hugh Dickins, Paul E . McKenney, Johannes Weiner,
	Tim Chen, Shaohua Li, J???r???me Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Dave Jiang,
	Aaron Lu

Mel Gorman <mgorman@techsingularity.net> writes:

> On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
>> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
>> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
>> > > > code path.  It appears that similar situation is possible for them too.
>> > > > 
>> > > > The file cache pages will be delete from file cache address_space before
>> > > > address_space (embedded in inode) is freed.  But they will be deleted
>> > > > from LRU list only when its refcount dropped to zero, please take a look
>> > > > at put_page() and release_pages().  While address_space will be freed
>> > > > after putting reference to all file cache pages.  If someone holds a
>> > > > reference to a file cache page for quite long time, it is possible for a
>> > > > file cache page to be in LRU list after the inode/address_space is
>> > > > freed.
>> > > > 
>> > > > And I found inode/address_space is freed witch call_rcu().  I don't know
>> > > > whether this is related to page_mapping().
>> > > > 
>> > > > This is just my understanding.
>> > > 
>> > > Hmm, it smells like a bug of __isolate_lru_page.
>> > > 
>> > > Ccing Mel:
>> > > 
>> > > What locks protects address_space destroying when race happens between
>> > > inode trauncation and __isolate_lru_page?
>> > > 
>> > 
>> > I'm just back online and have a lot of catching up to do so this is a rushed
>> > answer and I didn't read the background of this. However the question is
>> > somewhat ambiguous and the scope is broad as I'm not sure which race you
>> > refer to. For file cache pages, I wouldnt' expect the address_space to be
>> > destroyed specifically as long as the inode exists which is the structure
>> > containing the address_space in this case. A page on the LRU being isolated
>> > in __isolate_lru_page will have an elevated reference count which will
>> > pin the inode until remove_mapping is called which holds the page lock
>> > while inode truncation looking at a page for truncation also only checks
>> > page_mapping under the page lock. Very broadly speaking, pages avoid being
>> > added back to an inode being freed by checking the I_FREEING state.
>> 
>> So I'm wondering what prevents the following:
>> 
>> CPU1						CPU2
>> 
>> truncate(inode)					__isolate_lru_page()
>>   ...
>>   truncate_inode_page(mapping, page);
>>     delete_from_page_cache(page)
>>       spin_lock_irqsave(&mapping->tree_lock, flags);
>>         __delete_from_page_cache(page, NULL)
>>           page_cache_tree_delete(..)
>>             ...					  mapping = page_mapping(page);
>>             page->mapping = NULL;
>>             ...
>>       spin_unlock_irqrestore(&mapping->tree_lock, flags);
>>       page_cache_free_page(mapping, page)
>>         put_page(page)
>>           if (put_page_testzero(page)) -> false
>> - inode now has no pages and can be freed including embedded address_space
>> 
>> 						  if (mapping && !mapping->a_ops->migratepage)
>> - we've dereferenced mapping which is potentially already free.
>> 
>
> Hmm, possible if unlikely.
>
> Before delete_from_page_cache, we called truncate_cleanup_page so the
> page is likely to be !PageDirty or PageWriteback which gets skipped by
> the only caller that checks the mappping in __isolate_lru_page. The race
> is tiny but it does exist. One way of closing it is to check the mapping
> under the page lock which will prevent races with truncation. The
> overhead is minimal as the calling context (compaction) is quite a heavy
> operation anyway.
>

I think another possible fix is to use call_rcu_sched() to free inode
(and address_space).  Because __isolate_lru_page() will be called with
LRU spinlock held and IRQ disabled, call_rcu_sched() will wait
LRU spin_unlock and IRQ enabled.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2018-01-03  0:42                 ` Huang, Ying
@ 2018-01-03  9:54                   ` Mel Gorman
  2018-01-04  1:17                     ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2018-01-03  9:54 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Jan Kara, Mel Gorman, Minchan Kim, Andrew Morton, linux-mm,
	linux-kernel, Hugh Dickins, Paul E . McKenney, Johannes Weiner,
	Tim Chen, Shaohua Li, J???r???me Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Dave Jiang,
	Aaron Lu

On Wed, Jan 03, 2018 at 08:42:15AM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
> >> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
> >> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> >> > > > code path.  It appears that similar situation is possible for them too.
> >> > > > 
> >> > > > The file cache pages will be delete from file cache address_space before
> >> > > > address_space (embedded in inode) is freed.  But they will be deleted
> >> > > > from LRU list only when its refcount dropped to zero, please take a look
> >> > > > at put_page() and release_pages().  While address_space will be freed
> >> > > > after putting reference to all file cache pages.  If someone holds a
> >> > > > reference to a file cache page for quite long time, it is possible for a
> >> > > > file cache page to be in LRU list after the inode/address_space is
> >> > > > freed.
> >> > > > 
> >> > > > And I found inode/address_space is freed witch call_rcu().  I don't know
> >> > > > whether this is related to page_mapping().
> >> > > > 
> >> > > > This is just my understanding.
> >> > > 
> >> > > Hmm, it smells like a bug of __isolate_lru_page.
> >> > > 
> >> > > Ccing Mel:
> >> > > 
> >> > > What locks protects address_space destroying when race happens between
> >> > > inode trauncation and __isolate_lru_page?
> >> > > 
> >> > 
> >> > I'm just back online and have a lot of catching up to do so this is a rushed
> >> > answer and I didn't read the background of this. However the question is
> >> > somewhat ambiguous and the scope is broad as I'm not sure which race you
> >> > refer to. For file cache pages, I wouldnt' expect the address_space to be
> >> > destroyed specifically as long as the inode exists which is the structure
> >> > containing the address_space in this case. A page on the LRU being isolated
> >> > in __isolate_lru_page will have an elevated reference count which will
> >> > pin the inode until remove_mapping is called which holds the page lock
> >> > while inode truncation looking at a page for truncation also only checks
> >> > page_mapping under the page lock. Very broadly speaking, pages avoid being
> >> > added back to an inode being freed by checking the I_FREEING state.
> >> 
> >> So I'm wondering what prevents the following:
> >> 
> >> CPU1						CPU2
> >> 
> >> truncate(inode)					__isolate_lru_page()
> >>   ...
> >>   truncate_inode_page(mapping, page);
> >>     delete_from_page_cache(page)
> >>       spin_lock_irqsave(&mapping->tree_lock, flags);
> >>         __delete_from_page_cache(page, NULL)
> >>           page_cache_tree_delete(..)
> >>             ...					  mapping = page_mapping(page);
> >>             page->mapping = NULL;
> >>             ...
> >>       spin_unlock_irqrestore(&mapping->tree_lock, flags);
> >>       page_cache_free_page(mapping, page)
> >>         put_page(page)
> >>           if (put_page_testzero(page)) -> false
> >> - inode now has no pages and can be freed including embedded address_space
> >> 
> >> 						  if (mapping && !mapping->a_ops->migratepage)
> >> - we've dereferenced mapping which is potentially already free.
> >> 
> >
> > Hmm, possible if unlikely.
> >
> > Before delete_from_page_cache, we called truncate_cleanup_page so the
> > page is likely to be !PageDirty or PageWriteback which gets skipped by
> > the only caller that checks the mappping in __isolate_lru_page. The race
> > is tiny but it does exist. One way of closing it is to check the mapping
> > under the page lock which will prevent races with truncation. The
> > overhead is minimal as the calling context (compaction) is quite a heavy
> > operation anyway.
> >
> 
> I think another possible fix is to use call_rcu_sched() to free inode
> (and address_space).  Because __isolate_lru_page() will be called with
> LRU spinlock held and IRQ disabled, call_rcu_sched() will wait
> LRU spin_unlock and IRQ enabled.
> 

Maybe, but in this particular case, I would prefer to go with something
more conventional unless there is strong evidence that it's an improvement
(which I doubt in this case given the cost of migration overall and the
corner case of migrating a dirty page).

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2018-01-03  9:54                   ` Mel Gorman
@ 2018-01-04  1:17                     ` Huang, Ying
  2018-01-04 10:21                       ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2018-01-04  1:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jan Kara, Mel Gorman, Minchan Kim, Andrew Morton, linux-mm,
	linux-kernel, Hugh Dickins, Paul E . McKenney, Johannes Weiner,
	Tim Chen, Shaohua Li, J???r???me Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Dave Jiang,
	Aaron Lu

Mel Gorman <mgorman@techsingularity.net> writes:

> On Wed, Jan 03, 2018 at 08:42:15AM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>> 
>> > On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
>> >> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
>> >> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
>> >> > > > code path.  It appears that similar situation is possible for them too.
>> >> > > > 
>> >> > > > The file cache pages will be delete from file cache address_space before
>> >> > > > address_space (embedded in inode) is freed.  But they will be deleted
>> >> > > > from LRU list only when its refcount dropped to zero, please take a look
>> >> > > > at put_page() and release_pages().  While address_space will be freed
>> >> > > > after putting reference to all file cache pages.  If someone holds a
>> >> > > > reference to a file cache page for quite long time, it is possible for a
>> >> > > > file cache page to be in LRU list after the inode/address_space is
>> >> > > > freed.
>> >> > > > 
>> >> > > > And I found inode/address_space is freed witch call_rcu().  I don't know
>> >> > > > whether this is related to page_mapping().
>> >> > > > 
>> >> > > > This is just my understanding.
>> >> > > 
>> >> > > Hmm, it smells like a bug of __isolate_lru_page.
>> >> > > 
>> >> > > Ccing Mel:
>> >> > > 
>> >> > > What locks protects address_space destroying when race happens between
>> >> > > inode trauncation and __isolate_lru_page?
>> >> > > 
>> >> > 
>> >> > I'm just back online and have a lot of catching up to do so this is a rushed
>> >> > answer and I didn't read the background of this. However the question is
>> >> > somewhat ambiguous and the scope is broad as I'm not sure which race you
>> >> > refer to. For file cache pages, I wouldnt' expect the address_space to be
>> >> > destroyed specifically as long as the inode exists which is the structure
>> >> > containing the address_space in this case. A page on the LRU being isolated
>> >> > in __isolate_lru_page will have an elevated reference count which will
>> >> > pin the inode until remove_mapping is called which holds the page lock
>> >> > while inode truncation looking at a page for truncation also only checks
>> >> > page_mapping under the page lock. Very broadly speaking, pages avoid being
>> >> > added back to an inode being freed by checking the I_FREEING state.
>> >> 
>> >> So I'm wondering what prevents the following:
>> >> 
>> >> CPU1						CPU2
>> >> 
>> >> truncate(inode)					__isolate_lru_page()
>> >>   ...
>> >>   truncate_inode_page(mapping, page);
>> >>     delete_from_page_cache(page)
>> >>       spin_lock_irqsave(&mapping->tree_lock, flags);
>> >>         __delete_from_page_cache(page, NULL)
>> >>           page_cache_tree_delete(..)
>> >>             ...					  mapping = page_mapping(page);
>> >>             page->mapping = NULL;
>> >>             ...
>> >>       spin_unlock_irqrestore(&mapping->tree_lock, flags);
>> >>       page_cache_free_page(mapping, page)
>> >>         put_page(page)
>> >>           if (put_page_testzero(page)) -> false
>> >> - inode now has no pages and can be freed including embedded address_space
>> >> 
>> >> 						  if (mapping && !mapping->a_ops->migratepage)
>> >> - we've dereferenced mapping which is potentially already free.
>> >> 
>> >
>> > Hmm, possible if unlikely.
>> >
>> > Before delete_from_page_cache, we called truncate_cleanup_page so the
>> > page is likely to be !PageDirty or PageWriteback which gets skipped by
>> > the only caller that checks the mappping in __isolate_lru_page. The race
>> > is tiny but it does exist. One way of closing it is to check the mapping
>> > under the page lock which will prevent races with truncation. The
>> > overhead is minimal as the calling context (compaction) is quite a heavy
>> > operation anyway.
>> >
>> 
>> I think another possible fix is to use call_rcu_sched() to free inode
>> (and address_space).  Because __isolate_lru_page() will be called with
>> LRU spinlock held and IRQ disabled, call_rcu_sched() will wait
>> LRU spin_unlock and IRQ enabled.
>> 
>
> Maybe, but in this particular case, I would prefer to go with something
> more conventional unless there is strong evidence that it's an improvement
> (which I doubt in this case given the cost of migration overall and the
> corner case of migrating a dirty page).

So you like page_lock() more than RCU?  Is there any problem of RCU?
The object to be protected isn't clear?

Another way to fix this with RCU is to replace
trylock_page()/unlock_page() with rcu_read_lock()/rcu_read_unlock() in
your fix.

JFYI, please keep your fix if you think that is more appropriate.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations
  2018-01-04  1:17                     ` Huang, Ying
@ 2018-01-04 10:21                       ` Mel Gorman
  0 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2018-01-04 10:21 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Jan Kara, Mel Gorman, Minchan Kim, Andrew Morton, linux-mm,
	linux-kernel, Hugh Dickins, Paul E . McKenney, Johannes Weiner,
	Tim Chen, Shaohua Li, J???r???me Glisse, Michal Hocko,
	Andrea Arcangeli, David Rientjes, Rik van Riel, Dave Jiang,
	Aaron Lu

On Thu, Jan 04, 2018 at 09:17:36AM +0800, Huang, Ying wrote:
> > Maybe, but in this particular case, I would prefer to go with something
> > more conventional unless there is strong evidence that it's an improvement
> > (which I doubt in this case given the cost of migration overall and the
> > corner case of migrating a dirty page).
> 
> So you like page_lock() more than RCU? 

In this instance, yes.

> Is there any problem of RCU?
> The object to be protected isn't clear?
> 

It's not clear what object is being protected or how it's protected and
it's not the usual means a mapping is pinned. Furthermore, in the event
a page is being truncated, we really do not want to bother doing any
migration work for compaction purposes as it's a waste.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-01-04 10:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-20  1:26 [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations Huang, Ying
2017-12-21  2:16 ` Minchan Kim
     [not found]   ` <871sjopllj.fsf@yhuang-dev.intel.com>
2017-12-21 23:58     ` Minchan Kim
2017-12-22 14:14       ` Huang, Ying
2017-12-22 16:14         ` Paul E. McKenney
2017-12-25  1:28           ` Huang, Ying
2017-12-23  1:36         ` Minchan Kim
2017-12-26  5:33           ` Huang, Ying
2018-01-02 10:21           ` Mel Gorman
2018-01-02 11:29             ` Jan Kara
2018-01-02 13:29               ` Mel Gorman
2018-01-03  0:42                 ` Huang, Ying
2018-01-03  9:54                   ` Mel Gorman
2018-01-04  1:17                     ` Huang, Ying
2018-01-04 10:21                       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).