linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
@ 2017-12-07  1:14 Huang, Ying
  2017-12-08  0:29 ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Ying @ 2017-12-07  1:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	Jérôme Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

From: Huang Ying <ying.huang@intel.com>

When the swapin is performed, after getting the swap entry information
from the page table, the PTL (page table lock) will be released, then
system will go to swap in the swap entry, without any lock held to
prevent the swap device from being swapoff.  This may cause the race
like below,

CPU 1				CPU 2
-----				-----
				do_swap_page
				  swapin_readahead
				    __read_swap_cache_async
swapoff				      swapcache_prepare
  p->swap_map = NULL		        __swap_duplicate
					  p->swap_map[?] /* !!! NULL pointer access */

Because swap off is usually done when system shutdown only, the race
may not hit many people in practice.  But it is still a race need to
be fixed.

To fix the race, a reference count is added to SIS (struct
swap_info_struct).  The data structure in SIS will be freed only if
the reference count indicates that there are no other users.

The refcount is implemented based on atomic_inc_not_zero() instead of
"normal" refcount_inc().  Because when a swap entry is swapin in
shmem, the swap entry is found in page mapping radix tree locklessly.
We cannot get a valid reference to SIS with the help of a lock.
Instead, we set the refcount to 0 when SIS is swap off or
uninitialized, so atomic_inc_not_zero() will return false for that.

Several other code paths in addition to swapin has similar race, they
are fixed too in the same way.

Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 include/linux/swap.h | 13 +++++++-
 mm/madvise.c         | 15 +++++++---
 mm/memory.c          | 26 ++++++++++++----
 mm/shmem.c           | 11 +++++++
 mm/swapfile.c        | 83 ++++++++++++++++++++++++++++++++++++++++++++--------
 5 files changed, 125 insertions(+), 23 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..ba9ced8a6a1d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -173,7 +173,6 @@ enum {
 	SWP_STABLE_WRITES = (1 << 10),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 11),	/* synchronous IO is efficient */
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 12),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
@@ -230,6 +229,7 @@ struct swap_cluster_list {
  */
 struct swap_info_struct {
 	unsigned long	flags;		/* SWP_USED etc: see above */
+	atomic_t	count;		/* reference count of the swap device */
 	signed short	prio;		/* swap priority of this type */
 	struct plist_node list;		/* entry in swap_active_head */
 	struct plist_node avail_lists[MAX_NUMNODES];/* entry in swap_avail_heads */
@@ -470,6 +470,8 @@ extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
+extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+extern void put_swap_device(struct swap_info_struct *si);
 
 #else /* CONFIG_SWAP */
 
@@ -605,6 +607,15 @@ static inline swp_entry_t get_swap_page(struct page *page)
 	return entry;
 }
 
+static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline void put_swap_device(struct swap_info_struct *si)
+{
+}
+
 #endif /* CONFIG_SWAP */
 
 #ifdef CONFIG_THP_SWAP
diff --git a/mm/madvise.c b/mm/madvise.c
index 751e97aa2210..74544424e96d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -205,21 +205,28 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 		swp_entry_t entry;
 		struct page *page;
 		spinlock_t *ptl;
+		struct swap_info_struct *si;
 
 		orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
 		pte = *(orig_pte + ((index - start) / PAGE_SIZE));
-		pte_unmap_unlock(orig_pte, ptl);
-
 		if (pte_present(pte) || pte_none(pte))
-			continue;
+			goto unlock_continue;
 		entry = pte_to_swp_entry(pte);
 		if (unlikely(non_swap_entry(entry)))
-			continue;
+			goto unlock_continue;
+		si = get_swap_device(entry);
+		if (!si)
+			goto unlock_continue;
+		pte_unmap_unlock(orig_pte, ptl);
 
 		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
 							vma, index, false);
 		if (page)
 			put_page(page);
+		put_swap_device(si);
+		continue;
+unlock_continue:
+		pte_unmap_unlock(orig_pte, ptl);
 	}
 
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 1c9f3c03be15..68bbbed084e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1066,6 +1066,7 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	int progress = 0;
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry = (swp_entry_t){0};
+	struct swap_info_struct *si = NULL;
 
 again:
 	init_rss_vec(rss);
@@ -1097,8 +1098,10 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
 							vma, addr, rss);
-		if (entry.val)
+		if (entry.val) {
+			si = get_swap_device(entry);
 			break;
+		}
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
@@ -1109,8 +1112,12 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
 	cond_resched();
 
-	if (entry.val) {
-		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
+	if (entry.val && si) {
+		int ret;
+
+		ret = add_swap_count_continuation(entry, GFP_KERNEL);
+		put_swap_device(si);
+		if (ret < 0)
 			return -ENOMEM;
 		progress = 0;
 	}
@@ -2849,6 +2856,7 @@ int do_swap_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page = NULL, *swapcache = NULL;
 	struct mem_cgroup *memcg;
+	struct swap_info_struct *si = NULL;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
@@ -2880,12 +2888,13 @@ int do_swap_page(struct vm_fault *vmf)
 		goto out;
 	}
 
+	si = get_swap_device(entry);
+	if (unlikely(!si))
+		goto out;
 
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry, vma, vmf->address);
 	if (!page) {
-		struct swap_info_struct *si = swp_swap_info(entry);
-
 		if (si->flags & SWP_SYNCHRONOUS_IO &&
 				__swap_count(si, entry) == 1) {
 			/* skip swapcache */
@@ -2932,6 +2941,9 @@ int do_swap_page(struct vm_fault *vmf)
 		goto out_release;
 	}
 
+	put_swap_device(si);
+	si = NULL;
+
 	locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
 
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -3042,6 +3054,8 @@ int do_swap_page(struct vm_fault *vmf)
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
+	if (si)
+		put_swap_device(si);
 	return ret;
 out_nomap:
 	mem_cgroup_cancel_charge(page, memcg, false);
@@ -3049,6 +3063,8 @@ int do_swap_page(struct vm_fault *vmf)
 out_page:
 	unlock_page(page);
 out_release:
+	if (si)
+		put_swap_device(si);
 	put_page(page);
 	if (page != swapcache && swapcache) {
 		unlock_page(swapcache);
diff --git a/mm/shmem.c b/mm/shmem.c
index 2b157bd55326..5b0c9d9cdc86 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1603,6 +1603,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct mm_struct *charge_mm;
 	struct mem_cgroup *memcg;
 	struct page *page;
+	struct swap_info_struct *si = NULL;
 	swp_entry_t swap;
 	enum sgp_type sgp_huge = sgp;
 	pgoff_t hindex = index;
@@ -1619,6 +1620,10 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	page = find_lock_entry(mapping, index);
 	if (radix_tree_exceptional_entry(page)) {
 		swap = radix_to_swp_entry(page);
+		si = get_swap_device(swap);
+		/* Make sure swap device not swap off/on under us */
+		if (!si || !shmem_confirm_swap(mapping, index, swap))
+			goto repeat;
 		page = NULL;
 	}
 
@@ -1668,6 +1673,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 				goto failed;
 			}
 		}
+		put_swap_device(si);
+		si = NULL;
 
 		/* We have to do this with page locked to prevent races */
 		lock_page(page);
@@ -1895,6 +1902,10 @@ alloc_nohuge:		page = shmem_alloc_and_acct_page(gfp, inode,
 	if (swap.val && !shmem_confirm_swap(mapping, index, swap))
 		error = -EEXIST;
 unlock:
+	if (si) {
+		put_swap_device(si);
+		si = NULL;
+	}
 	if (page) {
 		unlock_page(page);
 		put_page(page);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3074b02eaa09..d219cd66d479 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -505,6 +505,57 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 		free_cluster(p, idx);
 }
 
+/*
+ * Must be called when the swap device is guaranteed to be valid, such
+ * as with swap_info_struct->lock held, etc.
+ */
+static void __get_swap_device(struct swap_info_struct *si)
+{
+	atomic_inc(&si->count);
+}
+
+/*
+ * Get a reference to the swap_info_struct to prevent it to be swap
+ * off.  Return NULL if swap device has been swap off, uninitialized,
+ * or invalid.
+ */
+struct swap_info_struct *get_swap_device(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	unsigned long type;
+
+	if (!entry.val)
+		return NULL;
+	type = swp_type(entry);
+	if (type >= nr_swapfiles)
+		goto bad_nofile;
+	si = swap_info[type];
+	/*
+	 * If swap device is uninitialized or destroyed, the
+	 * reference count will be 0
+	 */
+	if (!atomic_inc_not_zero(&si->count))
+		si = NULL;
+	/*
+	 * If atomic_inc_not_zero() inc successfully, it will be fully
+	 * ordered with following accesses.
+	 */
+
+	return si;
+bad_nofile:
+	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+	return NULL;
+}
+
+/*
+ * Put the reference to the swap_info_struct
+ */
+void put_swap_device(struct swap_info_struct *si)
+{
+	/* RELEASE make all access to si done before dec refcount */
+	atomic_dec_return_release(&si->count);
+}
+
 /*
  * It's possible scan_swap_map() uses a free cluster in the middle of free
  * cluster list. Avoiding such abuse to avoid list corruption.
@@ -691,7 +742,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	 * And we let swap pages go all over an SSD partition.  Hugh
 	 */
 
-	si->flags += SWP_SCANNING;
+	__get_swap_device(si);
 	scan_base = offset = si->cluster_next;
 
 	/* SSD algorithm */
@@ -821,7 +872,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	}
 
 done:
-	si->flags -= SWP_SCANNING;
+	put_swap_device(si);
 	return n_ret;
 
 scan:
@@ -859,7 +910,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	spin_lock(&si->lock);
 
 no_page:
-	si->flags -= SWP_SCANNING;
+	put_swap_device(si);
 	return n_ret;
 }
 
@@ -2479,6 +2530,8 @@ static void _enable_swap_info(struct swap_info_struct *p, int prio,
 	p->swap_map = swap_map;
 	p->cluster_info = cluster_info;
 	p->flags |= SWP_WRITEOK;
+	/* RELEASE makes store above done before inc refcount */
+	atomic_inc_return_release(&p->count);
 	atomic_long_add(p->pages, &nr_swap_pages);
 	total_swap_pages += p->pages;
 
@@ -2599,6 +2652,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	atomic_long_sub(p->pages, &nr_swap_pages);
 	total_swap_pages -= p->pages;
 	p->flags &= ~SWP_WRITEOK;
+	p->highest_bit = 0;		/* cuts scans short */
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 
@@ -2617,6 +2671,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	reenable_swap_slots_cache_unlock();
 
+	/*
+	 * wait for anyone still operate on swap device, like scan, swapin,
+	 * copy page table, etc.
+	 */
+	atomic_dec(&p->count);
+	while (atomic_read(&p->count) != 0)
+		schedule_timeout_uninterruptible(1);
+	/*
+	 * The control dependency on p->count above guarantees the following
+	 * stores will not occur before p->count been dec in other threads.
+	 */
+
 	flush_work(&p->discard_work);
 
 	destroy_swap_extents(p);
@@ -2631,16 +2697,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	spin_lock(&p->lock);
 	drain_mmlist();
 
-	/* wait for anyone still in scan_swap_map */
-	p->highest_bit = 0;		/* cuts scans short */
-	while (p->flags >= SWP_SCANNING) {
-		spin_unlock(&p->lock);
-		spin_unlock(&swap_lock);
-		schedule_timeout_uninterruptible(1);
-		spin_lock(&swap_lock);
-		spin_lock(&p->lock);
-	}
-
 	swap_file = p->swap_file;
 	old_block_size = p->old_block_size;
 	p->swap_file = NULL;
@@ -2861,6 +2917,7 @@ static struct swap_info_struct *alloc_swap_info(void)
 		 */
 		smp_wmb();
 		nr_swapfiles++;
+		atomic_set(&p->count, 0);
 	} else {
 		kfree(p);
 		p = swap_info[type];
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-07  1:14 [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations Huang, Ying
@ 2017-12-08  0:29 ` Andrew Morton
  2017-12-08  1:43   ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2017-12-08  0:29 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Hugh Dickins, Minchan Kim,
	Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	Jérôme Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Thu,  7 Dec 2017 09:14:26 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:

> When the swapin is performed, after getting the swap entry information
> from the page table, the PTL (page table lock) will be released, then
> system will go to swap in the swap entry, without any lock held to
> prevent the swap device from being swapoff.  This may cause the race
> like below,
> 
> CPU 1				CPU 2
> -----				-----
> 				do_swap_page
> 				  swapin_readahead
> 				    __read_swap_cache_async
> swapoff				      swapcache_prepare
>   p->swap_map = NULL		        __swap_duplicate
> 					  p->swap_map[?] /* !!! NULL pointer access */
> 
> Because swap off is usually done when system shutdown only, the race
> may not hit many people in practice.  But it is still a race need to
> be fixed.

swapoff is so rare that it's hard to get motivated about any fix which
adds overhead to the regular codepaths.

Is there something we can do to ensure that all the overhead of this
fix is placed into the swapoff side?  stop_machine() may be a bit
brutal, but a surprising amount of code uses it.  Any other ideas?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-08  0:29 ` Andrew Morton
@ 2017-12-08  1:43   ` Minchan Kim
       [not found]     ` <87po7pg4jt.fsf@yhuang-dev.intel.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2017-12-08  1:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang, Ying, linux-mm, linux-kernel, Hugh Dickins,
	Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	Jérôme Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Thu, Dec 07, 2017 at 04:29:37PM -0800, Andrew Morton wrote:
> On Thu,  7 Dec 2017 09:14:26 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
> 
> > When the swapin is performed, after getting the swap entry information
> > from the page table, the PTL (page table lock) will be released, then
> > system will go to swap in the swap entry, without any lock held to
> > prevent the swap device from being swapoff.  This may cause the race
> > like below,
> > 
> > CPU 1				CPU 2
> > -----				-----
> > 				do_swap_page
> > 				  swapin_readahead
> > 				    __read_swap_cache_async
> > swapoff				      swapcache_prepare
> >   p->swap_map = NULL		        __swap_duplicate
> > 					  p->swap_map[?] /* !!! NULL pointer access */
> > 
> > Because swap off is usually done when system shutdown only, the race
> > may not hit many people in practice.  But it is still a race need to
> > be fixed.
> 
> swapoff is so rare that it's hard to get motivated about any fix which
> adds overhead to the regular codepaths.

That was my concern, too when I see this patch.

> 
> Is there something we can do to ensure that all the overhead of this
> fix is placed into the swapoff side?  stop_machine() may be a bit
> brutal, but a surprising amount of code uses it.  Any other ideas?

How about this?

I think It's same approach with old where we uses si->lock everywhere
instead of more fine-grained cluster lock.

The reason I repeated to reset p->max to zero in the loop is to avoid
using lockdep annotation(maybe, spin_lock_nested(something) to prevent
false positive.

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 42fe5653814a..9ce007a42bbc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2644,6 +2644,19 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	swap_file = p->swap_file;
 	old_block_size = p->old_block_size;
 	p->swap_file = NULL;
+
+	if (p->flags & SWP_SOLIDSTATE) {
+		unsigned long ci, nr_cluster;
+
+		nr_cluster = DIV_ROUND_UP(p->max, SWAPFILE_CLUSTER);
+		for (ci = 0; ci < nr_cluster; ci++) {
+			struct swap_cluster_info *sci;
+
+			sci = lock_cluster(p, ci * SWAPFILE_CLUSTER);
+			p->max = 0;
+			unlock_cluster(sci);
+		}
+	}
 	p->max = 0;
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
@@ -3369,10 +3382,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 		goto bad_file;
 	p = swap_info[type];
 	offset = swp_offset(entry);
-	if (unlikely(offset >= p->max))
-		goto out;
 
 	ci = lock_cluster_or_swap_info(p, offset);
+	if (unlikely(offset >= p->max))
+		goto unlock_out;
 
 	count = p->swap_map[offset];
 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
       [not found]     ` <87po7pg4jt.fsf@yhuang-dev.intel.com>
@ 2017-12-08  8:26       ` Minchan Kim
  2017-12-08  8:41         ` Huang, Ying
  0 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2017-12-08  8:26 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Paul E. McKenney, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Fri, Dec 08, 2017 at 01:41:10PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Thu, Dec 07, 2017 at 04:29:37PM -0800, Andrew Morton wrote:
> >> On Thu,  7 Dec 2017 09:14:26 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
> >> 
> >> > When the swapin is performed, after getting the swap entry information
> >> > from the page table, the PTL (page table lock) will be released, then
> >> > system will go to swap in the swap entry, without any lock held to
> >> > prevent the swap device from being swapoff.  This may cause the race
> >> > like below,
> >> > 
> >> > CPU 1				CPU 2
> >> > -----				-----
> >> > 				do_swap_page
> >> > 				  swapin_readahead
> >> > 				    __read_swap_cache_async
> >> > swapoff				      swapcache_prepare
> >> >   p->swap_map = NULL		        __swap_duplicate
> >> > 					  p->swap_map[?] /* !!! NULL pointer access */
> >> > 
> >> > Because swap off is usually done when system shutdown only, the race
> >> > may not hit many people in practice.  But it is still a race need to
> >> > be fixed.
> >> 
> >> swapoff is so rare that it's hard to get motivated about any fix which
> >> adds overhead to the regular codepaths.
> >
> > That was my concern, too when I see this patch.
> >
> >> 
> >> Is there something we can do to ensure that all the overhead of this
> >> fix is placed into the swapoff side?  stop_machine() may be a bit
> >> brutal, but a surprising amount of code uses it.  Any other ideas?
> >
> > How about this?
> >
> > I think It's same approach with old where we uses si->lock everywhere
> > instead of more fine-grained cluster lock.
> >
> > The reason I repeated to reset p->max to zero in the loop is to avoid
> > using lockdep annotation(maybe, spin_lock_nested(something) to prevent
> > false positive.
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 42fe5653814a..9ce007a42bbc 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -2644,6 +2644,19 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> >  	swap_file = p->swap_file;
> >  	old_block_size = p->old_block_size;
> >  	p->swap_file = NULL;
> > +
> > +	if (p->flags & SWP_SOLIDSTATE) {
> > +		unsigned long ci, nr_cluster;
> > +
> > +		nr_cluster = DIV_ROUND_UP(p->max, SWAPFILE_CLUSTER);
> > +		for (ci = 0; ci < nr_cluster; ci++) {
> > +			struct swap_cluster_info *sci;
> > +
> > +			sci = lock_cluster(p, ci * SWAPFILE_CLUSTER);
> > +			p->max = 0;
> > +			unlock_cluster(sci);
> > +		}
> > +	}
> >  	p->max = 0;
> >  	swap_map = p->swap_map;
> >  	p->swap_map = NULL;
> > @@ -3369,10 +3382,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> >  		goto bad_file;
> >  	p = swap_info[type];
> >  	offset = swp_offset(entry);
> > -	if (unlikely(offset >= p->max))
> > -		goto out;
> >  
> >  	ci = lock_cluster_or_swap_info(p, offset);
> > +	if (unlikely(offset >= p->max))
> > +		goto unlock_out;
> >  
> >  	count = p->swap_map[offset];
> >  
> 
> Sorry, this doesn't work, because
> 
> lock_cluster_or_swap_info()
> 
> Need to read p->cluster_info, which may be freed during swapoff too.
> 
> 
> To reduce the added overhead in regular code path, Maybe we can use SRCU
> to implement get_swap_device() and put_swap_device()?  There is only
> increment/decrement on CPU local variable in srcu_read_lock/unlock().
> Should be acceptable in not so hot swap path?
> 
> This needs to select CONFIG_SRCU if CONFIG_SWAP is enabled.  But I guess
> that should be acceptable too?
> 

Why do we need srcu here? Is it enough with rcu like below?

It might have a bug/room to be optimized about performance/naming.
I just wanted to show my intention.

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2417d288e016..bfe493f3bcb8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -273,6 +273,7 @@ struct swap_info_struct {
 					 */
 	struct work_struct discard_work; /* discard worker */
 	struct swap_cluster_list discard_clusters; /* discard clusters list */
+	struct rcu_head	rcu;
 };
 
 #ifdef CONFIG_64BIT
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 42fe5653814a..ecec064f9b20 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -302,6 +302,7 @@ static inline struct swap_cluster_info *lock_cluster_or_swap_info(
 {
 	struct swap_cluster_info *ci;
 
+	rcu_read_lock();
 	ci = lock_cluster(si, offset);
 	if (!ci)
 		spin_lock(&si->lock);
@@ -316,6 +317,7 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
 		unlock_cluster(ci);
 	else
 		spin_unlock(&si->lock);
+	rcu_read_unlock();
 }
 
 static inline bool cluster_list_empty(struct swap_cluster_list *list)
@@ -2530,11 +2532,17 @@ bool has_usable_swap(void)
 	return ret;
 }
 
+static void swap_cluster_info_free(struct rcu_head *rcu)
+{
+	struct swap_info_struct *si = container_of(rcu, struct swap_info_struct, rcu);
+	kvfree(si->cluster_info);
+	si->cluster_info = NULL;
+}
+
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
 	unsigned char *swap_map;
-	struct swap_cluster_info *cluster_info;
 	unsigned long *frontswap_map;
 	struct file *swap_file, *victim;
 	struct address_space *mapping;
@@ -2542,6 +2550,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	struct filename *pathname;
 	int err, found = 0;
 	unsigned int old_block_size;
+	unsigned long ci, nr_cluster;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
@@ -2631,6 +2640,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	spin_lock(&p->lock);
 	drain_mmlist();
 
+	nr_cluster = DIV_ROUND_UP(p->max, SWAPFILE_CLUSTER);
 	/* wait for anyone still in scan_swap_map */
 	p->highest_bit = 0;		/* cuts scans short */
 	while (p->flags >= SWP_SCANNING) {
@@ -2641,14 +2651,33 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		spin_lock(&p->lock);
 	}
 
+	if (p->flags & SWP_SOLIDSTATE) {
+		struct swap_cluster_info *sci, *head_cluster;
+
+		head_cluster = p->cluster_info;
+		spin_lock(&head_cluster->lock);
+		sci = head_cluster + 1;
+		for (ci = 1; ci < nr_cluster; ci++, sci++)
+			spin_lock_nest_lock(&sci->lock, &head_cluster->lock);
+	}
+
 	swap_file = p->swap_file;
 	old_block_size = p->old_block_size;
 	p->swap_file = NULL;
 	p->max = 0;
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
-	cluster_info = p->cluster_info;
-	p->cluster_info = NULL;
+
+	if (p->flags & SWP_SOLIDSTATE) {
+		struct swap_cluster_info *sci, *head_cluster;
+
+		head_cluster = p->cluster_info;
+		sci = head_cluster + 1;
+		for (ci = 1; ci < nr_cluster; ci++, sci++)
+			spin_unlock(&sci->lock);
+		spin_unlock(&head_cluster->lock);
+	}
+
 	frontswap_map = frontswap_map_get(p);
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
@@ -2658,7 +2687,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	free_percpu(p->percpu_cluster);
 	p->percpu_cluster = NULL;
 	vfree(swap_map);
-	kvfree(cluster_info);
+	call_rcu(&p->rcu, swap_cluster_info_free);
+	synchronize_rcu();
 	kvfree(frontswap_map);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
@@ -3369,10 +3399,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 		goto bad_file;
 	p = swap_info[type];
 	offset = swp_offset(entry);
-	if (unlikely(offset >= p->max))
-		goto out;
 
 	ci = lock_cluster_or_swap_info(p, offset);
+	if (unlikely(offset >= p->max))
+		goto unlock_out;
 
 	count = p->swap_map[offset];
 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-08  8:26       ` Minchan Kim
@ 2017-12-08  8:41         ` Huang, Ying
  2017-12-08  9:10           ` Minchan Kim
  2017-12-08 22:09           ` Andrew Morton
  0 siblings, 2 replies; 16+ messages in thread
From: Huang, Ying @ 2017-12-08  8:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, Paul E. McKenney, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Tim Chen,
	Shaohua Li, Mel Gorman, J�r�me Glisse,
	Michal Hocko, Andrea Arcangeli, David Rientjes, Rik van Riel,
	Jan Kara, Dave Jiang, Aaron Lu

Minchan Kim <minchan@kernel.org> writes:

> On Fri, Dec 08, 2017 at 01:41:10PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Thu, Dec 07, 2017 at 04:29:37PM -0800, Andrew Morton wrote:
>> >> On Thu,  7 Dec 2017 09:14:26 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>> >> 
>> >> > When the swapin is performed, after getting the swap entry information
>> >> > from the page table, the PTL (page table lock) will be released, then
>> >> > system will go to swap in the swap entry, without any lock held to
>> >> > prevent the swap device from being swapoff.  This may cause the race
>> >> > like below,
>> >> > 
>> >> > CPU 1				CPU 2
>> >> > -----				-----
>> >> > 				do_swap_page
>> >> > 				  swapin_readahead
>> >> > 				    __read_swap_cache_async
>> >> > swapoff				      swapcache_prepare
>> >> >   p->swap_map = NULL		        __swap_duplicate
>> >> > 					  p->swap_map[?] /* !!! NULL pointer access */
>> >> > 
>> >> > Because swap off is usually done when system shutdown only, the race
>> >> > may not hit many people in practice.  But it is still a race need to
>> >> > be fixed.
>> >> 
>> >> swapoff is so rare that it's hard to get motivated about any fix which
>> >> adds overhead to the regular codepaths.
>> >
>> > That was my concern, too when I see this patch.
>> >
>> >> 
>> >> Is there something we can do to ensure that all the overhead of this
>> >> fix is placed into the swapoff side?  stop_machine() may be a bit
>> >> brutal, but a surprising amount of code uses it.  Any other ideas?
>> >
>> > How about this?
>> >
>> > I think It's same approach with old where we uses si->lock everywhere
>> > instead of more fine-grained cluster lock.
>> >
>> > The reason I repeated to reset p->max to zero in the loop is to avoid
>> > using lockdep annotation(maybe, spin_lock_nested(something) to prevent
>> > false positive.
>> >
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 42fe5653814a..9ce007a42bbc 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -2644,6 +2644,19 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> >  	swap_file = p->swap_file;
>> >  	old_block_size = p->old_block_size;
>> >  	p->swap_file = NULL;
>> > +
>> > +	if (p->flags & SWP_SOLIDSTATE) {
>> > +		unsigned long ci, nr_cluster;
>> > +
>> > +		nr_cluster = DIV_ROUND_UP(p->max, SWAPFILE_CLUSTER);
>> > +		for (ci = 0; ci < nr_cluster; ci++) {
>> > +			struct swap_cluster_info *sci;
>> > +
>> > +			sci = lock_cluster(p, ci * SWAPFILE_CLUSTER);
>> > +			p->max = 0;
>> > +			unlock_cluster(sci);
>> > +		}
>> > +	}
>> >  	p->max = 0;
>> >  	swap_map = p->swap_map;
>> >  	p->swap_map = NULL;
>> > @@ -3369,10 +3382,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
>> >  		goto bad_file;
>> >  	p = swap_info[type];
>> >  	offset = swp_offset(entry);
>> > -	if (unlikely(offset >= p->max))
>> > -		goto out;
>> >  
>> >  	ci = lock_cluster_or_swap_info(p, offset);
>> > +	if (unlikely(offset >= p->max))
>> > +		goto unlock_out;
>> >  
>> >  	count = p->swap_map[offset];
>> >  
>> 
>> Sorry, this doesn't work, because
>> 
>> lock_cluster_or_swap_info()
>> 
>> Need to read p->cluster_info, which may be freed during swapoff too.
>> 
>> 
>> To reduce the added overhead in regular code path, Maybe we can use SRCU
>> to implement get_swap_device() and put_swap_device()?  There is only
>> increment/decrement on CPU local variable in srcu_read_lock/unlock().
>> Should be acceptable in not so hot swap path?
>> 
>> This needs to select CONFIG_SRCU if CONFIG_SWAP is enabled.  But I guess
>> that should be acceptable too?
>> 
>
> Why do we need srcu here? Is it enough with rcu like below?
>
> It might have a bug/room to be optimized about performance/naming.
> I just wanted to show my intention.

Yes.  rcu should work too.  But if we use rcu, it may need to be called
several times to make sure the swap device under us doesn't go away, for
example, when checking si->max in __swp_swapcount() and
add_swap_count_continuation().  And I found we need rcu to protect swap
cache radix tree array too.  So I think it may be better to use one
calling to srcu_read_lock/unlock() instead of multiple callings to
rcu_read_lock/unlock().

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-08  8:41         ` Huang, Ying
@ 2017-12-08  9:10           ` Minchan Kim
  2017-12-08 12:32             ` Huang, Ying
  2017-12-08 22:09           ` Andrew Morton
  1 sibling, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2017-12-08  9:10 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Paul E. McKenney, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Fri, Dec 08, 2017 at 04:41:38PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Fri, Dec 08, 2017 at 01:41:10PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > On Thu, Dec 07, 2017 at 04:29:37PM -0800, Andrew Morton wrote:
> >> >> On Thu,  7 Dec 2017 09:14:26 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
> >> >> 
> >> >> > When the swapin is performed, after getting the swap entry information
> >> >> > from the page table, the PTL (page table lock) will be released, then
> >> >> > system will go to swap in the swap entry, without any lock held to
> >> >> > prevent the swap device from being swapoff.  This may cause the race
> >> >> > like below,
> >> >> > 
> >> >> > CPU 1				CPU 2
> >> >> > -----				-----
> >> >> > 				do_swap_page
> >> >> > 				  swapin_readahead
> >> >> > 				    __read_swap_cache_async
> >> >> > swapoff				      swapcache_prepare
> >> >> >   p->swap_map = NULL		        __swap_duplicate
> >> >> > 					  p->swap_map[?] /* !!! NULL pointer access */
> >> >> > 
> >> >> > Because swap off is usually done when system shutdown only, the race
> >> >> > may not hit many people in practice.  But it is still a race need to
> >> >> > be fixed.
> >> >> 
> >> >> swapoff is so rare that it's hard to get motivated about any fix which
> >> >> adds overhead to the regular codepaths.
> >> >
> >> > That was my concern, too when I see this patch.
> >> >
> >> >> 
> >> >> Is there something we can do to ensure that all the overhead of this
> >> >> fix is placed into the swapoff side?  stop_machine() may be a bit
> >> >> brutal, but a surprising amount of code uses it.  Any other ideas?
> >> >
> >> > How about this?
> >> >
> >> > I think It's same approach with old where we uses si->lock everywhere
> >> > instead of more fine-grained cluster lock.
> >> >
> >> > The reason I repeated to reset p->max to zero in the loop is to avoid
> >> > using lockdep annotation(maybe, spin_lock_nested(something) to prevent
> >> > false positive.
> >> >
> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> > index 42fe5653814a..9ce007a42bbc 100644
> >> > --- a/mm/swapfile.c
> >> > +++ b/mm/swapfile.c
> >> > @@ -2644,6 +2644,19 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> >> >  	swap_file = p->swap_file;
> >> >  	old_block_size = p->old_block_size;
> >> >  	p->swap_file = NULL;
> >> > +
> >> > +	if (p->flags & SWP_SOLIDSTATE) {
> >> > +		unsigned long ci, nr_cluster;
> >> > +
> >> > +		nr_cluster = DIV_ROUND_UP(p->max, SWAPFILE_CLUSTER);
> >> > +		for (ci = 0; ci < nr_cluster; ci++) {
> >> > +			struct swap_cluster_info *sci;
> >> > +
> >> > +			sci = lock_cluster(p, ci * SWAPFILE_CLUSTER);
> >> > +			p->max = 0;
> >> > +			unlock_cluster(sci);
> >> > +		}
> >> > +	}
> >> >  	p->max = 0;
> >> >  	swap_map = p->swap_map;
> >> >  	p->swap_map = NULL;
> >> > @@ -3369,10 +3382,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> >> >  		goto bad_file;
> >> >  	p = swap_info[type];
> >> >  	offset = swp_offset(entry);
> >> > -	if (unlikely(offset >= p->max))
> >> > -		goto out;
> >> >  
> >> >  	ci = lock_cluster_or_swap_info(p, offset);
> >> > +	if (unlikely(offset >= p->max))
> >> > +		goto unlock_out;
> >> >  
> >> >  	count = p->swap_map[offset];
> >> >  
> >> 
> >> Sorry, this doesn't work, because
> >> 
> >> lock_cluster_or_swap_info()
> >> 
> >> Need to read p->cluster_info, which may be freed during swapoff too.
> >> 
> >> 
> >> To reduce the added overhead in regular code path, Maybe we can use SRCU
> >> to implement get_swap_device() and put_swap_device()?  There is only
> >> increment/decrement on CPU local variable in srcu_read_lock/unlock().
> >> Should be acceptable in not so hot swap path?
> >> 
> >> This needs to select CONFIG_SRCU if CONFIG_SWAP is enabled.  But I guess
> >> that should be acceptable too?
> >> 
> >
> > Why do we need srcu here? Is it enough with rcu like below?
> >
> > It might have a bug/room to be optimized about performance/naming.
> > I just wanted to show my intention.
> 
> Yes.  rcu should work too.  But if we use rcu, it may need to be called
> several times to make sure the swap device under us doesn't go away, for
> example, when checking si->max in __swp_swapcount() and

I think it's not a big concern performance pov and benefit is good
abstraction through current locking function so we don't need much churn.

> add_swap_count_continuation().  And I found we need rcu to protect swap
> cache radix tree array too.  So I think it may be better to use one

Could you elaborate it more about swap cache arrary problem?

> calling to srcu_read_lock/unlock() instead of multiple callings to
> rcu_read_lock/unlock().
> 
> Best Regards,
> Huang, Ying
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-08  9:10           ` Minchan Kim
@ 2017-12-08 12:32             ` Huang, Ying
  2017-12-13  7:15               ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Ying @ 2017-12-08 12:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Huang, Ying, Andrew Morton, Paul E. McKenney, linux-mm,
	linux-kernel, Hugh Dickins, Johannes Weiner, Tim Chen,
	Shaohua Li, Mel Gorman, J�r�me Glisse,
	Michal Hocko, Andrea Arcangeli, David Rientjes, Rik van Riel,
	Jan Kara, Dave Jiang, Aaron Lu

Minchan Kim <minchan@kernel.org> writes:

> On Fri, Dec 08, 2017 at 04:41:38PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Fri, Dec 08, 2017 at 01:41:10PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >> 
>> >> > On Thu, Dec 07, 2017 at 04:29:37PM -0800, Andrew Morton wrote:
>> >> >> On Thu,  7 Dec 2017 09:14:26 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>> >> >> 
>> >> >> > When the swapin is performed, after getting the swap entry information
>> >> >> > from the page table, the PTL (page table lock) will be released, then
>> >> >> > system will go to swap in the swap entry, without any lock held to
>> >> >> > prevent the swap device from being swapoff.  This may cause the race
>> >> >> > like below,
>> >> >> > 
>> >> >> > CPU 1				CPU 2
>> >> >> > -----				-----
>> >> >> > 				do_swap_page
>> >> >> > 				  swapin_readahead
>> >> >> > 				    __read_swap_cache_async
>> >> >> > swapoff				      swapcache_prepare
>> >> >> >   p->swap_map = NULL		        __swap_duplicate
>> >> >> > 					  p->swap_map[?] /* !!! NULL pointer access */
>> >> >> > 
>> >> >> > Because swap off is usually done when system shutdown only, the race
>> >> >> > may not hit many people in practice.  But it is still a race need to
>> >> >> > be fixed.
>> >> >> 
>> >> >> swapoff is so rare that it's hard to get motivated about any fix which
>> >> >> adds overhead to the regular codepaths.
>> >> >
>> >> > That was my concern, too when I see this patch.
>> >> >
>> >> >> 
>> >> >> Is there something we can do to ensure that all the overhead of this
>> >> >> fix is placed into the swapoff side?  stop_machine() may be a bit
>> >> >> brutal, but a surprising amount of code uses it.  Any other ideas?
>> >> >
>> >> > How about this?
>> >> >
>> >> > I think It's same approach with old where we uses si->lock everywhere
>> >> > instead of more fine-grained cluster lock.
>> >> >
>> >> > The reason I repeated to reset p->max to zero in the loop is to avoid
>> >> > using lockdep annotation(maybe, spin_lock_nested(something) to prevent
>> >> > false positive.
>> >> >
>> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> > index 42fe5653814a..9ce007a42bbc 100644
>> >> > --- a/mm/swapfile.c
>> >> > +++ b/mm/swapfile.c
>> >> > @@ -2644,6 +2644,19 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> >> >  	swap_file = p->swap_file;
>> >> >  	old_block_size = p->old_block_size;
>> >> >  	p->swap_file = NULL;
>> >> > +
>> >> > +	if (p->flags & SWP_SOLIDSTATE) {
>> >> > +		unsigned long ci, nr_cluster;
>> >> > +
>> >> > +		nr_cluster = DIV_ROUND_UP(p->max, SWAPFILE_CLUSTER);
>> >> > +		for (ci = 0; ci < nr_cluster; ci++) {
>> >> > +			struct swap_cluster_info *sci;
>> >> > +
>> >> > +			sci = lock_cluster(p, ci * SWAPFILE_CLUSTER);
>> >> > +			p->max = 0;
>> >> > +			unlock_cluster(sci);
>> >> > +		}
>> >> > +	}
>> >> >  	p->max = 0;
>> >> >  	swap_map = p->swap_map;
>> >> >  	p->swap_map = NULL;
>> >> > @@ -3369,10 +3382,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
>> >> >  		goto bad_file;
>> >> >  	p = swap_info[type];
>> >> >  	offset = swp_offset(entry);
>> >> > -	if (unlikely(offset >= p->max))
>> >> > -		goto out;
>> >> >  
>> >> >  	ci = lock_cluster_or_swap_info(p, offset);
>> >> > +	if (unlikely(offset >= p->max))
>> >> > +		goto unlock_out;
>> >> >  
>> >> >  	count = p->swap_map[offset];
>> >> >  
>> >> 
>> >> Sorry, this doesn't work, because
>> >> 
>> >> lock_cluster_or_swap_info()
>> >> 
>> >> Need to read p->cluster_info, which may be freed during swapoff too.
>> >> 
>> >> 
>> >> To reduce the added overhead in regular code path, Maybe we can use SRCU
>> >> to implement get_swap_device() and put_swap_device()?  There is only
>> >> increment/decrement on CPU local variable in srcu_read_lock/unlock().
>> >> Should be acceptable in not so hot swap path?
>> >> 
>> >> This needs to select CONFIG_SRCU if CONFIG_SWAP is enabled.  But I guess
>> >> that should be acceptable too?
>> >> 
>> >
>> > Why do we need srcu here? Is it enough with rcu like below?
>> >
>> > It might have a bug/room to be optimized about performance/naming.
>> > I just wanted to show my intention.
>> 
>> Yes.  rcu should work too.  But if we use rcu, it may need to be called
>> several times to make sure the swap device under us doesn't go away, for
>> example, when checking si->max in __swp_swapcount() and
>
> I think it's not a big concern performance pov and benefit is good
> abstraction through current locking function so we don't need much churn.

I think get/put_something() is common practice in Linux kernel to
prevent something to go away under us.  That makes the programming model
easier to be understood than checking whether swap entry is valid here
and there.

>> add_swap_count_continuation().  And I found we need rcu to protect swap
>> cache radix tree array too.  So I think it may be better to use one
>
> Could you elaborate it more about swap cache arrary problem?

Like swap_map, cluster_info, swap cache radix tree array for a swap
device will be freed at the end of swapoff.  So when we look up swap
cache, we need to make sure the swap cache array is valid firstly too.

Best Regards,
Huang, Ying

>> calling to srcu_read_lock/unlock() instead of multiple callings to
>> rcu_read_lock/unlock().
>> 
>> Best Regards,
>> Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-08  8:41         ` Huang, Ying
  2017-12-08  9:10           ` Minchan Kim
@ 2017-12-08 22:09           ` Andrew Morton
  2017-12-11  5:30             ` Huang, Ying
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2017-12-08 22:09 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Minchan Kim, Paul E. McKenney, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Fri, 08 Dec 2017 16:41:38 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:

> > Why do we need srcu here? Is it enough with rcu like below?
> >
> > It might have a bug/room to be optimized about performance/naming.
> > I just wanted to show my intention.
> 
> Yes.  rcu should work too.  But if we use rcu, it may need to be called
> several times to make sure the swap device under us doesn't go away, for
> example, when checking si->max in __swp_swapcount() and
> add_swap_count_continuation().  And I found we need rcu to protect swap
> cache radix tree array too.  So I think it may be better to use one
> calling to srcu_read_lock/unlock() instead of multiple callings to
> rcu_read_lock/unlock().

Or use stop_machine() ;)  It's very crude but it sure is simple.  Does
anyone have a swapoff-intensive workload?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-08 22:09           ` Andrew Morton
@ 2017-12-11  5:30             ` Huang, Ying
  2017-12-11 17:04               ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Ying @ 2017-12-11  5:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Paul E. McKenney, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 08 Dec 2017 16:41:38 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
>
>> > Why do we need srcu here? Is it enough with rcu like below?
>> >
>> > It might have a bug/room to be optimized about performance/naming.
>> > I just wanted to show my intention.
>> 
>> Yes.  rcu should work too.  But if we use rcu, it may need to be called
>> several times to make sure the swap device under us doesn't go away, for
>> example, when checking si->max in __swp_swapcount() and
>> add_swap_count_continuation().  And I found we need rcu to protect swap
>> cache radix tree array too.  So I think it may be better to use one
>> calling to srcu_read_lock/unlock() instead of multiple callings to
>> rcu_read_lock/unlock().
>
> Or use stop_machine() ;)  It's very crude but it sure is simple.  Does
> anyone have a swapoff-intensive workload?

Sorry, I don't know how to solve the problem with stop_machine().

The problem we try to resolved is that, we have a swap entry, but that
swap entry can become invalid because of swappoff between we check it
and we use it.  So we need to prevent swapoff to be run between checking
and using.

I don't know how to use stop_machine() in swapoff to wait for all users
of swap entry to finish.  Anyone can help me on this?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-11  5:30             ` Huang, Ying
@ 2017-12-11 17:04               ` Paul E. McKenney
  2017-12-12  1:12                 ` Huang, Ying
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2017-12-11 17:04 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Hugh Dickins,
	Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Mon, Dec 11, 2017 at 01:30:03PM +0800, Huang, Ying wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > On Fri, 08 Dec 2017 16:41:38 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
> >
> >> > Why do we need srcu here? Is it enough with rcu like below?
> >> >
> >> > It might have a bug/room to be optimized about performance/naming.
> >> > I just wanted to show my intention.
> >> 
> >> Yes.  rcu should work too.  But if we use rcu, it may need to be called
> >> several times to make sure the swap device under us doesn't go away, for
> >> example, when checking si->max in __swp_swapcount() and
> >> add_swap_count_continuation().  And I found we need rcu to protect swap
> >> cache radix tree array too.  So I think it may be better to use one
> >> calling to srcu_read_lock/unlock() instead of multiple callings to
> >> rcu_read_lock/unlock().
> >
> > Or use stop_machine() ;)  It's very crude but it sure is simple.  Does
> > anyone have a swapoff-intensive workload?
> 
> Sorry, I don't know how to solve the problem with stop_machine().
> 
> The problem we try to resolved is that, we have a swap entry, but that
> swap entry can become invalid because of swappoff between we check it
> and we use it.  So we need to prevent swapoff to be run between checking
> and using.
> 
> I don't know how to use stop_machine() in swapoff to wait for all users
> of swap entry to finish.  Anyone can help me on this?

You can think of stop_machine() as being sort of like a reader-writer
lock.  The readers can be any section of code with preemption disabled,
and the writer is the function passed to stop_machine().

Users running real-time applications on Linux don't tend to like
stop_machine() much, but perhaps it is nevertheless the right tool
for this particular job.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-11 17:04               ` Paul E. McKenney
@ 2017-12-12  1:12                 ` Huang, Ying
  2017-12-12 17:11                   ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Ying @ 2017-12-12  1:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Hugh Dickins,
	Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

Hi, Pual,

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Mon, Dec 11, 2017 at 01:30:03PM +0800, Huang, Ying wrote:
>> Andrew Morton <akpm@linux-foundation.org> writes:
>> 
>> > On Fri, 08 Dec 2017 16:41:38 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
>> >
>> >> > Why do we need srcu here? Is it enough with rcu like below?
>> >> >
>> >> > It might have a bug/room to be optimized about performance/naming.
>> >> > I just wanted to show my intention.
>> >> 
>> >> Yes.  rcu should work too.  But if we use rcu, it may need to be called
>> >> several times to make sure the swap device under us doesn't go away, for
>> >> example, when checking si->max in __swp_swapcount() and
>> >> add_swap_count_continuation().  And I found we need rcu to protect swap
>> >> cache radix tree array too.  So I think it may be better to use one
>> >> calling to srcu_read_lock/unlock() instead of multiple callings to
>> >> rcu_read_lock/unlock().
>> >
>> > Or use stop_machine() ;)  It's very crude but it sure is simple.  Does
>> > anyone have a swapoff-intensive workload?
>> 
>> Sorry, I don't know how to solve the problem with stop_machine().
>> 
>> The problem we try to resolved is that, we have a swap entry, but that
>> swap entry can become invalid because of swappoff between we check it
>> and we use it.  So we need to prevent swapoff to be run between checking
>> and using.
>> 
>> I don't know how to use stop_machine() in swapoff to wait for all users
>> of swap entry to finish.  Anyone can help me on this?
>
> You can think of stop_machine() as being sort of like a reader-writer
> lock.  The readers can be any section of code with preemption disabled,
> and the writer is the function passed to stop_machine().
>
> Users running real-time applications on Linux don't tend to like
> stop_machine() much, but perhaps it is nevertheless the right tool
> for this particular job.

Thanks a lot for explanation!  Now I understand this.

Another question, for this specific problem, I think both stop_machine()
based solution and rcu_read_lock/unlock() + synchronize_rcu() based
solution work.  If so, what is the difference between them?  I guess rcu
based solution will be a little better for real-time applications?  So
what is the advantage of stop_machine() based solution?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-12  1:12                 ` Huang, Ying
@ 2017-12-12 17:11                   ` Paul E. McKenney
  2017-12-13  2:17                     ` Huang, Ying
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2017-12-12 17:11 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Hugh Dickins,
	Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Tue, Dec 12, 2017 at 09:12:20AM +0800, Huang, Ying wrote:
> Hi, Pual,
> 
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> 
> > On Mon, Dec 11, 2017 at 01:30:03PM +0800, Huang, Ying wrote:
> >> Andrew Morton <akpm@linux-foundation.org> writes:
> >> 
> >> > On Fri, 08 Dec 2017 16:41:38 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
> >> >
> >> >> > Why do we need srcu here? Is it enough with rcu like below?
> >> >> >
> >> >> > It might have a bug/room to be optimized about performance/naming.
> >> >> > I just wanted to show my intention.
> >> >> 
> >> >> Yes.  rcu should work too.  But if we use rcu, it may need to be called
> >> >> several times to make sure the swap device under us doesn't go away, for
> >> >> example, when checking si->max in __swp_swapcount() and
> >> >> add_swap_count_continuation().  And I found we need rcu to protect swap
> >> >> cache radix tree array too.  So I think it may be better to use one
> >> >> calling to srcu_read_lock/unlock() instead of multiple callings to
> >> >> rcu_read_lock/unlock().
> >> >
> >> > Or use stop_machine() ;)  It's very crude but it sure is simple.  Does
> >> > anyone have a swapoff-intensive workload?
> >> 
> >> Sorry, I don't know how to solve the problem with stop_machine().
> >> 
> >> The problem we try to resolved is that, we have a swap entry, but that
> >> swap entry can become invalid because of swappoff between we check it
> >> and we use it.  So we need to prevent swapoff to be run between checking
> >> and using.
> >> 
> >> I don't know how to use stop_machine() in swapoff to wait for all users
> >> of swap entry to finish.  Anyone can help me on this?
> >
> > You can think of stop_machine() as being sort of like a reader-writer
> > lock.  The readers can be any section of code with preemption disabled,
> > and the writer is the function passed to stop_machine().
> >
> > Users running real-time applications on Linux don't tend to like
> > stop_machine() much, but perhaps it is nevertheless the right tool
> > for this particular job.
> 
> Thanks a lot for explanation!  Now I understand this.
> 
> Another question, for this specific problem, I think both stop_machine()
> based solution and rcu_read_lock/unlock() + synchronize_rcu() based
> solution work.  If so, what is the difference between them?  I guess rcu
> based solution will be a little better for real-time applications?  So
> what is the advantage of stop_machine() based solution?

The stop_machine() solution places similar restrictions on readers as
does rcu_read_lock/unlock() + synchronize_rcu(), if that is what you
are asking.

More precisely, the stop_machine() solution places exactly the
same restrictions on readers as does preempt_disable/enable() and
synchronize_sched().

I would expect stop_machine() to be faster than either synchronize_rcu()
synchronize_sched(), or synchronize_srcu(), but stop_machine() operates
by making each CPU spin with interrupts until all the other CPUs arrive.
This normally does not make real-time people happy.

An compromise position is available in the form of
synchronize_rcu_expedited() and synchronize_sched_expedited().  These
are faster than their non-expedited counterparts, and only momentarily
disturb each CPU, rather than spinning with interrupts disabled.  However,
stop_machine() is probably a bit faster.

Finally, syncrhonize_srcu_expedited() is reasonably fast, but
avoids disturbing other CPUs.  Last I checked, not quite as fast as
synchronize_rcu_expedited() and synchronize_sched_expedited(), though.

You asked!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-12 17:11                   ` Paul E. McKenney
@ 2017-12-13  2:17                     ` Huang, Ying
  2017-12-13  3:27                       ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Ying @ 2017-12-13  2:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Hugh Dickins,
	Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Tue, Dec 12, 2017 at 09:12:20AM +0800, Huang, Ying wrote:
>> Hi, Pual,
>> 
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>> 
>> > On Mon, Dec 11, 2017 at 01:30:03PM +0800, Huang, Ying wrote:
>> >> Andrew Morton <akpm@linux-foundation.org> writes:
>> >> 
>> >> > On Fri, 08 Dec 2017 16:41:38 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
>> >> >
>> >> >> > Why do we need srcu here? Is it enough with rcu like below?
>> >> >> >
>> >> >> > It might have a bug/room to be optimized about performance/naming.
>> >> >> > I just wanted to show my intention.
>> >> >> 
>> >> >> Yes.  rcu should work too.  But if we use rcu, it may need to be called
>> >> >> several times to make sure the swap device under us doesn't go away, for
>> >> >> example, when checking si->max in __swp_swapcount() and
>> >> >> add_swap_count_continuation().  And I found we need rcu to protect swap
>> >> >> cache radix tree array too.  So I think it may be better to use one
>> >> >> calling to srcu_read_lock/unlock() instead of multiple callings to
>> >> >> rcu_read_lock/unlock().
>> >> >
>> >> > Or use stop_machine() ;)  It's very crude but it sure is simple.  Does
>> >> > anyone have a swapoff-intensive workload?
>> >> 
>> >> Sorry, I don't know how to solve the problem with stop_machine().
>> >> 
>> >> The problem we try to resolved is that, we have a swap entry, but that
>> >> swap entry can become invalid because of swappoff between we check it
>> >> and we use it.  So we need to prevent swapoff to be run between checking
>> >> and using.
>> >> 
>> >> I don't know how to use stop_machine() in swapoff to wait for all users
>> >> of swap entry to finish.  Anyone can help me on this?
>> >
>> > You can think of stop_machine() as being sort of like a reader-writer
>> > lock.  The readers can be any section of code with preemption disabled,
>> > and the writer is the function passed to stop_machine().
>> >
>> > Users running real-time applications on Linux don't tend to like
>> > stop_machine() much, but perhaps it is nevertheless the right tool
>> > for this particular job.
>> 
>> Thanks a lot for explanation!  Now I understand this.
>> 
>> Another question, for this specific problem, I think both stop_machine()
>> based solution and rcu_read_lock/unlock() + synchronize_rcu() based
>> solution work.  If so, what is the difference between them?  I guess rcu
>> based solution will be a little better for real-time applications?  So
>> what is the advantage of stop_machine() based solution?
>
> The stop_machine() solution places similar restrictions on readers as
> does rcu_read_lock/unlock() + synchronize_rcu(), if that is what you
> are asking.
>
> More precisely, the stop_machine() solution places exactly the
> same restrictions on readers as does preempt_disable/enable() and
> synchronize_sched().
>
> I would expect stop_machine() to be faster than either synchronize_rcu()
> synchronize_sched(), or synchronize_srcu(), but stop_machine() operates
> by making each CPU spin with interrupts until all the other CPUs arrive.
> This normally does not make real-time people happy.
>
> An compromise position is available in the form of
> synchronize_rcu_expedited() and synchronize_sched_expedited().  These
> are faster than their non-expedited counterparts, and only momentarily
> disturb each CPU, rather than spinning with interrupts disabled.  However,
> stop_machine() is probably a bit faster.
>
> Finally, syncrhonize_srcu_expedited() is reasonably fast, but
> avoids disturbing other CPUs.  Last I checked, not quite as fast as
> synchronize_rcu_expedited() and synchronize_sched_expedited(), though.
>
> You asked!  ;-)

Thanks a lot Paul!  That exceeds my expectation!

The performance of swapoff() isn't very important, probably it's not
necessary to accelerate it at the cost of realtime.  I think it is
better to use a rcu or srcu based solution.  I think the cost at reader
side should be almost same between rcu and srcu?  To use srcu, we need
to select CONFIG_SRCU when CONFIG_SWAP is enabled in Kconfig.  I think
that should be OK?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-13  2:17                     ` Huang, Ying
@ 2017-12-13  3:27                       ` Paul E. McKenney
  0 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2017-12-13  3:27 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Minchan Kim, linux-mm, linux-kernel, Hugh Dickins,
	Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

On Wed, Dec 13, 2017 at 10:17:41AM +0800, Huang, Ying wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> 
> > On Tue, Dec 12, 2017 at 09:12:20AM +0800, Huang, Ying wrote:
> >> Hi, Pual,
> >> 
> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> >> 
> >> > On Mon, Dec 11, 2017 at 01:30:03PM +0800, Huang, Ying wrote:
> >> >> Andrew Morton <akpm@linux-foundation.org> writes:
> >> >> 
> >> >> > On Fri, 08 Dec 2017 16:41:38 +0800 "Huang\, Ying" <ying.huang@intel.com> wrote:
> >> >> >
> >> >> >> > Why do we need srcu here? Is it enough with rcu like below?
> >> >> >> >
> >> >> >> > It might have a bug/room to be optimized about performance/naming.
> >> >> >> > I just wanted to show my intention.
> >> >> >> 
> >> >> >> Yes.  rcu should work too.  But if we use rcu, it may need to be called
> >> >> >> several times to make sure the swap device under us doesn't go away, for
> >> >> >> example, when checking si->max in __swp_swapcount() and
> >> >> >> add_swap_count_continuation().  And I found we need rcu to protect swap
> >> >> >> cache radix tree array too.  So I think it may be better to use one
> >> >> >> calling to srcu_read_lock/unlock() instead of multiple callings to
> >> >> >> rcu_read_lock/unlock().
> >> >> >
> >> >> > Or use stop_machine() ;)  It's very crude but it sure is simple.  Does
> >> >> > anyone have a swapoff-intensive workload?
> >> >> 
> >> >> Sorry, I don't know how to solve the problem with stop_machine().
> >> >> 
> >> >> The problem we try to resolved is that, we have a swap entry, but that
> >> >> swap entry can become invalid because of swappoff between we check it
> >> >> and we use it.  So we need to prevent swapoff to be run between checking
> >> >> and using.
> >> >> 
> >> >> I don't know how to use stop_machine() in swapoff to wait for all users
> >> >> of swap entry to finish.  Anyone can help me on this?
> >> >
> >> > You can think of stop_machine() as being sort of like a reader-writer
> >> > lock.  The readers can be any section of code with preemption disabled,
> >> > and the writer is the function passed to stop_machine().
> >> >
> >> > Users running real-time applications on Linux don't tend to like
> >> > stop_machine() much, but perhaps it is nevertheless the right tool
> >> > for this particular job.
> >> 
> >> Thanks a lot for explanation!  Now I understand this.
> >> 
> >> Another question, for this specific problem, I think both stop_machine()
> >> based solution and rcu_read_lock/unlock() + synchronize_rcu() based
> >> solution work.  If so, what is the difference between them?  I guess rcu
> >> based solution will be a little better for real-time applications?  So
> >> what is the advantage of stop_machine() based solution?
> >
> > The stop_machine() solution places similar restrictions on readers as
> > does rcu_read_lock/unlock() + synchronize_rcu(), if that is what you
> > are asking.
> >
> > More precisely, the stop_machine() solution places exactly the
> > same restrictions on readers as does preempt_disable/enable() and
> > synchronize_sched().
> >
> > I would expect stop_machine() to be faster than either synchronize_rcu()
> > synchronize_sched(), or synchronize_srcu(), but stop_machine() operates
> > by making each CPU spin with interrupts until all the other CPUs arrive.
> > This normally does not make real-time people happy.
> >
> > An compromise position is available in the form of
> > synchronize_rcu_expedited() and synchronize_sched_expedited().  These
> > are faster than their non-expedited counterparts, and only momentarily
> > disturb each CPU, rather than spinning with interrupts disabled.  However,
> > stop_machine() is probably a bit faster.
> >
> > Finally, syncrhonize_srcu_expedited() is reasonably fast, but
> > avoids disturbing other CPUs.  Last I checked, not quite as fast as
> > synchronize_rcu_expedited() and synchronize_sched_expedited(), though.
> >
> > You asked!  ;-)
> 
> Thanks a lot Paul!  That exceeds my expectation!
> 
> The performance of swapoff() isn't very important, probably it's not
> necessary to accelerate it at the cost of realtime.  I think it is
> better to use a rcu or srcu based solution.  I think the cost at reader
> side should be almost same between rcu and srcu?  To use srcu, we need
> to select CONFIG_SRCU when CONFIG_SWAP is enabled in Kconfig.  I think
> that should be OK?

The thing to do is to try SRCU and see if you can see significant
performance degradation.  Given that there is swapping involved, I
would be surprised if the added read-side overhead of SRCU was even
measurable, but then again I have been surprised before.

And yes, just select CONFIG_SRCU when you need it.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-08 12:32             ` Huang, Ying
@ 2017-12-13  7:15               ` Minchan Kim
  2017-12-13  8:52                 ` Huang, Ying
  0 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2017-12-13  7:15 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Paul E. McKenney, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

Hi Huang,
 
Sorry for the late response. I'm in middle of long vacation.

On Fri, Dec 08, 2017 at 08:32:16PM +0800, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
> > On Fri, Dec 08, 2017 at 04:41:38PM +0800, Huang, Ying wrote:
> >> Minchan Kim <minchan@kernel.org> writes:
> >> 
> >> > On Fri, Dec 08, 2017 at 01:41:10PM +0800, Huang, Ying wrote:
> >> >> Minchan Kim <minchan@kernel.org> writes:
> >> >> 
> >> >> > On Thu, Dec 07, 2017 at 04:29:37PM -0800, Andrew Morton wrote:
> >> >> >> On Thu,  7 Dec 2017 09:14:26 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
> >> >> >> 
> >> >> >> > When the swapin is performed, after getting the swap entry information
> >> >> >> > from the page table, the PTL (page table lock) will be released, then
> >> >> >> > system will go to swap in the swap entry, without any lock held to
> >> >> >> > prevent the swap device from being swapoff.  This may cause the race
> >> >> >> > like below,
> >> >> >> > 
> >> >> >> > CPU 1				CPU 2
> >> >> >> > -----				-----
> >> >> >> > 				do_swap_page
> >> >> >> > 				  swapin_readahead
> >> >> >> > 				    __read_swap_cache_async
> >> >> >> > swapoff				      swapcache_prepare
> >> >> >> >   p->swap_map = NULL		        __swap_duplicate
> >> >> >> > 					  p->swap_map[?] /* !!! NULL pointer access */
> >> >> >> > 
> >> >> >> > Because swap off is usually done when system shutdown only, the race
> >> >> >> > may not hit many people in practice.  But it is still a race need to
> >> >> >> > be fixed.
> >> >> >> 
> >> >> >> swapoff is so rare that it's hard to get motivated about any fix which
> >> >> >> adds overhead to the regular codepaths.
> >> >> >
> >> >> > That was my concern, too when I see this patch.
> >> >> >
> >> >> >> 
> >> >> >> Is there something we can do to ensure that all the overhead of this
> >> >> >> fix is placed into the swapoff side?  stop_machine() may be a bit
> >> >> >> brutal, but a surprising amount of code uses it.  Any other ideas?
> >> >> >
> >> >> > How about this?
> >> >> >
> >> >> > I think It's same approach with old where we uses si->lock everywhere
> >> >> > instead of more fine-grained cluster lock.
> >> >> >
> >> >> > The reason I repeated to reset p->max to zero in the loop is to avoid
> >> >> > using lockdep annotation(maybe, spin_lock_nested(something) to prevent
> >> >> > false positive.
> >> >> >
> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> > index 42fe5653814a..9ce007a42bbc 100644
> >> >> > --- a/mm/swapfile.c
> >> >> > +++ b/mm/swapfile.c
> >> >> > @@ -2644,6 +2644,19 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> >> >> >  	swap_file = p->swap_file;
> >> >> >  	old_block_size = p->old_block_size;
> >> >> >  	p->swap_file = NULL;
> >> >> > +
> >> >> > +	if (p->flags & SWP_SOLIDSTATE) {
> >> >> > +		unsigned long ci, nr_cluster;
> >> >> > +
> >> >> > +		nr_cluster = DIV_ROUND_UP(p->max, SWAPFILE_CLUSTER);
> >> >> > +		for (ci = 0; ci < nr_cluster; ci++) {
> >> >> > +			struct swap_cluster_info *sci;
> >> >> > +
> >> >> > +			sci = lock_cluster(p, ci * SWAPFILE_CLUSTER);
> >> >> > +			p->max = 0;
> >> >> > +			unlock_cluster(sci);
> >> >> > +		}
> >> >> > +	}
> >> >> >  	p->max = 0;
> >> >> >  	swap_map = p->swap_map;
> >> >> >  	p->swap_map = NULL;
> >> >> > @@ -3369,10 +3382,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
> >> >> >  		goto bad_file;
> >> >> >  	p = swap_info[type];
> >> >> >  	offset = swp_offset(entry);
> >> >> > -	if (unlikely(offset >= p->max))
> >> >> > -		goto out;
> >> >> >  
> >> >> >  	ci = lock_cluster_or_swap_info(p, offset);
> >> >> > +	if (unlikely(offset >= p->max))
> >> >> > +		goto unlock_out;
> >> >> >  
> >> >> >  	count = p->swap_map[offset];
> >> >> >  
> >> >> 
> >> >> Sorry, this doesn't work, because
> >> >> 
> >> >> lock_cluster_or_swap_info()
> >> >> 
> >> >> Need to read p->cluster_info, which may be freed during swapoff too.
> >> >> 
> >> >> 
> >> >> To reduce the added overhead in regular code path, Maybe we can use SRCU
> >> >> to implement get_swap_device() and put_swap_device()?  There is only
> >> >> increment/decrement on CPU local variable in srcu_read_lock/unlock().
> >> >> Should be acceptable in not so hot swap path?
> >> >> 
> >> >> This needs to select CONFIG_SRCU if CONFIG_SWAP is enabled.  But I guess
> >> >> that should be acceptable too?
> >> >> 
> >> >
> >> > Why do we need srcu here? Is it enough with rcu like below?
> >> >
> >> > It might have a bug/room to be optimized about performance/naming.
> >> > I just wanted to show my intention.
> >> 
> >> Yes.  rcu should work too.  But if we use rcu, it may need to be called
> >> several times to make sure the swap device under us doesn't go away, for
> >> example, when checking si->max in __swp_swapcount() and
> >
> > I think it's not a big concern performance pov and benefit is good
> > abstraction through current locking function so we don't need much churn.
> 
> I think get/put_something() is common practice in Linux kernel to
> prevent something to go away under us.  That makes the programming model
> easier to be understood than checking whether swap entry is valid here
> and there.
> 
> >> add_swap_count_continuation().  And I found we need rcu to protect swap
> >> cache radix tree array too.  So I think it may be better to use one
> >
> > Could you elaborate it more about swap cache arrary problem?
> 
> Like swap_map, cluster_info, swap cache radix tree array for a swap
> device will be freed at the end of swapoff.  So when we look up swap
> cache, we need to make sure the swap cache array is valid firstly too.
> 

Thanks for the clarification.
 
I'm not saying refcount approach you suggested is wrong but just wanted
to find more easier way with just fixing cold path instead of hot path.
To me, the thought came from from logical sense for the maintainance
rather than performan problem.
 
I still need a time to think over it and it would be made after the vacation
so don't want to make you stuck. A thing I want to suggest is that let's
think about maintanaince point of view for solution candidates.
I don't like to put get/put into out of swap code. Instead, let's
encapsulate the locking code into swap functions inside so any user
of swap function doesn't need to know the detail.
 
I think which approach is best for the solution among several
approaches depends on that how the solution makes code simple without
exposing the internal much rather than performance at the moment.

Just my two cents.
Sorry for the vague review. I'm looking forward to seeing new patches.
 
Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations
  2017-12-13  7:15               ` Minchan Kim
@ 2017-12-13  8:52                 ` Huang, Ying
  0 siblings, 0 replies; 16+ messages in thread
From: Huang, Ying @ 2017-12-13  8:52 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Paul E. McKenney, linux-mm, linux-kernel,
	Hugh Dickins, Johannes Weiner, Tim Chen, Shaohua Li, Mel Gorman,
	J�r�me Glisse, Michal Hocko, Andrea Arcangeli,
	David Rientjes, Rik van Riel, Jan Kara, Dave Jiang, Aaron Lu

Minchan Kim <minchan@kernel.org> writes:

> Hi Huang,
>  
> Sorry for the late response. I'm in middle of long vacation.
>
> On Fri, Dec 08, 2017 at 08:32:16PM +0800, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>> > On Fri, Dec 08, 2017 at 04:41:38PM +0800, Huang, Ying wrote:
>> >> Minchan Kim <minchan@kernel.org> writes:
>> >> 
>> >> > On Fri, Dec 08, 2017 at 01:41:10PM +0800, Huang, Ying wrote:
>> >> >> Minchan Kim <minchan@kernel.org> writes:
>> >> >> 
>> >> >> > On Thu, Dec 07, 2017 at 04:29:37PM -0800, Andrew Morton wrote:
>> >> >> >> On Thu,  7 Dec 2017 09:14:26 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
>> >> >> >> 
>> >> >> >> > When the swapin is performed, after getting the swap entry information
>> >> >> >> > from the page table, the PTL (page table lock) will be released, then
>> >> >> >> > system will go to swap in the swap entry, without any lock held to
>> >> >> >> > prevent the swap device from being swapoff.  This may cause the race
>> >> >> >> > like below,
>> >> >> >> > 
>> >> >> >> > CPU 1				CPU 2
>> >> >> >> > -----				-----
>> >> >> >> > 				do_swap_page
>> >> >> >> > 				  swapin_readahead
>> >> >> >> > 				    __read_swap_cache_async
>> >> >> >> > swapoff				      swapcache_prepare
>> >> >> >> >   p->swap_map = NULL		        __swap_duplicate
>> >> >> >> > 					  p->swap_map[?] /* !!! NULL pointer access */
>> >> >> >> > 
>> >> >> >> > Because swap off is usually done when system shutdown only, the race
>> >> >> >> > may not hit many people in practice.  But it is still a race need to
>> >> >> >> > be fixed.
>> >> >> >> 
>> >> >> >> swapoff is so rare that it's hard to get motivated about any fix which
>> >> >> >> adds overhead to the regular codepaths.
>> >> >> >
>> >> >> > That was my concern, too when I see this patch.
>> >> >> >
>> >> >> >> 
>> >> >> >> Is there something we can do to ensure that all the overhead of this
>> >> >> >> fix is placed into the swapoff side?  stop_machine() may be a bit
>> >> >> >> brutal, but a surprising amount of code uses it.  Any other ideas?
>> >> >> >
>> >> >> > How about this?
>> >> >> >
>> >> >> > I think It's same approach with old where we uses si->lock everywhere
>> >> >> > instead of more fine-grained cluster lock.
>> >> >> >
>> >> >> > The reason I repeated to reset p->max to zero in the loop is to avoid
>> >> >> > using lockdep annotation(maybe, spin_lock_nested(something) to prevent
>> >> >> > false positive.
>> >> >> >
>> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> > index 42fe5653814a..9ce007a42bbc 100644
>> >> >> > --- a/mm/swapfile.c
>> >> >> > +++ b/mm/swapfile.c
>> >> >> > @@ -2644,6 +2644,19 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>> >> >> >  	swap_file = p->swap_file;
>> >> >> >  	old_block_size = p->old_block_size;
>> >> >> >  	p->swap_file = NULL;
>> >> >> > +
>> >> >> > +	if (p->flags & SWP_SOLIDSTATE) {
>> >> >> > +		unsigned long ci, nr_cluster;
>> >> >> > +
>> >> >> > +		nr_cluster = DIV_ROUND_UP(p->max, SWAPFILE_CLUSTER);
>> >> >> > +		for (ci = 0; ci < nr_cluster; ci++) {
>> >> >> > +			struct swap_cluster_info *sci;
>> >> >> > +
>> >> >> > +			sci = lock_cluster(p, ci * SWAPFILE_CLUSTER);
>> >> >> > +			p->max = 0;
>> >> >> > +			unlock_cluster(sci);
>> >> >> > +		}
>> >> >> > +	}
>> >> >> >  	p->max = 0;
>> >> >> >  	swap_map = p->swap_map;
>> >> >> >  	p->swap_map = NULL;
>> >> >> > @@ -3369,10 +3382,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
>> >> >> >  		goto bad_file;
>> >> >> >  	p = swap_info[type];
>> >> >> >  	offset = swp_offset(entry);
>> >> >> > -	if (unlikely(offset >= p->max))
>> >> >> > -		goto out;
>> >> >> >  
>> >> >> >  	ci = lock_cluster_or_swap_info(p, offset);
>> >> >> > +	if (unlikely(offset >= p->max))
>> >> >> > +		goto unlock_out;
>> >> >> >  
>> >> >> >  	count = p->swap_map[offset];
>> >> >> >  
>> >> >> 
>> >> >> Sorry, this doesn't work, because
>> >> >> 
>> >> >> lock_cluster_or_swap_info()
>> >> >> 
>> >> >> Need to read p->cluster_info, which may be freed during swapoff too.
>> >> >> 
>> >> >> 
>> >> >> To reduce the added overhead in regular code path, Maybe we can use SRCU
>> >> >> to implement get_swap_device() and put_swap_device()?  There is only
>> >> >> increment/decrement on CPU local variable in srcu_read_lock/unlock().
>> >> >> Should be acceptable in not so hot swap path?
>> >> >> 
>> >> >> This needs to select CONFIG_SRCU if CONFIG_SWAP is enabled.  But I guess
>> >> >> that should be acceptable too?
>> >> >> 
>> >> >
>> >> > Why do we need srcu here? Is it enough with rcu like below?
>> >> >
>> >> > It might have a bug/room to be optimized about performance/naming.
>> >> > I just wanted to show my intention.
>> >> 
>> >> Yes.  rcu should work too.  But if we use rcu, it may need to be called
>> >> several times to make sure the swap device under us doesn't go away, for
>> >> example, when checking si->max in __swp_swapcount() and
>> >
>> > I think it's not a big concern performance pov and benefit is good
>> > abstraction through current locking function so we don't need much churn.
>> 
>> I think get/put_something() is common practice in Linux kernel to
>> prevent something to go away under us.  That makes the programming model
>> easier to be understood than checking whether swap entry is valid here
>> and there.
>> 
>> >> add_swap_count_continuation().  And I found we need rcu to protect swap
>> >> cache radix tree array too.  So I think it may be better to use one
>> >
>> > Could you elaborate it more about swap cache arrary problem?
>> 
>> Like swap_map, cluster_info, swap cache radix tree array for a swap
>> device will be freed at the end of swapoff.  So when we look up swap
>> cache, we need to make sure the swap cache array is valid firstly too.
>> 
>
> Thanks for the clarification.
>  
> I'm not saying refcount approach you suggested is wrong but just wanted
> to find more easier way with just fixing cold path instead of hot path.
> To me, the thought came from from logical sense for the maintainance
> rather than performan problem.
>  
> I still need a time to think over it and it would be made after the vacation
> so don't want to make you stuck. A thing I want to suggest is that let's
> think about maintanaince point of view for solution candidates.
> I don't like to put get/put into out of swap code. Instead, let's
> encapsulate the locking code into swap functions inside so any user
> of swap function doesn't need to know the detail.
>  
> I think which approach is best for the solution among several
> approaches depends on that how the solution makes code simple without
> exposing the internal much rather than performance at the moment.
>
> Just my two cents.
> Sorry for the vague review. I'm looking forward to seeing new patches.

Thanks!  Yes, long-term maintenance is key point.  I just thought that
get/put_swap_device() based method makes the whole picture simpler so
that we don't need to worry about the swap device gone under us.  Will
send out a new version soon.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-12-13  8:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-07  1:14 [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations Huang, Ying
2017-12-08  0:29 ` Andrew Morton
2017-12-08  1:43   ` Minchan Kim
     [not found]     ` <87po7pg4jt.fsf@yhuang-dev.intel.com>
2017-12-08  8:26       ` Minchan Kim
2017-12-08  8:41         ` Huang, Ying
2017-12-08  9:10           ` Minchan Kim
2017-12-08 12:32             ` Huang, Ying
2017-12-13  7:15               ` Minchan Kim
2017-12-13  8:52                 ` Huang, Ying
2017-12-08 22:09           ` Andrew Morton
2017-12-11  5:30             ` Huang, Ying
2017-12-11 17:04               ` Paul E. McKenney
2017-12-12  1:12                 ` Huang, Ying
2017-12-12 17:11                   ` Paul E. McKenney
2017-12-13  2:17                     ` Huang, Ying
2017-12-13  3:27                       ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).