All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async
@ 2017-03-17  6:46 ` Huang, Ying
  0 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-17  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel, Huang Ying,
	Rafael Aquini, Tim Chen, Michal Hocko, Mel Gorman, Aaron Lu,
	Kirill A. Shutemov, Gerald Schaefer, linux-mm, linux-kernel

From: Huang Ying <ying.huang@intel.com>

The commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
deadlock while waiting on discard I/O completion") fixed a deadlock in
read_swap_cache_async().  Because at that time, in swap allocation
path, a swap entry may be set as SWAP_HAS_CACHE, then wait for
discarding to complete before the page for the swap entry is added to
the swap cache.  But in the commit 815c2c543d3a ("swap: make swap
discard async"), the discarding for swap become asynchronous, waiting
for discarding to complete will be done before the swap entry is set
as SWAP_HAS_CACHE.  So the comments in code is incorrect now.  This
patch fixes the comments.

The cond_resched() added in the commit cbab0e4eec29 is not necessary
now too.  But if we added some sleep in swap allocation path in the
future, there may be some hard to debug/reproduce deadlock bug.  So it
is kept.

Cc: Shaohua Li <shli@kernel.org>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swap_state.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 473b71e052a8..7bfb9bd1ca21 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -360,17 +360,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			/*
 			 * We might race against get_swap_page() and stumble
 			 * across a SWAP_HAS_CACHE swap_map entry whose page
-			 * has not been brought into the swapcache yet, while
-			 * the other end is scheduled away waiting on discard
-			 * I/O completion at scan_swap_map().
-			 *
-			 * In order to avoid turning this transitory state
-			 * into a permanent loop around this -EEXIST case
-			 * if !CONFIG_PREEMPT and the I/O completion happens
-			 * to be waiting on the CPU waitqueue where we are now
-			 * busy looping, we just conditionally invoke the
-			 * scheduler here, if there are some more important
-			 * tasks to run.
+			 * has not been brought into the swapcache yet.
 			 */
 			cond_resched();
 			continue;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async
@ 2017-03-17  6:46 ` Huang, Ying
  0 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-17  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel, Huang Ying,
	Rafael Aquini, Tim Chen, Michal Hocko, Mel Gorman, Aaron Lu,
	Kirill A. Shutemov, Gerald Schaefer, linux-mm, linux-kernel

From: Huang Ying <ying.huang@intel.com>

The commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
deadlock while waiting on discard I/O completion") fixed a deadlock in
read_swap_cache_async().  Because at that time, in swap allocation
path, a swap entry may be set as SWAP_HAS_CACHE, then wait for
discarding to complete before the page for the swap entry is added to
the swap cache.  But in the commit 815c2c543d3a ("swap: make swap
discard async"), the discarding for swap become asynchronous, waiting
for discarding to complete will be done before the swap entry is set
as SWAP_HAS_CACHE.  So the comments in code is incorrect now.  This
patch fixes the comments.

The cond_resched() added in the commit cbab0e4eec29 is not necessary
now too.  But if we added some sleep in swap allocation path in the
future, there may be some hard to debug/reproduce deadlock bug.  So it
is kept.

Cc: Shaohua Li <shli@kernel.org>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/swap_state.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 473b71e052a8..7bfb9bd1ca21 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -360,17 +360,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			/*
 			 * We might race against get_swap_page() and stumble
 			 * across a SWAP_HAS_CACHE swap_map entry whose page
-			 * has not been brought into the swapcache yet, while
-			 * the other end is scheduled away waiting on discard
-			 * I/O completion at scan_swap_map().
-			 *
-			 * In order to avoid turning this transitory state
-			 * into a permanent loop around this -EEXIST case
-			 * if !CONFIG_PREEMPT and the I/O completion happens
-			 * to be waiting on the CPU waitqueue where we are now
-			 * busy looping, we just conditionally invoke the
-			 * scheduler here, if there are some more important
-			 * tasks to run.
+			 * has not been brought into the swapcache yet.
 			 */
 			cond_resched();
 			continue;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/5] mm, swap: Improve readability via make spin_lock/unlock balanced
  2017-03-17  6:46 ` Huang, Ying
@ 2017-03-17  6:46   ` Huang, Ying
  -1 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-17  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel, Huang Ying,
	Tim Chen, Michal Hocko, Hugh Dickins, Vegard Nossum, Ingo Molnar,
	Kirill A. Shutemov, linux-mm, linux-kernel

From: Huang Ying <ying.huang@intel.com>

This is just a cleanup patch, no functionality change.  In
cluster_list_add_tail(), spin_lock_nested() is used to lock the
cluster, while unlock_cluster() is used to unlock the cluster.  To
improve the code readability.  Use spin_unlock() directly to unlock
the cluster.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
---
 mm/swapfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6b6bb1bb6209..42fd620dcf4c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -335,7 +335,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
 		ci_tail = ci + tail;
 		spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
 		cluster_set_next(ci_tail, idx);
-		unlock_cluster(ci_tail);
+		spin_unlock(&ci_tail->lock);
 		cluster_set_next_flag(&list->tail, idx, 0);
 	}
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/5] mm, swap: Improve readability via make spin_lock/unlock balanced
@ 2017-03-17  6:46   ` Huang, Ying
  0 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-17  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel, Huang Ying,
	Tim Chen, Michal Hocko, Hugh Dickins, Vegard Nossum, Ingo Molnar,
	Kirill A. Shutemov, linux-mm, linux-kernel

From: Huang Ying <ying.huang@intel.com>

This is just a cleanup patch, no functionality change.  In
cluster_list_add_tail(), spin_lock_nested() is used to lock the
cluster, while unlock_cluster() is used to unlock the cluster.  To
improve the code readability.  Use spin_unlock() directly to unlock
the cluster.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
---
 mm/swapfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6b6bb1bb6209..42fd620dcf4c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -335,7 +335,7 @@ static void cluster_list_add_tail(struct swap_cluster_list *list,
 		ci_tail = ci + tail;
 		spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
 		cluster_set_next(ci_tail, idx);
-		unlock_cluster(ci_tail);
+		spin_unlock(&ci_tail->lock);
 		cluster_set_next_flag(&list->tail, idx, 0);
 	}
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/5] mm, swap: Avoid lock swap_avail_lock when held cluster lock
  2017-03-17  6:46 ` Huang, Ying
@ 2017-03-17  6:46   ` Huang, Ying
  -1 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-17  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel, Huang Ying,
	Tim Chen, Michal Hocko, Kirill A. Shutemov, Ingo Molnar,
	Vegard Nossum, linux-mm, linux-kernel

From: Huang Ying <ying.huang@intel.com>

Cluster lock is used to protect the swap_cluster_info and
corresponding elements in swap_info_struct->swap_map[].  But it is
found that now in scan_swap_map_slots(), swap_avail_lock may be
acquired when cluster lock is held.  This does no good except making
the locking more complex and improving the potential locking
contention, because the swap_info_struct->lock is used to protect the
data structure operated in the code already.  Fix this via moving the
corresponding operations in scan_swap_map_slots() out of cluster lock.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
---
 mm/swapfile.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 42fd620dcf4c..53b5881ee0d6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -672,6 +672,9 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		else
 			goto done;
 	}
+	si->swap_map[offset] = usage;
+	inc_cluster_info_page(si, si->cluster_info, offset);
+	unlock_cluster(ci);
 
 	if (offset == si->lowest_bit)
 		si->lowest_bit++;
@@ -685,9 +688,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		plist_del(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
 	}
-	si->swap_map[offset] = usage;
-	inc_cluster_info_page(si, si->cluster_info, offset);
-	unlock_cluster(ci);
 	si->cluster_next = offset + 1;
 	slots[n_ret++] = swp_entry(si->type, offset);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/5] mm, swap: Avoid lock swap_avail_lock when held cluster lock
@ 2017-03-17  6:46   ` Huang, Ying
  0 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-17  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel, Huang Ying,
	Tim Chen, Michal Hocko, Kirill A. Shutemov, Ingo Molnar,
	Vegard Nossum, linux-mm, linux-kernel

From: Huang Ying <ying.huang@intel.com>

Cluster lock is used to protect the swap_cluster_info and
corresponding elements in swap_info_struct->swap_map[].  But it is
found that now in scan_swap_map_slots(), swap_avail_lock may be
acquired when cluster lock is held.  This does no good except making
the locking more complex and improving the potential locking
contention, because the swap_info_struct->lock is used to protect the
data structure operated in the code already.  Fix this via moving the
corresponding operations in scan_swap_map_slots() out of cluster lock.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
---
 mm/swapfile.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 42fd620dcf4c..53b5881ee0d6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -672,6 +672,9 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		else
 			goto done;
 	}
+	si->swap_map[offset] = usage;
+	inc_cluster_info_page(si, si->cluster_info, offset);
+	unlock_cluster(ci);
 
 	if (offset == si->lowest_bit)
 		si->lowest_bit++;
@@ -685,9 +688,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		plist_del(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
 	}
-	si->swap_map[offset] = usage;
-	inc_cluster_info_page(si, si->cluster_info, offset);
-	unlock_cluster(ci);
 	si->cluster_next = offset + 1;
 	slots[n_ret++] = swp_entry(si->type, offset);
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] mm, swap: Try kzalloc before vzalloc
  2017-03-17  6:46 ` Huang, Ying
@ 2017-03-17  6:46   ` Huang, Ying
  -1 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-17  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel, Huang Ying,
	Johannes Weiner, Michal Hocko, Tim Chen, Mel Gorman,
	Kirill A. Shutemov, Jérôme Glisse, Aaron Lu,
	Hugh Dickins, Minchan Kim, Ingo Molnar, Vegard Nossum,
	linux-kernel, linux-mm

From: Huang Ying <ying.huang@intel.com>

Now vzalloc() is used in swap code to allocate various data
structures, such as swap cache, swap slots cache, cluster info, etc.
Because the size may be too large on some system, so that normal
kzalloc() may fail.  But using kzalloc() has some advantages, for
example, less memory fragmentation, less TLB pressure, etc.  So change
the data structure allocation in swap code to try to use kzalloc()
firstly, and fallback to vzalloc() if kzalloc() failed.

The allocation for swap_map[] in struct swap_info_struct is not
changed, because that is usually quite large and vmalloc_to_page() is
used for it.  That makes it a little harder to change.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
---
 include/linux/swap.h |  2 ++
 mm/swap_slots.c      | 20 +++++++++++---------
 mm/swap_state.c      |  2 +-
 mm/swapfile.c        | 20 ++++++++++++++++----
 4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f59d6b077401..35d5b626c4bc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -426,6 +426,8 @@ extern void exit_swap_address_space(unsigned int type);
 extern int get_swap_slots(int n, swp_entry_t *slots);
 extern void swapcache_free_batch(swp_entry_t *entries, int n);
 
+extern void *swap_kvzalloc(size_t size);
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)		(NULL)
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 9b5bc86f96ad..7ae10e6f757d 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -31,6 +31,8 @@
 #include <linux/cpumask.h>
 #include <linux/vmalloc.h>
 #include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
 
 #ifdef CONFIG_SWAP
 
@@ -118,17 +120,17 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 	swp_entry_t *slots, *slots_ret;
 
 	/*
-	 * Do allocation outside swap_slots_cache_mutex
-	 * as vzalloc could trigger reclaim and get_swap_page,
+	 * Do allocation outside swap_slots_cache_mutex as
+	 * kzalloc/vzalloc could trigger reclaim and get_swap_page,
 	 * which can lock swap_slots_cache_mutex.
 	 */
-	slots = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	slots = swap_kvzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
 	if (!slots)
 		return -ENOMEM;
 
-	slots_ret = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	slots_ret = swap_kvzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
 	if (!slots_ret) {
-		vfree(slots);
+		kvfree(slots);
 		return -ENOMEM;
 	}
 
@@ -152,9 +154,9 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 out:
 	mutex_unlock(&swap_slots_cache_mutex);
 	if (slots)
-		vfree(slots);
+		kvfree(slots);
 	if (slots_ret)
-		vfree(slots_ret);
+		kvfree(slots_ret);
 	return 0;
 }
 
@@ -171,7 +173,7 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
 		cache->cur = 0;
 		cache->nr = 0;
 		if (free_slots && cache->slots) {
-			vfree(cache->slots);
+			kvfree(cache->slots);
 			cache->slots = NULL;
 		}
 		mutex_unlock(&cache->alloc_lock);
@@ -186,7 +188,7 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
 		}
 		spin_unlock_irq(&cache->free_lock);
 		if (slots)
-			vfree(slots);
+			kvfree(slots);
 	}
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 7bfb9bd1ca21..d31017532ad5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -523,7 +523,7 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
 	unsigned int i, nr;
 
 	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
-	spaces = vzalloc(sizeof(struct address_space) * nr);
+	spaces = swap_kvzalloc(sizeof(struct address_space) * nr);
 	if (!spaces)
 		return -ENOMEM;
 	for (i = 0; i < nr; i++) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 53b5881ee0d6..1fb966cf2175 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2272,8 +2272,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	free_percpu(p->percpu_cluster);
 	p->percpu_cluster = NULL;
 	vfree(swap_map);
-	vfree(cluster_info);
-	vfree(frontswap_map);
+	kvfree(cluster_info);
+	kvfree(frontswap_map);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
 	exit_swap_address_space(p->type);
@@ -2796,7 +2796,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
 		nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 
-		cluster_info = vzalloc(nr_cluster * sizeof(*cluster_info));
+		cluster_info = swap_kvzalloc(nr_cluster * sizeof(*cluster_info));
 		if (!cluster_info) {
 			error = -ENOMEM;
 			goto bad_swap;
@@ -2829,7 +2829,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	}
 	/* frontswap enabled? set up bit-per-page map for frontswap */
 	if (IS_ENABLED(CONFIG_FRONTSWAP))
-		frontswap_map = vzalloc(BITS_TO_LONGS(maxpages) * sizeof(long));
+		frontswap_map =
+			swap_kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(long));
 
 	if (p->bdev &&(swap_flags & SWAP_FLAG_DISCARD) && swap_discardable(p)) {
 		/*
@@ -3308,3 +3309,14 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
 		}
 	}
 }
+
+void *swap_kvzalloc(size_t size)
+{
+	void *p;
+
+	p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!p)
+		p = vzalloc(size);
+
+	return p;
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] mm, swap: Try kzalloc before vzalloc
@ 2017-03-17  6:46   ` Huang, Ying
  0 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-17  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel, Huang Ying,
	Johannes Weiner, Michal Hocko, Tim Chen, Mel Gorman,
	Kirill A. Shutemov, Jérôme Glisse, Aaron Lu,
	Hugh Dickins, Minchan Kim, Ingo Molnar, Vegard Nossum,
	linux-kernel, linux-mm

From: Huang Ying <ying.huang@intel.com>

Now vzalloc() is used in swap code to allocate various data
structures, such as swap cache, swap slots cache, cluster info, etc.
Because the size may be too large on some system, so that normal
kzalloc() may fail.  But using kzalloc() has some advantages, for
example, less memory fragmentation, less TLB pressure, etc.  So change
the data structure allocation in swap code to try to use kzalloc()
firstly, and fallback to vzalloc() if kzalloc() failed.

The allocation for swap_map[] in struct swap_info_struct is not
changed, because that is usually quite large and vmalloc_to_page() is
used for it.  That makes it a little harder to change.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
---
 include/linux/swap.h |  2 ++
 mm/swap_slots.c      | 20 +++++++++++---------
 mm/swap_state.c      |  2 +-
 mm/swapfile.c        | 20 ++++++++++++++++----
 4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f59d6b077401..35d5b626c4bc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -426,6 +426,8 @@ extern void exit_swap_address_space(unsigned int type);
 extern int get_swap_slots(int n, swp_entry_t *slots);
 extern void swapcache_free_batch(swp_entry_t *entries, int n);
 
+extern void *swap_kvzalloc(size_t size);
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)		(NULL)
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 9b5bc86f96ad..7ae10e6f757d 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -31,6 +31,8 @@
 #include <linux/cpumask.h>
 #include <linux/vmalloc.h>
 #include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
 
 #ifdef CONFIG_SWAP
 
@@ -118,17 +120,17 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 	swp_entry_t *slots, *slots_ret;
 
 	/*
-	 * Do allocation outside swap_slots_cache_mutex
-	 * as vzalloc could trigger reclaim and get_swap_page,
+	 * Do allocation outside swap_slots_cache_mutex as
+	 * kzalloc/vzalloc could trigger reclaim and get_swap_page,
 	 * which can lock swap_slots_cache_mutex.
 	 */
-	slots = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	slots = swap_kvzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
 	if (!slots)
 		return -ENOMEM;
 
-	slots_ret = vzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
+	slots_ret = swap_kvzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE);
 	if (!slots_ret) {
-		vfree(slots);
+		kvfree(slots);
 		return -ENOMEM;
 	}
 
@@ -152,9 +154,9 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 out:
 	mutex_unlock(&swap_slots_cache_mutex);
 	if (slots)
-		vfree(slots);
+		kvfree(slots);
 	if (slots_ret)
-		vfree(slots_ret);
+		kvfree(slots_ret);
 	return 0;
 }
 
@@ -171,7 +173,7 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
 		cache->cur = 0;
 		cache->nr = 0;
 		if (free_slots && cache->slots) {
-			vfree(cache->slots);
+			kvfree(cache->slots);
 			cache->slots = NULL;
 		}
 		mutex_unlock(&cache->alloc_lock);
@@ -186,7 +188,7 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
 		}
 		spin_unlock_irq(&cache->free_lock);
 		if (slots)
-			vfree(slots);
+			kvfree(slots);
 	}
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 7bfb9bd1ca21..d31017532ad5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -523,7 +523,7 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
 	unsigned int i, nr;
 
 	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
-	spaces = vzalloc(sizeof(struct address_space) * nr);
+	spaces = swap_kvzalloc(sizeof(struct address_space) * nr);
 	if (!spaces)
 		return -ENOMEM;
 	for (i = 0; i < nr; i++) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 53b5881ee0d6..1fb966cf2175 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2272,8 +2272,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	free_percpu(p->percpu_cluster);
 	p->percpu_cluster = NULL;
 	vfree(swap_map);
-	vfree(cluster_info);
-	vfree(frontswap_map);
+	kvfree(cluster_info);
+	kvfree(frontswap_map);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
 	exit_swap_address_space(p->type);
@@ -2796,7 +2796,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
 		nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 
-		cluster_info = vzalloc(nr_cluster * sizeof(*cluster_info));
+		cluster_info = swap_kvzalloc(nr_cluster * sizeof(*cluster_info));
 		if (!cluster_info) {
 			error = -ENOMEM;
 			goto bad_swap;
@@ -2829,7 +2829,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	}
 	/* frontswap enabled? set up bit-per-page map for frontswap */
 	if (IS_ENABLED(CONFIG_FRONTSWAP))
-		frontswap_map = vzalloc(BITS_TO_LONGS(maxpages) * sizeof(long));
+		frontswap_map =
+			swap_kvzalloc(BITS_TO_LONGS(maxpages) * sizeof(long));
 
 	if (p->bdev &&(swap_flags & SWAP_FLAG_DISCARD) && swap_discardable(p)) {
 		/*
@@ -3308,3 +3309,14 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
 		}
 	}
 }
+
+void *swap_kvzalloc(size_t size)
+{
+	void *p;
+
+	p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!p)
+		p = vzalloc(size);
+
+	return p;
+}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] mm, swap: Sort swap entries before free
  2017-03-17  6:46 ` Huang, Ying
@ 2017-03-17  6:46   ` Huang, Ying
  -1 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-17  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel, Huang Ying,
	Tim Chen, Michal Hocko, Kirill A. Shutemov, Vegard Nossum,
	Ingo Molnar, linux-mm, linux-kernel

From: Huang Ying <ying.huang@intel.com>

To reduce the lock contention of swap_info_struct->lock when freeing
swap entry.  The freed swap entries will be collected in a per-CPU
buffer firstly, and be really freed later in batch.  During the batch
freeing, if the consecutive swap entries in the per-CPU buffer belongs
to same swap device, the swap_info_struct->lock needs to be
acquired/released only once, so that the lock contention could be
reduced greatly.  But if there are multiple swap devices, it is
possible that the lock may be unnecessarily released/acquired because
the swap entries belong to the same swap device are non-consecutive in
the per-CPU buffer.

To solve the issue, the per-CPU buffer is sorted according to the swap
device before freeing the swap entries.  Test shows that the time
spent by swapcache_free_entries() could be reduced after the patch.

Test the patch via measuring the run time of swap_cache_free_entries()
during the exit phase of the applications use much swap space.  The
results shows that the average run time of swap_cache_free_entries()
reduced about 20% after applying the patch.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
---
 mm/swapfile.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1fb966cf2175..dc1716c7997f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -37,6 +37,7 @@
 #include <linux/swapfile.h>
 #include <linux/export.h>
 #include <linux/swap_slots.h>
+#include <linux/sort.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -1065,6 +1066,13 @@ void swapcache_free(swp_entry_t entry)
 	}
 }
 
+static int swp_entry_cmp(const void *ent1, const void *ent2)
+{
+	const swp_entry_t *e1 = ent1, *e2 = ent2;
+
+	return (long)(swp_type(*e1) - swp_type(*e2));
+}
+
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
 	struct swap_info_struct *p, *prev;
@@ -1075,6 +1083,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 
 	prev = NULL;
 	p = NULL;
+	sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
 	for (i = 0; i < n; ++i) {
 		p = swap_info_get_cont(entries[i], prev);
 		if (p)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] mm, swap: Sort swap entries before free
@ 2017-03-17  6:46   ` Huang, Ying
  0 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-17  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel, Huang Ying,
	Tim Chen, Michal Hocko, Kirill A. Shutemov, Vegard Nossum,
	Ingo Molnar, linux-mm, linux-kernel

From: Huang Ying <ying.huang@intel.com>

To reduce the lock contention of swap_info_struct->lock when freeing
swap entry.  The freed swap entries will be collected in a per-CPU
buffer firstly, and be really freed later in batch.  During the batch
freeing, if the consecutive swap entries in the per-CPU buffer belongs
to same swap device, the swap_info_struct->lock needs to be
acquired/released only once, so that the lock contention could be
reduced greatly.  But if there are multiple swap devices, it is
possible that the lock may be unnecessarily released/acquired because
the swap entries belong to the same swap device are non-consecutive in
the per-CPU buffer.

To solve the issue, the per-CPU buffer is sorted according to the swap
device before freeing the swap entries.  Test shows that the time
spent by swapcache_free_entries() could be reduced after the patch.

Test the patch via measuring the run time of swap_cache_free_entries()
during the exit phase of the applications use much swap space.  The
results shows that the average run time of swap_cache_free_entries()
reduced about 20% after applying the patch.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
---
 mm/swapfile.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1fb966cf2175..dc1716c7997f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -37,6 +37,7 @@
 #include <linux/swapfile.h>
 #include <linux/export.h>
 #include <linux/swap_slots.h>
+#include <linux/sort.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -1065,6 +1066,13 @@ void swapcache_free(swp_entry_t entry)
 	}
 }
 
+static int swp_entry_cmp(const void *ent1, const void *ent2)
+{
+	const swp_entry_t *e1 = ent1, *e2 = ent2;
+
+	return (long)(swp_type(*e1) - swp_type(*e2));
+}
+
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
 	struct swap_info_struct *p, *prev;
@@ -1075,6 +1083,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 
 	prev = NULL;
 	p = NULL;
+	sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
 	for (i = 0; i < n; ++i) {
 		p = swap_info_get_cont(entries[i], prev);
 		if (p)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] mm, swap: Try kzalloc before vzalloc
  2017-03-17  6:46   ` Huang, Ying
@ 2017-03-17  8:52     ` David Rientjes
  -1 siblings, 0 replies; 19+ messages in thread
From: David Rientjes @ 2017-03-17  8:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel,
	Johannes Weiner, Michal Hocko, Tim Chen, Mel Gorman,
	Kirill A. Shutemov, Jérôme Glisse, Aaron Lu,
	Hugh Dickins, Minchan Kim, Ingo Molnar, Vegard Nossum,
	linux-kernel, linux-mm

On Fri, 17 Mar 2017, Huang, Ying wrote:

> From: Huang Ying <ying.huang@intel.com>
> 
> Now vzalloc() is used in swap code to allocate various data
> structures, such as swap cache, swap slots cache, cluster info, etc.
> Because the size may be too large on some system, so that normal
> kzalloc() may fail.  But using kzalloc() has some advantages, for
> example, less memory fragmentation, less TLB pressure, etc.  So change
> the data structure allocation in swap code to try to use kzalloc()
> firstly, and fallback to vzalloc() if kzalloc() failed.
> 

I'm concerned about preferring kzalloc() with __GFP_RECLAIM since the page 
allocator will try to do memory compaction for high-order allocations when 
the vzalloc() would have succeeded immediately.  Do we necessarily want to 
spend time doing memory compaction and direct reclaim for contiguous 
memory if it's not needed?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] mm, swap: Try kzalloc before vzalloc
@ 2017-03-17  8:52     ` David Rientjes
  0 siblings, 0 replies; 19+ messages in thread
From: David Rientjes @ 2017-03-17  8:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel,
	Johannes Weiner, Michal Hocko, Tim Chen, Mel Gorman,
	Kirill A. Shutemov, Jérôme Glisse, Aaron Lu,
	Hugh Dickins, Minchan Kim, Ingo Molnar, Vegard Nossum,
	linux-kernel, linux-mm

On Fri, 17 Mar 2017, Huang, Ying wrote:

> From: Huang Ying <ying.huang@intel.com>
> 
> Now vzalloc() is used in swap code to allocate various data
> structures, such as swap cache, swap slots cache, cluster info, etc.
> Because the size may be too large on some system, so that normal
> kzalloc() may fail.  But using kzalloc() has some advantages, for
> example, less memory fragmentation, less TLB pressure, etc.  So change
> the data structure allocation in swap code to try to use kzalloc()
> firstly, and fallback to vzalloc() if kzalloc() failed.
> 

I'm concerned about preferring kzalloc() with __GFP_RECLAIM since the page 
allocator will try to do memory compaction for high-order allocations when 
the vzalloc() would have succeeded immediately.  Do we necessarily want to 
spend time doing memory compaction and direct reclaim for contiguous 
memory if it's not needed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] mm, swap: Try kzalloc before vzalloc
  2017-03-17  6:46   ` Huang, Ying
@ 2017-03-17 11:47     ` Michal Hocko
  -1 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2017-03-17 11:47 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel,
	Johannes Weiner, Tim Chen, Mel Gorman, Kirill A. Shutemov,
	Jérôme Glisse, Aaron Lu, Hugh Dickins, Minchan Kim,
	Ingo Molnar, Vegard Nossum, linux-kernel, linux-mm

On Fri 17-03-17 14:46:22, Huang, Ying wrote:
> +void *swap_kvzalloc(size_t size)
> +{
> +	void *p;
> +
> +	p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> +	if (!p)
> +		p = vzalloc(size);
> +
> +	return p;
> +}

please do not invent your own kvmalloc implementation when we already
have on in mmotm tree.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] mm, swap: Try kzalloc before vzalloc
@ 2017-03-17 11:47     ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2017-03-17 11:47 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel,
	Johannes Weiner, Tim Chen, Mel Gorman, Kirill A. Shutemov,
	Jérôme Glisse, Aaron Lu, Hugh Dickins, Minchan Kim,
	Ingo Molnar, Vegard Nossum, linux-kernel, linux-mm

On Fri 17-03-17 14:46:22, Huang, Ying wrote:
> +void *swap_kvzalloc(size_t size)
> +{
> +	void *p;
> +
> +	p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
> +	if (!p)
> +		p = vzalloc(size);
> +
> +	return p;
> +}

please do not invent your own kvmalloc implementation when we already
have on in mmotm tree.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async
  2017-03-17  6:46 ` Huang, Ying
@ 2017-03-17 12:42   ` Rafael Aquini
  -1 siblings, 0 replies; 19+ messages in thread
From: Rafael Aquini @ 2017-03-17 12:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel,
	Tim Chen, Michal Hocko, Mel Gorman, Aaron Lu, Kirill A. Shutemov,
	Gerald Schaefer, linux-mm, linux-kernel

On Fri, Mar 17, 2017 at 02:46:19PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> The commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
> deadlock while waiting on discard I/O completion") fixed a deadlock in
> read_swap_cache_async().  Because at that time, in swap allocation
> path, a swap entry may be set as SWAP_HAS_CACHE, then wait for
> discarding to complete before the page for the swap entry is added to
> the swap cache.  But in the commit 815c2c543d3a ("swap: make swap
> discard async"), the discarding for swap become asynchronous, waiting
> for discarding to complete will be done before the swap entry is set
> as SWAP_HAS_CACHE.  So the comments in code is incorrect now.  This
> patch fixes the comments.
> 
> The cond_resched() added in the commit cbab0e4eec29 is not necessary
> now too.  But if we added some sleep in swap allocation path in the
> future, there may be some hard to debug/reproduce deadlock bug.  So it
> is kept.
>

^ this is a rather disconcerting way to describe why you left that part
behind, and I recollect telling you about it in a private discussion.

The fact is that __read_swap_cache_async() still races against get_swap_page()
with a way narrower window due to the async fashioned SSD wear leveling 
done for swap nowadays and other changes made within __read_swap_cache_async()'s
while loop thus making that old deadlock scenario very improbable to strike again.

All seems legit, apart from that last paragraph in the commit log
message


Acked-by: Rafael Aquini <aquini@redhat.com>
 
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Rafael Aquini <aquini@redhat.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  mm/swap_state.c | 12 +-----------
>  1 file changed, 1 insertion(+), 11 deletions(-)
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 473b71e052a8..7bfb9bd1ca21 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -360,17 +360,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  			/*
>  			 * We might race against get_swap_page() and stumble
>  			 * across a SWAP_HAS_CACHE swap_map entry whose page
> -			 * has not been brought into the swapcache yet, while
> -			 * the other end is scheduled away waiting on discard
> -			 * I/O completion at scan_swap_map().
> -			 *
> -			 * In order to avoid turning this transitory state
> -			 * into a permanent loop around this -EEXIST case
> -			 * if !CONFIG_PREEMPT and the I/O completion happens
> -			 * to be waiting on the CPU waitqueue where we are now
> -			 * busy looping, we just conditionally invoke the
> -			 * scheduler here, if there are some more important
> -			 * tasks to run.
> +			 * has not been brought into the swapcache yet.
>  			 */
>  			cond_resched();
>  			continue;
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async
@ 2017-03-17 12:42   ` Rafael Aquini
  0 siblings, 0 replies; 19+ messages in thread
From: Rafael Aquini @ 2017-03-17 12:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Andi Kleen, Dave Hansen, Shaohua Li, Rik van Riel,
	Tim Chen, Michal Hocko, Mel Gorman, Aaron Lu, Kirill A. Shutemov,
	Gerald Schaefer, linux-mm, linux-kernel

On Fri, Mar 17, 2017 at 02:46:19PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> The commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
> deadlock while waiting on discard I/O completion") fixed a deadlock in
> read_swap_cache_async().  Because at that time, in swap allocation
> path, a swap entry may be set as SWAP_HAS_CACHE, then wait for
> discarding to complete before the page for the swap entry is added to
> the swap cache.  But in the commit 815c2c543d3a ("swap: make swap
> discard async"), the discarding for swap become asynchronous, waiting
> for discarding to complete will be done before the swap entry is set
> as SWAP_HAS_CACHE.  So the comments in code is incorrect now.  This
> patch fixes the comments.
> 
> The cond_resched() added in the commit cbab0e4eec29 is not necessary
> now too.  But if we added some sleep in swap allocation path in the
> future, there may be some hard to debug/reproduce deadlock bug.  So it
> is kept.
>

^ this is a rather disconcerting way to describe why you left that part
behind, and I recollect telling you about it in a private discussion.

The fact is that __read_swap_cache_async() still races against get_swap_page()
with a way narrower window due to the async fashioned SSD wear leveling 
done for swap nowadays and other changes made within __read_swap_cache_async()'s
while loop thus making that old deadlock scenario very improbable to strike again.

All seems legit, apart from that last paragraph in the commit log
message


Acked-by: Rafael Aquini <aquini@redhat.com>
 
> Cc: Shaohua Li <shli@kernel.org>
> Cc: Rafael Aquini <aquini@redhat.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  mm/swap_state.c | 12 +-----------
>  1 file changed, 1 insertion(+), 11 deletions(-)
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 473b71e052a8..7bfb9bd1ca21 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -360,17 +360,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  			/*
>  			 * We might race against get_swap_page() and stumble
>  			 * across a SWAP_HAS_CACHE swap_map entry whose page
> -			 * has not been brought into the swapcache yet, while
> -			 * the other end is scheduled away waiting on discard
> -			 * I/O completion at scan_swap_map().
> -			 *
> -			 * In order to avoid turning this transitory state
> -			 * into a permanent loop around this -EEXIST case
> -			 * if !CONFIG_PREEMPT and the I/O completion happens
> -			 * to be waiting on the CPU waitqueue where we are now
> -			 * busy looping, we just conditionally invoke the
> -			 * scheduler here, if there are some more important
> -			 * tasks to run.
> +			 * has not been brought into the swapcache yet.
>  			 */
>  			cond_resched();
>  			continue;
> -- 
> 2.11.0
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] mm, swap: Try kzalloc before vzalloc
  2017-03-17 11:47     ` Michal Hocko
  (?)
@ 2017-03-20  1:01     ` Huang, Ying
  -1 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-20  1:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Huang, Ying, Andrew Morton, Andi Kleen, Dave Hansen, Shaohua Li,
	Rik van Riel, Johannes Weiner, Tim Chen, Mel Gorman,
	Kirill A. Shutemov, Jerome Glisse, Aaron Lu, Hugh Dickins,
	Minchan Kim, Ingo Molnar, Vegard Nossum, linux-kernel, linux-mm

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 17-03-17 14:46:22, Huang, Ying wrote:
>> +void *swap_kvzalloc(size_t size)
>> +{
>> +	void *p;
>> +
>> +	p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
>> +	if (!p)
>> +		p = vzalloc(size);
>> +
>> +	return p;
>> +}
>
> please do not invent your own kvmalloc implementation when we already
> have on in mmotm tree.

Thanks for pointing that out!  I will use it.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async
  2017-03-17 12:42   ` Rafael Aquini
@ 2017-03-20  2:07     ` Huang, Ying
  -1 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-20  2:07 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Huang, Ying, Andrew Morton, Andi Kleen, Dave Hansen, Shaohua Li,
	Rik van Riel, Tim Chen, Michal Hocko, Mel Gorman, Aaron Lu,
	Kirill A. Shutemov, Gerald Schaefer, linux-mm, linux-kernel

Hi, Rafeal,

Rafael Aquini <aquini@redhat.com> writes:

> On Fri, Mar 17, 2017 at 02:46:19PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> The commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
>> deadlock while waiting on discard I/O completion") fixed a deadlock in
>> read_swap_cache_async().  Because at that time, in swap allocation
>> path, a swap entry may be set as SWAP_HAS_CACHE, then wait for
>> discarding to complete before the page for the swap entry is added to
>> the swap cache.  But in the commit 815c2c543d3a ("swap: make swap
>> discard async"), the discarding for swap become asynchronous, waiting
>> for discarding to complete will be done before the swap entry is set
>> as SWAP_HAS_CACHE.  So the comments in code is incorrect now.  This
>> patch fixes the comments.
>> 
>> The cond_resched() added in the commit cbab0e4eec29 is not necessary
>> now too.  But if we added some sleep in swap allocation path in the
>> future, there may be some hard to debug/reproduce deadlock bug.  So it
>> is kept.
>>
>
> ^ this is a rather disconcerting way to describe why you left that part
> behind, and I recollect telling you about it in a private discussion.
>
> The fact is that __read_swap_cache_async() still races against get_swap_page()
> with a way narrower window due to the async fashioned SSD wear leveling 
> done for swap nowadays and other changes made within __read_swap_cache_async()'s
> while loop thus making that old deadlock scenario very improbable to strike again.

Thanks for your comments!  Could you tell me which kind of race
remaining?

> All seems legit, apart from that last paragraph in the commit log
> message
>
>
> Acked-by: Rafael Aquini <aquini@redhat.com>

Thanks!

Best Regards,
Huang, Ying

>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Rafael Aquini <aquini@redhat.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  mm/swap_state.c | 12 +-----------
>>  1 file changed, 1 insertion(+), 11 deletions(-)
>> 
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 473b71e052a8..7bfb9bd1ca21 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -360,17 +360,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>>  			/*
>>  			 * We might race against get_swap_page() and stumble
>>  			 * across a SWAP_HAS_CACHE swap_map entry whose page
>> -			 * has not been brought into the swapcache yet, while
>> -			 * the other end is scheduled away waiting on discard
>> -			 * I/O completion at scan_swap_map().
>> -			 *
>> -			 * In order to avoid turning this transitory state
>> -			 * into a permanent loop around this -EEXIST case
>> -			 * if !CONFIG_PREEMPT and the I/O completion happens
>> -			 * to be waiting on the CPU waitqueue where we are now
>> -			 * busy looping, we just conditionally invoke the
>> -			 * scheduler here, if there are some more important
>> -			 * tasks to run.
>> +			 * has not been brought into the swapcache yet.
>>  			 */
>>  			cond_resched();
>>  			continue;
>> -- 
>> 2.11.0
>> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async
@ 2017-03-20  2:07     ` Huang, Ying
  0 siblings, 0 replies; 19+ messages in thread
From: Huang, Ying @ 2017-03-20  2:07 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Huang, Ying, Andrew Morton, Andi Kleen, Dave Hansen, Shaohua Li,
	Rik van Riel, Tim Chen, Michal Hocko, Mel Gorman, Aaron Lu,
	Kirill A. Shutemov, Gerald Schaefer, linux-mm, linux-kernel

Hi, Rafeal,

Rafael Aquini <aquini@redhat.com> writes:

> On Fri, Mar 17, 2017 at 02:46:19PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> The commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
>> deadlock while waiting on discard I/O completion") fixed a deadlock in
>> read_swap_cache_async().  Because at that time, in swap allocation
>> path, a swap entry may be set as SWAP_HAS_CACHE, then wait for
>> discarding to complete before the page for the swap entry is added to
>> the swap cache.  But in the commit 815c2c543d3a ("swap: make swap
>> discard async"), the discarding for swap become asynchronous, waiting
>> for discarding to complete will be done before the swap entry is set
>> as SWAP_HAS_CACHE.  So the comments in code is incorrect now.  This
>> patch fixes the comments.
>> 
>> The cond_resched() added in the commit cbab0e4eec29 is not necessary
>> now too.  But if we added some sleep in swap allocation path in the
>> future, there may be some hard to debug/reproduce deadlock bug.  So it
>> is kept.
>>
>
> ^ this is a rather disconcerting way to describe why you left that part
> behind, and I recollect telling you about it in a private discussion.
>
> The fact is that __read_swap_cache_async() still races against get_swap_page()
> with a way narrower window due to the async fashioned SSD wear leveling 
> done for swap nowadays and other changes made within __read_swap_cache_async()'s
> while loop thus making that old deadlock scenario very improbable to strike again.

Thanks for your comments!  Could you tell me which kind of race
remaining?

> All seems legit, apart from that last paragraph in the commit log
> message
>
>
> Acked-by: Rafael Aquini <aquini@redhat.com>

Thanks!

Best Regards,
Huang, Ying

>> Cc: Shaohua Li <shli@kernel.org>
>> Cc: Rafael Aquini <aquini@redhat.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> ---
>>  mm/swap_state.c | 12 +-----------
>>  1 file changed, 1 insertion(+), 11 deletions(-)
>> 
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 473b71e052a8..7bfb9bd1ca21 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -360,17 +360,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>>  			/*
>>  			 * We might race against get_swap_page() and stumble
>>  			 * across a SWAP_HAS_CACHE swap_map entry whose page
>> -			 * has not been brought into the swapcache yet, while
>> -			 * the other end is scheduled away waiting on discard
>> -			 * I/O completion at scan_swap_map().
>> -			 *
>> -			 * In order to avoid turning this transitory state
>> -			 * into a permanent loop around this -EEXIST case
>> -			 * if !CONFIG_PREEMPT and the I/O completion happens
>> -			 * to be waiting on the CPU waitqueue where we are now
>> -			 * busy looping, we just conditionally invoke the
>> -			 * scheduler here, if there are some more important
>> -			 * tasks to run.
>> +			 * has not been brought into the swapcache yet.
>>  			 */
>>  			cond_resched();
>>  			continue;
>> -- 
>> 2.11.0
>> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-03-20  2:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-17  6:46 [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async Huang, Ying
2017-03-17  6:46 ` Huang, Ying
2017-03-17  6:46 ` [PATCH 2/5] mm, swap: Improve readability via make spin_lock/unlock balanced Huang, Ying
2017-03-17  6:46   ` Huang, Ying
2017-03-17  6:46 ` [PATCH 3/5] mm, swap: Avoid lock swap_avail_lock when held cluster lock Huang, Ying
2017-03-17  6:46   ` Huang, Ying
2017-03-17  6:46 ` [PATCH 4/5] mm, swap: Try kzalloc before vzalloc Huang, Ying
2017-03-17  6:46   ` Huang, Ying
2017-03-17  8:52   ` David Rientjes
2017-03-17  8:52     ` David Rientjes
2017-03-17 11:47   ` Michal Hocko
2017-03-17 11:47     ` Michal Hocko
2017-03-20  1:01     ` Huang, Ying
2017-03-17  6:46 ` [PATCH 5/5] mm, swap: Sort swap entries before free Huang, Ying
2017-03-17  6:46   ` Huang, Ying
2017-03-17 12:42 ` [PATCH 1/5] mm, swap: Fix comment in __read_swap_cache_async Rafael Aquini
2017-03-17 12:42   ` Rafael Aquini
2017-03-20  2:07   ` Huang, Ying
2017-03-20  2:07     ` Huang, Ying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.