[PATCH 1/2 v2] mm/zsmalloc.c: Migration can leave pages in ZS

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2 v2] mm/zsmalloc.c: Migration can leave pages in ZS_EMPTY indefinitely
@ 2019-08-09 18:17 Henry Burns
  2019-08-09 18:17 ` [PATCH 2/2 v2] mm/zsmalloc.c: Fix race condition in zs_destroy_pool Henry Burns
  2019-08-20  2:53 ` [PATCH 1/2 v2] mm/zsmalloc.c: Migration can leave pages in ZS_EMPTY indefinitely Sergey Senozhatsky
  0 siblings, 2 replies; 7+ messages in thread
From: Henry Burns @ 2019-08-09 18:17 UTC (permalink / raw)
  To: Minchan Kim, Nitin Gupta
  Cc: Sergey Senozhatsky, Shakeel Butt, Jonathan Adams, HenryBurns,
	linux-mm, linux-kernel, Henry Burns

In zs_page_migrate() we call putback_zspage() after we have finished
migrating all pages in this zspage. However, the return value is ignored.
If a zs_free() races in between zs_page_isolate() and zs_page_migrate(),
freeing the last object in the zspage, putback_zspage() will leave the page
in ZS_EMPTY for potentially an unbounded amount of time.

To fix this, we need to do the same thing as zs_page_putback() does:
schedule free_work to occur.  To avoid duplicated code, move the
sequence to a new putback_zspage_deferred() function which both
zs_page_migrate() and zs_page_putback() call.

Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
Signed-off-by: Henry Burns <henryburns@google.com>
---
 Changelog since v1:
 - Moved the comment from putback_zspage_deferred() to
   zs_page_putback().

 mm/zsmalloc.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 57fbb7ced69f..5105b9b66653 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1862,6 +1862,18 @@ static void dec_zspage_isolation(struct zspage *zspage)
 	zspage->isolated--;
 }
 
+static void putback_zspage_deferred(struct zs_pool *pool,
+				    struct size_class *class,
+				    struct zspage *zspage)
+{
+	enum fullness_group fg;
+
+	fg = putback_zspage(class, zspage);
+	if (fg == ZS_EMPTY)
+		schedule_work(&pool->free_work);
+
+}
+
 static void replace_sub_page(struct size_class *class, struct zspage *zspage,
 				struct page *newpage, struct page *oldpage)
 {
@@ -2031,7 +2043,7 @@ static int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	 * the list if @page is final isolated subpage in the zspage.
 	 */
 	if (!is_zspage_isolated(zspage))
-		putback_zspage(class, zspage);
+		putback_zspage_deferred(pool, class, zspage);
 
 	reset_page(page);
 	put_page(page);
@@ -2077,14 +2089,13 @@ static void zs_page_putback(struct page *page)
 	spin_lock(&class->lock);
 	dec_zspage_isolation(zspage);
 	if (!is_zspage_isolated(zspage)) {
-		fg = putback_zspage(class, zspage);
 		/*
 		 * Due to page_lock, we cannot free zspage immediately
 		 * so let's defer.
 		 */
-		if (fg == ZS_EMPTY)
-			schedule_work(&pool->free_work);
+		putback_zspage_deferred(pool, class, zspage);
 	}
+
 	spin_unlock(&class->lock);
 }
 
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2 v2] mm/zsmalloc.c: Fix race condition in zs_destroy_pool
  2019-08-09 18:17 [PATCH 1/2 v2] mm/zsmalloc.c: Migration can leave pages in ZS_EMPTY indefinitely Henry Burns
@ 2019-08-09 18:17 ` Henry Burns
  2019-08-20  2:59   ` Sergey Senozhatsky
  2019-08-20  2:53 ` [PATCH 1/2 v2] mm/zsmalloc.c: Migration can leave pages in ZS_EMPTY indefinitely Sergey Senozhatsky
  1 sibling, 1 reply; 7+ messages in thread
From: Henry Burns @ 2019-08-09 18:17 UTC (permalink / raw)
  To: Minchan Kim, Nitin Gupta
  Cc: Sergey Senozhatsky, Shakeel Butt, Jonathan Adams, HenryBurns,
	linux-mm, linux-kernel, Henry Burns

In zs_destroy_pool() we call flush_work(&pool->free_work). However, we
have no guarantee that migration isn't happening in the background
at that time.

Since migration can't directly free pages, it relies on free_work
being scheduled to free the pages.  But there's nothing preventing an
in-progress migrate from queuing the work *after*
zs_unregister_migration() has called flush_work().  Which would mean
pages still pointing at the inode when we free it.

Since we know at destroy time all objects should be free, no new
migrations can come in (since zs_page_isolate() fails for fully-free
zspages).  This means it is sufficient to track a "# isolated zspages"
count by class, and have the destroy logic ensure all such pages have
drained before proceeding.  Keeping that state under the class
spinlock keeps the logic straightforward.

Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
Signed-off-by: Henry Burns <henryburns@google.com>
---
 Changelog since v1:
 - Changed the class level isolated count to a pool level isolated count
   (of zspages). Also added a pool level flag for when destruction
   starts and a memory barrier to ensure this flag has global
   visibility.

 mm/zsmalloc.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 59 insertions(+), 2 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 5105b9b66653..08def3a0d200 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -54,6 +54,7 @@
 #include <linux/mount.h>
 #include <linux/pseudo_fs.h>
 #include <linux/migrate.h>
+#include <linux/wait.h>
 #include <linux/pagemap.h>
 #include <linux/fs.h>
 
@@ -268,6 +269,10 @@ struct zs_pool {
 #ifdef CONFIG_COMPACTION
 	struct inode *inode;
 	struct work_struct free_work;
+	/* A wait queue for when migration races with async_free_zspage() */
+	struct wait_queue_head migration_wait;
+	atomic_long_t isolated_pages;
+	bool destroying;
 #endif
 };
 
@@ -1874,6 +1879,19 @@ static void putback_zspage_deferred(struct zs_pool *pool,
 
 }
 
+static inline void zs_pool_dec_isolated(struct zs_pool *pool)
+{
+	VM_BUG_ON(atomic_long_read(&pool->isolated_pages) <= 0);
+	atomic_long_dec(&pool->isolated_pages);
+	/*
+	 * There's no possibility of racing, since wait_for_isolated_drain()
+	 * checks the isolated count under &class->lock after enqueuing
+	 * on migration_wait.
+	 */
+	if (atomic_long_read(&pool->isolated_pages) == 0 && pool->destroying)
+		wake_up_all(&pool->migration_wait);
+}
+
 static void replace_sub_page(struct size_class *class, struct zspage *zspage,
 				struct page *newpage, struct page *oldpage)
 {
@@ -1943,6 +1961,7 @@ static bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 	 */
 	if (!list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
 		get_zspage_mapping(zspage, &class_idx, &fullness);
+		atomic_long_inc(&pool->isolated_pages);
 		remove_zspage(class, zspage, fullness);
 	}
 
@@ -2042,8 +2061,16 @@ static int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	 * Page migration is done so let's putback isolated zspage to
 	 * the list if @page is final isolated subpage in the zspage.
 	 */
-	if (!is_zspage_isolated(zspage))
+	if (!is_zspage_isolated(zspage)) {
+		/*
+		 * We cannot race with zs_destroy_pool() here because we wait
+		 * for isolation to hit zero before we start destroying.
+		 * Also, we ensure that everyone can see pool->destroying before
+		 * we start waiting.
+		 */
 		putback_zspage_deferred(pool, class, zspage);
+		zs_pool_dec_isolated(pool);
+	}
 
 	reset_page(page);
 	put_page(page);
@@ -2094,8 +2121,8 @@ static void zs_page_putback(struct page *page)
 		 * so let's defer.
 		 */
 		putback_zspage_deferred(pool, class, zspage);
+		zs_pool_dec_isolated(pool);
 	}
-
 	spin_unlock(&class->lock);
 }
 
@@ -2118,8 +2145,36 @@ static int zs_register_migration(struct zs_pool *pool)
 	return 0;
 }
 
+static bool pool_isolated_are_drained(struct zs_pool *pool)
+{
+	return atomic_long_read(&pool->isolated_pages) == 0;
+}
+
+/* Function for resolving migration */
+static void wait_for_isolated_drain(struct zs_pool *pool)
+{
+
+	/*
+	 * We're in the process of destroying the pool, so there are no
+	 * active allocations. zs_page_isolate() fails for completely free
+	 * zspages, so we need only wait for the zs_pool's isolated
+	 * count to hit zero.
+	 */
+	wait_event(pool->migration_wait,
+		   pool_isolated_are_drained(pool));
+}
+
 static void zs_unregister_migration(struct zs_pool *pool)
 {
+	pool->destroying = true;
+	/*
+	 * We need a memory barrier here to ensure global visibility of
+	 * pool->destroying. Thus pool->isolated pages will either be 0 in which
+	 * case we don't care, or it will be > 0 and pool->destroying will
+	 * ensure that we wake up once isolation hits 0.
+	 */
+	smp_mb();
+	wait_for_isolated_drain(pool); /* This can block */
 	flush_work(&pool->free_work);
 	iput(pool->inode);
 }
@@ -2357,6 +2412,8 @@ struct zs_pool *zs_create_pool(const char *name)
 	if (!pool->name)
 		goto err;
 
+	init_waitqueue_head(&pool->migration_wait);
+
 	if (create_cache(pool))
 		goto err;
 
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2 v2] mm/zsmalloc.c: Migration can leave pages in ZS_EMPTY indefinitely
  2019-08-09 18:17 [PATCH 1/2 v2] mm/zsmalloc.c: Migration can leave pages in ZS_EMPTY indefinitely Henry Burns
  2019-08-09 18:17 ` [PATCH 2/2 v2] mm/zsmalloc.c: Fix race condition in zs_destroy_pool Henry Burns
@ 2019-08-20  2:53 ` Sergey Senozhatsky
  1 sibling, 0 replies; 7+ messages in thread
From: Sergey Senozhatsky @ 2019-08-20  2:53 UTC (permalink / raw)
  To: Henry Burns, Andrew Morton
  Cc: Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Shakeel Butt,
	Jonathan Adams, HenryBurns, linux-mm, linux-kernel

On (08/09/19 11:17), Henry Burns wrote:
> In zs_page_migrate() we call putback_zspage() after we have finished
> migrating all pages in this zspage. However, the return value is ignored.
> If a zs_free() races in between zs_page_isolate() and zs_page_migrate(),
> freeing the last object in the zspage, putback_zspage() will leave the page
> in ZS_EMPTY for potentially an unbounded amount of time.
> 
> To fix this, we need to do the same thing as zs_page_putback() does:
> schedule free_work to occur.  To avoid duplicated code, move the
> sequence to a new putback_zspage_deferred() function which both
> zs_page_migrate() and zs_page_putback() call.
> 
> Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
> Signed-off-by: Henry Burns <henryburns@google.com>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

+ Andrew

	-ss

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2 v2] mm/zsmalloc.c: Fix race condition in zs_destroy_pool
  2019-08-09 18:17 ` [PATCH 2/2 v2] mm/zsmalloc.c: Fix race condition in zs_destroy_pool Henry Burns
@ 2019-08-20  2:59   ` Sergey Senozhatsky
  2019-08-22 23:23     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Sergey Senozhatsky @ 2019-08-20  2:59 UTC (permalink / raw)
  To: Henry Burns, Andrew Morton
  Cc: Minchan Kim, Nitin Gupta, Sergey Senozhatsky, Shakeel Butt,
	Jonathan Adams, HenryBurns, linux-mm, linux-kernel

On (08/09/19 11:17), Henry Burns wrote:
> In zs_destroy_pool() we call flush_work(&pool->free_work). However, we
> have no guarantee that migration isn't happening in the background
> at that time.
> 
> Since migration can't directly free pages, it relies on free_work
> being scheduled to free the pages.  But there's nothing preventing an
> in-progress migrate from queuing the work *after*
> zs_unregister_migration() has called flush_work().  Which would mean
> pages still pointing at the inode when we free it.
> 
> Since we know at destroy time all objects should be free, no new
> migrations can come in (since zs_page_isolate() fails for fully-free
> zspages).  This means it is sufficient to track a "# isolated zspages"
> count by class, and have the destroy logic ensure all such pages have
> drained before proceeding.  Keeping that state under the class
> spinlock keeps the logic straightforward.
> 
> Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
> Signed-off-by: Henry Burns <henryburns@google.com>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

+ Andrew

	-ss

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2 v2] mm/zsmalloc.c: Fix race condition in zs_destroy_pool
  2019-08-20  2:59   ` Sergey Senozhatsky
@ 2019-08-22 23:23     ` Andrew Morton
  2019-08-23  8:10       ` Henry Burns
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2019-08-22 23:23 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Henry Burns, Minchan Kim, Nitin Gupta, Shakeel Butt,
	Jonathan Adams, HenryBurns, linux-mm, linux-kernel

On Tue, 20 Aug 2019 11:59:39 +0900 Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:

> On (08/09/19 11:17), Henry Burns wrote:
> > In zs_destroy_pool() we call flush_work(&pool->free_work). However, we
> > have no guarantee that migration isn't happening in the background
> > at that time.
> > 
> > Since migration can't directly free pages, it relies on free_work
> > being scheduled to free the pages.  But there's nothing preventing an
> > in-progress migrate from queuing the work *after*
> > zs_unregister_migration() has called flush_work().  Which would mean
> > pages still pointing at the inode when we free it.
> > 
> > Since we know at destroy time all objects should be free, no new
> > migrations can come in (since zs_page_isolate() fails for fully-free
> > zspages).  This means it is sufficient to track a "# isolated zspages"
> > count by class, and have the destroy logic ensure all such pages have
> > drained before proceeding.  Keeping that state under the class
> > spinlock keeps the logic straightforward.
> > 
> > Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
> > Signed-off-by: Henry Burns <henryburns@google.com>
> 
> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> 

Thanks.  So we have a couple of races which result in memory leaks?  Do
we feel this is serious enough to justify a -stable backport of the
fixes?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2 v2] mm/zsmalloc.c: Fix race condition in zs_destroy_pool
  2019-08-22 23:23     ` Andrew Morton
@ 2019-08-23  8:10       ` Henry Burns
  2019-08-26  5:13         ` Sergey Senozhatsky
  0 siblings, 1 reply; 7+ messages in thread
From: Henry Burns @ 2019-08-23  8:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sergey Senozhatsky, Henry Burns, Minchan Kim, Nitin Gupta,
	Shakeel Butt, Jonathan Adams, linux-mm, linux-kernel

On Thu, Aug 22, 2019 at 7:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 20 Aug 2019 11:59:39 +0900 Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:
>
> > On (08/09/19 11:17), Henry Burns wrote:
> > > In zs_destroy_pool() we call flush_work(&pool->free_work). However, we
> > > have no guarantee that migration isn't happening in the background
> > > at that time.
> > >
> > > Since migration can't directly free pages, it relies on free_work
> > > being scheduled to free the pages.  But there's nothing preventing an
> > > in-progress migrate from queuing the work *after*
> > > zs_unregister_migration() has called flush_work().  Which would mean
> > > pages still pointing at the inode when we free it.
> > >
> > > Since we know at destroy time all objects should be free, no new
> > > migrations can come in (since zs_page_isolate() fails for fully-free
> > > zspages).  This means it is sufficient to track a "# isolated zspages"
> > > count by class, and have the destroy logic ensure all such pages have
> > > drained before proceeding.  Keeping that state under the class
> > > spinlock keeps the logic straightforward.
> > >
> > > Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
> > > Signed-off-by: Henry Burns <henryburns@google.com>
> >
> > Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> >
>
> Thanks.  So we have a couple of races which result in memory leaks?  Do
> we feel this is serious enough to justify a -stable backport of the
> fixes?

In this case a memory leak could lead to an eventual crash if
compaction hits the leaked page. I don't know what a -stable
backport entails, but this crash would only occur if people are
changing their zswap backend at runtime
(which eventually starts destruction).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2 v2] mm/zsmalloc.c: Fix race condition in zs_destroy_pool
  2019-08-23  8:10       ` Henry Burns
@ 2019-08-26  5:13         ` Sergey Senozhatsky
  0 siblings, 0 replies; 7+ messages in thread
From: Sergey Senozhatsky @ 2019-08-26  5:13 UTC (permalink / raw)
  To: Henry Burns
  Cc: Andrew Morton, Sergey Senozhatsky, Henry Burns, Minchan Kim,
	Nitin Gupta, Shakeel Butt, Jonathan Adams, linux-mm,
	linux-kernel

On (08/23/19 04:10), Henry Burns wrote:
> > Thanks.  So we have a couple of races which result in memory leaks?  Do
> > we feel this is serious enough to justify a -stable backport of the
> > fixes?
> 
> In this case a memory leak could lead to an eventual crash if
> compaction hits the leaked page. I don't know what a -stable
> backport entails, but this crash would only occur if people are
> changing their zswap backend at runtime
> (which eventually starts destruction).

Well, zram/zsmalloc is not only for swapping, but it's also a virtual
block device which can be created or destroyed dynamically. So it looks
like a potential -stable material.

Minchan?

	-ss

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-08-26  5:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-09 18:17 [PATCH 1/2 v2] mm/zsmalloc.c: Migration can leave pages in ZS_EMPTY indefinitely Henry Burns
2019-08-09 18:17 ` [PATCH 2/2 v2] mm/zsmalloc.c: Fix race condition in zs_destroy_pool Henry Burns
2019-08-20  2:59   ` Sergey Senozhatsky
2019-08-22 23:23     ` Andrew Morton
2019-08-23  8:10       ` Henry Burns
2019-08-26  5:13         ` Sergey Senozhatsky
2019-08-20  2:53 ` [PATCH 1/2 v2] mm/zsmalloc.c: Migration can leave pages in ZS_EMPTY indefinitely Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).