Re: [PATCH] zsmalloc: Fix races between asynchronous zspage free and page migration

From: Minchan Kim <minchan@kernel.org>
To: Sultan Alsawaf <sultan@kerneltoast.com>
Cc: stable@vger.kernel.org, Nitin Gupta <ngupta@vflare.org>,
	Sergey Senozhatsky <senozhatsky@chromium.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] zsmalloc: Fix races between asynchronous zspage free and page migration
Date: Wed, 11 May 2022 14:07:19 -0700	[thread overview]
Message-ID: <Ynwlh3RT0PAYpWpD@google.com> (raw)
In-Reply-To: <YnwTfBLn+6vYSe59@sultan-box.localdomain>

On Wed, May 11, 2022 at 12:50:20PM -0700, Sultan Alsawaf wrote:
> On Wed, May 11, 2022 at 11:01:01AM -0700, Minchan Kim wrote:
> > On Sun, May 08, 2022 at 07:47:02PM -0700, Sultan Alsawaf wrote:
> > > Cc: stable@vger.kernel.org
> > > Fixes: 48b4800a1c6a ("zsmalloc: page migration support")
> > 
> > Shouldn't the fix be Fixes: 77ff465799c6 ("zsmalloc: zs_page_migrate: skip
> > unnecessary loops but not return -EBUSY if zspage is not inuse)?
> > Because we didn't migrate ZS_EMPTY pages before.
> 
> Hi,
> 
> Yeah, 77ff465799c6 indeed seems like the commit that introduced the bug.
> 
> > I couldn't get the point here. Why couldn't we simple lock zspage migration?
> > 
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 9152fbde33b5..05ff2315b7b1 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -1987,7 +1987,10 @@ static void async_free_zspage(struct work_struct *work)
> >  
> >         list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
> >                 list_del(&zspage->list);
> > +
> > +               migrate_read_lock(zspage);
> >                 lock_zspage(zspage);
> > +               migrate_read_unlock(zspage);
> >  
> >                 get_zspage_mapping(zspage, &class_idx, &fullness);
> >                 VM_BUG_ON(fullness != ZS_EMPTY);
> 
> There are two problems with this:
> 1. migrate_read_lock() is a rwlock and lock_page() can sleep.
> 2. This will cause a deadlock because it violates the lock ordering in
>    zs_page_migrate(), since zs_page_migrate() takes migrate_write_lock() under
>    the page lock.
> 

That's true. Thanks!

Then, how about this?

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 9152fbde33b5..2f205c18aee4 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1716,12 +1716,31 @@ static enum fullness_group putback_zspage(struct size_class *class,
  * To prevent zspage destroy during migration, zspage freeing should
  * hold locks of all pages in the zspage.
  */
-static void lock_zspage(struct zspage *zspage)
+static void lock_zspage(struct zs_pool *pool, struct zspage *zspage)
 {
-       struct page *page = get_first_page(zspage);
-
+       struct page *page;
+       int nr_locked;
+       struct page *locked_pages[ZS_MAX_PAGES_PER_ZSPAGE];
+       struct address_space *mapping;
+retry:
+       nr_locked = 0;
+       memset(locked_pages, 0, sizeof(struct page) * ARRAY_SIZE(locked_pages));
+       page = get_first_page(zspage);
        do {
                lock_page(page);
+               locked_pages[nr_locked++] = page;
+               /*
+                * subpage in the zspage could be migrated under us so
+                * verify it. Once it happens, retry the lock sequence.
+                */
+               mapping = page_mapping(page)
+               if (mapping != pool->inode->i_mapping ||
+                   page_private(page) != (unsigned long)zspage) {
+                       do {
+                               unlock_page(locked_pages[--nr_locked]);
+                       } while (nr_locked > 0)
+                       goto retry;
+               }
        } while ((page = get_next_page(page)) != NULL);
 }

@@ -1987,7 +2006,7 @@ static void async_free_zspage(struct work_struct *work)

        list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
                list_del(&zspage->list);
-               lock_zspage(zspage);
+               lock_zspage(pool, zspage);

                get_zspage_mapping(zspage, &class_idx, &fullness);
                VM_BUG_ON(fullness != ZS_EMPTY);