Re: [External] Re: [RFC PATCH] mm: add folio in swapcache if swapin from zswap

From: Zhongkun He <hezhongkun.hzk@bytedance.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Barry Song <21cnbao@gmail.com>,
	chengming.zhou@linux.dev,  nphamcs@gmail.com,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	 linux-kernel@vger.kernel.org
Subject: Re: [External] Re: [RFC PATCH] mm: add folio in swapcache if swapin from zswap
Date: Sat, 23 Mar 2024 10:40:38 +0800	[thread overview]
Message-ID: <CACSyD1P2GNmfhU4jghqEnrHi-9kHjLMJM_j_kxcQ=fsEgg7cwA@mail.gmail.com> (raw)
In-Reply-To: <CAJD7tkYDh39_Pp-_TFFvduGbirx0MTRpC3p+Z6NuY14xtXiOYA@mail.gmail.com>

On Sat, Mar 23, 2024 at 10:03 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Mar 22, 2024 at 6:55 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, Mar 22, 2024 at 05:14:37PM -0700, Yosry Ahmed wrote:
> > > [..]
> > > > > > I don't think we want to stop doing exclusive loads in zswap due to this
> > > > > > interaction with zram, which shouldn't be common.
> > > > > >
> > > > > > I think we can solve this by just writing the folio back to zswap upon
> > > > > > failure as I mentioned.
> > > > >
> > > > > Instead of storing again, can we avoid invalidating the entry in the
> > > > > first place if the load is not "exclusive"?
> > > > >
> > > > > The reason for exclusive loads is that the ownership is transferred to
> > > > > the swapcache, so there is no point in keeping our copy. With an
> > > > > optimistic read that doesn't transfer ownership, this doesn't
> > > > > apply. And we can easily tell inside zswap_load() if we're dealing
> > > > > with a swapcache read or not by testing the folio.
> > > > >
> > > > > The synchronous read already has to pin the swp_entry_t to be safe,
> > > > > using swapcache_prepare(). That blocks __read_swap_cache_async() which
> > > > > means no other (exclusive) loads and no invalidates can occur.
> > > > >
> > > > > The zswap entry is freed during the regular swap_free() path, which
> > > > > the sync fault calls on success. Otherwise we keep it.
> > > >
> > > > I thought about this, but I was particularly worried about the need to
> > > > bring back the refcount that was removed when we switched to only
> > > > supporting exclusive loads:
> > > > https://lore.kernel.org/lkml/20240201-b4-zswap-invalidate-entry-v2-6-99d4084260a0@bytedance.com/
> > > >
> > > > It seems to be that we don't need it, because swap_free() will free
> > > > the entry as you mentioned before anyone else has the chance to load
> > > > it or invalidate it. Writeback used to grab a reference as well, but
> > > > it removes the entry from the tree anyway and takes full ownership of
> > > > it then frees it, so that should be okay.
> > > >
> > > > It makes me nervous though to be honest. For example, not long ago
> > > > swap_free() didn't call zswap_invalidate() directly (used to happen to
> > > > swap slots cache draining). Without it, a subsequent load could race
> > > > with writeback without refcount protection, right? We would need to
> > > > make sure to backport 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > when swap entry free") with the fix to stable for instance.
> > > >
> > > > I can't find a problem with your diff, but it just makes me nervous to
> > > > have non-exclusive loads without a refcount.
> > > >
> > > > >
> > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > index 535c907345e0..686364a6dd86 100644
> > > > > --- a/mm/zswap.c
> > > > > +++ b/mm/zswap.c
> > > > > @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
> > > > >         swp_entry_t swp = folio->swap;
> > > > >         pgoff_t offset = swp_offset(swp);
> > > > >         struct page *page = &folio->page;
> > > > > +       bool swapcache = folio_test_swapcache(folio);
> > > > >         struct zswap_tree *tree = swap_zswap_tree(swp);
> > > > >         struct zswap_entry *entry;
> > > > >         u8 *dst;
> > > > > @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio)
> > > > >                 spin_unlock(&tree->lock);
> > > > >                 return false;
> > > > >         }
> > > > > -       zswap_rb_erase(&tree->rbroot, entry);
> > > > > +       if (swapcache)
> > > > > +               zswap_rb_erase(&tree->rbroot, entry);
> > >
> > > On second thought, if we don't remove the entry from the tree here,
> > > writeback could free the entry from under us after we drop the lock
> > > here, right?
> >
> > The sync-swapin does swapcache_prepare() and holds SWAP_HAS_CACHE, so
> > racing writeback would loop on the -EEXIST in __read_swap_cache_async().
> > (Or, if writeback wins the race, sync-swapin fails on swapcache_prepare()
> > instead and bails on the fault.)
> >
> > This isn't coincidental. The sync-swapin needs to, and does, serialize
> > against the swap entry moving into swapcache or being invalidated for
> > it to be safe. Which is the same requirement that zswap ops have.
>
> You are right. Even if swap_free() isn't called under SWAP_HAS_CACHE's
> protection, a subsequent load will also be protected by SWAP_HAS_CACHE
> (whether it's swapped in with sync swapin or throught the swapcache)
> -- so it would be protected against writeback as well. Now it seems
> like we may have been able to drop the refcount even without exclusive
> loads..?
>
> Anyway, I think your fix is sound. Zhongkun, do you mind confirming
> that the diff Johannes sent fixes the problem for you?

OK， I will try it and come back in a few hours.

Thanks for the solution, it sounds great.