linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Barry Song <21cnbao@gmail.com>,
	chengming.zhou@linux.dev, nphamcs@gmail.com,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Zhongkun He <hezhongkun.hzk@bytedance.com>
Subject: Re: [RFC PATCH] mm: add folio in swapcache if swapin from zswap
Date: Fri, 22 Mar 2024 21:55:43 -0400	[thread overview]
Message-ID: <20240323015543.GB448621@cmpxchg.org> (raw)
In-Reply-To: <CAJD7tkZqrrXuYTMYOAP+arMLeNayafFeLocWu7bJtDFHCYjwDA@mail.gmail.com>

On Fri, Mar 22, 2024 at 05:14:37PM -0700, Yosry Ahmed wrote:
> [..]
> > > > I don't think we want to stop doing exclusive loads in zswap due to this
> > > > interaction with zram, which shouldn't be common.
> > > >
> > > > I think we can solve this by just writing the folio back to zswap upon
> > > > failure as I mentioned.
> > >
> > > Instead of storing again, can we avoid invalidating the entry in the
> > > first place if the load is not "exclusive"?
> > >
> > > The reason for exclusive loads is that the ownership is transferred to
> > > the swapcache, so there is no point in keeping our copy. With an
> > > optimistic read that doesn't transfer ownership, this doesn't
> > > apply. And we can easily tell inside zswap_load() if we're dealing
> > > with a swapcache read or not by testing the folio.
> > >
> > > The synchronous read already has to pin the swp_entry_t to be safe,
> > > using swapcache_prepare(). That blocks __read_swap_cache_async() which
> > > means no other (exclusive) loads and no invalidates can occur.
> > >
> > > The zswap entry is freed during the regular swap_free() path, which
> > > the sync fault calls on success. Otherwise we keep it.
> >
> > I thought about this, but I was particularly worried about the need to
> > bring back the refcount that was removed when we switched to only
> > supporting exclusive loads:
> > https://lore.kernel.org/lkml/20240201-b4-zswap-invalidate-entry-v2-6-99d4084260a0@bytedance.com/
> >
> > It seems to be that we don't need it, because swap_free() will free
> > the entry as you mentioned before anyone else has the chance to load
> > it or invalidate it. Writeback used to grab a reference as well, but
> > it removes the entry from the tree anyway and takes full ownership of
> > it then frees it, so that should be okay.
> >
> > It makes me nervous though to be honest. For example, not long ago
> > swap_free() didn't call zswap_invalidate() directly (used to happen to
> > swap slots cache draining). Without it, a subsequent load could race
> > with writeback without refcount protection, right? We would need to
> > make sure to backport 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > when swap entry free") with the fix to stable for instance.
> >
> > I can't find a problem with your diff, but it just makes me nervous to
> > have non-exclusive loads without a refcount.
> >
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 535c907345e0..686364a6dd86 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
> > >         swp_entry_t swp = folio->swap;
> > >         pgoff_t offset = swp_offset(swp);
> > >         struct page *page = &folio->page;
> > > +       bool swapcache = folio_test_swapcache(folio);
> > >         struct zswap_tree *tree = swap_zswap_tree(swp);
> > >         struct zswap_entry *entry;
> > >         u8 *dst;
> > > @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio)
> > >                 spin_unlock(&tree->lock);
> > >                 return false;
> > >         }
> > > -       zswap_rb_erase(&tree->rbroot, entry);
> > > +       if (swapcache)
> > > +               zswap_rb_erase(&tree->rbroot, entry);
> 
> On second thought, if we don't remove the entry from the tree here,
> writeback could free the entry from under us after we drop the lock
> here, right?

The sync-swapin does swapcache_prepare() and holds SWAP_HAS_CACHE, so
racing writeback would loop on the -EEXIST in __read_swap_cache_async().
(Or, if writeback wins the race, sync-swapin fails on swapcache_prepare()
instead and bails on the fault.)

This isn't coincidental. The sync-swapin needs to, and does, serialize
against the swap entry moving into swapcache or being invalidated for
it to be safe. Which is the same requirement that zswap ops have.


  reply	other threads:[~2024-03-23  1:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-22 16:39 [RFC PATCH] mm: add folio in swapcache if swapin from zswap chengming.zhou
2024-03-22 19:37 ` Yosry Ahmed
2024-03-22 21:41   ` Barry Song
2024-03-22 22:33     ` Yosry Ahmed
2024-03-22 23:48       ` Johannes Weiner
2024-03-23  0:12         ` Yosry Ahmed
2024-03-23  0:14           ` Yosry Ahmed
2024-03-23  1:55             ` Johannes Weiner [this message]
2024-03-23  2:02               ` Yosry Ahmed
2024-03-23  2:40                 ` [External] " Zhongkun He
2024-03-23  8:38         ` Zhongkun He
2024-03-23  2:49   ` Chengming Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240323015543.GB448621@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=hezhongkun.hzk@bytedance.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).