All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosryahmed@google.com>
To: Chris Li <chrisl@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org,
	Nhat Pham <nphamcs@gmail.com>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	 Chengming Zhou <zhouchengming@bytedance.com>,
	Barry Song <v-songbaohua@oppo.com>
Subject: Re: [PATCH v4] zswap: replace RB tree with xarray
Date: Thu, 7 Mar 2024 09:07:05 +0000	[thread overview]
Message-ID: <ZemDuW25YxjqAjm-@google.com> (raw)
In-Reply-To: <CAF8kJuPcPi22hWhJFGAf0RW2Q8PS_WHKXfUYrcwnpHY2VDVhYg@mail.gmail.com>

[..]
> > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> > > -{
> > > -     rb_erase(&entry->rbnode, root);
> > > -     RB_CLEAR_NODE(&entry->rbnode);
> > > +     e = xa_store(tree, offset, entry, GFP_KERNEL);
> > > +     err = xa_err(e);
> > > +
> > > +     if (err) {
> > > +             e = xa_erase(tree, offset);
> > > +             if (err == -ENOMEM)
> > > +                     zswap_reject_alloc_fail++;
> > > +             else
> > > +                     zswap_reject_xarray_fail++;
> >
> > I think this is too complicated, and as Chengming pointed out, I believe
> > we can use xa_store() directly in zswap_store().
> 
> Sure.
> 
> > I am also not sure what the need for zswap_reject_xarray_fail is. Are
> > there any reasons why the store here can fail other than -ENOMEM? The
> > docs say the only other option is -EINVAL, and looking at __xa_store(),
> > it seems like this is only possible if xa_is_internal() is true (which
> > means we are not passing in a properly aligned pointer IIUC).
> 
> Because the xa_store document said it can return two error codes. I
> see zswap try to classify the error count it hit, that is why I add
> the zswap_reject_xarray_fail.

Right, but I think we should not get -EINVAL in this case. I think it
would be more appropriate to have WARN_ON() or VM_WARN_ON() in this
case?

[..]
> > > @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> > >  static int zswap_writeback_entry(struct zswap_entry *entry,
> > >                                swp_entry_t swpentry)
> > >  {
> > > -     struct zswap_tree *tree;
> > > +     struct xarray *tree;
> > > +     pgoff_t offset = swp_offset(swpentry);
> > > +     struct zswap_entry *e;
> > >       struct folio *folio;
> > >       struct mempolicy *mpol;
> > >       bool folio_was_allocated;
> > > @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > >        * be dereferenced.
> > >        */
> > >       tree = swap_zswap_tree(swpentry);
> > > -     spin_lock(&tree->lock);
> > > -     if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> > > -             spin_unlock(&tree->lock);
> > > +     e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
> > > +     if (e != entry) {
> >
> > I think we can avoid adding 'e' and 'offset' local variables here and
> > just do everything in the if condition. If you want to avoid the line
> > break, then introducing 'offset' is fine, but I don't see any value from
> > 'e'.
> 
> As I said in my other email. I don't think having this type of local
> variable affects the compiler negatively. The compiler generally uses
> their own local variable to track the expression anyway. So I am not
> sure about the motivation to remove local variables alone, if it helps
> the reading. I feel the line  "if (xa_cmpxchg(tree, offset, entry,
> NULL, GFP_KERNEL) != entry)" is too long and complicated inside the if
> condition. That is just me. Not a big deal.

I just think 'e' is not providing any readability improvements. If
anything, people need to pay closer attention to figure out 'e' is only
a temp variable and 'entry' is the real deal.

I vote for:
	if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL))

[..]
> > > @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio)
> > >  {
> > >       swp_entry_t swp = folio->swap;
> > >       pgoff_t offset = swp_offset(swp);
> > > -     struct zswap_tree *tree = swap_zswap_tree(swp);
> > > -     struct zswap_entry *entry, *dupentry;
> > > +     struct xarray *tree = swap_zswap_tree(swp);
> > > +     struct zswap_entry *entry, *old;
> > >       struct obj_cgroup *objcg = NULL;
> > >       struct mem_cgroup *memcg = NULL;
> > > +     int err;
> > > +     bool old_erased = false;
> > >
> > >       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > >       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio)
> > >                       kunmap_local(src);
> > >                       entry->length = 0;
> > >                       entry->value = value;
> > > +                     entry->pool = NULL;
> >
> > Why do we need to initialize the pool here? Is this is a bug fix for an
> > existing problem or just keeping things clean? Either way I think it
> > should be done separately, unless it is related to a change in this
> > patch.
> 
> I notice the entry->pool will leave uninitialized. I think it should
> be cleaned up. It is a clean up, it does not need to happen in this
> patch. I can do that as a separate patch.

Yes please.
 
[..]
> >
> > >       /*
> > >        * The folio may have been dirtied again, invalidate the
> > >        * possibly stale entry before inserting the new entry.
> > >        */
> > > -     if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > > -             zswap_invalidate_entry(tree, dupentry);
> > > -             WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > > +     err = zswap_xa_insert(tree, entry, &old);
> > > +     if (old)
> > > +             zswap_entry_free(old);
> > > +     if (err) {
> > > +             old_erased = true;
> >
> > I think this can be made simpler if we open code xa_store() here,
> > especially that we already have cleanup code below under 'check_old'
> > that removes the exisitng old entry. So zswap_xa_insert() replicates
> > this cleanup, then we add this 'old_erased' boolean to avoid doing the
> > cleanup below. It seems like it would much more straightforward with
> > open-coding xa_store() here and relying on the existing cleanup for the
> > old entry.  Also, if we initialize 'old' to NULL, we can use its value
> > to figure out whether any cleanup is needed under 'check_old' or not.
> 
> I think that is very similar to what Chengming was suggesting.
> 
> >
> > Taking a step back, I think we can further simplify this. What if we
> > move the tree insertion to right after we allocate the zswap entry? In
> > this case, if the tree insertion fails, we don't need to decrement the
> > same filled counter. If the tree insertion succeeds and then something
> > else fails, the existing cleanup code under 'check_old' will already
> > clean up the tree insertion for us.
> 
> That will create complications that, if the zswap compression fails
> the compression ratio, you will have to remove the entry from the tree
> as clean up. You have both xa_store() and xa_erase() where the current
> code just does one xa_erase() on compression failure.

Not really. If xa_store() fails because of -ENOMEM, then I think by
definition we do not need xa_erase() as there shouldn't be any stale
entries. I also think -ENOMEM should be the only valid errno from
xa_store() in this context. So we can avoid the check_old code if
xa_store() is called (whether it fails or succeeds) IIUC.

I prefer calling xa_store() entry and avoiding the extra 'insert_failed'
cleanup code, especially that unlike other cleanup code, it has its own
branching based on entry->length. I am also planning a cleanup for
zswap_store() to split the code better for the same_filled case and
avoid some unnecessary checks and failures, so it would be useful to
keep the common code path together.

> 
> >
> > If this works, we don't need to add extra cleanup code or move any code
> > around. Something like:
> 
> Due to the extra xa_insert() on compression failure, I think
> Chengming's or your earlier suggestion is better.
> 
> BTW, while you are here, can you confirm this race discussed in
> earlier email can't happen? Chengming convinced me this shouldn't
> happen. Like to hear your thoughts.
> 
>  CPU1                         CPU2
> 
> xa_store()
>                                    entry = xa_erase()
>                                    zswap_free_entry(entry)
> 
> if (entry->length)
>      ...
> CPU1 is using entry after free.

IIUC, CPU1 is in zswap_store(), CPU2 could either in zswap_invalidate()
or zswap_load().

For zswap_load(), I think synchronization is done in the core swap code
ensure we are not doing parallel swapin/swapout at the same entry,
right? In this specific case, I think the folio would be in the
swapcache while swapout (i.e. zswap_store()) is ongoing, so any swapins
will read the folio and not call zswap_load().

Actually, if we do not prevent parallel swapin/swapou at the same entry,
I suspect we may have problems even outside of zswap. For example, we
may read a partially written swap entry from disk, right? Or does the
block layer synchronize this somehow?

For zswap_invalidate(), the core swap code calls it when the swap entry
is no longer used and before we free it for reuse, so IIUC parallel
swapouts (i.e. zswap_store()) should not be possible here as well.

  reply	other threads:[~2024-03-07  9:07 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-04 21:32 [PATCH v4] zswap: replace RB tree with xarray Chris Li
2024-03-05  2:52 ` Chengming Zhou
2024-03-06 21:12   ` Chris Li
2024-03-07  2:17     ` Chengming Zhou
2024-03-07 18:45       ` Chris Li
2024-03-06 23:43   ` Chris Li
2024-03-05  5:08 ` Yosry Ahmed
2024-03-06 21:37   ` Chris Li
2024-03-07  9:07     ` Yosry Ahmed [this message]
2024-03-07 19:39       ` Chris Li
2024-03-07 21:28         ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZemDuW25YxjqAjm-@google.com \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=v-songbaohua@oppo.com \
    --cc=willy@infradead.org \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.