From: Yosry Ahmed <yosryahmed@google.com>
To: Minchan Kim <minchan@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>,
lsf-pc@lists.linux-foundation.org, Linux-MM <linux-mm@kvack.org>,
Michal Hocko <mhocko@kernel.org>,
Shakeel Butt <shakeelb@google.com>,
David Rientjes <rientjes@google.com>,
Hugh Dickins <hughd@google.com>,
Seth Jennings <sjenning@redhat.com>,
Dan Streetman <ddstreet@ieee.org>,
Vitaly Wool <vitaly.wool@konsulko.com>,
Yang Shi <shy828301@gmail.com>, Peter Xu <peterx@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
Date: Wed, 1 Mar 2023 17:25:00 -0800 [thread overview]
Message-ID: <CAJD7tkafuzg23kCfVhHVQzN4zQXNwM5dXSgRsWigAHabV0j_7g@mail.gmail.com> (raw)
In-Reply-To: <CAJD7tkYbLG1=YbBCF7bj9MbRcH0FkdjTi4tRZ+vAxE5DUodFAg@mail.gmail.com>
On Wed, Mar 1, 2023 at 4:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Feb 28, 2023 at 3:29 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > Hi Yosry,
> >
> > On Tue, Feb 28, 2023 at 12:12:05AM -0800, Yosry Ahmed wrote:
> > > On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky
> > > <senozhatsky@chromium.org> wrote:
> > > >
> > > > On (23/02/18 14:38), Yosry Ahmed wrote:
> > > > [..]
> > > > > ==================== Idea ====================
> > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > entry. It can also include all non-backend specific operations, such
> > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > > code and the actual swapping implementation.
> > > > >
> > > > > ==================== Benefits ====================
> > > > > This work enables using zswap without a backing swapfile and increases
> > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > a separation that allows us to skip code paths that don't make sense
> > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > which might result in better performance (less lookups, less lock
> > > > > contention).
> > > > >
> > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > removing swapper address spaces, removing swap count continuation
> > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > things that are stored in page tables / caches, and for actual swap
> > > > > entries. In the future, we can potentially further optimize how we use
> > > > > the bits in the page tables instead of sticking everything into the
> > > > > current type/offset format.
> > > > >
> > > > > Another potential win here can be swapoff, which can be more practical
> > > > > by directly scanning all swap_desc's instead of going through page
> > > > > tables and shmem page caches.
> > > > >
> > > > > Overall zswap becomes more accessible and available to a wider range
> > > > > of use cases.
> > > >
> > > > I assume this also brings us closer to a proper writeback LRU handling?
> > >
> > > I assume by proper LRU handling you mean:
> > > - Swap writeback LRU that lives outside of the zpool backends (i.e in
> > > zswap itself or even outside zswap).
> >
> > Even outside zswap to support any combination on any heterogenous
> > multiple swap device configuration.
>
> Agreed, this is the end goal for the writeback LRU.
>
> >
> > The indirection layer would be essential to support it but it would
> > be also great if we don't waste any memory for the user who don't
> > want the feature.
>
> I can't currently think of a way to eliminate overhead for people only
> using swapfiles, as a lot of the core implementation changes, unless
> we want to maintain considerably more code with a lot of repeated
> functionality implemented differently. Perhaps this will change as I
> implement this, maybe things are better (or worse) than what I think
> they are, I am actively working on a proof-of-concept right now. Maybe
> a discussion in LSF/MM/BPF will help come up with optimizations as
> well :)
>
> >
> > Just FYI, there was similar discussion long time ago about the
> > indirection layer.
> > https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
>
> Yeah Hugh shared this one with me earlier, but there are a few things
> that I don't understand how they would work, at least in today's
> world.
>
> Firstly, the proposal suggests that we store a radix tree index in the
> page tables, and in the radix tree store the swap entry AND the swap
> count. I am not really sure how they would fit in 8 bytes, especially
> if we need continuation and 1 byte is not enough for the swap count.
> Continuation logic now depends on linking vmalloc'd pages using the
> lru field in struct page/folio. Perhaps we can figure out a split that
> gives enough space for swap count without continuation while also not
> limiting swapfile sizes too much.
>
> Secondly, IIUC in that proposal once we swap a page in, we free the
> swap entry and add the swapcache page to the radix tree instead. In
> that case, where does the swap count go? IIUC we still need to
> maintain it to be able to tell when all processes mapping the page
> have faulted it back, otherwise the radix tree entry is maintained
> indefinitely. We can maybe stash the swap count somewhere else in this
> case, and bring it back to the radix tree if we swap the page out
> again. Not really sure where, we can have a separate radix tree for
> swap counts when the page is in swapcache, or we can always have it in
> a separate radix tree so that the swap entry fits comfortably in the
> first radix tree.
>
> To be able to accomodate zswap in this design, I think we always need
> a separate radix tree for swap counts. In that case, one radix tree
> contains swap_entry/zswap_entry/swapcache, and the other radix tree
> contains the swap count. I think this may work, but I am not sure if
> the overhead of always doing a lookup to read the swap count is okay.
> I am also sure there would be some fun synchronization problems
> between both trees (but we already need to synchronize today between
> the swapcache and swap counts?).
>
> It sounds like it is possible to make it work. I will spend some time
> thinking about it. Having 2 radix trees also solves the 32-bit systems
> problem, but I am not sure if it's a generally better design. Radix
> trees also take up some extra space other than the entry size itself,
> so I am not sure how much memory we would end up actually saving.
>
> Johannes, I am curious if you have any thoughts about this alternative design?
I completely forgot about shadow entries here. I don't think any of
this works with shadow entries as we still need to maintain them while
the page is swapped out.
next prev parent reply other threads:[~2023-03-02 1:25 UTC|newest]
Thread overview: 105+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-18 22:38 [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Yosry Ahmed
2023-02-19 4:31 ` Matthew Wilcox
2023-02-19 9:34 ` Yosry Ahmed
2023-02-28 23:22 ` Chris Li
2023-03-01 0:08 ` Matthew Wilcox
2023-03-01 23:22 ` Chris Li
2023-02-21 18:39 ` Yang Shi
2023-02-21 18:56 ` Yosry Ahmed
2023-02-21 19:26 ` Yang Shi
2023-02-21 19:46 ` Yosry Ahmed
2023-02-21 23:34 ` Yang Shi
2023-02-21 23:38 ` Yosry Ahmed
2023-02-22 16:57 ` Johannes Weiner
2023-02-22 22:46 ` Yosry Ahmed
2023-02-28 4:29 ` Kalesh Singh
2023-02-28 8:09 ` Yosry Ahmed
2023-02-28 4:54 ` Sergey Senozhatsky
2023-02-28 8:12 ` Yosry Ahmed
2023-02-28 23:29 ` Minchan Kim
2023-03-02 0:58 ` Yosry Ahmed
2023-03-02 1:25 ` Yosry Ahmed [this message]
2023-03-02 17:05 ` Chris Li
2023-03-02 17:47 ` Chris Li
2023-03-02 18:15 ` Johannes Weiner
2023-03-02 18:56 ` Chris Li
2023-03-02 18:23 ` Rik van Riel
2023-03-02 21:42 ` Chris Li
2023-03-02 22:36 ` Rik van Riel
2023-03-02 22:55 ` Yosry Ahmed
2023-03-03 4:05 ` Chris Li
2023-03-03 0:01 ` Chris Li
2023-03-02 16:58 ` Chris Li
2023-03-01 10:44 ` Sergey Senozhatsky
2023-03-02 1:01 ` Yosry Ahmed
2023-02-28 23:11 ` Chris Li
2023-03-02 0:30 ` Yosry Ahmed
2023-03-02 1:00 ` Yosry Ahmed
2023-03-02 16:51 ` Chris Li
2023-03-03 0:33 ` Minchan Kim
2023-03-03 0:49 ` Yosry Ahmed
2023-03-03 1:25 ` Minchan Kim
2023-03-03 17:15 ` Yosry Ahmed
2023-03-09 12:48 ` Huang, Ying
2023-03-09 19:58 ` Chris Li
2023-03-09 20:19 ` Yosry Ahmed
2023-03-10 3:06 ` Huang, Ying
2023-03-10 23:14 ` Chris Li
2023-03-13 1:10 ` Huang, Ying
2023-03-15 7:41 ` Yosry Ahmed
2023-03-16 1:42 ` Huang, Ying
2023-03-11 1:06 ` Yosry Ahmed
2023-03-13 2:12 ` Huang, Ying
2023-03-15 8:01 ` Yosry Ahmed
2023-03-16 7:50 ` Huang, Ying
2023-03-17 10:19 ` Yosry Ahmed
2023-03-17 18:19 ` Chris Li
2023-03-17 18:23 ` Yosry Ahmed
2023-03-20 2:55 ` Huang, Ying
2023-03-20 6:25 ` Chris Li
2023-03-23 0:56 ` Huang, Ying
2023-03-23 6:46 ` Chris Li
2023-03-23 6:56 ` Huang, Ying
2023-03-23 18:28 ` Chris Li
2023-03-23 18:40 ` Yosry Ahmed
2023-03-23 19:49 ` Chris Li
2023-03-23 19:54 ` Yosry Ahmed
2023-03-23 21:10 ` Chris Li
2023-03-24 17:28 ` Chris Li
2023-03-22 5:56 ` Yosry Ahmed
2023-03-23 1:48 ` Huang, Ying
2023-03-23 2:21 ` Yosry Ahmed
2023-03-23 3:16 ` Huang, Ying
2023-03-23 3:27 ` Yosry Ahmed
2023-03-23 5:37 ` Huang, Ying
2023-03-23 15:18 ` Yosry Ahmed
2023-03-24 2:37 ` Huang, Ying
2023-03-24 7:28 ` Yosry Ahmed
2023-03-24 17:23 ` Chris Li
2023-03-27 1:23 ` Huang, Ying
2023-03-28 5:54 ` Yosry Ahmed
2023-03-28 6:20 ` Huang, Ying
2023-03-28 6:29 ` Yosry Ahmed
2023-03-28 6:59 ` Huang, Ying
2023-03-28 7:59 ` Yosry Ahmed
2023-03-28 14:14 ` Johannes Weiner
2023-03-28 19:59 ` Yosry Ahmed
2023-03-28 21:22 ` Chris Li
2023-03-28 21:30 ` Yosry Ahmed
2023-03-28 20:50 ` Chris Li
2023-03-28 21:01 ` Yosry Ahmed
2023-03-28 21:32 ` Chris Li
2023-03-28 21:44 ` Yosry Ahmed
2023-03-28 22:01 ` Chris Li
2023-03-28 22:02 ` Yosry Ahmed
2023-03-29 1:31 ` Huang, Ying
2023-03-29 1:41 ` Yosry Ahmed
2023-03-29 16:04 ` Chris Li
2023-04-04 8:24 ` Huang, Ying
2023-04-04 8:10 ` Huang, Ying
2023-04-04 8:47 ` Yosry Ahmed
2023-04-06 1:40 ` Huang, Ying
2023-03-29 15:22 ` Chris Li
2023-03-10 2:07 ` Luis Chamberlain
2023-03-10 2:15 ` Yosry Ahmed
2023-05-12 3:07 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJD7tkafuzg23kCfVhHVQzN4zQXNwM5dXSgRsWigAHabV0j_7g@mail.gmail.com \
--to=yosryahmed@google.com \
--cc=akpm@linux-foundation.org \
--cc=ddstreet@ieee.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mhocko@kernel.org \
--cc=minchan@kernel.org \
--cc=peterx@redhat.com \
--cc=rientjes@google.com \
--cc=senozhatsky@chromium.org \
--cc=shakeelb@google.com \
--cc=shy828301@gmail.com \
--cc=sjenning@redhat.com \
--cc=vitaly.wool@konsulko.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).