linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosryahmed@google.com>
To: Minchan Kim <minchan@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>,
	lsf-pc@lists.linux-foundation.org,  Linux-MM <linux-mm@kvack.org>,
	Michal Hocko <mhocko@kernel.org>,
	 Shakeel Butt <shakeelb@google.com>,
	David Rientjes <rientjes@google.com>,
	 Hugh Dickins <hughd@google.com>,
	Seth Jennings <sjenning@redhat.com>,
	 Dan Streetman <ddstreet@ieee.org>,
	Vitaly Wool <vitaly.wool@konsulko.com>,
	 Yang Shi <shy828301@gmail.com>, Peter Xu <peterx@redhat.com>,
	 Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
Date: Wed, 1 Mar 2023 17:25:00 -0800	[thread overview]
Message-ID: <CAJD7tkafuzg23kCfVhHVQzN4zQXNwM5dXSgRsWigAHabV0j_7g@mail.gmail.com> (raw)
In-Reply-To: <CAJD7tkYbLG1=YbBCF7bj9MbRcH0FkdjTi4tRZ+vAxE5DUodFAg@mail.gmail.com>

On Wed, Mar 1, 2023 at 4:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Feb 28, 2023 at 3:29 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > Hi Yosry,
> >
> > On Tue, Feb 28, 2023 at 12:12:05AM -0800, Yosry Ahmed wrote:
> > > On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky
> > > <senozhatsky@chromium.org> wrote:
> > > >
> > > > On (23/02/18 14:38), Yosry Ahmed wrote:
> > > > [..]
> > > > > ==================== Idea ====================
> > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > entry. It can also include all non-backend specific operations, such
> > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > > code and the actual swapping implementation.
> > > > >
> > > > > ==================== Benefits ====================
> > > > > This work enables using zswap without a backing swapfile and increases
> > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > a separation that allows us to skip code paths that don't make sense
> > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > which might result in better performance (less lookups, less lock
> > > > > contention).
> > > > >
> > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > removing swapper address spaces, removing swap count continuation
> > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > things that are stored in page tables / caches, and for actual swap
> > > > > entries. In the future, we can potentially further optimize how we use
> > > > > the bits in the page tables instead of sticking everything into the
> > > > > current type/offset format.
> > > > >
> > > > > Another potential win here can be swapoff, which can be more practical
> > > > > by directly scanning all swap_desc's instead of going through page
> > > > > tables and shmem page caches.
> > > > >
> > > > > Overall zswap becomes more accessible and available to a wider range
> > > > > of use cases.
> > > >
> > > > I assume this also brings us closer to a proper writeback LRU handling?
> > >
> > > I assume by proper LRU handling you mean:
> > > - Swap writeback LRU that lives outside of the zpool backends (i.e in
> > > zswap itself or even outside zswap).
> >
> > Even outside zswap to support any combination on any heterogenous
> > multiple swap device configuration.
>
> Agreed, this is the end goal for the writeback LRU.
>
> >
> > The indirection layer would be essential to support it but it would
> > be also great if we don't waste any memory for the user who don't
> > want the feature.
>
> I can't currently think of a way to eliminate overhead for people only
> using swapfiles, as a lot of the core implementation changes, unless
> we want to maintain considerably more code with a lot of repeated
> functionality implemented differently. Perhaps this will change as I
> implement this, maybe things are better (or worse) than what I think
> they are, I am actively working on a proof-of-concept right now. Maybe
> a discussion in LSF/MM/BPF will help come up with optimizations as
> well :)
>
> >
> > Just FYI, there was similar discussion long time ago about the
> > indirection layer.
> > https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
>
> Yeah Hugh shared this one with me earlier, but there are a few things
> that I don't understand how they would work, at least in today's
> world.
>
> Firstly, the proposal suggests that we store a radix tree index in the
> page tables, and in the radix tree store the swap entry AND the swap
> count. I am not really sure how they would fit in 8 bytes, especially
> if we need continuation and 1 byte is not enough for the swap count.
> Continuation logic now depends on linking vmalloc'd pages using the
> lru field in struct page/folio. Perhaps we can figure out a split that
> gives enough space for swap count without continuation while also not
> limiting swapfile sizes too much.
>
> Secondly, IIUC in that proposal once we swap a page in, we free the
> swap entry and add the swapcache page to the radix tree instead. In
> that case, where does the swap count go? IIUC we still need to
> maintain it to be able to tell when all processes mapping the page
> have faulted it back, otherwise the radix tree entry is maintained
> indefinitely. We can maybe stash the swap count somewhere else in this
> case, and bring it back to the radix tree if we swap the page out
> again. Not really sure where, we can have a separate radix tree for
> swap counts when the page is in swapcache, or we can always have it in
> a separate radix tree so that the swap entry fits comfortably in the
> first radix tree.
>
> To be able to accomodate zswap in this design, I think we always need
> a separate radix tree for swap counts. In that case, one radix tree
> contains swap_entry/zswap_entry/swapcache, and the other radix tree
> contains the swap count. I think this may work, but I am not sure if
> the overhead of always doing a lookup to read the swap count is okay.
> I am also sure there would be some fun synchronization problems
> between both trees (but we already need to synchronize today between
> the swapcache and swap counts?).
>
> It sounds like it is possible to make it work. I will spend some time
> thinking about it. Having 2 radix trees also solves the 32-bit systems
> problem, but I am not sure if it's a generally better design. Radix
> trees also take up some extra space other than the entry size itself,
> so I am not sure how much memory we would end up actually saving.
>
> Johannes, I am curious if you have any thoughts about this alternative design?

I completely forgot about shadow entries here. I don't think any of
this works with shadow entries as we still need to maintain them while
the page is swapped out.


  reply	other threads:[~2023-03-02  1:25 UTC|newest]

Thread overview: 105+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-18 22:38 [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Yosry Ahmed
2023-02-19  4:31 ` Matthew Wilcox
2023-02-19  9:34   ` Yosry Ahmed
2023-02-28 23:22   ` Chris Li
2023-03-01  0:08     ` Matthew Wilcox
2023-03-01 23:22       ` Chris Li
2023-02-21 18:39 ` Yang Shi
2023-02-21 18:56   ` Yosry Ahmed
2023-02-21 19:26     ` Yang Shi
2023-02-21 19:46       ` Yosry Ahmed
2023-02-21 23:34         ` Yang Shi
2023-02-21 23:38           ` Yosry Ahmed
2023-02-22 16:57             ` Johannes Weiner
2023-02-22 22:46               ` Yosry Ahmed
2023-02-28  4:29                 ` Kalesh Singh
2023-02-28  8:09                   ` Yosry Ahmed
2023-02-28  4:54 ` Sergey Senozhatsky
2023-02-28  8:12   ` Yosry Ahmed
2023-02-28 23:29     ` Minchan Kim
2023-03-02  0:58       ` Yosry Ahmed
2023-03-02  1:25         ` Yosry Ahmed [this message]
2023-03-02 17:05         ` Chris Li
2023-03-02 17:47         ` Chris Li
2023-03-02 18:15           ` Johannes Weiner
2023-03-02 18:56             ` Chris Li
2023-03-02 18:23           ` Rik van Riel
2023-03-02 21:42             ` Chris Li
2023-03-02 22:36               ` Rik van Riel
2023-03-02 22:55                 ` Yosry Ahmed
2023-03-03  4:05                   ` Chris Li
2023-03-03  0:01                 ` Chris Li
2023-03-02 16:58       ` Chris Li
2023-03-01 10:44     ` Sergey Senozhatsky
2023-03-02  1:01       ` Yosry Ahmed
2023-02-28 23:11 ` Chris Li
2023-03-02  0:30   ` Yosry Ahmed
2023-03-02  1:00     ` Yosry Ahmed
2023-03-02 16:51     ` Chris Li
2023-03-03  0:33     ` Minchan Kim
2023-03-03  0:49       ` Yosry Ahmed
2023-03-03  1:25         ` Minchan Kim
2023-03-03 17:15           ` Yosry Ahmed
2023-03-09 12:48     ` Huang, Ying
2023-03-09 19:58       ` Chris Li
2023-03-09 20:19       ` Yosry Ahmed
2023-03-10  3:06         ` Huang, Ying
2023-03-10 23:14           ` Chris Li
2023-03-13  1:10             ` Huang, Ying
2023-03-15  7:41               ` Yosry Ahmed
2023-03-16  1:42                 ` Huang, Ying
2023-03-11  1:06           ` Yosry Ahmed
2023-03-13  2:12             ` Huang, Ying
2023-03-15  8:01               ` Yosry Ahmed
2023-03-16  7:50                 ` Huang, Ying
2023-03-17 10:19                   ` Yosry Ahmed
2023-03-17 18:19                     ` Chris Li
2023-03-17 18:23                       ` Yosry Ahmed
2023-03-20  2:55                     ` Huang, Ying
2023-03-20  6:25                       ` Chris Li
2023-03-23  0:56                         ` Huang, Ying
2023-03-23  6:46                           ` Chris Li
2023-03-23  6:56                             ` Huang, Ying
2023-03-23 18:28                               ` Chris Li
2023-03-23 18:40                                 ` Yosry Ahmed
2023-03-23 19:49                                   ` Chris Li
2023-03-23 19:54                                     ` Yosry Ahmed
2023-03-23 21:10                                       ` Chris Li
2023-03-24 17:28                                       ` Chris Li
2023-03-22  5:56                       ` Yosry Ahmed
2023-03-23  1:48                         ` Huang, Ying
2023-03-23  2:21                           ` Yosry Ahmed
2023-03-23  3:16                             ` Huang, Ying
2023-03-23  3:27                               ` Yosry Ahmed
2023-03-23  5:37                                 ` Huang, Ying
2023-03-23 15:18                                   ` Yosry Ahmed
2023-03-24  2:37                                     ` Huang, Ying
2023-03-24  7:28                                       ` Yosry Ahmed
2023-03-24 17:23                                         ` Chris Li
2023-03-27  1:23                                           ` Huang, Ying
2023-03-28  5:54                                             ` Yosry Ahmed
2023-03-28  6:20                                               ` Huang, Ying
2023-03-28  6:29                                                 ` Yosry Ahmed
2023-03-28  6:59                                                   ` Huang, Ying
2023-03-28  7:59                                                     ` Yosry Ahmed
2023-03-28 14:14                                                       ` Johannes Weiner
2023-03-28 19:59                                                         ` Yosry Ahmed
2023-03-28 21:22                                                           ` Chris Li
2023-03-28 21:30                                                             ` Yosry Ahmed
2023-03-28 20:50                                                       ` Chris Li
2023-03-28 21:01                                                         ` Yosry Ahmed
2023-03-28 21:32                                                           ` Chris Li
2023-03-28 21:44                                                             ` Yosry Ahmed
2023-03-28 22:01                                                               ` Chris Li
2023-03-28 22:02                                                                 ` Yosry Ahmed
2023-03-29  1:31                                                               ` Huang, Ying
2023-03-29  1:41                                                                 ` Yosry Ahmed
2023-03-29 16:04                                                                   ` Chris Li
2023-04-04  8:24                                                                     ` Huang, Ying
2023-04-04  8:10                                                                   ` Huang, Ying
2023-04-04  8:47                                                                     ` Yosry Ahmed
2023-04-06  1:40                                                                       ` Huang, Ying
2023-03-29 15:22                                                                 ` Chris Li
2023-03-10  2:07 ` Luis Chamberlain
2023-03-10  2:15   ` Yosry Ahmed
2023-05-12  3:07 ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJD7tkafuzg23kCfVhHVQzN4zQXNwM5dXSgRsWigAHabV0j_7g@mail.gmail.com \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=ddstreet@ieee.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=senozhatsky@chromium.org \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=sjenning@redhat.com \
    --cc=vitaly.wool@konsulko.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).