From: Yosry Ahmed <yosryahmed@google.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Chris Li <chrisl@kernel.org>,
lsf-pc@lists.linux-foundation.org,
Johannes Weiner <hannes@cmpxchg.org>,
Linux-MM <linux-mm@kvack.org>, Michal Hocko <mhocko@kernel.org>,
Shakeel Butt <shakeelb@google.com>,
David Rientjes <rientjes@google.com>,
Hugh Dickins <hughd@google.com>,
Seth Jennings <sjenning@redhat.com>,
Dan Streetman <ddstreet@ieee.org>,
Vitaly Wool <vitaly.wool@konsulko.com>,
Yang Shi <shy828301@gmail.com>, Peter Xu <peterx@redhat.com>,
Minchan Kim <minchan@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>,
Michal Hocko <mhocko@suse.com>, Wei Xu <weixugc@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
Date: Fri, 10 Mar 2023 17:06:35 -0800 [thread overview]
Message-ID: <CAJD7tkamf8TtY0wOjhZKsWBJLL4pMsUhkwPtwCuroWcipRZ3CA@mail.gmail.com> (raw)
In-Reply-To: <87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com>
On Thu, Mar 9, 2023 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Thu, Mar 9, 2023 at 4:49 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
> >> >>
> >> >> Hi Yosry,
> >> >>
> >> >> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> >> >> > Hello everyone,
> >> >> >
> >> >> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> >> >> > 2023 about swap & zswap (hope I am not too late).
> >> >>
> >> >> I am very interested in participating in this discussion as well.
> >> >
> >> > That's great to hear!
> >> >
> >> >>
> >> >> > ==================== Objective ====================
> >> >> > Enabling the use of zswap without a backing swapfile, which makes
> >> >> > zswap useful for a wider variety of use cases. Also, when zswap is
> >> >> > used with a swapfile, the pages in zswap do not use up space in the
> >> >> > swapfile, so the overall swapping capacity increases.
> >> >>
> >> >> Agree.
> >> >>
> >> >> >
> >> >> > ==================== Idea ====================
> >> >> > Introduce a data structure, which I currently call a swap_desc, as an
> >> >> > abstraction layer between swapping implementation and the rest of MM
> >> >> > code. Page tables & page caches would store a swap id (encoded as a
> >> >> > swp_entry_t) instead of directly storing the swap entry associated
> >> >> > with the swapfile. This swap id maps to a struct swap_desc, which acts
> >> >>
> >> >> Can you provide a bit more detail? I am curious how this swap id
> >> >> maps into the swap_desc? Is the swp_entry_t cast into "struct
> >> >> swap_desc*" or going through some lookup table/tree?
> >> >
> >> > swap id would be an index in a radix tree (aka xarray), which contains
> >> > a pointer to the swap_desc struct. This lookup should be free with
> >> > this design as we also use swap_desc to directly store the swap cache
> >> > pointer, so this lookup essentially replaces the swap cache lookup.
> >> >
> >> >>
> >> >> > as our abstraction layer. All MM code not concerned with swapping
> >> >> > details would operate in terms of swap descs. The swap_desc can point
> >> >> > to either a normal swap entry (associated with a swapfile) or a zswap
> >> >> > entry. It can also include all non-backend specific operations, such
> >> >> > as the swapcache (which would be a simple pointer in swap_desc), swap
> >> >>
> >> >> Does the zswap entry still use the swap slot cache and swap_info_struct?
> >> >
> >> > In this design no, it shouldn't.
> >> >
> >> >>
> >> >> > This work enables using zswap without a backing swapfile and increases
> >> >> > the swap capacity when zswap is used with a swapfile. It also creates
> >> >> > a separation that allows us to skip code paths that don't make sense
> >> >> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> >> >> > which might result in better performance (less lookups, less lock
> >> >> > contention).
> >> >> >
> >> >> > The abstraction layer also opens the door for multiple cleanups (e.g.
> >> >> > removing swapper address spaces, removing swap count continuation
> >> >> > code, etc). Another nice cleanup that this work enables would be
> >> >> > separating the overloaded swp_entry_t into two distinct types: one for
> >> >> > things that are stored in page tables / caches, and for actual swap
> >> >> > entries. In the future, we can potentially further optimize how we use
> >> >> > the bits in the page tables instead of sticking everything into the
> >> >> > current type/offset format.
> >> >>
> >> >> Looking forward to seeing more details in the upcoming discussion.
> >> >> >
> >> >> > ==================== Cost ====================
> >> >> > The obvious downside of this is added memory overhead, specifically
> >> >> > for users that use swapfiles without zswap. Instead of paying one byte
> >> >> > (swap_map) for every potential page in the swapfile (+ swap count
> >> >> > continuation), we pay the size of the swap_desc for every page that is
> >> >> > actually in the swapfile, which I am estimating can be roughly around
> >> >> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> >> >> > scales with pages actually swapped out. For zswap users, it should be
> >> >>
> >> >> Is there a way to avoid turning 1 byte into 24 byte per swapped
> >> >> pages? For the users that use swap but no zswap, this is pure overhead.
> >> >
> >> > That's what I could think of at this point. My idea was something like this:
> >> >
> >> > struct swap_desc {
> >> > union { /* Use one bit to distinguish them */
> >> > swp_entry_t swap_entry;
> >> > struct zswap_entry *zswap_entry;
> >> > };
> >> > struct folio *swapcache;
> >> > atomic_t swap_count;
> >> > u32 id;
> >> > }
> >> >
> >> > Having the id in the swap_desc is convenient as we can directly map
> >> > the swap_desc to a swp_entry_t to place in the page tables, but I
> >> > don't think it's necessary. Without it, the struct size is 20 bytes,
> >> > so I think the extra 4 bytes are okay to use anyway if the slab
> >> > allocator only allocates multiples of 8 bytes.
> >> >
> >> > The idea here is to unify the swapcache and swap_count implementation
> >> > between different swap backends (swapfiles, zswap, etc), which would
> >> > create a better abstraction and reduce reinventing the wheel.
> >> >
> >> > We can reduce to only 8 bytes and only store the swap/zswap entry, but
> >> > we still need the swap cache anyway so might as well just store the
> >> > pointer in the struct and have a unified lookup-free swapcache, so
> >> > really 16 bytes is the minimum.
> >> >
> >> > If we stop at 16 bytes, then we need to handle swap count separately
> >> > in swapfiles and zswap. This is not the end of the world, but are the
> >> > 8 bytes worth this?
> >>
> >> If my understanding were correct, for current implementation, we need
> >> one swap cache pointer per swapped out page too. Even after calling
> >> __delete_from_swap_cache(), we store the "shadow" entry there. Although
> >> it's possible to implement shadow entry reclaiming like that for file
> >> cache shadow entry (workingset_shadow_shrinker), we haven't done that
> >> yet. And, it appears that we can live with that. So, in current
> >> implementation, for each swapped out page, we use 9 bytes. If so, the
> >> memory usage ratio is 24 / 9 = 2.667, still not trivial, but not as
> >> horrible as 24 / 1 = 24.
> >
> > Unfortunately it's a little bit more. 24 is the extra overhead.
> >
> > Today we have an xarray entry for each swapped out page, that either
> > has the swapcache pointer or the shadow entry.
> >
> > With this implementation, we have an xarray entry for each swapped out
> > page, that has a pointer to the swap_desc.
> >
> > Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = 3.5556.
>
> OK. I see. We can only hold 8 bytes for each xarray entry. To save
> memory usage, we can allocate multiple swap_desc (e.g., 16) for each
> xarray entry. Then the memory usage of xarray becomes 1/N.
>
> > For rotating disks, this might be even higher (8 + 32) / (8 + 1) = 4.444
> >
> > This is because we need to maintain a reverse mapping between
> > swp_entry_t and the swap_desc to use for cluster readahead. I am
> > assuming we can limit cluster readahead for rotating disks only.
>
> If reverse mapping cannot be avoided for enough situation, it's better
> to only keep swap_entry in swap_desc, and create another xarray indexed
> by swap_entry and store swap_cache, swap_count etc.
My current idea is to have one xarray that stores the swap_descs
(which include swap_entry, swapcache, swap_count, etc), and only for
rotating disks have an additional xarray that maps swap_entry ->
swap_desc for cluster readahead, assuming we can eliminate all other
situations requiring a reverse mapping.
I am not sure how having separate xarrays help? If we have one xarray,
might as well save the other lookups on put everything in swap_desc.
In fact, this should improve the locking today as swapcache /
swap_count operations can be lockless or very lightly contended.
If the point is to store the swap_desc directly inside the xarray to
save 8 bytes, I am concerned that having multiple xarrays for
swapcache, swap_count, etc will use more than that.
>
>
> >>
> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
> >> > continuation pages. If we do, it may end up being more. We also
> >> > allocate continuation in full 4k pages, so even if one swap_map
> >> > element in a page requires continuation, we will allocate an entire
> >> > page. What I am trying to say is that to get an actual comparison you
> >> > need to also factor in the swap utilization and the rate of usage of
> >> > swap continuation. I don't know how to come up with a formula for this
> >> > tbh.
> >> >
> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> >> > overhead for people not using zswap, but it is not very awful.
> >> >
> >> >>
> >> >> It seems what you really need is one bit of information to indicate
> >> >> this page is backed by zswap. Then you can have a seperate pointer
> >> >> for the zswap entry.
> >> >
> >> > If you use one bit in swp_entry_t (or one of the available swap types)
> >> > to indicate whether the page is backed with a swapfile or zswap it
> >> > doesn't really work. We lose the indirection layer. How do we move the
> >> > page from zswap to swapfile? We need to go update the page tables and
> >> > the shmem page cache, similar to swapoff.
> >> >
> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
> >> > the swap_desc does. It just goes the extra mile of unifying the
> >> > swapcache as well and storing it directly in the swap_desc instead of
> >> > storing it in another lookup structure.
> >>
> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
> >> swap_entry in swap_desc. The added indirection appears to be another
> >> level of page table with 1 entry. Then, we may use the similar method
> >> as supporting system with 2 level and 3 level page tables, like the code
> >> in include/asm-generic/pgtable-nopmd.h. But I haven't thought about
> >> this deeply.
> >
> > Can you expand further on this idea? I am not sure I fully understand.
>
> OK. The goal is to avoid the overhead if indirection isn't enabled via
> kconfig.
>
> If indirection isn't enabled, store swap_entry in PTE directly.
> Otherwise, store index of swap_desc in PTE. Different functions (e.g.,
> to get/set swap_entry in PTE) are implemented based on kconfig.
I thought about this, the problem is that we will have multiple
implementations of multiple things. For example, swap_count without
the indirection layer lives in the swap_map (with continuation logic).
With the indirection layer, it lives in the swap_desc (or somewhere
else). Same for the swapcache. Even if we keep the swapcache in an
xarray and not inside swap_desc, it would be indexed by swap_entry if
the indirection is disabled, and by swap_desc (or similar) if the
indirection is enabled. I think maintaining separate implementations
for when the indirection is enabled/disabled would be adding too much
complexity.
WDYT?
>
>
> >> >>
> >> >> Depending on how much you are going to reuse the swap cache, you might
> >> >> need to have something like a swap_info_struct to keep the locks happy.
> >> >
> >> > My current intention is to reimplement the swapcache completely as a
> >> > pointer in struct swap_desc. This would eliminate this need and a lot
> >> > of the locking we do today if I get things right.
> >> >
> >> >>
> >> >> > Another potential concern is readahead. With this design, we have no
> >> >>
> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> >> >> use some modernization.
> >> >
> >> > Yeah, I initially thought we would only need the swp_entry_t ->
> >> > swap_desc reverse mapping for readahead, and that we can only store
> >> > that for spinning disks, but I was wrong. We need for other things as
> >> > well today: swapoff, when trying to find an empty swap slot and we
> >> > start trying to free swap slots used only by the swapcache. However, I
> >> > think both of these cases can be fixed (I can share more details if
> >> > you want). If everything goes well we should only need to maintain the
> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
> >> > spinning disks for readahead.
> >> >
> >> >>
> >> >> Looking forward to your discussion.
>
> Per my understanding, the indirection is to make it easy to move
> (swapped) pages among swap devices based on hot/cold. This is similar
> as the target of memory tiering. It appears that we can extend the
> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
> Is it possible for zswap to be faster than some slow memory media?
Agree with Chris that this may require a much larger overhaul. A slow
memory tier is still addressable memory, swap/zswap requires a page
fault to read the pages. I think (at least for now) there is a
fundamental difference. We want reclaim to eventually treat slow
memory & swap as just different tiers to place cold memory in with
different characteristics, but otherwise I think the swapping
implementation itself is very different. Am I missing something?
>
>
> Best Regards,
> Huang, Ying
next prev parent reply other threads:[~2023-03-11 1:07 UTC|newest]
Thread overview: 105+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-18 22:38 [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Yosry Ahmed
2023-02-19 4:31 ` Matthew Wilcox
2023-02-19 9:34 ` Yosry Ahmed
2023-02-28 23:22 ` Chris Li
2023-03-01 0:08 ` Matthew Wilcox
2023-03-01 23:22 ` Chris Li
2023-02-21 18:39 ` Yang Shi
2023-02-21 18:56 ` Yosry Ahmed
2023-02-21 19:26 ` Yang Shi
2023-02-21 19:46 ` Yosry Ahmed
2023-02-21 23:34 ` Yang Shi
2023-02-21 23:38 ` Yosry Ahmed
2023-02-22 16:57 ` Johannes Weiner
2023-02-22 22:46 ` Yosry Ahmed
2023-02-28 4:29 ` Kalesh Singh
2023-02-28 8:09 ` Yosry Ahmed
2023-02-28 4:54 ` Sergey Senozhatsky
2023-02-28 8:12 ` Yosry Ahmed
2023-02-28 23:29 ` Minchan Kim
2023-03-02 0:58 ` Yosry Ahmed
2023-03-02 1:25 ` Yosry Ahmed
2023-03-02 17:05 ` Chris Li
2023-03-02 17:47 ` Chris Li
2023-03-02 18:15 ` Johannes Weiner
2023-03-02 18:56 ` Chris Li
2023-03-02 18:23 ` Rik van Riel
2023-03-02 21:42 ` Chris Li
2023-03-02 22:36 ` Rik van Riel
2023-03-02 22:55 ` Yosry Ahmed
2023-03-03 4:05 ` Chris Li
2023-03-03 0:01 ` Chris Li
2023-03-02 16:58 ` Chris Li
2023-03-01 10:44 ` Sergey Senozhatsky
2023-03-02 1:01 ` Yosry Ahmed
2023-02-28 23:11 ` Chris Li
2023-03-02 0:30 ` Yosry Ahmed
2023-03-02 1:00 ` Yosry Ahmed
2023-03-02 16:51 ` Chris Li
2023-03-03 0:33 ` Minchan Kim
2023-03-03 0:49 ` Yosry Ahmed
2023-03-03 1:25 ` Minchan Kim
2023-03-03 17:15 ` Yosry Ahmed
2023-03-09 12:48 ` Huang, Ying
2023-03-09 19:58 ` Chris Li
2023-03-09 20:19 ` Yosry Ahmed
2023-03-10 3:06 ` Huang, Ying
2023-03-10 23:14 ` Chris Li
2023-03-13 1:10 ` Huang, Ying
2023-03-15 7:41 ` Yosry Ahmed
2023-03-16 1:42 ` Huang, Ying
2023-03-11 1:06 ` Yosry Ahmed [this message]
2023-03-13 2:12 ` Huang, Ying
2023-03-15 8:01 ` Yosry Ahmed
2023-03-16 7:50 ` Huang, Ying
2023-03-17 10:19 ` Yosry Ahmed
2023-03-17 18:19 ` Chris Li
2023-03-17 18:23 ` Yosry Ahmed
2023-03-20 2:55 ` Huang, Ying
2023-03-20 6:25 ` Chris Li
2023-03-23 0:56 ` Huang, Ying
2023-03-23 6:46 ` Chris Li
2023-03-23 6:56 ` Huang, Ying
2023-03-23 18:28 ` Chris Li
2023-03-23 18:40 ` Yosry Ahmed
2023-03-23 19:49 ` Chris Li
2023-03-23 19:54 ` Yosry Ahmed
2023-03-23 21:10 ` Chris Li
2023-03-24 17:28 ` Chris Li
2023-03-22 5:56 ` Yosry Ahmed
2023-03-23 1:48 ` Huang, Ying
2023-03-23 2:21 ` Yosry Ahmed
2023-03-23 3:16 ` Huang, Ying
2023-03-23 3:27 ` Yosry Ahmed
2023-03-23 5:37 ` Huang, Ying
2023-03-23 15:18 ` Yosry Ahmed
2023-03-24 2:37 ` Huang, Ying
2023-03-24 7:28 ` Yosry Ahmed
2023-03-24 17:23 ` Chris Li
2023-03-27 1:23 ` Huang, Ying
2023-03-28 5:54 ` Yosry Ahmed
2023-03-28 6:20 ` Huang, Ying
2023-03-28 6:29 ` Yosry Ahmed
2023-03-28 6:59 ` Huang, Ying
2023-03-28 7:59 ` Yosry Ahmed
2023-03-28 14:14 ` Johannes Weiner
2023-03-28 19:59 ` Yosry Ahmed
2023-03-28 21:22 ` Chris Li
2023-03-28 21:30 ` Yosry Ahmed
2023-03-28 20:50 ` Chris Li
2023-03-28 21:01 ` Yosry Ahmed
2023-03-28 21:32 ` Chris Li
2023-03-28 21:44 ` Yosry Ahmed
2023-03-28 22:01 ` Chris Li
2023-03-28 22:02 ` Yosry Ahmed
2023-03-29 1:31 ` Huang, Ying
2023-03-29 1:41 ` Yosry Ahmed
2023-03-29 16:04 ` Chris Li
2023-04-04 8:24 ` Huang, Ying
2023-04-04 8:10 ` Huang, Ying
2023-04-04 8:47 ` Yosry Ahmed
2023-04-06 1:40 ` Huang, Ying
2023-03-29 15:22 ` Chris Li
2023-03-10 2:07 ` Luis Chamberlain
2023-03-10 2:15 ` Yosry Ahmed
2023-05-12 3:07 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJD7tkamf8TtY0wOjhZKsWBJLL4pMsUhkwPtwCuroWcipRZ3CA@mail.gmail.com \
--to=yosryahmed@google.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=chrisl@kernel.org \
--cc=ddstreet@ieee.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mhocko@kernel.org \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=peterx@redhat.com \
--cc=rientjes@google.com \
--cc=shakeelb@google.com \
--cc=shy828301@gmail.com \
--cc=sjenning@redhat.com \
--cc=vitaly.wool@konsulko.com \
--cc=weixugc@google.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).