Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap

From: "Huang, Ying" <ying.huang@intel.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chrisl@kernel.org>,
	 lsf-pc@lists.linux-foundation.org,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Linux-MM <linux-mm@kvack.org>, Michal Hocko <mhocko@kernel.org>,
	 Shakeel Butt <shakeelb@google.com>,
	David Rientjes <rientjes@google.com>,
	 Hugh Dickins <hughd@google.com>,
	Seth Jennings <sjenning@redhat.com>,
	 Dan Streetman <ddstreet@ieee.org>,
	Vitaly Wool <vitaly.wool@konsulko.com>,
	 Yang Shi <shy828301@gmail.com>, Peter Xu <peterx@redhat.com>,
	 Minchan Kim <minchan@kernel.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>,
	Michal Hocko <mhocko@suse.com>, Wei Xu <weixugc@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
Date: Fri, 10 Mar 2023 11:06:37 +0800	[thread overview]
Message-ID: <87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <CAJD7tkYj-=-tYyfqk_t2c6WMtcPLHYc4teRNtE2H8G8igEGrpA@mail.gmail.com> (Yosry Ahmed's message of "Thu, 9 Mar 2023 12:19:03 -0800")

Yosry Ahmed <yosryahmed@google.com> writes:

> On Thu, Mar 9, 2023 at 4:49 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
>> >>
>> >> Hi Yosry,
>> >>
>> >> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
>> >> > Hello everyone,
>> >> >
>> >> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
>> >> > 2023 about swap & zswap (hope I am not too late).
>> >>
>> >> I am very interested in participating in this discussion as well.
>> >
>> > That's great to hear!
>> >
>> >>
>> >> > ==================== Objective ====================
>> >> > Enabling the use of zswap without a backing swapfile, which makes
>> >> > zswap useful for a wider variety of use cases. Also, when zswap is
>> >> > used with a swapfile, the pages in zswap do not use up space in the
>> >> > swapfile, so the overall swapping capacity increases.
>> >>
>> >> Agree.
>> >>
>> >> >
>> >> > ==================== Idea ====================
>> >> > Introduce a data structure, which I currently call a swap_desc, as an
>> >> > abstraction layer between swapping implementation and the rest of MM
>> >> > code. Page tables & page caches would store a swap id (encoded as a
>> >> > swp_entry_t) instead of directly storing the swap entry associated
>> >> > with the swapfile. This swap id maps to a struct swap_desc, which acts
>> >>
>> >> Can you provide a bit more detail? I am curious how this swap id
>> >> maps into the swap_desc? Is the swp_entry_t cast into "struct
>> >> swap_desc*" or going through some lookup table/tree?
>> >
>> > swap id would be an index in a radix tree (aka xarray), which contains
>> > a pointer to the swap_desc struct. This lookup should be free with
>> > this design as we also use swap_desc to directly store the swap cache
>> > pointer, so this lookup essentially replaces the swap cache lookup.
>> >
>> >>
>> >> > as our abstraction layer. All MM code not concerned with swapping
>> >> > details would operate in terms of swap descs. The swap_desc can point
>> >> > to either a normal swap entry (associated with a swapfile) or a zswap
>> >> > entry. It can also include all non-backend specific operations, such
>> >> > as the swapcache (which would be a simple pointer in swap_desc), swap
>> >>
>> >> Does the zswap entry still use the swap slot cache and swap_info_struct?
>> >
>> > In this design no, it shouldn't.
>> >
>> >>
>> >> > This work enables using zswap without a backing swapfile and increases
>> >> > the swap capacity when zswap is used with a swapfile. It also creates
>> >> > a separation that allows us to skip code paths that don't make sense
>> >> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
>> >> > which might result in better performance (less lookups, less lock
>> >> > contention).
>> >> >
>> >> > The abstraction layer also opens the door for multiple cleanups (e.g.
>> >> > removing swapper address spaces, removing swap count continuation
>> >> > code, etc). Another nice cleanup that this work enables would be
>> >> > separating the overloaded swp_entry_t into two distinct types: one for
>> >> > things that are stored in page tables / caches, and for actual swap
>> >> > entries. In the future, we can potentially further optimize how we use
>> >> > the bits in the page tables instead of sticking everything into the
>> >> > current type/offset format.
>> >>
>> >> Looking forward to seeing more details in the upcoming discussion.
>> >> >
>> >> > ==================== Cost ====================
>> >> > The obvious downside of this is added memory overhead, specifically
>> >> > for users that use swapfiles without zswap. Instead of paying one byte
>> >> > (swap_map) for every potential page in the swapfile (+ swap count
>> >> > continuation), we pay the size of the swap_desc for every page that is
>> >> > actually in the swapfile, which I am estimating can be roughly around
>> >> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
>> >> > scales with pages actually swapped out. For zswap users, it should be
>> >>
>> >> Is there a way to avoid turning 1 byte into 24 byte per swapped
>> >> pages? For the users that use swap but no zswap, this is pure overhead.
>> >
>> > That's what I could think of at this point. My idea was something like this:
>> >
>> > struct swap_desc {
>> >     union { /* Use one bit to distinguish them */
>> >         swp_entry_t swap_entry;
>> >         struct zswap_entry *zswap_entry;
>> >     };
>> >     struct folio *swapcache;
>> >     atomic_t swap_count;
>> >     u32 id;
>> > }
>> >
>> > Having the id in the swap_desc is convenient as we can directly map
>> > the swap_desc to a swp_entry_t to place in the page tables, but I
>> > don't think it's necessary. Without it, the struct size is 20 bytes,
>> > so I think the extra 4 bytes are okay to use anyway if the slab
>> > allocator only allocates multiples of 8 bytes.
>> >
>> > The idea here is to unify the swapcache and swap_count implementation
>> > between different swap backends (swapfiles, zswap, etc), which would
>> > create a better abstraction and reduce reinventing the wheel.
>> >
>> > We can reduce to only 8 bytes and only store the swap/zswap entry, but
>> > we still need the swap cache anyway so might as well just store the
>> > pointer in the struct and have a unified lookup-free swapcache, so
>> > really 16 bytes is the minimum.
>> >
>> > If we stop at 16 bytes, then we need to handle swap count separately
>> > in swapfiles and zswap. This is not the end of the world, but are the
>> > 8 bytes worth this?
>>
>> If my understanding were correct, for current implementation, we need
>> one swap cache pointer per swapped out page too.  Even after calling
>> __delete_from_swap_cache(), we store the "shadow" entry there.  Although
>> it's possible to implement shadow entry reclaiming like that for file
>> cache shadow entry (workingset_shadow_shrinker), we haven't done that
>> yet.  And, it appears that we can live with that.  So, in current
>> implementation, for each swapped out page, we use 9 bytes.  If so, the
>> memory usage ratio is 24 / 9 = 2.667, still not trivial, but not as
>> horrible as 24 / 1 = 24.
>
> Unfortunately it's a little bit more. 24 is the extra overhead.
>
> Today we have an xarray entry for each swapped out page, that either
> has the swapcache pointer or the shadow entry.
>
> With this implementation, we have an xarray entry for each swapped out
> page, that has a pointer to the swap_desc.
>
> Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = 3.5556.

OK.  I see.  We can only hold 8 bytes for each xarray entry.  To save
memory usage, we can allocate multiple swap_desc (e.g., 16) for each
xarray entry.  Then the memory usage of xarray becomes 1/N.

> For rotating disks, this might be even higher (8 + 32) / (8 + 1) = 4.444
>
> This is because we need to maintain a reverse mapping between
> swp_entry_t and the swap_desc to use for cluster readahead. I am
> assuming we can limit cluster readahead for rotating disks only.

If reverse mapping cannot be avoided for enough situation, it's better
to only keep swap_entry in swap_desc, and create another xarray indexed
by swap_entry and store swap_cache, swap_count etc.

>>
>> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
>> > O(swapped). Also, 1 byte is assuming we do not use the swap
>> > continuation pages. If we do, it may end up being more. We also
>> > allocate continuation in full 4k pages, so even if one swap_map
>> > element in a page requires continuation, we will allocate an entire
>> > page. What I am trying to say is that to get an actual comparison you
>> > need to also factor in the swap utilization and the rate of usage of
>> > swap continuation. I don't know how to come up with a formula for this
>> > tbh.
>> >
>> > Also, like Johannes said, the worst case overhead (32 bytes if you
>> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
>> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
>> > overhead for people not using zswap, but it is not very awful.
>> >
>> >>
>> >> It seems what you really need is one bit of information to indicate
>> >> this page is backed by zswap. Then you can have a seperate pointer
>> >> for the zswap entry.
>> >
>> > If you use one bit in swp_entry_t (or one of the available swap types)
>> > to indicate whether the page is backed with a swapfile or zswap it
>> > doesn't really work. We lose the indirection layer. How do we move the
>> > page from zswap to swapfile? We need to go update the page tables and
>> > the shmem page cache, similar to swapoff.
>> >
>> > Instead, if we store a key else in swp_entry_t and use this to lookup
>> > the swp_entry_t or zswap_entry pointer then that's essentially what
>> > the swap_desc does. It just goes the extra mile of unifying the
>> > swapcache as well and storing it directly in the swap_desc instead of
>> > storing it in another lookup structure.
>>
>> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
>> swap_entry in swap_desc.  The added indirection appears to be another
>> level of page table with 1 entry.  Then, we may use the similar method
>> as supporting system with 2 level and 3 level page tables, like the code
>> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
>> this deeply.
>
> Can you expand further on this idea? I am not sure I fully understand.

OK.  The goal is to avoid the overhead if indirection isn't enabled via
kconfig.

If indirection isn't enabled, store swap_entry in PTE directly.
Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
to get/set swap_entry in PTE) are implemented based on kconfig.

>> >>
>> >> Depending on how much you are going to reuse the swap cache, you might
>> >> need to have something like a swap_info_struct to keep the locks happy.
>> >
>> > My current intention is to reimplement the swapcache completely as a
>> > pointer in struct swap_desc. This would eliminate this need and a lot
>> > of the locking we do today if I get things right.
>> >
>> >>
>> >> > Another potential concern is readahead. With this design, we have no
>> >>
>> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
>> >> use some modernization.
>> >
>> > Yeah, I initially thought we would only need the swp_entry_t ->
>> > swap_desc reverse mapping for readahead, and that we can only store
>> > that for spinning disks, but I was wrong. We need for other things as
>> > well today: swapoff, when trying to find an empty swap slot and we
>> > start trying to free swap slots used only by the swapcache. However, I
>> > think both of these cases can be fixed (I can share more details if
>> > you want). If everything goes well we should only need to maintain the
>> > reverse mapping (extra overhead above 24 bytes) for swap files on
>> > spinning disks for readahead.
>> >
>> >>
>> >> Looking forward to your discussion.

Per my understanding, the indirection is to make it easy to move
(swapped) pages among swap devices based on hot/cold.  This is similar
as the target of memory tiering.  It appears that we can extend the
memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
Is it possible for zswap to be faster than some slow memory media?

Best Regards,
Huang, Ying