All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
@ 2023-02-18 22:38 Yosry Ahmed
  2023-02-19  4:31 ` Matthew Wilcox
                   ` (5 more replies)
  0 siblings, 6 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-02-18 22:38 UTC (permalink / raw)
  To: lsf-pc, Johannes Weiner
  Cc: Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Minchan Kim, Andrew Morton

Hello everyone,

I would like to propose a topic for the upcoming LSF/MM/BPF in May
2023 about swap & zswap (hope I am not too late).

==================== Intro ====================
Currently, using zswap is dependent on swapfiles in an unnecessary
way. To use zswap, you need a swapfile configured (even if the space
will not be used) and zswap is restricted by its size. When pages
reside in zswap, the corresponding swap entry in the swapfile cannot
be used, and is essentially wasted. We also go through unnecessary
code paths when using zswap, such as finding and allocating a swap
entry on the swapout path, or readahead in the swapin path. I am
proposing a swapping abstraction layer that would allow us to remove
zswap's dependency on swapfiles. This can be done by introducing a
data structure between the actual swapping implementation (swapfiles,
zswap) and the rest of the MM code.

==================== Objective ====================
Enabling the use of zswap without a backing swapfile, which makes
zswap useful for a wider variety of use cases. Also, when zswap is
used with a swapfile, the pages in zswap do not use up space in the
swapfile, so the overall swapping capacity increases.

==================== Idea ====================
Introduce a data structure, which I currently call a swap_desc, as an
abstraction layer between swapping implementation and the rest of MM
code. Page tables & page caches would store a swap id (encoded as a
swp_entry_t) instead of directly storing the swap entry associated
with the swapfile. This swap id maps to a struct swap_desc, which acts
as our abstraction layer. All MM code not concerned with swapping
details would operate in terms of swap descs. The swap_desc can point
to either a normal swap entry (associated with a swapfile) or a zswap
entry. It can also include all non-backend specific operations, such
as the swapcache (which would be a simple pointer in swap_desc), swap
counting, etc. It creates a clear, nice abstraction layer between MM
code and the actual swapping implementation.

==================== Benefits ====================
This work enables using zswap without a backing swapfile and increases
the swap capacity when zswap is used with a swapfile. It also creates
a separation that allows us to skip code paths that don't make sense
in the zswap path (e.g. readahead). We get to drop zswap's rbtree
which might result in better performance (less lookups, less lock
contention).

The abstraction layer also opens the door for multiple cleanups (e.g.
removing swapper address spaces, removing swap count continuation
code, etc). Another nice cleanup that this work enables would be
separating the overloaded swp_entry_t into two distinct types: one for
things that are stored in page tables / caches, and for actual swap
entries. In the future, we can potentially further optimize how we use
the bits in the page tables instead of sticking everything into the
current type/offset format.

Another potential win here can be swapoff, which can be more practical
by directly scanning all swap_desc's instead of going through page
tables and shmem page caches.

Overall zswap becomes more accessible and available to a wider range
of use cases.

==================== Cost ====================
The obvious downside of this is added memory overhead, specifically
for users that use swapfiles without zswap. Instead of paying one byte
(swap_map) for every potential page in the swapfile (+ swap count
continuation), we pay the size of the swap_desc for every page that is
actually in the swapfile, which I am estimating can be roughly around
24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
scales with pages actually swapped out. For zswap users, it should be
a win (or at least even) because we get to drop a lot of fields from
struct zswap_entry (e.g. rbtree, index, etc).

Another potential concern is readahead. With this design, we have no
way to get a swap_desc given a swap entry (type & offset). We would
need to maintain a reverse mapping, adding a little bit more overhead,
or search all swapped out pages instead :). A reverse mapping might
pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
memory).

==================== Bottom Line ====================
It would be nice to discuss the potential here and the tradeoffs. I
know that other folks using zswap (or interested in using it) may find
this very useful. I am sure I am missing some context on why things
are the way they are, and perhaps some obvious holes in my story.
Looking forward to discussing this with anyone interested :)

I think Johannes may be interested in attending this discussion, since
a lot of ideas here are inspired by discussions I had with him :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-18 22:38 [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Yosry Ahmed
@ 2023-02-19  4:31 ` Matthew Wilcox
  2023-02-19  9:34   ` Yosry Ahmed
  2023-02-28 23:22   ` Chris Li
  2023-02-21 18:39 ` Yang Shi
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 105+ messages in thread
From: Matthew Wilcox @ 2023-02-19  4:31 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton,
	Huang Ying, NeilBrown

On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> Hello everyone,
> 
> I would like to propose a topic for the upcoming LSF/MM/BPF in May
> 2023 about swap & zswap (hope I am not too late).

Submissions are due March 1st, I believe, so not too late.

> ==================== Bottom Line ====================
> It would be nice to discuss the potential here and the tradeoffs. I
> know that other folks using zswap (or interested in using it) may find
> this very useful. I am sure I am missing some context on why things
> are the way they are, and perhaps some obvious holes in my story.
> Looking forward to discussing this with anyone interested :)
> 
> I think Johannes may be interested in attending this discussion, since
> a lot of ideas here are inspired by discussions I had with him :)

I think an overhaul of the swap code is long overdue.  I appreciate
you're very much focused on zswap, but there are many other problems.
For example, swap does not work on zoned devices.  Swap readahead is
generally physical (ie optimised for spinning discs) rather than logical
(more appropriate for SSDs).  Swap's management of free space is crude
compared to real filesystems.  The way that swap bypasses the filesystem
when writing to swap files is awful.  I haven't even started to look at
what changes need to be made to swap in order to swap out arbitrary-order
folios (instead of PMD-sized + PTE-sized).

I'm probably not a great person to participate in the design of a
replacement system.  I don't know nearly enough about anonymous memory.
I'd be sitting in the back shouting unhelpful things like, "Can't you
see an anon_vma is the exact same thing as an inode?"  and "Why don't
we steal the block allocation functions from XFS?"  and "Why do tmpfs
pages have to move to the swap cache; can't we just leave them in the
page cache and pass them to the swap code directly?"

Maybe Neil Brown or Huang Ying would be good participants, although
I don't recall seeing either of them at an LSFMM recently.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-19  4:31 ` Matthew Wilcox
@ 2023-02-19  9:34   ` Yosry Ahmed
  2023-02-28 23:22   ` Chris Li
  1 sibling, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-02-19  9:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton,
	Huang Ying, NeilBrown

On Sat, Feb 18, 2023 at 8:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > Hello everyone,
> >
> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > 2023 about swap & zswap (hope I am not too late).
>
> Submissions are due March 1st, I believe, so not too late.
>
> > ==================== Bottom Line ====================
> > It would be nice to discuss the potential here and the tradeoffs. I
> > know that other folks using zswap (or interested in using it) may find
> > this very useful. I am sure I am missing some context on why things
> > are the way they are, and perhaps some obvious holes in my story.
> > Looking forward to discussing this with anyone interested :)
> >
> > I think Johannes may be interested in attending this discussion, since
> > a lot of ideas here are inspired by discussions I had with him :)
>
> I think an overhaul of the swap code is long overdue.  I appreciate
> you're very much focused on zswap, but there are many other problems.

Fully agree. I spent more time than I care to admit just figuring out
the difference between all the functions that have "swap" and "free"
in their names :/

I cannot claim that I am trying to do that, like you said
I am focused on zswap, but we can discuss the direction that swap
should head in, and where zswap would fit in the picture. We can at
least make sure that this zswap work would be aligned with any future
plans for swap, so that we don't step on each other's toes.

> For example, swap does not work on zoned devices.  Swap readahead is
> generally physical (ie optimised for spinning discs) rather than logical
> (more appropriate for SSDs).  Swap's management of free space is crude

We have swap_vma_readahead() which should be on by default for anon
memory on non-rotating devices, but it's only for anon. shmem only
uses swap_cluster_readahead(), which I am not sure if it makes sense
for all cases, especially zswap.

> compared to real filesystems.  The way that swap bypasses the filesystem
> when writing to swap files is awful.  I haven't even started to look at
> what changes need to be made to swap in order to swap out arbitrary-order
> folios (instead of PMD-sized + PTE-sized).

I don't know a lot about file systems so I can't chip in here.

>
> I'm probably not a great person to participate in the design of a
> replacement system.  I don't know nearly enough about anonymous memory.

Any input would be helpful, I am sure you know more than I do :)

> I'd be sitting in the back shouting unhelpful things like, "Can't you
> see an anon_vma is the exact same thing as an inode?"  and "Why don't
> we steal the block allocation functions from XFS?"  and "Why do tmpfs
> pages have to move to the swap cache; can't we just leave them in the
> page cache and pass them to the swap code directly?"

For that last one at least, the proposed design makes the swap cache
much less similar to the page cache, so at least we can stop worrying
about whether we really need to use the swap cache for tmpfs ;)

>
> Maybe Neil Brown or Huang Ying would be good participants, although
> I don't recall seeing either of them at an LSFMM recently.

Looking forward to talking about this to everyone who's interested :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-18 22:38 [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Yosry Ahmed
  2023-02-19  4:31 ` Matthew Wilcox
@ 2023-02-21 18:39 ` Yang Shi
  2023-02-21 18:56   ` Yosry Ahmed
  2023-02-28  4:54 ` Sergey Senozhatsky
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 105+ messages in thread
From: Yang Shi @ 2023-02-21 18:39 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Peter Xu, Minchan Kim, Andrew Morton

Hi Yosry,

Thanks for proposing this topic. I was thinking about this before but
I didn't make too much progress due to some other distractions, and I
got a couple of follow up questions about your design. Please see the
inline comments below.


On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Hello everyone,
>
> I would like to propose a topic for the upcoming LSF/MM/BPF in May
> 2023 about swap & zswap (hope I am not too late).
>
> ==================== Intro ====================
> Currently, using zswap is dependent on swapfiles in an unnecessary
> way. To use zswap, you need a swapfile configured (even if the space
> will not be used) and zswap is restricted by its size. When pages
> reside in zswap, the corresponding swap entry in the swapfile cannot
> be used, and is essentially wasted. We also go through unnecessary
> code paths when using zswap, such as finding and allocating a swap
> entry on the swapout path, or readahead in the swapin path. I am
> proposing a swapping abstraction layer that would allow us to remove
> zswap's dependency on swapfiles. This can be done by introducing a
> data structure between the actual swapping implementation (swapfiles,
> zswap) and the rest of the MM code.
>
> ==================== Objective ====================
> Enabling the use of zswap without a backing swapfile, which makes
> zswap useful for a wider variety of use cases. Also, when zswap is
> used with a swapfile, the pages in zswap do not use up space in the
> swapfile, so the overall swapping capacity increases.
>
> ==================== Idea ====================
> Introduce a data structure, which I currently call a swap_desc, as an
> abstraction layer between swapping implementation and the rest of MM
> code. Page tables & page caches would store a swap id (encoded as a
> swp_entry_t) instead of directly storing the swap entry associated
> with the swapfile. This swap id maps to a struct swap_desc, which acts
> as our abstraction layer. All MM code not concerned with swapping
> details would operate in terms of swap descs. The swap_desc can point
> to either a normal swap entry (associated with a swapfile) or a zswap
> entry. It can also include all non-backend specific operations, such
> as the swapcache (which would be a simple pointer in swap_desc), swap
> counting, etc. It creates a clear, nice abstraction layer between MM
> code and the actual swapping implementation.

How will the swap_desc be allocated? Dynamically or preallocated? Is
it 1:1 mapped to the swap slots on swap devices (whatever it is
backed, for example, zswap, swap partition, swapfile, etc)?

>
> ==================== Benefits ====================
> This work enables using zswap without a backing swapfile and increases
> the swap capacity when zswap is used with a swapfile. It also creates
> a separation that allows us to skip code paths that don't make sense
> in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> which might result in better performance (less lookups, less lock
> contention).
>
> The abstraction layer also opens the door for multiple cleanups (e.g.
> removing swapper address spaces, removing swap count continuation
> code, etc). Another nice cleanup that this work enables would be
> separating the overloaded swp_entry_t into two distinct types: one for
> things that are stored in page tables / caches, and for actual swap
> entries. In the future, we can potentially further optimize how we use
> the bits in the page tables instead of sticking everything into the
> current type/offset format.
>
> Another potential win here can be swapoff, which can be more practical
> by directly scanning all swap_desc's instead of going through page
> tables and shmem page caches.
>
> Overall zswap becomes more accessible and available to a wider range
> of use cases.

How will you handle zswap writeback? Zswap may writeback to the backed
swap device IIUC. Assuming you have both zswap and swapfile, they are
separate devices with this design, right? If so, is the swapfile still
the writeback target of zswap? And if it is the writeback target, what
if swapfile is full?

Anyway I'm interested in attending the discussion for this topic.

>
> ==================== Cost ====================
> The obvious downside of this is added memory overhead, specifically
> for users that use swapfiles without zswap. Instead of paying one byte
> (swap_map) for every potential page in the swapfile (+ swap count
> continuation), we pay the size of the swap_desc for every page that is
> actually in the swapfile, which I am estimating can be roughly around
> 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> scales with pages actually swapped out. For zswap users, it should be
> a win (or at least even) because we get to drop a lot of fields from
> struct zswap_entry (e.g. rbtree, index, etc).
>
> Another potential concern is readahead. With this design, we have no
> way to get a swap_desc given a swap entry (type & offset). We would
> need to maintain a reverse mapping, adding a little bit more overhead,
> or search all swapped out pages instead :). A reverse mapping might
> pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> memory).
>
> ==================== Bottom Line ====================
> It would be nice to discuss the potential here and the tradeoffs. I
> know that other folks using zswap (or interested in using it) may find
> this very useful. I am sure I am missing some context on why things
> are the way they are, and perhaps some obvious holes in my story.
> Looking forward to discussing this with anyone interested :)
>
> I think Johannes may be interested in attending this discussion, since
> a lot of ideas here are inspired by discussions I had with him :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-21 18:39 ` Yang Shi
@ 2023-02-21 18:56   ` Yosry Ahmed
  2023-02-21 19:26     ` Yang Shi
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-02-21 18:56 UTC (permalink / raw)
  To: Yang Shi
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Peter Xu, Minchan Kim, Andrew Morton

On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.com> wrote:
>
> Hi Yosry,
>
> Thanks for proposing this topic. I was thinking about this before but
> I didn't make too much progress due to some other distractions, and I
> got a couple of follow up questions about your design. Please see the
> inline comments below.

Great to see interested folks, thanks!

>
>
> On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Hello everyone,
> >
> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > 2023 about swap & zswap (hope I am not too late).
> >
> > ==================== Intro ====================
> > Currently, using zswap is dependent on swapfiles in an unnecessary
> > way. To use zswap, you need a swapfile configured (even if the space
> > will not be used) and zswap is restricted by its size. When pages
> > reside in zswap, the corresponding swap entry in the swapfile cannot
> > be used, and is essentially wasted. We also go through unnecessary
> > code paths when using zswap, such as finding and allocating a swap
> > entry on the swapout path, or readahead in the swapin path. I am
> > proposing a swapping abstraction layer that would allow us to remove
> > zswap's dependency on swapfiles. This can be done by introducing a
> > data structure between the actual swapping implementation (swapfiles,
> > zswap) and the rest of the MM code.
> >
> > ==================== Objective ====================
> > Enabling the use of zswap without a backing swapfile, which makes
> > zswap useful for a wider variety of use cases. Also, when zswap is
> > used with a swapfile, the pages in zswap do not use up space in the
> > swapfile, so the overall swapping capacity increases.
> >
> > ==================== Idea ====================
> > Introduce a data structure, which I currently call a swap_desc, as an
> > abstraction layer between swapping implementation and the rest of MM
> > code. Page tables & page caches would store a swap id (encoded as a
> > swp_entry_t) instead of directly storing the swap entry associated
> > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > as our abstraction layer. All MM code not concerned with swapping
> > details would operate in terms of swap descs. The swap_desc can point
> > to either a normal swap entry (associated with a swapfile) or a zswap
> > entry. It can also include all non-backend specific operations, such
> > as the swapcache (which would be a simple pointer in swap_desc), swap
> > counting, etc. It creates a clear, nice abstraction layer between MM
> > code and the actual swapping implementation.
>
> How will the swap_desc be allocated? Dynamically or preallocated? Is
> it 1:1 mapped to the swap slots on swap devices (whatever it is
> backed, for example, zswap, swap partition, swapfile, etc)?

I imagine swap_desc's would be dynamically allocated when we need to
swap something out. When allocated, a swap_desc would either point to
a zswap_entry (if available), or a swap slot otherwise. In this case,
it would be 1:1 mapped to swapped out pages, not the swap slots on
devices.

I know that it might not be ideal to make allocations on the reclaim
path (although it would be a small-ish slab allocation so we might be
able to get away with it), but otherwise we would have statically
allocated swap_desc's for all swap slots on a swap device, even unused
ones, which I imagine is too expensive. Also for things like zswap, it
doesn't really make sense to preallocate at all.

WDYT?

>
> >
> > ==================== Benefits ====================
> > This work enables using zswap without a backing swapfile and increases
> > the swap capacity when zswap is used with a swapfile. It also creates
> > a separation that allows us to skip code paths that don't make sense
> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > which might result in better performance (less lookups, less lock
> > contention).
> >
> > The abstraction layer also opens the door for multiple cleanups (e.g.
> > removing swapper address spaces, removing swap count continuation
> > code, etc). Another nice cleanup that this work enables would be
> > separating the overloaded swp_entry_t into two distinct types: one for
> > things that are stored in page tables / caches, and for actual swap
> > entries. In the future, we can potentially further optimize how we use
> > the bits in the page tables instead of sticking everything into the
> > current type/offset format.
> >
> > Another potential win here can be swapoff, which can be more practical
> > by directly scanning all swap_desc's instead of going through page
> > tables and shmem page caches.
> >
> > Overall zswap becomes more accessible and available to a wider range
> > of use cases.
>
> How will you handle zswap writeback? Zswap may writeback to the backed
> swap device IIUC. Assuming you have both zswap and swapfile, they are
> separate devices with this design, right? If so, is the swapfile still
> the writeback target of zswap? And if it is the writeback target, what
> if swapfile is full?

When we try to writeback from zswap, we try to allocate a swap slot in
the swapfile, and switch the swap_desc to point to that instead. The
process would be transparent to the rest of MM (page tables, page
cache, etc). If the swapfile is full, then there's really nothing we
can do, reclaim fails and we start OOMing. I imagine this is the same
behavior as today when swap is full, the difference would be that we
have to fill both zswap AND the swapfile to get to the OOMing point,
so an overall increased swapping capacity.

>
> Anyway I'm interested in attending the discussion for this topic.

Great! Looking forward to discuss this more!

>
> >
> > ==================== Cost ====================
> > The obvious downside of this is added memory overhead, specifically
> > for users that use swapfiles without zswap. Instead of paying one byte
> > (swap_map) for every potential page in the swapfile (+ swap count
> > continuation), we pay the size of the swap_desc for every page that is
> > actually in the swapfile, which I am estimating can be roughly around
> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > scales with pages actually swapped out. For zswap users, it should be
> > a win (or at least even) because we get to drop a lot of fields from
> > struct zswap_entry (e.g. rbtree, index, etc).
> >
> > Another potential concern is readahead. With this design, we have no
> > way to get a swap_desc given a swap entry (type & offset). We would
> > need to maintain a reverse mapping, adding a little bit more overhead,
> > or search all swapped out pages instead :). A reverse mapping might
> > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> > memory).
> >
> > ==================== Bottom Line ====================
> > It would be nice to discuss the potential here and the tradeoffs. I
> > know that other folks using zswap (or interested in using it) may find
> > this very useful. I am sure I am missing some context on why things
> > are the way they are, and perhaps some obvious holes in my story.
> > Looking forward to discussing this with anyone interested :)
> >
> > I think Johannes may be interested in attending this discussion, since
> > a lot of ideas here are inspired by discussions I had with him :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-21 18:56   ` Yosry Ahmed
@ 2023-02-21 19:26     ` Yang Shi
  2023-02-21 19:46       ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Yang Shi @ 2023-02-21 19:26 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Peter Xu, Minchan Kim, Andrew Morton

On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > Hi Yosry,
> >
> > Thanks for proposing this topic. I was thinking about this before but
> > I didn't make too much progress due to some other distractions, and I
> > got a couple of follow up questions about your design. Please see the
> > inline comments below.
>
> Great to see interested folks, thanks!
>
> >
> >
> > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > Hello everyone,
> > >
> > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > 2023 about swap & zswap (hope I am not too late).
> > >
> > > ==================== Intro ====================
> > > Currently, using zswap is dependent on swapfiles in an unnecessary
> > > way. To use zswap, you need a swapfile configured (even if the space
> > > will not be used) and zswap is restricted by its size. When pages
> > > reside in zswap, the corresponding swap entry in the swapfile cannot
> > > be used, and is essentially wasted. We also go through unnecessary
> > > code paths when using zswap, such as finding and allocating a swap
> > > entry on the swapout path, or readahead in the swapin path. I am
> > > proposing a swapping abstraction layer that would allow us to remove
> > > zswap's dependency on swapfiles. This can be done by introducing a
> > > data structure between the actual swapping implementation (swapfiles,
> > > zswap) and the rest of the MM code.
> > >
> > > ==================== Objective ====================
> > > Enabling the use of zswap without a backing swapfile, which makes
> > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > used with a swapfile, the pages in zswap do not use up space in the
> > > swapfile, so the overall swapping capacity increases.
> > >
> > > ==================== Idea ====================
> > > Introduce a data structure, which I currently call a swap_desc, as an
> > > abstraction layer between swapping implementation and the rest of MM
> > > code. Page tables & page caches would store a swap id (encoded as a
> > > swp_entry_t) instead of directly storing the swap entry associated
> > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > as our abstraction layer. All MM code not concerned with swapping
> > > details would operate in terms of swap descs. The swap_desc can point
> > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > entry. It can also include all non-backend specific operations, such
> > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > code and the actual swapping implementation.
> >
> > How will the swap_desc be allocated? Dynamically or preallocated? Is
> > it 1:1 mapped to the swap slots on swap devices (whatever it is
> > backed, for example, zswap, swap partition, swapfile, etc)?
>
> I imagine swap_desc's would be dynamically allocated when we need to
> swap something out. When allocated, a swap_desc would either point to
> a zswap_entry (if available), or a swap slot otherwise. In this case,
> it would be 1:1 mapped to swapped out pages, not the swap slots on
> devices.

It makes sense to be 1:1 mapped to swapped out pages if the swapfile
is used as the back of zswap.

>
> I know that it might not be ideal to make allocations on the reclaim
> path (although it would be a small-ish slab allocation so we might be
> able to get away with it), but otherwise we would have statically
> allocated swap_desc's for all swap slots on a swap device, even unused
> ones, which I imagine is too expensive. Also for things like zswap, it
> doesn't really make sense to preallocate at all.

Yeah, it is not perfect to allocate memory in the reclamation path. We
do have such cases, but the fewer the better IMHO.

>
> WDYT?
>
> >
> > >
> > > ==================== Benefits ====================
> > > This work enables using zswap without a backing swapfile and increases
> > > the swap capacity when zswap is used with a swapfile. It also creates
> > > a separation that allows us to skip code paths that don't make sense
> > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > which might result in better performance (less lookups, less lock
> > > contention).
> > >
> > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > removing swapper address spaces, removing swap count continuation
> > > code, etc). Another nice cleanup that this work enables would be
> > > separating the overloaded swp_entry_t into two distinct types: one for
> > > things that are stored in page tables / caches, and for actual swap
> > > entries. In the future, we can potentially further optimize how we use
> > > the bits in the page tables instead of sticking everything into the
> > > current type/offset format.
> > >
> > > Another potential win here can be swapoff, which can be more practical
> > > by directly scanning all swap_desc's instead of going through page
> > > tables and shmem page caches.
> > >
> > > Overall zswap becomes more accessible and available to a wider range
> > > of use cases.
> >
> > How will you handle zswap writeback? Zswap may writeback to the backed
> > swap device IIUC. Assuming you have both zswap and swapfile, they are
> > separate devices with this design, right? If so, is the swapfile still
> > the writeback target of zswap? And if it is the writeback target, what
> > if swapfile is full?
>
> When we try to writeback from zswap, we try to allocate a swap slot in
> the swapfile, and switch the swap_desc to point to that instead. The
> process would be transparent to the rest of MM (page tables, page
> cache, etc). If the swapfile is full, then there's really nothing we
> can do, reclaim fails and we start OOMing. I imagine this is the same
> behavior as today when swap is full, the difference would be that we
> have to fill both zswap AND the swapfile to get to the OOMing point,
> so an overall increased swapping capacity.

When zswap is full, but swapfile is not yet, will the swap try to
writeback zswap to swapfile to make more room for zswap or just swap
out to swapfile directly?

>
> >
> > Anyway I'm interested in attending the discussion for this topic.
>
> Great! Looking forward to discuss this more!
>
> >
> > >
> > > ==================== Cost ====================
> > > The obvious downside of this is added memory overhead, specifically
> > > for users that use swapfiles without zswap. Instead of paying one byte
> > > (swap_map) for every potential page in the swapfile (+ swap count
> > > continuation), we pay the size of the swap_desc for every page that is
> > > actually in the swapfile, which I am estimating can be roughly around
> > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > scales with pages actually swapped out. For zswap users, it should be
> > > a win (or at least even) because we get to drop a lot of fields from
> > > struct zswap_entry (e.g. rbtree, index, etc).
> > >
> > > Another potential concern is readahead. With this design, we have no
> > > way to get a swap_desc given a swap entry (type & offset). We would
> > > need to maintain a reverse mapping, adding a little bit more overhead,
> > > or search all swapped out pages instead :). A reverse mapping might
> > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> > > memory).
> > >
> > > ==================== Bottom Line ====================
> > > It would be nice to discuss the potential here and the tradeoffs. I
> > > know that other folks using zswap (or interested in using it) may find
> > > this very useful. I am sure I am missing some context on why things
> > > are the way they are, and perhaps some obvious holes in my story.
> > > Looking forward to discussing this with anyone interested :)
> > >
> > > I think Johannes may be interested in attending this discussion, since
> > > a lot of ideas here are inspired by discussions I had with him :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-21 19:26     ` Yang Shi
@ 2023-02-21 19:46       ` Yosry Ahmed
  2023-02-21 23:34         ` Yang Shi
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-02-21 19:46 UTC (permalink / raw)
  To: Yang Shi
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Peter Xu, Minchan Kim, Andrew Morton

On Tue, Feb 21, 2023 at 11:26 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > Hi Yosry,
> > >
> > > Thanks for proposing this topic. I was thinking about this before but
> > > I didn't make too much progress due to some other distractions, and I
> > > got a couple of follow up questions about your design. Please see the
> > > inline comments below.
> >
> > Great to see interested folks, thanks!
> >
> > >
> > >
> > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > Hello everyone,
> > > >
> > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > 2023 about swap & zswap (hope I am not too late).
> > > >
> > > > ==================== Intro ====================
> > > > Currently, using zswap is dependent on swapfiles in an unnecessary
> > > > way. To use zswap, you need a swapfile configured (even if the space
> > > > will not be used) and zswap is restricted by its size. When pages
> > > > reside in zswap, the corresponding swap entry in the swapfile cannot
> > > > be used, and is essentially wasted. We also go through unnecessary
> > > > code paths when using zswap, such as finding and allocating a swap
> > > > entry on the swapout path, or readahead in the swapin path. I am
> > > > proposing a swapping abstraction layer that would allow us to remove
> > > > zswap's dependency on swapfiles. This can be done by introducing a
> > > > data structure between the actual swapping implementation (swapfiles,
> > > > zswap) and the rest of the MM code.
> > > >
> > > > ==================== Objective ====================
> > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > swapfile, so the overall swapping capacity increases.
> > > >
> > > > ==================== Idea ====================
> > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > abstraction layer between swapping implementation and the rest of MM
> > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > as our abstraction layer. All MM code not concerned with swapping
> > > > details would operate in terms of swap descs. The swap_desc can point
> > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > entry. It can also include all non-backend specific operations, such
> > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > code and the actual swapping implementation.
> > >
> > > How will the swap_desc be allocated? Dynamically or preallocated? Is
> > > it 1:1 mapped to the swap slots on swap devices (whatever it is
> > > backed, for example, zswap, swap partition, swapfile, etc)?
> >
> > I imagine swap_desc's would be dynamically allocated when we need to
> > swap something out. When allocated, a swap_desc would either point to
> > a zswap_entry (if available), or a swap slot otherwise. In this case,
> > it would be 1:1 mapped to swapped out pages, not the swap slots on
> > devices.
>
> It makes sense to be 1:1 mapped to swapped out pages if the swapfile
> is used as the back of zswap.
>
> >
> > I know that it might not be ideal to make allocations on the reclaim
> > path (although it would be a small-ish slab allocation so we might be
> > able to get away with it), but otherwise we would have statically
> > allocated swap_desc's for all swap slots on a swap device, even unused
> > ones, which I imagine is too expensive. Also for things like zswap, it
> > doesn't really make sense to preallocate at all.
>
> Yeah, it is not perfect to allocate memory in the reclamation path. We
> do have such cases, but the fewer the better IMHO.

Yeah. Perhaps we can preallocate a pool of swap_desc's on top of the
slab cache, idk if that makes sense, or if there is a way to tell slab
to proactively refill a cache.

I am open to suggestions here. I don't think we should/can preallocate
the swap_desc's, and we cannot completely eliminate the allocations in
the reclaim path. We can only try to minimize them through caching,
etc. Right?

>
> >
> > WDYT?
> >
> > >
> > > >
> > > > ==================== Benefits ====================
> > > > This work enables using zswap without a backing swapfile and increases
> > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > a separation that allows us to skip code paths that don't make sense
> > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > which might result in better performance (less lookups, less lock
> > > > contention).
> > > >
> > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > removing swapper address spaces, removing swap count continuation
> > > > code, etc). Another nice cleanup that this work enables would be
> > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > things that are stored in page tables / caches, and for actual swap
> > > > entries. In the future, we can potentially further optimize how we use
> > > > the bits in the page tables instead of sticking everything into the
> > > > current type/offset format.
> > > >
> > > > Another potential win here can be swapoff, which can be more practical
> > > > by directly scanning all swap_desc's instead of going through page
> > > > tables and shmem page caches.
> > > >
> > > > Overall zswap becomes more accessible and available to a wider range
> > > > of use cases.
> > >
> > > How will you handle zswap writeback? Zswap may writeback to the backed
> > > swap device IIUC. Assuming you have both zswap and swapfile, they are
> > > separate devices with this design, right? If so, is the swapfile still
> > > the writeback target of zswap? And if it is the writeback target, what
> > > if swapfile is full?
> >
> > When we try to writeback from zswap, we try to allocate a swap slot in
> > the swapfile, and switch the swap_desc to point to that instead. The
> > process would be transparent to the rest of MM (page tables, page
> > cache, etc). If the swapfile is full, then there's really nothing we
> > can do, reclaim fails and we start OOMing. I imagine this is the same
> > behavior as today when swap is full, the difference would be that we
> > have to fill both zswap AND the swapfile to get to the OOMing point,
> > so an overall increased swapping capacity.
>
> When zswap is full, but swapfile is not yet, will the swap try to
> writeback zswap to swapfile to make more room for zswap or just swap
> out to swapfile directly?
>

The current behavior is that we swap to swapfile directly in this
case, which is far from ideal as we break LRU ordering by skipping
zswap. I believe this should be addressed, but not as part of this
effort. The work to make zswap respect the LRU ordering by writing
back from zswap to make room can be done orthogonal to this effort. I
believe Johannes was looking into this at some point.

> >
> > >
> > > Anyway I'm interested in attending the discussion for this topic.
> >
> > Great! Looking forward to discuss this more!
> >
> > >
> > > >
> > > > ==================== Cost ====================
> > > > The obvious downside of this is added memory overhead, specifically
> > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > continuation), we pay the size of the swap_desc for every page that is
> > > > actually in the swapfile, which I am estimating can be roughly around
> > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > scales with pages actually swapped out. For zswap users, it should be
> > > > a win (or at least even) because we get to drop a lot of fields from
> > > > struct zswap_entry (e.g. rbtree, index, etc).
> > > >
> > > > Another potential concern is readahead. With this design, we have no
> > > > way to get a swap_desc given a swap entry (type & offset). We would
> > > > need to maintain a reverse mapping, adding a little bit more overhead,
> > > > or search all swapped out pages instead :). A reverse mapping might
> > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> > > > memory).
> > > >
> > > > ==================== Bottom Line ====================
> > > > It would be nice to discuss the potential here and the tradeoffs. I
> > > > know that other folks using zswap (or interested in using it) may find
> > > > this very useful. I am sure I am missing some context on why things
> > > > are the way they are, and perhaps some obvious holes in my story.
> > > > Looking forward to discussing this with anyone interested :)
> > > >
> > > > I think Johannes may be interested in attending this discussion, since
> > > > a lot of ideas here are inspired by discussions I had with him :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-21 19:46       ` Yosry Ahmed
@ 2023-02-21 23:34         ` Yang Shi
  2023-02-21 23:38           ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Yang Shi @ 2023-02-21 23:34 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Peter Xu, Minchan Kim, Andrew Morton

On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Feb 21, 2023 at 11:26 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > Hi Yosry,
> > > >
> > > > Thanks for proposing this topic. I was thinking about this before but
> > > > I didn't make too much progress due to some other distractions, and I
> > > > got a couple of follow up questions about your design. Please see the
> > > > inline comments below.
> > >
> > > Great to see interested folks, thanks!
> > >
> > > >
> > > >
> > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > Hello everyone,
> > > > >
> > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > > 2023 about swap & zswap (hope I am not too late).
> > > > >
> > > > > ==================== Intro ====================
> > > > > Currently, using zswap is dependent on swapfiles in an unnecessary
> > > > > way. To use zswap, you need a swapfile configured (even if the space
> > > > > will not be used) and zswap is restricted by its size. When pages
> > > > > reside in zswap, the corresponding swap entry in the swapfile cannot
> > > > > be used, and is essentially wasted. We also go through unnecessary
> > > > > code paths when using zswap, such as finding and allocating a swap
> > > > > entry on the swapout path, or readahead in the swapin path. I am
> > > > > proposing a swapping abstraction layer that would allow us to remove
> > > > > zswap's dependency on swapfiles. This can be done by introducing a
> > > > > data structure between the actual swapping implementation (swapfiles,
> > > > > zswap) and the rest of the MM code.
> > > > >
> > > > > ==================== Objective ====================
> > > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > > swapfile, so the overall swapping capacity increases.
> > > > >
> > > > > ==================== Idea ====================
> > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > entry. It can also include all non-backend specific operations, such
> > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > > code and the actual swapping implementation.
> > > >
> > > > How will the swap_desc be allocated? Dynamically or preallocated? Is
> > > > it 1:1 mapped to the swap slots on swap devices (whatever it is
> > > > backed, for example, zswap, swap partition, swapfile, etc)?
> > >
> > > I imagine swap_desc's would be dynamically allocated when we need to
> > > swap something out. When allocated, a swap_desc would either point to
> > > a zswap_entry (if available), or a swap slot otherwise. In this case,
> > > it would be 1:1 mapped to swapped out pages, not the swap slots on
> > > devices.
> >
> > It makes sense to be 1:1 mapped to swapped out pages if the swapfile
> > is used as the back of zswap.
> >
> > >
> > > I know that it might not be ideal to make allocations on the reclaim
> > > path (although it would be a small-ish slab allocation so we might be
> > > able to get away with it), but otherwise we would have statically
> > > allocated swap_desc's for all swap slots on a swap device, even unused
> > > ones, which I imagine is too expensive. Also for things like zswap, it
> > > doesn't really make sense to preallocate at all.
> >
> > Yeah, it is not perfect to allocate memory in the reclamation path. We
> > do have such cases, but the fewer the better IMHO.
>
> Yeah. Perhaps we can preallocate a pool of swap_desc's on top of the
> slab cache, idk if that makes sense, or if there is a way to tell slab
> to proactively refill a cache.
>
> I am open to suggestions here. I don't think we should/can preallocate
> the swap_desc's, and we cannot completely eliminate the allocations in
> the reclaim path. We can only try to minimize them through caching,
> etc. Right?

Yeah, reallocation should not work. But I'm not sure whether caching
works well for this case or not either. I'm supposed that you were
thinking about something similar with pcp. When the available number
of elements is lower than a threshold, refill the cache. It should
work well with moderate memory pressure. But I'm not sure how it would
behave with severe memory pressure, particularly when  anonymous
memory dominated the memory usage. Or maybe dynamic allocation works
well, we are just over-engineered.

>
> >
> > >
> > > WDYT?
> > >
> > > >
> > > > >
> > > > > ==================== Benefits ====================
> > > > > This work enables using zswap without a backing swapfile and increases
> > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > a separation that allows us to skip code paths that don't make sense
> > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > which might result in better performance (less lookups, less lock
> > > > > contention).
> > > > >
> > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > removing swapper address spaces, removing swap count continuation
> > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > things that are stored in page tables / caches, and for actual swap
> > > > > entries. In the future, we can potentially further optimize how we use
> > > > > the bits in the page tables instead of sticking everything into the
> > > > > current type/offset format.
> > > > >
> > > > > Another potential win here can be swapoff, which can be more practical
> > > > > by directly scanning all swap_desc's instead of going through page
> > > > > tables and shmem page caches.
> > > > >
> > > > > Overall zswap becomes more accessible and available to a wider range
> > > > > of use cases.
> > > >
> > > > How will you handle zswap writeback? Zswap may writeback to the backed
> > > > swap device IIUC. Assuming you have both zswap and swapfile, they are
> > > > separate devices with this design, right? If so, is the swapfile still
> > > > the writeback target of zswap? And if it is the writeback target, what
> > > > if swapfile is full?
> > >
> > > When we try to writeback from zswap, we try to allocate a swap slot in
> > > the swapfile, and switch the swap_desc to point to that instead. The
> > > process would be transparent to the rest of MM (page tables, page
> > > cache, etc). If the swapfile is full, then there's really nothing we
> > > can do, reclaim fails and we start OOMing. I imagine this is the same
> > > behavior as today when swap is full, the difference would be that we
> > > have to fill both zswap AND the swapfile to get to the OOMing point,
> > > so an overall increased swapping capacity.
> >
> > When zswap is full, but swapfile is not yet, will the swap try to
> > writeback zswap to swapfile to make more room for zswap or just swap
> > out to swapfile directly?
> >
>
> The current behavior is that we swap to swapfile directly in this
> case, which is far from ideal as we break LRU ordering by skipping
> zswap. I believe this should be addressed, but not as part of this
> effort. The work to make zswap respect the LRU ordering by writing
> back from zswap to make room can be done orthogonal to this effort. I
> believe Johannes was looking into this at some point.

Other than breaking LRU ordering, I'm also concerned about the
potential deteriorating performance when writing/reading from swapfile
when zswap is full. The zswap->swapfile order should be able to
maintain a consistent performance for userspace.

But anyway I don't have the data from real life workload to back the
above points. If you or Johannes could share some real data, that
would be very helpful to make the decisions.

>
> > >
> > > >
> > > > Anyway I'm interested in attending the discussion for this topic.
> > >
> > > Great! Looking forward to discuss this more!
> > >
> > > >
> > > > >
> > > > > ==================== Cost ====================
> > > > > The obvious downside of this is added memory overhead, specifically
> > > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > > continuation), we pay the size of the swap_desc for every page that is
> > > > > actually in the swapfile, which I am estimating can be roughly around
> > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > > scales with pages actually swapped out. For zswap users, it should be
> > > > > a win (or at least even) because we get to drop a lot of fields from
> > > > > struct zswap_entry (e.g. rbtree, index, etc).
> > > > >
> > > > > Another potential concern is readahead. With this design, we have no
> > > > > way to get a swap_desc given a swap entry (type & offset). We would
> > > > > need to maintain a reverse mapping, adding a little bit more overhead,
> > > > > or search all swapped out pages instead :). A reverse mapping might
> > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> > > > > memory).
> > > > >
> > > > > ==================== Bottom Line ====================
> > > > > It would be nice to discuss the potential here and the tradeoffs. I
> > > > > know that other folks using zswap (or interested in using it) may find
> > > > > this very useful. I am sure I am missing some context on why things
> > > > > are the way they are, and perhaps some obvious holes in my story.
> > > > > Looking forward to discussing this with anyone interested :)
> > > > >
> > > > > I think Johannes may be interested in attending this discussion, since
> > > > > a lot of ideas here are inspired by discussions I had with him :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-21 23:34         ` Yang Shi
@ 2023-02-21 23:38           ` Yosry Ahmed
  2023-02-22 16:57             ` Johannes Weiner
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-02-21 23:38 UTC (permalink / raw)
  To: Yang Shi
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Peter Xu, Minchan Kim, Andrew Morton

On Tue, Feb 21, 2023 at 3:34 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > Hi Yosry,
> > > > >
> > > > > Thanks for proposing this topic. I was thinking about this before but
> > > > > I didn't make too much progress due to some other distractions, and I
> > > > > got a couple of follow up questions about your design. Please see the
> > > > > inline comments below.
> > > >
> > > > Great to see interested folks, thanks!
> > > >
> > > > >
> > > > >
> > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > Hello everyone,
> > > > > >
> > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > > > 2023 about swap & zswap (hope I am not too late).
> > > > > >
> > > > > > ==================== Intro ====================
> > > > > > Currently, using zswap is dependent on swapfiles in an unnecessary
> > > > > > way. To use zswap, you need a swapfile configured (even if the space
> > > > > > will not be used) and zswap is restricted by its size. When pages
> > > > > > reside in zswap, the corresponding swap entry in the swapfile cannot
> > > > > > be used, and is essentially wasted. We also go through unnecessary
> > > > > > code paths when using zswap, such as finding and allocating a swap
> > > > > > entry on the swapout path, or readahead in the swapin path. I am
> > > > > > proposing a swapping abstraction layer that would allow us to remove
> > > > > > zswap's dependency on swapfiles. This can be done by introducing a
> > > > > > data structure between the actual swapping implementation (swapfiles,
> > > > > > zswap) and the rest of the MM code.
> > > > > >
> > > > > > ==================== Objective ====================
> > > > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > > > swapfile, so the overall swapping capacity increases.
> > > > > >
> > > > > > ==================== Idea ====================
> > > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > > entry. It can also include all non-backend specific operations, such
> > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > > > code and the actual swapping implementation.
> > > > >
> > > > > How will the swap_desc be allocated? Dynamically or preallocated? Is
> > > > > it 1:1 mapped to the swap slots on swap devices (whatever it is
> > > > > backed, for example, zswap, swap partition, swapfile, etc)?
> > > >
> > > > I imagine swap_desc's would be dynamically allocated when we need to
> > > > swap something out. When allocated, a swap_desc would either point to
> > > > a zswap_entry (if available), or a swap slot otherwise. In this case,
> > > > it would be 1:1 mapped to swapped out pages, not the swap slots on
> > > > devices.
> > >
> > > It makes sense to be 1:1 mapped to swapped out pages if the swapfile
> > > is used as the back of zswap.
> > >
> > > >
> > > > I know that it might not be ideal to make allocations on the reclaim
> > > > path (although it would be a small-ish slab allocation so we might be
> > > > able to get away with it), but otherwise we would have statically
> > > > allocated swap_desc's for all swap slots on a swap device, even unused
> > > > ones, which I imagine is too expensive. Also for things like zswap, it
> > > > doesn't really make sense to preallocate at all.
> > >
> > > Yeah, it is not perfect to allocate memory in the reclamation path. We
> > > do have such cases, but the fewer the better IMHO.
> >
> > Yeah. Perhaps we can preallocate a pool of swap_desc's on top of the
> > slab cache, idk if that makes sense, or if there is a way to tell slab
> > to proactively refill a cache.
> >
> > I am open to suggestions here. I don't think we should/can preallocate
> > the swap_desc's, and we cannot completely eliminate the allocations in
> > the reclaim path. We can only try to minimize them through caching,
> > etc. Right?
>
> Yeah, reallocation should not work. But I'm not sure whether caching
> works well for this case or not either. I'm supposed that you were
> thinking about something similar with pcp. When the available number
> of elements is lower than a threshold, refill the cache. It should
> work well with moderate memory pressure. But I'm not sure how it would
> behave with severe memory pressure, particularly when  anonymous
> memory dominated the memory usage. Or maybe dynamic allocation works
> well, we are just over-engineered.

Yeah it would be interesting to look into whether the swap_desc
allocation will be a bottleneck. Definitely something to look out for.
I share your thoughts about wanting to do something about it but also
not wanting to over-engineer it.

>
> >
> > >
> > > >
> > > > WDYT?
> > > >
> > > > >
> > > > > >
> > > > > > ==================== Benefits ====================
> > > > > > This work enables using zswap without a backing swapfile and increases
> > > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > > a separation that allows us to skip code paths that don't make sense
> > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > > which might result in better performance (less lookups, less lock
> > > > > > contention).
> > > > > >
> > > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > > removing swapper address spaces, removing swap count continuation
> > > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > > things that are stored in page tables / caches, and for actual swap
> > > > > > entries. In the future, we can potentially further optimize how we use
> > > > > > the bits in the page tables instead of sticking everything into the
> > > > > > current type/offset format.
> > > > > >
> > > > > > Another potential win here can be swapoff, which can be more practical
> > > > > > by directly scanning all swap_desc's instead of going through page
> > > > > > tables and shmem page caches.
> > > > > >
> > > > > > Overall zswap becomes more accessible and available to a wider range
> > > > > > of use cases.
> > > > >
> > > > > How will you handle zswap writeback? Zswap may writeback to the backed
> > > > > swap device IIUC. Assuming you have both zswap and swapfile, they are
> > > > > separate devices with this design, right? If so, is the swapfile still
> > > > > the writeback target of zswap? And if it is the writeback target, what
> > > > > if swapfile is full?
> > > >
> > > > When we try to writeback from zswap, we try to allocate a swap slot in
> > > > the swapfile, and switch the swap_desc to point to that instead. The
> > > > process would be transparent to the rest of MM (page tables, page
> > > > cache, etc). If the swapfile is full, then there's really nothing we
> > > > can do, reclaim fails and we start OOMing. I imagine this is the same
> > > > behavior as today when swap is full, the difference would be that we
> > > > have to fill both zswap AND the swapfile to get to the OOMing point,
> > > > so an overall increased swapping capacity.
> > >
> > > When zswap is full, but swapfile is not yet, will the swap try to
> > > writeback zswap to swapfile to make more room for zswap or just swap
> > > out to swapfile directly?
> > >
> >
> > The current behavior is that we swap to swapfile directly in this
> > case, which is far from ideal as we break LRU ordering by skipping
> > zswap. I believe this should be addressed, but not as part of this
> > effort. The work to make zswap respect the LRU ordering by writing
> > back from zswap to make room can be done orthogonal to this effort. I
> > believe Johannes was looking into this at some point.
>
> Other than breaking LRU ordering, I'm also concerned about the
> potential deteriorating performance when writing/reading from swapfile
> when zswap is full. The zswap->swapfile order should be able to
> maintain a consistent performance for userspace.

Right. This happens today anyway AFAICT, when zswap is full we just
fallback to writing to swapfile, so this would not be a behavior
change. I agree it should be addressed anyway.

>
> But anyway I don't have the data from real life workload to back the
> above points. If you or Johannes could share some real data, that
> would be very helpful to make the decisions.

I actually don't, since we mostly run zswap without a backing
swapfile. Perhaps Johannes might be able to have some data on this (or
anyone using zswap with a backing swapfile).

>
> >
> > > >
> > > > >
> > > > > Anyway I'm interested in attending the discussion for this topic.
> > > >
> > > > Great! Looking forward to discuss this more!
> > > >
> > > > >
> > > > > >
> > > > > > ==================== Cost ====================
> > > > > > The obvious downside of this is added memory overhead, specifically
> > > > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > > > continuation), we pay the size of the swap_desc for every page that is
> > > > > > actually in the swapfile, which I am estimating can be roughly around
> > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > > > scales with pages actually swapped out. For zswap users, it should be
> > > > > > a win (or at least even) because we get to drop a lot of fields from
> > > > > > struct zswap_entry (e.g. rbtree, index, etc).
> > > > > >
> > > > > > Another potential concern is readahead. With this design, we have no
> > > > > > way to get a swap_desc given a swap entry (type & offset). We would
> > > > > > need to maintain a reverse mapping, adding a little bit more overhead,
> > > > > > or search all swapped out pages instead :). A reverse mapping might
> > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> > > > > > memory).
> > > > > >
> > > > > > ==================== Bottom Line ====================
> > > > > > It would be nice to discuss the potential here and the tradeoffs. I
> > > > > > know that other folks using zswap (or interested in using it) may find
> > > > > > this very useful. I am sure I am missing some context on why things
> > > > > > are the way they are, and perhaps some obvious holes in my story.
> > > > > > Looking forward to discussing this with anyone interested :)
> > > > > >
> > > > > > I think Johannes may be interested in attending this discussion, since
> > > > > > a lot of ideas here are inspired by discussions I had with him :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-21 23:38           ` Yosry Ahmed
@ 2023-02-22 16:57             ` Johannes Weiner
  2023-02-22 22:46               ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Johannes Weiner @ 2023-02-22 16:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Yang Shi, lsf-pc, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Peter Xu, Minchan Kim, Andrew Morton, Nhat Pham

Hello,

thanks for proposing this, Yosry. I'm very interested in this
work. Unfortunately, I won't be able to attend LSFMMBPF myself this
time around due to a scheduling conflict :(

On Tue, Feb 21, 2023 at 03:38:57PM -0800, Yosry Ahmed wrote:
> On Tue, Feb 21, 2023 at 3:34 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > > >
> > > > > > Hi Yosry,
> > > > > >
> > > > > > Thanks for proposing this topic. I was thinking about this before but
> > > > > > I didn't make too much progress due to some other distractions, and I
> > > > > > got a couple of follow up questions about your design. Please see the
> > > > > > inline comments below.
> > > > >
> > > > > Great to see interested folks, thanks!
> > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > Hello everyone,
> > > > > > >
> > > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > > > > 2023 about swap & zswap (hope I am not too late).
> > > > > > >
> > > > > > > ==================== Intro ====================
> > > > > > > Currently, using zswap is dependent on swapfiles in an unnecessary
> > > > > > > way. To use zswap, you need a swapfile configured (even if the space
> > > > > > > will not be used) and zswap is restricted by its size. When pages
> > > > > > > reside in zswap, the corresponding swap entry in the swapfile cannot
> > > > > > > be used, and is essentially wasted. We also go through unnecessary
> > > > > > > code paths when using zswap, such as finding and allocating a swap
> > > > > > > entry on the swapout path, or readahead in the swapin path. I am
> > > > > > > proposing a swapping abstraction layer that would allow us to remove
> > > > > > > zswap's dependency on swapfiles. This can be done by introducing a
> > > > > > > data structure between the actual swapping implementation (swapfiles,
> > > > > > > zswap) and the rest of the MM code.
> > > > > > >
> > > > > > > ==================== Objective ====================
> > > > > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > > > > swapfile, so the overall swapping capacity increases.
> > > > > > >
> > > > > > > ==================== Idea ====================
> > > > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > > > entry. It can also include all non-backend specific operations, such
> > > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > > > > code and the actual swapping implementation.
> > > > > >
> > > > > > How will the swap_desc be allocated? Dynamically or preallocated? Is
> > > > > > it 1:1 mapped to the swap slots on swap devices (whatever it is
> > > > > > backed, for example, zswap, swap partition, swapfile, etc)?
> > > > >
> > > > > I imagine swap_desc's would be dynamically allocated when we need to
> > > > > swap something out. When allocated, a swap_desc would either point to
> > > > > a zswap_entry (if available), or a swap slot otherwise. In this case,
> > > > > it would be 1:1 mapped to swapped out pages, not the swap slots on
> > > > > devices.
> > > >
> > > > It makes sense to be 1:1 mapped to swapped out pages if the swapfile
> > > > is used as the back of zswap.
> > > >
> > > > >
> > > > > I know that it might not be ideal to make allocations on the reclaim
> > > > > path (although it would be a small-ish slab allocation so we might be
> > > > > able to get away with it), but otherwise we would have statically
> > > > > allocated swap_desc's for all swap slots on a swap device, even unused
> > > > > ones, which I imagine is too expensive. Also for things like zswap, it
> > > > > doesn't really make sense to preallocate at all.
> > > >
> > > > Yeah, it is not perfect to allocate memory in the reclamation path. We
> > > > do have such cases, but the fewer the better IMHO.
> > >
> > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top of the
> > > slab cache, idk if that makes sense, or if there is a way to tell slab
> > > to proactively refill a cache.
> > >
> > > I am open to suggestions here. I don't think we should/can preallocate
> > > the swap_desc's, and we cannot completely eliminate the allocations in
> > > the reclaim path. We can only try to minimize them through caching,
> > > etc. Right?
> >
> > Yeah, reallocation should not work. But I'm not sure whether caching
> > works well for this case or not either. I'm supposed that you were
> > thinking about something similar with pcp. When the available number
> > of elements is lower than a threshold, refill the cache. It should
> > work well with moderate memory pressure. But I'm not sure how it would
> > behave with severe memory pressure, particularly when  anonymous
> > memory dominated the memory usage. Or maybe dynamic allocation works
> > well, we are just over-engineered.
> 
> Yeah it would be interesting to look into whether the swap_desc
> allocation will be a bottleneck. Definitely something to look out for.
> I share your thoughts about wanting to do something about it but also
> not wanting to over-engineer it.

I'm not too concerned by this. It's a PF_MEMALLOC allocation, meaning
it's not subject to watermarks. And the swapped page is freed right
afterwards. As long as the compression delta exceeds the size of
swap_desc, the process is a net reduction in allocated memory. For
regular swap, the only requirement is that swap_desc < page_size() :-)

To put this into perspective, the zswap backends allocate backing
pages on-demand during reclaim. zsmalloc also kmallocs metadata in
that path. We haven't had any issues with this in production, even
under fairly severe memory pressure scenarios.

> > > > > > > ==================== Benefits ====================
> > > > > > > This work enables using zswap without a backing swapfile and increases
> > > > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > > > a separation that allows us to skip code paths that don't make sense
> > > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > > > which might result in better performance (less lookups, less lock
> > > > > > > contention).
> > > > > > >
> > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > > > removing swapper address spaces, removing swap count continuation
> > > > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > > > things that are stored in page tables / caches, and for actual swap
> > > > > > > entries. In the future, we can potentially further optimize how we use
> > > > > > > the bits in the page tables instead of sticking everything into the
> > > > > > > current type/offset format.
> > > > > > >
> > > > > > > Another potential win here can be swapoff, which can be more practical
> > > > > > > by directly scanning all swap_desc's instead of going through page
> > > > > > > tables and shmem page caches.
> > > > > > >
> > > > > > > Overall zswap becomes more accessible and available to a wider range
> > > > > > > of use cases.
> > > > > >
> > > > > > How will you handle zswap writeback? Zswap may writeback to the backed
> > > > > > swap device IIUC. Assuming you have both zswap and swapfile, they are
> > > > > > separate devices with this design, right? If so, is the swapfile still
> > > > > > the writeback target of zswap? And if it is the writeback target, what
> > > > > > if swapfile is full?
> > > > >
> > > > > When we try to writeback from zswap, we try to allocate a swap slot in
> > > > > the swapfile, and switch the swap_desc to point to that instead. The
> > > > > process would be transparent to the rest of MM (page tables, page
> > > > > cache, etc). If the swapfile is full, then there's really nothing we
> > > > > can do, reclaim fails and we start OOMing. I imagine this is the same
> > > > > behavior as today when swap is full, the difference would be that we
> > > > > have to fill both zswap AND the swapfile to get to the OOMing point,
> > > > > so an overall increased swapping capacity.
> > > >
> > > > When zswap is full, but swapfile is not yet, will the swap try to
> > > > writeback zswap to swapfile to make more room for zswap or just swap
> > > > out to swapfile directly?
> > > >
> > >
> > > The current behavior is that we swap to swapfile directly in this
> > > case, which is far from ideal as we break LRU ordering by skipping
> > > zswap. I believe this should be addressed, but not as part of this
> > > effort. The work to make zswap respect the LRU ordering by writing
> > > back from zswap to make room can be done orthogonal to this effort. I
> > > believe Johannes was looking into this at some point.

Actually, zswap already does LRU writeback when the pool is full. Nhat
Pham (CCd) recently upstreamed the LRU implementation for zsmalloc, so
as of today all backends support this.

There are still a few quirks in zswap that can cause rejections which
bypass the LRU that need fixing. But for the most part LRU writeback
to the backing file is the default behavior.

> > Other than breaking LRU ordering, I'm also concerned about the
> > potential deteriorating performance when writing/reading from swapfile
> > when zswap is full. The zswap->swapfile order should be able to
> > maintain a consistent performance for userspace.
> 
> Right. This happens today anyway AFAICT, when zswap is full we just
> fallback to writing to swapfile, so this would not be a behavior
> change. I agree it should be addressed anyway.
> 
> >
> > But anyway I don't have the data from real life workload to back the
> > above points. If you or Johannes could share some real data, that
> > would be very helpful to make the decisions.
> 
> I actually don't, since we mostly run zswap without a backing
> swapfile. Perhaps Johannes might be able to have some data on this (or
> anyone using zswap with a backing swapfile).

Due to LRU writeback, the latency increase when zswap spills its
coldest entries into backing swap is fairly linear, as you may
expect. We have some limited production data on this from the
webservers.

The biggest challenge in this space is properly sizing the zswap pool,
such that it's big enough to hold the warm set that the workload is
most latency-sensitive too, yet small enough such that the cold pages
get spilled to backing swap. Nhat is working on improving this.

That said, I think this discussion is orthogonal to the proposed
topic. zswap spills to backing swap in LRU order as of today. The
LRU/pool size tweaking is an optimization to get smarter zswap/swap
placement according to access frequency. The proposed swap descriptor
is an optimization to get better disk utilization, the ability to run
zswap without backing swap, and a dramatic speedup in swapoff time.

> > > > > > Anyway I'm interested in attending the discussion for this topic.
> > > > >
> > > > > Great! Looking forward to discuss this more!
> > > > >
> > > > > >
> > > > > > >
> > > > > > > ==================== Cost ====================
> > > > > > > The obvious downside of this is added memory overhead, specifically
> > > > > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > > > > continuation), we pay the size of the swap_desc for every page that is
> > > > > > > actually in the swapfile, which I am estimating can be roughly around
> > > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > > > > scales with pages actually swapped out. For zswap users, it should be
> > > > > > > a win (or at least even) because we get to drop a lot of fields from
> > > > > > > struct zswap_entry (e.g. rbtree, index, etc).

Shifting the cost from O(swapspace) to O(swapped) could be a win for
many regular swap users too.

There are the legacy setups that provision 2*RAM worth of swap as an
emergency overflow that is then rarely used.

We have a setups that swap to disk more proactively, but we also
overprovision those in terms of swap space due to the cliff behavior
when swap fills up and the VM runs out of options.

To make a fair comparison, you really have to take average swap
utilization into account. And I doubt that's very high.

In terms of worst-case behavior, +0.8% per swapped page doesn't sound
like a show-stopper to me. Especially when compared to zswap's current
O(swapped) waste of disk space.

> > > > > > > Another potential concern is readahead. With this design, we have no
> > > > > > > way to get a swap_desc given a swap entry (type & offset). We would
> > > > > > > need to maintain a reverse mapping, adding a little bit more overhead,
> > > > > > > or search all swapped out pages instead :). A reverse mapping might
> > > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> > > > > > > memory).
> > > > > > >
> > > > > > > ==================== Bottom Line ====================
> > > > > > > It would be nice to discuss the potential here and the tradeoffs. I
> > > > > > > know that other folks using zswap (or interested in using it) may find
> > > > > > > this very useful. I am sure I am missing some context on why things
> > > > > > > are the way they are, and perhaps some obvious holes in my story.
> > > > > > > Looking forward to discussing this with anyone interested :)
> > > > > > >
> > > > > > > I think Johannes may be interested in attending this discussion, since
> > > > > > > a lot of ideas here are inspired by discussions I had with him :)

Thanks!


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-22 16:57             ` Johannes Weiner
@ 2023-02-22 22:46               ` Yosry Ahmed
  2023-02-28  4:29                 ` Kalesh Singh
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-02-22 22:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yang Shi, lsf-pc, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Peter Xu, Minchan Kim, Andrew Morton, Nhat Pham

On Wed, Feb 22, 2023 at 8:57 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hello,
>
> thanks for proposing this, Yosry. I'm very interested in this
> work. Unfortunately, I won't be able to attend LSFMMBPF myself this
> time around due to a scheduling conflict :(

Ugh, would have been great to have you, I guess there might be a
remote option, or we will end up discussing on the mailing list
eventually anyway.

>
> On Tue, Feb 21, 2023 at 03:38:57PM -0800, Yosry Ahmed wrote:
> > On Tue, Feb 21, 2023 at 3:34 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi Yosry,
> > > > > > >
> > > > > > > Thanks for proposing this topic. I was thinking about this before but
> > > > > > > I didn't make too much progress due to some other distractions, and I
> > > > > > > got a couple of follow up questions about your design. Please see the
> > > > > > > inline comments below.
> > > > > >
> > > > > > Great to see interested folks, thanks!
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > > Hello everyone,
> > > > > > > >
> > > > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > > > > > 2023 about swap & zswap (hope I am not too late).
> > > > > > > >
> > > > > > > > ==================== Intro ====================
> > > > > > > > Currently, using zswap is dependent on swapfiles in an unnecessary
> > > > > > > > way. To use zswap, you need a swapfile configured (even if the space
> > > > > > > > will not be used) and zswap is restricted by its size. When pages
> > > > > > > > reside in zswap, the corresponding swap entry in the swapfile cannot
> > > > > > > > be used, and is essentially wasted. We also go through unnecessary
> > > > > > > > code paths when using zswap, such as finding and allocating a swap
> > > > > > > > entry on the swapout path, or readahead in the swapin path. I am
> > > > > > > > proposing a swapping abstraction layer that would allow us to remove
> > > > > > > > zswap's dependency on swapfiles. This can be done by introducing a
> > > > > > > > data structure between the actual swapping implementation (swapfiles,
> > > > > > > > zswap) and the rest of the MM code.
> > > > > > > >
> > > > > > > > ==================== Objective ====================
> > > > > > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > > > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > > > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > > > > > swapfile, so the overall swapping capacity increases.
> > > > > > > >
> > > > > > > > ==================== Idea ====================
> > > > > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > > > > entry. It can also include all non-backend specific operations, such
> > > > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > > > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > > > > > code and the actual swapping implementation.
> > > > > > >
> > > > > > > How will the swap_desc be allocated? Dynamically or preallocated? Is
> > > > > > > it 1:1 mapped to the swap slots on swap devices (whatever it is
> > > > > > > backed, for example, zswap, swap partition, swapfile, etc)?
> > > > > >
> > > > > > I imagine swap_desc's would be dynamically allocated when we need to
> > > > > > swap something out. When allocated, a swap_desc would either point to
> > > > > > a zswap_entry (if available), or a swap slot otherwise. In this case,
> > > > > > it would be 1:1 mapped to swapped out pages, not the swap slots on
> > > > > > devices.
> > > > >
> > > > > It makes sense to be 1:1 mapped to swapped out pages if the swapfile
> > > > > is used as the back of zswap.
> > > > >
> > > > > >
> > > > > > I know that it might not be ideal to make allocations on the reclaim
> > > > > > path (although it would be a small-ish slab allocation so we might be
> > > > > > able to get away with it), but otherwise we would have statically
> > > > > > allocated swap_desc's for all swap slots on a swap device, even unused
> > > > > > ones, which I imagine is too expensive. Also for things like zswap, it
> > > > > > doesn't really make sense to preallocate at all.
> > > > >
> > > > > Yeah, it is not perfect to allocate memory in the reclamation path. We
> > > > > do have such cases, but the fewer the better IMHO.
> > > >
> > > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top of the
> > > > slab cache, idk if that makes sense, or if there is a way to tell slab
> > > > to proactively refill a cache.
> > > >
> > > > I am open to suggestions here. I don't think we should/can preallocate
> > > > the swap_desc's, and we cannot completely eliminate the allocations in
> > > > the reclaim path. We can only try to minimize them through caching,
> > > > etc. Right?
> > >
> > > Yeah, reallocation should not work. But I'm not sure whether caching
> > > works well for this case or not either. I'm supposed that you were
> > > thinking about something similar with pcp. When the available number
> > > of elements is lower than a threshold, refill the cache. It should
> > > work well with moderate memory pressure. But I'm not sure how it would
> > > behave with severe memory pressure, particularly when  anonymous
> > > memory dominated the memory usage. Or maybe dynamic allocation works
> > > well, we are just over-engineered.
> >
> > Yeah it would be interesting to look into whether the swap_desc
> > allocation will be a bottleneck. Definitely something to look out for.
> > I share your thoughts about wanting to do something about it but also
> > not wanting to over-engineer it.
>
> I'm not too concerned by this. It's a PF_MEMALLOC allocation, meaning
> it's not subject to watermarks. And the swapped page is freed right
> afterwards. As long as the compression delta exceeds the size of
> swap_desc, the process is a net reduction in allocated memory. For
> regular swap, the only requirement is that swap_desc < page_size() :-)
>
> To put this into perspective, the zswap backends allocate backing
> pages on-demand during reclaim. zsmalloc also kmallocs metadata in
> that path. We haven't had any issues with this in production, even
> under fairly severe memory pressure scenarios.

Right. The only problem would be for pages that do not compress well
in zswap, in which case we might not end up freeing memory. As you
said, this is already happening today with zswap tho.

>
> > > > > > > > ==================== Benefits ====================
> > > > > > > > This work enables using zswap without a backing swapfile and increases
> > > > > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > > > > a separation that allows us to skip code paths that don't make sense
> > > > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > > > > which might result in better performance (less lookups, less lock
> > > > > > > > contention).
> > > > > > > >
> > > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > > > > removing swapper address spaces, removing swap count continuation
> > > > > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > > > > things that are stored in page tables / caches, and for actual swap
> > > > > > > > entries. In the future, we can potentially further optimize how we use
> > > > > > > > the bits in the page tables instead of sticking everything into the
> > > > > > > > current type/offset format.
> > > > > > > >
> > > > > > > > Another potential win here can be swapoff, which can be more practical
> > > > > > > > by directly scanning all swap_desc's instead of going through page
> > > > > > > > tables and shmem page caches.
> > > > > > > >
> > > > > > > > Overall zswap becomes more accessible and available to a wider range
> > > > > > > > of use cases.
> > > > > > >
> > > > > > > How will you handle zswap writeback? Zswap may writeback to the backed
> > > > > > > swap device IIUC. Assuming you have both zswap and swapfile, they are
> > > > > > > separate devices with this design, right? If so, is the swapfile still
> > > > > > > the writeback target of zswap? And if it is the writeback target, what
> > > > > > > if swapfile is full?
> > > > > >
> > > > > > When we try to writeback from zswap, we try to allocate a swap slot in
> > > > > > the swapfile, and switch the swap_desc to point to that instead. The
> > > > > > process would be transparent to the rest of MM (page tables, page
> > > > > > cache, etc). If the swapfile is full, then there's really nothing we
> > > > > > can do, reclaim fails and we start OOMing. I imagine this is the same
> > > > > > behavior as today when swap is full, the difference would be that we
> > > > > > have to fill both zswap AND the swapfile to get to the OOMing point,
> > > > > > so an overall increased swapping capacity.
> > > > >
> > > > > When zswap is full, but swapfile is not yet, will the swap try to
> > > > > writeback zswap to swapfile to make more room for zswap or just swap
> > > > > out to swapfile directly?
> > > > >
> > > >
> > > > The current behavior is that we swap to swapfile directly in this
> > > > case, which is far from ideal as we break LRU ordering by skipping
> > > > zswap. I believe this should be addressed, but not as part of this
> > > > effort. The work to make zswap respect the LRU ordering by writing
> > > > back from zswap to make room can be done orthogonal to this effort. I
> > > > believe Johannes was looking into this at some point.
>
> Actually, zswap already does LRU writeback when the pool is full. Nhat
> Pham (CCd) recently upstreamed the LRU implementation for zsmalloc, so
> as of today all backends support this.
>
> There are still a few quirks in zswap that can cause rejections which
> bypass the LRU that need fixing. But for the most part LRU writeback
> to the backing file is the default behavior.

Right, I was specifically talking about this case. When zswap is full
it rejects incoming pages and they go directly to the swapfile, but we
also kickoff writeback, so this only happens until we do some LRU
writeback. I guess I should have been more clear here. Thanks for
clarifying and correcting.

>
> > > Other than breaking LRU ordering, I'm also concerned about the
> > > potential deteriorating performance when writing/reading from swapfile
> > > when zswap is full. The zswap->swapfile order should be able to
> > > maintain a consistent performance for userspace.
> >
> > Right. This happens today anyway AFAICT, when zswap is full we just
> > fallback to writing to swapfile, so this would not be a behavior
> > change. I agree it should be addressed anyway.
> >
> > >
> > > But anyway I don't have the data from real life workload to back the
> > > above points. If you or Johannes could share some real data, that
> > > would be very helpful to make the decisions.
> >
> > I actually don't, since we mostly run zswap without a backing
> > swapfile. Perhaps Johannes might be able to have some data on this (or
> > anyone using zswap with a backing swapfile).
>
> Due to LRU writeback, the latency increase when zswap spills its
> coldest entries into backing swap is fairly linear, as you may
> expect. We have some limited production data on this from the
> webservers.
>
> The biggest challenge in this space is properly sizing the zswap pool,
> such that it's big enough to hold the warm set that the workload is
> most latency-sensitive too, yet small enough such that the cold pages
> get spilled to backing swap. Nhat is working on improving this.
>
> That said, I think this discussion is orthogonal to the proposed
> topic. zswap spills to backing swap in LRU order as of today. The
> LRU/pool size tweaking is an optimization to get smarter zswap/swap
> placement according to access frequency. The proposed swap descriptor
> is an optimization to get better disk utilization, the ability to run
> zswap without backing swap, and a dramatic speedup in swapoff time.

Fully agree.

>
> > > > > > > Anyway I'm interested in attending the discussion for this topic.
> > > > > >
> > > > > > Great! Looking forward to discuss this more!
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > ==================== Cost ====================
> > > > > > > > The obvious downside of this is added memory overhead, specifically
> > > > > > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > > > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > > > > > continuation), we pay the size of the swap_desc for every page that is
> > > > > > > > actually in the swapfile, which I am estimating can be roughly around
> > > > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > > > > > scales with pages actually swapped out. For zswap users, it should be
> > > > > > > > a win (or at least even) because we get to drop a lot of fields from
> > > > > > > > struct zswap_entry (e.g. rbtree, index, etc).
>
> Shifting the cost from O(swapspace) to O(swapped) could be a win for
> many regular swap users too.
>
> There are the legacy setups that provision 2*RAM worth of swap as an
> emergency overflow that is then rarely used.
>
> We have a setups that swap to disk more proactively, but we also
> overprovision those in terms of swap space due to the cliff behavior
> when swap fills up and the VM runs out of options.
>
> To make a fair comparison, you really have to take average swap
> utilization into account. And I doubt that's very high.

Yeah I was looking for some data here, but it varies heavily based on
the use case, so I opted to only state the overhead of the swap
descriptor without directly comparing it to the current overhead.

>
> In terms of worst-case behavior, +0.8% per swapped page doesn't sound
> like a show-stopper to me. Especially when compared to zswap's current
> O(swapped) waste of disk space.

Yeah for zswap users this should be a win on most/all fronts, even
memory overhead, as we will end up trimming struct zswap_entry which
is also O(swapped) memory overhead. It should also make zswap
available for more use cases. You don't need to provision and
configure swap space, you just need to turn zswap on.

>
> > > > > > > > Another potential concern is readahead. With this design, we have no
> > > > > > > > way to get a swap_desc given a swap entry (type & offset). We would
> > > > > > > > need to maintain a reverse mapping, adding a little bit more overhead,
> > > > > > > > or search all swapped out pages instead :). A reverse mapping might
> > > > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> > > > > > > > memory).
> > > > > > > >
> > > > > > > > ==================== Bottom Line ====================
> > > > > > > > It would be nice to discuss the potential here and the tradeoffs. I
> > > > > > > > know that other folks using zswap (or interested in using it) may find
> > > > > > > > this very useful. I am sure I am missing some context on why things
> > > > > > > > are the way they are, and perhaps some obvious holes in my story.
> > > > > > > > Looking forward to discussing this with anyone interested :)
> > > > > > > >
> > > > > > > > I think Johannes may be interested in attending this discussion, since
> > > > > > > > a lot of ideas here are inspired by discussions I had with him :)
>
> Thanks!


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-22 22:46               ` Yosry Ahmed
@ 2023-02-28  4:29                 ` Kalesh Singh
  2023-02-28  8:09                   ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Kalesh Singh @ 2023-02-28  4:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Yang Shi, lsf-pc, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Peter Xu, Minchan Kim, Andrew Morton,
	Nhat Pham, Akilesh Kailash

On Wed, Feb 22, 2023 at 2:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Feb 22, 2023 at 8:57 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Hello,
> >
> > thanks for proposing this, Yosry. I'm very interested in this
> > work. Unfortunately, I won't be able to attend LSFMMBPF myself this
> > time around due to a scheduling conflict :(
>
> Ugh, would have been great to have you, I guess there might be a
> remote option, or we will end up discussing on the mailing list
> eventually anyway.
>
> >
> > On Tue, Feb 21, 2023 at 03:38:57PM -0800, Yosry Ahmed wrote:
> > > On Tue, Feb 21, 2023 at 3:34 PM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hi Yosry,
> > > > > > > >
> > > > > > > > Thanks for proposing this topic. I was thinking about this before but
> > > > > > > > I didn't make too much progress due to some other distractions, and I
> > > > > > > > got a couple of follow up questions about your design. Please see the
> > > > > > > > inline comments below.
> > > > > > >
> > > > > > > Great to see interested folks, thanks!
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > >
> > > > > > > > > Hello everyone,
> > > > > > > > >
> > > > > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > > > > > > 2023 about swap & zswap (hope I am not too late).
> > > > > > > > >
> > > > > > > > > ==================== Intro ====================
> > > > > > > > > Currently, using zswap is dependent on swapfiles in an unnecessary
> > > > > > > > > way. To use zswap, you need a swapfile configured (even if the space
> > > > > > > > > will not be used) and zswap is restricted by its size. When pages
> > > > > > > > > reside in zswap, the corresponding swap entry in the swapfile cannot
> > > > > > > > > be used, and is essentially wasted. We also go through unnecessary
> > > > > > > > > code paths when using zswap, such as finding and allocating a swap
> > > > > > > > > entry on the swapout path, or readahead in the swapin path. I am
> > > > > > > > > proposing a swapping abstraction layer that would allow us to remove
> > > > > > > > > zswap's dependency on swapfiles. This can be done by introducing a
> > > > > > > > > data structure between the actual swapping implementation (swapfiles,
> > > > > > > > > zswap) and the rest of the MM code.
> > > > > > > > >
> > > > > > > > > ==================== Objective ====================
> > > > > > > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > > > > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > > > > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > > > > > > swapfile, so the overall swapping capacity increases.
> > > > > > > > >
> > > > > > > > > ==================== Idea ====================
> > > > > > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > > > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > > > > > entry. It can also include all non-backend specific operations, such
> > > > > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > > > > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > > > > > > code and the actual swapping implementation.
> > > > > > > >
> > > > > > > > How will the swap_desc be allocated? Dynamically or preallocated? Is
> > > > > > > > it 1:1 mapped to the swap slots on swap devices (whatever it is
> > > > > > > > backed, for example, zswap, swap partition, swapfile, etc)?
> > > > > > >
> > > > > > > I imagine swap_desc's would be dynamically allocated when we need to
> > > > > > > swap something out. When allocated, a swap_desc would either point to
> > > > > > > a zswap_entry (if available), or a swap slot otherwise. In this case,
> > > > > > > it would be 1:1 mapped to swapped out pages, not the swap slots on
> > > > > > > devices.
> > > > > >
> > > > > > It makes sense to be 1:1 mapped to swapped out pages if the swapfile
> > > > > > is used as the back of zswap.
> > > > > >
> > > > > > >
> > > > > > > I know that it might not be ideal to make allocations on the reclaim
> > > > > > > path (although it would be a small-ish slab allocation so we might be
> > > > > > > able to get away with it), but otherwise we would have statically
> > > > > > > allocated swap_desc's for all swap slots on a swap device, even unused
> > > > > > > ones, which I imagine is too expensive. Also for things like zswap, it
> > > > > > > doesn't really make sense to preallocate at all.
> > > > > >
> > > > > > Yeah, it is not perfect to allocate memory in the reclamation path. We
> > > > > > do have such cases, but the fewer the better IMHO.
> > > > >
> > > > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top of the
> > > > > slab cache, idk if that makes sense, or if there is a way to tell slab
> > > > > to proactively refill a cache.
> > > > >
> > > > > I am open to suggestions here. I don't think we should/can preallocate
> > > > > the swap_desc's, and we cannot completely eliminate the allocations in
> > > > > the reclaim path. We can only try to minimize them through caching,
> > > > > etc. Right?
> > > >
> > > > Yeah, reallocation should not work. But I'm not sure whether caching
> > > > works well for this case or not either. I'm supposed that you were
> > > > thinking about something similar with pcp. When the available number
> > > > of elements is lower than a threshold, refill the cache. It should
> > > > work well with moderate memory pressure. But I'm not sure how it would
> > > > behave with severe memory pressure, particularly when  anonymous
> > > > memory dominated the memory usage. Or maybe dynamic allocation works
> > > > well, we are just over-engineered.
> > >
> > > Yeah it would be interesting to look into whether the swap_desc
> > > allocation will be a bottleneck. Definitely something to look out for.
> > > I share your thoughts about wanting to do something about it but also
> > > not wanting to over-engineer it.
> >
> > I'm not too concerned by this. It's a PF_MEMALLOC allocation, meaning
> > it's not subject to watermarks. And the swapped page is freed right
> > afterwards. As long as the compression delta exceeds the size of
> > swap_desc, the process is a net reduction in allocated memory. For
> > regular swap, the only requirement is that swap_desc < page_size() :-)
> >
> > To put this into perspective, the zswap backends allocate backing
> > pages on-demand during reclaim. zsmalloc also kmallocs metadata in
> > that path. We haven't had any issues with this in production, even
> > under fairly severe memory pressure scenarios.
>
> Right. The only problem would be for pages that do not compress well
> in zswap, in which case we might not end up freeing memory. As you
> said, this is already happening today with zswap tho.
>
> >
> > > > > > > > > ==================== Benefits ====================
> > > > > > > > > This work enables using zswap without a backing swapfile and increases
> > > > > > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > > > > > a separation that allows us to skip code paths that don't make sense
> > > > > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > > > > > which might result in better performance (less lookups, less lock
> > > > > > > > > contention).
> > > > > > > > >
> > > > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > > > > > removing swapper address spaces, removing swap count continuation
> > > > > > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > > > > > things that are stored in page tables / caches, and for actual swap
> > > > > > > > > entries. In the future, we can potentially further optimize how we use
> > > > > > > > > the bits in the page tables instead of sticking everything into the
> > > > > > > > > current type/offset format.
> > > > > > > > >
> > > > > > > > > Another potential win here can be swapoff, which can be more practical
> > > > > > > > > by directly scanning all swap_desc's instead of going through page
> > > > > > > > > tables and shmem page caches.
> > > > > > > > >
> > > > > > > > > Overall zswap becomes more accessible and available to a wider range
> > > > > > > > > of use cases.
> > > > > > > >
> > > > > > > > How will you handle zswap writeback? Zswap may writeback to the backed
> > > > > > > > swap device IIUC. Assuming you have both zswap and swapfile, they are
> > > > > > > > separate devices with this design, right? If so, is the swapfile still
> > > > > > > > the writeback target of zswap? And if it is the writeback target, what
> > > > > > > > if swapfile is full?
> > > > > > >
> > > > > > > When we try to writeback from zswap, we try to allocate a swap slot in
> > > > > > > the swapfile, and switch the swap_desc to point to that instead. The
> > > > > > > process would be transparent to the rest of MM (page tables, page
> > > > > > > cache, etc). If the swapfile is full, then there's really nothing we
> > > > > > > can do, reclaim fails and we start OOMing. I imagine this is the same
> > > > > > > behavior as today when swap is full, the difference would be that we
> > > > > > > have to fill both zswap AND the swapfile to get to the OOMing point,
> > > > > > > so an overall increased swapping capacity.
> > > > > >
> > > > > > When zswap is full, but swapfile is not yet, will the swap try to
> > > > > > writeback zswap to swapfile to make more room for zswap or just swap
> > > > > > out to swapfile directly?
> > > > > >
> > > > >
> > > > > The current behavior is that we swap to swapfile directly in this
> > > > > case, which is far from ideal as we break LRU ordering by skipping
> > > > > zswap. I believe this should be addressed, but not as part of this
> > > > > effort. The work to make zswap respect the LRU ordering by writing
> > > > > back from zswap to make room can be done orthogonal to this effort. I
> > > > > believe Johannes was looking into this at some point.
> >
> > Actually, zswap already does LRU writeback when the pool is full. Nhat
> > Pham (CCd) recently upstreamed the LRU implementation for zsmalloc, so
> > as of today all backends support this.
> >
> > There are still a few quirks in zswap that can cause rejections which
> > bypass the LRU that need fixing. But for the most part LRU writeback
> > to the backing file is the default behavior.
>
> Right, I was specifically talking about this case. When zswap is full
> it rejects incoming pages and they go directly to the swapfile, but we
> also kickoff writeback, so this only happens until we do some LRU
> writeback. I guess I should have been more clear here. Thanks for
> clarifying and correcting.
>
> >
> > > > Other than breaking LRU ordering, I'm also concerned about the
> > > > potential deteriorating performance when writing/reading from swapfile
> > > > when zswap is full. The zswap->swapfile order should be able to
> > > > maintain a consistent performance for userspace.
> > >
> > > Right. This happens today anyway AFAICT, when zswap is full we just
> > > fallback to writing to swapfile, so this would not be a behavior
> > > change. I agree it should be addressed anyway.
> > >
> > > >
> > > > But anyway I don't have the data from real life workload to back the
> > > > above points. If you or Johannes could share some real data, that
> > > > would be very helpful to make the decisions.
> > >
> > > I actually don't, since we mostly run zswap without a backing
> > > swapfile. Perhaps Johannes might be able to have some data on this (or
> > > anyone using zswap with a backing swapfile).
> >
> > Due to LRU writeback, the latency increase when zswap spills its
> > coldest entries into backing swap is fairly linear, as you may
> > expect. We have some limited production data on this from the
> > webservers.
> >
> > The biggest challenge in this space is properly sizing the zswap pool,
> > such that it's big enough to hold the warm set that the workload is
> > most latency-sensitive too, yet small enough such that the cold pages
> > get spilled to backing swap. Nhat is working on improving this.
> >
> > That said, I think this discussion is orthogonal to the proposed
> > topic. zswap spills to backing swap in LRU order as of today. The
> > LRU/pool size tweaking is an optimization to get smarter zswap/swap
> > placement according to access frequency. The proposed swap descriptor
> > is an optimization to get better disk utilization, the ability to run
> > zswap without backing swap, and a dramatic speedup in swapoff time.
>
> Fully agree.
>
> >
> > > > > > > > Anyway I'm interested in attending the discussion for this topic.
> > > > > > >
> > > > > > > Great! Looking forward to discuss this more!
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > ==================== Cost ====================
> > > > > > > > > The obvious downside of this is added memory overhead, specifically
> > > > > > > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > > > > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > > > > > > continuation), we pay the size of the swap_desc for every page that is
> > > > > > > > > actually in the swapfile, which I am estimating can be roughly around
> > > > > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > > > > > > scales with pages actually swapped out. For zswap users, it should be
> > > > > > > > > a win (or at least even) because we get to drop a lot of fields from
> > > > > > > > > struct zswap_entry (e.g. rbtree, index, etc).
> >
> > Shifting the cost from O(swapspace) to O(swapped) could be a win for
> > many regular swap users too.
> >
> > There are the legacy setups that provision 2*RAM worth of swap as an
> > emergency overflow that is then rarely used.
> >
> > We have a setups that swap to disk more proactively, but we also
> > overprovision those in terms of swap space due to the cliff behavior
> > when swap fills up and the VM runs out of options.
> >
> > To make a fair comparison, you really have to take average swap
> > utilization into account. And I doubt that's very high.
>
> Yeah I was looking for some data here, but it varies heavily based on
> the use case, so I opted to only state the overhead of the swap
> descriptor without directly comparing it to the current overhead.
>
> >
> > In terms of worst-case behavior, +0.8% per swapped page doesn't sound
> > like a show-stopper to me. Especially when compared to zswap's current
> > O(swapped) waste of disk space.
>
> Yeah for zswap users this should be a win on most/all fronts, even
> memory overhead, as we will end up trimming struct zswap_entry which
> is also O(swapped) memory overhead. It should also make zswap
> available for more use cases. You don't need to provision and
> configure swap space, you just need to turn zswap on.
>
> >
> > > > > > > > > Another potential concern is readahead. With this design, we have no
> > > > > > > > > way to get a swap_desc given a swap entry (type & offset). We would
> > > > > > > > > need to maintain a reverse mapping, adding a little bit more overhead,
> > > > > > > > > or search all swapped out pages instead :). A reverse mapping might
> > > > > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> > > > > > > > > memory).
> > > > > > > > >
> > > > > > > > > ==================== Bottom Line ====================
> > > > > > > > > It would be nice to discuss the potential here and the tradeoffs. I
> > > > > > > > > know that other folks using zswap (or interested in using it) may find
> > > > > > > > > this very useful. I am sure I am missing some context on why things
> > > > > > > > > are the way they are, and perhaps some obvious holes in my story.
> > > > > > > > > Looking forward to discussing this with anyone interested :)
> > > > > > > > >
> > > > > > > > > I think Johannes may be interested in attending this discussion, since
> > > > > > > > > a lot of ideas here are inspired by discussions I had with him :)

Hi everyone,

I came across this interesting proposal and I would like to
participate in the discussion. I think it will be useful/overlap with
some projects we are currently planning in Android.

Thanks,
Kalesh

> >
> > Thanks!
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-18 22:38 [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Yosry Ahmed
  2023-02-19  4:31 ` Matthew Wilcox
  2023-02-21 18:39 ` Yang Shi
@ 2023-02-28  4:54 ` Sergey Senozhatsky
  2023-02-28  8:12   ` Yosry Ahmed
  2023-02-28 23:11 ` Chris Li
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 105+ messages in thread
From: Sergey Senozhatsky @ 2023-02-28  4:54 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton

On (23/02/18 14:38), Yosry Ahmed wrote:
[..]
> ==================== Idea ====================
> Introduce a data structure, which I currently call a swap_desc, as an
> abstraction layer between swapping implementation and the rest of MM
> code. Page tables & page caches would store a swap id (encoded as a
> swp_entry_t) instead of directly storing the swap entry associated
> with the swapfile. This swap id maps to a struct swap_desc, which acts
> as our abstraction layer. All MM code not concerned with swapping
> details would operate in terms of swap descs. The swap_desc can point
> to either a normal swap entry (associated with a swapfile) or a zswap
> entry. It can also include all non-backend specific operations, such
> as the swapcache (which would be a simple pointer in swap_desc), swap
> counting, etc. It creates a clear, nice abstraction layer between MM
> code and the actual swapping implementation.
> 
> ==================== Benefits ====================
> This work enables using zswap without a backing swapfile and increases
> the swap capacity when zswap is used with a swapfile. It also creates
> a separation that allows us to skip code paths that don't make sense
> in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> which might result in better performance (less lookups, less lock
> contention).
> 
> The abstraction layer also opens the door for multiple cleanups (e.g.
> removing swapper address spaces, removing swap count continuation
> code, etc). Another nice cleanup that this work enables would be
> separating the overloaded swp_entry_t into two distinct types: one for
> things that are stored in page tables / caches, and for actual swap
> entries. In the future, we can potentially further optimize how we use
> the bits in the page tables instead of sticking everything into the
> current type/offset format.
> 
> Another potential win here can be swapoff, which can be more practical
> by directly scanning all swap_desc's instead of going through page
> tables and shmem page caches.
> 
> Overall zswap becomes more accessible and available to a wider range
> of use cases.

I assume this also brings us closer to a proper writeback LRU handling?


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-28  4:29                 ` Kalesh Singh
@ 2023-02-28  8:09                   ` Yosry Ahmed
  0 siblings, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-02-28  8:09 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: Johannes Weiner, Yang Shi, lsf-pc, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Peter Xu, Minchan Kim, Andrew Morton,
	Nhat Pham, Akilesh Kailash

On Mon, Feb 27, 2023 at 8:29 PM Kalesh Singh <kaleshsingh@google.com> wrote:
>
> On Wed, Feb 22, 2023 at 2:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Feb 22, 2023 at 8:57 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Hello,
> > >
> > > thanks for proposing this, Yosry. I'm very interested in this
> > > work. Unfortunately, I won't be able to attend LSFMMBPF myself this
> > > time around due to a scheduling conflict :(
> >
> > Ugh, would have been great to have you, I guess there might be a
> > remote option, or we will end up discussing on the mailing list
> > eventually anyway.
> >
> > >
> > > On Tue, Feb 21, 2023 at 03:38:57PM -0800, Yosry Ahmed wrote:
> > > > On Tue, Feb 21, 2023 at 3:34 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi Yosry,
> > > > > > > > >
> > > > > > > > > Thanks for proposing this topic. I was thinking about this before but
> > > > > > > > > I didn't make too much progress due to some other distractions, and I
> > > > > > > > > got a couple of follow up questions about your design. Please see the
> > > > > > > > > inline comments below.
> > > > > > > >
> > > > > > > > Great to see interested folks, thanks!
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello everyone,
> > > > > > > > > >
> > > > > > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > > > > > > > 2023 about swap & zswap (hope I am not too late).
> > > > > > > > > >
> > > > > > > > > > ==================== Intro ====================
> > > > > > > > > > Currently, using zswap is dependent on swapfiles in an unnecessary
> > > > > > > > > > way. To use zswap, you need a swapfile configured (even if the space
> > > > > > > > > > will not be used) and zswap is restricted by its size. When pages
> > > > > > > > > > reside in zswap, the corresponding swap entry in the swapfile cannot
> > > > > > > > > > be used, and is essentially wasted. We also go through unnecessary
> > > > > > > > > > code paths when using zswap, such as finding and allocating a swap
> > > > > > > > > > entry on the swapout path, or readahead in the swapin path. I am
> > > > > > > > > > proposing a swapping abstraction layer that would allow us to remove
> > > > > > > > > > zswap's dependency on swapfiles. This can be done by introducing a
> > > > > > > > > > data structure between the actual swapping implementation (swapfiles,
> > > > > > > > > > zswap) and the rest of the MM code.
> > > > > > > > > >
> > > > > > > > > > ==================== Objective ====================
> > > > > > > > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > > > > > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > > > > > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > > > > > > > swapfile, so the overall swapping capacity increases.
> > > > > > > > > >
> > > > > > > > > > ==================== Idea ====================
> > > > > > > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > > > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > > > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > > > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > > > > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > > > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > > > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > > > > > > entry. It can also include all non-backend specific operations, such
> > > > > > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > > > > > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > > > > > > > code and the actual swapping implementation.
> > > > > > > > >
> > > > > > > > > How will the swap_desc be allocated? Dynamically or preallocated? Is
> > > > > > > > > it 1:1 mapped to the swap slots on swap devices (whatever it is
> > > > > > > > > backed, for example, zswap, swap partition, swapfile, etc)?
> > > > > > > >
> > > > > > > > I imagine swap_desc's would be dynamically allocated when we need to
> > > > > > > > swap something out. When allocated, a swap_desc would either point to
> > > > > > > > a zswap_entry (if available), or a swap slot otherwise. In this case,
> > > > > > > > it would be 1:1 mapped to swapped out pages, not the swap slots on
> > > > > > > > devices.
> > > > > > >
> > > > > > > It makes sense to be 1:1 mapped to swapped out pages if the swapfile
> > > > > > > is used as the back of zswap.
> > > > > > >
> > > > > > > >
> > > > > > > > I know that it might not be ideal to make allocations on the reclaim
> > > > > > > > path (although it would be a small-ish slab allocation so we might be
> > > > > > > > able to get away with it), but otherwise we would have statically
> > > > > > > > allocated swap_desc's for all swap slots on a swap device, even unused
> > > > > > > > ones, which I imagine is too expensive. Also for things like zswap, it
> > > > > > > > doesn't really make sense to preallocate at all.
> > > > > > >
> > > > > > > Yeah, it is not perfect to allocate memory in the reclamation path. We
> > > > > > > do have such cases, but the fewer the better IMHO.
> > > > > >
> > > > > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top of the
> > > > > > slab cache, idk if that makes sense, or if there is a way to tell slab
> > > > > > to proactively refill a cache.
> > > > > >
> > > > > > I am open to suggestions here. I don't think we should/can preallocate
> > > > > > the swap_desc's, and we cannot completely eliminate the allocations in
> > > > > > the reclaim path. We can only try to minimize them through caching,
> > > > > > etc. Right?
> > > > >
> > > > > Yeah, reallocation should not work. But I'm not sure whether caching
> > > > > works well for this case or not either. I'm supposed that you were
> > > > > thinking about something similar with pcp. When the available number
> > > > > of elements is lower than a threshold, refill the cache. It should
> > > > > work well with moderate memory pressure. But I'm not sure how it would
> > > > > behave with severe memory pressure, particularly when  anonymous
> > > > > memory dominated the memory usage. Or maybe dynamic allocation works
> > > > > well, we are just over-engineered.
> > > >
> > > > Yeah it would be interesting to look into whether the swap_desc
> > > > allocation will be a bottleneck. Definitely something to look out for.
> > > > I share your thoughts about wanting to do something about it but also
> > > > not wanting to over-engineer it.
> > >
> > > I'm not too concerned by this. It's a PF_MEMALLOC allocation, meaning
> > > it's not subject to watermarks. And the swapped page is freed right
> > > afterwards. As long as the compression delta exceeds the size of
> > > swap_desc, the process is a net reduction in allocated memory. For
> > > regular swap, the only requirement is that swap_desc < page_size() :-)
> > >
> > > To put this into perspective, the zswap backends allocate backing
> > > pages on-demand during reclaim. zsmalloc also kmallocs metadata in
> > > that path. We haven't had any issues with this in production, even
> > > under fairly severe memory pressure scenarios.
> >
> > Right. The only problem would be for pages that do not compress well
> > in zswap, in which case we might not end up freeing memory. As you
> > said, this is already happening today with zswap tho.
> >
> > >
> > > > > > > > > > ==================== Benefits ====================
> > > > > > > > > > This work enables using zswap without a backing swapfile and increases
> > > > > > > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > > > > > > a separation that allows us to skip code paths that don't make sense
> > > > > > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > > > > > > which might result in better performance (less lookups, less lock
> > > > > > > > > > contention).
> > > > > > > > > >
> > > > > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > > > > > > removing swapper address spaces, removing swap count continuation
> > > > > > > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > > > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > > > > > > things that are stored in page tables / caches, and for actual swap
> > > > > > > > > > entries. In the future, we can potentially further optimize how we use
> > > > > > > > > > the bits in the page tables instead of sticking everything into the
> > > > > > > > > > current type/offset format.
> > > > > > > > > >
> > > > > > > > > > Another potential win here can be swapoff, which can be more practical
> > > > > > > > > > by directly scanning all swap_desc's instead of going through page
> > > > > > > > > > tables and shmem page caches.
> > > > > > > > > >
> > > > > > > > > > Overall zswap becomes more accessible and available to a wider range
> > > > > > > > > > of use cases.
> > > > > > > > >
> > > > > > > > > How will you handle zswap writeback? Zswap may writeback to the backed
> > > > > > > > > swap device IIUC. Assuming you have both zswap and swapfile, they are
> > > > > > > > > separate devices with this design, right? If so, is the swapfile still
> > > > > > > > > the writeback target of zswap? And if it is the writeback target, what
> > > > > > > > > if swapfile is full?
> > > > > > > >
> > > > > > > > When we try to writeback from zswap, we try to allocate a swap slot in
> > > > > > > > the swapfile, and switch the swap_desc to point to that instead. The
> > > > > > > > process would be transparent to the rest of MM (page tables, page
> > > > > > > > cache, etc). If the swapfile is full, then there's really nothing we
> > > > > > > > can do, reclaim fails and we start OOMing. I imagine this is the same
> > > > > > > > behavior as today when swap is full, the difference would be that we
> > > > > > > > have to fill both zswap AND the swapfile to get to the OOMing point,
> > > > > > > > so an overall increased swapping capacity.
> > > > > > >
> > > > > > > When zswap is full, but swapfile is not yet, will the swap try to
> > > > > > > writeback zswap to swapfile to make more room for zswap or just swap
> > > > > > > out to swapfile directly?
> > > > > > >
> > > > > >
> > > > > > The current behavior is that we swap to swapfile directly in this
> > > > > > case, which is far from ideal as we break LRU ordering by skipping
> > > > > > zswap. I believe this should be addressed, but not as part of this
> > > > > > effort. The work to make zswap respect the LRU ordering by writing
> > > > > > back from zswap to make room can be done orthogonal to this effort. I
> > > > > > believe Johannes was looking into this at some point.
> > >
> > > Actually, zswap already does LRU writeback when the pool is full. Nhat
> > > Pham (CCd) recently upstreamed the LRU implementation for zsmalloc, so
> > > as of today all backends support this.
> > >
> > > There are still a few quirks in zswap that can cause rejections which
> > > bypass the LRU that need fixing. But for the most part LRU writeback
> > > to the backing file is the default behavior.
> >
> > Right, I was specifically talking about this case. When zswap is full
> > it rejects incoming pages and they go directly to the swapfile, but we
> > also kickoff writeback, so this only happens until we do some LRU
> > writeback. I guess I should have been more clear here. Thanks for
> > clarifying and correcting.
> >
> > >
> > > > > Other than breaking LRU ordering, I'm also concerned about the
> > > > > potential deteriorating performance when writing/reading from swapfile
> > > > > when zswap is full. The zswap->swapfile order should be able to
> > > > > maintain a consistent performance for userspace.
> > > >
> > > > Right. This happens today anyway AFAICT, when zswap is full we just
> > > > fallback to writing to swapfile, so this would not be a behavior
> > > > change. I agree it should be addressed anyway.
> > > >
> > > > >
> > > > > But anyway I don't have the data from real life workload to back the
> > > > > above points. If you or Johannes could share some real data, that
> > > > > would be very helpful to make the decisions.
> > > >
> > > > I actually don't, since we mostly run zswap without a backing
> > > > swapfile. Perhaps Johannes might be able to have some data on this (or
> > > > anyone using zswap with a backing swapfile).
> > >
> > > Due to LRU writeback, the latency increase when zswap spills its
> > > coldest entries into backing swap is fairly linear, as you may
> > > expect. We have some limited production data on this from the
> > > webservers.
> > >
> > > The biggest challenge in this space is properly sizing the zswap pool,
> > > such that it's big enough to hold the warm set that the workload is
> > > most latency-sensitive too, yet small enough such that the cold pages
> > > get spilled to backing swap. Nhat is working on improving this.
> > >
> > > That said, I think this discussion is orthogonal to the proposed
> > > topic. zswap spills to backing swap in LRU order as of today. The
> > > LRU/pool size tweaking is an optimization to get smarter zswap/swap
> > > placement according to access frequency. The proposed swap descriptor
> > > is an optimization to get better disk utilization, the ability to run
> > > zswap without backing swap, and a dramatic speedup in swapoff time.
> >
> > Fully agree.
> >
> > >
> > > > > > > > > Anyway I'm interested in attending the discussion for this topic.
> > > > > > > >
> > > > > > > > Great! Looking forward to discuss this more!
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ==================== Cost ====================
> > > > > > > > > > The obvious downside of this is added memory overhead, specifically
> > > > > > > > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > > > > > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > > > > > > > continuation), we pay the size of the swap_desc for every page that is
> > > > > > > > > > actually in the swapfile, which I am estimating can be roughly around
> > > > > > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > > > > > > > scales with pages actually swapped out. For zswap users, it should be
> > > > > > > > > > a win (or at least even) because we get to drop a lot of fields from
> > > > > > > > > > struct zswap_entry (e.g. rbtree, index, etc).
> > >
> > > Shifting the cost from O(swapspace) to O(swapped) could be a win for
> > > many regular swap users too.
> > >
> > > There are the legacy setups that provision 2*RAM worth of swap as an
> > > emergency overflow that is then rarely used.
> > >
> > > We have a setups that swap to disk more proactively, but we also
> > > overprovision those in terms of swap space due to the cliff behavior
> > > when swap fills up and the VM runs out of options.
> > >
> > > To make a fair comparison, you really have to take average swap
> > > utilization into account. And I doubt that's very high.
> >
> > Yeah I was looking for some data here, but it varies heavily based on
> > the use case, so I opted to only state the overhead of the swap
> > descriptor without directly comparing it to the current overhead.
> >
> > >
> > > In terms of worst-case behavior, +0.8% per swapped page doesn't sound
> > > like a show-stopper to me. Especially when compared to zswap's current
> > > O(swapped) waste of disk space.
> >
> > Yeah for zswap users this should be a win on most/all fronts, even
> > memory overhead, as we will end up trimming struct zswap_entry which
> > is also O(swapped) memory overhead. It should also make zswap
> > available for more use cases. You don't need to provision and
> > configure swap space, you just need to turn zswap on.
> >
> > >
> > > > > > > > > > Another potential concern is readahead. With this design, we have no
> > > > > > > > > > way to get a swap_desc given a swap entry (type & offset). We would
> > > > > > > > > > need to maintain a reverse mapping, adding a little bit more overhead,
> > > > > > > > > > or search all swapped out pages instead :). A reverse mapping might
> > > > > > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> > > > > > > > > > memory).
> > > > > > > > > >
> > > > > > > > > > ==================== Bottom Line ====================
> > > > > > > > > > It would be nice to discuss the potential here and the tradeoffs. I
> > > > > > > > > > know that other folks using zswap (or interested in using it) may find
> > > > > > > > > > this very useful. I am sure I am missing some context on why things
> > > > > > > > > > are the way they are, and perhaps some obvious holes in my story.
> > > > > > > > > > Looking forward to discussing this with anyone interested :)
> > > > > > > > > >
> > > > > > > > > > I think Johannes may be interested in attending this discussion, since
> > > > > > > > > > a lot of ideas here are inspired by discussions I had with him :)
>
> Hi everyone,
>
> I came across this interesting proposal and I would like to
> participate in the discussion. I think it will be useful/overlap with
> some projects we are currently planning in Android.

Great to see more interested folks! Looking forward to discussing that!

>
> Thanks,
> Kalesh
>
> > >
> > > Thanks!
> >


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-28  4:54 ` Sergey Senozhatsky
@ 2023-02-28  8:12   ` Yosry Ahmed
  2023-02-28 23:29     ` Minchan Kim
  2023-03-01 10:44     ` Sergey Senozhatsky
  0 siblings, 2 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-02-28  8:12 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton

On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (23/02/18 14:38), Yosry Ahmed wrote:
> [..]
> > ==================== Idea ====================
> > Introduce a data structure, which I currently call a swap_desc, as an
> > abstraction layer between swapping implementation and the rest of MM
> > code. Page tables & page caches would store a swap id (encoded as a
> > swp_entry_t) instead of directly storing the swap entry associated
> > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > as our abstraction layer. All MM code not concerned with swapping
> > details would operate in terms of swap descs. The swap_desc can point
> > to either a normal swap entry (associated with a swapfile) or a zswap
> > entry. It can also include all non-backend specific operations, such
> > as the swapcache (which would be a simple pointer in swap_desc), swap
> > counting, etc. It creates a clear, nice abstraction layer between MM
> > code and the actual swapping implementation.
> >
> > ==================== Benefits ====================
> > This work enables using zswap without a backing swapfile and increases
> > the swap capacity when zswap is used with a swapfile. It also creates
> > a separation that allows us to skip code paths that don't make sense
> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > which might result in better performance (less lookups, less lock
> > contention).
> >
> > The abstraction layer also opens the door for multiple cleanups (e.g.
> > removing swapper address spaces, removing swap count continuation
> > code, etc). Another nice cleanup that this work enables would be
> > separating the overloaded swp_entry_t into two distinct types: one for
> > things that are stored in page tables / caches, and for actual swap
> > entries. In the future, we can potentially further optimize how we use
> > the bits in the page tables instead of sticking everything into the
> > current type/offset format.
> >
> > Another potential win here can be swapoff, which can be more practical
> > by directly scanning all swap_desc's instead of going through page
> > tables and shmem page caches.
> >
> > Overall zswap becomes more accessible and available to a wider range
> > of use cases.
>
> I assume this also brings us closer to a proper writeback LRU handling?

I assume by proper LRU handling you mean:
- Swap writeback LRU that lives outside of the zpool backends (i.e in
zswap itself or even outside zswap).
- Fix the case where we temporarily skip zswap and write directly to
the backing swapfile while zswap is full, until it performs some
writeback in the background.

This work is orthogonal to that, but it is on the list of things that
we would like to do for zswap.

I guess you are mainly eager to move the writeback logic outside of
zsmalloc, or is there a different motivation? :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-18 22:38 [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Yosry Ahmed
                   ` (2 preceding siblings ...)
  2023-02-28  4:54 ` Sergey Senozhatsky
@ 2023-02-28 23:11 ` Chris Li
  2023-03-02  0:30   ` Yosry Ahmed
  2023-03-10  2:07 ` Luis Chamberlain
  2023-05-12  3:07 ` Yosry Ahmed
  5 siblings, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-02-28 23:11 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton

Hi Yosry,

On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> Hello everyone,
> 
> I would like to propose a topic for the upcoming LSF/MM/BPF in May
> 2023 about swap & zswap (hope I am not too late).

I am very interested in participating in this discussion as well.

> ==================== Objective ====================
> Enabling the use of zswap without a backing swapfile, which makes
> zswap useful for a wider variety of use cases. Also, when zswap is
> used with a swapfile, the pages in zswap do not use up space in the
> swapfile, so the overall swapping capacity increases.

Agree.

> 
> ==================== Idea ====================
> Introduce a data structure, which I currently call a swap_desc, as an
> abstraction layer between swapping implementation and the rest of MM
> code. Page tables & page caches would store a swap id (encoded as a
> swp_entry_t) instead of directly storing the swap entry associated
> with the swapfile. This swap id maps to a struct swap_desc, which acts

Can you provide a bit more detail? I am curious how this swap id
maps into the swap_desc? Is the swp_entry_t cast into "struct
swap_desc*" or going through some lookup table/tree?

> as our abstraction layer. All MM code not concerned with swapping
> details would operate in terms of swap descs. The swap_desc can point
> to either a normal swap entry (associated with a swapfile) or a zswap
> entry. It can also include all non-backend specific operations, such
> as the swapcache (which would be a simple pointer in swap_desc), swap

Does the zswap entry still use the swap slot cache and swap_info_struct?

> This work enables using zswap without a backing swapfile and increases
> the swap capacity when zswap is used with a swapfile. It also creates
> a separation that allows us to skip code paths that don't make sense
> in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> which might result in better performance (less lookups, less lock
> contention).
> 
> The abstraction layer also opens the door for multiple cleanups (e.g.
> removing swapper address spaces, removing swap count continuation
> code, etc). Another nice cleanup that this work enables would be
> separating the overloaded swp_entry_t into two distinct types: one for
> things that are stored in page tables / caches, and for actual swap
> entries. In the future, we can potentially further optimize how we use
> the bits in the page tables instead of sticking everything into the
> current type/offset format.

Looking forward to seeing more details in the upcoming discussion.
> 
> ==================== Cost ====================
> The obvious downside of this is added memory overhead, specifically
> for users that use swapfiles without zswap. Instead of paying one byte
> (swap_map) for every potential page in the swapfile (+ swap count
> continuation), we pay the size of the swap_desc for every page that is
> actually in the swapfile, which I am estimating can be roughly around
> 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> scales with pages actually swapped out. For zswap users, it should be

Is there a way to avoid turning 1 byte into 24 byte per swapped
pages? For the users that use swap but no zswap, this is pure overhead.

It seems what you really need is one bit of information to indicate
this page is backed by zswap. Then you can have a seperate pointer
for the zswap entry.

Depending on how much you are going to reuse the swap cache, you might
need to have something like a swap_info_struct to keep the locks happy.

> Another potential concern is readahead. With this design, we have no

Readahead is for spinning disk :-) Even a normal swap file with an SSD can
use some modernization.

Looking forward to your discussion.

Chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-19  4:31 ` Matthew Wilcox
  2023-02-19  9:34   ` Yosry Ahmed
@ 2023-02-28 23:22   ` Chris Li
  2023-03-01  0:08     ` Matthew Wilcox
  1 sibling, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-02-28 23:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Huang Ying, NeilBrown

Hi Matthew,

On Sun, Feb 19, 2023 at 04:31:33AM +0000, Matthew Wilcox wrote:
> 
> I think an overhaul of the swap code is long overdue.  I appreciate
> you're very much focused on zswap, but there are many other problems.
> For example, swap does not work on zoned devices.  Swap readahead is
> generally physical (ie optimised for spinning discs) rather than logical
> (more appropriate for SSDs).  Swap's management of free space is crude
> compared to real filesystems.  The way that swap bypasses the filesystem
> when writing to swap files is awful.  I haven't even started to look at

Can you expand a bit on that? I assume you want to see the swap file
behavior more like a normal file system and reuse more of the readpage()
and writepage() path.

> what changes need to be made to swap in order to swap out arbitrary-order
> folios (instead of PMD-sized + PTE-sized).

When the page fault happens, does the whole folios get swapped in or break
into smaller pages?

> I'm probably not a great person to participate in the design of a
> replacement system.  I don't know nearly enough about anonymous memory.
> I'd be sitting in the back shouting unhelpful things like, "Can't you
> see an anon_vma is the exact same thing as an inode?"  and "Why don't
> we steal the block allocation functions from XFS?"  and "Why do tmpfs

I notice the swap_map has one byte per swap entry even the swap is not
used.

> pages have to move to the swap cache; can't we just leave them in the
> page cache and pass them to the swap code directly?"

All great suggestions and I am very interested in that.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-28  8:12   ` Yosry Ahmed
@ 2023-02-28 23:29     ` Minchan Kim
  2023-03-02  0:58       ` Yosry Ahmed
  2023-03-02 16:58       ` Chris Li
  2023-03-01 10:44     ` Sergey Senozhatsky
  1 sibling, 2 replies; 105+ messages in thread
From: Minchan Kim @ 2023-02-28 23:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, lsf-pc, Johannes Weiner, Linux-MM,
	Michal Hocko, Shakeel Butt, David Rientjes, Hugh Dickins,
	Seth Jennings, Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu,
	Andrew Morton

Hi Yosry,

On Tue, Feb 28, 2023 at 12:12:05AM -0800, Yosry Ahmed wrote:
> On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky
> <senozhatsky@chromium.org> wrote:
> >
> > On (23/02/18 14:38), Yosry Ahmed wrote:
> > [..]
> > > ==================== Idea ====================
> > > Introduce a data structure, which I currently call a swap_desc, as an
> > > abstraction layer between swapping implementation and the rest of MM
> > > code. Page tables & page caches would store a swap id (encoded as a
> > > swp_entry_t) instead of directly storing the swap entry associated
> > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > as our abstraction layer. All MM code not concerned with swapping
> > > details would operate in terms of swap descs. The swap_desc can point
> > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > entry. It can also include all non-backend specific operations, such
> > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > code and the actual swapping implementation.
> > >
> > > ==================== Benefits ====================
> > > This work enables using zswap without a backing swapfile and increases
> > > the swap capacity when zswap is used with a swapfile. It also creates
> > > a separation that allows us to skip code paths that don't make sense
> > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > which might result in better performance (less lookups, less lock
> > > contention).
> > >
> > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > removing swapper address spaces, removing swap count continuation
> > > code, etc). Another nice cleanup that this work enables would be
> > > separating the overloaded swp_entry_t into two distinct types: one for
> > > things that are stored in page tables / caches, and for actual swap
> > > entries. In the future, we can potentially further optimize how we use
> > > the bits in the page tables instead of sticking everything into the
> > > current type/offset format.
> > >
> > > Another potential win here can be swapoff, which can be more practical
> > > by directly scanning all swap_desc's instead of going through page
> > > tables and shmem page caches.
> > >
> > > Overall zswap becomes more accessible and available to a wider range
> > > of use cases.
> >
> > I assume this also brings us closer to a proper writeback LRU handling?
> 
> I assume by proper LRU handling you mean:
> - Swap writeback LRU that lives outside of the zpool backends (i.e in
> zswap itself or even outside zswap).

Even outside zswap to support any combination on any heterogenous
multiple swap device configuration.

The indirection layer would be essential to support it but it would
be also great if we don't waste any memory for the user who don't
want the feature.

Just FYI, there was similar discussion long time ago about the
indirection layer.
https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-28 23:22   ` Chris Li
@ 2023-03-01  0:08     ` Matthew Wilcox
  2023-03-01 23:22       ` Chris Li
  0 siblings, 1 reply; 105+ messages in thread
From: Matthew Wilcox @ 2023-03-01  0:08 UTC (permalink / raw)
  To: Chris Li
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Huang Ying, NeilBrown

On Tue, Feb 28, 2023 at 03:22:20PM -0800, Chris Li wrote:
> Hi Matthew,
> 
> On Sun, Feb 19, 2023 at 04:31:33AM +0000, Matthew Wilcox wrote:
> > 
> > I think an overhaul of the swap code is long overdue.  I appreciate
> > you're very much focused on zswap, but there are many other problems.
> > For example, swap does not work on zoned devices.  Swap readahead is
> > generally physical (ie optimised for spinning discs) rather than logical
> > (more appropriate for SSDs).  Swap's management of free space is crude
> > compared to real filesystems.  The way that swap bypasses the filesystem
> > when writing to swap files is awful.  I haven't even started to look at
> 
> Can you expand a bit on that? I assume you want to see the swap file
> behavior more like a normal file system and reuse more of the readpage()
> and writepage() path.

Actually, no, readpage() and writepage() should be reserved for
page cache.  We now have a ->swap_rw(), but it's only implemented by
nfs so far.  Instead of constructing its own BIOs, swap should invoke
->swap_rw for every filesystem.  I suspect we can do a fairly generic
block_swap_rw() for the vast majority of filesystems.

> > what changes need to be made to swap in order to swap out arbitrary-order
> > folios (instead of PMD-sized + PTE-sized).
> 
> When the page fault happens, does the whole folios get swapped in or break
> into smaller pages?

I think the whole folio should be swapped in.  See my proposal for
determining the correct size folio to use here:
https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/

Assuming something like that gets implemented, for a large folio to
be swapped out, we've had a selection of page faults on the folio,
followed by a period of no faults.  All of a sudden we have a fault,
so I think we should bring the whole folio back in.  The algorithm I
outline in that email would then take care of breaking down the folio
into smaller folios if it turns out they're not used.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-28  8:12   ` Yosry Ahmed
  2023-02-28 23:29     ` Minchan Kim
@ 2023-03-01 10:44     ` Sergey Senozhatsky
  2023-03-02  1:01       ` Yosry Ahmed
  1 sibling, 1 reply; 105+ messages in thread
From: Sergey Senozhatsky @ 2023-03-01 10:44 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sergey Senozhatsky, lsf-pc, Johannes Weiner, Linux-MM,
	Michal Hocko, Shakeel Butt, David Rientjes, Hugh Dickins,
	Seth Jennings, Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu,
	Minchan Kim, Andrew Morton

On (23/02/28 00:12), Yosry Ahmed wrote:
> 
> I assume by proper LRU handling you mean:
> - Swap writeback LRU that lives outside of the zpool backends (i.e in
> zswap itself or even outside zswap).
> - Fix the case where we temporarily skip zswap and write directly to
> the backing swapfile while zswap is full, until it performs some
> writeback in the background.
> 
> This work is orthogonal to that, but it is on the list of things that
> we would like to do for zswap.

Oh, sorry for the noise then. I somehow thought that one leads to
another in some way, probably got that impression from offline
discussions.

> I guess you are mainly eager to move the writeback logic outside of
> zsmalloc, or is there a different motivation? :)

Not eager, but we've been promised that! :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-01  0:08     ` Matthew Wilcox
@ 2023-03-01 23:22       ` Chris Li
  0 siblings, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-01 23:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Huang Ying, NeilBrown

On Wed, Mar 01, 2023 at 12:08:17AM +0000, Matthew Wilcox wrote:
> > Can you expand a bit on that? I assume you want to see the swap file
> > behavior more like a normal file system and reuse more of the readpage()
> > and writepage() path.
> 
> Actually, no, readpage() and writepage() should be reserved for
> page cache.  We now have a ->swap_rw(), but it's only implemented by
> nfs so far.  Instead of constructing its own BIOs, swap should invoke
> ->swap_rw for every filesystem.  I suspect we can do a fairly generic
> block_swap_rw() for the vast majority of filesystems.

The swap_rw() is for the file system backing the swap file. That is more
close to the back end IO side.

In the case of zswap, it it can't be implemented as a simple file system
layer because the vma can only belong to one file system. Zswap can back
some of the page in a vmx but not the other. It will require some support
before hitting the swap_rw() paging path.

BTW, current code the swap_rw() is called from swap_writepage() which
is part of the writepage() call as well.

> > When the page fault happens, does the whole folios get swapped in or break
> > into smaller pages?
> 
> I think the whole folio should be swapped in.  See my proposal for
> determining the correct size folio to use here:
> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
> 
> Assuming something like that gets implemented, for a large folio to
> be swapped out, we've had a selection of page faults on the folio,
> followed by a period of no faults.  All of a sudden we have a fault,
> so I think we should bring the whole folio back in.  The algorithm I
> outline in that email would then take care of breaking down the folio
> into smaller folios if it turns out they're not used.

One side effect is that the fault might bring in more pages than it
absolutely necessary. Might want to collect some data on that to
see the real impact.

Chris




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-28 23:11 ` Chris Li
@ 2023-03-02  0:30   ` Yosry Ahmed
  2023-03-02  1:00     ` Yosry Ahmed
                       ` (3 more replies)
  0 siblings, 4 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-02  0:30 UTC (permalink / raw)
  To: Chris Li
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton

On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > Hello everyone,
> >
> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > 2023 about swap & zswap (hope I am not too late).
>
> I am very interested in participating in this discussion as well.

That's great to hear!

>
> > ==================== Objective ====================
> > Enabling the use of zswap without a backing swapfile, which makes
> > zswap useful for a wider variety of use cases. Also, when zswap is
> > used with a swapfile, the pages in zswap do not use up space in the
> > swapfile, so the overall swapping capacity increases.
>
> Agree.
>
> >
> > ==================== Idea ====================
> > Introduce a data structure, which I currently call a swap_desc, as an
> > abstraction layer between swapping implementation and the rest of MM
> > code. Page tables & page caches would store a swap id (encoded as a
> > swp_entry_t) instead of directly storing the swap entry associated
> > with the swapfile. This swap id maps to a struct swap_desc, which acts
>
> Can you provide a bit more detail? I am curious how this swap id
> maps into the swap_desc? Is the swp_entry_t cast into "struct
> swap_desc*" or going through some lookup table/tree?

swap id would be an index in a radix tree (aka xarray), which contains
a pointer to the swap_desc struct. This lookup should be free with
this design as we also use swap_desc to directly store the swap cache
pointer, so this lookup essentially replaces the swap cache lookup.

>
> > as our abstraction layer. All MM code not concerned with swapping
> > details would operate in terms of swap descs. The swap_desc can point
> > to either a normal swap entry (associated with a swapfile) or a zswap
> > entry. It can also include all non-backend specific operations, such
> > as the swapcache (which would be a simple pointer in swap_desc), swap
>
> Does the zswap entry still use the swap slot cache and swap_info_struct?

In this design no, it shouldn't.

>
> > This work enables using zswap without a backing swapfile and increases
> > the swap capacity when zswap is used with a swapfile. It also creates
> > a separation that allows us to skip code paths that don't make sense
> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > which might result in better performance (less lookups, less lock
> > contention).
> >
> > The abstraction layer also opens the door for multiple cleanups (e.g.
> > removing swapper address spaces, removing swap count continuation
> > code, etc). Another nice cleanup that this work enables would be
> > separating the overloaded swp_entry_t into two distinct types: one for
> > things that are stored in page tables / caches, and for actual swap
> > entries. In the future, we can potentially further optimize how we use
> > the bits in the page tables instead of sticking everything into the
> > current type/offset format.
>
> Looking forward to seeing more details in the upcoming discussion.
> >
> > ==================== Cost ====================
> > The obvious downside of this is added memory overhead, specifically
> > for users that use swapfiles without zswap. Instead of paying one byte
> > (swap_map) for every potential page in the swapfile (+ swap count
> > continuation), we pay the size of the swap_desc for every page that is
> > actually in the swapfile, which I am estimating can be roughly around
> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > scales with pages actually swapped out. For zswap users, it should be
>
> Is there a way to avoid turning 1 byte into 24 byte per swapped
> pages? For the users that use swap but no zswap, this is pure overhead.

That's what I could think of at this point. My idea was something like this:

struct swap_desc {
    union { /* Use one bit to distinguish them */
        swp_entry_t swap_entry;
        struct zswap_entry *zswap_entry;
    };
    struct folio *swapcache;
    atomic_t swap_count;
    u32 id;
}

Having the id in the swap_desc is convenient as we can directly map
the swap_desc to a swp_entry_t to place in the page tables, but I
don't think it's necessary. Without it, the struct size is 20 bytes,
so I think the extra 4 bytes are okay to use anyway if the slab
allocator only allocates multiples of 8 bytes.

The idea here is to unify the swapcache and swap_count implementation
between different swap backends (swapfiles, zswap, etc), which would
create a better abstraction and reduce reinventing the wheel.

We can reduce to only 8 bytes and only store the swap/zswap entry, but
we still need the swap cache anyway so might as well just store the
pointer in the struct and have a unified lookup-free swapcache, so
really 16 bytes is the minimum.

If we stop at 16 bytes, then we need to handle swap count separately
in swapfiles and zswap. This is not the end of the world, but are the
8 bytes worth this?

Keep in mind that the current overhead is 1 byte O(max swap pages) not
O(swapped). Also, 1 byte is assuming we do not use the swap
continuation pages. If we do, it may end up being more. We also
allocate continuation in full 4k pages, so even if one swap_map
element in a page requires continuation, we will allocate an entire
page. What I am trying to say is that to get an actual comparison you
need to also factor in the swap utilization and the rate of usage of
swap continuation. I don't know how to come up with a formula for this
tbh.

Also, like Johannes said, the worst case overhead (32 bytes if you
count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
1G swapped. It doesn't sound *very* bad. I understand that it is pure
overhead for people not using zswap, but it is not very awful.

>
> It seems what you really need is one bit of information to indicate
> this page is backed by zswap. Then you can have a seperate pointer
> for the zswap entry.

If you use one bit in swp_entry_t (or one of the available swap types)
to indicate whether the page is backed with a swapfile or zswap it
doesn't really work. We lose the indirection layer. How do we move the
page from zswap to swapfile? We need to go update the page tables and
the shmem page cache, similar to swapoff.

Instead, if we store a key else in swp_entry_t and use this to lookup
the swp_entry_t or zswap_entry pointer then that's essentially what
the swap_desc does. It just goes the extra mile of unifying the
swapcache as well and storing it directly in the swap_desc instead of
storing it in another lookup structure.

>
> Depending on how much you are going to reuse the swap cache, you might
> need to have something like a swap_info_struct to keep the locks happy.

My current intention is to reimplement the swapcache completely as a
pointer in struct swap_desc. This would eliminate this need and a lot
of the locking we do today if I get things right.

>
> > Another potential concern is readahead. With this design, we have no
>
> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> use some modernization.

Yeah, I initially thought we would only need the swp_entry_t ->
swap_desc reverse mapping for readahead, and that we can only store
that for spinning disks, but I was wrong. We need for other things as
well today: swapoff, when trying to find an empty swap slot and we
start trying to free swap slots used only by the swapcache. However, I
think both of these cases can be fixed (I can share more details if
you want). If everything goes well we should only need to maintain the
reverse mapping (extra overhead above 24 bytes) for swap files on
spinning disks for readahead.

>
> Looking forward to your discussion.
>
> Chris
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-28 23:29     ` Minchan Kim
@ 2023-03-02  0:58       ` Yosry Ahmed
  2023-03-02  1:25         ` Yosry Ahmed
                           ` (2 more replies)
  2023-03-02 16:58       ` Chris Li
  1 sibling, 3 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-02  0:58 UTC (permalink / raw)
  To: Minchan Kim, Johannes Weiner
  Cc: Sergey Senozhatsky, lsf-pc, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Andrew Morton

On Tue, Feb 28, 2023 at 3:29 PM Minchan Kim <minchan@kernel.org> wrote:
>
> Hi Yosry,
>
> On Tue, Feb 28, 2023 at 12:12:05AM -0800, Yosry Ahmed wrote:
> > On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky
> > <senozhatsky@chromium.org> wrote:
> > >
> > > On (23/02/18 14:38), Yosry Ahmed wrote:
> > > [..]
> > > > ==================== Idea ====================
> > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > abstraction layer between swapping implementation and the rest of MM
> > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > as our abstraction layer. All MM code not concerned with swapping
> > > > details would operate in terms of swap descs. The swap_desc can point
> > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > entry. It can also include all non-backend specific operations, such
> > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > code and the actual swapping implementation.
> > > >
> > > > ==================== Benefits ====================
> > > > This work enables using zswap without a backing swapfile and increases
> > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > a separation that allows us to skip code paths that don't make sense
> > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > which might result in better performance (less lookups, less lock
> > > > contention).
> > > >
> > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > removing swapper address spaces, removing swap count continuation
> > > > code, etc). Another nice cleanup that this work enables would be
> > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > things that are stored in page tables / caches, and for actual swap
> > > > entries. In the future, we can potentially further optimize how we use
> > > > the bits in the page tables instead of sticking everything into the
> > > > current type/offset format.
> > > >
> > > > Another potential win here can be swapoff, which can be more practical
> > > > by directly scanning all swap_desc's instead of going through page
> > > > tables and shmem page caches.
> > > >
> > > > Overall zswap becomes more accessible and available to a wider range
> > > > of use cases.
> > >
> > > I assume this also brings us closer to a proper writeback LRU handling?
> >
> > I assume by proper LRU handling you mean:
> > - Swap writeback LRU that lives outside of the zpool backends (i.e in
> > zswap itself or even outside zswap).
>
> Even outside zswap to support any combination on any heterogenous
> multiple swap device configuration.

Agreed, this is the end goal for the writeback LRU.

>
> The indirection layer would be essential to support it but it would
> be also great if we don't waste any memory for the user who don't
> want the feature.

I can't currently think of a way to eliminate overhead for people only
using swapfiles, as a lot of the core implementation changes, unless
we want to maintain considerably more code with a lot of repeated
functionality implemented differently. Perhaps this will change as I
implement this, maybe things are better (or worse) than what I think
they are, I am actively working on a proof-of-concept right now. Maybe
a discussion in LSF/MM/BPF will help come up with optimizations as
well :)

>
> Just FYI, there was similar discussion long time ago about the
> indirection layer.
> https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/

Yeah Hugh shared this one with me earlier, but there are a few things
that I don't understand how they would work, at least in today's
world.

Firstly, the proposal suggests that we store a radix tree index in the
page tables, and in the radix tree store the swap entry AND the swap
count. I am not really sure how they would fit in 8 bytes, especially
if we need continuation and 1 byte is not enough for the swap count.
Continuation logic now depends on linking vmalloc'd pages using the
lru field in struct page/folio. Perhaps we can figure out a split that
gives enough space for swap count without continuation while also not
limiting swapfile sizes too much.

Secondly, IIUC in that proposal once we swap a page in, we free the
swap entry and add the swapcache page to the radix tree instead. In
that case, where does the swap count go? IIUC we still need to
maintain it to be able to tell when all processes mapping the page
have faulted it back, otherwise the radix tree entry is maintained
indefinitely. We can maybe stash the swap count somewhere else in this
case, and bring it back to the radix tree if we swap the page out
again. Not really sure where, we can have a separate radix tree for
swap counts when the page is in swapcache, or we can always have it in
a separate radix tree so that the swap entry fits comfortably in the
first radix tree.

To be able to accomodate zswap in this design, I think we always need
a separate radix tree for swap counts. In that case, one radix tree
contains swap_entry/zswap_entry/swapcache, and the other radix tree
contains the swap count. I think this may work, but I am not sure if
the overhead of always doing a lookup to read the swap count is okay.
I am also sure there would be some fun synchronization problems
between both trees (but we already need to synchronize today between
the swapcache and swap counts?).

It sounds like it is possible to make it work. I will spend some time
thinking about it. Having 2 radix trees also solves the 32-bit systems
problem, but I am not sure if it's a generally better design. Radix
trees also take up some extra space other than the entry size itself,
so I am not sure how much memory we would end up actually saving.

Johannes, I am curious if you have any thoughts about this alternative design?


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02  0:30   ` Yosry Ahmed
@ 2023-03-02  1:00     ` Yosry Ahmed
  2023-03-02 16:51     ` Chris Li
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-02  1:00 UTC (permalink / raw)
  To: Chris Li
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton

On Wed, Mar 1, 2023 at 4:30 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Yosry,
> >
> > On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > > Hello everyone,
> > >
> > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > 2023 about swap & zswap (hope I am not too late).
> >
> > I am very interested in participating in this discussion as well.
>
> That's great to hear!
>
> >
> > > ==================== Objective ====================
> > > Enabling the use of zswap without a backing swapfile, which makes
> > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > used with a swapfile, the pages in zswap do not use up space in the
> > > swapfile, so the overall swapping capacity increases.
> >
> > Agree.
> >
> > >
> > > ==================== Idea ====================
> > > Introduce a data structure, which I currently call a swap_desc, as an
> > > abstraction layer between swapping implementation and the rest of MM
> > > code. Page tables & page caches would store a swap id (encoded as a
> > > swp_entry_t) instead of directly storing the swap entry associated
> > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> >
> > Can you provide a bit more detail? I am curious how this swap id
> > maps into the swap_desc? Is the swp_entry_t cast into "struct
> > swap_desc*" or going through some lookup table/tree?
>
> swap id would be an index in a radix tree (aka xarray), which contains
> a pointer to the swap_desc struct. This lookup should be free with
> this design as we also use swap_desc to directly store the swap cache
> pointer, so this lookup essentially replaces the swap cache lookup.
>
> >
> > > as our abstraction layer. All MM code not concerned with swapping
> > > details would operate in terms of swap descs. The swap_desc can point
> > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > entry. It can also include all non-backend specific operations, such
> > > as the swapcache (which would be a simple pointer in swap_desc), swap
> >
> > Does the zswap entry still use the swap slot cache and swap_info_struct?
>
> In this design no, it shouldn't.
>
> >
> > > This work enables using zswap without a backing swapfile and increases
> > > the swap capacity when zswap is used with a swapfile. It also creates
> > > a separation that allows us to skip code paths that don't make sense
> > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > which might result in better performance (less lookups, less lock
> > > contention).
> > >
> > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > removing swapper address spaces, removing swap count continuation
> > > code, etc). Another nice cleanup that this work enables would be
> > > separating the overloaded swp_entry_t into two distinct types: one for
> > > things that are stored in page tables / caches, and for actual swap
> > > entries. In the future, we can potentially further optimize how we use
> > > the bits in the page tables instead of sticking everything into the
> > > current type/offset format.
> >
> > Looking forward to seeing more details in the upcoming discussion.
> > >
> > > ==================== Cost ====================
> > > The obvious downside of this is added memory overhead, specifically
> > > for users that use swapfiles without zswap. Instead of paying one byte
> > > (swap_map) for every potential page in the swapfile (+ swap count
> > > continuation), we pay the size of the swap_desc for every page that is
> > > actually in the swapfile, which I am estimating can be roughly around
> > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > scales with pages actually swapped out. For zswap users, it should be
> >
> > Is there a way to avoid turning 1 byte into 24 byte per swapped
> > pages? For the users that use swap but no zswap, this is pure overhead.
>
> That's what I could think of at this point. My idea was something like this:
>
> struct swap_desc {
>     union { /* Use one bit to distinguish them */
>         swp_entry_t swap_entry;
>         struct zswap_entry *zswap_entry;
>     };
>     struct folio *swapcache;
>     atomic_t swap_count;
>     u32 id;
> }
>
> Having the id in the swap_desc is convenient as we can directly map
> the swap_desc to a swp_entry_t to place in the page tables, but I
> don't think it's necessary. Without it, the struct size is 20 bytes,
> so I think the extra 4 bytes are okay to use anyway if the slab
> allocator only allocates multiples of 8 bytes.
>
> The idea here is to unify the swapcache and swap_count implementation
> between different swap backends (swapfiles, zswap, etc), which would
> create a better abstraction and reduce reinventing the wheel.
>
> We can reduce to only 8 bytes and only store the swap/zswap entry, but
> we still need the swap cache anyway so might as well just store the
> pointer in the struct and have a unified lookup-free swapcache, so
> really 16 bytes is the minimum.
>
> If we stop at 16 bytes, then we need to handle swap count separately
> in swapfiles and zswap. This is not the end of the world, but are the
> 8 bytes worth this?
>
> Keep in mind that the current overhead is 1 byte O(max swap pages) not
> O(swapped). Also, 1 byte is assuming we do not use the swap
> continuation pages. If we do, it may end up being more. We also
> allocate continuation in full 4k pages, so even if one swap_map
> element in a page requires continuation, we will allocate an entire
> page. What I am trying to say is that to get an actual comparison you
> need to also factor in the swap utilization and the rate of usage of
> swap continuation. I don't know how to come up with a formula for this
> tbh.
>
> Also, like Johannes said, the worst case overhead (32 bytes if you
> count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> overhead for people not using zswap, but it is not very awful.

Oh I forgot. I think the 24 bytes *might* actually be reduced to 16
bytes if we free the underlying swap entry / zswap entry once we add
the page to the swapcache. I did not post anything about it yet as I
am still thinking if there might be any synchronization problems with
this approach, but I will try it out.

>
> >
> > It seems what you really need is one bit of information to indicate
> > this page is backed by zswap. Then you can have a seperate pointer
> > for the zswap entry.
>
> If you use one bit in swp_entry_t (or one of the available swap types)
> to indicate whether the page is backed with a swapfile or zswap it
> doesn't really work. We lose the indirection layer. How do we move the
> page from zswap to swapfile? We need to go update the page tables and
> the shmem page cache, similar to swapoff.
>
> Instead, if we store a key else in swp_entry_t and use this to lookup
> the swp_entry_t or zswap_entry pointer then that's essentially what
> the swap_desc does. It just goes the extra mile of unifying the
> swapcache as well and storing it directly in the swap_desc instead of
> storing it in another lookup structure.
>
> >
> > Depending on how much you are going to reuse the swap cache, you might
> > need to have something like a swap_info_struct to keep the locks happy.
>
> My current intention is to reimplement the swapcache completely as a
> pointer in struct swap_desc. This would eliminate this need and a lot
> of the locking we do today if I get things right.
>
> >
> > > Another potential concern is readahead. With this design, we have no
> >
> > Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> > use some modernization.
>
> Yeah, I initially thought we would only need the swp_entry_t ->
> swap_desc reverse mapping for readahead, and that we can only store
> that for spinning disks, but I was wrong. We need for other things as
> well today: swapoff, when trying to find an empty swap slot and we
> start trying to free swap slots used only by the swapcache. However, I
> think both of these cases can be fixed (I can share more details if
> you want). If everything goes well we should only need to maintain the
> reverse mapping (extra overhead above 24 bytes) for swap files on
> spinning disks for readahead.
>
> >
> > Looking forward to your discussion.
> >
> > Chris
> >


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-01 10:44     ` Sergey Senozhatsky
@ 2023-03-02  1:01       ` Yosry Ahmed
  0 siblings, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-02  1:01 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton

On Wed, Mar 1, 2023 at 2:45 AM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (23/02/28 00:12), Yosry Ahmed wrote:
> >
> > I assume by proper LRU handling you mean:
> > - Swap writeback LRU that lives outside of the zpool backends (i.e in
> > zswap itself or even outside zswap).
> > - Fix the case where we temporarily skip zswap and write directly to
> > the backing swapfile while zswap is full, until it performs some
> > writeback in the background.
> >
> > This work is orthogonal to that, but it is on the list of things that
> > we would like to do for zswap.
>
> Oh, sorry for the noise then. I somehow thought that one leads to
> another in some way, probably got that impression from offline
> discussions.
>
> > I guess you are mainly eager to move the writeback logic outside of
> > zsmalloc, or is there a different motivation? :)
>
> Not eager, but we've been promised that! :)

It still stands as the end goal for swap writeback LRUs :) As Minchan
said, the abstraction layer helps with a generic writeback LRU outside
of zswap.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02  0:58       ` Yosry Ahmed
@ 2023-03-02  1:25         ` Yosry Ahmed
  2023-03-02 17:05         ` Chris Li
  2023-03-02 17:47         ` Chris Li
  2 siblings, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-02  1:25 UTC (permalink / raw)
  To: Minchan Kim, Johannes Weiner
  Cc: Sergey Senozhatsky, lsf-pc, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Andrew Morton

On Wed, Mar 1, 2023 at 4:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Feb 28, 2023 at 3:29 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > Hi Yosry,
> >
> > On Tue, Feb 28, 2023 at 12:12:05AM -0800, Yosry Ahmed wrote:
> > > On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky
> > > <senozhatsky@chromium.org> wrote:
> > > >
> > > > On (23/02/18 14:38), Yosry Ahmed wrote:
> > > > [..]
> > > > > ==================== Idea ====================
> > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > entry. It can also include all non-backend specific operations, such
> > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > > counting, etc. It creates a clear, nice abstraction layer between MM
> > > > > code and the actual swapping implementation.
> > > > >
> > > > > ==================== Benefits ====================
> > > > > This work enables using zswap without a backing swapfile and increases
> > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > a separation that allows us to skip code paths that don't make sense
> > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > which might result in better performance (less lookups, less lock
> > > > > contention).
> > > > >
> > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > removing swapper address spaces, removing swap count continuation
> > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > things that are stored in page tables / caches, and for actual swap
> > > > > entries. In the future, we can potentially further optimize how we use
> > > > > the bits in the page tables instead of sticking everything into the
> > > > > current type/offset format.
> > > > >
> > > > > Another potential win here can be swapoff, which can be more practical
> > > > > by directly scanning all swap_desc's instead of going through page
> > > > > tables and shmem page caches.
> > > > >
> > > > > Overall zswap becomes more accessible and available to a wider range
> > > > > of use cases.
> > > >
> > > > I assume this also brings us closer to a proper writeback LRU handling?
> > >
> > > I assume by proper LRU handling you mean:
> > > - Swap writeback LRU that lives outside of the zpool backends (i.e in
> > > zswap itself or even outside zswap).
> >
> > Even outside zswap to support any combination on any heterogenous
> > multiple swap device configuration.
>
> Agreed, this is the end goal for the writeback LRU.
>
> >
> > The indirection layer would be essential to support it but it would
> > be also great if we don't waste any memory for the user who don't
> > want the feature.
>
> I can't currently think of a way to eliminate overhead for people only
> using swapfiles, as a lot of the core implementation changes, unless
> we want to maintain considerably more code with a lot of repeated
> functionality implemented differently. Perhaps this will change as I
> implement this, maybe things are better (or worse) than what I think
> they are, I am actively working on a proof-of-concept right now. Maybe
> a discussion in LSF/MM/BPF will help come up with optimizations as
> well :)
>
> >
> > Just FYI, there was similar discussion long time ago about the
> > indirection layer.
> > https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
>
> Yeah Hugh shared this one with me earlier, but there are a few things
> that I don't understand how they would work, at least in today's
> world.
>
> Firstly, the proposal suggests that we store a radix tree index in the
> page tables, and in the radix tree store the swap entry AND the swap
> count. I am not really sure how they would fit in 8 bytes, especially
> if we need continuation and 1 byte is not enough for the swap count.
> Continuation logic now depends on linking vmalloc'd pages using the
> lru field in struct page/folio. Perhaps we can figure out a split that
> gives enough space for swap count without continuation while also not
> limiting swapfile sizes too much.
>
> Secondly, IIUC in that proposal once we swap a page in, we free the
> swap entry and add the swapcache page to the radix tree instead. In
> that case, where does the swap count go? IIUC we still need to
> maintain it to be able to tell when all processes mapping the page
> have faulted it back, otherwise the radix tree entry is maintained
> indefinitely. We can maybe stash the swap count somewhere else in this
> case, and bring it back to the radix tree if we swap the page out
> again. Not really sure where, we can have a separate radix tree for
> swap counts when the page is in swapcache, or we can always have it in
> a separate radix tree so that the swap entry fits comfortably in the
> first radix tree.
>
> To be able to accomodate zswap in this design, I think we always need
> a separate radix tree for swap counts. In that case, one radix tree
> contains swap_entry/zswap_entry/swapcache, and the other radix tree
> contains the swap count. I think this may work, but I am not sure if
> the overhead of always doing a lookup to read the swap count is okay.
> I am also sure there would be some fun synchronization problems
> between both trees (but we already need to synchronize today between
> the swapcache and swap counts?).
>
> It sounds like it is possible to make it work. I will spend some time
> thinking about it. Having 2 radix trees also solves the 32-bit systems
> problem, but I am not sure if it's a generally better design. Radix
> trees also take up some extra space other than the entry size itself,
> so I am not sure how much memory we would end up actually saving.
>
> Johannes, I am curious if you have any thoughts about this alternative design?

I completely forgot about shadow entries here. I don't think any of
this works with shadow entries as we still need to maintain them while
the page is swapped out.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02  0:30   ` Yosry Ahmed
  2023-03-02  1:00     ` Yosry Ahmed
@ 2023-03-02 16:51     ` Chris Li
  2023-03-03  0:33     ` Minchan Kim
  2023-03-09 12:48     ` Huang, Ying
  3 siblings, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-02 16:51 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton

Hi Yosry,

On Wed, Mar 01, 2023 at 04:30:22PM -0800, Yosry Ahmed wrote:
> > Can you provide a bit more detail? I am curious how this swap id
> > maps into the swap_desc? Is the swp_entry_t cast into "struct
> > swap_desc*" or going through some lookup table/tree?
> 
> swap id would be an index in a radix tree (aka xarray), which contains
> a pointer to the swap_desc struct. This lookup should be free with
> this design as we also use swap_desc to directly store the swap cache
> pointer, so this lookup essentially replaces the swap cache lookup.


Thanks for the additional clarification. If you don't mind, I have some follow
up questions.

Is this radix tree global or has multiple small trees (e.g per swap device)?

> > > as our abstraction layer. All MM code not concerned with swapping
> > > details would operate in terms of swap descs. The swap_desc can point
> > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > entry. It can also include all non-backend specific operations, such
> > > as the swapcache (which would be a simple pointer in swap_desc), swap
> >
> > Does the zswap entry still use the swap slot cache and swap_info_struct?
> 
> In this design no, it shouldn't.

So the zswap entry only shares the swap cache with normal swap entry.
That help me paint a better picture how you are going to do the indirection
layers.

> That's what I could think of at this point. My idea was something like this:
> 
> struct swap_desc {
>     union { /* Use one bit to distinguish them */
>         swp_entry_t swap_entry;
>         struct zswap_entry *zswap_entry;
>     };
>     struct folio *swapcache;
>     atomic_t swap_count;
>     u32 id;
> }
> 
> Having the id in the swap_desc is convenient as we can directly map
> the swap_desc to a swp_entry_t to place in the page tables, but I
> don't think it's necessary. Without it, the struct size is 20 bytes,
> so I think the extra 4 bytes are okay to use anyway if the slab
> allocator only allocates multiples of 8 bytes.

The whole complexity of the swap_count continues is trying to save
a few bytes one swap entry has one low count numbers.

This seems much more heavy weight compare to that.

> The idea here is to unify the swapcache and swap_count implementation
> between different swap backends (swapfiles, zswap, etc), which would
> create a better abstraction and reduce reinventing the wheel.

Same goal here. I am just trying to find ways to use less memory,
for users who don't use the indirection.

> Keep in mind that the current overhead is 1 byte O(max swap pages) not
> O(swapped). Also, 1 byte is assuming we do not use the swap
> continuation pages. If we do, it may end up being more. We also
> allocate continuation in full 4k pages, so even if one swap_map
> element in a page requires continuation, we will allocate an entire
> page. What I am trying to say is that to get an actual comparison you
> need to also factor in the swap utilization and the rate of usage of
> swap continuation. I don't know how to come up with a formula for this
> tbh.

I would consider two extreme cases of memory usage first.
1) Have swap file size N and no swapping at all.
2) The swap file is full. (aka per page swapping memory overhead).

Your proposal will likely do well on 1) because it is dynamic allocated.
but worse in 2) due to the extra 20 or so bytes per swap_desc.

> Also, like Johannes said, the worst case overhead (32 bytes if you
> count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> overhead for people not using zswap, but it is not very awful.

I might have an alternative to avoid increasing memory usage if not
use the zswap.

> > It seems what you really need is one bit of information to indicate
> > this page is backed by zswap. Then you can have a seperate pointer
> > for the zswap entry.
> 
> If you use one bit in swp_entry_t (or one of the available swap types)
> to indicate whether the page is backed with a swapfile or zswap it
> doesn't really work. We lose the indirection layer. How do we move the
> page from zswap to swapfile? We need to go update the page tables and
> the shmem page cache, similar to swapoff.

How about I make a proposal and you can help me poke holes on it?
I fail to see how it can't move a page from zswap to swapfile, yet.

Chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-28 23:29     ` Minchan Kim
  2023-03-02  0:58       ` Yosry Ahmed
@ 2023-03-02 16:58       ` Chris Li
  1 sibling, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-02 16:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Yosry Ahmed, Sergey Senozhatsky, lsf-pc, Johannes Weiner,
	Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Andrew Morton

Hi Minchan,

On Tue, Feb 28, 2023 at 03:29:00PM -0800, Minchan Kim wrote:
> > - Swap writeback LRU that lives outside of the zpool backends (i.e in
> > zswap itself or even outside zswap).
> 
> Even outside zswap to support any combination on any heterogenous
> multiple swap device configuration.
> 
> The indirection layer would be essential to support it but it would
> be also great if we don't waste any memory for the user who don't
> want the feature.

I feel the same way as well.

> 
> Just FYI, there was similar discussion long time ago about the
> indirection layer.
> https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/

Thanks for the pointer, that is an interesting read. I need some time
to think about it.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02  0:58       ` Yosry Ahmed
  2023-03-02  1:25         ` Yosry Ahmed
@ 2023-03-02 17:05         ` Chris Li
  2023-03-02 17:47         ` Chris Li
  2 siblings, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-02 17:05 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Minchan Kim, Johannes Weiner, Sergey Senozhatsky, lsf-pc,
	Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Andrew Morton, Rik van Riel


On Wed, Mar 01, 2023 at 04:58:08PM -0800, Yosry Ahmed wrote:
> > The indirection layer would be essential to support it but it would
> > be also great if we don't waste any memory for the user who don't
> > want the feature.
> 
> I can't currently think of a way to eliminate overhead for people only
> using swapfiles, as a lot of the core implementation changes, unless
> we want to maintain considerably more code with a lot of repeated
> functionality implemented differently. Perhaps this will change as I
> implement this, maybe things are better (or worse) than what I think
> they are, I am actively working on a proof-of-concept right now. Maybe
> a discussion in LSF/MM/BPF will help come up with optimizations as
> well :)
> 
> >
> > Just FYI, there was similar discussion long time ago about the
> > indirection layer.
> > https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
> 
> Yeah Hugh shared this one with me earlier, but there are a few things
> that I don't understand how they would work, at least in today's
> world.

Let's add Rik into the discussion, maybe he can help refresh some details.

Chris

> 
> Firstly, the proposal suggests that we store a radix tree index in the
> page tables, and in the radix tree store the swap entry AND the swap
> count. I am not really sure how they would fit in 8 bytes, especially
> if we need continuation and 1 byte is not enough for the swap count.
> Continuation logic now depends on linking vmalloc'd pages using the
> lru field in struct page/folio. Perhaps we can figure out a split that
> gives enough space for swap count without continuation while also not
> limiting swapfile sizes too much.
> 
> Secondly, IIUC in that proposal once we swap a page in, we free the
> swap entry and add the swapcache page to the radix tree instead. In
> that case, where does the swap count go? IIUC we still need to
> maintain it to be able to tell when all processes mapping the page
> have faulted it back, otherwise the radix tree entry is maintained
> indefinitely. We can maybe stash the swap count somewhere else in this
> case, and bring it back to the radix tree if we swap the page out
> again. Not really sure where, we can have a separate radix tree for
> swap counts when the page is in swapcache, or we can always have it in
> a separate radix tree so that the swap entry fits comfortably in the
> first radix tree.
> 
> To be able to accomodate zswap in this design, I think we always need
> a separate radix tree for swap counts. In that case, one radix tree
> contains swap_entry/zswap_entry/swapcache, and the other radix tree
> contains the swap count. I think this may work, but I am not sure if
> the overhead of always doing a lookup to read the swap count is okay.
> I am also sure there would be some fun synchronization problems
> between both trees (but we already need to synchronize today between
> the swapcache and swap counts?).
> 
> It sounds like it is possible to make it work. I will spend some time
> thinking about it. Having 2 radix trees also solves the 32-bit systems
> problem, but I am not sure if it's a generally better design. Radix
> trees also take up some extra space other than the entry size itself,
> so I am not sure how much memory we would end up actually saving.
> 
> Johannes, I am curious if you have any thoughts about this alternative design?
> 


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02  0:58       ` Yosry Ahmed
  2023-03-02  1:25         ` Yosry Ahmed
  2023-03-02 17:05         ` Chris Li
@ 2023-03-02 17:47         ` Chris Li
  2023-03-02 18:15           ` Johannes Weiner
  2023-03-02 18:23           ` Rik van Riel
  2 siblings, 2 replies; 105+ messages in thread
From: Chris Li @ 2023-03-02 17:47 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Minchan Kim, Johannes Weiner, Sergey Senozhatsky, lsf-pc,
	Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Andrew Morton, Rik van Riel

On Wed, Mar 01, 2023 at 04:58:08PM -0800, Yosry Ahmed wrote:
> > The indirection layer would be essential to support it but it would
> > be also great if we don't waste any memory for the user who don't
> > want the feature.
> 
> I can't currently think of a way to eliminate overhead for people only
> using swapfiles, as a lot of the core implementation changes, unless
> we want to maintain considerably more code with a lot of repeated
> functionality implemented differently. Perhaps this will change as I
> implement this, maybe things are better (or worse) than what I think
> they are, I am actively working on a proof-of-concept right now. Maybe
> a discussion in LSF/MM/BPF will help come up with optimizations as
> well :)

How about we just put the indirection layer into the swap device?

For the zswap, it registered its own swap device type, as many as needed.
The indirection is that, it is up to the swap device type to interpret
the offset and swap counts. Maybe it even has its own address space
and address space ops.

The zswap does not have swap_map and has its own xtree to look up
the zswap entry. 

The common swap device related operation can have some swap related
swap_device_ops function call back. e.g. get swap_cout. 
That way, obviously there will not be much memory overhead for
the devicethat doesn't use zswap. 

The zswap entry can still do something similar to your swap_desc, save
some pointer to its nested backing device(normal swap file).
That way the swap_desc overhead is purely on the zswap side.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02 17:47         ` Chris Li
@ 2023-03-02 18:15           ` Johannes Weiner
  2023-03-02 18:56             ` Chris Li
  2023-03-02 18:23           ` Rik van Riel
  1 sibling, 1 reply; 105+ messages in thread
From: Johannes Weiner @ 2023-03-02 18:15 UTC (permalink / raw)
  To: Chris Li
  Cc: Yosry Ahmed, Minchan Kim, Sergey Senozhatsky, lsf-pc, Linux-MM,
	Michal Hocko, Shakeel Butt, David Rientjes, Hugh Dickins,
	Seth Jennings, Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu,
	Andrew Morton, Rik van Riel

On Thu, Mar 02, 2023 at 09:47:06AM -0800, Chris Li wrote:
> On Wed, Mar 01, 2023 at 04:58:08PM -0800, Yosry Ahmed wrote:
> > > The indirection layer would be essential to support it but it would
> > > be also great if we don't waste any memory for the user who don't
> > > want the feature.
> > 
> > I can't currently think of a way to eliminate overhead for people only
> > using swapfiles, as a lot of the core implementation changes, unless
> > we want to maintain considerably more code with a lot of repeated
> > functionality implemented differently. Perhaps this will change as I
> > implement this, maybe things are better (or worse) than what I think
> > they are, I am actively working on a proof-of-concept right now. Maybe
> > a discussion in LSF/MM/BPF will help come up with optimizations as
> > well :)
> 
> How about we just put the indirection layer into the swap device?
> 
> For the zswap, it registered its own swap device type, as many as needed.
> The indirection is that, it is up to the swap device type to interpret
> the offset and swap counts. Maybe it even has its own address space
> and address space ops.
> 
> The zswap does not have swap_map and has its own xtree to look up
> the zswap entry. 
> 
> The common swap device related operation can have some swap related
> swap_device_ops function call back. e.g. get swap_cout. 
> That way, obviously there will not be much memory overhead for
> the devicethat doesn't use zswap. 
> 
> The zswap entry can still do something similar to your swap_desc, save
> some pointer to its nested backing device(normal swap file).
> That way the swap_desc overhead is purely on the zswap side.

The problem with that is that zswap needs to be able to write its cold
entries to flash. If zswap and the backing file don't live in a shared
swap id space, it means that zswap writeback would have to allocate a
new device/offset tuple and update all the page table references. It
would impose the complexity of swapoff on every zswap writeout.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02 17:47         ` Chris Li
  2023-03-02 18:15           ` Johannes Weiner
@ 2023-03-02 18:23           ` Rik van Riel
  2023-03-02 21:42             ` Chris Li
  1 sibling, 1 reply; 105+ messages in thread
From: Rik van Riel @ 2023-03-02 18:23 UTC (permalink / raw)
  To: Chris Li, Yosry Ahmed
  Cc: Minchan Kim, Johannes Weiner, Sergey Senozhatsky, lsf-pc,
	Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]

On Thu, 2023-03-02 at 09:47 -0800, Chris Li wrote:
> On Wed, Mar 01, 2023 at 04:58:08PM -0800, Yosry Ahmed wrote:
> > > The indirection layer would be essential to support it but it
> > > would
> > > be also great if we don't waste any memory for the user who don't
> > > want the feature.
> > 
> > I can't currently think of a way to eliminate overhead for people
> > only
> > using swapfiles, as a lot of the core implementation changes,
> > unless
> > we want to maintain considerably more code with a lot of repeated
> > functionality implemented differently. Perhaps this will change as
> > I
> > implement this, maybe things are better (or worse) than what I
> > think
> > they are, I am actively working on a proof-of-concept right now.
> > Maybe
> > a discussion in LSF/MM/BPF will help come up with optimizations as
> > well :)
> 
> How about we just put the indirection layer into the swap device?

The indirection layer needs to be higher up, in order to allow
for easy movement of swap entries between different devices.

For example, we could initially store something in zswap, and
then later decide we want to write it out to disk swap.

This could also be used to quickly free up swap space at swapin
time, by freeing the backing storage (eg. in zswap, or when disk
swap is starting to get full)) when placing an uncompressed copy
of the data in the swap cache. We could apply per-device policies
on whether or not to free swap space at swapin time, because the
tradeoffs are just different between eg disk and zswap.

Doing that would also allow us to turn swapoff into a simple
"load everything from this device into the swap cache" operation.
The pageout code can move that data from the swap cache into
another swap device, without ever having to look up page tables.

One possible implementation might be to have swap page table entries
point to a swap address in this indirection layer, and the indirection
layer can be an xarray containing the actual swap entries specifying
at which position in which swap device the data can be found.

That might be a net reduction in the code over what we have today,
because it gets rid of some ugly corner cases.

kind regards,

Rik van Riel
-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02 18:15           ` Johannes Weiner
@ 2023-03-02 18:56             ` Chris Li
  0 siblings, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-02 18:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yosry Ahmed, Minchan Kim, Sergey Senozhatsky, lsf-pc, Linux-MM,
	Michal Hocko, Shakeel Butt, David Rientjes, Hugh Dickins,
	Seth Jennings, Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu,
	Andrew Morton, Rik van Riel

On Thu, Mar 02, 2023 at 01:15:32PM -0500, Johannes Weiner wrote:
> > The common swap device related operation can have some swap related
> > swap_device_ops function call back. e.g. get swap_cout. 
> > That way, obviously there will not be much memory overhead for
> > the devicethat doesn't use zswap. 
> > 
> > The zswap entry can still do something similar to your swap_desc, save
> > some pointer to its nested backing device(normal swap file).
> > That way the swap_desc overhead is purely on the zswap side.
> 
> The problem with that is that zswap needs to be able to write its cold
> entries to flash. If zswap and the backing file don't live in a shared
> swap id space, it means that zswap writeback would have to allocate a
> new device/offset tuple and update all the page table references. It
> would impose the complexity of swapoff on every zswap writeout.
>

When writing back to flash, allocating a new device/offset is unavoidable.
Otherwise it means zswap will reserve/waste a device/offset even when
it is not writing back to the flash device.

Update the page table reference can be avoided. By keeping the zswap entry.
When lookup the original zswap offset, it knows that the data has been write
to a flash device, it keeps a pointer of the flash swap entry, returning that instead.
The mechanism is similar to the swap_desc Yosary description.

Will that address your concern?

Basically the indirection layer is on demand.

Chris
  


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02 18:23           ` Rik van Riel
@ 2023-03-02 21:42             ` Chris Li
  2023-03-02 22:36               ` Rik van Riel
  0 siblings, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-02 21:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Yosry Ahmed, Minchan Kim, Johannes Weiner, Sergey Senozhatsky,
	lsf-pc, Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Andrew Morton

On Thu, Mar 02, 2023 at 01:23:14PM -0500, Rik van Riel wrote:
> > How about we just put the indirection layer into the swap device?
> 
> The indirection layer needs to be higher up, in order to allow
> for easy movement of swap entries between different devices.

In my mind this "swap device" has some upper layer parts.
Maybe I should not call it a swap device. Let's say it is a
redirect on the swap_info layer.

> 
> For example, we could initially store something in zswap, and
> then later decide we want to write it out to disk swap.
>

Right. We might even want to move swap from SSD into a slower
spinning disk swap space.
 
> This could also be used to quickly free up swap space at swapin
> time, by freeing the backing storage (eg. in zswap, or when disk
> swap is starting to get full)) when placing an uncompressed copy
> of the data in the swap cache. We could apply per-device policies
> on whether or not to free swap space at swapin time, because the
> tradeoffs are just different between eg disk and zswap.
> 
> Doing that would also allow us to turn swapoff into a simple
> "load everything from this device into the swap cache" operation.
> The pageout code can move that data from the swap cache into
> another swap device, without ever having to look up page tables.

I am with you on that.

> 
> One possible implementation might be to have swap page table entries
> point to a swap address in this indirection layer, and the indirection
> layer can be an xarray containing the actual swap entries specifying
> at which position in which swap device the data can be found.

The questions, do we have this indirection layer apply to all swap
entries?

My small tweak is to limit the indirection layer only to non leaf
swap devices. Then it is actually very close to what I am proposing.
Just your "indirection layer" is my "special swap device".

Again, "special swap device" is a very bad name, let's name it something
more useful.

> That might be a net reduction in the code over what we have today,
> because it gets rid of some ugly corner cases.

Great.

Chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02 21:42             ` Chris Li
@ 2023-03-02 22:36               ` Rik van Riel
  2023-03-02 22:55                 ` Yosry Ahmed
  2023-03-03  0:01                 ` Chris Li
  0 siblings, 2 replies; 105+ messages in thread
From: Rik van Riel @ 2023-03-02 22:36 UTC (permalink / raw)
  To: Chris Li
  Cc: Yosry Ahmed, Minchan Kim, Johannes Weiner, Sergey Senozhatsky,
	lsf-pc, Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1402 bytes --]

On Thu, 2023-03-02 at 13:42 -0800, Chris Li wrote:
> On Thu, Mar 02, 2023 at 01:23:14PM -0500, Rik van Riel wrote:
> > 
> > 
> > One possible implementation might be to have swap page table
> > entries
> > point to a swap address in this indirection layer, and the
> > indirection
> > layer can be an xarray containing the actual swap entries
> > specifying
> > at which position in which swap device the data can be found.
> 
> The questions, do we have this indirection layer apply to all swap
> entries?
> 
I believe we should have a system that tracks every swap entry
the same, data structure wise. Otherwise we will have two sets
of code in the kernel, and it will be too easy to get corner
cases wrong.

> My small tweak is to limit the indirection layer only to non leaf
> swap devices. Then it is actually very close to what I am proposing.
> Just your "indirection layer" is my "special swap device".
> 
> Again, "special swap device" is a very bad name, let's name it
> something
> more useful.
> 
> > That might be a net reduction in the code over what we have today,
> > because it gets rid of some ugly corner cases.
> 
> Great.

... but that won't happen if the indirection layer only applies
to some swap devices, because we will still need to keep around
the crazy code to deal with the swap devices that don't have it.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02 22:36               ` Rik van Riel
@ 2023-03-02 22:55                 ` Yosry Ahmed
  2023-03-03  4:05                   ` Chris Li
  2023-03-03  0:01                 ` Chris Li
  1 sibling, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-02 22:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Chris Li, Minchan Kim, Johannes Weiner, Sergey Senozhatsky,
	lsf-pc, Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Andrew Morton

On Thu, Mar 2, 2023 at 2:36 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Thu, 2023-03-02 at 13:42 -0800, Chris Li wrote:
> > On Thu, Mar 02, 2023 at 01:23:14PM -0500, Rik van Riel wrote:
> > >
> > >
> > > One possible implementation might be to have swap page table
> > > entries
> > > point to a swap address in this indirection layer, and the
> > > indirection
> > > layer can be an xarray containing the actual swap entries
> > > specifying
> > > at which position in which swap device the data can be found.
> >
> > The questions, do we have this indirection layer apply to all swap
> > entries?
> >
> I believe we should have a system that tracks every swap entry
> the same, data structure wise. Otherwise we will have two sets
> of code in the kernel, and it will be too easy to get corner
> cases wrong.
>
> > My small tweak is to limit the indirection layer only to non leaf
> > swap devices. Then it is actually very close to what I am proposing.
> > Just your "indirection layer" is my "special swap device".
> >
> > Again, "special swap device" is a very bad name, let's name it
> > something
> > more useful.
> >
> > > That might be a net reduction in the code over what we have today,
> > > because it gets rid of some ugly corner cases.
> >
> > Great.
>
> ... but that won't happen if the indirection layer only applies
> to some swap devices, because we will still need to keep around
> the crazy code to deal with the swap devices that don't have it.

I agree with Rik here. We can certainly special case the indirection
layer and only apply to some swap backends (e.g. zswap), but this
makes things more complicated. For example, if each swap backend
maintains swap_count in their own way, we have to hand over the swap
count when we move a swapped page between backends.

With a common data structure like the proposed swap_desc, everything
becomes easier to reason about. The core swapping logic that is
agnostic to the backend like swapcache and swap counting lives in one
common place and becomes easier to reason about. Swap backends like
swapfiles or zswap can then implement a common interface to do backend
specific operations, like allocating entries, reading/writing pages,
etc.

This, of course, isn't free. There is an associated overhead. It's a
trade off like most things are. We want to work towards the outcome of
that tradeoff that makes sense, we don't want to incur too much
overhead, but we also don't want a very complicated and error-prone
implementation.

Rik, I am wondering about your thoughts on this proposal and how you
think it can be improved?

>
> --
> All Rights Reversed.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02 22:36               ` Rik van Riel
  2023-03-02 22:55                 ` Yosry Ahmed
@ 2023-03-03  0:01                 ` Chris Li
  1 sibling, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-03  0:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Yosry Ahmed, Minchan Kim, Johannes Weiner, Sergey Senozhatsky,
	lsf-pc, Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Andrew Morton

On Thu, Mar 02, 2023 at 05:36:20PM -0500, Rik van Riel wrote:
> > The questions, do we have this indirection layer apply to all swap
> > entries?
> > 
> I believe we should have a system that tracks every swap entry
> the same, data structure wise. Otherwise we will have two sets
> of code in the kernel, and it will be too easy to get corner
> cases wrong.

It is all about trade offs.  The original proposal is adding 20-30
bytes per swapped out page, compared to existing swap devices.
That is why I want to trade some code complexity for less memory
usage. I will mind less if indirect overhead is much smaller,
for the user who doesn't use it.

> > > That might be a net reduction in the code over what we have today,
> > > because it gets rid of some ugly corner cases.
> > 
> > Great.
> 
> ... but that won't happen if the indirection layer only applies
> to some swap devices, because we will still need to keep around
> the crazy code to deal with the swap devices that don't have it.

I understand your concern.

It is important to have good abstraction between the different
swap devices. I was hoping to good abstraction can isolate out
the difference in different swap devices.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02  0:30   ` Yosry Ahmed
  2023-03-02  1:00     ` Yosry Ahmed
  2023-03-02 16:51     ` Chris Li
@ 2023-03-03  0:33     ` Minchan Kim
  2023-03-03  0:49       ` Yosry Ahmed
  2023-03-09 12:48     ` Huang, Ying
  3 siblings, 1 reply; 105+ messages in thread
From: Minchan Kim @ 2023-03-03  0:33 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Andrew Morton

On Wed, Mar 01, 2023 at 04:30:22PM -0800, Yosry Ahmed wrote:
> On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Yosry,
> >
> > On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > > Hello everyone,
> > >
> > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > 2023 about swap & zswap (hope I am not too late).
> >
> > I am very interested in participating in this discussion as well.
> 
> That's great to hear!
> 
> >
> > > ==================== Objective ====================
> > > Enabling the use of zswap without a backing swapfile, which makes
> > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > used with a swapfile, the pages in zswap do not use up space in the
> > > swapfile, so the overall swapping capacity increases.
> >
> > Agree.
> >
> > >
> > > ==================== Idea ====================
> > > Introduce a data structure, which I currently call a swap_desc, as an
> > > abstraction layer between swapping implementation and the rest of MM
> > > code. Page tables & page caches would store a swap id (encoded as a
> > > swp_entry_t) instead of directly storing the swap entry associated
> > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> >
> > Can you provide a bit more detail? I am curious how this swap id
> > maps into the swap_desc? Is the swp_entry_t cast into "struct
> > swap_desc*" or going through some lookup table/tree?
> 
> swap id would be an index in a radix tree (aka xarray), which contains
> a pointer to the swap_desc struct. This lookup should be free with
> this design as we also use swap_desc to directly store the swap cache
> pointer, so this lookup essentially replaces the swap cache lookup.
> 
> >
> > > as our abstraction layer. All MM code not concerned with swapping
> > > details would operate in terms of swap descs. The swap_desc can point
> > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > entry. It can also include all non-backend specific operations, such
> > > as the swapcache (which would be a simple pointer in swap_desc), swap
> >
> > Does the zswap entry still use the swap slot cache and swap_info_struct?
> 
> In this design no, it shouldn't.
> 
> >
> > > This work enables using zswap without a backing swapfile and increases
> > > the swap capacity when zswap is used with a swapfile. It also creates
> > > a separation that allows us to skip code paths that don't make sense
> > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > which might result in better performance (less lookups, less lock
> > > contention).
> > >
> > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > removing swapper address spaces, removing swap count continuation
> > > code, etc). Another nice cleanup that this work enables would be
> > > separating the overloaded swp_entry_t into two distinct types: one for
> > > things that are stored in page tables / caches, and for actual swap
> > > entries. In the future, we can potentially further optimize how we use
> > > the bits in the page tables instead of sticking everything into the
> > > current type/offset format.
> >
> > Looking forward to seeing more details in the upcoming discussion.
> > >
> > > ==================== Cost ====================
> > > The obvious downside of this is added memory overhead, specifically
> > > for users that use swapfiles without zswap. Instead of paying one byte
> > > (swap_map) for every potential page in the swapfile (+ swap count
> > > continuation), we pay the size of the swap_desc for every page that is
> > > actually in the swapfile, which I am estimating can be roughly around
> > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > scales with pages actually swapped out. For zswap users, it should be
> >
> > Is there a way to avoid turning 1 byte into 24 byte per swapped
> > pages? For the users that use swap but no zswap, this is pure overhead.
> 
> That's what I could think of at this point. My idea was something like this:
> 
> struct swap_desc {
>     union { /* Use one bit to distinguish them */
>         swp_entry_t swap_entry;
>         struct zswap_entry *zswap_entry;
>     };
>     struct folio *swapcache;
>     atomic_t swap_count;
>     u32 id;
> }
> 
> Having the id in the swap_desc is convenient as we can directly map
> the swap_desc to a swp_entry_t to place in the page tables, but I
> don't think it's necessary. Without it, the struct size is 20 bytes,
> so I think the extra 4 bytes are okay to use anyway if the slab
> allocator only allocates multiples of 8 bytes.
> 
> The idea here is to unify the swapcache and swap_count implementation
> between different swap backends (swapfiles, zswap, etc), which would
> create a better abstraction and reduce reinventing the wheel.
> 
> We can reduce to only 8 bytes and only store the swap/zswap entry, but
> we still need the swap cache anyway so might as well just store the
> pointer in the struct and have a unified lookup-free swapcache, so
> really 16 bytes is the minimum.
> 
> If we stop at 16 bytes, then we need to handle swap count separately
> in swapfiles and zswap. This is not the end of the world, but are the
> 8 bytes worth this?
> 
> Keep in mind that the current overhead is 1 byte O(max swap pages) not
> O(swapped). Also, 1 byte is assuming we do not use the swap

Just to share info:

Android usually used swap space fully most of times via Compacting
background Apps so O(swapped) ~= O(max swap pages).


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-03  0:33     ` Minchan Kim
@ 2023-03-03  0:49       ` Yosry Ahmed
  2023-03-03  1:25         ` Minchan Kim
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-03  0:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Andrew Morton

On Thu, Mar 2, 2023 at 4:33 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Mar 01, 2023 at 04:30:22PM -0800, Yosry Ahmed wrote:
> > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > Hi Yosry,
> > >
> > > On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > > > Hello everyone,
> > > >
> > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > 2023 about swap & zswap (hope I am not too late).
> > >
> > > I am very interested in participating in this discussion as well.
> >
> > That's great to hear!
> >
> > >
> > > > ==================== Objective ====================
> > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > swapfile, so the overall swapping capacity increases.
> > >
> > > Agree.
> > >
> > > >
> > > > ==================== Idea ====================
> > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > abstraction layer between swapping implementation and the rest of MM
> > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > >
> > > Can you provide a bit more detail? I am curious how this swap id
> > > maps into the swap_desc? Is the swp_entry_t cast into "struct
> > > swap_desc*" or going through some lookup table/tree?
> >
> > swap id would be an index in a radix tree (aka xarray), which contains
> > a pointer to the swap_desc struct. This lookup should be free with
> > this design as we also use swap_desc to directly store the swap cache
> > pointer, so this lookup essentially replaces the swap cache lookup.
> >
> > >
> > > > as our abstraction layer. All MM code not concerned with swapping
> > > > details would operate in terms of swap descs. The swap_desc can point
> > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > entry. It can also include all non-backend specific operations, such
> > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > >
> > > Does the zswap entry still use the swap slot cache and swap_info_struct?
> >
> > In this design no, it shouldn't.
> >
> > >
> > > > This work enables using zswap without a backing swapfile and increases
> > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > a separation that allows us to skip code paths that don't make sense
> > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > which might result in better performance (less lookups, less lock
> > > > contention).
> > > >
> > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > removing swapper address spaces, removing swap count continuation
> > > > code, etc). Another nice cleanup that this work enables would be
> > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > things that are stored in page tables / caches, and for actual swap
> > > > entries. In the future, we can potentially further optimize how we use
> > > > the bits in the page tables instead of sticking everything into the
> > > > current type/offset format.
> > >
> > > Looking forward to seeing more details in the upcoming discussion.
> > > >
> > > > ==================== Cost ====================
> > > > The obvious downside of this is added memory overhead, specifically
> > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > continuation), we pay the size of the swap_desc for every page that is
> > > > actually in the swapfile, which I am estimating can be roughly around
> > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > scales with pages actually swapped out. For zswap users, it should be
> > >
> > > Is there a way to avoid turning 1 byte into 24 byte per swapped
> > > pages? For the users that use swap but no zswap, this is pure overhead.
> >
> > That's what I could think of at this point. My idea was something like this:
> >
> > struct swap_desc {
> >     union { /* Use one bit to distinguish them */
> >         swp_entry_t swap_entry;
> >         struct zswap_entry *zswap_entry;
> >     };
> >     struct folio *swapcache;
> >     atomic_t swap_count;
> >     u32 id;
> > }
> >
> > Having the id in the swap_desc is convenient as we can directly map
> > the swap_desc to a swp_entry_t to place in the page tables, but I
> > don't think it's necessary. Without it, the struct size is 20 bytes,
> > so I think the extra 4 bytes are okay to use anyway if the slab
> > allocator only allocates multiples of 8 bytes.
> >
> > The idea here is to unify the swapcache and swap_count implementation
> > between different swap backends (swapfiles, zswap, etc), which would
> > create a better abstraction and reduce reinventing the wheel.
> >
> > We can reduce to only 8 bytes and only store the swap/zswap entry, but
> > we still need the swap cache anyway so might as well just store the
> > pointer in the struct and have a unified lookup-free swapcache, so
> > really 16 bytes is the minimum.
> >
> > If we stop at 16 bytes, then we need to handle swap count separately
> > in swapfiles and zswap. This is not the end of the world, but are the
> > 8 bytes worth this?
> >
> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> > O(swapped). Also, 1 byte is assuming we do not use the swap
>
> Just to share info:
>
> Android usually used swap space fully most of times via Compacting
> background Apps so O(swapped) ~= O(max swap pages).

Thanks for sharing this, that's definitely interesting.

What percentage of memory is usually provisioned as swap in such
cases? Would you consider an extra overhead of ~8M per 1G of swapped
memory particularly unacceptable?


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-03  0:49       ` Yosry Ahmed
@ 2023-03-03  1:25         ` Minchan Kim
  2023-03-03 17:15           ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Minchan Kim @ 2023-03-03  1:25 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Andrew Morton

On Thu, Mar 02, 2023 at 04:49:01PM -0800, Yosry Ahmed wrote:
> On Thu, Mar 2, 2023 at 4:33 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Wed, Mar 01, 2023 at 04:30:22PM -0800, Yosry Ahmed wrote:
> > > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
> > > >
> > > > Hi Yosry,
> > > >
> > > > On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > > > > Hello everyone,
> > > > >
> > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > > 2023 about swap & zswap (hope I am not too late).
> > > >
> > > > I am very interested in participating in this discussion as well.
> > >
> > > That's great to hear!
> > >
> > > >
> > > > > ==================== Objective ====================
> > > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > > swapfile, so the overall swapping capacity increases.
> > > >
> > > > Agree.
> > > >
> > > > >
> > > > > ==================== Idea ====================
> > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > >
> > > > Can you provide a bit more detail? I am curious how this swap id
> > > > maps into the swap_desc? Is the swp_entry_t cast into "struct
> > > > swap_desc*" or going through some lookup table/tree?
> > >
> > > swap id would be an index in a radix tree (aka xarray), which contains
> > > a pointer to the swap_desc struct. This lookup should be free with
> > > this design as we also use swap_desc to directly store the swap cache
> > > pointer, so this lookup essentially replaces the swap cache lookup.
> > >
> > > >
> > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > entry. It can also include all non-backend specific operations, such
> > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > >
> > > > Does the zswap entry still use the swap slot cache and swap_info_struct?
> > >
> > > In this design no, it shouldn't.
> > >
> > > >
> > > > > This work enables using zswap without a backing swapfile and increases
> > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > a separation that allows us to skip code paths that don't make sense
> > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > which might result in better performance (less lookups, less lock
> > > > > contention).
> > > > >
> > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > removing swapper address spaces, removing swap count continuation
> > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > things that are stored in page tables / caches, and for actual swap
> > > > > entries. In the future, we can potentially further optimize how we use
> > > > > the bits in the page tables instead of sticking everything into the
> > > > > current type/offset format.
> > > >
> > > > Looking forward to seeing more details in the upcoming discussion.
> > > > >
> > > > > ==================== Cost ====================
> > > > > The obvious downside of this is added memory overhead, specifically
> > > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > > continuation), we pay the size of the swap_desc for every page that is
> > > > > actually in the swapfile, which I am estimating can be roughly around
> > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > > scales with pages actually swapped out. For zswap users, it should be
> > > >
> > > > Is there a way to avoid turning 1 byte into 24 byte per swapped
> > > > pages? For the users that use swap but no zswap, this is pure overhead.
> > >
> > > That's what I could think of at this point. My idea was something like this:
> > >
> > > struct swap_desc {
> > >     union { /* Use one bit to distinguish them */
> > >         swp_entry_t swap_entry;
> > >         struct zswap_entry *zswap_entry;
> > >     };
> > >     struct folio *swapcache;
> > >     atomic_t swap_count;
> > >     u32 id;
> > > }
> > >
> > > Having the id in the swap_desc is convenient as we can directly map
> > > the swap_desc to a swp_entry_t to place in the page tables, but I
> > > don't think it's necessary. Without it, the struct size is 20 bytes,
> > > so I think the extra 4 bytes are okay to use anyway if the slab
> > > allocator only allocates multiples of 8 bytes.
> > >
> > > The idea here is to unify the swapcache and swap_count implementation
> > > between different swap backends (swapfiles, zswap, etc), which would
> > > create a better abstraction and reduce reinventing the wheel.
> > >
> > > We can reduce to only 8 bytes and only store the swap/zswap entry, but
> > > we still need the swap cache anyway so might as well just store the
> > > pointer in the struct and have a unified lookup-free swapcache, so
> > > really 16 bytes is the minimum.
> > >
> > > If we stop at 16 bytes, then we need to handle swap count separately
> > > in swapfiles and zswap. This is not the end of the world, but are the
> > > 8 bytes worth this?
> > >
> > > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> > > O(swapped). Also, 1 byte is assuming we do not use the swap
> >
> > Just to share info:
> >
> > Android usually used swap space fully most of times via Compacting
> > background Apps so O(swapped) ~= O(max swap pages).
> 
> Thanks for sharing this, that's definitely interesting.
> 
> What percentage of memory is usually provisioned as swap in such
> cases? Would you consider an extra overhead of ~8M per 1G of swapped
> memory particularly unacceptable?

Vendors have different sizes and usually decide by how many apps
they want to cache at the cost of foreground app's trouble.
I couldn't speak for all vendors but half of DRAM may be small
size on the market.

I recall the ~80M less free memory made huge differene of jank
ratio for memory hungry app launchhing on the memory pressure
state so we needed to cut down the additional memory in the end.

I cannot say the ~8M per 1G is acceptible or not since it's
depends on workload with the device's RAM size(worried more on
entry level devices) so I am not sure what's the code complexity
we will bring in the end but considering swap metadata is one of
the largest area from memory consumption view(struct page is working
toward to shrink - folio, Yay!), IMHO, it's worthwhile to see
whether we could take the complexity.

Thanks.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02 22:55                 ` Yosry Ahmed
@ 2023-03-03  4:05                   ` Chris Li
  0 siblings, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-03  4:05 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Rik van Riel, Minchan Kim, Johannes Weiner, Sergey Senozhatsky,
	lsf-pc, Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Andrew Morton

On Thu, Mar 02, 2023 at 02:55:09PM -0800, Yosry Ahmed wrote:
> This, of course, isn't free. There is an associated overhead. It's a
> trade off like most things are. We want to work towards the outcome of
> that tradeoff that makes sense, we don't want to incur too much
> overhead, but we also don't want a very complicated and error-prone
> implementation.

I agree it is all about trade offs. I can't resist the chance to save
memory :-)

Chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-03  1:25         ` Minchan Kim
@ 2023-03-03 17:15           ` Yosry Ahmed
  0 siblings, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-03 17:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Andrew Morton

On Thu, Mar 2, 2023 at 5:25 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, Mar 02, 2023 at 04:49:01PM -0800, Yosry Ahmed wrote:
> > On Thu, Mar 2, 2023 at 4:33 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Wed, Mar 01, 2023 at 04:30:22PM -0800, Yosry Ahmed wrote:
> > > > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
> > > > >
> > > > > Hi Yosry,
> > > > >
> > > > > On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > > > > > Hello everyone,
> > > > > >
> > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > > > 2023 about swap & zswap (hope I am not too late).
> > > > >
> > > > > I am very interested in participating in this discussion as well.
> > > >
> > > > That's great to hear!
> > > >
> > > > >
> > > > > > ==================== Objective ====================
> > > > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > > > swapfile, so the overall swapping capacity increases.
> > > > >
> > > > > Agree.
> > > > >
> > > > > >
> > > > > > ==================== Idea ====================
> > > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > >
> > > > > Can you provide a bit more detail? I am curious how this swap id
> > > > > maps into the swap_desc? Is the swp_entry_t cast into "struct
> > > > > swap_desc*" or going through some lookup table/tree?
> > > >
> > > > swap id would be an index in a radix tree (aka xarray), which contains
> > > > a pointer to the swap_desc struct. This lookup should be free with
> > > > this design as we also use swap_desc to directly store the swap cache
> > > > pointer, so this lookup essentially replaces the swap cache lookup.
> > > >
> > > > >
> > > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > > entry. It can also include all non-backend specific operations, such
> > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > >
> > > > > Does the zswap entry still use the swap slot cache and swap_info_struct?
> > > >
> > > > In this design no, it shouldn't.
> > > >
> > > > >
> > > > > > This work enables using zswap without a backing swapfile and increases
> > > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > > a separation that allows us to skip code paths that don't make sense
> > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > > which might result in better performance (less lookups, less lock
> > > > > > contention).
> > > > > >
> > > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > > removing swapper address spaces, removing swap count continuation
> > > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > > things that are stored in page tables / caches, and for actual swap
> > > > > > entries. In the future, we can potentially further optimize how we use
> > > > > > the bits in the page tables instead of sticking everything into the
> > > > > > current type/offset format.
> > > > >
> > > > > Looking forward to seeing more details in the upcoming discussion.
> > > > > >
> > > > > > ==================== Cost ====================
> > > > > > The obvious downside of this is added memory overhead, specifically
> > > > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > > > continuation), we pay the size of the swap_desc for every page that is
> > > > > > actually in the swapfile, which I am estimating can be roughly around
> > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > > > scales with pages actually swapped out. For zswap users, it should be
> > > > >
> > > > > Is there a way to avoid turning 1 byte into 24 byte per swapped
> > > > > pages? For the users that use swap but no zswap, this is pure overhead.
> > > >
> > > > That's what I could think of at this point. My idea was something like this:
> > > >
> > > > struct swap_desc {
> > > >     union { /* Use one bit to distinguish them */
> > > >         swp_entry_t swap_entry;
> > > >         struct zswap_entry *zswap_entry;
> > > >     };
> > > >     struct folio *swapcache;
> > > >     atomic_t swap_count;
> > > >     u32 id;
> > > > }
> > > >
> > > > Having the id in the swap_desc is convenient as we can directly map
> > > > the swap_desc to a swp_entry_t to place in the page tables, but I
> > > > don't think it's necessary. Without it, the struct size is 20 bytes,
> > > > so I think the extra 4 bytes are okay to use anyway if the slab
> > > > allocator only allocates multiples of 8 bytes.
> > > >
> > > > The idea here is to unify the swapcache and swap_count implementation
> > > > between different swap backends (swapfiles, zswap, etc), which would
> > > > create a better abstraction and reduce reinventing the wheel.
> > > >
> > > > We can reduce to only 8 bytes and only store the swap/zswap entry, but
> > > > we still need the swap cache anyway so might as well just store the
> > > > pointer in the struct and have a unified lookup-free swapcache, so
> > > > really 16 bytes is the minimum.
> > > >
> > > > If we stop at 16 bytes, then we need to handle swap count separately
> > > > in swapfiles and zswap. This is not the end of the world, but are the
> > > > 8 bytes worth this?
> > > >
> > > > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> > > > O(swapped). Also, 1 byte is assuming we do not use the swap
> > >
> > > Just to share info:
> > >
> > > Android usually used swap space fully most of times via Compacting
> > > background Apps so O(swapped) ~= O(max swap pages).
> >
> > Thanks for sharing this, that's definitely interesting.
> >
> > What percentage of memory is usually provisioned as swap in such
> > cases? Would you consider an extra overhead of ~8M per 1G of swapped
> > memory particularly unacceptable?
>
> Vendors have different sizes and usually decide by how many apps
> they want to cache at the cost of foreground app's trouble.
> I couldn't speak for all vendors but half of DRAM may be small
> size on the market.
>
> I recall the ~80M less free memory made huge differene of jank
> ratio for memory hungry app launchhing on the memory pressure
> state so we needed to cut down the additional memory in the end.
>
> I cannot say the ~8M per 1G is acceptible or not since it's
> depends on workload with the device's RAM size(worried more on
> entry level devices) so I am not sure what's the code complexity
> we will bring in the end but considering swap metadata is one of
> the largest area from memory consumption view(struct page is working
> toward to shrink - folio, Yay!), IMHO, it's worthwhile to see
> whether we could take the complexity.

We can do something similar to what Chris suggested, and have the swap
entries act the same way as today if frontswap/zswap is not
configured. If frontswap/zswap is configured, all swap entries go
through frontswap, we can have an xarray there (instead of today's
rbtree) that either points to a zswap entry or a swap entry. So the
indirection lives in frontswap essentially.

A few problems with this alternative approach vs. the initial proposal:

a) The abstraction layer is within frontswap/zswap, so we cannot move
the writeback LRU logic outside zswap like you mentioned before to
support different combinations of swapping backends.

b) Today we do one lookup in the fault path (in the swapcache) for
swapfiles, and 2 lookups for zswap (an extra lookup in the zswap
rbtree). The initial proposal allows us to remove the rbtree and have
a single lookup in both cases to get the swap_desc, in which the
swapcache and zswap_entry are just pointers we can access directly.
Furthermore, adding the swapped in page to the swapcache is an O(1)
operation now instead of storing into an xarray (IIUC we did an
optimization to skip the swapcache for zram single-mapping fault to
avoid this).

With the alternative approach, we have to do 2 lookups in the fault
path if zswap is enabled (one in the swapcache, and one in the zswap
tree to get the underlying zswap_entry / swap entry).

c) The initial proposal allows us to simplify the swap counting and
swapcache logic, and perhaps reduce the locking we have to do (one
lock to update the swapcache, one lock to update swap count). I even
think we might be able to do such updates locklessly with atomic
operations on the swap_desc. With the alternative approach, we need to
have swap counting logic in swapfiles (the one we have today), and
another swap counting logic in zswap. When a page is moved from zswap
to swapfile, we may need to hand over the swap count. Instead of
simplifying the complicated swap counting logic we double down by
having two implementations.

I am not saying one or the alternative approach is worse, I am just
saying there is more to it than complexity vs. memory savings.

>
> Thanks.
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-02  0:30   ` Yosry Ahmed
                       ` (2 preceding siblings ...)
  2023-03-03  0:33     ` Minchan Kim
@ 2023-03-09 12:48     ` Huang, Ying
  2023-03-09 19:58       ` Chris Li
  2023-03-09 20:19       ` Yosry Ahmed
  3 siblings, 2 replies; 105+ messages in thread
From: Huang, Ying @ 2023-03-09 12:48 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton

Yosry Ahmed <yosryahmed@google.com> writes:

> On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
>>
>> Hi Yosry,
>>
>> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
>> > Hello everyone,
>> >
>> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
>> > 2023 about swap & zswap (hope I am not too late).
>>
>> I am very interested in participating in this discussion as well.
>
> That's great to hear!
>
>>
>> > ==================== Objective ====================
>> > Enabling the use of zswap without a backing swapfile, which makes
>> > zswap useful for a wider variety of use cases. Also, when zswap is
>> > used with a swapfile, the pages in zswap do not use up space in the
>> > swapfile, so the overall swapping capacity increases.
>>
>> Agree.
>>
>> >
>> > ==================== Idea ====================
>> > Introduce a data structure, which I currently call a swap_desc, as an
>> > abstraction layer between swapping implementation and the rest of MM
>> > code. Page tables & page caches would store a swap id (encoded as a
>> > swp_entry_t) instead of directly storing the swap entry associated
>> > with the swapfile. This swap id maps to a struct swap_desc, which acts
>>
>> Can you provide a bit more detail? I am curious how this swap id
>> maps into the swap_desc? Is the swp_entry_t cast into "struct
>> swap_desc*" or going through some lookup table/tree?
>
> swap id would be an index in a radix tree (aka xarray), which contains
> a pointer to the swap_desc struct. This lookup should be free with
> this design as we also use swap_desc to directly store the swap cache
> pointer, so this lookup essentially replaces the swap cache lookup.
>
>>
>> > as our abstraction layer. All MM code not concerned with swapping
>> > details would operate in terms of swap descs. The swap_desc can point
>> > to either a normal swap entry (associated with a swapfile) or a zswap
>> > entry. It can also include all non-backend specific operations, such
>> > as the swapcache (which would be a simple pointer in swap_desc), swap
>>
>> Does the zswap entry still use the swap slot cache and swap_info_struct?
>
> In this design no, it shouldn't.
>
>>
>> > This work enables using zswap without a backing swapfile and increases
>> > the swap capacity when zswap is used with a swapfile. It also creates
>> > a separation that allows us to skip code paths that don't make sense
>> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
>> > which might result in better performance (less lookups, less lock
>> > contention).
>> >
>> > The abstraction layer also opens the door for multiple cleanups (e.g.
>> > removing swapper address spaces, removing swap count continuation
>> > code, etc). Another nice cleanup that this work enables would be
>> > separating the overloaded swp_entry_t into two distinct types: one for
>> > things that are stored in page tables / caches, and for actual swap
>> > entries. In the future, we can potentially further optimize how we use
>> > the bits in the page tables instead of sticking everything into the
>> > current type/offset format.
>>
>> Looking forward to seeing more details in the upcoming discussion.
>> >
>> > ==================== Cost ====================
>> > The obvious downside of this is added memory overhead, specifically
>> > for users that use swapfiles without zswap. Instead of paying one byte
>> > (swap_map) for every potential page in the swapfile (+ swap count
>> > continuation), we pay the size of the swap_desc for every page that is
>> > actually in the swapfile, which I am estimating can be roughly around
>> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
>> > scales with pages actually swapped out. For zswap users, it should be
>>
>> Is there a way to avoid turning 1 byte into 24 byte per swapped
>> pages? For the users that use swap but no zswap, this is pure overhead.
>
> That's what I could think of at this point. My idea was something like this:
>
> struct swap_desc {
>     union { /* Use one bit to distinguish them */
>         swp_entry_t swap_entry;
>         struct zswap_entry *zswap_entry;
>     };
>     struct folio *swapcache;
>     atomic_t swap_count;
>     u32 id;
> }
>
> Having the id in the swap_desc is convenient as we can directly map
> the swap_desc to a swp_entry_t to place in the page tables, but I
> don't think it's necessary. Without it, the struct size is 20 bytes,
> so I think the extra 4 bytes are okay to use anyway if the slab
> allocator only allocates multiples of 8 bytes.
>
> The idea here is to unify the swapcache and swap_count implementation
> between different swap backends (swapfiles, zswap, etc), which would
> create a better abstraction and reduce reinventing the wheel.
>
> We can reduce to only 8 bytes and only store the swap/zswap entry, but
> we still need the swap cache anyway so might as well just store the
> pointer in the struct and have a unified lookup-free swapcache, so
> really 16 bytes is the minimum.
>
> If we stop at 16 bytes, then we need to handle swap count separately
> in swapfiles and zswap. This is not the end of the world, but are the
> 8 bytes worth this?

If my understanding were correct, for current implementation, we need
one swap cache pointer per swapped out page too.  Even after calling
__delete_from_swap_cache(), we store the "shadow" entry there.  Although
it's possible to implement shadow entry reclaiming like that for file
cache shadow entry (workingset_shadow_shrinker), we haven't done that
yet.  And, it appears that we can live with that.  So, in current
implementation, for each swapped out page, we use 9 bytes.  If so, the
memory usage ratio is 24 / 9 = 2.667, still not trivial, but not as
horrible as 24 / 1 = 24.

> Keep in mind that the current overhead is 1 byte O(max swap pages) not
> O(swapped). Also, 1 byte is assuming we do not use the swap
> continuation pages. If we do, it may end up being more. We also
> allocate continuation in full 4k pages, so even if one swap_map
> element in a page requires continuation, we will allocate an entire
> page. What I am trying to say is that to get an actual comparison you
> need to also factor in the swap utilization and the rate of usage of
> swap continuation. I don't know how to come up with a formula for this
> tbh.
>
> Also, like Johannes said, the worst case overhead (32 bytes if you
> count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> overhead for people not using zswap, but it is not very awful.
>
>>
>> It seems what you really need is one bit of information to indicate
>> this page is backed by zswap. Then you can have a seperate pointer
>> for the zswap entry.
>
> If you use one bit in swp_entry_t (or one of the available swap types)
> to indicate whether the page is backed with a swapfile or zswap it
> doesn't really work. We lose the indirection layer. How do we move the
> page from zswap to swapfile? We need to go update the page tables and
> the shmem page cache, similar to swapoff.
>
> Instead, if we store a key else in swp_entry_t and use this to lookup
> the swp_entry_t or zswap_entry pointer then that's essentially what
> the swap_desc does. It just goes the extra mile of unifying the
> swapcache as well and storing it directly in the swap_desc instead of
> storing it in another lookup structure.

If we choose to make sizeof(struct swap_desc) == 8, that is, store only
swap_entry in swap_desc.  The added indirection appears to be another
level of page table with 1 entry.  Then, we may use the similar method
as supporting system with 2 level and 3 level page tables, like the code
in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
this deeply.

>>
>> Depending on how much you are going to reuse the swap cache, you might
>> need to have something like a swap_info_struct to keep the locks happy.
>
> My current intention is to reimplement the swapcache completely as a
> pointer in struct swap_desc. This would eliminate this need and a lot
> of the locking we do today if I get things right.
>
>>
>> > Another potential concern is readahead. With this design, we have no
>>
>> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
>> use some modernization.
>
> Yeah, I initially thought we would only need the swp_entry_t ->
> swap_desc reverse mapping for readahead, and that we can only store
> that for spinning disks, but I was wrong. We need for other things as
> well today: swapoff, when trying to find an empty swap slot and we
> start trying to free swap slots used only by the swapcache. However, I
> think both of these cases can be fixed (I can share more details if
> you want). If everything goes well we should only need to maintain the
> reverse mapping (extra overhead above 24 bytes) for swap files on
> spinning disks for readahead.
>
>>
>> Looking forward to your discussion.
>>
>> Chris
>>

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-09 12:48     ` Huang, Ying
@ 2023-03-09 19:58       ` Chris Li
  2023-03-09 20:19       ` Yosry Ahmed
  1 sibling, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-09 19:58 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton

On Thu, Mar 09, 2023 at 08:48:28PM +0800, Huang, Ying wrote:
> Yosry Ahmed <yosryahmed@google.com> writes:
> 
> >
> > struct swap_desc {
> >     union { /* Use one bit to distinguish them */
> >         swp_entry_t swap_entry;
> >         struct zswap_entry *zswap_entry;
> >     };
> >     struct folio *swapcache;
> >     atomic_t swap_count;
> >     u32 id;
> > }
> >
> > Having the id in the swap_desc is convenient as we can directly map
> > the swap_desc to a swp_entry_t to place in the page tables, but I
> > don't think it's necessary. Without it, the struct size is 20 bytes,
> > so I think the extra 4 bytes are okay to use anyway if the slab
> > allocator only allocates multiples of 8 bytes.
> >
> > The idea here is to unify the swapcache and swap_count implementation
> > between different swap backends (swapfiles, zswap, etc), which would
> > create a better abstraction and reduce reinventing the wheel.
> >
> > We can reduce to only 8 bytes and only store the swap/zswap entry, but
> > we still need the swap cache anyway so might as well just store the
> > pointer in the struct and have a unified lookup-free swapcache, so
> > really 16 bytes is the minimum.
> >
> > If we stop at 16 bytes, then we need to handle swap count separately
> > in swapfiles and zswap. This is not the end of the world, but are the
> > 8 bytes worth this?
> 
> If my understanding were correct, for current implementation, we need
> one swap cache pointer per swapped out page too.  Even after calling
> __delete_from_swap_cache(), we store the "shadow" entry there.  Although

That is correct. We have the "shadow" entry.

> it's possible to implement shadow entry reclaiming like that for file
> cache shadow entry (workingset_shadow_shrinker), we haven't done that
> yet.  And, it appears that we can live with that.  So, in current
> implementation, for each swapped out page, we use 9 bytes.  If so, the
> memory usage ratio is 24 / 9 = 2.667, still not trivial, but not as
> horrible as 24 / 1 = 24.


The swap_desc proposal did not explicit save the shadow entry in swap_desc.
So the math should be (24 + 8) vs ( 1 + 8). There is about 20 byte extra
per page frame.

> > Instead, if we store a key else in swp_entry_t and use this to lookup
> > the swp_entry_t or zswap_entry pointer then that's essentially what
> > the swap_desc does. It just goes the extra mile of unifying the
> > swapcache as well and storing it directly in the swap_desc instead of
> > storing it in another lookup structure.
> 
> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
> swap_entry in swap_desc.  The added indirection appears to be another
> level of page table with 1 entry.  Then, we may use the similar method
> as supporting system with 2 level and 3 level page tables, like the code
> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
> this deeply.

I would like to explore other possibility as well. More idea and discussion is
welcome.

Chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-09 12:48     ` Huang, Ying
  2023-03-09 19:58       ` Chris Li
@ 2023-03-09 20:19       ` Yosry Ahmed
  2023-03-10  3:06         ` Huang, Ying
  1 sibling, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-09 20:19 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton

On Thu, Mar 9, 2023 at 4:49 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
> >>
> >> Hi Yosry,
> >>
> >> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> >> > Hello everyone,
> >> >
> >> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> >> > 2023 about swap & zswap (hope I am not too late).
> >>
> >> I am very interested in participating in this discussion as well.
> >
> > That's great to hear!
> >
> >>
> >> > ==================== Objective ====================
> >> > Enabling the use of zswap without a backing swapfile, which makes
> >> > zswap useful for a wider variety of use cases. Also, when zswap is
> >> > used with a swapfile, the pages in zswap do not use up space in the
> >> > swapfile, so the overall swapping capacity increases.
> >>
> >> Agree.
> >>
> >> >
> >> > ==================== Idea ====================
> >> > Introduce a data structure, which I currently call a swap_desc, as an
> >> > abstraction layer between swapping implementation and the rest of MM
> >> > code. Page tables & page caches would store a swap id (encoded as a
> >> > swp_entry_t) instead of directly storing the swap entry associated
> >> > with the swapfile. This swap id maps to a struct swap_desc, which acts
> >>
> >> Can you provide a bit more detail? I am curious how this swap id
> >> maps into the swap_desc? Is the swp_entry_t cast into "struct
> >> swap_desc*" or going through some lookup table/tree?
> >
> > swap id would be an index in a radix tree (aka xarray), which contains
> > a pointer to the swap_desc struct. This lookup should be free with
> > this design as we also use swap_desc to directly store the swap cache
> > pointer, so this lookup essentially replaces the swap cache lookup.
> >
> >>
> >> > as our abstraction layer. All MM code not concerned with swapping
> >> > details would operate in terms of swap descs. The swap_desc can point
> >> > to either a normal swap entry (associated with a swapfile) or a zswap
> >> > entry. It can also include all non-backend specific operations, such
> >> > as the swapcache (which would be a simple pointer in swap_desc), swap
> >>
> >> Does the zswap entry still use the swap slot cache and swap_info_struct?
> >
> > In this design no, it shouldn't.
> >
> >>
> >> > This work enables using zswap without a backing swapfile and increases
> >> > the swap capacity when zswap is used with a swapfile. It also creates
> >> > a separation that allows us to skip code paths that don't make sense
> >> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> >> > which might result in better performance (less lookups, less lock
> >> > contention).
> >> >
> >> > The abstraction layer also opens the door for multiple cleanups (e.g.
> >> > removing swapper address spaces, removing swap count continuation
> >> > code, etc). Another nice cleanup that this work enables would be
> >> > separating the overloaded swp_entry_t into two distinct types: one for
> >> > things that are stored in page tables / caches, and for actual swap
> >> > entries. In the future, we can potentially further optimize how we use
> >> > the bits in the page tables instead of sticking everything into the
> >> > current type/offset format.
> >>
> >> Looking forward to seeing more details in the upcoming discussion.
> >> >
> >> > ==================== Cost ====================
> >> > The obvious downside of this is added memory overhead, specifically
> >> > for users that use swapfiles without zswap. Instead of paying one byte
> >> > (swap_map) for every potential page in the swapfile (+ swap count
> >> > continuation), we pay the size of the swap_desc for every page that is
> >> > actually in the swapfile, which I am estimating can be roughly around
> >> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> >> > scales with pages actually swapped out. For zswap users, it should be
> >>
> >> Is there a way to avoid turning 1 byte into 24 byte per swapped
> >> pages? For the users that use swap but no zswap, this is pure overhead.
> >
> > That's what I could think of at this point. My idea was something like this:
> >
> > struct swap_desc {
> >     union { /* Use one bit to distinguish them */
> >         swp_entry_t swap_entry;
> >         struct zswap_entry *zswap_entry;
> >     };
> >     struct folio *swapcache;
> >     atomic_t swap_count;
> >     u32 id;
> > }
> >
> > Having the id in the swap_desc is convenient as we can directly map
> > the swap_desc to a swp_entry_t to place in the page tables, but I
> > don't think it's necessary. Without it, the struct size is 20 bytes,
> > so I think the extra 4 bytes are okay to use anyway if the slab
> > allocator only allocates multiples of 8 bytes.
> >
> > The idea here is to unify the swapcache and swap_count implementation
> > between different swap backends (swapfiles, zswap, etc), which would
> > create a better abstraction and reduce reinventing the wheel.
> >
> > We can reduce to only 8 bytes and only store the swap/zswap entry, but
> > we still need the swap cache anyway so might as well just store the
> > pointer in the struct and have a unified lookup-free swapcache, so
> > really 16 bytes is the minimum.
> >
> > If we stop at 16 bytes, then we need to handle swap count separately
> > in swapfiles and zswap. This is not the end of the world, but are the
> > 8 bytes worth this?
>
> If my understanding were correct, for current implementation, we need
> one swap cache pointer per swapped out page too.  Even after calling
> __delete_from_swap_cache(), we store the "shadow" entry there.  Although
> it's possible to implement shadow entry reclaiming like that for file
> cache shadow entry (workingset_shadow_shrinker), we haven't done that
> yet.  And, it appears that we can live with that.  So, in current
> implementation, for each swapped out page, we use 9 bytes.  If so, the
> memory usage ratio is 24 / 9 = 2.667, still not trivial, but not as
> horrible as 24 / 1 = 24.

Unfortunately it's a little bit more. 24 is the extra overhead.

Today we have an xarray entry for each swapped out page, that either
has the swapcache pointer or the shadow entry.

With this implementation, we have an xarray entry for each swapped out
page, that has a pointer to the swap_desc.

Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = 3.5556.

For rotating disks, this might be even higher (8 + 32) / (8 + 1) = 4.444

This is because we need to maintain a reverse mapping between
swp_entry_t and the swap_desc to use for cluster readahead. I am
assuming we can limit cluster readahead for rotating disks only.

>
> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> > O(swapped). Also, 1 byte is assuming we do not use the swap
> > continuation pages. If we do, it may end up being more. We also
> > allocate continuation in full 4k pages, so even if one swap_map
> > element in a page requires continuation, we will allocate an entire
> > page. What I am trying to say is that to get an actual comparison you
> > need to also factor in the swap utilization and the rate of usage of
> > swap continuation. I don't know how to come up with a formula for this
> > tbh.
> >
> > Also, like Johannes said, the worst case overhead (32 bytes if you
> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> > overhead for people not using zswap, but it is not very awful.
> >
> >>
> >> It seems what you really need is one bit of information to indicate
> >> this page is backed by zswap. Then you can have a seperate pointer
> >> for the zswap entry.
> >
> > If you use one bit in swp_entry_t (or one of the available swap types)
> > to indicate whether the page is backed with a swapfile or zswap it
> > doesn't really work. We lose the indirection layer. How do we move the
> > page from zswap to swapfile? We need to go update the page tables and
> > the shmem page cache, similar to swapoff.
> >
> > Instead, if we store a key else in swp_entry_t and use this to lookup
> > the swp_entry_t or zswap_entry pointer then that's essentially what
> > the swap_desc does. It just goes the extra mile of unifying the
> > swapcache as well and storing it directly in the swap_desc instead of
> > storing it in another lookup structure.
>
> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
> swap_entry in swap_desc.  The added indirection appears to be another
> level of page table with 1 entry.  Then, we may use the similar method
> as supporting system with 2 level and 3 level page tables, like the code
> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
> this deeply.

Can you expand further on this idea? I am not sure I fully understand.

>
> >>
> >> Depending on how much you are going to reuse the swap cache, you might
> >> need to have something like a swap_info_struct to keep the locks happy.
> >
> > My current intention is to reimplement the swapcache completely as a
> > pointer in struct swap_desc. This would eliminate this need and a lot
> > of the locking we do today if I get things right.
> >
> >>
> >> > Another potential concern is readahead. With this design, we have no
> >>
> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> >> use some modernization.
> >
> > Yeah, I initially thought we would only need the swp_entry_t ->
> > swap_desc reverse mapping for readahead, and that we can only store
> > that for spinning disks, but I was wrong. We need for other things as
> > well today: swapoff, when trying to find an empty swap slot and we
> > start trying to free swap slots used only by the swapcache. However, I
> > think both of these cases can be fixed (I can share more details if
> > you want). If everything goes well we should only need to maintain the
> > reverse mapping (extra overhead above 24 bytes) for swap files on
> > spinning disks for readahead.
> >
> >>
> >> Looking forward to your discussion.
> >>
> >> Chris
> >>
>
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-18 22:38 [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Yosry Ahmed
                   ` (3 preceding siblings ...)
  2023-02-28 23:11 ` Chris Li
@ 2023-03-10  2:07 ` Luis Chamberlain
  2023-03-10  2:15   ` Yosry Ahmed
  2023-05-12  3:07 ` Yosry Ahmed
  5 siblings, 1 reply; 105+ messages in thread
From: Luis Chamberlain @ 2023-03-10  2:07 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton,
	Anthony Iliopoulos, Davidlohr Bueso, Adam Manzanares

On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> ==================== Intro ====================
> Currently, using zswap is dependent on swapfiles in an unnecessary
> way. To use zswap, you need a swapfile configured (even if the space
> will not be used) and zswap is restricted by its size.

There's also overlap with zram too. zswap uses zpool and so does its
compression backends (zbud, zsmalloc, z3fold), but zram does not. Vitaly
did a great job at presenting some overlap between all this before, so
perhaps we can summon him as I'd imagine he'd be interested. 

https://lpc.events/event/4/contributions/551/attachments/364/597/zram-decouple.pdf

FWIW we've been using zswap with the whole intention of *never* touching
disk on purpose for years now on kdevops with the goal to just do compression
of memory and avoid having to mess with a block device as preamble. This is
done on the hypervisor with tons of guests, that together with KSM saves *huge*
amount of memory:

https://github.com/linux-kdevops/kdevops/blob/master/docs/kernel-ci/kernel-ci-hypervisor-tuning.md

  Luis


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-10  2:07 ` Luis Chamberlain
@ 2023-03-10  2:15   ` Yosry Ahmed
  0 siblings, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-10  2:15 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko, Shakeel Butt,
	David Rientjes, Hugh Dickins, Seth Jennings, Dan Streetman,
	Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim, Andrew Morton,
	Anthony Iliopoulos, Davidlohr Bueso, Adam Manzanares

On Thu, Mar 9, 2023 at 6:08 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > ==================== Intro ====================
> > Currently, using zswap is dependent on swapfiles in an unnecessary
> > way. To use zswap, you need a swapfile configured (even if the space
> > will not be used) and zswap is restricted by its size.
>
> There's also overlap with zram too. zswap uses zpool and so does its
> compression backends (zbud, zsmalloc, z3fold), but zram does not. Vitaly
> did a great job at presenting some overlap between all this before, so
> perhaps we can summon him as I'd imagine he'd be interested.
>
> https://lpc.events/event/4/contributions/551/attachments/364/597/zram-decouple.pdf

Yes. Hopefully by making zswap independent from swapfiles we partially
bridge the gap between zswap and zram.

>
> FWIW we've been using zswap with the whole intention of *never* touching
> disk on purpose for years now on kdevops with the goal to just do compression
> of memory and avoid having to mess with a block device as preamble. This is
> done on the hypervisor with tons of guests, that together with KSM saves *huge*
> amount of memory:
>
> https://github.com/linux-kdevops/kdevops/blob/master/docs/kernel-ci/kernel-ci-hypervisor-tuning.md

At Google, we have been also using zswap without writeback for years.
It's great to hear that others share the same use case. Perhaps it's
finally time to properly support this use case upstream.

>
>   Luis


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-09 20:19       ` Yosry Ahmed
@ 2023-03-10  3:06         ` Huang, Ying
  2023-03-10 23:14           ` Chris Li
  2023-03-11  1:06           ` Yosry Ahmed
  0 siblings, 2 replies; 105+ messages in thread
From: Huang, Ying @ 2023-03-10  3:06 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Thu, Mar 9, 2023 at 4:49 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
>> >>
>> >> Hi Yosry,
>> >>
>> >> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
>> >> > Hello everyone,
>> >> >
>> >> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
>> >> > 2023 about swap & zswap (hope I am not too late).
>> >>
>> >> I am very interested in participating in this discussion as well.
>> >
>> > That's great to hear!
>> >
>> >>
>> >> > ==================== Objective ====================
>> >> > Enabling the use of zswap without a backing swapfile, which makes
>> >> > zswap useful for a wider variety of use cases. Also, when zswap is
>> >> > used with a swapfile, the pages in zswap do not use up space in the
>> >> > swapfile, so the overall swapping capacity increases.
>> >>
>> >> Agree.
>> >>
>> >> >
>> >> > ==================== Idea ====================
>> >> > Introduce a data structure, which I currently call a swap_desc, as an
>> >> > abstraction layer between swapping implementation and the rest of MM
>> >> > code. Page tables & page caches would store a swap id (encoded as a
>> >> > swp_entry_t) instead of directly storing the swap entry associated
>> >> > with the swapfile. This swap id maps to a struct swap_desc, which acts
>> >>
>> >> Can you provide a bit more detail? I am curious how this swap id
>> >> maps into the swap_desc? Is the swp_entry_t cast into "struct
>> >> swap_desc*" or going through some lookup table/tree?
>> >
>> > swap id would be an index in a radix tree (aka xarray), which contains
>> > a pointer to the swap_desc struct. This lookup should be free with
>> > this design as we also use swap_desc to directly store the swap cache
>> > pointer, so this lookup essentially replaces the swap cache lookup.
>> >
>> >>
>> >> > as our abstraction layer. All MM code not concerned with swapping
>> >> > details would operate in terms of swap descs. The swap_desc can point
>> >> > to either a normal swap entry (associated with a swapfile) or a zswap
>> >> > entry. It can also include all non-backend specific operations, such
>> >> > as the swapcache (which would be a simple pointer in swap_desc), swap
>> >>
>> >> Does the zswap entry still use the swap slot cache and swap_info_struct?
>> >
>> > In this design no, it shouldn't.
>> >
>> >>
>> >> > This work enables using zswap without a backing swapfile and increases
>> >> > the swap capacity when zswap is used with a swapfile. It also creates
>> >> > a separation that allows us to skip code paths that don't make sense
>> >> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
>> >> > which might result in better performance (less lookups, less lock
>> >> > contention).
>> >> >
>> >> > The abstraction layer also opens the door for multiple cleanups (e.g.
>> >> > removing swapper address spaces, removing swap count continuation
>> >> > code, etc). Another nice cleanup that this work enables would be
>> >> > separating the overloaded swp_entry_t into two distinct types: one for
>> >> > things that are stored in page tables / caches, and for actual swap
>> >> > entries. In the future, we can potentially further optimize how we use
>> >> > the bits in the page tables instead of sticking everything into the
>> >> > current type/offset format.
>> >>
>> >> Looking forward to seeing more details in the upcoming discussion.
>> >> >
>> >> > ==================== Cost ====================
>> >> > The obvious downside of this is added memory overhead, specifically
>> >> > for users that use swapfiles without zswap. Instead of paying one byte
>> >> > (swap_map) for every potential page in the swapfile (+ swap count
>> >> > continuation), we pay the size of the swap_desc for every page that is
>> >> > actually in the swapfile, which I am estimating can be roughly around
>> >> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
>> >> > scales with pages actually swapped out. For zswap users, it should be
>> >>
>> >> Is there a way to avoid turning 1 byte into 24 byte per swapped
>> >> pages? For the users that use swap but no zswap, this is pure overhead.
>> >
>> > That's what I could think of at this point. My idea was something like this:
>> >
>> > struct swap_desc {
>> >     union { /* Use one bit to distinguish them */
>> >         swp_entry_t swap_entry;
>> >         struct zswap_entry *zswap_entry;
>> >     };
>> >     struct folio *swapcache;
>> >     atomic_t swap_count;
>> >     u32 id;
>> > }
>> >
>> > Having the id in the swap_desc is convenient as we can directly map
>> > the swap_desc to a swp_entry_t to place in the page tables, but I
>> > don't think it's necessary. Without it, the struct size is 20 bytes,
>> > so I think the extra 4 bytes are okay to use anyway if the slab
>> > allocator only allocates multiples of 8 bytes.
>> >
>> > The idea here is to unify the swapcache and swap_count implementation
>> > between different swap backends (swapfiles, zswap, etc), which would
>> > create a better abstraction and reduce reinventing the wheel.
>> >
>> > We can reduce to only 8 bytes and only store the swap/zswap entry, but
>> > we still need the swap cache anyway so might as well just store the
>> > pointer in the struct and have a unified lookup-free swapcache, so
>> > really 16 bytes is the minimum.
>> >
>> > If we stop at 16 bytes, then we need to handle swap count separately
>> > in swapfiles and zswap. This is not the end of the world, but are the
>> > 8 bytes worth this?
>>
>> If my understanding were correct, for current implementation, we need
>> one swap cache pointer per swapped out page too.  Even after calling
>> __delete_from_swap_cache(), we store the "shadow" entry there.  Although
>> it's possible to implement shadow entry reclaiming like that for file
>> cache shadow entry (workingset_shadow_shrinker), we haven't done that
>> yet.  And, it appears that we can live with that.  So, in current
>> implementation, for each swapped out page, we use 9 bytes.  If so, the
>> memory usage ratio is 24 / 9 = 2.667, still not trivial, but not as
>> horrible as 24 / 1 = 24.
>
> Unfortunately it's a little bit more. 24 is the extra overhead.
>
> Today we have an xarray entry for each swapped out page, that either
> has the swapcache pointer or the shadow entry.
>
> With this implementation, we have an xarray entry for each swapped out
> page, that has a pointer to the swap_desc.
>
> Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = 3.5556.

OK.  I see.  We can only hold 8 bytes for each xarray entry.  To save
memory usage, we can allocate multiple swap_desc (e.g., 16) for each
xarray entry.  Then the memory usage of xarray becomes 1/N.

> For rotating disks, this might be even higher (8 + 32) / (8 + 1) = 4.444
>
> This is because we need to maintain a reverse mapping between
> swp_entry_t and the swap_desc to use for cluster readahead. I am
> assuming we can limit cluster readahead for rotating disks only.

If reverse mapping cannot be avoided for enough situation, it's better
to only keep swap_entry in swap_desc, and create another xarray indexed
by swap_entry and store swap_cache, swap_count etc.

>>
>> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
>> > O(swapped). Also, 1 byte is assuming we do not use the swap
>> > continuation pages. If we do, it may end up being more. We also
>> > allocate continuation in full 4k pages, so even if one swap_map
>> > element in a page requires continuation, we will allocate an entire
>> > page. What I am trying to say is that to get an actual comparison you
>> > need to also factor in the swap utilization and the rate of usage of
>> > swap continuation. I don't know how to come up with a formula for this
>> > tbh.
>> >
>> > Also, like Johannes said, the worst case overhead (32 bytes if you
>> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
>> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
>> > overhead for people not using zswap, but it is not very awful.
>> >
>> >>
>> >> It seems what you really need is one bit of information to indicate
>> >> this page is backed by zswap. Then you can have a seperate pointer
>> >> for the zswap entry.
>> >
>> > If you use one bit in swp_entry_t (or one of the available swap types)
>> > to indicate whether the page is backed with a swapfile or zswap it
>> > doesn't really work. We lose the indirection layer. How do we move the
>> > page from zswap to swapfile? We need to go update the page tables and
>> > the shmem page cache, similar to swapoff.
>> >
>> > Instead, if we store a key else in swp_entry_t and use this to lookup
>> > the swp_entry_t or zswap_entry pointer then that's essentially what
>> > the swap_desc does. It just goes the extra mile of unifying the
>> > swapcache as well and storing it directly in the swap_desc instead of
>> > storing it in another lookup structure.
>>
>> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
>> swap_entry in swap_desc.  The added indirection appears to be another
>> level of page table with 1 entry.  Then, we may use the similar method
>> as supporting system with 2 level and 3 level page tables, like the code
>> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
>> this deeply.
>
> Can you expand further on this idea? I am not sure I fully understand.

OK.  The goal is to avoid the overhead if indirection isn't enabled via
kconfig.

If indirection isn't enabled, store swap_entry in PTE directly.
Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
to get/set swap_entry in PTE) are implemented based on kconfig.

>> >>
>> >> Depending on how much you are going to reuse the swap cache, you might
>> >> need to have something like a swap_info_struct to keep the locks happy.
>> >
>> > My current intention is to reimplement the swapcache completely as a
>> > pointer in struct swap_desc. This would eliminate this need and a lot
>> > of the locking we do today if I get things right.
>> >
>> >>
>> >> > Another potential concern is readahead. With this design, we have no
>> >>
>> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
>> >> use some modernization.
>> >
>> > Yeah, I initially thought we would only need the swp_entry_t ->
>> > swap_desc reverse mapping for readahead, and that we can only store
>> > that for spinning disks, but I was wrong. We need for other things as
>> > well today: swapoff, when trying to find an empty swap slot and we
>> > start trying to free swap slots used only by the swapcache. However, I
>> > think both of these cases can be fixed (I can share more details if
>> > you want). If everything goes well we should only need to maintain the
>> > reverse mapping (extra overhead above 24 bytes) for swap files on
>> > spinning disks for readahead.
>> >
>> >>
>> >> Looking forward to your discussion.

Per my understanding, the indirection is to make it easy to move
(swapped) pages among swap devices based on hot/cold.  This is similar
as the target of memory tiering.  It appears that we can extend the
memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
Is it possible for zswap to be faster than some slow memory media?

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-10  3:06         ` Huang, Ying
@ 2023-03-10 23:14           ` Chris Li
  2023-03-13  1:10             ` Huang, Ying
  2023-03-11  1:06           ` Yosry Ahmed
  1 sibling, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-10 23:14 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Fri, Mar 10, 2023 at 11:06:37AM +0800, Huang, Ying wrote:
> > Unfortunately it's a little bit more. 24 is the extra overhead.
> >
> > Today we have an xarray entry for each swapped out page, that either
> > has the swapcache pointer or the shadow entry.
> >
> > With this implementation, we have an xarray entry for each swapped out
> > page, that has a pointer to the swap_desc.
> >
> > Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = 3.5556.
> 
> OK.  I see.  We can only hold 8 bytes for each xarray entry.  To save
> memory usage, we can allocate multiple swap_desc (e.g., 16) for each
> xarray entry.  Then the memory usage of xarray becomes 1/N.

The xarray look up key is the swap offset from the swap entry. If you
put more than one swap_desc under the one xarray entry. It will mean
all those different swap_descs will share a swap offset.

> >> > Yeah, I initially thought we would only need the swp_entry_t ->
> >> > swap_desc reverse mapping for readahead, and that we can only store
> >> > that for spinning disks, but I was wrong. We need for other things as
> >> > well today: swapoff, when trying to find an empty swap slot and we
> >> > start trying to free swap slots used only by the swapcache. However, I
> >> > think both of these cases can be fixed (I can share more details if
> >> > you want). If everything goes well we should only need to maintain the
> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
> >> > spinning disks for readahead.
> >> >
> >> >>
> >> >> Looking forward to your discussion.
> 
> Per my understanding, the indirection is to make it easy to move
> (swapped) pages among swap devices based on hot/cold.  This is similar
> as the target of memory tiering.  It appears that we can extend the
> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
> Is it possible for zswap to be faster than some slow memory media?

I just took a look at the mm/memory-tier.c. It does not look like it
can cover block device swap without major overhauling.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-10  3:06         ` Huang, Ying
  2023-03-10 23:14           ` Chris Li
@ 2023-03-11  1:06           ` Yosry Ahmed
  2023-03-13  2:12             ` Huang, Ying
  1 sibling, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-11  1:06 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Thu, Mar 9, 2023 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Thu, Mar 9, 2023 at 4:49 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
> >> >>
> >> >> Hi Yosry,
> >> >>
> >> >> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> >> >> > Hello everyone,
> >> >> >
> >> >> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> >> >> > 2023 about swap & zswap (hope I am not too late).
> >> >>
> >> >> I am very interested in participating in this discussion as well.
> >> >
> >> > That's great to hear!
> >> >
> >> >>
> >> >> > ==================== Objective ====================
> >> >> > Enabling the use of zswap without a backing swapfile, which makes
> >> >> > zswap useful for a wider variety of use cases. Also, when zswap is
> >> >> > used with a swapfile, the pages in zswap do not use up space in the
> >> >> > swapfile, so the overall swapping capacity increases.
> >> >>
> >> >> Agree.
> >> >>
> >> >> >
> >> >> > ==================== Idea ====================
> >> >> > Introduce a data structure, which I currently call a swap_desc, as an
> >> >> > abstraction layer between swapping implementation and the rest of MM
> >> >> > code. Page tables & page caches would store a swap id (encoded as a
> >> >> > swp_entry_t) instead of directly storing the swap entry associated
> >> >> > with the swapfile. This swap id maps to a struct swap_desc, which acts
> >> >>
> >> >> Can you provide a bit more detail? I am curious how this swap id
> >> >> maps into the swap_desc? Is the swp_entry_t cast into "struct
> >> >> swap_desc*" or going through some lookup table/tree?
> >> >
> >> > swap id would be an index in a radix tree (aka xarray), which contains
> >> > a pointer to the swap_desc struct. This lookup should be free with
> >> > this design as we also use swap_desc to directly store the swap cache
> >> > pointer, so this lookup essentially replaces the swap cache lookup.
> >> >
> >> >>
> >> >> > as our abstraction layer. All MM code not concerned with swapping
> >> >> > details would operate in terms of swap descs. The swap_desc can point
> >> >> > to either a normal swap entry (associated with a swapfile) or a zswap
> >> >> > entry. It can also include all non-backend specific operations, such
> >> >> > as the swapcache (which would be a simple pointer in swap_desc), swap
> >> >>
> >> >> Does the zswap entry still use the swap slot cache and swap_info_struct?
> >> >
> >> > In this design no, it shouldn't.
> >> >
> >> >>
> >> >> > This work enables using zswap without a backing swapfile and increases
> >> >> > the swap capacity when zswap is used with a swapfile. It also creates
> >> >> > a separation that allows us to skip code paths that don't make sense
> >> >> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> >> >> > which might result in better performance (less lookups, less lock
> >> >> > contention).
> >> >> >
> >> >> > The abstraction layer also opens the door for multiple cleanups (e.g.
> >> >> > removing swapper address spaces, removing swap count continuation
> >> >> > code, etc). Another nice cleanup that this work enables would be
> >> >> > separating the overloaded swp_entry_t into two distinct types: one for
> >> >> > things that are stored in page tables / caches, and for actual swap
> >> >> > entries. In the future, we can potentially further optimize how we use
> >> >> > the bits in the page tables instead of sticking everything into the
> >> >> > current type/offset format.
> >> >>
> >> >> Looking forward to seeing more details in the upcoming discussion.
> >> >> >
> >> >> > ==================== Cost ====================
> >> >> > The obvious downside of this is added memory overhead, specifically
> >> >> > for users that use swapfiles without zswap. Instead of paying one byte
> >> >> > (swap_map) for every potential page in the swapfile (+ swap count
> >> >> > continuation), we pay the size of the swap_desc for every page that is
> >> >> > actually in the swapfile, which I am estimating can be roughly around
> >> >> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> >> >> > scales with pages actually swapped out. For zswap users, it should be
> >> >>
> >> >> Is there a way to avoid turning 1 byte into 24 byte per swapped
> >> >> pages? For the users that use swap but no zswap, this is pure overhead.
> >> >
> >> > That's what I could think of at this point. My idea was something like this:
> >> >
> >> > struct swap_desc {
> >> >     union { /* Use one bit to distinguish them */
> >> >         swp_entry_t swap_entry;
> >> >         struct zswap_entry *zswap_entry;
> >> >     };
> >> >     struct folio *swapcache;
> >> >     atomic_t swap_count;
> >> >     u32 id;
> >> > }
> >> >
> >> > Having the id in the swap_desc is convenient as we can directly map
> >> > the swap_desc to a swp_entry_t to place in the page tables, but I
> >> > don't think it's necessary. Without it, the struct size is 20 bytes,
> >> > so I think the extra 4 bytes are okay to use anyway if the slab
> >> > allocator only allocates multiples of 8 bytes.
> >> >
> >> > The idea here is to unify the swapcache and swap_count implementation
> >> > between different swap backends (swapfiles, zswap, etc), which would
> >> > create a better abstraction and reduce reinventing the wheel.
> >> >
> >> > We can reduce to only 8 bytes and only store the swap/zswap entry, but
> >> > we still need the swap cache anyway so might as well just store the
> >> > pointer in the struct and have a unified lookup-free swapcache, so
> >> > really 16 bytes is the minimum.
> >> >
> >> > If we stop at 16 bytes, then we need to handle swap count separately
> >> > in swapfiles and zswap. This is not the end of the world, but are the
> >> > 8 bytes worth this?
> >>
> >> If my understanding were correct, for current implementation, we need
> >> one swap cache pointer per swapped out page too.  Even after calling
> >> __delete_from_swap_cache(), we store the "shadow" entry there.  Although
> >> it's possible to implement shadow entry reclaiming like that for file
> >> cache shadow entry (workingset_shadow_shrinker), we haven't done that
> >> yet.  And, it appears that we can live with that.  So, in current
> >> implementation, for each swapped out page, we use 9 bytes.  If so, the
> >> memory usage ratio is 24 / 9 = 2.667, still not trivial, but not as
> >> horrible as 24 / 1 = 24.
> >
> > Unfortunately it's a little bit more. 24 is the extra overhead.
> >
> > Today we have an xarray entry for each swapped out page, that either
> > has the swapcache pointer or the shadow entry.
> >
> > With this implementation, we have an xarray entry for each swapped out
> > page, that has a pointer to the swap_desc.
> >
> > Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = 3.5556.
>
> OK.  I see.  We can only hold 8 bytes for each xarray entry.  To save
> memory usage, we can allocate multiple swap_desc (e.g., 16) for each
> xarray entry.  Then the memory usage of xarray becomes 1/N.
>
> > For rotating disks, this might be even higher (8 + 32) / (8 + 1) = 4.444
> >
> > This is because we need to maintain a reverse mapping between
> > swp_entry_t and the swap_desc to use for cluster readahead. I am
> > assuming we can limit cluster readahead for rotating disks only.
>
> If reverse mapping cannot be avoided for enough situation, it's better
> to only keep swap_entry in swap_desc, and create another xarray indexed
> by swap_entry and store swap_cache, swap_count etc.


My current idea is to have one xarray that stores the swap_descs
(which include swap_entry, swapcache, swap_count, etc), and only for
rotating disks have an additional xarray that maps swap_entry ->
swap_desc for cluster readahead, assuming we can eliminate all other
situations requiring a reverse mapping.

I am not sure how having separate xarrays help? If we have one xarray,
might as well save the other lookups on put everything in swap_desc.
In fact, this should improve the locking today as swapcache /
swap_count operations can be lockless or very lightly contended.

If the point is to store the swap_desc directly inside the xarray to
save 8 bytes, I am concerned that having multiple xarrays for
swapcache, swap_count, etc will use more than that.

>
>
> >>
> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
> >> > continuation pages. If we do, it may end up being more. We also
> >> > allocate continuation in full 4k pages, so even if one swap_map
> >> > element in a page requires continuation, we will allocate an entire
> >> > page. What I am trying to say is that to get an actual comparison you
> >> > need to also factor in the swap utilization and the rate of usage of
> >> > swap continuation. I don't know how to come up with a formula for this
> >> > tbh.
> >> >
> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> >> > overhead for people not using zswap, but it is not very awful.
> >> >
> >> >>
> >> >> It seems what you really need is one bit of information to indicate
> >> >> this page is backed by zswap. Then you can have a seperate pointer
> >> >> for the zswap entry.
> >> >
> >> > If you use one bit in swp_entry_t (or one of the available swap types)
> >> > to indicate whether the page is backed with a swapfile or zswap it
> >> > doesn't really work. We lose the indirection layer. How do we move the
> >> > page from zswap to swapfile? We need to go update the page tables and
> >> > the shmem page cache, similar to swapoff.
> >> >
> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
> >> > the swap_desc does. It just goes the extra mile of unifying the
> >> > swapcache as well and storing it directly in the swap_desc instead of
> >> > storing it in another lookup structure.
> >>
> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
> >> swap_entry in swap_desc.  The added indirection appears to be another
> >> level of page table with 1 entry.  Then, we may use the similar method
> >> as supporting system with 2 level and 3 level page tables, like the code
> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
> >> this deeply.
> >
> > Can you expand further on this idea? I am not sure I fully understand.
>
> OK.  The goal is to avoid the overhead if indirection isn't enabled via
> kconfig.
>
> If indirection isn't enabled, store swap_entry in PTE directly.
> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
> to get/set swap_entry in PTE) are implemented based on kconfig.


I thought about this, the problem is that we will have multiple
implementations of multiple things. For example, swap_count without
the indirection layer lives in the swap_map (with continuation logic).
With the indirection layer, it lives in the swap_desc (or somewhere
else). Same for the swapcache. Even if we keep the swapcache in an
xarray and not inside swap_desc, it would be indexed by swap_entry if
the indirection is disabled, and by swap_desc (or similar) if the
indirection is enabled. I think maintaining separate implementations
for when the indirection is enabled/disabled would be adding too much
complexity.

WDYT?

>
>
> >> >>
> >> >> Depending on how much you are going to reuse the swap cache, you might
> >> >> need to have something like a swap_info_struct to keep the locks happy.
> >> >
> >> > My current intention is to reimplement the swapcache completely as a
> >> > pointer in struct swap_desc. This would eliminate this need and a lot
> >> > of the locking we do today if I get things right.
> >> >
> >> >>
> >> >> > Another potential concern is readahead. With this design, we have no
> >> >>
> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> >> >> use some modernization.
> >> >
> >> > Yeah, I initially thought we would only need the swp_entry_t ->
> >> > swap_desc reverse mapping for readahead, and that we can only store
> >> > that for spinning disks, but I was wrong. We need for other things as
> >> > well today: swapoff, when trying to find an empty swap slot and we
> >> > start trying to free swap slots used only by the swapcache. However, I
> >> > think both of these cases can be fixed (I can share more details if
> >> > you want). If everything goes well we should only need to maintain the
> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
> >> > spinning disks for readahead.
> >> >
> >> >>
> >> >> Looking forward to your discussion.
>
> Per my understanding, the indirection is to make it easy to move
> (swapped) pages among swap devices based on hot/cold.  This is similar
> as the target of memory tiering.  It appears that we can extend the
> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
> Is it possible for zswap to be faster than some slow memory media?


Agree with Chris that this may require a much larger overhaul. A slow
memory tier is still addressable memory, swap/zswap requires a page
fault to read the pages. I think (at least for now) there is a
fundamental difference. We want reclaim to eventually treat slow
memory & swap as just different tiers to place cold memory in with
different characteristics, but otherwise I think the swapping
implementation itself is very different.  Am I missing something?

>
>
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-10 23:14           ` Chris Li
@ 2023-03-13  1:10             ` Huang, Ying
  2023-03-15  7:41               ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-13  1:10 UTC (permalink / raw)
  To: Chris Li
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Chris Li <chrisl@kernel.org> writes:

> On Fri, Mar 10, 2023 at 11:06:37AM +0800, Huang, Ying wrote:
>> > Unfortunately it's a little bit more. 24 is the extra overhead.
>> >
>> > Today we have an xarray entry for each swapped out page, that either
>> > has the swapcache pointer or the shadow entry.
>> >
>> > With this implementation, we have an xarray entry for each swapped out
>> > page, that has a pointer to the swap_desc.
>> >
>> > Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = 3.5556.
>> 
>> OK.  I see.  We can only hold 8 bytes for each xarray entry.  To save
>> memory usage, we can allocate multiple swap_desc (e.g., 16) for each
>> xarray entry.  Then the memory usage of xarray becomes 1/N.
>
> The xarray look up key is the swap offset from the swap entry. If you
> put more than one swap_desc under the one xarray entry. It will mean
> all those different swap_descs will share a swap offset.

For example, if we allocate 16 swap_desc for each xarray entry.  Then,
we can use (swap_desc_index >> 4) as key to lookup xarray, and
(swap_desc_index & 0x15) to index inside 16 swap_desc for the xarray
entry.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-11  1:06           ` Yosry Ahmed
@ 2023-03-13  2:12             ` Huang, Ying
  2023-03-15  8:01               ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-13  2:12 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Thu, Mar 9, 2023 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Thu, Mar 9, 2023 at 4:49 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
>> >> >>
>> >> >> Hi Yosry,
>> >> >>
>> >> >> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
>> >> >> > Hello everyone,
>> >> >> >
>> >> >> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
>> >> >> > 2023 about swap & zswap (hope I am not too late).
>> >> >>
>> >> >> I am very interested in participating in this discussion as well.
>> >> >
>> >> > That's great to hear!
>> >> >
>> >> >>
>> >> >> > ==================== Objective ====================
>> >> >> > Enabling the use of zswap without a backing swapfile, which makes
>> >> >> > zswap useful for a wider variety of use cases. Also, when zswap is
>> >> >> > used with a swapfile, the pages in zswap do not use up space in the
>> >> >> > swapfile, so the overall swapping capacity increases.
>> >> >>
>> >> >> Agree.
>> >> >>
>> >> >> >
>> >> >> > ==================== Idea ====================
>> >> >> > Introduce a data structure, which I currently call a swap_desc, as an
>> >> >> > abstraction layer between swapping implementation and the rest of MM
>> >> >> > code. Page tables & page caches would store a swap id (encoded as a
>> >> >> > swp_entry_t) instead of directly storing the swap entry associated
>> >> >> > with the swapfile. This swap id maps to a struct swap_desc, which acts
>> >> >>
>> >> >> Can you provide a bit more detail? I am curious how this swap id
>> >> >> maps into the swap_desc? Is the swp_entry_t cast into "struct
>> >> >> swap_desc*" or going through some lookup table/tree?
>> >> >
>> >> > swap id would be an index in a radix tree (aka xarray), which contains
>> >> > a pointer to the swap_desc struct. This lookup should be free with
>> >> > this design as we also use swap_desc to directly store the swap cache
>> >> > pointer, so this lookup essentially replaces the swap cache lookup.
>> >> >
>> >> >>
>> >> >> > as our abstraction layer. All MM code not concerned with swapping
>> >> >> > details would operate in terms of swap descs. The swap_desc can point
>> >> >> > to either a normal swap entry (associated with a swapfile) or a zswap
>> >> >> > entry. It can also include all non-backend specific operations, such
>> >> >> > as the swapcache (which would be a simple pointer in swap_desc), swap
>> >> >>
>> >> >> Does the zswap entry still use the swap slot cache and swap_info_struct?
>> >> >
>> >> > In this design no, it shouldn't.
>> >> >
>> >> >>
>> >> >> > This work enables using zswap without a backing swapfile and increases
>> >> >> > the swap capacity when zswap is used with a swapfile. It also creates
>> >> >> > a separation that allows us to skip code paths that don't make sense
>> >> >> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
>> >> >> > which might result in better performance (less lookups, less lock
>> >> >> > contention).
>> >> >> >
>> >> >> > The abstraction layer also opens the door for multiple cleanups (e.g.
>> >> >> > removing swapper address spaces, removing swap count continuation
>> >> >> > code, etc). Another nice cleanup that this work enables would be
>> >> >> > separating the overloaded swp_entry_t into two distinct types: one for
>> >> >> > things that are stored in page tables / caches, and for actual swap
>> >> >> > entries. In the future, we can potentially further optimize how we use
>> >> >> > the bits in the page tables instead of sticking everything into the
>> >> >> > current type/offset format.
>> >> >>
>> >> >> Looking forward to seeing more details in the upcoming discussion.
>> >> >> >
>> >> >> > ==================== Cost ====================
>> >> >> > The obvious downside of this is added memory overhead, specifically
>> >> >> > for users that use swapfiles without zswap. Instead of paying one byte
>> >> >> > (swap_map) for every potential page in the swapfile (+ swap count
>> >> >> > continuation), we pay the size of the swap_desc for every page that is
>> >> >> > actually in the swapfile, which I am estimating can be roughly around
>> >> >> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
>> >> >> > scales with pages actually swapped out. For zswap users, it should be
>> >> >>
>> >> >> Is there a way to avoid turning 1 byte into 24 byte per swapped
>> >> >> pages? For the users that use swap but no zswap, this is pure overhead.
>> >> >
>> >> > That's what I could think of at this point. My idea was something like this:
>> >> >
>> >> > struct swap_desc {
>> >> >     union { /* Use one bit to distinguish them */
>> >> >         swp_entry_t swap_entry;
>> >> >         struct zswap_entry *zswap_entry;
>> >> >     };
>> >> >     struct folio *swapcache;
>> >> >     atomic_t swap_count;
>> >> >     u32 id;
>> >> > }
>> >> >
>> >> > Having the id in the swap_desc is convenient as we can directly map
>> >> > the swap_desc to a swp_entry_t to place in the page tables, but I
>> >> > don't think it's necessary. Without it, the struct size is 20 bytes,
>> >> > so I think the extra 4 bytes are okay to use anyway if the slab
>> >> > allocator only allocates multiples of 8 bytes.
>> >> >
>> >> > The idea here is to unify the swapcache and swap_count implementation
>> >> > between different swap backends (swapfiles, zswap, etc), which would
>> >> > create a better abstraction and reduce reinventing the wheel.
>> >> >
>> >> > We can reduce to only 8 bytes and only store the swap/zswap entry, but
>> >> > we still need the swap cache anyway so might as well just store the
>> >> > pointer in the struct and have a unified lookup-free swapcache, so
>> >> > really 16 bytes is the minimum.
>> >> >
>> >> > If we stop at 16 bytes, then we need to handle swap count separately
>> >> > in swapfiles and zswap. This is not the end of the world, but are the
>> >> > 8 bytes worth this?
>> >>
>> >> If my understanding were correct, for current implementation, we need
>> >> one swap cache pointer per swapped out page too.  Even after calling
>> >> __delete_from_swap_cache(), we store the "shadow" entry there.  Although
>> >> it's possible to implement shadow entry reclaiming like that for file
>> >> cache shadow entry (workingset_shadow_shrinker), we haven't done that
>> >> yet.  And, it appears that we can live with that.  So, in current
>> >> implementation, for each swapped out page, we use 9 bytes.  If so, the
>> >> memory usage ratio is 24 / 9 = 2.667, still not trivial, but not as
>> >> horrible as 24 / 1 = 24.
>> >
>> > Unfortunately it's a little bit more. 24 is the extra overhead.
>> >
>> > Today we have an xarray entry for each swapped out page, that either
>> > has the swapcache pointer or the shadow entry.
>> >
>> > With this implementation, we have an xarray entry for each swapped out
>> > page, that has a pointer to the swap_desc.
>> >
>> > Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = 3.5556.
>>
>> OK.  I see.  We can only hold 8 bytes for each xarray entry.  To save
>> memory usage, we can allocate multiple swap_desc (e.g., 16) for each
>> xarray entry.  Then the memory usage of xarray becomes 1/N.
>>
>> > For rotating disks, this might be even higher (8 + 32) / (8 + 1) = 4.444
>> >
>> > This is because we need to maintain a reverse mapping between
>> > swp_entry_t and the swap_desc to use for cluster readahead. I am
>> > assuming we can limit cluster readahead for rotating disks only.
>>
>> If reverse mapping cannot be avoided for enough situation, it's better
>> to only keep swap_entry in swap_desc, and create another xarray indexed
>> by swap_entry and store swap_cache, swap_count etc.
>
>
> My current idea is to have one xarray that stores the swap_descs
> (which include swap_entry, swapcache, swap_count, etc), and only for
> rotating disks have an additional xarray that maps swap_entry ->
> swap_desc for cluster readahead, assuming we can eliminate all other
> situations requiring a reverse mapping.
>
> I am not sure how having separate xarrays help? If we have one xarray,
> might as well save the other lookups on put everything in swap_desc.
> In fact, this should improve the locking today as swapcache /
> swap_count operations can be lockless or very lightly contended.

The condition of the proposal is "reverse mapping cannot be avoided for
enough situation".  So, if reverse mapping (or cluster readahead) can be
avoided for enough situations, I think your proposal is good.  Otherwise,
I propose to use 2 xarrays.  You don't need another reverse mapping
xarray, because you just need to read the next several swap_entry into
the swap cache for cluster readahead.  swap_desc isn't needed for
cluster readahead.

> If the point is to store the swap_desc directly inside the xarray to
> save 8 bytes, I am concerned that having multiple xarrays for
> swapcache, swap_count, etc will use more than that.

The idea is to save the memory used by reverse mapping xarray.

>> >>
>> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
>> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
>> >> > continuation pages. If we do, it may end up being more. We also
>> >> > allocate continuation in full 4k pages, so even if one swap_map
>> >> > element in a page requires continuation, we will allocate an entire
>> >> > page. What I am trying to say is that to get an actual comparison you
>> >> > need to also factor in the swap utilization and the rate of usage of
>> >> > swap continuation. I don't know how to come up with a formula for this
>> >> > tbh.
>> >> >
>> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
>> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
>> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
>> >> > overhead for people not using zswap, but it is not very awful.
>> >> >
>> >> >>
>> >> >> It seems what you really need is one bit of information to indicate
>> >> >> this page is backed by zswap. Then you can have a seperate pointer
>> >> >> for the zswap entry.
>> >> >
>> >> > If you use one bit in swp_entry_t (or one of the available swap types)
>> >> > to indicate whether the page is backed with a swapfile or zswap it
>> >> > doesn't really work. We lose the indirection layer. How do we move the
>> >> > page from zswap to swapfile? We need to go update the page tables and
>> >> > the shmem page cache, similar to swapoff.
>> >> >
>> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
>> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
>> >> > the swap_desc does. It just goes the extra mile of unifying the
>> >> > swapcache as well and storing it directly in the swap_desc instead of
>> >> > storing it in another lookup structure.
>> >>
>> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
>> >> swap_entry in swap_desc.  The added indirection appears to be another
>> >> level of page table with 1 entry.  Then, we may use the similar method
>> >> as supporting system with 2 level and 3 level page tables, like the code
>> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
>> >> this deeply.
>> >
>> > Can you expand further on this idea? I am not sure I fully understand.
>>
>> OK.  The goal is to avoid the overhead if indirection isn't enabled via
>> kconfig.
>>
>> If indirection isn't enabled, store swap_entry in PTE directly.
>> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
>> to get/set swap_entry in PTE) are implemented based on kconfig.
>
>
> I thought about this, the problem is that we will have multiple
> implementations of multiple things. For example, swap_count without
> the indirection layer lives in the swap_map (with continuation logic).
> With the indirection layer, it lives in the swap_desc (or somewhere
> else). Same for the swapcache. Even if we keep the swapcache in an
> xarray and not inside swap_desc, it would be indexed by swap_entry if
> the indirection is disabled, and by swap_desc (or similar) if the
> indirection is enabled. I think maintaining separate implementations
> for when the indirection is enabled/disabled would be adding too much
> complexity.
>
> WDYT?

If we go this way, swap cache and swap_count will always be indexed by
swap_entry.  swap_desc just provides a indirection to make it possible
to move between swap devices.

Why must we index swap cache and swap_count by swap_desc if indirection
is enabled?  Yes, we can save one xarray indexing if we do so, but I
don't think the overhead of one xarray indexing is a showstopper.

I think this can be one intermediate step towards your final target.
The changes to current implementation can be smaller.

>> >> >>
>> >> >> Depending on how much you are going to reuse the swap cache, you might
>> >> >> need to have something like a swap_info_struct to keep the locks happy.
>> >> >
>> >> > My current intention is to reimplement the swapcache completely as a
>> >> > pointer in struct swap_desc. This would eliminate this need and a lot
>> >> > of the locking we do today if I get things right.
>> >> >
>> >> >>
>> >> >> > Another potential concern is readahead. With this design, we have no
>> >> >>
>> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
>> >> >> use some modernization.
>> >> >
>> >> > Yeah, I initially thought we would only need the swp_entry_t ->
>> >> > swap_desc reverse mapping for readahead, and that we can only store
>> >> > that for spinning disks, but I was wrong. We need for other things as
>> >> > well today: swapoff, when trying to find an empty swap slot and we
>> >> > start trying to free swap slots used only by the swapcache. However, I
>> >> > think both of these cases can be fixed (I can share more details if
>> >> > you want). If everything goes well we should only need to maintain the
>> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
>> >> > spinning disks for readahead.
>> >> >
>> >> >>
>> >> >> Looking forward to your discussion.
>>
>> Per my understanding, the indirection is to make it easy to move
>> (swapped) pages among swap devices based on hot/cold.  This is similar
>> as the target of memory tiering.  It appears that we can extend the
>> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
>> Is it possible for zswap to be faster than some slow memory media?
>
>
> Agree with Chris that this may require a much larger overhaul. A slow
> memory tier is still addressable memory, swap/zswap requires a page
> fault to read the pages. I think (at least for now) there is a
> fundamental difference. We want reclaim to eventually treat slow
> memory & swap as just different tiers to place cold memory in with
> different characteristics, but otherwise I think the swapping
> implementation itself is very different.  Am I missing something?

Is it possible that zswap is faster than a really slow memory
addressable device backed by NAND?  TBH, I don't have the answer.

Anyway, do you need a way to describe the tiers of the swap devices?
So, you can move the cold pages among the swap devices based on that?

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-13  1:10             ` Huang, Ying
@ 2023-03-15  7:41               ` Yosry Ahmed
  2023-03-16  1:42                 ` Huang, Ying
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-15  7:41 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Sun, Mar 12, 2023 at 6:11 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Fri, Mar 10, 2023 at 11:06:37AM +0800, Huang, Ying wrote:
> >> > Unfortunately it's a little bit more. 24 is the extra overhead.
> >> >
> >> > Today we have an xarray entry for each swapped out page, that either
> >> > has the swapcache pointer or the shadow entry.
> >> >
> >> > With this implementation, we have an xarray entry for each swapped out
> >> > page, that has a pointer to the swap_desc.
> >> >
> >> > Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = 3.5556.
> >>
> >> OK.  I see.  We can only hold 8 bytes for each xarray entry.  To save
> >> memory usage, we can allocate multiple swap_desc (e.g., 16) for each
> >> xarray entry.  Then the memory usage of xarray becomes 1/N.
> >
> > The xarray look up key is the swap offset from the swap entry. If you
> > put more than one swap_desc under the one xarray entry. It will mean
> > all those different swap_descs will share a swap offset.
>
> For example, if we allocate 16 swap_desc for each xarray entry.  Then,
> we can use (swap_desc_index >> 4) as key to lookup xarray, and
> (swap_desc_index & 0x15) to index inside 16 swap_desc for the xarray
> entry.

With this approach we save (16 - 1) * 8 = 120 bytes per 16 swap_descs,
but only if we use the space for 16 swap_desc's fully. As pages are
swapped in, we can easily end up with less than 16 swap_descs for one
xarray entry, how do we deal with such fragmentation?

At 11 swap_descs per xarray entry, we are not saving any memory
(wasting 24 * 5 = 120 bytes). Below 11 swap_descs, we are wasting more
memory than we are saving.

>
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-13  2:12             ` Huang, Ying
@ 2023-03-15  8:01               ` Yosry Ahmed
  2023-03-16  7:50                 ` Huang, Ying
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-15  8:01 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> <snip>
> >
> > My current idea is to have one xarray that stores the swap_descs
> > (which include swap_entry, swapcache, swap_count, etc), and only for
> > rotating disks have an additional xarray that maps swap_entry ->
> > swap_desc for cluster readahead, assuming we can eliminate all other
> > situations requiring a reverse mapping.
> >
> > I am not sure how having separate xarrays help? If we have one xarray,
> > might as well save the other lookups on put everything in swap_desc.
> > In fact, this should improve the locking today as swapcache /
> > swap_count operations can be lockless or very lightly contended.
>
> The condition of the proposal is "reverse mapping cannot be avoided for
> enough situation".  So, if reverse mapping (or cluster readahead) can be
> avoided for enough situations, I think your proposal is good.  Otherwise,
> I propose to use 2 xarrays.  You don't need another reverse mapping
> xarray, because you just need to read the next several swap_entry into
> the swap cache for cluster readahead.  swap_desc isn't needed for
> cluster readahead.

swap_desc would be needed for cluster readahead in my original
proposal as the swap cache lives in swap_descs. Based on the current
implementation, we would need a reverse mapping (swap entry ->
swap_desc) in 3 situations:

1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
failing, we fallback to trying to find swap entries that only have a
page in the swap cache (no references in page tables or page cache)
and free them. This would require a reverse mapping.

2) swapoff: we need to swap in all entries in a swapfile, so we need
to get all swap_descs associated with that swapfile.

3) swap cluster readahead.

For (1), I think we can drop the dependency of a reverse mapping if we
free swap entries once we swap a page in and add it to the swap cache,
even if the swap count does not drop to 0.

For (2), instead of scanning page tables and shmem page cache to find
swapped out pages for the swapfile, we can scan all swap_descs
instead, we should be more efficient. This is one of the proposal's
potential advantages.

(3) is the one that would still need a reverse mapping with the
current proposal. Today we use swap cluster readahead for anon pages
if we have a spinning disk or vma readahead is disabled. For shmem, we
always use cluster readahead. If we can limit cluster readahead to
only rotating disks, then the reverse mapping can only be maintained
for swapfiles on rotating disks. Otherwise, we will need to maintain a
reverse mapping for all swapfiles.

>
> > If the point is to store the swap_desc directly inside the xarray to
> > save 8 bytes, I am concerned that having multiple xarrays for
> > swapcache, swap_count, etc will use more than that.
>
> The idea is to save the memory used by reverse mapping xarray.

I see.

>
> >> >>
> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
> >> >> > continuation pages. If we do, it may end up being more. We also
> >> >> > allocate continuation in full 4k pages, so even if one swap_map
> >> >> > element in a page requires continuation, we will allocate an entire
> >> >> > page. What I am trying to say is that to get an actual comparison you
> >> >> > need to also factor in the swap utilization and the rate of usage of
> >> >> > swap continuation. I don't know how to come up with a formula for this
> >> >> > tbh.
> >> >> >
> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> >> >> > overhead for people not using zswap, but it is not very awful.
> >> >> >
> >> >> >>
> >> >> >> It seems what you really need is one bit of information to indicate
> >> >> >> this page is backed by zswap. Then you can have a seperate pointer
> >> >> >> for the zswap entry.
> >> >> >
> >> >> > If you use one bit in swp_entry_t (or one of the available swap types)
> >> >> > to indicate whether the page is backed with a swapfile or zswap it
> >> >> > doesn't really work. We lose the indirection layer. How do we move the
> >> >> > page from zswap to swapfile? We need to go update the page tables and
> >> >> > the shmem page cache, similar to swapoff.
> >> >> >
> >> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
> >> >> > the swap_desc does. It just goes the extra mile of unifying the
> >> >> > swapcache as well and storing it directly in the swap_desc instead of
> >> >> > storing it in another lookup structure.
> >> >>
> >> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
> >> >> swap_entry in swap_desc.  The added indirection appears to be another
> >> >> level of page table with 1 entry.  Then, we may use the similar method
> >> >> as supporting system with 2 level and 3 level page tables, like the code
> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
> >> >> this deeply.
> >> >
> >> > Can you expand further on this idea? I am not sure I fully understand.
> >>
> >> OK.  The goal is to avoid the overhead if indirection isn't enabled via
> >> kconfig.
> >>
> >> If indirection isn't enabled, store swap_entry in PTE directly.
> >> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
> >> to get/set swap_entry in PTE) are implemented based on kconfig.
> >
> >
> > I thought about this, the problem is that we will have multiple
> > implementations of multiple things. For example, swap_count without
> > the indirection layer lives in the swap_map (with continuation logic).
> > With the indirection layer, it lives in the swap_desc (or somewhere
> > else). Same for the swapcache. Even if we keep the swapcache in an
> > xarray and not inside swap_desc, it would be indexed by swap_entry if
> > the indirection is disabled, and by swap_desc (or similar) if the
> > indirection is enabled. I think maintaining separate implementations
> > for when the indirection is enabled/disabled would be adding too much
> > complexity.
> >
> > WDYT?
>
> If we go this way, swap cache and swap_count will always be indexed by
> swap_entry.  swap_desc just provides a indirection to make it possible
> to move between swap devices.
>
> Why must we index swap cache and swap_count by swap_desc if indirection
> is enabled?  Yes, we can save one xarray indexing if we do so, but I
> don't think the overhead of one xarray indexing is a showstopper.
>
> I think this can be one intermediate step towards your final target.
> The changes to current implementation can be smaller.

IIUC, the idea is to have two xarrays:
(a) xarray that stores a pointer to a struct containing swap_count and
swap cache.
(b) xarray that stores the underlying swap entry or zswap entry.

When indirection is disabled:
page tables & page cache have swap entry directly like today, xarray
(a) is indexed by swap entry, xarray (b) does not exist. No reverse
mapping needed.

In this case we have an extra overhead of 12-16 bytes (the struct
containing swap_count and swap cache) vs. 24 bytes of the swap_desc.

When indirection is enabled:
page tables & page cache have a swap id (or swap_desc index), xarray
(a) is indexed by swap id, xarray (b) is indexed by swap id as well
and contain swap entry or zswap entry. Reverse mapping might be
needed.

In this case we have an extra overhead of 12-16 bytes + 8 bytes for
xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
where needed.

There is also the extra cpu overhead for an extra lookup in certain paths.

Is my analysis correct? If yes, I agree that the original proposal is
good if the reverse mapping can be avoided in enough situations, and
that we should consider such alternatives otherwise. As I mentioned
above, I think it comes down to whether we can completely restrict
cluster readahead to rotating disks or not -- in which case we need to
decide what to do for shmem and for anon when vma readahead is
disabled.

>
> >> >> >>
> >> >> >> Depending on how much you are going to reuse the swap cache, you might
> >> >> >> need to have something like a swap_info_struct to keep the locks happy.
> >> >> >
> >> >> > My current intention is to reimplement the swapcache completely as a
> >> >> > pointer in struct swap_desc. This would eliminate this need and a lot
> >> >> > of the locking we do today if I get things right.
> >> >> >
> >> >> >>
> >> >> >> > Another potential concern is readahead. With this design, we have no
> >> >> >>
> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> >> >> >> use some modernization.
> >> >> >
> >> >> > Yeah, I initially thought we would only need the swp_entry_t ->
> >> >> > swap_desc reverse mapping for readahead, and that we can only store
> >> >> > that for spinning disks, but I was wrong. We need for other things as
> >> >> > well today: swapoff, when trying to find an empty swap slot and we
> >> >> > start trying to free swap slots used only by the swapcache. However, I
> >> >> > think both of these cases can be fixed (I can share more details if
> >> >> > you want). If everything goes well we should only need to maintain the
> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
> >> >> > spinning disks for readahead.
> >> >> >
> >> >> >>
> >> >> >> Looking forward to your discussion.
> >>
> >> Per my understanding, the indirection is to make it easy to move
> >> (swapped) pages among swap devices based on hot/cold.  This is similar
> >> as the target of memory tiering.  It appears that we can extend the
> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
> >> Is it possible for zswap to be faster than some slow memory media?
> >
> >
> > Agree with Chris that this may require a much larger overhaul. A slow
> > memory tier is still addressable memory, swap/zswap requires a page
> > fault to read the pages. I think (at least for now) there is a
> > fundamental difference. We want reclaim to eventually treat slow
> > memory & swap as just different tiers to place cold memory in with
> > different characteristics, but otherwise I think the swapping
> > implementation itself is very different.  Am I missing something?
>
> Is it possible that zswap is faster than a really slow memory
> addressable device backed by NAND?  TBH, I don't have the answer.

I am not sure either.

>
> Anyway, do you need a way to describe the tiers of the swap devices?
> So, you can move the cold pages among the swap devices based on that?

For now I think the "tiers" in this proposal are just zswap and normal
swapfiles. We can later extend it to support more explicit tiering.

>
> Best Regards,
> Huang, Ying
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-15  7:41               ` Yosry Ahmed
@ 2023-03-16  1:42                 ` Huang, Ying
  0 siblings, 0 replies; 105+ messages in thread
From: Huang, Ying @ 2023-03-16  1:42 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Sun, Mar 12, 2023 at 6:11 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Fri, Mar 10, 2023 at 11:06:37AM +0800, Huang, Ying wrote:
>> >> > Unfortunately it's a little bit more. 24 is the extra overhead.
>> >> >
>> >> > Today we have an xarray entry for each swapped out page, that either
>> >> > has the swapcache pointer or the shadow entry.
>> >> >
>> >> > With this implementation, we have an xarray entry for each swapped out
>> >> > page, that has a pointer to the swap_desc.
>> >> >
>> >> > Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = 3.5556.
>> >>
>> >> OK.  I see.  We can only hold 8 bytes for each xarray entry.  To save
>> >> memory usage, we can allocate multiple swap_desc (e.g., 16) for each
>> >> xarray entry.  Then the memory usage of xarray becomes 1/N.
>> >
>> > The xarray look up key is the swap offset from the swap entry. If you
>> > put more than one swap_desc under the one xarray entry. It will mean
>> > all those different swap_descs will share a swap offset.
>>
>> For example, if we allocate 16 swap_desc for each xarray entry.  Then,
>> we can use (swap_desc_index >> 4) as key to lookup xarray, and
>> (swap_desc_index & 0x15) to index inside 16 swap_desc for the xarray
>> entry.
>
> With this approach we save (16 - 1) * 8 = 120 bytes per 16 swap_descs,
> but only if we use the space for 16 swap_desc's fully. As pages are
> swapped in, we can easily end up with less than 16 swap_descs for one
> xarray entry, how do we deal with such fragmentation?
>
> At 11 swap_descs per xarray entry, we are not saving any memory
> (wasting 24 * 5 = 120 bytes). Below 11 swap_descs, we are wasting more
> memory than we are saving.

Yes.  This approach isn't very fragmentation friendly.  xarray isn't
too.  So, I think we should try to reduce fragmentation anyway, for
example, reuse freed swap_desc IDs ASAP, or allocate less swap_desc per
xarray entry (e.g., 4), etc.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-15  8:01               ` Yosry Ahmed
@ 2023-03-16  7:50                 ` Huang, Ying
  2023-03-17 10:19                   ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-16  7:50 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> <snip>
>> >
>> > My current idea is to have one xarray that stores the swap_descs
>> > (which include swap_entry, swapcache, swap_count, etc), and only for
>> > rotating disks have an additional xarray that maps swap_entry ->
>> > swap_desc for cluster readahead, assuming we can eliminate all other
>> > situations requiring a reverse mapping.
>> >
>> > I am not sure how having separate xarrays help? If we have one xarray,
>> > might as well save the other lookups on put everything in swap_desc.
>> > In fact, this should improve the locking today as swapcache /
>> > swap_count operations can be lockless or very lightly contended.
>>
>> The condition of the proposal is "reverse mapping cannot be avoided for
>> enough situation".  So, if reverse mapping (or cluster readahead) can be
>> avoided for enough situations, I think your proposal is good.  Otherwise,
>> I propose to use 2 xarrays.  You don't need another reverse mapping
>> xarray, because you just need to read the next several swap_entry into
>> the swap cache for cluster readahead.  swap_desc isn't needed for
>> cluster readahead.
>
> swap_desc would be needed for cluster readahead in my original
> proposal as the swap cache lives in swap_descs. Based on the current
> implementation, we would need a reverse mapping (swap entry ->
> swap_desc) in 3 situations:
>
> 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
> failing, we fallback to trying to find swap entries that only have a
> page in the swap cache (no references in page tables or page cache)
> and free them. This would require a reverse mapping.
>
> 2) swapoff: we need to swap in all entries in a swapfile, so we need
> to get all swap_descs associated with that swapfile.
>
> 3) swap cluster readahead.
>
> For (1), I think we can drop the dependency of a reverse mapping if we
> free swap entries once we swap a page in and add it to the swap cache,
> even if the swap count does not drop to 0.

Now, we will not drop the swap cache even if the swap count becomes 0 if
swap space utility < 50%.  Per my understanding, this avoid swap page
writing for read accesses.  So I don't think we can change this directly
without necessary discussion firstly.

> For (2), instead of scanning page tables and shmem page cache to find
> swapped out pages for the swapfile, we can scan all swap_descs
> instead, we should be more efficient. This is one of the proposal's
> potential advantages.

Good.

> (3) is the one that would still need a reverse mapping with the
> current proposal. Today we use swap cluster readahead for anon pages
> if we have a spinning disk or vma readahead is disabled. For shmem, we
> always use cluster readahead. If we can limit cluster readahead to
> only rotating disks, then the reverse mapping can only be maintained
> for swapfiles on rotating disks. Otherwise, we will need to maintain a
> reverse mapping for all swapfiles.

For shmem, I think that it should be good to readahead based on shmem
file offset instead of swap device offset.

It's possible that some pages in the readahead window are from HDD while
some other pages aren't.  So it's a little hard to enable cluster read
for HDD only.  Anyway, it's not common to use HDD for swap now.

>>
>> > If the point is to store the swap_desc directly inside the xarray to
>> > save 8 bytes, I am concerned that having multiple xarrays for
>> > swapcache, swap_count, etc will use more than that.
>>
>> The idea is to save the memory used by reverse mapping xarray.
>
> I see.
>
>>
>> >> >>
>> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
>> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
>> >> >> > continuation pages. If we do, it may end up being more. We also
>> >> >> > allocate continuation in full 4k pages, so even if one swap_map
>> >> >> > element in a page requires continuation, we will allocate an entire
>> >> >> > page. What I am trying to say is that to get an actual comparison you
>> >> >> > need to also factor in the swap utilization and the rate of usage of
>> >> >> > swap continuation. I don't know how to come up with a formula for this
>> >> >> > tbh.
>> >> >> >
>> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
>> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
>> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
>> >> >> > overhead for people not using zswap, but it is not very awful.
>> >> >> >
>> >> >> >>
>> >> >> >> It seems what you really need is one bit of information to indicate
>> >> >> >> this page is backed by zswap. Then you can have a seperate pointer
>> >> >> >> for the zswap entry.
>> >> >> >
>> >> >> > If you use one bit in swp_entry_t (or one of the available swap types)
>> >> >> > to indicate whether the page is backed with a swapfile or zswap it
>> >> >> > doesn't really work. We lose the indirection layer. How do we move the
>> >> >> > page from zswap to swapfile? We need to go update the page tables and
>> >> >> > the shmem page cache, similar to swapoff.
>> >> >> >
>> >> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
>> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
>> >> >> > the swap_desc does. It just goes the extra mile of unifying the
>> >> >> > swapcache as well and storing it directly in the swap_desc instead of
>> >> >> > storing it in another lookup structure.
>> >> >>
>> >> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
>> >> >> swap_entry in swap_desc.  The added indirection appears to be another
>> >> >> level of page table with 1 entry.  Then, we may use the similar method
>> >> >> as supporting system with 2 level and 3 level page tables, like the code
>> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
>> >> >> this deeply.
>> >> >
>> >> > Can you expand further on this idea? I am not sure I fully understand.
>> >>
>> >> OK.  The goal is to avoid the overhead if indirection isn't enabled via
>> >> kconfig.
>> >>
>> >> If indirection isn't enabled, store swap_entry in PTE directly.
>> >> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
>> >> to get/set swap_entry in PTE) are implemented based on kconfig.
>> >
>> >
>> > I thought about this, the problem is that we will have multiple
>> > implementations of multiple things. For example, swap_count without
>> > the indirection layer lives in the swap_map (with continuation logic).
>> > With the indirection layer, it lives in the swap_desc (or somewhere
>> > else). Same for the swapcache. Even if we keep the swapcache in an
>> > xarray and not inside swap_desc, it would be indexed by swap_entry if
>> > the indirection is disabled, and by swap_desc (or similar) if the
>> > indirection is enabled. I think maintaining separate implementations
>> > for when the indirection is enabled/disabled would be adding too much
>> > complexity.
>> >
>> > WDYT?
>>
>> If we go this way, swap cache and swap_count will always be indexed by
>> swap_entry.  swap_desc just provides a indirection to make it possible
>> to move between swap devices.
>>
>> Why must we index swap cache and swap_count by swap_desc if indirection
>> is enabled?  Yes, we can save one xarray indexing if we do so, but I
>> don't think the overhead of one xarray indexing is a showstopper.
>>
>> I think this can be one intermediate step towards your final target.
>> The changes to current implementation can be smaller.
>
> IIUC, the idea is to have two xarrays:
> (a) xarray that stores a pointer to a struct containing swap_count and
> swap cache.
> (b) xarray that stores the underlying swap entry or zswap entry.
>
> When indirection is disabled:
> page tables & page cache have swap entry directly like today, xarray
> (a) is indexed by swap entry, xarray (b) does not exist. No reverse
> mapping needed.
>
> In this case we have an extra overhead of 12-16 bytes (the struct
> containing swap_count and swap cache) vs. 24 bytes of the swap_desc.
>
> When indirection is enabled:
> page tables & page cache have a swap id (or swap_desc index), xarray
> (a) is indexed by swap id,

xarray (a) is indexed by swap entry.

> xarray (b) is indexed by swap id as well
> and contain swap entry or zswap entry. Reverse mapping might be
> needed.

Reverse mapping isn't needed.

> In this case we have an extra overhead of 12-16 bytes + 8 bytes for
> xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
> where needed.
>
> There is also the extra cpu overhead for an extra lookup in certain paths.
>
> Is my analysis correct? If yes, I agree that the original proposal is
> good if the reverse mapping can be avoided in enough situations, and
> that we should consider such alternatives otherwise. As I mentioned
> above, I think it comes down to whether we can completely restrict
> cluster readahead to rotating disks or not -- in which case we need to
> decide what to do for shmem and for anon when vma readahead is
> disabled.

We can even have a minimal indirection implementation.  Where, swap
cache and swap_map[] are kept as they ware before, just one xarray is
added.  The xarray is indexed by swap id (or swap_desc index) to store
the corresponding swap entry.

When indirection is disabled, no extra overhead.

When indirection is enabled, the extra overhead is just 8 bytes per
swapped page.

The basic migration support can be build on top of this.

I think that this could be a baseline for indirection support.  Then
further optimization can be built on top of it step by step with
supporting data.

>>
>> >> >> >>
>> >> >> >> Depending on how much you are going to reuse the swap cache, you might
>> >> >> >> need to have something like a swap_info_struct to keep the locks happy.
>> >> >> >
>> >> >> > My current intention is to reimplement the swapcache completely as a
>> >> >> > pointer in struct swap_desc. This would eliminate this need and a lot
>> >> >> > of the locking we do today if I get things right.
>> >> >> >
>> >> >> >>
>> >> >> >> > Another potential concern is readahead. With this design, we have no
>> >> >> >>
>> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
>> >> >> >> use some modernization.
>> >> >> >
>> >> >> > Yeah, I initially thought we would only need the swp_entry_t ->
>> >> >> > swap_desc reverse mapping for readahead, and that we can only store
>> >> >> > that for spinning disks, but I was wrong. We need for other things as
>> >> >> > well today: swapoff, when trying to find an empty swap slot and we
>> >> >> > start trying to free swap slots used only by the swapcache. However, I
>> >> >> > think both of these cases can be fixed (I can share more details if
>> >> >> > you want). If everything goes well we should only need to maintain the
>> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
>> >> >> > spinning disks for readahead.
>> >> >> >
>> >> >> >>
>> >> >> >> Looking forward to your discussion.
>> >>
>> >> Per my understanding, the indirection is to make it easy to move
>> >> (swapped) pages among swap devices based on hot/cold.  This is similar
>> >> as the target of memory tiering.  It appears that we can extend the
>> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
>> >> Is it possible for zswap to be faster than some slow memory media?
>> >
>> >
>> > Agree with Chris that this may require a much larger overhaul. A slow
>> > memory tier is still addressable memory, swap/zswap requires a page
>> > fault to read the pages. I think (at least for now) there is a
>> > fundamental difference. We want reclaim to eventually treat slow
>> > memory & swap as just different tiers to place cold memory in with
>> > different characteristics, but otherwise I think the swapping
>> > implementation itself is very different.  Am I missing something?
>>
>> Is it possible that zswap is faster than a really slow memory
>> addressable device backed by NAND?  TBH, I don't have the answer.
>
> I am not sure either.
>
>>
>> Anyway, do you need a way to describe the tiers of the swap devices?
>> So, you can move the cold pages among the swap devices based on that?
>
> For now I think the "tiers" in this proposal are just zswap and normal
> swapfiles. We can later extend it to support more explicit tiering.

IIUC, in original zswap implementation, there's 1:1 relationship between
zswap and normal swapfile.  But now, you make demoting among swap
devices more general.  Then we need some general way to specify which
swap devices are fast and which are slow, and the demoting relationship
among them.  It can be memory tiers or something else, but we need one.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-16  7:50                 ` Huang, Ying
@ 2023-03-17 10:19                   ` Yosry Ahmed
  2023-03-17 18:19                     ` Chris Li
  2023-03-20  2:55                     ` Huang, Ying
  0 siblings, 2 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-17 10:19 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> <snip>
> >> >
> >> > My current idea is to have one xarray that stores the swap_descs
> >> > (which include swap_entry, swapcache, swap_count, etc), and only for
> >> > rotating disks have an additional xarray that maps swap_entry ->
> >> > swap_desc for cluster readahead, assuming we can eliminate all other
> >> > situations requiring a reverse mapping.
> >> >
> >> > I am not sure how having separate xarrays help? If we have one xarray,
> >> > might as well save the other lookups on put everything in swap_desc.
> >> > In fact, this should improve the locking today as swapcache /
> >> > swap_count operations can be lockless or very lightly contended.
> >>
> >> The condition of the proposal is "reverse mapping cannot be avoided for
> >> enough situation".  So, if reverse mapping (or cluster readahead) can be
> >> avoided for enough situations, I think your proposal is good.  Otherwise,
> >> I propose to use 2 xarrays.  You don't need another reverse mapping
> >> xarray, because you just need to read the next several swap_entry into
> >> the swap cache for cluster readahead.  swap_desc isn't needed for
> >> cluster readahead.
> >
> > swap_desc would be needed for cluster readahead in my original
> > proposal as the swap cache lives in swap_descs. Based on the current
> > implementation, we would need a reverse mapping (swap entry ->
> > swap_desc) in 3 situations:
> >
> > 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
> > failing, we fallback to trying to find swap entries that only have a
> > page in the swap cache (no references in page tables or page cache)
> > and free them. This would require a reverse mapping.
> >
> > 2) swapoff: we need to swap in all entries in a swapfile, so we need
> > to get all swap_descs associated with that swapfile.
> >
> > 3) swap cluster readahead.
> >
> > For (1), I think we can drop the dependency of a reverse mapping if we
> > free swap entries once we swap a page in and add it to the swap cache,
> > even if the swap count does not drop to 0.
>
> Now, we will not drop the swap cache even if the swap count becomes 0 if
> swap space utility < 50%.  Per my understanding, this avoid swap page
> writing for read accesses.  So I don't think we can change this directly
> without necessary discussion firstly.


Right. I am not sure I understand why we do this today, is it to save
the overhead of allocating a new swap entry if the page is swapped out
again soon? I am not sure I understand this statement "this avoid swap
page
writing for read accesses".

>
>
> > For (2), instead of scanning page tables and shmem page cache to find
> > swapped out pages for the swapfile, we can scan all swap_descs
> > instead, we should be more efficient. This is one of the proposal's
> > potential advantages.
>
> Good.
>
> > (3) is the one that would still need a reverse mapping with the
> > current proposal. Today we use swap cluster readahead for anon pages
> > if we have a spinning disk or vma readahead is disabled. For shmem, we
> > always use cluster readahead. If we can limit cluster readahead to
> > only rotating disks, then the reverse mapping can only be maintained
> > for swapfiles on rotating disks. Otherwise, we will need to maintain a
> > reverse mapping for all swapfiles.
>
> For shmem, I think that it should be good to readahead based on shmem
> file offset instead of swap device offset.
>
> It's possible that some pages in the readahead window are from HDD while
> some other pages aren't.  So it's a little hard to enable cluster read
> for HDD only.  Anyway, it's not common to use HDD for swap now.
>
> >>
> >> > If the point is to store the swap_desc directly inside the xarray to
> >> > save 8 bytes, I am concerned that having multiple xarrays for
> >> > swapcache, swap_count, etc will use more than that.
> >>
> >> The idea is to save the memory used by reverse mapping xarray.
> >
> > I see.
> >
> >>
> >> >> >>
> >> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
> >> >> >> > continuation pages. If we do, it may end up being more. We also
> >> >> >> > allocate continuation in full 4k pages, so even if one swap_map
> >> >> >> > element in a page requires continuation, we will allocate an entire
> >> >> >> > page. What I am trying to say is that to get an actual comparison you
> >> >> >> > need to also factor in the swap utilization and the rate of usage of
> >> >> >> > swap continuation. I don't know how to come up with a formula for this
> >> >> >> > tbh.
> >> >> >> >
> >> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> >> >> >> > overhead for people not using zswap, but it is not very awful.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> It seems what you really need is one bit of information to indicate
> >> >> >> >> this page is backed by zswap. Then you can have a seperate pointer
> >> >> >> >> for the zswap entry.
> >> >> >> >
> >> >> >> > If you use one bit in swp_entry_t (or one of the available swap types)
> >> >> >> > to indicate whether the page is backed with a swapfile or zswap it
> >> >> >> > doesn't really work. We lose the indirection layer. How do we move the
> >> >> >> > page from zswap to swapfile? We need to go update the page tables and
> >> >> >> > the shmem page cache, similar to swapoff.
> >> >> >> >
> >> >> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
> >> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
> >> >> >> > the swap_desc does. It just goes the extra mile of unifying the
> >> >> >> > swapcache as well and storing it directly in the swap_desc instead of
> >> >> >> > storing it in another lookup structure.
> >> >> >>
> >> >> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
> >> >> >> swap_entry in swap_desc.  The added indirection appears to be another
> >> >> >> level of page table with 1 entry.  Then, we may use the similar method
> >> >> >> as supporting system with 2 level and 3 level page tables, like the code
> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
> >> >> >> this deeply.
> >> >> >
> >> >> > Can you expand further on this idea? I am not sure I fully understand.
> >> >>
> >> >> OK.  The goal is to avoid the overhead if indirection isn't enabled via
> >> >> kconfig.
> >> >>
> >> >> If indirection isn't enabled, store swap_entry in PTE directly.
> >> >> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
> >> >> to get/set swap_entry in PTE) are implemented based on kconfig.
> >> >
> >> >
> >> > I thought about this, the problem is that we will have multiple
> >> > implementations of multiple things. For example, swap_count without
> >> > the indirection layer lives in the swap_map (with continuation logic).
> >> > With the indirection layer, it lives in the swap_desc (or somewhere
> >> > else). Same for the swapcache. Even if we keep the swapcache in an
> >> > xarray and not inside swap_desc, it would be indexed by swap_entry if
> >> > the indirection is disabled, and by swap_desc (or similar) if the
> >> > indirection is enabled. I think maintaining separate implementations
> >> > for when the indirection is enabled/disabled would be adding too much
> >> > complexity.
> >> >
> >> > WDYT?
> >>
> >> If we go this way, swap cache and swap_count will always be indexed by
> >> swap_entry.  swap_desc just provides a indirection to make it possible
> >> to move between swap devices.
> >>
> >> Why must we index swap cache and swap_count by swap_desc if indirection
> >> is enabled?  Yes, we can save one xarray indexing if we do so, but I
> >> don't think the overhead of one xarray indexing is a showstopper.
> >>
> >> I think this can be one intermediate step towards your final target.
> >> The changes to current implementation can be smaller.
> >
> > IIUC, the idea is to have two xarrays:
> > (a) xarray that stores a pointer to a struct containing swap_count and
> > swap cache.
> > (b) xarray that stores the underlying swap entry or zswap entry.
> >
> > When indirection is disabled:
> > page tables & page cache have swap entry directly like today, xarray
> > (a) is indexed by swap entry, xarray (b) does not exist. No reverse
> > mapping needed.
> >
> > In this case we have an extra overhead of 12-16 bytes (the struct
> > containing swap_count and swap cache) vs. 24 bytes of the swap_desc.
> >
> > When indirection is enabled:
> > page tables & page cache have a swap id (or swap_desc index), xarray
> > (a) is indexed by swap id,
>
> xarray (a) is indexed by swap entry.


How so? With the indirection enabled, the page tables & page cache
have the swap id (or swap_desc index), which can point to a swap entry
or a zswap entry -- which can change when the page is moved between
zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
case? Shouldn't be indexed by the abstract swap id so that the
writeback from zswap is transparent?

>
>
> > xarray (b) is indexed by swap id as well
> > and contain swap entry or zswap entry. Reverse mapping might be
> > needed.
>
> Reverse mapping isn't needed.


It would be needed if xarray (a) is indexed by the swap id. I am not
sure I understand how it can be indexed by the swap entry if the
indirection is enabled.

>
>
> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
> > where needed.
> >
> > There is also the extra cpu overhead for an extra lookup in certain paths.
> >
> > Is my analysis correct? If yes, I agree that the original proposal is
> > good if the reverse mapping can be avoided in enough situations, and
> > that we should consider such alternatives otherwise. As I mentioned
> > above, I think it comes down to whether we can completely restrict
> > cluster readahead to rotating disks or not -- in which case we need to
> > decide what to do for shmem and for anon when vma readahead is
> > disabled.
>
> We can even have a minimal indirection implementation.  Where, swap
> cache and swap_map[] are kept as they ware before, just one xarray is
> added.  The xarray is indexed by swap id (or swap_desc index) to store
> the corresponding swap entry.
>
> When indirection is disabled, no extra overhead.
>
> When indirection is enabled, the extra overhead is just 8 bytes per
> swapped page.
>
> The basic migration support can be build on top of this.
>
> I think that this could be a baseline for indirection support.  Then
> further optimization can be built on top of it step by step with
> supporting data.


I am not sure how this works with zswap. Currently swap_map[]
implementation is specific for swapfiles, it does not work for zswap
unless we implement separate swap counting logic for zswap &
swapfiles. Same for the swapcache, it currently supports being indexed
by a swap entry, it would need to support being indexed by a swap id,
or have a separate swap cache for zswap. Having separate
implementation would add complexity, and we would need to perform
handoffs of the swap count/cache when a page is moved from zswap to a
swapfile.

>
>
> >>
> >> >> >> >>
> >> >> >> >> Depending on how much you are going to reuse the swap cache, you might
> >> >> >> >> need to have something like a swap_info_struct to keep the locks happy.
> >> >> >> >
> >> >> >> > My current intention is to reimplement the swapcache completely as a
> >> >> >> > pointer in struct swap_desc. This would eliminate this need and a lot
> >> >> >> > of the locking we do today if I get things right.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> > Another potential concern is readahead. With this design, we have no
> >> >> >> >>
> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> >> >> >> >> use some modernization.
> >> >> >> >
> >> >> >> > Yeah, I initially thought we would only need the swp_entry_t ->
> >> >> >> > swap_desc reverse mapping for readahead, and that we can only store
> >> >> >> > that for spinning disks, but I was wrong. We need for other things as
> >> >> >> > well today: swapoff, when trying to find an empty swap slot and we
> >> >> >> > start trying to free swap slots used only by the swapcache. However, I
> >> >> >> > think both of these cases can be fixed (I can share more details if
> >> >> >> > you want). If everything goes well we should only need to maintain the
> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
> >> >> >> > spinning disks for readahead.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> Looking forward to your discussion.
> >> >>
> >> >> Per my understanding, the indirection is to make it easy to move
> >> >> (swapped) pages among swap devices based on hot/cold.  This is similar
> >> >> as the target of memory tiering.  It appears that we can extend the
> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
> >> >> Is it possible for zswap to be faster than some slow memory media?
> >> >
> >> >
> >> > Agree with Chris that this may require a much larger overhaul. A slow
> >> > memory tier is still addressable memory, swap/zswap requires a page
> >> > fault to read the pages. I think (at least for now) there is a
> >> > fundamental difference. We want reclaim to eventually treat slow
> >> > memory & swap as just different tiers to place cold memory in with
> >> > different characteristics, but otherwise I think the swapping
> >> > implementation itself is very different.  Am I missing something?
> >>
> >> Is it possible that zswap is faster than a really slow memory
> >> addressable device backed by NAND?  TBH, I don't have the answer.
> >
> > I am not sure either.
> >
> >>
> >> Anyway, do you need a way to describe the tiers of the swap devices?
> >> So, you can move the cold pages among the swap devices based on that?
> >
> > For now I think the "tiers" in this proposal are just zswap and normal
> > swapfiles. We can later extend it to support more explicit tiering.
>
> IIUC, in original zswap implementation, there's 1:1 relationship between
> zswap and normal swapfile.  But now, you make demoting among swap
> devices more general.  Then we need some general way to specify which
> swap devices are fast and which are slow, and the demoting relationship
> among them.  It can be memory tiers or something else, but we need one.


I think for this proposal, there are only 2 hardcoded tiers. Zswap is
fast, swapfile is slow. In the future, we can support more dynamic
tiering if the need arises.

>
>
> Best Regards,
> Huang, Ying
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-17 10:19                   ` Yosry Ahmed
@ 2023-03-17 18:19                     ` Chris Li
  2023-03-17 18:23                       ` Yosry Ahmed
  2023-03-20  2:55                     ` Huang, Ying
  1 sibling, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-17 18:19 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Fri, Mar 17, 2023 at 03:19:09AM -0700, Yosry Ahmed wrote:
> > Now, we will not drop the swap cache even if the swap count becomes 0 if
> > swap space utility < 50%.  Per my understanding, this avoid swap page
> > writing for read accesses.  So I don't think we can change this directly
> > without necessary discussion firstly.
> 
> 
> Right. I am not sure I understand why we do this today, is it to save
> the overhead of allocating a new swap entry if the page is swapped out
> again soon? I am not sure I understand this statement "this avoid swap
> page
> writing for read accesses".

When the page is swapped out again soon. If the swap slot has been recycled,
then the page need to assign to a new swap slot, most likely different than
the previous slot. Then the page write out would need to write that page
to the swap device, even though that swap device might already has the
same page data in the previous slot.

If keeping the previous swap slot cache when swap space utility < 50%,
then the swap code can avoid writing out to the same slot with the same
data, if the page is not dirty (read only access).

The saving is in 1) avoid allocating a new slot. 2) For read access,
avoid page io write the same data to the same slot.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-17 18:19                     ` Chris Li
@ 2023-03-17 18:23                       ` Yosry Ahmed
  0 siblings, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-17 18:23 UTC (permalink / raw)
  To: Chris Li
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Fri, Mar 17, 2023 at 11:19 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Fri, Mar 17, 2023 at 03:19:09AM -0700, Yosry Ahmed wrote:
> > > Now, we will not drop the swap cache even if the swap count becomes 0 if
> > > swap space utility < 50%.  Per my understanding, this avoid swap page
> > > writing for read accesses.  So I don't think we can change this directly
> > > without necessary discussion firstly.
> >
> >
> > Right. I am not sure I understand why we do this today, is it to save
> > the overhead of allocating a new swap entry if the page is swapped out
> > again soon? I am not sure I understand this statement "this avoid swap
> > page
> > writing for read accesses".
>
> When the page is swapped out again soon. If the swap slot has been recycled,
> then the page need to assign to a new swap slot, most likely different than
> the previous slot. Then the page write out would need to write that page
> to the swap device, even though that swap device might already has the
> same page data in the previous slot.
>
> If keeping the previous swap slot cache when swap space utility < 50%,
> then the swap code can avoid writing out to the same slot with the same
> data, if the page is not dirty (read only access).
>
> The saving is in 1) avoid allocating a new slot. 2) For read access,
> avoid page io write the same data to the same slot.

I see. Makes sense. Thanks, Chris.

I guess in this case we shouldn't just unconditionally free the swap
entry once the page is swapped in without giving it some thought.

>
> Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-17 10:19                   ` Yosry Ahmed
  2023-03-17 18:19                     ` Chris Li
@ 2023-03-20  2:55                     ` Huang, Ying
  2023-03-20  6:25                       ` Chris Li
  2023-03-22  5:56                       ` Yosry Ahmed
  1 sibling, 2 replies; 105+ messages in thread
From: Huang, Ying @ 2023-03-20  2:55 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> <snip>
>> >> >
>> >> > My current idea is to have one xarray that stores the swap_descs
>> >> > (which include swap_entry, swapcache, swap_count, etc), and only for
>> >> > rotating disks have an additional xarray that maps swap_entry ->
>> >> > swap_desc for cluster readahead, assuming we can eliminate all other
>> >> > situations requiring a reverse mapping.
>> >> >
>> >> > I am not sure how having separate xarrays help? If we have one xarray,
>> >> > might as well save the other lookups on put everything in swap_desc.
>> >> > In fact, this should improve the locking today as swapcache /
>> >> > swap_count operations can be lockless or very lightly contended.
>> >>
>> >> The condition of the proposal is "reverse mapping cannot be avoided for
>> >> enough situation".  So, if reverse mapping (or cluster readahead) can be
>> >> avoided for enough situations, I think your proposal is good.  Otherwise,
>> >> I propose to use 2 xarrays.  You don't need another reverse mapping
>> >> xarray, because you just need to read the next several swap_entry into
>> >> the swap cache for cluster readahead.  swap_desc isn't needed for
>> >> cluster readahead.
>> >
>> > swap_desc would be needed for cluster readahead in my original
>> > proposal as the swap cache lives in swap_descs. Based on the current
>> > implementation, we would need a reverse mapping (swap entry ->
>> > swap_desc) in 3 situations:
>> >
>> > 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
>> > failing, we fallback to trying to find swap entries that only have a
>> > page in the swap cache (no references in page tables or page cache)
>> > and free them. This would require a reverse mapping.
>> >
>> > 2) swapoff: we need to swap in all entries in a swapfile, so we need
>> > to get all swap_descs associated with that swapfile.
>> >
>> > 3) swap cluster readahead.
>> >
>> > For (1), I think we can drop the dependency of a reverse mapping if we
>> > free swap entries once we swap a page in and add it to the swap cache,
>> > even if the swap count does not drop to 0.
>>
>> Now, we will not drop the swap cache even if the swap count becomes 0 if
>> swap space utility < 50%.  Per my understanding, this avoid swap page
>> writing for read accesses.  So I don't think we can change this directly
>> without necessary discussion firstly.
>
>
> Right. I am not sure I understand why we do this today, is it to save
> the overhead of allocating a new swap entry if the page is swapped out
> again soon? I am not sure I understand this statement "this avoid swap
> page
> writing for read accesses".
>
>>
>>
>> > For (2), instead of scanning page tables and shmem page cache to find
>> > swapped out pages for the swapfile, we can scan all swap_descs
>> > instead, we should be more efficient. This is one of the proposal's
>> > potential advantages.
>>
>> Good.
>>
>> > (3) is the one that would still need a reverse mapping with the
>> > current proposal. Today we use swap cluster readahead for anon pages
>> > if we have a spinning disk or vma readahead is disabled. For shmem, we
>> > always use cluster readahead. If we can limit cluster readahead to
>> > only rotating disks, then the reverse mapping can only be maintained
>> > for swapfiles on rotating disks. Otherwise, we will need to maintain a
>> > reverse mapping for all swapfiles.
>>
>> For shmem, I think that it should be good to readahead based on shmem
>> file offset instead of swap device offset.
>>
>> It's possible that some pages in the readahead window are from HDD while
>> some other pages aren't.  So it's a little hard to enable cluster read
>> for HDD only.  Anyway, it's not common to use HDD for swap now.
>>
>> >>
>> >> > If the point is to store the swap_desc directly inside the xarray to
>> >> > save 8 bytes, I am concerned that having multiple xarrays for
>> >> > swapcache, swap_count, etc will use more than that.
>> >>
>> >> The idea is to save the memory used by reverse mapping xarray.
>> >
>> > I see.
>> >
>> >>
>> >> >> >>
>> >> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
>> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
>> >> >> >> > continuation pages. If we do, it may end up being more. We also
>> >> >> >> > allocate continuation in full 4k pages, so even if one swap_map
>> >> >> >> > element in a page requires continuation, we will allocate an entire
>> >> >> >> > page. What I am trying to say is that to get an actual comparison you
>> >> >> >> > need to also factor in the swap utilization and the rate of usage of
>> >> >> >> > swap continuation. I don't know how to come up with a formula for this
>> >> >> >> > tbh.
>> >> >> >> >
>> >> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
>> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
>> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
>> >> >> >> > overhead for people not using zswap, but it is not very awful.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> It seems what you really need is one bit of information to indicate
>> >> >> >> >> this page is backed by zswap. Then you can have a seperate pointer
>> >> >> >> >> for the zswap entry.
>> >> >> >> >
>> >> >> >> > If you use one bit in swp_entry_t (or one of the available swap types)
>> >> >> >> > to indicate whether the page is backed with a swapfile or zswap it
>> >> >> >> > doesn't really work. We lose the indirection layer. How do we move the
>> >> >> >> > page from zswap to swapfile? We need to go update the page tables and
>> >> >> >> > the shmem page cache, similar to swapoff.
>> >> >> >> >
>> >> >> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
>> >> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
>> >> >> >> > the swap_desc does. It just goes the extra mile of unifying the
>> >> >> >> > swapcache as well and storing it directly in the swap_desc instead of
>> >> >> >> > storing it in another lookup structure.
>> >> >> >>
>> >> >> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
>> >> >> >> swap_entry in swap_desc.  The added indirection appears to be another
>> >> >> >> level of page table with 1 entry.  Then, we may use the similar method
>> >> >> >> as supporting system with 2 level and 3 level page tables, like the code
>> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
>> >> >> >> this deeply.
>> >> >> >
>> >> >> > Can you expand further on this idea? I am not sure I fully understand.
>> >> >>
>> >> >> OK.  The goal is to avoid the overhead if indirection isn't enabled via
>> >> >> kconfig.
>> >> >>
>> >> >> If indirection isn't enabled, store swap_entry in PTE directly.
>> >> >> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
>> >> >> to get/set swap_entry in PTE) are implemented based on kconfig.
>> >> >
>> >> >
>> >> > I thought about this, the problem is that we will have multiple
>> >> > implementations of multiple things. For example, swap_count without
>> >> > the indirection layer lives in the swap_map (with continuation logic).
>> >> > With the indirection layer, it lives in the swap_desc (or somewhere
>> >> > else). Same for the swapcache. Even if we keep the swapcache in an
>> >> > xarray and not inside swap_desc, it would be indexed by swap_entry if
>> >> > the indirection is disabled, and by swap_desc (or similar) if the
>> >> > indirection is enabled. I think maintaining separate implementations
>> >> > for when the indirection is enabled/disabled would be adding too much
>> >> > complexity.
>> >> >
>> >> > WDYT?
>> >>
>> >> If we go this way, swap cache and swap_count will always be indexed by
>> >> swap_entry.  swap_desc just provides a indirection to make it possible
>> >> to move between swap devices.
>> >>
>> >> Why must we index swap cache and swap_count by swap_desc if indirection
>> >> is enabled?  Yes, we can save one xarray indexing if we do so, but I
>> >> don't think the overhead of one xarray indexing is a showstopper.
>> >>
>> >> I think this can be one intermediate step towards your final target.
>> >> The changes to current implementation can be smaller.
>> >
>> > IIUC, the idea is to have two xarrays:
>> > (a) xarray that stores a pointer to a struct containing swap_count and
>> > swap cache.
>> > (b) xarray that stores the underlying swap entry or zswap entry.
>> >
>> > When indirection is disabled:
>> > page tables & page cache have swap entry directly like today, xarray
>> > (a) is indexed by swap entry, xarray (b) does not exist. No reverse
>> > mapping needed.
>> >
>> > In this case we have an extra overhead of 12-16 bytes (the struct
>> > containing swap_count and swap cache) vs. 24 bytes of the swap_desc.
>> >
>> > When indirection is enabled:
>> > page tables & page cache have a swap id (or swap_desc index), xarray
>> > (a) is indexed by swap id,
>>
>> xarray (a) is indexed by swap entry.
>
>
> How so? With the indirection enabled, the page tables & page cache
> have the swap id (or swap_desc index), which can point to a swap entry
> or a zswap entry -- which can change when the page is moved between
> zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
> case? Shouldn't be indexed by the abstract swap id so that the
> writeback from zswap is transparent?

In my mind,

- swap core will define a abstract interface to swap implementations
  (zswap, swap device/file, maybe more in the future), like VFS.

- zswap will be a special swap implementation (compressing instead of
  writing to disk).

- swap core will manage the indirection layer and swap cache.

- swap core can move swap pages between swap implementations (e.g., from
  zswap to a swap device, or from one swap device to another swap
  device) with the help of the indirection layer.

In this design, the writeback from zswap becomes moving swapped pages
from zswap to a swap device.

If my understanding were correct, your suggestion is kind of moving
zswap logic to the swap core?  And zswap will be always at a higher
layer on top of swap device/file?

>>
>>
>> > xarray (b) is indexed by swap id as well
>> > and contain swap entry or zswap entry. Reverse mapping might be
>> > needed.
>>
>> Reverse mapping isn't needed.
>
>
> It would be needed if xarray (a) is indexed by the swap id. I am not
> sure I understand how it can be indexed by the swap entry if the
> indirection is enabled.
>
>>
>>
>> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
>> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
>> > where needed.
>> >
>> > There is also the extra cpu overhead for an extra lookup in certain paths.
>> >
>> > Is my analysis correct? If yes, I agree that the original proposal is
>> > good if the reverse mapping can be avoided in enough situations, and
>> > that we should consider such alternatives otherwise. As I mentioned
>> > above, I think it comes down to whether we can completely restrict
>> > cluster readahead to rotating disks or not -- in which case we need to
>> > decide what to do for shmem and for anon when vma readahead is
>> > disabled.
>>
>> We can even have a minimal indirection implementation.  Where, swap
>> cache and swap_map[] are kept as they ware before, just one xarray is
>> added.  The xarray is indexed by swap id (or swap_desc index) to store
>> the corresponding swap entry.
>>
>> When indirection is disabled, no extra overhead.
>>
>> When indirection is enabled, the extra overhead is just 8 bytes per
>> swapped page.
>>
>> The basic migration support can be build on top of this.
>>
>> I think that this could be a baseline for indirection support.  Then
>> further optimization can be built on top of it step by step with
>> supporting data.
>
>
> I am not sure how this works with zswap. Currently swap_map[]
> implementation is specific for swapfiles, it does not work for zswap
> unless we implement separate swap counting logic for zswap &
> swapfiles. Same for the swapcache, it currently supports being indexed
> by a swap entry, it would need to support being indexed by a swap id,
> or have a separate swap cache for zswap. Having separate
> implementation would add complexity, and we would need to perform
> handoffs of the swap count/cache when a page is moved from zswap to a
> swapfile.

We can allocate a swap entry for each swapped page in zswap.

>>
>>
>> >>
>> >> >> >> >>
>> >> >> >> >> Depending on how much you are going to reuse the swap cache, you might
>> >> >> >> >> need to have something like a swap_info_struct to keep the locks happy.
>> >> >> >> >
>> >> >> >> > My current intention is to reimplement the swapcache completely as a
>> >> >> >> > pointer in struct swap_desc. This would eliminate this need and a lot
>> >> >> >> > of the locking we do today if I get things right.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> > Another potential concern is readahead. With this design, we have no
>> >> >> >> >>
>> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
>> >> >> >> >> use some modernization.
>> >> >> >> >
>> >> >> >> > Yeah, I initially thought we would only need the swp_entry_t ->
>> >> >> >> > swap_desc reverse mapping for readahead, and that we can only store
>> >> >> >> > that for spinning disks, but I was wrong. We need for other things as
>> >> >> >> > well today: swapoff, when trying to find an empty swap slot and we
>> >> >> >> > start trying to free swap slots used only by the swapcache. However, I
>> >> >> >> > think both of these cases can be fixed (I can share more details if
>> >> >> >> > you want). If everything goes well we should only need to maintain the
>> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
>> >> >> >> > spinning disks for readahead.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> Looking forward to your discussion.
>> >> >>
>> >> >> Per my understanding, the indirection is to make it easy to move
>> >> >> (swapped) pages among swap devices based on hot/cold.  This is similar
>> >> >> as the target of memory tiering.  It appears that we can extend the
>> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
>> >> >> Is it possible for zswap to be faster than some slow memory media?
>> >> >
>> >> >
>> >> > Agree with Chris that this may require a much larger overhaul. A slow
>> >> > memory tier is still addressable memory, swap/zswap requires a page
>> >> > fault to read the pages. I think (at least for now) there is a
>> >> > fundamental difference. We want reclaim to eventually treat slow
>> >> > memory & swap as just different tiers to place cold memory in with
>> >> > different characteristics, but otherwise I think the swapping
>> >> > implementation itself is very different.  Am I missing something?
>> >>
>> >> Is it possible that zswap is faster than a really slow memory
>> >> addressable device backed by NAND?  TBH, I don't have the answer.
>> >
>> > I am not sure either.
>> >
>> >>
>> >> Anyway, do you need a way to describe the tiers of the swap devices?
>> >> So, you can move the cold pages among the swap devices based on that?
>> >
>> > For now I think the "tiers" in this proposal are just zswap and normal
>> > swapfiles. We can later extend it to support more explicit tiering.
>>
>> IIUC, in original zswap implementation, there's 1:1 relationship between
>> zswap and normal swapfile.  But now, you make demoting among swap
>> devices more general.  Then we need some general way to specify which
>> swap devices are fast and which are slow, and the demoting relationship
>> among them.  It can be memory tiers or something else, but we need one.
>
>
> I think for this proposal, there are only 2 hardcoded tiers. Zswap is
> fast, swapfile is slow. In the future, we can support more dynamic
> tiering if the need arises.

We can start from a simple implementation.  And I think that it's better
to consider the general design too.  Try not to make it impossible now.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-20  2:55                     ` Huang, Ying
@ 2023-03-20  6:25                       ` Chris Li
  2023-03-23  0:56                         ` Huang, Ying
  2023-03-22  5:56                       ` Yosry Ahmed
  1 sibling, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-20  6:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Mon, Mar 20, 2023 at 10:55:03AM +0800, Huang, Ying wrote:
> >
> > How so? With the indirection enabled, the page tables & page cache
> > have the swap id (or swap_desc index), which can point to a swap entry
> > or a zswap entry -- which can change when the page is moved between
> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
> > case? Shouldn't be indexed by the abstract swap id so that the
> > writeback from zswap is transparent?
> 
> In my mind,
> 
> - swap core will define a abstract interface to swap implementations
>   (zswap, swap device/file, maybe more in the future), like VFS.

I like your idea very much.

> 
> - zswap will be a special swap implementation (compressing instead of
>   writing to disk).

Agree.

> 
> - swap core will manage the indirection layer and swap cache.

Agree, those are very good points.

> 
> - swap core can move swap pages between swap implementations (e.g., from
>   zswap to a swap device, or from one swap device to another swap
>   device) with the help of the indirection layer.
 
We need to carefully design the swap cache that, when moving between
swap implementaions, there will be one shared swap cache. The current
swap cache belongs to swap devices, so two devices will have the same
page in two swap caches.

> In this design, the writeback from zswap becomes moving swapped pages
> from zswap to a swap device.

Ack.

> 
> If my understanding were correct, your suggestion is kind of moving
> zswap logic to the swap core?  And zswap will be always at a higher
> layer on top of swap device/file?

It seems that way to me. I will let Yosry confirm that.

> > I am not sure how this works with zswap. Currently swap_map[]
> > implementation is specific for swapfiles, it does not work for zswap
> > unless we implement separate swap counting logic for zswap &
> > swapfiles. Same for the swapcache, it currently supports being indexed
> > by a swap entry, it would need to support being indexed by a swap id,
> > or have a separate swap cache for zswap. Having separate
> > implementation would add complexity, and we would need to perform
> > handoffs of the swap count/cache when a page is moved from zswap to a
> > swapfile.
> 
> We can allocate a swap entry for each swapped page in zswap.

One thing to consider when moving page from zswap to swap file, is the
zswap swap entry the same entry as the swap file entry.

> > I think for this proposal, there are only 2 hardcoded tiers. Zswap is
> > fast, swapfile is slow. In the future, we can support more dynamic
> > tiering if the need arises.
> 
> We can start from a simple implementation.  And I think that it's better
> to consider the general design too.  Try not to make it impossible now.

In my mind there are a few usage cases:
1) using only swap file.
2) using only zswap, no swap file.
3) Using zswap + swap file (SSD).

The swap core should handle both 3 cases well with minial memory waste.

Chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-20  2:55                     ` Huang, Ying
  2023-03-20  6:25                       ` Chris Li
@ 2023-03-22  5:56                       ` Yosry Ahmed
  2023-03-23  1:48                         ` Huang, Ying
  1 sibling, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-22  5:56 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Sun, Mar 19, 2023 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >>
> >> >> <snip>
> >> >> >
> >> >> > My current idea is to have one xarray that stores the swap_descs
> >> >> > (which include swap_entry, swapcache, swap_count, etc), and only for
> >> >> > rotating disks have an additional xarray that maps swap_entry ->
> >> >> > swap_desc for cluster readahead, assuming we can eliminate all other
> >> >> > situations requiring a reverse mapping.
> >> >> >
> >> >> > I am not sure how having separate xarrays help? If we have one xarray,
> >> >> > might as well save the other lookups on put everything in swap_desc.
> >> >> > In fact, this should improve the locking today as swapcache /
> >> >> > swap_count operations can be lockless or very lightly contended.
> >> >>
> >> >> The condition of the proposal is "reverse mapping cannot be avoided for
> >> >> enough situation".  So, if reverse mapping (or cluster readahead) can be
> >> >> avoided for enough situations, I think your proposal is good.  Otherwise,
> >> >> I propose to use 2 xarrays.  You don't need another reverse mapping
> >> >> xarray, because you just need to read the next several swap_entry into
> >> >> the swap cache for cluster readahead.  swap_desc isn't needed for
> >> >> cluster readahead.
> >> >
> >> > swap_desc would be needed for cluster readahead in my original
> >> > proposal as the swap cache lives in swap_descs. Based on the current
> >> > implementation, we would need a reverse mapping (swap entry ->
> >> > swap_desc) in 3 situations:
> >> >
> >> > 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
> >> > failing, we fallback to trying to find swap entries that only have a
> >> > page in the swap cache (no references in page tables or page cache)
> >> > and free them. This would require a reverse mapping.
> >> >
> >> > 2) swapoff: we need to swap in all entries in a swapfile, so we need
> >> > to get all swap_descs associated with that swapfile.
> >> >
> >> > 3) swap cluster readahead.
> >> >
> >> > For (1), I think we can drop the dependency of a reverse mapping if we
> >> > free swap entries once we swap a page in and add it to the swap cache,
> >> > even if the swap count does not drop to 0.
> >>
> >> Now, we will not drop the swap cache even if the swap count becomes 0 if
> >> swap space utility < 50%.  Per my understanding, this avoid swap page
> >> writing for read accesses.  So I don't think we can change this directly
> >> without necessary discussion firstly.
> >
> >
> > Right. I am not sure I understand why we do this today, is it to save
> > the overhead of allocating a new swap entry if the page is swapped out
> > again soon? I am not sure I understand this statement "this avoid swap
> > page
> > writing for read accesses".
> >
> >>
> >>
> >> > For (2), instead of scanning page tables and shmem page cache to find
> >> > swapped out pages for the swapfile, we can scan all swap_descs
> >> > instead, we should be more efficient. This is one of the proposal's
> >> > potential advantages.
> >>
> >> Good.
> >>
> >> > (3) is the one that would still need a reverse mapping with the
> >> > current proposal. Today we use swap cluster readahead for anon pages
> >> > if we have a spinning disk or vma readahead is disabled. For shmem, we
> >> > always use cluster readahead. If we can limit cluster readahead to
> >> > only rotating disks, then the reverse mapping can only be maintained
> >> > for swapfiles on rotating disks. Otherwise, we will need to maintain a
> >> > reverse mapping for all swapfiles.
> >>
> >> For shmem, I think that it should be good to readahead based on shmem
> >> file offset instead of swap device offset.
> >>
> >> It's possible that some pages in the readahead window are from HDD while
> >> some other pages aren't.  So it's a little hard to enable cluster read
> >> for HDD only.  Anyway, it's not common to use HDD for swap now.
> >>
> >> >>
> >> >> > If the point is to store the swap_desc directly inside the xarray to
> >> >> > save 8 bytes, I am concerned that having multiple xarrays for
> >> >> > swapcache, swap_count, etc will use more than that.
> >> >>
> >> >> The idea is to save the memory used by reverse mapping xarray.
> >> >
> >> > I see.
> >> >
> >> >>
> >> >> >> >>
> >> >> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> >> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
> >> >> >> >> > continuation pages. If we do, it may end up being more. We also
> >> >> >> >> > allocate continuation in full 4k pages, so even if one swap_map
> >> >> >> >> > element in a page requires continuation, we will allocate an entire
> >> >> >> >> > page. What I am trying to say is that to get an actual comparison you
> >> >> >> >> > need to also factor in the swap utilization and the rate of usage of
> >> >> >> >> > swap continuation. I don't know how to come up with a formula for this
> >> >> >> >> > tbh.
> >> >> >> >> >
> >> >> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
> >> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> >> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> >> >> >> >> > overhead for people not using zswap, but it is not very awful.
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> It seems what you really need is one bit of information to indicate
> >> >> >> >> >> this page is backed by zswap. Then you can have a seperate pointer
> >> >> >> >> >> for the zswap entry.
> >> >> >> >> >
> >> >> >> >> > If you use one bit in swp_entry_t (or one of the available swap types)
> >> >> >> >> > to indicate whether the page is backed with a swapfile or zswap it
> >> >> >> >> > doesn't really work. We lose the indirection layer. How do we move the
> >> >> >> >> > page from zswap to swapfile? We need to go update the page tables and
> >> >> >> >> > the shmem page cache, similar to swapoff.
> >> >> >> >> >
> >> >> >> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
> >> >> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
> >> >> >> >> > the swap_desc does. It just goes the extra mile of unifying the
> >> >> >> >> > swapcache as well and storing it directly in the swap_desc instead of
> >> >> >> >> > storing it in another lookup structure.
> >> >> >> >>
> >> >> >> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
> >> >> >> >> swap_entry in swap_desc.  The added indirection appears to be another
> >> >> >> >> level of page table with 1 entry.  Then, we may use the similar method
> >> >> >> >> as supporting system with 2 level and 3 level page tables, like the code
> >> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
> >> >> >> >> this deeply.
> >> >> >> >
> >> >> >> > Can you expand further on this idea? I am not sure I fully understand.
> >> >> >>
> >> >> >> OK.  The goal is to avoid the overhead if indirection isn't enabled via
> >> >> >> kconfig.
> >> >> >>
> >> >> >> If indirection isn't enabled, store swap_entry in PTE directly.
> >> >> >> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
> >> >> >> to get/set swap_entry in PTE) are implemented based on kconfig.
> >> >> >
> >> >> >
> >> >> > I thought about this, the problem is that we will have multiple
> >> >> > implementations of multiple things. For example, swap_count without
> >> >> > the indirection layer lives in the swap_map (with continuation logic).
> >> >> > With the indirection layer, it lives in the swap_desc (or somewhere
> >> >> > else). Same for the swapcache. Even if we keep the swapcache in an
> >> >> > xarray and not inside swap_desc, it would be indexed by swap_entry if
> >> >> > the indirection is disabled, and by swap_desc (or similar) if the
> >> >> > indirection is enabled. I think maintaining separate implementations
> >> >> > for when the indirection is enabled/disabled would be adding too much
> >> >> > complexity.
> >> >> >
> >> >> > WDYT?
> >> >>
> >> >> If we go this way, swap cache and swap_count will always be indexed by
> >> >> swap_entry.  swap_desc just provides a indirection to make it possible
> >> >> to move between swap devices.
> >> >>
> >> >> Why must we index swap cache and swap_count by swap_desc if indirection
> >> >> is enabled?  Yes, we can save one xarray indexing if we do so, but I
> >> >> don't think the overhead of one xarray indexing is a showstopper.
> >> >>
> >> >> I think this can be one intermediate step towards your final target.
> >> >> The changes to current implementation can be smaller.
> >> >
> >> > IIUC, the idea is to have two xarrays:
> >> > (a) xarray that stores a pointer to a struct containing swap_count and
> >> > swap cache.
> >> > (b) xarray that stores the underlying swap entry or zswap entry.
> >> >
> >> > When indirection is disabled:
> >> > page tables & page cache have swap entry directly like today, xarray
> >> > (a) is indexed by swap entry, xarray (b) does not exist. No reverse
> >> > mapping needed.
> >> >
> >> > In this case we have an extra overhead of 12-16 bytes (the struct
> >> > containing swap_count and swap cache) vs. 24 bytes of the swap_desc.
> >> >
> >> > When indirection is enabled:
> >> > page tables & page cache have a swap id (or swap_desc index), xarray
> >> > (a) is indexed by swap id,
> >>
> >> xarray (a) is indexed by swap entry.
> >
> >
> > How so? With the indirection enabled, the page tables & page cache
> > have the swap id (or swap_desc index), which can point to a swap entry
> > or a zswap entry -- which can change when the page is moved between
> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
> > case? Shouldn't be indexed by the abstract swap id so that the
> > writeback from zswap is transparent?
>
> In my mind,
>
> - swap core will define a abstract interface to swap implementations
>   (zswap, swap device/file, maybe more in the future), like VFS.
>
> - zswap will be a special swap implementation (compressing instead of
>   writing to disk).
>
> - swap core will manage the indirection layer and swap cache.
>
> - swap core can move swap pages between swap implementations (e.g., from
>   zswap to a swap device, or from one swap device to another swap
>   device) with the help of the indirection layer.
>
> In this design, the writeback from zswap becomes moving swapped pages
> from zswap to a swap device.


All the above matches my understanding of this proposal. swap_desc is
the proposed indirection layer, and the swap implementations are zswap
& swap devices. For now, we only have 2 static swap implementations
(zswap->swapfile). In the future, we can make this more dynamic as the
need arises.

>
>
> If my understanding were correct, your suggestion is kind of moving
> zswap logic to the swap core?  And zswap will be always at a higher
> layer on top of swap device/file?


We do not want to move the zswap logic into the swap core, we want to
make the swap core independent of the swap implementation, and zswap
is just one possible implementation.

>
>
> >>
> >>
> >> > xarray (b) is indexed by swap id as well
> >> > and contain swap entry or zswap entry. Reverse mapping might be
> >> > needed.
> >>
> >> Reverse mapping isn't needed.
> >
> >
> > It would be needed if xarray (a) is indexed by the swap id. I am not
> > sure I understand how it can be indexed by the swap entry if the
> > indirection is enabled.
> >
> >>
> >>
> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
> >> > where needed.
> >> >
> >> > There is also the extra cpu overhead for an extra lookup in certain paths.
> >> >
> >> > Is my analysis correct? If yes, I agree that the original proposal is
> >> > good if the reverse mapping can be avoided in enough situations, and
> >> > that we should consider such alternatives otherwise. As I mentioned
> >> > above, I think it comes down to whether we can completely restrict
> >> > cluster readahead to rotating disks or not -- in which case we need to
> >> > decide what to do for shmem and for anon when vma readahead is
> >> > disabled.
> >>
> >> We can even have a minimal indirection implementation.  Where, swap
> >> cache and swap_map[] are kept as they ware before, just one xarray is
> >> added.  The xarray is indexed by swap id (or swap_desc index) to store
> >> the corresponding swap entry.
> >>
> >> When indirection is disabled, no extra overhead.
> >>
> >> When indirection is enabled, the extra overhead is just 8 bytes per
> >> swapped page.
> >>
> >> The basic migration support can be build on top of this.
> >>
> >> I think that this could be a baseline for indirection support.  Then
> >> further optimization can be built on top of it step by step with
> >> supporting data.
> >
> >
> > I am not sure how this works with zswap. Currently swap_map[]
> > implementation is specific for swapfiles, it does not work for zswap
> > unless we implement separate swap counting logic for zswap &
> > swapfiles. Same for the swapcache, it currently supports being indexed
> > by a swap entry, it would need to support being indexed by a swap id,
> > or have a separate swap cache for zswap. Having separate
> > implementation would add complexity, and we would need to perform
> > handoffs of the swap count/cache when a page is moved from zswap to a
> > swapfile.
>
> We can allocate a swap entry for each swapped page in zswap.


This is exactly what the current implementation does and what we want
to move away from. The current implementation uses zswap as an
in-memory compressed cache on top of an actual swap device, and each
swapped page in zswap has a swap entry allocated. With this
implementation, zswap cannot be used without a swap device.

>
>
> >>
> >>
> >> >>
> >> >> >> >> >>
> >> >> >> >> >> Depending on how much you are going to reuse the swap cache, you might
> >> >> >> >> >> need to have something like a swap_info_struct to keep the locks happy.
> >> >> >> >> >
> >> >> >> >> > My current intention is to reimplement the swapcache completely as a
> >> >> >> >> > pointer in struct swap_desc. This would eliminate this need and a lot
> >> >> >> >> > of the locking we do today if I get things right.
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> > Another potential concern is readahead. With this design, we have no
> >> >> >> >> >>
> >> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> >> >> >> >> >> use some modernization.
> >> >> >> >> >
> >> >> >> >> > Yeah, I initially thought we would only need the swp_entry_t ->
> >> >> >> >> > swap_desc reverse mapping for readahead, and that we can only store
> >> >> >> >> > that for spinning disks, but I was wrong. We need for other things as
> >> >> >> >> > well today: swapoff, when trying to find an empty swap slot and we
> >> >> >> >> > start trying to free swap slots used only by the swapcache. However, I
> >> >> >> >> > think both of these cases can be fixed (I can share more details if
> >> >> >> >> > you want). If everything goes well we should only need to maintain the
> >> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
> >> >> >> >> > spinning disks for readahead.
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> Looking forward to your discussion.
> >> >> >>
> >> >> >> Per my understanding, the indirection is to make it easy to move
> >> >> >> (swapped) pages among swap devices based on hot/cold.  This is similar
> >> >> >> as the target of memory tiering.  It appears that we can extend the
> >> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
> >> >> >> Is it possible for zswap to be faster than some slow memory media?
> >> >> >
> >> >> >
> >> >> > Agree with Chris that this may require a much larger overhaul. A slow
> >> >> > memory tier is still addressable memory, swap/zswap requires a page
> >> >> > fault to read the pages. I think (at least for now) there is a
> >> >> > fundamental difference. We want reclaim to eventually treat slow
> >> >> > memory & swap as just different tiers to place cold memory in with
> >> >> > different characteristics, but otherwise I think the swapping
> >> >> > implementation itself is very different.  Am I missing something?
> >> >>
> >> >> Is it possible that zswap is faster than a really slow memory
> >> >> addressable device backed by NAND?  TBH, I don't have the answer.
> >> >
> >> > I am not sure either.
> >> >
> >> >>
> >> >> Anyway, do you need a way to describe the tiers of the swap devices?
> >> >> So, you can move the cold pages among the swap devices based on that?
> >> >
> >> > For now I think the "tiers" in this proposal are just zswap and normal
> >> > swapfiles. We can later extend it to support more explicit tiering.
> >>
> >> IIUC, in original zswap implementation, there's 1:1 relationship between
> >> zswap and normal swapfile.  But now, you make demoting among swap
> >> devices more general.  Then we need some general way to specify which
> >> swap devices are fast and which are slow, and the demoting relationship
> >> among them.  It can be memory tiers or something else, but we need one.
> >
> >
> > I think for this proposal, there are only 2 hardcoded tiers. Zswap is
> > fast, swapfile is slow. In the future, we can support more dynamic
> > tiering if the need arises.
>
> We can start from a simple implementation.  And I think that it's better
> to consider the general design too.  Try not to make it impossible now.


Right. I am proposing we come up with an abstract generic interface
for swap implementations, and have 2 implementations statically
defined (swapfiles and zswap). If the need arises, we can make swap
implementations more dynamic in the future.

>
>
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-20  6:25                       ` Chris Li
@ 2023-03-23  0:56                         ` Huang, Ying
  2023-03-23  6:46                           ` Chris Li
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-23  0:56 UTC (permalink / raw)
  To: Chris Li
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Chris Li <chrisl@kernel.org> writes:

> On Mon, Mar 20, 2023 at 10:55:03AM +0800, Huang, Ying wrote:
>> >
>> > How so? With the indirection enabled, the page tables & page cache
>> > have the swap id (or swap_desc index), which can point to a swap entry
>> > or a zswap entry -- which can change when the page is moved between
>> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
>> > case? Shouldn't be indexed by the abstract swap id so that the
>> > writeback from zswap is transparent?
>> 
>> In my mind,
>> 
>> - swap core will define a abstract interface to swap implementations
>>   (zswap, swap device/file, maybe more in the future), like VFS.
>
> I like your idea very much.

Thanks!

>> 
>> - zswap will be a special swap implementation (compressing instead of
>>   writing to disk).
>
> Agree.
>
>> 
>> - swap core will manage the indirection layer and swap cache.
>
> Agree, those are very good points.
>
>> 
>> - swap core can move swap pages between swap implementations (e.g., from
>>   zswap to a swap device, or from one swap device to another swap
>>   device) with the help of the indirection layer.
>  
> We need to carefully design the swap cache that, when moving between
> swap implementaions, there will be one shared swap cache. The current
> swap cache belongs to swap devices, so two devices will have the same
> page in two swap caches.

We can remove a page from the swap cache for the swap device A, then
insert the page into the swap cache for the swap device B.  The swap
entry will be changed too.

>> In this design, the writeback from zswap becomes moving swapped pages
>> from zswap to a swap device.
>
> Ack.
>
>> 
>> If my understanding were correct, your suggestion is kind of moving
>> zswap logic to the swap core?  And zswap will be always at a higher
>> layer on top of swap device/file?
>
> It seems that way to me. I will let Yosry confirm that.
>
>> > I am not sure how this works with zswap. Currently swap_map[]
>> > implementation is specific for swapfiles, it does not work for zswap
>> > unless we implement separate swap counting logic for zswap &
>> > swapfiles. Same for the swapcache, it currently supports being indexed
>> > by a swap entry, it would need to support being indexed by a swap id,
>> > or have a separate swap cache for zswap. Having separate
>> > implementation would add complexity, and we would need to perform
>> > handoffs of the swap count/cache when a page is moved from zswap to a
>> > swapfile.
>> 
>> We can allocate a swap entry for each swapped page in zswap.
>
> One thing to consider when moving page from zswap to swap file, is the
> zswap swap entry the same entry as the swap file entry.

I think that the swap entry will be changed after moving.  Swap entry is
kind of local to a swap device.  While the swap desc ID isn't changed,
that is why we need the indirection layer.

>> > I think for this proposal, there are only 2 hardcoded tiers. Zswap is
>> > fast, swapfile is slow. In the future, we can support more dynamic
>> > tiering if the need arises.
>> 
>> We can start from a simple implementation.  And I think that it's better
>> to consider the general design too.  Try not to make it impossible now.
>
> In my mind there are a few usage cases:
> 1) using only swap file.
> 2) using only zswap, no swap file.
> 3) Using zswap + swap file (SSD).
>
> The swap core should handle both 3 cases well with minial memory waste.

Yes.  Agree.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-22  5:56                       ` Yosry Ahmed
@ 2023-03-23  1:48                         ` Huang, Ying
  2023-03-23  2:21                           ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-23  1:48 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Sun, Mar 19, 2023 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >>
>> >> >> <snip>
>> >> >> >
>> >> >> > My current idea is to have one xarray that stores the swap_descs
>> >> >> > (which include swap_entry, swapcache, swap_count, etc), and only for
>> >> >> > rotating disks have an additional xarray that maps swap_entry ->
>> >> >> > swap_desc for cluster readahead, assuming we can eliminate all other
>> >> >> > situations requiring a reverse mapping.
>> >> >> >
>> >> >> > I am not sure how having separate xarrays help? If we have one xarray,
>> >> >> > might as well save the other lookups on put everything in swap_desc.
>> >> >> > In fact, this should improve the locking today as swapcache /
>> >> >> > swap_count operations can be lockless or very lightly contended.
>> >> >>
>> >> >> The condition of the proposal is "reverse mapping cannot be avoided for
>> >> >> enough situation".  So, if reverse mapping (or cluster readahead) can be
>> >> >> avoided for enough situations, I think your proposal is good.  Otherwise,
>> >> >> I propose to use 2 xarrays.  You don't need another reverse mapping
>> >> >> xarray, because you just need to read the next several swap_entry into
>> >> >> the swap cache for cluster readahead.  swap_desc isn't needed for
>> >> >> cluster readahead.
>> >> >
>> >> > swap_desc would be needed for cluster readahead in my original
>> >> > proposal as the swap cache lives in swap_descs. Based on the current
>> >> > implementation, we would need a reverse mapping (swap entry ->
>> >> > swap_desc) in 3 situations:
>> >> >
>> >> > 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
>> >> > failing, we fallback to trying to find swap entries that only have a
>> >> > page in the swap cache (no references in page tables or page cache)
>> >> > and free them. This would require a reverse mapping.
>> >> >
>> >> > 2) swapoff: we need to swap in all entries in a swapfile, so we need
>> >> > to get all swap_descs associated with that swapfile.
>> >> >
>> >> > 3) swap cluster readahead.
>> >> >
>> >> > For (1), I think we can drop the dependency of a reverse mapping if we
>> >> > free swap entries once we swap a page in and add it to the swap cache,
>> >> > even if the swap count does not drop to 0.
>> >>
>> >> Now, we will not drop the swap cache even if the swap count becomes 0 if
>> >> swap space utility < 50%.  Per my understanding, this avoid swap page
>> >> writing for read accesses.  So I don't think we can change this directly
>> >> without necessary discussion firstly.
>> >
>> >
>> > Right. I am not sure I understand why we do this today, is it to save
>> > the overhead of allocating a new swap entry if the page is swapped out
>> > again soon? I am not sure I understand this statement "this avoid swap
>> > page
>> > writing for read accesses".
>> >
>> >>
>> >>
>> >> > For (2), instead of scanning page tables and shmem page cache to find
>> >> > swapped out pages for the swapfile, we can scan all swap_descs
>> >> > instead, we should be more efficient. This is one of the proposal's
>> >> > potential advantages.
>> >>
>> >> Good.
>> >>
>> >> > (3) is the one that would still need a reverse mapping with the
>> >> > current proposal. Today we use swap cluster readahead for anon pages
>> >> > if we have a spinning disk or vma readahead is disabled. For shmem, we
>> >> > always use cluster readahead. If we can limit cluster readahead to
>> >> > only rotating disks, then the reverse mapping can only be maintained
>> >> > for swapfiles on rotating disks. Otherwise, we will need to maintain a
>> >> > reverse mapping for all swapfiles.
>> >>
>> >> For shmem, I think that it should be good to readahead based on shmem
>> >> file offset instead of swap device offset.
>> >>
>> >> It's possible that some pages in the readahead window are from HDD while
>> >> some other pages aren't.  So it's a little hard to enable cluster read
>> >> for HDD only.  Anyway, it's not common to use HDD for swap now.
>> >>
>> >> >>
>> >> >> > If the point is to store the swap_desc directly inside the xarray to
>> >> >> > save 8 bytes, I am concerned that having multiple xarrays for
>> >> >> > swapcache, swap_count, etc will use more than that.
>> >> >>
>> >> >> The idea is to save the memory used by reverse mapping xarray.
>> >> >
>> >> > I see.
>> >> >
>> >> >>
>> >> >> >> >>
>> >> >> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
>> >> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
>> >> >> >> >> > continuation pages. If we do, it may end up being more. We also
>> >> >> >> >> > allocate continuation in full 4k pages, so even if one swap_map
>> >> >> >> >> > element in a page requires continuation, we will allocate an entire
>> >> >> >> >> > page. What I am trying to say is that to get an actual comparison you
>> >> >> >> >> > need to also factor in the swap utilization and the rate of usage of
>> >> >> >> >> > swap continuation. I don't know how to come up with a formula for this
>> >> >> >> >> > tbh.
>> >> >> >> >> >
>> >> >> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
>> >> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
>> >> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
>> >> >> >> >> > overhead for people not using zswap, but it is not very awful.
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> It seems what you really need is one bit of information to indicate
>> >> >> >> >> >> this page is backed by zswap. Then you can have a seperate pointer
>> >> >> >> >> >> for the zswap entry.
>> >> >> >> >> >
>> >> >> >> >> > If you use one bit in swp_entry_t (or one of the available swap types)
>> >> >> >> >> > to indicate whether the page is backed with a swapfile or zswap it
>> >> >> >> >> > doesn't really work. We lose the indirection layer. How do we move the
>> >> >> >> >> > page from zswap to swapfile? We need to go update the page tables and
>> >> >> >> >> > the shmem page cache, similar to swapoff.
>> >> >> >> >> >
>> >> >> >> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
>> >> >> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
>> >> >> >> >> > the swap_desc does. It just goes the extra mile of unifying the
>> >> >> >> >> > swapcache as well and storing it directly in the swap_desc instead of
>> >> >> >> >> > storing it in another lookup structure.
>> >> >> >> >>
>> >> >> >> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
>> >> >> >> >> swap_entry in swap_desc.  The added indirection appears to be another
>> >> >> >> >> level of page table with 1 entry.  Then, we may use the similar method
>> >> >> >> >> as supporting system with 2 level and 3 level page tables, like the code
>> >> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
>> >> >> >> >> this deeply.
>> >> >> >> >
>> >> >> >> > Can you expand further on this idea? I am not sure I fully understand.
>> >> >> >>
>> >> >> >> OK.  The goal is to avoid the overhead if indirection isn't enabled via
>> >> >> >> kconfig.
>> >> >> >>
>> >> >> >> If indirection isn't enabled, store swap_entry in PTE directly.
>> >> >> >> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
>> >> >> >> to get/set swap_entry in PTE) are implemented based on kconfig.
>> >> >> >
>> >> >> >
>> >> >> > I thought about this, the problem is that we will have multiple
>> >> >> > implementations of multiple things. For example, swap_count without
>> >> >> > the indirection layer lives in the swap_map (with continuation logic).
>> >> >> > With the indirection layer, it lives in the swap_desc (or somewhere
>> >> >> > else). Same for the swapcache. Even if we keep the swapcache in an
>> >> >> > xarray and not inside swap_desc, it would be indexed by swap_entry if
>> >> >> > the indirection is disabled, and by swap_desc (or similar) if the
>> >> >> > indirection is enabled. I think maintaining separate implementations
>> >> >> > for when the indirection is enabled/disabled would be adding too much
>> >> >> > complexity.
>> >> >> >
>> >> >> > WDYT?
>> >> >>
>> >> >> If we go this way, swap cache and swap_count will always be indexed by
>> >> >> swap_entry.  swap_desc just provides a indirection to make it possible
>> >> >> to move between swap devices.
>> >> >>
>> >> >> Why must we index swap cache and swap_count by swap_desc if indirection
>> >> >> is enabled?  Yes, we can save one xarray indexing if we do so, but I
>> >> >> don't think the overhead of one xarray indexing is a showstopper.
>> >> >>
>> >> >> I think this can be one intermediate step towards your final target.
>> >> >> The changes to current implementation can be smaller.
>> >> >
>> >> > IIUC, the idea is to have two xarrays:
>> >> > (a) xarray that stores a pointer to a struct containing swap_count and
>> >> > swap cache.
>> >> > (b) xarray that stores the underlying swap entry or zswap entry.
>> >> >
>> >> > When indirection is disabled:
>> >> > page tables & page cache have swap entry directly like today, xarray
>> >> > (a) is indexed by swap entry, xarray (b) does not exist. No reverse
>> >> > mapping needed.
>> >> >
>> >> > In this case we have an extra overhead of 12-16 bytes (the struct
>> >> > containing swap_count and swap cache) vs. 24 bytes of the swap_desc.
>> >> >
>> >> > When indirection is enabled:
>> >> > page tables & page cache have a swap id (or swap_desc index), xarray
>> >> > (a) is indexed by swap id,
>> >>
>> >> xarray (a) is indexed by swap entry.
>> >
>> >
>> > How so? With the indirection enabled, the page tables & page cache
>> > have the swap id (or swap_desc index), which can point to a swap entry
>> > or a zswap entry -- which can change when the page is moved between
>> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
>> > case? Shouldn't be indexed by the abstract swap id so that the
>> > writeback from zswap is transparent?
>>
>> In my mind,
>>
>> - swap core will define a abstract interface to swap implementations
>>   (zswap, swap device/file, maybe more in the future), like VFS.
>>
>> - zswap will be a special swap implementation (compressing instead of
>>   writing to disk).
>>
>> - swap core will manage the indirection layer and swap cache.
>>
>> - swap core can move swap pages between swap implementations (e.g., from
>>   zswap to a swap device, or from one swap device to another swap
>>   device) with the help of the indirection layer.
>>
>> In this design, the writeback from zswap becomes moving swapped pages
>> from zswap to a swap device.
>
>
> All the above matches my understanding of this proposal. swap_desc is
> the proposed indirection layer, and the swap implementations are zswap
> & swap devices. For now, we only have 2 static swap implementations
> (zswap->swapfile). In the future, we can make this more dynamic as the
> need arises.

Great to align with you on this.

>>
>> If my understanding were correct, your suggestion is kind of moving
>> zswap logic to the swap core?  And zswap will be always at a higher
>> layer on top of swap device/file?
>
>
> We do not want to move the zswap logic into the swap core, we want to
> make the swap core independent of the swap implementation, and zswap
> is just one possible implementation.

Good!

I found that you put zswap related data structure inside struct
swap_desc directly.  I think that we should avoid that as much as
possible.

>>
>> >>
>> >>
>> >> > xarray (b) is indexed by swap id as well
>> >> > and contain swap entry or zswap entry. Reverse mapping might be
>> >> > needed.
>> >>
>> >> Reverse mapping isn't needed.
>> >
>> >
>> > It would be needed if xarray (a) is indexed by the swap id. I am not
>> > sure I understand how it can be indexed by the swap entry if the
>> > indirection is enabled.
>> >
>> >>
>> >>
>> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
>> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
>> >> > where needed.
>> >> >
>> >> > There is also the extra cpu overhead for an extra lookup in certain paths.
>> >> >
>> >> > Is my analysis correct? If yes, I agree that the original proposal is
>> >> > good if the reverse mapping can be avoided in enough situations, and
>> >> > that we should consider such alternatives otherwise. As I mentioned
>> >> > above, I think it comes down to whether we can completely restrict
>> >> > cluster readahead to rotating disks or not -- in which case we need to
>> >> > decide what to do for shmem and for anon when vma readahead is
>> >> > disabled.
>> >>
>> >> We can even have a minimal indirection implementation.  Where, swap
>> >> cache and swap_map[] are kept as they ware before, just one xarray is
>> >> added.  The xarray is indexed by swap id (or swap_desc index) to store
>> >> the corresponding swap entry.
>> >>
>> >> When indirection is disabled, no extra overhead.
>> >>
>> >> When indirection is enabled, the extra overhead is just 8 bytes per
>> >> swapped page.
>> >>
>> >> The basic migration support can be build on top of this.
>> >>
>> >> I think that this could be a baseline for indirection support.  Then
>> >> further optimization can be built on top of it step by step with
>> >> supporting data.
>> >
>> >
>> > I am not sure how this works with zswap. Currently swap_map[]
>> > implementation is specific for swapfiles, it does not work for zswap
>> > unless we implement separate swap counting logic for zswap &
>> > swapfiles. Same for the swapcache, it currently supports being indexed
>> > by a swap entry, it would need to support being indexed by a swap id,
>> > or have a separate swap cache for zswap. Having separate
>> > implementation would add complexity, and we would need to perform
>> > handoffs of the swap count/cache when a page is moved from zswap to a
>> > swapfile.
>>
>> We can allocate a swap entry for each swapped page in zswap.
>
>
> This is exactly what the current implementation does and what we want
> to move away from. The current implementation uses zswap as an
> in-memory compressed cache on top of an actual swap device, and each
> swapped page in zswap has a swap entry allocated. With this
> implementation, zswap cannot be used without a swap device.

I totally agree that we should avoid to use an actual swap device under
zswap.  And, as an swap implementation, zswap can manage the swap entry
inside zswap without an underlying actual swap device.  For example,
when we swap a page to zswap (actually compress), we can allocate a
(virtual) swap entry in the zswap.  I understand that there's overhead
to manage the swap entry in zswap.  We can consider how to reduce the
overhead.

Best Regards,
Huang, Ying

>>
>>
>> >>
>> >>
>> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> Depending on how much you are going to reuse the swap cache, you might
>> >> >> >> >> >> need to have something like a swap_info_struct to keep the locks happy.
>> >> >> >> >> >
>> >> >> >> >> > My current intention is to reimplement the swapcache completely as a
>> >> >> >> >> > pointer in struct swap_desc. This would eliminate this need and a lot
>> >> >> >> >> > of the locking we do today if I get things right.
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> > Another potential concern is readahead. With this design, we have no
>> >> >> >> >> >>
>> >> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
>> >> >> >> >> >> use some modernization.
>> >> >> >> >> >
>> >> >> >> >> > Yeah, I initially thought we would only need the swp_entry_t ->
>> >> >> >> >> > swap_desc reverse mapping for readahead, and that we can only store
>> >> >> >> >> > that for spinning disks, but I was wrong. We need for other things as
>> >> >> >> >> > well today: swapoff, when trying to find an empty swap slot and we
>> >> >> >> >> > start trying to free swap slots used only by the swapcache. However, I
>> >> >> >> >> > think both of these cases can be fixed (I can share more details if
>> >> >> >> >> > you want). If everything goes well we should only need to maintain the
>> >> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
>> >> >> >> >> > spinning disks for readahead.
>> >> >> >> >> >
>> >> >> >> >> >>
>> >> >> >> >> >> Looking forward to your discussion.
>> >> >> >>
>> >> >> >> Per my understanding, the indirection is to make it easy to move
>> >> >> >> (swapped) pages among swap devices based on hot/cold.  This is similar
>> >> >> >> as the target of memory tiering.  It appears that we can extend the
>> >> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
>> >> >> >> Is it possible for zswap to be faster than some slow memory media?
>> >> >> >
>> >> >> >
>> >> >> > Agree with Chris that this may require a much larger overhaul. A slow
>> >> >> > memory tier is still addressable memory, swap/zswap requires a page
>> >> >> > fault to read the pages. I think (at least for now) there is a
>> >> >> > fundamental difference. We want reclaim to eventually treat slow
>> >> >> > memory & swap as just different tiers to place cold memory in with
>> >> >> > different characteristics, but otherwise I think the swapping
>> >> >> > implementation itself is very different.  Am I missing something?
>> >> >>
>> >> >> Is it possible that zswap is faster than a really slow memory
>> >> >> addressable device backed by NAND?  TBH, I don't have the answer.
>> >> >
>> >> > I am not sure either.
>> >> >
>> >> >>
>> >> >> Anyway, do you need a way to describe the tiers of the swap devices?
>> >> >> So, you can move the cold pages among the swap devices based on that?
>> >> >
>> >> > For now I think the "tiers" in this proposal are just zswap and normal
>> >> > swapfiles. We can later extend it to support more explicit tiering.
>> >>
>> >> IIUC, in original zswap implementation, there's 1:1 relationship between
>> >> zswap and normal swapfile.  But now, you make demoting among swap
>> >> devices more general.  Then we need some general way to specify which
>> >> swap devices are fast and which are slow, and the demoting relationship
>> >> among them.  It can be memory tiers or something else, but we need one.
>> >
>> >
>> > I think for this proposal, there are only 2 hardcoded tiers. Zswap is
>> > fast, swapfile is slow. In the future, we can support more dynamic
>> > tiering if the need arises.
>>
>> We can start from a simple implementation.  And I think that it's better
>> to consider the general design too.  Try not to make it impossible now.
>
>
> Right. I am proposing we come up with an abstract generic interface
> for swap implementations, and have 2 implementations statically
> defined (swapfiles and zswap). If the need arises, we can make swap
> implementations more dynamic in the future.
>
>>
>>
>> Best Regards,
>> Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23  1:48                         ` Huang, Ying
@ 2023-03-23  2:21                           ` Yosry Ahmed
  2023-03-23  3:16                             ` Huang, Ying
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-23  2:21 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Wed, Mar 22, 2023 at 6:50 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Sun, Mar 19, 2023 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >>
> >> >> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >>
> >> >> >> <snip>
> >> >> >> >
> >> >> >> > My current idea is to have one xarray that stores the swap_descs
> >> >> >> > (which include swap_entry, swapcache, swap_count, etc), and only for
> >> >> >> > rotating disks have an additional xarray that maps swap_entry ->
> >> >> >> > swap_desc for cluster readahead, assuming we can eliminate all other
> >> >> >> > situations requiring a reverse mapping.
> >> >> >> >
> >> >> >> > I am not sure how having separate xarrays help? If we have one xarray,
> >> >> >> > might as well save the other lookups on put everything in swap_desc.
> >> >> >> > In fact, this should improve the locking today as swapcache /
> >> >> >> > swap_count operations can be lockless or very lightly contended.
> >> >> >>
> >> >> >> The condition of the proposal is "reverse mapping cannot be avoided for
> >> >> >> enough situation".  So, if reverse mapping (or cluster readahead) can be
> >> >> >> avoided for enough situations, I think your proposal is good.  Otherwise,
> >> >> >> I propose to use 2 xarrays.  You don't need another reverse mapping
> >> >> >> xarray, because you just need to read the next several swap_entry into
> >> >> >> the swap cache for cluster readahead.  swap_desc isn't needed for
> >> >> >> cluster readahead.
> >> >> >
> >> >> > swap_desc would be needed for cluster readahead in my original
> >> >> > proposal as the swap cache lives in swap_descs. Based on the current
> >> >> > implementation, we would need a reverse mapping (swap entry ->
> >> >> > swap_desc) in 3 situations:
> >> >> >
> >> >> > 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
> >> >> > failing, we fallback to trying to find swap entries that only have a
> >> >> > page in the swap cache (no references in page tables or page cache)
> >> >> > and free them. This would require a reverse mapping.
> >> >> >
> >> >> > 2) swapoff: we need to swap in all entries in a swapfile, so we need
> >> >> > to get all swap_descs associated with that swapfile.
> >> >> >
> >> >> > 3) swap cluster readahead.
> >> >> >
> >> >> > For (1), I think we can drop the dependency of a reverse mapping if we
> >> >> > free swap entries once we swap a page in and add it to the swap cache,
> >> >> > even if the swap count does not drop to 0.
> >> >>
> >> >> Now, we will not drop the swap cache even if the swap count becomes 0 if
> >> >> swap space utility < 50%.  Per my understanding, this avoid swap page
> >> >> writing for read accesses.  So I don't think we can change this directly
> >> >> without necessary discussion firstly.
> >> >
> >> >
> >> > Right. I am not sure I understand why we do this today, is it to save
> >> > the overhead of allocating a new swap entry if the page is swapped out
> >> > again soon? I am not sure I understand this statement "this avoid swap
> >> > page
> >> > writing for read accesses".
> >> >
> >> >>
> >> >>
> >> >> > For (2), instead of scanning page tables and shmem page cache to find
> >> >> > swapped out pages for the swapfile, we can scan all swap_descs
> >> >> > instead, we should be more efficient. This is one of the proposal's
> >> >> > potential advantages.
> >> >>
> >> >> Good.
> >> >>
> >> >> > (3) is the one that would still need a reverse mapping with the
> >> >> > current proposal. Today we use swap cluster readahead for anon pages
> >> >> > if we have a spinning disk or vma readahead is disabled. For shmem, we
> >> >> > always use cluster readahead. If we can limit cluster readahead to
> >> >> > only rotating disks, then the reverse mapping can only be maintained
> >> >> > for swapfiles on rotating disks. Otherwise, we will need to maintain a
> >> >> > reverse mapping for all swapfiles.
> >> >>
> >> >> For shmem, I think that it should be good to readahead based on shmem
> >> >> file offset instead of swap device offset.
> >> >>
> >> >> It's possible that some pages in the readahead window are from HDD while
> >> >> some other pages aren't.  So it's a little hard to enable cluster read
> >> >> for HDD only.  Anyway, it's not common to use HDD for swap now.
> >> >>
> >> >> >>
> >> >> >> > If the point is to store the swap_desc directly inside the xarray to
> >> >> >> > save 8 bytes, I am concerned that having multiple xarrays for
> >> >> >> > swapcache, swap_count, etc will use more than that.
> >> >> >>
> >> >> >> The idea is to save the memory used by reverse mapping xarray.
> >> >> >
> >> >> > I see.
> >> >> >
> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> >> >> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
> >> >> >> >> >> > continuation pages. If we do, it may end up being more. We also
> >> >> >> >> >> > allocate continuation in full 4k pages, so even if one swap_map
> >> >> >> >> >> > element in a page requires continuation, we will allocate an entire
> >> >> >> >> >> > page. What I am trying to say is that to get an actual comparison you
> >> >> >> >> >> > need to also factor in the swap utilization and the rate of usage of
> >> >> >> >> >> > swap continuation. I don't know how to come up with a formula for this
> >> >> >> >> >> > tbh.
> >> >> >> >> >> >
> >> >> >> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
> >> >> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> >> >> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> >> >> >> >> >> > overhead for people not using zswap, but it is not very awful.
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> It seems what you really need is one bit of information to indicate
> >> >> >> >> >> >> this page is backed by zswap. Then you can have a seperate pointer
> >> >> >> >> >> >> for the zswap entry.
> >> >> >> >> >> >
> >> >> >> >> >> > If you use one bit in swp_entry_t (or one of the available swap types)
> >> >> >> >> >> > to indicate whether the page is backed with a swapfile or zswap it
> >> >> >> >> >> > doesn't really work. We lose the indirection layer. How do we move the
> >> >> >> >> >> > page from zswap to swapfile? We need to go update the page tables and
> >> >> >> >> >> > the shmem page cache, similar to swapoff.
> >> >> >> >> >> >
> >> >> >> >> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
> >> >> >> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
> >> >> >> >> >> > the swap_desc does. It just goes the extra mile of unifying the
> >> >> >> >> >> > swapcache as well and storing it directly in the swap_desc instead of
> >> >> >> >> >> > storing it in another lookup structure.
> >> >> >> >> >>
> >> >> >> >> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
> >> >> >> >> >> swap_entry in swap_desc.  The added indirection appears to be another
> >> >> >> >> >> level of page table with 1 entry.  Then, we may use the similar method
> >> >> >> >> >> as supporting system with 2 level and 3 level page tables, like the code
> >> >> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
> >> >> >> >> >> this deeply.
> >> >> >> >> >
> >> >> >> >> > Can you expand further on this idea? I am not sure I fully understand.
> >> >> >> >>
> >> >> >> >> OK.  The goal is to avoid the overhead if indirection isn't enabled via
> >> >> >> >> kconfig.
> >> >> >> >>
> >> >> >> >> If indirection isn't enabled, store swap_entry in PTE directly.
> >> >> >> >> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
> >> >> >> >> to get/set swap_entry in PTE) are implemented based on kconfig.
> >> >> >> >
> >> >> >> >
> >> >> >> > I thought about this, the problem is that we will have multiple
> >> >> >> > implementations of multiple things. For example, swap_count without
> >> >> >> > the indirection layer lives in the swap_map (with continuation logic).
> >> >> >> > With the indirection layer, it lives in the swap_desc (or somewhere
> >> >> >> > else). Same for the swapcache. Even if we keep the swapcache in an
> >> >> >> > xarray and not inside swap_desc, it would be indexed by swap_entry if
> >> >> >> > the indirection is disabled, and by swap_desc (or similar) if the
> >> >> >> > indirection is enabled. I think maintaining separate implementations
> >> >> >> > for when the indirection is enabled/disabled would be adding too much
> >> >> >> > complexity.
> >> >> >> >
> >> >> >> > WDYT?
> >> >> >>
> >> >> >> If we go this way, swap cache and swap_count will always be indexed by
> >> >> >> swap_entry.  swap_desc just provides a indirection to make it possible
> >> >> >> to move between swap devices.
> >> >> >>
> >> >> >> Why must we index swap cache and swap_count by swap_desc if indirection
> >> >> >> is enabled?  Yes, we can save one xarray indexing if we do so, but I
> >> >> >> don't think the overhead of one xarray indexing is a showstopper.
> >> >> >>
> >> >> >> I think this can be one intermediate step towards your final target.
> >> >> >> The changes to current implementation can be smaller.
> >> >> >
> >> >> > IIUC, the idea is to have two xarrays:
> >> >> > (a) xarray that stores a pointer to a struct containing swap_count and
> >> >> > swap cache.
> >> >> > (b) xarray that stores the underlying swap entry or zswap entry.
> >> >> >
> >> >> > When indirection is disabled:
> >> >> > page tables & page cache have swap entry directly like today, xarray
> >> >> > (a) is indexed by swap entry, xarray (b) does not exist. No reverse
> >> >> > mapping needed.
> >> >> >
> >> >> > In this case we have an extra overhead of 12-16 bytes (the struct
> >> >> > containing swap_count and swap cache) vs. 24 bytes of the swap_desc.
> >> >> >
> >> >> > When indirection is enabled:
> >> >> > page tables & page cache have a swap id (or swap_desc index), xarray
> >> >> > (a) is indexed by swap id,
> >> >>
> >> >> xarray (a) is indexed by swap entry.
> >> >
> >> >
> >> > How so? With the indirection enabled, the page tables & page cache
> >> > have the swap id (or swap_desc index), which can point to a swap entry
> >> > or a zswap entry -- which can change when the page is moved between
> >> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
> >> > case? Shouldn't be indexed by the abstract swap id so that the
> >> > writeback from zswap is transparent?
> >>
> >> In my mind,
> >>
> >> - swap core will define a abstract interface to swap implementations
> >>   (zswap, swap device/file, maybe more in the future), like VFS.
> >>
> >> - zswap will be a special swap implementation (compressing instead of
> >>   writing to disk).
> >>
> >> - swap core will manage the indirection layer and swap cache.
> >>
> >> - swap core can move swap pages between swap implementations (e.g., from
> >>   zswap to a swap device, or from one swap device to another swap
> >>   device) with the help of the indirection layer.
> >>
> >> In this design, the writeback from zswap becomes moving swapped pages
> >> from zswap to a swap device.
> >
> >
> > All the above matches my understanding of this proposal. swap_desc is
> > the proposed indirection layer, and the swap implementations are zswap
> > & swap devices. For now, we only have 2 static swap implementations
> > (zswap->swapfile). In the future, we can make this more dynamic as the
> > need arises.
>
> Great to align with you on this.

Great!

>
> >>
> >> If my understanding were correct, your suggestion is kind of moving
> >> zswap logic to the swap core?  And zswap will be always at a higher
> >> layer on top of swap device/file?
> >
> >
> > We do not want to move the zswap logic into the swap core, we want to
> > make the swap core independent of the swap implementation, and zswap
> > is just one possible implementation.
>
> Good!
>
> I found that you put zswap related data structure inside struct
> swap_desc directly.  I think that we should avoid that as much as
> possible.

That was just an initial draft to show what it would be like, I do not
intend to hardcode zswap-specific items in swap_desc.

>
> >>
> >> >>
> >> >>
> >> >> > xarray (b) is indexed by swap id as well
> >> >> > and contain swap entry or zswap entry. Reverse mapping might be
> >> >> > needed.
> >> >>
> >> >> Reverse mapping isn't needed.
> >> >
> >> >
> >> > It would be needed if xarray (a) is indexed by the swap id. I am not
> >> > sure I understand how it can be indexed by the swap entry if the
> >> > indirection is enabled.
> >> >
> >> >>
> >> >>
> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
> >> >> > where needed.
> >> >> >
> >> >> > There is also the extra cpu overhead for an extra lookup in certain paths.
> >> >> >
> >> >> > Is my analysis correct? If yes, I agree that the original proposal is
> >> >> > good if the reverse mapping can be avoided in enough situations, and
> >> >> > that we should consider such alternatives otherwise. As I mentioned
> >> >> > above, I think it comes down to whether we can completely restrict
> >> >> > cluster readahead to rotating disks or not -- in which case we need to
> >> >> > decide what to do for shmem and for anon when vma readahead is
> >> >> > disabled.
> >> >>
> >> >> We can even have a minimal indirection implementation.  Where, swap
> >> >> cache and swap_map[] are kept as they ware before, just one xarray is
> >> >> added.  The xarray is indexed by swap id (or swap_desc index) to store
> >> >> the corresponding swap entry.
> >> >>
> >> >> When indirection is disabled, no extra overhead.
> >> >>
> >> >> When indirection is enabled, the extra overhead is just 8 bytes per
> >> >> swapped page.
> >> >>
> >> >> The basic migration support can be build on top of this.
> >> >>
> >> >> I think that this could be a baseline for indirection support.  Then
> >> >> further optimization can be built on top of it step by step with
> >> >> supporting data.
> >> >
> >> >
> >> > I am not sure how this works with zswap. Currently swap_map[]
> >> > implementation is specific for swapfiles, it does not work for zswap
> >> > unless we implement separate swap counting logic for zswap &
> >> > swapfiles. Same for the swapcache, it currently supports being indexed
> >> > by a swap entry, it would need to support being indexed by a swap id,
> >> > or have a separate swap cache for zswap. Having separate
> >> > implementation would add complexity, and we would need to perform
> >> > handoffs of the swap count/cache when a page is moved from zswap to a
> >> > swapfile.
> >>
> >> We can allocate a swap entry for each swapped page in zswap.
> >
> >
> > This is exactly what the current implementation does and what we want
> > to move away from. The current implementation uses zswap as an
> > in-memory compressed cache on top of an actual swap device, and each
> > swapped page in zswap has a swap entry allocated. With this
> > implementation, zswap cannot be used without a swap device.
>
> I totally agree that we should avoid to use an actual swap device under
> zswap.  And, as an swap implementation, zswap can manage the swap entry
> inside zswap without an underlying actual swap device.  For example,
> when we swap a page to zswap (actually compress), we can allocate a
> (virtual) swap entry in the zswap.  I understand that there's overhead
> to manage the swap entry in zswap.  We can consider how to reduce the
> overhead.

I see. So we can (for example) use one of the swap types for zswap,
and then have zswap code handle this entry according to its
implementation. We can then have an xarray that maps swap ID -> swap
entry, and this swap entry is used to index the swap cache and such.
When a swapped page is moved between backends we update the swap ID ->
swap entry xarray.

This is one possible implementation that I thought of (very briefly
tbh), but it does have its problems:
For zswap:
- Managing swap entries inside zswap unnecessarily.
- We need to maintain a swap entry -> zswap entry mapping in zswap --
similar to the current rbtree, which is something that we can get rid
of with the initial proposal if we embed the zswap_entry pointer
directly in the swap_desc (it can be encoded to avoid breaking the
abstraction).

For mm/swap in general:
- When we allocate a swap entry today, we store it in folio->private
(or page->private), which is used by the unmapping code to be placed
in the page tables or shmem page cache. With this implementation, we
need to store the swap ID in page->private instead, which means that
every time we need to access the swap cache during reclaim/swapout we
need to lookup the swap entry first.
- On the fault path, we need two lookups instead of one (swap ID ->
swap entry, swap entry -> swap cache), not sure how this affects fault
latency.
- Each swap backend will have its own separate implementation of swap
counting, which is hard to maintain and very error-prone since the
logic is backend-agnostic.
- Handing over a page from one swap backend to another includes
handing over swap cache entries and swap counts, which I imagine will
involve considerable synchronization.

Do you have any thoughts on this?

>
> Best Regards,
> Huang, Ying
>
> >>
> >>
> >> >>
> >> >>
> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> Depending on how much you are going to reuse the swap cache, you might
> >> >> >> >> >> >> need to have something like a swap_info_struct to keep the locks happy.
> >> >> >> >> >> >
> >> >> >> >> >> > My current intention is to reimplement the swapcache completely as a
> >> >> >> >> >> > pointer in struct swap_desc. This would eliminate this need and a lot
> >> >> >> >> >> > of the locking we do today if I get things right.
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> > Another potential concern is readahead. With this design, we have no
> >> >> >> >> >> >>
> >> >> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> >> >> >> >> >> >> use some modernization.
> >> >> >> >> >> >
> >> >> >> >> >> > Yeah, I initially thought we would only need the swp_entry_t ->
> >> >> >> >> >> > swap_desc reverse mapping for readahead, and that we can only store
> >> >> >> >> >> > that for spinning disks, but I was wrong. We need for other things as
> >> >> >> >> >> > well today: swapoff, when trying to find an empty swap slot and we
> >> >> >> >> >> > start trying to free swap slots used only by the swapcache. However, I
> >> >> >> >> >> > think both of these cases can be fixed (I can share more details if
> >> >> >> >> >> > you want). If everything goes well we should only need to maintain the
> >> >> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
> >> >> >> >> >> > spinning disks for readahead.
> >> >> >> >> >> >
> >> >> >> >> >> >>
> >> >> >> >> >> >> Looking forward to your discussion.
> >> >> >> >>
> >> >> >> >> Per my understanding, the indirection is to make it easy to move
> >> >> >> >> (swapped) pages among swap devices based on hot/cold.  This is similar
> >> >> >> >> as the target of memory tiering.  It appears that we can extend the
> >> >> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
> >> >> >> >> Is it possible for zswap to be faster than some slow memory media?
> >> >> >> >
> >> >> >> >
> >> >> >> > Agree with Chris that this may require a much larger overhaul. A slow
> >> >> >> > memory tier is still addressable memory, swap/zswap requires a page
> >> >> >> > fault to read the pages. I think (at least for now) there is a
> >> >> >> > fundamental difference. We want reclaim to eventually treat slow
> >> >> >> > memory & swap as just different tiers to place cold memory in with
> >> >> >> > different characteristics, but otherwise I think the swapping
> >> >> >> > implementation itself is very different.  Am I missing something?
> >> >> >>
> >> >> >> Is it possible that zswap is faster than a really slow memory
> >> >> >> addressable device backed by NAND?  TBH, I don't have the answer.
> >> >> >
> >> >> > I am not sure either.
> >> >> >
> >> >> >>
> >> >> >> Anyway, do you need a way to describe the tiers of the swap devices?
> >> >> >> So, you can move the cold pages among the swap devices based on that?
> >> >> >
> >> >> > For now I think the "tiers" in this proposal are just zswap and normal
> >> >> > swapfiles. We can later extend it to support more explicit tiering.
> >> >>
> >> >> IIUC, in original zswap implementation, there's 1:1 relationship between
> >> >> zswap and normal swapfile.  But now, you make demoting among swap
> >> >> devices more general.  Then we need some general way to specify which
> >> >> swap devices are fast and which are slow, and the demoting relationship
> >> >> among them.  It can be memory tiers or something else, but we need one.
> >> >
> >> >
> >> > I think for this proposal, there are only 2 hardcoded tiers. Zswap is
> >> > fast, swapfile is slow. In the future, we can support more dynamic
> >> > tiering if the need arises.
> >>
> >> We can start from a simple implementation.  And I think that it's better
> >> to consider the general design too.  Try not to make it impossible now.
> >
> >
> > Right. I am proposing we come up with an abstract generic interface
> > for swap implementations, and have 2 implementations statically
> > defined (swapfiles and zswap). If the need arises, we can make swap
> > implementations more dynamic in the future.
> >
> >>
> >>
> >> Best Regards,
> >> Huang, Ying
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23  2:21                           ` Yosry Ahmed
@ 2023-03-23  3:16                             ` Huang, Ying
  2023-03-23  3:27                               ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-23  3:16 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Wed, Mar 22, 2023 at 6:50 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Sun, Mar 19, 2023 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >>
>> >> >> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >> >>
>> >> >> >> <snip>
>> >> >> >> >
>> >> >> >> > My current idea is to have one xarray that stores the swap_descs
>> >> >> >> > (which include swap_entry, swapcache, swap_count, etc), and only for
>> >> >> >> > rotating disks have an additional xarray that maps swap_entry ->
>> >> >> >> > swap_desc for cluster readahead, assuming we can eliminate all other
>> >> >> >> > situations requiring a reverse mapping.
>> >> >> >> >
>> >> >> >> > I am not sure how having separate xarrays help? If we have one xarray,
>> >> >> >> > might as well save the other lookups on put everything in swap_desc.
>> >> >> >> > In fact, this should improve the locking today as swapcache /
>> >> >> >> > swap_count operations can be lockless or very lightly contended.
>> >> >> >>
>> >> >> >> The condition of the proposal is "reverse mapping cannot be avoided for
>> >> >> >> enough situation".  So, if reverse mapping (or cluster readahead) can be
>> >> >> >> avoided for enough situations, I think your proposal is good.  Otherwise,
>> >> >> >> I propose to use 2 xarrays.  You don't need another reverse mapping
>> >> >> >> xarray, because you just need to read the next several swap_entry into
>> >> >> >> the swap cache for cluster readahead.  swap_desc isn't needed for
>> >> >> >> cluster readahead.
>> >> >> >
>> >> >> > swap_desc would be needed for cluster readahead in my original
>> >> >> > proposal as the swap cache lives in swap_descs. Based on the current
>> >> >> > implementation, we would need a reverse mapping (swap entry ->
>> >> >> > swap_desc) in 3 situations:
>> >> >> >
>> >> >> > 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
>> >> >> > failing, we fallback to trying to find swap entries that only have a
>> >> >> > page in the swap cache (no references in page tables or page cache)
>> >> >> > and free them. This would require a reverse mapping.
>> >> >> >
>> >> >> > 2) swapoff: we need to swap in all entries in a swapfile, so we need
>> >> >> > to get all swap_descs associated with that swapfile.
>> >> >> >
>> >> >> > 3) swap cluster readahead.
>> >> >> >
>> >> >> > For (1), I think we can drop the dependency of a reverse mapping if we
>> >> >> > free swap entries once we swap a page in and add it to the swap cache,
>> >> >> > even if the swap count does not drop to 0.
>> >> >>
>> >> >> Now, we will not drop the swap cache even if the swap count becomes 0 if
>> >> >> swap space utility < 50%.  Per my understanding, this avoid swap page
>> >> >> writing for read accesses.  So I don't think we can change this directly
>> >> >> without necessary discussion firstly.
>> >> >
>> >> >
>> >> > Right. I am not sure I understand why we do this today, is it to save
>> >> > the overhead of allocating a new swap entry if the page is swapped out
>> >> > again soon? I am not sure I understand this statement "this avoid swap
>> >> > page
>> >> > writing for read accesses".
>> >> >
>> >> >>
>> >> >>
>> >> >> > For (2), instead of scanning page tables and shmem page cache to find
>> >> >> > swapped out pages for the swapfile, we can scan all swap_descs
>> >> >> > instead, we should be more efficient. This is one of the proposal's
>> >> >> > potential advantages.
>> >> >>
>> >> >> Good.
>> >> >>
>> >> >> > (3) is the one that would still need a reverse mapping with the
>> >> >> > current proposal. Today we use swap cluster readahead for anon pages
>> >> >> > if we have a spinning disk or vma readahead is disabled. For shmem, we
>> >> >> > always use cluster readahead. If we can limit cluster readahead to
>> >> >> > only rotating disks, then the reverse mapping can only be maintained
>> >> >> > for swapfiles on rotating disks. Otherwise, we will need to maintain a
>> >> >> > reverse mapping for all swapfiles.
>> >> >>
>> >> >> For shmem, I think that it should be good to readahead based on shmem
>> >> >> file offset instead of swap device offset.
>> >> >>
>> >> >> It's possible that some pages in the readahead window are from HDD while
>> >> >> some other pages aren't.  So it's a little hard to enable cluster read
>> >> >> for HDD only.  Anyway, it's not common to use HDD for swap now.
>> >> >>
>> >> >> >>
>> >> >> >> > If the point is to store the swap_desc directly inside the xarray to
>> >> >> >> > save 8 bytes, I am concerned that having multiple xarrays for
>> >> >> >> > swapcache, swap_count, etc will use more than that.
>> >> >> >>
>> >> >> >> The idea is to save the memory used by reverse mapping xarray.
>> >> >> >
>> >> >> > I see.
>> >> >> >
>> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
>> >> >> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
>> >> >> >> >> >> > continuation pages. If we do, it may end up being more. We also
>> >> >> >> >> >> > allocate continuation in full 4k pages, so even if one swap_map
>> >> >> >> >> >> > element in a page requires continuation, we will allocate an entire
>> >> >> >> >> >> > page. What I am trying to say is that to get an actual comparison you
>> >> >> >> >> >> > need to also factor in the swap utilization and the rate of usage of
>> >> >> >> >> >> > swap continuation. I don't know how to come up with a formula for this
>> >> >> >> >> >> > tbh.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
>> >> >> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
>> >> >> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
>> >> >> >> >> >> > overhead for people not using zswap, but it is not very awful.
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> It seems what you really need is one bit of information to indicate
>> >> >> >> >> >> >> this page is backed by zswap. Then you can have a seperate pointer
>> >> >> >> >> >> >> for the zswap entry.
>> >> >> >> >> >> >
>> >> >> >> >> >> > If you use one bit in swp_entry_t (or one of the available swap types)
>> >> >> >> >> >> > to indicate whether the page is backed with a swapfile or zswap it
>> >> >> >> >> >> > doesn't really work. We lose the indirection layer. How do we move the
>> >> >> >> >> >> > page from zswap to swapfile? We need to go update the page tables and
>> >> >> >> >> >> > the shmem page cache, similar to swapoff.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
>> >> >> >> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
>> >> >> >> >> >> > the swap_desc does. It just goes the extra mile of unifying the
>> >> >> >> >> >> > swapcache as well and storing it directly in the swap_desc instead of
>> >> >> >> >> >> > storing it in another lookup structure.
>> >> >> >> >> >>
>> >> >> >> >> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
>> >> >> >> >> >> swap_entry in swap_desc.  The added indirection appears to be another
>> >> >> >> >> >> level of page table with 1 entry.  Then, we may use the similar method
>> >> >> >> >> >> as supporting system with 2 level and 3 level page tables, like the code
>> >> >> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
>> >> >> >> >> >> this deeply.
>> >> >> >> >> >
>> >> >> >> >> > Can you expand further on this idea? I am not sure I fully understand.
>> >> >> >> >>
>> >> >> >> >> OK.  The goal is to avoid the overhead if indirection isn't enabled via
>> >> >> >> >> kconfig.
>> >> >> >> >>
>> >> >> >> >> If indirection isn't enabled, store swap_entry in PTE directly.
>> >> >> >> >> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
>> >> >> >> >> to get/set swap_entry in PTE) are implemented based on kconfig.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > I thought about this, the problem is that we will have multiple
>> >> >> >> > implementations of multiple things. For example, swap_count without
>> >> >> >> > the indirection layer lives in the swap_map (with continuation logic).
>> >> >> >> > With the indirection layer, it lives in the swap_desc (or somewhere
>> >> >> >> > else). Same for the swapcache. Even if we keep the swapcache in an
>> >> >> >> > xarray and not inside swap_desc, it would be indexed by swap_entry if
>> >> >> >> > the indirection is disabled, and by swap_desc (or similar) if the
>> >> >> >> > indirection is enabled. I think maintaining separate implementations
>> >> >> >> > for when the indirection is enabled/disabled would be adding too much
>> >> >> >> > complexity.
>> >> >> >> >
>> >> >> >> > WDYT?
>> >> >> >>
>> >> >> >> If we go this way, swap cache and swap_count will always be indexed by
>> >> >> >> swap_entry.  swap_desc just provides a indirection to make it possible
>> >> >> >> to move between swap devices.
>> >> >> >>
>> >> >> >> Why must we index swap cache and swap_count by swap_desc if indirection
>> >> >> >> is enabled?  Yes, we can save one xarray indexing if we do so, but I
>> >> >> >> don't think the overhead of one xarray indexing is a showstopper.
>> >> >> >>
>> >> >> >> I think this can be one intermediate step towards your final target.
>> >> >> >> The changes to current implementation can be smaller.
>> >> >> >
>> >> >> > IIUC, the idea is to have two xarrays:
>> >> >> > (a) xarray that stores a pointer to a struct containing swap_count and
>> >> >> > swap cache.
>> >> >> > (b) xarray that stores the underlying swap entry or zswap entry.
>> >> >> >
>> >> >> > When indirection is disabled:
>> >> >> > page tables & page cache have swap entry directly like today, xarray
>> >> >> > (a) is indexed by swap entry, xarray (b) does not exist. No reverse
>> >> >> > mapping needed.
>> >> >> >
>> >> >> > In this case we have an extra overhead of 12-16 bytes (the struct
>> >> >> > containing swap_count and swap cache) vs. 24 bytes of the swap_desc.
>> >> >> >
>> >> >> > When indirection is enabled:
>> >> >> > page tables & page cache have a swap id (or swap_desc index), xarray
>> >> >> > (a) is indexed by swap id,
>> >> >>
>> >> >> xarray (a) is indexed by swap entry.
>> >> >
>> >> >
>> >> > How so? With the indirection enabled, the page tables & page cache
>> >> > have the swap id (or swap_desc index), which can point to a swap entry
>> >> > or a zswap entry -- which can change when the page is moved between
>> >> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
>> >> > case? Shouldn't be indexed by the abstract swap id so that the
>> >> > writeback from zswap is transparent?
>> >>
>> >> In my mind,
>> >>
>> >> - swap core will define a abstract interface to swap implementations
>> >>   (zswap, swap device/file, maybe more in the future), like VFS.
>> >>
>> >> - zswap will be a special swap implementation (compressing instead of
>> >>   writing to disk).
>> >>
>> >> - swap core will manage the indirection layer and swap cache.
>> >>
>> >> - swap core can move swap pages between swap implementations (e.g., from
>> >>   zswap to a swap device, or from one swap device to another swap
>> >>   device) with the help of the indirection layer.
>> >>
>> >> In this design, the writeback from zswap becomes moving swapped pages
>> >> from zswap to a swap device.
>> >
>> >
>> > All the above matches my understanding of this proposal. swap_desc is
>> > the proposed indirection layer, and the swap implementations are zswap
>> > & swap devices. For now, we only have 2 static swap implementations
>> > (zswap->swapfile). In the future, we can make this more dynamic as the
>> > need arises.
>>
>> Great to align with you on this.
>
> Great!
>
>>
>> >>
>> >> If my understanding were correct, your suggestion is kind of moving
>> >> zswap logic to the swap core?  And zswap will be always at a higher
>> >> layer on top of swap device/file?
>> >
>> >
>> > We do not want to move the zswap logic into the swap core, we want to
>> > make the swap core independent of the swap implementation, and zswap
>> > is just one possible implementation.
>>
>> Good!
>>
>> I found that you put zswap related data structure inside struct
>> swap_desc directly.  I think that we should avoid that as much as
>> possible.
>
> That was just an initial draft to show what it would be like, I do not
> intend to hardcode zswap-specific items in swap_desc.
>
>>
>> >>
>> >> >>
>> >> >>
>> >> >> > xarray (b) is indexed by swap id as well
>> >> >> > and contain swap entry or zswap entry. Reverse mapping might be
>> >> >> > needed.
>> >> >>
>> >> >> Reverse mapping isn't needed.
>> >> >
>> >> >
>> >> > It would be needed if xarray (a) is indexed by the swap id. I am not
>> >> > sure I understand how it can be indexed by the swap entry if the
>> >> > indirection is enabled.
>> >> >
>> >> >>
>> >> >>
>> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
>> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
>> >> >> > where needed.
>> >> >> >
>> >> >> > There is also the extra cpu overhead for an extra lookup in certain paths.
>> >> >> >
>> >> >> > Is my analysis correct? If yes, I agree that the original proposal is
>> >> >> > good if the reverse mapping can be avoided in enough situations, and
>> >> >> > that we should consider such alternatives otherwise. As I mentioned
>> >> >> > above, I think it comes down to whether we can completely restrict
>> >> >> > cluster readahead to rotating disks or not -- in which case we need to
>> >> >> > decide what to do for shmem and for anon when vma readahead is
>> >> >> > disabled.
>> >> >>
>> >> >> We can even have a minimal indirection implementation.  Where, swap
>> >> >> cache and swap_map[] are kept as they ware before, just one xarray is
>> >> >> added.  The xarray is indexed by swap id (or swap_desc index) to store
>> >> >> the corresponding swap entry.
>> >> >>
>> >> >> When indirection is disabled, no extra overhead.
>> >> >>
>> >> >> When indirection is enabled, the extra overhead is just 8 bytes per
>> >> >> swapped page.
>> >> >>
>> >> >> The basic migration support can be build on top of this.
>> >> >>
>> >> >> I think that this could be a baseline for indirection support.  Then
>> >> >> further optimization can be built on top of it step by step with
>> >> >> supporting data.
>> >> >
>> >> >
>> >> > I am not sure how this works with zswap. Currently swap_map[]
>> >> > implementation is specific for swapfiles, it does not work for zswap
>> >> > unless we implement separate swap counting logic for zswap &
>> >> > swapfiles. Same for the swapcache, it currently supports being indexed
>> >> > by a swap entry, it would need to support being indexed by a swap id,
>> >> > or have a separate swap cache for zswap. Having separate
>> >> > implementation would add complexity, and we would need to perform
>> >> > handoffs of the swap count/cache when a page is moved from zswap to a
>> >> > swapfile.
>> >>
>> >> We can allocate a swap entry for each swapped page in zswap.
>> >
>> >
>> > This is exactly what the current implementation does and what we want
>> > to move away from. The current implementation uses zswap as an
>> > in-memory compressed cache on top of an actual swap device, and each
>> > swapped page in zswap has a swap entry allocated. With this
>> > implementation, zswap cannot be used without a swap device.
>>
>> I totally agree that we should avoid to use an actual swap device under
>> zswap.  And, as an swap implementation, zswap can manage the swap entry
>> inside zswap without an underlying actual swap device.  For example,
>> when we swap a page to zswap (actually compress), we can allocate a
>> (virtual) swap entry in the zswap.  I understand that there's overhead
>> to manage the swap entry in zswap.  We can consider how to reduce the
>> overhead.
>
> I see. So we can (for example) use one of the swap types for zswap,
> and then have zswap code handle this entry according to its
> implementation. We can then have an xarray that maps swap ID -> swap
> entry, and this swap entry is used to index the swap cache and such.
> When a swapped page is moved between backends we update the swap ID ->
> swap entry xarray.
>
> This is one possible implementation that I thought of (very briefly
> tbh), but it does have its problems:
> For zswap:
> - Managing swap entries inside zswap unnecessarily.
> - We need to maintain a swap entry -> zswap entry mapping in zswap --
> similar to the current rbtree, which is something that we can get rid
> of with the initial proposal if we embed the zswap_entry pointer
> directly in the swap_desc (it can be encoded to avoid breaking the
> abstraction).
>
> For mm/swap in general:
> - When we allocate a swap entry today, we store it in folio->private
> (or page->private), which is used by the unmapping code to be placed
> in the page tables or shmem page cache. With this implementation, we
> need to store the swap ID in page->private instead, which means that
> every time we need to access the swap cache during reclaim/swapout we
> need to lookup the swap entry first.
> - On the fault path, we need two lookups instead of one (swap ID ->
> swap entry, swap entry -> swap cache), not sure how this affects fault
> latency.
> - Each swap backend will have its own separate implementation of swap
> counting, which is hard to maintain and very error-prone since the
> logic is backend-agnostic.
> - Handing over a page from one swap backend to another includes
> handing over swap cache entries and swap counts, which I imagine will
> involve considerable synchronization.
>
> Do you have any thoughts on this?

Yes.  I understand there's additional overhead.  I have no clear idea
about how to reduce this now.  We need to think about that in depth.

The bottom line is whether this is worse than the current zswap
implementation?

Best Regards,
Huang, Ying

>>
>> >>
>> >>
>> >> >>
>> >> >>
>> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Depending on how much you are going to reuse the swap cache, you might
>> >> >> >> >> >> >> need to have something like a swap_info_struct to keep the locks happy.
>> >> >> >> >> >> >
>> >> >> >> >> >> > My current intention is to reimplement the swapcache completely as a
>> >> >> >> >> >> > pointer in struct swap_desc. This would eliminate this need and a lot
>> >> >> >> >> >> > of the locking we do today if I get things right.
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > Another potential concern is readahead. With this design, we have no
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
>> >> >> >> >> >> >> use some modernization.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Yeah, I initially thought we would only need the swp_entry_t ->
>> >> >> >> >> >> > swap_desc reverse mapping for readahead, and that we can only store
>> >> >> >> >> >> > that for spinning disks, but I was wrong. We need for other things as
>> >> >> >> >> >> > well today: swapoff, when trying to find an empty swap slot and we
>> >> >> >> >> >> > start trying to free swap slots used only by the swapcache. However, I
>> >> >> >> >> >> > think both of these cases can be fixed (I can share more details if
>> >> >> >> >> >> > you want). If everything goes well we should only need to maintain the
>> >> >> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
>> >> >> >> >> >> > spinning disks for readahead.
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Looking forward to your discussion.
>> >> >> >> >>
>> >> >> >> >> Per my understanding, the indirection is to make it easy to move
>> >> >> >> >> (swapped) pages among swap devices based on hot/cold.  This is similar
>> >> >> >> >> as the target of memory tiering.  It appears that we can extend the
>> >> >> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
>> >> >> >> >> Is it possible for zswap to be faster than some slow memory media?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > Agree with Chris that this may require a much larger overhaul. A slow
>> >> >> >> > memory tier is still addressable memory, swap/zswap requires a page
>> >> >> >> > fault to read the pages. I think (at least for now) there is a
>> >> >> >> > fundamental difference. We want reclaim to eventually treat slow
>> >> >> >> > memory & swap as just different tiers to place cold memory in with
>> >> >> >> > different characteristics, but otherwise I think the swapping
>> >> >> >> > implementation itself is very different.  Am I missing something?
>> >> >> >>
>> >> >> >> Is it possible that zswap is faster than a really slow memory
>> >> >> >> addressable device backed by NAND?  TBH, I don't have the answer.
>> >> >> >
>> >> >> > I am not sure either.
>> >> >> >
>> >> >> >>
>> >> >> >> Anyway, do you need a way to describe the tiers of the swap devices?
>> >> >> >> So, you can move the cold pages among the swap devices based on that?
>> >> >> >
>> >> >> > For now I think the "tiers" in this proposal are just zswap and normal
>> >> >> > swapfiles. We can later extend it to support more explicit tiering.
>> >> >>
>> >> >> IIUC, in original zswap implementation, there's 1:1 relationship between
>> >> >> zswap and normal swapfile.  But now, you make demoting among swap
>> >> >> devices more general.  Then we need some general way to specify which
>> >> >> swap devices are fast and which are slow, and the demoting relationship
>> >> >> among them.  It can be memory tiers or something else, but we need one.
>> >> >
>> >> >
>> >> > I think for this proposal, there are only 2 hardcoded tiers. Zswap is
>> >> > fast, swapfile is slow. In the future, we can support more dynamic
>> >> > tiering if the need arises.
>> >>
>> >> We can start from a simple implementation.  And I think that it's better
>> >> to consider the general design too.  Try not to make it impossible now.
>> >
>> >
>> > Right. I am proposing we come up with an abstract generic interface
>> > for swap implementations, and have 2 implementations statically
>> > defined (swapfiles and zswap). If the need arises, we can make swap
>> > implementations more dynamic in the future.
>> >
>> >>
>> >>
>> >> Best Regards,
>> >> Huang, Ying
>>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23  3:16                             ` Huang, Ying
@ 2023-03-23  3:27                               ` Yosry Ahmed
  2023-03-23  5:37                                 ` Huang, Ying
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-23  3:27 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Wed, Mar 22, 2023 at 8:17 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Wed, Mar 22, 2023 at 6:50 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Sun, Mar 19, 2023 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >>
> >> >> > On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >>
> >> >> >> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >> >>
> >> >> >> >> <snip>
> >> >> >> >> >
> >> >> >> >> > My current idea is to have one xarray that stores the swap_descs
> >> >> >> >> > (which include swap_entry, swapcache, swap_count, etc), and only for
> >> >> >> >> > rotating disks have an additional xarray that maps swap_entry ->
> >> >> >> >> > swap_desc for cluster readahead, assuming we can eliminate all other
> >> >> >> >> > situations requiring a reverse mapping.
> >> >> >> >> >
> >> >> >> >> > I am not sure how having separate xarrays help? If we have one xarray,
> >> >> >> >> > might as well save the other lookups on put everything in swap_desc.
> >> >> >> >> > In fact, this should improve the locking today as swapcache /
> >> >> >> >> > swap_count operations can be lockless or very lightly contended.
> >> >> >> >>
> >> >> >> >> The condition of the proposal is "reverse mapping cannot be avoided for
> >> >> >> >> enough situation".  So, if reverse mapping (or cluster readahead) can be
> >> >> >> >> avoided for enough situations, I think your proposal is good.  Otherwise,
> >> >> >> >> I propose to use 2 xarrays.  You don't need another reverse mapping
> >> >> >> >> xarray, because you just need to read the next several swap_entry into
> >> >> >> >> the swap cache for cluster readahead.  swap_desc isn't needed for
> >> >> >> >> cluster readahead.
> >> >> >> >
> >> >> >> > swap_desc would be needed for cluster readahead in my original
> >> >> >> > proposal as the swap cache lives in swap_descs. Based on the current
> >> >> >> > implementation, we would need a reverse mapping (swap entry ->
> >> >> >> > swap_desc) in 3 situations:
> >> >> >> >
> >> >> >> > 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
> >> >> >> > failing, we fallback to trying to find swap entries that only have a
> >> >> >> > page in the swap cache (no references in page tables or page cache)
> >> >> >> > and free them. This would require a reverse mapping.
> >> >> >> >
> >> >> >> > 2) swapoff: we need to swap in all entries in a swapfile, so we need
> >> >> >> > to get all swap_descs associated with that swapfile.
> >> >> >> >
> >> >> >> > 3) swap cluster readahead.
> >> >> >> >
> >> >> >> > For (1), I think we can drop the dependency of a reverse mapping if we
> >> >> >> > free swap entries once we swap a page in and add it to the swap cache,
> >> >> >> > even if the swap count does not drop to 0.
> >> >> >>
> >> >> >> Now, we will not drop the swap cache even if the swap count becomes 0 if
> >> >> >> swap space utility < 50%.  Per my understanding, this avoid swap page
> >> >> >> writing for read accesses.  So I don't think we can change this directly
> >> >> >> without necessary discussion firstly.
> >> >> >
> >> >> >
> >> >> > Right. I am not sure I understand why we do this today, is it to save
> >> >> > the overhead of allocating a new swap entry if the page is swapped out
> >> >> > again soon? I am not sure I understand this statement "this avoid swap
> >> >> > page
> >> >> > writing for read accesses".
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> > For (2), instead of scanning page tables and shmem page cache to find
> >> >> >> > swapped out pages for the swapfile, we can scan all swap_descs
> >> >> >> > instead, we should be more efficient. This is one of the proposal's
> >> >> >> > potential advantages.
> >> >> >>
> >> >> >> Good.
> >> >> >>
> >> >> >> > (3) is the one that would still need a reverse mapping with the
> >> >> >> > current proposal. Today we use swap cluster readahead for anon pages
> >> >> >> > if we have a spinning disk or vma readahead is disabled. For shmem, we
> >> >> >> > always use cluster readahead. If we can limit cluster readahead to
> >> >> >> > only rotating disks, then the reverse mapping can only be maintained
> >> >> >> > for swapfiles on rotating disks. Otherwise, we will need to maintain a
> >> >> >> > reverse mapping for all swapfiles.
> >> >> >>
> >> >> >> For shmem, I think that it should be good to readahead based on shmem
> >> >> >> file offset instead of swap device offset.
> >> >> >>
> >> >> >> It's possible that some pages in the readahead window are from HDD while
> >> >> >> some other pages aren't.  So it's a little hard to enable cluster read
> >> >> >> for HDD only.  Anyway, it's not common to use HDD for swap now.
> >> >> >>
> >> >> >> >>
> >> >> >> >> > If the point is to store the swap_desc directly inside the xarray to
> >> >> >> >> > save 8 bytes, I am concerned that having multiple xarrays for
> >> >> >> >> > swapcache, swap_count, etc will use more than that.
> >> >> >> >>
> >> >> >> >> The idea is to save the memory used by reverse mapping xarray.
> >> >> >> >
> >> >> >> > I see.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> >> >> >> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
> >> >> >> >> >> >> > continuation pages. If we do, it may end up being more. We also
> >> >> >> >> >> >> > allocate continuation in full 4k pages, so even if one swap_map
> >> >> >> >> >> >> > element in a page requires continuation, we will allocate an entire
> >> >> >> >> >> >> > page. What I am trying to say is that to get an actual comparison you
> >> >> >> >> >> >> > need to also factor in the swap utilization and the rate of usage of
> >> >> >> >> >> >> > swap continuation. I don't know how to come up with a formula for this
> >> >> >> >> >> >> > tbh.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
> >> >> >> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> >> >> >> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> >> >> >> >> >> >> > overhead for people not using zswap, but it is not very awful.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> It seems what you really need is one bit of information to indicate
> >> >> >> >> >> >> >> this page is backed by zswap. Then you can have a seperate pointer
> >> >> >> >> >> >> >> for the zswap entry.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > If you use one bit in swp_entry_t (or one of the available swap types)
> >> >> >> >> >> >> > to indicate whether the page is backed with a swapfile or zswap it
> >> >> >> >> >> >> > doesn't really work. We lose the indirection layer. How do we move the
> >> >> >> >> >> >> > page from zswap to swapfile? We need to go update the page tables and
> >> >> >> >> >> >> > the shmem page cache, similar to swapoff.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Instead, if we store a key else in swp_entry_t and use this to lookup
> >> >> >> >> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
> >> >> >> >> >> >> > the swap_desc does. It just goes the extra mile of unifying the
> >> >> >> >> >> >> > swapcache as well and storing it directly in the swap_desc instead of
> >> >> >> >> >> >> > storing it in another lookup structure.
> >> >> >> >> >> >>
> >> >> >> >> >> >> If we choose to make sizeof(struct swap_desc) == 8, that is, store only
> >> >> >> >> >> >> swap_entry in swap_desc.  The added indirection appears to be another
> >> >> >> >> >> >> level of page table with 1 entry.  Then, we may use the similar method
> >> >> >> >> >> >> as supporting system with 2 level and 3 level page tables, like the code
> >> >> >> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
> >> >> >> >> >> >> this deeply.
> >> >> >> >> >> >
> >> >> >> >> >> > Can you expand further on this idea? I am not sure I fully understand.
> >> >> >> >> >>
> >> >> >> >> >> OK.  The goal is to avoid the overhead if indirection isn't enabled via
> >> >> >> >> >> kconfig.
> >> >> >> >> >>
> >> >> >> >> >> If indirection isn't enabled, store swap_entry in PTE directly.
> >> >> >> >> >> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
> >> >> >> >> >> to get/set swap_entry in PTE) are implemented based on kconfig.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > I thought about this, the problem is that we will have multiple
> >> >> >> >> > implementations of multiple things. For example, swap_count without
> >> >> >> >> > the indirection layer lives in the swap_map (with continuation logic).
> >> >> >> >> > With the indirection layer, it lives in the swap_desc (or somewhere
> >> >> >> >> > else). Same for the swapcache. Even if we keep the swapcache in an
> >> >> >> >> > xarray and not inside swap_desc, it would be indexed by swap_entry if
> >> >> >> >> > the indirection is disabled, and by swap_desc (or similar) if the
> >> >> >> >> > indirection is enabled. I think maintaining separate implementations
> >> >> >> >> > for when the indirection is enabled/disabled would be adding too much
> >> >> >> >> > complexity.
> >> >> >> >> >
> >> >> >> >> > WDYT?
> >> >> >> >>
> >> >> >> >> If we go this way, swap cache and swap_count will always be indexed by
> >> >> >> >> swap_entry.  swap_desc just provides a indirection to make it possible
> >> >> >> >> to move between swap devices.
> >> >> >> >>
> >> >> >> >> Why must we index swap cache and swap_count by swap_desc if indirection
> >> >> >> >> is enabled?  Yes, we can save one xarray indexing if we do so, but I
> >> >> >> >> don't think the overhead of one xarray indexing is a showstopper.
> >> >> >> >>
> >> >> >> >> I think this can be one intermediate step towards your final target.
> >> >> >> >> The changes to current implementation can be smaller.
> >> >> >> >
> >> >> >> > IIUC, the idea is to have two xarrays:
> >> >> >> > (a) xarray that stores a pointer to a struct containing swap_count and
> >> >> >> > swap cache.
> >> >> >> > (b) xarray that stores the underlying swap entry or zswap entry.
> >> >> >> >
> >> >> >> > When indirection is disabled:
> >> >> >> > page tables & page cache have swap entry directly like today, xarray
> >> >> >> > (a) is indexed by swap entry, xarray (b) does not exist. No reverse
> >> >> >> > mapping needed.
> >> >> >> >
> >> >> >> > In this case we have an extra overhead of 12-16 bytes (the struct
> >> >> >> > containing swap_count and swap cache) vs. 24 bytes of the swap_desc.
> >> >> >> >
> >> >> >> > When indirection is enabled:
> >> >> >> > page tables & page cache have a swap id (or swap_desc index), xarray
> >> >> >> > (a) is indexed by swap id,
> >> >> >>
> >> >> >> xarray (a) is indexed by swap entry.
> >> >> >
> >> >> >
> >> >> > How so? With the indirection enabled, the page tables & page cache
> >> >> > have the swap id (or swap_desc index), which can point to a swap entry
> >> >> > or a zswap entry -- which can change when the page is moved between
> >> >> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
> >> >> > case? Shouldn't be indexed by the abstract swap id so that the
> >> >> > writeback from zswap is transparent?
> >> >>
> >> >> In my mind,
> >> >>
> >> >> - swap core will define a abstract interface to swap implementations
> >> >>   (zswap, swap device/file, maybe more in the future), like VFS.
> >> >>
> >> >> - zswap will be a special swap implementation (compressing instead of
> >> >>   writing to disk).
> >> >>
> >> >> - swap core will manage the indirection layer and swap cache.
> >> >>
> >> >> - swap core can move swap pages between swap implementations (e.g., from
> >> >>   zswap to a swap device, or from one swap device to another swap
> >> >>   device) with the help of the indirection layer.
> >> >>
> >> >> In this design, the writeback from zswap becomes moving swapped pages
> >> >> from zswap to a swap device.
> >> >
> >> >
> >> > All the above matches my understanding of this proposal. swap_desc is
> >> > the proposed indirection layer, and the swap implementations are zswap
> >> > & swap devices. For now, we only have 2 static swap implementations
> >> > (zswap->swapfile). In the future, we can make this more dynamic as the
> >> > need arises.
> >>
> >> Great to align with you on this.
> >
> > Great!
> >
> >>
> >> >>
> >> >> If my understanding were correct, your suggestion is kind of moving
> >> >> zswap logic to the swap core?  And zswap will be always at a higher
> >> >> layer on top of swap device/file?
> >> >
> >> >
> >> > We do not want to move the zswap logic into the swap core, we want to
> >> > make the swap core independent of the swap implementation, and zswap
> >> > is just one possible implementation.
> >>
> >> Good!
> >>
> >> I found that you put zswap related data structure inside struct
> >> swap_desc directly.  I think that we should avoid that as much as
> >> possible.
> >
> > That was just an initial draft to show what it would be like, I do not
> > intend to hardcode zswap-specific items in swap_desc.
> >
> >>
> >> >>
> >> >> >>
> >> >> >>
> >> >> >> > xarray (b) is indexed by swap id as well
> >> >> >> > and contain swap entry or zswap entry. Reverse mapping might be
> >> >> >> > needed.
> >> >> >>
> >> >> >> Reverse mapping isn't needed.
> >> >> >
> >> >> >
> >> >> > It would be needed if xarray (a) is indexed by the swap id. I am not
> >> >> > sure I understand how it can be indexed by the swap entry if the
> >> >> > indirection is enabled.
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
> >> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
> >> >> >> > where needed.
> >> >> >> >
> >> >> >> > There is also the extra cpu overhead for an extra lookup in certain paths.
> >> >> >> >
> >> >> >> > Is my analysis correct? If yes, I agree that the original proposal is
> >> >> >> > good if the reverse mapping can be avoided in enough situations, and
> >> >> >> > that we should consider such alternatives otherwise. As I mentioned
> >> >> >> > above, I think it comes down to whether we can completely restrict
> >> >> >> > cluster readahead to rotating disks or not -- in which case we need to
> >> >> >> > decide what to do for shmem and for anon when vma readahead is
> >> >> >> > disabled.
> >> >> >>
> >> >> >> We can even have a minimal indirection implementation.  Where, swap
> >> >> >> cache and swap_map[] are kept as they ware before, just one xarray is
> >> >> >> added.  The xarray is indexed by swap id (or swap_desc index) to store
> >> >> >> the corresponding swap entry.
> >> >> >>
> >> >> >> When indirection is disabled, no extra overhead.
> >> >> >>
> >> >> >> When indirection is enabled, the extra overhead is just 8 bytes per
> >> >> >> swapped page.
> >> >> >>
> >> >> >> The basic migration support can be build on top of this.
> >> >> >>
> >> >> >> I think that this could be a baseline for indirection support.  Then
> >> >> >> further optimization can be built on top of it step by step with
> >> >> >> supporting data.
> >> >> >
> >> >> >
> >> >> > I am not sure how this works with zswap. Currently swap_map[]
> >> >> > implementation is specific for swapfiles, it does not work for zswap
> >> >> > unless we implement separate swap counting logic for zswap &
> >> >> > swapfiles. Same for the swapcache, it currently supports being indexed
> >> >> > by a swap entry, it would need to support being indexed by a swap id,
> >> >> > or have a separate swap cache for zswap. Having separate
> >> >> > implementation would add complexity, and we would need to perform
> >> >> > handoffs of the swap count/cache when a page is moved from zswap to a
> >> >> > swapfile.
> >> >>
> >> >> We can allocate a swap entry for each swapped page in zswap.
> >> >
> >> >
> >> > This is exactly what the current implementation does and what we want
> >> > to move away from. The current implementation uses zswap as an
> >> > in-memory compressed cache on top of an actual swap device, and each
> >> > swapped page in zswap has a swap entry allocated. With this
> >> > implementation, zswap cannot be used without a swap device.
> >>
> >> I totally agree that we should avoid to use an actual swap device under
> >> zswap.  And, as an swap implementation, zswap can manage the swap entry
> >> inside zswap without an underlying actual swap device.  For example,
> >> when we swap a page to zswap (actually compress), we can allocate a
> >> (virtual) swap entry in the zswap.  I understand that there's overhead
> >> to manage the swap entry in zswap.  We can consider how to reduce the
> >> overhead.
> >
> > I see. So we can (for example) use one of the swap types for zswap,
> > and then have zswap code handle this entry according to its
> > implementation. We can then have an xarray that maps swap ID -> swap
> > entry, and this swap entry is used to index the swap cache and such.
> > When a swapped page is moved between backends we update the swap ID ->
> > swap entry xarray.
> >
> > This is one possible implementation that I thought of (very briefly
> > tbh), but it does have its problems:
> > For zswap:
> > - Managing swap entries inside zswap unnecessarily.
> > - We need to maintain a swap entry -> zswap entry mapping in zswap --
> > similar to the current rbtree, which is something that we can get rid
> > of with the initial proposal if we embed the zswap_entry pointer
> > directly in the swap_desc (it can be encoded to avoid breaking the
> > abstraction).
> >
> > For mm/swap in general:
> > - When we allocate a swap entry today, we store it in folio->private
> > (or page->private), which is used by the unmapping code to be placed
> > in the page tables or shmem page cache. With this implementation, we
> > need to store the swap ID in page->private instead, which means that
> > every time we need to access the swap cache during reclaim/swapout we
> > need to lookup the swap entry first.
> > - On the fault path, we need two lookups instead of one (swap ID ->
> > swap entry, swap entry -> swap cache), not sure how this affects fault
> > latency.
> > - Each swap backend will have its own separate implementation of swap
> > counting, which is hard to maintain and very error-prone since the
> > logic is backend-agnostic.
> > - Handing over a page from one swap backend to another includes
> > handing over swap cache entries and swap counts, which I imagine will
> > involve considerable synchronization.
> >
> > Do you have any thoughts on this?
>
> Yes.  I understand there's additional overhead.  I have no clear idea
> about how to reduce this now.  We need to think about that in depth.
>
> The bottom line is whether this is worse than the current zswap
> implementation?

It's not just zswap, as I note above, this design would introduce some
overheads to the core swapping code as well as long as the indirection
layer is active. I am particularly worried about the extra lookups on
the fault path.

For zswap, we already have a lookup today, so maintaining swap entry
-> zswap entry mapping would not be a regression, but I am not sure
about the extra overhead to manage swap entries within zswap. Keep in
mind that using swap entries for zswap probably implies having a
fixed/max size for zswap (to be able to manage the swap entries
efficiently similar to swap devices), which is a limitation that the
initial proposal was hoping to overcome.

>
> Best Regards,
> Huang, Ying
>
> >>
> >> >>
> >> >>
> >> >> >>
> >> >> >>
> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Depending on how much you are going to reuse the swap cache, you might
> >> >> >> >> >> >> >> need to have something like a swap_info_struct to keep the locks happy.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > My current intention is to reimplement the swapcache completely as a
> >> >> >> >> >> >> > pointer in struct swap_desc. This would eliminate this need and a lot
> >> >> >> >> >> >> > of the locking we do today if I get things right.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > Another potential concern is readahead. With this design, we have no
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> >> >> >> >> >> >> >> use some modernization.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Yeah, I initially thought we would only need the swp_entry_t ->
> >> >> >> >> >> >> > swap_desc reverse mapping for readahead, and that we can only store
> >> >> >> >> >> >> > that for spinning disks, but I was wrong. We need for other things as
> >> >> >> >> >> >> > well today: swapoff, when trying to find an empty swap slot and we
> >> >> >> >> >> >> > start trying to free swap slots used only by the swapcache. However, I
> >> >> >> >> >> >> > think both of these cases can be fixed (I can share more details if
> >> >> >> >> >> >> > you want). If everything goes well we should only need to maintain the
> >> >> >> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
> >> >> >> >> >> >> > spinning disks for readahead.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Looking forward to your discussion.
> >> >> >> >> >>
> >> >> >> >> >> Per my understanding, the indirection is to make it easy to move
> >> >> >> >> >> (swapped) pages among swap devices based on hot/cold.  This is similar
> >> >> >> >> >> as the target of memory tiering.  It appears that we can extend the
> >> >> >> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
> >> >> >> >> >> Is it possible for zswap to be faster than some slow memory media?
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > Agree with Chris that this may require a much larger overhaul. A slow
> >> >> >> >> > memory tier is still addressable memory, swap/zswap requires a page
> >> >> >> >> > fault to read the pages. I think (at least for now) there is a
> >> >> >> >> > fundamental difference. We want reclaim to eventually treat slow
> >> >> >> >> > memory & swap as just different tiers to place cold memory in with
> >> >> >> >> > different characteristics, but otherwise I think the swapping
> >> >> >> >> > implementation itself is very different.  Am I missing something?
> >> >> >> >>
> >> >> >> >> Is it possible that zswap is faster than a really slow memory
> >> >> >> >> addressable device backed by NAND?  TBH, I don't have the answer.
> >> >> >> >
> >> >> >> > I am not sure either.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> Anyway, do you need a way to describe the tiers of the swap devices?
> >> >> >> >> So, you can move the cold pages among the swap devices based on that?
> >> >> >> >
> >> >> >> > For now I think the "tiers" in this proposal are just zswap and normal
> >> >> >> > swapfiles. We can later extend it to support more explicit tiering.
> >> >> >>
> >> >> >> IIUC, in original zswap implementation, there's 1:1 relationship between
> >> >> >> zswap and normal swapfile.  But now, you make demoting among swap
> >> >> >> devices more general.  Then we need some general way to specify which
> >> >> >> swap devices are fast and which are slow, and the demoting relationship
> >> >> >> among them.  It can be memory tiers or something else, but we need one.
> >> >> >
> >> >> >
> >> >> > I think for this proposal, there are only 2 hardcoded tiers. Zswap is
> >> >> > fast, swapfile is slow. In the future, we can support more dynamic
> >> >> > tiering if the need arises.
> >> >>
> >> >> We can start from a simple implementation.  And I think that it's better
> >> >> to consider the general design too.  Try not to make it impossible now.
> >> >
> >> >
> >> > Right. I am proposing we come up with an abstract generic interface
> >> > for swap implementations, and have 2 implementations statically
> >> > defined (swapfiles and zswap). If the need arises, we can make swap
> >> > implementations more dynamic in the future.
> >> >
> >> >>
> >> >>
> >> >> Best Regards,
> >> >> Huang, Ying
> >>
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23  3:27                               ` Yosry Ahmed
@ 2023-03-23  5:37                                 ` Huang, Ying
  2023-03-23 15:18                                   ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-23  5:37 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Wed, Mar 22, 2023 at 8:17 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Wed, Mar 22, 2023 at 6:50 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Sun, Mar 19, 2023 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >>
>> >> >> > On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >> >>
>> >> >> >> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >> >> >>

[snip]

>> >>
>> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> > xarray (b) is indexed by swap id as well
>> >> >> >> > and contain swap entry or zswap entry. Reverse mapping might be
>> >> >> >> > needed.
>> >> >> >>
>> >> >> >> Reverse mapping isn't needed.
>> >> >> >
>> >> >> >
>> >> >> > It would be needed if xarray (a) is indexed by the swap id. I am not
>> >> >> > sure I understand how it can be indexed by the swap entry if the
>> >> >> > indirection is enabled.
>> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
>> >> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
>> >> >> >> > where needed.
>> >> >> >> >
>> >> >> >> > There is also the extra cpu overhead for an extra lookup in certain paths.
>> >> >> >> >
>> >> >> >> > Is my analysis correct? If yes, I agree that the original proposal is
>> >> >> >> > good if the reverse mapping can be avoided in enough situations, and
>> >> >> >> > that we should consider such alternatives otherwise. As I mentioned
>> >> >> >> > above, I think it comes down to whether we can completely restrict
>> >> >> >> > cluster readahead to rotating disks or not -- in which case we need to
>> >> >> >> > decide what to do for shmem and for anon when vma readahead is
>> >> >> >> > disabled.
>> >> >> >>
>> >> >> >> We can even have a minimal indirection implementation.  Where, swap
>> >> >> >> cache and swap_map[] are kept as they ware before, just one xarray is
>> >> >> >> added.  The xarray is indexed by swap id (or swap_desc index) to store
>> >> >> >> the corresponding swap entry.
>> >> >> >>
>> >> >> >> When indirection is disabled, no extra overhead.
>> >> >> >>
>> >> >> >> When indirection is enabled, the extra overhead is just 8 bytes per
>> >> >> >> swapped page.
>> >> >> >>
>> >> >> >> The basic migration support can be build on top of this.
>> >> >> >>
>> >> >> >> I think that this could be a baseline for indirection support.  Then
>> >> >> >> further optimization can be built on top of it step by step with
>> >> >> >> supporting data.
>> >> >> >
>> >> >> >
>> >> >> > I am not sure how this works with zswap. Currently swap_map[]
>> >> >> > implementation is specific for swapfiles, it does not work for zswap
>> >> >> > unless we implement separate swap counting logic for zswap &
>> >> >> > swapfiles. Same for the swapcache, it currently supports being indexed
>> >> >> > by a swap entry, it would need to support being indexed by a swap id,
>> >> >> > or have a separate swap cache for zswap. Having separate
>> >> >> > implementation would add complexity, and we would need to perform
>> >> >> > handoffs of the swap count/cache when a page is moved from zswap to a
>> >> >> > swapfile.
>> >> >>
>> >> >> We can allocate a swap entry for each swapped page in zswap.
>> >> >
>> >> >
>> >> > This is exactly what the current implementation does and what we want
>> >> > to move away from. The current implementation uses zswap as an
>> >> > in-memory compressed cache on top of an actual swap device, and each
>> >> > swapped page in zswap has a swap entry allocated. With this
>> >> > implementation, zswap cannot be used without a swap device.
>> >>
>> >> I totally agree that we should avoid to use an actual swap device under
>> >> zswap.  And, as an swap implementation, zswap can manage the swap entry
>> >> inside zswap without an underlying actual swap device.  For example,
>> >> when we swap a page to zswap (actually compress), we can allocate a
>> >> (virtual) swap entry in the zswap.  I understand that there's overhead
>> >> to manage the swap entry in zswap.  We can consider how to reduce the
>> >> overhead.
>> >
>> > I see. So we can (for example) use one of the swap types for zswap,
>> > and then have zswap code handle this entry according to its
>> > implementation. We can then have an xarray that maps swap ID -> swap
>> > entry, and this swap entry is used to index the swap cache and such.
>> > When a swapped page is moved between backends we update the swap ID ->
>> > swap entry xarray.
>> >
>> > This is one possible implementation that I thought of (very briefly
>> > tbh), but it does have its problems:
>> > For zswap:
>> > - Managing swap entries inside zswap unnecessarily.
>> > - We need to maintain a swap entry -> zswap entry mapping in zswap --
>> > similar to the current rbtree, which is something that we can get rid
>> > of with the initial proposal if we embed the zswap_entry pointer
>> > directly in the swap_desc (it can be encoded to avoid breaking the
>> > abstraction).
>> >
>> > For mm/swap in general:
>> > - When we allocate a swap entry today, we store it in folio->private
>> > (or page->private), which is used by the unmapping code to be placed
>> > in the page tables or shmem page cache. With this implementation, we
>> > need to store the swap ID in page->private instead, which means that
>> > every time we need to access the swap cache during reclaim/swapout we
>> > need to lookup the swap entry first.
>> > - On the fault path, we need two lookups instead of one (swap ID ->
>> > swap entry, swap entry -> swap cache), not sure how this affects fault
>> > latency.
>> > - Each swap backend will have its own separate implementation of swap
>> > counting, which is hard to maintain and very error-prone since the
>> > logic is backend-agnostic.
>> > - Handing over a page from one swap backend to another includes
>> > handing over swap cache entries and swap counts, which I imagine will
>> > involve considerable synchronization.
>> >
>> > Do you have any thoughts on this?
>>
>> Yes.  I understand there's additional overhead.  I have no clear idea
>> about how to reduce this now.  We need to think about that in depth.
>>
>> The bottom line is whether this is worse than the current zswap
>> implementation?
>
> It's not just zswap, as I note above, this design would introduce some
> overheads to the core swapping code as well as long as the indirection
> layer is active. I am particularly worried about the extra lookups on
> the fault path.

Maybe you can measure the time for the radix tree lookup?  And compare
it with the total fault time?

> For zswap, we already have a lookup today, so maintaining swap entry
> -> zswap entry mapping would not be a regression, but I am not sure
> about the extra overhead to manage swap entries within zswap. Keep in
> mind that using swap entries for zswap probably implies having a
> fixed/max size for zswap (to be able to manage the swap entries
> efficiently similar to swap devices), which is a limitation that the
> initial proposal was hoping to overcome.

We have limited bits in PTE, so the max number of zswap entries will be
limited anyway.  And, we don't need to manage swap entries in the same
way as disks (which need to consider sequential writing etc.).

Best Regards,
Huang, Ying

[snip]


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23  0:56                         ` Huang, Ying
@ 2023-03-23  6:46                           ` Chris Li
  2023-03-23  6:56                             ` Huang, Ying
  0 siblings, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-23  6:46 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Thu, Mar 23, 2023 at 08:56:43AM +0800, Huang, Ying wrote:
> > We need to carefully design the swap cache that, when moving between
> > swap implementaions, there will be one shared swap cache. The current
> > swap cache belongs to swap devices, so two devices will have the same
> > page in two swap caches.
> 
> We can remove a page from the swap cache for the swap device A, then
> insert the page into the swap cache for the swap device B.  The swap
> entry will be changed too.
It is possible, however very tricky.

Let's assume the swap entry in device A has more than one user.
e.g. Swap entry A1 on device A is shared by three different
process. It is installed in three PTE locations. When A1 page data
write to device B, gain a new swap entry B1.

We will need to walk the three process page table to hunt down
PTE point to the swap entry A1, change that to B1. There will
be a short time window some processes have B1 and other processes
havee A1.  If both of them trigger page fault to swap in the
page. You will have the same page in both A and B's swap cache.
It needs to be back by the same physical page.

That seems to suggest that it needs to merge the swap cache look
up somehow.

> >> We can allocate a swap entry for each swapped page in zswap.
> >
> > One thing to consider when moving page from zswap to swap file, is the
> > zswap swap entry the same entry as the swap file entry.
> 
> I think that the swap entry will be changed after moving.  Swap entry is
> kind of local to a swap device.  While the swap desc ID isn't changed,
> that is why we need the indirection layer.

Due to the above mentioned  complication. I am also evaluating
a design have the zswap just share the same swap entry as the
underlying swap device.

The price to pay is one extra look up in zswap. That is similar
to the current frontswap.


Chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23  6:46                           ` Chris Li
@ 2023-03-23  6:56                             ` Huang, Ying
  2023-03-23 18:28                               ` Chris Li
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-23  6:56 UTC (permalink / raw)
  To: Chris Li
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Chris Li <chrisl@kernel.org> writes:

> On Thu, Mar 23, 2023 at 08:56:43AM +0800, Huang, Ying wrote:
>> > We need to carefully design the swap cache that, when moving between
>> > swap implementaions, there will be one shared swap cache. The current
>> > swap cache belongs to swap devices, so two devices will have the same
>> > page in two swap caches.
>> 
>> We can remove a page from the swap cache for the swap device A, then
>> insert the page into the swap cache for the swap device B.  The swap
>> entry will be changed too.
> It is possible, however very tricky.
>
> Let's assume the swap entry in device A has more than one user.
> e.g. Swap entry A1 on device A is shared by three different
> process. It is installed in three PTE locations.

With indirection, the swap ID (swap_desc index) will be installed in
PTEs instead of the swap entry itself.

Best Regards,
Huang, Ying

> When A1 page data
> write to device B, gain a new swap entry B1.
>
> We will need to walk the three process page table to hunt down
> PTE point to the swap entry A1, change that to B1. There will
> be a short time window some processes have B1 and other processes
> havee A1.  If both of them trigger page fault to swap in the
> page. You will have the same page in both A and B's swap cache.
> It needs to be back by the same physical page.
>
> That seems to suggest that it needs to merge the swap cache look
> up somehow.
>
>> >> We can allocate a swap entry for each swapped page in zswap.
>> >
>> > One thing to consider when moving page from zswap to swap file, is the
>> > zswap swap entry the same entry as the swap file entry.
>> 
>> I think that the swap entry will be changed after moving.  Swap entry is
>> kind of local to a swap device.  While the swap desc ID isn't changed,
>> that is why we need the indirection layer.
>
> Due to the above mentioned  complication. I am also evaluating
> a design have the zswap just share the same swap entry as the
> underlying swap device.
>
> The price to pay is one extra look up in zswap. That is similar
> to the current frontswap.
>
>
> Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23  5:37                                 ` Huang, Ying
@ 2023-03-23 15:18                                   ` Yosry Ahmed
  2023-03-24  2:37                                     ` Huang, Ying
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-23 15:18 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Wed, Mar 22, 2023 at 10:39 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Wed, Mar 22, 2023 at 8:17 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Wed, Mar 22, 2023 at 6:50 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >>
> >> >> > On Sun, Mar 19, 2023 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >>
> >> >> >> > On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >> >>
> >> >> >> >> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >> >> >>
>
> [snip]
>
> >> >>
> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> > xarray (b) is indexed by swap id as well
> >> >> >> >> > and contain swap entry or zswap entry. Reverse mapping might be
> >> >> >> >> > needed.
> >> >> >> >>
> >> >> >> >> Reverse mapping isn't needed.
> >> >> >> >
> >> >> >> >
> >> >> >> > It would be needed if xarray (a) is indexed by the swap id. I am not
> >> >> >> > sure I understand how it can be indexed by the swap entry if the
> >> >> >> > indirection is enabled.
> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
> >> >> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
> >> >> >> >> > where needed.
> >> >> >> >> >
> >> >> >> >> > There is also the extra cpu overhead for an extra lookup in certain paths.
> >> >> >> >> >
> >> >> >> >> > Is my analysis correct? If yes, I agree that the original proposal is
> >> >> >> >> > good if the reverse mapping can be avoided in enough situations, and
> >> >> >> >> > that we should consider such alternatives otherwise. As I mentioned
> >> >> >> >> > above, I think it comes down to whether we can completely restrict
> >> >> >> >> > cluster readahead to rotating disks or not -- in which case we need to
> >> >> >> >> > decide what to do for shmem and for anon when vma readahead is
> >> >> >> >> > disabled.
> >> >> >> >>
> >> >> >> >> We can even have a minimal indirection implementation.  Where, swap
> >> >> >> >> cache and swap_map[] are kept as they ware before, just one xarray is
> >> >> >> >> added.  The xarray is indexed by swap id (or swap_desc index) to store
> >> >> >> >> the corresponding swap entry.
> >> >> >> >>
> >> >> >> >> When indirection is disabled, no extra overhead.
> >> >> >> >>
> >> >> >> >> When indirection is enabled, the extra overhead is just 8 bytes per
> >> >> >> >> swapped page.
> >> >> >> >>
> >> >> >> >> The basic migration support can be build on top of this.
> >> >> >> >>
> >> >> >> >> I think that this could be a baseline for indirection support.  Then
> >> >> >> >> further optimization can be built on top of it step by step with
> >> >> >> >> supporting data.
> >> >> >> >
> >> >> >> >
> >> >> >> > I am not sure how this works with zswap. Currently swap_map[]
> >> >> >> > implementation is specific for swapfiles, it does not work for zswap
> >> >> >> > unless we implement separate swap counting logic for zswap &
> >> >> >> > swapfiles. Same for the swapcache, it currently supports being indexed
> >> >> >> > by a swap entry, it would need to support being indexed by a swap id,
> >> >> >> > or have a separate swap cache for zswap. Having separate
> >> >> >> > implementation would add complexity, and we would need to perform
> >> >> >> > handoffs of the swap count/cache when a page is moved from zswap to a
> >> >> >> > swapfile.
> >> >> >>
> >> >> >> We can allocate a swap entry for each swapped page in zswap.
> >> >> >
> >> >> >
> >> >> > This is exactly what the current implementation does and what we want
> >> >> > to move away from. The current implementation uses zswap as an
> >> >> > in-memory compressed cache on top of an actual swap device, and each
> >> >> > swapped page in zswap has a swap entry allocated. With this
> >> >> > implementation, zswap cannot be used without a swap device.
> >> >>
> >> >> I totally agree that we should avoid to use an actual swap device under
> >> >> zswap.  And, as an swap implementation, zswap can manage the swap entry
> >> >> inside zswap without an underlying actual swap device.  For example,
> >> >> when we swap a page to zswap (actually compress), we can allocate a
> >> >> (virtual) swap entry in the zswap.  I understand that there's overhead
> >> >> to manage the swap entry in zswap.  We can consider how to reduce the
> >> >> overhead.
> >> >
> >> > I see. So we can (for example) use one of the swap types for zswap,
> >> > and then have zswap code handle this entry according to its
> >> > implementation. We can then have an xarray that maps swap ID -> swap
> >> > entry, and this swap entry is used to index the swap cache and such.
> >> > When a swapped page is moved between backends we update the swap ID ->
> >> > swap entry xarray.
> >> >
> >> > This is one possible implementation that I thought of (very briefly
> >> > tbh), but it does have its problems:
> >> > For zswap:
> >> > - Managing swap entries inside zswap unnecessarily.
> >> > - We need to maintain a swap entry -> zswap entry mapping in zswap --
> >> > similar to the current rbtree, which is something that we can get rid
> >> > of with the initial proposal if we embed the zswap_entry pointer
> >> > directly in the swap_desc (it can be encoded to avoid breaking the
> >> > abstraction).
> >> >
> >> > For mm/swap in general:
> >> > - When we allocate a swap entry today, we store it in folio->private
> >> > (or page->private), which is used by the unmapping code to be placed
> >> > in the page tables or shmem page cache. With this implementation, we
> >> > need to store the swap ID in page->private instead, which means that
> >> > every time we need to access the swap cache during reclaim/swapout we
> >> > need to lookup the swap entry first.
> >> > - On the fault path, we need two lookups instead of one (swap ID ->
> >> > swap entry, swap entry -> swap cache), not sure how this affects fault
> >> > latency.
> >> > - Each swap backend will have its own separate implementation of swap
> >> > counting, which is hard to maintain and very error-prone since the
> >> > logic is backend-agnostic.
> >> > - Handing over a page from one swap backend to another includes
> >> > handing over swap cache entries and swap counts, which I imagine will
> >> > involve considerable synchronization.
> >> >
> >> > Do you have any thoughts on this?
> >>
> >> Yes.  I understand there's additional overhead.  I have no clear idea
> >> about how to reduce this now.  We need to think about that in depth.

I agree that we need to think deeper about the tradeoff here. It seems
like the extra xarray lookup may not be a huge problem, but there are
other concerns such as having separate implementations of swap
counting that are basically doing the same thing in different ways for
different backends.

> >>
> >> The bottom line is whether this is worse than the current zswap
> >> implementation?
> >
> > It's not just zswap, as I note above, this design would introduce some
> > overheads to the core swapping code as well as long as the indirection
> > layer is active. I am particularly worried about the extra lookups on
> > the fault path.
>
> Maybe you can measure the time for the radix tree lookup?  And compare
> it with the total fault time?

I ran a simple test with perf swapping in a 1G shmem file:

                       |--1.91%--swap_cache_get_folio
                       |          |
                       |           --1.32%--__filemap_get_folio
                       |                     |
                       |                      --0.66%--xas_load

Seems like the swap cache lookup is ~2%, and < 1% is coming from the
xarray lookup. I am not sure if the lookup time varies a lot with
fragmentation and different access patterns, but it seems like it's
generally not a major contributor to the latency.

>
> > For zswap, we already have a lookup today, so maintaining swap entry
> > -> zswap entry mapping would not be a regression, but I am not sure
> > about the extra overhead to manage swap entries within zswap. Keep in
> > mind that using swap entries for zswap probably implies having a
> > fixed/max size for zswap (to be able to manage the swap entries
> > efficiently similar to swap devices), which is a limitation that the
> > initial proposal was hoping to overcome.
>
> We have limited bits in PTE, so the max number of zswap entries will be
> limited anyway.  And, we don't need to manage swap entries in the same
> way as disks (which need to consider sequential writing etc.).

Right, the number of bits allowed would impose a maximum on the swap
ID, which would imply a maximum on the number of zswap entries. The
concern is about managing swap entries within zswap. If zswap needs to
keep track of the entries it allocated and the entries that are free,
it needs a data structure to do so (e.g. a bitmap). The size of this
data structure can potentially scale with the maximum number of
entries, so we would want to impose a virtual limit on zswap entries
to limit the size of the data structure. Alternatively, we can have a
dynamic data structure, but this also comes with its complexities.

>
> Best Regards,
> Huang, Ying
>
> [snip]
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23  6:56                             ` Huang, Ying
@ 2023-03-23 18:28                               ` Chris Li
  2023-03-23 18:40                                 ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-23 18:28 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Thu, Mar 23, 2023 at 02:56:24PM +0800, Huang, Ying wrote:
> >
> > Let's assume the swap entry in device A has more than one user.
> > e.g. Swap entry A1 on device A is shared by three different
> > process. It is installed in three PTE locations.
> 
> With indirection, the swap ID (swap_desc index) will be installed in
> PTEs instead of the swap entry itself.

Thanks for the clarification. If we have swap_desc index in the PTE,
let's call it S1. I assume S1 will contain A1 and B1 as part of the
swap_desc struct.

Now we are getting to some very interesting details.

What is the life cycle of the S1? Does S1 share the same index as A1?

Chris




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23 18:28                               ` Chris Li
@ 2023-03-23 18:40                                 ` Yosry Ahmed
  2023-03-23 19:49                                   ` Chris Li
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-23 18:40 UTC (permalink / raw)
  To: Chris Li
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Thu, Mar 23, 2023 at 11:28 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Mar 23, 2023 at 02:56:24PM +0800, Huang, Ying wrote:
> > >
> > > Let's assume the swap entry in device A has more than one user.
> > > e.g. Swap entry A1 on device A is shared by three different
> > > process. It is installed in three PTE locations.
> >
> > With indirection, the swap ID (swap_desc index) will be installed in
> > PTEs instead of the swap entry itself.
>
> Thanks for the clarification. If we have swap_desc index in the PTE,
> let's call it S1. I assume S1 will contain A1 and B1 as part of the
> swap_desc struct.
>
> Now we are getting to some very interesting details.
>
> What is the life cycle of the S1? Does S1 share the same index as A1?

The idea that we are currently discussing does not involve a swap_desc
struct. There is only a swap ID that indexes into an xarray that
points to a swap_entry. This swap ID and the xarray formulate the
indirection layer.

I am guessing in this design the swap ID is allocated when unmapping a
page to be swapped out, and freed when the underlying swap_entry's
swap count falls to 0.

Moving a page from a swap backend A to another swap backend B should
not be a problem in terms of the swap cache, as we will add it to the
swap cache of B, modify the swap ID mapping to point to B, then remove
it from the swap cache of A.

There are some concerns with this design that I outlined in one of my
previous emails, such as having separate swap counting implementation
in different swap backends, which is a maintenance burden and
error-prone.

>
> Chris
>
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23 18:40                                 ` Yosry Ahmed
@ 2023-03-23 19:49                                   ` Chris Li
  2023-03-23 19:54                                     ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-23 19:49 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Thu, Mar 23, 2023 at 11:40:35AM -0700, Yosry Ahmed wrote:
> > Thanks for the clarification. If we have swap_desc index in the PTE,
> > let's call it S1. I assume S1 will contain A1 and B1 as part of the
> > swap_desc struct.
> >
> > Now we are getting to some very interesting details.
> >
> > What is the life cycle of the S1? Does S1 share the same index as A1?
> 
> The idea that we are currently discussing does not involve a swap_desc
> struct. There is only a swap ID that indexes into an xarray that
> points to a swap_entry. This swap ID and the xarray formulate the
> indirection layer.

I see. The swap_entry is the same size as the swp_entry_t, which is
unsigned long.

> I am guessing in this design the swap ID is allocated when unmapping a
> page to be swapped out, and freed when the underlying swap_entry's
> swap count falls to 0.

If none of the swap pages on A need to move to B (yet). Does it still
need to allocate an entry in the swap ID xarray, only pointing to A1?

At the time of swap out page into A, we will not know if it will
move to B in a later time. I guess the swap ID xarray look up always
needs to be there?

> Moving a page from a swap backend A to another swap backend B should
> not be a problem in terms of the swap cache, as we will add it to the
> swap cache of B, modify the swap ID mapping to point to B, then remove
> it from the swap cache of A.

That means when B swap in a page, it will always look up the swap ID
xarray first, then resolve to the actual swap_entry B1.

> There are some concerns with this design that I outlined in one of my
> previous emails, such as having separate swap counting implementation
> in different swap backends, which is a maintenance burden and
> error-prone.

I agree that allocating the swap ID and maintaining the free swap ID
would be some extra complexity if we are not reusing the existing swap
count code path.

My other concern would be the swap ID xarray indirection is always there
regardless if you need to use the indirection or not.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23 19:49                                   ` Chris Li
@ 2023-03-23 19:54                                     ` Yosry Ahmed
  2023-03-23 21:10                                       ` Chris Li
  2023-03-24 17:28                                       ` Chris Li
  0 siblings, 2 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-23 19:54 UTC (permalink / raw)
  To: Chris Li
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Thu, Mar 23, 2023 at 12:50 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Mar 23, 2023 at 11:40:35AM -0700, Yosry Ahmed wrote:
> > > Thanks for the clarification. If we have swap_desc index in the PTE,
> > > let's call it S1. I assume S1 will contain A1 and B1 as part of the
> > > swap_desc struct.
> > >
> > > Now we are getting to some very interesting details.
> > >
> > > What is the life cycle of the S1? Does S1 share the same index as A1?
> >
> > The idea that we are currently discussing does not involve a swap_desc
> > struct. There is only a swap ID that indexes into an xarray that
> > points to a swap_entry. This swap ID and the xarray formulate the
> > indirection layer.
>
> I see. The swap_entry is the same size as the swp_entry_t, which is
> unsigned long.
>
> > I am guessing in this design the swap ID is allocated when unmapping a
> > page to be swapped out, and freed when the underlying swap_entry's
> > swap count falls to 0.
>
> If none of the swap pages on A need to move to B (yet). Does it still
> need to allocate an entry in the swap ID xarray, only pointing to A1?
>
> At the time of swap out page into A, we will not know if it will
> move to B in a later time. I guess the swap ID xarray look up always
> needs to be there?

If the indirection is enabled, yes.

>
> > Moving a page from a swap backend A to another swap backend B should
> > not be a problem in terms of the swap cache, as we will add it to the
> > swap cache of B, modify the swap ID mapping to point to B, then remove
> > it from the swap cache of A.
>
> That means when B swap in a page, it will always look up the swap ID
> xarray first, then resolve to the actual swap_entry B1.

Yes. There is an extra lookup.

>
> > There are some concerns with this design that I outlined in one of my
> > previous emails, such as having separate swap counting implementation
> > in different swap backends, which is a maintenance burden and
> > error-prone.
>
> I agree that allocating the swap ID and maintaining the free swap ID
> would be some extra complexity if we are not reusing the existing swap
> count code path.
>
> My other concern would be the swap ID xarray indirection is always there
> regardless if you need to use the indirection or not.

I think the idea is that this design is more minimal than the proposed
swap_desc, so we can have it behind a config option and remove the
indirection layer if it is not configured.
However, I am not yet sure if this would be straightforward. I need to
give this more thought.

>
> Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23 19:54                                     ` Yosry Ahmed
@ 2023-03-23 21:10                                       ` Chris Li
  2023-03-24 17:28                                       ` Chris Li
  1 sibling, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-23 21:10 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Thu, Mar 23, 2023 at 12:54:42PM -0700, Yosry Ahmed wrote:
> >
> > At the time of swap out page into A, we will not know if it will
> > move to B in a later time. I guess the swap ID xarray look up always
> > needs to be there?
> 
> If the indirection is enabled, yes.

Ack.

> 
> >
> > That means when B swap in a page, it will always look up the swap ID
> > xarray first, then resolve to the actual swap_entry B1.
> 
> Yes. There is an extra lookup.

Ack.

> > > There are some concerns with this design that I outlined in one of my
> > > previous emails, such as having separate swap counting implementation
> > > in different swap backends, which is a maintenance burden and
> > > error-prone.
> >
> > I agree that allocating the swap ID and maintaining the free swap ID
> > would be some extra complexity if we are not reusing the existing swap
> > count code path.
> >
> > My other concern would be the swap ID xarray indirection is always there
> > regardless if you need to use the indirection or not.
> 
> I think the idea is that this design is more minimal than the proposed
> swap_desc, so we can have it behind a config option and remove the
> indirection layer if it is not configured.
> However, I am not yet sure if this would be straightforward. I need to
> give this more thought.

Thanks for the clarification. I will give it some thought as well.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23 15:18                                   ` Yosry Ahmed
@ 2023-03-24  2:37                                     ` Huang, Ying
  2023-03-24  7:28                                       ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-24  2:37 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Wed, Mar 22, 2023 at 10:39 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Wed, Mar 22, 2023 at 8:17 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Wed, Mar 22, 2023 at 6:50 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >>
>> >> >> > On Sun, Mar 19, 2023 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >> >>
>> >> >> >> > On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >> >> >>
>> >> >> >> >> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >> >> >> >>
>>
>> [snip]
>>
>> >> >>
>> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > xarray (b) is indexed by swap id as well
>> >> >> >> >> > and contain swap entry or zswap entry. Reverse mapping might be
>> >> >> >> >> > needed.
>> >> >> >> >>
>> >> >> >> >> Reverse mapping isn't needed.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > It would be needed if xarray (a) is indexed by the swap id. I am not
>> >> >> >> > sure I understand how it can be indexed by the swap entry if the
>> >> >> >> > indirection is enabled.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
>> >> >> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
>> >> >> >> >> > where needed.
>> >> >> >> >> >
>> >> >> >> >> > There is also the extra cpu overhead for an extra lookup in certain paths.
>> >> >> >> >> >
>> >> >> >> >> > Is my analysis correct? If yes, I agree that the original proposal is
>> >> >> >> >> > good if the reverse mapping can be avoided in enough situations, and
>> >> >> >> >> > that we should consider such alternatives otherwise. As I mentioned
>> >> >> >> >> > above, I think it comes down to whether we can completely restrict
>> >> >> >> >> > cluster readahead to rotating disks or not -- in which case we need to
>> >> >> >> >> > decide what to do for shmem and for anon when vma readahead is
>> >> >> >> >> > disabled.
>> >> >> >> >>
>> >> >> >> >> We can even have a minimal indirection implementation.  Where, swap
>> >> >> >> >> cache and swap_map[] are kept as they ware before, just one xarray is
>> >> >> >> >> added.  The xarray is indexed by swap id (or swap_desc index) to store
>> >> >> >> >> the corresponding swap entry.
>> >> >> >> >>
>> >> >> >> >> When indirection is disabled, no extra overhead.
>> >> >> >> >>
>> >> >> >> >> When indirection is enabled, the extra overhead is just 8 bytes per
>> >> >> >> >> swapped page.
>> >> >> >> >>
>> >> >> >> >> The basic migration support can be build on top of this.
>> >> >> >> >>
>> >> >> >> >> I think that this could be a baseline for indirection support.  Then
>> >> >> >> >> further optimization can be built on top of it step by step with
>> >> >> >> >> supporting data.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > I am not sure how this works with zswap. Currently swap_map[]
>> >> >> >> > implementation is specific for swapfiles, it does not work for zswap
>> >> >> >> > unless we implement separate swap counting logic for zswap &
>> >> >> >> > swapfiles. Same for the swapcache, it currently supports being indexed
>> >> >> >> > by a swap entry, it would need to support being indexed by a swap id,
>> >> >> >> > or have a separate swap cache for zswap. Having separate
>> >> >> >> > implementation would add complexity, and we would need to perform
>> >> >> >> > handoffs of the swap count/cache when a page is moved from zswap to a
>> >> >> >> > swapfile.
>> >> >> >>
>> >> >> >> We can allocate a swap entry for each swapped page in zswap.
>> >> >> >
>> >> >> >
>> >> >> > This is exactly what the current implementation does and what we want
>> >> >> > to move away from. The current implementation uses zswap as an
>> >> >> > in-memory compressed cache on top of an actual swap device, and each
>> >> >> > swapped page in zswap has a swap entry allocated. With this
>> >> >> > implementation, zswap cannot be used without a swap device.
>> >> >>
>> >> >> I totally agree that we should avoid to use an actual swap device under
>> >> >> zswap.  And, as an swap implementation, zswap can manage the swap entry
>> >> >> inside zswap without an underlying actual swap device.  For example,
>> >> >> when we swap a page to zswap (actually compress), we can allocate a
>> >> >> (virtual) swap entry in the zswap.  I understand that there's overhead
>> >> >> to manage the swap entry in zswap.  We can consider how to reduce the
>> >> >> overhead.
>> >> >
>> >> > I see. So we can (for example) use one of the swap types for zswap,
>> >> > and then have zswap code handle this entry according to its
>> >> > implementation. We can then have an xarray that maps swap ID -> swap
>> >> > entry, and this swap entry is used to index the swap cache and such.
>> >> > When a swapped page is moved between backends we update the swap ID ->
>> >> > swap entry xarray.
>> >> >
>> >> > This is one possible implementation that I thought of (very briefly
>> >> > tbh), but it does have its problems:
>> >> > For zswap:
>> >> > - Managing swap entries inside zswap unnecessarily.
>> >> > - We need to maintain a swap entry -> zswap entry mapping in zswap --
>> >> > similar to the current rbtree, which is something that we can get rid
>> >> > of with the initial proposal if we embed the zswap_entry pointer
>> >> > directly in the swap_desc (it can be encoded to avoid breaking the
>> >> > abstraction).
>> >> >
>> >> > For mm/swap in general:
>> >> > - When we allocate a swap entry today, we store it in folio->private
>> >> > (or page->private), which is used by the unmapping code to be placed
>> >> > in the page tables or shmem page cache. With this implementation, we
>> >> > need to store the swap ID in page->private instead, which means that
>> >> > every time we need to access the swap cache during reclaim/swapout we
>> >> > need to lookup the swap entry first.
>> >> > - On the fault path, we need two lookups instead of one (swap ID ->
>> >> > swap entry, swap entry -> swap cache), not sure how this affects fault
>> >> > latency.
>> >> > - Each swap backend will have its own separate implementation of swap
>> >> > counting, which is hard to maintain and very error-prone since the
>> >> > logic is backend-agnostic.
>> >> > - Handing over a page from one swap backend to another includes
>> >> > handing over swap cache entries and swap counts, which I imagine will
>> >> > involve considerable synchronization.
>> >> >
>> >> > Do you have any thoughts on this?
>> >>
>> >> Yes.  I understand there's additional overhead.  I have no clear idea
>> >> about how to reduce this now.  We need to think about that in depth.
>
> I agree that we need to think deeper about the tradeoff here. It seems
> like the extra xarray lookup may not be a huge problem, but there are
> other concerns such as having separate implementations of swap
> counting that are basically doing the same thing in different ways for
> different backends.

In fact, I just suggest to use the minimal design on top of the current
implementation as the first step.  Then, you can improve it step by
step.

The first step could be the minimal effort to implement indirection
layer and moving swapped pages between swap implementations.  Based on
that, you can build other optimizations, such as pulling swap counting
to the swap core.  For each step, we can evaluate the gain and cost with
data.

Anyway, I don't think you can just implement all your final solution in
one step.  And, I think the minimal design suggested could be a starting
point.

>> >>
>> >> The bottom line is whether this is worse than the current zswap
>> >> implementation?
>> >
>> > It's not just zswap, as I note above, this design would introduce some
>> > overheads to the core swapping code as well as long as the indirection
>> > layer is active. I am particularly worried about the extra lookups on
>> > the fault path.
>>
>> Maybe you can measure the time for the radix tree lookup?  And compare
>> it with the total fault time?
>
> I ran a simple test with perf swapping in a 1G shmem file:
>
>                        |--1.91%--swap_cache_get_folio
>                        |          |
>                        |           --1.32%--__filemap_get_folio
>                        |                     |
>                        |                      --0.66%--xas_load
>
> Seems like the swap cache lookup is ~2%, and < 1% is coming from the
> xarray lookup. I am not sure if the lookup time varies a lot with
> fragmentation and different access patterns, but it seems like it's
> generally not a major contributor to the latency.

Thanks for data!

>>
>> > For zswap, we already have a lookup today, so maintaining swap entry
>> > -> zswap entry mapping would not be a regression, but I am not sure
>> > about the extra overhead to manage swap entries within zswap. Keep in
>> > mind that using swap entries for zswap probably implies having a
>> > fixed/max size for zswap (to be able to manage the swap entries
>> > efficiently similar to swap devices), which is a limitation that the
>> > initial proposal was hoping to overcome.
>>
>> We have limited bits in PTE, so the max number of zswap entries will be
>> limited anyway.  And, we don't need to manage swap entries in the same
>> way as disks (which need to consider sequential writing etc.).
>
> Right, the number of bits allowed would impose a maximum on the swap
> ID, which would imply a maximum on the number of zswap entries. The
> concern is about managing swap entries within zswap. If zswap needs to
> keep track of the entries it allocated and the entries that are free,
> it needs a data structure to do so (e.g. a bitmap). The size of this
> data structure can potentially scale with the maximum number of
> entries, so we would want to impose a virtual limit on zswap entries
> to limit the size of the data structure. Alternatively, we can have a
> dynamic data structure, but this also comes with its complexities.

Yes.  We will need that.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-24  2:37                                     ` Huang, Ying
@ 2023-03-24  7:28                                       ` Yosry Ahmed
  2023-03-24 17:23                                         ` Chris Li
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-24  7:28 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Thu, Mar 23, 2023 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Wed, Mar 22, 2023 at 10:39 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Wed, Mar 22, 2023 at 8:17 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >>
> >> >> > On Wed, Mar 22, 2023 at 6:50 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >>
> >> >> >> > On Sun, Mar 19, 2023 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >> >>
> >> >> >> >> > On Thu, Mar 16, 2023 at 12:51 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > On Sun, Mar 12, 2023 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >> >> >> >>
> >>
> >> [snip]
> >>
> >> >> >>
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> > xarray (b) is indexed by swap id as well
> >> >> >> >> >> > and contain swap entry or zswap entry. Reverse mapping might be
> >> >> >> >> >> > needed.
> >> >> >> >> >>
> >> >> >> >> >> Reverse mapping isn't needed.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > It would be needed if xarray (a) is indexed by the swap id. I am not
> >> >> >> >> > sure I understand how it can be indexed by the swap entry if the
> >> >> >> >> > indirection is enabled.
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
> >> >> >> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
> >> >> >> >> >> > where needed.
> >> >> >> >> >> >
> >> >> >> >> >> > There is also the extra cpu overhead for an extra lookup in certain paths.
> >> >> >> >> >> >
> >> >> >> >> >> > Is my analysis correct? If yes, I agree that the original proposal is
> >> >> >> >> >> > good if the reverse mapping can be avoided in enough situations, and
> >> >> >> >> >> > that we should consider such alternatives otherwise. As I mentioned
> >> >> >> >> >> > above, I think it comes down to whether we can completely restrict
> >> >> >> >> >> > cluster readahead to rotating disks or not -- in which case we need to
> >> >> >> >> >> > decide what to do for shmem and for anon when vma readahead is
> >> >> >> >> >> > disabled.
> >> >> >> >> >>
> >> >> >> >> >> We can even have a minimal indirection implementation.  Where, swap
> >> >> >> >> >> cache and swap_map[] are kept as they ware before, just one xarray is
> >> >> >> >> >> added.  The xarray is indexed by swap id (or swap_desc index) to store
> >> >> >> >> >> the corresponding swap entry.
> >> >> >> >> >>
> >> >> >> >> >> When indirection is disabled, no extra overhead.
> >> >> >> >> >>
> >> >> >> >> >> When indirection is enabled, the extra overhead is just 8 bytes per
> >> >> >> >> >> swapped page.
> >> >> >> >> >>
> >> >> >> >> >> The basic migration support can be build on top of this.
> >> >> >> >> >>
> >> >> >> >> >> I think that this could be a baseline for indirection support.  Then
> >> >> >> >> >> further optimization can be built on top of it step by step with
> >> >> >> >> >> supporting data.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > I am not sure how this works with zswap. Currently swap_map[]
> >> >> >> >> > implementation is specific for swapfiles, it does not work for zswap
> >> >> >> >> > unless we implement separate swap counting logic for zswap &
> >> >> >> >> > swapfiles. Same for the swapcache, it currently supports being indexed
> >> >> >> >> > by a swap entry, it would need to support being indexed by a swap id,
> >> >> >> >> > or have a separate swap cache for zswap. Having separate
> >> >> >> >> > implementation would add complexity, and we would need to perform
> >> >> >> >> > handoffs of the swap count/cache when a page is moved from zswap to a
> >> >> >> >> > swapfile.
> >> >> >> >>
> >> >> >> >> We can allocate a swap entry for each swapped page in zswap.
> >> >> >> >
> >> >> >> >
> >> >> >> > This is exactly what the current implementation does and what we want
> >> >> >> > to move away from. The current implementation uses zswap as an
> >> >> >> > in-memory compressed cache on top of an actual swap device, and each
> >> >> >> > swapped page in zswap has a swap entry allocated. With this
> >> >> >> > implementation, zswap cannot be used without a swap device.
> >> >> >>
> >> >> >> I totally agree that we should avoid to use an actual swap device under
> >> >> >> zswap.  And, as an swap implementation, zswap can manage the swap entry
> >> >> >> inside zswap without an underlying actual swap device.  For example,
> >> >> >> when we swap a page to zswap (actually compress), we can allocate a
> >> >> >> (virtual) swap entry in the zswap.  I understand that there's overhead
> >> >> >> to manage the swap entry in zswap.  We can consider how to reduce the
> >> >> >> overhead.
> >> >> >
> >> >> > I see. So we can (for example) use one of the swap types for zswap,
> >> >> > and then have zswap code handle this entry according to its
> >> >> > implementation. We can then have an xarray that maps swap ID -> swap
> >> >> > entry, and this swap entry is used to index the swap cache and such.
> >> >> > When a swapped page is moved between backends we update the swap ID ->
> >> >> > swap entry xarray.
> >> >> >
> >> >> > This is one possible implementation that I thought of (very briefly
> >> >> > tbh), but it does have its problems:
> >> >> > For zswap:
> >> >> > - Managing swap entries inside zswap unnecessarily.
> >> >> > - We need to maintain a swap entry -> zswap entry mapping in zswap --
> >> >> > similar to the current rbtree, which is something that we can get rid
> >> >> > of with the initial proposal if we embed the zswap_entry pointer
> >> >> > directly in the swap_desc (it can be encoded to avoid breaking the
> >> >> > abstraction).
> >> >> >
> >> >> > For mm/swap in general:
> >> >> > - When we allocate a swap entry today, we store it in folio->private
> >> >> > (or page->private), which is used by the unmapping code to be placed
> >> >> > in the page tables or shmem page cache. With this implementation, we
> >> >> > need to store the swap ID in page->private instead, which means that
> >> >> > every time we need to access the swap cache during reclaim/swapout we
> >> >> > need to lookup the swap entry first.
> >> >> > - On the fault path, we need two lookups instead of one (swap ID ->
> >> >> > swap entry, swap entry -> swap cache), not sure how this affects fault
> >> >> > latency.
> >> >> > - Each swap backend will have its own separate implementation of swap
> >> >> > counting, which is hard to maintain and very error-prone since the
> >> >> > logic is backend-agnostic.
> >> >> > - Handing over a page from one swap backend to another includes
> >> >> > handing over swap cache entries and swap counts, which I imagine will
> >> >> > involve considerable synchronization.
> >> >> >
> >> >> > Do you have any thoughts on this?
> >> >>
> >> >> Yes.  I understand there's additional overhead.  I have no clear idea
> >> >> about how to reduce this now.  We need to think about that in depth.
> >
> > I agree that we need to think deeper about the tradeoff here. It seems
> > like the extra xarray lookup may not be a huge problem, but there are
> > other concerns such as having separate implementations of swap
> > counting that are basically doing the same thing in different ways for
> > different backends.
>
> In fact, I just suggest to use the minimal design on top of the current
> implementation as the first step.  Then, you can improve it step by
> step.
>
> The first step could be the minimal effort to implement indirection
> layer and moving swapped pages between swap implementations.  Based on
> that, you can build other optimizations, such as pulling swap counting
> to the swap core.  For each step, we can evaluate the gain and cost with
> data.

Right, I understand that, but to implement the indirection layer on
top of the current implementation, then we will need to support using
zswap without a backing swap device. In order to do this without
pulling swap counting to the swap core, we need to implement swap
counting logic in zswap. We need to implement swap entry management in
zswap as well. Right?

>
> Anyway, I don't think you can just implement all your final solution in
> one step.  And, I think the minimal design suggested could be a starting
> point.

I agree that's a great point, I am just afraid that we will avoid
implementing that full final solution and instead do a lot of work
inside zswap to make up for the difference (e.g. swap entry
management, swap counting). Also, that work in zswap may end up being
unacceptable due to the maintenance burden and/or complexity.

>
> >> >>
> >> >> The bottom line is whether this is worse than the current zswap
> >> >> implementation?
> >> >
> >> > It's not just zswap, as I note above, this design would introduce some
> >> > overheads to the core swapping code as well as long as the indirection
> >> > layer is active. I am particularly worried about the extra lookups on
> >> > the fault path.
> >>
> >> Maybe you can measure the time for the radix tree lookup?  And compare
> >> it with the total fault time?
> >
> > I ran a simple test with perf swapping in a 1G shmem file:
> >
> >                        |--1.91%--swap_cache_get_folio
> >                        |          |
> >                        |           --1.32%--__filemap_get_folio
> >                        |                     |
> >                        |                      --0.66%--xas_load
> >
> > Seems like the swap cache lookup is ~2%, and < 1% is coming from the
> > xarray lookup. I am not sure if the lookup time varies a lot with
> > fragmentation and different access patterns, but it seems like it's
> > generally not a major contributor to the latency.
>
> Thanks for data!
>
> >>
> >> > For zswap, we already have a lookup today, so maintaining swap entry
> >> > -> zswap entry mapping would not be a regression, but I am not sure
> >> > about the extra overhead to manage swap entries within zswap. Keep in
> >> > mind that using swap entries for zswap probably implies having a
> >> > fixed/max size for zswap (to be able to manage the swap entries
> >> > efficiently similar to swap devices), which is a limitation that the
> >> > initial proposal was hoping to overcome.
> >>
> >> We have limited bits in PTE, so the max number of zswap entries will be
> >> limited anyway.  And, we don't need to manage swap entries in the same
> >> way as disks (which need to consider sequential writing etc.).
> >
> > Right, the number of bits allowed would impose a maximum on the swap
> > ID, which would imply a maximum on the number of zswap entries. The
> > concern is about managing swap entries within zswap. If zswap needs to
> > keep track of the entries it allocated and the entries that are free,
> > it needs a data structure to do so (e.g. a bitmap). The size of this
> > data structure can potentially scale with the maximum number of
> > entries, so we would want to impose a virtual limit on zswap entries
> > to limit the size of the data structure. Alternatively, we can have a
> > dynamic data structure, but this also comes with its complexities.
>
> Yes.  We will need that.
>
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-24  7:28                                       ` Yosry Ahmed
@ 2023-03-24 17:23                                         ` Chris Li
  2023-03-27  1:23                                           ` Huang, Ying
  0 siblings, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-24 17:23 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Fri, Mar 24, 2023 at 12:28:31AM -0700, Yosry Ahmed wrote:
> > In fact, I just suggest to use the minimal design on top of the current
> > implementation as the first step.  Then, you can improve it step by
> > step.
> >
> > The first step could be the minimal effort to implement indirection
> > layer and moving swapped pages between swap implementations.  Based on
> > that, you can build other optimizations, such as pulling swap counting
> > to the swap core.  For each step, we can evaluate the gain and cost with
> > data.
> 
> Right, I understand that, but to implement the indirection layer on
> top of the current implementation, then we will need to support using
> zswap without a backing swap device. In order to do this without

Agree with Ying on the minimal approach here as well.

There are two ways to approach this.

1) Forget zswap, make a minimal implementation to move the page between
two swapfile device. It can be swapfile back to two loop back files.

Any indirect layer you design will need to convert this usage case
any way.

2) Make zswap work without a swapfile.
You can implement the zswap on a fake ghosts swap file.

If you keep the zswap as frontswap, just make zswap can work without
a real swapfile.

Make that as your first minimal step. Then it does not need to touch
the swap count changes.

I view make that step is independent of moving pages between swap device.

That patch exists and I consider it has value to some users.

> > Anyway, I don't think you can just implement all your final solution in
> > one step.  And, I think the minimal design suggested could be a starting
> > point.
> 
> I agree that's a great point, I am just afraid that we will avoid
> implementing that full final solution and instead do a lot of work
> inside zswap to make up for the difference (e.g. swap entry
> management, swap counting). Also, that work in zswap may end up being
> unacceptable due to the maintenance burden and/or complexity.

If you do either 1) or 2), you can keep these two paths separate.

Even if you want to move the page between zswap and swapfile.

Idea 3)
You don't have to change the swap count code, you can do a
minimal change moves the page between zswap and another block
device. That way you can get two differenet swap entry with
existing code.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-23 19:54                                     ` Yosry Ahmed
  2023-03-23 21:10                                       ` Chris Li
@ 2023-03-24 17:28                                       ` Chris Li
  1 sibling, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-24 17:28 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Thu, Mar 23, 2023 at 12:54:42PM -0700, Yosry Ahmed wrote:
> >
> > At the time of swap out page into A, we will not know if it will
> > move to B in a later time. I guess the swap ID xarray look up always
> > needs to be there?
> 
> If the indirection is enabled, yes.
> 
> >
> > > Moving a page from a swap backend A to another swap backend B should
> > > not be a problem in terms of the swap cache, as we will add it to the
> > > swap cache of B, modify the swap ID mapping to point to B, then remove
> > > it from the swap cache of A.
> >
> > That means when B swap in a page, it will always look up the swap ID
> > xarray first, then resolve to the actual swap_entry B1.
> 
> Yes. There is an extra lookup.

Ack.

> 
> >
> > > There are some concerns with this design that I outlined in one of my
> > > previous emails, such as having separate swap counting implementation
> > > in different swap backends, which is a maintenance burden and
> > > error-prone.
> >
> > I agree that allocating the swap ID and maintaining the free swap ID
> > would be some extra complexity if we are not reusing the existing swap
> > count code path.
> >
> > My other concern would be the swap ID xarray indirection is always there
> > regardless if you need to use the indirection or not.
> 
> I think the idea is that this design is more minimal than the proposed
> swap_desc, so we can have it behind a config option and remove the
> indirection layer if it is not configured.
> However, I am not yet sure if this would be straightforward. I need to
> give this more thought.

I have given it some thought. I am still concerned about this always on
redirect layer in terms of memory usage and lookup overhead.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-24 17:23                                         ` Chris Li
@ 2023-03-27  1:23                                           ` Huang, Ying
  2023-03-28  5:54                                             ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-27  1:23 UTC (permalink / raw)
  To: Chris Li
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Chris Li <chrisl@kernel.org> writes:

> On Fri, Mar 24, 2023 at 12:28:31AM -0700, Yosry Ahmed wrote:
>> > In fact, I just suggest to use the minimal design on top of the current
>> > implementation as the first step.  Then, you can improve it step by
>> > step.
>> >
>> > The first step could be the minimal effort to implement indirection
>> > layer and moving swapped pages between swap implementations.  Based on
>> > that, you can build other optimizations, such as pulling swap counting
>> > to the swap core.  For each step, we can evaluate the gain and cost with
>> > data.
>> 
>> Right, I understand that, but to implement the indirection layer on
>> top of the current implementation, then we will need to support using
>> zswap without a backing swap device. In order to do this without
>
> Agree with Ying on the minimal approach here as well.
>
> There are two ways to approach this.
>
> 1) Forget zswap, make a minimal implementation to move the page between
> two swapfile device. It can be swapfile back to two loop back files.
>
> Any indirect layer you design will need to convert this usage case
> any way.
>
> 2) Make zswap work without a swapfile.
> You can implement the zswap on a fake ghosts swap file.
>
> If you keep the zswap as frontswap, just make zswap can work without
> a real swapfile.
>
> Make that as your first minimal step. Then it does not need to touch
> the swap count changes.
>
> I view make that step is independent of moving pages between swap device.
>
> That patch exists and I consider it has value to some users.

This sounds like an even smaller approach as the first step.  Further
improvement can be built on top of it.

Best Regards,
Huang, Ying

>> > Anyway, I don't think you can just implement all your final solution in
>> > one step.  And, I think the minimal design suggested could be a starting
>> > point.
>> 
>> I agree that's a great point, I am just afraid that we will avoid
>> implementing that full final solution and instead do a lot of work
>> inside zswap to make up for the difference (e.g. swap entry
>> management, swap counting). Also, that work in zswap may end up being
>> unacceptable due to the maintenance burden and/or complexity.
>
> If you do either 1) or 2), you can keep these two paths separate.
>
> Even if you want to move the page between zswap and swapfile.
>
> Idea 3)
> You don't have to change the swap count code, you can do a
> minimal change moves the page between zswap and another block
> device. That way you can get two differenet swap entry with
> existing code.
>
> Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-27  1:23                                           ` Huang, Ying
@ 2023-03-28  5:54                                             ` Yosry Ahmed
  2023-03-28  6:20                                               ` Huang, Ying
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-28  5:54 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Sun, Mar 26, 2023 at 6:24 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Fri, Mar 24, 2023 at 12:28:31AM -0700, Yosry Ahmed wrote:
> >> > In fact, I just suggest to use the minimal design on top of the current
> >> > implementation as the first step.  Then, you can improve it step by
> >> > step.
> >> >
> >> > The first step could be the minimal effort to implement indirection
> >> > layer and moving swapped pages between swap implementations.  Based on
> >> > that, you can build other optimizations, such as pulling swap counting
> >> > to the swap core.  For each step, we can evaluate the gain and cost with
> >> > data.
> >>
> >> Right, I understand that, but to implement the indirection layer on
> >> top of the current implementation, then we will need to support using
> >> zswap without a backing swap device. In order to do this without
> >
> > Agree with Ying on the minimal approach here as well.
> >
> > There are two ways to approach this.
> >
> > 1) Forget zswap, make a minimal implementation to move the page between
> > two swapfile device. It can be swapfile back to two loop back files.
> >
> > Any indirect layer you design will need to convert this usage case
> > any way.
> >
> > 2) Make zswap work without a swapfile.
> > You can implement the zswap on a fake ghosts swap file.
> >
> > If you keep the zswap as frontswap, just make zswap can work without
> > a real swapfile.
> >
> > Make that as your first minimal step. Then it does not need to touch
> > the swap count changes.
> >
> > I view make that step is independent of moving pages between swap device.
> >
> > That patch exists and I consider it has value to some users.
>
> This sounds like an even smaller approach as the first step.  Further
> improvement can be built on top of it.

I am not sure how this would be a step towards the abstraction goal we
have been discussing.

We have been discussing starting out with a minimal indirection layer,
in the shape of an xarray that maps a swap ID to a swap entry, and
that can be disabled with a config option.

For such a design to work, we have to implement swap entry management
& swap counting in zswap, right? Am I missing something?

>
> Best Regards,
> Huang, Ying
>
> >> > Anyway, I don't think you can just implement all your final solution in
> >> > one step.  And, I think the minimal design suggested could be a starting
> >> > point.
> >>
> >> I agree that's a great point, I am just afraid that we will avoid
> >> implementing that full final solution and instead do a lot of work
> >> inside zswap to make up for the difference (e.g. swap entry
> >> management, swap counting). Also, that work in zswap may end up being
> >> unacceptable due to the maintenance burden and/or complexity.
> >
> > If you do either 1) or 2), you can keep these two paths separate.
> >
> > Even if you want to move the page between zswap and swapfile.
> >
> > Idea 3)
> > You don't have to change the swap count code, you can do a
> > minimal change moves the page between zswap and another block
> > device. That way you can get two differenet swap entry with
> > existing code.
> >
> > Chris
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28  5:54                                             ` Yosry Ahmed
@ 2023-03-28  6:20                                               ` Huang, Ying
  2023-03-28  6:29                                                 ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-28  6:20 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Sun, Mar 26, 2023 at 6:24 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Fri, Mar 24, 2023 at 12:28:31AM -0700, Yosry Ahmed wrote:
>> >> > In fact, I just suggest to use the minimal design on top of the current
>> >> > implementation as the first step.  Then, you can improve it step by
>> >> > step.
>> >> >
>> >> > The first step could be the minimal effort to implement indirection
>> >> > layer and moving swapped pages between swap implementations.  Based on
>> >> > that, you can build other optimizations, such as pulling swap counting
>> >> > to the swap core.  For each step, we can evaluate the gain and cost with
>> >> > data.
>> >>
>> >> Right, I understand that, but to implement the indirection layer on
>> >> top of the current implementation, then we will need to support using
>> >> zswap without a backing swap device. In order to do this without
>> >
>> > Agree with Ying on the minimal approach here as well.
>> >
>> > There are two ways to approach this.
>> >
>> > 1) Forget zswap, make a minimal implementation to move the page between
>> > two swapfile device. It can be swapfile back to two loop back files.
>> >
>> > Any indirect layer you design will need to convert this usage case
>> > any way.
>> >
>> > 2) Make zswap work without a swapfile.
>> > You can implement the zswap on a fake ghosts swap file.
>> >
>> > If you keep the zswap as frontswap, just make zswap can work without
>> > a real swapfile.
>> >
>> > Make that as your first minimal step. Then it does not need to touch
>> > the swap count changes.
>> >
>> > I view make that step is independent of moving pages between swap device.
>> >
>> > That patch exists and I consider it has value to some users.
>>
>> This sounds like an even smaller approach as the first step.  Further
>> improvement can be built on top of it.
>
> I am not sure how this would be a step towards the abstraction goal we
> have been discussing.
>
> We have been discussing starting out with a minimal indirection layer,
> in the shape of an xarray that maps a swap ID to a swap entry, and
> that can be disabled with a config option.
>
> For such a design to work, we have to implement swap entry management
> & swap counting in zswap, right? Am I missing something?

Chris suggested to avoid to implement the swap entry management & swap
counting in zswap via using a "fake ghost swap file".  Copied his
suggestion as below,

"
>> > 2) Make zswap work without a swapfile.
>> > You can implement the zswap on a fake ghosts swap file.
>> >
>> > If you keep the zswap as frontswap, just make zswap can work without
>> > a real swapfile.
>> >
>> > Make that as your first minimal step. Then it does not need to touch
>> > the swap count changes.
"

Best Regards,
Huang, Ying

>>
>> >> > Anyway, I don't think you can just implement all your final solution in
>> >> > one step.  And, I think the minimal design suggested could be a starting
>> >> > point.
>> >>
>> >> I agree that's a great point, I am just afraid that we will avoid
>> >> implementing that full final solution and instead do a lot of work
>> >> inside zswap to make up for the difference (e.g. swap entry
>> >> management, swap counting). Also, that work in zswap may end up being
>> >> unacceptable due to the maintenance burden and/or complexity.
>> >
>> > If you do either 1) or 2), you can keep these two paths separate.
>> >
>> > Even if you want to move the page between zswap and swapfile.
>> >
>> > Idea 3)
>> > You don't have to change the swap count code, you can do a
>> > minimal change moves the page between zswap and another block
>> > device. That way you can get two differenet swap entry with
>> > existing code.
>> >
>> > Chris
>>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28  6:20                                               ` Huang, Ying
@ 2023-03-28  6:29                                                 ` Yosry Ahmed
  2023-03-28  6:59                                                   ` Huang, Ying
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-28  6:29 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Mon, Mar 27, 2023 at 11:22 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Sun, Mar 26, 2023 at 6:24 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > On Fri, Mar 24, 2023 at 12:28:31AM -0700, Yosry Ahmed wrote:
> >> >> > In fact, I just suggest to use the minimal design on top of the current
> >> >> > implementation as the first step.  Then, you can improve it step by
> >> >> > step.
> >> >> >
> >> >> > The first step could be the minimal effort to implement indirection
> >> >> > layer and moving swapped pages between swap implementations.  Based on
> >> >> > that, you can build other optimizations, such as pulling swap counting
> >> >> > to the swap core.  For each step, we can evaluate the gain and cost with
> >> >> > data.
> >> >>
> >> >> Right, I understand that, but to implement the indirection layer on
> >> >> top of the current implementation, then we will need to support using
> >> >> zswap without a backing swap device. In order to do this without
> >> >
> >> > Agree with Ying on the minimal approach here as well.
> >> >
> >> > There are two ways to approach this.
> >> >
> >> > 1) Forget zswap, make a minimal implementation to move the page between
> >> > two swapfile device. It can be swapfile back to two loop back files.
> >> >
> >> > Any indirect layer you design will need to convert this usage case
> >> > any way.
> >> >
> >> > 2) Make zswap work without a swapfile.
> >> > You can implement the zswap on a fake ghosts swap file.
> >> >
> >> > If you keep the zswap as frontswap, just make zswap can work without
> >> > a real swapfile.
> >> >
> >> > Make that as your first minimal step. Then it does not need to touch
> >> > the swap count changes.
> >> >
> >> > I view make that step is independent of moving pages between swap device.
> >> >
> >> > That patch exists and I consider it has value to some users.
> >>
> >> This sounds like an even smaller approach as the first step.  Further
> >> improvement can be built on top of it.
> >
> > I am not sure how this would be a step towards the abstraction goal we
> > have been discussing.
> >
> > We have been discussing starting out with a minimal indirection layer,
> > in the shape of an xarray that maps a swap ID to a swap entry, and
> > that can be disabled with a config option.
> >
> > For such a design to work, we have to implement swap entry management
> > & swap counting in zswap, right? Am I missing something?
>
> Chris suggested to avoid to implement the swap entry management & swap
> counting in zswap via using a "fake ghost swap file".  Copied his
> suggestion as below,

Right, we have been using ghost swapfiles at Google for a while. They
are basically sparse files that you can never actually write to, they
are just used so that we can use zswap without a backing swap device.

What I do not understand is how this is a step towards the ultimate
goal of swap abstraction. Is the idea to have the indirection layer
only support moving swapped pages between swapfiles, and have those
"ghost" swapfiles be on a higher tier than normal swapfiles? In this
case, I am guessing we eliminate the writeback logic from zswap itself
and move it to this indirection layer.

I don't have a problem with this approach, it is not really clean as
we still treat zswap as a swapfile and have to deal with a lot of
unnecessary code like swap slots handling and whatnot. We also have to
unnecessarily limit the size of zswap with the size of this fake
swapfile. In other words, we retain a lot of limitations that we have
today. Keep in mind that supporting ghost swapfiles is something that
is exposed to userspace, so we have to commit to supporting it -- it
can't just be an incremental step that we will change later.

With all that said, it is certainly a much simpler "solution".
Interested to hear thoughts on this, we can certainly pursue it if
people think it is the right way to move forward.

>
> "
> >> > 2) Make zswap work without a swapfile.
> >> > You can implement the zswap on a fake ghosts swap file.
> >> >
> >> > If you keep the zswap as frontswap, just make zswap can work without
> >> > a real swapfile.
> >> >
> >> > Make that as your first minimal step. Then it does not need to touch
> >> > the swap count changes.
> "
>
> Best Regards,
> Huang, Ying
>
> >>
> >> >> > Anyway, I don't think you can just implement all your final solution in
> >> >> > one step.  And, I think the minimal design suggested could be a starting
> >> >> > point.
> >> >>
> >> >> I agree that's a great point, I am just afraid that we will avoid
> >> >> implementing that full final solution and instead do a lot of work
> >> >> inside zswap to make up for the difference (e.g. swap entry
> >> >> management, swap counting). Also, that work in zswap may end up being
> >> >> unacceptable due to the maintenance burden and/or complexity.
> >> >
> >> > If you do either 1) or 2), you can keep these two paths separate.
> >> >
> >> > Even if you want to move the page between zswap and swapfile.
> >> >
> >> > Idea 3)
> >> > You don't have to change the swap count code, you can do a
> >> > minimal change moves the page between zswap and another block
> >> > device. That way you can get two differenet swap entry with
> >> > existing code.
> >> >
> >> > Chris
> >>
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28  6:29                                                 ` Yosry Ahmed
@ 2023-03-28  6:59                                                   ` Huang, Ying
  2023-03-28  7:59                                                     ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-03-28  6:59 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Mon, Mar 27, 2023 at 11:22 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Sun, Mar 26, 2023 at 6:24 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > On Fri, Mar 24, 2023 at 12:28:31AM -0700, Yosry Ahmed wrote:
>> >> >> > In fact, I just suggest to use the minimal design on top of the current
>> >> >> > implementation as the first step.  Then, you can improve it step by
>> >> >> > step.
>> >> >> >
>> >> >> > The first step could be the minimal effort to implement indirection
>> >> >> > layer and moving swapped pages between swap implementations.  Based on
>> >> >> > that, you can build other optimizations, such as pulling swap counting
>> >> >> > to the swap core.  For each step, we can evaluate the gain and cost with
>> >> >> > data.
>> >> >>
>> >> >> Right, I understand that, but to implement the indirection layer on
>> >> >> top of the current implementation, then we will need to support using
>> >> >> zswap without a backing swap device. In order to do this without
>> >> >
>> >> > Agree with Ying on the minimal approach here as well.
>> >> >
>> >> > There are two ways to approach this.
>> >> >
>> >> > 1) Forget zswap, make a minimal implementation to move the page between
>> >> > two swapfile device. It can be swapfile back to two loop back files.
>> >> >
>> >> > Any indirect layer you design will need to convert this usage case
>> >> > any way.
>> >> >
>> >> > 2) Make zswap work without a swapfile.
>> >> > You can implement the zswap on a fake ghosts swap file.
>> >> >
>> >> > If you keep the zswap as frontswap, just make zswap can work without
>> >> > a real swapfile.
>> >> >
>> >> > Make that as your first minimal step. Then it does not need to touch
>> >> > the swap count changes.
>> >> >
>> >> > I view make that step is independent of moving pages between swap device.
>> >> >
>> >> > That patch exists and I consider it has value to some users.
>> >>
>> >> This sounds like an even smaller approach as the first step.  Further
>> >> improvement can be built on top of it.
>> >
>> > I am not sure how this would be a step towards the abstraction goal we
>> > have been discussing.
>> >
>> > We have been discussing starting out with a minimal indirection layer,
>> > in the shape of an xarray that maps a swap ID to a swap entry, and
>> > that can be disabled with a config option.
>> >
>> > For such a design to work, we have to implement swap entry management
>> > & swap counting in zswap, right? Am I missing something?
>>
>> Chris suggested to avoid to implement the swap entry management & swap
>> counting in zswap via using a "fake ghost swap file".  Copied his
>> suggestion as below,
>
> Right, we have been using ghost swapfiles at Google for a while. They
> are basically sparse files that you can never actually write to, they
> are just used so that we can use zswap without a backing swap device.
>
> What I do not understand is how this is a step towards the ultimate
> goal of swap abstraction. Is the idea to have the indirection layer
> only support moving swapped pages between swapfiles, and have those
> "ghost" swapfiles be on a higher tier than normal swapfiles? In this
> case, I am guessing we eliminate the writeback logic from zswap itself
> and move it to this indirection layer.

Yes.  I think the suggested minimal first step includes replacing the
writeback logic of zswap itself with moving swapped page of swap core
(indirectly layer).

> I don't have a problem with this approach, it is not really clean as
> we still treat zswap as a swapfile and have to deal with a lot of
> unnecessary code like swap slots handling and whatnot.

These are existing code?

> We also have to unnecessarily limit the size of zswap with the size of
> this fake swapfile.

I guess you need to limit the size of zswap anyway, because you need to
decide when to start to writeback or moving to the lower tiers.

> In other words, we retain a lot of limitations that we have today.

As the minimal first step, not the final state.

> Keep in mind that supporting ghost swapfiles is something that
> is exposed to userspace, so we have to commit to supporting it -- it
> can't just be an incremental step that we will change later.

Yes.  We should really care about ABI.  It's not a good idea to add ABI
for an intermediate step.  Do we need to change ABI to use a sparse file
to backing zswap?

> With all that said, it is certainly a much simpler "solution".
> Interested to hear thoughts on this, we can certainly pursue it if
> people think it is the right way to move forward.

Personally, I have no problem to change the design of swap code to add
useful features.  Just want to check whether we can do that step by step
and show benefit and cost clearly in each step.

Best Regards,
Huang, Ying

>>
>> "
>> >> > 2) Make zswap work without a swapfile.
>> >> > You can implement the zswap on a fake ghosts swap file.
>> >> >
>> >> > If you keep the zswap as frontswap, just make zswap can work without
>> >> > a real swapfile.
>> >> >
>> >> > Make that as your first minimal step. Then it does not need to touch
>> >> > the swap count changes.
>> "
>>
>> Best Regards,
>> Huang, Ying
>>
>> >>
>> >> >> > Anyway, I don't think you can just implement all your final solution in
>> >> >> > one step.  And, I think the minimal design suggested could be a starting
>> >> >> > point.
>> >> >>
>> >> >> I agree that's a great point, I am just afraid that we will avoid
>> >> >> implementing that full final solution and instead do a lot of work
>> >> >> inside zswap to make up for the difference (e.g. swap entry
>> >> >> management, swap counting). Also, that work in zswap may end up being
>> >> >> unacceptable due to the maintenance burden and/or complexity.
>> >> >
>> >> > If you do either 1) or 2), you can keep these two paths separate.
>> >> >
>> >> > Even if you want to move the page between zswap and swapfile.
>> >> >
>> >> > Idea 3)
>> >> > You don't have to change the swap count code, you can do a
>> >> > minimal change moves the page between zswap and another block
>> >> > device. That way you can get two differenet swap entry with
>> >> > existing code.
>> >> >
>> >> > Chris
>> >>
>>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28  6:59                                                   ` Huang, Ying
@ 2023-03-28  7:59                                                     ` Yosry Ahmed
  2023-03-28 14:14                                                       ` Johannes Weiner
  2023-03-28 20:50                                                       ` Chris Li
  0 siblings, 2 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-28  7:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 12:01 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Mon, Mar 27, 2023 at 11:22 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Sun, Mar 26, 2023 at 6:24 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Chris Li <chrisl@kernel.org> writes:
> >> >>
> >> >> > On Fri, Mar 24, 2023 at 12:28:31AM -0700, Yosry Ahmed wrote:
> >> >> >> > In fact, I just suggest to use the minimal design on top of the current
> >> >> >> > implementation as the first step.  Then, you can improve it step by
> >> >> >> > step.
> >> >> >> >
> >> >> >> > The first step could be the minimal effort to implement indirection
> >> >> >> > layer and moving swapped pages between swap implementations.  Based on
> >> >> >> > that, you can build other optimizations, such as pulling swap counting
> >> >> >> > to the swap core.  For each step, we can evaluate the gain and cost with
> >> >> >> > data.
> >> >> >>
> >> >> >> Right, I understand that, but to implement the indirection layer on
> >> >> >> top of the current implementation, then we will need to support using
> >> >> >> zswap without a backing swap device. In order to do this without
> >> >> >
> >> >> > Agree with Ying on the minimal approach here as well.
> >> >> >
> >> >> > There are two ways to approach this.
> >> >> >
> >> >> > 1) Forget zswap, make a minimal implementation to move the page between
> >> >> > two swapfile device. It can be swapfile back to two loop back files.
> >> >> >
> >> >> > Any indirect layer you design will need to convert this usage case
> >> >> > any way.
> >> >> >
> >> >> > 2) Make zswap work without a swapfile.
> >> >> > You can implement the zswap on a fake ghosts swap file.
> >> >> >
> >> >> > If you keep the zswap as frontswap, just make zswap can work without
> >> >> > a real swapfile.
> >> >> >
> >> >> > Make that as your first minimal step. Then it does not need to touch
> >> >> > the swap count changes.
> >> >> >
> >> >> > I view make that step is independent of moving pages between swap device.
> >> >> >
> >> >> > That patch exists and I consider it has value to some users.
> >> >>
> >> >> This sounds like an even smaller approach as the first step.  Further
> >> >> improvement can be built on top of it.
> >> >
> >> > I am not sure how this would be a step towards the abstraction goal we
> >> > have been discussing.
> >> >
> >> > We have been discussing starting out with a minimal indirection layer,
> >> > in the shape of an xarray that maps a swap ID to a swap entry, and
> >> > that can be disabled with a config option.
> >> >
> >> > For such a design to work, we have to implement swap entry management
> >> > & swap counting in zswap, right? Am I missing something?
> >>
> >> Chris suggested to avoid to implement the swap entry management & swap
> >> counting in zswap via using a "fake ghost swap file".  Copied his
> >> suggestion as below,
> >
> > Right, we have been using ghost swapfiles at Google for a while. They
> > are basically sparse files that you can never actually write to, they
> > are just used so that we can use zswap without a backing swap device.
> >
> > What I do not understand is how this is a step towards the ultimate
> > goal of swap abstraction. Is the idea to have the indirection layer
> > only support moving swapped pages between swapfiles, and have those
> > "ghost" swapfiles be on a higher tier than normal swapfiles? In this
> > case, I am guessing we eliminate the writeback logic from zswap itself
> > and move it to this indirection layer.
>
> Yes.  I think the suggested minimal first step includes replacing the
> writeback logic of zswap itself with moving swapped page of swap core
> (indirectly layer).
>
> > I don't have a problem with this approach, it is not really clean as
> > we still treat zswap as a swapfile and have to deal with a lot of
> > unnecessary code like swap slots handling and whatnot.
>
> These are existing code?

I was referring to the fact that today with zswap being tied to
swapfiles we do some necessary work such as searching for swap slots
during swapout. The initial swap_desc approach aimed to avoid that.
With this minimal ghost swapfile approach we retain this unfavorable
behavior.

>
> > We also have to unnecessarily limit the size of zswap with the size of
> > this fake swapfile.
>
> I guess you need to limit the size of zswap anyway, because you need to
> decide when to start to writeback or moving to the lower tiers.

zswap has a knob to limit its size, but based on the actual memory
usage of zswap (i.e the size of compressed pages). There is ongoing
work as well to autotune this if I remember correctly. Having to deal
with both the limit on compressed memory and the limited on the
uncompressed size of swapped pages is cumbersome. Again, we already
have this behavior today, but the initial swap_desc proposal aimed to
avoid it.

>
> > In other words, we retain a lot of limitations that we have today.
>
> As the minimal first step, not the final state.

I am assuming the first step here is using ghost swapfiles to support
zswap without a backing swap device and having an indirection layer
between swapfiles.

A following step can be making this indirection layer support zswap
directly without having to use such a ghost swapfile, if the
cost-benefit analysis proves that this is worthwhile. Is my
understanding correct?

>
> > Keep in mind that supporting ghost swapfiles is something that
> > is exposed to userspace, so we have to commit to supporting it -- it
> > can't just be an incremental step that we will change later.
>
> Yes.  We should really care about ABI.  It's not a good idea to add ABI
> for an intermediate step.  Do we need to change ABI to use a sparse file
> to backing zswap?

I think so. At Google we identify ghost swapfiles by having 0 block
length and mark them in the kernel. The upstream kernel would reject
such a file because it doesn't have a proper swap header. If we use
mkswap to have a proper swap header then swapon rejects the file
because it has holes.

>
> > With all that said, it is certainly a much simpler "solution".
> > Interested to hear thoughts on this, we can certainly pursue it if
> > people think it is the right way to move forward.
>
> Personally, I have no problem to change the design of swap code to add
> useful features.  Just want to check whether we can do that step by step
> and show benefit and cost clearly in each step.

Right. I understand and totally agree, even from a development point
of view it's much better to make big changes incrementally to avoid
doing a lot of work that ends up going nowhere. I am just trying to
make sure that whatever we decide is indeed a step in the right
direction.

Thanks for a very insightful discussion.

>
> Best Regards,
> Huang, Ying
>
> >>
> >> "
> >> >> > 2) Make zswap work without a swapfile.
> >> >> > You can implement the zswap on a fake ghosts swap file.
> >> >> >
> >> >> > If you keep the zswap as frontswap, just make zswap can work without
> >> >> > a real swapfile.
> >> >> >
> >> >> > Make that as your first minimal step. Then it does not need to touch
> >> >> > the swap count changes.
> >> "
> >>
> >> Best Regards,
> >> Huang, Ying
> >>
> >> >>
> >> >> >> > Anyway, I don't think you can just implement all your final solution in
> >> >> >> > one step.  And, I think the minimal design suggested could be a starting
> >> >> >> > point.
> >> >> >>
> >> >> >> I agree that's a great point, I am just afraid that we will avoid
> >> >> >> implementing that full final solution and instead do a lot of work
> >> >> >> inside zswap to make up for the difference (e.g. swap entry
> >> >> >> management, swap counting). Also, that work in zswap may end up being
> >> >> >> unacceptable due to the maintenance burden and/or complexity.
> >> >> >
> >> >> > If you do either 1) or 2), you can keep these two paths separate.
> >> >> >
> >> >> > Even if you want to move the page between zswap and swapfile.
> >> >> >
> >> >> > Idea 3)
> >> >> > You don't have to change the swap count code, you can do a
> >> >> > minimal change moves the page between zswap and another block
> >> >> > device. That way you can get two differenet swap entry with
> >> >> > existing code.
> >> >> >
> >> >> > Chris
> >> >>
> >>
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28  7:59                                                     ` Yosry Ahmed
@ 2023-03-28 14:14                                                       ` Johannes Weiner
  2023-03-28 19:59                                                         ` Yosry Ahmed
  2023-03-28 20:50                                                       ` Chris Li
  1 sibling, 1 reply; 105+ messages in thread
From: Johannes Weiner @ 2023-03-28 14:14 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Huang, Ying, Chris Li, lsf-pc, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
> On Tue, Mar 28, 2023 at 12:01 AM Huang, Ying <ying.huang@intel.com> wrote:
> > Yosry Ahmed <yosryahmed@google.com> writes:
> > > We also have to unnecessarily limit the size of zswap with the size of
> > > this fake swapfile.
> >
> > I guess you need to limit the size of zswap anyway, because you need to
> > decide when to start to writeback or moving to the lower tiers.
> 
> zswap has a knob to limit its size, but based on the actual memory
> usage of zswap (i.e the size of compressed pages). There is ongoing
> work as well to autotune this if I remember correctly. Having to deal
> with both the limit on compressed memory and the limited on the
> uncompressed size of swapped pages is cumbersome. Again, we already
> have this behavior today, but the initial swap_desc proposal aimed to
> avoid it.

Right.

The optimal size of the zswap pool on top of a swapfile depends on the
size and compressibility of the warm set of the workload: data that's
too cold for regular memory yet too hot for swap. This is obviously
highly dynamic, and even varies over time within individual jobs.

With this proposal, we'd have to provision a static swap map for the
highest expected offloading rate and compression ratio on every host
of a shared pool. On 256G machines that would put the fixed overhead
at a couple of hundred MB if I counted right.

Not the end of the world I guess. And I agree it would make for
simpler initial patches. OTOH, it would add more quirks to the swap
code instead of cleaning it up. And given how common compressed memory
setups are nowadays, it still feels like it's trading off too far in
favor of regular swap setups at the expense of compression.

So it wouldn't be my first preference. But it sounds workable.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28 14:14                                                       ` Johannes Weiner
@ 2023-03-28 19:59                                                         ` Yosry Ahmed
  2023-03-28 21:22                                                           ` Chris Li
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-28 19:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Huang, Ying, Chris Li, lsf-pc, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 7:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
> > On Tue, Mar 28, 2023 at 12:01 AM Huang, Ying <ying.huang@intel.com> wrote:
> > > Yosry Ahmed <yosryahmed@google.com> writes:
> > > > We also have to unnecessarily limit the size of zswap with the size of
> > > > this fake swapfile.
> > >
> > > I guess you need to limit the size of zswap anyway, because you need to
> > > decide when to start to writeback or moving to the lower tiers.
> >
> > zswap has a knob to limit its size, but based on the actual memory
> > usage of zswap (i.e the size of compressed pages). There is ongoing
> > work as well to autotune this if I remember correctly. Having to deal
> > with both the limit on compressed memory and the limited on the
> > uncompressed size of swapped pages is cumbersome. Again, we already
> > have this behavior today, but the initial swap_desc proposal aimed to
> > avoid it.
>
> Right.
>
> The optimal size of the zswap pool on top of a swapfile depends on the
> size and compressibility of the warm set of the workload: data that's
> too cold for regular memory yet too hot for swap. This is obviously
> highly dynamic, and even varies over time within individual jobs.
>
> With this proposal, we'd have to provision a static swap map for the
> highest expected offloading rate and compression ratio on every host
> of a shared pool. On 256G machines that would put the fixed overhead
> at a couple of hundred MB if I counted right.
>
> Not the end of the world I guess. And I agree it would make for
> simpler initial patches. OTOH, it would add more quirks to the swap
> code instead of cleaning it up. And given how common compressed memory
> setups are nowadays, it still feels like it's trading off too far in
> favor of regular swap setups at the expense of compression.

Right, I don't like adding more quirks to the swap code. I guess for
Android and ChromeOS, even though they are using compressed memory, it
is zram not zswap, so any extra overhead by swap_descs for normal swap
setups would also affect Android -- so that's something to think
about.

>
> So it wouldn't be my first preference. But it sounds workable.

If we settle on this as a first step, perhaps to avoid any ABI changes
we can have the kernel create a virtual swap device for zswap if it is
enabled, without userspace interfering or having to do swapon on a
sparse swapfile like we do today with ghost swapfiles at Google. We
can then implement indirection logic that only supports moving pages
between swap devices -- and perhaps only restrict it to only support
the virtual zswap swap device as a top tier initially.

The only user visible effect would be that if the user has zswap
enabled and did not configure a swapfile, zswap would start
compressing pages regardless, but that's what we're hoping for anyway
-- I wouldn't think this is a breaking change.

This also wouldn't be my first preference, but it seems like a smaller
step from what we have today. As long as we don't have ABI
dependencies we can always come back and change it later I suppose.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28  7:59                                                     ` Yosry Ahmed
  2023-03-28 14:14                                                       ` Johannes Weiner
@ 2023-03-28 20:50                                                       ` Chris Li
  2023-03-28 21:01                                                         ` Yosry Ahmed
  1 sibling, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-28 20:50 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
> > > I don't have a problem with this approach, it is not really clean as
> > > we still treat zswap as a swapfile and have to deal with a lot of
> > > unnecessary code like swap slots handling and whatnot.
> >
> > These are existing code?

Yes. The ghost swap file are existing code used in Google for many years.
 
> I was referring to the fact that today with zswap being tied to
> swapfiles we do some necessary work such as searching for swap slots
> during swapout. The initial swap_desc approach aimed to avoid that.
> With this minimal ghost swapfile approach we retain this unfavorable
> behavior.

Can you explain how you can avoid the free swap entry search
in the swap descriptor world?

The swap entry space is smaller than the memory address space and
there are other swapfiles can use the swap entry. You do need to do
some swap entry space management work to get a free entry. That to
me seems unavoidable, am I missing something?

> > Personally, I have no problem to change the design of swap code to add
> > useful features.  Just want to check whether we can do that step by step
> > and show benefit and cost clearly in each step.
> 
> Right. I understand and totally agree, even from a development point
> of view it's much better to make big changes incrementally to avoid
> doing a lot of work that ends up going nowhere. I am just trying to
> make sure that whatever we decide is indeed a step in the right
> direction.

The ghost swap file patch already exists. We might just share it here
for the discussion purpose, list the pros and cons.

Chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28 20:50                                                       ` Chris Li
@ 2023-03-28 21:01                                                         ` Yosry Ahmed
  2023-03-28 21:32                                                           ` Chris Li
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-28 21:01 UTC (permalink / raw)
  To: Chris Li
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 1:50 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
> > > > I don't have a problem with this approach, it is not really clean as
> > > > we still treat zswap as a swapfile and have to deal with a lot of
> > > > unnecessary code like swap slots handling and whatnot.
> > >
> > > These are existing code?
>
> Yes. The ghost swap file are existing code used in Google for many years.
>
> > I was referring to the fact that today with zswap being tied to
> > swapfiles we do some necessary work such as searching for swap slots
> > during swapout. The initial swap_desc approach aimed to avoid that.
> > With this minimal ghost swapfile approach we retain this unfavorable
> > behavior.
>
> Can you explain how you can avoid the free swap entry search
> in the swap descriptor world?

For zswap, in the swap descriptor world, you just need to allocate a
struct zswap_entry and have the swap descriptor point to it. No need
for swap slot management since we are not tied to a swapfile and pages
in zswap do not have a specific position.

>
> The swap entry space is smaller than the memory address space and
> there are other swapfiles can use the swap entry. You do need to do
> some swap entry space management work to get a free entry. That to
> me seems unavoidable, am I missing something?
>
> > > Personally, I have no problem to change the design of swap code to add
> > > useful features.  Just want to check whether we can do that step by step
> > > and show benefit and cost clearly in each step.
> >
> > Right. I understand and totally agree, even from a development point
> > of view it's much better to make big changes incrementally to avoid
> > doing a lot of work that ends up going nowhere. I am just trying to
> > make sure that whatever we decide is indeed a step in the right
> > direction.
>
> The ghost swap file patch already exists. We might just share it here
> for the discussion purpose, list the pros and cons.

The ghost swap file patch that we have does not work upstream because:
(a) It involves ABI changes (we start supporting swapon on a sparse
0-block-length file). Hence, it cannot be an incremental/intermediate
step because once we start supporting it we cannot take it back.
(b) It requires modifications to the swapon utility to work.

In my response to Johannes [1] I described an alternative that can
work upstream and wouldn't have ABI changes or need changes to swapon.

[1]https://lore.kernel.org/linux-mm/CAJD7tkbudmPTEumgKJZ5pXy6O79ySbGiCnAZXnUUuEmfZ6KCtQ@mail.gmail.com/

>
> Chris
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28 19:59                                                         ` Yosry Ahmed
@ 2023-03-28 21:22                                                           ` Chris Li
  2023-03-28 21:30                                                             ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-28 21:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Huang, Ying, lsf-pc, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 12:59:55PM -0700, Yosry Ahmed wrote:
> > So it wouldn't be my first preference. But it sounds workable.
> 
> If we settle on this as a first step, perhaps to avoid any ABI changes

Turn on zswap without a real swapfile is user space visible change.
Some small change is unavoidable, ABI or not.

> we can have the kernel create a virtual swap device for zswap if it is
> enabled, without userspace interfering or having to do swapon on a
> sparse swapfile like we do today with ghost swapfiles at Google. We
> can then implement indirection logic that only supports moving pages
> between swap devices -- and perhaps only restrict it to only support
> the virtual zswap swap device as a top tier initially.

One more things to consider is that, Google use more than one ghost
swapfiles for zswap due to scalability. There might be scalability
implication if there can be only one zswap device.

Chris

 


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28 21:22                                                           ` Chris Li
@ 2023-03-28 21:30                                                             ` Yosry Ahmed
  0 siblings, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-28 21:30 UTC (permalink / raw)
  To: Chris Li
  Cc: Johannes Weiner, Huang, Ying, lsf-pc, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 2:22 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Mar 28, 2023 at 12:59:55PM -0700, Yosry Ahmed wrote:
> > > So it wouldn't be my first preference. But it sounds workable.
> >
> > If we settle on this as a first step, perhaps to avoid any ABI changes
>
> Turn on zswap without a real swapfile is user space visible change.
> Some small change is unavoidable, ABI or not.

Yes, but to support the ghost swapfile use case we have to make swapon
support files with holes and/or 0-block length -- which is a bigger
change user-visible change and also includes a change to the swapon
utility.

>
> > we can have the kernel create a virtual swap device for zswap if it is
> > enabled, without userspace interfering or having to do swapon on a
> > sparse swapfile like we do today with ghost swapfiles at Google. We
> > can then implement indirection logic that only supports moving pages
> > between swap devices -- and perhaps only restrict it to only support
> > the virtual zswap swap device as a top tier initially.
>
> One more things to consider is that, Google use more than one ghost
> swapfiles for zswap due to scalability. There might be scalability
> implication if there can be only one zswap device.

Yes, but I believe this can be (and should be) handled within the
kernel. We can find the sources of scalability problems with a single
swapfile and mitigate them. I suspect the tree lock in zswap is one of
them. The swap cache lock might be another one, but the swap cache is
already partitioned within each swapfile to address that.

>
> Chris
>
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28 21:01                                                         ` Yosry Ahmed
@ 2023-03-28 21:32                                                           ` Chris Li
  2023-03-28 21:44                                                             ` Yosry Ahmed
  0 siblings, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-28 21:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 02:01:09PM -0700, Yosry Ahmed wrote:
> On Tue, Mar 28, 2023 at 1:50 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
> > > > > I don't have a problem with this approach, it is not really clean as
> > > > > we still treat zswap as a swapfile and have to deal with a lot of
> > > > > unnecessary code like swap slots handling and whatnot.
> > > >
> > > > These are existing code?
> >
> > Yes. The ghost swap file are existing code used in Google for many years.
> >
> > > I was referring to the fact that today with zswap being tied to
> > > swapfiles we do some necessary work such as searching for swap slots
> > > during swapout. The initial swap_desc approach aimed to avoid that.
> > > With this minimal ghost swapfile approach we retain this unfavorable
> > > behavior.
> >
> > Can you explain how you can avoid the free swap entry search
> > in the swap descriptor world?
> 
> For zswap, in the swap descriptor world, you just need to allocate a
> struct zswap_entry and have the swap descriptor point to it. No need
> for swap slot management since we are not tied to a swapfile and pages
> in zswap do not have a specific position.

Your swap descriptor will be using one swp_entry_t, which get from the PTE 
to lookup, right? That is the swap entry I am talking about. You just
substitute zswap swap entry with the swap descriptor swap entry.
You still need to allocate from the free swap entry space at least once.

Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28 21:32                                                           ` Chris Li
@ 2023-03-28 21:44                                                             ` Yosry Ahmed
  2023-03-28 22:01                                                               ` Chris Li
  2023-03-29  1:31                                                               ` Huang, Ying
  0 siblings, 2 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-28 21:44 UTC (permalink / raw)
  To: Chris Li
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 2:32 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Mar 28, 2023 at 02:01:09PM -0700, Yosry Ahmed wrote:
> > On Tue, Mar 28, 2023 at 1:50 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
> > > > > > I don't have a problem with this approach, it is not really clean as
> > > > > > we still treat zswap as a swapfile and have to deal with a lot of
> > > > > > unnecessary code like swap slots handling and whatnot.
> > > > >
> > > > > These are existing code?
> > >
> > > Yes. The ghost swap file are existing code used in Google for many years.
> > >
> > > > I was referring to the fact that today with zswap being tied to
> > > > swapfiles we do some necessary work such as searching for swap slots
> > > > during swapout. The initial swap_desc approach aimed to avoid that.
> > > > With this minimal ghost swapfile approach we retain this unfavorable
> > > > behavior.
> > >
> > > Can you explain how you can avoid the free swap entry search
> > > in the swap descriptor world?
> >
> > For zswap, in the swap descriptor world, you just need to allocate a
> > struct zswap_entry and have the swap descriptor point to it. No need
> > for swap slot management since we are not tied to a swapfile and pages
> > in zswap do not have a specific position.
>
> Your swap descriptor will be using one swp_entry_t, which get from the PTE
> to lookup, right? That is the swap entry I am talking about. You just
> substitute zswap swap entry with the swap descriptor swap entry.
> You still need to allocate from the free swap entry space at least once.

Oh, you mean the swap ID space. We just need to find an unused ID, we
can simply use an allocating xarray
(https://docs.kernel.org/core-api/xarray.html#allocating-xarrays).
This is simpler than keeping track of swap slots in a swapfile.

>
> Chris


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28 21:44                                                             ` Yosry Ahmed
@ 2023-03-28 22:01                                                               ` Chris Li
  2023-03-28 22:02                                                                 ` Yosry Ahmed
  2023-03-29  1:31                                                               ` Huang, Ying
  1 sibling, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-28 22:01 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 02:44:27PM -0700, Yosry Ahmed wrote:
> > Your swap descriptor will be using one swp_entry_t, which get from the PTE
> > to lookup, right? That is the swap entry I am talking about. You just
> > substitute zswap swap entry with the swap descriptor swap entry.
> > You still need to allocate from the free swap entry space at least once.
> 
> Oh, you mean the swap ID space. We just need to find an unused ID, we
> can simply use an allocating xarray
> (https://docs.kernel.org/core-api/xarray.html#allocating-xarrays).
> This is simpler than keeping track of swap slots in a swapfile.

Ah I see. That makes sense. Thanks for explaining it to me.

The real block swap device will still need to scan the swap_map
to find an empty space to write the page.

Chris




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28 22:01                                                               ` Chris Li
@ 2023-03-28 22:02                                                                 ` Yosry Ahmed
  0 siblings, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:02 UTC (permalink / raw)
  To: Chris Li
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 3:01 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Mar 28, 2023 at 02:44:27PM -0700, Yosry Ahmed wrote:
> > > Your swap descriptor will be using one swp_entry_t, which get from the PTE
> > > to lookup, right? That is the swap entry I am talking about. You just
> > > substitute zswap swap entry with the swap descriptor swap entry.
> > > You still need to allocate from the free swap entry space at least once.
> >
> > Oh, you mean the swap ID space. We just need to find an unused ID, we
> > can simply use an allocating xarray
> > (https://docs.kernel.org/core-api/xarray.html#allocating-xarrays).
> > This is simpler than keeping track of swap slots in a swapfile.
>
> Ah I see. That makes sense. Thanks for explaining it to me.
>
> The real block swap device will still need to scan the swap_map
> to find an empty space to write the page.

Yes, exactly. It's a tradeoff, and I think as long as whatever we
decide can be considered as a step towards the right direction it
should be fine.

>
> Chris
>
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-28 21:44                                                             ` Yosry Ahmed
  2023-03-28 22:01                                                               ` Chris Li
@ 2023-03-29  1:31                                                               ` Huang, Ying
  2023-03-29  1:41                                                                 ` Yosry Ahmed
  2023-03-29 15:22                                                                 ` Chris Li
  1 sibling, 2 replies; 105+ messages in thread
From: Huang, Ying @ 2023-03-29  1:31 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Tue, Mar 28, 2023 at 2:32 PM Chris Li <chrisl@kernel.org> wrote:
>>
>> On Tue, Mar 28, 2023 at 02:01:09PM -0700, Yosry Ahmed wrote:
>> > On Tue, Mar 28, 2023 at 1:50 PM Chris Li <chrisl@kernel.org> wrote:
>> > >
>> > > On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
>> > > > > > I don't have a problem with this approach, it is not really clean as
>> > > > > > we still treat zswap as a swapfile and have to deal with a lot of
>> > > > > > unnecessary code like swap slots handling and whatnot.
>> > > > >
>> > > > > These are existing code?
>> > >
>> > > Yes. The ghost swap file are existing code used in Google for many years.
>> > >
>> > > > I was referring to the fact that today with zswap being tied to
>> > > > swapfiles we do some necessary work such as searching for swap slots
>> > > > during swapout. The initial swap_desc approach aimed to avoid that.
>> > > > With this minimal ghost swapfile approach we retain this unfavorable
>> > > > behavior.
>> > >
>> > > Can you explain how you can avoid the free swap entry search
>> > > in the swap descriptor world?
>> >
>> > For zswap, in the swap descriptor world, you just need to allocate a
>> > struct zswap_entry and have the swap descriptor point to it. No need
>> > for swap slot management since we are not tied to a swapfile and pages
>> > in zswap do not have a specific position.
>>
>> Your swap descriptor will be using one swp_entry_t, which get from the PTE
>> to lookup, right? That is the swap entry I am talking about. You just
>> substitute zswap swap entry with the swap descriptor swap entry.
>> You still need to allocate from the free swap entry space at least once.
>
> Oh, you mean the swap ID space. We just need to find an unused ID, we
> can simply use an allocating xarray
> (https://docs.kernel.org/core-api/xarray.html#allocating-xarrays).
> This is simpler than keeping track of swap slots in a swapfile.

If we want to implement the swap entry management inside the zswap
implementation (instead of reusing swap_map[]), then the allocating
xarray can be used too.  Some per-entry data (such as swap count, etc.)
can be stored there.  I understanding that this isn't perfect (one more
xarray looking up, one more data structure, etc.), but this is a choice
too.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-29  1:31                                                               ` Huang, Ying
@ 2023-03-29  1:41                                                                 ` Yosry Ahmed
  2023-03-29 16:04                                                                   ` Chris Li
  2023-04-04  8:10                                                                   ` Huang, Ying
  2023-03-29 15:22                                                                 ` Chris Li
  1 sibling, 2 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-03-29  1:41 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 6:33 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Tue, Mar 28, 2023 at 2:32 PM Chris Li <chrisl@kernel.org> wrote:
> >>
> >> On Tue, Mar 28, 2023 at 02:01:09PM -0700, Yosry Ahmed wrote:
> >> > On Tue, Mar 28, 2023 at 1:50 PM Chris Li <chrisl@kernel.org> wrote:
> >> > >
> >> > > On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
> >> > > > > > I don't have a problem with this approach, it is not really clean as
> >> > > > > > we still treat zswap as a swapfile and have to deal with a lot of
> >> > > > > > unnecessary code like swap slots handling and whatnot.
> >> > > > >
> >> > > > > These are existing code?
> >> > >
> >> > > Yes. The ghost swap file are existing code used in Google for many years.
> >> > >
> >> > > > I was referring to the fact that today with zswap being tied to
> >> > > > swapfiles we do some necessary work such as searching for swap slots
> >> > > > during swapout. The initial swap_desc approach aimed to avoid that.
> >> > > > With this minimal ghost swapfile approach we retain this unfavorable
> >> > > > behavior.
> >> > >
> >> > > Can you explain how you can avoid the free swap entry search
> >> > > in the swap descriptor world?
> >> >
> >> > For zswap, in the swap descriptor world, you just need to allocate a
> >> > struct zswap_entry and have the swap descriptor point to it. No need
> >> > for swap slot management since we are not tied to a swapfile and pages
> >> > in zswap do not have a specific position.
> >>
> >> Your swap descriptor will be using one swp_entry_t, which get from the PTE
> >> to lookup, right? That is the swap entry I am talking about. You just
> >> substitute zswap swap entry with the swap descriptor swap entry.
> >> You still need to allocate from the free swap entry space at least once.
> >
> > Oh, you mean the swap ID space. We just need to find an unused ID, we
> > can simply use an allocating xarray
> > (https://docs.kernel.org/core-api/xarray.html#allocating-xarrays).
> > This is simpler than keeping track of swap slots in a swapfile.
>
> If we want to implement the swap entry management inside the zswap
> implementation (instead of reusing swap_map[]), then the allocating
> xarray can be used too.  Some per-entry data (such as swap count, etc.)
> can be stored there.  I understanding that this isn't perfect (one more
> xarray looking up, one more data structure, etc.), but this is a choice
> too.

My main concern here would be having two separate swap counting
implementations -- although it might not be the end of the world. It
would be useful to consider all the options. So far, I think we have
been discussing 3 alternatives:

(a) The initial swap_desc proposal.
(b) Add an optional indirection layer that can move swap entries
between swap devices and add a virtual swap device for zswap in the
kernel.
(c) Add an optional indirection layer that can move entries between
different swap backends. Swap backends would be zswap & swap devices
for now. Zswap needs to implement swap entry management, swap
counting, etc.

Does this accurately summarize what we have discussed so far?

>
> Best Regards,
> Huang, Ying
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-29  1:31                                                               ` Huang, Ying
  2023-03-29  1:41                                                                 ` Yosry Ahmed
@ 2023-03-29 15:22                                                                 ` Chris Li
  1 sibling, 0 replies; 105+ messages in thread
From: Chris Li @ 2023-03-29 15:22 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Wed, Mar 29, 2023 at 09:31:59AM +0800, Huang, Ying wrote:
> > Oh, you mean the swap ID space. We just need to find an unused ID, we
> > can simply use an allocating xarray
> > (https://docs.kernel.org/core-api/xarray.html#allocating-xarrays).
> > This is simpler than keeping track of swap slots in a swapfile.
> 
> If we want to implement the swap entry management inside the zswap
> implementation (instead of reusing swap_map[]), then the allocating
> xarray can be used too.  Some per-entry data (such as swap count, etc.)
> can be stored there.  I understanding that this isn't perfect (one more

Just want to confirm that "there" means the zswap entry, not the xarray
itself, right? If you have ways to store both the zswap entry
pointer and swap count in xarray, I am very much interested to know :-)

> xarray looking up, one more data structure, etc.), but this is a choice
> too.

Chris




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-29  1:41                                                                 ` Yosry Ahmed
@ 2023-03-29 16:04                                                                   ` Chris Li
  2023-04-04  8:24                                                                     ` Huang, Ying
  2023-04-04  8:10                                                                   ` Huang, Ying
  1 sibling, 1 reply; 105+ messages in thread
From: Chris Li @ 2023-03-29 16:04 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Huang, Ying, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Mar 28, 2023 at 06:41:54PM -0700, Yosry Ahmed wrote:
> My main concern here would be having two separate swap counting
> implementations -- although it might not be the end of the world. It
> would be useful to consider all the options. So far, I think we have

Agree.

> been discussing 3 alternatives:
> 
> (a) The initial swap_desc proposal.
> (b) Add an optional indirection layer that can move swap entries
> between swap devices and add a virtual swap device for zswap in the
> kernel.

For the completeness sake let me add some option that have both pros
and cons.

(d) There is the google's ghost swap file. I understand it mean a bit
ABI change. It has the advantange that it allow more than one
zswap swapfile. Google use it that way. Another consideration is
that ghost swap file compatible with exisiting swapon behavior.
You can see how much swap entry was used from swapon summary.
Some application might depend on that.

We might able to find some way to break ABI less. 

> (c) Add an optional indirection layer that can move entries between
> different swap backends. Swap backends would be zswap & swap devices
> for now. Zswap needs to implement swap entry management, swap
> counting, etc.
(f) I have been thinking of variants of (b) without adding a virtual
swap device for zswap, using the ghost swap file instead.

Also the indirection is optional per swap entry at run time.
Some swap devices can have some entries move to another swap device.
Only those swap entries pay the price of the indirection layer.

(e) This is the long term goal I have in mind. A VFS like
implementation for swap file. Let's call it VSW.
This allows different swap devices using different
swap file system implementations.

A lot of the difficult trade off we have right now:
Smaller per entry up front allocate like swap_map[] for all
entry vs only allocating memory for swap entry that has been
swap out, but a larger per entry allocation.

I believe some of those trade offs can be addressed by having a
different swap file system. I do mean a different "mkswap"
that kind of file system. We can write out some of the swap
entry meta data to the swap file system as well. It means
we don't have to pay the larger per swap entry allocation overhead
for very cold pages. it might need to take two reads to swap
in some of the very cold swap entries. But that should be rare.

It can offer benefits for swapping out larger folio as well.
Right now swapping out large folios still needs to go through
the per 4k page swap index allocation and break down.

Basically, modernized the swap file system.

The redirection layer should be able to implement within VSW
as well.

I know that is a very ambitious plan :-)

We can do that incrementally. The swap file system doesn't have
much backward compatibility cross reboot, should be easier than
the normal file system.

Chris



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-29  1:41                                                                 ` Yosry Ahmed
  2023-03-29 16:04                                                                   ` Chris Li
@ 2023-04-04  8:10                                                                   ` Huang, Ying
  2023-04-04  8:47                                                                     ` Yosry Ahmed
  1 sibling, 1 reply; 105+ messages in thread
From: Huang, Ying @ 2023-04-04  8:10 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Tue, Mar 28, 2023 at 6:33 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Tue, Mar 28, 2023 at 2:32 PM Chris Li <chrisl@kernel.org> wrote:
>> >>
>> >> On Tue, Mar 28, 2023 at 02:01:09PM -0700, Yosry Ahmed wrote:
>> >> > On Tue, Mar 28, 2023 at 1:50 PM Chris Li <chrisl@kernel.org> wrote:
>> >> > >
>> >> > > On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
>> >> > > > > > I don't have a problem with this approach, it is not really clean as
>> >> > > > > > we still treat zswap as a swapfile and have to deal with a lot of
>> >> > > > > > unnecessary code like swap slots handling and whatnot.
>> >> > > > >
>> >> > > > > These are existing code?
>> >> > >
>> >> > > Yes. The ghost swap file are existing code used in Google for many years.
>> >> > >
>> >> > > > I was referring to the fact that today with zswap being tied to
>> >> > > > swapfiles we do some necessary work such as searching for swap slots
>> >> > > > during swapout. The initial swap_desc approach aimed to avoid that.
>> >> > > > With this minimal ghost swapfile approach we retain this unfavorable
>> >> > > > behavior.
>> >> > >
>> >> > > Can you explain how you can avoid the free swap entry search
>> >> > > in the swap descriptor world?
>> >> >
>> >> > For zswap, in the swap descriptor world, you just need to allocate a
>> >> > struct zswap_entry and have the swap descriptor point to it. No need
>> >> > for swap slot management since we are not tied to a swapfile and pages
>> >> > in zswap do not have a specific position.
>> >>
>> >> Your swap descriptor will be using one swp_entry_t, which get from the PTE
>> >> to lookup, right? That is the swap entry I am talking about. You just
>> >> substitute zswap swap entry with the swap descriptor swap entry.
>> >> You still need to allocate from the free swap entry space at least once.
>> >
>> > Oh, you mean the swap ID space. We just need to find an unused ID, we
>> > can simply use an allocating xarray
>> > (https://docs.kernel.org/core-api/xarray.html#allocating-xarrays).
>> > This is simpler than keeping track of swap slots in a swapfile.
>>
>> If we want to implement the swap entry management inside the zswap
>> implementation (instead of reusing swap_map[]), then the allocating
>> xarray can be used too.  Some per-entry data (such as swap count, etc.)
>> can be stored there.  I understanding that this isn't perfect (one more
>> xarray looking up, one more data structure, etc.), but this is a choice
>> too.
>
> My main concern here would be having two separate swap counting
> implementations -- although it might not be the end of the world.

This isn't a big issue for me.  For file systems, there are duplicated
functionality in different file system implementation, such as free
block space management.  Instead, I hope we can design better swap
implementation in the future.

> It would be useful to consider all the options. So far, I think we
> have been discussing 3 alternatives:
>
> (a) The initial swap_desc proposal.

My main concern for the initial swap_desc proposal is that the zswap
code is put in swap core instead of zswap implementation per my
understanding.  So zswap isn't another swap implementation encapsulated
with a common interface.  Please correct me if my understanding isn't
correct.

If so, the flexibility of the swap system is the cost.  For example,
zswap may be always at the highest priority among all swap devices.  We
can move the cold page from zswap to some swap device.  But we cannot
move the cold page from some swap device to zswap.

Maybe compression is always faster than any other swap devices, so we
will never need the flexibility.  Maybe the cost to hide zswap behind a
common interface is unacceptable.  I'm open to these.  But please
provide the evidence, and maybe data.

Best Regards,
Huang, Ying

> (b) Add an optional indirection layer that can move swap entries
> between swap devices and add a virtual swap device for zswap in the
> kernel.
> (c) Add an optional indirection layer that can move entries between
> different swap backends. Swap backends would be zswap & swap devices
> for now. Zswap needs to implement swap entry management, swap
> counting, etc.
>
> Does this accurately summarize what we have discussed so far?
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-03-29 16:04                                                                   ` Chris Li
@ 2023-04-04  8:24                                                                     ` Huang, Ying
  0 siblings, 0 replies; 105+ messages in thread
From: Huang, Ying @ 2023-04-04  8:24 UTC (permalink / raw)
  To: Chris Li
  Cc: Yosry Ahmed, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Chris Li <chrisl@kernel.org> writes:

> On Tue, Mar 28, 2023 at 06:41:54PM -0700, Yosry Ahmed wrote:
>> My main concern here would be having two separate swap counting
>> implementations -- although it might not be the end of the world. It
>> would be useful to consider all the options. So far, I think we have
>
> Agree.
>
>> been discussing 3 alternatives:
>> 
>> (a) The initial swap_desc proposal.
>> (b) Add an optional indirection layer that can move swap entries
>> between swap devices and add a virtual swap device for zswap in the
>> kernel.
>
> For the completeness sake let me add some option that have both pros
> and cons.
>
> (d) There is the google's ghost swap file. I understand it mean a bit
> ABI change. It has the advantange that it allow more than one
> zswap swapfile. Google use it that way. Another consideration is
> that ghost swap file compatible with exisiting swapon behavior.
> You can see how much swap entry was used from swapon summary.
> Some application might depend on that.
>
> We might able to find some way to break ABI less. 
>
>> (c) Add an optional indirection layer that can move entries between
>> different swap backends. Swap backends would be zswap & swap devices
>> for now. Zswap needs to implement swap entry management, swap
>> counting, etc.
> (f) I have been thinking of variants of (b) without adding a virtual
> swap device for zswap, using the ghost swap file instead.
>
> Also the indirection is optional per swap entry at run time.
> Some swap devices can have some entries move to another swap device.
> Only those swap entries pay the price of the indirection layer.
>
> (e) This is the long term goal I have in mind. A VFS like
> implementation for swap file. Let's call it VSW.
> This allows different swap devices using different
> swap file system implementations.

I like this too!

> A lot of the difficult trade off we have right now:
> Smaller per entry up front allocate like swap_map[] for all
> entry vs only allocating memory for swap entry that has been
> swap out, but a larger per entry allocation.

Yes.

> I believe some of those trade offs can be addressed by having a
> different swap file system. I do mean a different "mkswap"
> that kind of file system.

We may don't need that, because the swap on-disk format needn't to be
permanent across rebooting.

> We can write out some of the swap
> entry meta data to the swap file system as well. It means
> we don't have to pay the larger per swap entry allocation overhead
> for very cold pages. it might need to take two reads to swap
> in some of the very cold swap entries. But that should be rare.

Sound like a good idea.  At least can be investigated further.

> It can offer benefits for swapping out larger folio as well.
> Right now swapping out large folios still needs to go through
> the per 4k page swap index allocation and break down.
>
> Basically, modernized the swap file system.
>
> The redirection layer should be able to implement within VSW
> as well.
>
> I know that is a very ambitious plan :-)

Yes.

> We can do that incrementally. The swap file system doesn't have
> much backward compatibility cross reboot, should be easier than
> the normal file system.

Agree.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-04-04  8:10                                                                   ` Huang, Ying
@ 2023-04-04  8:47                                                                     ` Yosry Ahmed
  2023-04-06  1:40                                                                       ` Huang, Ying
  0 siblings, 1 reply; 105+ messages in thread
From: Yosry Ahmed @ 2023-04-04  8:47 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

On Tue, Apr 4, 2023 at 1:12 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Tue, Mar 28, 2023 at 6:33 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Tue, Mar 28, 2023 at 2:32 PM Chris Li <chrisl@kernel.org> wrote:
> >> >>
> >> >> On Tue, Mar 28, 2023 at 02:01:09PM -0700, Yosry Ahmed wrote:
> >> >> > On Tue, Mar 28, 2023 at 1:50 PM Chris Li <chrisl@kernel.org> wrote:
> >> >> > >
> >> >> > > On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
> >> >> > > > > > I don't have a problem with this approach, it is not really clean as
> >> >> > > > > > we still treat zswap as a swapfile and have to deal with a lot of
> >> >> > > > > > unnecessary code like swap slots handling and whatnot.
> >> >> > > > >
> >> >> > > > > These are existing code?
> >> >> > >
> >> >> > > Yes. The ghost swap file are existing code used in Google for many years.
> >> >> > >
> >> >> > > > I was referring to the fact that today with zswap being tied to
> >> >> > > > swapfiles we do some necessary work such as searching for swap slots
> >> >> > > > during swapout. The initial swap_desc approach aimed to avoid that.
> >> >> > > > With this minimal ghost swapfile approach we retain this unfavorable
> >> >> > > > behavior.
> >> >> > >
> >> >> > > Can you explain how you can avoid the free swap entry search
> >> >> > > in the swap descriptor world?
> >> >> >
> >> >> > For zswap, in the swap descriptor world, you just need to allocate a
> >> >> > struct zswap_entry and have the swap descriptor point to it. No need
> >> >> > for swap slot management since we are not tied to a swapfile and pages
> >> >> > in zswap do not have a specific position.
> >> >>
> >> >> Your swap descriptor will be using one swp_entry_t, which get from the PTE
> >> >> to lookup, right? That is the swap entry I am talking about. You just
> >> >> substitute zswap swap entry with the swap descriptor swap entry.
> >> >> You still need to allocate from the free swap entry space at least once.
> >> >
> >> > Oh, you mean the swap ID space. We just need to find an unused ID, we
> >> > can simply use an allocating xarray
> >> > (https://docs.kernel.org/core-api/xarray.html#allocating-xarrays).
> >> > This is simpler than keeping track of swap slots in a swapfile.
> >>
> >> If we want to implement the swap entry management inside the zswap
> >> implementation (instead of reusing swap_map[]), then the allocating
> >> xarray can be used too.  Some per-entry data (such as swap count, etc.)
> >> can be stored there.  I understanding that this isn't perfect (one more
> >> xarray looking up, one more data structure, etc.), but this is a choice
> >> too.
> >
> > My main concern here would be having two separate swap counting
> > implementations -- although it might not be the end of the world.
>
> This isn't a big issue for me.  For file systems, there are duplicated
> functionality in different file system implementation, such as free
> block space management.  Instead, I hope we can design better swap
> implementation in the future.
>
> > It would be useful to consider all the options. So far, I think we
> > have been discussing 3 alternatives:
> >
> > (a) The initial swap_desc proposal.
>
> My main concern for the initial swap_desc proposal is that the zswap
> code is put in swap core instead of zswap implementation per my
> understanding.  So zswap isn't another swap implementation encapsulated
> with a common interface.  Please correct me if my understanding isn't
> correct.
>
> If so, the flexibility of the swap system is the cost.  For example,
> zswap may be always at the highest priority among all swap devices.  We
> can move the cold page from zswap to some swap device.  But we cannot
> move the cold page from some swap device to zswap.


Not really. In the swap_desc proposal, I intended to have struct
swap_desc contain either a swap device entry (swp_entry_t) or a
frontswap entry (a pointer). zswap implementation would not be in the
swap core, instead, we would have two swap implementations: swap
devices and frontswap/zswap -- each of which implement a common swap
API. We can use one of the free bits to distinguish the type of the
underlying entry (swp_entry_t or pointer to frontswap/zswap entry).

We can start by only supporting moving pages from frontswap/zswap to
swap devices, but I don't see why the same design would not support
pages moving in the other direction if the need arises.

The number of free bits in swp_entry_t and pointers is limited (2 bits
on 32-bit systems, 3 bits on 64-bit systems), so there are only a
handful of different swap types we can support with the swap_desc
design, but we only need two to begin with. If in the future we need
more, we can add an indirection layer then or expand swap_desc -- or
we can encode the data within the swap device itself (how it compares
to frontswap/zswap).

In summary, the swap_desc proposal does NOT involve moving zswap code
to core swap, it involves a generic swap API with two implementations:
swap devices and frontswap/zswap.

The only problems I see with the swap_desc design are:
- Extra overhead for users using swapfiles only.
- A bigger leap from what we have today than other ideas proposed
(e.g. virtual swap device for zswap).

>
>
> Maybe compression is always faster than any other swap devices, so we
> will never need the flexibility.  Maybe the cost to hide zswap behind a
> common interface is unacceptable.  I'm open to these.  But please
> provide the evidence, and maybe data.
>
> Best Regards,
> Huang, Ying
>
> > (b) Add an optional indirection layer that can move swap entries
> > between swap devices and add a virtual swap device for zswap in the
> > kernel.
> > (c) Add an optional indirection layer that can move entries between
> > different swap backends. Swap backends would be zswap & swap devices
> > for now. Zswap needs to implement swap entry management, swap
> > counting, etc.
> >
> > Does this accurately summarize what we have discussed so far?
> >


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-04-04  8:47                                                                     ` Yosry Ahmed
@ 2023-04-06  1:40                                                                       ` Huang, Ying
  0 siblings, 0 replies; 105+ messages in thread
From: Huang, Ying @ 2023-04-06  1:40 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chris Li, lsf-pc, Johannes Weiner, Linux-MM, Michal Hocko,
	Shakeel Butt, David Rientjes, Hugh Dickins, Seth Jennings,
	Dan Streetman, Vitaly Wool, Yang Shi, Peter Xu, Minchan Kim,
	Andrew Morton, Aneesh Kumar K V, Michal Hocko, Wei Xu

Yosry Ahmed <yosryahmed@google.com> writes:

> On Tue, Apr 4, 2023 at 1:12 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Tue, Mar 28, 2023 at 6:33 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Tue, Mar 28, 2023 at 2:32 PM Chris Li <chrisl@kernel.org> wrote:
>> >> >>
>> >> >> On Tue, Mar 28, 2023 at 02:01:09PM -0700, Yosry Ahmed wrote:
>> >> >> > On Tue, Mar 28, 2023 at 1:50 PM Chris Li <chrisl@kernel.org> wrote:
>> >> >> > >
>> >> >> > > On Tue, Mar 28, 2023 at 12:59:31AM -0700, Yosry Ahmed wrote:
>> >> >> > > > > > I don't have a problem with this approach, it is not really clean as
>> >> >> > > > > > we still treat zswap as a swapfile and have to deal with a lot of
>> >> >> > > > > > unnecessary code like swap slots handling and whatnot.
>> >> >> > > > >
>> >> >> > > > > These are existing code?
>> >> >> > >
>> >> >> > > Yes. The ghost swap file are existing code used in Google for many years.
>> >> >> > >
>> >> >> > > > I was referring to the fact that today with zswap being tied to
>> >> >> > > > swapfiles we do some necessary work such as searching for swap slots
>> >> >> > > > during swapout. The initial swap_desc approach aimed to avoid that.
>> >> >> > > > With this minimal ghost swapfile approach we retain this unfavorable
>> >> >> > > > behavior.
>> >> >> > >
>> >> >> > > Can you explain how you can avoid the free swap entry search
>> >> >> > > in the swap descriptor world?
>> >> >> >
>> >> >> > For zswap, in the swap descriptor world, you just need to allocate a
>> >> >> > struct zswap_entry and have the swap descriptor point to it. No need
>> >> >> > for swap slot management since we are not tied to a swapfile and pages
>> >> >> > in zswap do not have a specific position.
>> >> >>
>> >> >> Your swap descriptor will be using one swp_entry_t, which get from the PTE
>> >> >> to lookup, right? That is the swap entry I am talking about. You just
>> >> >> substitute zswap swap entry with the swap descriptor swap entry.
>> >> >> You still need to allocate from the free swap entry space at least once.
>> >> >
>> >> > Oh, you mean the swap ID space. We just need to find an unused ID, we
>> >> > can simply use an allocating xarray
>> >> > (https://docs.kernel.org/core-api/xarray.html#allocating-xarrays).
>> >> > This is simpler than keeping track of swap slots in a swapfile.
>> >>
>> >> If we want to implement the swap entry management inside the zswap
>> >> implementation (instead of reusing swap_map[]), then the allocating
>> >> xarray can be used too.  Some per-entry data (such as swap count, etc.)
>> >> can be stored there.  I understanding that this isn't perfect (one more
>> >> xarray looking up, one more data structure, etc.), but this is a choice
>> >> too.
>> >
>> > My main concern here would be having two separate swap counting
>> > implementations -- although it might not be the end of the world.
>>
>> This isn't a big issue for me.  For file systems, there are duplicated
>> functionality in different file system implementation, such as free
>> block space management.  Instead, I hope we can design better swap
>> implementation in the future.
>>
>> > It would be useful to consider all the options. So far, I think we
>> > have been discussing 3 alternatives:
>> >
>> > (a) The initial swap_desc proposal.
>>
>> My main concern for the initial swap_desc proposal is that the zswap
>> code is put in swap core instead of zswap implementation per my
>> understanding.  So zswap isn't another swap implementation encapsulated
>> with a common interface.  Please correct me if my understanding isn't
>> correct.
>>
>> If so, the flexibility of the swap system is the cost.  For example,
>> zswap may be always at the highest priority among all swap devices.  We
>> can move the cold page from zswap to some swap device.  But we cannot
>> move the cold page from some swap device to zswap.
>
>
> Not really. In the swap_desc proposal, I intended to have struct
> swap_desc contain either a swap device entry (swp_entry_t) or a
> frontswap entry (a pointer). zswap implementation would not be in the
> swap core, instead, we would have two swap implementations: swap
> devices and frontswap/zswap -- each of which implement a common swap
> API. We can use one of the free bits to distinguish the type of the
> underlying entry (swp_entry_t or pointer to frontswap/zswap entry).
>
> We can start by only supporting moving pages from frontswap/zswap to
> swap devices, but I don't see why the same design would not support
> pages moving in the other direction if the need arises.
>
> The number of free bits in swp_entry_t and pointers is limited (2 bits
> on 32-bit systems, 3 bits on 64-bit systems), so there are only a
> handful of different swap types we can support with the swap_desc
> design, but we only need two to begin with. If in the future we need
> more, we can add an indirection layer then or expand swap_desc -- or
> we can encode the data within the swap device itself (how it compares
> to frontswap/zswap).
>
> In summary, the swap_desc proposal does NOT involve moving zswap code
> to core swap, it involves a generic swap API with two implementations:
> swap devices and frontswap/zswap.

This eliminate the main concerns for me!  Thanks!

> The only problems I see with the swap_desc design are:
> - Extra overhead for users using swapfiles only.
> - A bigger leap from what we have today than other ideas proposed
> (e.g. virtual swap device for zswap).

Yes.

Best Regards,
Huang, Ying

>>
>>
>> Maybe compression is always faster than any other swap devices, so we
>> will never need the flexibility.  Maybe the cost to hide zswap behind a
>> common interface is unacceptable.  I'm open to these.  But please
>> provide the evidence, and maybe data.
>>
>> Best Regards,
>> Huang, Ying
>>
>> > (b) Add an optional indirection layer that can move swap entries
>> > between swap devices and add a virtual swap device for zswap in the
>> > kernel.
>> > (c) Add an optional indirection layer that can move entries between
>> > different swap backends. Swap backends would be zswap & swap devices
>> > for now. Zswap needs to implement swap entry management, swap
>> > counting, etc.
>> >
>> > Does this accurately summarize what we have discussed so far?
>> >


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
  2023-02-18 22:38 [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Yosry Ahmed
                   ` (4 preceding siblings ...)
  2023-03-10  2:07 ` Luis Chamberlain
@ 2023-05-12  3:07 ` Yosry Ahmed
  5 siblings, 0 replies; 105+ messages in thread
From: Yosry Ahmed @ 2023-05-12  3:07 UTC (permalink / raw)
  To: lsf-pc, Johannes Weiner
  Cc: Linux-MM, Michal Hocko, Shakeel Butt, David Rientjes,
	Hugh Dickins, Seth Jennings, Dan Streetman, Vitaly Wool,
	Yang Shi, Peter Xu, Minchan Kim, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 5184 bytes --]

On Sat, Feb 18, 2023 at 2:38 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Hello everyone,
>
> I would like to propose a topic for the upcoming LSF/MM/BPF in May
> 2023 about swap & zswap (hope I am not too late).
>
> ==================== Intro ====================
> Currently, using zswap is dependent on swapfiles in an unnecessary
> way. To use zswap, you need a swapfile configured (even if the space
> will not be used) and zswap is restricted by its size. When pages
> reside in zswap, the corresponding swap entry in the swapfile cannot
> be used, and is essentially wasted. We also go through unnecessary
> code paths when using zswap, such as finding and allocating a swap
> entry on the swapout path, or readahead in the swapin path. I am
> proposing a swapping abstraction layer that would allow us to remove
> zswap's dependency on swapfiles. This can be done by introducing a
> data structure between the actual swapping implementation (swapfiles,
> zswap) and the rest of the MM code.
>
> ==================== Objective ====================
> Enabling the use of zswap without a backing swapfile, which makes
> zswap useful for a wider variety of use cases. Also, when zswap is
> used with a swapfile, the pages in zswap do not use up space in the
> swapfile, so the overall swapping capacity increases.
>
> ==================== Idea ====================
> Introduce a data structure, which I currently call a swap_desc, as an
> abstraction layer between swapping implementation and the rest of MM
> code. Page tables & page caches would store a swap id (encoded as a
> swp_entry_t) instead of directly storing the swap entry associated
> with the swapfile. This swap id maps to a struct swap_desc, which acts
> as our abstraction layer. All MM code not concerned with swapping
> details would operate in terms of swap descs. The swap_desc can point
> to either a normal swap entry (associated with a swapfile) or a zswap
> entry. It can also include all non-backend specific operations, such
> as the swapcache (which would be a simple pointer in swap_desc), swap
> counting, etc. It creates a clear, nice abstraction layer between MM
> code and the actual swapping implementation.
>
> ==================== Benefits ====================
> This work enables using zswap without a backing swapfile and increases
> the swap capacity when zswap is used with a swapfile. It also creates
> a separation that allows us to skip code paths that don't make sense
> in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> which might result in better performance (less lookups, less lock
> contention).
>
> The abstraction layer also opens the door for multiple cleanups (e.g.
> removing swapper address spaces, removing swap count continuation
> code, etc). Another nice cleanup that this work enables would be
> separating the overloaded swp_entry_t into two distinct types: one for
> things that are stored in page tables / caches, and for actual swap
> entries. In the future, we can potentially further optimize how we use
> the bits in the page tables instead of sticking everything into the
> current type/offset format.
>
> Another potential win here can be swapoff, which can be more practical
> by directly scanning all swap_desc's instead of going through page
> tables and shmem page caches.
>
> Overall zswap becomes more accessible and available to a wider range
> of use cases.
>
> ==================== Cost ====================
> The obvious downside of this is added memory overhead, specifically
> for users that use swapfiles without zswap. Instead of paying one byte
> (swap_map) for every potential page in the swapfile (+ swap count
> continuation), we pay the size of the swap_desc for every page that is
> actually in the swapfile, which I am estimating can be roughly around
> 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> scales with pages actually swapped out. For zswap users, it should be
> a win (or at least even) because we get to drop a lot of fields from
> struct zswap_entry (e.g. rbtree, index, etc).
>
> Another potential concern is readahead. With this design, we have no
> way to get a swap_desc given a swap entry (type & offset). We would
> need to maintain a reverse mapping, adding a little bit more overhead,
> or search all swapped out pages instead :). A reverse mapping might
> pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out
> memory).
>
> ==================== Bottom Line ====================
> It would be nice to discuss the potential here and the tradeoffs. I
> know that other folks using zswap (or interested in using it) may find
> this very useful. I am sure I am missing some context on why things
> are the way they are, and perhaps some obvious holes in my story.
> Looking forward to discussing this with anyone interested :)
>
> I think Johannes may be interested in attending this discussion, since
> a lot of ideas here are inspired by discussions I had with him :)

For the record, here are the slides that were presented for this
discussion (attached).

[-- Attachment #2: [LSF_MM_BPF 2023] Swap Abstraction _ Native Zswap.pdf --]
[-- Type: application/pdf, Size: 133190 bytes --]

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2023-05-12  3:08 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-18 22:38 [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Yosry Ahmed
2023-02-19  4:31 ` Matthew Wilcox
2023-02-19  9:34   ` Yosry Ahmed
2023-02-28 23:22   ` Chris Li
2023-03-01  0:08     ` Matthew Wilcox
2023-03-01 23:22       ` Chris Li
2023-02-21 18:39 ` Yang Shi
2023-02-21 18:56   ` Yosry Ahmed
2023-02-21 19:26     ` Yang Shi
2023-02-21 19:46       ` Yosry Ahmed
2023-02-21 23:34         ` Yang Shi
2023-02-21 23:38           ` Yosry Ahmed
2023-02-22 16:57             ` Johannes Weiner
2023-02-22 22:46               ` Yosry Ahmed
2023-02-28  4:29                 ` Kalesh Singh
2023-02-28  8:09                   ` Yosry Ahmed
2023-02-28  4:54 ` Sergey Senozhatsky
2023-02-28  8:12   ` Yosry Ahmed
2023-02-28 23:29     ` Minchan Kim
2023-03-02  0:58       ` Yosry Ahmed
2023-03-02  1:25         ` Yosry Ahmed
2023-03-02 17:05         ` Chris Li
2023-03-02 17:47         ` Chris Li
2023-03-02 18:15           ` Johannes Weiner
2023-03-02 18:56             ` Chris Li
2023-03-02 18:23           ` Rik van Riel
2023-03-02 21:42             ` Chris Li
2023-03-02 22:36               ` Rik van Riel
2023-03-02 22:55                 ` Yosry Ahmed
2023-03-03  4:05                   ` Chris Li
2023-03-03  0:01                 ` Chris Li
2023-03-02 16:58       ` Chris Li
2023-03-01 10:44     ` Sergey Senozhatsky
2023-03-02  1:01       ` Yosry Ahmed
2023-02-28 23:11 ` Chris Li
2023-03-02  0:30   ` Yosry Ahmed
2023-03-02  1:00     ` Yosry Ahmed
2023-03-02 16:51     ` Chris Li
2023-03-03  0:33     ` Minchan Kim
2023-03-03  0:49       ` Yosry Ahmed
2023-03-03  1:25         ` Minchan Kim
2023-03-03 17:15           ` Yosry Ahmed
2023-03-09 12:48     ` Huang, Ying
2023-03-09 19:58       ` Chris Li
2023-03-09 20:19       ` Yosry Ahmed
2023-03-10  3:06         ` Huang, Ying
2023-03-10 23:14           ` Chris Li
2023-03-13  1:10             ` Huang, Ying
2023-03-15  7:41               ` Yosry Ahmed
2023-03-16  1:42                 ` Huang, Ying
2023-03-11  1:06           ` Yosry Ahmed
2023-03-13  2:12             ` Huang, Ying
2023-03-15  8:01               ` Yosry Ahmed
2023-03-16  7:50                 ` Huang, Ying
2023-03-17 10:19                   ` Yosry Ahmed
2023-03-17 18:19                     ` Chris Li
2023-03-17 18:23                       ` Yosry Ahmed
2023-03-20  2:55                     ` Huang, Ying
2023-03-20  6:25                       ` Chris Li
2023-03-23  0:56                         ` Huang, Ying
2023-03-23  6:46                           ` Chris Li
2023-03-23  6:56                             ` Huang, Ying
2023-03-23 18:28                               ` Chris Li
2023-03-23 18:40                                 ` Yosry Ahmed
2023-03-23 19:49                                   ` Chris Li
2023-03-23 19:54                                     ` Yosry Ahmed
2023-03-23 21:10                                       ` Chris Li
2023-03-24 17:28                                       ` Chris Li
2023-03-22  5:56                       ` Yosry Ahmed
2023-03-23  1:48                         ` Huang, Ying
2023-03-23  2:21                           ` Yosry Ahmed
2023-03-23  3:16                             ` Huang, Ying
2023-03-23  3:27                               ` Yosry Ahmed
2023-03-23  5:37                                 ` Huang, Ying
2023-03-23 15:18                                   ` Yosry Ahmed
2023-03-24  2:37                                     ` Huang, Ying
2023-03-24  7:28                                       ` Yosry Ahmed
2023-03-24 17:23                                         ` Chris Li
2023-03-27  1:23                                           ` Huang, Ying
2023-03-28  5:54                                             ` Yosry Ahmed
2023-03-28  6:20                                               ` Huang, Ying
2023-03-28  6:29                                                 ` Yosry Ahmed
2023-03-28  6:59                                                   ` Huang, Ying
2023-03-28  7:59                                                     ` Yosry Ahmed
2023-03-28 14:14                                                       ` Johannes Weiner
2023-03-28 19:59                                                         ` Yosry Ahmed
2023-03-28 21:22                                                           ` Chris Li
2023-03-28 21:30                                                             ` Yosry Ahmed
2023-03-28 20:50                                                       ` Chris Li
2023-03-28 21:01                                                         ` Yosry Ahmed
2023-03-28 21:32                                                           ` Chris Li
2023-03-28 21:44                                                             ` Yosry Ahmed
2023-03-28 22:01                                                               ` Chris Li
2023-03-28 22:02                                                                 ` Yosry Ahmed
2023-03-29  1:31                                                               ` Huang, Ying
2023-03-29  1:41                                                                 ` Yosry Ahmed
2023-03-29 16:04                                                                   ` Chris Li
2023-04-04  8:24                                                                     ` Huang, Ying
2023-04-04  8:10                                                                   ` Huang, Ying
2023-04-04  8:47                                                                     ` Yosry Ahmed
2023-04-06  1:40                                                                       ` Huang, Ying
2023-03-29 15:22                                                                 ` Chris Li
2023-03-10  2:07 ` Luis Chamberlain
2023-03-10  2:15   ` Yosry Ahmed
2023-05-12  3:07 ` Yosry Ahmed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.