From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6DE84C7EE23
	for <linux-mm@archiver.kernel.org>; Thu,  2 Mar 2023 00:31:04 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9213C6B0071; Wed,  1 Mar 2023 19:31:03 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8D1796B0073; Wed,  1 Mar 2023 19:31:03 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 798FA6B0074; Wed,  1 Mar 2023 19:31:03 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 670156B0071
	for <linux-mm@kvack.org>; Wed,  1 Mar 2023 19:31:03 -0500 (EST)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 3C4171C66DA
	for <linux-mm@kvack.org>; Thu,  2 Mar 2023 00:31:03 +0000 (UTC)
X-FDA: 80522078406.24.43DB5CB
Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53])
	by imf24.hostedemail.com (Postfix) with ESMTP id 63507180009
	for <linux-mm@kvack.org>; Thu,  2 Mar 2023 00:31:01 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=X0xC9rhk;
	spf=pass (imf24.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677717061;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=tMsqIKWYzEgzu4YVeFKc/Q/7pi4Se1NdyN4+b559i9E=;
	b=sLGYSr9N1qlquuVsFC87tUsdFA76GySnzwwPacHlzcf8dX3TBOKrxxiXtkbSakGLtohB72
	DSVuKDzImpbhL76OWwLsUyLoo+Qgo2/9t8SDTQG5nNAjfsc+FZaSYWJNxg7d3r+/xOQU4z
	b2oAYgX93uLbMcI/Ko5hV/uTeIeKyHo=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=X0xC9rhk;
	spf=pass (imf24.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677717061; a=rsa-sha256;
	cv=none;
	b=wRWN3Cvj+FCnTvp1WujyGkPb5VtwWMDsPWxK+Myn8NpPvOq5VP6Szuv7WCwLTu3QryAqrj
	PO3MtTXS8wYoOcEE9V6Nps1KLibHQYf9uEo8+QaWeyFfBHyKtTMTi3zg+PQn+32vJdn5fr
	EEeIjff2IgtkbKpi4mWOhBSBn+0G9S4=
Received: by mail-ed1-f53.google.com with SMTP id cw28so12006694edb.5
        for <linux-mm@kvack.org>; Wed, 01 Mar 2023 16:31:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1677717060;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=tMsqIKWYzEgzu4YVeFKc/Q/7pi4Se1NdyN4+b559i9E=;
        b=X0xC9rhkNH+qoeOURrkhbrG5HtQ1wNcLrKXBG6sJ7d/hFeUM0QUce6IW8X61jEyK0k
         jJHPpq92SruzYJc6k/s+ifRAp6JQ8KtXOncY3vbw2BfPlpIoRpFlOPjO8PMAWQ2DhxjX
         lhMYDEF7XRVzYG7+2DeUl90Ar1JA78Sb4SqzE0iBU12kgCGCDmuMnA8pKATDLLQxhxd4
         fnHQx5A96i0AbfC5AW4MxfQ53tMhCn84uNIwUFxS+X+QYhhVK3Fa11sWOcwFEnb9gDRH
         M2rpIa146JDnd9loOV1eRhJ3NXddch/kD+t3rCJ2BdN1HBN2Zg2G9iSyEysVn8FzsLxn
         7xNQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1677717060;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=tMsqIKWYzEgzu4YVeFKc/Q/7pi4Se1NdyN4+b559i9E=;
        b=SXJlQCKUp9BTXwIDSlxJ6DbgIYl4qdEPKTtoews9Ft2J4UpKS6PQ8GURsIIhByYmFD
         5oZs8YFlxH9CZUPDaujYVekVgThunaHTKkTdWJ8sC7DCuQGBTef4YRes0gKcA++IYIQo
         v2a8AttivYmNkAB7qMST3wci3r2ZGiWPasGitIsOYUsp70mp/2U/q7dRS2T2BT9MYYP8
         Vo4xrHrqOrAsbMEOfleFLCLM+t+tBrjqUFL1VCmzLP73NblxleYEfU0cGsVpkF4aARvf
         pVeQVPYC9nrGjdmKS4p5r+B/rPqxkaguIJM1fzSq9ppPR0FSGWIWVtJRH/Bbljt6CiCq
         YZ9w==
X-Gm-Message-State: AO0yUKXr6fZGHVmnkTds4eZk0ONm4gY0WkieTnGCP/ueMPnenxFCvUoO
	SZdjr49TlU9P5y8Utdpso7ODnA7WUMCiJElhyozC1g==
X-Google-Smtp-Source: AK7set8tW5f2ZtvEDaQUiUH0MgHRGXkuVsP+VYMgJG0Ig3Io5m94+VDje0ybWXhh0XkyamrZHRH9rwMEOhTYnW64Wu4=
X-Received: by 2002:a17:907:98eb:b0:8af:3930:c38e with SMTP id
 ke11-20020a17090798eb00b008af3930c38emr3923970ejc.10.1677717059455; Wed, 01
 Mar 2023 16:30:59 -0800 (PST)
MIME-Version: 1.0
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
 <Y/6KOFMZaE0yOj/1@google.com>
In-Reply-To: <Y/6KOFMZaE0yOj/1@google.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Wed, 1 Mar 2023 16:30:22 -0800
Message-ID: <CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
To: Chris Li <chrisl@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, Johannes Weiner <hannes@cmpxchg.org>, 
	Linux-MM <linux-mm@kvack.org>, Michal Hocko <mhocko@kernel.org>, 
	Shakeel Butt <shakeelb@google.com>, David Rientjes <rientjes@google.com>, 
	Hugh Dickins <hughd@google.com>, Seth Jennings <sjenning@redhat.com>, 
	Dan Streetman <ddstreet@ieee.org>, Vitaly Wool <vitaly.wool@konsulko.com>, 
	Yang Shi <shy828301@gmail.com>, Peter Xu <peterx@redhat.com>, 
	Minchan Kim <minchan@kernel.org>, Andrew Morton <akpm@linux-foundation.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Stat-Signature: itn4k8nn76ezfxw8e3wndreyzkwrjqgn
X-Rspamd-Queue-Id: 63507180009
X-HE-Tag: 1677717061-923832
X-HE-Meta: U2FsdGVkX1/+V418rmLQLFmf16efEIjZOCw+mhu6qaRIEvZOjDFIPHVu1jNgG+fLJU3/8Kv6/casoPn/TSvgEA7F4FL5/HamwST9qaA32fpz15kUlX04E0H69GuK2QQV8mU2Nobi8po3Jbn9wFKKTrSAlGxAsyZwS6zZCIHA735dh5t/zdd18dElsoSuUmhroJlcw4DOodP1jgr0omh4Teasl8T7sRH1TJarYn0jmz9SE+a1zgqASV0YpdNpf+rPhGoukVBtEPO9bQFOQGLen4NrWlZimRguD2AXHO/RNX+9/StpX0mVQXXnDZVd4G1/8mmQHmdU5LYuOFt4D++7wNxwPnfFQ3S40xys0De0LET5TwbbjlTbQhqvR5AaMcFN7Rac/gNAR3wmAYDkiQeCQxJe7LnRA0ZI2B5wqza8rlRIMkNL+BDFGF3DDVwZA7Nfz9wWxvy9BehcsKvbTjClrP0K7Hof1GOSMBRocnG8kfu0qh+92p2TkK+hYPYkZ+kJNvXYV/nC1l7pWXrHqLJw1Mk1/HM+8XOgjaDPqwMnKEyB4hs7ASBwQNY9RxHclYeJU2LLzJCa6dm6X2PJDpYlgqOzHWqIUH7+yAnjMhzUC+oSU2WdFYpICoG7b9go6qJPVkvxPHxdaPExZQf9vs14VXfzD3rEovubOMBXt2FscJHOZAj53M240KUqurGvLhcA0RGivN8h770GUySEbj38UHFsv06GAqjozZviXZ0IQokTfphXa3zSCkkaPDealriU+3rwR9PBNnDmD8cdKT8FapRs7QOYTIC/1FCbrbGN3m7xiQ+nTmxh6OC3SWb+WIXchJRkTHtc4SeCDXvZr6I3HVX4GOiS7jawq31psa147J0XahP7C0XJoKILtBBzk7ljipx6/e2QD8Uj6dAJ9FAa+UzdCuyiApeS46uf4jcUsWQjO8ycjv5wRjZS1/Vvzxc6C8k7o4tfs+96bAgdWhl
 bpjvT0Qe
 AMId1MSnAVI27qdXuNqIadwaaFLNb3JAhwm0kh8asALC/S4QSNihD7it1yqtC90AhLI5KP6/o5DDqeE1oYkt09bwgTeQ9P2z9r1HvTEQgBRBFpgSnK6+00+Niml1vaUc9701bOQqvBt06eQUnuVDVwwTwWiCRmnNtgYJ/g8b9lA/2Xce16l4xpu9z+e04USOerjsMf3uyk4GHcX3Si3Lykwzigr7t1wS6UGaho9gmrQSARcnSTmftW1BO1oJhtP+TGMKsGBrhG7qic8NuUHg997PAQrG7r3HzdEqUq+MAwZ4pq9DqPVBwkQBQtA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > Hello everyone,
> >
> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > 2023 about swap & zswap (hope I am not too late).
>
> I am very interested in participating in this discussion as well.

That's great to hear!

>
> > ==================== Objective ====================
> > Enabling the use of zswap without a backing swapfile, which makes
> > zswap useful for a wider variety of use cases. Also, when zswap is
> > used with a swapfile, the pages in zswap do not use up space in the
> > swapfile, so the overall swapping capacity increases.
>
> Agree.
>
> >
> > ==================== Idea ====================
> > Introduce a data structure, which I currently call a swap_desc, as an
> > abstraction layer between swapping implementation and the rest of MM
> > code. Page tables & page caches would store a swap id (encoded as a
> > swp_entry_t) instead of directly storing the swap entry associated
> > with the swapfile. This swap id maps to a struct swap_desc, which acts
>
> Can you provide a bit more detail? I am curious how this swap id
> maps into the swap_desc? Is the swp_entry_t cast into "struct
> swap_desc*" or going through some lookup table/tree?

swap id would be an index in a radix tree (aka xarray), which contains
a pointer to the swap_desc struct. This lookup should be free with
this design as we also use swap_desc to directly store the swap cache
pointer, so this lookup essentially replaces the swap cache lookup.

>
> > as our abstraction layer. All MM code not concerned with swapping
> > details would operate in terms of swap descs. The swap_desc can point
> > to either a normal swap entry (associated with a swapfile) or a zswap
> > entry. It can also include all non-backend specific operations, such
> > as the swapcache (which would be a simple pointer in swap_desc), swap
>
> Does the zswap entry still use the swap slot cache and swap_info_struct?

In this design no, it shouldn't.

>
> > This work enables using zswap without a backing swapfile and increases
> > the swap capacity when zswap is used with a swapfile. It also creates
> > a separation that allows us to skip code paths that don't make sense
> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > which might result in better performance (less lookups, less lock
> > contention).
> >
> > The abstraction layer also opens the door for multiple cleanups (e.g.
> > removing swapper address spaces, removing swap count continuation
> > code, etc). Another nice cleanup that this work enables would be
> > separating the overloaded swp_entry_t into two distinct types: one for
> > things that are stored in page tables / caches, and for actual swap
> > entries. In the future, we can potentially further optimize how we use
> > the bits in the page tables instead of sticking everything into the
> > current type/offset format.
>
> Looking forward to seeing more details in the upcoming discussion.
> >
> > ==================== Cost ====================
> > The obvious downside of this is added memory overhead, specifically
> > for users that use swapfiles without zswap. Instead of paying one byte
> > (swap_map) for every potential page in the swapfile (+ swap count
> > continuation), we pay the size of the swap_desc for every page that is
> > actually in the swapfile, which I am estimating can be roughly around
> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > scales with pages actually swapped out. For zswap users, it should be
>
> Is there a way to avoid turning 1 byte into 24 byte per swapped
> pages? For the users that use swap but no zswap, this is pure overhead.

That's what I could think of at this point. My idea was something like this:

struct swap_desc {
    union { /* Use one bit to distinguish them */
        swp_entry_t swap_entry;
        struct zswap_entry *zswap_entry;
    };
    struct folio *swapcache;
    atomic_t swap_count;
    u32 id;
}

Having the id in the swap_desc is convenient as we can directly map
the swap_desc to a swp_entry_t to place in the page tables, but I
don't think it's necessary. Without it, the struct size is 20 bytes,
so I think the extra 4 bytes are okay to use anyway if the slab
allocator only allocates multiples of 8 bytes.

The idea here is to unify the swapcache and swap_count implementation
between different swap backends (swapfiles, zswap, etc), which would
create a better abstraction and reduce reinventing the wheel.

We can reduce to only 8 bytes and only store the swap/zswap entry, but
we still need the swap cache anyway so might as well just store the
pointer in the struct and have a unified lookup-free swapcache, so
really 16 bytes is the minimum.

If we stop at 16 bytes, then we need to handle swap count separately
in swapfiles and zswap. This is not the end of the world, but are the
8 bytes worth this?

Keep in mind that the current overhead is 1 byte O(max swap pages) not
O(swapped). Also, 1 byte is assuming we do not use the swap
continuation pages. If we do, it may end up being more. We also
allocate continuation in full 4k pages, so even if one swap_map
element in a page requires continuation, we will allocate an entire
page. What I am trying to say is that to get an actual comparison you
need to also factor in the swap utilization and the rate of usage of
swap continuation. I don't know how to come up with a formula for this
tbh.

Also, like Johannes said, the worst case overhead (32 bytes if you
count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
1G swapped. It doesn't sound *very* bad. I understand that it is pure
overhead for people not using zswap, but it is not very awful.

>
> It seems what you really need is one bit of information to indicate
> this page is backed by zswap. Then you can have a seperate pointer
> for the zswap entry.

If you use one bit in swp_entry_t (or one of the available swap types)
to indicate whether the page is backed with a swapfile or zswap it
doesn't really work. We lose the indirection layer. How do we move the
page from zswap to swapfile? We need to go update the page tables and
the shmem page cache, similar to swapoff.

Instead, if we store a key else in swp_entry_t and use this to lookup
the swp_entry_t or zswap_entry pointer then that's essentially what
the swap_desc does. It just goes the extra mile of unifying the
swapcache as well and storing it directly in the swap_desc instead of
storing it in another lookup structure.

>
> Depending on how much you are going to reuse the swap cache, you might
> need to have something like a swap_info_struct to keep the locks happy.

My current intention is to reimplement the swapcache completely as a
pointer in struct swap_desc. This would eliminate this need and a lot
of the locking we do today if I get things right.

>
> > Another potential concern is readahead. With this design, we have no
>
> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> use some modernization.

Yeah, I initially thought we would only need the swp_entry_t ->
swap_desc reverse mapping for readahead, and that we can only store
that for spinning disks, but I was wrong. We need for other things as
well today: swapoff, when trying to find an empty swap slot and we
start trying to free swap slots used only by the swapcache. However, I
think both of these cases can be fixed (I can share more details if
you want). If everything goes well we should only need to maintain the
reverse mapping (extra overhead above 24 bytes) for swap files on
spinning disks for readahead.

>
> Looking forward to your discussion.
>
> Chris
>