From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E2802C7EE2D
	for <linux-mm@archiver.kernel.org>; Fri,  3 Mar 2023 17:15:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4FCD16B0071; Fri,  3 Mar 2023 12:15:52 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4ADC66B0072; Fri,  3 Mar 2023 12:15:52 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 380C36B0073; Fri,  3 Mar 2023 12:15:52 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 28AD06B0071
	for <linux-mm@kvack.org>; Fri,  3 Mar 2023 12:15:52 -0500 (EST)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id E6CEA14035C
	for <linux-mm@kvack.org>; Fri,  3 Mar 2023 17:15:51 +0000 (UTC)
X-FDA: 80528239302.04.BC3A1E6
Received: from mail-ed1-f50.google.com (mail-ed1-f50.google.com [209.85.208.50])
	by imf11.hostedemail.com (Postfix) with ESMTP id 992174000C
	for <linux-mm@kvack.org>; Fri,  3 Mar 2023 17:15:48 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=RrfTH7ls;
	spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677863748;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=y30ZRmUgc11kZznkoFx9IC7TuHPJjBjW0bvsFlaaYi8=;
	b=gmHp/fCiUMOQSZ7Jw43Oi+WZHDc2w8KBXMeqfa3kZCh1WzKjZubBnHCU2JA4IhbOjUhowr
	kxHw5sJCk+RUWFnoktnHqrMxwvxWZThSXnM8vdMjQ/6RKYC9mrzl5iVeBlKKzGRoYH4QDz
	7h0VT3oIlBgovAqk7AKgvGVg9JkZQ4I=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=RrfTH7ls;
	spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.50 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677863748; a=rsa-sha256;
	cv=none;
	b=rUpNOo9u/KNKXSeOR1cJkzR6L9ikdgOfyZQWFD8QmrxWL5p6bfEm5WyIy8AaAt9JpNDtyp
	+zv13Vift90gJlP5qxCTS9HBdgxtCiEAtru0LjRJA7M5PW1GE0oOvQ+3jDGM8vRyyz8hbX
	uu2FAeQqtuS1XywuRn/5yHQAxMIAgQk=
Received: by mail-ed1-f50.google.com with SMTP id cw28so13116629edb.5
        for <linux-mm@kvack.org>; Fri, 03 Mar 2023 09:15:48 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1677863747;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=y30ZRmUgc11kZznkoFx9IC7TuHPJjBjW0bvsFlaaYi8=;
        b=RrfTH7lsVQTDQCjyXksbYQwWsG4t/nlb9hoV8+7fd/C8q/1cwKI+6Ltru6tIbu+9IK
         UhRyJ2peFxNcQag5RVpWmGbfyAkXXr6UjRBaR3H9kh6t46l1jG/2U47vqrge0576MWEg
         OtwmJwuGmBOXizAPCz8bGanQEu8An0n4REDBCOqPTL7rE2nd2n22KogPYLw3ETtcWCNC
         hmHeBVi8+2IuILdKn8dOKDU8Uy6p8iqBxj7lNqE4A2DiL5slrM4DQQg6XXfGhXj68iMo
         e6UOOvX7fupyRsGumQZdu2o7pLWPMTtmxXxVJr1nxcmydlMPTlk99OFaaPvGX6Zgd+vi
         TbVQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1677863747;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=y30ZRmUgc11kZznkoFx9IC7TuHPJjBjW0bvsFlaaYi8=;
        b=qicyS9jr32KHEvWMUxBUqmFnMkX93TF1W3rLfJyyqwC2G/B/sdiza5Ios/SxamZ+RX
         U7ojwbiZHpHxQIg9E7ObILsoMMzcDpjFzAANehekK57GrUl0f+EsaCDt5EKK/JUJj1mH
         4WJzVNbNmL5FJkwkXXfoG9vUtVp5P0t6bJ6NF7aWLivtQ7EewUF3uOHYQVagVeQVpIyr
         GPr52X8jPRrwa/r49tqm7RYKPNcO1t3+LuJT7IMG2a9IYtRQTEroAm069nU1F+7dfRiW
         CcD/O8OU1bkhDa3e2G7rosSjK+Wy9Dhr0m5QWPyuTyz6aNEYDzYyc7RfRTpcRTdX87SW
         vtKQ==
X-Gm-Message-State: AO0yUKUP5m7Umi7j89BkCedOCQNAhPQooas5zXADZUJl/DBTHuwqm4ti
	NoLDrWEMWwqTo7ilRPkyNF2BoXKgL+5/TmKL2T5Msw==
X-Google-Smtp-Source: AK7set+3dEQDMzHbhjfdvi/5/c+PSrzeN5OFMplf2byKdJlOtoKo05HMRU9qWl6nVbpLAtXkW1DK4tPdyebfiZbekaI=
X-Received: by 2002:a17:906:2546:b0:8d8:4578:18e0 with SMTP id
 j6-20020a170906254600b008d8457818e0mr1213968ejb.10.1677863746923; Fri, 03 Mar
 2023 09:15:46 -0800 (PST)
MIME-Version: 1.0
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
 <Y/6KOFMZaE0yOj/1@google.com> <CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
 <ZAFAWPXX8QGkep0G@google.com> <CAJD7tkbMGJ2C79jwXs8m8LMpDGe1xFw4_uMmCgONc17WMX53Cw@mail.gmail.com>
 <ZAFMiCf1QFAVYuP4@google.com>
In-Reply-To: <ZAFMiCf1QFAVYuP4@google.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Fri, 3 Mar 2023 09:15:10 -0800
Message-ID: <CAJD7tkYkS478cN7=SKjxNMz+_6K1FicL-6d_qAw+-3zKH1n3pw@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
To: Minchan Kim <minchan@kernel.org>
Cc: Chris Li <chrisl@kernel.org>, lsf-pc@lists.linux-foundation.org, 
	Johannes Weiner <hannes@cmpxchg.org>, Linux-MM <linux-mm@kvack.org>, 
	Michal Hocko <mhocko@kernel.org>, Shakeel Butt <shakeelb@google.com>, 
	David Rientjes <rientjes@google.com>, Hugh Dickins <hughd@google.com>, 
	Seth Jennings <sjenning@redhat.com>, Dan Streetman <ddstreet@ieee.org>, 
	Vitaly Wool <vitaly.wool@konsulko.com>, Yang Shi <shy828301@gmail.com>, 
	Peter Xu <peterx@redhat.com>, Andrew Morton <akpm@linux-foundation.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Stat-Signature: et71qnu53b8mbmpxqw676ekoicqr5cge
X-Rspamd-Queue-Id: 992174000C
X-HE-Tag: 1677863748-140292
X-HE-Meta: U2FsdGVkX19YNBspZBVqe/cmySvV2tuZjVzyFn7CaX+YCneoXMFSjHVWnAeQN77dwBaVlUjPo1NDAOCV6n2hfZw3pX3w10gP9aHzJAC1D+viB707tkkcttqwlo+t/SMorqiyaXOjnx47tbdV4XR5mBJSHJUC5Aolipa+b3HcbDyYJn8Q9jQ4Gz6tngNI7bKaapIbbDyZz4niKilst5+UFxxUKo+0usczH2H1PuHL3OPCsUl+yOGfCSYGKzTIYzDjTs7l6AQ0rB6GtTDqwcCvLcgPI+dsOoYTp86E5pz5JHlCWXw2h1slqsdzuwGwNZ36Du8veLlZXcRajzSI57NJ8mG55IjW7GMxo7zmxBrclB2FKFgYIvtatyflkEYx6oqbP4xGP0he24OP93rfO5xNP56PX0ezIpnLjsSly9U7It5OWCPn3UqSH4IviWPReYzrTKxuNCnxflF1XzQwi040483A0Rg4W2un27fkHGKVsWeojDLvzfmRzCKQArUC9h15gN9PdMEDXwy3gn0zEv78neCTkPVJhjMXMgp2MXlvbOdMik7uQxZJsz7l0znpQCwEZ+eyZohclBgT/oHzD1Sf5agwOKby7R8CaYWs6sSjLebnO7t0ypvV18CFVPfzN9qMwsMXI96sE+tJ8TBNSmsJ2zm0AN2P/3WoyL0+QohwDQix7m/nQyBQZmZycvl3O62863x3E1/XbCbyZH/FEXFdf9mSm7bf2YmIL8FD5LmIXSMKbgD1ltaRVCfQJC7VZB+/P+FkmjP3MwnaHGSSEUiZe7WpWjAg6As7Xyv3oFzzLGmmLg+D6lmCAVPuedTBO7lb7WGNHaJh7yU2+zv3X6VOrT2LusoBJ2bMY5YtgtTUdiwUZPf5gGHDkF+0qzGq1fnN3Xy7wfb4iP2D4xRU+PZz0Lf5iadiVRoEvS/AzmqdjbvtWH3OXt6RcfU/0/8tr92nHYYsLYx7Y7O/35gzLqb
 7ew8sVFB
 wIqgdjFNMogNZddYIQJ8Je9KXUUN6UUuyioBRledeGpziYokToPxndFdb3ezaM86yOzYJQFLB1GEeATyoadrr3w0JbMrAlROuG7GkN6JPRT/zSNHYpxFmhhr0QClqJbxrA/+9+MZ13/yOW8q1iV5B5Wn7eIEMhNK7Pz8zioInOGalnPhePDpdhgooPd1kNFtWXO6D0c3Bj0qFy1ikUjNRAA+pPLSBOo5P2d+qGgZevDtX92EdS6q+aT0csZSskhrFa6MHcnFmQz7tH25lcMt6SXtC78APbI35usy+wQojDSL3idBMdWDkdnBjOg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Mar 2, 2023 at 5:25 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, Mar 02, 2023 at 04:49:01PM -0800, Yosry Ahmed wrote:
> > On Thu, Mar 2, 2023 at 4:33 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Wed, Mar 01, 2023 at 04:30:22PM -0800, Yosry Ahmed wrote:
> > > > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
> > > > >
> > > > > Hi Yosry,
> > > > >
> > > > > On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > > > > > Hello everyone,
> > > > > >
> > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > > > > 2023 about swap & zswap (hope I am not too late).
> > > > >
> > > > > I am very interested in participating in this discussion as well.
> > > >
> > > > That's great to hear!
> > > >
> > > > >
> > > > > > ==================== Objective ====================
> > > > > > Enabling the use of zswap without a backing swapfile, which makes
> > > > > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > > > > used with a swapfile, the pages in zswap do not use up space in the
> > > > > > swapfile, so the overall swapping capacity increases.
> > > > >
> > > > > Agree.
> > > > >
> > > > > >
> > > > > > ==================== Idea ====================
> > > > > > Introduce a data structure, which I currently call a swap_desc, as an
> > > > > > abstraction layer between swapping implementation and the rest of MM
> > > > > > code. Page tables & page caches would store a swap id (encoded as a
> > > > > > swp_entry_t) instead of directly storing the swap entry associated
> > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> > > > >
> > > > > Can you provide a bit more detail? I am curious how this swap id
> > > > > maps into the swap_desc? Is the swp_entry_t cast into "struct
> > > > > swap_desc*" or going through some lookup table/tree?
> > > >
> > > > swap id would be an index in a radix tree (aka xarray), which contains
> > > > a pointer to the swap_desc struct. This lookup should be free with
> > > > this design as we also use swap_desc to directly store the swap cache
> > > > pointer, so this lookup essentially replaces the swap cache lookup.
> > > >
> > > > >
> > > > > > as our abstraction layer. All MM code not concerned with swapping
> > > > > > details would operate in terms of swap descs. The swap_desc can point
> > > > > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > > > > entry. It can also include all non-backend specific operations, such
> > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap
> > > > >
> > > > > Does the zswap entry still use the swap slot cache and swap_info_struct?
> > > >
> > > > In this design no, it shouldn't.
> > > >
> > > > >
> > > > > > This work enables using zswap without a backing swapfile and increases
> > > > > > the swap capacity when zswap is used with a swapfile. It also creates
> > > > > > a separation that allows us to skip code paths that don't make sense
> > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > > > > which might result in better performance (less lookups, less lock
> > > > > > contention).
> > > > > >
> > > > > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > > > > removing swapper address spaces, removing swap count continuation
> > > > > > code, etc). Another nice cleanup that this work enables would be
> > > > > > separating the overloaded swp_entry_t into two distinct types: one for
> > > > > > things that are stored in page tables / caches, and for actual swap
> > > > > > entries. In the future, we can potentially further optimize how we use
> > > > > > the bits in the page tables instead of sticking everything into the
> > > > > > current type/offset format.
> > > > >
> > > > > Looking forward to seeing more details in the upcoming discussion.
> > > > > >
> > > > > > ==================== Cost ====================
> > > > > > The obvious downside of this is added memory overhead, specifically
> > > > > > for users that use swapfiles without zswap. Instead of paying one byte
> > > > > > (swap_map) for every potential page in the swapfile (+ swap count
> > > > > > continuation), we pay the size of the swap_desc for every page that is
> > > > > > actually in the swapfile, which I am estimating can be roughly around
> > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > > > > scales with pages actually swapped out. For zswap users, it should be
> > > > >
> > > > > Is there a way to avoid turning 1 byte into 24 byte per swapped
> > > > > pages? For the users that use swap but no zswap, this is pure overhead.
> > > >
> > > > That's what I could think of at this point. My idea was something like this:
> > > >
> > > > struct swap_desc {
> > > >     union { /* Use one bit to distinguish them */
> > > >         swp_entry_t swap_entry;
> > > >         struct zswap_entry *zswap_entry;
> > > >     };
> > > >     struct folio *swapcache;
> > > >     atomic_t swap_count;
> > > >     u32 id;
> > > > }
> > > >
> > > > Having the id in the swap_desc is convenient as we can directly map
> > > > the swap_desc to a swp_entry_t to place in the page tables, but I
> > > > don't think it's necessary. Without it, the struct size is 20 bytes,
> > > > so I think the extra 4 bytes are okay to use anyway if the slab
> > > > allocator only allocates multiples of 8 bytes.
> > > >
> > > > The idea here is to unify the swapcache and swap_count implementation
> > > > between different swap backends (swapfiles, zswap, etc), which would
> > > > create a better abstraction and reduce reinventing the wheel.
> > > >
> > > > We can reduce to only 8 bytes and only store the swap/zswap entry, but
> > > > we still need the swap cache anyway so might as well just store the
> > > > pointer in the struct and have a unified lookup-free swapcache, so
> > > > really 16 bytes is the minimum.
> > > >
> > > > If we stop at 16 bytes, then we need to handle swap count separately
> > > > in swapfiles and zswap. This is not the end of the world, but are the
> > > > 8 bytes worth this?
> > > >
> > > > Keep in mind that the current overhead is 1 byte O(max swap pages) not
> > > > O(swapped). Also, 1 byte is assuming we do not use the swap
> > >
> > > Just to share info:
> > >
> > > Android usually used swap space fully most of times via Compacting
> > > background Apps so O(swapped) ~= O(max swap pages).
> >
> > Thanks for sharing this, that's definitely interesting.
> >
> > What percentage of memory is usually provisioned as swap in such
> > cases? Would you consider an extra overhead of ~8M per 1G of swapped
> > memory particularly unacceptable?
>
> Vendors have different sizes and usually decide by how many apps
> they want to cache at the cost of foreground app's trouble.
> I couldn't speak for all vendors but half of DRAM may be small
> size on the market.
>
> I recall the ~80M less free memory made huge differene of jank
> ratio for memory hungry app launchhing on the memory pressure
> state so we needed to cut down the additional memory in the end.
>
> I cannot say the ~8M per 1G is acceptible or not since it's
> depends on workload with the device's RAM size(worried more on
> entry level devices) so I am not sure what's the code complexity
> we will bring in the end but considering swap metadata is one of
> the largest area from memory consumption view(struct page is working
> toward to shrink - folio, Yay!), IMHO, it's worthwhile to see
> whether we could take the complexity.

We can do something similar to what Chris suggested, and have the swap
entries act the same way as today if frontswap/zswap is not
configured. If frontswap/zswap is configured, all swap entries go
through frontswap, we can have an xarray there (instead of today's
rbtree) that either points to a zswap entry or a swap entry. So the
indirection lives in frontswap essentially.

A few problems with this alternative approach vs. the initial proposal:

a) The abstraction layer is within frontswap/zswap, so we cannot move
the writeback LRU logic outside zswap like you mentioned before to
support different combinations of swapping backends.

b) Today we do one lookup in the fault path (in the swapcache) for
swapfiles, and 2 lookups for zswap (an extra lookup in the zswap
rbtree). The initial proposal allows us to remove the rbtree and have
a single lookup in both cases to get the swap_desc, in which the
swapcache and zswap_entry are just pointers we can access directly.
Furthermore, adding the swapped in page to the swapcache is an O(1)
operation now instead of storing into an xarray (IIUC we did an
optimization to skip the swapcache for zram single-mapping fault to
avoid this).

With the alternative approach, we have to do 2 lookups in the fault
path if zswap is enabled (one in the swapcache, and one in the zswap
tree to get the underlying zswap_entry / swap entry).

c) The initial proposal allows us to simplify the swap counting and
swapcache logic, and perhaps reduce the locking we have to do (one
lock to update the swapcache, one lock to update swap count). I even
think we might be able to do such updates locklessly with atomic
operations on the swap_desc. With the alternative approach, we need to
have swap counting logic in swapfiles (the one we have today), and
another swap counting logic in zswap. When a page is moved from zswap
to swapfile, we may need to hand over the swap count. Instead of
simplifying the complicated swap counting logic we double down by
having two implementations.

I am not saying one or the alternative approach is worse, I am just
saying there is more to it than complexity vs. memory savings.

>
> Thanks.
>