From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A71BDC64EC7
	for <linux-mm@archiver.kernel.org>; Tue, 28 Feb 2023 04:29:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A58B16B0071; Mon, 27 Feb 2023 23:29:15 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A091A6B0072; Mon, 27 Feb 2023 23:29:15 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8A9576B0073; Mon, 27 Feb 2023 23:29:15 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 7937E6B0071
	for <linux-mm@kvack.org>; Mon, 27 Feb 2023 23:29:15 -0500 (EST)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 3BB5840FCF
	for <linux-mm@kvack.org>; Tue, 28 Feb 2023 04:29:15 +0000 (UTC)
X-FDA: 80515421070.18.3FB1F71
Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52])
	by imf27.hostedemail.com (Postfix) with ESMTP id 4460340004
	for <linux-mm@kvack.org>; Tue, 28 Feb 2023 04:29:13 +0000 (UTC)
Authentication-Results: imf27.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Av1q89iu;
	spf=pass (imf27.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=kaleshsingh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677558553;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=MJZxzQ5tCEoWPieCBc7EeSQTW20RHLDycZrIaeQua7E=;
	b=JjvnMBNfzSvKWnukFe/S8VKHt6zAsPz6/MxB3nBgbSJ8bz/n8c5U/ZUoduGwj11rh2LVYB
	rkpUV8WWJYjpRJbZz7iuW5yFa5ntgCgkIOhQX43aC1abzcRvAX9jp8dP92GntO1LRCW6yF
	oFZJ7fV+Hf61TTqpOG5bBJZa7ECsaVE=
ARC-Authentication-Results: i=1;
	imf27.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Av1q89iu;
	spf=pass (imf27.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=kaleshsingh@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677558553; a=rsa-sha256;
	cv=none;
	b=QJpOx79cvKmeLqb0zARIpB6uUnkCe0+1SVikpYYxF7vZT6UnTg3dELBD0u6XS1au/bO6sx
	teMfxI4eQxLwi4ic7K5m5/G5WiJ2FVvpscoTs1Y61TPAW1tjVImvXSxSVA431zGjou8TMv
	yCeG2Uf8WdzcOAGCooWFmPn9Bd4c6P8=
Received: by mail-ed1-f52.google.com with SMTP id i34so34756715eda.7
        for <linux-mm@kvack.org>; Mon, 27 Feb 2023 20:29:12 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1677558552;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=MJZxzQ5tCEoWPieCBc7EeSQTW20RHLDycZrIaeQua7E=;
        b=Av1q89iuhP8aIvSI4qR+mq/kjjpxCSzwnQQANsQx6CVTgFDvleABMSd0K4boA3gIAT
         gR5jRVans4Cl2NUWaLpQKaRc19SQD//aWltcn3X2aa/BQAA85ECpXl0RcU97YW1q8a53
         e+S4UKpM8qFxD7yDzKRbiouEuK6LpbxUARD+V+dV8PwSR7j3X04Hkf1auKIk7NmQeYSq
         X0UNRyEPtqcR3dND7Af/k13uJmknTv+Imw8KVRV6Wx6HWe4JzT4QKjPNC8DR+sdqsPRf
         pJ56ATM0lyJT4jip94J5GWxwqm8Dx1s+a0RMY7nj6M/Dy0T6nqPYvJ4kCHrTpjYxuchx
         aYSQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1677558552;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=MJZxzQ5tCEoWPieCBc7EeSQTW20RHLDycZrIaeQua7E=;
        b=PEOYZ3vga5NHoNEZhSCnLFUYtNyplHPbQMSlca+D1b6TzFiwyi/B584mP+usEk1C2n
         pyA1GY79hN0Tq8N0qnrhQ8y/nh/MNAKiO/Tt02vOEZPkY58LcCdsaNpfjNDWwLfgxt3/
         wnZCjpbmK1uL9gbb8eRoXVQ3I0E4+RN87J8Cv0prm5Rpco/AO/jxe38HWTB5SZoWk8Sd
         qtdNrxxzL53ab8Zo1GUr/8Tqmqm+lqlHNi0tGNYrujfOHwONxK9P/Xz67rz4A9jvmioG
         mrAl74iKV8dY2NwqlPJBRbGCFoOqBVjLEaPKqgq8CeewuJWC6bmKYbLlGfweMFM/mdME
         KAhQ==
X-Gm-Message-State: AO0yUKVrfDQJ5fqDqZ8A4XoxgjTNLVhpGROIu/rHsJ+5RRaQR5bnUQJl
	EeUA+kWUfi1uqcp0w5nyRno1t+OXp3vVF/UhjCBsGg==
X-Google-Smtp-Source: AK7set+Zc6ka1EAs831G73G/ccv1FGNxoN+1NFu6k9c66jcobZj6o+jXqum7gKRRTkUqaIcFpq0BXqjyg3wfFbAQRzY=
X-Received: by 2002:a17:906:f192:b0:8b1:78b7:6803 with SMTP id
 gs18-20020a170906f19200b008b178b76803mr560590ejb.4.1677558551484; Mon, 27 Feb
 2023 20:29:11 -0800 (PST)
MIME-Version: 1.0
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
 <CAHbLzkqWD2cV5McSEtBd88zPsG0Ak-O6sMba1L_OSj+fomSs8A@mail.gmail.com>
 <CAJD7tkYvhbMth4v-kg_nwZPE6s-0UwM27XpMe-NsAvRjCStrqQ@mail.gmail.com>
 <CAHbLzkrATJSrWQX2+iHE4TL6BX6uZyJXB6EqvkoPuqRsRtbmdg@mail.gmail.com>
 <CAJD7tkbqv5_oPNM7HqgbkjqwyAiG1ew7UWn5qTzhic=BasOvog@mail.gmail.com>
 <CAHbLzkpJF+7s+C0WVj_5b4PvLGe29rE2SyXTDcc0dHexgZt1vw@mail.gmail.com>
 <CAJD7tkb=RqyWr1P1jPzbKPLig1NjRHGt7Wa5ECz1NS9Vv4PiMw@mail.gmail.com>
 <Y/ZJcx+5tvhpv59P@cmpxchg.org> <CAJD7tka3MgUpyG4zfcKjtA-P=Wt0Qog=AdJ5zPx0pGwN2a8dbQ@mail.gmail.com>
In-Reply-To: <CAJD7tka3MgUpyG4zfcKjtA-P=Wt0Qog=AdJ5zPx0pGwN2a8dbQ@mail.gmail.com>
From: Kalesh Singh <kaleshsingh@google.com>
Date: Mon, 27 Feb 2023 20:29:00 -0800
Message-ID: <CAC_TJve7e=sz4uPDuRvauj1hr=evOUWbSoz91wniSQYUbv0ajA@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Yang Shi <shy828301@gmail.com>, 
	lsf-pc@lists.linux-foundation.org, Linux-MM <linux-mm@kvack.org>, 
	Michal Hocko <mhocko@kernel.org>, Shakeel Butt <shakeelb@google.com>, 
	David Rientjes <rientjes@google.com>, Hugh Dickins <hughd@google.com>, 
	Seth Jennings <sjenning@redhat.com>, Dan Streetman <ddstreet@ieee.org>, 
	Vitaly Wool <vitaly.wool@konsulko.com>, Peter Xu <peterx@redhat.com>, 
	Minchan Kim <minchan@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, 
	Nhat Pham <nphamcs@gmail.com>, Akilesh Kailash <akailash@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 4460340004
X-Stat-Signature: afby8qaxnfuyp1cquup1tcksxyrc7gr7
X-HE-Tag: 1677558553-263651
X-HE-Meta: U2FsdGVkX19bIMQqit8P7WPCBGahDAmoQ7p8jh3KO/CawNI7ABiCBGL6fONGhmTXovMai+f2m/P/UFlJaxGuDVKcMc/4ltR6dlnFT/TaHzOXrKgp9Z/x3SBME+xoyaeiMnLSugbnwBF2F2ztfB3AB3NbD0prGmU/TYWr0hNYzwxXOKqjr3cdksT5SsKXQ77Tj7ukjqTGW13GRjwa/w3d2JzK7TkvRYvXUbYmtriLzjP2HrQy+FRN3kHyRTxesgxtwnSW6jj670KZTFvlRL5lwNUQLa11au0udBkN3NJxqZMAXaWgvpNQgoAr/FPbmcga7e+9/pV8Lit6eBMQg+Td7P9zc2uYcBcPCZY4KlWwHcheZQjOlk9E81yyU5jMpS0tAY4IYAmortyMKggQxf1m/2pe+7Rl+Q4CiuXiLyAAT278/yvX0lMg0ra0w6B4zsSFslmGfSv4YVDTdaakViHm/iRqLFZc/yOmrIUt011JLdr+Hxt7B3UuI4VMaIMmg6SBNjgklmEFHHmOxbAe7PoAKZBSl4iIfW83H+pRG/g/xdjYRpxTgw9nPkDrFyfxmVE6bfC5i9zhCJVJ12NmlyXcGsr+dYGQRrLQj4Y194Bqu8vNYY8jYwPJ1y22k6kqFloAb0wzsQ1+uUeRhC7CoKJKZB168e6HF5cDb8xw/69xvWWhgbTmCOapglg7k4JAaK9duBo6xqKSUs6FNQFXj/ArrmXT+UMtejxr330Jk/MidkBuQT/6pdfG5c/LrcBOU2zpNOoyxK+TwQK+KQzPbCNfKWO/0JI3jFu1JNFBmLfprDqj5zI/paVuQ58Yf1AwzntNCxjLm2T8vAy5HeRuAhCt87B+BjEhz73QXpLiK966SQsKA7YeRboFFWYGFYHCsMtryGRKZtj+LORyPsNNX8UJrg9KNSR/Zj9E/BJigcHrbY/bPG2T2gGuI4qqs49yV0R7UIAd17tom5YEJjvNSiY
 pX+AWAHa
 lzOAKG8UBuhZjIRfZPKIJ7SonTbJANozdDg0o1UL4nuJeRYq1Uvh4EnI87d6K/+8lcJnapGcVKsuLRUmmHvaNOHMotJhgtIN66ePJFeBGLnmLAH1WubuGxhkuEhyugxvxTxMJe8hHvslu3QAyB7YVi1LireUgGrYrRrF6rrpxRKNxMP+cEFe109Y6g88IN52hSy9+F/D+0juwhTebYwfAZyLePQJGYXKudYLMjnp1gNItSdPs7OKKu46/z/1stpYvcsxvu2GBre/SNiirzlt9CBqX+ZIkSHotJPJK8JI0dSU/8Y7hwqV3gGOuaFf5KK20zpkpNmoyiknKxCjEwiVUSEZliB7rSumpE5Fa
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Feb 22, 2023 at 2:47=E2=80=AFPM Yosry Ahmed <yosryahmed@google.com>=
 wrote:
>
> On Wed, Feb 22, 2023 at 8:57 AM Johannes Weiner <hannes@cmpxchg.org> wrot=
e:
> >
> > Hello,
> >
> > thanks for proposing this, Yosry. I'm very interested in this
> > work. Unfortunately, I won't be able to attend LSFMMBPF myself this
> > time around due to a scheduling conflict :(
>
> Ugh, would have been great to have you, I guess there might be a
> remote option, or we will end up discussing on the mailing list
> eventually anyway.
>
> >
> > On Tue, Feb 21, 2023 at 03:38:57PM -0800, Yosry Ahmed wrote:
> > > On Tue, Feb 21, 2023 at 3:34 PM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed <yosryahmed@google.com=
> wrote:
> > > > >
> > > > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi <shy828301@gmail.com> w=
rote:
> > > > > >
> > > > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@google=
.com> wrote:
> > > > > > >
> > > > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.co=
m> wrote:
> > > > > > > >
> > > > > > > > Hi Yosry,
> > > > > > > >
> > > > > > > > Thanks for proposing this topic. I was thinking about this =
before but
> > > > > > > > I didn't make too much progress due to some other distracti=
ons, and I
> > > > > > > > got a couple of follow up questions about your design. Plea=
se see the
> > > > > > > > inline comments below.
> > > > > > >
> > > > > > > Great to see interested folks, thanks!
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@goo=
gle.com> wrote:
> > > > > > > > >
> > > > > > > > > Hello everyone,
> > > > > > > > >
> > > > > > > > > I would like to propose a topic for the upcoming LSF/MM/B=
PF in May
> > > > > > > > > 2023 about swap & zswap (hope I am not too late).
> > > > > > > > >
> > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D Intro =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > > > > > Currently, using zswap is dependent on swapfiles in an un=
necessary
> > > > > > > > > way. To use zswap, you need a swapfile configured (even i=
f the space
> > > > > > > > > will not be used) and zswap is restricted by its size. Wh=
en pages
> > > > > > > > > reside in zswap, the corresponding swap entry in the swap=
file cannot
> > > > > > > > > be used, and is essentially wasted. We also go through un=
necessary
> > > > > > > > > code paths when using zswap, such as finding and allocati=
ng a swap
> > > > > > > > > entry on the swapout path, or readahead in the swapin pat=
h. I am
> > > > > > > > > proposing a swapping abstraction layer that would allow u=
s to remove
> > > > > > > > > zswap's dependency on swapfiles. This can be done by intr=
oducing a
> > > > > > > > > data structure between the actual swapping implementation=
 (swapfiles,
> > > > > > > > > zswap) and the rest of the MM code.
> > > > > > > > >
> > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D Objective =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > > > > > Enabling the use of zswap without a backing swapfile, whi=
ch makes
> > > > > > > > > zswap useful for a wider variety of use cases. Also, when=
 zswap is
> > > > > > > > > used with a swapfile, the pages in zswap do not use up sp=
ace in the
> > > > > > > > > swapfile, so the overall swapping capacity increases.
> > > > > > > > >
> > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D Idea =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > > > > > Introduce a data structure, which I currently call a swap=
_desc, as an
> > > > > > > > > abstraction layer between swapping implementation and the=
 rest of MM
> > > > > > > > > code. Page tables & page caches would store a swap id (en=
coded as a
> > > > > > > > > swp_entry_t) instead of directly storing the swap entry a=
ssociated
> > > > > > > > > with the swapfile. This swap id maps to a struct swap_des=
c, which acts
> > > > > > > > > as our abstraction layer. All MM code not concerned with =
swapping
> > > > > > > > > details would operate in terms of swap descs. The swap_de=
sc can point
> > > > > > > > > to either a normal swap entry (associated with a swapfile=
) or a zswap
> > > > > > > > > entry. It can also include all non-backend specific opera=
tions, such
> > > > > > > > > as the swapcache (which would be a simple pointer in swap=
_desc), swap
> > > > > > > > > counting, etc. It creates a clear, nice abstraction layer=
 between MM
> > > > > > > > > code and the actual swapping implementation.
> > > > > > > >
> > > > > > > > How will the swap_desc be allocated? Dynamically or preallo=
cated? Is
> > > > > > > > it 1:1 mapped to the swap slots on swap devices (whatever i=
t is
> > > > > > > > backed, for example, zswap, swap partition, swapfile, etc)?
> > > > > > >
> > > > > > > I imagine swap_desc's would be dynamically allocated when we =
need to
> > > > > > > swap something out. When allocated, a swap_desc would either =
point to
> > > > > > > a zswap_entry (if available), or a swap slot otherwise. In th=
is case,
> > > > > > > it would be 1:1 mapped to swapped out pages, not the swap slo=
ts on
> > > > > > > devices.
> > > > > >
> > > > > > It makes sense to be 1:1 mapped to swapped out pages if the swa=
pfile
> > > > > > is used as the back of zswap.
> > > > > >
> > > > > > >
> > > > > > > I know that it might not be ideal to make allocations on the =
reclaim
> > > > > > > path (although it would be a small-ish slab allocation so we =
might be
> > > > > > > able to get away with it), but otherwise we would have static=
ally
> > > > > > > allocated swap_desc's for all swap slots on a swap device, ev=
en unused
> > > > > > > ones, which I imagine is too expensive. Also for things like =
zswap, it
> > > > > > > doesn't really make sense to preallocate at all.
> > > > > >
> > > > > > Yeah, it is not perfect to allocate memory in the reclamation p=
ath. We
> > > > > > do have such cases, but the fewer the better IMHO.
> > > > >
> > > > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top of =
the
> > > > > slab cache, idk if that makes sense, or if there is a way to tell=
 slab
> > > > > to proactively refill a cache.
> > > > >
> > > > > I am open to suggestions here. I don't think we should/can preall=
ocate
> > > > > the swap_desc's, and we cannot completely eliminate the allocatio=
ns in
> > > > > the reclaim path. We can only try to minimize them through cachin=
g,
> > > > > etc. Right?
> > > >
> > > > Yeah, reallocation should not work. But I'm not sure whether cachin=
g
> > > > works well for this case or not either. I'm supposed that you were
> > > > thinking about something similar with pcp. When the available numbe=
r
> > > > of elements is lower than a threshold, refill the cache. It should
> > > > work well with moderate memory pressure. But I'm not sure how it wo=
uld
> > > > behave with severe memory pressure, particularly when  anonymous
> > > > memory dominated the memory usage. Or maybe dynamic allocation work=
s
> > > > well, we are just over-engineered.
> > >
> > > Yeah it would be interesting to look into whether the swap_desc
> > > allocation will be a bottleneck. Definitely something to look out for=
.
> > > I share your thoughts about wanting to do something about it but also
> > > not wanting to over-engineer it.
> >
> > I'm not too concerned by this. It's a PF_MEMALLOC allocation, meaning
> > it's not subject to watermarks. And the swapped page is freed right
> > afterwards. As long as the compression delta exceeds the size of
> > swap_desc, the process is a net reduction in allocated memory. For
> > regular swap, the only requirement is that swap_desc < page_size() :-)
> >
> > To put this into perspective, the zswap backends allocate backing
> > pages on-demand during reclaim. zsmalloc also kmallocs metadata in
> > that path. We haven't had any issues with this in production, even
> > under fairly severe memory pressure scenarios.
>
> Right. The only problem would be for pages that do not compress well
> in zswap, in which case we might not end up freeing memory. As you
> said, this is already happening today with zswap tho.
>
> >
> > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D Benefits =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > > > > > This work enables using zswap without a backing swapfile =
and increases
> > > > > > > > > the swap capacity when zswap is used with a swapfile. It =
also creates
> > > > > > > > > a separation that allows us to skip code paths that don't=
 make sense
> > > > > > > > > in the zswap path (e.g. readahead). We get to drop zswap'=
s rbtree
> > > > > > > > > which might result in better performance (less lookups, l=
ess lock
> > > > > > > > > contention).
> > > > > > > > >
> > > > > > > > > The abstraction layer also opens the door for multiple cl=
eanups (e.g.
> > > > > > > > > removing swapper address spaces, removing swap count cont=
inuation
> > > > > > > > > code, etc). Another nice cleanup that this work enables w=
ould be
> > > > > > > > > separating the overloaded swp_entry_t into two distinct t=
ypes: one for
> > > > > > > > > things that are stored in page tables / caches, and for a=
ctual swap
> > > > > > > > > entries. In the future, we can potentially further optimi=
ze how we use
> > > > > > > > > the bits in the page tables instead of sticking everythin=
g into the
> > > > > > > > > current type/offset format.
> > > > > > > > >
> > > > > > > > > Another potential win here can be swapoff, which can be m=
ore practical
> > > > > > > > > by directly scanning all swap_desc's instead of going thr=
ough page
> > > > > > > > > tables and shmem page caches.
> > > > > > > > >
> > > > > > > > > Overall zswap becomes more accessible and available to a =
wider range
> > > > > > > > > of use cases.
> > > > > > > >
> > > > > > > > How will you handle zswap writeback? Zswap may writeback to=
 the backed
> > > > > > > > swap device IIUC. Assuming you have both zswap and swapfile=
, they are
> > > > > > > > separate devices with this design, right? If so, is the swa=
pfile still
> > > > > > > > the writeback target of zswap? And if it is the writeback t=
arget, what
> > > > > > > > if swapfile is full?
> > > > > > >
> > > > > > > When we try to writeback from zswap, we try to allocate a swa=
p slot in
> > > > > > > the swapfile, and switch the swap_desc to point to that inste=
ad. The
> > > > > > > process would be transparent to the rest of MM (page tables, =
page
> > > > > > > cache, etc). If the swapfile is full, then there's really not=
hing we
> > > > > > > can do, reclaim fails and we start OOMing. I imagine this is =
the same
> > > > > > > behavior as today when swap is full, the difference would be =
that we
> > > > > > > have to fill both zswap AND the swapfile to get to the OOMing=
 point,
> > > > > > > so an overall increased swapping capacity.
> > > > > >
> > > > > > When zswap is full, but swapfile is not yet, will the swap try =
to
> > > > > > writeback zswap to swapfile to make more room for zswap or just=
 swap
> > > > > > out to swapfile directly?
> > > > > >
> > > > >
> > > > > The current behavior is that we swap to swapfile directly in this
> > > > > case, which is far from ideal as we break LRU ordering by skippin=
g
> > > > > zswap. I believe this should be addressed, but not as part of thi=
s
> > > > > effort. The work to make zswap respect the LRU ordering by writin=
g
> > > > > back from zswap to make room can be done orthogonal to this effor=
t. I
> > > > > believe Johannes was looking into this at some point.
> >
> > Actually, zswap already does LRU writeback when the pool is full. Nhat
> > Pham (CCd) recently upstreamed the LRU implementation for zsmalloc, so
> > as of today all backends support this.
> >
> > There are still a few quirks in zswap that can cause rejections which
> > bypass the LRU that need fixing. But for the most part LRU writeback
> > to the backing file is the default behavior.
>
> Right, I was specifically talking about this case. When zswap is full
> it rejects incoming pages and they go directly to the swapfile, but we
> also kickoff writeback, so this only happens until we do some LRU
> writeback. I guess I should have been more clear here. Thanks for
> clarifying and correcting.
>
> >
> > > > Other than breaking LRU ordering, I'm also concerned about the
> > > > potential deteriorating performance when writing/reading from swapf=
ile
> > > > when zswap is full. The zswap->swapfile order should be able to
> > > > maintain a consistent performance for userspace.
> > >
> > > Right. This happens today anyway AFAICT, when zswap is full we just
> > > fallback to writing to swapfile, so this would not be a behavior
> > > change. I agree it should be addressed anyway.
> > >
> > > >
> > > > But anyway I don't have the data from real life workload to back th=
e
> > > > above points. If you or Johannes could share some real data, that
> > > > would be very helpful to make the decisions.
> > >
> > > I actually don't, since we mostly run zswap without a backing
> > > swapfile. Perhaps Johannes might be able to have some data on this (o=
r
> > > anyone using zswap with a backing swapfile).
> >
> > Due to LRU writeback, the latency increase when zswap spills its
> > coldest entries into backing swap is fairly linear, as you may
> > expect. We have some limited production data on this from the
> > webservers.
> >
> > The biggest challenge in this space is properly sizing the zswap pool,
> > such that it's big enough to hold the warm set that the workload is
> > most latency-sensitive too, yet small enough such that the cold pages
> > get spilled to backing swap. Nhat is working on improving this.
> >
> > That said, I think this discussion is orthogonal to the proposed
> > topic. zswap spills to backing swap in LRU order as of today. The
> > LRU/pool size tweaking is an optimization to get smarter zswap/swap
> > placement according to access frequency. The proposed swap descriptor
> > is an optimization to get better disk utilization, the ability to run
> > zswap without backing swap, and a dramatic speedup in swapoff time.
>
> Fully agree.
>
> >
> > > > > > > > Anyway I'm interested in attending the discussion for this =
topic.
> > > > > > >
> > > > > > > Great! Looking forward to discuss this more!
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D Cost =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > > > > > The obvious downside of this is added memory overhead, sp=
ecifically
> > > > > > > > > for users that use swapfiles without zswap. Instead of pa=
ying one byte
> > > > > > > > > (swap_map) for every potential page in the swapfile (+ sw=
ap count
> > > > > > > > > continuation), we pay the size of the swap_desc for every=
 page that is
> > > > > > > > > actually in the swapfile, which I am estimating can be ro=
ughly around
> > > > > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The =
overhead only
> > > > > > > > > scales with pages actually swapped out. For zswap users, =
it should be
> > > > > > > > > a win (or at least even) because we get to drop a lot of =
fields from
> > > > > > > > > struct zswap_entry (e.g. rbtree, index, etc).
> >
> > Shifting the cost from O(swapspace) to O(swapped) could be a win for
> > many regular swap users too.
> >
> > There are the legacy setups that provision 2*RAM worth of swap as an
> > emergency overflow that is then rarely used.
> >
> > We have a setups that swap to disk more proactively, but we also
> > overprovision those in terms of swap space due to the cliff behavior
> > when swap fills up and the VM runs out of options.
> >
> > To make a fair comparison, you really have to take average swap
> > utilization into account. And I doubt that's very high.
>
> Yeah I was looking for some data here, but it varies heavily based on
> the use case, so I opted to only state the overhead of the swap
> descriptor without directly comparing it to the current overhead.
>
> >
> > In terms of worst-case behavior, +0.8% per swapped page doesn't sound
> > like a show-stopper to me. Especially when compared to zswap's current
> > O(swapped) waste of disk space.
>
> Yeah for zswap users this should be a win on most/all fronts, even
> memory overhead, as we will end up trimming struct zswap_entry which
> is also O(swapped) memory overhead. It should also make zswap
> available for more use cases. You don't need to provision and
> configure swap space, you just need to turn zswap on.
>
> >
> > > > > > > > > Another potential concern is readahead. With this design,=
 we have no
> > > > > > > > > way to get a swap_desc given a swap entry (type & offset)=
. We would
> > > > > > > > > need to maintain a reverse mapping, adding a little bit m=
ore overhead,
> > > > > > > > > or search all swapped out pages instead :). A reverse map=
ping might
> > > > > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of=
 swapped out
> > > > > > > > > memory).
> > > > > > > > >
> > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D Bottom Line =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
> > > > > > > > > It would be nice to discuss the potential here and the tr=
adeoffs. I
> > > > > > > > > know that other folks using zswap (or interested in using=
 it) may find
> > > > > > > > > this very useful. I am sure I am missing some context on =
why things
> > > > > > > > > are the way they are, and perhaps some obvious holes in m=
y story.
> > > > > > > > > Looking forward to discussing this with anyone interested=
 :)
> > > > > > > > >
> > > > > > > > > I think Johannes may be interested in attending this disc=
ussion, since
> > > > > > > > > a lot of ideas here are inspired by discussions I had wit=
h him :)

Hi everyone,

I came across this interesting proposal and I would like to
participate in the discussion. I think it will be useful/overlap with
some projects we are currently planning in Android.

Thanks,
Kalesh

> >
> > Thanks!
>