From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BDC03C74A5B
	for <linux-mm@archiver.kernel.org>; Fri, 17 Mar 2023 10:19:51 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C01CF6B0074; Fri, 17 Mar 2023 06:19:50 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BB1CA6B0075; Fri, 17 Mar 2023 06:19:50 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A2B416B0078; Fri, 17 Mar 2023 06:19:50 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 8CA6A6B0074
	for <linux-mm@kvack.org>; Fri, 17 Mar 2023 06:19:50 -0400 (EDT)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 5BAB216052D
	for <linux-mm@kvack.org>; Fri, 17 Mar 2023 10:19:50 +0000 (UTC)
X-FDA: 80577994140.24.A0CC4FA
Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53])
	by imf29.hostedemail.com (Postfix) with ESMTP id 681DE12001E
	for <linux-mm@kvack.org>; Fri, 17 Mar 2023 10:19:48 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=fWzjldi5;
	spf=pass (imf29.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1679048388;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=/jDxAToP0HJFJqA6f89IH/Coe2MyshEz1rHYGwTY1ag=;
	b=l0E6vIHK5pveQMPTC98qYAAiXOlpNLI4Zj4jJuuAPOc3Bz52Nq6wL+FILss7dhvMpcCD8w
	3E/s3F7l8SDWkAein+siojN8dmA7jiEXelABC51cRP2v+Tg7hM58TPBxJ0I+6rt28zrTX4
	OVxr79gjxoeTz8Rp/JXCd309uBEeARw=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=fWzjldi5;
	spf=pass (imf29.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679048388; a=rsa-sha256;
	cv=none;
	b=8BHZ2WrcrOuEfZUy0R02cF9AQmgrCb9tvcRV2OCVR71RWcPWdlVmvcOmoXx8U628wiTu6f
	AHd3BSoTqih9cBvwDeDgSLXbTC2pz9jkP4PmmuLwUMG55keiOgUGGIAkqHP+RpwMhQoyah
	D2Vqh8LdwJsf157HLc3G35wMDxebBuA=
Received: by mail-ed1-f53.google.com with SMTP id eg48so18243444edb.13
        for <linux-mm@kvack.org>; Fri, 17 Mar 2023 03:19:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679048387;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=/jDxAToP0HJFJqA6f89IH/Coe2MyshEz1rHYGwTY1ag=;
        b=fWzjldi5TasX8/srTSsGe4CDazgCILDyQFSLLyMcmvR1LSR6pyx2rSTTNbez9DtlgO
         zlFA1hFL4MmRig9uU2PQ61MrUVwhQ8Ecooy2Vv995Ne5M5giq74mJ47FmzkP1kNJDSHB
         jXXgta7DwQjedJgcL1Ppf1O4X7zLtMrL5VdAFq0BZ1sSA3HPeI21Ek49X4nWDJW1G+9I
         LnlvK+MpQu0WK7AeWU2zAinRqtX6ICnZ121rFlvS1btR/TuHQ3DMVtjymBEkeGTnommc
         dtLNPFBQlnOvbnb1xJEdmmDF9zvkYuSS1MltUVMfYTswZQAeYqbiohYjd2Iwj3qwVb00
         QzkA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679048387;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=/jDxAToP0HJFJqA6f89IH/Coe2MyshEz1rHYGwTY1ag=;
        b=bk0fCMFCBMSV6Xa0Eotqu8j5qwuKVCTaMyNOzyef2EyjwVStbzkoG42Uccwl6Q+32C
         sJJqp2JkPICuOvKOmEydZdAYTjhqgvvPjsqj07UrLXY7zBO766ffDd2PaTmahmL5+5gX
         tgMGPUtNAXijZVHUZj9s3HIyU8teh+L4JXatYcgd95GqOIEJ0MA4RDdX9vOyjHOfp1Su
         G0dNfWYV7tv801SM0bvWNKoF6YJ2peJI2yCTEc1MN+1M7Emj5Okp0hca47Jy/A3caKWp
         BBNs6WJOjMQlS2RHUMWnVBqgJpNflzXd7ATDckqpmkxklRL2oASoEoVuS//eYzgGiwCa
         X6Ig==
X-Gm-Message-State: AO0yUKX9vA9J5Zd+kfhwm905H1miOsgNcfmqhSITp/G5jxEkybt3EtSV
	xgli+6Jzw/xxYe2I9n+1IbN1Snq5uAEe4j2QXyz9oQ==
X-Google-Smtp-Source: AK7set+WaZ3bgjKiOS200wJmzG925jp3//Syi/PAjuHjh15WZlXroEba/VVsxISfKRd2J6Uh2AUZoFbgnv2+4worahE=
X-Received: by 2002:a17:906:a14:b0:927:912:6baf with SMTP id
 w20-20020a1709060a1400b0092709126bafmr6182573ejf.15.1679048386499; Fri, 17
 Mar 2023 03:19:46 -0700 (PDT)
MIME-Version: 1.0
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
 <Y/6KOFMZaE0yOj/1@google.com> <CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
 <87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAJD7tkYj-=-tYyfqk_t2c6WMtcPLHYc4teRNtE2H8G8igEGrpA@mail.gmail.com>
 <87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAJD7tkamf8TtY0wOjhZKsWBJLL4pMsUhkwPtwCuroWcipRZ3CA@mail.gmail.com>
 <87o7ox762m.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAJD7tka992TT4-nkPckhSgVXcTnJq8YPYt2CzupZgGGe72NTRw@mail.gmail.com>
 <87bkkt5e4o.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87bkkt5e4o.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Fri, 17 Mar 2023 03:19:09 -0700
Message-ID: <CAJD7tkbgS1NHbmvNH8B9f-E7b6tSoE2hFssoDD54KG7u6eBWkw@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Chris Li <chrisl@kernel.org>, lsf-pc@lists.linux-foundation.org, 
	Johannes Weiner <hannes@cmpxchg.org>, Linux-MM <linux-mm@kvack.org>, 
	Michal Hocko <mhocko@kernel.org>, Shakeel Butt <shakeelb@google.com>, 
	David Rientjes <rientjes@google.com>, Hugh Dickins <hughd@google.com>, 
	Seth Jennings <sjenning@redhat.com>, Dan Streetman <ddstreet@ieee.org>, 
	Vitaly Wool <vitaly.wool@konsulko.com>, Yang Shi <shy828301@gmail.com>, 
	Peter Xu <peterx@redhat.com>, Minchan Kim <minchan@kernel.org>, 
	Andrew Morton <akpm@linux-foundation.org>, Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>, 
	Michal Hocko <mhocko@suse.com>, Wei Xu <weixugc@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: zuc9oe4fhcjjr5xyhx8nk7si4hg4461f
X-Rspam-User: 
X-Rspamd-Queue-Id: 681DE12001E
X-Rspamd-Server: rspam06
X-HE-Tag: 1679048388-120673
X-HE-Meta: U2FsdGVkX18h6I/R9vRXLteWPsXrE07oTopQ9wT6WWE0WQ8ymrC1uF+0/FDDjmad1oRSAK9zXRaWjWcqBBsD7whKfsx3Lo+5eKI+WhOCgjmEaSyyd+hgKOAHR8/qUa8KvYWIm7oEnZWouvkT+M6Q+weBm8kVbdlIEqFkuu/JWXsSoKuIBfXFHe84tFy0dMWAazWXVVpyvOlsAj0S9RDOg8sWWxckyeTZrs+ekOC2U/Am8RWXXHFKFFJemUhroz3QRe8MQQ6VeAGEuDIMK1pjkwdacHnD00YqnR7pZFUT0xONSkfyWwiNBzuNfQ7yZwdKTHdNPOUBYdzfGmBw9Jnuby/CCH7TX70cSTw6Jc8M9+rLqRdnz/MmsZcQRc22TJs7h/7IGfjNiu4tsHkAucvsLzAaA4TiW6X0lFaP9nhvsdc6AU7rF0jX6cb9SMqFa3ugbTg+7gtwvUxGYu9k+Q3xPOAa6qqgLhZU0Qk27+1ZXM0p0D1Xj4aDWDtzrFNRy6m+rJgJuyMmgCuxtbFvZ7LTfmwtnNgykFU19bWQhrroa9wRFjixMf5dajSJMwVxg37jVE9SEcoBgKiN0kJ8xmS+GXDBwWAkLT8RVewXwdAmRWQyyINlTxUCYHivRAl3F6xHxPMhgjVdQPBa1dDugvNg4/2IF6xlXNUKXo/YbD7W69WZf0DJxxCKo1t2U4QUDEC9l5RyoDTaE0CqeKZHmul0WOMXBd1WtCNiiBJZusseRC7q+ahyu6pXyPc0J4ClEMh+PlquP0w8Us6ynj3vRoEylE1JZQMtVOHlnLUvkXF08AZ3F8czREnP/9FAL004xHLSvH4eoO1SoJP1g2ZCUQ0lrPYxL4tD5Q8lyS9pXoGLHH3sSudkF1ObOdDrsJ8Qi9i9Do9dswUaWThO/myjsFOi31hLK5YGxjX7YIg0B8VdCrQ4NWd3TtOiPVerGGrrUDGqC0epxEXlzKpEb/obR+G
 AGRxhSSo
 4XPvD0894wviKqxL5KExqjxW/sY8J0d3+Iq9C9sm4gnV+gSPoq+s9W+PDfWGp2M3d+XNFe9h9CRJ+CPs/1TV3z+bJoHNljtwM+01SfaUyigQtl3pGIMN+blzg7HJFqA4AxHT2QbtzLAZCMfC3T5sa3fRhQAPXagpX3oz0XhOJ2NZfD7lrWFiEaGnejfj2EOK3ZjU60UrYEcVgxI9mDHwvZWliwdMd61EVQtx45WC8Yb7JyCAfQjkMNgyWDcd/2YeR88LryE7Ox/VsjeTIt/3MukgiZPFsnWQDKJrbCROG2BKD0xY=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Mar 16, 2023 at 12:51=E2=80=AFAM Huang, Ying <ying.huang@intel.com>=
 wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Sun, Mar 12, 2023 at 7:13=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> <snip>
> >> >
> >> > My current idea is to have one xarray that stores the swap_descs
> >> > (which include swap_entry, swapcache, swap_count, etc), and only for
> >> > rotating disks have an additional xarray that maps swap_entry ->
> >> > swap_desc for cluster readahead, assuming we can eliminate all other
> >> > situations requiring a reverse mapping.
> >> >
> >> > I am not sure how having separate xarrays help? If we have one xarra=
y,
> >> > might as well save the other lookups on put everything in swap_desc.
> >> > In fact, this should improve the locking today as swapcache /
> >> > swap_count operations can be lockless or very lightly contended.
> >>
> >> The condition of the proposal is "reverse mapping cannot be avoided fo=
r
> >> enough situation".  So, if reverse mapping (or cluster readahead) can =
be
> >> avoided for enough situations, I think your proposal is good.  Otherwi=
se,
> >> I propose to use 2 xarrays.  You don't need another reverse mapping
> >> xarray, because you just need to read the next several swap_entry into
> >> the swap cache for cluster readahead.  swap_desc isn't needed for
> >> cluster readahead.
> >
> > swap_desc would be needed for cluster readahead in my original
> > proposal as the swap cache lives in swap_descs. Based on the current
> > implementation, we would need a reverse mapping (swap entry ->
> > swap_desc) in 3 situations:
> >
> > 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
> > failing, we fallback to trying to find swap entries that only have a
> > page in the swap cache (no references in page tables or page cache)
> > and free them. This would require a reverse mapping.
> >
> > 2) swapoff: we need to swap in all entries in a swapfile, so we need
> > to get all swap_descs associated with that swapfile.
> >
> > 3) swap cluster readahead.
> >
> > For (1), I think we can drop the dependency of a reverse mapping if we
> > free swap entries once we swap a page in and add it to the swap cache,
> > even if the swap count does not drop to 0.
>
> Now, we will not drop the swap cache even if the swap count becomes 0 if
> swap space utility < 50%.  Per my understanding, this avoid swap page
> writing for read accesses.  So I don't think we can change this directly
> without necessary discussion firstly.


Right. I am not sure I understand why we do this today, is it to save
the overhead of allocating a new swap entry if the page is swapped out
again soon? I am not sure I understand this statement "this avoid swap
page
writing for read accesses".

>
>
> > For (2), instead of scanning page tables and shmem page cache to find
> > swapped out pages for the swapfile, we can scan all swap_descs
> > instead, we should be more efficient. This is one of the proposal's
> > potential advantages.
>
> Good.
>
> > (3) is the one that would still need a reverse mapping with the
> > current proposal. Today we use swap cluster readahead for anon pages
> > if we have a spinning disk or vma readahead is disabled. For shmem, we
> > always use cluster readahead. If we can limit cluster readahead to
> > only rotating disks, then the reverse mapping can only be maintained
> > for swapfiles on rotating disks. Otherwise, we will need to maintain a
> > reverse mapping for all swapfiles.
>
> For shmem, I think that it should be good to readahead based on shmem
> file offset instead of swap device offset.
>
> It's possible that some pages in the readahead window are from HDD while
> some other pages aren't.  So it's a little hard to enable cluster read
> for HDD only.  Anyway, it's not common to use HDD for swap now.
>
> >>
> >> > If the point is to store the swap_desc directly inside the xarray to
> >> > save 8 bytes, I am concerned that having multiple xarrays for
> >> > swapcache, swap_count, etc will use more than that.
> >>
> >> The idea is to save the memory used by reverse mapping xarray.
> >
> > I see.
> >
> >>
> >> >> >>
> >> >> >> > Keep in mind that the current overhead is 1 byte O(max swap pa=
ges) not
> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
> >> >> >> > continuation pages. If we do, it may end up being more. We als=
o
> >> >> >> > allocate continuation in full 4k pages, so even if one swap_ma=
p
> >> >> >> > element in a page requires continuation, we will allocate an e=
ntire
> >> >> >> > page. What I am trying to say is that to get an actual compari=
son you
> >> >> >> > need to also factor in the swap utilization and the rate of us=
age of
> >> >> >> > swap continuation. I don't know how to come up with a formula =
for this
> >> >> >> > tbh.
> >> >> >> >
> >> >> >> > Also, like Johannes said, the worst case overhead (32 bytes if=
 you
> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M f=
or every
> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it =
is pure
> >> >> >> > overhead for people not using zswap, but it is not very awful.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> It seems what you really need is one bit of information to in=
dicate
> >> >> >> >> this page is backed by zswap. Then you can have a seperate po=
inter
> >> >> >> >> for the zswap entry.
> >> >> >> >
> >> >> >> > If you use one bit in swp_entry_t (or one of the available swa=
p types)
> >> >> >> > to indicate whether the page is backed with a swapfile or zswa=
p it
> >> >> >> > doesn't really work. We lose the indirection layer. How do we =
move the
> >> >> >> > page from zswap to swapfile? We need to go update the page tab=
les and
> >> >> >> > the shmem page cache, similar to swapoff.
> >> >> >> >
> >> >> >> > Instead, if we store a key else in swp_entry_t and use this to=
 lookup
> >> >> >> > the swp_entry_t or zswap_entry pointer then that's essentially=
 what
> >> >> >> > the swap_desc does. It just goes the extra mile of unifying th=
e
> >> >> >> > swapcache as well and storing it directly in the swap_desc ins=
tead of
> >> >> >> > storing it in another lookup structure.
> >> >> >>
> >> >> >> If we choose to make sizeof(struct swap_desc) =3D=3D 8, that is,=
 store only
> >> >> >> swap_entry in swap_desc.  The added indirection appears to be an=
other
> >> >> >> level of page table with 1 entry.  Then, we may use the similar =
method
> >> >> >> as supporting system with 2 level and 3 level page tables, like =
the code
> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought a=
bout
> >> >> >> this deeply.
> >> >> >
> >> >> > Can you expand further on this idea? I am not sure I fully unders=
tand.
> >> >>
> >> >> OK.  The goal is to avoid the overhead if indirection isn't enabled=
 via
> >> >> kconfig.
> >> >>
> >> >> If indirection isn't enabled, store swap_entry in PTE directly.
> >> >> Otherwise, store index of swap_desc in PTE.  Different functions (e=
.g.,
> >> >> to get/set swap_entry in PTE) are implemented based on kconfig.
> >> >
> >> >
> >> > I thought about this, the problem is that we will have multiple
> >> > implementations of multiple things. For example, swap_count without
> >> > the indirection layer lives in the swap_map (with continuation logic=
).
> >> > With the indirection layer, it lives in the swap_desc (or somewhere
> >> > else). Same for the swapcache. Even if we keep the swapcache in an
> >> > xarray and not inside swap_desc, it would be indexed by swap_entry i=
f
> >> > the indirection is disabled, and by swap_desc (or similar) if the
> >> > indirection is enabled. I think maintaining separate implementations
> >> > for when the indirection is enabled/disabled would be adding too muc=
h
> >> > complexity.
> >> >
> >> > WDYT?
> >>
> >> If we go this way, swap cache and swap_count will always be indexed by
> >> swap_entry.  swap_desc just provides a indirection to make it possible
> >> to move between swap devices.
> >>
> >> Why must we index swap cache and swap_count by swap_desc if indirectio=
n
> >> is enabled?  Yes, we can save one xarray indexing if we do so, but I
> >> don't think the overhead of one xarray indexing is a showstopper.
> >>
> >> I think this can be one intermediate step towards your final target.
> >> The changes to current implementation can be smaller.
> >
> > IIUC, the idea is to have two xarrays:
> > (a) xarray that stores a pointer to a struct containing swap_count and
> > swap cache.
> > (b) xarray that stores the underlying swap entry or zswap entry.
> >
> > When indirection is disabled:
> > page tables & page cache have swap entry directly like today, xarray
> > (a) is indexed by swap entry, xarray (b) does not exist. No reverse
> > mapping needed.
> >
> > In this case we have an extra overhead of 12-16 bytes (the struct
> > containing swap_count and swap cache) vs. 24 bytes of the swap_desc.
> >
> > When indirection is enabled:
> > page tables & page cache have a swap id (or swap_desc index), xarray
> > (a) is indexed by swap id,
>
> xarray (a) is indexed by swap entry.


How so? With the indirection enabled, the page tables & page cache
have the swap id (or swap_desc index), which can point to a swap entry
or a zswap entry -- which can change when the page is moved between
zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
case? Shouldn't be indexed by the abstract swap id so that the
writeback from zswap is transparent?

>
>
> > xarray (b) is indexed by swap id as well
> > and contain swap entry or zswap entry. Reverse mapping might be
> > needed.
>
> Reverse mapping isn't needed.


It would be needed if xarray (a) is indexed by the swap id. I am not
sure I understand how it can be indexed by the swap entry if the
indirection is enabled.

>
>
> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
> > where needed.
> >
> > There is also the extra cpu overhead for an extra lookup in certain pat=
hs.
> >
> > Is my analysis correct? If yes, I agree that the original proposal is
> > good if the reverse mapping can be avoided in enough situations, and
> > that we should consider such alternatives otherwise. As I mentioned
> > above, I think it comes down to whether we can completely restrict
> > cluster readahead to rotating disks or not -- in which case we need to
> > decide what to do for shmem and for anon when vma readahead is
> > disabled.
>
> We can even have a minimal indirection implementation.  Where, swap
> cache and swap_map[] are kept as they ware before, just one xarray is
> added.  The xarray is indexed by swap id (or swap_desc index) to store
> the corresponding swap entry.
>
> When indirection is disabled, no extra overhead.
>
> When indirection is enabled, the extra overhead is just 8 bytes per
> swapped page.
>
> The basic migration support can be build on top of this.
>
> I think that this could be a baseline for indirection support.  Then
> further optimization can be built on top of it step by step with
> supporting data.


I am not sure how this works with zswap. Currently swap_map[]
implementation is specific for swapfiles, it does not work for zswap
unless we implement separate swap counting logic for zswap &
swapfiles. Same for the swapcache, it currently supports being indexed
by a swap entry, it would need to support being indexed by a swap id,
or have a separate swap cache for zswap. Having separate
implementation would add complexity, and we would need to perform
handoffs of the swap count/cache when a page is moved from zswap to a
swapfile.

>
>
> >>
> >> >> >> >>
> >> >> >> >> Depending on how much you are going to reuse the swap cache, =
you might
> >> >> >> >> need to have something like a swap_info_struct to keep the lo=
cks happy.
> >> >> >> >
> >> >> >> > My current intention is to reimplement the swapcache completel=
y as a
> >> >> >> > pointer in struct swap_desc. This would eliminate this need an=
d a lot
> >> >> >> > of the locking we do today if I get things right.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> > Another potential concern is readahead. With this design, w=
e have no
> >> >> >> >>
> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file wi=
th an SSD can
> >> >> >> >> use some modernization.
> >> >> >> >
> >> >> >> > Yeah, I initially thought we would only need the swp_entry_t -=
>
> >> >> >> > swap_desc reverse mapping for readahead, and that we can only =
store
> >> >> >> > that for spinning disks, but I was wrong. We need for other th=
ings as
> >> >> >> > well today: swapoff, when trying to find an empty swap slot an=
d we
> >> >> >> > start trying to free swap slots used only by the swapcache. Ho=
wever, I
> >> >> >> > think both of these cases can be fixed (I can share more detai=
ls if
> >> >> >> > you want). If everything goes well we should only need to main=
tain the
> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swap files=
 on
> >> >> >> > spinning disks for readahead.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> Looking forward to your discussion.
> >> >>
> >> >> Per my understanding, the indirection is to make it easy to move
> >> >> (swapped) pages among swap devices based on hot/cold.  This is simi=
lar
> >> >> as the target of memory tiering.  It appears that we can extend the
> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices =
too?
> >> >> Is it possible for zswap to be faster than some slow memory media?
> >> >
> >> >
> >> > Agree with Chris that this may require a much larger overhaul. A slo=
w
> >> > memory tier is still addressable memory, swap/zswap requires a page
> >> > fault to read the pages. I think (at least for now) there is a
> >> > fundamental difference. We want reclaim to eventually treat slow
> >> > memory & swap as just different tiers to place cold memory in with
> >> > different characteristics, but otherwise I think the swapping
> >> > implementation itself is very different.  Am I missing something?
> >>
> >> Is it possible that zswap is faster than a really slow memory
> >> addressable device backed by NAND?  TBH, I don't have the answer.
> >
> > I am not sure either.
> >
> >>
> >> Anyway, do you need a way to describe the tiers of the swap devices?
> >> So, you can move the cold pages among the swap devices based on that?
> >
> > For now I think the "tiers" in this proposal are just zswap and normal
> > swapfiles. We can later extend it to support more explicit tiering.
>
> IIUC, in original zswap implementation, there's 1:1 relationship between
> zswap and normal swapfile.  But now, you make demoting among swap
> devices more general.  Then we need some general way to specify which
> swap devices are fast and which are slow, and the demoting relationship
> among them.  It can be memory tiers or something else, but we need one.


I think for this proposal, there are only 2 hardcoded tiers. Zswap is
fast, swapfile is slow. In the future, we can support more dynamic
tiering if the need arises.

>
>
> Best Regards,
> Huang, Ying
>