From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 806A2C6FD19
	for <linux-mm@archiver.kernel.org>; Mon, 13 Mar 2023 02:13:36 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 69BA06B0071; Sun, 12 Mar 2023 22:13:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 64BC28E0002; Sun, 12 Mar 2023 22:13:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4EC428E0001; Sun, 12 Mar 2023 22:13:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 3772F6B0071
	for <linux-mm@kvack.org>; Sun, 12 Mar 2023 22:13:35 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 093E0A0293
	for <linux-mm@kvack.org>; Mon, 13 Mar 2023 02:13:35 +0000 (UTC)
X-FDA: 80562253590.21.C441CFE
Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31])
	by imf09.hostedemail.com (Postfix) with ESMTP id 08BA8140003
	for <linux-mm@kvack.org>; Mon, 13 Mar 2023 02:13:31 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=ZZ6hLJjn;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1678673613;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=rNuSOfHS+rRKHMtUOZfDiyYLW92RZqXw/uQzWPuLFa8=;
	b=yC4RToJMJzOH61lKiXTqDDFGgP0yO5mAJxHHyL4MSe6U/nuuZ6amRGPzVg3I9+Ltvknz/i
	sCKkPtmta3lwkqdOZsZsoXI5ITy9H2zD3RvlEWHWfX1cQdi8DzbffZwJ2ejV9JHSstbNy6
	gSx94wlSpO2uzV7EuSfuD1IMn25oYUQ=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=ZZ6hLJjn;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678673613; a=rsa-sha256;
	cv=none;
	b=3ItcmQJ5VDkRupMNluoIGZdxlX2OUlpWlC5ZlRc11J8/RLpCbDOHPGZNtcvK2xP077A1rY
	J/W790HqO80uo1mS/WoQcXNT7/22NpVR1yvH7kIhQbIBF7wfQlbwBS/fWlTODTDz5SXYE2
	JM+IYij++5I53vR9x6UjPjmzGZZg38Q=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1678673612; x=1710209612;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version:content-transfer-encoding;
  bh=h/IN0dcXfQLSk3nrpeo9uku7W3GXIRtIjRfNVdL9+AQ=;
  b=ZZ6hLJjn4i5+Ib5Q5X+Lv21mg542pbjg0c1zsU8lnS+klGdc/sv2CY8s
   W7GKc+vH6Wpc+oeFpmn8ARQBpuRJjyLDydCHQVGp01ELAiq/FirWINznx
   6amRrrKshBHydMkHFpP9IROZo6AUCxoXuzzfU7ppZE564Jq96KRaaC8JN
   BXAmZV+cArcsOP5ObKCKeU1Vg5D1D3Yuk791QbJveF3uhpRrFSEfUFGdV
   +kpWN9U0oQcHPep2ZL4wDh+oDU2n36qmIllNcd91ZnMHTtpeSFiBAMuQM
   oRGLPsgfdsSy/9yVVrPcco6WVvGeNBqC0+2DEl3mftdhZ25jJCQaArQDf
   Q==;
X-IronPort-AV: E=McAfee;i="6500,9779,10647"; a="399640299"
X-IronPort-AV: E=Sophos;i="5.98,254,1673942400"; 
   d="scan'208";a="399640299"
Received: from orsmga004.jf.intel.com ([10.7.209.38])
  by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2023 19:13:22 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6500,9779,10647"; a="802289602"
X-IronPort-AV: E=Sophos;i="5.98,254,1673942400"; 
   d="scan'208";a="802289602"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2023 19:13:17 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chrisl@kernel.org>,  lsf-pc@lists.linux-foundation.org,
  Johannes Weiner <hannes@cmpxchg.org>,  Linux-MM <linux-mm@kvack.org>,
  Michal Hocko <mhocko@kernel.org>,  Shakeel Butt <shakeelb@google.com>,
  David Rientjes <rientjes@google.com>,  Hugh Dickins <hughd@google.com>,
  Seth Jennings <sjenning@redhat.com>,  Dan Streetman <ddstreet@ieee.org>,
  Vitaly Wool <vitaly.wool@konsulko.com>,  Yang Shi <shy828301@gmail.com>,
  Peter Xu <peterx@redhat.com>,  Minchan Kim <minchan@kernel.org>,  Andrew
 Morton <akpm@linux-foundation.org>,  Aneesh Kumar K V
 <aneesh.kumar@linux.ibm.com>,  Michal Hocko <mhocko@suse.com>,  Wei Xu
 <weixugc@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
	<Y/6KOFMZaE0yOj/1@google.com>
	<CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
	<87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkYj-=-tYyfqk_t2c6WMtcPLHYc4teRNtE2H8G8igEGrpA@mail.gmail.com>
	<87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkamf8TtY0wOjhZKsWBJLL4pMsUhkwPtwCuroWcipRZ3CA@mail.gmail.com>
Date: Mon, 13 Mar 2023 10:12:17 +0800
In-Reply-To: <CAJD7tkamf8TtY0wOjhZKsWBJLL4pMsUhkwPtwCuroWcipRZ3CA@mail.gmail.com>
	(Yosry Ahmed's message of "Fri, 10 Mar 2023 17:06:35 -0800")
Message-ID: <87o7ox762m.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 08BA8140003
X-Stat-Signature: cp9134u3kwn8tfi1sj8jd4pmz61tjzaj
X-HE-Tag: 1678673611-829082
X-HE-Meta: U2FsdGVkX18AerqL3n0+SQT6lpy2vpMR/980ysgXFzQoFhQhgSSVlp1EdFD2+UR72B4iCUn0BC681JAB2J/5NKElmhOl+OonL1ZVEgheVOqefq83VYqhEmQD9V+/sdnCQAswYXJCaphgiIb96mtderPThr9z6VrauBaxAD69+f9HT662JE7qPyInlWyE46AESR39cG5hKdWEg4JciR9j4PIzp/bAqqGTMm+Ze2B6goF8Y5U7lhF24GjuODMpqmj8InHS+59ruRtfdWQaJjEZvpfXLgOeycAcbF98CXej1nIfcCjhTMh0umFtKdDvBei3Z82CHltxj7l5QHqGCkTEK/Nh+jtn+rESmPL1JDVzJBbA7kFhP18yN+0fT2fvniWgf4LrQNSQWYPk/F3p0AZNFxNgmJoD1qNe3DDdPfvEhfNOCBBMXkRESoQSYxxZNfybVGGQhFfDBFa5GUb3B+w1i9dOmMciKMafq4EzUZE9XSrWzB5lknacUNiYbhnG7T3zVPErM2M6vs+0v61bMGz1lRzXwM0fe+WS9/0ci9YD06v6vCORIyvUOaK4e3RaeP4botGwjEXnDsykesQozn3HUKJvpAMyRdku2pxbhriEOulY5dbSmpkQcfrTiUa4xuqZN8NQv9GnRCBP/o7GYy1EbSH9RwAJxYL62/piDvhKti1d/sItpTyVfCaM1pwlP6hL9Yd2L7aBJB2SQ7S99lN7XzUmYaxltL7LdTuzHTMGCA/86wUtqw7eEPMzhveV7i5dcmKYP58vFRJK1noi1Zcw17Wd8gXpCwbU+DWQpZEBZgw6wKbUWIxBL1RYOvG5EBJahnB1cVGSTqHewFS+zu0V0fNups3q8d3lQith30YHmeUVFWaUhV8XfZgEIu10cKKo6C5XlHK4gl7BEYMKs51zcFayiyi+zIma51fOSgR8BCz1RPgNqL07pRwN7f5CKTzeYrWRcP7tnfixkPt4IBy
 E4q0zUMt
 qEkyhvb92d1xEsLQxF1b9PGG44n0IZsKEPo4/sjLUyYEveRkiMmKMr6Vt9qqkxnwPOAiLjGlei2P2MqG+AbGFfs1N/3zhgacGGmuTLlq5ofVMt4Gi/ZiWTv9geeTsmPBoEDTeL24ICDIzMpTOF9P2r4Q7DfxU2EzBh63ALkQZUFMXXeXKWy86kcTnTQgmg9PeetaXslYZTaaByqKrzesc5PtswOWwJ0LlObJkxQ1TEGM8z1Oqkss2t8YihSEfXVr5e7tsRO4HaYziBAQtyLS7orN42A==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Yosry Ahmed <yosryahmed@google.com> writes:

> On Thu, Mar 9, 2023 at 7:07=E2=80=AFPM Huang, Ying <ying.huang@intel.com>=
 wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Thu, Mar 9, 2023 at 4:49=E2=80=AFAM Huang, Ying <ying.huang@intel.c=
om> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
>> >> >>
>> >> >> Hi Yosry,
>> >> >>
>> >> >> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
>> >> >> > Hello everyone,
>> >> >> >
>> >> >> > I would like to propose a topic for the upcoming LSF/MM/BPF in M=
ay
>> >> >> > 2023 about swap & zswap (hope I am not too late).
>> >> >>
>> >> >> I am very interested in participating in this discussion as well.
>> >> >
>> >> > That's great to hear!
>> >> >
>> >> >>
>> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Obj=
ective =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >> > Enabling the use of zswap without a backing swapfile, which makes
>> >> >> > zswap useful for a wider variety of use cases. Also, when zswap =
is
>> >> >> > used with a swapfile, the pages in zswap do not use up space in =
the
>> >> >> > swapfile, so the overall swapping capacity increases.
>> >> >>
>> >> >> Agree.
>> >> >>
>> >> >> >
>> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Ide=
a =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >> > Introduce a data structure, which I currently call a swap_desc, =
as an
>> >> >> > abstraction layer between swapping implementation and the rest o=
f MM
>> >> >> > code. Page tables & page caches would store a swap id (encoded a=
s a
>> >> >> > swp_entry_t) instead of directly storing the swap entry associat=
ed
>> >> >> > with the swapfile. This swap id maps to a struct swap_desc, whic=
h acts
>> >> >>
>> >> >> Can you provide a bit more detail? I am curious how this swap id
>> >> >> maps into the swap_desc? Is the swp_entry_t cast into "struct
>> >> >> swap_desc*" or going through some lookup table/tree?
>> >> >
>> >> > swap id would be an index in a radix tree (aka xarray), which conta=
ins
>> >> > a pointer to the swap_desc struct. This lookup should be free with
>> >> > this design as we also use swap_desc to directly store the swap cac=
he
>> >> > pointer, so this lookup essentially replaces the swap cache lookup.
>> >> >
>> >> >>
>> >> >> > as our abstraction layer. All MM code not concerned with swapping
>> >> >> > details would operate in terms of swap descs. The swap_desc can =
point
>> >> >> > to either a normal swap entry (associated with a swapfile) or a =
zswap
>> >> >> > entry. It can also include all non-backend specific operations, =
such
>> >> >> > as the swapcache (which would be a simple pointer in swap_desc),=
 swap
>> >> >>
>> >> >> Does the zswap entry still use the swap slot cache and swap_info_s=
truct?
>> >> >
>> >> > In this design no, it shouldn't.
>> >> >
>> >> >>
>> >> >> > This work enables using zswap without a backing swapfile and inc=
reases
>> >> >> > the swap capacity when zswap is used with a swapfile. It also cr=
eates
>> >> >> > a separation that allows us to skip code paths that don't make s=
ense
>> >> >> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
>> >> >> > which might result in better performance (less lookups, less lock
>> >> >> > contention).
>> >> >> >
>> >> >> > The abstraction layer also opens the door for multiple cleanups =
(e.g.
>> >> >> > removing swapper address spaces, removing swap count continuation
>> >> >> > code, etc). Another nice cleanup that this work enables would be
>> >> >> > separating the overloaded swp_entry_t into two distinct types: o=
ne for
>> >> >> > things that are stored in page tables / caches, and for actual s=
wap
>> >> >> > entries. In the future, we can potentially further optimize how =
we use
>> >> >> > the bits in the page tables instead of sticking everything into =
the
>> >> >> > current type/offset format.
>> >> >>
>> >> >> Looking forward to seeing more details in the upcoming discussion.
>> >> >> >
>> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Cos=
t =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >> > The obvious downside of this is added memory overhead, specifica=
lly
>> >> >> > for users that use swapfiles without zswap. Instead of paying on=
e byte
>> >> >> > (swap_map) for every potential page in the swapfile (+ swap count
>> >> >> > continuation), we pay the size of the swap_desc for every page t=
hat is
>> >> >> > actually in the swapfile, which I am estimating can be roughly a=
round
>> >> >> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhea=
d only
>> >> >> > scales with pages actually swapped out. For zswap users, it shou=
ld be
>> >> >>
>> >> >> Is there a way to avoid turning 1 byte into 24 byte per swapped
>> >> >> pages? For the users that use swap but no zswap, this is pure over=
head.
>> >> >
>> >> > That's what I could think of at this point. My idea was something l=
ike this:
>> >> >
>> >> > struct swap_desc {
>> >> >     union { /* Use one bit to distinguish them */
>> >> >         swp_entry_t swap_entry;
>> >> >         struct zswap_entry *zswap_entry;
>> >> >     };
>> >> >     struct folio *swapcache;
>> >> >     atomic_t swap_count;
>> >> >     u32 id;
>> >> > }
>> >> >
>> >> > Having the id in the swap_desc is convenient as we can directly map
>> >> > the swap_desc to a swp_entry_t to place in the page tables, but I
>> >> > don't think it's necessary. Without it, the struct size is 20 bytes,
>> >> > so I think the extra 4 bytes are okay to use anyway if the slab
>> >> > allocator only allocates multiples of 8 bytes.
>> >> >
>> >> > The idea here is to unify the swapcache and swap_count implementati=
on
>> >> > between different swap backends (swapfiles, zswap, etc), which would
>> >> > create a better abstraction and reduce reinventing the wheel.
>> >> >
>> >> > We can reduce to only 8 bytes and only store the swap/zswap entry, =
but
>> >> > we still need the swap cache anyway so might as well just store the
>> >> > pointer in the struct and have a unified lookup-free swapcache, so
>> >> > really 16 bytes is the minimum.
>> >> >
>> >> > If we stop at 16 bytes, then we need to handle swap count separately
>> >> > in swapfiles and zswap. This is not the end of the world, but are t=
he
>> >> > 8 bytes worth this?
>> >>
>> >> If my understanding were correct, for current implementation, we need
>> >> one swap cache pointer per swapped out page too.  Even after calling
>> >> __delete_from_swap_cache(), we store the "shadow" entry there.  Altho=
ugh
>> >> it's possible to implement shadow entry reclaiming like that for file
>> >> cache shadow entry (workingset_shadow_shrinker), we haven't done that
>> >> yet.  And, it appears that we can live with that.  So, in current
>> >> implementation, for each swapped out page, we use 9 bytes.  If so, the
>> >> memory usage ratio is 24 / 9 =3D 2.667, still not trivial, but not as
>> >> horrible as 24 / 1 =3D 24.
>> >
>> > Unfortunately it's a little bit more. 24 is the extra overhead.
>> >
>> > Today we have an xarray entry for each swapped out page, that either
>> > has the swapcache pointer or the shadow entry.
>> >
>> > With this implementation, we have an xarray entry for each swapped out
>> > page, that has a pointer to the swap_desc.
>> >
>> > Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1)=
 =3D 3.5556.
>>
>> OK.  I see.  We can only hold 8 bytes for each xarray entry.  To save
>> memory usage, we can allocate multiple swap_desc (e.g., 16) for each
>> xarray entry.  Then the memory usage of xarray becomes 1/N.
>>
>> > For rotating disks, this might be even higher (8 + 32) / (8 + 1) =3D 4=
.444
>> >
>> > This is because we need to maintain a reverse mapping between
>> > swp_entry_t and the swap_desc to use for cluster readahead. I am
>> > assuming we can limit cluster readahead for rotating disks only.
>>
>> If reverse mapping cannot be avoided for enough situation, it's better
>> to only keep swap_entry in swap_desc, and create another xarray indexed
>> by swap_entry and store swap_cache, swap_count etc.
>
>
> My current idea is to have one xarray that stores the swap_descs
> (which include swap_entry, swapcache, swap_count, etc), and only for
> rotating disks have an additional xarray that maps swap_entry ->
> swap_desc for cluster readahead, assuming we can eliminate all other
> situations requiring a reverse mapping.
>
> I am not sure how having separate xarrays help? If we have one xarray,
> might as well save the other lookups on put everything in swap_desc.
> In fact, this should improve the locking today as swapcache /
> swap_count operations can be lockless or very lightly contended.

The condition of the proposal is "reverse mapping cannot be avoided for
enough situation".  So, if reverse mapping (or cluster readahead) can be
avoided for enough situations, I think your proposal is good.  Otherwise,
I propose to use 2 xarrays.  You don't need another reverse mapping
xarray, because you just need to read the next several swap_entry into
the swap cache for cluster readahead.  swap_desc isn't needed for
cluster readahead.

> If the point is to store the swap_desc directly inside the xarray to
> save 8 bytes, I am concerned that having multiple xarrays for
> swapcache, swap_count, etc will use more than that.

The idea is to save the memory used by reverse mapping xarray.

>> >>
>> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) =
not
>> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
>> >> > continuation pages. If we do, it may end up being more. We also
>> >> > allocate continuation in full 4k pages, so even if one swap_map
>> >> > element in a page requires continuation, we will allocate an entire
>> >> > page. What I am trying to say is that to get an actual comparison y=
ou
>> >> > need to also factor in the swap utilization and the rate of usage of
>> >> > swap continuation. I don't know how to come up with a formula for t=
his
>> >> > tbh.
>> >> >
>> >> > Also, like Johannes said, the worst case overhead (32 bytes if you
>> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for ev=
ery
>> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pu=
re
>> >> > overhead for people not using zswap, but it is not very awful.
>> >> >
>> >> >>
>> >> >> It seems what you really need is one bit of information to indicate
>> >> >> this page is backed by zswap. Then you can have a seperate pointer
>> >> >> for the zswap entry.
>> >> >
>> >> > If you use one bit in swp_entry_t (or one of the available swap typ=
es)
>> >> > to indicate whether the page is backed with a swapfile or zswap it
>> >> > doesn't really work. We lose the indirection layer. How do we move =
the
>> >> > page from zswap to swapfile? We need to go update the page tables a=
nd
>> >> > the shmem page cache, similar to swapoff.
>> >> >
>> >> > Instead, if we store a key else in swp_entry_t and use this to look=
up
>> >> > the swp_entry_t or zswap_entry pointer then that's essentially what
>> >> > the swap_desc does. It just goes the extra mile of unifying the
>> >> > swapcache as well and storing it directly in the swap_desc instead =
of
>> >> > storing it in another lookup structure.
>> >>
>> >> If we choose to make sizeof(struct swap_desc) =3D=3D 8, that is, stor=
e only
>> >> swap_entry in swap_desc.  The added indirection appears to be another
>> >> level of page table with 1 entry.  Then, we may use the similar method
>> >> as supporting system with 2 level and 3 level page tables, like the c=
ode
>> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
>> >> this deeply.
>> >
>> > Can you expand further on this idea? I am not sure I fully understand.
>>
>> OK.  The goal is to avoid the overhead if indirection isn't enabled via
>> kconfig.
>>
>> If indirection isn't enabled, store swap_entry in PTE directly.
>> Otherwise, store index of swap_desc in PTE.  Different functions (e.g.,
>> to get/set swap_entry in PTE) are implemented based on kconfig.
>
>
> I thought about this, the problem is that we will have multiple
> implementations of multiple things. For example, swap_count without
> the indirection layer lives in the swap_map (with continuation logic).
> With the indirection layer, it lives in the swap_desc (or somewhere
> else). Same for the swapcache. Even if we keep the swapcache in an
> xarray and not inside swap_desc, it would be indexed by swap_entry if
> the indirection is disabled, and by swap_desc (or similar) if the
> indirection is enabled. I think maintaining separate implementations
> for when the indirection is enabled/disabled would be adding too much
> complexity.
>
> WDYT?

If we go this way, swap cache and swap_count will always be indexed by
swap_entry.  swap_desc just provides a indirection to make it possible
to move between swap devices.

Why must we index swap cache and swap_count by swap_desc if indirection
is enabled?  Yes, we can save one xarray indexing if we do so, but I
don't think the overhead of one xarray indexing is a showstopper.

I think this can be one intermediate step towards your final target.
The changes to current implementation can be smaller.

>> >> >>
>> >> >> Depending on how much you are going to reuse the swap cache, you m=
ight
>> >> >> need to have something like a swap_info_struct to keep the locks h=
appy.
>> >> >
>> >> > My current intention is to reimplement the swapcache completely as a
>> >> > pointer in struct swap_desc. This would eliminate this need and a l=
ot
>> >> > of the locking we do today if I get things right.
>> >> >
>> >> >>
>> >> >> > Another potential concern is readahead. With this design, we hav=
e no
>> >> >>
>> >> >> Readahead is for spinning disk :-) Even a normal swap file with an=
 SSD can
>> >> >> use some modernization.
>> >> >
>> >> > Yeah, I initially thought we would only need the swp_entry_t ->
>> >> > swap_desc reverse mapping for readahead, and that we can only store
>> >> > that for spinning disks, but I was wrong. We need for other things =
as
>> >> > well today: swapoff, when trying to find an empty swap slot and we
>> >> > start trying to free swap slots used only by the swapcache. However=
, I
>> >> > think both of these cases can be fixed (I can share more details if
>> >> > you want). If everything goes well we should only need to maintain =
the
>> >> > reverse mapping (extra overhead above 24 bytes) for swap files on
>> >> > spinning disks for readahead.
>> >> >
>> >> >>
>> >> >> Looking forward to your discussion.
>>
>> Per my understanding, the indirection is to make it easy to move
>> (swapped) pages among swap devices based on hot/cold.  This is similar
>> as the target of memory tiering.  It appears that we can extend the
>> memory tiering (mm/memory-tiers.c) framework to cover swap devices too?
>> Is it possible for zswap to be faster than some slow memory media?
>
>
> Agree with Chris that this may require a much larger overhaul. A slow
> memory tier is still addressable memory, swap/zswap requires a page
> fault to read the pages. I think (at least for now) there is a
> fundamental difference. We want reclaim to eventually treat slow
> memory & swap as just different tiers to place cold memory in with
> different characteristics, but otherwise I think the swapping
> implementation itself is very different.  Am I missing something?

Is it possible that zswap is faster than a really slow memory
addressable device backed by NAND?  TBH, I don't have the answer.

Anyway, do you need a way to describe the tiers of the swap devices?
So, you can move the cold pages among the swap devices based on that?

Best Regards,
Huang, Ying