From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BEE1EC7618A
	for <linux-mm@archiver.kernel.org>; Mon, 20 Mar 2023 02:56:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 429AF900003; Sun, 19 Mar 2023 22:56:16 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3D91F900002; Sun, 19 Mar 2023 22:56:16 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 27930900003; Sun, 19 Mar 2023 22:56:16 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 13E0E900002
	for <linux-mm@kvack.org>; Sun, 19 Mar 2023 22:56:16 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id D27EDC015E
	for <linux-mm@kvack.org>; Mon, 20 Mar 2023 02:56:15 +0000 (UTC)
X-FDA: 80587762710.06.6EEB70F
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
	by imf29.hostedemail.com (Postfix) with ESMTP id C796A120006
	for <linux-mm@kvack.org>; Mon, 20 Mar 2023 02:56:12 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=htFXrpUz;
	spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1679280973;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=jQ4Hkv0MCeJxs+XHGHL7SberjlNYvtZQ1WcNYOLjImw=;
	b=YZ7GvqcLO6h9av6/Wzc2ypvdGXVs3RoW+3vn4J8Rm23KAG9mBXgOnuk2+7WoJDpaovmvVx
	V6jmu/VGTJ2GLX8V4TLc4L0Wwo5LO7E8KHq6Bs8z4Zz+igRS6tAdyk96PZdthEClYN+kob
	5l/1Xc1zA0SGQj9yE5IqYhePL4UjTs8=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=htFXrpUz;
	spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679280973; a=rsa-sha256;
	cv=none;
	b=LDpULYgg9uKTL/arPri8ErAAeZHHgoNVqcp6Iu/G5ijIL7poQgEoxlE6KVWntt6HWXwxJP
	KW3aUsC/W5Ywey34/qJaFSK455eTzuCMUyIPrxXWiZJh96jCcLQmgCMIvqbssrmfxDkWjy
	YcX5LyZr49lTRdV6PBNBhbkuWZZNZdA=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1679280972; x=1710816972;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version:content-transfer-encoding;
  bh=AYU5d/1I4v1PrsxaWtmT0KuP+LRXrxl46ru9LO3bzU0=;
  b=htFXrpUzfZ0ofZa3THpq9og/tqnUSXkMtgtvAz5fsJE2zMP6gLCnMHzn
   pGBuKnTN2F9A55pRrJKIh2AcP43h75qimP/BOpqVZGcAQzBLL8AFKi9y+
   exSJF7AM55GCbBpaOIvJDCTWdzaColyRqRiYbdvFKERtcOWav1HBW/CxZ
   iNJ6nD/cjCLMiyVt0WDeVGUIOy+GhyZgigRWuGSVt5WUPBGi1gTa1VlZy
   WHGnffNP3TlZugQ8X5qfHp7tBKkCtgn38qumEHnf2dj6Sw8gSQUh+Q06Y
   1svAyirrKOvL39Sk5NPI/LyCwErLUIHrEoLonkDKVZYcB7e90j4g9LjCn
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10654"; a="424838695"
X-IronPort-AV: E=Sophos;i="5.98,274,1673942400"; 
   d="scan'208";a="424838695"
Received: from orsmga005.jf.intel.com ([10.7.209.41])
  by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Mar 2023 19:56:10 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10654"; a="855134260"
X-IronPort-AV: E=Sophos;i="5.98,274,1673942400"; 
   d="scan'208";a="855134260"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Mar 2023 19:56:06 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chrisl@kernel.org>,  lsf-pc@lists.linux-foundation.org,
  Johannes Weiner <hannes@cmpxchg.org>,  Linux-MM <linux-mm@kvack.org>,
  Michal Hocko <mhocko@kernel.org>,  Shakeel Butt <shakeelb@google.com>,
  David Rientjes <rientjes@google.com>,  Hugh Dickins <hughd@google.com>,
  Seth Jennings <sjenning@redhat.com>,  Dan Streetman <ddstreet@ieee.org>,
  Vitaly Wool <vitaly.wool@konsulko.com>,  Yang Shi <shy828301@gmail.com>,
  Peter Xu <peterx@redhat.com>,  Minchan Kim <minchan@kernel.org>,  Andrew
 Morton <akpm@linux-foundation.org>,  Aneesh Kumar K V
 <aneesh.kumar@linux.ibm.com>,  Michal Hocko <mhocko@suse.com>,  Wei Xu
 <weixugc@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
	<Y/6KOFMZaE0yOj/1@google.com>
	<CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
	<87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkYj-=-tYyfqk_t2c6WMtcPLHYc4teRNtE2H8G8igEGrpA@mail.gmail.com>
	<87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkamf8TtY0wOjhZKsWBJLL4pMsUhkwPtwCuroWcipRZ3CA@mail.gmail.com>
	<87o7ox762m.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tka992TT4-nkPckhSgVXcTnJq8YPYt2CzupZgGGe72NTRw@mail.gmail.com>
	<87bkkt5e4o.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkbgS1NHbmvNH8B9f-E7b6tSoE2hFssoDD54KG7u6eBWkw@mail.gmail.com>
Date: Mon, 20 Mar 2023 10:55:03 +0800
In-Reply-To: <CAJD7tkbgS1NHbmvNH8B9f-E7b6tSoE2hFssoDD54KG7u6eBWkw@mail.gmail.com>
	(Yosry Ahmed's message of "Fri, 17 Mar 2023 03:19:09 -0700")
Message-ID: <87y1ns3zeg.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: C796A120006
X-Rspam-User: 
X-Stat-Signature: 7zotrmfz1hshi7p8ei5tnkuidoqm3gp4
X-HE-Tag: 1679280972-52092
X-HE-Meta: U2FsdGVkX1+DhoD2sVg1REASftChZnSOJ5abQHgvLzo0aMt5DzP17pAvxO3peKxOYO0aPjYw4EgnJBEMbDgKCjnmXXxzNJInzqrn3OIu9lgA8WCxpo2zjP541AXQGOJiY3/v0mltsyVxo18qrdEFgrCdEk3OJdNA7YC2GI3vHkq4I7pkjkDHWw/b5tKipVDjYp0wY7jkm0tr8V72R9es6Ztyr67GaIPQW495p/IJZuV7sWHbUqMxgcsM/8J8PC1r9mqSQtkh3Zxi8UWDVWTvghegenrkf7uzXRZEtIyL5L07L8qCkEvfHwkz1dr3L021ZwLrSNscZv3V1bNoSxGTCZja6EWLq4s/1XPc+2bcTArhtbXP7tI4g4mGX5B3+s4OSMc/YWfG9gXFKnPPWy1vzKMD/P3J6iHnOSq7u21exVhUHeDg0AEhRNeIfSUuEcBjySSn0uKpynQBguu6leQrSO041AXMDxOMGKWsPB2/EiA9XdqagGexy/7RtxWBxN7IZJHtd939OfWLzvqg4qFTh+YIs4ucbs8NFbQLGhkBRixrrOCbVNVvBOy5Xp120j2vtk4oEbQgkiftoZVVyFHFeJM3CFv7Xdtstn2Ee4aX97Riumn8GGeNzvVGH3FOWCS1Zm3L/Q1gVJzhRpJrNJWebFIHihIvo5nLwO4MKCtw3GOfKZZpSkFYJNdxoIO+0rBoWDQeP1KMuAcPOT+I4pijxPjQgdCZYWO3l7kVJggmZqmGMg0YoJekyFqYw3N35qZs5eStGiXq58Rg9X927/P1o2oOtzGYxy5GtFDV6//KF+7iyuXn5vhL1YfUrLTek4UjPMrBorWhtTHt7Nyd6lLaSoQX9dK2tn0miZKsnI0PIS0FhuydTg+/VYVW2yf8wJ6C5Ys24KgYTgOBiE/xC0c3/TbZI/X+SNrSxbnw3OlHDzgyYNAa/CHKIDxrnpQEuV0VIk/J6Khc9N0rQqMuwN0
 CgBIgihc
 /v5PNFq7029GZM79fw3dfxuny2F2F3XHZfmeG/1eSRpoBDrspUqb3u1MTlO/MhqhIDXlys5WEI/DnDBBwQ2nylAutPHRKye6PyGK4agEJQ3fCTU0NjpwvlknQeGcwRouUfYN/tt1qJiwDGtnBQUd14g8eVpFMk869NgIgMwEoRmf7ELLfAdQuqc1c1pLw7lIbGKmL3vRGrxctEu61kRZA3+2BoQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Yosry Ahmed <yosryahmed@google.com> writes:

> On Thu, Mar 16, 2023 at 12:51=E2=80=AFAM Huang, Ying <ying.huang@intel.co=
m> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Sun, Mar 12, 2023 at 7:13=E2=80=AFPM Huang, Ying <ying.huang@intel.=
com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> <snip>
>> >> >
>> >> > My current idea is to have one xarray that stores the swap_descs
>> >> > (which include swap_entry, swapcache, swap_count, etc), and only for
>> >> > rotating disks have an additional xarray that maps swap_entry ->
>> >> > swap_desc for cluster readahead, assuming we can eliminate all other
>> >> > situations requiring a reverse mapping.
>> >> >
>> >> > I am not sure how having separate xarrays help? If we have one xarr=
ay,
>> >> > might as well save the other lookups on put everything in swap_desc.
>> >> > In fact, this should improve the locking today as swapcache /
>> >> > swap_count operations can be lockless or very lightly contended.
>> >>
>> >> The condition of the proposal is "reverse mapping cannot be avoided f=
or
>> >> enough situation".  So, if reverse mapping (or cluster readahead) can=
 be
>> >> avoided for enough situations, I think your proposal is good.  Otherw=
ise,
>> >> I propose to use 2 xarrays.  You don't need another reverse mapping
>> >> xarray, because you just need to read the next several swap_entry into
>> >> the swap cache for cluster readahead.  swap_desc isn't needed for
>> >> cluster readahead.
>> >
>> > swap_desc would be needed for cluster readahead in my original
>> > proposal as the swap cache lives in swap_descs. Based on the current
>> > implementation, we would need a reverse mapping (swap entry ->
>> > swap_desc) in 3 situations:
>> >
>> > 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and
>> > failing, we fallback to trying to find swap entries that only have a
>> > page in the swap cache (no references in page tables or page cache)
>> > and free them. This would require a reverse mapping.
>> >
>> > 2) swapoff: we need to swap in all entries in a swapfile, so we need
>> > to get all swap_descs associated with that swapfile.
>> >
>> > 3) swap cluster readahead.
>> >
>> > For (1), I think we can drop the dependency of a reverse mapping if we
>> > free swap entries once we swap a page in and add it to the swap cache,
>> > even if the swap count does not drop to 0.
>>
>> Now, we will not drop the swap cache even if the swap count becomes 0 if
>> swap space utility < 50%.  Per my understanding, this avoid swap page
>> writing for read accesses.  So I don't think we can change this directly
>> without necessary discussion firstly.
>
>
> Right. I am not sure I understand why we do this today, is it to save
> the overhead of allocating a new swap entry if the page is swapped out
> again soon? I am not sure I understand this statement "this avoid swap
> page
> writing for read accesses".
>
>>
>>
>> > For (2), instead of scanning page tables and shmem page cache to find
>> > swapped out pages for the swapfile, we can scan all swap_descs
>> > instead, we should be more efficient. This is one of the proposal's
>> > potential advantages.
>>
>> Good.
>>
>> > (3) is the one that would still need a reverse mapping with the
>> > current proposal. Today we use swap cluster readahead for anon pages
>> > if we have a spinning disk or vma readahead is disabled. For shmem, we
>> > always use cluster readahead. If we can limit cluster readahead to
>> > only rotating disks, then the reverse mapping can only be maintained
>> > for swapfiles on rotating disks. Otherwise, we will need to maintain a
>> > reverse mapping for all swapfiles.
>>
>> For shmem, I think that it should be good to readahead based on shmem
>> file offset instead of swap device offset.
>>
>> It's possible that some pages in the readahead window are from HDD while
>> some other pages aren't.  So it's a little hard to enable cluster read
>> for HDD only.  Anyway, it's not common to use HDD for swap now.
>>
>> >>
>> >> > If the point is to store the swap_desc directly inside the xarray to
>> >> > save 8 bytes, I am concerned that having multiple xarrays for
>> >> > swapcache, swap_count, etc will use more than that.
>> >>
>> >> The idea is to save the memory used by reverse mapping xarray.
>> >
>> > I see.
>> >
>> >>
>> >> >> >>
>> >> >> >> > Keep in mind that the current overhead is 1 byte O(max swap p=
ages) not
>> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap
>> >> >> >> > continuation pages. If we do, it may end up being more. We al=
so
>> >> >> >> > allocate continuation in full 4k pages, so even if one swap_m=
ap
>> >> >> >> > element in a page requires continuation, we will allocate an =
entire
>> >> >> >> > page. What I am trying to say is that to get an actual compar=
ison you
>> >> >> >> > need to also factor in the swap utilization and the rate of u=
sage of
>> >> >> >> > swap continuation. I don't know how to come up with a formula=
 for this
>> >> >> >> > tbh.
>> >> >> >> >
>> >> >> >> > Also, like Johannes said, the worst case overhead (32 bytes i=
f you
>> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M =
for every
>> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it=
 is pure
>> >> >> >> > overhead for people not using zswap, but it is not very awful.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> It seems what you really need is one bit of information to i=
ndicate
>> >> >> >> >> this page is backed by zswap. Then you can have a seperate p=
ointer
>> >> >> >> >> for the zswap entry.
>> >> >> >> >
>> >> >> >> > If you use one bit in swp_entry_t (or one of the available sw=
ap types)
>> >> >> >> > to indicate whether the page is backed with a swapfile or zsw=
ap it
>> >> >> >> > doesn't really work. We lose the indirection layer. How do we=
 move the
>> >> >> >> > page from zswap to swapfile? We need to go update the page ta=
bles and
>> >> >> >> > the shmem page cache, similar to swapoff.
>> >> >> >> >
>> >> >> >> > Instead, if we store a key else in swp_entry_t and use this t=
o lookup
>> >> >> >> > the swp_entry_t or zswap_entry pointer then that's essentiall=
y what
>> >> >> >> > the swap_desc does. It just goes the extra mile of unifying t=
he
>> >> >> >> > swapcache as well and storing it directly in the swap_desc in=
stead of
>> >> >> >> > storing it in another lookup structure.
>> >> >> >>
>> >> >> >> If we choose to make sizeof(struct swap_desc) =3D=3D 8, that is=
, store only
>> >> >> >> swap_entry in swap_desc.  The added indirection appears to be a=
nother
>> >> >> >> level of page table with 1 entry.  Then, we may use the similar=
 method
>> >> >> >> as supporting system with 2 level and 3 level page tables, like=
 the code
>> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't thought =
about
>> >> >> >> this deeply.
>> >> >> >
>> >> >> > Can you expand further on this idea? I am not sure I fully under=
stand.
>> >> >>
>> >> >> OK.  The goal is to avoid the overhead if indirection isn't enable=
d via
>> >> >> kconfig.
>> >> >>
>> >> >> If indirection isn't enabled, store swap_entry in PTE directly.
>> >> >> Otherwise, store index of swap_desc in PTE.  Different functions (=
e.g.,
>> >> >> to get/set swap_entry in PTE) are implemented based on kconfig.
>> >> >
>> >> >
>> >> > I thought about this, the problem is that we will have multiple
>> >> > implementations of multiple things. For example, swap_count without
>> >> > the indirection layer lives in the swap_map (with continuation logi=
c).
>> >> > With the indirection layer, it lives in the swap_desc (or somewhere
>> >> > else). Same for the swapcache. Even if we keep the swapcache in an
>> >> > xarray and not inside swap_desc, it would be indexed by swap_entry =
if
>> >> > the indirection is disabled, and by swap_desc (or similar) if the
>> >> > indirection is enabled. I think maintaining separate implementations
>> >> > for when the indirection is enabled/disabled would be adding too mu=
ch
>> >> > complexity.
>> >> >
>> >> > WDYT?
>> >>
>> >> If we go this way, swap cache and swap_count will always be indexed by
>> >> swap_entry.  swap_desc just provides a indirection to make it possible
>> >> to move between swap devices.
>> >>
>> >> Why must we index swap cache and swap_count by swap_desc if indirecti=
on
>> >> is enabled?  Yes, we can save one xarray indexing if we do so, but I
>> >> don't think the overhead of one xarray indexing is a showstopper.
>> >>
>> >> I think this can be one intermediate step towards your final target.
>> >> The changes to current implementation can be smaller.
>> >
>> > IIUC, the idea is to have two xarrays:
>> > (a) xarray that stores a pointer to a struct containing swap_count and
>> > swap cache.
>> > (b) xarray that stores the underlying swap entry or zswap entry.
>> >
>> > When indirection is disabled:
>> > page tables & page cache have swap entry directly like today, xarray
>> > (a) is indexed by swap entry, xarray (b) does not exist. No reverse
>> > mapping needed.
>> >
>> > In this case we have an extra overhead of 12-16 bytes (the struct
>> > containing swap_count and swap cache) vs. 24 bytes of the swap_desc.
>> >
>> > When indirection is enabled:
>> > page tables & page cache have a swap id (or swap_desc index), xarray
>> > (a) is indexed by swap id,
>>
>> xarray (a) is indexed by swap entry.
>
>
> How so? With the indirection enabled, the page tables & page cache
> have the swap id (or swap_desc index), which can point to a swap entry
> or a zswap entry -- which can change when the page is moved between
> zswap & swapfiles. How is xarray (a) indexed by the swap entry in this
> case? Shouldn't be indexed by the abstract swap id so that the
> writeback from zswap is transparent?

In my mind,

- swap core will define a abstract interface to swap implementations
  (zswap, swap device/file, maybe more in the future), like VFS.

- zswap will be a special swap implementation (compressing instead of
  writing to disk).

- swap core will manage the indirection layer and swap cache.

- swap core can move swap pages between swap implementations (e.g., from
  zswap to a swap device, or from one swap device to another swap
  device) with the help of the indirection layer.

In this design, the writeback from zswap becomes moving swapped pages
from zswap to a swap device.

If my understanding were correct, your suggestion is kind of moving
zswap logic to the swap core?  And zswap will be always at a higher
layer on top of swap device/file?

>>
>>
>> > xarray (b) is indexed by swap id as well
>> > and contain swap entry or zswap entry. Reverse mapping might be
>> > needed.
>>
>> Reverse mapping isn't needed.
>
>
> It would be needed if xarray (a) is indexed by the swap id. I am not
> sure I understand how it can be indexed by the swap entry if the
> indirection is enabled.
>
>>
>>
>> > In this case we have an extra overhead of 12-16 bytes + 8 bytes for
>> > xarray (b) entry + memory overhead from 2nd xarray + reverse mapping
>> > where needed.
>> >
>> > There is also the extra cpu overhead for an extra lookup in certain pa=
ths.
>> >
>> > Is my analysis correct? If yes, I agree that the original proposal is
>> > good if the reverse mapping can be avoided in enough situations, and
>> > that we should consider such alternatives otherwise. As I mentioned
>> > above, I think it comes down to whether we can completely restrict
>> > cluster readahead to rotating disks or not -- in which case we need to
>> > decide what to do for shmem and for anon when vma readahead is
>> > disabled.
>>
>> We can even have a minimal indirection implementation.  Where, swap
>> cache and swap_map[] are kept as they ware before, just one xarray is
>> added.  The xarray is indexed by swap id (or swap_desc index) to store
>> the corresponding swap entry.
>>
>> When indirection is disabled, no extra overhead.
>>
>> When indirection is enabled, the extra overhead is just 8 bytes per
>> swapped page.
>>
>> The basic migration support can be build on top of this.
>>
>> I think that this could be a baseline for indirection support.  Then
>> further optimization can be built on top of it step by step with
>> supporting data.
>
>
> I am not sure how this works with zswap. Currently swap_map[]
> implementation is specific for swapfiles, it does not work for zswap
> unless we implement separate swap counting logic for zswap &
> swapfiles. Same for the swapcache, it currently supports being indexed
> by a swap entry, it would need to support being indexed by a swap id,
> or have a separate swap cache for zswap. Having separate
> implementation would add complexity, and we would need to perform
> handoffs of the swap count/cache when a page is moved from zswap to a
> swapfile.

We can allocate a swap entry for each swapped page in zswap.

>>
>>
>> >>
>> >> >> >> >>
>> >> >> >> >> Depending on how much you are going to reuse the swap cache,=
 you might
>> >> >> >> >> need to have something like a swap_info_struct to keep the l=
ocks happy.
>> >> >> >> >
>> >> >> >> > My current intention is to reimplement the swapcache complete=
ly as a
>> >> >> >> > pointer in struct swap_desc. This would eliminate this need a=
nd a lot
>> >> >> >> > of the locking we do today if I get things right.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> > Another potential concern is readahead. With this design, =
we have no
>> >> >> >> >>
>> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file w=
ith an SSD can
>> >> >> >> >> use some modernization.
>> >> >> >> >
>> >> >> >> > Yeah, I initially thought we would only need the swp_entry_t =
->
>> >> >> >> > swap_desc reverse mapping for readahead, and that we can only=
 store
>> >> >> >> > that for spinning disks, but I was wrong. We need for other t=
hings as
>> >> >> >> > well today: swapoff, when trying to find an empty swap slot a=
nd we
>> >> >> >> > start trying to free swap slots used only by the swapcache. H=
owever, I
>> >> >> >> > think both of these cases can be fixed (I can share more deta=
ils if
>> >> >> >> > you want). If everything goes well we should only need to mai=
ntain the
>> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swap file=
s on
>> >> >> >> > spinning disks for readahead.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> Looking forward to your discussion.
>> >> >>
>> >> >> Per my understanding, the indirection is to make it easy to move
>> >> >> (swapped) pages among swap devices based on hot/cold.  This is sim=
ilar
>> >> >> as the target of memory tiering.  It appears that we can extend the
>> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap devices=
 too?
>> >> >> Is it possible for zswap to be faster than some slow memory media?
>> >> >
>> >> >
>> >> > Agree with Chris that this may require a much larger overhaul. A sl=
ow
>> >> > memory tier is still addressable memory, swap/zswap requires a page
>> >> > fault to read the pages. I think (at least for now) there is a
>> >> > fundamental difference. We want reclaim to eventually treat slow
>> >> > memory & swap as just different tiers to place cold memory in with
>> >> > different characteristics, but otherwise I think the swapping
>> >> > implementation itself is very different.  Am I missing something?
>> >>
>> >> Is it possible that zswap is faster than a really slow memory
>> >> addressable device backed by NAND?  TBH, I don't have the answer.
>> >
>> > I am not sure either.
>> >
>> >>
>> >> Anyway, do you need a way to describe the tiers of the swap devices?
>> >> So, you can move the cold pages among the swap devices based on that?
>> >
>> > For now I think the "tiers" in this proposal are just zswap and normal
>> > swapfiles. We can later extend it to support more explicit tiering.
>>
>> IIUC, in original zswap implementation, there's 1:1 relationship between
>> zswap and normal swapfile.  But now, you make demoting among swap
>> devices more general.  Then we need some general way to specify which
>> swap devices are fast and which are slow, and the demoting relationship
>> among them.  It can be memory tiers or something else, but we need one.
>
>
> I think for this proposal, there are only 2 hardcoded tiers. Zswap is
> fast, swapfile is slow. In the future, we can support more dynamic
> tiering if the need arises.

We can start from a simple implementation.  And I think that it's better
to consider the general design too.  Try not to make it impossible now.

Best Regards,
Huang, Ying