From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 92041C6FD1F
	for <linux-mm@archiver.kernel.org>; Thu, 23 Mar 2023 03:17:31 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id F16ED6B0072; Wed, 22 Mar 2023 23:17:30 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EC6AE6B0074; Wed, 22 Mar 2023 23:17:30 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D8F926B0075; Wed, 22 Mar 2023 23:17:30 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id C77A76B0072
	for <linux-mm@kvack.org>; Wed, 22 Mar 2023 23:17:30 -0400 (EDT)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 87E04AAE5D
	for <linux-mm@kvack.org>; Thu, 23 Mar 2023 03:17:30 +0000 (UTC)
X-FDA: 80598702660.29.EEE545E
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
	by imf10.hostedemail.com (Postfix) with ESMTP id 82507C0015
	for <linux-mm@kvack.org>; Thu, 23 Mar 2023 03:17:27 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=kMKbqdPq;
	spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1679541448;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=WrkKuqO0CT14v5YY79e3GgO+OeRoaUyymJHrRMg+G8Y=;
	b=CNlmwv9wQkah/fsJWXMVIRAgMqRORv4iEAXuUM++o5pC2Z+qRjAUdLI2EiTgP6BSmTqWbc
	FMc1lwVfx9xYDX93ay9k7uYnu3cFvBAsVfjPpsJVeEyM2aldSZ54JwfC0p68GjE8oKPAu5
	bjq2AIaYzwGLsQa7X0+8fhzprW1U+iM=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=kMKbqdPq;
	spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679541448; a=rsa-sha256;
	cv=none;
	b=AGNrpJSkW7KrjSry9ycJUFs5w850gNX9OYbMlIFdxz1P5lSC4fZwEH/E2cMO4yLbLVVaXV
	tNxVZiOCgcs8kq/9gqR3T3F4a3qTP/wCZvNXpLijvR3fJLjnsQXNoQyUxV+S9CCeF8GH38
	U2o17dSHI3la0zjxglaLX6X6MxbciQI=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1679541447; x=1711077447;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version:content-transfer-encoding;
  bh=HlWN4pLEHHOlbwaTFfvays4uhZkVPKXxa6QyVpuKTso=;
  b=kMKbqdPqsUbM+kXx7dTS2xxGe8yKXVATHzUfvZKxiSkxbEbwQB/SzI+k
   usTu7XXj9pUMjpIrJDnubbM7bhGUIsP6PyEHscTjzl3cE+aYW2R4oy6rP
   9p1UHnTno45u7DhypVcvIICg0yqfPgtJ6OwouQAOq1FYLgp2N5aiGTvxk
   Ne2l+QW/9DfO03IygHI518ybZbnHM7yMoabEOysOcYAxZqRsJXyROHjNV
   wRPhAtkqAaWYHJrsgVmchY05C3gP/pNn4gTv0rXvU3po+WogLRQCScIaf
   NPV0qrE4T2NDHKpN8DR+k+QSBRaRVPs0RNe3V6/ZqyF4doRX3MmMrf5Tv
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10657"; a="340923210"
X-IronPort-AV: E=Sophos;i="5.98,283,1673942400"; 
   d="scan'208";a="340923210"
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
  by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Mar 2023 20:17:25 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10657"; a="1011612547"
X-IronPort-AV: E=Sophos;i="5.98,283,1673942400"; 
   d="scan'208";a="1011612547"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Mar 2023 20:17:20 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chrisl@kernel.org>,  lsf-pc@lists.linux-foundation.org,
  Johannes Weiner <hannes@cmpxchg.org>,  Linux-MM <linux-mm@kvack.org>,
  Michal Hocko <mhocko@kernel.org>,  Shakeel Butt <shakeelb@google.com>,
  David Rientjes <rientjes@google.com>,  Hugh Dickins <hughd@google.com>,
  Seth Jennings <sjenning@redhat.com>,  Dan Streetman <ddstreet@ieee.org>,
  Vitaly Wool <vitaly.wool@konsulko.com>,  Yang Shi <shy828301@gmail.com>,
  Peter Xu <peterx@redhat.com>,  Minchan Kim <minchan@kernel.org>,  Andrew
 Morton <akpm@linux-foundation.org>,  Aneesh Kumar K V
 <aneesh.kumar@linux.ibm.com>,  Michal Hocko <mhocko@suse.com>,  Wei Xu
 <weixugc@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
	<Y/6KOFMZaE0yOj/1@google.com>
	<CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
	<87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkYj-=-tYyfqk_t2c6WMtcPLHYc4teRNtE2H8G8igEGrpA@mail.gmail.com>
	<87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkamf8TtY0wOjhZKsWBJLL4pMsUhkwPtwCuroWcipRZ3CA@mail.gmail.com>
	<87o7ox762m.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tka992TT4-nkPckhSgVXcTnJq8YPYt2CzupZgGGe72NTRw@mail.gmail.com>
	<87bkkt5e4o.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkbgS1NHbmvNH8B9f-E7b6tSoE2hFssoDD54KG7u6eBWkw@mail.gmail.com>
	<87y1ns3zeg.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkZh=WtPkqHTCEKxYL6+hLCM05SbyY9UdHu6si=8xw2m=w@mail.gmail.com>
	<874jqcteyq.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkZ42no+atw9-d9jOBrpZj7=-Zge5p7pRg8mVqV=OZ5sPg@mail.gmail.com>
Date: Thu, 23 Mar 2023 11:16:11 +0800
In-Reply-To: <CAJD7tkZ42no+atw9-d9jOBrpZj7=-Zge5p7pRg8mVqV=OZ5sPg@mail.gmail.com>
	(Yosry Ahmed's message of "Wed, 22 Mar 2023 19:21:04 -0700")
Message-ID: <87v8isrwck.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Stat-Signature: nb7mkbejgjyw9k3p9kha5agzf67pkzk4
X-Rspamd-Queue-Id: 82507C0015
X-HE-Tag: 1679541447-244357
X-HE-Meta: U2FsdGVkX1+sqD12K2BiOWLQeR3BXPRk+/ejP0JHHdjxqZofGAJ9sJWxJo/ON4QVP6SiED025N2Uz4UNNxdZH0dmBPeE9+GSzJoUC5fYRhiNQKvP7GM9O0xwr3+4mQPSvllddZIk2tCm1NpDGmDSbo/qzJbPEaBXGGoTL8b7LP4S3OoFAmVzcjvlzLZtburW/+bbJyG47gV+Tb9EIPB4VqwtEv6yA4iLbofx7ANOxa8PO56ai+RLLzcJWVURANXDqE+4R3JuCyBPGX3s0EA5ZZBGs9Z0xszT9weRazid30FBiCiEN7tFZrX/ZzQK9tJnjHG56L97yiqN56p55LjT0nR2y5rpprXwWy1M8nDKwL3vjI3aco8DA6doIZZ9WPqctpHDLg3yv5e81J+aNsqbwP4iZ4tnCd9Elb70sUWsR3TP1xNb2oGkpBdjv8fKx7WLs2I/YbsGCY/ZwnFSCVI+FEM1AWTpz0VGHHHANNl/Jo9+OfH4P10Aq3lN08wiTGm5f9oXLZF5NTxSM5knYxikFii90Bp1qpqtbxsk4wsBOLQYpThoVO8/8QVsQz8zuo+vlKpOeLRLlREDWHDa+MFT6kujpiv9594Aln8HvcmeyGq+oNGrwqFtmui7cFx/VGWsJOwUiOT8WvrR1YdV1u/dVkNS4hhs9q9Ix1wJNAIk7UdLUcP7m5xknxP3k2T5B7ToncfNvLWApAvvX2xzCR6KFFeVV593auYPV5l1PdxKgOPLgYXDI54TlBR0hMQCLl52zXuLD2jrPlsne8DSkcyr3l7L/73HfV52Br2NgXR1OpaGaN7/21Zb9dplPs3hE0Gm8ASxmPlXtenqdMxbQaImMhVw3I08oXD6K1+P1pKvgJkmi6c1+tYKd43r0bbn3EHdLAWPoEhZk7vhlezrYSm4xvV79lZJz7Hn1RuIHILpmkhdfN1GP0Ms1dyS010T/4EjygvAbQjl/jGYrvdMy07
 Yy3UgcPH
 A8FDvV81tdex1V6KCsR/ARKCvTX0F3gCZQ97eBvsRY9GRjCuknO4ZR8GgW3wWOwowz1oGBJv9mQDi8zrVVRIXofyDEzjkO5vaPIJDjFI5GzzQwuvyYIf+7KJFg9kBXQtRWA9mpLd2neNDuoFfcndvQV2P48I/cVl/AoYWGcM9k8IUnQ++ThHQ2FVpf7gkgrEFO8E/XWv2v27mx8M2jYLIaNQlPg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Yosry Ahmed <yosryahmed@google.com> writes:

> On Wed, Mar 22, 2023 at 6:50=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Sun, Mar 19, 2023 at 7:56=E2=80=AFPM Huang, Ying <ying.huang@intel.=
com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Thu, Mar 16, 2023 at 12:51=E2=80=AFAM Huang, Ying <ying.huang@in=
tel.com> wrote:
>> >> >>
>> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >>
>> >> >> > On Sun, Mar 12, 2023 at 7:13=E2=80=AFPM Huang, Ying <ying.huang@=
intel.com> wrote:
>> >> >> >>
>> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >> >>
>> >> >> >> <snip>
>> >> >> >> >
>> >> >> >> > My current idea is to have one xarray that stores the swap_de=
scs
>> >> >> >> > (which include swap_entry, swapcache, swap_count, etc), and o=
nly for
>> >> >> >> > rotating disks have an additional xarray that maps swap_entry=
 ->
>> >> >> >> > swap_desc for cluster readahead, assuming we can eliminate al=
l other
>> >> >> >> > situations requiring a reverse mapping.
>> >> >> >> >
>> >> >> >> > I am not sure how having separate xarrays help? If we have on=
e xarray,
>> >> >> >> > might as well save the other lookups on put everything in swa=
p_desc.
>> >> >> >> > In fact, this should improve the locking today as swapcache /
>> >> >> >> > swap_count operations can be lockless or very lightly contend=
ed.
>> >> >> >>
>> >> >> >> The condition of the proposal is "reverse mapping cannot be avo=
ided for
>> >> >> >> enough situation".  So, if reverse mapping (or cluster readahea=
d) can be
>> >> >> >> avoided for enough situations, I think your proposal is good.  =
Otherwise,
>> >> >> >> I propose to use 2 xarrays.  You don't need another reverse map=
ping
>> >> >> >> xarray, because you just need to read the next several swap_ent=
ry into
>> >> >> >> the swap cache for cluster readahead.  swap_desc isn't needed f=
or
>> >> >> >> cluster readahead.
>> >> >> >
>> >> >> > swap_desc would be needed for cluster readahead in my original
>> >> >> > proposal as the swap cache lives in swap_descs. Based on the cur=
rent
>> >> >> > implementation, we would need a reverse mapping (swap entry ->
>> >> >> > swap_desc) in 3 situations:
>> >> >> >
>> >> >> > 1) __try_to_reclaim_swap(): when trying to find an empty swap sl=
ot and
>> >> >> > failing, we fallback to trying to find swap entries that only ha=
ve a
>> >> >> > page in the swap cache (no references in page tables or page cac=
he)
>> >> >> > and free them. This would require a reverse mapping.
>> >> >> >
>> >> >> > 2) swapoff: we need to swap in all entries in a swapfile, so we =
need
>> >> >> > to get all swap_descs associated with that swapfile.
>> >> >> >
>> >> >> > 3) swap cluster readahead.
>> >> >> >
>> >> >> > For (1), I think we can drop the dependency of a reverse mapping=
 if we
>> >> >> > free swap entries once we swap a page in and add it to the swap =
cache,
>> >> >> > even if the swap count does not drop to 0.
>> >> >>
>> >> >> Now, we will not drop the swap cache even if the swap count become=
s 0 if
>> >> >> swap space utility < 50%.  Per my understanding, this avoid swap p=
age
>> >> >> writing for read accesses.  So I don't think we can change this di=
rectly
>> >> >> without necessary discussion firstly.
>> >> >
>> >> >
>> >> > Right. I am not sure I understand why we do this today, is it to sa=
ve
>> >> > the overhead of allocating a new swap entry if the page is swapped =
out
>> >> > again soon? I am not sure I understand this statement "this avoid s=
wap
>> >> > page
>> >> > writing for read accesses".
>> >> >
>> >> >>
>> >> >>
>> >> >> > For (2), instead of scanning page tables and shmem page cache to=
 find
>> >> >> > swapped out pages for the swapfile, we can scan all swap_descs
>> >> >> > instead, we should be more efficient. This is one of the proposa=
l's
>> >> >> > potential advantages.
>> >> >>
>> >> >> Good.
>> >> >>
>> >> >> > (3) is the one that would still need a reverse mapping with the
>> >> >> > current proposal. Today we use swap cluster readahead for anon p=
ages
>> >> >> > if we have a spinning disk or vma readahead is disabled. For shm=
em, we
>> >> >> > always use cluster readahead. If we can limit cluster readahead =
to
>> >> >> > only rotating disks, then the reverse mapping can only be mainta=
ined
>> >> >> > for swapfiles on rotating disks. Otherwise, we will need to main=
tain a
>> >> >> > reverse mapping for all swapfiles.
>> >> >>
>> >> >> For shmem, I think that it should be good to readahead based on sh=
mem
>> >> >> file offset instead of swap device offset.
>> >> >>
>> >> >> It's possible that some pages in the readahead window are from HDD=
 while
>> >> >> some other pages aren't.  So it's a little hard to enable cluster =
read
>> >> >> for HDD only.  Anyway, it's not common to use HDD for swap now.
>> >> >>
>> >> >> >>
>> >> >> >> > If the point is to store the swap_desc directly inside the xa=
rray to
>> >> >> >> > save 8 bytes, I am concerned that having multiple xarrays for
>> >> >> >> > swapcache, swap_count, etc will use more than that.
>> >> >> >>
>> >> >> >> The idea is to save the memory used by reverse mapping xarray.
>> >> >> >
>> >> >> > I see.
>> >> >> >
>> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> > Keep in mind that the current overhead is 1 byte O(max =
swap pages) not
>> >> >> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use the =
swap
>> >> >> >> >> >> > continuation pages. If we do, it may end up being more.=
 We also
>> >> >> >> >> >> > allocate continuation in full 4k pages, so even if one =
swap_map
>> >> >> >> >> >> > element in a page requires continuation, we will alloca=
te an entire
>> >> >> >> >> >> > page. What I am trying to say is that to get an actual =
comparison you
>> >> >> >> >> >> > need to also factor in the swap utilization and the rat=
e of usage of
>> >> >> >> >> >> > swap continuation. I don't know how to come up with a f=
ormula for this
>> >> >> >> >> >> > tbh.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Also, like Johannes said, the worst case overhead (32 b=
ytes if you
>> >> >> >> >> >> > count the reverse mapping) is 0.8% of swapped memory, a=
ka 8M for every
>> >> >> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand t=
hat it is pure
>> >> >> >> >> >> > overhead for people not using zswap, but it is not very=
 awful.
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> It seems what you really need is one bit of informatio=
n to indicate
>> >> >> >> >> >> >> this page is backed by zswap. Then you can have a sepe=
rate pointer
>> >> >> >> >> >> >> for the zswap entry.
>> >> >> >> >> >> >
>> >> >> >> >> >> > If you use one bit in swp_entry_t (or one of the availa=
ble swap types)
>> >> >> >> >> >> > to indicate whether the page is backed with a swapfile =
or zswap it
>> >> >> >> >> >> > doesn't really work. We lose the indirection layer. How=
 do we move the
>> >> >> >> >> >> > page from zswap to swapfile? We need to go update the p=
age tables and
>> >> >> >> >> >> > the shmem page cache, similar to swapoff.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Instead, if we store a key else in swp_entry_t and use =
this to lookup
>> >> >> >> >> >> > the swp_entry_t or zswap_entry pointer then that's esse=
ntially what
>> >> >> >> >> >> > the swap_desc does. It just goes the extra mile of unif=
ying the
>> >> >> >> >> >> > swapcache as well and storing it directly in the swap_d=
esc instead of
>> >> >> >> >> >> > storing it in another lookup structure.
>> >> >> >> >> >>
>> >> >> >> >> >> If we choose to make sizeof(struct swap_desc) =3D=3D 8, t=
hat is, store only
>> >> >> >> >> >> swap_entry in swap_desc.  The added indirection appears t=
o be another
>> >> >> >> >> >> level of page table with 1 entry.  Then, we may use the s=
imilar method
>> >> >> >> >> >> as supporting system with 2 level and 3 level page tables=
, like the code
>> >> >> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't th=
ought about
>> >> >> >> >> >> this deeply.
>> >> >> >> >> >
>> >> >> >> >> > Can you expand further on this idea? I am not sure I fully=
 understand.
>> >> >> >> >>
>> >> >> >> >> OK.  The goal is to avoid the overhead if indirection isn't =
enabled via
>> >> >> >> >> kconfig.
>> >> >> >> >>
>> >> >> >> >> If indirection isn't enabled, store swap_entry in PTE direct=
ly.
>> >> >> >> >> Otherwise, store index of swap_desc in PTE.  Different funct=
ions (e.g.,
>> >> >> >> >> to get/set swap_entry in PTE) are implemented based on kconf=
ig.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > I thought about this, the problem is that we will have multip=
le
>> >> >> >> > implementations of multiple things. For example, swap_count w=
ithout
>> >> >> >> > the indirection layer lives in the swap_map (with continuatio=
n logic).
>> >> >> >> > With the indirection layer, it lives in the swap_desc (or som=
ewhere
>> >> >> >> > else). Same for the swapcache. Even if we keep the swapcache =
in an
>> >> >> >> > xarray and not inside swap_desc, it would be indexed by swap_=
entry if
>> >> >> >> > the indirection is disabled, and by swap_desc (or similar) if=
 the
>> >> >> >> > indirection is enabled. I think maintaining separate implemen=
tations
>> >> >> >> > for when the indirection is enabled/disabled would be adding =
too much
>> >> >> >> > complexity.
>> >> >> >> >
>> >> >> >> > WDYT?
>> >> >> >>
>> >> >> >> If we go this way, swap cache and swap_count will always be ind=
exed by
>> >> >> >> swap_entry.  swap_desc just provides a indirection to make it p=
ossible
>> >> >> >> to move between swap devices.
>> >> >> >>
>> >> >> >> Why must we index swap cache and swap_count by swap_desc if ind=
irection
>> >> >> >> is enabled?  Yes, we can save one xarray indexing if we do so, =
but I
>> >> >> >> don't think the overhead of one xarray indexing is a showstoppe=
r.
>> >> >> >>
>> >> >> >> I think this can be one intermediate step towards your final ta=
rget.
>> >> >> >> The changes to current implementation can be smaller.
>> >> >> >
>> >> >> > IIUC, the idea is to have two xarrays:
>> >> >> > (a) xarray that stores a pointer to a struct containing swap_cou=
nt and
>> >> >> > swap cache.
>> >> >> > (b) xarray that stores the underlying swap entry or zswap entry.
>> >> >> >
>> >> >> > When indirection is disabled:
>> >> >> > page tables & page cache have swap entry directly like today, xa=
rray
>> >> >> > (a) is indexed by swap entry, xarray (b) does not exist. No reve=
rse
>> >> >> > mapping needed.
>> >> >> >
>> >> >> > In this case we have an extra overhead of 12-16 bytes (the struct
>> >> >> > containing swap_count and swap cache) vs. 24 bytes of the swap_d=
esc.
>> >> >> >
>> >> >> > When indirection is enabled:
>> >> >> > page tables & page cache have a swap id (or swap_desc index), xa=
rray
>> >> >> > (a) is indexed by swap id,
>> >> >>
>> >> >> xarray (a) is indexed by swap entry.
>> >> >
>> >> >
>> >> > How so? With the indirection enabled, the page tables & page cache
>> >> > have the swap id (or swap_desc index), which can point to a swap en=
try
>> >> > or a zswap entry -- which can change when the page is moved between
>> >> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in t=
his
>> >> > case? Shouldn't be indexed by the abstract swap id so that the
>> >> > writeback from zswap is transparent?
>> >>
>> >> In my mind,
>> >>
>> >> - swap core will define a abstract interface to swap implementations
>> >>   (zswap, swap device/file, maybe more in the future), like VFS.
>> >>
>> >> - zswap will be a special swap implementation (compressing instead of
>> >>   writing to disk).
>> >>
>> >> - swap core will manage the indirection layer and swap cache.
>> >>
>> >> - swap core can move swap pages between swap implementations (e.g., f=
rom
>> >>   zswap to a swap device, or from one swap device to another swap
>> >>   device) with the help of the indirection layer.
>> >>
>> >> In this design, the writeback from zswap becomes moving swapped pages
>> >> from zswap to a swap device.
>> >
>> >
>> > All the above matches my understanding of this proposal. swap_desc is
>> > the proposed indirection layer, and the swap implementations are zswap
>> > & swap devices. For now, we only have 2 static swap implementations
>> > (zswap->swapfile). In the future, we can make this more dynamic as the
>> > need arises.
>>
>> Great to align with you on this.
>
> Great!
>
>>
>> >>
>> >> If my understanding were correct, your suggestion is kind of moving
>> >> zswap logic to the swap core?  And zswap will be always at a higher
>> >> layer on top of swap device/file?
>> >
>> >
>> > We do not want to move the zswap logic into the swap core, we want to
>> > make the swap core independent of the swap implementation, and zswap
>> > is just one possible implementation.
>>
>> Good!
>>
>> I found that you put zswap related data structure inside struct
>> swap_desc directly.  I think that we should avoid that as much as
>> possible.
>
> That was just an initial draft to show what it would be like, I do not
> intend to hardcode zswap-specific items in swap_desc.
>
>>
>> >>
>> >> >>
>> >> >>
>> >> >> > xarray (b) is indexed by swap id as well
>> >> >> > and contain swap entry or zswap entry. Reverse mapping might be
>> >> >> > needed.
>> >> >>
>> >> >> Reverse mapping isn't needed.
>> >> >
>> >> >
>> >> > It would be needed if xarray (a) is indexed by the swap id. I am not
>> >> > sure I understand how it can be indexed by the swap entry if the
>> >> > indirection is enabled.
>> >> >
>> >> >>
>> >> >>
>> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 bytes =
for
>> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse map=
ping
>> >> >> > where needed.
>> >> >> >
>> >> >> > There is also the extra cpu overhead for an extra lookup in cert=
ain paths.
>> >> >> >
>> >> >> > Is my analysis correct? If yes, I agree that the original propos=
al is
>> >> >> > good if the reverse mapping can be avoided in enough situations,=
 and
>> >> >> > that we should consider such alternatives otherwise. As I mentio=
ned
>> >> >> > above, I think it comes down to whether we can completely restri=
ct
>> >> >> > cluster readahead to rotating disks or not -- in which case we n=
eed to
>> >> >> > decide what to do for shmem and for anon when vma readahead is
>> >> >> > disabled.
>> >> >>
>> >> >> We can even have a minimal indirection implementation.  Where, swap
>> >> >> cache and swap_map[] are kept as they ware before, just one xarray=
 is
>> >> >> added.  The xarray is indexed by swap id (or swap_desc index) to s=
tore
>> >> >> the corresponding swap entry.
>> >> >>
>> >> >> When indirection is disabled, no extra overhead.
>> >> >>
>> >> >> When indirection is enabled, the extra overhead is just 8 bytes per
>> >> >> swapped page.
>> >> >>
>> >> >> The basic migration support can be build on top of this.
>> >> >>
>> >> >> I think that this could be a baseline for indirection support.  Th=
en
>> >> >> further optimization can be built on top of it step by step with
>> >> >> supporting data.
>> >> >
>> >> >
>> >> > I am not sure how this works with zswap. Currently swap_map[]
>> >> > implementation is specific for swapfiles, it does not work for zswap
>> >> > unless we implement separate swap counting logic for zswap &
>> >> > swapfiles. Same for the swapcache, it currently supports being inde=
xed
>> >> > by a swap entry, it would need to support being indexed by a swap i=
d,
>> >> > or have a separate swap cache for zswap. Having separate
>> >> > implementation would add complexity, and we would need to perform
>> >> > handoffs of the swap count/cache when a page is moved from zswap to=
 a
>> >> > swapfile.
>> >>
>> >> We can allocate a swap entry for each swapped page in zswap.
>> >
>> >
>> > This is exactly what the current implementation does and what we want
>> > to move away from. The current implementation uses zswap as an
>> > in-memory compressed cache on top of an actual swap device, and each
>> > swapped page in zswap has a swap entry allocated. With this
>> > implementation, zswap cannot be used without a swap device.
>>
>> I totally agree that we should avoid to use an actual swap device under
>> zswap.  And, as an swap implementation, zswap can manage the swap entry
>> inside zswap without an underlying actual swap device.  For example,
>> when we swap a page to zswap (actually compress), we can allocate a
>> (virtual) swap entry in the zswap.  I understand that there's overhead
>> to manage the swap entry in zswap.  We can consider how to reduce the
>> overhead.
>
> I see. So we can (for example) use one of the swap types for zswap,
> and then have zswap code handle this entry according to its
> implementation. We can then have an xarray that maps swap ID -> swap
> entry, and this swap entry is used to index the swap cache and such.
> When a swapped page is moved between backends we update the swap ID ->
> swap entry xarray.
>
> This is one possible implementation that I thought of (very briefly
> tbh), but it does have its problems:
> For zswap:
> - Managing swap entries inside zswap unnecessarily.
> - We need to maintain a swap entry -> zswap entry mapping in zswap --
> similar to the current rbtree, which is something that we can get rid
> of with the initial proposal if we embed the zswap_entry pointer
> directly in the swap_desc (it can be encoded to avoid breaking the
> abstraction).
>
> For mm/swap in general:
> - When we allocate a swap entry today, we store it in folio->private
> (or page->private), which is used by the unmapping code to be placed
> in the page tables or shmem page cache. With this implementation, we
> need to store the swap ID in page->private instead, which means that
> every time we need to access the swap cache during reclaim/swapout we
> need to lookup the swap entry first.
> - On the fault path, we need two lookups instead of one (swap ID ->
> swap entry, swap entry -> swap cache), not sure how this affects fault
> latency.
> - Each swap backend will have its own separate implementation of swap
> counting, which is hard to maintain and very error-prone since the
> logic is backend-agnostic.
> - Handing over a page from one swap backend to another includes
> handing over swap cache entries and swap counts, which I imagine will
> involve considerable synchronization.
>
> Do you have any thoughts on this?

Yes.  I understand there's additional overhead.  I have no clear idea
about how to reduce this now.  We need to think about that in depth.

The bottom line is whether this is worse than the current zswap
implementation?

Best Regards,
Huang, Ying

>>
>> >>
>> >>
>> >> >>
>> >> >>
>> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Depending on how much you are going to reuse the swap =
cache, you might
>> >> >> >> >> >> >> need to have something like a swap_info_struct to keep=
 the locks happy.
>> >> >> >> >> >> >
>> >> >> >> >> >> > My current intention is to reimplement the swapcache co=
mpletely as a
>> >> >> >> >> >> > pointer in struct swap_desc. This would eliminate this =
need and a lot
>> >> >> >> >> >> > of the locking we do today if I get things right.
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > Another potential concern is readahead. With this de=
sign, we have no
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Readahead is for spinning disk :-) Even a normal swap =
file with an SSD can
>> >> >> >> >> >> >> use some modernization.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Yeah, I initially thought we would only need the swp_en=
try_t ->
>> >> >> >> >> >> > swap_desc reverse mapping for readahead, and that we ca=
n only store
>> >> >> >> >> >> > that for spinning disks, but I was wrong. We need for o=
ther things as
>> >> >> >> >> >> > well today: swapoff, when trying to find an empty swap =
slot and we
>> >> >> >> >> >> > start trying to free swap slots used only by the swapca=
che. However, I
>> >> >> >> >> >> > think both of these cases can be fixed (I can share mor=
e details if
>> >> >> >> >> >> > you want). If everything goes well we should only need =
to maintain the
>> >> >> >> >> >> > reverse mapping (extra overhead above 24 bytes) for swa=
p files on
>> >> >> >> >> >> > spinning disks for readahead.
>> >> >> >> >> >> >
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Looking forward to your discussion.
>> >> >> >> >>
>> >> >> >> >> Per my understanding, the indirection is to make it easy to =
move
>> >> >> >> >> (swapped) pages among swap devices based on hot/cold.  This =
is similar
>> >> >> >> >> as the target of memory tiering.  It appears that we can ext=
end the
>> >> >> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap d=
evices too?
>> >> >> >> >> Is it possible for zswap to be faster than some slow memory =
media?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > Agree with Chris that this may require a much larger overhaul=
. A slow
>> >> >> >> > memory tier is still addressable memory, swap/zswap requires =
a page
>> >> >> >> > fault to read the pages. I think (at least for now) there is a
>> >> >> >> > fundamental difference. We want reclaim to eventually treat s=
low
>> >> >> >> > memory & swap as just different tiers to place cold memory in=
 with
>> >> >> >> > different characteristics, but otherwise I think the swapping
>> >> >> >> > implementation itself is very different.  Am I missing someth=
ing?
>> >> >> >>
>> >> >> >> Is it possible that zswap is faster than a really slow memory
>> >> >> >> addressable device backed by NAND?  TBH, I don't have the answe=
r.
>> >> >> >
>> >> >> > I am not sure either.
>> >> >> >
>> >> >> >>
>> >> >> >> Anyway, do you need a way to describe the tiers of the swap dev=
ices?
>> >> >> >> So, you can move the cold pages among the swap devices based on=
 that?
>> >> >> >
>> >> >> > For now I think the "tiers" in this proposal are just zswap and =
normal
>> >> >> > swapfiles. We can later extend it to support more explicit tieri=
ng.
>> >> >>
>> >> >> IIUC, in original zswap implementation, there's 1:1 relationship b=
etween
>> >> >> zswap and normal swapfile.  But now, you make demoting among swap
>> >> >> devices more general.  Then we need some general way to specify wh=
ich
>> >> >> swap devices are fast and which are slow, and the demoting relatio=
nship
>> >> >> among them.  It can be memory tiers or something else, but we need=
 one.
>> >> >
>> >> >
>> >> > I think for this proposal, there are only 2 hardcoded tiers. Zswap =
is
>> >> > fast, swapfile is slow. In the future, we can support more dynamic
>> >> > tiering if the need arises.
>> >>
>> >> We can start from a simple implementation.  And I think that it's bet=
ter
>> >> to consider the general design too.  Try not to make it impossible no=
w.
>> >
>> >
>> > Right. I am proposing we come up with an abstract generic interface
>> > for swap implementations, and have 2 implementations statically
>> > defined (swapfiles and zswap). If the need arises, we can make swap
>> > implementations more dynamic in the future.
>> >
>> >>
>> >>
>> >> Best Regards,
>> >> Huang, Ying
>>