From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AC244C6FD1D
	for <linux-mm@archiver.kernel.org>; Thu, 23 Mar 2023 05:39:23 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DEA6B6B0072; Thu, 23 Mar 2023 01:39:22 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D72916B0074; Thu, 23 Mar 2023 01:39:22 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BEB626B0075; Thu, 23 Mar 2023 01:39:22 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id AD7226B0072
	for <linux-mm@kvack.org>; Thu, 23 Mar 2023 01:39:22 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 6CA2FAAB48
	for <linux-mm@kvack.org>; Thu, 23 Mar 2023 05:39:22 +0000 (UTC)
X-FDA: 80599060164.01.4CE78D4
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
	by imf11.hostedemail.com (Postfix) with ESMTP id 3465D40005
	for <linux-mm@kvack.org>; Thu, 23 Mar 2023 05:39:19 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="kM7Sn/aZ";
	spf=pass (imf11.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1679549960;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=pDBFGsQWpp9Di2kSI3zrHUadAwozlNAOepSbRqVmNQE=;
	b=vVS/xcCyt/BktDLUW0lwVJfHajh75ZeYIQA2MmW8njyKaKRdjFMrGMhyq2NTCo9xxACZej
	5ftGqZDnISTBymO+Z0SqLtj1eLL2lpqQzfJi0jEvug7JcNbC4qCPN1Cw1we6E4s/5ij+em
	zk5wDXWE8vDJYqoUKc2Kq7AfCLkFIg0=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="kM7Sn/aZ";
	spf=pass (imf11.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679549960; a=rsa-sha256;
	cv=none;
	b=pC1lpgznLgiuEChaoKkmx6wNtXzrxRkyW5ZG5pU5gNo0YMfZETGjKB+Qvkp74wnhSv2g2J
	uRIAJZCkl5QnySCujIyl7ZZkGiydqnvrAZMREKESxTLjuXURwuLAAQojPaPSDVeizeFCom
	OKwm4EhtRpgw4VV4Q8K4TH++9BNUi5E=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1679549960; x=1711085960;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version:content-transfer-encoding;
  bh=wBl7PPS74ygfocrSwzCiw3xmJY2GV/zpkaCLa4QdZS8=;
  b=kM7Sn/aZ/x4K6YzLSxigPuMps7AfdtUp2+ALhI4jouUdYtkrHgbWvvp5
   CUXR2uFI3DtnWBMq2jFleXqeIHh5IxZcNgxIpDMi5DueADQsB5azw6lHr
   E6+/aHQ/dZmVw10bj4/nzi0NblDWmNtpkrh36Zvv0WhmaXfExBG900lkR
   ZTj3DMwu0OzI0fGyaBDr2HBAf4iyID4LJsDs0Q4NRPmJQ4vNsBm6oOW1P
   sfWeCLxY7GVaU6Cr3YrlOPgb/OItnDJzWc3aDggpEpOJdLe6ImJAhcEyG
   QQzVqJVkxw7pbDPYKAYS5mCZ+4qP7lMjIDXIuTdOttWR0tPalUIc5Oaui
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10657"; a="319053092"
X-IronPort-AV: E=Sophos;i="5.98,283,1673942400"; 
   d="scan'208";a="319053092"
Received: from orsmga001.jf.intel.com ([10.7.209.18])
  by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Mar 2023 22:39:18 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10657"; a="714668441"
X-IronPort-AV: E=Sophos;i="5.98,283,1673942400"; 
   d="scan'208";a="714668441"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Mar 2023 22:39:13 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chrisl@kernel.org>,  lsf-pc@lists.linux-foundation.org,
  Johannes Weiner <hannes@cmpxchg.org>,  Linux-MM <linux-mm@kvack.org>,
  Michal Hocko <mhocko@kernel.org>,  Shakeel Butt <shakeelb@google.com>,
  David Rientjes <rientjes@google.com>,  Hugh Dickins <hughd@google.com>,
  Seth Jennings <sjenning@redhat.com>,  Dan Streetman <ddstreet@ieee.org>,
  Vitaly Wool <vitaly.wool@konsulko.com>,  Yang Shi <shy828301@gmail.com>,
  Peter Xu <peterx@redhat.com>,  Minchan Kim <minchan@kernel.org>,  Andrew
 Morton <akpm@linux-foundation.org>,  Aneesh Kumar K V
 <aneesh.kumar@linux.ibm.com>,  Michal Hocko <mhocko@suse.com>,  Wei Xu
 <weixugc@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
	<Y/6KOFMZaE0yOj/1@google.com>
	<CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
	<87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkYj-=-tYyfqk_t2c6WMtcPLHYc4teRNtE2H8G8igEGrpA@mail.gmail.com>
	<87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkamf8TtY0wOjhZKsWBJLL4pMsUhkwPtwCuroWcipRZ3CA@mail.gmail.com>
	<87o7ox762m.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tka992TT4-nkPckhSgVXcTnJq8YPYt2CzupZgGGe72NTRw@mail.gmail.com>
	<87bkkt5e4o.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkbgS1NHbmvNH8B9f-E7b6tSoE2hFssoDD54KG7u6eBWkw@mail.gmail.com>
	<87y1ns3zeg.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkZh=WtPkqHTCEKxYL6+hLCM05SbyY9UdHu6si=8xw2m=w@mail.gmail.com>
	<874jqcteyq.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkZ42no+atw9-d9jOBrpZj7=-Zge5p7pRg8mVqV=OZ5sPg@mail.gmail.com>
	<87v8isrwck.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAJD7tkb21yjzrd=u+XEPqnzL7tFe6pH2bZ_vnbEfCpZsiCdfYQ@mail.gmail.com>
Date: Thu, 23 Mar 2023 13:37:59 +0800
In-Reply-To: <CAJD7tkb21yjzrd=u+XEPqnzL7tFe6pH2bZ_vnbEfCpZsiCdfYQ@mail.gmail.com>
	(Yosry Ahmed's message of "Wed, 22 Mar 2023 20:27:24 -0700")
Message-ID: <87bkkkrps8.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 3465D40005
X-Stat-Signature: pwzgosy3kx4euk1wpioxeih7agwyjq56
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1679549959-893780
X-HE-Meta: U2FsdGVkX19RmFbcLD6b0f4Fqf6mcJtoivQWbRGcFbf+rV+HwK5C+yr3H59Yhf93lAc222PLyiE7lk2BEQbhbPjuC5RAZOQPzgGaub0dwWgURtOFs2kqk84laVze0Hwko6DB0TLMLoQDrwGX4zdyVhMUQCuUKbuEzDjrWLGM/Rz/S7Z1bby0x/6eJOKtiGXcoc2De1foq2wDT8wCgpwj1POaGMx4o+zmMtYxUlmJPOm/b9UMYVKHYbfn62OM5yHMSt/3S8c/qJH+zk/DBXwZF2snj1bkRXkiEnzhlGd7xSbdv+w8nwKRPxg433vINsE8mGpGhb24vwzyRCjzoG0eaQvJXQolBOIVEd8H0CiacYeE3fgI0VzALgnRcU5P0bICif72eEgqPJ0u8q6pnm36FoXSIXmaO0dxERZoV1mnpYKpJXzENeBD23utQM2Vs3cII7vAyN+1MzqpbvZsanMHMKz0AFhzRa+PUYYNfTv/Gevm5mjdd2QW/XqegNOy05d9ZF3HvNG9BpvTtiPUY4EneyCe7ydyN1SBxwnofcxtvcd29S+y6IkiwEb5XnyUXo9h6nU/5Yb69VCQrpwSF3qeecWwC4STo+hMZr60xhKMesNjFe2NHeKKkWbV+DysOuVE9GsmzjDdh1BflDvX4nkVbLX9+f6dtlyNeh9gRp1yrjsZbGPzq7uapZhEQR+8gCFC4sXTIJCyPhrjCprIKGtBGoELcwr2MGxveZhpVXvC8teFqVg2uKGO5weYfQ+n0tAmjWO1M2hzpxNd0baZiOwuODlYGr1mtx2ngWyE41rDxkiVudKL0KQp7lOZsIwwKrCy78jauIFVqL2lUd/eYyP3fNtNlj2BT4fRgB+5RkF6kK0glFzhoHvZ2AiNEp/OyJ3zPJzW4XqTaJN6PVO/ZJxPQyieVPoNbsWbokxAR32xMLj4dyoXMNTc8wdtqDYmlME2pptC5Bxr/bRv76Yx5i1
 7jWUjKTN
 PzkdRx3q+wPYYR3s6uYldbzki/+QWG+5Qg1DYvywPbwWCa7tvsJjwewgQPiZ41OIEv6HAhk38yBAeo4p3xVanPi25bd63j25ovjqIm+lQBcG6cRMn/aMeZIFebKoOFzJt7/up4CcTtMb/RwEgfUIiMHcelSkyVpKl+NHTEGTabvzUzq+yBIrI16E3TasBPub6oTpZ/47k5+qx5vHcjx84K5c0Lg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Yosry Ahmed <yosryahmed@google.com> writes:

> On Wed, Mar 22, 2023 at 8:17=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>>
>> > On Wed, Mar 22, 2023 at 6:50=E2=80=AFPM Huang, Ying <ying.huang@intel.=
com> wrote:
>> >>
>> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >>
>> >> > On Sun, Mar 19, 2023 at 7:56=E2=80=AFPM Huang, Ying <ying.huang@int=
el.com> wrote:
>> >> >>
>> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >>
>> >> >> > On Thu, Mar 16, 2023 at 12:51=E2=80=AFAM Huang, Ying <ying.huang=
@intel.com> wrote:
>> >> >> >>
>> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >> >>
>> >> >> >> > On Sun, Mar 12, 2023 at 7:13=E2=80=AFPM Huang, Ying <ying.hua=
ng@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
>> >> >> >> >>

[snip]

>> >>
>> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> > xarray (b) is indexed by swap id as well
>> >> >> >> > and contain swap entry or zswap entry. Reverse mapping might =
be
>> >> >> >> > needed.
>> >> >> >>
>> >> >> >> Reverse mapping isn't needed.
>> >> >> >
>> >> >> >
>> >> >> > It would be needed if xarray (a) is indexed by the swap id. I am=
 not
>> >> >> > sure I understand how it can be indexed by the swap entry if the
>> >> >> > indirection is enabled.
>> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 byt=
es for
>> >> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse =
mapping
>> >> >> >> > where needed.
>> >> >> >> >
>> >> >> >> > There is also the extra cpu overhead for an extra lookup in c=
ertain paths.
>> >> >> >> >
>> >> >> >> > Is my analysis correct? If yes, I agree that the original pro=
posal is
>> >> >> >> > good if the reverse mapping can be avoided in enough situatio=
ns, and
>> >> >> >> > that we should consider such alternatives otherwise. As I men=
tioned
>> >> >> >> > above, I think it comes down to whether we can completely res=
trict
>> >> >> >> > cluster readahead to rotating disks or not -- in which case w=
e need to
>> >> >> >> > decide what to do for shmem and for anon when vma readahead is
>> >> >> >> > disabled.
>> >> >> >>
>> >> >> >> We can even have a minimal indirection implementation.  Where, =
swap
>> >> >> >> cache and swap_map[] are kept as they ware before, just one xar=
ray is
>> >> >> >> added.  The xarray is indexed by swap id (or swap_desc index) t=
o store
>> >> >> >> the corresponding swap entry.
>> >> >> >>
>> >> >> >> When indirection is disabled, no extra overhead.
>> >> >> >>
>> >> >> >> When indirection is enabled, the extra overhead is just 8 bytes=
 per
>> >> >> >> swapped page.
>> >> >> >>
>> >> >> >> The basic migration support can be build on top of this.
>> >> >> >>
>> >> >> >> I think that this could be a baseline for indirection support. =
 Then
>> >> >> >> further optimization can be built on top of it step by step with
>> >> >> >> supporting data.
>> >> >> >
>> >> >> >
>> >> >> > I am not sure how this works with zswap. Currently swap_map[]
>> >> >> > implementation is specific for swapfiles, it does not work for z=
swap
>> >> >> > unless we implement separate swap counting logic for zswap &
>> >> >> > swapfiles. Same for the swapcache, it currently supports being i=
ndexed
>> >> >> > by a swap entry, it would need to support being indexed by a swa=
p id,
>> >> >> > or have a separate swap cache for zswap. Having separate
>> >> >> > implementation would add complexity, and we would need to perform
>> >> >> > handoffs of the swap count/cache when a page is moved from zswap=
 to a
>> >> >> > swapfile.
>> >> >>
>> >> >> We can allocate a swap entry for each swapped page in zswap.
>> >> >
>> >> >
>> >> > This is exactly what the current implementation does and what we wa=
nt
>> >> > to move away from. The current implementation uses zswap as an
>> >> > in-memory compressed cache on top of an actual swap device, and each
>> >> > swapped page in zswap has a swap entry allocated. With this
>> >> > implementation, zswap cannot be used without a swap device.
>> >>
>> >> I totally agree that we should avoid to use an actual swap device und=
er
>> >> zswap.  And, as an swap implementation, zswap can manage the swap ent=
ry
>> >> inside zswap without an underlying actual swap device.  For example,
>> >> when we swap a page to zswap (actually compress), we can allocate a
>> >> (virtual) swap entry in the zswap.  I understand that there's overhead
>> >> to manage the swap entry in zswap.  We can consider how to reduce the
>> >> overhead.
>> >
>> > I see. So we can (for example) use one of the swap types for zswap,
>> > and then have zswap code handle this entry according to its
>> > implementation. We can then have an xarray that maps swap ID -> swap
>> > entry, and this swap entry is used to index the swap cache and such.
>> > When a swapped page is moved between backends we update the swap ID ->
>> > swap entry xarray.
>> >
>> > This is one possible implementation that I thought of (very briefly
>> > tbh), but it does have its problems:
>> > For zswap:
>> > - Managing swap entries inside zswap unnecessarily.
>> > - We need to maintain a swap entry -> zswap entry mapping in zswap --
>> > similar to the current rbtree, which is something that we can get rid
>> > of with the initial proposal if we embed the zswap_entry pointer
>> > directly in the swap_desc (it can be encoded to avoid breaking the
>> > abstraction).
>> >
>> > For mm/swap in general:
>> > - When we allocate a swap entry today, we store it in folio->private
>> > (or page->private), which is used by the unmapping code to be placed
>> > in the page tables or shmem page cache. With this implementation, we
>> > need to store the swap ID in page->private instead, which means that
>> > every time we need to access the swap cache during reclaim/swapout we
>> > need to lookup the swap entry first.
>> > - On the fault path, we need two lookups instead of one (swap ID ->
>> > swap entry, swap entry -> swap cache), not sure how this affects fault
>> > latency.
>> > - Each swap backend will have its own separate implementation of swap
>> > counting, which is hard to maintain and very error-prone since the
>> > logic is backend-agnostic.
>> > - Handing over a page from one swap backend to another includes
>> > handing over swap cache entries and swap counts, which I imagine will
>> > involve considerable synchronization.
>> >
>> > Do you have any thoughts on this?
>>
>> Yes.  I understand there's additional overhead.  I have no clear idea
>> about how to reduce this now.  We need to think about that in depth.
>>
>> The bottom line is whether this is worse than the current zswap
>> implementation?
>
> It's not just zswap, as I note above, this design would introduce some
> overheads to the core swapping code as well as long as the indirection
> layer is active. I am particularly worried about the extra lookups on
> the fault path.

Maybe you can measure the time for the radix tree lookup?  And compare
it with the total fault time?

> For zswap, we already have a lookup today, so maintaining swap entry
> -> zswap entry mapping would not be a regression, but I am not sure
> about the extra overhead to manage swap entries within zswap. Keep in
> mind that using swap entries for zswap probably implies having a
> fixed/max size for zswap (to be able to manage the swap entries
> efficiently similar to swap devices), which is a limitation that the
> initial proposal was hoping to overcome.

We have limited bits in PTE, so the max number of zswap entries will be
limited anyway.  And, we don't need to manage swap entries in the same
way as disks (which need to consider sequential writing etc.).

Best Regards,
Huang, Ying

[snip]