From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 737F7C6FD1C
	for <linux-mm@archiver.kernel.org>; Thu, 23 Mar 2023 03:28:07 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 990526B0075; Wed, 22 Mar 2023 23:28:06 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9190A6B0078; Wed, 22 Mar 2023 23:28:06 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 744686B007B; Wed, 22 Mar 2023 23:28:06 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 5BE6B6B0075
	for <linux-mm@kvack.org>; Wed, 22 Mar 2023 23:28:06 -0400 (EDT)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 09203A0153
	for <linux-mm@kvack.org>; Thu, 23 Mar 2023 03:28:06 +0000 (UTC)
X-FDA: 80598729372.03.88E8DDC
Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54])
	by imf23.hostedemail.com (Postfix) with ESMTP id 03C01140007
	for <linux-mm@kvack.org>; Thu, 23 Mar 2023 03:28:02 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Wmu2J2YC;
	spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1679542083;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=pOcERnPGJ9g5YQqGCV8iH7CMxNUsN1xeMHGnW/QdyxY=;
	b=sbH+EYKABRCDvAQ1iqRfmWIu1gAuNuXSZIami4km4Sazyyp3A1ifQbwaYt1AAlJUWRZHvA
	M37nJYEQzqvADFEHh7rhNziivLuaXmcn/AvFbt4uwtlosqboA5wq/+SUP5gj2ScpKE0F5L
	7fY5QUJdaRAtAcvj3JVJdV7e24CPNYA=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Wmu2J2YC;
	spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679542083; a=rsa-sha256;
	cv=none;
	b=ZgK93ZftrGOZHwPx8EyV5cOiNGX1KJdmYoSvUJG7usXOopD61GuT1tQJIVFmke82zQwk5T
	+nS1IhE/nYajkJ2TPQnyfpUPik+Ez+k1lAnV9ZfrmVZmw3ekZILtS88nU0WR9QtIHneHKj
	jygg9M+WLxzgBRksb8783G9UW3xCkio=
Received: by mail-ed1-f54.google.com with SMTP id ew6so18139021edb.7
        for <linux-mm@kvack.org>; Wed, 22 Mar 2023 20:28:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1679542081;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=pOcERnPGJ9g5YQqGCV8iH7CMxNUsN1xeMHGnW/QdyxY=;
        b=Wmu2J2YCTZyyEVGbK2YDEavUsk8kelgX70f9rpX5jHGJQpF57YgDK+A9sJ6Fovn8xe
         LSs7NIhAUmvWmGMdtnVWbgNmSse/RjmIPIOw49zOzIWRU5fZTNOO6MHPSwxoX+ZPoxJC
         Y5SLy6xKQ2Q9VVgmAlIyEnk9hjGYlwPgjgh6Eup8PBaLl9TeXyWYyBHxlYJS0uPXEUAL
         tXraZSEZ7xxzXKar4CQYPWVh9hDrBDz1jRvlegG3lHi/ZPHhoYA1Pu3wy7gOTJ9dapuT
         FW8wvUaLQyyg8I0zPf3+AS5kptsbQC0uHKaAbkWn+p27/j4HB7vqlt07iTWWg9KNL0Cm
         yTyw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679542081;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=pOcERnPGJ9g5YQqGCV8iH7CMxNUsN1xeMHGnW/QdyxY=;
        b=6QROUvkUwhHBfIFFB+FqbhVWr2V6TXUhEEakeHKAP3y/qk9ud27ijaLxGCFmJb6gCn
         sLqVi4HvrnVvKpFsM9yJ/nR1aGFsjoVNmuzOgFgl5jt2RAsbFI7pHmUQPMwsQNjprJGj
         4qJcg/Du4EV3VQ8I+IcTK5bghyaFJUVwYsyvgL06pEmPevPXAddx68xFZdZxkk6SUsEx
         /CnsiDBJ57RCsSKzincRH75YqAN9dil3ngZo4Aax65/UKtr3b/1xy1MkQhXhJth75rM/
         +V/GOIbIgf1oJxodhr9W00779+9CtuBOrXkO1GfUgf9JLxhhiWI+rtXmJ7jAh0vbmp0B
         Cy9w==
X-Gm-Message-State: AO0yUKUbEcAxhWoCCYu89Ki7PwsJcNDtQshFVQpF2BSYWTapNLMX3IOl
	a5Y5xaGHfoeTiUhDlOje3ZRPkwSuuTsD5wtT7IlB7Q==
X-Google-Smtp-Source: AK7set+0rtDXA9aJgF5KJsPMKN3G0aigQO7TFhWXMsyzq/Z+RGcwfhFQf1iPMuZBK5YsLFKPfEU82GMmbTGOloiB2H8=
X-Received: by 2002:a17:907:375:b0:933:1967:a984 with SMTP id
 rs21-20020a170907037500b009331967a984mr4108517ejb.15.1679542081139; Wed, 22
 Mar 2023 20:28:01 -0700 (PDT)
MIME-Version: 1.0
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
 <Y/6KOFMZaE0yOj/1@google.com> <CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
 <87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAJD7tkYj-=-tYyfqk_t2c6WMtcPLHYc4teRNtE2H8G8igEGrpA@mail.gmail.com>
 <87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAJD7tkamf8TtY0wOjhZKsWBJLL4pMsUhkwPtwCuroWcipRZ3CA@mail.gmail.com>
 <87o7ox762m.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAJD7tka992TT4-nkPckhSgVXcTnJq8YPYt2CzupZgGGe72NTRw@mail.gmail.com>
 <87bkkt5e4o.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAJD7tkbgS1NHbmvNH8B9f-E7b6tSoE2hFssoDD54KG7u6eBWkw@mail.gmail.com>
 <87y1ns3zeg.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAJD7tkZh=WtPkqHTCEKxYL6+hLCM05SbyY9UdHu6si=8xw2m=w@mail.gmail.com>
 <874jqcteyq.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAJD7tkZ42no+atw9-d9jOBrpZj7=-Zge5p7pRg8mVqV=OZ5sPg@mail.gmail.com>
 <87v8isrwck.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87v8isrwck.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Wed, 22 Mar 2023 20:27:24 -0700
Message-ID: <CAJD7tkb21yjzrd=u+XEPqnzL7tFe6pH2bZ_vnbEfCpZsiCdfYQ@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Chris Li <chrisl@kernel.org>, lsf-pc@lists.linux-foundation.org, 
	Johannes Weiner <hannes@cmpxchg.org>, Linux-MM <linux-mm@kvack.org>, 
	Michal Hocko <mhocko@kernel.org>, Shakeel Butt <shakeelb@google.com>, 
	David Rientjes <rientjes@google.com>, Hugh Dickins <hughd@google.com>, 
	Seth Jennings <sjenning@redhat.com>, Dan Streetman <ddstreet@ieee.org>, 
	Vitaly Wool <vitaly.wool@konsulko.com>, Yang Shi <shy828301@gmail.com>, 
	Peter Xu <peterx@redhat.com>, Minchan Kim <minchan@kernel.org>, 
	Andrew Morton <akpm@linux-foundation.org>, Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>, 
	Michal Hocko <mhocko@suse.com>, Wei Xu <weixugc@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 03C01140007
X-Stat-Signature: qawfys54qs7yqi8dqg6cbof6srjhun8r
X-HE-Tag: 1679542082-533514
X-HE-Meta: U2FsdGVkX195E23WU3P9qPKzD98PnF+LGa1B5+kwHytwClaBjrbQNKMPfSejVWM8dL4pqgKXvtcmmEPtOxiK6jHtfUOeaYRfU9IQU3+nuFbxUdZA+FnwYHt2MpiQiRWjLbeThdvk04dyxssmaFawqICoBkKFnnxL1ZusMxfNdN+ki+D1wKBXfkNky6WhkFDromA2Or6lXB+wwgRGd7oJtkpmfTRBlLV225sWltJoXEbXFNp7xeEKqpnJPKDxbzZ2A8b1Csi2Yb9Aw7j+jlmAzui10jQHN4ozJWN+E68WkpBSQbHPu8Kfck/dVXj1HoeJajL9+v/NDY4V0kEAEDCeS176ePrp9bxLxwwJIjnjGE2osJQWWd7d7BocsghZMeWUVaHOW77PS78CU49vK5ahj9SXU42uW3QcZIn/ooI1RKmfdpapHpQ9q9EtkRcPcNxIeNEpKfV4j4KtYxaYsC6PCVJeEB0YcTQrqOy8V4RN4naCOkOlZ2eXoAT4KmlLj1pD00kvE/9NEo77giMQvzyH9XoUjfWeTyAcLgTphBJO2ENgk8DvA+l4msLu4B4RGejLF+3eKfsV+aS46uSPjpKRvESr0RjpH7KZGZrKl0aUZoZAu/NaigyRDHtUvdkygUCdaWXbHFh/e5lEf2XK4JS5AYohG5TAfQ5irzZLI8H52o7+iP02HVIeYQeT1APwE/ygmPeF58OdiEdgWpLWd3VlqlHoHOPwD90srLq42znEqdEv7qwIOMesQhkDEtYvQ7VKx5uwHonTreSNFzOpeMyR9OazsR9Anss+3mYsbr5KNlGq4Raf7aIQPRRmMt/VVbgoAKbg3IxcC2ktgrV6qipjuRf7RJdGaRIDytA8TOikKCFfJqIDMv0CmdoA86ujfdCN1vUQdKOe98ycuNtUfpnpn/ynz+kQukZG5PfDIHkUApk3ArVMp1+ol2OygxI+2FmSlif9yZl+vmHVACGmR+z
 5jAraNiB
 mb+D6tw85r7ZAcX6Kb6/LTTtku4rr+RBkIWfcHhWrZCAZEEysFCdV72txT8R+WOc8EM5lIbBijbYWpUo4GLdWXP5RDhsNz//IQPERd/qDxJatm85htqaqEOFRh7/cKkEZoQfVuLPNEk2DY+pHHMHKc7sEETwwsUowcBKqmfJ5DXp+jP4y3kBZ3MMDuSfRaLzCP/HDlNqj5z9Qc4XnMx9LXcXZ6GKyajmO/qS0+pTMRu1z9hS+CFpDtyZl4bf/sm1Jcw3UAOPGY1AEmlfeOJ6llhD8U7c21G+LOj69JrTet4wicKY=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Mar 22, 2023 at 8:17=E2=80=AFPM Huang, Ying <ying.huang@intel.com> =
wrote:
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Wed, Mar 22, 2023 at 6:50=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >>
> >> > On Sun, Mar 19, 2023 at 7:56=E2=80=AFPM Huang, Ying <ying.huang@inte=
l.com> wrote:
> >> >>
> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >>
> >> >> > On Thu, Mar 16, 2023 at 12:51=E2=80=AFAM Huang, Ying <ying.huang@=
intel.com> wrote:
> >> >> >>
> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >>
> >> >> >> > On Sun, Mar 12, 2023 at 7:13=E2=80=AFPM Huang, Ying <ying.huan=
g@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> >> >> >>
> >> >> >> >> <snip>
> >> >> >> >> >
> >> >> >> >> > My current idea is to have one xarray that stores the swap_=
descs
> >> >> >> >> > (which include swap_entry, swapcache, swap_count, etc), and=
 only for
> >> >> >> >> > rotating disks have an additional xarray that maps swap_ent=
ry ->
> >> >> >> >> > swap_desc for cluster readahead, assuming we can eliminate =
all other
> >> >> >> >> > situations requiring a reverse mapping.
> >> >> >> >> >
> >> >> >> >> > I am not sure how having separate xarrays help? If we have =
one xarray,
> >> >> >> >> > might as well save the other lookups on put everything in s=
wap_desc.
> >> >> >> >> > In fact, this should improve the locking today as swapcache=
 /
> >> >> >> >> > swap_count operations can be lockless or very lightly conte=
nded.
> >> >> >> >>
> >> >> >> >> The condition of the proposal is "reverse mapping cannot be a=
voided for
> >> >> >> >> enough situation".  So, if reverse mapping (or cluster readah=
ead) can be
> >> >> >> >> avoided for enough situations, I think your proposal is good.=
  Otherwise,
> >> >> >> >> I propose to use 2 xarrays.  You don't need another reverse m=
apping
> >> >> >> >> xarray, because you just need to read the next several swap_e=
ntry into
> >> >> >> >> the swap cache for cluster readahead.  swap_desc isn't needed=
 for
> >> >> >> >> cluster readahead.
> >> >> >> >
> >> >> >> > swap_desc would be needed for cluster readahead in my original
> >> >> >> > proposal as the swap cache lives in swap_descs. Based on the c=
urrent
> >> >> >> > implementation, we would need a reverse mapping (swap entry ->
> >> >> >> > swap_desc) in 3 situations:
> >> >> >> >
> >> >> >> > 1) __try_to_reclaim_swap(): when trying to find an empty swap =
slot and
> >> >> >> > failing, we fallback to trying to find swap entries that only =
have a
> >> >> >> > page in the swap cache (no references in page tables or page c=
ache)
> >> >> >> > and free them. This would require a reverse mapping.
> >> >> >> >
> >> >> >> > 2) swapoff: we need to swap in all entries in a swapfile, so w=
e need
> >> >> >> > to get all swap_descs associated with that swapfile.
> >> >> >> >
> >> >> >> > 3) swap cluster readahead.
> >> >> >> >
> >> >> >> > For (1), I think we can drop the dependency of a reverse mappi=
ng if we
> >> >> >> > free swap entries once we swap a page in and add it to the swa=
p cache,
> >> >> >> > even if the swap count does not drop to 0.
> >> >> >>
> >> >> >> Now, we will not drop the swap cache even if the swap count beco=
mes 0 if
> >> >> >> swap space utility < 50%.  Per my understanding, this avoid swap=
 page
> >> >> >> writing for read accesses.  So I don't think we can change this =
directly
> >> >> >> without necessary discussion firstly.
> >> >> >
> >> >> >
> >> >> > Right. I am not sure I understand why we do this today, is it to =
save
> >> >> > the overhead of allocating a new swap entry if the page is swappe=
d out
> >> >> > again soon? I am not sure I understand this statement "this avoid=
 swap
> >> >> > page
> >> >> > writing for read accesses".
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> > For (2), instead of scanning page tables and shmem page cache =
to find
> >> >> >> > swapped out pages for the swapfile, we can scan all swap_descs
> >> >> >> > instead, we should be more efficient. This is one of the propo=
sal's
> >> >> >> > potential advantages.
> >> >> >>
> >> >> >> Good.
> >> >> >>
> >> >> >> > (3) is the one that would still need a reverse mapping with th=
e
> >> >> >> > current proposal. Today we use swap cluster readahead for anon=
 pages
> >> >> >> > if we have a spinning disk or vma readahead is disabled. For s=
hmem, we
> >> >> >> > always use cluster readahead. If we can limit cluster readahea=
d to
> >> >> >> > only rotating disks, then the reverse mapping can only be main=
tained
> >> >> >> > for swapfiles on rotating disks. Otherwise, we will need to ma=
intain a
> >> >> >> > reverse mapping for all swapfiles.
> >> >> >>
> >> >> >> For shmem, I think that it should be good to readahead based on =
shmem
> >> >> >> file offset instead of swap device offset.
> >> >> >>
> >> >> >> It's possible that some pages in the readahead window are from H=
DD while
> >> >> >> some other pages aren't.  So it's a little hard to enable cluste=
r read
> >> >> >> for HDD only.  Anyway, it's not common to use HDD for swap now.
> >> >> >>
> >> >> >> >>
> >> >> >> >> > If the point is to store the swap_desc directly inside the =
xarray to
> >> >> >> >> > save 8 bytes, I am concerned that having multiple xarrays f=
or
> >> >> >> >> > swapcache, swap_count, etc will use more than that.
> >> >> >> >>
> >> >> >> >> The idea is to save the memory used by reverse mapping xarray=
.
> >> >> >> >
> >> >> >> > I see.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> > Keep in mind that the current overhead is 1 byte O(ma=
x swap pages) not
> >> >> >> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use th=
e swap
> >> >> >> >> >> >> > continuation pages. If we do, it may end up being mor=
e. We also
> >> >> >> >> >> >> > allocate continuation in full 4k pages, so even if on=
e swap_map
> >> >> >> >> >> >> > element in a page requires continuation, we will allo=
cate an entire
> >> >> >> >> >> >> > page. What I am trying to say is that to get an actua=
l comparison you
> >> >> >> >> >> >> > need to also factor in the swap utilization and the r=
ate of usage of
> >> >> >> >> >> >> > swap continuation. I don't know how to come up with a=
 formula for this
> >> >> >> >> >> >> > tbh.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Also, like Johannes said, the worst case overhead (32=
 bytes if you
> >> >> >> >> >> >> > count the reverse mapping) is 0.8% of swapped memory,=
 aka 8M for every
> >> >> >> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand=
 that it is pure
> >> >> >> >> >> >> > overhead for people not using zswap, but it is not ve=
ry awful.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> It seems what you really need is one bit of informat=
ion to indicate
> >> >> >> >> >> >> >> this page is backed by zswap. Then you can have a se=
perate pointer
> >> >> >> >> >> >> >> for the zswap entry.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > If you use one bit in swp_entry_t (or one of the avai=
lable swap types)
> >> >> >> >> >> >> > to indicate whether the page is backed with a swapfil=
e or zswap it
> >> >> >> >> >> >> > doesn't really work. We lose the indirection layer. H=
ow do we move the
> >> >> >> >> >> >> > page from zswap to swapfile? We need to go update the=
 page tables and
> >> >> >> >> >> >> > the shmem page cache, similar to swapoff.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Instead, if we store a key else in swp_entry_t and us=
e this to lookup
> >> >> >> >> >> >> > the swp_entry_t or zswap_entry pointer then that's es=
sentially what
> >> >> >> >> >> >> > the swap_desc does. It just goes the extra mile of un=
ifying the
> >> >> >> >> >> >> > swapcache as well and storing it directly in the swap=
_desc instead of
> >> >> >> >> >> >> > storing it in another lookup structure.
> >> >> >> >> >> >>
> >> >> >> >> >> >> If we choose to make sizeof(struct swap_desc) =3D=3D 8,=
 that is, store only
> >> >> >> >> >> >> swap_entry in swap_desc.  The added indirection appears=
 to be another
> >> >> >> >> >> >> level of page table with 1 entry.  Then, we may use the=
 similar method
> >> >> >> >> >> >> as supporting system with 2 level and 3 level page tabl=
es, like the code
> >> >> >> >> >> >> in include/asm-generic/pgtable-nopmd.h.  But I haven't =
thought about
> >> >> >> >> >> >> this deeply.
> >> >> >> >> >> >
> >> >> >> >> >> > Can you expand further on this idea? I am not sure I ful=
ly understand.
> >> >> >> >> >>
> >> >> >> >> >> OK.  The goal is to avoid the overhead if indirection isn'=
t enabled via
> >> >> >> >> >> kconfig.
> >> >> >> >> >>
> >> >> >> >> >> If indirection isn't enabled, store swap_entry in PTE dire=
ctly.
> >> >> >> >> >> Otherwise, store index of swap_desc in PTE.  Different fun=
ctions (e.g.,
> >> >> >> >> >> to get/set swap_entry in PTE) are implemented based on kco=
nfig.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > I thought about this, the problem is that we will have mult=
iple
> >> >> >> >> > implementations of multiple things. For example, swap_count=
 without
> >> >> >> >> > the indirection layer lives in the swap_map (with continuat=
ion logic).
> >> >> >> >> > With the indirection layer, it lives in the swap_desc (or s=
omewhere
> >> >> >> >> > else). Same for the swapcache. Even if we keep the swapcach=
e in an
> >> >> >> >> > xarray and not inside swap_desc, it would be indexed by swa=
p_entry if
> >> >> >> >> > the indirection is disabled, and by swap_desc (or similar) =
if the
> >> >> >> >> > indirection is enabled. I think maintaining separate implem=
entations
> >> >> >> >> > for when the indirection is enabled/disabled would be addin=
g too much
> >> >> >> >> > complexity.
> >> >> >> >> >
> >> >> >> >> > WDYT?
> >> >> >> >>
> >> >> >> >> If we go this way, swap cache and swap_count will always be i=
ndexed by
> >> >> >> >> swap_entry.  swap_desc just provides a indirection to make it=
 possible
> >> >> >> >> to move between swap devices.
> >> >> >> >>
> >> >> >> >> Why must we index swap cache and swap_count by swap_desc if i=
ndirection
> >> >> >> >> is enabled?  Yes, we can save one xarray indexing if we do so=
, but I
> >> >> >> >> don't think the overhead of one xarray indexing is a showstop=
per.
> >> >> >> >>
> >> >> >> >> I think this can be one intermediate step towards your final =
target.
> >> >> >> >> The changes to current implementation can be smaller.
> >> >> >> >
> >> >> >> > IIUC, the idea is to have two xarrays:
> >> >> >> > (a) xarray that stores a pointer to a struct containing swap_c=
ount and
> >> >> >> > swap cache.
> >> >> >> > (b) xarray that stores the underlying swap entry or zswap entr=
y.
> >> >> >> >
> >> >> >> > When indirection is disabled:
> >> >> >> > page tables & page cache have swap entry directly like today, =
xarray
> >> >> >> > (a) is indexed by swap entry, xarray (b) does not exist. No re=
verse
> >> >> >> > mapping needed.
> >> >> >> >
> >> >> >> > In this case we have an extra overhead of 12-16 bytes (the str=
uct
> >> >> >> > containing swap_count and swap cache) vs. 24 bytes of the swap=
_desc.
> >> >> >> >
> >> >> >> > When indirection is enabled:
> >> >> >> > page tables & page cache have a swap id (or swap_desc index), =
xarray
> >> >> >> > (a) is indexed by swap id,
> >> >> >>
> >> >> >> xarray (a) is indexed by swap entry.
> >> >> >
> >> >> >
> >> >> > How so? With the indirection enabled, the page tables & page cach=
e
> >> >> > have the swap id (or swap_desc index), which can point to a swap =
entry
> >> >> > or a zswap entry -- which can change when the page is moved betwe=
en
> >> >> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in=
 this
> >> >> > case? Shouldn't be indexed by the abstract swap id so that the
> >> >> > writeback from zswap is transparent?
> >> >>
> >> >> In my mind,
> >> >>
> >> >> - swap core will define a abstract interface to swap implementation=
s
> >> >>   (zswap, swap device/file, maybe more in the future), like VFS.
> >> >>
> >> >> - zswap will be a special swap implementation (compressing instead =
of
> >> >>   writing to disk).
> >> >>
> >> >> - swap core will manage the indirection layer and swap cache.
> >> >>
> >> >> - swap core can move swap pages between swap implementations (e.g.,=
 from
> >> >>   zswap to a swap device, or from one swap device to another swap
> >> >>   device) with the help of the indirection layer.
> >> >>
> >> >> In this design, the writeback from zswap becomes moving swapped pag=
es
> >> >> from zswap to a swap device.
> >> >
> >> >
> >> > All the above matches my understanding of this proposal. swap_desc i=
s
> >> > the proposed indirection layer, and the swap implementations are zsw=
ap
> >> > & swap devices. For now, we only have 2 static swap implementations
> >> > (zswap->swapfile). In the future, we can make this more dynamic as t=
he
> >> > need arises.
> >>
> >> Great to align with you on this.
> >
> > Great!
> >
> >>
> >> >>
> >> >> If my understanding were correct, your suggestion is kind of moving
> >> >> zswap logic to the swap core?  And zswap will be always at a higher
> >> >> layer on top of swap device/file?
> >> >
> >> >
> >> > We do not want to move the zswap logic into the swap core, we want t=
o
> >> > make the swap core independent of the swap implementation, and zswap
> >> > is just one possible implementation.
> >>
> >> Good!
> >>
> >> I found that you put zswap related data structure inside struct
> >> swap_desc directly.  I think that we should avoid that as much as
> >> possible.
> >
> > That was just an initial draft to show what it would be like, I do not
> > intend to hardcode zswap-specific items in swap_desc.
> >
> >>
> >> >>
> >> >> >>
> >> >> >>
> >> >> >> > xarray (b) is indexed by swap id as well
> >> >> >> > and contain swap entry or zswap entry. Reverse mapping might b=
e
> >> >> >> > needed.
> >> >> >>
> >> >> >> Reverse mapping isn't needed.
> >> >> >
> >> >> >
> >> >> > It would be needed if xarray (a) is indexed by the swap id. I am =
not
> >> >> > sure I understand how it can be indexed by the swap entry if the
> >> >> > indirection is enabled.
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 byte=
s for
> >> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse m=
apping
> >> >> >> > where needed.
> >> >> >> >
> >> >> >> > There is also the extra cpu overhead for an extra lookup in ce=
rtain paths.
> >> >> >> >
> >> >> >> > Is my analysis correct? If yes, I agree that the original prop=
osal is
> >> >> >> > good if the reverse mapping can be avoided in enough situation=
s, and
> >> >> >> > that we should consider such alternatives otherwise. As I ment=
ioned
> >> >> >> > above, I think it comes down to whether we can completely rest=
rict
> >> >> >> > cluster readahead to rotating disks or not -- in which case we=
 need to
> >> >> >> > decide what to do for shmem and for anon when vma readahead is
> >> >> >> > disabled.
> >> >> >>
> >> >> >> We can even have a minimal indirection implementation.  Where, s=
wap
> >> >> >> cache and swap_map[] are kept as they ware before, just one xarr=
ay is
> >> >> >> added.  The xarray is indexed by swap id (or swap_desc index) to=
 store
> >> >> >> the corresponding swap entry.
> >> >> >>
> >> >> >> When indirection is disabled, no extra overhead.
> >> >> >>
> >> >> >> When indirection is enabled, the extra overhead is just 8 bytes =
per
> >> >> >> swapped page.
> >> >> >>
> >> >> >> The basic migration support can be build on top of this.
> >> >> >>
> >> >> >> I think that this could be a baseline for indirection support.  =
Then
> >> >> >> further optimization can be built on top of it step by step with
> >> >> >> supporting data.
> >> >> >
> >> >> >
> >> >> > I am not sure how this works with zswap. Currently swap_map[]
> >> >> > implementation is specific for swapfiles, it does not work for zs=
wap
> >> >> > unless we implement separate swap counting logic for zswap &
> >> >> > swapfiles. Same for the swapcache, it currently supports being in=
dexed
> >> >> > by a swap entry, it would need to support being indexed by a swap=
 id,
> >> >> > or have a separate swap cache for zswap. Having separate
> >> >> > implementation would add complexity, and we would need to perform
> >> >> > handoffs of the swap count/cache when a page is moved from zswap =
to a
> >> >> > swapfile.
> >> >>
> >> >> We can allocate a swap entry for each swapped page in zswap.
> >> >
> >> >
> >> > This is exactly what the current implementation does and what we wan=
t
> >> > to move away from. The current implementation uses zswap as an
> >> > in-memory compressed cache on top of an actual swap device, and each
> >> > swapped page in zswap has a swap entry allocated. With this
> >> > implementation, zswap cannot be used without a swap device.
> >>
> >> I totally agree that we should avoid to use an actual swap device unde=
r
> >> zswap.  And, as an swap implementation, zswap can manage the swap entr=
y
> >> inside zswap without an underlying actual swap device.  For example,
> >> when we swap a page to zswap (actually compress), we can allocate a
> >> (virtual) swap entry in the zswap.  I understand that there's overhead
> >> to manage the swap entry in zswap.  We can consider how to reduce the
> >> overhead.
> >
> > I see. So we can (for example) use one of the swap types for zswap,
> > and then have zswap code handle this entry according to its
> > implementation. We can then have an xarray that maps swap ID -> swap
> > entry, and this swap entry is used to index the swap cache and such.
> > When a swapped page is moved between backends we update the swap ID ->
> > swap entry xarray.
> >
> > This is one possible implementation that I thought of (very briefly
> > tbh), but it does have its problems:
> > For zswap:
> > - Managing swap entries inside zswap unnecessarily.
> > - We need to maintain a swap entry -> zswap entry mapping in zswap --
> > similar to the current rbtree, which is something that we can get rid
> > of with the initial proposal if we embed the zswap_entry pointer
> > directly in the swap_desc (it can be encoded to avoid breaking the
> > abstraction).
> >
> > For mm/swap in general:
> > - When we allocate a swap entry today, we store it in folio->private
> > (or page->private), which is used by the unmapping code to be placed
> > in the page tables or shmem page cache. With this implementation, we
> > need to store the swap ID in page->private instead, which means that
> > every time we need to access the swap cache during reclaim/swapout we
> > need to lookup the swap entry first.
> > - On the fault path, we need two lookups instead of one (swap ID ->
> > swap entry, swap entry -> swap cache), not sure how this affects fault
> > latency.
> > - Each swap backend will have its own separate implementation of swap
> > counting, which is hard to maintain and very error-prone since the
> > logic is backend-agnostic.
> > - Handing over a page from one swap backend to another includes
> > handing over swap cache entries and swap counts, which I imagine will
> > involve considerable synchronization.
> >
> > Do you have any thoughts on this?
>
> Yes.  I understand there's additional overhead.  I have no clear idea
> about how to reduce this now.  We need to think about that in depth.
>
> The bottom line is whether this is worse than the current zswap
> implementation?

It's not just zswap, as I note above, this design would introduce some
overheads to the core swapping code as well as long as the indirection
layer is active. I am particularly worried about the extra lookups on
the fault path.

For zswap, we already have a lookup today, so maintaining swap entry
-> zswap entry mapping would not be a regression, but I am not sure
about the extra overhead to manage swap entries within zswap. Keep in
mind that using swap entries for zswap probably implies having a
fixed/max size for zswap (to be able to manage the swap entries
efficiently similar to swap devices), which is a limitation that the
initial proposal was hoping to overcome.

>
> Best Regards,
> Huang, Ying
>
> >>
> >> >>
> >> >>
> >> >> >>
> >> >> >>
> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Depending on how much you are going to reuse the swa=
p cache, you might
> >> >> >> >> >> >> >> need to have something like a swap_info_struct to ke=
ep the locks happy.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > My current intention is to reimplement the swapcache =
completely as a
> >> >> >> >> >> >> > pointer in struct swap_desc. This would eliminate thi=
s need and a lot
> >> >> >> >> >> >> > of the locking we do today if I get things right.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > Another potential concern is readahead. With this =
design, we have no
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Readahead is for spinning disk :-) Even a normal swa=
p file with an SSD can
> >> >> >> >> >> >> >> use some modernization.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Yeah, I initially thought we would only need the swp_=
entry_t ->
> >> >> >> >> >> >> > swap_desc reverse mapping for readahead, and that we =
can only store
> >> >> >> >> >> >> > that for spinning disks, but I was wrong. We need for=
 other things as
> >> >> >> >> >> >> > well today: swapoff, when trying to find an empty swa=
p slot and we
> >> >> >> >> >> >> > start trying to free swap slots used only by the swap=
cache. However, I
> >> >> >> >> >> >> > think both of these cases can be fixed (I can share m=
ore details if
> >> >> >> >> >> >> > you want). If everything goes well we should only nee=
d to maintain the
> >> >> >> >> >> >> > reverse mapping (extra overhead above 24 bytes) for s=
wap files on
> >> >> >> >> >> >> > spinning disks for readahead.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Looking forward to your discussion.
> >> >> >> >> >>
> >> >> >> >> >> Per my understanding, the indirection is to make it easy t=
o move
> >> >> >> >> >> (swapped) pages among swap devices based on hot/cold.  Thi=
s is similar
> >> >> >> >> >> as the target of memory tiering.  It appears that we can e=
xtend the
> >> >> >> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap=
 devices too?
> >> >> >> >> >> Is it possible for zswap to be faster than some slow memor=
y media?
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > Agree with Chris that this may require a much larger overha=
ul. A slow
> >> >> >> >> > memory tier is still addressable memory, swap/zswap require=
s a page
> >> >> >> >> > fault to read the pages. I think (at least for now) there i=
s a
> >> >> >> >> > fundamental difference. We want reclaim to eventually treat=
 slow
> >> >> >> >> > memory & swap as just different tiers to place cold memory =
in with
> >> >> >> >> > different characteristics, but otherwise I think the swappi=
ng
> >> >> >> >> > implementation itself is very different.  Am I missing some=
thing?
> >> >> >> >>
> >> >> >> >> Is it possible that zswap is faster than a really slow memory
> >> >> >> >> addressable device backed by NAND?  TBH, I don't have the ans=
wer.
> >> >> >> >
> >> >> >> > I am not sure either.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> Anyway, do you need a way to describe the tiers of the swap d=
evices?
> >> >> >> >> So, you can move the cold pages among the swap devices based =
on that?
> >> >> >> >
> >> >> >> > For now I think the "tiers" in this proposal are just zswap an=
d normal
> >> >> >> > swapfiles. We can later extend it to support more explicit tie=
ring.
> >> >> >>
> >> >> >> IIUC, in original zswap implementation, there's 1:1 relationship=
 between
> >> >> >> zswap and normal swapfile.  But now, you make demoting among swa=
p
> >> >> >> devices more general.  Then we need some general way to specify =
which
> >> >> >> swap devices are fast and which are slow, and the demoting relat=
ionship
> >> >> >> among them.  It can be memory tiers or something else, but we ne=
ed one.
> >> >> >
> >> >> >
> >> >> > I think for this proposal, there are only 2 hardcoded tiers. Zswa=
p is
> >> >> > fast, swapfile is slow. In the future, we can support more dynami=
c
> >> >> > tiering if the need arises.
> >> >>
> >> >> We can start from a simple implementation.  And I think that it's b=
etter
> >> >> to consider the general design too.  Try not to make it impossible =
now.
> >> >
> >> >
> >> > Right. I am proposing we come up with an abstract generic interface
> >> > for swap implementations, and have 2 implementations statically
> >> > defined (swapfiles and zswap). If the need arises, we can make swap
> >> > implementations more dynamic in the future.
> >> >
> >> >>
> >> >>
> >> >> Best Regards,
> >> >> Huang, Ying
> >>
>