From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 948B4C7EE23
	for <linux-mm@archiver.kernel.org>; Thu,  2 Mar 2023 01:01:05 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1DD3B6B0075; Wed,  1 Mar 2023 20:01:05 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 18DC26B0078; Wed,  1 Mar 2023 20:01:05 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0561F6B007B; Wed,  1 Mar 2023 20:01:05 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id EB02E6B0075
	for <linux-mm@kvack.org>; Wed,  1 Mar 2023 20:01:04 -0500 (EST)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id B7CE5A0F24
	for <linux-mm@kvack.org>; Thu,  2 Mar 2023 01:01:04 +0000 (UTC)
X-FDA: 80522154048.15.11D6A26
Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48])
	by imf22.hostedemail.com (Postfix) with ESMTP id DEDECC000E
	for <linux-mm@kvack.org>; Thu,  2 Mar 2023 01:01:01 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=NUhyw44I;
	spf=pass (imf22.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677718862;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=BNlYHFn9Rx+MncWA9/JOciy3z8ConjM1w4QOe6wBszw=;
	b=Ul9ZdH4GU3MVa8zpdaOjQiCzNjpFSTF3N+tN5UB8LE+GLYhvvXkjxUslibzu4ScfzLLESW
	7efOIec9DzlWzbkP0eqSC3qyInUOFnizojF6asY8OIphK3CZ+uwA02Ooq5y92/jVgiRBny
	ABqGvwmfbHWOX6PZSrCn9BWNC2PWDmE=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=NUhyw44I;
	spf=pass (imf22.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677718862; a=rsa-sha256;
	cv=none;
	b=IM9dZlf2KS25xZjKbU5PUDsc0/xOH8AciaFRxvKqZ1kCXBXBgA8cjdBi3mbd+qL8aEUKxZ
	65lI4vtGII7FkW2DUL+k7ulgskE+g9Sw/cGuvNSK3q87RUTLMuEOsaCD6w+Cqly+KEjO13
	T9SRVTAr+jTjSouAnbatoGUixM9Aef0=
Received: by mail-ed1-f48.google.com with SMTP id h16so61336919edz.10
        for <linux-mm@kvack.org>; Wed, 01 Mar 2023 17:01:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1677718860;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=BNlYHFn9Rx+MncWA9/JOciy3z8ConjM1w4QOe6wBszw=;
        b=NUhyw44I+aSL5mPG7ST3+1oyApA2N8Gg5Jp7iLf2VeNJ86t42ejFnUSv0TvsPpvDQ3
         +9h/gcWAMzbIJ3LLrwGI1D4mHh9ZwZjygXX1InR8qbs2Mwwl+6lYHslqGq9JCO0TAqHw
         GS4v7RuUQTOjl7X533QBLMqxDM8Qyz75dW2bkJRduAV9yhB+t785+5blIAFXSRZ2zX/X
         eO8ZyH2fVLAVCo7vcH2thak6PFihV9yNi92IO/bVtb9juLhI7nOa2NG47QncGQPLmhuI
         7ofQoK9TQkqGN9i0sn5q90WatHqSqK3hB099k4+9TA4TtNGmtqGVuJqATYfuunx1kpoL
         XDcw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1677718860;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=BNlYHFn9Rx+MncWA9/JOciy3z8ConjM1w4QOe6wBszw=;
        b=Q+SzVD/SL91lCfoQ20wqp41JIJUUCKcbBTF4lctT4o0OFguXhmZT7eBKQGgc1cd57D
         mJuyQEnZLQ7lTTj9m3BDkj9c8qR29v9AIsFyF6juuB/wAVEbnvwppNy8R8jC02v6qcsh
         wjtSQxTyMf9wfDkGAGKXJ1tIOOLLL9RfZePKROMJCFeDnj5B/ip/UgZ+eylGTnlb1+UV
         uL2Djw9IRc5UYEm2lS8q62hdcaJXg3ltxkn2BV745EFWkZQlPvIEVHlKKcMKXj0UIZIq
         /m8tE1nB1knDLXbxJR6N5seOibWwBiYBFH9ktm2nGvcxVumF0DUI0wYB1A59BZoh/EqE
         kxBQ==
X-Gm-Message-State: AO0yUKUNQsBUG+t1ao0y5cVTK+DqC2W/KEWGNpg8qlApd/WpLOUlI2KI
	uuNs+jB08+gtFLHhEttqmNaYrJ3pIlAW+DFFuOwz/w==
X-Google-Smtp-Source: AK7set9jwdjURA1TGl41uUYN9SxcF7Hv19TCvuuFpY/+tqy2/jyn9FqTI2gcKQE0D2PuCu6zXby+TbHv4pENYJIDzR8=
X-Received: by 2002:a17:907:98eb:b0:8af:3930:c38e with SMTP id
 ke11-20020a17090798eb00b008af3930c38emr3957407ejc.10.1677718860305; Wed, 01
 Mar 2023 17:01:00 -0800 (PST)
MIME-Version: 1.0
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
 <Y/6KOFMZaE0yOj/1@google.com> <CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
In-Reply-To: <CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Wed, 1 Mar 2023 17:00:23 -0800
Message-ID: <CAJD7tka2sr=9KbHRKLso2SxHBsjOX1otoadm_VMQXEGuqBMBgw@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
To: Chris Li <chrisl@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, Johannes Weiner <hannes@cmpxchg.org>, 
	Linux-MM <linux-mm@kvack.org>, Michal Hocko <mhocko@kernel.org>, 
	Shakeel Butt <shakeelb@google.com>, David Rientjes <rientjes@google.com>, 
	Hugh Dickins <hughd@google.com>, Seth Jennings <sjenning@redhat.com>, 
	Dan Streetman <ddstreet@ieee.org>, Vitaly Wool <vitaly.wool@konsulko.com>, 
	Yang Shi <shy828301@gmail.com>, Peter Xu <peterx@redhat.com>, 
	Minchan Kim <minchan@kernel.org>, Andrew Morton <akpm@linux-foundation.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: DEDECC000E
X-Rspam-User: 
X-Stat-Signature: k44q8xoc54nmwucjsqetqzche43yhp7n
X-HE-Tag: 1677718861-330964
X-HE-Meta: U2FsdGVkX19U9WBEOHuYWRwJXiKkVj6b5cGXr0cNiD2o8JDWOWoRT4isvN4MqnUYtKfhgMHv8/iucJ4/XEHgvGJD4BahDmH1lnxZ8UtLgZ6lA2wArl/RqUb69Jl5R1OJEcS17mLvmthW62fTPQEWAf+Z4Q3IlhQDFTB/kpmX49iJu2RCYTHMKX2dZfb2sPZD+oss3v0j3HafTNpopPnSJzMUq907ynLuLZ/Xct9r5/6yOJl+mOzh+nbjhTonrnrGLWg/7gHbulzH2HWK90kkhXJIMQ7GjasWZ6yvSq1La9HOlWzps8hHbTkNehV2daHRb4CtRwJ6vbSqyV9ox6g6hWq8IiHKfPl8YEnKK+XnoHC2yDhMaBBN2EajuIQ7c3U3+FzhuIshlxIjB1k4a0q0NBw1zB0UPnqzFpcjTbKzq5cqPzBDKypi5DgRa0qlImUG4F2IvNNHqhZ6xGMO7yQsTlEdvxnrvoqvnXi3AGFdqhZ1P9CEwPz80aD+zlzqCySajtVfDZe2x+XQJCMSRI4haBnUrYwYOJ9g29JWh7xBuSnEAWubeQWedwL1CJMFpaVuDkHmsRLRw1aoLSDm9S48OXoqDcC96Rk/gzFP33WhJWTrJXEVH6ZTQCuexrfUOt4+0EOU8+S9L/dkkScHQUXA2u0xNg+0ph6XrV6ybuSrnB4bALb1LR3ZyThqiGlhpelF33FV5BTNjZTKIhhiFnaM7Li/dEnFtL+q5KIRFFrcteqzprFjvxLAYabuNB7Dua6bMPCAtNbqfW/USbyXoJB2sNudAZkdfezhpXxyVb01ivgyMBkM6kZJ94IMND+UfEGAd9klsWh/EgcM9eOLmXGZeZQLCM4HWzeIcmO1gXstR6YhQIuw+sWz+P99oUosexGykh70DBd+X6rE7h1Yp0J8yfzq0RWds7nyXQrWB3P6qBu0jKEWfcr2x9xfMhEbuUUG5TQXE8edgscWT6xXoFN
 ATofp85+
 VfDGDnBXhIeR8a1CwQFgJuOE0ETcUBfvJok+Bmq/LrEgaeHtzkh3iW5tOylTuex3JfBuUBB6VxTsyDFmaGeYFuglvdmUF/B8bmGSoHciRJvec0h2JczPUKo34V5akG4LqXov8DbnQ/ioE7o07svNEVlcY7pVokG+2I/8JZ2MNYx9eXRhZBaLwejl5Lu4B1H1+zpAxoak3wxScQY9I6bPBwDNqjgTPZ1t3rBu2P6S3q76E6LfCn6XNtf5ViI1L3E51PEI7ClvG/pOkUbOkxllafW0raFmcq9VQrf+VoiN58Kgf4VK0YSafFBOzurmX4eR9txnZk7P/1pzvH04=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Mar 1, 2023 at 4:30 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Yosry,
> >
> > On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
> > > Hello everyone,
> > >
> > > I would like to propose a topic for the upcoming LSF/MM/BPF in May
> > > 2023 about swap & zswap (hope I am not too late).
> >
> > I am very interested in participating in this discussion as well.
>
> That's great to hear!
>
> >
> > > ==================== Objective ====================
> > > Enabling the use of zswap without a backing swapfile, which makes
> > > zswap useful for a wider variety of use cases. Also, when zswap is
> > > used with a swapfile, the pages in zswap do not use up space in the
> > > swapfile, so the overall swapping capacity increases.
> >
> > Agree.
> >
> > >
> > > ==================== Idea ====================
> > > Introduce a data structure, which I currently call a swap_desc, as an
> > > abstraction layer between swapping implementation and the rest of MM
> > > code. Page tables & page caches would store a swap id (encoded as a
> > > swp_entry_t) instead of directly storing the swap entry associated
> > > with the swapfile. This swap id maps to a struct swap_desc, which acts
> >
> > Can you provide a bit more detail? I am curious how this swap id
> > maps into the swap_desc? Is the swp_entry_t cast into "struct
> > swap_desc*" or going through some lookup table/tree?
>
> swap id would be an index in a radix tree (aka xarray), which contains
> a pointer to the swap_desc struct. This lookup should be free with
> this design as we also use swap_desc to directly store the swap cache
> pointer, so this lookup essentially replaces the swap cache lookup.
>
> >
> > > as our abstraction layer. All MM code not concerned with swapping
> > > details would operate in terms of swap descs. The swap_desc can point
> > > to either a normal swap entry (associated with a swapfile) or a zswap
> > > entry. It can also include all non-backend specific operations, such
> > > as the swapcache (which would be a simple pointer in swap_desc), swap
> >
> > Does the zswap entry still use the swap slot cache and swap_info_struct?
>
> In this design no, it shouldn't.
>
> >
> > > This work enables using zswap without a backing swapfile and increases
> > > the swap capacity when zswap is used with a swapfile. It also creates
> > > a separation that allows us to skip code paths that don't make sense
> > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
> > > which might result in better performance (less lookups, less lock
> > > contention).
> > >
> > > The abstraction layer also opens the door for multiple cleanups (e.g.
> > > removing swapper address spaces, removing swap count continuation
> > > code, etc). Another nice cleanup that this work enables would be
> > > separating the overloaded swp_entry_t into two distinct types: one for
> > > things that are stored in page tables / caches, and for actual swap
> > > entries. In the future, we can potentially further optimize how we use
> > > the bits in the page tables instead of sticking everything into the
> > > current type/offset format.
> >
> > Looking forward to seeing more details in the upcoming discussion.
> > >
> > > ==================== Cost ====================
> > > The obvious downside of this is added memory overhead, specifically
> > > for users that use swapfiles without zswap. Instead of paying one byte
> > > (swap_map) for every potential page in the swapfile (+ swap count
> > > continuation), we pay the size of the swap_desc for every page that is
> > > actually in the swapfile, which I am estimating can be roughly around
> > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
> > > scales with pages actually swapped out. For zswap users, it should be
> >
> > Is there a way to avoid turning 1 byte into 24 byte per swapped
> > pages? For the users that use swap but no zswap, this is pure overhead.
>
> That's what I could think of at this point. My idea was something like this:
>
> struct swap_desc {
>     union { /* Use one bit to distinguish them */
>         swp_entry_t swap_entry;
>         struct zswap_entry *zswap_entry;
>     };
>     struct folio *swapcache;
>     atomic_t swap_count;
>     u32 id;
> }
>
> Having the id in the swap_desc is convenient as we can directly map
> the swap_desc to a swp_entry_t to place in the page tables, but I
> don't think it's necessary. Without it, the struct size is 20 bytes,
> so I think the extra 4 bytes are okay to use anyway if the slab
> allocator only allocates multiples of 8 bytes.
>
> The idea here is to unify the swapcache and swap_count implementation
> between different swap backends (swapfiles, zswap, etc), which would
> create a better abstraction and reduce reinventing the wheel.
>
> We can reduce to only 8 bytes and only store the swap/zswap entry, but
> we still need the swap cache anyway so might as well just store the
> pointer in the struct and have a unified lookup-free swapcache, so
> really 16 bytes is the minimum.
>
> If we stop at 16 bytes, then we need to handle swap count separately
> in swapfiles and zswap. This is not the end of the world, but are the
> 8 bytes worth this?
>
> Keep in mind that the current overhead is 1 byte O(max swap pages) not
> O(swapped). Also, 1 byte is assuming we do not use the swap
> continuation pages. If we do, it may end up being more. We also
> allocate continuation in full 4k pages, so even if one swap_map
> element in a page requires continuation, we will allocate an entire
> page. What I am trying to say is that to get an actual comparison you
> need to also factor in the swap utilization and the rate of usage of
> swap continuation. I don't know how to come up with a formula for this
> tbh.
>
> Also, like Johannes said, the worst case overhead (32 bytes if you
> count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> overhead for people not using zswap, but it is not very awful.

Oh I forgot. I think the 24 bytes *might* actually be reduced to 16
bytes if we free the underlying swap entry / zswap entry once we add
the page to the swapcache. I did not post anything about it yet as I
am still thinking if there might be any synchronization problems with
this approach, but I will try it out.

>
> >
> > It seems what you really need is one bit of information to indicate
> > this page is backed by zswap. Then you can have a seperate pointer
> > for the zswap entry.
>
> If you use one bit in swp_entry_t (or one of the available swap types)
> to indicate whether the page is backed with a swapfile or zswap it
> doesn't really work. We lose the indirection layer. How do we move the
> page from zswap to swapfile? We need to go update the page tables and
> the shmem page cache, similar to swapoff.
>
> Instead, if we store a key else in swp_entry_t and use this to lookup
> the swp_entry_t or zswap_entry pointer then that's essentially what
> the swap_desc does. It just goes the extra mile of unifying the
> swapcache as well and storing it directly in the swap_desc instead of
> storing it in another lookup structure.
>
> >
> > Depending on how much you are going to reuse the swap cache, you might
> > need to have something like a swap_info_struct to keep the locks happy.
>
> My current intention is to reimplement the swapcache completely as a
> pointer in struct swap_desc. This would eliminate this need and a lot
> of the locking we do today if I get things right.
>
> >
> > > Another potential concern is readahead. With this design, we have no
> >
> > Readahead is for spinning disk :-) Even a normal swap file with an SSD can
> > use some modernization.
>
> Yeah, I initially thought we would only need the swp_entry_t ->
> swap_desc reverse mapping for readahead, and that we can only store
> that for spinning disks, but I was wrong. We need for other things as
> well today: swapoff, when trying to find an empty swap slot and we
> start trying to free swap slots used only by the swapcache. However, I
> think both of these cases can be fixed (I can share more details if
> you want). If everything goes well we should only need to maintain the
> reverse mapping (extra overhead above 24 bytes) for swap files on
> spinning disks for readahead.
>
> >
> > Looking forward to your discussion.
> >
> > Chris
> >