From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0EEDBC61DA4
	for <linux-mm@archiver.kernel.org>; Thu,  9 Mar 2023 12:49:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 99DC86B0071; Thu,  9 Mar 2023 07:49:41 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 927E06B0072; Thu,  9 Mar 2023 07:49:41 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7C6166B0075; Thu,  9 Mar 2023 07:49:41 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 638556B0071
	for <linux-mm@kvack.org>; Thu,  9 Mar 2023 07:49:41 -0500 (EST)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 218C4120EE2
	for <linux-mm@kvack.org>; Thu,  9 Mar 2023 12:49:41 +0000 (UTC)
X-FDA: 80549341362.09.BDA1F51
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
	by imf20.hostedemail.com (Postfix) with ESMTP id E77D31C0003
	for <linux-mm@kvack.org>; Thu,  9 Mar 2023 12:49:38 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=WkXf6BE9;
	spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1678366179;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=xdeF1L9FadU9FNjasfd98Cn+qQBkiNNP5bd4aWWzGt4=;
	b=ctvTSqJsVe3qP4plEuP0eVvcQnJJc1fiHl1kF2x28SPBYNeNyUTHhMjIPJltI/9DZZNZ6v
	p/noUZtPKy4ZvtYPhoUjoXgS77PO9c50EeJRpK3OxF7pKO2m1QHXENkUhzkc8MyeEr10XR
	bNQ32NyfRqJYqedOoWQ4AkMyVnJJJQg=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=WkXf6BE9;
	spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678366179; a=rsa-sha256;
	cv=none;
	b=gwKmqa0XKuSP9gDQJ2PyyI1jr5bI4c15dmSv1MBLCsVbN9Co4sRrxfHnF8pCCOnmalSu1f
	yKc7/1nYGOTyyRs0AIVZF5i1/QjUIGz579B8td5LMEDfCGj4aW+GR8e/bP91/RGS/oOzQu
	IEdLfJII9F3Shbs1CJfi7OV8MxYa25M=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1678366179; x=1709902179;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version;
  bh=2YiZddfljzugDPaa95UH2QUyr1pKCpw/snM70F4HYLA=;
  b=WkXf6BE9FUJOSqaOBra73/WS41JSckRHQ7l4kCjOvqzu2W9/Fv4wZumA
   6e9uM/4vftiuuwX8zUCe1ize9k9jscL1ESwbFr6ag2kJOYVTyQUx4+MFZ
   0o8SLdm5aeY5uGjP2a4H8VxJeRakOLWH0ngufrD8EyayJJpQDePRvZKYu
   DkBkuNxfSKEq6FjsPZRnGTVywy3CICEm49XCxMT/NseTae8bLhNjfGpPg
   MEeRfKVHTyy1aIPQDWwavFOUPpc0zpg4pQFwOAtWw30gD2V7usqwGkWfd
   i8Vs7AHO3l3tu5qbiaBa/l3ScNAyVeN++qa7yNUMImnOs11FidbaP6JcL
   Q==;
X-IronPort-AV: E=McAfee;i="6500,9779,10643"; a="337961993"
X-IronPort-AV: E=Sophos;i="5.98,246,1673942400"; 
   d="scan'208";a="337961993"
Received: from orsmga008.jf.intel.com ([10.7.209.65])
  by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Mar 2023 04:49:36 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6500,9779,10643"; a="707599020"
X-IronPort-AV: E=Sophos;i="5.98,246,1673942400"; 
   d="scan'208";a="707599020"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Mar 2023 04:49:33 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chrisl@kernel.org>,  lsf-pc@lists.linux-foundation.org,
  Johannes Weiner <hannes@cmpxchg.org>,  Linux-MM <linux-mm@kvack.org>,
  Michal Hocko <mhocko@kernel.org>,  Shakeel Butt <shakeelb@google.com>,
  David Rientjes <rientjes@google.com>,  Hugh Dickins <hughd@google.com>,
  Seth Jennings <sjenning@redhat.com>,  Dan Streetman <ddstreet@ieee.org>,
  Vitaly Wool <vitaly.wool@konsulko.com>,  Yang Shi <shy828301@gmail.com>,
  Peter Xu <peterx@redhat.com>,  Minchan Kim <minchan@kernel.org>,  Andrew
 Morton <akpm@linux-foundation.org>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
	<Y/6KOFMZaE0yOj/1@google.com>
	<CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
Date: Thu, 09 Mar 2023 20:48:28 +0800
In-Reply-To: <CAJD7tkbvGvhTKCOqRpcht797Uw41fWgNd3r2kpN3ObfnUuaUxw@mail.gmail.com>
	(Yosry Ahmed's message of "Wed, 1 Mar 2023 16:30:22 -0800")
Message-ID: <87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Queue-Id: E77D31C0003
X-Stat-Signature: srhnsin74mgdqb4wf86q9fgny47a7igd
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1678366178-38997
X-HE-Meta: U2FsdGVkX18geppjaLD7yZzpFIRsBJy5jbXMZgEBQ6ZrCGGpnzhZDG3nha8pBWcsnMHwAsrNQ4QJzlHV6t82degBqnIByRM0tMVVoaWRhdCki+AspRPKgJZMaZFG1hVYPskGy3emTZRVuP9C/vAH/w7BuJcSUMDYkI6IBK/wo0fi0XmbrccFJanpYjiaW/CTgrQt5EpsgjBROrWCrRYq0KMzwKq3Ov8dqMNI/BFWwr0Ispijh4I9a2SvaEHKPYtiGbx2/CqCPv3JepOvP4eKr66c+i/7uuVt202Y2YcEeDgKUe2MYTN9fLDdH89TofNlOhVmIvgc142zjl2zESEyhebLJDrVEJgzX4Q4n3Dc2cPwkYWci6ptOLssjgTKIyyINOHqoTMAwIzzt/codmKVGIiR/6pE/Ku9YeqJhIiLsE/KDHFHxIkWSiwiKRBojJ8UyZIR2DLqQ5gQyeXP4VOEQKv/phf+syTskx6qnJ1mi9rH+7swMVaKu6zAE7h5VNGjQeP1sEt7ZRlH8gzNLvDS0zkEatFPSg87CyzqlMyHuGw8YzMe3vvRkMSe+9E+VgkInJKz8jN4kUxImS1qG/il08cI6s6UTZVZkdKAM5JBRWvw0wI5HkpiRDb3iNJ7FRnIUFzSXgzEZ5nfvIA96qoVby0bP5aHKCXRmYDOvhDNvQGWW/bvNG9AJHvCRTyzTb/gXYJwDyPTBEGjbSTl5a4v/0+Tx/+kkOxtnoF4+YJ5EjZ3ceh5o/RGafWkIOxtzOUCvMmiDChmKJ96aONtl7iUgQ2MEZmwhHhOCcqiqjuZ4Jzyc17Fji2hNpe7rFdxPA4azXg4pw217xLU0kWyu3LC+s3f9Z4O9BuvxIQG1I+ApVb5Y0iVGZ8XltLM0WM9YS2MCaAdDrYGAMXEOOIRt+7kFJpLRbcgqclNVY5RCAejw654t7LIuuOqaaVMxT/On6jqhZN0bjv8tGnvQ5FKWNZ
 GqISEMUq
 C2VRHR8KHtZa++P+RlsG9wL1f7WBUwcJyDbUsAg74YIbXCA9GwZrhTHxBFBZg3oC5iRihjtlAbm+vMdQD7yZalx7QF5RYgZgZwUic4R48PI5sj5ZcQHpMhvyhhpRFQq7vPAysqdmP+nFIaPSqktjBWgPiLNgPWnGfUzX3BVp0VxWWiJrqAg0qCnyGcJ9CUIP5wE59YbKFWx6b4npdAFpIulpTNJlj4hnWzy0P86OxzDWRN7ubRjC4hxjlr2TIG0BA60TIr7vwN9pgCL+Vaa93JSLJgw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Yosry Ahmed <yosryahmed@google.com> writes:

> On Tue, Feb 28, 2023 at 3:11 PM Chris Li <chrisl@kernel.org> wrote:
>>
>> Hi Yosry,
>>
>> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote:
>> > Hello everyone,
>> >
>> > I would like to propose a topic for the upcoming LSF/MM/BPF in May
>> > 2023 about swap & zswap (hope I am not too late).
>>
>> I am very interested in participating in this discussion as well.
>
> That's great to hear!
>
>>
>> > ==================== Objective ====================
>> > Enabling the use of zswap without a backing swapfile, which makes
>> > zswap useful for a wider variety of use cases. Also, when zswap is
>> > used with a swapfile, the pages in zswap do not use up space in the
>> > swapfile, so the overall swapping capacity increases.
>>
>> Agree.
>>
>> >
>> > ==================== Idea ====================
>> > Introduce a data structure, which I currently call a swap_desc, as an
>> > abstraction layer between swapping implementation and the rest of MM
>> > code. Page tables & page caches would store a swap id (encoded as a
>> > swp_entry_t) instead of directly storing the swap entry associated
>> > with the swapfile. This swap id maps to a struct swap_desc, which acts
>>
>> Can you provide a bit more detail? I am curious how this swap id
>> maps into the swap_desc? Is the swp_entry_t cast into "struct
>> swap_desc*" or going through some lookup table/tree?
>
> swap id would be an index in a radix tree (aka xarray), which contains
> a pointer to the swap_desc struct. This lookup should be free with
> this design as we also use swap_desc to directly store the swap cache
> pointer, so this lookup essentially replaces the swap cache lookup.
>
>>
>> > as our abstraction layer. All MM code not concerned with swapping
>> > details would operate in terms of swap descs. The swap_desc can point
>> > to either a normal swap entry (associated with a swapfile) or a zswap
>> > entry. It can also include all non-backend specific operations, such
>> > as the swapcache (which would be a simple pointer in swap_desc), swap
>>
>> Does the zswap entry still use the swap slot cache and swap_info_struct?
>
> In this design no, it shouldn't.
>
>>
>> > This work enables using zswap without a backing swapfile and increases
>> > the swap capacity when zswap is used with a swapfile. It also creates
>> > a separation that allows us to skip code paths that don't make sense
>> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree
>> > which might result in better performance (less lookups, less lock
>> > contention).
>> >
>> > The abstraction layer also opens the door for multiple cleanups (e.g.
>> > removing swapper address spaces, removing swap count continuation
>> > code, etc). Another nice cleanup that this work enables would be
>> > separating the overloaded swp_entry_t into two distinct types: one for
>> > things that are stored in page tables / caches, and for actual swap
>> > entries. In the future, we can potentially further optimize how we use
>> > the bits in the page tables instead of sticking everything into the
>> > current type/offset format.
>>
>> Looking forward to seeing more details in the upcoming discussion.
>> >
>> > ==================== Cost ====================
>> > The obvious downside of this is added memory overhead, specifically
>> > for users that use swapfiles without zswap. Instead of paying one byte
>> > (swap_map) for every potential page in the swapfile (+ swap count
>> > continuation), we pay the size of the swap_desc for every page that is
>> > actually in the swapfile, which I am estimating can be roughly around
>> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only
>> > scales with pages actually swapped out. For zswap users, it should be
>>
>> Is there a way to avoid turning 1 byte into 24 byte per swapped
>> pages? For the users that use swap but no zswap, this is pure overhead.
>
> That's what I could think of at this point. My idea was something like this:
>
> struct swap_desc {
>     union { /* Use one bit to distinguish them */
>         swp_entry_t swap_entry;
>         struct zswap_entry *zswap_entry;
>     };
>     struct folio *swapcache;
>     atomic_t swap_count;
>     u32 id;
> }
>
> Having the id in the swap_desc is convenient as we can directly map
> the swap_desc to a swp_entry_t to place in the page tables, but I
> don't think it's necessary. Without it, the struct size is 20 bytes,
> so I think the extra 4 bytes are okay to use anyway if the slab
> allocator only allocates multiples of 8 bytes.
>
> The idea here is to unify the swapcache and swap_count implementation
> between different swap backends (swapfiles, zswap, etc), which would
> create a better abstraction and reduce reinventing the wheel.
>
> We can reduce to only 8 bytes and only store the swap/zswap entry, but
> we still need the swap cache anyway so might as well just store the
> pointer in the struct and have a unified lookup-free swapcache, so
> really 16 bytes is the minimum.
>
> If we stop at 16 bytes, then we need to handle swap count separately
> in swapfiles and zswap. This is not the end of the world, but are the
> 8 bytes worth this?

If my understanding were correct, for current implementation, we need
one swap cache pointer per swapped out page too.  Even after calling
__delete_from_swap_cache(), we store the "shadow" entry there.  Although
it's possible to implement shadow entry reclaiming like that for file
cache shadow entry (workingset_shadow_shrinker), we haven't done that
yet.  And, it appears that we can live with that.  So, in current
implementation, for each swapped out page, we use 9 bytes.  If so, the
memory usage ratio is 24 / 9 = 2.667, still not trivial, but not as
horrible as 24 / 1 = 24.

> Keep in mind that the current overhead is 1 byte O(max swap pages) not
> O(swapped). Also, 1 byte is assuming we do not use the swap
> continuation pages. If we do, it may end up being more. We also
> allocate continuation in full 4k pages, so even if one swap_map
> element in a page requires continuation, we will allocate an entire
> page. What I am trying to say is that to get an actual comparison you
> need to also factor in the swap utilization and the rate of usage of
> swap continuation. I don't know how to come up with a formula for this
> tbh.
>
> Also, like Johannes said, the worst case overhead (32 bytes if you
> count the reverse mapping) is 0.8% of swapped memory, aka 8M for every
> 1G swapped. It doesn't sound *very* bad. I understand that it is pure
> overhead for people not using zswap, but it is not very awful.
>
>>
>> It seems what you really need is one bit of information to indicate
>> this page is backed by zswap. Then you can have a seperate pointer
>> for the zswap entry.
>
> If you use one bit in swp_entry_t (or one of the available swap types)
> to indicate whether the page is backed with a swapfile or zswap it
> doesn't really work. We lose the indirection layer. How do we move the
> page from zswap to swapfile? We need to go update the page tables and
> the shmem page cache, similar to swapoff.
>
> Instead, if we store a key else in swp_entry_t and use this to lookup
> the swp_entry_t or zswap_entry pointer then that's essentially what
> the swap_desc does. It just goes the extra mile of unifying the
> swapcache as well and storing it directly in the swap_desc instead of
> storing it in another lookup structure.

If we choose to make sizeof(struct swap_desc) == 8, that is, store only
swap_entry in swap_desc.  The added indirection appears to be another
level of page table with 1 entry.  Then, we may use the similar method
as supporting system with 2 level and 3 level page tables, like the code
in include/asm-generic/pgtable-nopmd.h.  But I haven't thought about
this deeply.

>>
>> Depending on how much you are going to reuse the swap cache, you might
>> need to have something like a swap_info_struct to keep the locks happy.
>
> My current intention is to reimplement the swapcache completely as a
> pointer in struct swap_desc. This would eliminate this need and a lot
> of the locking we do today if I get things right.
>
>>
>> > Another potential concern is readahead. With this design, we have no
>>
>> Readahead is for spinning disk :-) Even a normal swap file with an SSD can
>> use some modernization.
>
> Yeah, I initially thought we would only need the swp_entry_t ->
> swap_desc reverse mapping for readahead, and that we can only store
> that for spinning disks, but I was wrong. We need for other things as
> well today: swapoff, when trying to find an empty swap slot and we
> start trying to free swap slots used only by the swapcache. However, I
> think both of these cases can be fixed (I can share more details if
> you want). If everything goes well we should only need to maintain the
> reverse mapping (extra overhead above 24 bytes) for swap files on
> spinning disks for readahead.
>
>>
>> Looking forward to your discussion.
>>
>> Chris
>>

Best Regards,
Huang, Ying