From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5938C61DA4 for ; Fri, 10 Mar 2023 03:07:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E41106B0074; Thu, 9 Mar 2023 22:07:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DF2256B0075; Thu, 9 Mar 2023 22:07:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C91CF6B0078; Thu, 9 Mar 2023 22:07:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B8F236B0074 for ; Thu, 9 Mar 2023 22:07:51 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 615AB1A1361 for ; Fri, 10 Mar 2023 03:07:51 +0000 (UTC) X-FDA: 80551503942.26.1681AB7 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf19.hostedemail.com (Postfix) with ESMTP id 5484B1A0009 for ; Fri, 10 Mar 2023 03:07:48 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=JfPOlriY; spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678417669; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OZ3LPL7vEG1qkNsY58XbFTFt9nbYKFLgEqUrm8usKug=; b=FfEsW1nLGaXXuhO6QfQFuPl7jYqAGrJK1TMk9V33Nc3MyZhb/G0YhInp0KwcVPQwMlZacu ZDy7iSqdFJxMjeY2bE/W3VLFkc4XVIEMkCOVLI1Wc0blUWduub/RKzypPkZVozyNQ71ua4 VxiWggM1GtNzVUqvxc29z1iFcHQafiQ= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=JfPOlriY; spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678417669; a=rsa-sha256; cv=none; b=uAF/LGgbheQV4sbEs5CbN8/ikC/0BLX2H8isCd6fU328nrCKiS+Go0mCzX8lr5Il8vCxDl ZcwN8yOxD54BP26EEj/eHB7RRZN2SJKche5IXXEZnt7xkwPl9lMVIfKFIiu1r3PpZ5NvEM VSeCoJ6R6iPY5RX6s1tWKmT2rkaKH74= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1678417668; x=1709953668; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version:content-transfer-encoding; bh=8Z6U9WfvcE5rPloEJQZPhvXSXnzSj4xXkDdnSSaMheg=; b=JfPOlriYcPdL1djzJ6oqLQ8COPoDtR7AYoSWMQHaLPVQrTfaHVCtqC4r fTWjzFp/jiFSbC1hommua7eq6KgtHdfOlf3rmuNqlRx96yLSA8+utTGGa s4AEb6cXZAY60D2ityKyFt1QP2hRfoVbRGwdlZerJOFjliK+qEAXjdzGS u0egNgfJuHFsIvTUnc2u5njaAt3VJnMWYFH8sqBHDkdhECHKtnVrGMQpD TqObDLr0YPY4VIEoT0ZO6zggwOgM1nOz4Uod1rlv+R0JiCKPZ+xCmahII AePtambO33doU3EAUZuL0LXN+foNs+zrGst4mAM4SgpwhjA5upXKuf2p2 A==; X-IronPort-AV: E=McAfee;i="6500,9779,10644"; a="401493797" X-IronPort-AV: E=Sophos;i="5.98,248,1673942400"; d="scan'208";a="401493797" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Mar 2023 19:07:46 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10644"; a="655026055" X-IronPort-AV: E=Sophos;i="5.98,248,1673942400"; d="scan'208";a="655026055" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Mar 2023 19:07:42 -0800 From: "Huang, Ying" To: Yosry Ahmed Cc: Chris Li , lsf-pc@lists.linux-foundation.org, Johannes Weiner , Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Yang Shi , Peter Xu , Minchan Kim , Andrew Morton , Aneesh Kumar K V , Michal Hocko , Wei Xu Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap References: <87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Fri, 10 Mar 2023 11:06:37 +0800 In-Reply-To: (Yosry Ahmed's message of "Thu, 9 Mar 2023 12:19:03 -0800") Message-ID: <87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5484B1A0009 X-Rspam-User: X-Stat-Signature: woms4qby5ohoqikunhfh4frmus3rpmh8 X-HE-Tag: 1678417668-92169 X-HE-Meta: U2FsdGVkX1/fBSXFzsWAEW0vFkNgidRr5jJpadnCuO/ga8ZXS82YwnDpIbYHt+9cnO+8LSZHYR7CxLLDqTBLcbGp+nHwC7SdY1mc4OPJsAY7J8TKyVN/uOWkFWDY+RyGJrZQNl8D0w2Vp6L8Fl4758Lw4yfQ0PelWnLm36E3o8aVzKd144iTjA1R1aHinm9BUqprYvo26Pgmsz17AnzYVPD91uCbOWiin2InFqKiFmNmGGOJBl7SzL30UcuD6nCrqKqXgV+YzQYThxEGfVpfi92G5VQx2dC2ljk6McupZxXBGjOnyOrn5QhSyymgHHUwlCH+Lrh4G0Y522IuFbhjMG7XAYGtwwTclRK5c9BT64t/9+/bw2MUCwt4/IRFjgQiZuFPoWUB2ius/N2yE5RJusq9U6tJI7IdC91UyvmL5xdW2P4qZPBQQK0fmqvneVf4L5g5hvswbUxPm7ZEo/glvEcfHYXSzWpx2V51wCcmiamA3dbr66TYv9KEcdHUJ+OO1QAkOC0Ad+Ga5T2D25tbYH+bYo5YfTlIPSa2IiTQ9hkHcC7e0xobco5WJwKbC7PQ3TgQonV05ty1RwjBFOiwYS8mgQt92Nt3/mpoJ1k7a5X1EelVgSBRyb6SKaAeNZc8QDFRPlcFyhyPHDJ7WVB5UinXiCbG3ARzQTvi3sXVvA3TBPSPkxtY3rUE0W/x6ASLEmoNq+JwbN44vQPy41uru3kg56QAZys5O01KxuJUh+G+DLeg3PbXYJ7/a0PCY2MxURkZ3nJzcC4uqqV1SMFPHzYjDVEULlRCJz0DkupBVhej3xh5CW9yGQLcr24gjGwvJFErRxYin0fYRGeKl1LFsK4jiPY6wioSBs362wQ/fvw9R3SKNtSQHhnmDJTwi4aX3zM7fEu1Wjrnm7dL7ZYDAwrOCBZv9jya0fdTW0jvKMtXYjYiqUhgP22+SJ5CMyjC4SvF8ogoKR3zsK421N4 rGgtUA0+ KvWnG0XzwGzUOZsJMu97MUEsSBqBomAE++8CqxxvTISb4L4dOyMdtaQ71/rfLgXFhFArHd+7FiVcLxvVBe8SUE1trzbolOD/WlwwgBNswE6G7b4Jgj+LV1lxhyjR1iJk5Rt3TrOIG7aF94l5CQbtGP4uPG6icPLehUgLGQ6KUeByy0pWAue75tbUTicRk3Pk7thHUgV3gW6ij3o/qaHLJujSnA65YXZwSf4jIUz/qD0WohlR7djssR6nKLm956MJGVj5NTH716gVPa8EzOtBJJTRQuQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Yosry Ahmed writes: > On Thu, Mar 9, 2023 at 4:49=E2=80=AFAM Huang, Ying = wrote: >> >> Yosry Ahmed writes: >> >> > On Tue, Feb 28, 2023 at 3:11 PM Chris Li wrote: >> >> >> >> Hi Yosry, >> >> >> >> On Sat, Feb 18, 2023 at 02:38:40PM -0800, Yosry Ahmed wrote: >> >> > Hello everyone, >> >> > >> >> > I would like to propose a topic for the upcoming LSF/MM/BPF in May >> >> > 2023 about swap & zswap (hope I am not too late). >> >> >> >> I am very interested in participating in this discussion as well. >> > >> > That's great to hear! >> > >> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Object= ive =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> > Enabling the use of zswap without a backing swapfile, which makes >> >> > zswap useful for a wider variety of use cases. Also, when zswap is >> >> > used with a swapfile, the pages in zswap do not use up space in the >> >> > swapfile, so the overall swapping capacity increases. >> >> >> >> Agree. >> >> >> >> > >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Idea = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> > Introduce a data structure, which I currently call a swap_desc, as = an >> >> > abstraction layer between swapping implementation and the rest of MM >> >> > code. Page tables & page caches would store a swap id (encoded as a >> >> > swp_entry_t) instead of directly storing the swap entry associated >> >> > with the swapfile. This swap id maps to a struct swap_desc, which a= cts >> >> >> >> Can you provide a bit more detail? I am curious how this swap id >> >> maps into the swap_desc? Is the swp_entry_t cast into "struct >> >> swap_desc*" or going through some lookup table/tree? >> > >> > swap id would be an index in a radix tree (aka xarray), which contains >> > a pointer to the swap_desc struct. This lookup should be free with >> > this design as we also use swap_desc to directly store the swap cache >> > pointer, so this lookup essentially replaces the swap cache lookup. >> > >> >> >> >> > as our abstraction layer. All MM code not concerned with swapping >> >> > details would operate in terms of swap descs. The swap_desc can poi= nt >> >> > to either a normal swap entry (associated with a swapfile) or a zsw= ap >> >> > entry. It can also include all non-backend specific operations, such >> >> > as the swapcache (which would be a simple pointer in swap_desc), sw= ap >> >> >> >> Does the zswap entry still use the swap slot cache and swap_info_stru= ct? >> > >> > In this design no, it shouldn't. >> > >> >> >> >> > This work enables using zswap without a backing swapfile and increa= ses >> >> > the swap capacity when zswap is used with a swapfile. It also creat= es >> >> > a separation that allows us to skip code paths that don't make sense >> >> > in the zswap path (e.g. readahead). We get to drop zswap's rbtree >> >> > which might result in better performance (less lookups, less lock >> >> > contention). >> >> > >> >> > The abstraction layer also opens the door for multiple cleanups (e.= g. >> >> > removing swapper address spaces, removing swap count continuation >> >> > code, etc). Another nice cleanup that this work enables would be >> >> > separating the overloaded swp_entry_t into two distinct types: one = for >> >> > things that are stored in page tables / caches, and for actual swap >> >> > entries. In the future, we can potentially further optimize how we = use >> >> > the bits in the page tables instead of sticking everything into the >> >> > current type/offset format. >> >> >> >> Looking forward to seeing more details in the upcoming discussion. >> >> > >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Cost = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> > The obvious downside of this is added memory overhead, specifically >> >> > for users that use swapfiles without zswap. Instead of paying one b= yte >> >> > (swap_map) for every potential page in the swapfile (+ swap count >> >> > continuation), we pay the size of the swap_desc for every page that= is >> >> > actually in the swapfile, which I am estimating can be roughly arou= nd >> >> > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead o= nly >> >> > scales with pages actually swapped out. For zswap users, it should = be >> >> >> >> Is there a way to avoid turning 1 byte into 24 byte per swapped >> >> pages? For the users that use swap but no zswap, this is pure overhea= d. >> > >> > That's what I could think of at this point. My idea was something like= this: >> > >> > struct swap_desc { >> > union { /* Use one bit to distinguish them */ >> > swp_entry_t swap_entry; >> > struct zswap_entry *zswap_entry; >> > }; >> > struct folio *swapcache; >> > atomic_t swap_count; >> > u32 id; >> > } >> > >> > Having the id in the swap_desc is convenient as we can directly map >> > the swap_desc to a swp_entry_t to place in the page tables, but I >> > don't think it's necessary. Without it, the struct size is 20 bytes, >> > so I think the extra 4 bytes are okay to use anyway if the slab >> > allocator only allocates multiples of 8 bytes. >> > >> > The idea here is to unify the swapcache and swap_count implementation >> > between different swap backends (swapfiles, zswap, etc), which would >> > create a better abstraction and reduce reinventing the wheel. >> > >> > We can reduce to only 8 bytes and only store the swap/zswap entry, but >> > we still need the swap cache anyway so might as well just store the >> > pointer in the struct and have a unified lookup-free swapcache, so >> > really 16 bytes is the minimum. >> > >> > If we stop at 16 bytes, then we need to handle swap count separately >> > in swapfiles and zswap. This is not the end of the world, but are the >> > 8 bytes worth this? >> >> If my understanding were correct, for current implementation, we need >> one swap cache pointer per swapped out page too. Even after calling >> __delete_from_swap_cache(), we store the "shadow" entry there. Although >> it's possible to implement shadow entry reclaiming like that for file >> cache shadow entry (workingset_shadow_shrinker), we haven't done that >> yet. And, it appears that we can live with that. So, in current >> implementation, for each swapped out page, we use 9 bytes. If so, the >> memory usage ratio is 24 / 9 =3D 2.667, still not trivial, but not as >> horrible as 24 / 1 =3D 24. > > Unfortunately it's a little bit more. 24 is the extra overhead. > > Today we have an xarray entry for each swapped out page, that either > has the swapcache pointer or the shadow entry. > > With this implementation, we have an xarray entry for each swapped out > page, that has a pointer to the swap_desc. > > Ignoring the overhead of the xarray itself, we have (8 + 24) / (8 + 1) = =3D 3.5556. OK. I see. We can only hold 8 bytes for each xarray entry. To save memory usage, we can allocate multiple swap_desc (e.g., 16) for each xarray entry. Then the memory usage of xarray becomes 1/N. > For rotating disks, this might be even higher (8 + 32) / (8 + 1) =3D 4.444 > > This is because we need to maintain a reverse mapping between > swp_entry_t and the swap_desc to use for cluster readahead. I am > assuming we can limit cluster readahead for rotating disks only. If reverse mapping cannot be avoided for enough situation, it's better to only keep swap_entry in swap_desc, and create another xarray indexed by swap_entry and store swap_cache, swap_count etc. >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages) not >> > O(swapped). Also, 1 byte is assuming we do not use the swap >> > continuation pages. If we do, it may end up being more. We also >> > allocate continuation in full 4k pages, so even if one swap_map >> > element in a page requires continuation, we will allocate an entire >> > page. What I am trying to say is that to get an actual comparison you >> > need to also factor in the swap utilization and the rate of usage of >> > swap continuation. I don't know how to come up with a formula for this >> > tbh. >> > >> > Also, like Johannes said, the worst case overhead (32 bytes if you >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for every >> > 1G swapped. It doesn't sound *very* bad. I understand that it is pure >> > overhead for people not using zswap, but it is not very awful. >> > >> >> >> >> It seems what you really need is one bit of information to indicate >> >> this page is backed by zswap. Then you can have a seperate pointer >> >> for the zswap entry. >> > >> > If you use one bit in swp_entry_t (or one of the available swap types) >> > to indicate whether the page is backed with a swapfile or zswap it >> > doesn't really work. We lose the indirection layer. How do we move the >> > page from zswap to swapfile? We need to go update the page tables and >> > the shmem page cache, similar to swapoff. >> > >> > Instead, if we store a key else in swp_entry_t and use this to lookup >> > the swp_entry_t or zswap_entry pointer then that's essentially what >> > the swap_desc does. It just goes the extra mile of unifying the >> > swapcache as well and storing it directly in the swap_desc instead of >> > storing it in another lookup structure. >> >> If we choose to make sizeof(struct swap_desc) =3D=3D 8, that is, store o= nly >> swap_entry in swap_desc. The added indirection appears to be another >> level of page table with 1 entry. Then, we may use the similar method >> as supporting system with 2 level and 3 level page tables, like the code >> in include/asm-generic/pgtable-nopmd.h. But I haven't thought about >> this deeply. > > Can you expand further on this idea? I am not sure I fully understand. OK. The goal is to avoid the overhead if indirection isn't enabled via kconfig. If indirection isn't enabled, store swap_entry in PTE directly. Otherwise, store index of swap_desc in PTE. Different functions (e.g., to get/set swap_entry in PTE) are implemented based on kconfig. >> >> >> >> Depending on how much you are going to reuse the swap cache, you might >> >> need to have something like a swap_info_struct to keep the locks happ= y. >> > >> > My current intention is to reimplement the swapcache completely as a >> > pointer in struct swap_desc. This would eliminate this need and a lot >> > of the locking we do today if I get things right. >> > >> >> >> >> > Another potential concern is readahead. With this design, we have no >> >> >> >> Readahead is for spinning disk :-) Even a normal swap file with an SS= D can >> >> use some modernization. >> > >> > Yeah, I initially thought we would only need the swp_entry_t -> >> > swap_desc reverse mapping for readahead, and that we can only store >> > that for spinning disks, but I was wrong. We need for other things as >> > well today: swapoff, when trying to find an empty swap slot and we >> > start trying to free swap slots used only by the swapcache. However, I >> > think both of these cases can be fixed (I can share more details if >> > you want). If everything goes well we should only need to maintain the >> > reverse mapping (extra overhead above 24 bytes) for swap files on >> > spinning disks for readahead. >> > >> >> >> >> Looking forward to your discussion. Per my understanding, the indirection is to make it easy to move (swapped) pages among swap devices based on hot/cold. This is similar as the target of memory tiering. It appears that we can extend the memory tiering (mm/memory-tiers.c) framework to cover swap devices too? Is it possible for zswap to be faster than some slow memory media? Best Regards, Huang, Ying