From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1929C6FD1D for ; Wed, 15 Mar 2023 08:02:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3296A6B0072; Wed, 15 Mar 2023 04:02:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2B18D6B0074; Wed, 15 Mar 2023 04:02:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 152B26B0075; Wed, 15 Mar 2023 04:02:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id F28266B0072 for ; Wed, 15 Mar 2023 04:02:19 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id AEABCAB16C for ; Wed, 15 Mar 2023 08:02:19 +0000 (UTC) X-FDA: 80570389998.13.158730A Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) by imf23.hostedemail.com (Postfix) with ESMTP id A8A2C14001E for ; Wed, 15 Mar 2023 08:02:17 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dqQ1fRyk; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678867337; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=z5U/1R2MY6v5lnmPc4QOE635OkT2r2+td8z5Sl3kk6o=; b=KgKMacvXZswWko9cEs3IlO6hidxLopJA7d7CybsOyfMOQw3xiawVkLNpYCyQSTkSbLDeg2 I7f+cvZH4LvqpuUxQFN7r0Gu8xFp5GRuJP5PEtjuSmOQ6FoXXI2h/G+cSdeROikgadWHKt eUWj3nXFmmGZ5teMiMozWQ/FsWku7VQ= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dqQ1fRyk; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678867337; a=rsa-sha256; cv=none; b=Kd6W+1VlsdxzbKaUkepXeTbck2If96PEnFZwIoDpAaZ8y/7QWa9XCbnDt4huo5c+OSlbec KjmaD5HY8WOf81C8KFasgOoyJyzfDXBoEJl7Eu1oEylaPvGWMyBqkLrnilYM7Rxz4cNBPG 9iaC5oGYpkWFyWdltibDROpQ6YCKMDA= Received: by mail-ed1-f48.google.com with SMTP id o12so72010923edb.9 for ; Wed, 15 Mar 2023 01:02:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1678867336; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=z5U/1R2MY6v5lnmPc4QOE635OkT2r2+td8z5Sl3kk6o=; b=dqQ1fRykVhHIepXJIYRdw5XbZreXRsumlNuopbhqFckqOAOyWoYa9KvJpRf6hO0BR3 Xi5REQO7mptGjncu9W1G2fRKrrJzim8I6+ZAblR9Gg1qsUr0Hm1X1AJKXqCL6yn40P4P NWJssQzTJ9TxWG6DD3HX52Nf8qbFl5sLzwbAwF5FqJqeV8kaIPV86uDSxocbnSxVjKKY RsLUXbJccwbrBs2G4AZO60K7kRzNuvIGPhBfj7WVtZ8WI9UHt7xY+EwsREoyH1O38RFt N3AEqKSyXvwBRxC6Ed6YNxwbhfBwTs0/giMJwbmlg70LXPkyiNxKszSJ+cG9Ead+P4WF RAuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678867336; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=z5U/1R2MY6v5lnmPc4QOE635OkT2r2+td8z5Sl3kk6o=; b=HWbDHXMBxzfpl1IXR/uoR6VijFg8UMx9770pLSqHxFt8AOCzby+x9bjg1WM+px2vf6 YQ5DJxO76z5t00fAT14DUEaUqxooHTQyDIogaTHigjSCXY3VKYhxuHarvpAytRwXQ4zV C+31VvSjqtvfnFo5cS7KA8BUjjo9tkxNvXNkAos5Ojo/mIyvbStd+RclrICAOiEtRAJP cCeUJzBIqIF51x4Tyve46BZWRC733Xlo03NbfFdP1T1kjl6TyZ87cgcXM/jYR0iby5/5 4MId6qmy6/+QUAsjnDFhbQOdX8X+yv5fsbQFGj956JJz9yJN3AGVRhT/MZ20LS91EsiK BV3g== X-Gm-Message-State: AO0yUKUjG/kHQWtA3uBDkCwBpaPT2zUrO5yJuQ3mp7XZplEht4BFv47D y77R7ec1GmW1FlKmXZtP2KyIpaCWovrohb3IANLxyA== X-Google-Smtp-Source: AK7set8oqdTCKTykDjzrvWdxDHKh6k3g1jwZaUZpGT35mxS0myxzJi9RyIXj+rFiteJVHj11DQcvu7AqmsddJw+xruw= X-Received: by 2002:a17:906:a14:b0:927:912:6baf with SMTP id w20-20020a1709060a1400b0092709126bafmr2398468ejf.15.1678867336026; Wed, 15 Mar 2023 01:02:16 -0700 (PDT) MIME-Version: 1.0 References: <87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o7ox762m.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87o7ox762m.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yosry Ahmed Date: Wed, 15 Mar 2023 01:01:39 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap To: "Huang, Ying" Cc: Chris Li , lsf-pc@lists.linux-foundation.org, Johannes Weiner , Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Yang Shi , Peter Xu , Minchan Kim , Andrew Morton , Aneesh Kumar K V , Michal Hocko , Wei Xu Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: A8A2C14001E X-Stat-Signature: rk7dt7zhx14wc4ur4osx8ecztjxz3j3u X-HE-Tag: 1678867337-950544 X-HE-Meta: U2FsdGVkX18HLj/g8xyt/YnRxIi+ZJgXeuIQ4/Kw2KjDZLsIcNK4XzHmqJUpHNJDWDS29fBmIFntIHvTTjJ+kF1/Gh08Dus+p+1rxcQF4FqONIXVJrNwxDyCe2rrbMnHcq+uNYN6dvpr74l1y61qZdNKoTT7tcur575snQqV3Dk94X8gB0i+DATq4wgkHJTti91aImcJVI66rQ0UQqXp05HJ//lsgJWm67tTPq8VSSd1QrsmotYf2QIX2eaLucSOCIw1oRChDae34qu9Nrc5eyArhwYKwZ2bGuNc4NvgCy8Ji6OgWMveETbeEWl0q1ASwMcU2CjBQmRXLd0S2Ewfaf60qzze9DR28SNeitT4mdFqiMUOytd7JGECs8Ou4uKYrvp2esW2iG64Cc4YHhPA58bs0zxT/+P653uVRAL1gz3ogR52oVrzHzdxigaxG5zOIGQU7rG2+Hppl/BApjmaILVhqaX/VDMeJynVFZGzp1tkyeQ0DTSRMW/lhvMvSfJp/tlAep1IsMF1omiR8qdAT2zuEHpZ7U1oH+99SU/Mi8Kl7zFy6Am448UhH8ZaSC0V4Sx+tMu6SqBNI5VD3zRrva39IjNH+7HRvYB4YFwRvY/xWK2FcSL0L4hG5gehgowvM7RKFjnWj8rL5oyXD+SC2L7wB+Xp8Ty91Lmru2UH4aZ+kY0lfRJ9fRlSEkxJqXMaQfd0jlpsHQscpYBCF/mAjH3oP4ZVf9cadhsBS2m/pRJz2BORMXcik57qE2tiLJehb2btBUEtUSee2Y5FakpMNoS2lkTLlN7DFf7ZCSCIMdRqgFsv36KqqvmtZfGUHMbBfKT8DeCGF1WLQfwwsxjDX9MloDtJa0uZ3ubpQ0k04pnolTTUlm+edONUe+7mTVBVEL+oA5a6mjKg7T5WzsW/gEPdjKPjFDlZF5NRFYSwoJiMdcLqJ5uKT2wV3U9rBPjDZldAJM7VKKojPg9+8hr B+8E5R/x 4+gfi7sOvztwHPx+GTUguMB23Vm4FyIo3b6BQ68Kz1mLY1vkvMcopfzJWYbHsnB/N9KPGFFejySn8Z9LHvexT9iFzocnpXJ73UYvCai0XpLaG6bCTDu5X/jLKETlVwHqaqhcCpKYrCnj+qhteIGsddd/zMNFNxeYQQKrA9Yx6CZi9dILRsLdU1UHvC81ibO+aNlTXkGqKvmrxlk53GML+2glzw+7upHqqFv5+S16pXPxPIZIfkAq+FBMzUan+zD0yhFNresRGztLLvoAIrALcWZLQQXfkTbhZoLSwxa3Dvr4akaw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Mar 12, 2023 at 7:13=E2=80=AFPM Huang, Ying = wrote: > > Yosry Ahmed writes: > > > > > > My current idea is to have one xarray that stores the swap_descs > > (which include swap_entry, swapcache, swap_count, etc), and only for > > rotating disks have an additional xarray that maps swap_entry -> > > swap_desc for cluster readahead, assuming we can eliminate all other > > situations requiring a reverse mapping. > > > > I am not sure how having separate xarrays help? If we have one xarray, > > might as well save the other lookups on put everything in swap_desc. > > In fact, this should improve the locking today as swapcache / > > swap_count operations can be lockless or very lightly contended. > > The condition of the proposal is "reverse mapping cannot be avoided for > enough situation". So, if reverse mapping (or cluster readahead) can be > avoided for enough situations, I think your proposal is good. Otherwise, > I propose to use 2 xarrays. You don't need another reverse mapping > xarray, because you just need to read the next several swap_entry into > the swap cache for cluster readahead. swap_desc isn't needed for > cluster readahead. swap_desc would be needed for cluster readahead in my original proposal as the swap cache lives in swap_descs. Based on the current implementation, we would need a reverse mapping (swap entry -> swap_desc) in 3 situations: 1) __try_to_reclaim_swap(): when trying to find an empty swap slot and failing, we fallback to trying to find swap entries that only have a page in the swap cache (no references in page tables or page cache) and free them. This would require a reverse mapping. 2) swapoff: we need to swap in all entries in a swapfile, so we need to get all swap_descs associated with that swapfile. 3) swap cluster readahead. For (1), I think we can drop the dependency of a reverse mapping if we free swap entries once we swap a page in and add it to the swap cache, even if the swap count does not drop to 0. For (2), instead of scanning page tables and shmem page cache to find swapped out pages for the swapfile, we can scan all swap_descs instead, we should be more efficient. This is one of the proposal's potential advantages. (3) is the one that would still need a reverse mapping with the current proposal. Today we use swap cluster readahead for anon pages if we have a spinning disk or vma readahead is disabled. For shmem, we always use cluster readahead. If we can limit cluster readahead to only rotating disks, then the reverse mapping can only be maintained for swapfiles on rotating disks. Otherwise, we will need to maintain a reverse mapping for all swapfiles. > > > If the point is to store the swap_desc directly inside the xarray to > > save 8 bytes, I am concerned that having multiple xarrays for > > swapcache, swap_count, etc will use more than that. > > The idea is to save the memory used by reverse mapping xarray. I see. > > >> >> > >> >> > Keep in mind that the current overhead is 1 byte O(max swap pages= ) not > >> >> > O(swapped). Also, 1 byte is assuming we do not use the swap > >> >> > continuation pages. If we do, it may end up being more. We also > >> >> > allocate continuation in full 4k pages, so even if one swap_map > >> >> > element in a page requires continuation, we will allocate an enti= re > >> >> > page. What I am trying to say is that to get an actual comparison= you > >> >> > need to also factor in the swap utilization and the rate of usage= of > >> >> > swap continuation. I don't know how to come up with a formula for= this > >> >> > tbh. > >> >> > > >> >> > Also, like Johannes said, the worst case overhead (32 bytes if yo= u > >> >> > count the reverse mapping) is 0.8% of swapped memory, aka 8M for = every > >> >> > 1G swapped. It doesn't sound *very* bad. I understand that it is = pure > >> >> > overhead for people not using zswap, but it is not very awful. > >> >> > > >> >> >> > >> >> >> It seems what you really need is one bit of information to indic= ate > >> >> >> this page is backed by zswap. Then you can have a seperate point= er > >> >> >> for the zswap entry. > >> >> > > >> >> > If you use one bit in swp_entry_t (or one of the available swap t= ypes) > >> >> > to indicate whether the page is backed with a swapfile or zswap i= t > >> >> > doesn't really work. We lose the indirection layer. How do we mov= e the > >> >> > page from zswap to swapfile? We need to go update the page tables= and > >> >> > the shmem page cache, similar to swapoff. > >> >> > > >> >> > Instead, if we store a key else in swp_entry_t and use this to lo= okup > >> >> > the swp_entry_t or zswap_entry pointer then that's essentially wh= at > >> >> > the swap_desc does. It just goes the extra mile of unifying the > >> >> > swapcache as well and storing it directly in the swap_desc instea= d of > >> >> > storing it in another lookup structure. > >> >> > >> >> If we choose to make sizeof(struct swap_desc) =3D=3D 8, that is, st= ore only > >> >> swap_entry in swap_desc. The added indirection appears to be anoth= er > >> >> level of page table with 1 entry. Then, we may use the similar met= hod > >> >> as supporting system with 2 level and 3 level page tables, like the= code > >> >> in include/asm-generic/pgtable-nopmd.h. But I haven't thought abou= t > >> >> this deeply. > >> > > >> > Can you expand further on this idea? I am not sure I fully understan= d. > >> > >> OK. The goal is to avoid the overhead if indirection isn't enabled vi= a > >> kconfig. > >> > >> If indirection isn't enabled, store swap_entry in PTE directly. > >> Otherwise, store index of swap_desc in PTE. Different functions (e.g.= , > >> to get/set swap_entry in PTE) are implemented based on kconfig. > > > > > > I thought about this, the problem is that we will have multiple > > implementations of multiple things. For example, swap_count without > > the indirection layer lives in the swap_map (with continuation logic). > > With the indirection layer, it lives in the swap_desc (or somewhere > > else). Same for the swapcache. Even if we keep the swapcache in an > > xarray and not inside swap_desc, it would be indexed by swap_entry if > > the indirection is disabled, and by swap_desc (or similar) if the > > indirection is enabled. I think maintaining separate implementations > > for when the indirection is enabled/disabled would be adding too much > > complexity. > > > > WDYT? > > If we go this way, swap cache and swap_count will always be indexed by > swap_entry. swap_desc just provides a indirection to make it possible > to move between swap devices. > > Why must we index swap cache and swap_count by swap_desc if indirection > is enabled? Yes, we can save one xarray indexing if we do so, but I > don't think the overhead of one xarray indexing is a showstopper. > > I think this can be one intermediate step towards your final target. > The changes to current implementation can be smaller. IIUC, the idea is to have two xarrays: (a) xarray that stores a pointer to a struct containing swap_count and swap cache. (b) xarray that stores the underlying swap entry or zswap entry. When indirection is disabled: page tables & page cache have swap entry directly like today, xarray (a) is indexed by swap entry, xarray (b) does not exist. No reverse mapping needed. In this case we have an extra overhead of 12-16 bytes (the struct containing swap_count and swap cache) vs. 24 bytes of the swap_desc. When indirection is enabled: page tables & page cache have a swap id (or swap_desc index), xarray (a) is indexed by swap id, xarray (b) is indexed by swap id as well and contain swap entry or zswap entry. Reverse mapping might be needed. In this case we have an extra overhead of 12-16 bytes + 8 bytes for xarray (b) entry + memory overhead from 2nd xarray + reverse mapping where needed. There is also the extra cpu overhead for an extra lookup in certain paths. Is my analysis correct? If yes, I agree that the original proposal is good if the reverse mapping can be avoided in enough situations, and that we should consider such alternatives otherwise. As I mentioned above, I think it comes down to whether we can completely restrict cluster readahead to rotating disks or not -- in which case we need to decide what to do for shmem and for anon when vma readahead is disabled. > > >> >> >> > >> >> >> Depending on how much you are going to reuse the swap cache, you= might > >> >> >> need to have something like a swap_info_struct to keep the locks= happy. > >> >> > > >> >> > My current intention is to reimplement the swapcache completely a= s a > >> >> > pointer in struct swap_desc. This would eliminate this need and a= lot > >> >> > of the locking we do today if I get things right. > >> >> > > >> >> >> > >> >> >> > Another potential concern is readahead. With this design, we h= ave no > >> >> >> > >> >> >> Readahead is for spinning disk :-) Even a normal swap file with = an SSD can > >> >> >> use some modernization. > >> >> > > >> >> > Yeah, I initially thought we would only need the swp_entry_t -> > >> >> > swap_desc reverse mapping for readahead, and that we can only sto= re > >> >> > that for spinning disks, but I was wrong. We need for other thing= s as > >> >> > well today: swapoff, when trying to find an empty swap slot and w= e > >> >> > start trying to free swap slots used only by the swapcache. Howev= er, I > >> >> > think both of these cases can be fixed (I can share more details = if > >> >> > you want). If everything goes well we should only need to maintai= n the > >> >> > reverse mapping (extra overhead above 24 bytes) for swap files on > >> >> > spinning disks for readahead. > >> >> > > >> >> >> > >> >> >> Looking forward to your discussion. > >> > >> Per my understanding, the indirection is to make it easy to move > >> (swapped) pages among swap devices based on hot/cold. This is similar > >> as the target of memory tiering. It appears that we can extend the > >> memory tiering (mm/memory-tiers.c) framework to cover swap devices too= ? > >> Is it possible for zswap to be faster than some slow memory media? > > > > > > Agree with Chris that this may require a much larger overhaul. A slow > > memory tier is still addressable memory, swap/zswap requires a page > > fault to read the pages. I think (at least for now) there is a > > fundamental difference. We want reclaim to eventually treat slow > > memory & swap as just different tiers to place cold memory in with > > different characteristics, but otherwise I think the swapping > > implementation itself is very different. Am I missing something? > > Is it possible that zswap is faster than a really slow memory > addressable device backed by NAND? TBH, I don't have the answer. I am not sure either. > > Anyway, do you need a way to describe the tiers of the swap devices? > So, you can move the cold pages among the swap devices based on that? For now I think the "tiers" in this proposal are just zswap and normal swapfiles. We can later extend it to support more explicit tiering. > > Best Regards, > Huang, Ying >