From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0522C61DA3 for ; Tue, 21 Feb 2023 23:39:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6882C6B0073; Tue, 21 Feb 2023 18:39:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 638186B0074; Tue, 21 Feb 2023 18:39:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 501F56B0075; Tue, 21 Feb 2023 18:39:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 4040D6B0073 for ; Tue, 21 Feb 2023 18:39:38 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 105F2A042A for ; Tue, 21 Feb 2023 23:39:38 +0000 (UTC) X-FDA: 80492918436.14.F61B1C0 Received: from mail-ed1-f41.google.com (mail-ed1-f41.google.com [209.85.208.41]) by imf17.hostedemail.com (Postfix) with ESMTP id 314C840010 for ; Tue, 21 Feb 2023 23:39:35 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=b1XmUsv5; spf=pass (imf17.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677022776; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ChZcYq0XbcpaLilzlZXhIbiwifhU5yB0OL8ZM18anOo=; b=y06MKEzasBP/6rIjmBrve3qeaEBkeQ3PSVRvRYqoHVXpj9vJqDGpe9YZ0w3ym4ZjazuNOe /gDbGDK/w5jqE3HxyfQ4Khwx2YrqWD3a7GdshFB3dPQr7aRigShAw593WUolmt0ixBOeGb yzoL+xEsRwzQX9mxUURFmBTYU28OW4g= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=b1XmUsv5; spf=pass (imf17.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677022776; a=rsa-sha256; cv=none; b=nxDUOyTg8P+AtkVMxFS/+nDuWEszH2lJdmG2CST+JTKqPmOlWrF6CJQnVcwkbF7CMD8/nS WcgzcCj9gTCNrbWoHT6TFQiKRHrHKJzNbQPI7SMvhwCHyHCsuUODdffytHQ63cqo+evNEV 32ok7+1S2CEBTQnAo9HU4VZTIuCGQkU= Received: by mail-ed1-f41.google.com with SMTP id ec43so23162725edb.8 for ; Tue, 21 Feb 2023 15:39:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=ChZcYq0XbcpaLilzlZXhIbiwifhU5yB0OL8ZM18anOo=; b=b1XmUsv5NrnYm36dRRva8QsXEKv/6HW6gLWGdXYKN0POMP2gLlzbzELzKlJO3fyWKk AKPteijvLL+qhS4edo4RwNKtDAlgJJSzSFlhkiMuSXxOBUj6ABIOg21VC1nHKxG8OnHv upOjl+b1kdNHT8DHuNEpk0wMtiSDU962ffbYYV2wtlBCc/jKBcZ6DscbAYUieKdFnA34 2BuTscDiZ4/PQz3BR1ATPyRTd5wjCyikjo2stqP1bgUTH8xIlSS360hX/pALaPPyZ+uS 7LSOpzWF3S4Px3F1P3Kgf6B/mMBoz5avrPHRscKIDB5cu8DviiNKenrxJrECafn9HGwj nu9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ChZcYq0XbcpaLilzlZXhIbiwifhU5yB0OL8ZM18anOo=; b=L7SiWyo019Iu+a91/gMprPgmvtqSjF2nTaorTSpNc/y/XXzdOb6NBSVf6SHThyFV1q CMujZYIo+ZFdZ7Ge5BDGx4Qxc+s6UrGtjKiem6abpMkw6B9xuNG0sljpfifLwQY/qyOo ERX3SfhHUcEiLy0GVybcsf1ouy7XAtn1y3rJE4lxU73xVTWlUow+S5ebi5T+ACn56e2K Vf/qkp3fnIZTmvSEaK9vTvbGi6y8vveLWVVZLKmTgGEv/VdHdvWntJCvV4+QimZTNIiq ob/GmRz5Qil0h2GEbdbwYyQvJpCapGSWuPmrPIC2I5Q/yT7DTxSu61mLbpmd9d8dGT32 FC1w== X-Gm-Message-State: AO0yUKVW2eAF+UZwwnAzJiUVrmKeclUH0F8YyIxzGoWRxjr2whml/kA1 qNFQCr/d4oEte/lk89tNWB20us6W8BYxBS4f+3mamA== X-Google-Smtp-Source: AK7set/nXmZToUzrFyOfrZYXxhSnM0Fhzs0G+C3YhruzuS+tW0VTRQ+9SQghXHJsL5DmA0z7j2OptJM35Y5ktXdXpT0= X-Received: by 2002:a17:906:1c4b:b0:878:6488:915f with SMTP id l11-20020a1709061c4b00b008786488915fmr6965691ejg.10.1677022774421; Tue, 21 Feb 2023 15:39:34 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Tue, 21 Feb 2023 15:38:57 -0800 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap To: Yang Shi Cc: lsf-pc@lists.linux-foundation.org, Johannes Weiner , Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Peter Xu , Minchan Kim , Andrew Morton Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 314C840010 X-Stat-Signature: 3tfcxzry39pgexss9b1xd7pgq9prn1kz X-HE-Tag: 1677022775-568920 X-HE-Meta: U2FsdGVkX1/j9Ic92yS8OAVcLp9aJlEMO7gtYARbaofoErcHYuUKQGE4MzsvytvlIywZTFlBvXTFP9Wz+Hn/6iBe107163ZuW/R679WOhRT1Z14r/Q2cjupj/UQOnmrpdMHZACjVfQSefSUin/+6is/0XcMw5URttHxe8nAGSZo8Xn3JY8NaaF2c+ez9eQfjN/E7kQ4qmhTI3byifMk0ZGiiSJ3jlMJWQcqnXrZXKy0RdrStALlPwNkPUQROAKy1uKdmQ0VsirfKQyVNPF6kWBFNr1QZA73L1Ent/k+BgpQeHW9BscuDGFOs2P9Ow2O+/zhFDzhWS9o5ZsL8B4jKycP+tAwBkkm81JYm4EWv+LA3snK5NLzlIjiARuQy8DT7HtHuqQ0cGC78MpXjAsgAgwdIrQDSsE1MvpfUyBsQnK2Fhcmao3g5FtNg6EdpOxrez+0NDk6XC/VXzCDAx2bYUA5JXtXA2V7v8TDgpocNmIIpCjyMbayZmzRDhpW5ogD9Jb/KTKT2rEiCI9Npc0HQhKnbrTRGmSSZ1FKt9Mibxls1JLYtlFLOnBLqHvWsR2+lXZERIi6sGAW+AsZf+jKO+1TL7+c2gjmGw/d9jc+1yMCw4exJffUWd0iTFW/UFEMMgO7A58yrMjrdG3+T2wt0Hk7rbYJxzGLiZtdvnbRF6hyWdUAKoFAQ+tdLYCLX6v3pNLheXPpWn99yMooDjuoEcih4nHX/EziRw0RihE3OY1sbMAlkl4wtxQ2XJs0sa46SbzakB63HYCKBBtcXeVimMfn7I12V4bTpIT88DXBAhx4ouUWbjAsQIm1MN7aXH8xbpkNGe7p+rzzXnO+L5vbaaTxCa9EWluuEfAZVQ2TCbvDbz28ydQzItFkyXE+me2kIOHn3xwoxSaIJjn9SeFRJMNNM9DC1vKnBxFvl4WHZ8L6EQF1569rm9bJLcA+5e42q8CccLYI2yHa5eOIhr6B UQG5fqeK Ci2sH4A39Zd1Lnq/rFwAAeNn8PsUFKxPXRuptXMlrJDWnuEeFmkN0/vz43WJzp9jN/tjJdPMko5uyj+H7tRqHYuba9TbhPTYbit/VTZuex4wffhgQFslO3LB/kgbxK9yOESD/u+Ce488afoJa/z+6OzD72cIQJL9K5HHu5ONAMGT7qVEFatZRxyn3+Tm6n16VTu9bMMBqv+u+vilidKWsxwjdwLnkmx1DG1kOd8YUEDpCxpICd/CpEIBJQvojC9+SbYY2m84VRcGN1RpBSKOyRt2H1TpCQa6G/GLj649zZDG2RF0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 21, 2023 at 3:34 PM Yang Shi wrote: > > On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed wrote: > > > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi wrote: > > > > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed wrote: > > > > > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi wrote: > > > > > > > > > > Hi Yosry, > > > > > > > > > > Thanks for proposing this topic. I was thinking about this before but > > > > > I didn't make too much progress due to some other distractions, and I > > > > > got a couple of follow up questions about your design. Please see the > > > > > inline comments below. > > > > > > > > Great to see interested folks, thanks! > > > > > > > > > > > > > > > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed wrote: > > > > > > > > > > > > Hello everyone, > > > > > > > > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May > > > > > > 2023 about swap & zswap (hope I am not too late). > > > > > > > > > > > > ==================== Intro ==================== > > > > > > Currently, using zswap is dependent on swapfiles in an unnecessary > > > > > > way. To use zswap, you need a swapfile configured (even if the space > > > > > > will not be used) and zswap is restricted by its size. When pages > > > > > > reside in zswap, the corresponding swap entry in the swapfile cannot > > > > > > be used, and is essentially wasted. We also go through unnecessary > > > > > > code paths when using zswap, such as finding and allocating a swap > > > > > > entry on the swapout path, or readahead in the swapin path. I am > > > > > > proposing a swapping abstraction layer that would allow us to remove > > > > > > zswap's dependency on swapfiles. This can be done by introducing a > > > > > > data structure between the actual swapping implementation (swapfiles, > > > > > > zswap) and the rest of the MM code. > > > > > > > > > > > > ==================== Objective ==================== > > > > > > Enabling the use of zswap without a backing swapfile, which makes > > > > > > zswap useful for a wider variety of use cases. Also, when zswap is > > > > > > used with a swapfile, the pages in zswap do not use up space in the > > > > > > swapfile, so the overall swapping capacity increases. > > > > > > > > > > > > ==================== Idea ==================== > > > > > > Introduce a data structure, which I currently call a swap_desc, as an > > > > > > abstraction layer between swapping implementation and the rest of MM > > > > > > code. Page tables & page caches would store a swap id (encoded as a > > > > > > swp_entry_t) instead of directly storing the swap entry associated > > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts > > > > > > as our abstraction layer. All MM code not concerned with swapping > > > > > > details would operate in terms of swap descs. The swap_desc can point > > > > > > to either a normal swap entry (associated with a swapfile) or a zswap > > > > > > entry. It can also include all non-backend specific operations, such > > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap > > > > > > counting, etc. It creates a clear, nice abstraction layer between MM > > > > > > code and the actual swapping implementation. > > > > > > > > > > How will the swap_desc be allocated? Dynamically or preallocated? Is > > > > > it 1:1 mapped to the swap slots on swap devices (whatever it is > > > > > backed, for example, zswap, swap partition, swapfile, etc)? > > > > > > > > I imagine swap_desc's would be dynamically allocated when we need to > > > > swap something out. When allocated, a swap_desc would either point to > > > > a zswap_entry (if available), or a swap slot otherwise. In this case, > > > > it would be 1:1 mapped to swapped out pages, not the swap slots on > > > > devices. > > > > > > It makes sense to be 1:1 mapped to swapped out pages if the swapfile > > > is used as the back of zswap. > > > > > > > > > > > I know that it might not be ideal to make allocations on the reclaim > > > > path (although it would be a small-ish slab allocation so we might be > > > > able to get away with it), but otherwise we would have statically > > > > allocated swap_desc's for all swap slots on a swap device, even unused > > > > ones, which I imagine is too expensive. Also for things like zswap, it > > > > doesn't really make sense to preallocate at all. > > > > > > Yeah, it is not perfect to allocate memory in the reclamation path. We > > > do have such cases, but the fewer the better IMHO. > > > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top of the > > slab cache, idk if that makes sense, or if there is a way to tell slab > > to proactively refill a cache. > > > > I am open to suggestions here. I don't think we should/can preallocate > > the swap_desc's, and we cannot completely eliminate the allocations in > > the reclaim path. We can only try to minimize them through caching, > > etc. Right? > > Yeah, reallocation should not work. But I'm not sure whether caching > works well for this case or not either. I'm supposed that you were > thinking about something similar with pcp. When the available number > of elements is lower than a threshold, refill the cache. It should > work well with moderate memory pressure. But I'm not sure how it would > behave with severe memory pressure, particularly when anonymous > memory dominated the memory usage. Or maybe dynamic allocation works > well, we are just over-engineered. Yeah it would be interesting to look into whether the swap_desc allocation will be a bottleneck. Definitely something to look out for. I share your thoughts about wanting to do something about it but also not wanting to over-engineer it. > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > > > > ==================== Benefits ==================== > > > > > > This work enables using zswap without a backing swapfile and increases > > > > > > the swap capacity when zswap is used with a swapfile. It also creates > > > > > > a separation that allows us to skip code paths that don't make sense > > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree > > > > > > which might result in better performance (less lookups, less lock > > > > > > contention). > > > > > > > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g. > > > > > > removing swapper address spaces, removing swap count continuation > > > > > > code, etc). Another nice cleanup that this work enables would be > > > > > > separating the overloaded swp_entry_t into two distinct types: one for > > > > > > things that are stored in page tables / caches, and for actual swap > > > > > > entries. In the future, we can potentially further optimize how we use > > > > > > the bits in the page tables instead of sticking everything into the > > > > > > current type/offset format. > > > > > > > > > > > > Another potential win here can be swapoff, which can be more practical > > > > > > by directly scanning all swap_desc's instead of going through page > > > > > > tables and shmem page caches. > > > > > > > > > > > > Overall zswap becomes more accessible and available to a wider range > > > > > > of use cases. > > > > > > > > > > How will you handle zswap writeback? Zswap may writeback to the backed > > > > > swap device IIUC. Assuming you have both zswap and swapfile, they are > > > > > separate devices with this design, right? If so, is the swapfile still > > > > > the writeback target of zswap? And if it is the writeback target, what > > > > > if swapfile is full? > > > > > > > > When we try to writeback from zswap, we try to allocate a swap slot in > > > > the swapfile, and switch the swap_desc to point to that instead. The > > > > process would be transparent to the rest of MM (page tables, page > > > > cache, etc). If the swapfile is full, then there's really nothing we > > > > can do, reclaim fails and we start OOMing. I imagine this is the same > > > > behavior as today when swap is full, the difference would be that we > > > > have to fill both zswap AND the swapfile to get to the OOMing point, > > > > so an overall increased swapping capacity. > > > > > > When zswap is full, but swapfile is not yet, will the swap try to > > > writeback zswap to swapfile to make more room for zswap or just swap > > > out to swapfile directly? > > > > > > > The current behavior is that we swap to swapfile directly in this > > case, which is far from ideal as we break LRU ordering by skipping > > zswap. I believe this should be addressed, but not as part of this > > effort. The work to make zswap respect the LRU ordering by writing > > back from zswap to make room can be done orthogonal to this effort. I > > believe Johannes was looking into this at some point. > > Other than breaking LRU ordering, I'm also concerned about the > potential deteriorating performance when writing/reading from swapfile > when zswap is full. The zswap->swapfile order should be able to > maintain a consistent performance for userspace. Right. This happens today anyway AFAICT, when zswap is full we just fallback to writing to swapfile, so this would not be a behavior change. I agree it should be addressed anyway. > > But anyway I don't have the data from real life workload to back the > above points. If you or Johannes could share some real data, that > would be very helpful to make the decisions. I actually don't, since we mostly run zswap without a backing swapfile. Perhaps Johannes might be able to have some data on this (or anyone using zswap with a backing swapfile). > > > > > > > > > > > > > > > > > Anyway I'm interested in attending the discussion for this topic. > > > > > > > > Great! Looking forward to discuss this more! > > > > > > > > > > > > > > > > > > > > > ==================== Cost ==================== > > > > > > The obvious downside of this is added memory overhead, specifically > > > > > > for users that use swapfiles without zswap. Instead of paying one byte > > > > > > (swap_map) for every potential page in the swapfile (+ swap count > > > > > > continuation), we pay the size of the swap_desc for every page that is > > > > > > actually in the swapfile, which I am estimating can be roughly around > > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only > > > > > > scales with pages actually swapped out. For zswap users, it should be > > > > > > a win (or at least even) because we get to drop a lot of fields from > > > > > > struct zswap_entry (e.g. rbtree, index, etc). > > > > > > > > > > > > Another potential concern is readahead. With this design, we have no > > > > > > way to get a swap_desc given a swap entry (type & offset). We would > > > > > > need to maintain a reverse mapping, adding a little bit more overhead, > > > > > > or search all swapped out pages instead :). A reverse mapping might > > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out > > > > > > memory). > > > > > > > > > > > > ==================== Bottom Line ==================== > > > > > > It would be nice to discuss the potential here and the tradeoffs. I > > > > > > know that other folks using zswap (or interested in using it) may find > > > > > > this very useful. I am sure I am missing some context on why things > > > > > > are the way they are, and perhaps some obvious holes in my story. > > > > > > Looking forward to discussing this with anyone interested :) > > > > > > > > > > > > I think Johannes may be interested in attending this discussion, since > > > > > > a lot of ideas here are inspired by discussions I had with him :)