From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD5C5C636D7 for ; Tue, 21 Feb 2023 23:34:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5042D6B0071; Tue, 21 Feb 2023 18:34:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B5116B0073; Tue, 21 Feb 2023 18:34:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 37C806B0074; Tue, 21 Feb 2023 18:34:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2AA8E6B0071 for ; Tue, 21 Feb 2023 18:34:50 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 05E06C02AF for ; Tue, 21 Feb 2023 23:34:50 +0000 (UTC) X-FDA: 80492906340.22.9E12346 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) by imf02.hostedemail.com (Postfix) with ESMTP id 2D8E58000C for ; Tue, 21 Feb 2023 23:34:47 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="qFkoT/pE"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677022488; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BNy7krcnwo1DvBdQOwQP6g98JX2jb0LjCc6mgK3natE=; b=Ne0ph4zRxx/f4HzP4G1uASwo1zlphxI0EsmC5yCVmAHw8CjvcXTfH9O3sGWlj+lHCama6F ePgO4MuHGatUU/wJtRAszJ9Qh0LvK1pIbSWo9ZyYKc194/Wz2adwLzx2rqieEM04kND9Ue 8gfwG3xCLL1BYU5xrcVvkGxiE5ryF+c= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="qFkoT/pE"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677022488; a=rsa-sha256; cv=none; b=dFqFMbtG5DYvfhWeQfn3ij0p015wjvfQ4A2SZ20cgvmsQnlDlVIsTv/nPd2puo9p8qPHiJ AI0K9I4GGLywDmPVz1PiMXWNha/fpklnGOt+YB4xAaKJ9fw2S9B5LOp/8rcIdRQrJqJjT7 8iVU4GrUY3oG1fgt9I3q+6Yr25Lavaw= Received: by mail-pl1-f178.google.com with SMTP id l15so7998894pls.1 for ; Tue, 21 Feb 2023 15:34:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=BNy7krcnwo1DvBdQOwQP6g98JX2jb0LjCc6mgK3natE=; b=qFkoT/pEid/uxfE0RS2fi++VuBFWVXtE4Et7IMpxp7UlvPmrqhcYVZ/ZDDFG1XqVVk UgNuFpBrKqcpyDgiYWRWAcWZbjOQRc6j8NH5+T2cVxooPQOd5mFzn3YvajjriLhBYQuv rqbmQdRvYbNG4wcsQcpqUo3PqFif6sF/+U9Oe95ZVb2Y/waj8yYw62x+Ccjt0hH3bQmr 0Vnf1o/J7LqdugVwlfItfVGkahsiW7MzdmwAOSZWf0tg5KZ6Tg6rdEurM0xNZSbbUnjX Bc+HE8l86Vta6J1uqqVHWyxH0ipi/vztJLv5BQKGdDJUZv9coCMHgM0TuEpaAx27xBFe WDnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=BNy7krcnwo1DvBdQOwQP6g98JX2jb0LjCc6mgK3natE=; b=BLKmFVlsA7chRXlGSMewF+7DObkqgIde+vTmadNP7cPbaOBuuV+4JbnfEfkrPcXHmK g9QOMf2ePjqAAVXk66jNb97hrvQT6cGPy6JI2YsOmEpL04LQq0wHvZksqezNYoXiyEek 6qw++V/WrZaz+8lUwEyEOoeK6nDHUhSmpR+8yr4NwBXuyduPCG1l31B173ts6HO84LXt O1CfgkR8C0y3V81FyEf+nh/YLeO/clfsgiUnL07VlA/vC1vE/yWvUgldX4ch0RjbRbF9 ViyYjQSwa63aA/5L1Qtd2X3t0YJoqVIbQjpW7uANg/TdrgfAJqrb+mcGOgMz5G7xGoSE YeGA== X-Gm-Message-State: AO0yUKW8GDA3wuo24KERG+Jn9VUxlq9f1/GWkaH9gWjXpylr12MFh/W4 JWOAFwsCADkoHTMI21x5RvElIPiKj5Z3ZpU3eXA= X-Google-Smtp-Source: AK7set8p1jwSK5s66PWixqqiDxumVTmEiLphTUW71ehj2KU2L+LTWOTV9rE4LroI4MeP9LEw7FXc8I32eVtKLyxj/KU= X-Received: by 2002:a17:90b:3b92:b0:233:e796:7583 with SMTP id pc18-20020a17090b3b9200b00233e7967583mr1748847pjb.1.1677022486913; Tue, 21 Feb 2023 15:34:46 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Yang Shi Date: Tue, 21 Feb 2023 15:34:35 -0800 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap To: Yosry Ahmed Cc: lsf-pc@lists.linux-foundation.org, Johannes Weiner , Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Peter Xu , Minchan Kim , Andrew Morton Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 2D8E58000C X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: 1d3yowehw9wh8peq1h6gpro3qceu3zdo X-HE-Tag: 1677022487-920948 X-HE-Meta: U2FsdGVkX18xneTxZR7Eo2tRp6JGyXvA4GOyMBZNhbRxRr5qYh0twnOQ8wabzbaEABTnWjNRj/yOV81MLS4y3QUiq4ldUdwC0yqD1W+N1PNSJZm6GhebwDJeUbnfn0YiEnCFh/5eWHLjYw4a7X0z70+cTFrDr2UkB2KhSvDqbNIUcFU2JpEkE095FM34MchN/z41/il16/OXNc7Wirmw9+Ao1ldh+w3Cyewb2Mc9Z5aXciYfD4Bob9zHfNmdA0YazKDcqDDwyWk9QtDEuz/y7KvNIdxw5HZLbtl9lqUgYxyCyxvVkru+hsE9gmotkxGRgjZ2jMz8lfhiBbyfZR1TuOivL8AwkEoAIRZR8ODFl4qWkg/enxY96IwjXJE0jFP9hxf4EIbO8+7EY3tlClbMOtMASHb7Su7aScjcZjLOUt3QeCyA9E9zTJ3s1zLLwFot6up8z49isw/4LHwUq8DVJoKU5K441/MAcGd0kRSClITqRuefmuy5gOruwTPGMMLDYuWO+mz71/rhC+6E/oCUaZAv29Sotf4SC+MbekZvZ6qtta2sPLYq/JZhlHeYEWNbOWmrZPGGk9GAVwyJiIEZihkkwymmwzFztxzByx91UCIlzBvkyyRikCEF+1T+jWNGTtAiRhYiTEBe2eoiEC9iW/L//Qv5u09jv4YCGO0mKVfwsUPazOiBjTYmKQ4JNTtx95F2WSgVyu4aXuXtDh+wwYoLzR23xwp6QkEOB0QdyGF1DDeEubX5Qo1zO2NxzXKw1vFjkRiZP2nu+TcAjq45EwbaV4WENSDpOPTswX7IxA0YsqJI+qRap7zC+a5yziJYL4U6hG4LpxE92uYKqoQF6bTSWGaeTcpq5fI+nVjkC7gOeLo8Svc87EL/uv8E0fRFB1tOLJibLSduJuljPPuLVNPJ5q9RendOLCsbb6NPuNstleA+WvPGckR9nPFHzlJBUqF66wy+OIa2M13ZyDS 6m6soFXn 3ouGTyYcFgChB/cKwj1mgZYGRISgybevBzvHZJAq3HWR3Cr4GzYlNy+YB15KdgMfq0E1qcD5NnYXF9P5tiB7TAcersQtvdye7WtXYX4ka0RlIT+Mz6JO+m7zZcIIBNbqgCYJlCnos7xWF+4LI1GFl4i4iPmBI/m0ZnKhfD1VOeba8duGdoKN1IXy7swAC70oW+YQheVvv4rdnqBtR2rD28OGeeG98bD4yF064eed0xK85ovfe8X9xllDrNBZiE7odhaz8GDuJ21Lx4SdhRrCGQ35bqrtB0z2lsGQBPhrUPJ901BTp5GjAyBIiq28V7skJL/p26eRsFJvTVgpd0E8ReqlrNR3ZpTCAf425mFg5m0h6bVDwzq4mok7xZr9W3Sh9D6WuUCRrizCVq5JzvLSAEEXSNVCOIx8SHKO3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed wrote: > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi wrote: > > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed wrote: > > > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi wrote: > > > > > > > > Hi Yosry, > > > > > > > > Thanks for proposing this topic. I was thinking about this before but > > > > I didn't make too much progress due to some other distractions, and I > > > > got a couple of follow up questions about your design. Please see the > > > > inline comments below. > > > > > > Great to see interested folks, thanks! > > > > > > > > > > > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed wrote: > > > > > > > > > > Hello everyone, > > > > > > > > > > I would like to propose a topic for the upcoming LSF/MM/BPF in May > > > > > 2023 about swap & zswap (hope I am not too late). > > > > > > > > > > ==================== Intro ==================== > > > > > Currently, using zswap is dependent on swapfiles in an unnecessary > > > > > way. To use zswap, you need a swapfile configured (even if the space > > > > > will not be used) and zswap is restricted by its size. When pages > > > > > reside in zswap, the corresponding swap entry in the swapfile cannot > > > > > be used, and is essentially wasted. We also go through unnecessary > > > > > code paths when using zswap, such as finding and allocating a swap > > > > > entry on the swapout path, or readahead in the swapin path. I am > > > > > proposing a swapping abstraction layer that would allow us to remove > > > > > zswap's dependency on swapfiles. This can be done by introducing a > > > > > data structure between the actual swapping implementation (swapfiles, > > > > > zswap) and the rest of the MM code. > > > > > > > > > > ==================== Objective ==================== > > > > > Enabling the use of zswap without a backing swapfile, which makes > > > > > zswap useful for a wider variety of use cases. Also, when zswap is > > > > > used with a swapfile, the pages in zswap do not use up space in the > > > > > swapfile, so the overall swapping capacity increases. > > > > > > > > > > ==================== Idea ==================== > > > > > Introduce a data structure, which I currently call a swap_desc, as an > > > > > abstraction layer between swapping implementation and the rest of MM > > > > > code. Page tables & page caches would store a swap id (encoded as a > > > > > swp_entry_t) instead of directly storing the swap entry associated > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts > > > > > as our abstraction layer. All MM code not concerned with swapping > > > > > details would operate in terms of swap descs. The swap_desc can point > > > > > to either a normal swap entry (associated with a swapfile) or a zswap > > > > > entry. It can also include all non-backend specific operations, such > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap > > > > > counting, etc. It creates a clear, nice abstraction layer between MM > > > > > code and the actual swapping implementation. > > > > > > > > How will the swap_desc be allocated? Dynamically or preallocated? Is > > > > it 1:1 mapped to the swap slots on swap devices (whatever it is > > > > backed, for example, zswap, swap partition, swapfile, etc)? > > > > > > I imagine swap_desc's would be dynamically allocated when we need to > > > swap something out. When allocated, a swap_desc would either point to > > > a zswap_entry (if available), or a swap slot otherwise. In this case, > > > it would be 1:1 mapped to swapped out pages, not the swap slots on > > > devices. > > > > It makes sense to be 1:1 mapped to swapped out pages if the swapfile > > is used as the back of zswap. > > > > > > > > I know that it might not be ideal to make allocations on the reclaim > > > path (although it would be a small-ish slab allocation so we might be > > > able to get away with it), but otherwise we would have statically > > > allocated swap_desc's for all swap slots on a swap device, even unused > > > ones, which I imagine is too expensive. Also for things like zswap, it > > > doesn't really make sense to preallocate at all. > > > > Yeah, it is not perfect to allocate memory in the reclamation path. We > > do have such cases, but the fewer the better IMHO. > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top of the > slab cache, idk if that makes sense, or if there is a way to tell slab > to proactively refill a cache. > > I am open to suggestions here. I don't think we should/can preallocate > the swap_desc's, and we cannot completely eliminate the allocations in > the reclaim path. We can only try to minimize them through caching, > etc. Right? Yeah, reallocation should not work. But I'm not sure whether caching works well for this case or not either. I'm supposed that you were thinking about something similar with pcp. When the available number of elements is lower than a threshold, refill the cache. It should work well with moderate memory pressure. But I'm not sure how it would behave with severe memory pressure, particularly when anonymous memory dominated the memory usage. Or maybe dynamic allocation works well, we are just over-engineered. > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > ==================== Benefits ==================== > > > > > This work enables using zswap without a backing swapfile and increases > > > > > the swap capacity when zswap is used with a swapfile. It also creates > > > > > a separation that allows us to skip code paths that don't make sense > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree > > > > > which might result in better performance (less lookups, less lock > > > > > contention). > > > > > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g. > > > > > removing swapper address spaces, removing swap count continuation > > > > > code, etc). Another nice cleanup that this work enables would be > > > > > separating the overloaded swp_entry_t into two distinct types: one for > > > > > things that are stored in page tables / caches, and for actual swap > > > > > entries. In the future, we can potentially further optimize how we use > > > > > the bits in the page tables instead of sticking everything into the > > > > > current type/offset format. > > > > > > > > > > Another potential win here can be swapoff, which can be more practical > > > > > by directly scanning all swap_desc's instead of going through page > > > > > tables and shmem page caches. > > > > > > > > > > Overall zswap becomes more accessible and available to a wider range > > > > > of use cases. > > > > > > > > How will you handle zswap writeback? Zswap may writeback to the backed > > > > swap device IIUC. Assuming you have both zswap and swapfile, they are > > > > separate devices with this design, right? If so, is the swapfile still > > > > the writeback target of zswap? And if it is the writeback target, what > > > > if swapfile is full? > > > > > > When we try to writeback from zswap, we try to allocate a swap slot in > > > the swapfile, and switch the swap_desc to point to that instead. The > > > process would be transparent to the rest of MM (page tables, page > > > cache, etc). If the swapfile is full, then there's really nothing we > > > can do, reclaim fails and we start OOMing. I imagine this is the same > > > behavior as today when swap is full, the difference would be that we > > > have to fill both zswap AND the swapfile to get to the OOMing point, > > > so an overall increased swapping capacity. > > > > When zswap is full, but swapfile is not yet, will the swap try to > > writeback zswap to swapfile to make more room for zswap or just swap > > out to swapfile directly? > > > > The current behavior is that we swap to swapfile directly in this > case, which is far from ideal as we break LRU ordering by skipping > zswap. I believe this should be addressed, but not as part of this > effort. The work to make zswap respect the LRU ordering by writing > back from zswap to make room can be done orthogonal to this effort. I > believe Johannes was looking into this at some point. Other than breaking LRU ordering, I'm also concerned about the potential deteriorating performance when writing/reading from swapfile when zswap is full. The zswap->swapfile order should be able to maintain a consistent performance for userspace. But anyway I don't have the data from real life workload to back the above points. If you or Johannes could share some real data, that would be very helpful to make the decisions. > > > > > > > > > > > > Anyway I'm interested in attending the discussion for this topic. > > > > > > Great! Looking forward to discuss this more! > > > > > > > > > > > > > > > > > ==================== Cost ==================== > > > > > The obvious downside of this is added memory overhead, specifically > > > > > for users that use swapfiles without zswap. Instead of paying one byte > > > > > (swap_map) for every potential page in the swapfile (+ swap count > > > > > continuation), we pay the size of the swap_desc for every page that is > > > > > actually in the swapfile, which I am estimating can be roughly around > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The overhead only > > > > > scales with pages actually swapped out. For zswap users, it should be > > > > > a win (or at least even) because we get to drop a lot of fields from > > > > > struct zswap_entry (e.g. rbtree, index, etc). > > > > > > > > > > Another potential concern is readahead. With this design, we have no > > > > > way to get a swap_desc given a swap entry (type & offset). We would > > > > > need to maintain a reverse mapping, adding a little bit more overhead, > > > > > or search all swapped out pages instead :). A reverse mapping might > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of swapped out > > > > > memory). > > > > > > > > > > ==================== Bottom Line ==================== > > > > > It would be nice to discuss the potential here and the tradeoffs. I > > > > > know that other folks using zswap (or interested in using it) may find > > > > > this very useful. I am sure I am missing some context on why things > > > > > are the way they are, and perhaps some obvious holes in my story. > > > > > Looking forward to discussing this with anyone interested :) > > > > > > > > > > I think Johannes may be interested in attending this discussion, since > > > > > a lot of ideas here are inspired by discussions I had with him :)