From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05A2FC7EE23 for ; Thu, 2 Mar 2023 01:25:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6AE846B0071; Wed, 1 Mar 2023 20:25:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 65F936B0073; Wed, 1 Mar 2023 20:25:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 526506B0074; Wed, 1 Mar 2023 20:25:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 4394B6B0071 for ; Wed, 1 Mar 2023 20:25:42 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 0A5171A0EBA for ; Thu, 2 Mar 2023 01:25:42 +0000 (UTC) X-FDA: 80522216124.27.35652A6 Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) by imf08.hostedemail.com (Postfix) with ESMTP id 349BD160015 for ; Thu, 2 Mar 2023 01:25:37 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dwdewAWc; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677720338; a=rsa-sha256; cv=none; b=P2oD9Fgw1lE2spXJ5/++/WVTTWMUygU/+C0twnPuC9QhVJTfKCemTzuljOT7ENeOVUXkZ5 OSCCoVklS19FxqVKN1FDzIukJh/lHUBFiJBagMjYTNUsz8R9r0Grj9lnENvrjKqpflR8ct SdaCkchZAVcDpYbBgYwueGqG9ZfYgoA= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dwdewAWc; spf=pass (imf08.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677720338; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=g0thYLDDeXG9OHisqx0Qq4yVH9LULkaXr/hiXyAMAnY=; b=mASOrhPfhbEMHh4SUHSZ4bc+jVjqQIZqL3uqRO8qWE6jt3g2WGmKoJiuSQ2IBzfVi6/v6Y kRnbmZl6draSPymQz17CfxpKOcybcyJ9QKBLcsBhhVHhuSCW2RGTmveJOimvXar1ETK+xY PYvbZWmuMJUCFtzLB3EdFIa+TjHz2ZM= Received: by mail-ed1-f48.google.com with SMTP id eg37so61379848edb.12 for ; Wed, 01 Mar 2023 17:25:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1677720337; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=g0thYLDDeXG9OHisqx0Qq4yVH9LULkaXr/hiXyAMAnY=; b=dwdewAWcACIgfOMHKMplzVvFhdAdktXWlW084U312rKY53ol70FbM0rsVBMaWO0CHJ ElocG9qsX7O3WWUb6x5VbCoNY09e+TkMecrVNxXk2l5gOm9H9gpkzUDJfbrelPgSD2Pr tKpTSqlzbLAD8QTMJsvjV3TWCsGCQ4LiCkLIoA0nhPOa+gljWTrfFJRV3WSVaPYjH14m hrLX+z56R8H89lnOeUsv5vaMZ+WiOnVXt9QLlr97iPXJiwBzba/3tPgiRV7ETcY+4JjM nIuqpnwJP1tnn2NjCnO4AHZcJDqGH3uFl8VsTy5iZ4yR9myJNTyuNNX9pzHV6tbTZlnF UJrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677720337; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=g0thYLDDeXG9OHisqx0Qq4yVH9LULkaXr/hiXyAMAnY=; b=WTKrl5mR81UX9HTAiwnXM7u6i8OX7TKBPqccuyh3IFSYQ+IpDNifVWHBF4TuOw1U/B Tk+Y5yMlz19EJc7yq25U3k6/q68p63x4XwbSi7VnRjaxlVQgZTyHvc5wJvfB0SENjOQK +jeiD+js6A4BuYuk5X6EMGD/b5BJC3HAvmcZB5bBHYrLDCeqNsBCQC3OxwHUZgFVbeo5 Hak3BWpNBZRLvDa6zdopLyXlSRG+yDVV4DwxiTyjiT0PSstTuFSGjBFFK4Eq2M1BbaPH Zfe+4sxpmVbbw67g3Yv5UGbHFa1nV3VzzD2OPiuAQz20liFExOJnOJo1JBuzojHSvg15 MMEQ== X-Gm-Message-State: AO0yUKWc/rXdcdEie2r7YvKLEzK9G2xUzgxk095EyIBH6AifeBHEqxxF 49LzNP7+e9l9I0SeJuPs1m+dRcFoVz6z0sfHR2r8Lg== X-Google-Smtp-Source: AK7set+HHjlKEPcPwcAYDmb9d+qdyJ8czqyaVn0ozPkKNNSpMDeblrMaUTl0iPIvA06uyfkT4uUPMMNoPZevybZZkV4= X-Received: by 2002:a17:906:2358:b0:87a:3b3f:b9da with SMTP id m24-20020a170906235800b0087a3b3fb9damr4198348eja.10.1677720336572; Wed, 01 Mar 2023 17:25:36 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Wed, 1 Mar 2023 17:25:00 -0800 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap To: Minchan Kim , Johannes Weiner Cc: Sergey Senozhatsky , lsf-pc@lists.linux-foundation.org, Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Yang Shi , Peter Xu , Andrew Morton Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Queue-Id: 349BD160015 X-Rspamd-Server: rspam01 X-Stat-Signature: pbnyp98wccm696jqtkoymt4wom9ntosp X-HE-Tag: 1677720337-236191 X-HE-Meta: U2FsdGVkX1/FPOtcJb3FkN8UGS7wCEqQ2IrAUB+zEJaTvvzb3n3ZcoYzpy2NDLPOqNYqkqEzAsQOIZTC07JCc66jGDepBcIz0dV1AwJ6X7btQIGT5sECiZhXCdamKNNsixgexFhcXEy1V3ZA5A14T36rlefNgpFyziyY03vsR+xC+QuTI0z9Mtfof9tZQw0usiQGwv40oO+KPIZP4g3VtFr3152Poyh6sL2kJVGGUwu+ry/zwPl69wWUpHdxftey/VsjIcZci6psV101RUN/hlczI5fc4hydZsuIxbVA8HLx/On1BpyBKcxSil+7AeJx042QiNKDnq0vUQP68hqau7cQSAJuJLPKtovesa1ASdldPUsQWOfN5hD8f7ktSB1z2rMEq1yvru14C4baNoW8wSb4USo4Id93bvkhOLPe2hZvJydmUbOKm8BKMvFFTj48kB1aSXCLwu8Up6AKq0xd6HuOAGciLt26PHMEu1e+5bZJjqCeQyH0hv6XdPhm0Jn5TzKU3a+o1EG/tGRdf8JfvqZnd5zGmc9pO4HscMTX/QnRZrFwAgeviRk+gShudle5Kigkqgfgw77HKX7Dka23tnwymTDR6YpdtmpqBsFAdYnIGSZO5Hn5qyxqu0X7rJZFIRg7adywWI1YHcT6B4QjBKAFTWQ3wQgA8oCNYWdWNdEx9fhuBOl4sI5OXuLjPGyoy2Nv0aPvEpHKV4O94jNSZW9jvtv8anpbacIXYsbP7jRwSuHNdQG9QT1jsOYsPT7665wYIwOA6ZNwubnVDrux1uge3qfG7CqDz6PL2HX2Eyw45jVsRfcz89KZ4vQHNXS8IYskvdQha6Ii9DffHjnpGdEAqExETZq4pa2sSoi5N/furbL1bpe41FAPZkvuKazpPzAj1amrbXIM7gtYZ9GRp7u2jkg91vy3Ds3gBI5Z2zwU6tRxqPebmRuXK3YPGuOp0KIs3AIstWuvowCJxjJ SCKP2u3/ b0JvfeKK15W8/Kk9M0XDDz+VmPlK05KdxmD1Fzb4HJQRdSt+8DEuXa5vRsa5TEj8kffa8nbD9GmeaMcUYMNvjQBepqEubf/g5OH0UyaZTGScM6lSoHw8YDNX/36XksiNAjQOk8iyIilDql1/2++DC6mrej41NHmQHHbK9yvQRtWAaMBS+t2MAEobik/30Tgowe4eDeXI6boOl0G2Wf0X+geexmUCqNWgfGObpcdNdE/uTvj09axYEbvDsXysbn5A8y9O5GAJVuFEPd1pIj/WVwv7ZR3HBvsw9NL4OaiG6RnRMKC4iYtpD7EMgwM0nh6cWIlnT62aFR2vnxbbuT0cmWfOt5I2a96+PO5OetfKsWeONbRjy0/x2GjdGPnPdVSC02BvMZK0vspBG9jzJAS6avsTRWVN8jKkz7dN+ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Mar 1, 2023 at 4:58 PM Yosry Ahmed wrote: > > On Tue, Feb 28, 2023 at 3:29 PM Minchan Kim wrote: > > > > Hi Yosry, > > > > On Tue, Feb 28, 2023 at 12:12:05AM -0800, Yosry Ahmed wrote: > > > On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky > > > wrote: > > > > > > > > On (23/02/18 14:38), Yosry Ahmed wrote: > > > > [..] > > > > > ==================== Idea ==================== > > > > > Introduce a data structure, which I currently call a swap_desc, as an > > > > > abstraction layer between swapping implementation and the rest of MM > > > > > code. Page tables & page caches would store a swap id (encoded as a > > > > > swp_entry_t) instead of directly storing the swap entry associated > > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts > > > > > as our abstraction layer. All MM code not concerned with swapping > > > > > details would operate in terms of swap descs. The swap_desc can point > > > > > to either a normal swap entry (associated with a swapfile) or a zswap > > > > > entry. It can also include all non-backend specific operations, such > > > > > as the swapcache (which would be a simple pointer in swap_desc), swap > > > > > counting, etc. It creates a clear, nice abstraction layer between MM > > > > > code and the actual swapping implementation. > > > > > > > > > > ==================== Benefits ==================== > > > > > This work enables using zswap without a backing swapfile and increases > > > > > the swap capacity when zswap is used with a swapfile. It also creates > > > > > a separation that allows us to skip code paths that don't make sense > > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree > > > > > which might result in better performance (less lookups, less lock > > > > > contention). > > > > > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g. > > > > > removing swapper address spaces, removing swap count continuation > > > > > code, etc). Another nice cleanup that this work enables would be > > > > > separating the overloaded swp_entry_t into two distinct types: one for > > > > > things that are stored in page tables / caches, and for actual swap > > > > > entries. In the future, we can potentially further optimize how we use > > > > > the bits in the page tables instead of sticking everything into the > > > > > current type/offset format. > > > > > > > > > > Another potential win here can be swapoff, which can be more practical > > > > > by directly scanning all swap_desc's instead of going through page > > > > > tables and shmem page caches. > > > > > > > > > > Overall zswap becomes more accessible and available to a wider range > > > > > of use cases. > > > > > > > > I assume this also brings us closer to a proper writeback LRU handling? > > > > > > I assume by proper LRU handling you mean: > > > - Swap writeback LRU that lives outside of the zpool backends (i.e in > > > zswap itself or even outside zswap). > > > > Even outside zswap to support any combination on any heterogenous > > multiple swap device configuration. > > Agreed, this is the end goal for the writeback LRU. > > > > > The indirection layer would be essential to support it but it would > > be also great if we don't waste any memory for the user who don't > > want the feature. > > I can't currently think of a way to eliminate overhead for people only > using swapfiles, as a lot of the core implementation changes, unless > we want to maintain considerably more code with a lot of repeated > functionality implemented differently. Perhaps this will change as I > implement this, maybe things are better (or worse) than what I think > they are, I am actively working on a proof-of-concept right now. Maybe > a discussion in LSF/MM/BPF will help come up with optimizations as > well :) > > > > > Just FYI, there was similar discussion long time ago about the > > indirection layer. > > https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/ > > Yeah Hugh shared this one with me earlier, but there are a few things > that I don't understand how they would work, at least in today's > world. > > Firstly, the proposal suggests that we store a radix tree index in the > page tables, and in the radix tree store the swap entry AND the swap > count. I am not really sure how they would fit in 8 bytes, especially > if we need continuation and 1 byte is not enough for the swap count. > Continuation logic now depends on linking vmalloc'd pages using the > lru field in struct page/folio. Perhaps we can figure out a split that > gives enough space for swap count without continuation while also not > limiting swapfile sizes too much. > > Secondly, IIUC in that proposal once we swap a page in, we free the > swap entry and add the swapcache page to the radix tree instead. In > that case, where does the swap count go? IIUC we still need to > maintain it to be able to tell when all processes mapping the page > have faulted it back, otherwise the radix tree entry is maintained > indefinitely. We can maybe stash the swap count somewhere else in this > case, and bring it back to the radix tree if we swap the page out > again. Not really sure where, we can have a separate radix tree for > swap counts when the page is in swapcache, or we can always have it in > a separate radix tree so that the swap entry fits comfortably in the > first radix tree. > > To be able to accomodate zswap in this design, I think we always need > a separate radix tree for swap counts. In that case, one radix tree > contains swap_entry/zswap_entry/swapcache, and the other radix tree > contains the swap count. I think this may work, but I am not sure if > the overhead of always doing a lookup to read the swap count is okay. > I am also sure there would be some fun synchronization problems > between both trees (but we already need to synchronize today between > the swapcache and swap counts?). > > It sounds like it is possible to make it work. I will spend some time > thinking about it. Having 2 radix trees also solves the 32-bit systems > problem, but I am not sure if it's a generally better design. Radix > trees also take up some extra space other than the entry size itself, > so I am not sure how much memory we would end up actually saving. > > Johannes, I am curious if you have any thoughts about this alternative design? I completely forgot about shadow entries here. I don't think any of this works with shadow entries as we still need to maintain them while the page is swapped out.