From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19769C7EE23 for ; Thu, 2 Mar 2023 00:58:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 624106B0074; Wed, 1 Mar 2023 19:58:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5D6576B0075; Wed, 1 Mar 2023 19:58:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 49B9D6B0078; Wed, 1 Mar 2023 19:58:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 3A3FA6B0074 for ; Wed, 1 Mar 2023 19:58:49 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 01154C07F9 for ; Thu, 2 Mar 2023 00:58:48 +0000 (UTC) X-FDA: 80522148378.13.7E1A1A2 Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54]) by imf27.hostedemail.com (Postfix) with ESMTP id 44CA640016 for ; Thu, 2 Mar 2023 00:58:47 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Eh8KCKKS; spf=pass (imf27.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677718727; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=c/Zdslz5gZ/4zFdN2wd9DFBwydwaTo/dj/t6APpnopc=; b=ufpUQIV19ubG9lkhbIGrNpMUYD5JVCFJvsDPd1MxxE0aIscThaCx6HLFHy2F6/U3auM8AK PT7IH5sFklA4fPs34eh4IMlddputUYTmUtEY21CTWVQ3ZReUwTEoKOtzn808SSkOLlEeoR g3aMxHZpHpJcVgDkwrcyJ7cCFM2lcMs= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Eh8KCKKS; spf=pass (imf27.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677718727; a=rsa-sha256; cv=none; b=A+u8plj/mFQhiupaj+nTsSXPoeKxyJMwWN6Jst18ftV8zb2QCljScxGxgu4byhEc2udTqn x8yyR18mWp5REx+xJEcCP/uIwV/0VIYd9ELOLEQN+skquizg8EVkHgvHVjIUcHSGRIQ41c 2uHOkYkJhN6+C1p8H0RlQ8t7cpTcH3E= Received: by mail-ed1-f54.google.com with SMTP id g3so5885794eda.1 for ; Wed, 01 Mar 2023 16:58:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1677718726; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=c/Zdslz5gZ/4zFdN2wd9DFBwydwaTo/dj/t6APpnopc=; b=Eh8KCKKS1YhOOqidlwpn2yIpbzUR6jIKLp0SB4aEp6qZglXDutP7v5LdjQVjHNAdMH h9EZkdHhHUJ5SfnHnODACgLtSWsM6BZ0S2zrmgtq3JhqA/B2yI4vhYEtNaJu6YGESZEz cxVIpapUq/Ce+Wp5oY3mwbDdmMW4GSd2n1unAfDJHPzbILzK0FQnaMIqet7smRKQghoN 22UbOMnRiRy+/UWBgMDfhos0vbiqFyH/naDHAYoRhmdlCsShUMuG8YvEBEv46GElT5gd SZc8gbdKv70+QMnPUoDIQy2dzFYjJ/lg6BiQNf8hbbMX7AZe7CJV3GiBvPHkUbjWkXM+ WXeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677718726; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=c/Zdslz5gZ/4zFdN2wd9DFBwydwaTo/dj/t6APpnopc=; b=WEHxWfv6lHGo7Fp2cgEmn95A3l01N0qzJ6wyKgb2j1yv+HpWnIFj+Y6bJoVJfiTv0q JR9/mE6Hhy8Sv/6dt6l7Pj9xJ9cpnm8NtUub7fGLYh7toe19dKSfVEhhb7JvwLR5KSuA NwAX9dL9mGut4V/GbUMRkwOO/igS9rWlMo9j9J1d1r9K2ExNryAJU28Qx8NV0ZNPo1II FuqalvQiiHkpB0HAHKkJ37hgDDj68FA5ItBwCWn8nXEUJWXime/YPObEWm03lSu7NNC9 I0imPxF2GvP9LSmypeLK+5Pl6B8OICJOPJGyzmP3sM0C542Og61pMToP8dmHDhiahXN3 eK4A== X-Gm-Message-State: AO0yUKWsOv3/lX3Tcoyn0jM4z06lsXKXbKdHmDaRPAnyo4jI8PotKcuC owyX5T5zC3+YIQyFLmZvNxiMHSLvdN36nCy8IomxVg== X-Google-Smtp-Source: AK7set/CUp3IIJWo8ht3w4LUvad/N7h/jeJ87uPmsNP3xrwfVL/ifxq1PKfRLctv1briQplI6UCKwg/wJEX9VQfGMxg= X-Received: by 2002:a17:906:f18f:b0:8af:2ad8:3453 with SMTP id gs15-20020a170906f18f00b008af2ad83453mr233281ejb.6.1677718725573; Wed, 01 Mar 2023 16:58:45 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Wed, 1 Mar 2023 16:58:08 -0800 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap To: Minchan Kim , Johannes Weiner Cc: Sergey Senozhatsky , lsf-pc@lists.linux-foundation.org, Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Yang Shi , Peter Xu , Andrew Morton Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: ffieeip9xk5y616b4ziba4kcita8idfk X-Rspamd-Queue-Id: 44CA640016 X-HE-Tag: 1677718727-472312 X-HE-Meta: U2FsdGVkX19/fSUnBotNJ16PY5oodL1CL12P+gfxoTg8+IrRXfKDHydfGOcYJwWKJinZAFukrvLXNnSHnNXr+w7669YwVwuSrkoM6HMqfF3uO1vlcFgUTqczecOsPHG2Wl0S9jVXv323Jce5C5BahJSwl4mVFdwdfjN91GUHpLgDEGwBjdeQV6U1LrqoItWdw9cEKxRf8a1DF68OW1IniQuXGEwFP0cijf1oa29c16cZ+B9ESSKgulCxiySF04Hn+hS6f3oxvUX7D6IoOfeBfKtbEhh0zaHI2KAVYYiVg7/XWFWxjtqYsVVs18NH+QgTnJ61mR9WlzUHM+pFTFW3XH3Xok/uxr73jjqo+uuRH4rN6Z/Qq+ywRt78eUw9NeAE6JJRgTgd66PWk2CQmWaC4ZTDUZZP4TKQ4PYRBcER0Jokl6DW25fEEmoYfuCswRVN01Af5dv31kKQm0qyLGG+02xR9LUL+FK4F24+r7g/Vr9tbB5/10JuH7sK38vfrDmNidIWT3qzAFGcUUCFYAQeVyBYxIxWKWFP9ipM7cqB0AqN3Ax9fcQ5jaE7XpFbNfYR+Xz3BnnRz5gKZSvjz3i3JcxD8YMoJqB6UjfWtd5SDTUqOOiDbq1G3OUzshm+2VnCRxZAju46snuWFWQeOrf0wg4spkULfxWuZzUKnwjOLLBVqYGHzFZH9kICFE0UkYGrY2d134s/LP8E/zpaUylLIKzH/XTuN43+MY5L+gcssLdxnTval0+jgHn/bLcF9xEH+8bV0rlc5ixGgt4hpoSf+PH4DKwXwgOBbxvBuuWHUhlNKDpztYjBEj5Z15nmynnEnizgJkwpkN6oIF5PfFsLYtnK+85YhpiXU5e/SPwKZgrgHOcym3npi0IT4W5drPv6Ax/9KSdxCRef5TrUrmEeWVy50zjaE45H1WOOk6TwSeiR8PlOXgbwKN7PVrO3suZTFegyniITbQyUSox/AXv 7I+FeXE3 WFQQL9mRPTjLkRnQTXIGcmscbrF8mWWTT35dvKcRDBRFs91cxTav7L2P+C0tbQHllFbf+Lk+BWoBwtv0V6wqLWbv2Ly5FtkOBE8ixbstZBw0YU6nAr6hBALWvhSD6st5r5XnPLwWA0yp06fzc3hvU5lV49dojoJLjl8HcgaKAMW85V5NaX66akmB5wns0UQUkCoiHC44QOmKlA+fLlnHjqaR9+UamtghjFEsExLVxIdNm2nMJX2zDMCB08AFZi8FvYOORGMkdPEEytc92rLnyRXcCk5Y9o6GdXLLYBjxrkXQ7aSA33+bxbV4/l8BX/6mgTUR/J0dTrN63XBbGWtacRAzZBvwisCbrnWO5DUqksdszRBsUHPSvYbScreq4TFCzBisG X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 28, 2023 at 3:29 PM Minchan Kim wrote: > > Hi Yosry, > > On Tue, Feb 28, 2023 at 12:12:05AM -0800, Yosry Ahmed wrote: > > On Mon, Feb 27, 2023 at 8:54 PM Sergey Senozhatsky > > wrote: > > > > > > On (23/02/18 14:38), Yosry Ahmed wrote: > > > [..] > > > > ==================== Idea ==================== > > > > Introduce a data structure, which I currently call a swap_desc, as an > > > > abstraction layer between swapping implementation and the rest of MM > > > > code. Page tables & page caches would store a swap id (encoded as a > > > > swp_entry_t) instead of directly storing the swap entry associated > > > > with the swapfile. This swap id maps to a struct swap_desc, which acts > > > > as our abstraction layer. All MM code not concerned with swapping > > > > details would operate in terms of swap descs. The swap_desc can point > > > > to either a normal swap entry (associated with a swapfile) or a zswap > > > > entry. It can also include all non-backend specific operations, such > > > > as the swapcache (which would be a simple pointer in swap_desc), swap > > > > counting, etc. It creates a clear, nice abstraction layer between MM > > > > code and the actual swapping implementation. > > > > > > > > ==================== Benefits ==================== > > > > This work enables using zswap without a backing swapfile and increases > > > > the swap capacity when zswap is used with a swapfile. It also creates > > > > a separation that allows us to skip code paths that don't make sense > > > > in the zswap path (e.g. readahead). We get to drop zswap's rbtree > > > > which might result in better performance (less lookups, less lock > > > > contention). > > > > > > > > The abstraction layer also opens the door for multiple cleanups (e.g. > > > > removing swapper address spaces, removing swap count continuation > > > > code, etc). Another nice cleanup that this work enables would be > > > > separating the overloaded swp_entry_t into two distinct types: one for > > > > things that are stored in page tables / caches, and for actual swap > > > > entries. In the future, we can potentially further optimize how we use > > > > the bits in the page tables instead of sticking everything into the > > > > current type/offset format. > > > > > > > > Another potential win here can be swapoff, which can be more practical > > > > by directly scanning all swap_desc's instead of going through page > > > > tables and shmem page caches. > > > > > > > > Overall zswap becomes more accessible and available to a wider range > > > > of use cases. > > > > > > I assume this also brings us closer to a proper writeback LRU handling? > > > > I assume by proper LRU handling you mean: > > - Swap writeback LRU that lives outside of the zpool backends (i.e in > > zswap itself or even outside zswap). > > Even outside zswap to support any combination on any heterogenous > multiple swap device configuration. Agreed, this is the end goal for the writeback LRU. > > The indirection layer would be essential to support it but it would > be also great if we don't waste any memory for the user who don't > want the feature. I can't currently think of a way to eliminate overhead for people only using swapfiles, as a lot of the core implementation changes, unless we want to maintain considerably more code with a lot of repeated functionality implemented differently. Perhaps this will change as I implement this, maybe things are better (or worse) than what I think they are, I am actively working on a proof-of-concept right now. Maybe a discussion in LSF/MM/BPF will help come up with optimizations as well :) > > Just FYI, there was similar discussion long time ago about the > indirection layer. > https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/ Yeah Hugh shared this one with me earlier, but there are a few things that I don't understand how they would work, at least in today's world. Firstly, the proposal suggests that we store a radix tree index in the page tables, and in the radix tree store the swap entry AND the swap count. I am not really sure how they would fit in 8 bytes, especially if we need continuation and 1 byte is not enough for the swap count. Continuation logic now depends on linking vmalloc'd pages using the lru field in struct page/folio. Perhaps we can figure out a split that gives enough space for swap count without continuation while also not limiting swapfile sizes too much. Secondly, IIUC in that proposal once we swap a page in, we free the swap entry and add the swapcache page to the radix tree instead. In that case, where does the swap count go? IIUC we still need to maintain it to be able to tell when all processes mapping the page have faulted it back, otherwise the radix tree entry is maintained indefinitely. We can maybe stash the swap count somewhere else in this case, and bring it back to the radix tree if we swap the page out again. Not really sure where, we can have a separate radix tree for swap counts when the page is in swapcache, or we can always have it in a separate radix tree so that the swap entry fits comfortably in the first radix tree. To be able to accomodate zswap in this design, I think we always need a separate radix tree for swap counts. In that case, one radix tree contains swap_entry/zswap_entry/swapcache, and the other radix tree contains the swap count. I think this may work, but I am not sure if the overhead of always doing a lookup to read the swap count is okay. I am also sure there would be some fun synchronization problems between both trees (but we already need to synchronize today between the swapcache and swap counts?). It sounds like it is possible to make it work. I will spend some time thinking about it. Having 2 radix trees also solves the 32-bit systems problem, but I am not sure if it's a generally better design. Radix trees also take up some extra space other than the entry size itself, so I am not sure how much memory we would end up actually saving. Johannes, I am curious if you have any thoughts about this alternative design?