From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A8E2C678D4 for ; Thu, 2 Mar 2023 17:05:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A099F6B0074; Thu, 2 Mar 2023 12:05:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9BAA86B0075; Thu, 2 Mar 2023 12:05:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 882366B0078; Thu, 2 Mar 2023 12:05:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 785976B0074 for ; Thu, 2 Mar 2023 12:05:36 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 29AD8160EBC for ; Thu, 2 Mar 2023 17:05:36 +0000 (UTC) X-FDA: 80524584672.22.9439C79 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf13.hostedemail.com (Postfix) with ESMTP id 11C5420022 for ; Thu, 2 Mar 2023 17:05:32 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=IjQvUA18; spf=pass (imf13.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677776733; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RY/YNkKndSCIele5qE23267pMbDeZnivhSrkQJP3vNw=; b=hDCsdEZgBhFkwcmzbTQos9xdOSAIdt6nS44HHQl2qmM6DF4Xu8KbtTZlVq60QetEOKEqtR OLrgGb78zeGhH1iwFY+c8bVz7XSzOgb0pxLUkiwqvF7qCBpJY38ljlCf3+c6zcsix0efwf thzjFT5pubV/GsO0Bfi22CGvuh0uGiw= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=IjQvUA18; spf=pass (imf13.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677776733; a=rsa-sha256; cv=none; b=dCmkJ3xlIsURUj59GLt+xsC8JuiH+8MpoZrUlZebWpT5f7Ltpj5+ssxwf4FkdmbX0lz84L sw58l0s7PIN8TA3HbjUQeKjhhhVBC+xogLKziMaYB107aQka+3+rf8m+9ujsMalBckWLmX 6TEy2pJKscZFxulmQhJUMOh561Jl39Y= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id ECE3A61578; Thu, 2 Mar 2023 17:05:31 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id CD2C3C4339B; Thu, 2 Mar 2023 17:05:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1677776731; bh=SoQ12LaEDdo3aKQ3sVLQSYqJgFbmv+Rz4+CZdAPtJ3k=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=IjQvUA18F5zoqNY8XzYMJPVStE9Geuukg0NMZcuPz/CTGiVpFcIwiswIwP1ODGMe7 iMib5qu45MvRYDlLSxmUaEao75BtvK1j0AwCZE9imjnkQGd6eQ/x6jM51zArRJQqW2 tFVjTm0B1ARwcIgb3zfs4Wnw8TC9hoNCm+8PuSpLVBydA5moHgcNNXcaIEGW3QM0vH duGT1Q9lskaoColv1q9gXPqE2pYDv6ctvh2B9y5IEbQzeOfYphX5TWYdCkkB7m+yGK d+V9Q+EnuLEo5S4QWNOv5p3bEVPGvMc5AZXhE4sjrXBEGfBYX2N0UfeZyP5VMpn++5 ztH2q3Q2P4rfw== Date: Thu, 2 Mar 2023 09:05:28 -0800 From: Chris Li To: Yosry Ahmed Cc: Minchan Kim , Johannes Weiner , Sergey Senozhatsky , lsf-pc@lists.linux-foundation.org, Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Yang Shi , Peter Xu , Andrew Morton , Rik van Riel Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 11C5420022 X-Stat-Signature: nzisuwf797b5rrhpeictqk5r34nzqnsf X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1677776732-956478 X-HE-Meta: U2FsdGVkX185emcYc2+O889pQQtadzRNCYgINYliiNtM96sDlgZ47KPxgZARyHRggst4AYXJKoglMecYpFwmhbJARFAkfMEy+FyMxc50qpdkAzp+Pc7XDuaukLn9Wal5+VE6uT621lcg/QPzZ6PtQQ2EvA9prU5kns7zpSqHPfDn/ZaKmsUKP6/vUx9LnaJrzRv/qbCXALJ9Y69mvFORLg1H6WKjStv/a9wnzjj2GT+XtJt+YY74y91mkc5UXR+GFviVw/N8MLKFuHpaOogAO+VdL7jlL6lJE37GBCpr6npQ55Bi8S6L7cNSvZi2oJavSVWymL48cE77qf2xfSKtoMWnGbGtEhx5Vkh+rs+3slmhK5tivqNGXvc43FsH2PnlYWOf8GHL5jhGPM847G+9bAi9lojXHHnUVuZGuVS3m18J87EdGNEsB22k4GF9UPa7lbjQ4cnXjKWMeqjhXf9q8cNQH620CAKleo4ew5ul8slELpwjlXKkT5NKDWV2Q3BX1Q5PxP8eDBu/uPCR1zrFv0hCZgUbGg3lff5CG1YIU2KjR17+ufgt7VTcJ2ClUn/I7GgTqBO+LsigFNGnLEcTFEXje12+y6vnz1BmI6/lRfEYt8Wa9qu95ktLj3+g5VISsmAKKKQN3TR1oJ9HjQJWZX6WT6YKNC6qzgkfNK+973w/drb6UzdHFior/lsZwn8nl14MKizEg4G2KNXtfTVCk4udduwMq+Y8as8r2il/+/NvOwO9vXJ/MGdJbUjwRFoVdhsMrK61ZfKTzVp89VB3UqJLbyfb8YGDfmls/AvHbyKOJzkONSCBgb2Qm1v5C4myFbnmOI4EySd+tgvnwVITA4QRJuIz0EtrGaVCT4T0pfIKpIj12vNxpivKxHu2//RFMy/x2N6VidoTxNr5lreGAFDX6yRdi81vXvzJ/Y318jBSK/O7rNflXptsY95PoUWB7KcnAydaPXbjq2p5j0F jbDtXT9K t07bHibg8MMsFZzWBWqffJOFriBEELtYYSOSqOWLH+nLpF5XDH1dx1gxDZCyqfDFDa8aEWHbLoEGmknixEFZRbDrzhkluyzrDVLakrHmwEUKoSNnNddlGmXQ1ULczJlKTCLYwSAkrN1dY54O7EHoe5lBaXVpjF4vLqDd9dhy0zQE9apC8aJ4HeeOW3NFREUXtFCowbWzBmjgUd0PJxxnrVHtY6pPc8Jr0DPZ1NHmOyNK5p8IIFReittASQIxJXotc1VpKabLDvXKINSC2ki5tHP82OJRVBw+9WxLXxbHXP9+L0ssCNYq4/VK1O7A/4xD8bHTAjcYj/9y/PTaawfCWjEZu8TKuW5QYjaYWo4UoBE4q34YrTkgvHFgHqq5ZyuVkWbit X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Mar 01, 2023 at 04:58:08PM -0800, Yosry Ahmed wrote: > > The indirection layer would be essential to support it but it would > > be also great if we don't waste any memory for the user who don't > > want the feature. > > I can't currently think of a way to eliminate overhead for people only > using swapfiles, as a lot of the core implementation changes, unless > we want to maintain considerably more code with a lot of repeated > functionality implemented differently. Perhaps this will change as I > implement this, maybe things are better (or worse) than what I think > they are, I am actively working on a proof-of-concept right now. Maybe > a discussion in LSF/MM/BPF will help come up with optimizations as > well :) > > > > > Just FYI, there was similar discussion long time ago about the > > indirection layer. > > https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/ > > Yeah Hugh shared this one with me earlier, but there are a few things > that I don't understand how they would work, at least in today's > world. Let's add Rik into the discussion, maybe he can help refresh some details. Chris > > Firstly, the proposal suggests that we store a radix tree index in the > page tables, and in the radix tree store the swap entry AND the swap > count. I am not really sure how they would fit in 8 bytes, especially > if we need continuation and 1 byte is not enough for the swap count. > Continuation logic now depends on linking vmalloc'd pages using the > lru field in struct page/folio. Perhaps we can figure out a split that > gives enough space for swap count without continuation while also not > limiting swapfile sizes too much. > > Secondly, IIUC in that proposal once we swap a page in, we free the > swap entry and add the swapcache page to the radix tree instead. In > that case, where does the swap count go? IIUC we still need to > maintain it to be able to tell when all processes mapping the page > have faulted it back, otherwise the radix tree entry is maintained > indefinitely. We can maybe stash the swap count somewhere else in this > case, and bring it back to the radix tree if we swap the page out > again. Not really sure where, we can have a separate radix tree for > swap counts when the page is in swapcache, or we can always have it in > a separate radix tree so that the swap entry fits comfortably in the > first radix tree. > > To be able to accomodate zswap in this design, I think we always need > a separate radix tree for swap counts. In that case, one radix tree > contains swap_entry/zswap_entry/swapcache, and the other radix tree > contains the swap count. I think this may work, but I am not sure if > the overhead of always doing a lookup to read the swap count is okay. > I am also sure there would be some fun synchronization problems > between both trees (but we already need to synchronize today between > the swapcache and swap counts?). > > It sounds like it is possible to make it work. I will spend some time > thinking about it. Having 2 radix trees also solves the 32-bit systems > problem, but I am not sure if it's a generally better design. Radix > trees also take up some extra space other than the entry size itself, > so I am not sure how much memory we would end up actually saving. > > Johannes, I am curious if you have any thoughts about this alternative design? >