From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4F18C74A5B for ; Wed, 29 Mar 2023 16:04:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 404E96B007E; Wed, 29 Mar 2023 12:04:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B557900003; Wed, 29 Mar 2023 12:04:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 27D81900002; Wed, 29 Mar 2023 12:04:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 191536B007E for ; Wed, 29 Mar 2023 12:04:28 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id CE2691C5C97 for ; Wed, 29 Mar 2023 16:04:27 +0000 (UTC) X-FDA: 80622408174.27.E165E63 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf18.hostedemail.com (Postfix) with ESMTP id BD88E1C0022 for ; Wed, 29 Mar 2023 16:04:24 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=FaW2JnaH; spf=pass (imf18.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680105865; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=f1ps5Yzs3yjgcx/ptY3GP+36sGrcyh8gNphxD93icl8=; b=ejrFIlXVVkzIhOTvPWav3d3ur7qYDnCnEvUm9HYMyD7uGhU3O+y7X603cJEWeZAU+qA5XP xW1MQ6o/fDwoTmpF9ycMI4tD5rKmTBQcaXFeoV4eiP1+bEV7WhR40W7I20gTXkvue+5+nc uztTRCA2rvyb52DbA8bftuoCB+YTsH0= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=FaW2JnaH; spf=pass (imf18.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680105865; a=rsa-sha256; cv=none; b=lb8CWWyh/1QrXccTn7IzzCIGoTjmMOZKyWUHQZyK3DlUiHOVPWVjdn9Qs488jvSzIAXSKe WCjDXnza21qphGORHItrNCqy8mYPAsn0YxyZr8kYf8Abmtb5Ht0FBHXMmyMkP1lukUBkgb AoFiRNuL5GQxMUE8SMCJDYy3xf+vlMI= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 987F061DA5; Wed, 29 Mar 2023 16:04:23 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 35357C433EF; Wed, 29 Mar 2023 16:04:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1680105862; bh=bTcX69o3WVT3deaOIOjCkROkYEWrISzS6cFxgabzrqc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=FaW2JnaHPNftDvI51NZGMsydqe++YuL87quiURv+zsi1jdqRSV2dtg9+cxgfpQhd0 5yzjFQPYxEsHCarUNkTNlFXmJYoL9PYb0ad6JV8wk+oQSRJqwmeRAzuih3eN9eWjYZ c/rdZYkqga+Ao1g4MnjuU+8/FhTWX/4P8pHOuaZN7WhbgcZTyguleLidBmsYRv3FUr jkZCy+RqokP+culLOlY727EWtDdFSz26031EKRmMimY+2mrSdVYPanJvqU1PohiDq1 7ej7OBy4cOCgCTAdu+aOdrfEHItHQN+LVwmC13zaElmQwEoXvpZmTAa14xTv5rkR77 CfYLiR7YFtYjg== Date: Wed, 29 Mar 2023 09:04:20 -0700 From: Chris Li To: Yosry Ahmed Cc: "Huang, Ying" , lsf-pc@lists.linux-foundation.org, Johannes Weiner , Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Yang Shi , Peter Xu , Minchan Kim , Andrew Morton , Aneesh Kumar K V , Michal Hocko , Wei Xu Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap Message-ID: References: <87jzz1pfb3.fsf@yhuang6-desk2.ccr.corp.intel.com> <87fs9ppdhz.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bkkcpckw.fsf@yhuang6-desk2.ccr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: BD88E1C0022 X-Stat-Signature: g417yqpnjtoqfu7wzxjryazosa6hbfbm X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1680105864-503982 X-HE-Meta: U2FsdGVkX1/UrZfjxY8SzmOZ0zmHv0b8MJAiJRaNgikahK26C9vALePfnYWmCySGyULgHkdQb24kgJ2i5Eyzn+bnvA6nTV7W7qROJy2eH2IvepdgcYoX6gbIk+RoD3noVhZ1l3PHVn7JXcjqyQNrH59nTPuSzh3PH0zR+8Ht6ZQe61ypzNp18S/bS/0fJ4cMpkNKG4XBVzpFTLs2dM3EBEe48f/HPB2HltmpiumkTGpE6h98Z5P7zdc2Ow9Nmp9t+JE3Q/fZzLmbudhp95Wu+7F89hxH0W4thfPhUrixiau0oX6ISuFiP47rwc+8nniDd5c9i9RtsSs9fWYC2cHJIQVNlGb+L+4/j4Y81tCqmduW9sdTu3ehisSnN83SJSZG6mMQfyjre3Wtjkv22kUD88LwA9L6PrAqHHVopuE+zDZaZ6gW8kUd0GTFYM7vKOY0oz3MHTPqkz2WAX2/L1iDuRPs/HFsOCHnOfOEf2SmZt5AawqpLLlmzu2pLV0FZNI9OHNqLXftIesVjAZM5iejh8VJQPkKzgG+sHFyfXHdUuTp5qcRN03TMpAIhflFA0EPXT3VvXivQ7J8QFEvPjMi8ihdsrnPo1DjMOyw+MgKUi0/H+hl9Vir90jNKnQa8GyBwAJWn046e9A8O1TYlXMk6EnckP61diQjkIwI9zHuBbGV0BZVaTrE+i8dNh82YkWVky50UHil1o4/zghqHVMYBhby/zBTXes77uKzVjdhvm+55k7GeP6b3bTUQHM92cH98d6Jq+bdOEvndR7bOQCxduoIemi2FDlmqlaSlEzoXE03d8nf+6IgdxniYPS45eMXiww9MXRHj/k11Vdj4sKUO2Xc2NjrRgxjeT0SiX/DDA6I+7dzo++ufxGRZDo/tfO6tGRDYE7TmP2aRNtsPmj1wldWGZXGoRN/mizOUm8BkELl4GF4ZTa43JgAf6D5Sb6S00ZNmq1FRyJhL/jp0lV HG12UrU3 ZiT+vWzRzaGIJ4l9uwMwOGkHKPjPW7h7wfEECslDk3o7HXxdd1st1IxrQWqYDMpo8hLY0ge4tjsoAqWVk3vQCfQd4CD1EmdmjNHRy+t7AIcg3hd6TODu00mc95cWBwO9wTgcHbP4FrAP0Fi1exz4SsCHDh+iByfju9meHHl6N4A794IssQFZaaQjSxjWN+pevRlLQRUV5lFC2bc344eKTVLW4i83JXGJkkaRpyM2Dic8GKrGz+gszkcZw3QsFAcrqjiCHzsQZojyaVW3COrg22+JEXeQ54kMR02n8NdWo1BOBgo2HAjz9Ku2dRQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 28, 2023 at 06:41:54PM -0700, Yosry Ahmed wrote: > My main concern here would be having two separate swap counting > implementations -- although it might not be the end of the world. It > would be useful to consider all the options. So far, I think we have Agree. > been discussing 3 alternatives: > > (a) The initial swap_desc proposal. > (b) Add an optional indirection layer that can move swap entries > between swap devices and add a virtual swap device for zswap in the > kernel. For the completeness sake let me add some option that have both pros and cons. (d) There is the google's ghost swap file. I understand it mean a bit ABI change. It has the advantange that it allow more than one zswap swapfile. Google use it that way. Another consideration is that ghost swap file compatible with exisiting swapon behavior. You can see how much swap entry was used from swapon summary. Some application might depend on that. We might able to find some way to break ABI less. > (c) Add an optional indirection layer that can move entries between > different swap backends. Swap backends would be zswap & swap devices > for now. Zswap needs to implement swap entry management, swap > counting, etc. (f) I have been thinking of variants of (b) without adding a virtual swap device for zswap, using the ghost swap file instead. Also the indirection is optional per swap entry at run time. Some swap devices can have some entries move to another swap device. Only those swap entries pay the price of the indirection layer. (e) This is the long term goal I have in mind. A VFS like implementation for swap file. Let's call it VSW. This allows different swap devices using different swap file system implementations. A lot of the difficult trade off we have right now: Smaller per entry up front allocate like swap_map[] for all entry vs only allocating memory for swap entry that has been swap out, but a larger per entry allocation. I believe some of those trade offs can be addressed by having a different swap file system. I do mean a different "mkswap" that kind of file system. We can write out some of the swap entry meta data to the swap file system as well. It means we don't have to pay the larger per swap entry allocation overhead for very cold pages. it might need to take two reads to swap in some of the very cold swap entries. But that should be rare. It can offer benefits for swapping out larger folio as well. Right now swapping out large folios still needs to go through the per 4k page swap index allocation and break down. Basically, modernized the swap file system. The redirection layer should be able to implement within VSW as well. I know that is a very ambitious plan :-) We can do that incrementally. The swap file system doesn't have much backward compatibility cross reboot, should be easier than the normal file system. Chris