From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48E0EC76195 for ; Tue, 28 Mar 2023 07:01:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6DE53900002; Tue, 28 Mar 2023 03:01:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 68E8B6B0074; Tue, 28 Mar 2023 03:01:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 52EA7900002; Tue, 28 Mar 2023 03:01:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 432536B0072 for ; Tue, 28 Mar 2023 03:01:03 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 066B614089F for ; Tue, 28 Mar 2023 07:01:03 +0000 (UTC) X-FDA: 80617410006.22.54785A9 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf10.hostedemail.com (Postfix) with ESMTP id A996CC001F for ; Tue, 28 Mar 2023 07:00:59 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=U64RJMRq; spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679986860; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=r6bqWJrIsx6Y8810ZpWVAV+ahdVFTELd++CFOTjDKOA=; b=fjg6dW4L7R+9TWiWDFeh+539ltFMwXnTZZzO7FnTdjNx4d3/Ch3FelQ0Nazl7Xp6pTltms IFJXU7uyprk1ZgSN5Odq5fY5zN6UgwhSSJqfJu5OsQy51u6rO8zj6Xx9pE2sqLHGz59WVo 0v9Rp/deEZ9ADTbBLj8yoxGqV11WQEU= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=U64RJMRq; spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679986860; a=rsa-sha256; cv=none; b=Jligeh5ChVbhTBiHYZIo9ijguaI5+Cq6foIM3or/9a/O1Eus5R81b49iblyqY2nTVaP/bm 0y7vPJIKS0hEeT6tx8eIkziDCeUKOFIFmsNiLBxOsG/sZCKtiXC2v0ywoBwwz2vAUF2NHN /Ypff3HKXDzRD5G0K7sE8Q7+PTV2/rU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1679986860; x=1711522860; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version:content-transfer-encoding; bh=vyrR6+uPvIYQtQWXnI1YduU/4ZiakeX8WyJBBdWlak4=; b=U64RJMRq4eg5eo1jB7qSsJg3nFHtq4kmZc1AjIbMg5nTQal67jtivXRl c413g8DZZ3qqa4NPP9e7ZwFjQGP2HFydBCSHGbZhIYoKvYY8kuj26Sx7h e924IlxLBBvc1LtyJ67pceutU3P5mXfimIZVgOC1P6tAhNo83Exs9FwG7 CTIaJMeICUolD+NFNd2CZ9J8lqTbFwdB1Ah0eEMgw79LCM1s0MiIVCBft zWN0NhhTDD+RxvBwgB5Hd/2VVyHDzEnHOEDDUN31BkM+lM5rBOgngGE5y VeIE6Ydy0nKlyZw+co3Dj9wygBxfXusN/PJHNrHuuXuwhca6jQKiH3Zy1 g==; X-IronPort-AV: E=McAfee;i="6600,9927,10662"; a="405424969" X-IronPort-AV: E=Sophos;i="5.98,296,1673942400"; d="scan'208";a="405424969" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Mar 2023 00:00:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10662"; a="1013431719" X-IronPort-AV: E=Sophos;i="5.98,296,1673942400"; d="scan'208";a="1013431719" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Mar 2023 00:00:52 -0700 From: "Huang, Ying" To: Yosry Ahmed Cc: Chris Li , lsf-pc@lists.linux-foundation.org, Johannes Weiner , Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Yang Shi , Peter Xu , Minchan Kim , Andrew Morton , Aneesh Kumar K V , Michal Hocko , Wei Xu Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap References: <87y1ns3zeg.fsf@yhuang6-desk2.ccr.corp.intel.com> <874jqcteyq.fsf@yhuang6-desk2.ccr.corp.intel.com> <87v8isrwck.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bkkkrps8.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sfduri1j.fsf@yhuang6-desk2.ccr.corp.intel.com> <87edpbq96g.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzz1pfb3.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 28 Mar 2023 14:59:52 +0800 In-Reply-To: (Yosry Ahmed's message of "Mon, 27 Mar 2023 23:29:20 -0700") Message-ID: <87fs9ppdhz.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: A996CC001F X-Stat-Signature: uxs8jyy3ypwt1gq5916powbie43dsk84 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1679986859-204630 X-HE-Meta: U2FsdGVkX1+8u+AtuyCaIFETXnq/9CCn9JMhpB/U58jSuEjKNNYpUFsfnsCRMW6fVqKKNMGT5jxjMA6K5hWw6ejfl9rgCqf5vSb92SeONEDLUJFY8C6wpBeq1I+UmbGKVgX+W3rm3O6+Wdp7TGlXyRb76ukRibpN1yjlEvbR3woePHolDl00i02p4dKVVyz+NHjmB+Wv1gUsL/gKjkhZ9LXT8TZK2AGWmmAlzu4gi3l38kaAme9uOWWfIHzGI1UxT6s9rcqkkHQhsXhk3w97vCRKYDCZOVfTWOsTa9CbciCo6wDbsT/x2D205JIzH1kuFIzOTM8htKbJyM3MLkH5ZZqaY9hj1w/M5N36lRNkAeSCI1FQ2PITKPDiMeCNVA99reIacdNRGDbPM54XzHGdTxW2oyZ92VbAvwGC5dyZm70ameVZouW3HMoBGlP5lm0RdR7N9kGGMCf2eQewrRY1hEz1iIAflkeA2BAtyJOXO8EJKlDP5dvEfKJesR5gD3qP8uhtgqI0W2SPx+3Zv9wJqx4InXE/e0eb7EwtWknjBb8u3BPWA7HTn5rcB/qsGUSmINwdfVmb5aOOb6JdfL/+SyH8+zqp+ocUMBGDt/eiqI8s3FsUjJtVQZQxBFpfq8wadY5fYl5CGXtPEuzura47wcqmXPO2WMiYvRJuzxyI6XmFeWGLbNiCxIgEqU9mK0Ma5xvEgCsU82t7BtNNCFZ7EZ8D0c+A3944V5eYOUC1/ZfWqWOBoKFTuRGM3Zd6qYksuYO1FJSfYQD/qU0towlIB5M8JHGGbGkh3gfKF26I0r1mSyhw73J4k3xRtMrOOlyeSRf8YSgw8Wa7qYOGp50q/WkWcDhHhR0scx4DMVrXXxrEzm+yJIkeV+aVi/Iw6X19bNZ8tcnh2Gil2x4n6sA30Cn48r9b7qKxXJt2xbsQh3jZksqIHBrHCJ8YcGYTAQdHpYKXDloDZXHaqxGaKJA TGQJLegz EaHrDdqOYQ3s/yrQ9Dev4l32DSFZnA89rHfStcpiUrqrfbxDvJAjO9pn3vbkepigrpsXBhWndQXwsoC4hZO85z8gTaGoS1U+E23rJ7Pr+EX44FoPzWo3nPQnaffaiy/fafslz0baQgkiX4QH+peiBwz/GxWNL5mXG+SMowoWDLK8G9hLLmbGb++FxUl/+G4qJ7/iUcDUh/gsYVMvNANgD68sZt1GvHJesytbevHnXUHuCVQgEsmdmBVUqRo+sm6CZgEKsVw176EKDLqWyacMq4WWK4w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Yosry Ahmed writes: > On Mon, Mar 27, 2023 at 11:22=E2=80=AFPM Huang, Ying wrote: >> >> Yosry Ahmed writes: >> >> > On Sun, Mar 26, 2023 at 6:24=E2=80=AFPM Huang, Ying wrote: >> >> >> >> Chris Li writes: >> >> >> >> > On Fri, Mar 24, 2023 at 12:28:31AM -0700, Yosry Ahmed wrote: >> >> >> > In fact, I just suggest to use the minimal design on top of the = current >> >> >> > implementation as the first step. Then, you can improve it step= by >> >> >> > step. >> >> >> > >> >> >> > The first step could be the minimal effort to implement indirect= ion >> >> >> > layer and moving swapped pages between swap implementations. Ba= sed on >> >> >> > that, you can build other optimizations, such as pulling swap co= unting >> >> >> > to the swap core. For each step, we can evaluate the gain and c= ost with >> >> >> > data. >> >> >> >> >> >> Right, I understand that, but to implement the indirection layer on >> >> >> top of the current implementation, then we will need to support us= ing >> >> >> zswap without a backing swap device. In order to do this without >> >> > >> >> > Agree with Ying on the minimal approach here as well. >> >> > >> >> > There are two ways to approach this. >> >> > >> >> > 1) Forget zswap, make a minimal implementation to move the page bet= ween >> >> > two swapfile device. It can be swapfile back to two loop back files. >> >> > >> >> > Any indirect layer you design will need to convert this usage case >> >> > any way. >> >> > >> >> > 2) Make zswap work without a swapfile. >> >> > You can implement the zswap on a fake ghosts swap file. >> >> > >> >> > If you keep the zswap as frontswap, just make zswap can work without >> >> > a real swapfile. >> >> > >> >> > Make that as your first minimal step. Then it does not need to touch >> >> > the swap count changes. >> >> > >> >> > I view make that step is independent of moving pages between swap d= evice. >> >> > >> >> > That patch exists and I consider it has value to some users. >> >> >> >> This sounds like an even smaller approach as the first step. Further >> >> improvement can be built on top of it. >> > >> > I am not sure how this would be a step towards the abstraction goal we >> > have been discussing. >> > >> > We have been discussing starting out with a minimal indirection layer, >> > in the shape of an xarray that maps a swap ID to a swap entry, and >> > that can be disabled with a config option. >> > >> > For such a design to work, we have to implement swap entry management >> > & swap counting in zswap, right? Am I missing something? >> >> Chris suggested to avoid to implement the swap entry management & swap >> counting in zswap via using a "fake ghost swap file". Copied his >> suggestion as below, > > Right, we have been using ghost swapfiles at Google for a while. They > are basically sparse files that you can never actually write to, they > are just used so that we can use zswap without a backing swap device. > > What I do not understand is how this is a step towards the ultimate > goal of swap abstraction. Is the idea to have the indirection layer > only support moving swapped pages between swapfiles, and have those > "ghost" swapfiles be on a higher tier than normal swapfiles? In this > case, I am guessing we eliminate the writeback logic from zswap itself > and move it to this indirection layer. Yes. I think the suggested minimal first step includes replacing the writeback logic of zswap itself with moving swapped page of swap core (indirectly layer). > I don't have a problem with this approach, it is not really clean as > we still treat zswap as a swapfile and have to deal with a lot of > unnecessary code like swap slots handling and whatnot. These are existing code? > We also have to unnecessarily limit the size of zswap with the size of > this fake swapfile. I guess you need to limit the size of zswap anyway, because you need to decide when to start to writeback or moving to the lower tiers. > In other words, we retain a lot of limitations that we have today. As the minimal first step, not the final state. > Keep in mind that supporting ghost swapfiles is something that > is exposed to userspace, so we have to commit to supporting it -- it > can't just be an incremental step that we will change later. Yes. We should really care about ABI. It's not a good idea to add ABI for an intermediate step. Do we need to change ABI to use a sparse file to backing zswap? > With all that said, it is certainly a much simpler "solution". > Interested to hear thoughts on this, we can certainly pursue it if > people think it is the right way to move forward. Personally, I have no problem to change the design of swap code to add useful features. Just want to check whether we can do that step by step and show benefit and cost clearly in each step. Best Regards, Huang, Ying >> >> " >> >> > 2) Make zswap work without a swapfile. >> >> > You can implement the zswap on a fake ghosts swap file. >> >> > >> >> > If you keep the zswap as frontswap, just make zswap can work without >> >> > a real swapfile. >> >> > >> >> > Make that as your first minimal step. Then it does not need to touch >> >> > the swap count changes. >> " >> >> Best Regards, >> Huang, Ying >> >> >> >> >> >> > Anyway, I don't think you can just implement all your final solu= tion in >> >> >> > one step. And, I think the minimal design suggested could be a = starting >> >> >> > point. >> >> >> >> >> >> I agree that's a great point, I am just afraid that we will avoid >> >> >> implementing that full final solution and instead do a lot of work >> >> >> inside zswap to make up for the difference (e.g. swap entry >> >> >> management, swap counting). Also, that work in zswap may end up be= ing >> >> >> unacceptable due to the maintenance burden and/or complexity. >> >> > >> >> > If you do either 1) or 2), you can keep these two paths separate. >> >> > >> >> > Even if you want to move the page between zswap and swapfile. >> >> > >> >> > Idea 3) >> >> > You don't have to change the swap count code, you can do a >> >> > minimal change moves the page between zswap and another block >> >> > device. That way you can get two differenet swap entry with >> >> > existing code. >> >> > >> >> > Chris >> >> >>