From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 891EDC76196 for ; Tue, 28 Mar 2023 08:00:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DC3A56B0072; Tue, 28 Mar 2023 04:00:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D737B900002; Tue, 28 Mar 2023 04:00:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C145A6B0075; Tue, 28 Mar 2023 04:00:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id AF2976B0072 for ; Tue, 28 Mar 2023 04:00:11 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 568E84090D for ; Tue, 28 Mar 2023 08:00:11 +0000 (UTC) X-FDA: 80617559022.16.3920157 Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) by imf07.hostedemail.com (Postfix) with ESMTP id 7502F40021 for ; Tue, 28 Mar 2023 08:00:09 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=ly0DEHcR; spf=pass (imf07.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679990409; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ygulsa0g/3y4lLxnKBFM8bxaf66LMM+c6K4W5s/P9tU=; b=wlMJX4A4HHkafzZJGBeHlOPOOaKitaqDQ5gNNwIFXHocRp1tDTt0SOownNh/3jGordwm2W pm3fPUoejjZpCzd+ZmvWHfnOUYctkJNKMIkdUM0cBIFim80D2qyAEW8qteeDl5BmhDToqf iuwR7dd/ptE0ui8dpWlzaBQ+TfFiJuk= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=ly0DEHcR; spf=pass (imf07.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679990409; a=rsa-sha256; cv=none; b=CP3F1uHiMiWzqUknlJuWYceiPlvAKlTbDm/cNrJNALBNzGLgzoqfFIixxp6JG5OZrnMR+P fzMfyYEWid5l2C7PUAxbn498S6bMhhH+xdLBOy2+epe1GUOuHtE4Kp394chTjAOFKRuiof ZhPec7Os6rebwd2Yyd079DiCJ3PrsNA= Received: by mail-ed1-f44.google.com with SMTP id eh3so45929324edb.11 for ; Tue, 28 Mar 2023 01:00:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1679990408; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ygulsa0g/3y4lLxnKBFM8bxaf66LMM+c6K4W5s/P9tU=; b=ly0DEHcRKBm5Z+mgPMUT1mvOk5cdG0sWtHFsrRkIiPVEnb9eoQC9kADuxrXoO4awJt KVai9OAVVNrnG5yri/QwMEJiDpEquSwBtgUZZhXmi4+WRT9uuvnRUdSjV2H8rtqZWdDu DeWK+3nq1Rv2EXBTXVxmEwBLVcqNFMqTOu16K11joyq42KR2bu15x7iWpV1lUdNI9gRE YHkL0x0aV0Xa/0vC4rkBeJ9sUCW8ogUc7WJonmxHjs7499B8stHitXTM17zBzkX42qor FhKxSoqXhxRm4uEnCaeQFcbJCYKNN1nkyjzLI6AiMA0rsQZOlHFkL7ORgrRqmc4lElps jOiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679990408; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ygulsa0g/3y4lLxnKBFM8bxaf66LMM+c6K4W5s/P9tU=; b=8K0tAVLoYVAceeAI0kljrEmu1q3S9HysV5QvVUelH3XsvBubhOg/M9PAQavokUa7qW IP7alm8+njFpwJdWprHjHKP3yyOaoE4aDL26ir5Op6mrngQFLMgDGuMwz/ybjMU/qGnb Z2vmLBVDCD1PEnyjx/FsBZjpT75WwYcT/82o7O3XexT0iG5wGMRKNjz4hWdKzeH2dR1s 5TBk5QKCZYs8UZrDjv4CkIdu+GewqwnfBo4kRq6PqxlGTbvgffIh728Jf4edaG5R+pVs 7vPqAaaYxi9BDykX24lkUeX43YigLshY/BncX+qlEqrbMdhlzGpaDAI773LDgEZY7TfG KUMA== X-Gm-Message-State: AAQBX9d6sbpOFP2IBfHFhx5Dp31yMNSv6jxufhi9mt2iZ34iVer8HltH xreEZOsc82q4G8t9DLAwKBrSRMhQy6GOflGTrYFJrQ== X-Google-Smtp-Source: AKy350ZCYTzuFNH4VEc9p9jgAITjGbIlDaJdtGVSBaTLplkiZGMyYMEFghb8qblxeILHvt2LNO9IBvgB35N81CaBe+o= X-Received: by 2002:a17:907:9870:b0:8b1:28f6:8ab3 with SMTP id ko16-20020a170907987000b008b128f68ab3mr7639318ejc.15.1679990407766; Tue, 28 Mar 2023 01:00:07 -0700 (PDT) MIME-Version: 1.0 References: <87y1ns3zeg.fsf@yhuang6-desk2.ccr.corp.intel.com> <874jqcteyq.fsf@yhuang6-desk2.ccr.corp.intel.com> <87v8isrwck.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bkkkrps8.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sfduri1j.fsf@yhuang6-desk2.ccr.corp.intel.com> <87edpbq96g.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzz1pfb3.fsf@yhuang6-desk2.ccr.corp.intel.com> <87fs9ppdhz.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87fs9ppdhz.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yosry Ahmed Date: Tue, 28 Mar 2023 00:59:31 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap To: "Huang, Ying" Cc: Chris Li , lsf-pc@lists.linux-foundation.org, Johannes Weiner , Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Yang Shi , Peter Xu , Minchan Kim , Andrew Morton , Aneesh Kumar K V , Michal Hocko , Wei Xu Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: uf15ci5mw566b1fcosqakdqzxomjzju3 X-Rspam-User: X-Rspamd-Queue-Id: 7502F40021 X-Rspamd-Server: rspam06 X-HE-Tag: 1679990409-434422 X-HE-Meta: U2FsdGVkX18Rg5Oe+/tgWvMUal10fmw9dhRXn9ZUGyGMPlqCx7wME3z9Eyah7IG6sG9sZ3VCumfcUGGX3dqmC9ddS+7erg84WNPC0K7HufWHhsKyRtfVlCf49CGgvwYEpeymm1MDnAhlJgZSEp9snWkPZVxyRRfjbTlwNDG9rhFAUf4jGDRBlCpJXzHf/Nipcep8VB1WlLFWTJ6EjZ5kzxkeCDKnUtX2whONwtrBBI789a9Hb3wrLMXIFzPZ3aHP3ZKL4PIZePclJbLuS5o4FRT2T5td8Jd4WUfP0DZgmokckt2C0ARZjSha7MCcDKgyPTs0W/y7/8QnaZTz6SPI8de/uQ9S4ox2MkrdwY5ycOyjpUSVuTTZNRdTqbnzyyyHXRfeol1hLzZgaQyo5rQb6uUuWaAtK/Qy1S90d1BbpVF8SnaWVJwfg0nM4zmNTS7wuFTCR9YlDcUPVZJaqNAJWfTTMELm6dQzPqieXPhwrRQkCBT1EIcGY+YxxzLFtEZW1PRMi3Q6c70dxxTD7tONiXel+N6hm4GQn6J9cMNJxjwdURSJuaEyN82BU2ltw6MfzGjPvr+arfhgoCmim+GbfMuI9+r6UexMUtK5lKMcl2DzPaSoeM6iU+9jUm1bUIDRXxQTOwQXTY+ytWnfc+HrhThJ9kRWodCxzD0RBrKHPRKqWOM9GX/Z88ZhiWrzXEfg0N7NBxqyzPIw3O5dVvt7zQfvbRM/dW1MJisRuk/FH5CpjMDiZaIWrkSoc94UrTOEw2EcUZP2h0t6mnLs61ijcJZ5nx0MM6uLpVhO1cJR5gT6ZtWfJjfp5J79ejwb9d/P67D4DTU67v6ZKsgvTZZCnDm2vdfuerCMWU1SkfWLBHNoMmBTVknpwRbfNWPP2kVwwjP2rcRQwsnU4bGwd/wWfWHAOT3kLGByEGW2qr1rcUaBPLYgn926IQ0xFiQ9yj1zUtEFfK9c3xh5VvTEcvX rCQhzb4z W0KruNB0Xhh+LkRBa9nzdzL82F9iwldM2dg3MJHvLv9O8LXmmjyQJGzRzCDkBKfbBuyggy2/wcZBjHT8caR8ltgGWJ8dJzIg6xyR0jyURvP9BlmzgtDumCmRT7j/w4rPNgAnn6dg4xIX9SERRMsd8Ef/inWAqBocA9GVuvS3Y3jCi84eDkw5Xg28FeZXpCGRif4R/KcWgGYQzOxJkahhN5hVActH6A70WJTcr2+ATXFFWN3122s37gNjJojyaISIwrT2Pl/PaHzvnlyCtxFtNPs5liN/tvGb5S65aJyWjhyDbA8+p+a41Jb5F5JGgxUzLwS/DGBKYZ/N+ML/MWMk7ornOpMA2wRXFGl8/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 28, 2023 at 12:01=E2=80=AFAM Huang, Ying = wrote: > > Yosry Ahmed writes: > > > On Mon, Mar 27, 2023 at 11:22=E2=80=AFPM Huang, Ying wrote: > >> > >> Yosry Ahmed writes: > >> > >> > On Sun, Mar 26, 2023 at 6:24=E2=80=AFPM Huang, Ying wrote: > >> >> > >> >> Chris Li writes: > >> >> > >> >> > On Fri, Mar 24, 2023 at 12:28:31AM -0700, Yosry Ahmed wrote: > >> >> >> > In fact, I just suggest to use the minimal design on top of th= e current > >> >> >> > implementation as the first step. Then, you can improve it st= ep by > >> >> >> > step. > >> >> >> > > >> >> >> > The first step could be the minimal effort to implement indire= ction > >> >> >> > layer and moving swapped pages between swap implementations. = Based on > >> >> >> > that, you can build other optimizations, such as pulling swap = counting > >> >> >> > to the swap core. For each step, we can evaluate the gain and= cost with > >> >> >> > data. > >> >> >> > >> >> >> Right, I understand that, but to implement the indirection layer= on > >> >> >> top of the current implementation, then we will need to support = using > >> >> >> zswap without a backing swap device. In order to do this without > >> >> > > >> >> > Agree with Ying on the minimal approach here as well. > >> >> > > >> >> > There are two ways to approach this. > >> >> > > >> >> > 1) Forget zswap, make a minimal implementation to move the page b= etween > >> >> > two swapfile device. It can be swapfile back to two loop back fil= es. > >> >> > > >> >> > Any indirect layer you design will need to convert this usage cas= e > >> >> > any way. > >> >> > > >> >> > 2) Make zswap work without a swapfile. > >> >> > You can implement the zswap on a fake ghosts swap file. > >> >> > > >> >> > If you keep the zswap as frontswap, just make zswap can work with= out > >> >> > a real swapfile. > >> >> > > >> >> > Make that as your first minimal step. Then it does not need to to= uch > >> >> > the swap count changes. > >> >> > > >> >> > I view make that step is independent of moving pages between swap= device. > >> >> > > >> >> > That patch exists and I consider it has value to some users. > >> >> > >> >> This sounds like an even smaller approach as the first step. Furth= er > >> >> improvement can be built on top of it. > >> > > >> > I am not sure how this would be a step towards the abstraction goal = we > >> > have been discussing. > >> > > >> > We have been discussing starting out with a minimal indirection laye= r, > >> > in the shape of an xarray that maps a swap ID to a swap entry, and > >> > that can be disabled with a config option. > >> > > >> > For such a design to work, we have to implement swap entry managemen= t > >> > & swap counting in zswap, right? Am I missing something? > >> > >> Chris suggested to avoid to implement the swap entry management & swap > >> counting in zswap via using a "fake ghost swap file". Copied his > >> suggestion as below, > > > > Right, we have been using ghost swapfiles at Google for a while. They > > are basically sparse files that you can never actually write to, they > > are just used so that we can use zswap without a backing swap device. > > > > What I do not understand is how this is a step towards the ultimate > > goal of swap abstraction. Is the idea to have the indirection layer > > only support moving swapped pages between swapfiles, and have those > > "ghost" swapfiles be on a higher tier than normal swapfiles? In this > > case, I am guessing we eliminate the writeback logic from zswap itself > > and move it to this indirection layer. > > Yes. I think the suggested minimal first step includes replacing the > writeback logic of zswap itself with moving swapped page of swap core > (indirectly layer). > > > I don't have a problem with this approach, it is not really clean as > > we still treat zswap as a swapfile and have to deal with a lot of > > unnecessary code like swap slots handling and whatnot. > > These are existing code? I was referring to the fact that today with zswap being tied to swapfiles we do some necessary work such as searching for swap slots during swapout. The initial swap_desc approach aimed to avoid that. With this minimal ghost swapfile approach we retain this unfavorable behavior. > > > We also have to unnecessarily limit the size of zswap with the size of > > this fake swapfile. > > I guess you need to limit the size of zswap anyway, because you need to > decide when to start to writeback or moving to the lower tiers. zswap has a knob to limit its size, but based on the actual memory usage of zswap (i.e the size of compressed pages). There is ongoing work as well to autotune this if I remember correctly. Having to deal with both the limit on compressed memory and the limited on the uncompressed size of swapped pages is cumbersome. Again, we already have this behavior today, but the initial swap_desc proposal aimed to avoid it. > > > In other words, we retain a lot of limitations that we have today. > > As the minimal first step, not the final state. I am assuming the first step here is using ghost swapfiles to support zswap without a backing swap device and having an indirection layer between swapfiles. A following step can be making this indirection layer support zswap directly without having to use such a ghost swapfile, if the cost-benefit analysis proves that this is worthwhile. Is my understanding correct? > > > Keep in mind that supporting ghost swapfiles is something that > > is exposed to userspace, so we have to commit to supporting it -- it > > can't just be an incremental step that we will change later. > > Yes. We should really care about ABI. It's not a good idea to add ABI > for an intermediate step. Do we need to change ABI to use a sparse file > to backing zswap? I think so. At Google we identify ghost swapfiles by having 0 block length and mark them in the kernel. The upstream kernel would reject such a file because it doesn't have a proper swap header. If we use mkswap to have a proper swap header then swapon rejects the file because it has holes. > > > With all that said, it is certainly a much simpler "solution". > > Interested to hear thoughts on this, we can certainly pursue it if > > people think it is the right way to move forward. > > Personally, I have no problem to change the design of swap code to add > useful features. Just want to check whether we can do that step by step > and show benefit and cost clearly in each step. Right. I understand and totally agree, even from a development point of view it's much better to make big changes incrementally to avoid doing a lot of work that ends up going nowhere. I am just trying to make sure that whatever we decide is indeed a step in the right direction. Thanks for a very insightful discussion. > > Best Regards, > Huang, Ying > > >> > >> " > >> >> > 2) Make zswap work without a swapfile. > >> >> > You can implement the zswap on a fake ghosts swap file. > >> >> > > >> >> > If you keep the zswap as frontswap, just make zswap can work with= out > >> >> > a real swapfile. > >> >> > > >> >> > Make that as your first minimal step. Then it does not need to to= uch > >> >> > the swap count changes. > >> " > >> > >> Best Regards, > >> Huang, Ying > >> > >> >> > >> >> >> > Anyway, I don't think you can just implement all your final so= lution in > >> >> >> > one step. And, I think the minimal design suggested could be = a starting > >> >> >> > point. > >> >> >> > >> >> >> I agree that's a great point, I am just afraid that we will avoi= d > >> >> >> implementing that full final solution and instead do a lot of wo= rk > >> >> >> inside zswap to make up for the difference (e.g. swap entry > >> >> >> management, swap counting). Also, that work in zswap may end up = being > >> >> >> unacceptable due to the maintenance burden and/or complexity. > >> >> > > >> >> > If you do either 1) or 2), you can keep these two paths separate. > >> >> > > >> >> > Even if you want to move the page between zswap and swapfile. > >> >> > > >> >> > Idea 3) > >> >> > You don't have to change the swap count code, you can do a > >> >> > minimal change moves the page between zswap and another block > >> >> > device. That way you can get two differenet swap entry with > >> >> > existing code. > >> >> > > >> >> > Chris > >> >> > >> >