From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.4 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF20DC433E0 for ; Tue, 19 Jan 2021 01:29:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4CE27230FA for ; Tue, 19 Jan 2021 01:29:02 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4CE27230FA Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=huawei.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5FC106B0271; Mon, 18 Jan 2021 20:29:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5AB786B0273; Mon, 18 Jan 2021 20:29:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C2856B0274; Mon, 18 Jan 2021 20:29:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0191.hostedemail.com [216.40.44.191]) by kanga.kvack.org (Postfix) with ESMTP id 355BB6B0271 for ; Mon, 18 Jan 2021 20:29:01 -0500 (EST) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id F37143642 for ; Tue, 19 Jan 2021 01:29:00 +0000 (UTC) X-FDA: 77720790840.01.value91_1e048532754d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin01.hostedemail.com (Postfix) with ESMTP id D65B7100473B6 for ; Tue, 19 Jan 2021 01:29:00 +0000 (UTC) X-HE-Tag: value91_1e048532754d X-Filterd-Recvd-Size: 8212 Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Tue, 19 Jan 2021 01:28:59 +0000 (UTC) Received: from DGGEMS406-HUB.china.huawei.com (unknown [172.30.72.58]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4DKWH80L4czMM0C; Tue, 19 Jan 2021 09:27:32 +0800 (CST) Received: from [127.0.0.1] (10.40.188.144) by DGGEMS406-HUB.china.huawei.com (10.3.19.206) with Microsoft SMTP Server id 14.3.498.0; Tue, 19 Jan 2021 09:28:45 +0800 Subject: Re: [RFC mm/zswap 1/2] mm/zswap: add the flag can_sleep_mapped To: Vitaly Wool , Shakeel Butt CC: Minchan Kim , "tiantao (H)" , Seth Jennings , Dan Streetman , Andrew Morton , "Song Bao Hua (Barry Song)" , Linux-MM , "Sergey Senozhatsky" References: <1608894171-54174-1-git-send-email-tiantao6@hisilicon.com> <1608894171-54174-2-git-send-email-tiantao6@hisilicon.com> From: "tiantao (H)" Message-ID: Date: Tue, 19 Jan 2021 09:28:45 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed X-Originating-IP: [10.40.188.144] X-CFilter-Loop: Reflected Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: =E5=9C=A8 2021/1/15 6:41, Vitaly Wool =E5=86=99=E9=81=93: > On Thu, Jan 14, 2021 at 10:09 PM Shakeel Butt wro= te: >> On Thu, Jan 14, 2021 at 11:53 AM Vitaly Wool wrote: >>> On Thu, Jan 14, 2021 at 8:21 PM Shakeel Butt wr= ote: >>>> On Thu, Jan 14, 2021 at 11:05 AM Vitaly Wool wrote: >>>>> On Thu, Jan 14, 2021 at 7:56 PM Minchan Kim wr= ote: >>>>>> On Thu, Jan 14, 2021 at 07:40:50PM +0100, Vitaly Wool wrote: >>>>>>> On Thu, Jan 14, 2021 at 7:29 PM Minchan Kim = wrote: >>>>>>>> On Fri, Dec 25, 2020 at 07:02:50PM +0800, Tian Tao wrote: >>>>>>>>> add a flag to zpool, named is "can_sleep_mapped", and have it s= et true >>>>>>>>> for zbud/z3fold, set false for zsmalloc. Then zswap could go th= e current >>>>>>>>> path if the flag is true; and if it's false, copy data from src= to a >>>>>>>>> temporary buffer, then unmap the handle, take the mutex, proces= s the >>>>>>>>> buffer instead of src to avoid sleeping function called from at= omic >>>>>>>>> context. >>>>>>>>> >>>>>>>>> Signed-off-by: Tian Tao >>>>>>>>> --- >>>>>>>>> include/linux/zpool.h | 3 +++ >>>>>>>>> mm/zpool.c | 13 +++++++++++++ >>>>>>>>> mm/zswap.c | 50 ++++++++++++++++++++++++++++++++++= +++++++++++----- >>>>>>>>> 3 files changed, 61 insertions(+), 5 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/include/linux/zpool.h b/include/linux/zpool.h >>>>>>>>> index 51bf430..e899701 100644 >>>>>>>>> --- a/include/linux/zpool.h >>>>>>>>> +++ b/include/linux/zpool.h >>>>>>>>> @@ -73,6 +73,7 @@ u64 zpool_get_total_size(struct zpool *pool); >>>>>>>>> * @malloc: allocate mem from a pool. >>>>>>>>> * @free: free mem from a pool. >>>>>>>>> * @shrink: shrink the pool. >>>>>>>>> + * @sleep_mapped: whether zpool driver can sleep during map. >>>>>>>> I don't think it's a good idea. It just breaks zpool abstraction >>>>>>>> in that it exposes internal implementation to user to avoid issu= e >>>>>>>> zswap recently introduced. It also conflicts zpool_map_handle's >>>>>>>> semantic. >>>>>>>> >>>>>>>> Rather than introducing another break in zpool due to the new >>>>>>>> zswap feature recenlty introduced, zswap could introduce >>>>>>>> CONFIG_ZSWAP_HW_COMPRESSOR. Once it's configured, zsmalloc could >>>>>>>> be disabled. And with disabling CONFIG_ZSWAP_HW_COMPRESSOR, zswa= p >>>>>>>> doesn't need to make any bounce buffer copy so that no existing >>>>>>>> zsmalloc user will see performance regression. >>>>>>> I believe it won't help that much -- the new compressor API presu= mes >>>>>>> that the caller may sleep during compression and that will be an >>>>>>> accident waiting to happen as long as we use it and keep the hand= le >>>>>>> mapped in zsmalloc case. >>>>>>> >>>>>>> Or maybe I interpreted you wrong and you are suggesting re-introd= ucing >>>>>>> calls to the old API under this #ifdef, is that the case? >>>>>> Yub. zswap could abstract that part under #ifdef to keep old behav= ior. >>>>> We can reconsider this option when zsmalloc implements reclaim >>>>> callback. So far it's obviously too much a mess for a reason so wea= k. >>>>> >>>> Sorry I don't understand the link between zsmalloc implementing shri= nk >>>> callback and this patch. >>> There is none. There is a link between taking all the burden to reviv= e >>> zsmalloc for zswap at the cost of extra zswap complexity and zsmalloc >>> not being fully zswap compatible. >>> >>> The ultimate zswap goal is to cache hottest swapped-out pages in a >>> compressed format. zsmalloc doesn't implement reclaim callback, and >>> therefore zswap can *not* fulfill its goal since old pages are there >>> to stay, and it can't accept new hotter ones. So, let me make it >>> clear: zswap/zsmalloc combo is a legacy corner case. >>> >> This is the first time I am hearing that zswap/zsmalloc combo is a >> legacy configuration. We use zswap in production and intentionally >> size the zswap pool to never have to go to the backing device. So >> absence of reclaim callback is not an issue for us. Please let me know >> if the zswap/zsmalloc combo is officially on its way to deprecation. > No, zswap/zsmalloc combo not on the way to deprecation. I generally > would not advise on using it but your particular case does make sense > (although using frontswap/zswap without a backing device *is* a corner > case). > >>>> This patch is adding an overhead for all >>>> zswap+zsmalloc users irrespective of availability of hardware. If we >>>> want to add support for new hardware, please add without impacting t= he >>>> current users. >>> No, it's not like that. zswap+zsmalloc combination is currently >>> already broken >> By broken do you mean the missing reclaim callback? > No. I mean deadlocks in -rt kernel / scheduling while atomic bugs in > the mainline. Missing reclaim callback goes into "not fully > compatible" section in my world. > >>> and this patch implements a way to work that around. >>> The users are already impacted and that is of course a problem. >> Are you talking about rt users or was there some other report? > I don't think the issue is specific to -rt series. There may (and > will) still be "scheduling while atomic" bugs occurring, just not as > often. > >>> The workaround is not perfect but looks good enough for me. >> I would like a see page fault perf experiment for the non-hardware cas= e. > I second that. @tiantao (H), would it be possible to provide one? No problem, but can you tell how to provide the data you want for page=20 fault? > > Also, correct me if I'm wrong but from what I recall, the acomp API > does one redundant copy less than the old comp API so zsmalloc should > be back to square one even with the buffer implemented in this patch. > The other backends should do a little better though but if so, it's > the upside of not taking too many spinlocks. > > Best regards, > Vitaly