From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45DB5C433DB for ; Tue, 22 Dec 2020 01:57:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A538222A83 for ; Tue, 22 Dec 2020 01:57:18 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A538222A83 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=konsulko.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C1C9E6B008C; Mon, 21 Dec 2020 20:57:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BCC3D6B0092; Mon, 21 Dec 2020 20:57:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F72A8D0001; Mon, 21 Dec 2020 20:57:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0130.hostedemail.com [216.40.44.130]) by kanga.kvack.org (Postfix) with ESMTP id 7C5F06B008C for ; Mon, 21 Dec 2020 20:57:17 -0500 (EST) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 3815A180AD82F for ; Tue, 22 Dec 2020 01:57:17 +0000 (UTC) X-FDA: 77619255714.02.ray19_4b152402745c Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin02.hostedemail.com (Postfix) with ESMTP id 1F54A10097AA0 for ; Tue, 22 Dec 2020 01:57:17 +0000 (UTC) X-HE-Tag: ray19_4b152402745c X-Filterd-Recvd-Size: 47810 Received: from mail-lf1-f54.google.com (mail-lf1-f54.google.com [209.85.167.54]) by imf38.hostedemail.com (Postfix) with ESMTP for ; Tue, 22 Dec 2020 01:57:15 +0000 (UTC) Received: by mail-lf1-f54.google.com with SMTP id o13so28331351lfr.3 for ; Mon, 21 Dec 2020 17:57:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=konsulko.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0jnBeZrqUSvR8n6EoDicjQpe0YHKprUMkodKeSNQD/k=; b=iF6wdcPl/3QaL0hSnBXLN+gSmIEtdc3ti574KyWOG3/9Z0LHudGEZtmMfaCk6pkpIG tLYPW3PkXUAWzr7GYP0RZZvGSQoQwRyeGvVb1qRbqg9nO0eIaU82ZXTNmoTZf63ff5kW +FwiR9BnpXWQdtMaNPOfl4kE/GFV7HKN/PjKI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0jnBeZrqUSvR8n6EoDicjQpe0YHKprUMkodKeSNQD/k=; b=bgzPjpE/Lin5RS1zghrToRELl3dw0ptUTUgLpRrdt8dbLifI9QKRtFCXTSBmyPqv1l M+ZUjp2wAXPxH6Lo5GiEVl03HBoctiCoxhpA2SsTMLkZ2pbmlUmcifaFtWFZ315qN/u5 TXmHmDcrWD9j66OCW4W2liHFfBHN3NQunHQjOi5Ty7bfS/wxJXD1MEgwmsA/zBjvGr25 qkgPECz+LgSrvN6Zxj2wC0w5WsUXSuzGnEVUxFWeK4UUJ2HhOcBNIdeE3KhkAsXMhMVt Y8hWsMqVurLe8pWeHcKaQlt0HSJMef0pJ8VZq4L9lkIZWsuYQL1j8hqMURXdrU3MrMDY ON2Q== X-Gm-Message-State: AOAM532jn498M40jB5hbEvvkkZHxk3J+9GFx5E8bBdjhSrUzuM5j8748 PJAfSFKLaOzWK9uXIhbbu84GorWoc0flJKTfToN7TshDocCxe9KgU5Y= X-Google-Smtp-Source: ABdhPJzB4ERwPv9WwAyKhorY0AySPOAt9RDyr5A09zYw65VpneWuIXWyP8TLHNGa98ua7UtysqfDUSU80i5ut8PFC4o= X-Received: by 2002:a05:6512:3e7:: with SMTP id n7mr7620173lfq.585.1608602234114; Mon, 21 Dec 2020 17:57:14 -0800 (PST) MIME-Version: 1.0 References: <18669bd607ae9efbf4e00e36532c7aa167d0fa12.camel@gmx.de> <20201220002228.38697-1-vitaly.wool@konsulko.com> <8cc0e01fd03245a4994f2e0f54b264fa@hisilicon.com> <4490cb6a7e2243fba374e40652979e46@hisilicon.com> In-Reply-To: <4490cb6a7e2243fba374e40652979e46@hisilicon.com> From: Vitaly Wool Date: Tue, 22 Dec 2020 02:57:08 +0100 Message-ID: Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock To: "Song Bao Hua (Barry Song)" Cc: Shakeel Butt , Minchan Kim , Mike Galbraith , LKML , linux-mm , Sebastian Andrzej Siewior , NitinGupta , Sergey Senozhatsky , Andrew Morton , "tiantao (H)" Content-Type: multipart/alternative; boundary="000000000000bfe60805b703e488" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --000000000000bfe60805b703e488 Content-Type: text/plain; charset="UTF-8" On Tue, 22 Dec 2020, 02:42 Song Bao Hua (Barry Song), < song.bao.hua@hisilicon.com> wrote: > > > > -----Original Message----- > > From: Song Bao Hua (Barry Song) > > Sent: Tuesday, December 22, 2020 2:06 PM > > To: 'Vitaly Wool' > > Cc: Shakeel Butt ; Minchan Kim ; > Mike > > Galbraith ; LKML ; linux-mm > > ; Sebastian Andrzej Siewior ; > > NitinGupta ; Sergey Senozhatsky > > ; Andrew Morton > > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > -----Original Message----- > > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > > > Sent: Tuesday, December 22, 2020 2:00 PM > > > To: Song Bao Hua (Barry Song) > > > Cc: Shakeel Butt ; Minchan Kim < > minchan@kernel.org>; > > Mike > > > Galbraith ; LKML ; > linux-mm > > > ; Sebastian Andrzej Siewior >; > > > NitinGupta ; Sergey Senozhatsky > > > ; Andrew Morton > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song) > > > wrote: > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Song Bao Hua (Barry Song) > > > > > Sent: Tuesday, December 22, 2020 11:38 AM > > > > > To: 'Vitaly Wool' > > > > > Cc: Shakeel Butt ; Minchan Kim < > minchan@kernel.org>; > > > Mike > > > > > Galbraith ; LKML ; > linux-mm > > > > > ; Sebastian Andrzej Siewior < > bigeasy@linutronix.de>; > > > > > NitinGupta ; Sergey Senozhatsky > > > > > ; Andrew Morton > > > > > > > > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com] > > > > > > Sent: Tuesday, December 22, 2020 11:12 AM > > > > > > To: Song Bao Hua (Barry Song) > > > > > > Cc: Shakeel Butt ; Minchan Kim > > ; > > > > > Mike > > > > > > Galbraith ; LKML ; > > linux-mm > > > > > > ; Sebastian Andrzej Siewior > > ; > > > > > > NitinGupta ; Sergey Senozhatsky > > > > > > ; Andrew Morton > > > > > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song) > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > > > > Sent: Tuesday, December 22, 2020 10:03 AM > > > > > > > > To: Song Bao Hua (Barry Song) > > > > > > > > Cc: Vitaly Wool ; Minchan Kim > > > > > > ; > > > > > > > > Mike Galbraith ; LKML < > linux-kernel@vger.kernel.org>; > > > > > > linux-mm > > > > > > > > ; Sebastian Andrzej Siewior > > > ; > > > > > > > > NitinGupta ; Sergey Senozhatsky > > > > > > > > ; Andrew Morton > > > > > > > > > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song) > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > From: Shakeel Butt [mailto:shakeelb@google.com] > > > > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM > > > > > > > > > > To: Vitaly Wool > > > > > > > > > > Cc: Minchan Kim ; Mike Galbraith > > > ; > > > > > > LKML > > > > > > > > > > ; linux-mm < > linux-mm@kvack.org>; > > > Song > > > > > > Bao > > > > > > > > Hua > > > > > > > > > > (Barry Song) ; Sebastian > Andrzej > > > Siewior > > > > > > > > > > ; NitinGupta ; > Sergey > > > > > > > > Senozhatsky > > > > > > > > > > ; Andrew Morton > > > > > > > > > > > > > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock > > > > > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim < > minchan@kernel.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly > Wool wrote: > > > > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback > and > > releases > > > > > > it > > > > > > > > > > > > > only in unmap() which is unsafe and leads to zswap > complaining > > > > > > > > > > > > > about scheduling in atomic context. > > > > > > > > > > > > > > > > > > > > > > > > > > To fix that and to improve RT properties of > zsmalloc, > > remove > > > > > that > > > > > > > > > > > > > bit spinlock completely and use a bit flag instead. > > > > > > > > > > > > > > > > > > > > > > > > I don't want to use such open code for the lock. > > > > > > > > > > > > > > > > > > > > > > > > I see from Mike's patch, recent zswap change > introduced > > the > > > lockdep > > > > > > > > > > > > splat bug and you want to improve zsmalloc to fix > the zswap > > > bug > > > > > > and > > > > > > > > > > > > introduce this patch with allowing preemption > enabling. > > > > > > > > > > > > > > > > > > > > > > This understanding is upside down. The code in zswap > you are > > > referring > > > > > > > > > > > to is not buggy. You may claim that it is suboptimal > but there > > > is > > > > > > > > > > > nothing wrong in taking a mutex. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is this suboptimal for all or just the hardware > accelerators? > > > Sorry, > > > > > > I > > > > > > > > > > am not very familiar with the crypto API. If I select > lzo or > > lz4 > > > as > > > > > > a > > > > > > > > > > zswap compressor will the [de]compression be async or > sync? > > > > > > > > > > > > > > > > > > Right now, in crypto subsystem, new drivers are required > to write > > > based > > > > > > on > > > > > > > > > async APIs. The old sync API can't work in new accelerator > drivers > > > as > > > > > > they > > > > > > > > > are not supported at all. > > > > > > > > > > > > > > > > > > Old drivers are used to sync, but they've got async > wrappers to > > > support > > > > > > async > > > > > > > > > APIs. Eg. > > > > > > > > > crypto: acomp - add support for lz4 via scomp > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d > > > > > > > > > > > > > > > > > > crypto: acomp - add support for lzo via scomp > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > > > > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e > > > > > > > > > > > > > > > > > > so they are supporting async APIs but they are still > working in > > > sync > > > > > mode > > > > > > > > as > > > > > > > > > those old drivers don't sleep. > > > > > > > > > > > > > > > > > > > > > > > > > Good to know that those are sync because I want them to be > sync. > > > > > > > > Please note that zswap is a cache in front of a real swap > and the > > > load > > > > > > > > operation is latency sensitive as it comes in the page fault > path > > > and > > > > > > > > directly impacts the applications. I doubt decompressing > synchronously > > > > > > > > a 4k page on a cpu will be costlier than asynchronously > decompressing > > > > > > > > the same page from hardware accelerators. > > > > > > > > > > > > > > If you read the old paper: > > > > > > > > > > > > > > > > > > > > > > > > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit > > > > > > y > > > > > > > Because the hardware accelerator speeds up compression, > looking at > > the > > > zswap > > > > > > > metrics we observed that there were more store and load > requests in > > > a given > > > > > > > amount of time, which filled up the zswap pool faster than a > software > > > > > > > compression run. Because of this behavior, we set the > max_pool_percent > > > > > > > parameter to 30 for the hardware compression runs - this means > that > > > zswap > > > > > > > can use up to 30% of the 10GB of total memory. > > > > > > > > > > > > > > So using hardware accelerators, we get a chance to speed up > compression > > > > > > > while decreasing cpu utilization. > > > > > > > > > > > > > > BTW, If it is not easy to change zsmalloc, one quick > workaround we > > might > > > > > do > > > > > > > in zswap is adding the below after applying Mike's original > patch: > > > > > > > > > > > > > > if(in_atomic()) /* for zsmalloc */ > > > > > > > while(!try_wait_for_completion(&req->done); > > > > > > > else /* for zbud, z3fold */ > > > > > > > crypto_wait_req(....); > > > > > > > > > > > > I don't think I'm going to ack this, sorry. > > > > > > > > > > > > > > > > Fair enough. And I am also thinking if we can move > zpool_unmap_handle() > > > > > quite after zpool_map_handle() as below: > > > > > > > > > > dlen = PAGE_SIZE; > > > > > src = zpool_map_handle(entry->pool->zpool, entry->handle, > > > ZPOOL_MM_RO); > > > > > if (zpool_evictable(entry->pool->zpool)) > > > > > src += sizeof(struct zswap_header); > > > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > > > mutex_lock(acomp_ctx->mutex); > > > > > sg_init_one(&input, src, entry->length); > > > > > sg_init_table(&output, 1); > > > > > sg_set_page(&output, page, PAGE_SIZE, 0); > > > > > acomp_request_set_params(acomp_ctx->req, &input, &output, > > > entry->length, > > > > > dlen); > > > > > ret = > crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > > > > > &acomp_ctx->wait); > > > > > mutex_unlock(acomp_ctx->mutex); > > > > > > > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > > > > > Since src is always low memory and we only need its virtual address > > > > > to get the page of src in sg_init_one(). We don't actually read it > > > > > by CPU anywhere. > > > > > > > > The below code might be better: > > > > > > > > dlen = PAGE_SIZE; > > > > src = zpool_map_handle(entry->pool->zpool, entry->handle, > > > ZPOOL_MM_RO); > > > > if (zpool_evictable(entry->pool->zpool)) > > > > src += sizeof(struct zswap_header); > > > > > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > > > > > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > > > mutex_lock(acomp_ctx->mutex); > > > > sg_init_one(&input, src, entry->length); > > > > sg_init_table(&output, 1); > > > > sg_set_page(&output, page, PAGE_SIZE, 0); > > > > acomp_request_set_params(acomp_ctx->req, &input, &output, > > > entry->length, dlen); > > > > ret = > crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), > > > &acomp_ctx->wait); > > > > mutex_unlock(acomp_ctx->mutex); > > > > > > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle); > > > > > > I don't see how this is going to work since we can't guarantee src > > > will be a valid pointer after the zpool_unmap_handle() call, can we? > > > Could you please elaborate? > > > > A valid pointer is for cpu to read and write. Here, cpu doesn't read > > and write it, we only need to get page struct from the address. > > > > void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int > buflen) > > { > > sg_init_table(sg, 1); > > sg_set_buf(sg, buf, buflen); > > } > > > > static inline void sg_set_buf(struct scatterlist *sg, const void *buf, > > unsigned int buflen) > > { > > #ifdef CONFIG_DEBUG_SG > > BUG_ON(!virt_addr_valid(buf)); > > #endif > > sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf)); > > } > > > > sg_init_one() is always using an address which has a linear mapping > > with physical address. > > So once we get the value of src, we can get the page struct. > > > > src has a linear mapping with physical address. It doesn't require > > page table walk which vmalloc_to_page() wants. > > > > The req only requires page to initialize sg table, I think if > > we are going to use a cpu-based (de)compression, the crypto > > driver will kmap it again. > > Probably I made another bug here. for zsmalloc, it is possible to > get highmem for zpool since its malloc_support_movable = true. > > if (zpool_malloc_support_movable(entry->pool->zpool)) > gfp |= __GFP_HIGHMEM | __GFP_MOVABLE; > ret = zpool_malloc(entry->pool->zpool, hlen + dlen, gfp, &handle); > > For 64bit system, there is never a highmem. For 32bit system, we may > trigger this bug. > > So actually zswap should have used kmap_to_page() which can support > both linear mapping and non-linear mapping. sg_init_one() only supports > linear mapping. > But it does't change the fact: Once req is initialized with page > struct, we can unmap src. If we are going to use a HW accelerator, > it would be a DMA; if we are going to use CPU decompression, crypto > driver will kmap() again. > I'm still not convinced. Will kmap what, src? At this point src might become just a bogus pointer. Why couldn't the object have been moved somewhere else (due to the compaction mechanism for instance) at the time DMA kicks in? > > > > > > > > ~Vitaly > > > > Thanks > Barry > --000000000000bfe60805b703e488 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Tue, 22 Dec 2020, 02:42 Song Bao Hua (Barry Song), = <song.bao.hua@hisilicon.co= m> wrote:


> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, December 22, 2020 2:06 PM
> To: 'Vitaly Wool' <vitaly.wool@konsulko.com> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim <min= chan@kernel.org>; Mike
> Galbraith <efault@gmx.de>; LKML <linux-kernel@vger= .kernel.org>; linux-mm
> <linux-mm@kvack.org>; Sebastian Andrzej Siewior <bige= asy@linutronix.de>;
> NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> <sergey.senozhatsky.work@gmail.com>; Andrew M= orton
> <akpm@linux-foundation.org>
> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
>
>
>
> > -----Original Message-----
> > From: Vitaly Wool [mailto:vitaly.wool@konsulko.com]
> > Sent: Tuesday, December 22, 2020 2:00 PM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.= com>
> > Cc: Shakeel Butt <shakeelb@google.com>; Minchan Kim &l= t;minchan@kernel.org>;
> Mike
> > Galbraith <efault@gmx.de>; LKML <linux-kernel@v= ger.kernel.org>; linux-mm
> > <linux-mm@kvack.org>; Sebastian Andrzej Siewior <<= a href=3D"mailto:bigeasy@linutronix.de" target=3D"_blank" rel=3D"noreferrer= ">bigeasy@linutronix.de>;
> > NitinGupta <ngupta@vflare.org>; Sergey Senozhatsky
> > <sergey.senozhatsky.work@gmail.com>; A= ndrew Morton
> > <akpm@linux-foundation.org>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song)
> > <song.bao.hua@hisilicon.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Song Bao Hua (Barry Song)
> > > > Sent: Tuesday, December 22, 2020 11:38 AM
> > > > To: 'Vitaly Wool' <vitaly.wool@konsulk= o.com>
> > > > Cc: Shakeel Butt <shakeelb@google.com>; Minc= han Kim <minchan@kernel.org>;
> > Mike
> > > > Galbraith <efault@gmx.de>; LKML <li= nux-kernel@vger.kernel.org>; linux-mm
> > > > <linux-mm@kvack.org>; Sebastian Andrzej Siewi= or <bigeasy@linutronix.de>;
> > > > NitinGupta <ngupta@vflare.org>; Sergey Senozha= tsky
> > > > <sergey.senozhatsky.work@gmail.com>; Andrew Morton
> > > > <
akpm@linux-foundation.org>
> > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock=
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Vitaly Wool [mailto:vitaly.wool@konsul= ko.com]
> > > > > Sent: Tuesday, December 22, 2020 11:12 AM
> > > > > To: Song Bao Hua (Barry Song) <song.bao= .hua@hisilicon.com>
> > > > > Cc: Shakeel Butt <shakeelb@google.com>;= Minchan Kim
> <minchan@kernel.org>;
> > > > Mike
> > > > > Galbraith <efault@gmx.de>; LKML <linux-kernel@vger.kernel.org>;
> linux-mm
> > > > > <linux-mm@kvack.org>; Sebastian Andrze= j Siewior
> <bigeasy@linutronix.de>;
> > > > > NitinGupta <ngupta@vflare.org>; Sergey Se= nozhatsky
> > > > > <sergey.senozhatsky.work@gmail.c= om>; Andrew Morton
> > > > > <akpm@linux-foundation.org>
> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin= _lock
> > > > >
> > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Bar= ry Song)
> > > > > <song.bao.hua@hisilicon.com> wro= te:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Shakeel Butt [mailto:shakeelb@goo= gle.com]
> > > > > > > Sent: Tuesday, December 22, 2020 10:03 A= M
> > > > > > > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> > > > > > > Cc: Vitaly Wool <vitaly.wool@ko= nsulko.com>; Minchan Kim
> > > > > <minchan@kernel.org>;
> > > > > > > Mike Galbraith <efault@gmx.de>; LKM= L <linux-kernel@vger.kernel.org>;
> > > > > linux-mm
> > > > > > > <linux-mm@kvack.org>; Sebastia= n Andrzej Siewior
> > <bigeasy@linutronix.de>;
> > > > > > > NitinGupta <ngupta@vflare.org>;= Sergey Senozhatsky
> > > > > > > <sergey.senozhatsky.wo= rk@gmail.com>; Andrew Morton
> > > > > > > <akpm@linux-foundation.org= >
> > > > > > > Subject: Re: [PATCH] zsmalloc: do not us= e bit_spin_lock
> > > > > > >
> > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Ba= o Hua (Barry Song)
> > > > > > > <song.bao.hua@hisilicon.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Shakeel Butt [mailto:
sh= akeelb@google.com]
> > > > > > > > > Sent: Tuesday, December 22, 20= 20 8:50 AM
> > > > > > > > > To: Vitaly Wool <vita= ly.wool@konsulko.com>
> > > > > > > > > Cc: Minchan Kim <minchan@ke= rnel.org>; Mike Galbraith
> > <efault@gmx.de>;
> > > > > LKML
> > > > > > > > > <linux-kernel@vge= r.kernel.org>; linux-mm <linux-mm@kvack.org>;
> > Song
> > > > > Bao
> > > > > > > Hua
> > > > > > > > > (Barry Song) <song.= bao.hua@hisilicon.com>; Sebastian Andrzej
> > Siewior
> > > > > > > > > <bigeasy@linutronix.de>; NitinGupta <ngupta@vflare.org>; Sergey
> > > > > > > Senozhatsky
> > > > > > > > > <sergey.seno= zhatsky.work@gmail.com>; Andrew Morton
> > > > > > > > > <akpm@linux-foundati= on.org>
> > > > > > > > > Subject: Re: [PATCH] zsmalloc:= do not use bit_spin_lock
> > > > > > > > >
> > > > > > > > > On Mon, Dec 21, 2020 at 11:20 = AM Vitaly Wool
> > <vitaly.wool@konsulko.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 21, 2020 at 6= :24 PM Minchan Kim <minchan@kernel.org>
> > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Sun, Dec 20, 2020= at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > > > > > zsmalloc takes = bit spinlock in its _map() callback and
> releases
> > > > > it
> > > > > > > > > > > > only in unmap()= which is unsafe and leads to zswap complaining
> > > > > > > > > > > > about schedulin= g in atomic context.
> > > > > > > > > > > >
> > > > > > > > > > > > To fix that and= to improve RT properties of zsmalloc,
> remove
> > > > that
> > > > > > > > > > > > bit spinlock co= mpletely and use a bit flag instead.
> > > > > > > > > > >
> > > > > > > > > > > I don't want to = use such open code for the lock.
> > > > > > > > > > >
> > > > > > > > > > > I see from Mike'= s patch, recent zswap change introduced
> the
> > lockdep
> > > > > > > > > > > splat bug and you wa= nt to improve zsmalloc to fix the zswap
> > bug
> > > > > and
> > > > > > > > > > > introduce this patch= with allowing preemption enabling.
> > > > > > > > > >
> > > > > > > > > > This understanding is ups= ide down. The code in zswap you are
> > referring
> > > > > > > > > > to is not buggy.=C2=A0 Yo= u may claim that it is suboptimal but there
> > is
> > > > > > > > > > nothing wrong in taking a= mutex.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Is this suboptimal for all or = just the hardware accelerators?
> > Sorry,
> > > > > I
> > > > > > > > > am not very familiar with the = crypto API. If I select lzo or
> lz4
> > as
> > > > > a
> > > > > > > > > zswap compressor will the [de]= compression be async or sync?
> > > > > > > >
> > > > > > > > Right now, in crypto subsystem, new= drivers are required to write
> > based
> > > > > on
> > > > > > > > async APIs. The old sync API can= 9;t work in new accelerator drivers
> > as
> > > > > they
> > > > > > > > are not supported at all.
> > > > > > > >
> > > > > > > > Old drivers are used to sync, but t= hey've got async wrappers to
> > support
> > > > > async
> > > > > > > > APIs. Eg.
> > > > > > > > crypto: acomp - add support for lz4= via scomp
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://gi= t.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > > crypto/lz4.c?id=3D8cd9330e0a615c931037d4= def98b5ce0d540f08d
> > > > > > > >
> > > > > > > > crypto: acomp - add support for lzo= via scomp
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://gi= t.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > > crypto/lzo.c?id=3Dac9d2c4b39e022d2c61486= bfc33b730cfd02898e
> > > > > > > >
> > > > > > > > so they are supporting async APIs b= ut they are still working in
> > sync
> > > > mode
> > > > > > > as
> > > > > > > > those old drivers don't sleep.<= br> > > > > > > > >
> > > > > > >
> > > > > > > Good to know that those are sync because= I want them to be sync.
> > > > > > > Please note that zswap is a cache in fro= nt of a real swap and the
> > load
> > > > > > > operation is latency sensitive as it com= es in the page fault path
> > and
> > > > > > > directly impacts the applications. I dou= bt decompressing synchronously
> > > > > > > a 4k page on a cpu will be costlier than= asynchronously decompressing
> > > > > > > the same page from hardware accelerators= .
> > > > > >
> > > > > > If you read the old paper:
> > > > > >
> > > > >
> > > >
> >
> https://ww= w.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> > > > > y
> > > > > > Because the hardware accelerator speeds up co= mpression, looking at
> the
> > zswap
> > > > > > metrics we observed that there were more stor= e and load requests in
> > a given
> > > > > > amount of time, which filled up the zswap poo= l faster than a software
> > > > > > compression run. Because of this behavior, we= set the max_pool_percent
> > > > > > parameter to 30 for the hardware compression = runs - this means that
> > zswap
> > > > > > can use up to 30% of the 10GB of total memory= .
> > > > > >
> > > > > > So using hardware accelerators, we get a chan= ce to speed up compression
> > > > > > while decreasing cpu utilization.
> > > > > >
> > > > > > BTW, If it is not easy to change zsmalloc, on= e quick workaround we
> might
> > > > do
> > > > > > in zswap is adding the below after applying M= ike's original patch:
> > > > > >
> > > > > > if(in_atomic()) /* for zsmalloc */
> > > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0while(!try_w= ait_for_completion(&req->done);
> > > > > > else /* for zbud, z3fold */
> > > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0crypto_wait_= req(....);
> > > > >
> > > > > I don't think I'm going to ack this, sorry= .
> > > > >
> > > >
> > > > Fair enough. And I am also thinking if we can move zpoo= l_unmap_handle()
> > > > quite after zpool_map_handle() as below:
> > > >
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0dlen =3D PAGE_SIZE;
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0src =3D zpool_map_handle(entr= y->pool->zpool, entry->handle,
> > ZPOOL_MM_RO);
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0if (zpool_evictable(entry->= ;pool->zpool))
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0s= rc +=3D sizeof(struct zswap_header);
> > > > +=C2=A0 =C2=A0 =C2=A0zpool_unmap_handle(entry->pool-= >zpool, entry->handle);
> > > >
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0acomp_ctx =3D raw_cpu_ptr(ent= ry->pool->acomp_ctx);
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0mutex_lock(acomp_ctx->mute= x);
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0sg_init_one(&input, src, = entry->length);
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0sg_init_table(&output, 1)= ;
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0sg_set_page(&output, page= , PAGE_SIZE, 0);
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0acomp_request_set_params(acom= p_ctx->req, &input, &output,
> > entry->length,
> > > > dlen);
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D crypto_wait_req(crypt= o_acomp_decompress(acomp_ctx->req),
> > > > &acomp_ctx->wait);
> > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0mutex_unlock(acomp_ctx->mu= tex);
> > > >
> > > > -=C2=A0 =C2=A0 =C2=A0zpool_unmap_handle(entry->pool-= >zpool, entry->handle);
> > > >
> > > > Since src is always low memory and we only need its vir= tual address
> > > > to get the page of src in sg_init_one(). We don't a= ctually read it
> > > > by CPU anywhere.
> > >
> > > The below code might be better:
> > >
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0dlen =3D PAGE_SIZE;
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0src =3D zpool_map_handle(en= try->pool->zpool, entry->handle,
> > ZPOOL_MM_RO);
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (zpool_evictable(entry-&= gt;pool->zpool))
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0src +=3D sizeof(struct zswap_header);
> > >
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0acomp_ctx =3D raw_cpu_ptr(e= ntry->pool->acomp_ctx);
> > >
> > > +=C2=A0 =C2=A0 =C2=A0 =C2=A0zpool_unmap_handle(entry->poo= l->zpool, entry->handle);
> > >
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mutex_lock(acomp_ctx->mu= tex);
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sg_init_one(&input, src= , entry->length);
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sg_init_table(&output, = 1);
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sg_set_page(&output, pa= ge, PAGE_SIZE, 0);
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0acomp_request_set_params(ac= omp_ctx->req, &input, &output,
> > entry->length, dlen);
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D crypto_wait_req(cry= pto_acomp_decompress(acomp_ctx->req),
> > &acomp_ctx->wait);
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mutex_unlock(acomp_ctx->= mutex);
> > >
> > > -=C2=A0 =C2=A0 =C2=A0 =C2=A0zpool_unmap_handle(entry->poo= l->zpool, entry->handle);
> >
> > I don't see how this is going to work since we can't guar= antee src
> > will be a valid pointer after the zpool_unmap_handle() call, can = we?
> > Could you please elaborate?
>
> A valid pointer is for cpu to read and write. Here, cpu doesn't re= ad
> and write it, we only need to get page struct from the address.
>
> void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int= buflen)
> {
>=C2=A0 =C2=A0 =C2=A0 =C2=A0sg_init_table(sg, 1);
>=C2=A0 =C2=A0 =C2=A0 =C2=A0sg_set_buf(sg, buf, buflen);
> }
>
> static inline void sg_set_buf(struct scatterlist *sg, const void *buf,=
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0unsigned int buflen)
> {
> #ifdef CONFIG_DEBUG_SG
>=C2=A0 =C2=A0 =C2=A0 =C2=A0BUG_ON(!virt_addr_valid(buf));
> #endif
>=C2=A0 =C2=A0 =C2=A0 =C2=A0sg_set_page(sg, virt_to_page(buf), buflen, o= ffset_in_page(buf));
> }
>
> sg_init_one() is always using an address which has a linear mapping > with physical address.
> So once we get the value of src, we can get the page struct.
>
> src has a linear mapping with physical address. It doesn't require=
> page table walk which vmalloc_to_page() wants.
>
> The req only requires page to initialize sg table, I think if
> we are going to use a cpu-based (de)compression, the crypto
> driver will kmap it again.

Probably I made another bug here. for zsmalloc, it is possible to
get highmem for zpool since its malloc_support_movable =3D true.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (zpool_malloc_support_movable(entry->pool= ->zpool))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 gfp |=3D __GFP_HIGH= MEM | __GFP_MOVABLE;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ret =3D zpool_malloc(entry->pool->zpool, = hlen + dlen, gfp, &handle);

For 64bit system, there is never a highmem. For 32bit system, we may
trigger this bug.

So actually zswap should have used kmap_to_page() which can support
both linear mapping and non-linear mapping. sg_init_one() only supports
linear mapping.
But it does't change the fact: Once req is initialized with page
struct, we can unmap src. If we are going to use a HW accelerator,
it would be a DMA; if we are going to use CPU decompression, crypto
driver will kmap() again.
I'm still not convinced. Will kmap what, src? = At this point src might become just a bogus pointer. Why couldn't the o= bject have been moved somewhere else (due to the compaction mechanism for i= nstance) at the time DMA kicks in?


>
> >
> > ~Vitaly
>

Thanks
Barry
--000000000000bfe60805b703e488--