From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 399F2C4361B for ; Sat, 12 Dec 2020 07:32:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AA0CE23437 for ; Sat, 12 Dec 2020 07:32:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AA0CE23437 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E16056B005C; Sat, 12 Dec 2020 02:32:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DC7646B005D; Sat, 12 Dec 2020 02:32:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CB68E6B0068; Sat, 12 Dec 2020 02:32:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0177.hostedemail.com [216.40.44.177]) by kanga.kvack.org (Postfix) with ESMTP id B61566B005C for ; Sat, 12 Dec 2020 02:32:48 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 653BC181AF5C3 for ; Sat, 12 Dec 2020 07:32:48 +0000 (UTC) X-FDA: 77583813216.12.pets49_2f0fa7c27407 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin12.hostedemail.com (Postfix) with ESMTP id 3EE8518075A5F for ; Sat, 12 Dec 2020 07:32:48 +0000 (UTC) X-HE-Tag: pets49_2f0fa7c27407 X-Filterd-Recvd-Size: 17021 Received: from mail-il1-f196.google.com (mail-il1-f196.google.com [209.85.166.196]) by imf07.hostedemail.com (Postfix) with ESMTP for ; Sat, 12 Dec 2020 07:32:47 +0000 (UTC) Received: by mail-il1-f196.google.com with SMTP id p5so11004142iln.8 for ; Fri, 11 Dec 2020 23:32:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JpXNyn4n+AcCKD+QrPtwugs+CK+jpg42kUz9whPPWy0=; b=TcZomEzraF0w8pzccTVtkKVzuraS/9kXbI6PL7erhqLsu8VbKIXUVbkM83HwBTBBcV SiYVayVAxFNepCnVQDgfqJgm6Fgm59D1sT8scrdj427uoocEmxAQm1KGKl8mfZzRl8yf udhNwM4uxPUJcboOXD+lElxdo7soH7F0WDRM0qJxHhOh+BNEU3NjLpc3rASV/B6gCIp4 +ecgyHCf9QJ+2Qp2qC+8rCwGXUHAv2Y/cjwd/XYeaWGk9jhzfqnlJJWfbSxSVE2mjvqx UC/t0uveFmdMl5i/jpKNXqC7RUC1qQkA6xneMEN6WpLhtkI6GQKvCs3MwG6hmf3syEHe wrIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JpXNyn4n+AcCKD+QrPtwugs+CK+jpg42kUz9whPPWy0=; b=oIjGNZe//ZiqVqlhkrfQXz1LV+EZ4Zw9OFQ598CE2oH0ZInTRr+HI0YfQr/2yH60r/ iPCvFsonb+iyCscAdkRZHr2VFQAKbxmX8hJOwx10ILEyPl1eiMmzaUAZFDx/R4wZT7Ur PRhZmMpcowuSXZOZJ/U7tZwh5+qu8Mfagz4Adzi8sm/KGjcyoa4wn1hqVKR/+mLgctNQ ZCsy/TUd+cUWk4/uBOY5inJ5yX52kzIG7IZ1i5mqJ/vu7u614TJriOpkT8J3wfBLqnR9 w2vTA1L42xy4P/PRIkIMXx9vPlCElQLUt5x2KISLy0j5ABKS7N+SIYhk7XkVijEkeB0D A9uw== X-Gm-Message-State: AOAM533vIwtxVvrfBMlr0D3G9JD0ptBEGMGtYUnpzMa1t5FdENNkMIbe hX/0iusn07M5D66IqMuUeHf81uLb/XrCb5LsSVc= X-Google-Smtp-Source: ABdhPJwYHvWRVwDdBXQvQzgS9A159AdN5biSHrNmO+Qt7r/SzexUOr2H6//fUw/Xy2ynz37K5ej0gcDvlJa+F3cqaj0= X-Received: by 2002:a92:8419:: with SMTP id l25mr21281953ild.100.1607758366998; Fri, 11 Dec 2020 23:32:46 -0800 (PST) MIME-Version: 1.0 References: <158893941613.200862.4094521350329937435.stgit@buzz> <97ece625-2799-7ae6-28b5-73c52c7c497b@oracle.com> In-Reply-To: <97ece625-2799-7ae6-28b5-73c52c7c497b@oracle.com> From: Konstantin Khlebnikov Date: Sat, 12 Dec 2020 10:32:37 +0300 Message-ID: Subject: Re: [PATCH RFC 0/8] dcache: increase poison resistance To: Junxiao Bi Cc: Konstantin Khlebnikov , Linux Kernel Mailing List , linux-fsdevel , linux-mm@kvack.org, Alexander Viro , Waiman Long , Gautham Ananthakrishna , matthew.wilcox@oracle.com Content-Type: multipart/alternative; boundary="00000000000059892c05b63f6a42" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --00000000000059892c05b63f6a42 Content-Type: text/plain; charset="UTF-8" On Thu, Dec 10, 2020 at 2:01 AM Junxiao Bi wrote: > Hi Konstantin, > > We tested this patch set recently and found it limiting negative dentry > to a small part of total memory. And also we don't see any performance > regression on it. Do you have any plan to integrate it into mainline? It > will help a lot on memory fragmentation issue causing by dentry slab, > there were a lot of customer cases where sys% was very high since most > cpu were doing memory compaction, dentry slab was taking too much memory > and nearly all dentry there were negative. > Right now I don't have any plans for this. I suspect such problems will appear much more often since machines are getting bigger. So, somebody will take care of it. First part which collects negative dentries at the end list of siblings could be done in a more obvious way by splitting the list in two. But this touches much more code. Last patch isn't very rigid but does non-trivial changes. Probably it's better to call some garbage collector thingy periodically. Lru list needs pressure to age and reorder entries properly. Gc could be off by default or thresholds set very high (50% of ram for example). Final setup could be left up to owners of large systems, which needs fine tuning. > > The following is test result we run on two types of servers, one is 256G > memory with 24 CPUS and another is 3T memory with 384 CPUS. The test > case is using a lot of processes to generate negative dentry in > parallel, the following is the test result after 72 hours, the negative > dentry number is stable around that number even running longer time. If > without the patch set, in less than half an hour 197G was took by > negative dentry on 256G system, in 1 day 2.4T was took on 3T system. > > neg-dentry-number neg-dentry-mem-usage > > 256G 55259084 10.6G > > 3T 202306756 38.8G > > For perf test, we run the following, and no regression found. > > - create 1M negative dentry and then touch them to convert them to > positive dentry > > - create 10K/100K/1M files > > - remove 10K/100K/1M files > > - kernel compile > > To verify the fsnotify fix, we used inotifywait to watch file > create/open in some directory where there is a lot of negative dentry, > without the patch set, the system will run into soft lockup, with it, no > soft lockup. > > We also try to defeat the limitation by making different processes > generating negative dentry with the same naming way, that will make one > negative dentry being accessed couple times around same time, > DCACHE_REFERENCED will be set on it and then it can't be trimmed easily. > We do see negative dentry will take all the memory slowly from one of > our system with 120G memory, for above two system, we see the memory > usage were increased, but still a small part of total memory. This looks > ok, since the common negative dentry user case will be create some temp > files and then remove it, it will be rare to access same negative dentry > around same time. > > Thanks, > > Junxiao. > > > On 5/8/20 5:23 AM, Konstantin Khlebnikov wrote: > > For most filesystems result of every negative lookup is cached, content > of > > directories is usually cached too. Production of negative dentries isn't > > limited with disk speed. It's really easy to generate millions of them if > > system has enough memory. > > > > Getting this memory back ins't that easy because slab frees pages only > when > > all related objects are gone. While dcache shrinker works in LRU order. > > > > Typical scenario is an idle system where some process periodically > creates > > temporary files and removes them. After some time, memory will be filled > > with negative dentries for these random file names. > > > > Simple lookup of random names also generates negative dentries very fast. > > Constant flow of such negative denries drains all other inactive caches. > > > > Negative dentries are linked into siblings list along with normal > positive > > dentries. Some operations walks dcache tree but looks only for positive > > dentries: most important is fsnotify/inotify. Hordes of negative dentries > > slow down these operations significantly. > > > > Time of dentry lookup is usually unaffected because hash table grows > along > > with size of memory. Unless somebody especially crafts hash collisions. > > > > This patch set solves all of these problems: > > > > Move negative denries to the end of sliblings list, thus walkers could > > skip them at first sight (patches 3-6). > > > > Keep in dcache at most three unreferenced negative denties in row in each > > hash bucket (patches 7-8). > > > > --- > > > > Konstantin Khlebnikov (8): > > dcache: show count of hash buckets in sysctl fs.dentry-state > > selftests: add stress testing tool for dcache > > dcache: sweep cached negative dentries to the end of list of > siblings > > fsnotify: stop walking child dentries if remaining tail is > negative > > dcache: add action D_WALK_SKIP_SIBLINGS to d_walk() > > dcache: stop walking siblings if remaining dentries all negative > > dcache: push releasing dentry lock into sweep_negative > > dcache: prevent flooding with negative dentries > > > > > > fs/dcache.c | 144 +++++++++++- > > fs/libfs.c | 10 +- > > fs/notify/fsnotify.c | 6 +- > > include/linux/dcache.h | 6 + > > tools/testing/selftests/filesystems/Makefile | 1 + > > .../selftests/filesystems/dcache_stress.c | 210 ++++++++++++++++++ > > 6 files changed, 370 insertions(+), 7 deletions(-) > > create mode 100644 tools/testing/selftests/filesystems/dcache_stress.c > > > > -- > > Signature > > --00000000000059892c05b63f6a42 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Thu, Dec 10, 2020 at 2:01 AM Junxiao B= i <junxiao.bi@oracle.com>= ; wrote:
Hi Konstantin,

We tested this patch set recently and found it limiting negative dentry to a small part of total memory. And also we don't see any performance =
regression on it. Do you have any plan to integrate it into mainline? It will help a lot on memory fragmentation issue causing by dentry slab,
there were a lot of customer cases where sys% was very high since most
cpu were doing memory compaction, dentry slab was taking too much memory and nearly all dentry there were negative.

<= div>Right now I don't have any plans for this. I suspect such problems = will
appear much more often since machines are getting bigger.
So, somebody will take care of it.

First p= art which collects negative dentries at the end list of siblings could be
done in a more obvious way by splitting the list in two.
But this touches much more code.

Last patch isn&#= 39;t very rigid but does non-trivial changes.
Probably it's b= etter to call some garbage collector thingy periodically.
Lru lis= t needs pressure to age and reorder entries properly.

<= div>Gc could be off by default or thresholds set very high (50% of ram for = example).
Final setup could be left up to owners of large systems= , which needs fine tuning.
=C2=A0

The following is test result we run on two types of servers, one is 256G memory with 24 CPUS and another is 3T memory with 384 CPUS. The test
case is using a lot of processes to generate negative dentry in
parallel, the following is the test result after 72 hours, the negative dentry number is stable around that number even running longer time. If without the patch set, in less than half an hour 197G was took by
negative dentry on 256G system, in 1 day 2.4T was took on 3T system.

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 neg-dentry-number=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 neg-dentry-mem-usage

256G 55259084 10.6G

3T 202306756 38.8G

For perf test, we run the following, and no regression found.

- create 1M negative dentry and then touch them to convert them to
positive dentry

- create 10K/100K/1M files

- remove 10K/100K/1M files

- kernel compile

To verify the fsnotify fix, we used inotifywait to watch file
create/open in some directory where there is a lot of negative dentry,
without the patch set, the system will run into soft lockup, with it, no soft lockup.

We also try to defeat the limitation by making different processes
generating negative dentry with the same naming way, that will make one negative dentry being accessed couple times around same time,
DCACHE_REFERENCED will be set on it and then it can't be trimmed easily= .
We do see negative dentry will take all the memory slowly from one of
our system with 120G memory, for above two system, we see the memory
usage were increased, but still a small part of total memory. This looks ok, since the common negative dentry user case will be create some temp files and then remove it, it will be rare to access same negative dentry around same time.

Thanks,

Junxiao.


On 5/8/20 5:23 AM, Konstantin Khlebnikov wrote:
> For most filesystems result of every negative lookup is cached, conten= t of
> directories is usually cached too. Production of negative dentries isn= 't
> limited with disk speed. It's really easy to generate millions of = them if
> system has enough memory.
>
> Getting this memory back ins't that easy because slab frees pages = only when
> all related objects are gone. While dcache shrinker works in LRU order= .
>
> Typical scenario is an idle system where some process periodically cre= ates
> temporary files and removes them. After some time, memory will be fill= ed
> with negative dentries for these random file names.
>
> Simple lookup of random names also generates negative dentries very fa= st.
> Constant flow of such negative denries drains all other inactive cache= s.
>
> Negative dentries are linked into siblings list along with normal posi= tive
> dentries. Some operations walks dcache tree but looks only for positiv= e
> dentries: most important is fsnotify/inotify. Hordes of negative dentr= ies
> slow down these operations significantly.
>
> Time of dentry lookup is usually unaffected because hash table grows a= long
> with size of memory. Unless somebody especially crafts hash collisions= .
>
> This patch set solves all of these problems:
>
> Move negative denries to the end of sliblings list, thus walkers could=
> skip them at first sight (patches 3-6).
>
> Keep in dcache at most three unreferenced negative denties in row in e= ach
> hash bucket (patches 7-8).
>
> ---
>
> Konstantin Khlebnikov (8):
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 dcache: show count of hash buckets in sysct= l fs.dentry-state
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 selftests: add stress testing tool for dcac= he
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 dcache: sweep cached negative dentries to t= he end of list of siblings
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 fsnotify: stop walking child dentries if re= maining tail is negative
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 dcache: add action D_WALK_SKIP_SIBLINGS to = d_walk()
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 dcache: stop walking siblings if remaining = dentries all negative
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 dcache: push releasing dentry lock into swe= ep_negative
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 dcache: prevent flooding with negative dent= ries
>
>
>=C2=A0 =C2=A0fs/dcache.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0| 144 +++++++++++-
>=C2=A0 =C2=A0fs/libfs.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 |=C2=A0 10 +-
>=C2=A0 =C2=A0fs/notify/fsnotify.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A06 +-
>=C2=A0 =C2=A0include/linux/dcache.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=C2=A0 =C2=A06 +
>=C2=A0 =C2=A0tools/testing/selftests/filesystems/Makefile=C2=A0 |=C2=A0= =C2=A01 +
>=C2=A0 =C2=A0.../selftests/filesystems/dcache_stress.c=C2=A0 =C2=A0 =C2= =A0| 210 ++++++++++++++++++
>=C2=A0 =C2=A06 files changed, 370 insertions(+), 7 deletions(-)
>=C2=A0 =C2=A0create mode 100644 tools/testing/selftests/filesystems/dca= che_stress.c
>
> --
> Signature

--00000000000059892c05b63f6a42--