From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 47C0F6B1BE3 for ; Mon, 20 Aug 2018 20:36:08 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id s205-v6so110276wmf.7 for ; Mon, 20 Aug 2018 17:36:08 -0700 (PDT) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id y185-v6sor36041wmy.42.2018.08.20.17.36.06 for (Google Transport Security); Mon, 20 Aug 2018 17:36:06 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20180730144048.GW24267@dhcp22.suse.cz> <1f862d41-1e9f-5324-fb90-b43f598c3955@suse.cz> <30f7ec9a-e090-06f1-1851-b18b3214f5e3@suse.cz> <20180806120042.GL19540@dhcp22.suse.cz> <010001650fe29e66-359ffa28-9290-4e83-a7e2-b6d1d8d2ee1d-000000@email.amazonses.com> <20180806181638.GE10003@dhcp22.suse.cz> From: Marinko Catovic Date: Tue, 21 Aug 2018 02:36:05 +0200 Message-ID: Subject: Re: Caching/buffers become useless after some time Content-Type: multipart/alternative; boundary="0000000000001e44390573e73628" Sender: owner-linux-mm@kvack.org List-ID: To: Marinko Catovic Cc: Christopher Lameter , Vlastimil Babka , linux-mm@kvack.org --0000000000001e44390573e73628 Content-Type: text/plain; charset="UTF-8" > The only way how kmemcg limit could help I can think of would be to >> enforce metadata reclaim much more often. But that is rather a bad >> workaround. > >would that have some significant performance impact? >I would be willing to try if you think the idea is not thaaat bad. >If so, could you please explain what to do? > >> > > Because a lot of FS metadata is fragmenting the memory and a large >> > > number of high order allocations which want to be served reclaim a lot >> > > of memory to achieve their gol. Considering a large part of memory is >> > > fragmented by unmovable objects there is no other way than to use >> > > reclaim to release that memory. >> > >> > Well it looks like the fragmentation issue gets worse. Is that enough to >> > consider merging the slab defrag patchset and get some work done on inodes >> > and dentries to make them movable (or use targetd reclaim)? > >> Is there anything to test? > >Are you referring to some known issue there, possibly directly related to mine? >If so, I would be willing to test that patchset, if it makes into the kernel.org sources, >or if I'd have to patch that manually. > > >> Well, there are some drivers (mostly out-of-tree) which are high order >> hungry. You can try to trace all allocations which with order > 0 and >> see who that might be. >> # mount -t tracefs none /debug/trace/ >> # echo stacktrace > /debug/trace/trace_options >> # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter >> # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable >> # cat /debug/trace/trace_pipe >> >> And later this to disable tracing. >> # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable > >I just had a major cache-useless situation, with like 100M/8G usage only >and horrible performance. There you go: > >https://nofile.io/f/mmwVedaTFsd > >I think mysql occurs mostly, regardless of the binary name this is actually >mariadb in version 10.1. > >> You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches >> should be sufficient to drop metadata only. > >that is exactly what I am doing, I already mentioned that 1> does not >make any difference at all 2> is the only way that helps. >just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily >going up, as usual. > > >2018-08-09 10:29 GMT+02:00 Marinko Catovic : > > > > On Mon 06-08-18 15:37:14, Cristopher Lameter wrote: > > On Mon, 6 Aug 2018, Michal Hocko wrote: > > > > > Because a lot of FS metadata is fragmenting the memory and a large > > > number of high order allocations which want to be served reclaim a lot > > > of memory to achieve their gol. Considering a large part of memory is > > > fragmented by unmovable objects there is no other way than to use > > > reclaim to release that memory. > > > > Well it looks like the fragmentation issue gets worse. Is that enough to > > consider merging the slab defrag patchset and get some work done on inodes > > and dentries to make them movable (or use targetd reclaim)? > > Is there anything to test? > -- > Michal Hocko > SUSE Labs > > > > [Please do not top-post] > > like this? > > > The only way how kmemcg limit could help I can think of would be to > > enforce metadata reclaim much more often. But that is rather a bad > > workaround. > > would that have some significant performance impact? > I would be willing to try if you think the idea is not thaaat bad. > If so, could you please explain what to do? > > > > > Because a lot of FS metadata is fragmenting the memory and a large > > > > number of high order allocations which want to be served reclaim a lot > > > > of memory to achieve their gol. Considering a large part of memory is > > > > fragmented by unmovable objects there is no other way than to use > > > > reclaim to release that memory. > > > > > > Well it looks like the fragmentation issue gets worse. Is that enough to > > > consider merging the slab defrag patchset and get some work done on inodes > > > and dentries to make them movable (or use targetd reclaim)? > > > Is there anything to test? > > Are you referring to some known issue there, possibly directly related to mine? > If so, I would be willing to test that patchset, if it makes into the kernel.org sources, > or if I'd have to patch that manually. > > > > Well, there are some drivers (mostly out-of-tree) which are high order > > hungry. You can try to trace all allocations which with order > 0 and > > see who that might be. > > # mount -t tracefs none /debug/trace/ > > # echo stacktrace > /debug/trace/trace_options > > # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter > > # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable > > # cat /debug/trace/trace_pipe > > > > And later this to disable tracing. > > # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable > > I just had a major cache-useless situation, with like 100M/8G usage only > and horrible performance. There you go: > > https://nofile.io/f/mmwVedaTFsd > > I think mysql occurs mostly, regardless of the binary name this is actually > mariadb in version 10.1. > > > You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches > > should be sufficient to drop metadata only. > > that is exactly what I am doing, I already mentioned that 1> does not > make any difference at all 2> is the only way that helps. > just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily > going up, as usual. Is there anything you can read from these results? The issue keeps occuring, the latest one was even totally unexpected in the morning hours, causing downtime the entire morning until noon when I could check and drop the caches again. I also reset O_DIRECT from mariadb to `fsync`, the new default in their latest release, hoping that this would help, but it did not. Before giving totally up I'd like to know whether there is any solution for this, where again I can not believe that I am the only one affected. this *has* to affect anyone with similar a use case, I do not see what is so special about mine. this is simply many users with many files, every larger shared hosting provider should experience the totally same behaviour with the 4.x kernel branch. --0000000000001e44390573e73628 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
> The only way how kmemcg limit could help I can think = of would be to
>> enforce metadata reclaim much more often. But th= at is rather a bad
>> workaround.
>
>would that have s= ome significant performance impact?
>I would be willing to try if you= think the idea is not thaaat bad.
>If so, could you please explain w= hat to do?
>
>> > > Because a lot of FS metadata is fr= agmenting the memory and a large
>> > > number of high order= allocations which want to be served reclaim a lot
>> > > of= memory to achieve their gol. Considering a large part of memory is
>= > > > fragmented by unmovable objects there is no other way than t= o use
>> > > reclaim to release that memory.
>> >= ;
>> > Well it looks like the fragmentation issue gets worse. = Is that enough to
>> > consider merging the slab defrag patchse= t and get some work done on inodes
>> > and dentries to make th= em movable (or use targetd reclaim)?
>
>> Is there anything = to test?
>
>Are you referring to some known issue there, possib= ly directly related to mine?
>If so, I would be willing to test that = patchset, if it makes into the kernel.org= sources,
>or if I'd have to patch that manually.
>
>=
>> Well, there are some drivers (mostly out-of-tree) which are hi= gh order
>> hungry. You can try to trace all allocations which wit= h order > 0 and
>> see who that might be.
>> # mount -= t tracefs none /debug/trace/
>> # echo stacktrace > /debug/trac= e/trace_options
>> # echo "order>0" > /debug/trace= /events/kmem/mm_page_alloc/filter
>> # echo 1 > /debug/trace/ev= ents/kmem/mm_page_alloc/enable
>> # cat /debug/trace/trace_pipe>>
>> And later this to disable tracing.
>> # ech= o 0 > /debug/trace/events/kmem/mm_page_alloc/enable
>
>I jus= t had a major cache-useless situation, with like 100M/8G usage only
>= and horrible performance. There you go:
>
>https://nofile.io/f/mmwVedaTFsd
>
>= I think mysql occurs mostly, regardless of the binary name this is actually=
>mariadb in version 10.1.
>
>> You do not have to dro= p all caches. echo 2 > /proc/sys/vm/drop_caches
>> should be su= fficient to drop metadata only.
>
>that is exactly what I am do= ing, I already mentioned that 1> does not
>make any difference at = all 2> is the only way that helps.
>just 5 minutes after doing tha= t the usage grew to 2GB/10GB and is steadily
>going up, as usual.
= >
>
>2018-08-09 10:29 GMT+02:00 Marinko Catovic <marinko.catovic@gmail.com>:>
>
>
>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 On = Mon 06-08-18 15:37:14, Cristopher Lameter wrote:
>=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 > On Mon, 6 Aug 2018, Michal Hocko wrote:
&g= t;=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >
>=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 > > Because a lot of FS metadata is fragment= ing the memory and a large
>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 > > number of high order allocations which want to be served recl= aim a lot
>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > of me= mory to achieve their gol. Considering a large part of memory is
>=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > > fragmented by unmovable o= bjects there is no other way than to use
>=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 > > reclaim to release that memory.
>=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 >
>=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 > Well it looks like the fragmentation issue gets worse.= Is that enough to
>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > c= onsider merging the slab defrag patchset and get some work done on inodes>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 > and dentries to make = them movable (or use targetd reclaim)?
>
>=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 Is there anything to test?
>=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 --
>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 Michal Hocko
>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 SUSE Lab= s
>
>
>=C2=A0=C2=A0=C2=A0 > [Please do not top-post]>
>=C2=A0=C2=A0=C2=A0 like this?
>
>=C2=A0=C2=A0=C2= =A0 > The only way how kmemcg limit could help I can think of would be t= o
>=C2=A0=C2=A0=C2=A0 > enforce metadata reclaim much more often. = But that is rather a bad
>=C2=A0=C2=A0=C2=A0 > workaround.
>=
>=C2=A0=C2=A0=C2=A0 would that have some significant performance imp= act?
>=C2=A0=C2=A0=C2=A0 I would be willing to try if you think the i= dea is not thaaat bad.
>=C2=A0=C2=A0=C2=A0 If so, could you please ex= plain what to do?
>
>=C2=A0=C2=A0=C2=A0 > > > Because = a lot of FS metadata is fragmenting the memory and a large
>=C2=A0=C2= =A0=C2=A0 > > > number of high order allocations which want to be = served reclaim a lot
>=C2=A0=C2=A0=C2=A0 > > > of memory to = achieve their gol. Considering a large part of memory is
>=C2=A0=C2= =A0=C2=A0 > > > fragmented by unmovable objects there is no other = way than to use
>=C2=A0=C2=A0=C2=A0 > > > reclaim to release= that memory.
>=C2=A0=C2=A0=C2=A0 > >
>=C2=A0=C2=A0=C2= =A0 > > Well it looks like the fragmentation issue gets worse. Is tha= t enough to
>=C2=A0=C2=A0=C2=A0 > > consider merging the slab d= efrag patchset and get some work done on inodes
>=C2=A0=C2=A0=C2=A0 &= gt; > and dentries to make them movable (or use targetd reclaim)?
>= ;
>=C2=A0=C2=A0=C2=A0 > Is there anything to test?
>
>= =C2=A0=C2=A0=C2=A0 Are you referring to some known issue there, possibly di= rectly related to mine?
>=C2=A0=C2=A0=C2=A0 If so, I would be willing= to test that patchset, if it makes into the = kernel.org sources,
>=C2=A0=C2=A0=C2=A0 or if I'd have to pat= ch that manually.
>
>
>=C2=A0=C2=A0=C2=A0 > Well, ther= e are some drivers (mostly out-of-tree) which are high order
>=C2=A0= =C2=A0=C2=A0 > hungry. You can try to trace all allocations which with o= rder > 0 and
>=C2=A0=C2=A0=C2=A0 > see who that might be.
&g= t;=C2=A0=C2=A0=C2=A0 > # mount -t tracefs none /debug/trace/
>=C2= =A0=C2=A0=C2=A0 > # echo stacktrace > /debug/trace/trace_options
&= gt;=C2=A0=C2=A0=C2=A0 > # echo "order>0" > /debug/trace/= events/kmem/mm_page_alloc/filter
>=C2=A0=C2=A0=C2=A0 > # echo 1 &g= t; /debug/trace/events/kmem/mm_page_alloc/enable
>=C2=A0=C2=A0=C2=A0 = > # cat /debug/trace/trace_pipe
>=C2=A0=C2=A0=C2=A0 >
>= =C2=A0=C2=A0=C2=A0 > And later this to disable tracing.
>=C2=A0=C2= =A0=C2=A0 > # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable<= br>>
>=C2=A0=C2=A0=C2=A0 I just had a major cache-useless situatio= n, with like 100M/8G usage only
>=C2=A0=C2=A0=C2=A0 and horrible perf= ormance. There you go:
>
>=C2=A0=C2=A0=C2=A0 https://nofile.io/f/mmwVedaTFsd
>
&g= t;=C2=A0=C2=A0=C2=A0 I think mysql occurs mostly, regardless of the binary = name this is actually
>=C2=A0=C2=A0=C2=A0 mariadb in version 10.1.>
>=C2=A0=C2=A0=C2=A0 > You do not have to drop all caches. ec= ho 2 > /proc/sys/vm/drop_caches
>=C2=A0=C2=A0=C2=A0 > should be= sufficient to drop metadata only.
>
>=C2=A0=C2=A0=C2=A0 that i= s exactly what I am doing, I already mentioned that 1> does not
>= =C2=A0=C2=A0=C2=A0 make any difference at all 2> is the only way that he= lps.
>=C2=A0=C2=A0=C2=A0 just 5 minutes after doing that the usage gr= ew to 2GB/10GB and is steadily
>=C2=A0=C2=A0=C2=A0 going up, as usual= .

Is there anything you can read from these results?=
The issue keeps occuring, the latest one was even totally unexpe= cted in the morning hours,
causing downtime the entire morning un= til noon when I could check and drop the caches again.

=
I also reset O_DIRECT from mariadb to `fsync`, the new default in thei= r latest release, hoping
that this would help, but it did not.

Before giving totally up I'd like to know wh= ether there is any solution for this, where again I can
not belie= ve that I am the only one affected. this *has* to affect anyone with simila= r a use case,
I do not see what is so special about mine. this is= simply many users with many files, every
larger shared hosting p= rovider should experience the totally same behaviour with the 4.x kernel br= anch.

--0000000000001e44390573e73628--