> The only way how kmemcg limit could help I can think of would be to
>> enforce metadata reclaim much more often. But that is rather a bad
>> workaround.
>
>would that have some significant performance impact?
>I would be willing to try if you think the idea is not thaaat bad.
>If so, could you please explain what to do?
>
>> > > Because a lot of FS metadata is fragmenting the memory and a large
>> > > number of high order allocations which want to be served reclaim a
lot
>> > > of memory to achieve their gol. Considering a large part of memory is
>> > > fragmented by unmovable objects there is no other way than to use
>> > > reclaim to release that memory.
>> >
>> > Well it looks like the fragmentation issue gets worse. Is that enough
to
>> > consider merging the slab defrag patchset and get some work done on
inodes
>> > and dentries to make them movable (or use targetd reclaim)?
>
>> Is there anything to test?
>
>Are you referring to some known issue there, possibly directly related to
mine?
>If so, I would be willing to test that patchset, if it makes into the
kernel.org sources,
>or if I'd have to patch that manually.
>
>
>> Well, there are some drivers (mostly out-of-tree) which are high order
>> hungry. You can try to trace all allocations which with order > 0 and
>> see who that might be.
>> # mount -t tracefs none /debug/trace/
>> # echo stacktrace > /debug/trace/trace_options
>> # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
>> # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
>> # cat /debug/trace/trace_pipe
>>
>> And later this to disable tracing.
>> # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable
>
>I just had a major cache-useless situation, with like 100M/8G usage only
>and horrible performance. There you go:
>
>https://nofile.io/f/mmwVedaTFsd
>
>I think mysql occurs mostly, regardless of the binary name this is actually
>mariadb in version 10.1.
>
>> You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches
>> should be sufficient to drop metadata only.
>
>that is exactly what I am doing, I already mentioned that 1> does not
>make any difference at all 2> is the only way that helps.
>just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily
>going up, as usual.
>
>
>2018-08-09 10:29 GMT+02:00 Marinko Catovic <marinko.catovic@gmail.com>:
>
>
>
>        On Mon 06-08-18 15:37:14, Cristopher Lameter wrote:
>        > On Mon, 6 Aug 2018, Michal Hocko wrote:
>        >
>        > > Because a lot of FS metadata is fragmenting the memory and a
large
>        > > number of high order allocations which want to be served
reclaim a lot
>        > > of memory to achieve their gol. Considering a large part of
memory is
>        > > fragmented by unmovable objects there is no other way than to
use
>        > > reclaim to release that memory.
>        >
>        > Well it looks like the fragmentation issue gets worse. Is that
enough to
>        > consider merging the slab defrag patchset and get some work done
on inodes
>        > and dentries to make them movable (or use targetd reclaim)?
>
>        Is there anything to test?
>        --
>        Michal Hocko
>        SUSE Labs
>
>
>    > [Please do not top-post]
>
>    like this?
>
>    > The only way how kmemcg limit could help I can think of would be to
>    > enforce metadata reclaim much more often. But that is rather a bad
>    > workaround.
>
>    would that have some significant performance impact?
>    I would be willing to try if you think the idea is not thaaat bad.
>    If so, could you please explain what to do?
>
>    > > > Because a lot of FS metadata is fragmenting the memory and a
large
>    > > > number of high order allocations which want to be served reclaim
a lot
>    > > > of memory to achieve their gol. Considering a large part of
memory is
>    > > > fragmented by unmovable objects there is no other way than to use
>    > > > reclaim to release that memory.
>    > >
>    > > Well it looks like the fragmentation issue gets worse. Is that
enough to
>    > > consider merging the slab defrag patchset and get some work done
on inodes
>    > > and dentries to make them movable (or use targetd reclaim)?
>
>    > Is there anything to test?
>
>    Are you referring to some known issue there, possibly directly related
to mine?
>    If so, I would be willing to test that patchset, if it makes into the
kernel.org sources,
>    or if I'd have to patch that manually.
>
>
>    > Well, there are some drivers (mostly out-of-tree) which are high
order
>    > hungry. You can try to trace all allocations which with order > 0 and
>    > see who that might be.
>    > # mount -t tracefs none /debug/trace/
>    > # echo stacktrace > /debug/trace/trace_options
>    > # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
>    > # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
>    > # cat /debug/trace/trace_pipe
>    >
>    > And later this to disable tracing.
>    > # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable
>
>    I just had a major cache-useless situation, with like 100M/8G usage
only
>    and horrible performance. There you go:
>
>    https://nofile.io/f/mmwVedaTFsd
>
>    I think mysql occurs mostly, regardless of the binary name this is
actually
>    mariadb in version 10.1.
>
>    > You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches
>    > should be sufficient to drop metadata only.
>
>    that is exactly what I am doing, I already mentioned that 1> does not
>    make any difference at all 2> is the only way that helps.
>    just 5 minutes after doing that the usage grew to 2GB/10GB and is
steadily
>    going up, as usual.

Is there anything you can read from these results?
The issue keeps occuring, the latest one was even totally unexpected in the
morning hours,
causing downtime the entire morning until noon when I could check and drop
the caches again.

I also reset O_DIRECT from mariadb to `fsync`, the new default in their
latest release, hoping
that this would help, but it did not.

Before giving totally up I'd like to know whether there is any solution for
this, where again I can
not believe that I am the only one affected. this *has* to affect anyone
with similar a use case,
I do not see what is so special about mine. this is simply many users with
many files, every
larger shared hosting provider should experience the totally same behaviour
with the 4.x kernel branch.