All of lore.kernel.org
 help / color / mirror / Atom feed
* Caching/buffers become useless after some time
@ 2018-07-11 13:18 Marinko Catovic
  2018-07-12 11:34 ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-07-11 13:18 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 3385 bytes --]

hello guys


I tried in a few IRC, people told me to ask here, so I'll give it a try.


I have a very weird issue with mm on several hosts.
The systems are for shared hosting, so lots of users there with lots of
files.
Maybe 5TB of files per host, several million at least, there is lots of I/O
which can be handled perfectly fine with buffers/cache

The kernel version is the latest stable, 4.17.4, I had 3.x before and did
not notice any issues until now. the same is for 4.16 which was in use
before:

The hosts altogether have 64G of RAM and operate with SSD+HDD.
HDDs are the issue here, since those 5TB of data are stored there, there
goes the high I/O.
Running applications need about 15GB, so say 40GB of RAM are left for
buffers/caching.

Usually this works perfectly fine. The buffers take about 1-3G of RAM, the
cache the rest, say 35GB as an example.
But every now and then, maybe every 2 days it happens that both drop to
really low values, say 100MB buffers, 3GB caches and the rest of the RAM is
not in use, so there are about 35GB+ of totally free RAM.

The performance of the host goes down significantly then, as it becomes
unusable at some point, since it behaves as if the buffers/cache were
totally useless.
After lots and lots of playing around I noticed that when shutting down all
services that access the HDDs on the system and restarting them, that this
does *not* make any difference.

But what did make a difference was stopping and umounting the fs, mounting
it again and starting the services.
Then the buffers+cache built up to 5GB/35GB as usual after a while and
everything was perfectly fine again!

I noticed that what happens when umount is called, that the caches are
being dropped. So I gave it a try:

sync; echo 2 > /proc/sys/vm/drop_caches

has the exactly same effect. Note that echo 1 > .. does not.

So if that low usage like 100MB/3GB occurs I'd have to drop the caches by
echoing 2 to drop_caches. The 3GB then become even lower, which is
expected, but then at least the buffers/cache built up again to ordinary
values and the usual performance is restored after a few minutes.
I have never seen this before, this happened after I switched the systems
to newer ones, where the old ones had kernel 3.x, this behavior was never
observed before.

Do you have *any idea* at all what could be causing this? that issue is
bugging me since over a month and seriously really disturbs everything I'm
doing since lot of people access that data and all of them start to
complain at some point where I see that the caches became useless at that
time, having to drop them to rebuild again.

Some guys in IRC suggested that his could be a fragmentation problem or
something, or about slab shrinking.

The problem is that I can not reproduce this, I have to wait a while, maybe
2 days to observe that, until that the buffers/caches are fully in use and
at some point they decrease within a few hours to those useless values.
Sadly this is a production system and I can not play that much around,
already causing downtime when dropping caches (populating caches needs
maybe 5-10 minutes until the performance is ok again).

Please tell me whatever info you need me to pastebin and when (before/after
what event).
Any hints are appreciated a lot, it really gives me lots of headache, since
I am really busy with other things. Thank you very much!


Marinko

[-- Attachment #2: Type: text/html, Size: 3950 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-11 13:18 Caching/buffers become useless after some time Marinko Catovic
@ 2018-07-12 11:34 ` Michal Hocko
  2018-07-13 15:48   ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-07-12 11:34 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: linux-mm

On Wed 11-07-18 15:18:30, Marinko Catovic wrote:
> hello guys
> 
> 
> I tried in a few IRC, people told me to ask here, so I'll give it a try.
> 
> 
> I have a very weird issue with mm on several hosts.
> The systems are for shared hosting, so lots of users there with lots of
> files.
> Maybe 5TB of files per host, several million at least, there is lots of I/O
> which can be handled perfectly fine with buffers/cache
> 
> The kernel version is the latest stable, 4.17.4, I had 3.x before and did
> not notice any issues until now. the same is for 4.16 which was in use
> before:
> 
> The hosts altogether have 64G of RAM and operate with SSD+HDD.
> HDDs are the issue here, since those 5TB of data are stored there, there
> goes the high I/O.
> Running applications need about 15GB, so say 40GB of RAM are left for
> buffers/caching.
> 
> Usually this works perfectly fine. The buffers take about 1-3G of RAM, the
> cache the rest, say 35GB as an example.
> But every now and then, maybe every 2 days it happens that both drop to
> really low values, say 100MB buffers, 3GB caches and the rest of the RAM is
> not in use, so there are about 35GB+ of totally free RAM.
> 
> The performance of the host goes down significantly then, as it becomes
> unusable at some point, since it behaves as if the buffers/cache were
> totally useless.
> After lots and lots of playing around I noticed that when shutting down all
> services that access the HDDs on the system and restarting them, that this
> does *not* make any difference.
> 
> But what did make a difference was stopping and umounting the fs, mounting
> it again and starting the services.
> Then the buffers+cache built up to 5GB/35GB as usual after a while and
> everything was perfectly fine again!
> 
> I noticed that what happens when umount is called, that the caches are
> being dropped. So I gave it a try:
> 
> sync; echo 2 > /proc/sys/vm/drop_caches
> 
> has the exactly same effect. Note that echo 1 > .. does not.
> 
> So if that low usage like 100MB/3GB occurs I'd have to drop the caches by
> echoing 2 to drop_caches. The 3GB then become even lower, which is
> expected, but then at least the buffers/cache built up again to ordinary
> values and the usual performance is restored after a few minutes.
> I have never seen this before, this happened after I switched the systems
> to newer ones, where the old ones had kernel 3.x, this behavior was never
> observed before.
> 
> Do you have *any idea* at all what could be causing this? that issue is
> bugging me since over a month and seriously really disturbs everything I'm
> doing since lot of people access that data and all of them start to
> complain at some point where I see that the caches became useless at that
> time, having to drop them to rebuild again.
> 
> Some guys in IRC suggested that his could be a fragmentation problem or
> something, or about slab shrinking.

Well, the page cache shouldn't really care about fragmentation because
single pages are used. Btw. what is the filesystem that you are using?

> The problem is that I can not reproduce this, I have to wait a while, maybe
> 2 days to observe that, until that the buffers/caches are fully in use and
> at some point they decrease within a few hours to those useless values.
> Sadly this is a production system and I can not play that much around,
> already causing downtime when dropping caches (populating caches needs
> maybe 5-10 minutes until the performance is ok again).

This doesn't really ring bells for me.

> Please tell me whatever info you need me to pastebin and when (before/after
> what event).
> Any hints are appreciated a lot, it really gives me lots of headache, since
> I am really busy with other things. Thank you very much!

Could you collect /proc/vmstat every few seconds over that time period?
Maybe it will tell us more.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-12 11:34 ` Michal Hocko
@ 2018-07-13 15:48   ` Marinko Catovic
  2018-07-16 15:53     ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-07-13 15:48 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 7058 bytes --]

hello Michal


well these hints were just ideas mentioned by some people, it took me weeks
just to figure
out that 2>drop_caches helps, still not knowing why this happens.

Right now I am observing ~18GB of unused RAM, since yesterday, so this is
not always
about 100MB/3.5GB, but right now it may be in the process of shrinking.
I really can not tell for sure, this is so nondeterministic - I just wish I
could reproduce it for better testing.

Right now top shows:
KiB Mem : 65892044 total, 18169232 free, 11879604 used, 35843208 buff/cache
Where 1GB goes to buffers, the rest to cache, the host *is* busy and the
buff/cache consumed
all RAM yesterday, where I did 2>drop_caches about one day before.

Another host (still) shows full usage. That other one is 1:1 the same by
software and config,
but with different data/users; the use-cases and load are pretty much
similar.

Affected host at this time:
https://pastebin.com/fRQMPuwb
https://pastebin.com/tagXJRi1  .. 3 minutes later
https://pastebin.com/8YNFfKXf  .. 3 minutes later
https://pastebin.com/UEq7NKR4 .. 3 minutes later

To compare - this is the other host, that is still showing full
buffers/cache usage by now:
https://pastebin.com/Jraux2gy

Usually both show this more or less at the same time, sometimes it is the
one, sometimes
the other. Other hosts I have are currently not under similar high load,
making it even harder
to compare.

However, right now I can not observe this dropping towards really low
values, but I am sure it will come.

fs is ext4, mount options are auto,rw,data=writeback,noatime
,nodiratime,nodev,nosuid,async
previous mount options with same behavior also had max_dir_size_kb, quotas
and defaults for data=
so I also played around with these, but that made no difference.

---------

follow up (sorry, messed up with reply-to this mailing list):


https://pastebin.com/0v4ZFNCv .. one hour later, right after my last
report, 22GB free
https://pastebin.com/rReWnHtE .. one day later, 28GB free

It is interesting to see however, that this did not get that low as
mentioned before.
So not sure where this is going right now, but nevertheless, the RAM is not
occupied fully,
there should be no reason to allow 28GB to be free at all.

Still lots I/O, and I am 100% positive that if I'd echo 2 > drop_caches,
this would fill up the
entire RAM again.


What I can see is that buffers are around 500-700MB, the values increase
and decrease
all the time, really "oscillating" around 600. afaik this should get as
high as possible, as long
there is free ram - the other host that is still healthy has about 2GB/48GB
fully occupying RAM.

Currently I have set vm.dirty_ratio = 15, vm.dirty_background_ratio = 3,
vm.vfs_cache_pressure = 1
and the low usage occurred 3 days before, other values like the defaults or
when I was playing
around with vm.dirty_ratio = 90, vm.dirty_background_ratio = 80 and
whatever cache_pressure
showed similar results.


2018-07-12 13:34 GMT+02:00 Michal Hocko <mhocko@kernel.org>:

> On Wed 11-07-18 15:18:30, Marinko Catovic wrote:
> > hello guys
> >
> >
> > I tried in a few IRC, people told me to ask here, so I'll give it a try.
> >
> >
> > I have a very weird issue with mm on several hosts.
> > The systems are for shared hosting, so lots of users there with lots of
> > files.
> > Maybe 5TB of files per host, several million at least, there is lots of
> I/O
> > which can be handled perfectly fine with buffers/cache
> >
> > The kernel version is the latest stable, 4.17.4, I had 3.x before and did
> > not notice any issues until now. the same is for 4.16 which was in use
> > before:
> >
> > The hosts altogether have 64G of RAM and operate with SSD+HDD.
> > HDDs are the issue here, since those 5TB of data are stored there, there
> > goes the high I/O.
> > Running applications need about 15GB, so say 40GB of RAM are left for
> > buffers/caching.
> >
> > Usually this works perfectly fine. The buffers take about 1-3G of RAM,
> the
> > cache the rest, say 35GB as an example.
> > But every now and then, maybe every 2 days it happens that both drop to
> > really low values, say 100MB buffers, 3GB caches and the rest of the RAM
> is
> > not in use, so there are about 35GB+ of totally free RAM.
> >
> > The performance of the host goes down significantly then, as it becomes
> > unusable at some point, since it behaves as if the buffers/cache were
> > totally useless.
> > After lots and lots of playing around I noticed that when shutting down
> all
> > services that access the HDDs on the system and restarting them, that
> this
> > does *not* make any difference.
> >
> > But what did make a difference was stopping and umounting the fs,
> mounting
> > it again and starting the services.
> > Then the buffers+cache built up to 5GB/35GB as usual after a while and
> > everything was perfectly fine again!
> >
> > I noticed that what happens when umount is called, that the caches are
> > being dropped. So I gave it a try:
> >
> > sync; echo 2 > /proc/sys/vm/drop_caches
> >
> > has the exactly same effect. Note that echo 1 > .. does not.
> >
> > So if that low usage like 100MB/3GB occurs I'd have to drop the caches by
> > echoing 2 to drop_caches. The 3GB then become even lower, which is
> > expected, but then at least the buffers/cache built up again to ordinary
> > values and the usual performance is restored after a few minutes.
> > I have never seen this before, this happened after I switched the systems
> > to newer ones, where the old ones had kernel 3.x, this behavior was never
> > observed before.
> >
> > Do you have *any idea* at all what could be causing this? that issue is
> > bugging me since over a month and seriously really disturbs everything
> I'm
> > doing since lot of people access that data and all of them start to
> > complain at some point where I see that the caches became useless at that
> > time, having to drop them to rebuild again.
> >
> > Some guys in IRC suggested that his could be a fragmentation problem or
> > something, or about slab shrinking.
>
> Well, the page cache shouldn't really care about fragmentation because
> single pages are used. Btw. what is the filesystem that you are using?
>
> > The problem is that I can not reproduce this, I have to wait a while,
> maybe
> > 2 days to observe that, until that the buffers/caches are fully in use
> and
> > at some point they decrease within a few hours to those useless values.
> > Sadly this is a production system and I can not play that much around,
> > already causing downtime when dropping caches (populating caches needs
> > maybe 5-10 minutes until the performance is ok again).
>
> This doesn't really ring bells for me.
>
> > Please tell me whatever info you need me to pastebin and when
> (before/after
> > what event).
> > Any hints are appreciated a lot, it really gives me lots of headache,
> since
> > I am really busy with other things. Thank you very much!
>
> Could you collect /proc/vmstat every few seconds over that time period?
> Maybe it will tell us more.
> --
> Michal Hocko
> SUSE Labs
>

[-- Attachment #2: Type: text/html, Size: 8955 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-13 15:48   ` Marinko Catovic
@ 2018-07-16 15:53     ` Marinko Catovic
  2018-07-16 16:23       ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-07-16 15:53 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 8247 bytes --]

I can provide further data now, monitoring vmstat:

https://pastebin.com/j0dMGBe4 .. 1 day later, 600MB/13GB in use, 35GB free
https://pastebin.com/N011kYyd .. 1 day later, 300MB/10GB in use, 40GB free,
performance becomes even worse

the issue raises up again, I would have to drop caches by now to restore
normal usage for another day or two.

Afaik there should be no reason at all to not have the buffers/cache fill
up the entire memory, isn't that true?
There is to my knowledge almost no O_DIRECT involved, also as mentioned
before: when dropping caches
the buffers/cache usage would eat up all RAM within the hour as usual for
1-2 days until it starts to go crazy again.

As mentioned, the usage oscillates up and down instead of up until all RAM
is consumed.

Please tell me if there is anything else I can do to help investigate this.


2018-07-13 17:48 GMT+02:00 Marinko Catovic <marinko.catovic@gmail.com>:

> hello Michal
>
>
> well these hints were just ideas mentioned by some people, it took me
> weeks just to figure
> out that 2>drop_caches helps, still not knowing why this happens.
>
> Right now I am observing ~18GB of unused RAM, since yesterday, so this is
> not always
> about 100MB/3.5GB, but right now it may be in the process of shrinking.
> I really can not tell for sure, this is so nondeterministic - I just wish
> I could reproduce it for better testing.
>
> Right now top shows:
> KiB Mem : 65892044 total, 18169232 free, 11879604 used, 35843208 buff/cache
> Where 1GB goes to buffers, the rest to cache, the host *is* busy and the
> buff/cache consumed
> all RAM yesterday, where I did 2>drop_caches about one day before.
>
> Another host (still) shows full usage. That other one is 1:1 the same by
> software and config,
> but with different data/users; the use-cases and load are pretty much
> similar.
>
> Affected host at this time:
> https://pastebin.com/fRQMPuwb
> https://pastebin.com/tagXJRi1  .. 3 minutes later
> https://pastebin.com/8YNFfKXf  .. 3 minutes later
> https://pastebin.com/UEq7NKR4 .. 3 minutes later
>
> To compare - this is the other host, that is still showing full
> buffers/cache usage by now:
> https://pastebin.com/Jraux2gy
>
> Usually both show this more or less at the same time, sometimes it is the
> one, sometimes
> the other. Other hosts I have are currently not under similar high load,
> making it even harder
> to compare.
>
> However, right now I can not observe this dropping towards really low
> values, but I am sure it will come.
>
> fs is ext4, mount options are auto,rw,data=writeback,noatime
> ,nodiratime,nodev,nosuid,async
> previous mount options with same behavior also had max_dir_size_kb, quotas
> and defaults for data=
> so I also played around with these, but that made no difference.
>
> ---------
>
> follow up (sorry, messed up with reply-to this mailing list):
>
>
> https://pastebin.com/0v4ZFNCv .. one hour later, right after my last
> report, 22GB free
> https://pastebin.com/rReWnHtE .. one day later, 28GB free
>
> It is interesting to see however, that this did not get that low as
> mentioned before.
> So not sure where this is going right now, but nevertheless, the RAM is
> not occupied fully,
> there should be no reason to allow 28GB to be free at all.
>
> Still lots I/O, and I am 100% positive that if I'd echo 2 > drop_caches,
> this would fill up the
> entire RAM again.
>
>
> What I can see is that buffers are around 500-700MB, the values increase
> and decrease
> all the time, really "oscillating" around 600. afaik this should get as
> high as possible, as long
> there is free ram - the other host that is still healthy has about
> 2GB/48GB fully occupying RAM.
>
> Currently I have set vm.dirty_ratio = 15, vm.dirty_background_ratio = 3,
> vm.vfs_cache_pressure = 1
> and the low usage occurred 3 days before, other values like the defaults
> or when I was playing
> around with vm.dirty_ratio = 90, vm.dirty_background_ratio = 80 and
> whatever cache_pressure
> showed similar results.
>
>
> 2018-07-12 13:34 GMT+02:00 Michal Hocko <mhocko@kernel.org>:
>
>> On Wed 11-07-18 15:18:30, Marinko Catovic wrote:
>> > hello guys
>> >
>> >
>> > I tried in a few IRC, people told me to ask here, so I'll give it a try.
>> >
>> >
>> > I have a very weird issue with mm on several hosts.
>> > The systems are for shared hosting, so lots of users there with lots of
>> > files.
>> > Maybe 5TB of files per host, several million at least, there is lots of
>> I/O
>> > which can be handled perfectly fine with buffers/cache
>> >
>> > The kernel version is the latest stable, 4.17.4, I had 3.x before and
>> did
>> > not notice any issues until now. the same is for 4.16 which was in use
>> > before:
>> >
>> > The hosts altogether have 64G of RAM and operate with SSD+HDD.
>> > HDDs are the issue here, since those 5TB of data are stored there, there
>> > goes the high I/O.
>> > Running applications need about 15GB, so say 40GB of RAM are left for
>> > buffers/caching.
>> >
>> > Usually this works perfectly fine. The buffers take about 1-3G of RAM,
>> the
>> > cache the rest, say 35GB as an example.
>> > But every now and then, maybe every 2 days it happens that both drop to
>> > really low values, say 100MB buffers, 3GB caches and the rest of the
>> RAM is
>> > not in use, so there are about 35GB+ of totally free RAM.
>> >
>> > The performance of the host goes down significantly then, as it becomes
>> > unusable at some point, since it behaves as if the buffers/cache were
>> > totally useless.
>> > After lots and lots of playing around I noticed that when shutting down
>> all
>> > services that access the HDDs on the system and restarting them, that
>> this
>> > does *not* make any difference.
>> >
>> > But what did make a difference was stopping and umounting the fs,
>> mounting
>> > it again and starting the services.
>> > Then the buffers+cache built up to 5GB/35GB as usual after a while and
>> > everything was perfectly fine again!
>> >
>> > I noticed that what happens when umount is called, that the caches are
>> > being dropped. So I gave it a try:
>> >
>> > sync; echo 2 > /proc/sys/vm/drop_caches
>> >
>> > has the exactly same effect. Note that echo 1 > .. does not.
>> >
>> > So if that low usage like 100MB/3GB occurs I'd have to drop the caches
>> by
>> > echoing 2 to drop_caches. The 3GB then become even lower, which is
>> > expected, but then at least the buffers/cache built up again to ordinary
>> > values and the usual performance is restored after a few minutes.
>> > I have never seen this before, this happened after I switched the
>> systems
>> > to newer ones, where the old ones had kernel 3.x, this behavior was
>> never
>> > observed before.
>> >
>> > Do you have *any idea* at all what could be causing this? that issue is
>> > bugging me since over a month and seriously really disturbs everything
>> I'm
>> > doing since lot of people access that data and all of them start to
>> > complain at some point where I see that the caches became useless at
>> that
>> > time, having to drop them to rebuild again.
>> >
>> > Some guys in IRC suggested that his could be a fragmentation problem or
>> > something, or about slab shrinking.
>>
>> Well, the page cache shouldn't really care about fragmentation because
>> single pages are used. Btw. what is the filesystem that you are using?
>>
>> > The problem is that I can not reproduce this, I have to wait a while,
>> maybe
>> > 2 days to observe that, until that the buffers/caches are fully in use
>> and
>> > at some point they decrease within a few hours to those useless values.
>> > Sadly this is a production system and I can not play that much around,
>> > already causing downtime when dropping caches (populating caches needs
>> > maybe 5-10 minutes until the performance is ok again).
>>
>> This doesn't really ring bells for me.
>>
>> > Please tell me whatever info you need me to pastebin and when
>> (before/after
>> > what event).
>> > Any hints are appreciated a lot, it really gives me lots of headache,
>> since
>> > I am really busy with other things. Thank you very much!
>>
>> Could you collect /proc/vmstat every few seconds over that time period?
>> Maybe it will tell us more.
>> --
>> Michal Hocko
>> SUSE Labs
>>
>
>

[-- Attachment #2: Type: text/html, Size: 10590 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-16 15:53     ` Marinko Catovic
@ 2018-07-16 16:23       ` Michal Hocko
  2018-07-16 16:33         ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-07-16 16:23 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: linux-mm

On Mon 16-07-18 17:53:42, Marinko Catovic wrote:
> I can provide further data now, monitoring vmstat:
> 
> https://pastebin.com/j0dMGBe4 .. 1 day later, 600MB/13GB in use, 35GB free
> https://pastebin.com/N011kYyd .. 1 day later, 300MB/10GB in use, 40GB free,
> performance becomes even worse
> 
> the issue raises up again, I would have to drop caches by now to restore
> normal usage for another day or two.
> 
> Afaik there should be no reason at all to not have the buffers/cache fill
> up the entire memory, isn't that true?
> There is to my knowledge almost no O_DIRECT involved, also as mentioned
> before: when dropping caches
> the buffers/cache usage would eat up all RAM within the hour as usual for
> 1-2 days until it starts to go crazy again.
> 
> As mentioned, the usage oscillates up and down instead of up until all RAM
> is consumed.
> 
> Please tell me if there is anything else I can do to help investigate this.

Do you have periodic /proc/vmstat snapshots I have asked before?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-16 16:23       ` Michal Hocko
@ 2018-07-16 16:33         ` Marinko Catovic
  2018-07-16 16:45           ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-07-16 16:33 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1816 bytes --]

how periodically do you want them? I assumed this some-hours and days
snapshots would be sufficient.
any particular command with or without grep perhaps?

I just had to drop caches, right before your response, the performance was
simply too bad.

this is for your information, how it was right after dropping and 0+5+25
minutes later

https://pastebin.com/LcjKgQkg .. this is what it looks like just after
sync; echo 2 > /proc/sys/vm/drop_caches
https://pastebin.com/ZCeFCKrb .. 5 minutes later, when performance is
starting to get better again
https://pastebin.com/8hij8Lid .. 20 minutes after that, you can expect this
to consume all the available ram within 1-2 hours


2018-07-16 18:23 GMT+02:00 Michal Hocko <mhocko@kernel.org>:

> On Mon 16-07-18 17:53:42, Marinko Catovic wrote:
> > I can provide further data now, monitoring vmstat:
> >
> > https://pastebin.com/j0dMGBe4 .. 1 day later, 600MB/13GB in use, 35GB
> free
> > https://pastebin.com/N011kYyd .. 1 day later, 300MB/10GB in use, 40GB
> free,
> > performance becomes even worse
> >
> > the issue raises up again, I would have to drop caches by now to restore
> > normal usage for another day or two.
> >
> > Afaik there should be no reason at all to not have the buffers/cache fill
> > up the entire memory, isn't that true?
> > There is to my knowledge almost no O_DIRECT involved, also as mentioned
> > before: when dropping caches
> > the buffers/cache usage would eat up all RAM within the hour as usual for
> > 1-2 days until it starts to go crazy again.
> >
> > As mentioned, the usage oscillates up and down instead of up until all
> RAM
> > is consumed.
> >
> > Please tell me if there is anything else I can do to help investigate
> this.
>
> Do you have periodic /proc/vmstat snapshots I have asked before?
> --
> Michal Hocko
> SUSE Labs
>

[-- Attachment #2: Type: text/html, Size: 2842 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-16 16:33         ` Marinko Catovic
@ 2018-07-16 16:45           ` Michal Hocko
  2018-07-20 22:03             ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-07-16 16:45 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: linux-mm

On Mon 16-07-18 18:33:57, Marinko Catovic wrote:
> how periodically do you want them? I assumed this some-hours and days
> snapshots would be sufficient.

Every 10s should be reasonable even for a long term monitoring.

> any particular command with or without grep perhaps?

while true
do
	cp /proc/vmstat vmstat.$(date +%s)
	sleep 10s
done
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-16 16:45           ` Michal Hocko
@ 2018-07-20 22:03             ` Marinko Catovic
  2018-07-27 11:15               ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-07-20 22:03 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1815 bytes --]

I let this run for 3 days now, so it is quite a lot, there you go:
https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz

There is one thing I forgot to mention: the hosts perform find and du (I
mean the commands, finding files and disk usage)
on the HDDs every night, starting from 00:20 AM up until in the morning
07:45 AM, for maintenance and stats.

During this period the buffers/caches raise again as you may see from the
logs, so find/du do fill them.
Nevertheless as the day passes both decrease again until low values are
reached.
I disabled find/du for the night on 19->20th July to compare.

I have to say that this really low usage (300MB/xGB) occured just once
after I upgraded from 4.16 to 4.17, not sure
why, where one can still see from the logs that the buffers/cache is not
using up the entire available RAM.

This low usage occured the last time on that one host when I mentioned that
I had to 2>drop_caches again in my
previous message, so this is still an issue even on the latest kernel.

The other host (the one that was not measured with the vmstat logs) has
currently 600MB/14GB, 34GB of free RAM.
Both were reset with drop_caches at the same time. From the looks of this
the really low usage will occur again
somewhat shortly, it just did not come up during measurement. However, the
RAM should be full anyway, true?





2018-07-16 18:45 GMT+02:00 Michal Hocko <mhocko@kernel.org>:

> On Mon 16-07-18 18:33:57, Marinko Catovic wrote:
> > how periodically do you want them? I assumed this some-hours and days
> > snapshots would be sufficient.
>
> Every 10s should be reasonable even for a long term monitoring.
>
> > any particular command with or without grep perhaps?
>
> while true
> do
>         cp /proc/vmstat vmstat.$(date +%s)
>         sleep 10s
> done
> --
> Michal Hocko
> SUSE Labs
>

[-- Attachment #2: Type: text/html, Size: 2593 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-20 22:03             ` Marinko Catovic
@ 2018-07-27 11:15               ` Vlastimil Babka
  2018-07-30 14:40                 ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-07-27 11:15 UTC (permalink / raw)
  To: Marinko Catovic, linux-mm

On 07/21/2018 12:03 AM, Marinko Catovic wrote:
> I let this run for 3 days now, so it is quite a lot, there you go:
> https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz

The stats show that compaction has very bad results. Between first and
last snapshot, compact_fail grew by 80k and compact_success by 1300.
High-order allocations will thus cycle between (failing) compaction and
reclaim that removes the buffer/caches from memory.

Since dropping slab caches helps, I suspect it's either the slab pages
(which cannot be migrated for compaction) being spread over all memory,
making it impossible to assemble high-order pages, or some slab objects
are pinning file pages making them also impossible to be migrated.

> There is one thing I forgot to mention: the hosts perform find and du (I
> mean the commands, finding files and disk usage)
> on the HDDs every night, starting from 00:20 AM up until in the morning
> 07:45 AM, for maintenance and stats.
> 
> During this period the buffers/caches raise again as you may see from
> the logs, so find/du do fill them.
> Nevertheless as the day passes both decrease again until low values are
> reached.
> I disabled find/du for the night on 19->20th July to compare.
> 
> I have to say that this really low usage (300MB/xGB) occured just once
> after I upgraded from 4.16 to 4.17, not sure
> why, where one can still see from the logs that the buffers/cache is not
> using up the entire available RAM.
> 
> This low usage occured the last time on that one host when I mentioned
> that I had to 2>drop_caches again in my
> previous message, so this is still an issue even on the latest kernel.
> 
> The other host (the one that was not measured with the vmstat logs) has
> currently 600MB/14GB, 34GB of free RAM.
> Both were reset with drop_caches at the same time. From the looks of
> this the really low usage will occur again
> somewhat shortly, it just did not come up during measurement. However,
> the RAM should be full anyway, true?

Can you provide (a single snapshot) /proc/pagetypeinfo and
/proc/slabinfo from a system that's currently experiencing the issue,
also with /proc/vmstat and /proc/zoneinfo to verify? Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-27 11:15               ` Vlastimil Babka
@ 2018-07-30 14:40                 ` Michal Hocko
  2018-07-30 22:08                   ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-07-30 14:40 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Marinko Catovic, linux-mm

On Fri 27-07-18 13:15:33, Vlastimil Babka wrote:
> On 07/21/2018 12:03 AM, Marinko Catovic wrote:
> > I let this run for 3 days now, so it is quite a lot, there you go:
> > https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz
> 
> The stats show that compaction has very bad results. Between first and
> last snapshot, compact_fail grew by 80k and compact_success by 1300.
> High-order allocations will thus cycle between (failing) compaction and
> reclaim that removes the buffer/caches from memory.

I guess you are right. I've just looked at random large direct reclaim activity
$ grep -w pgscan_direct  vmstat*| awk  '{diff=$2-old; if (old && diff > 100000) printf "%s %d\n", $1, diff; old=$2}'
vmstat.1531957422:pgscan_direct 114334
vmstat.1532047588:pgscan_direct 111796

$ paste-with-diff.sh vmstat.1532047578 vmstat.1532047588 | grep "pgscan\|pgsteal\|compact\|pgalloc" | sort
# counter			value1		value2-value1
compact_daemon_free_scanned     2628160139      0
compact_daemon_migrate_scanned  797948703       0
compact_daemon_wake     23634   0
compact_fail    124806  108
compact_free_scanned    226181616304    295560271
compact_isolated        2881602028      480577
compact_migrate_scanned 147900786550    27834455
compact_stall   146749  108
compact_success 21943   0
pgalloc_dma     0       0
pgalloc_dma32   1577060946      10752
pgalloc_movable 0       0
pgalloc_normal  29389246430     343249
pgscan_direct   737335028       111796
pgscan_direct_throttle  0       0
pgscan_kswapd   1177909394      0
pgsteal_direct  704542843       111784
pgsteal_kswapd  898170720       0

There is zero kswapd activity so this must have been higher order
allocation activity and all the direct compaction failed so we keep
reclaiming.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-30 14:40                 ` Michal Hocko
@ 2018-07-30 22:08                   ` Marinko Catovic
  2018-08-02 16:15                     ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-07-30 22:08 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 2926 bytes --]

> Can you provide (a single snapshot) /proc/pagetypeinfo and
> /proc/slabinfo from a system that's currently experiencing the issue,
> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks.

your request came in just one day after I 2>drop_caches again when the ram
usage
was really really low again. Up until now it did not reoccur on any of the
2 hosts,
where one shows 550MB/11G with 37G of totally free ram for now - so not
that low
like last time when I dropped it, I think it was like 300M/8G or so, but I
hope it helps:

/proc/pagetypeinfo  https://pastebin.com/6QWEZagL
/proc/slabinfo  https://pastebin.com/81QAFgke
/proc/vmstat  https://pastebin.com/S7mrQx1s
/proc/zoneinfo  https://pastebin.com/csGeqNyX

also please note - whether this makes any difference: there is no swap
file/partition
I am using this without swap space. imho this should not be necessary since
applications running on the hosts would not consume more than 20GB, the rest
should be used by buffers/cache.

2018-07-30 16:40 GMT+02:00 Michal Hocko <mhocko@suse.com>:

> On Fri 27-07-18 13:15:33, Vlastimil Babka wrote:
> > On 07/21/2018 12:03 AM, Marinko Catovic wrote:
> > > I let this run for 3 days now, so it is quite a lot, there you go:
> > > https://nofile.io/f/egGyRjf0NPs/vmstat.tar.gz
> >
> > The stats show that compaction has very bad results. Between first and
> > last snapshot, compact_fail grew by 80k and compact_success by 1300.
> > High-order allocations will thus cycle between (failing) compaction and
> > reclaim that removes the buffer/caches from memory.
>
> I guess you are right. I've just looked at random large direct reclaim
> activity
> $ grep -w pgscan_direct  vmstat*| awk  '{diff=$2-old; if (old && diff >
> 100000) printf "%s %d\n", $1, diff; old=$2}'
> vmstat.1531957422:pgscan_direct 114334
> vmstat.1532047588:pgscan_direct 111796
>
> $ paste-with-diff.sh vmstat.1532047578 vmstat.1532047588 | grep
> "pgscan\|pgsteal\|compact\|pgalloc" | sort
> # counter                       value1          value2-value1
> compact_daemon_free_scanned     2628160139      0
> compact_daemon_migrate_scanned  797948703       0
> compact_daemon_wake     23634   0
> compact_fail    124806  108
> compact_free_scanned    226181616304    295560271
> compact_isolated        2881602028      480577
> compact_migrate_scanned 147900786550    27834455
> compact_stall   146749  108
> compact_success 21943   0
> pgalloc_dma     0       0
> pgalloc_dma32   1577060946      10752
> pgalloc_movable 0       0
> pgalloc_normal  29389246430     343249
> pgscan_direct   737335028       111796
> pgscan_direct_throttle  0       0
> pgscan_kswapd   1177909394      0
> pgsteal_direct  704542843       111784
> pgsteal_kswapd  898170720       0
>
> There is zero kswapd activity so this must have been higher order
> allocation activity and all the direct compaction failed so we keep
> reclaiming.
> --
> Michal Hocko
> SUSE Labs
>

[-- Attachment #2: Type: text/html, Size: 4218 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-07-30 22:08                   ` Marinko Catovic
@ 2018-08-02 16:15                     ` Vlastimil Babka
  2018-08-03 14:13                       ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-08-02 16:15 UTC (permalink / raw)
  To: Marinko Catovic, linux-mm, Michal Hocko

On 07/31/2018 12:08 AM, Marinko Catovic wrote:
> 
>> Can you provide (a single snapshot) /proc/pagetypeinfo and
>> /proc/slabinfo from a system that's currently experiencing the issue,
>> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks.
> 
> your request came in just one day after I 2>drop_caches again when the
> ram usage
> was really really low again. Up until now it did not reoccur on any of
> the 2 hosts,
> where one shows 550MB/11G with 37G of totally free ram for now - so not
> that low
> like last time when I dropped it, I think it was like 300M/8G or so, but
> I hope it helps:

Thanks.
 
> /proc/pagetypeinfoA  https://pastebin.com/6QWEZagL

Yep, looks like fragmented by reclaimable slabs:

Node    0, zone   Normal, type    Unmovable  29101  32754   8372   2790   1334    354     23      3      4      0      0 
Node    0, zone   Normal, type      Movable 142449  83386  99426  69177  36761  12931   1378     24      0      0      0 
Node    0, zone   Normal, type  Reclaimable 467195 530638 355045 192638  80358  15627   2029    231     18      0      0 

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic      Isolate 
Node 0, zone      DMA            1            7            0            0            0 
Node 0, zone    DMA32           34          703          375            0            0 
Node 0, zone   Normal         1672        14276        15659            1            0

Half of the memory is marked as reclaimable (2 megabyte) pageblocks.
zoneinfo has nr_slab_reclaimable 1679817 so the reclaimable slabs occupy
only 3280 (6G) pageblocks, yet they are spread over 5 times as much.
It's also possible they pollute the Movable pageblocks as well, but the
stats can't tell us. Either the page grouping mobility heuristics are
broken here, or the worst case scenario happened - memory was at some point
really wholly filled with reclaimable slabs, and the rather random reclaim
did not result in whole pageblocks being freed.

> /proc/slabinfoA  https://pastebin.com/81QAFgke

Largest caches seem to be:
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
ext4_inode_cache  3107754 3759573   1080    3    1 : tunables   24   12    8 : slabdata 1253191 1253191      0
dentry            2840237 7328181    192   21    1 : tunables  120   60    8 : slabdata 348961 348961    120

The internal framentation of dentry cache is significant as well.
Dunno if some of those objects pin movable pages as well...

So looks like there's insufficient slab reclaim (shrinker activity), and
possibly problems with page grouping by mobility heuristics as well...

> /proc/vmstatA  https://pastebin.com/S7mrQx1s
> /proc/zoneinfoA  https://pastebin.com/csGeqNyX
> 
> also please note - whether this makes any difference: there is no swap
> file/partition
> I am using this without swap space. imho this should not be necessary since
> applications running on the hosts would not consume more than 20GB, the rest
> should be used by buffers/cache.
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-02 16:15                     ` Vlastimil Babka
@ 2018-08-03 14:13                       ` Marinko Catovic
  2018-08-06  9:40                         ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-08-03 14:13 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 3743 bytes --]

Thanks for the analysis.

So since I am no mem management dev, what exactly does this mean?
Is there any way of workaround or quickfix or something that can/will
be fixed at some point in time?

I can not imagine that I am the only one who is affected by this, nor do I
know why my use case would be so much different from any other.
Most 'cloud' services should be affected as well.

Tell me if you need any other snapshots or whatever info.

2018-08-02 18:15 GMT+02:00 Vlastimil Babka <vbabka@suse.cz>:

> On 07/31/2018 12:08 AM, Marinko Catovic wrote:
> >
> >> Can you provide (a single snapshot) /proc/pagetypeinfo and
> >> /proc/slabinfo from a system that's currently experiencing the issue,
> >> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks.
> >
> > your request came in just one day after I 2>drop_caches again when the
> > ram usage
> > was really really low again. Up until now it did not reoccur on any of
> > the 2 hosts,
> > where one shows 550MB/11G with 37G of totally free ram for now - so not
> > that low
> > like last time when I dropped it, I think it was like 300M/8G or so, but
> > I hope it helps:
>
> Thanks.
>
> > /proc/pagetypeinfo  https://pastebin.com/6QWEZagL
>
> Yep, looks like fragmented by reclaimable slabs:
>
> Node    0, zone   Normal, type    Unmovable  29101  32754   8372   2790
>  1334    354     23      3      4      0      0
> Node    0, zone   Normal, type      Movable 142449  83386  99426  69177
> 36761  12931   1378     24      0      0      0
> Node    0, zone   Normal, type  Reclaimable 467195 530638 355045 192638
> 80358  15627   2029    231     18      0      0
>
> Number of blocks type     Unmovable      Movable  Reclaimable
>  HighAtomic      Isolate
> Node 0, zone      DMA            1            7            0            0
>           0
> Node 0, zone    DMA32           34          703          375            0
>           0
> Node 0, zone   Normal         1672        14276        15659            1
>           0
>
> Half of the memory is marked as reclaimable (2 megabyte) pageblocks.
> zoneinfo has nr_slab_reclaimable 1679817 so the reclaimable slabs occupy
> only 3280 (6G) pageblocks, yet they are spread over 5 times as much.
> It's also possible they pollute the Movable pageblocks as well, but the
> stats can't tell us. Either the page grouping mobility heuristics are
> broken here, or the worst case scenario happened - memory was at some point
> really wholly filled with reclaimable slabs, and the rather random reclaim
> did not result in whole pageblocks being freed.
>
> > /proc/slabinfo  https://pastebin.com/81QAFgke
>
> Largest caches seem to be:
> # name            <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata
> <active_slabs> <num_slabs> <sharedavail>
> ext4_inode_cache  3107754 3759573   1080    3    1 : tunables   24   12
> 8 : slabdata 1253191 1253191      0
> dentry            2840237 7328181    192   21    1 : tunables  120   60
> 8 : slabdata 348961 348961    120
>
> The internal framentation of dentry cache is significant as well.
> Dunno if some of those objects pin movable pages as well...
>
> So looks like there's insufficient slab reclaim (shrinker activity), and
> possibly problems with page grouping by mobility heuristics as well...
>
> > /proc/vmstat  https://pastebin.com/S7mrQx1s
> > /proc/zoneinfo  https://pastebin.com/csGeqNyX
> >
> > also please note - whether this makes any difference: there is no swap
> > file/partition
> > I am using this without swap space. imho this should not be necessary
> since
> > applications running on the hosts would not consume more than 20GB, the
> rest
> > should be used by buffers/cache.
> >
>

[-- Attachment #2: Type: text/html, Size: 5128 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-03 14:13                       ` Marinko Catovic
@ 2018-08-06  9:40                         ` Vlastimil Babka
  2018-08-06 10:29                           ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-08-06  9:40 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: linux-mm, Michal Hocko

On 08/03/2018 04:13 PM, Marinko Catovic wrote:
> Thanks for the analysis.
> 
> So since I am no mem management dev, what exactly does this mean?
> Is there any way of workaround or quickfix or something that can/will
> be fixed at some point in time?

Workaround would be the manual / periodic cache flushing, unfortunately.

Maybe a memcg with kmemcg limit? Michal could know more.

A long-term generic solution will be much harder to find :(

> I can not imagine that I am the only one who is affected by this, nor do I
> know why my use case would be so much different from any other.
> Most 'cloud' services should be affected as well.

Hmm, either your workload is specific in being hungry for fs metadata
and not much data (page cache). And/Or there's some source of the
high-order allocations that others don't have, possibly related to some
piece of hardware?

> Tell me if you need any other snapshots or whatever info.
> 
> 2018-08-02 18:15 GMT+02:00 Vlastimil Babka <vbabka@suse.cz
> <mailto:vbabka@suse.cz>>:
> 
>     On 07/31/2018 12:08 AM, Marinko Catovic wrote:
>     > 
>     >> Can you provide (a single snapshot) /proc/pagetypeinfo and
>     >> /proc/slabinfo from a system that's currently experiencing the issue,
>     >> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks.
>     > 
>     > your request came in just one day after I 2>drop_caches again when the
>     > ram usage
>     > was really really low again. Up until now it did not reoccur on any of
>     > the 2 hosts,
>     > where one shows 550MB/11G with 37G of totally free ram for now - so not
>     > that low
>     > like last time when I dropped it, I think it was like 300M/8G or so, but
>     > I hope it helps:
> 
>     Thanks.
> 
>     > /proc/pagetypeinfoA  https://pastebin.com/6QWEZagL
> 
>     Yep, looks like fragmented by reclaimable slabs:
> 
>     NodeA  A  0, zoneA  A Normal, typeA  A  UnmovableA  29101A  32754A  A 8372A 
>     A 2790A  A 1334A  A  354A  A  A 23A  A  A  3A  A  A  4A  A  A  0A  A  A  0
>     NodeA  A  0, zoneA  A Normal, typeA  A  A  Movable 142449A  83386A  99426A 
>     69177A  36761A  12931A  A 1378A  A  A 24A  A  A  0A  A  A  0A  A  A  0
>     NodeA  A  0, zoneA  A Normal, typeA  Reclaimable 467195 530638 355045
>     192638A  80358A  15627A  A 2029A  A  231A  A  A 18A  A  A  0A  A  A  0
> 
>     Number of blocks typeA  A  A UnmovableA  A  A  MovableA  ReclaimableA 
>     A HighAtomicA  A  A  Isolate
>     Node 0, zoneA  A  A  DMAA  A  A  A  A  A  1A  A  A  A  A  A  7A  A  A  A  A  A  0A  A  A  A 
>     A  A  0A  A  A  A  A  A  0
>     Node 0, zoneA  A  DMA32A  A  A  A  A  A 34A  A  A  A  A  703A  A  A  A  A  375A  A  A  A 
>     A  A  0A  A  A  A  A  A  0
>     Node 0, zoneA  A NormalA  A  A  A  A 1672A  A  A  A  14276A  A  A  A  15659A  A  A  A 
>     A  A  1A  A  A  A  A  A  0
> 
>     Half of the memory is marked as reclaimable (2 megabyte) pageblocks.
>     zoneinfo has nr_slab_reclaimable 1679817 so the reclaimable slabs occupy
>     only 3280 (6G) pageblocks, yet they are spread over 5 times as much.
>     It's also possible they pollute the Movable pageblocks as well, but the
>     stats can't tell us. Either the page grouping mobility heuristics are
>     broken here, or the worst case scenario happened - memory was at
>     some point
>     really wholly filled with reclaimable slabs, and the rather random
>     reclaim
>     did not result in whole pageblocks being freed.
> 
>     > /proc/slabinfoA  https://pastebin.com/81QAFgke
> 
>     Largest caches seem to be:
>     # nameA  A  A  A  A  A  <active_objs> <num_objs> <objsize> <objperslab>
>     <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
>     slabdata <active_slabs> <num_slabs> <sharedavail>
>     ext4_inode_cacheA  3107754 3759573A  A 1080A  A  3A  A  1 : tunablesA  A 24A 
>     A 12A  A  8 : slabdata 1253191 1253191A  A  A  0
>     dentryA  A  A  A  A  A  2840237 7328181A  A  192A  A 21A  A  1 : tunablesA  120A 
>     A 60A  A  8 : slabdata 348961 348961A  A  120
> 
>     The internal framentation of dentry cache is significant as well.
>     Dunno if some of those objects pin movable pages as well...
> 
>     So looks like there's insufficient slab reclaim (shrinker activity), and
>     possibly problems with page grouping by mobility heuristics as well...
> 
>     > /proc/vmstatA  https://pastebin.com/S7mrQx1s
>     > /proc/zoneinfoA  https://pastebin.com/csGeqNyX
>     >
>     > also please note - whether this makes any difference: there is no swap
>     > file/partition
>     > I am using this without swap space. imho this should not be
>     necessary since
>     > applications running on the hosts would not consume more than
>     20GB, the rest
>     > should be used by buffers/cache.
>     >
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-06  9:40                         ` Vlastimil Babka
@ 2018-08-06 10:29                           ` Marinko Catovic
  2018-08-06 12:00                             ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-08-06 10:29 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 5362 bytes --]

> Maybe a memcg with kmemcg limit? Michal could know more.

Could you/Michael explain this perhaps?

The hardware is pretty much high end datacenter grade, I really would
not know how this is to be related with the hardware :(

I do not understand why apparently the caching is working very much
fine for the beginning after a drop_caches, then degrades to low usage
somewhat later. I can not possibly drop caches automatically, since
this requires monitoring for overload with temporary dropping traffic
on specific ports until the writes/reads cool down.


2018-08-06 11:40 GMT+02:00 Vlastimil Babka <vbabka@suse.cz>:

> On 08/03/2018 04:13 PM, Marinko Catovic wrote:
> > Thanks for the analysis.
> >
> > So since I am no mem management dev, what exactly does this mean?
> > Is there any way of workaround or quickfix or something that can/will
> > be fixed at some point in time?
>
> Workaround would be the manual / periodic cache flushing, unfortunately.
>
> Maybe a memcg with kmemcg limit? Michal could know more.
>
> A long-term generic solution will be much harder to find :(
>
> > I can not imagine that I am the only one who is affected by this, nor do
> I
> > know why my use case would be so much different from any other.
> > Most 'cloud' services should be affected as well.
>
> Hmm, either your workload is specific in being hungry for fs metadata
> and not much data (page cache). And/Or there's some source of the
> high-order allocations that others don't have, possibly related to some
> piece of hardware?
>
> > Tell me if you need any other snapshots or whatever info.
> >
> > 2018-08-02 18:15 GMT+02:00 Vlastimil Babka <vbabka@suse.cz
> > <mailto:vbabka@suse.cz>>:
> >
> >     On 07/31/2018 12:08 AM, Marinko Catovic wrote:
> >     >
> >     >> Can you provide (a single snapshot) /proc/pagetypeinfo and
> >     >> /proc/slabinfo from a system that's currently experiencing the
> issue,
> >     >> also with /proc/vmstat and /proc/zoneinfo to verify? Thanks.
> >     >
> >     > your request came in just one day after I 2>drop_caches again when
> the
> >     > ram usage
> >     > was really really low again. Up until now it did not reoccur on
> any of
> >     > the 2 hosts,
> >     > where one shows 550MB/11G with 37G of totally free ram for now -
> so not
> >     > that low
> >     > like last time when I dropped it, I think it was like 300M/8G or
> so, but
> >     > I hope it helps:
> >
> >     Thanks.
> >
> >     > /proc/pagetypeinfo  https://pastebin.com/6QWEZagL
> >
> >     Yep, looks like fragmented by reclaimable slabs:
> >
> >     Node    0, zone   Normal, type    Unmovable  29101  32754   8372
> >      2790   1334    354     23      3      4      0      0
> >     Node    0, zone   Normal, type      Movable 142449  83386  99426
> >     69177  36761  12931   1378     24      0      0      0
> >     Node    0, zone   Normal, type  Reclaimable 467195 530638 355045
> >     192638  80358  15627   2029    231     18      0      0
> >
> >     Number of blocks type     Unmovable      Movable  Reclaimable
> >      HighAtomic      Isolate
> >     Node 0, zone      DMA            1            7            0
> >         0            0
> >     Node 0, zone    DMA32           34          703          375
> >         0            0
> >     Node 0, zone   Normal         1672        14276        15659
> >         1            0
> >
> >     Half of the memory is marked as reclaimable (2 megabyte) pageblocks.
> >     zoneinfo has nr_slab_reclaimable 1679817 so the reclaimable slabs
> occupy
> >     only 3280 (6G) pageblocks, yet they are spread over 5 times as much.
> >     It's also possible they pollute the Movable pageblocks as well, but
> the
> >     stats can't tell us. Either the page grouping mobility heuristics are
> >     broken here, or the worst case scenario happened - memory was at
> >     some point
> >     really wholly filled with reclaimable slabs, and the rather random
> >     reclaim
> >     did not result in whole pageblocks being freed.
> >
> >     > /proc/slabinfo  https://pastebin.com/81QAFgke
> >
> >     Largest caches seem to be:
> >     # name            <active_objs> <num_objs> <objsize> <objperslab>
> >     <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
> >     slabdata <active_slabs> <num_slabs> <sharedavail>
> >     ext4_inode_cache  3107754 3759573   1080    3    1 : tunables   24
> >      12    8 : slabdata 1253191 1253191      0
> >     dentry            2840237 7328181    192   21    1 : tunables  120
> >      60    8 : slabdata 348961 348961    120
> >
> >     The internal framentation of dentry cache is significant as well.
> >     Dunno if some of those objects pin movable pages as well...
> >
> >     So looks like there's insufficient slab reclaim (shrinker activity),
> and
> >     possibly problems with page grouping by mobility heuristics as
> well...
> >
> >     > /proc/vmstat  https://pastebin.com/S7mrQx1s
> >     > /proc/zoneinfo  https://pastebin.com/csGeqNyX
> >     >
> >     > also please note - whether this makes any difference: there is no
> swap
> >     > file/partition
> >     > I am using this without swap space. imho this should not be
> >     necessary since
> >     > applications running on the hosts would not consume more than
> >     20GB, the rest
> >     > should be used by buffers/cache.
> >     >
> >
> >
>
>

[-- Attachment #2: Type: text/html, Size: 7521 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-06 10:29                           ` Marinko Catovic
@ 2018-08-06 12:00                             ` Michal Hocko
  2018-08-06 15:37                               ` Christopher Lameter
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-08-06 12:00 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm

[Please do not top-post]

On Mon 06-08-18 12:29:43, Marinko Catovic wrote:
> > Maybe a memcg with kmemcg limit? Michal could know more.
> 
> Could you/Michael explain this perhaps?

The only way how kmemcg limit could help I can think of would be to
enforce metadata reclaim much more often. But that is rather a bad
workaround.

> The hardware is pretty much high end datacenter grade, I really would
> not know how this is to be related with the hardware :(

Well, there are some drivers (mostly out-of-tree) which are high order
hungry. You can try to trace all allocations which with order > 0 and
see who that might be.
# mount -t tracefs none /debug/trace/
# echo stacktrace > /debug/trace/trace_options
# echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
# echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
# cat /debug/trace/trace_pipe

And later this to disable tracing.
# echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable

> I do not understand why apparently the caching is working very much
> fine for the beginning after a drop_caches, then degrades to low usage
> somewhat later.

Because a lot of FS metadata is fragmenting the memory and a large
number of high order allocations which want to be served reclaim a lot
of memory to achieve their gol. Considering a large part of memory is
fragmented by unmovable objects there is no other way than to use
reclaim to release that memory.

> I can not possibly drop caches automatically, since
> this requires monitoring for overload with temporary dropping traffic
> on specific ports until the writes/reads cool down.

You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches
should be sufficient to drop metadata only.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-06 12:00                             ` Michal Hocko
@ 2018-08-06 15:37                               ` Christopher Lameter
  2018-08-06 18:16                                 ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Christopher Lameter @ 2018-08-06 15:37 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Marinko Catovic, Vlastimil Babka, linux-mm

On Mon, 6 Aug 2018, Michal Hocko wrote:

> Because a lot of FS metadata is fragmenting the memory and a large
> number of high order allocations which want to be served reclaim a lot
> of memory to achieve their gol. Considering a large part of memory is
> fragmented by unmovable objects there is no other way than to use
> reclaim to release that memory.

Well it looks like the fragmentation issue gets worse. Is that enough to
consider merging the slab defrag patchset and get some work done on inodes
and dentries to make them movable (or use targetd reclaim)?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-06 15:37                               ` Christopher Lameter
@ 2018-08-06 18:16                                 ` Michal Hocko
  2018-08-09  8:29                                   ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-08-06 18:16 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: Marinko Catovic, Vlastimil Babka, linux-mm

On Mon 06-08-18 15:37:14, Cristopher Lameter wrote:
> On Mon, 6 Aug 2018, Michal Hocko wrote:
> 
> > Because a lot of FS metadata is fragmenting the memory and a large
> > number of high order allocations which want to be served reclaim a lot
> > of memory to achieve their gol. Considering a large part of memory is
> > fragmented by unmovable objects there is no other way than to use
> > reclaim to release that memory.
> 
> Well it looks like the fragmentation issue gets worse. Is that enough to
> consider merging the slab defrag patchset and get some work done on inodes
> and dentries to make them movable (or use targetd reclaim)?

Is there anything to test?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-06 18:16                                 ` Michal Hocko
@ 2018-08-09  8:29                                   ` Marinko Catovic
  2018-08-21  0:36                                     ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-08-09  8:29 UTC (permalink / raw)
  Cc: Christopher Lameter, Vlastimil Babka, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2996 bytes --]

On Mon 06-08-18 15:37:14, Cristopher Lameter wrote:
> > On Mon, 6 Aug 2018, Michal Hocko wrote:
> >
> > > Because a lot of FS metadata is fragmenting the memory and a large
> > > number of high order allocations which want to be served reclaim a lot
> > > of memory to achieve their gol. Considering a large part of memory is
> > > fragmented by unmovable objects there is no other way than to use
> > > reclaim to release that memory.
> >
> > Well it looks like the fragmentation issue gets worse. Is that enough to
> > consider merging the slab defrag patchset and get some work done on
> inodes
> > and dentries to make them movable (or use targetd reclaim)?
>
> Is there anything to test?
> --
> Michal Hocko
> SUSE Labs
>

> [Please do not top-post]

like this?

> The only way how kmemcg limit could help I can think of would be to
> enforce metadata reclaim much more often. But that is rather a bad
> workaround.

would that have some significant performance impact?
I would be willing to try if you think the idea is not thaaat bad.
If so, could you please explain what to do?

> > > Because a lot of FS metadata is fragmenting the memory and a large
> > > number of high order allocations which want to be served reclaim a lot
> > > of memory to achieve their gol. Considering a large part of memory is
> > > fragmented by unmovable objects there is no other way than to use
> > > reclaim to release that memory.
> >
> > Well it looks like the fragmentation issue gets worse. Is that enough to
> > consider merging the slab defrag patchset and get some work done on
inodes
> > and dentries to make them movable (or use targetd reclaim)?

> Is there anything to test?

Are you referring to some known issue there, possibly directly related to
mine?
If so, I would be willing to test that patchset, if it makes into the
kernel.org sources,
or if I'd have to patch that manually.


> Well, there are some drivers (mostly out-of-tree) which are high order
> hungry. You can try to trace all allocations which with order > 0 and
> see who that might be.
> # mount -t tracefs none /debug/trace/
> # echo stacktrace > /debug/trace/trace_options
> # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
> # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
> # cat /debug/trace/trace_pipe
>
> And later this to disable tracing.
> # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable

I just had a major cache-useless situation, with like 100M/8G usage only
and horrible performance. There you go:

https://nofile.io/f/mmwVedaTFsd

I think mysql occurs mostly, regardless of the binary name this is actually
mariadb in version 10.1.

> You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches
> should be sufficient to drop metadata only.

that is exactly what I am doing, I already mentioned that 1> does not
make any difference at all 2> is the only way that helps.
just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily
going up, as usual.

[-- Attachment #2: Type: text/html, Size: 5068 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-09  8:29                                   ` Marinko Catovic
@ 2018-08-21  0:36                                     ` Marinko Catovic
  2018-08-21  6:49                                       ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-08-21  0:36 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Christopher Lameter, Vlastimil Babka, linux-mm

[-- Attachment #1: Type: text/plain, Size: 6465 bytes --]

> The only way how kmemcg limit could help I can think of would be to
>> enforce metadata reclaim much more often. But that is rather a bad
>> workaround.
>
>would that have some significant performance impact?
>I would be willing to try if you think the idea is not thaaat bad.
>If so, could you please explain what to do?
>
>> > > Because a lot of FS metadata is fragmenting the memory and a large
>> > > number of high order allocations which want to be served reclaim a
lot
>> > > of memory to achieve their gol. Considering a large part of memory is
>> > > fragmented by unmovable objects there is no other way than to use
>> > > reclaim to release that memory.
>> >
>> > Well it looks like the fragmentation issue gets worse. Is that enough
to
>> > consider merging the slab defrag patchset and get some work done on
inodes
>> > and dentries to make them movable (or use targetd reclaim)?
>
>> Is there anything to test?
>
>Are you referring to some known issue there, possibly directly related to
mine?
>If so, I would be willing to test that patchset, if it makes into the
kernel.org sources,
>or if I'd have to patch that manually.
>
>
>> Well, there are some drivers (mostly out-of-tree) which are high order
>> hungry. You can try to trace all allocations which with order > 0 and
>> see who that might be.
>> # mount -t tracefs none /debug/trace/
>> # echo stacktrace > /debug/trace/trace_options
>> # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
>> # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
>> # cat /debug/trace/trace_pipe
>>
>> And later this to disable tracing.
>> # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable
>
>I just had a major cache-useless situation, with like 100M/8G usage only
>and horrible performance. There you go:
>
>https://nofile.io/f/mmwVedaTFsd
>
>I think mysql occurs mostly, regardless of the binary name this is actually
>mariadb in version 10.1.
>
>> You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches
>> should be sufficient to drop metadata only.
>
>that is exactly what I am doing, I already mentioned that 1> does not
>make any difference at all 2> is the only way that helps.
>just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily
>going up, as usual.
>
>
>2018-08-09 10:29 GMT+02:00 Marinko Catovic <marinko.catovic@gmail.com>:
>
>
>
>        On Mon 06-08-18 15:37:14, Cristopher Lameter wrote:
>        > On Mon, 6 Aug 2018, Michal Hocko wrote:
>        >
>        > > Because a lot of FS metadata is fragmenting the memory and a
large
>        > > number of high order allocations which want to be served
reclaim a lot
>        > > of memory to achieve their gol. Considering a large part of
memory is
>        > > fragmented by unmovable objects there is no other way than to
use
>        > > reclaim to release that memory.
>        >
>        > Well it looks like the fragmentation issue gets worse. Is that
enough to
>        > consider merging the slab defrag patchset and get some work done
on inodes
>        > and dentries to make them movable (or use targetd reclaim)?
>
>        Is there anything to test?
>        --
>        Michal Hocko
>        SUSE Labs
>
>
>    > [Please do not top-post]
>
>    like this?
>
>    > The only way how kmemcg limit could help I can think of would be to
>    > enforce metadata reclaim much more often. But that is rather a bad
>    > workaround.
>
>    would that have some significant performance impact?
>    I would be willing to try if you think the idea is not thaaat bad.
>    If so, could you please explain what to do?
>
>    > > > Because a lot of FS metadata is fragmenting the memory and a
large
>    > > > number of high order allocations which want to be served reclaim
a lot
>    > > > of memory to achieve their gol. Considering a large part of
memory is
>    > > > fragmented by unmovable objects there is no other way than to use
>    > > > reclaim to release that memory.
>    > >
>    > > Well it looks like the fragmentation issue gets worse. Is that
enough to
>    > > consider merging the slab defrag patchset and get some work done
on inodes
>    > > and dentries to make them movable (or use targetd reclaim)?
>
>    > Is there anything to test?
>
>    Are you referring to some known issue there, possibly directly related
to mine?
>    If so, I would be willing to test that patchset, if it makes into the
kernel.org sources,
>    or if I'd have to patch that manually.
>
>
>    > Well, there are some drivers (mostly out-of-tree) which are high
order
>    > hungry. You can try to trace all allocations which with order > 0 and
>    > see who that might be.
>    > # mount -t tracefs none /debug/trace/
>    > # echo stacktrace > /debug/trace/trace_options
>    > # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
>    > # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
>    > # cat /debug/trace/trace_pipe
>    >
>    > And later this to disable tracing.
>    > # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable
>
>    I just had a major cache-useless situation, with like 100M/8G usage
only
>    and horrible performance. There you go:
>
>    https://nofile.io/f/mmwVedaTFsd
>
>    I think mysql occurs mostly, regardless of the binary name this is
actually
>    mariadb in version 10.1.
>
>    > You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches
>    > should be sufficient to drop metadata only.
>
>    that is exactly what I am doing, I already mentioned that 1> does not
>    make any difference at all 2> is the only way that helps.
>    just 5 minutes after doing that the usage grew to 2GB/10GB and is
steadily
>    going up, as usual.

Is there anything you can read from these results?
The issue keeps occuring, the latest one was even totally unexpected in the
morning hours,
causing downtime the entire morning until noon when I could check and drop
the caches again.

I also reset O_DIRECT from mariadb to `fsync`, the new default in their
latest release, hoping
that this would help, but it did not.

Before giving totally up I'd like to know whether there is any solution for
this, where again I can
not believe that I am the only one affected. this *has* to affect anyone
with similar a use case,
I do not see what is so special about mine. this is simply many users with
many files, every
larger shared hosting provider should experience the totally same behaviour
with the 4.x kernel branch.

[-- Attachment #2: Type: text/html, Size: 8266 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-21  0:36                                     ` Marinko Catovic
@ 2018-08-21  6:49                                       ` Michal Hocko
  2018-08-21  7:19                                         ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-08-21  6:49 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Christopher Lameter, Vlastimil Babka, linux-mm

On Tue 21-08-18 02:36:05, Marinko Catovic wrote:
[...]
> > > Well, there are some drivers (mostly out-of-tree) which are high order
> > > hungry. You can try to trace all allocations which with order > 0 and
> > > see who that might be.
> > > # mount -t tracefs none /debug/trace/
> > > # echo stacktrace > /debug/trace/trace_options
> > > # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
> > > # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
> > > # cat /debug/trace/trace_pipe
> > >
> > > And later this to disable tracing.
> > > # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable
> >
> > I just had a major cache-useless situation, with like 100M/8G usage only
> > and horrible performance. There you go:
> >
> > https://nofile.io/f/mmwVedaTFsd

$ grep mm_page_alloc: trace_pipe | sed 's@.*order=\([0-9]*\) .*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
    428 1 __GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
     10 1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
      6 1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
   3061 1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
   8672 1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT
   2547 1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
      4 2 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
      5 2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
  20030 2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT
   1528 3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
   2476 3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
   6512 3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT
    277 9 GFP_TRANSHUGE|__GFP_THISNODE

This only covers ~90s of the allocator activity. Most of those requests
are not troggering any reclaim (GFP_NOWAIT/ATOMIC). Vlastimil will
know better but this might mean that we are not envoking kcompactd
enough. But considering that we have suspected that an overly eager
reclaim triggers the page cache reduction I am not really sure I see the
above to match that theory.

Btw. I was probably not specific enough. This data should be collected
_during_ the time when the page cache is disappearing. I suspect you
have started collecting after the fact.

Btw. vast majority of order-3 requests come from the network layer. Are
you using a large MTU (jumbo packets)?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-21  6:49                                       ` Michal Hocko
@ 2018-08-21  7:19                                         ` Vlastimil Babka
  2018-08-22 20:02                                           ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-08-21  7:19 UTC (permalink / raw)
  To: Michal Hocko, Marinko Catovic; +Cc: Christopher Lameter, linux-mm

On 8/21/18 8:49 AM, Michal Hocko wrote:
> On Tue 21-08-18 02:36:05, Marinko Catovic wrote:
> [...]
>>>> Well, there are some drivers (mostly out-of-tree) which are high order
>>>> hungry. You can try to trace all allocations which with order > 0 and
>>>> see who that might be.
>>>> # mount -t tracefs none /debug/trace/
>>>> # echo stacktrace > /debug/trace/trace_options
>>>> # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
>>>> # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
>>>> # cat /debug/trace/trace_pipe
>>>>
>>>> And later this to disable tracing.
>>>> # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable
>>>
>>> I just had a major cache-useless situation, with like 100M/8G usage only
>>> and horrible performance. There you go:
>>>
>>> https://nofile.io/f/mmwVedaTFsd
> 
> $ grep mm_page_alloc: trace_pipe | sed 's@.*order=\([0-9]*\) .*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
>     428 1 __GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>      10 1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>       6 1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>    3061 1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
>    8672 1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT
>    2547 1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>       4 2 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>       5 2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>   20030 2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT
>    1528 3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
>    2476 3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
>    6512 3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ACCOUNT
>     277 9 GFP_TRANSHUGE|__GFP_THISNODE
> 
> This only covers ~90s of the allocator activity. Most of those requests
> are not troggering any reclaim (GFP_NOWAIT/ATOMIC). Vlastimil will
> know better but this might mean that we are not envoking kcompactd
> enough.

Earlier vmstat data showed that it's invoked but responsible for less
than 1% of compaction activity.

> But considering that we have suspected that an overly eager
> reclaim triggers the page cache reduction I am not really sure I see the
> above to match that theory.

Yeah, the GFP_NOWAIT/GFP_ATOMIC above shouldn't be responsible for such
overreclaim?

> Btw. I was probably not specific enough. This data should be collected
> _during_ the time when the page cache is disappearing. I suspect you
> have started collecting after the fact.

It might be also interesting to do in the problematic state, instead of
dropping caches:

- save snapshot of /proc/vmstat and /proc/pagetypeinfo
- echo 1 > /proc/sys/vm/compact_memory
- save new snapshot of /proc/vmstat and /proc/pagetypeinfo

That would show if compaction is able to help at all.

> Btw. vast majority of order-3 requests come from the network layer. Are
> you using a large MTU (jumbo packets)?
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-21  7:19                                         ` Vlastimil Babka
@ 2018-08-22 20:02                                           ` Marinko Catovic
  2018-08-23 12:10                                             ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-08-22 20:02 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Michal Hocko, Christopher Lameter, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1763 bytes --]

> It might be also interesting to do in the problematic state, instead of
> dropping caches:
>
> - save snapshot of /proc/vmstat and /proc/pagetypeinfo
> - echo 1 > /proc/sys/vm/compact_memory
> - save new snapshot of /proc/vmstat and /proc/pagetypeinfo

There was just a worstcase in progress, about 100MB/10GB were used,
super-low perfomance, but could not see any improvement there after echo 1,
I watches this for about 3 minutes, the cache usage did not change.

pagetypeinfo before echo https://pastebin.com/MjSgiMRL
pagetypeinfo 3min after echo https://pastebin.com/uWM6xGDd

vmstat before echo https://pastebin.com/TjYSKNdE
vmstat 3min after echo https://pastebin.com/MqTibEKi

> Btw. vast majority of order-3 requests come from the network layer. Are
> you using a large MTU (jumbo packets)?

not that I know of, how would I figure that out?
I have not touched sysctl net.* besides a few values not related to mtu
afaik

> Btw. I was probably not specific enough. This data should be collected
> _during_ the time when the page cache is disappearing. I suspect you
> have started collecting after the fact.

meh, I just messed up that output with the latest drop_caches, but I am
pretty
much sure that the one you see is while the usage was like 300MB/10GB,
before drop caches.

I was thinking maybe it would really help if one of you guys links up with
the hosts
in that state so that you can see for yourself. due to privacy issues (gdpr
and stuff)
I'd like to monitor this, so the ssh login would have to go over something
like teamviewer
on my host or whatever. please let me know if anyone is willing, since I
really see
no help there with anything I tried for 3 months by now. thanks for the
efforts.
surely any diagnosis would be easier this way.

[-- Attachment #2: Type: text/html, Size: 3043 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-22 20:02                                           ` Marinko Catovic
@ 2018-08-23 12:10                                             ` Vlastimil Babka
  2018-08-23 12:21                                               ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-08-23 12:10 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Michal Hocko, Christopher Lameter, linux-mm

On 08/22/2018 10:02 PM, Marinko Catovic wrote:
>> It might be also interesting to do in the problematic state, instead of
>> dropping caches:
>>
>> - save snapshot of /proc/vmstat and /proc/pagetypeinfo
>> - echo 1 > /proc/sys/vm/compact_memory
>> - save new snapshot of /proc/vmstat and /proc/pagetypeinfo
> 
> There was just a worstcase in progress, about 100MB/10GB were used,
> super-low perfomance, but could not see any improvement there after echo 1,
> I watches this for about 3 minutes, the cache usage did not change.
> 
> pagetypeinfo before echo https://pastebin.com/MjSgiMRL
> pagetypeinfo 3min after echo https://pastebin.com/uWM6xGDd
> 
> vmstat before echo https://pastebin.com/TjYSKNdE
> vmstat 3min after echo https://pastebin.com/MqTibEKi

OK, that confirms compaction is useless here. Thanks.

It also shows that all orders except order-9 are in fact plentiful.
Michal's earlier summary of the trace shows that most allocations are up
to order-3 and should be fine, the exception is THP:

    277 9 GFP_TRANSHUGE|__GFP_THISNODE

Hmm it's actually interesting to see GFP_TRANSHUGE there and not
GFP_TRANSHUGE_LIGHT. What's your thp defrag setting? (cat
/sys/kernel/mm/transparent_hugepage/enabled). Maybe it's set to
"always", or there's a heavily faulting process that's using
madvise(MADV_HUGEPAGE). If that's the case, setting it to "defer" or
even "never" could be a workaround.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-23 12:10                                             ` Vlastimil Babka
@ 2018-08-23 12:21                                               ` Michal Hocko
  2018-08-24  0:11                                                 ` Marinko Catovic
  2018-08-24  6:24                                                 ` Vlastimil Babka
  0 siblings, 2 replies; 66+ messages in thread
From: Michal Hocko @ 2018-08-23 12:21 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Marinko Catovic, Christopher Lameter, linux-mm

On Thu 23-08-18 14:10:28, Vlastimil Babka wrote:
> On 08/22/2018 10:02 PM, Marinko Catovic wrote:
> >> It might be also interesting to do in the problematic state, instead of
> >> dropping caches:
> >>
> >> - save snapshot of /proc/vmstat and /proc/pagetypeinfo
> >> - echo 1 > /proc/sys/vm/compact_memory
> >> - save new snapshot of /proc/vmstat and /proc/pagetypeinfo
> > 
> > There was just a worstcase in progress, about 100MB/10GB were used,
> > super-low perfomance, but could not see any improvement there after echo 1,
> > I watches this for about 3 minutes, the cache usage did not change.
> > 
> > pagetypeinfo before echo https://pastebin.com/MjSgiMRL
> > pagetypeinfo 3min after echo https://pastebin.com/uWM6xGDd
> > 
> > vmstat before echo https://pastebin.com/TjYSKNdE
> > vmstat 3min after echo https://pastebin.com/MqTibEKi
> 
> OK, that confirms compaction is useless here. Thanks.
> 
> It also shows that all orders except order-9 are in fact plentiful.
> Michal's earlier summary of the trace shows that most allocations are up
> to order-3 and should be fine, the exception is THP:
> 
>     277 9 GFP_TRANSHUGE|__GFP_THISNODE

But please note that this is not from the time when the page cache
dropped to the observed values. So we do not know what happened at the
time.

Anyway 277 THP pages paging out such a large page cache amount would be
more than unexpected even for explicitly costly THP fault in methods.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-23 12:21                                               ` Michal Hocko
@ 2018-08-24  0:11                                                 ` Marinko Catovic
  2018-08-24  6:34                                                   ` Vlastimil Babka
  2018-08-24  6:24                                                 ` Vlastimil Babka
  1 sibling, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-08-24  0:11 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vlastimil Babka, Christopher Lameter, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3349 bytes --]

> Hmm it's actually interesting to see GFP_TRANSHUGE there and not
> GFP_TRANSHUGE_LIGHT. What's your thp defrag setting? (cat
> /sys/kernel/mm/transparent_hugepage/enabled). Maybe it's set to
> "always", or there's a heavily faulting process that's using
> madvise(MADV_HUGEPAGE). If that's the case, setting it to "defer" or
> even "never" could be a workaround.

cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

according to the docs this is the default
> "madvise" will enter direct reclaim like "always" but only for regions
> that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

would any change there kick in immediately, even when in the 100M/10G case?

> or there's a heavily faulting process that's using madvise(MADV_HUGEPAGE)

are you suggesting that a/one process can cause this?
how would one be able to identify it..? should killing it allow the cache
to be
populated again instantly? if yes, then I could start killing all processes
on the
host until there is improvement to observe.
so far I can tell that it is not the database server, since restarting it
did not help at all.

Please remember that, suggesting this, I can see how buffers (the 100MB
value)
are `oscillating`. When in the cache-useless state it jumps around
literally every second
from e.g. 100 to 102, then 99, 104, 85, 101, 105, 98, .. and so on, where
it always gets
closer from well-populated several GB in the beginning to those 100MB over
the days.
so doing anything that should cause an effect would be easily measurable
instantly,
which is to date only achieved by dropping caches.

Please tell me if you need any measurements again, when or at what state,
with code
snippets perhaps to fit your needs.


Am Do., 23. Aug. 2018 um 14:21 Uhr schrieb Michal Hocko <mhocko@suse.com>:

> On Thu 23-08-18 14:10:28, Vlastimil Babka wrote:
> > On 08/22/2018 10:02 PM, Marinko Catovic wrote:
> > >> It might be also interesting to do in the problematic state, instead
> of
> > >> dropping caches:
> > >>
> > >> - save snapshot of /proc/vmstat and /proc/pagetypeinfo
> > >> - echo 1 > /proc/sys/vm/compact_memory
> > >> - save new snapshot of /proc/vmstat and /proc/pagetypeinfo
> > >
> > > There was just a worstcase in progress, about 100MB/10GB were used,
> > > super-low perfomance, but could not see any improvement there after
> echo 1,
> > > I watches this for about 3 minutes, the cache usage did not change.
> > >
> > > pagetypeinfo before echo https://pastebin.com/MjSgiMRL
> > > pagetypeinfo 3min after echo https://pastebin.com/uWM6xGDd
> > >
> > > vmstat before echo https://pastebin.com/TjYSKNdE
> > > vmstat 3min after echo https://pastebin.com/MqTibEKi
> >
> > OK, that confirms compaction is useless here. Thanks.
> >
> > It also shows that all orders except order-9 are in fact plentiful.
> > Michal's earlier summary of the trace shows that most allocations are up
> > to order-3 and should be fine, the exception is THP:
> >
> >     277 9 GFP_TRANSHUGE|__GFP_THISNODE
>
> But please note that this is not from the time when the page cache
> dropped to the observed values. So we do not know what happened at the
> time.
>
> Anyway 277 THP pages paging out such a large page cache amount would be
> more than unexpected even for explicitly costly THP fault in methods.
> --
> Michal Hocko
> SUSE Labs
>

[-- Attachment #2: Type: text/html, Size: 4620 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-23 12:21                                               ` Michal Hocko
  2018-08-24  0:11                                                 ` Marinko Catovic
@ 2018-08-24  6:24                                                 ` Vlastimil Babka
  1 sibling, 0 replies; 66+ messages in thread
From: Vlastimil Babka @ 2018-08-24  6:24 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Marinko Catovic, Christopher Lameter, linux-mm

On 08/23/2018 02:21 PM, Michal Hocko wrote:
> On Thu 23-08-18 14:10:28, Vlastimil Babka wrote:
>> It also shows that all orders except order-9 are in fact plentiful.
>> Michal's earlier summary of the trace shows that most allocations are up
>> to order-3 and should be fine, the exception is THP:
>>
>>     277 9 GFP_TRANSHUGE|__GFP_THISNODE
> 
> But please note that this is not from the time when the page cache
> dropped to the observed values. So we do not know what happened at the
> time.

Okay, we didn't observe it drop, but there must still be something going
on that keeps it from growing back?

> Anyway 277 THP pages paging out such a large page cache amount would be
> more than unexpected even for explicitly costly THP fault in methods.

It's 277 in 90 seconds. But it seems no reclaim should happen there
anyway, because shrink_zones() should evaluate compaction_ready() as
true and skip the zones. Unless there is some kind of bug, maybe e.g.
ZONE_DMA returns compaction_ready() as false, causing the whole node to
be reclaimed? Hmm.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-24  0:11                                                 ` Marinko Catovic
@ 2018-08-24  6:34                                                   ` Vlastimil Babka
  2018-08-24  8:11                                                     ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-08-24  6:34 UTC (permalink / raw)
  To: Marinko Catovic, Michal Hocko; +Cc: Christopher Lameter, linux-mm

On 08/24/2018 02:11 AM, Marinko Catovic wrote:
>> Hmm it's actually interesting to see GFP_TRANSHUGE there and not
>> GFP_TRANSHUGE_LIGHT. What's your thp defrag setting? (cat
>> /sys/kernel/mm/transparent_hugepage/enabled). Maybe it's set to
>> "always", or there's a heavily faulting process that's using
>> madvise(MADV_HUGEPAGE). If that's the case, setting it to "defer" or
>> even "never" could be a workaround.
> 
> cat /sys/kernel/mm/transparent_hugepage/enabled
> always [madvise] never

Hmm my mistake. I was actually interested in
/sys/kernel/mm/transparent_hugepage/defrag

> according to the docs this is the default
>> "madvise" will enter direct reclaim like "always" but only for regions
>> that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

Yeah but that's about 'defrag'. For 'enabled', the default should be
always. But it's a kernel config option I think? Let's see what you have
for 'defrag'...

> would any change there kick in immediately, even when in the 100M/10G case?

If it's indeed preventing the cache from growing back, changing that
should result in gradual increase. Note that it doesn't look probable
that THP is the cause, but the trace didn't contain any other
allocations that could be responsible for high-order direct reclaim.

>> or there's a heavily faulting process that's using madvise(MADV_HUGEPAGE)
> 
> are you suggesting that a/one process can cause this?
> how would one be able to identify it..? should killing it allow the
> cache to be
> populated again instantly? if yes, then I could start killing all
> processes on the
> host until there is improvement to observe.

It's not the process' fault, and killing it might disrupt the
observation in unexpected ways. It's simpler to change the global
setting to "never" to confirm or rule out this.

Ah, checked the trace and it seems to be "php-cgi". Interesting that
they use madvise(MADV_HUGEPAGE). Anyway the above still applies.

> so far I can tell that it is not the database server, since restarting
> it did not help at all.
> 
> Please remember that, suggesting this, I can see how buffers (the 100MB
> value)
> are `oscillating`. When in the cache-useless state it jumps around
> literally every second
> from e.g. 100 to 102, then 99, 104, 85, 101, 105, 98, .. and so on,
> where it always gets
> closer from well-populated several GB in the beginning to those 100MB
> over the days.
> so doing anything that should cause an effect would be easily measurable
> instantly,
> which is to date only achieved by dropping caches.
> 
> Please tell me if you need any measurements again, when or at what
> state, with code
> snippets perhaps to fit your needs.

1. Send the current value of /sys/kernel/mm/transparent_hugepage/defrag
2. Unless it's 'defer' or 'never' already, try changing it to 'defer'.

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-24  6:34                                                   ` Vlastimil Babka
@ 2018-08-24  8:11                                                     ` Marinko Catovic
  2018-08-24  8:36                                                       ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-08-24  8:11 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Michal Hocko, Christopher Lameter, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1891 bytes --]

>
> 1. Send the current value of /sys/kernel/mm/transparent_hugepage/defrag
> 2. Unless it's 'defer' or 'never' already, try changing it to 'defer'.
>

 /sys/kernel/mm/transparent_hugepage/defrag is
always defer defer+madvise [madvise] never

I *think* I already played around with these values, as far as I remember
`never`
almost caused the system to hang, or at least while I switched back to
madvise.
shall I switch it to defer and observe (all hosts are running fine by just
now) or
switch to defer while it is in the bad state?
and when doing this, should improvement be measurable immediately?
I need to know how long to hold this, before dropping caches becomes
necessary.

> Ah, checked the trace and it seems to be "php-cgi". Interesting that
> they use madvise(MADV_HUGEPAGE). Anyway the above still applies.

you know, that's at least an interesting hint. look at this:
https://ckon.wordpress.com/2015/09/18/php7-opcache-performance/

this was experimental there, but a more recent version seems to have it on
by default, since I need to disable it on request (implies to me that it is
on by default).
it is however *disabled* in the runtime configuration (and not in effect, I
just confirmed that)

It would be interesting to know whether madvise(MADV_HUGEPAGE) is then
active
somewhere else, since it is in the dump as you observed.

Please note that `killing` php-cgi would not make any difference then,
since these processes
are started by request for every user and killed after whatever script is
finished. this may
invoke about 10-50 forks, depending on load, (with different system users)
every second.

That also *may* explain why it is not so much deterministic (sometimes
earlier/sooner, sometimes
on one host and not on the other), since there are multiple php-cgi
versions available
and not everyone is using the same version - most people stick to legacy
versions.

[-- Attachment #2: Type: text/html, Size: 3153 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-24  8:11                                                     ` Marinko Catovic
@ 2018-08-24  8:36                                                       ` Vlastimil Babka
  2018-08-29 14:54                                                         ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-08-24  8:36 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Michal Hocko, Christopher Lameter, linux-mm

On 08/24/2018 10:11 AM, Marinko Catovic wrote:
>     1. Send the current value of /sys/kernel/mm/transparent_hugepage/defrag
>     2. Unless it's 'defer' or 'never' already, try changing it to 'defer'.
> 
> 
> A /sys/kernel/mm/transparent_hugepage/defrag is
> always defer defer+madvise [madvise] never

Yeah that's the default.

> I *think* I already played around with these values, as far as I
> remember `never`
> almost caused the system to hang, or at least while I switched back to
> madvise.

That would be unexpected for the 'defrag' file, but maybe possible for
'enabled' file where mm structs are put on/removed from a list
system-wide, AFAIK.

> shall I switch it to defer and observe (all hosts are running fine by
> just now) or
> switch to defer while it is in the bad state?

You could do it immediately and see if no problems appear for long
enough, OTOH...

> and when doing this, should improvement be measurable immediately?

I would expect that. It would be a more direct proof that that was the
cause.

> I need to know how long to hold this, before dropping caches becomes
> necessary.

If it keeps oscillating and doesn't start growing, it means it didn't
help. Few minutes should be enough.

>> Ah, checked the trace and it seems to be "php-cgi". Interesting that
>> they use madvise(MADV_HUGEPAGE). Anyway the above still applies.
> 
> you know, that's at least an interesting hint. look at this:
> https://ckon.wordpress.com/2015/09/18/php7-opcache-performance/
> 
> this was experimental there, but a more recent version seems to have it on
> by default, since I need to disable it on request (implies to me that it
> is on by default).
> it is however *disabled* in the runtime configuration (and not in
> effect, I just confirmed that)
> 
> It would be interesting to know whether madvise(MADV_HUGEPAGE) is then
> active
> somewhere else, since it is in the dump as you observed.

The trace points to php-cgi so either disabling it doesn't work, or they
started using the madvise also for other stuff than opcache. But that
doesn't matter, it would be kernel's fault if a program using the
madvise would effectively kill the system like this. Let's just stick
with the global 'defrag'='defer' change and not tweak several things at
once.

> Please note that `killing` php-cgi would not make any difference then,
> since these processes
> are started by request for every user and killed after whatever script
> is finished. this may
> invoke about 10-50 forks, depending on load, (with different system
> users) every second.

Yep.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-24  8:36                                                       ` Vlastimil Babka
@ 2018-08-29 14:54                                                         ` Marinko Catovic
  2018-08-29 15:01                                                           ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-08-29 14:54 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Michal Hocko, Christopher Lameter, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1034 bytes --]

> > shall I switch it to defer and observe (all hosts are running fine by
> > just now) or
> > switch to defer while it is in the bad state?
>
> You could do it immediately and see if no problems appear for long
> enough, OTOH...
>

well cat /sys/kernel/mm/transparent_hugepage/defrag
always [defer] defer+madvise madvise never
was active now since your reply, however, I can not tell that it helped.

This was set on 2 hosts, one has 20GB of unused RAM now.
Yesterday there was a similar picture for both, with several GB, one with
up to 10GB unused,
I just checked once, this is what I recall.

tell me if one would like to login remotely, I can set up teamviewer or
something for this
at any time, just drop a message here and I'll contact you.
I have hopes that one can investigate things even on that host that has
20GB unused, it's just
a matter of time until this gets to the low values, surely the problem here
already kicked in.

Also if the remote login is not an option, I'm always happy to provide
whatever info you need.

[-- Attachment #2: Type: text/html, Size: 1468 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-29 14:54                                                         ` Marinko Catovic
@ 2018-08-29 15:01                                                           ` Michal Hocko
  2018-08-29 15:13                                                             ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-08-29 15:01 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Vlastimil Babka, Christopher Lameter, linux-mm

On Wed 29-08-18 16:54:32, Marinko Catovic wrote:
[...]
> Also if the remote login is not an option, I'm always happy to provide
> whatever info you need.

trace data which starts _before_ the cache dropdown starts and while it
is decreasing should be the first step. Ideally along with /proc/vmstat
gathered at the same time. I am pretty sure you have some high order
memory consumer which forces the reclaim and we over reclaim. Last data
was not really conclusive as it didn't really captured the dropdown
IIRC.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-29 15:01                                                           ` Michal Hocko
@ 2018-08-29 15:13                                                             ` Marinko Catovic
  2018-08-29 15:27                                                               ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-08-29 15:13 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vlastimil Babka, Christopher Lameter, linux-mm

[-- Attachment #1: Type: text/plain, Size: 755 bytes --]

>
> trace data which starts _before_ the cache dropdown starts and while it
> is decreasing should be the first step. Ideally along with /proc/vmstat
> gathered at the same time. I am pretty sure you have some high order
> memory consumer which forces the reclaim and we over reclaim. Last data
> was not really conclusive as it didn't really captured the dropdown
> IIRC.
>

with before you mean in a totally healthy state?
as I can not tell when decreasing starts this would mean collecting data
over days perhaps. however, I have no issue with that.
As I do not want to miss anything that might help you, could you please
provide the commands for all the data you require?
one host is at a healthy state right now, I'd run that over there
immediately.

[-- Attachment #2: Type: text/html, Size: 1060 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-29 15:13                                                             ` Marinko Catovic
@ 2018-08-29 15:27                                                               ` Michal Hocko
  2018-08-29 16:44                                                                 ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-08-29 15:27 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Vlastimil Babka, Christopher Lameter, linux-mm

On Wed 29-08-18 17:13:59, Marinko Catovic wrote:
> >
> > trace data which starts _before_ the cache dropdown starts and while it
> > is decreasing should be the first step. Ideally along with /proc/vmstat
> > gathered at the same time. I am pretty sure you have some high order
> > memory consumer which forces the reclaim and we over reclaim. Last data
> > was not really conclusive as it didn't really captured the dropdown
> > IIRC.
> >
> 
> with before you mean in a totally healthy state?

yep

> as I can not tell when decreasing starts this would mean collecting data
> over days perhaps. however, I have no issue with that.

yeah, you can pipe the trace buffer to gzip and reduce the output
considerably.

> As I do not want to miss anything that might help you, could you please
> provide the commands for all the data you require?

Use the same set of commands for tracing I have provided earlier + add
the compresssion

cat /debug/trace/trace_pipe | gzip > file.gz

+ the loop to gather vmstat

while true
do
	cp /proc/vmstat vmstat.$(date +%s)
	sleep 5s
done

> one host is at a healthy state right now, I'd run that over there immediately.

Let's see what we can get from here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-29 15:27                                                               ` Michal Hocko
@ 2018-08-29 16:44                                                                 ` Marinko Catovic
  2018-10-22  1:19                                                                   ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-08-29 16:44 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vlastimil Babka, Christopher Lameter, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1029 bytes --]

> > one host is at a healthy state right now, I'd run that over there
> immediately.
>
> Let's see what we can get from here.
>

oh well, that went fast. actually with having low values for buffers
(around 100MB) with caches
around 20G or so, the performance was nevertheless super-low, I really had
to drop
the caches right now. This is the first time I see it with caches >10G
happening, but hopefully
this also provides a clue for you.

Just after starting the stats I reset from previously defer to madvise - I
suspect that this somehow
caused the rapid reaction, since a few minutes later I saw that the free
RAM jumped from 5GB to 10GB,
after that I went afk, returning to the pc since my monitoring systems went
crazy telling me about downtime.

If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to
its default, while it was
on defer now for days, was a mistake, then please tell me.

here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz
trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz

[-- Attachment #2: Type: text/html, Size: 1702 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-08-29 16:44                                                                 ` Marinko Catovic
@ 2018-10-22  1:19                                                                   ` Marinko Catovic
  2018-10-23 17:41                                                                     ` Marinko Catovic
                                                                                       ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: Marinko Catovic @ 2018-10-22  1:19 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, Vlastimil Babka, Christopher Lameter

Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic
<marinko.catovic@gmail.com>:
>
>
>> > one host is at a healthy state right now, I'd run that over there immediately.
>>
>> Let's see what we can get from here.
>
>
> oh well, that went fast. actually with having low values for buffers (around 100MB) with caches
> around 20G or so, the performance was nevertheless super-low, I really had to drop
> the caches right now. This is the first time I see it with caches >10G happening, but hopefully
> this also provides a clue for you.
>
> Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow
> caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB,
> after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime.
>
> If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was
> on defer now for days, was a mistake, then please tell me.
>
> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz
> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz
>

There we go again.

First of all, I have set up this monitoring on 1 host, as a matter of
fact it did not occur on that single
one for days and weeks now, so I set this up again on all the hosts
and it just happened again on another one.

This issue is far from over, even when upgrading to the latest 4.18.12

https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz

Please note: the trace_pipe is quite big in size, but it covers a
full-RAM to unused-RAM within just ~24 hours,
the measurements were initiated right after echo 3 > drop_caches and
stopped when the RAM was unused
aka re-used after another echo 3 in the end.

This issue is alive for about half a year now, any suggestions, hints
or solutions are greatly appreciated,
again, I can not possibly be the only one experiencing this, I just
may be among the few ones who actually
notice this and are indeed suffering from very poor performance with
lots of I/O on cache/buffers.

Also, I'd like to ask for a workaround until this is fixed someday:
echo 3 > drop_caches can take a very
long time when the host is busy with I/O in the background. According
to some resources in the net I discovered
that dropping caches operates until some lower threshold is reached,
which is less and less likely, when the
host is really busy. Could one point out what threshold this is perhaps?
I was thinking of e.g. mm/vmscan.c

 549 void drop_slab_node(int nid)
 550 {
 551         unsigned long freed;
 552
 553         do {
 554                 struct mem_cgroup *memcg = NULL;
 555
 556                 freed = 0;
 557                 do {
 558                         freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
 559                 } while ((memcg = mem_cgroup_iter(NULL, memcg,
NULL)) != NULL);
 560         } while (freed > 10);
 561 }

..would it make sense to increase > 10 here with, for example, > 100 ?
I could easily adjust this, or any other relevant threshold, since I
am compiling the kernel in use.

I'd just like it to be able to finish dropping caches to achieve the
workaround here until this issue is fixed,
which as mentioned, can take hours on a busy host, causing the host to
hang (having low performance) since
buffers/caches are not used at that time while drop_caches is being
set to 3, until that freeing up is finished.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-22  1:19                                                                   ` Marinko Catovic
@ 2018-10-23 17:41                                                                     ` Marinko Catovic
  2018-10-26  5:48                                                                       ` Marinko Catovic
  2018-10-26  8:01                                                                     ` Michal Hocko
                                                                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-10-23 17:41 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, Vlastimil Babka, Christopher Lameter

Am Mo., 22. Okt. 2018 um 03:19 Uhr schrieb Marinko Catovic
<marinko.catovic@gmail.com>:
>
> Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic
> <marinko.catovic@gmail.com>:
> >
> >
> >> > one host is at a healthy state right now, I'd run that over there immediately.
> >>
> >> Let's see what we can get from here.
> >
> >
> > oh well, that went fast. actually with having low values for buffers (around 100MB) with caches
> > around 20G or so, the performance was nevertheless super-low, I really had to drop
> > the caches right now. This is the first time I see it with caches >10G happening, but hopefully
> > this also provides a clue for you.
> >
> > Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow
> > caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB,
> > after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime.
> >
> > If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was
> > on defer now for days, was a mistake, then please tell me.
> >
> > here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz
> > trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz
> >
>
> There we go again.
>
> First of all, I have set up this monitoring on 1 host, as a matter of
> fact it did not occur on that single
> one for days and weeks now, so I set this up again on all the hosts
> and it just happened again on another one.
>
> This issue is far from over, even when upgrading to the latest 4.18.12
>
> https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
> https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz
>
> Please note: the trace_pipe is quite big in size, but it covers a
> full-RAM to unused-RAM within just ~24 hours,
> the measurements were initiated right after echo 3 > drop_caches and
> stopped when the RAM was unused
> aka re-used after another echo 3 in the end.
>
> This issue is alive for about half a year now, any suggestions, hints
> or solutions are greatly appreciated,
> again, I can not possibly be the only one experiencing this, I just
> may be among the few ones who actually
> notice this and are indeed suffering from very poor performance with
> lots of I/O on cache/buffers.
>
> Also, I'd like to ask for a workaround until this is fixed someday:
> echo 3 > drop_caches can take a very
> long time when the host is busy with I/O in the background. According
> to some resources in the net I discovered
> that dropping caches operates until some lower threshold is reached,
> which is less and less likely, when the
> host is really busy. Could one point out what threshold this is perhaps?
> I was thinking of e.g. mm/vmscan.c
>
>  549 void drop_slab_node(int nid)
>  550 {
>  551         unsigned long freed;
>  552
>  553         do {
>  554                 struct mem_cgroup *memcg = NULL;
>  555
>  556                 freed = 0;
>  557                 do {
>  558                         freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
>  559                 } while ((memcg = mem_cgroup_iter(NULL, memcg,
> NULL)) != NULL);
>  560         } while (freed > 10);
>  561 }
>
> ..would it make sense to increase > 10 here with, for example, > 100 ?
> I could easily adjust this, or any other relevant threshold, since I
> am compiling the kernel in use.
>
> I'd just like it to be able to finish dropping caches to achieve the
> workaround here until this issue is fixed,
> which as mentioned, can take hours on a busy host, causing the host to
> hang (having low performance) since
> buffers/caches are not used at that time while drop_caches is being
> set to 3, until that freeing up is finished.

by the way, it seems to happen on the one mentioned host on a daily
basis now, like dropping
to 100M/10G every 24 hours, so it is actually a lot easier now to
capture relevant data/stats, since
it occurs again and again right now.

strangely, other hosts are currently not affected for days.
So if there is anything you need to know, beside the vmstat and
trace_pipe files, please let me know.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-23 17:41                                                                     ` Marinko Catovic
@ 2018-10-26  5:48                                                                       ` Marinko Catovic
  0 siblings, 0 replies; 66+ messages in thread
From: Marinko Catovic @ 2018-10-26  5:48 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, Vlastimil Babka, Christopher Lameter

Am Di., 23. Okt. 2018 um 19:41 Uhr schrieb Marinko Catovic
<marinko.catovic@gmail.com>:
>
> Am Mo., 22. Okt. 2018 um 03:19 Uhr schrieb Marinko Catovic
> <marinko.catovic@gmail.com>:
> >
> > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic
> > <marinko.catovic@gmail.com>:
> > >
> > >
> > >> > one host is at a healthy state right now, I'd run that over there immediately.
> > >>
> > >> Let's see what we can get from here.
> > >
> > >
> > > oh well, that went fast. actually with having low values for buffers (around 100MB) with caches
> > > around 20G or so, the performance was nevertheless super-low, I really had to drop
> > > the caches right now. This is the first time I see it with caches >10G happening, but hopefully
> > > this also provides a clue for you.
> > >
> > > Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow
> > > caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB,
> > > after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime.
> > >
> > > If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was
> > > on defer now for days, was a mistake, then please tell me.
> > >
> > > here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz
> > > trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz
> > >
> >
> > There we go again.
> >
> > First of all, I have set up this monitoring on 1 host, as a matter of
> > fact it did not occur on that single
> > one for days and weeks now, so I set this up again on all the hosts
> > and it just happened again on another one.
> >
> > This issue is far from over, even when upgrading to the latest 4.18.12
> >
> > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
> > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz
> >
> > Please note: the trace_pipe is quite big in size, but it covers a
> > full-RAM to unused-RAM within just ~24 hours,
> > the measurements were initiated right after echo 3 > drop_caches and
> > stopped when the RAM was unused
> > aka re-used after another echo 3 in the end.
> >
> > This issue is alive for about half a year now, any suggestions, hints
> > or solutions are greatly appreciated,
> > again, I can not possibly be the only one experiencing this, I just
> > may be among the few ones who actually
> > notice this and are indeed suffering from very poor performance with
> > lots of I/O on cache/buffers.
> >
> > Also, I'd like to ask for a workaround until this is fixed someday:
> > echo 3 > drop_caches can take a very
> > long time when the host is busy with I/O in the background. According
> > to some resources in the net I discovered
> > that dropping caches operates until some lower threshold is reached,
> > which is less and less likely, when the
> > host is really busy. Could one point out what threshold this is perhaps?
> > I was thinking of e.g. mm/vmscan.c
> >
> >  549 void drop_slab_node(int nid)
> >  550 {
> >  551         unsigned long freed;
> >  552
> >  553         do {
> >  554                 struct mem_cgroup *memcg = NULL;
> >  555
> >  556                 freed = 0;
> >  557                 do {
> >  558                         freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
> >  559                 } while ((memcg = mem_cgroup_iter(NULL, memcg,
> > NULL)) != NULL);
> >  560         } while (freed > 10);
> >  561 }
> >
> > ..would it make sense to increase > 10 here with, for example, > 100 ?
> > I could easily adjust this, or any other relevant threshold, since I
> > am compiling the kernel in use.
> >
> > I'd just like it to be able to finish dropping caches to achieve the
> > workaround here until this issue is fixed,
> > which as mentioned, can take hours on a busy host, causing the host to
> > hang (having low performance) since
> > buffers/caches are not used at that time while drop_caches is being
> > set to 3, until that freeing up is finished.
>
> by the way, it seems to happen on the one mentioned host on a daily
> basis now, like dropping
> to 100M/10G every 24 hours, so it is actually a lot easier now to
> capture relevant data/stats, since
> it occurs again and again right now.
>
> strangely, other hosts are currently not affected for days.
> So if there is anything you need to know, beside the vmstat and
> trace_pipe files, please let me know.

As it happened again now for the 2nd time within 2 days, and mainly on
the very same host I mentioned before and with the reports given with
my previous reply, I just wanted to point
out something that I observed: earlier I stated that the buffers were
really low and the caches as well - however, I just monitored for the
second or third time, that this applies to buffers way more
significantly than to caches. As an example: 50MB buffers were in use,
yet 10GB for caches, still leaving around 20GB or RAM totally unused.
Note: buffer/caches were surely around 5GB/35GB in the healthy state
before, so still both are getting lower.

So the performance dropped that much so all services on the host
basically stopped working since there was so much I/O wait, again. I
tried to summarize what file contents people asked me to post, so
besides the trace_pipe and vmstat-folder from my previos post, here
goes another with the others while in the 50MB buffers state:

cat /proc/pagetypeinfo https://pastebin.com/W1sJscsZ
cat /proc/slabinfo     https://pastebin.com/9ZPU3q7X
cat /proc/zoneinfo     https://pastebin.com/RMTwtXGr

Hopefully you can read something from this.
As always, feel free to ask whatever info you'd like me to share.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-22  1:19                                                                   ` Marinko Catovic
  2018-10-23 17:41                                                                     ` Marinko Catovic
@ 2018-10-26  8:01                                                                     ` Michal Hocko
  2018-10-26 23:31                                                                       ` Marinko Catovic
       [not found]                                                                     ` <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz>
  2018-10-31 13:12                                                                     ` Vlastimil Babka
  3 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-10-26  8:01 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: linux-mm, Vlastimil Babka, Christopher Lameter

Sorry for late reply. Busy as always...

On Mon 22-10-18 03:19:57, Marinko Catovic wrote:
[...]
> There we go again.
> 
> First of all, I have set up this monitoring on 1 host, as a matter of
> fact it did not occur on that single
> one for days and weeks now, so I set this up again on all the hosts
> and it just happened again on another one.
> 
> This issue is far from over, even when upgrading to the latest 4.18.12
> 
> https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
> https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz

I cannot download these. I am getting an invalid certificate and
403 when ignoring it

[...]

> Also, I'd like to ask for a workaround until this is fixed someday:
> echo 3 > drop_caches can take a very
> long time when the host is busy with I/O in the background. According
> to some resources in the net I discovered
> that dropping caches operates until some lower threshold is reached,
> which is less and less likely, when the
> host is really busy. Could one point out what threshold this is perhaps?
> I was thinking of e.g. mm/vmscan.c
> 
>  549 void drop_slab_node(int nid)
>  550 {
>  551         unsigned long freed;
>  552
>  553         do {
>  554                 struct mem_cgroup *memcg = NULL;
>  555
>  556                 freed = 0;
>  557                 do {
>  558                         freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
>  559                 } while ((memcg = mem_cgroup_iter(NULL, memcg,
> NULL)) != NULL);
>  560         } while (freed > 10);
>  561 }
> 
> ..would it make sense to increase > 10 here with, for example, > 100 ?
> I could easily adjust this, or any other relevant threshold, since I
> am compiling the kernel in use.
> 
> I'd just like it to be able to finish dropping caches to achieve the
> workaround here until this issue is fixed,
> which as mentioned, can take hours on a busy host, causing the host to
> hang (having low performance) since
> buffers/caches are not used at that time while drop_caches is being
> set to 3, until that freeing up is finished.

This is worth a separate discussion. Please start a new email thread.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-26  8:01                                                                     ` Michal Hocko
@ 2018-10-26 23:31                                                                       ` Marinko Catovic
  2018-10-27  6:42                                                                         ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-10-26 23:31 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Vlastimil Babka, Christopher Lameter

Am Fr., 26. Okt. 2018 um 10:02 Uhr schrieb Michal Hocko <mhocko@suse.com>:
>
> Sorry for late reply. Busy as always...
>
> On Mon 22-10-18 03:19:57, Marinko Catovic wrote:
> [...]
> > There we go again.
> >
> > First of all, I have set up this monitoring on 1 host, as a matter of
> > fact it did not occur on that single
> > one for days and weeks now, so I set this up again on all the hosts
> > and it just happened again on another one.
> >
> > This issue is far from over, even when upgrading to the latest 4.18.12
> >
> > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
> > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz
>
> I cannot download these. I am getting an invalid certificate and
> 403 when ignoring it

are you sure about that? I can download both just fine, different
browsers, the cert seems fine, no 403 there.

> This is worth a separate discussion. Please start a new email thread.

I was merely looking for a real quick-hotfix there in the meantime,
also wondering why '10' is hardcoded

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-26 23:31                                                                       ` Marinko Catovic
@ 2018-10-27  6:42                                                                         ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2018-10-27  6:42 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: linux-mm, Vlastimil Babka, Christopher Lameter

On Sat 27-10-18 01:31:05, Marinko Catovic wrote:
> Am Fr., 26. Okt. 2018 um 10:02 Uhr schrieb Michal Hocko <mhocko@suse.com>:
> >
> > Sorry for late reply. Busy as always...
> >
> > On Mon 22-10-18 03:19:57, Marinko Catovic wrote:
> > [...]
> > > There we go again.
> > >
> > > First of all, I have set up this monitoring on 1 host, as a matter of
> > > fact it did not occur on that single
> > > one for days and weeks now, so I set this up again on all the hosts
> > > and it just happened again on another one.
> > >
> > > This issue is far from over, even when upgrading to the latest 4.18.12
> > >
> > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
> > > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz
> >
> > I cannot download these. I am getting an invalid certificate and
> > 403 when ignoring it
> 
> are you sure about that? I can download both just fine, different
> browsers, the cert seems fine, no 403 there.

Interesting. It works now from my home network. Something must have been
fishy in the office network when I've tried the same thing.

I have it now. Will have a look at monday at earliest.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
       [not found]                                                                     ` <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz>
@ 2018-10-30 15:30                                                                       ` Michal Hocko
  2018-10-30 16:08                                                                         ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-10-30 15:30 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Marinko Catovic, linux-mm, Christopher Lameter

On Tue 30-10-18 14:44:27, Vlastimil Babka wrote:
> On 10/22/18 3:19 AM, Marinko Catovic wrote:
> > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic
[...]
> >> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz
> >> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz
> >>
> > 
> > There we go again.
> > 
> > First of all, I have set up this monitoring on 1 host, as a matter of
> > fact it did not occur on that single
> > one for days and weeks now, so I set this up again on all the hosts
> > and it just happened again on another one.
> > 
> > This issue is far from over, even when upgrading to the latest 4.18.12
> > 
> > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
> > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz
> 
> I have plot the vmstat using the attached script, and got the attached
> plots. X axis are the vmstat snapshots, almost 14k of them, each for 5
> seconds, so almost 19 hours. I can see the following phases:

Thanks a lot. I like the script much!

[...]

> 12000 - end:
> - free pages growing sharply
> - page cache declining sharply
> - slab still slowly declining

$ cat filter 
pgfree
pgsteal_
pgscan_
compact
nr_free_pages

$ grep -f filter -h vmstat.1539866837 vmstat.1539874353 | awk '{if (c[$1]) {printf "%s %d\n", $1, $2-c[$1]}; c[$1]=$2}'
nr_free_pages 4216371
pgfree 267884025
pgsteal_kswapd 0
pgsteal_direct 11890416
pgscan_kswapd 0
pgscan_direct 11937805
compact_migrate_scanned 2197060121
compact_free_scanned 4747491606
compact_isolated 54281848
compact_stall 1797
compact_fail 1721
compact_success 76

So we have ended up with 16G freed pages in that last time period.
Kswapd was sleeping throughout the time but direct reclaim was quite
active. ~46GB pages recycled. Note that much more pages were freed which
suggests there was quite a large memory allocation/free activity.

One notable thing here is that there shouldn't be any reason to do the
direct reclaim when kswapd itself doesn't do anything. It could be
either blocked on something but I find it quite surprising to see it in
that state for the whole 1500s time period or we are simply not low on
free memory at all. That would point towards compaction triggered memory
reclaim which account as the direct reclaim as well. The direct
compaction triggered more than once a second in average. We shouldn't
really reclaim unless we are low on memory but repeatedly failing
compaction could just add up and reclaim a lot in the end. There seem to
be quite a lot of low order request as per your trace buffer

$ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
   1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
   5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
    121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
     22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
   1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
   3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
  93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
     10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
    114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
  67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE

We can safely rule out NOWAIT and ATOMIC because those do not reclaim.
That leaves us with 
   5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
    121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
     22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
   1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
   3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
     10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
    114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
  67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE

by large the kernel stack allocations are in lead. You can put some
relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of
THP pages allocations. Just curious are you running on a NUMA machine?
If yes [1] might be relevant. Other than that nothing really jumped at
me.

[1] http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-30 15:30                                                                       ` Michal Hocko
@ 2018-10-30 16:08                                                                         ` Marinko Catovic
  2018-10-30 17:00                                                                           ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-10-30 16:08 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter

Am Di., 30. Okt. 2018 um 16:30 Uhr schrieb Michal Hocko <mhocko@suse.com>:
>
> On Tue 30-10-18 14:44:27, Vlastimil Babka wrote:
> > On 10/22/18 3:19 AM, Marinko Catovic wrote:
> > > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic
> [...]
> > >> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz
> > >> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz
> > >>
> > >
> > > There we go again.
> > >
> > > First of all, I have set up this monitoring on 1 host, as a matter of
> > > fact it did not occur on that single
> > > one for days and weeks now, so I set this up again on all the hosts
> > > and it just happened again on another one.
> > >
> > > This issue is far from over, even when upgrading to the latest 4.18.12
> > >
> > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
> > > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz
> >
> > I have plot the vmstat using the attached script, and got the attached
> > plots. X axis are the vmstat snapshots, almost 14k of them, each for 5
> > seconds, so almost 19 hours. I can see the following phases:
>
> Thanks a lot. I like the script much!
>
> [...]
>
> > 12000 - end:
> > - free pages growing sharply
> > - page cache declining sharply
> > - slab still slowly declining
>
> $ cat filter
> pgfree
> pgsteal_
> pgscan_
> compact
> nr_free_pages
>
> $ grep -f filter -h vmstat.1539866837 vmstat.1539874353 | awk '{if (c[$1]) {printf "%s %d\n", $1, $2-c[$1]}; c[$1]=$2}'
> nr_free_pages 4216371
> pgfree 267884025
> pgsteal_kswapd 0
> pgsteal_direct 11890416
> pgscan_kswapd 0
> pgscan_direct 11937805
> compact_migrate_scanned 2197060121
> compact_free_scanned 4747491606
> compact_isolated 54281848
> compact_stall 1797
> compact_fail 1721
> compact_success 76
>
> So we have ended up with 16G freed pages in that last time period.
> Kswapd was sleeping throughout the time but direct reclaim was quite
> active. ~46GB pages recycled. Note that much more pages were freed which
> suggests there was quite a large memory allocation/free activity.
>
> One notable thing here is that there shouldn't be any reason to do the
> direct reclaim when kswapd itself doesn't do anything. It could be
> either blocked on something but I find it quite surprising to see it in
> that state for the whole 1500s time period or we are simply not low on
> free memory at all. That would point towards compaction triggered memory
> reclaim which account as the direct reclaim as well. The direct
> compaction triggered more than once a second in average. We shouldn't
> really reclaim unless we are low on memory but repeatedly failing
> compaction could just add up and reclaim a lot in the end. There seem to
> be quite a lot of low order request as per your trace buffer
>
> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
>    1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
>  783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>  797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>   93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
>  498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>  243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
>
> We can safely rule out NOWAIT and ATOMIC because those do not reclaim.
> That leaves us with
>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
>
> by large the kernel stack allocations are in lead. You can put some
> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of
> THP pages allocations. Just curious are you running on a NUMA machine?
> If yes [1] might be relevant. Other than that nothing really jumped at
> me.
>
> [1] http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
> --
> Michal Hocko
> SUSE Labs

thanks a lot Vlastimil!

I would not really know whether this is a NUMA, it is some usual
server running with a i7-8700
and ECC RAM. How would I find out?
So I should do CONFIG_VMAP_STACK=y and try that..?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-30 16:08                                                                         ` Marinko Catovic
@ 2018-10-30 17:00                                                                           ` Vlastimil Babka
  2018-10-30 18:26                                                                             ` Marinko Catovic
                                                                                               ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Vlastimil Babka @ 2018-10-30 17:00 UTC (permalink / raw)
  To: Marinko Catovic, Michal Hocko; +Cc: linux-mm, Christopher Lameter

On 10/30/18 5:08 PM, Marinko Catovic wrote:
>> One notable thing here is that there shouldn't be any reason to do the
>> direct reclaim when kswapd itself doesn't do anything. It could be
>> either blocked on something but I find it quite surprising to see it in
>> that state for the whole 1500s time period or we are simply not low on
>> free memory at all. That would point towards compaction triggered memory
>> reclaim which account as the direct reclaim as well. The direct
>> compaction triggered more than once a second in average. We shouldn't
>> really reclaim unless we are low on memory but repeatedly failing
>> compaction could just add up and reclaim a lot in the end. There seem to
>> be quite a lot of low order request as per your trace buffer

I realized that the fact that slabs grew so large might be very
relevant. It means a lot of unmovable pages, and while they are slowly
being freed, the remaining are scattered all over the memory, making it
impossible to successfully compact, until the slabs are almost
*completely* freed. It's in fact the theoretical worst case scenario for
compaction and fragmentation avoidance. Next time it would be nice to
also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so
much there (probably dentries and inodes).

The question is why the problems happened some time later after the
unmovable pollution. The trace showed me that the structure of
allocations wrt order+flags as Michal breaks them down below, is not
significanly different in the last phase than in the whole trace.
Possibly the state of memory gradually changed so that the various
heuristics (fragindex, pageblock skip bits etc) resulted in compaction
being tried more than initially, eventually hitting a very bad corner case.

>> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
>>    1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
>>  783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>  797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>>   93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
>>  498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>>  243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
>>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
>>
>> We can safely rule out NOWAIT and ATOMIC because those do not reclaim.
>> That leaves us with
>>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO

I suspect there are lots of short-lived processes, so these are probably
rapidly recycled and not causing compaction. It also seems to be pgd
allocation (2 pages due to PTI) not kernel stack?

>>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE

I would again suspect those. IIRC we already confirmed earlier that THP
defrag setting is madvise or madvise+defer, and there are
madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag
to plain 'defer'?

>>
>> by large the kernel stack allocations are in lead. You can put some
>> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of
>> THP pages allocations. Just curious are you running on a NUMA machine?
>> If yes [1] might be relevant. Other than that nothing really jumped at
>> me.


> thanks a lot Vlastimil!

And Michal :)

> I would not really know whether this is a NUMA, it is some usual
> server running with a i7-8700
> and ECC RAM. How would I find out?

Please provide /proc/zoneinfo and we'll see.

> So I should do CONFIG_VMAP_STACK=y and try that..?

I suspect you already have it.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-30 17:00                                                                           ` Vlastimil Babka
@ 2018-10-30 18:26                                                                             ` Marinko Catovic
  2018-10-31  7:34                                                                               ` Michal Hocko
  2018-10-31  7:32                                                                             ` Michal Hocko
  2018-10-31 13:40                                                                             ` Vlastimil Babka
  2 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-10-30 18:26 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter

Am Di., 30. Okt. 2018 um 18:03 Uhr schrieb Vlastimil Babka <vbabka@suse.cz>:
>
> On 10/30/18 5:08 PM, Marinko Catovic wrote:
> >> One notable thing here is that there shouldn't be any reason to do the
> >> direct reclaim when kswapd itself doesn't do anything. It could be
> >> either blocked on something but I find it quite surprising to see it in
> >> that state for the whole 1500s time period or we are simply not low on
> >> free memory at all. That would point towards compaction triggered memory
> >> reclaim which account as the direct reclaim as well. The direct
> >> compaction triggered more than once a second in average. We shouldn't
> >> really reclaim unless we are low on memory but repeatedly failing
> >> compaction could just add up and reclaim a lot in the end. There seem to
> >> be quite a lot of low order request as per your trace buffer
>
> I realized that the fact that slabs grew so large might be very
> relevant. It means a lot of unmovable pages, and while they are slowly
> being freed, the remaining are scattered all over the memory, making it
> impossible to successfully compact, until the slabs are almost
> *completely* freed. It's in fact the theoretical worst case scenario for
> compaction and fragmentation avoidance. Next time it would be nice to
> also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so
> much there (probably dentries and inodes).

how would you like the results? as a job collecting those from 3 >
drop_caches until worst case, which may be 24 hours every 5 seconds,
or at what point in time?
Please note that I already provided them (see my response before) as a
one-time snapshot while being in the worst case;

cat /proc/pagetypeinfo https://pastebin.com/W1sJscsZ
cat /proc/slabinfo     https://pastebin.com/9ZPU3q7X

> The question is why the problems happened some time later after the
> unmovable pollution. The trace showed me that the structure of
> allocations wrt order+flags as Michal breaks them down below, is not
> significanly different in the last phase than in the whole trace.
> Possibly the state of memory gradually changed so that the various
> heuristics (fragindex, pageblock skip bits etc) resulted in compaction
> being tried more than initially, eventually hitting a very bad corner case.
>
> >> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
> >>    1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
> >>  783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
> >>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >>  797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
> >>   93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
> >>  498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
> >>  243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
> >>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
> >>
> >> We can safely rule out NOWAIT and ATOMIC because those do not reclaim.
> >> That leaves us with
> >>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
>
> I suspect there are lots of short-lived processes, so these are probably
> rapidly recycled and not causing compaction.

Well yes, since it is about shared hosting there are lots of users,
running lots of scripts, perhaps 5-50 new forks and kills every
second, depending on load, hard to tell.

> It also seems to be pgd allocation (2 pages due to PTI) not kernel stack?

plain english, please? :)

> >>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
>
> I would again suspect those. IIRC we already confirmed earlier that THP
> defrag setting is madvise or madvise+defer, and there are
> madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag
> to plain 'defer'?

Yes, I think I mentioned this before. AFAIK it did not make
(immediate) changes, madvise is the current type.

> and there are madvise(MADV_HUGEPAGE) using processes?

Can't tell you that..

> >>
> >> by large the kernel stack allocations are in lead. You can put some
> >> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of
> >> THP pages allocations. Just curious are you running on a NUMA machine?
> >> If yes [1] might be relevant. Other than that nothing really jumped at
> >> me.
>
>
> > thanks a lot Vlastimil!
>
> And Michal :)
>
> > I would not really know whether this is a NUMA, it is some usual
> > server running with a i7-8700
> > and ECC RAM. How would I find out?
>
> Please provide /proc/zoneinfo and we'll see.

there you go: cat /proc/zoneinfo     https://pastebin.com/RMTwtXGr

> > So I should do CONFIG_VMAP_STACK=y and try that..?
>
> I suspect you already have it.

Yes true, the currently loaded kernel is with =y there.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-30 17:00                                                                           ` Vlastimil Babka
  2018-10-30 18:26                                                                             ` Marinko Catovic
@ 2018-10-31  7:32                                                                             ` Michal Hocko
  2018-10-31 13:40                                                                             ` Vlastimil Babka
  2 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2018-10-31  7:32 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Marinko Catovic, linux-mm, Christopher Lameter

On Tue 30-10-18 18:00:23, Vlastimil Babka wrote:
[...]
> I suspect there are lots of short-lived processes, so these are probably
> rapidly recycled and not causing compaction. It also seems to be pgd
> allocation (2 pages due to PTI) not kernel stack?

I guess you are right. I have misread order=2 yesterday. order=1 stack
would be quite unexpected.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-30 18:26                                                                             ` Marinko Catovic
@ 2018-10-31  7:34                                                                               ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2018-10-31  7:34 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter

On Tue 30-10-18 19:26:32, Marinko Catovic wrote:
[...]
> > > I would not really know whether this is a NUMA, it is some usual
> > > server running with a i7-8700
> > > and ECC RAM. How would I find out?
> >
> > Please provide /proc/zoneinfo and we'll see.
> 
> there you go: cat /proc/zoneinfo     https://pastebin.com/RMTwtXGr

Nope, a single node machine so no NUMA.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Caching/buffers become useless after some time
  2018-10-22  1:19                                                                   ` Marinko Catovic
                                                                                       ` (2 preceding siblings ...)
       [not found]                                                                     ` <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz>
@ 2018-10-31 13:12                                                                     ` Vlastimil Babka
  3 siblings, 0 replies; 66+ messages in thread
From: Vlastimil Babka @ 2018-10-31 13:12 UTC (permalink / raw)
  To: Marinko Catovic, Michal Hocko, linux-mm, Christopher Lameter

Resending for lists which dropped my mail due to attachments. Sorry.
plots: https://nofile.io/f/ogwbrwhwBU7/plots.tar.bz2
R script:


files <- Sys.glob("vmstat.1*")

results <- read.table(files[1], row.names=1)

for (file in files[-1]) {
	tmp2 <- read.table(file)$V2
	results <- cbind(results, tmp2)
}

for (row in row.names(results)) {
	png(paste("plots/", row, ".png", sep=""), width=1900, height=1150)
	plot(t(as.vector(results[row,])), main=row)
	dev.off()
}

On 10/22/18 3:19 AM, Marinko Catovic wrote:
> Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic
> <marinko.catovic@gmail.com>:
>>
>>
>>>> one host is at a healthy state right now, I'd run that over there immediately.
>>>
>>> Let's see what we can get from here.
>>
>>
>> oh well, that went fast. actually with having low values for buffers (around 100MB) with caches
>> around 20G or so, the performance was nevertheless super-low, I really had to drop
>> the caches right now. This is the first time I see it with caches >10G happening, but hopefully
>> this also provides a clue for you.
>>
>> Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow
>> caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB,
>> after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime.
>>
>> If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was
>> on defer now for days, was a mistake, then please tell me.
>>
>> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz
>> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz
>>
> 
> There we go again.
> 
> First of all, I have set up this monitoring on 1 host, as a matter of
> fact it did not occur on that single
> one for days and weeks now, so I set this up again on all the hosts
> and it just happened again on another one.
> 
> This issue is far from over, even when upgrading to the latest 4.18.12
> 
> https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
> https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz

I have plot the vmstat using the attached script, and got the attached
plots. X axis are the vmstat snapshots, almost 14k of them, each for 5
seconds, so almost 19 hours. I can see the following phases:

0 - 2000:
- free memory (nr_free_pages) dropping from 48GB to the minimum allowed
by watermarks
- page cache (nr_file_pages) grows correspondingly

2000 - 6000:
- reclaimable slab (nr_slab_reclaimable) grows up to 40GB, unreclaimable
slab has same trend but much less
- page cache is shrinked correspondingly
- free memory remains at miminum

6000 - 12000:
- slab usage is slowly declining
- page cache slowly growing but there are hiccups
- free pages at minimum, growing after 9000, oscillating between 10000
and 12000

12000 - end:
- free pages growing sharply
- page cache declining sharply
- slab still slowly declining

I guess the original problem is manifested in the last phase. There
might be secondary issue with the slab usage, between 2000 and 6000 but
it doesn't seem immeidately connected (?).

I can see compaction activity (but not success) increased a lot in the
last phase, while direct reclaim is steady from 2000 onwards. This would
again suggest high-order allocations. THP doesn't seem to be the cause.

Vlastimil

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-30 17:00                                                                           ` Vlastimil Babka
  2018-10-30 18:26                                                                             ` Marinko Catovic
  2018-10-31  7:32                                                                             ` Michal Hocko
@ 2018-10-31 13:40                                                                             ` Vlastimil Babka
  2018-10-31 14:53                                                                               ` Marinko Catovic
  2 siblings, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-10-31 13:40 UTC (permalink / raw)
  To: Marinko Catovic, Michal Hocko; +Cc: linux-mm, Christopher Lameter

On 10/30/18 6:00 PM, Vlastimil Babka wrote:
> On 10/30/18 5:08 PM, Marinko Catovic wrote:
>>> One notable thing here is that there shouldn't be any reason to do the
>>> direct reclaim when kswapd itself doesn't do anything. It could be
>>> either blocked on something but I find it quite surprising to see it in
>>> that state for the whole 1500s time period or we are simply not low on
>>> free memory at all. That would point towards compaction triggered memory
>>> reclaim which account as the direct reclaim as well. The direct
>>> compaction triggered more than once a second in average. We shouldn't
>>> really reclaim unless we are low on memory but repeatedly failing
>>> compaction could just add up and reclaim a lot in the end. There seem to
>>> be quite a lot of low order request as per your trace buffer
> 
> I realized that the fact that slabs grew so large might be very
> relevant. It means a lot of unmovable pages, and while they are slowly
> being freed, the remaining are scattered all over the memory, making it
> impossible to successfully compact, until the slabs are almost
> *completely* freed. It's in fact the theoretical worst case scenario for
> compaction and fragmentation avoidance. Next time it would be nice to
> also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so
> much there (probably dentries and inodes).

I went through the whole thread again as it was spread over months, and
finally connected some dots. In one mail you said:

> There is one thing I forgot to mention: the hosts perform find and du (I mean the commands, finding files and disk usage)
> on the HDDs every night, starting from 00:20 AM up until in the morning 07:45 AM, for maintenance and stats.

The timespan above roughly matches the phase where reclaimable slab grow
(samples 2000-6000 over 5 seconds is roughly 5.5 hours). The find will
fetch a lots of metadata in dentries, inodes etc. which are part of
reclaimable slabs. In other mail you posted a slabinfo
https://pastebin.com/81QAFgke in the phase where it's already being
slowly reclaimed, but still occupies 6.5GB, and mostly it's
ext4_inode_cache, and dentry cache (also very much internally fragmented).
In another mail I suggest that maybe fragmentation happened because the
slab filled up much more at some point, and I think we now have that
solidly confirmed from the vmstat plots.
I think one workaround is for you to perform echo 2 > drop_caches (not
3) right after the find/du maintenance finishes. At that point you don't
have too much page cache anyway, since the slabs have pushed it out.
It's also overnight so there are not many users yet?
Alternatively the find/du could run in a memcg limiting its slab use.
Michal would know the details.

Long term we should do something about these slab objects that are only
used briefly (once?) so there's no point in caching them and letting the
cache grow like this.

> The question is why the problems happened some time later after the
> unmovable pollution. The trace showed me that the structure of
> allocations wrt order+flags as Michal breaks them down below, is not
> significanly different in the last phase than in the whole trace.
> Possibly the state of memory gradually changed so that the various
> heuristics (fragindex, pageblock skip bits etc) resulted in compaction
> being tried more than initially, eventually hitting a very bad corner case.

This is still an open question. Why do we overreclaim that much? If we
can trust one of the older pagetypeinfo snapshots
https://pastebin.com/6QWEZagL then of those below, only the THP
allocations should need reclaim/compaction. Maybe the order-7 ones as
well, but there are just a few of those and they are __GFP_NORETRY.

Maybe enable also tracing events (in addition to page alloc)
compaction/mm_compaction_try_to_compact_pages and
compaction/mm_compaction_suitable?

>>> We can safely rule out NOWAIT and ATOMIC because those do not reclaim.
>>> That leaves us with
>>>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>>>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
> 
> I suspect there are lots of short-lived processes, so these are probably
> rapidly recycled and not causing compaction. It also seems to be pgd
> allocation (2 pages due to PTI) not kernel stack?
> 
>>>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>>>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
> 
> I would again suspect those. IIRC we already confirmed earlier that THP
> defrag setting is madvise or madvise+defer, and there are
> madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag
> to plain 'defer'?
> 
>>>
>>> by large the kernel stack allocations are in lead. You can put some
>>> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of
>>> THP pages allocations. Just curious are you running on a NUMA machine?
>>> If yes [1] might be relevant. Other than that nothing really jumped at
>>> me.
> 
> 
>> thanks a lot Vlastimil!
> 
> And Michal :)
> 
>> I would not really know whether this is a NUMA, it is some usual
>> server running with a i7-8700
>> and ECC RAM. How would I find out?
> 
> Please provide /proc/zoneinfo and we'll see.
> 
>> So I should do CONFIG_VMAP_STACK=y and try that..?
> 
> I suspect you already have it.
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-31 13:40                                                                             ` Vlastimil Babka
@ 2018-10-31 14:53                                                                               ` Marinko Catovic
  2018-10-31 17:01                                                                                 ` Michal Hocko
  2018-11-02 14:59                                                                                 ` Vlastimil Babka
  0 siblings, 2 replies; 66+ messages in thread
From: Marinko Catovic @ 2018-10-31 14:53 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter

> I went through the whole thread again as it was spread over months, and
> finally connected some dots. In one mail you said:
>
> > There is one thing I forgot to mention: the hosts perform find and du (I mean the commands, finding files and disk usage)
> > on the HDDs every night, starting from 00:20 AM up until in the morning 07:45 AM, for maintenance and stats.
>
> The timespan above roughly matches the phase where reclaimable slab grow
> (samples 2000-6000 over 5 seconds is roughly 5.5 hours). The find will
> fetch a lots of metadata in dentries, inodes etc. which are part of
> reclaimable slabs. In other mail you posted a slabinfo
> https://pastebin.com/81QAFgke in the phase where it's already being
> slowly reclaimed, but still occupies 6.5GB, and mostly it's
> ext4_inode_cache, and dentry cache (also very much internally fragmented).
> In another mail I suggest that maybe fragmentation happened because the
> slab filled up much more at some point, and I think we now have that
> solidly confirmed from the vmstat plots.
> I think one workaround is for you to perform echo 2 > drop_caches (not
> 3) right after the find/du maintenance finishes. At that point you don't
> have too much page cache anyway, since the slabs have pushed it out.
> It's also overnight so there are not many users yet?
> Alternatively the find/du could run in a memcg limiting its slab use.
> Michal would know the details.
>
> Long term we should do something about these slab objects that are only
> used briefly (once?) so there's no point in caching them and letting the
> cache grow like this.
>

Well caching of any operations with find/du is not necessary imho
anyway, since walking over all these millions of files in that time
period is really not worth caching at all - if there is a way you
mentioned to limit the commands there, that would be great.
Also I want to mention that these operations were in use with 3.x
kernels as well, for years, with absolutely zero issues.

2 > drop_caches right after that is something I considered, I just had
some bad experience with this, since I tried it around 5:00 AM in the
first place to give it enough spare time to finish, since sync; echo 2
> drop_caches can take some time, hence my question about lowering the
limits in mm/vmscan.c, void drop_slab_node(int nid)

I could do this effectively right after find/du at 07:45, just hoping
that this is finished soon enough - in one worst case it took over 2
hours (from 05:00 AM to 07:00 AM), since the host was busy during that
time with find/du, never having freed enough caches to continue, hence
my question to let it stop earlier with the modification of
drop_slab_node ... it was just an idea, nevermind if you believe that
it was a bad one :)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-31 14:53                                                                               ` Marinko Catovic
@ 2018-10-31 17:01                                                                                 ` Michal Hocko
  2018-10-31 19:21                                                                                   ` Marinko Catovic
  2018-11-02 14:59                                                                                 ` Vlastimil Babka
  1 sibling, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-10-31 17:01 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter

On Wed 31-10-18 15:53:44, Marinko Catovic wrote:
[...]
> Well caching of any operations with find/du is not necessary imho
> anyway, since walking over all these millions of files in that time
> period is really not worth caching at all - if there is a way you
> mentioned to limit the commands there, that would be great.

One possible way would be to run this find/du workload inside a memory
cgroup with high limit set to something reasonable (that will likely
require some tuning). I am not 100% sure that will behave for metadata
mostly workload without almost any pagecache to reclaim so it might turn
out this will result in other issues. But it is definitely worth trying.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-31 17:01                                                                                 ` Michal Hocko
@ 2018-10-31 19:21                                                                                   ` Marinko Catovic
  2018-11-01 13:23                                                                                     ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-10-31 19:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter

Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>:
>
> On Wed 31-10-18 15:53:44, Marinko Catovic wrote:
> [...]
> > Well caching of any operations with find/du is not necessary imho
> > anyway, since walking over all these millions of files in that time
> > period is really not worth caching at all - if there is a way you
> > mentioned to limit the commands there, that would be great.
>
> One possible way would be to run this find/du workload inside a memory
> cgroup with high limit set to something reasonable (that will likely
> require some tuning). I am not 100% sure that will behave for metadata
> mostly workload without almost any pagecache to reclaim so it might turn
> out this will result in other issues. But it is definitely worth trying.

hm, how would that be possible..? every user has its UID, the group
can also not be a factor, since this memory restriction would apply to
all users then, find/du are running as UID 0 to have access to
everyone's data.

so what is the conclusion from this issue now btw? is it something
that will be changed/fixed at any time?
As I understand everyone would have this issue when extensive walking
over files is performed, basically any `cloud`, shared hosting or
storage systems should experience it, true?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-31 19:21                                                                                   ` Marinko Catovic
@ 2018-11-01 13:23                                                                                     ` Michal Hocko
  2018-11-01 22:46                                                                                       ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-11-01 13:23 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter

On Wed 31-10-18 20:21:42, Marinko Catovic wrote:
> Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>:
> >
> > On Wed 31-10-18 15:53:44, Marinko Catovic wrote:
> > [...]
> > > Well caching of any operations with find/du is not necessary imho
> > > anyway, since walking over all these millions of files in that time
> > > period is really not worth caching at all - if there is a way you
> > > mentioned to limit the commands there, that would be great.
> >
> > One possible way would be to run this find/du workload inside a memory
> > cgroup with high limit set to something reasonable (that will likely
> > require some tuning). I am not 100% sure that will behave for metadata
> > mostly workload without almost any pagecache to reclaim so it might turn
> > out this will result in other issues. But it is definitely worth trying.
> 
> hm, how would that be possible..? every user has its UID, the group
> can also not be a factor, since this memory restriction would apply to
> all users then, find/du are running as UID 0 to have access to
> everyone's data.

I thought you have a dedicated script(s) to do all the stats. All you
need is to run that particular script(s) within a memory cgroup
 
> so what is the conclusion from this issue now btw? is it something
> that will be changed/fixed at any time?

It is likely that you are triggering a pathological memory fragmentation
with a lot of unmovable objects that prevent it to get resolved. That
leads to memory over reclaim to make a forward progress. A hard nut to
resolve but something that is definitely on radar to be solved
eventually. So far we have been quite lucky to not trigger it that
badly.

> As I understand everyone would have this issue when extensive walking
> over files is performed, basically any `cloud`, shared hosting or
> storage systems should experience it, true?

Not really. You need also a high demand for high order allocations to
require contiguous physical memory. Maybe there is something in your
workload triggering this particular pattern.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-01 13:23                                                                                     ` Michal Hocko
@ 2018-11-01 22:46                                                                                       ` Marinko Catovic
  2018-11-02  8:05                                                                                         ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-11-01 22:46 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter

Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@suse.com>:
>
> On Wed 31-10-18 20:21:42, Marinko Catovic wrote:
> > Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>:
> > >
> > > On Wed 31-10-18 15:53:44, Marinko Catovic wrote:
> > > [...]
> > > > Well caching of any operations with find/du is not necessary imho
> > > > anyway, since walking over all these millions of files in that time
> > > > period is really not worth caching at all - if there is a way you
> > > > mentioned to limit the commands there, that would be great.
> > >
> > > One possible way would be to run this find/du workload inside a memory
> > > cgroup with high limit set to something reasonable (that will likely
> > > require some tuning). I am not 100% sure that will behave for metadata
> > > mostly workload without almost any pagecache to reclaim so it might turn
> > > out this will result in other issues. But it is definitely worth trying.
> >
> > hm, how would that be possible..? every user has its UID, the group
> > can also not be a factor, since this memory restriction would apply to
> > all users then, find/du are running as UID 0 to have access to
> > everyone's data.
>
> I thought you have a dedicated script(s) to do all the stats. All you
> need is to run that particular script(s) within a memory cgroup

yes, that is the case - the scripts are running as root, since as
mentioned all users have own UIDs and specific groups, so to have
access one would need root privileges.
My question was how to limit this using cgroups, since afaik limits
there apply to given UIDs/GIDs

> > so what is the conclusion from this issue now btw? is it something
> > that will be changed/fixed at any time?
>
> It is likely that you are triggering a pathological memory fragmentation
> with a lot of unmovable objects that prevent it to get resolved. That
> leads to memory over reclaim to make a forward progress. A hard nut to
> resolve but something that is definitely on radar to be solved
> eventually. So far we have been quite lucky to not trigger it that
> badly.

good to hear :)

> > As I understand everyone would have this issue when extensive walking
> > over files is performed, basically any `cloud`, shared hosting or
> > storage systems should experience it, true?
>
> Not really. You need also a high demand for high order allocations to
> require contiguous physical memory. Maybe there is something in your
> workload triggering this particular pattern.

I would not even know what triggers it, nor what it has to do with
high order, I'm just running find/du, nothing special I'd say.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-01 22:46                                                                                       ` Marinko Catovic
@ 2018-11-02  8:05                                                                                         ` Michal Hocko
  2018-11-02 11:31                                                                                           ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-11-02  8:05 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter

On Thu 01-11-18 23:46:27, Marinko Catovic wrote:
> Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@suse.com>:
> >
> > On Wed 31-10-18 20:21:42, Marinko Catovic wrote:
> > > Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>:
> > > >
> > > > On Wed 31-10-18 15:53:44, Marinko Catovic wrote:
> > > > [...]
> > > > > Well caching of any operations with find/du is not necessary imho
> > > > > anyway, since walking over all these millions of files in that time
> > > > > period is really not worth caching at all - if there is a way you
> > > > > mentioned to limit the commands there, that would be great.
> > > >
> > > > One possible way would be to run this find/du workload inside a memory
> > > > cgroup with high limit set to something reasonable (that will likely
> > > > require some tuning). I am not 100% sure that will behave for metadata
> > > > mostly workload without almost any pagecache to reclaim so it might turn
> > > > out this will result in other issues. But it is definitely worth trying.
> > >
> > > hm, how would that be possible..? every user has its UID, the group
> > > can also not be a factor, since this memory restriction would apply to
> > > all users then, find/du are running as UID 0 to have access to
> > > everyone's data.
> >
> > I thought you have a dedicated script(s) to do all the stats. All you
> > need is to run that particular script(s) within a memory cgroup
> 
> yes, that is the case - the scripts are running as root, since as
> mentioned all users have own UIDs and specific groups, so to have
> access one would need root privileges.
> My question was how to limit this using cgroups, since afaik limits
> there apply to given UIDs/GIDs

No. Limits apply to a specific memory cgroup and all tasks which are
associated with it. There are many tutorials on how to configure/use
memory cgroups or cgroups in general. If I were you I would simply do
this

mount -t cgroup -o memory none $SOME_MOUNTPOINT
mkdir $SOME_MOUNTPOINT/A
echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes

Your script then just do
echo $$ > $SOME_MOUNTPOINT/A/tasks
# rest of your script
echo 1 > $SOME_MOUNTPOINT/A/memory.force_empty

That should drop the memory cached on behalf of the memcg A including the
metadata.


[...]
> > > As I understand everyone would have this issue when extensive walking
> > > over files is performed, basically any `cloud`, shared hosting or
> > > storage systems should experience it, true?
> >
> > Not really. You need also a high demand for high order allocations to
> > require contiguous physical memory. Maybe there is something in your
> > workload triggering this particular pattern.
> 
> I would not even know what triggers it, nor what it has to do with
> high order, I'm just running find/du, nothing special I'd say.

Please note that find/du is mostly a fragmentation generator. It
seems there is other system activity which requires those high order
allocations.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-02  8:05                                                                                         ` Michal Hocko
@ 2018-11-02 11:31                                                                                           ` Marinko Catovic
  2018-11-02 11:49                                                                                             ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-11-02 11:31 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter

Am Fr., 2. Nov. 2018 um 09:05 Uhr schrieb Michal Hocko <mhocko@suse.com>:
>
> On Thu 01-11-18 23:46:27, Marinko Catovic wrote:
> > Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@suse.com>:
> > >
> > > On Wed 31-10-18 20:21:42, Marinko Catovic wrote:
> > > > Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>:
> > > > >
> > > > > On Wed 31-10-18 15:53:44, Marinko Catovic wrote:
> > > > > [...]
> > > > > > Well caching of any operations with find/du is not necessary imho
> > > > > > anyway, since walking over all these millions of files in that time
> > > > > > period is really not worth caching at all - if there is a way you
> > > > > > mentioned to limit the commands there, that would be great.
> > > > >
> > > > > One possible way would be to run this find/du workload inside a memory
> > > > > cgroup with high limit set to something reasonable (that will likely
> > > > > require some tuning). I am not 100% sure that will behave for metadata
> > > > > mostly workload without almost any pagecache to reclaim so it might turn
> > > > > out this will result in other issues. But it is definitely worth trying.
> > > >
> > > > hm, how would that be possible..? every user has its UID, the group
> > > > can also not be a factor, since this memory restriction would apply to
> > > > all users then, find/du are running as UID 0 to have access to
> > > > everyone's data.
> > >
> > > I thought you have a dedicated script(s) to do all the stats. All you
> > > need is to run that particular script(s) within a memory cgroup
> >
> > yes, that is the case - the scripts are running as root, since as
> > mentioned all users have own UIDs and specific groups, so to have
> > access one would need root privileges.
> > My question was how to limit this using cgroups, since afaik limits
> > there apply to given UIDs/GIDs
>
> No. Limits apply to a specific memory cgroup and all tasks which are
> associated with it. There are many tutorials on how to configure/use
> memory cgroups or cgroups in general. If I were you I would simply do
> this
>
> mount -t cgroup -o memory none $SOME_MOUNTPOINT
> mkdir $SOME_MOUNTPOINT/A
> echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes
>
> Your script then just do
> echo $$ > $SOME_MOUNTPOINT/A/tasks
> # rest of your script
> echo 1 > $SOME_MOUNTPOINT/A/memory.force_empty
>
> That should drop the memory cached on behalf of the memcg A including the
> metadata.

well, that's an interesting approach, I did not know that this was
possible to assign cgroups to PIDs, without additionally explicitly
defining UID/GID. This way memory.force_empty basically acts like echo
3 > drop_caches, but only for the memory affected by the PIDs and its
children/forks from the A/tasks-list, true?

I'll give it a try with the nightly du/find jobs, thank you!

>
>
> [...]
> > > > As I understand everyone would have this issue when extensive walking
> > > > over files is performed, basically any `cloud`, shared hosting or
> > > > storage systems should experience it, true?
> > >
> > > Not really. You need also a high demand for high order allocations to
> > > require contiguous physical memory. Maybe there is something in your
> > > workload triggering this particular pattern.
> >
> > I would not even know what triggers it, nor what it has to do with
> > high order, I'm just running find/du, nothing special I'd say.
>
> Please note that find/du is mostly a fragmentation generator. It
> seems there is other system activity which requires those high order
> allocations.

any idea how to find out what that might be? I'd really have no idea,
I also wonder why this never was an issue with 3.x
find uses regex patterns, that's the only thing that may be unusual.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-02 11:31                                                                                           ` Marinko Catovic
@ 2018-11-02 11:49                                                                                             ` Michal Hocko
  2018-11-02 12:22                                                                                               ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2018-11-02 11:49 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter

On Fri 02-11-18 12:31:09, Marinko Catovic wrote:
> Am Fr., 2. Nov. 2018 um 09:05 Uhr schrieb Michal Hocko <mhocko@suse.com>:
> >
> > On Thu 01-11-18 23:46:27, Marinko Catovic wrote:
> > > Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@suse.com>:
> > > >
> > > > On Wed 31-10-18 20:21:42, Marinko Catovic wrote:
> > > > > Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>:
> > > > > >
> > > > > > On Wed 31-10-18 15:53:44, Marinko Catovic wrote:
> > > > > > [...]
> > > > > > > Well caching of any operations with find/du is not necessary imho
> > > > > > > anyway, since walking over all these millions of files in that time
> > > > > > > period is really not worth caching at all - if there is a way you
> > > > > > > mentioned to limit the commands there, that would be great.
> > > > > >
> > > > > > One possible way would be to run this find/du workload inside a memory
> > > > > > cgroup with high limit set to something reasonable (that will likely
> > > > > > require some tuning). I am not 100% sure that will behave for metadata
> > > > > > mostly workload without almost any pagecache to reclaim so it might turn
> > > > > > out this will result in other issues. But it is definitely worth trying.
> > > > >
> > > > > hm, how would that be possible..? every user has its UID, the group
> > > > > can also not be a factor, since this memory restriction would apply to
> > > > > all users then, find/du are running as UID 0 to have access to
> > > > > everyone's data.
> > > >
> > > > I thought you have a dedicated script(s) to do all the stats. All you
> > > > need is to run that particular script(s) within a memory cgroup
> > >
> > > yes, that is the case - the scripts are running as root, since as
> > > mentioned all users have own UIDs and specific groups, so to have
> > > access one would need root privileges.
> > > My question was how to limit this using cgroups, since afaik limits
> > > there apply to given UIDs/GIDs
> >
> > No. Limits apply to a specific memory cgroup and all tasks which are
> > associated with it. There are many tutorials on how to configure/use
> > memory cgroups or cgroups in general. If I were you I would simply do
> > this
> >
> > mount -t cgroup -o memory none $SOME_MOUNTPOINT
> > mkdir $SOME_MOUNTPOINT/A
> > echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes
> >
> > Your script then just do
> > echo $$ > $SOME_MOUNTPOINT/A/tasks
> > # rest of your script
> > echo 1 > $SOME_MOUNTPOINT/A/memory.force_empty
> >
> > That should drop the memory cached on behalf of the memcg A including the
> > metadata.
> 
> well, that's an interesting approach, I did not know that this was
> possible to assign cgroups to PIDs, without additionally explicitly
> defining UID/GID. This way memory.force_empty basically acts like echo
> 3 > drop_caches, but only for the memory affected by the PIDs and its
> children/forks from the A/tasks-list, true?

Yup
 
> I'll give it a try with the nightly du/find jobs, thank you!

I am still a bit curious how that will work out on metadata mostly
workload because we usually have quite a lot of memory on normal LRUs to
reclaim (page cache, anonymous memory) and slab reclaim is just to
balance kmem. But let's see. Watch for memcg OOM killer invocations if
the reclaim is not sufficient.

> > [...]
> > > > > As I understand everyone would have this issue when extensive walking
> > > > > over files is performed, basically any `cloud`, shared hosting or
> > > > > storage systems should experience it, true?
> > > >
> > > > Not really. You need also a high demand for high order allocations to
> > > > require contiguous physical memory. Maybe there is something in your
> > > > workload triggering this particular pattern.
> > >
> > > I would not even know what triggers it, nor what it has to do with
> > > high order, I'm just running find/du, nothing special I'd say.
> >
> > Please note that find/du is mostly a fragmentation generator. It
> > seems there is other system activity which requires those high order
> > allocations.
> 
> any idea how to find out what that might be? I'd really have no idea,
> I also wonder why this never was an issue with 3.x
> find uses regex patterns, that's the only thing that may be unusual.

The allocation tracepoint has the stack trace so that might help. This
is quite a lot of work to pin point and find a pattern though. This is
way out the time scope I can devote to this unfortunately. This might be
some driver asking for more, or even the core kernel being more high
order memory hungry.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-02 11:49                                                                                             ` Michal Hocko
@ 2018-11-02 12:22                                                                                               ` Vlastimil Babka
  2018-11-02 12:41                                                                                                 ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-11-02 12:22 UTC (permalink / raw)
  To: Michal Hocko, Marinko Catovic; +Cc: linux-mm, Christopher Lameter

On 11/2/18 12:49 PM, Michal Hocko wrote:
> On Fri 02-11-18 12:31:09, Marinko Catovic wrote:
>> Am Fr., 2. Nov. 2018 um 09:05 Uhr schrieb Michal Hocko <mhocko@suse.com>:
>>>
>>> On Thu 01-11-18 23:46:27, Marinko Catovic wrote:
>>>> Am Do., 1. Nov. 2018 um 14:23 Uhr schrieb Michal Hocko <mhocko@suse.com>:
>>>>>
>>>>> On Wed 31-10-18 20:21:42, Marinko Catovic wrote:
>>>>>> Am Mi., 31. Okt. 2018 um 18:01 Uhr schrieb Michal Hocko <mhocko@suse.com>:
>>>>>>>
>>>>>>> On Wed 31-10-18 15:53:44, Marinko Catovic wrote:
>>>>>>> [...]
>>>>>>>> Well caching of any operations with find/du is not necessary imho
>>>>>>>> anyway, since walking over all these millions of files in that time
>>>>>>>> period is really not worth caching at all - if there is a way you
>>>>>>>> mentioned to limit the commands there, that would be great.
>>>>>>>
>>>>>>> One possible way would be to run this find/du workload inside a memory
>>>>>>> cgroup with high limit set to something reasonable (that will likely
>>>>>>> require some tuning). I am not 100% sure that will behave for metadata
>>>>>>> mostly workload without almost any pagecache to reclaim so it might turn
>>>>>>> out this will result in other issues. But it is definitely worth trying.
>>>>>>
>>>>>> hm, how would that be possible..? every user has its UID, the group
>>>>>> can also not be a factor, since this memory restriction would apply to
>>>>>> all users then, find/du are running as UID 0 to have access to
>>>>>> everyone's data.
>>>>>
>>>>> I thought you have a dedicated script(s) to do all the stats. All you
>>>>> need is to run that particular script(s) within a memory cgroup
>>>>
>>>> yes, that is the case - the scripts are running as root, since as
>>>> mentioned all users have own UIDs and specific groups, so to have
>>>> access one would need root privileges.
>>>> My question was how to limit this using cgroups, since afaik limits
>>>> there apply to given UIDs/GIDs
>>>
>>> No. Limits apply to a specific memory cgroup and all tasks which are
>>> associated with it. There are many tutorials on how to configure/use
>>> memory cgroups or cgroups in general. If I were you I would simply do
>>> this
>>>
>>> mount -t cgroup -o memory none $SOME_MOUNTPOINT
>>> mkdir $SOME_MOUNTPOINT/A
>>> echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes
>>>
>>> Your script then just do
>>> echo $$ > $SOME_MOUNTPOINT/A/tasks
>>> # rest of your script
>>> echo 1 > $SOME_MOUNTPOINT/A/memory.force_empty
>>>
>>> That should drop the memory cached on behalf of the memcg A including the
>>> metadata.
>>
>> well, that's an interesting approach, I did not know that this was
>> possible to assign cgroups to PIDs, without additionally explicitly
>> defining UID/GID. This way memory.force_empty basically acts like echo
>> 3 > drop_caches, but only for the memory affected by the PIDs and its
>> children/forks from the A/tasks-list, true?
> 
> Yup
>  
>> I'll give it a try with the nightly du/find jobs, thank you!
> 
> I am still a bit curious how that will work out on metadata mostly
> workload because we usually have quite a lot of memory on normal LRUs to
> reclaim (page cache, anonymous memory) and slab reclaim is just to
> balance kmem. But let's see. Watch for memcg OOM killer invocations if
> the reclaim is not sufficient.
> 
>>> [...]
>>>>>> As I understand everyone would have this issue when extensive walking
>>>>>> over files is performed, basically any `cloud`, shared hosting or
>>>>>> storage systems should experience it, true?
>>>>>
>>>>> Not really. You need also a high demand for high order allocations to
>>>>> require contiguous physical memory. Maybe there is something in your
>>>>> workload triggering this particular pattern.
>>>>
>>>> I would not even know what triggers it, nor what it has to do with
>>>> high order, I'm just running find/du, nothing special I'd say.
>>>
>>> Please note that find/du is mostly a fragmentation generator. It
>>> seems there is other system activity which requires those high order
>>> allocations.
>>
>> any idea how to find out what that might be? I'd really have no idea,
>> I also wonder why this never was an issue with 3.x
>> find uses regex patterns, that's the only thing that may be unusual.
> 
> The allocation tracepoint has the stack trace so that might help. This

Well we already checked the mm_page_alloc traces and it seemed that only
THP allocations could be the culprit. But apparently defrag=defer made
no difference. I would still recommend it so we can see the effects on
the traces. And adding tracepoints
compaction/mm_compaction_try_to_compact_pages and
compaction/mm_compaction_suitable as I suggested should show which
high-order allocations actually invoke the compaction.

> is quite a lot of work to pin point and find a pattern though. This is
> way out the time scope I can devote to this unfortunately. This might be
> some driver asking for more, or even the core kernel being more high
> order memory hungry.
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-02 12:22                                                                                               ` Vlastimil Babka
@ 2018-11-02 12:41                                                                                                 ` Marinko Catovic
  2018-11-02 13:13                                                                                                   ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-11-02 12:41 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter

> >> any idea how to find out what that might be? I'd really have no idea,
> >> I also wonder why this never was an issue with 3.x
> >> find uses regex patterns, that's the only thing that may be unusual.
> >
> > The allocation tracepoint has the stack trace so that might help. This
>
> Well we already checked the mm_page_alloc traces and it seemed that only
> THP allocations could be the culprit. But apparently defrag=defer made
> no difference. I would still recommend it so we can see the effects on
> the traces. And adding tracepoints
> compaction/mm_compaction_try_to_compact_pages and
> compaction/mm_compaction_suitable as I suggested should show which
> high-order allocations actually invoke the compaction.

Anything in particular I should do to figure this out?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-02 12:41                                                                                                 ` Marinko Catovic
@ 2018-11-02 13:13                                                                                                   ` Vlastimil Babka
  2018-11-02 13:50                                                                                                     ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-11-02 13:13 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Michal Hocko, linux-mm, Christopher Lameter

On 11/2/18 1:41 PM, Marinko Catovic wrote:
>>>> any idea how to find out what that might be? I'd really have no idea,
>>>> I also wonder why this never was an issue with 3.x
>>>> find uses regex patterns, that's the only thing that may be unusual.
>>>
>>> The allocation tracepoint has the stack trace so that might help. This
>>
>> Well we already checked the mm_page_alloc traces and it seemed that only
>> THP allocations could be the culprit. But apparently defrag=defer made
>> no difference. I would still recommend it so we can see the effects on
>> the traces. And adding tracepoints
>> compaction/mm_compaction_try_to_compact_pages and
>> compaction/mm_compaction_suitable as I suggested should show which
>> high-order allocations actually invoke the compaction.
> 
> Anything in particular I should do to figure this out?

Setup the same monitoring as before, but with two additional tracepoints
(echo 1 > .../enable) and once the problem appears, provide the tracing
output.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-02 13:13                                                                                                   ` Vlastimil Babka
@ 2018-11-02 13:50                                                                                                     ` Marinko Catovic
  2018-11-02 14:49                                                                                                       ` Vlastimil Babka
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-11-02 13:50 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter

Am Fr., 2. Nov. 2018 um 14:13 Uhr schrieb Vlastimil Babka <vbabka@suse.cz>:
>
> On 11/2/18 1:41 PM, Marinko Catovic wrote:
> >>>> any idea how to find out what that might be? I'd really have no idea,
> >>>> I also wonder why this never was an issue with 3.x
> >>>> find uses regex patterns, that's the only thing that may be unusual.
> >>>
> >>> The allocation tracepoint has the stack trace so that might help. This
> >>
> >> Well we already checked the mm_page_alloc traces and it seemed that only
> >> THP allocations could be the culprit. But apparently defrag=defer made
> >> no difference. I would still recommend it so we can see the effects on
> >> the traces. And adding tracepoints
> >> compaction/mm_compaction_try_to_compact_pages and
> >> compaction/mm_compaction_suitable as I suggested should show which
> >> high-order allocations actually invoke the compaction.
> >
> > Anything in particular I should do to figure this out?
>
> Setup the same monitoring as before, but with two additional tracepoints
> (echo 1 > .../enable) and once the problem appears, provide the tracing
> output.

I think I'll need more details about that setup  :)
also, do you want the tracing output every 5sec or just once when it
is around the worst case? what files exactly?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-02 13:50                                                                                                     ` Marinko Catovic
@ 2018-11-02 14:49                                                                                                       ` Vlastimil Babka
  0 siblings, 0 replies; 66+ messages in thread
From: Vlastimil Babka @ 2018-11-02 14:49 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Michal Hocko, linux-mm, Christopher Lameter

On 11/2/18 2:50 PM, Marinko Catovic wrote:
> Am Fr., 2. Nov. 2018 um 14:13 Uhr schrieb Vlastimil Babka <vbabka@suse.cz>:
>>
>> On 11/2/18 1:41 PM, Marinko Catovic wrote:
>>>>>> any idea how to find out what that might be? I'd really have no idea,
>>>>>> I also wonder why this never was an issue with 3.x
>>>>>> find uses regex patterns, that's the only thing that may be unusual.
>>>>>
>>>>> The allocation tracepoint has the stack trace so that might help. This
>>>>
>>>> Well we already checked the mm_page_alloc traces and it seemed that only
>>>> THP allocations could be the culprit. But apparently defrag=defer made
>>>> no difference. I would still recommend it so we can see the effects on
>>>> the traces. And adding tracepoints
>>>> compaction/mm_compaction_try_to_compact_pages and
>>>> compaction/mm_compaction_suitable as I suggested should show which
>>>> high-order allocations actually invoke the compaction.
>>>
>>> Anything in particular I should do to figure this out?
>>
>> Setup the same monitoring as before, but with two additional tracepoints
>> (echo 1 > .../enable) and once the problem appears, provide the tracing
>> output.
> 
> I think I'll need more details about that setup  :)

It's like what you already did based on suggestion from Michal Hocko:

# mount -t tracefs none /debug/trace/
# echo stacktrace > /debug/trace/trace_options
# echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
# echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
# echo 1 > /debug/trace/events/compaction/mm_compaction_try_to_compact_pages
# echo 1 > /debug/trace/events/compaction/mm_compaction_suitable
# cat /debug/trace/trace_pipe | gzip > /path/to/trace_pipe.txt.gz

And later this to disable tracing.
# echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable

> also, do you want the tracing output every 5sec or just once when it
> is around the worst case? what files exactly?

Collect vmstat periodically every 5 secs as you already did. Tracing is
continuous and results in the single trace_pipe.txt.gz file.
The trace should cover at least some time while you're experiencing the
too much free memory/too little pagecache phase. Might be enough to
enable the collection only after you detect the situation, and before
you e.g. drop caches to restore the system.

To remove THP allocations from the picture, it would be nice if the
system was configured with:
echo defer > /sys/kernel/mm/transparent_hugepage/defrag

Again you can do that only after detecting the problematic situation,
before starting to collect trace.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-10-31 14:53                                                                               ` Marinko Catovic
  2018-10-31 17:01                                                                                 ` Michal Hocko
@ 2018-11-02 14:59                                                                                 ` Vlastimil Babka
  2018-11-30 12:01                                                                                   ` Marinko Catovic
  1 sibling, 1 reply; 66+ messages in thread
From: Vlastimil Babka @ 2018-11-02 14:59 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Michal Hocko, linux-mm, Christopher Lameter

Forgot to answer this:

On 10/31/18 3:53 PM, Marinko Catovic wrote:
> Well caching of any operations with find/du is not necessary imho
> anyway, since walking over all these millions of files in that time
> period is really not worth caching at all - if there is a way you
> mentioned to limit the commands there, that would be great.
> Also I want to mention that these operations were in use with 3.x
> kernels as well, for years, with absolutely zero issues.

Yep, something had to change at some point. Possibly the
reclaim/compaction loop. Probably not the way dentries/inodes are being
cached though.

> 2 > drop_caches right after that is something I considered, I just had
> some bad experience with this, since I tried it around 5:00 AM in the
> first place to give it enough spare time to finish, since sync; echo 2
>> drop_caches can take some time, hence my question about lowering the
> limits in mm/vmscan.c, void drop_slab_node(int nid)
> 
> I could do this effectively right after find/du at 07:45, just hoping
> that this is finished soon enough - in one worst case it took over 2
> hours (from 05:00 AM to 07:00 AM), since the host was busy during that
> time with find/du, never having freed enough caches to continue, hence

Dropping caches while find/du is still running would be
counter-productive. If done after it's already finished, it shouldn't be
so disruptive.

> my question to let it stop earlier with the modification of
> drop_slab_node ... it was just an idea, nevermind if you believe that
> it was a bad one :)

Finding a universally "correct" threshold could easily be impossible. I
guess the proper solution would be to drop the while loop and
restructure the shrinking so that it would do a single pass through all
objects.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-02 14:59                                                                                 ` Vlastimil Babka
@ 2018-11-30 12:01                                                                                   ` Marinko Catovic
  2018-12-10 21:30                                                                                     ` Marinko Catovic
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-11-30 12:01 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter

Am Fr., 2. Nov. 2018 um 15:59 Uhr schrieb Vlastimil Babka <vbabka@suse.cz>:
>
> Forgot to answer this:
>
> On 10/31/18 3:53 PM, Marinko Catovic wrote:
> > Well caching of any operations with find/du is not necessary imho
> > anyway, since walking over all these millions of files in that time
> > period is really not worth caching at all - if there is a way you
> > mentioned to limit the commands there, that would be great.
> > Also I want to mention that these operations were in use with 3.x
> > kernels as well, for years, with absolutely zero issues.
>
> Yep, something had to change at some point. Possibly the
> reclaim/compaction loop. Probably not the way dentries/inodes are being
> cached though.
>
> > 2 > drop_caches right after that is something I considered, I just had
> > some bad experience with this, since I tried it around 5:00 AM in the
> > first place to give it enough spare time to finish, since sync; echo 2
> >> drop_caches can take some time, hence my question about lowering the
> > limits in mm/vmscan.c, void drop_slab_node(int nid)
> >
> > I could do this effectively right after find/du at 07:45, just hoping
> > that this is finished soon enough - in one worst case it took over 2
> > hours (from 05:00 AM to 07:00 AM), since the host was busy during that
> > time with find/du, never having freed enough caches to continue, hence
>
> Dropping caches while find/du is still running would be
> counter-productive. If done after it's already finished, it shouldn't be
> so disruptive.
>
> > my question to let it stop earlier with the modification of
> > drop_slab_node ... it was just an idea, nevermind if you believe that
> > it was a bad one :)
>
> Finding a universally "correct" threshold could easily be impossible. I
> guess the proper solution would be to drop the while loop and
> restructure the shrinking so that it would do a single pass through all
> objects.

well after a few weeks to make sure, the results seem very promising.
There were no issues any more after setting up the cgroup with the limit.

This workaround is anyway a good idea to prevent the nightly processed
from eating up all the caching/buffers which become useless anyway in
the morning, so performance got even better - although the issue is
not fixed with that workaround.
Since other people will be affected sooner or later as well imho,
hopefully you'll figure out a fix soon.

Nevertheless I also ran into a new problem there.
While writing the PID into the tasks-file (echo $$ > ../tasks) or a
direct fputs(getpid(), tasks_fp);
works very well, I also had problems with daemons that I wanted to
start (e.g. a SQL server) from within that cgroup-controlled binary.
This results in the sql server's task kill, since the memory limit is
exceeded. I would not like to set the memory.limit_in_bytes to
something that huge, such as 30G to make sure, I'd rather just use a
wrapper script to handle this, for example:
1) the cgroup-controlled instance starts the wrapper script
2) which excludes itself from the tasks-PID-list (hence the wrapper
script it is not controlled any more)
3) it starts or does whatever necessary that should continue normally
without the memory restriction

Currently I fail to manage this, since I do not know how to do step 2.
echo $PID > tasks writes into it and adds the PID, but how would one
remove the wrapper script's PID from there?
I came up with: cat /cgpath/A/tasks | sed "/$$/d" | cat >
/cgpath/A/tasks ..which results in a list without the current PID,
however, it fails to write to tasks with cat: write error: Invalid
argument, since this is not a regular file.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-11-30 12:01                                                                                   ` Marinko Catovic
@ 2018-12-10 21:30                                                                                     ` Marinko Catovic
  2018-12-10 21:47                                                                                       ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Marinko Catovic @ 2018-12-10 21:30 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Michal Hocko, linux-mm, Christopher Lameter

> Currently I fail to manage this, since I do not know how to do step 2.
> echo $PID > tasks writes into it and adds the PID, but how would one
> remove the wrapper script's PID from there?

any ideas on this perhaps?
The workaround, otherwise working perfectly fine, causes huge problems there
since I have to exclude certain processes from that tasklist.

Basically I'd need to know how to remove a PID from the mountpoint, created by

mount -t cgroup -o memory none $SOME_MOUNTPOINT
mkdir $SOME_MOUNTPOINT/A
echo 500M > $SOME_MOUNTPOINT/A/memory.limit_in_bytes

aka remove a specific PID from $SOME_MOUNTPOINT/A/tasks

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Caching/buffers become useless after some time
  2018-12-10 21:30                                                                                     ` Marinko Catovic
@ 2018-12-10 21:47                                                                                       ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2018-12-10 21:47 UTC (permalink / raw)
  To: Marinko Catovic; +Cc: Vlastimil Babka, linux-mm, Christopher Lameter

On Mon 10-12-18 22:30:40, Marinko Catovic wrote:
> > Currently I fail to manage this, since I do not know how to do step 2.
> > echo $PID > tasks writes into it and adds the PID, but how would one
> > remove the wrapper script's PID from there?
> 
> any ideas on this perhaps?
> The workaround, otherwise working perfectly fine, causes huge problems there
> since I have to exclude certain processes from that tasklist.

I am sorry, I didn't get to your previous email. But this is quite
simply. You just echo those pids to a different cgroup. E.g. the root
one at the top of the mounted hierarchy. There are also wrappers to
execute a task into a specific cgroup in libcgroup package and I am
pretty sure systemd has its own mechanisms to achieve the same.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2018-12-10 21:47 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-11 13:18 Caching/buffers become useless after some time Marinko Catovic
2018-07-12 11:34 ` Michal Hocko
2018-07-13 15:48   ` Marinko Catovic
2018-07-16 15:53     ` Marinko Catovic
2018-07-16 16:23       ` Michal Hocko
2018-07-16 16:33         ` Marinko Catovic
2018-07-16 16:45           ` Michal Hocko
2018-07-20 22:03             ` Marinko Catovic
2018-07-27 11:15               ` Vlastimil Babka
2018-07-30 14:40                 ` Michal Hocko
2018-07-30 22:08                   ` Marinko Catovic
2018-08-02 16:15                     ` Vlastimil Babka
2018-08-03 14:13                       ` Marinko Catovic
2018-08-06  9:40                         ` Vlastimil Babka
2018-08-06 10:29                           ` Marinko Catovic
2018-08-06 12:00                             ` Michal Hocko
2018-08-06 15:37                               ` Christopher Lameter
2018-08-06 18:16                                 ` Michal Hocko
2018-08-09  8:29                                   ` Marinko Catovic
2018-08-21  0:36                                     ` Marinko Catovic
2018-08-21  6:49                                       ` Michal Hocko
2018-08-21  7:19                                         ` Vlastimil Babka
2018-08-22 20:02                                           ` Marinko Catovic
2018-08-23 12:10                                             ` Vlastimil Babka
2018-08-23 12:21                                               ` Michal Hocko
2018-08-24  0:11                                                 ` Marinko Catovic
2018-08-24  6:34                                                   ` Vlastimil Babka
2018-08-24  8:11                                                     ` Marinko Catovic
2018-08-24  8:36                                                       ` Vlastimil Babka
2018-08-29 14:54                                                         ` Marinko Catovic
2018-08-29 15:01                                                           ` Michal Hocko
2018-08-29 15:13                                                             ` Marinko Catovic
2018-08-29 15:27                                                               ` Michal Hocko
2018-08-29 16:44                                                                 ` Marinko Catovic
2018-10-22  1:19                                                                   ` Marinko Catovic
2018-10-23 17:41                                                                     ` Marinko Catovic
2018-10-26  5:48                                                                       ` Marinko Catovic
2018-10-26  8:01                                                                     ` Michal Hocko
2018-10-26 23:31                                                                       ` Marinko Catovic
2018-10-27  6:42                                                                         ` Michal Hocko
     [not found]                                                                     ` <6e3a9434-32f2-0388-e0c7-2bd1c2ebc8b1@suse.cz>
2018-10-30 15:30                                                                       ` Michal Hocko
2018-10-30 16:08                                                                         ` Marinko Catovic
2018-10-30 17:00                                                                           ` Vlastimil Babka
2018-10-30 18:26                                                                             ` Marinko Catovic
2018-10-31  7:34                                                                               ` Michal Hocko
2018-10-31  7:32                                                                             ` Michal Hocko
2018-10-31 13:40                                                                             ` Vlastimil Babka
2018-10-31 14:53                                                                               ` Marinko Catovic
2018-10-31 17:01                                                                                 ` Michal Hocko
2018-10-31 19:21                                                                                   ` Marinko Catovic
2018-11-01 13:23                                                                                     ` Michal Hocko
2018-11-01 22:46                                                                                       ` Marinko Catovic
2018-11-02  8:05                                                                                         ` Michal Hocko
2018-11-02 11:31                                                                                           ` Marinko Catovic
2018-11-02 11:49                                                                                             ` Michal Hocko
2018-11-02 12:22                                                                                               ` Vlastimil Babka
2018-11-02 12:41                                                                                                 ` Marinko Catovic
2018-11-02 13:13                                                                                                   ` Vlastimil Babka
2018-11-02 13:50                                                                                                     ` Marinko Catovic
2018-11-02 14:49                                                                                                       ` Vlastimil Babka
2018-11-02 14:59                                                                                 ` Vlastimil Babka
2018-11-30 12:01                                                                                   ` Marinko Catovic
2018-12-10 21:30                                                                                     ` Marinko Catovic
2018-12-10 21:47                                                                                       ` Michal Hocko
2018-10-31 13:12                                                                     ` Vlastimil Babka
2018-08-24  6:24                                                 ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.